[mpich-discuss] mpich2-1.3 problems

Robert Graves rwgraves at usgs.gov
Wed Oct 27 17:04:53 CDT 2010


Hello-

We have just installed mpich2-1.3 on a cluster of 18 nodes. The nodes are all running fedora 13
and consist of 64-bit HP machines of various vintages and numbers of cores (from 2 to 12 cores per node).

I have created a hostfile (named mpi.machinefile) with the following entries:

% cat mpi.machinefile
aki18:4 
aki17:4 
aki16:4 
aki15:4 
aki14:1 
aki13:1 
aki12:1 
aki11:1 
aki10:1 
aki09:1 
aki08:1 
aki07:1 
aki06:1 
aki05:1 
aki04:1 
aki03:1 
aki02:1 
aki01:1 

where my nodes are named aki01 ... aki18 (also resolved as aki01.urscorp.com ... aki18.urscorp.com).

Executing the following appears to work correctly:

% mpiexec -f mpi.machinefile -n 12 /opt/mpich2-1.3/examples/cpi

and gives the output:

Process 9 of 12 is on aki16.urscorp.com
Process 10 of 12 is on aki16.urscorp.com 
Process 11 of 12 is on aki16.urscorp.com 
Process 8 of 12 is on aki16.urscorp.com 
Process 6 of 12 is on aki17.urscorp.com 
Process 4 of 12 is on aki17.urscorp.com 
Process 5 of 12 is on aki17.urscorp.com 
Process 7 of 12 is on aki17.urscorp.com
Process 0 of 12 is on aki18.urscorp.com 
Process 1 of 12 is on aki18.urscorp.com
Process 2 of 12 is on aki18.urscorp.com 
Process 3 of 12 is on aki18.urscorp.com
pi is approximately 3.1415926544231256, Error is 0.0000000008333325 
wall clock time = 0.004010 


However, changing the requested number of CPUs to 17 causes a fatal error:

% mpiexec -f mpi.machinefile -n 17 /opt/mpich2-1.3/examples/cpi

and gives the output:

Fatal error in MPI_Init: Other MPI error, error stack: 
MPIR_Init_thread(385).................: 
MPID_Init(135)........................: channel initialization failed 
MPIDI_CH3_Init(38)....................: 
MPID_nem_init(196)....................: 
MPIDI_CH3I_Seg_commit(366)............: 
MPIU_SHMW_Hnd_deserialize(324)........: 
MPIU_SHMW_Seg_open(863)...............: 
MPIU_SHMW_Seg_create_attach_templ(637): open failed - No such file or directory 
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1) 



I also tried setting MPI_NO_LOCAL=1 but that did not help.

Any help you can provide is greatly appreciated.

Thanks,
Rob Graves
Research Geophysicst
US Geological Survey
Pasadena, CA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101027/52cfe350/attachment.htm>


More information about the mpich-discuss mailing list