[mpich-discuss] mpich2-1.3 problems

Robert Graves rwgraves at usgs.gov
Wed Oct 27 20:06:22 CDT 2010


Hi Pavan-

Thanks for the suggestions. I have done both and now things appear to be working OK.

Question:
Should I still set PMI_SUBVERSION=0 with the new version of hydra?
It seems to be working OK without setting this (i.e., leaving as default =1).

-Rob





On Oct 27, 2010, at 3:12 PM, Pavan Balaji wrote:

> 
> I wonder if this is a bug that we just fixed. Can you try passing the environment PMI_SUBVERSION=0?
> 
> If that works, can you try the latest snapshot of Hydra? http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/nightly/hydra
> 
> -- Pavan
> 
> On 10/27/2010 05:04 PM, Robert Graves wrote:
>> Hello-
>> 
>> We have just installed mpich2-1.3 on a cluster of 18 nodes. The nodes
>> are all running fedora 13
>> and consist of 64-bit HP machines of various vintages and numbers of
>> cores (from 2 to 12 cores per node).
>> 
>> I have created a hostfile (named mpi.machinefile) with the following
>> entries:
>> 
>> % cat mpi.machinefile
>> *aki18:4
>> aki17:4
>> aki16:4
>> aki15:4
>> aki14:1
>> aki13:1
>> aki12:1
>> aki11:1
>> aki10:1
>> aki09:1
>> aki08:1
>> aki07:1
>> aki06:1
>> aki05:1
>> aki04:1
>> aki03:1
>> aki02:1
>> aki01:1 *
>> 
>> where my nodes are named aki01 ... aki18 (also resolved as
>> aki01.urscorp.com <http://aki01.urscorp.com> ... aki18.urscorp.com
>> <http://aki18.urscorp.com>).
>> 
>> Executing the following appears to work correctly:
>> 
>> % mpiexec -f mpi.machinefile -n 12 /opt/mpich2-1.3/examples/cpi
>> 
>> and gives the output:
>> 
>> *Process 9 of 12 is on aki16.urscorp.com <http://aki16.urscorp.com>
>> Process 10 of 12 is on *aki16.urscorp.com <http://aki16.urscorp.com>*
>> Process 11 of 12 is on *aki16.urscorp.com <http://aki16.urscorp.com>*
>> Process 8 of 12 is on *aki16.urscorp.com <http://aki16.urscorp.com>*
>> Process 6 of 12 is on *aki17.urscorp.com <http://aki17.urscorp.com>*
>> Process 4 of 12 is on *aki17.urscorp.com <http://aki17.urscorp.com>*
>> Process 5 of 12 is on *aki17.urscorp.com <http://aki17.urscorp.com>*
>> Process 7 of 12 is on *aki17.urscorp.com <http://aki17.urscorp.com>*
>> Process 0 of 12 is on *aki18.urscorp.com <http://aki18.urscorp.com>*
>> Process 1 of 12 is on *aki18.urscorp.com <http://aki18.urscorp.com>*
>> Process 2 of 12 is on *aki18.urscorp.com <http://aki18.urscorp.com>*
>> Process 3 of 12 is on *aki18.urscorp.com <http://aki18.urscorp.com>*
>> pi is approximately 3.1415926544231256, Error is 0.0000000008333325
>> wall clock time = 0.004010 *
>> 
>> 
>> However, changing the requested number of CPUs to 17 causes a fatal error:
>> 
>> % mpiexec -f mpi.machinefile -n 17 /opt/mpich2-1.3/examples/cpi
>> 
>> and gives the output:
>> 
>> *Fatal error in MPI_Init: Other MPI error, error stack:** *
>> *MPIR_Init_thread(385).................: **
>> **MPID_Init(135)........................: channel initialization failed**
>> **MPIDI_CH3_Init(38)....................: **
>> **MPID_nem_init(196)....................: **
>> **MPIDI_CH3I_Seg_commit(366)............: **
>> **MPIU_SHMW_Hnd_deserialize(324)........: **
>> **MPIU_SHMW_Seg_open(863)...............: **
>> **MPIU_SHMW_Seg_create_attach_templ(637): open failed - No such file or
>> directory**
>> **APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)** *
>> 
>> 
>> 
>> I also tried setting MPI_NO_LOCAL=1 but that did not help.
>> 
>> Any help you can provide is greatly appreciated.
>> 
>> Thanks,
>> Rob Graves
>> Research Geophysicst
>> US Geological Survey
>> Pasadena, CA
>> 
>> 
>> 
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> -- 
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji



More information about the mpich-discuss mailing list