[mpich-discuss] mpich2-1.3 problems

Pavan Balaji balaji at mcs.anl.gov
Wed Oct 27 17:12:25 CDT 2010


I wonder if this is a bug that we just fixed. Can you try passing the 
environment PMI_SUBVERSION=0?

If that works, can you try the latest snapshot of Hydra? 
http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/nightly/hydra

  -- Pavan

On 10/27/2010 05:04 PM, Robert Graves wrote:
> Hello-
>
> We have just installed mpich2-1.3 on a cluster of 18 nodes. The nodes
> are all running fedora 13
> and consist of 64-bit HP machines of various vintages and numbers of
> cores (from 2 to 12 cores per node).
>
> I have created a hostfile (named mpi.machinefile) with the following
> entries:
>
> % cat mpi.machinefile
> *aki18:4
> aki17:4
> aki16:4
> aki15:4
> aki14:1
> aki13:1
> aki12:1
> aki11:1
> aki10:1
> aki09:1
> aki08:1
> aki07:1
> aki06:1
> aki05:1
> aki04:1
> aki03:1
> aki02:1
> aki01:1 *
>
> where my nodes are named aki01 ... aki18 (also resolved as
> aki01.urscorp.com <http://aki01.urscorp.com> ... aki18.urscorp.com
> <http://aki18.urscorp.com>).
>
> Executing the following appears to work correctly:
>
> % mpiexec -f mpi.machinefile -n 12 /opt/mpich2-1.3/examples/cpi
>
> and gives the output:
>
> *Process 9 of 12 is on aki16.urscorp.com <http://aki16.urscorp.com>
> Process 10 of 12 is on *aki16.urscorp.com <http://aki16.urscorp.com>*
> Process 11 of 12 is on *aki16.urscorp.com <http://aki16.urscorp.com>*
> Process 8 of 12 is on *aki16.urscorp.com <http://aki16.urscorp.com>*
> Process 6 of 12 is on *aki17.urscorp.com <http://aki17.urscorp.com>*
> Process 4 of 12 is on *aki17.urscorp.com <http://aki17.urscorp.com>*
> Process 5 of 12 is on *aki17.urscorp.com <http://aki17.urscorp.com>*
> Process 7 of 12 is on *aki17.urscorp.com <http://aki17.urscorp.com>*
> Process 0 of 12 is on *aki18.urscorp.com <http://aki18.urscorp.com>*
> Process 1 of 12 is on *aki18.urscorp.com <http://aki18.urscorp.com>*
> Process 2 of 12 is on *aki18.urscorp.com <http://aki18.urscorp.com>*
> Process 3 of 12 is on *aki18.urscorp.com <http://aki18.urscorp.com>*
> pi is approximately 3.1415926544231256, Error is 0.0000000008333325
> wall clock time = 0.004010 *
>
>
> However, changing the requested number of CPUs to 17 causes a fatal error:
>
> % mpiexec -f mpi.machinefile -n 17 /opt/mpich2-1.3/examples/cpi
>
> and gives the output:
>
> *Fatal error in MPI_Init: Other MPI error, error stack:** *
> *MPIR_Init_thread(385).................: **
> **MPID_Init(135)........................: channel initialization failed**
> **MPIDI_CH3_Init(38)....................: **
> **MPID_nem_init(196)....................: **
> **MPIDI_CH3I_Seg_commit(366)............: **
> **MPIU_SHMW_Hnd_deserialize(324)........: **
> **MPIU_SHMW_Seg_open(863)...............: **
> **MPIU_SHMW_Seg_create_attach_templ(637): open failed - No such file or
> directory**
> **APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)** *
>
>
>
> I also tried setting MPI_NO_LOCAL=1 but that did not help.
>
> Any help you can provide is greatly appreciated.
>
> Thanks,
> Rob Graves
> Research Geophysicst
> US Geological Survey
> Pasadena, CA
>
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list