[mpich-discuss] Interesting Problem

Evans Jeffrey jje at purdue.edu
Thu Feb 17 16:01:16 CST 2011


Pavan, 

I have a small 80-core cluster (10, dual quad core processor nodes) that I am running with MPICH2-1.3.2. 

My work involves precise process binding and I have come across an interesting problem. 
My application is a tool that emulates parallel applications by simulating computation then using the communication subsystem for data movement between processes (do this over and over to emulate a parallel application). It's a much longer story as to what I use this for. 

I can use the tool in every possible permutation over subsets of the 80 cores, including using the entire machine (all 80-cores). 

Today I came across a situation that puzzles me. When trying to bind all but 2 cores across N nodes things work fine until I get to 30 cores  (on 4 nodes). The error report is below. 

You should know that so far MPICH2-1.3.2 has been working perfectly for us, so I don't think it's a build issue. 

This problem travels with the nodes, in other words, if I use a different subset of nodes I get the same response (below), only the error log reports the nodes I'm using. For the trace below my host file is as follows: 

hpn01:7 binding=user:1,2,3,4,5,6,7
hpn02:8 binding=user:0,1,2,3,4,5,6,7
hpn03:8 binding=user:0,1,2,3,4,5,6,7
hpn04:7 binding=user:0,1,2,3,4,5,6

My PBS script calls out the nodes (and ppn=x values) explicitly, and normally is not an issue. 
Like I said earlier - I can literally configure bindings in virtually any way I want (except this one) and the apps seem to function fine.

Any thoughts on how I might begin to track this down? 

Jeff
 
Jeffrey J. Evans
jje at purdue.edu
http://web.ics.purdue.edu/~evans6/

/opt/mpi/mpich2-1.3.2/64/nemesis-gcc-4.4.0/bin/mpiexec
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(385).................: 
MPID_Init(135)........................: channel initialization failed
MPIDI_CH3_Init(38)....................: 
MPID_nem_init(196)....................: 
MPIDI_CH3I_Seg_commit(366)............: 
MPIU_SHMW_Hnd_deserialize(324)........: 
MPIU_SHMW_Seg_open(863)...............: 
MPIU_SHMW_Seg_create_attach_templ(637): open failed - No such file or directory
[proxy:0:3 at hpn04] handle_pmi_response (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/pm/pmiserv/pmip_cb.c:417): assert (!closed) failed
[proxy:0:3 at hpn04] HYD_pmcd_pmip_control_cmd_cb (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/pm/pmiserv/pmip_cb.c:908): unable to handle PMI response
[proxy:0:3 at hpn04] HYDT_dmxu_poll_wait_for_event (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:3 at hpn04] main (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/pm/pmiserv/pmip.c:221): demux engine error waiting for event
[mpiexec at hpn01] HYDT_bscu_wait_for_completion (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:99): one of the processes terminated badly; aborting
[mpiexec at hpn01] HYDT_bsci_wait_for_completion (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[mpiexec at hpn01] HYD_pmci_wait_for_completion (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:352): bootstrap server returned error waiting for completion
[mpiexec at hpn01] main (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/ui/mpich/mpiexec.c:294): process manager error waiting for completion




More information about the mpich-discuss mailing list