[mpich-discuss] Interesting Problem

Fri Feb 18 09:38:14 CST 2011

Done, output attached. I probably won't be able to look at it myself for a while. I have also attached the PBS script (It's short) and the host file.
Thanks for any hints or suggestions.

jje

Jeffrey J. Evans
jje at purdue.edu
http://web.ics.purdue.edu/~evans6/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: pacetest.o3771
Type: application/octet-stream
Size: 61038 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110218/3f61f65d/attachment-0003.obj>
-------------- next part --------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 30core
Type: application/octet-stream
Size: 144 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110218/3f61f65d/attachment-0004.obj>
-------------- next part --------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: pacetest
Type: application/octet-stream
Size: 1285 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110218/3f61f65d/attachment-0005.obj>
-------------- next part --------------

On Feb 17, 2011, at 5:07 PM, Pavan Balaji wrote:

> Hi Evans,
> 
> Would you be able to run mpiexec with the -verbose flag and send us the 
> output?
> 
> Thanks,
> 
>  -- Pavan
> 
> On 02/17/2011 04:01 PM, Evans Jeffrey wrote:
>> Pavan,
>> 
>> I have a small 80-core cluster (10, dual quad core processor nodes) that I am running with MPICH2-1.3.2.
>> 
>> My work involves precise process binding and I have come across an interesting problem.
>> My application is a tool that emulates parallel applications by simulating computation then using the communication subsystem for data movement between processes (do this over and over to emulate a parallel application). It's a much longer story as to what I use this for.
>> 
>> I can use the tool in every possible permutation over subsets of the 80 cores, including using the entire machine (all 80-cores).
>> 
>> Today I came across a situation that puzzles me. When trying to bind all but 2 cores across N nodes things work fine until I get to 30 cores  (on 4 nodes). The error report is below.
>> 
>> You should know that so far MPICH2-1.3.2 has been working perfectly for us, so I don't think it's a build issue.
>> 
>> This problem travels with the nodes, in other words, if I use a different subset of nodes I get the same response (below), only the error log reports the nodes I'm using. For the trace below my host file is as follows:
>> 
>> hpn01:7 binding=user:1,2,3,4,5,6,7
>> hpn02:8 binding=user:0,1,2,3,4,5,6,7
>> hpn03:8 binding=user:0,1,2,3,4,5,6,7
>> hpn04:7 binding=user:0,1,2,3,4,5,6
>> 
>> My PBS script calls out the nodes (and ppn=x values) explicitly, and normally is not an issue.
>> Like I said earlier - I can literally configure bindings in virtually any way I want (except this one) and the apps seem to function fine.
>> 
>> Any thoughts on how I might begin to track this down?
>> 
>> Jeff
>> 
>> Jeffrey J. Evans
>> jje at purdue.edu
>> http://web.ics.purdue.edu/~evans6/
>> 
>> /opt/mpi/mpich2-1.3.2/64/nemesis-gcc-4.4.0/bin/mpiexec
>> Fatal error in MPI_Init: Other MPI error, error stack:
>> MPIR_Init_thread(385).................:
>> MPID_Init(135)........................: channel initialization failed
>> MPIDI_CH3_Init(38)....................:
>> MPID_nem_init(196)....................:
>> MPIDI_CH3I_Seg_commit(366)............:
>> MPIU_SHMW_Hnd_deserialize(324)........:
>> MPIU_SHMW_Seg_open(863)...............:
>> MPIU_SHMW_Seg_create_attach_templ(637): open failed - No such file or directory
>> [proxy:0:3 at hpn04] handle_pmi_response (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/pm/pmiserv/pmip_cb.c:417): assert (!closed) failed
>> [proxy:0:3 at hpn04] HYD_pmcd_pmip_control_cmd_cb (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/pm/pmiserv/pmip_cb.c:908): unable to handle PMI response
>> [proxy:0:3 at hpn04] HYDT_dmxu_poll_wait_for_event (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
>> [proxy:0:3 at hpn04] main (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/pm/pmiserv/pmip.c:221): demux engine error waiting for event
>> [mpiexec at hpn01] HYDT_bscu_wait_for_completion (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:99): one of the processes terminated badly; aborting
>> [mpiexec at hpn01] HYDT_bsci_wait_for_completion (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
>> [mpiexec at hpn01] HYD_pmci_wait_for_completion (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:352): bootstrap server returned error waiting for completion
>> [mpiexec at hpn01] main (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/ui/mpich/mpiexec.c:294): process manager error waiting for completion
>> 
>> 
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> -- 
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji