[mpich-discuss] Interesting Problem

Pavan Balaji balaji at mcs.anl.gov
Thu Feb 17 16:07:14 CST 2011


Hi Evans,

Would you be able to run mpiexec with the -verbose flag and send us the 
output?

Thanks,

  -- Pavan

On 02/17/2011 04:01 PM, Evans Jeffrey wrote:
> Pavan,
>
> I have a small 80-core cluster (10, dual quad core processor nodes) that I am running with MPICH2-1.3.2.
>
> My work involves precise process binding and I have come across an interesting problem.
> My application is a tool that emulates parallel applications by simulating computation then using the communication subsystem for data movement between processes (do this over and over to emulate a parallel application). It's a much longer story as to what I use this for.
>
> I can use the tool in every possible permutation over subsets of the 80 cores, including using the entire machine (all 80-cores).
>
> Today I came across a situation that puzzles me. When trying to bind all but 2 cores across N nodes things work fine until I get to 30 cores  (on 4 nodes). The error report is below.
>
> You should know that so far MPICH2-1.3.2 has been working perfectly for us, so I don't think it's a build issue.
>
> This problem travels with the nodes, in other words, if I use a different subset of nodes I get the same response (below), only the error log reports the nodes I'm using. For the trace below my host file is as follows:
>
> hpn01:7 binding=user:1,2,3,4,5,6,7
> hpn02:8 binding=user:0,1,2,3,4,5,6,7
> hpn03:8 binding=user:0,1,2,3,4,5,6,7
> hpn04:7 binding=user:0,1,2,3,4,5,6
>
> My PBS script calls out the nodes (and ppn=x values) explicitly, and normally is not an issue.
> Like I said earlier - I can literally configure bindings in virtually any way I want (except this one) and the apps seem to function fine.
>
> Any thoughts on how I might begin to track this down?
>
> Jeff
>
> Jeffrey J. Evans
> jje at purdue.edu
> http://web.ics.purdue.edu/~evans6/
>
> /opt/mpi/mpich2-1.3.2/64/nemesis-gcc-4.4.0/bin/mpiexec
> Fatal error in MPI_Init: Other MPI error, error stack:
> MPIR_Init_thread(385).................:
> MPID_Init(135)........................: channel initialization failed
> MPIDI_CH3_Init(38)....................:
> MPID_nem_init(196)....................:
> MPIDI_CH3I_Seg_commit(366)............:
> MPIU_SHMW_Hnd_deserialize(324)........:
> MPIU_SHMW_Seg_open(863)...............:
> MPIU_SHMW_Seg_create_attach_templ(637): open failed - No such file or directory
> [proxy:0:3 at hpn04] handle_pmi_response (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/pm/pmiserv/pmip_cb.c:417): assert (!closed) failed
> [proxy:0:3 at hpn04] HYD_pmcd_pmip_control_cmd_cb (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/pm/pmiserv/pmip_cb.c:908): unable to handle PMI response
> [proxy:0:3 at hpn04] HYDT_dmxu_poll_wait_for_event (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:3 at hpn04] main (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/pm/pmiserv/pmip.c:221): demux engine error waiting for event
> [mpiexec at hpn01] HYDT_bscu_wait_for_completion (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:99): one of the processes terminated badly; aborting
> [mpiexec at hpn01] HYDT_bsci_wait_for_completion (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
> [mpiexec at hpn01] HYD_pmci_wait_for_completion (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:352): bootstrap server returned error waiting for completion
> [mpiexec at hpn01] main (/scratch/program_tarballs/mpich/mpich2-1.3/src/pm/hydra/ui/mpich/mpiexec.c:294): process manager error waiting for completion
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list