[mpich-discuss] mpich2-1.4 works on one host, fails on multiple

Anthony Chan chan at mcs.anl.gov
Fri Dec 9 12:14:57 CST 2011


Is your program making any MPI call in within OpenMP pragmas ?
If it does, you need to use MPI_Init_thread with MPI_THREAD_MULTIPLE.
If not, a correct MPI program should still need to use MPI_Init_thread
with MPI_THREAD_FUNNELED.  Not sure if this is the cause of problem,
but you should at least use the correct MPI_Init call for your threaded
program.

A.Chan

----- Original Message -----
> Sorry, one further detail that I've just discovered. The program is a
> MPI/OpenMP hybrid program. As a hunch, I commented out all the OpenMP
> calls, include statement, omp_set_num_threads, !$OMP DO etc... and
> tried to
> re-run the problem across several nodes and it worked out perfectly.
> Somehow the inclusion of OpenMP stuff, when trying to run across
> several
> nodes, is causing corruption that manifests itself in the MPI_INIT
> call, the
> first real call in the program after variable initialization and
> include
> statements. I don't know if this error is due to OpenMP or MPICH2, or
> something else. Hopefully this helps.
> 
> Thanks,
> Jack
> 
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev Thakur
> Sent: Friday, December 09, 2011 10:23 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] mpich2-1.4 works on one host, fails on
> multiple
> 
> Can you try running the hellow.f example from examples/f77.
> 
> Rajeev
> 
> On Dec 9, 2011, at 10:40 AM, Jack D. Galloway wrote:
> 
> > I have an mpich2 error that I have run into a wall trying to debug.
> > When
> I run on a given node, either the headnode, or a slave node, as long
> as the
> machinefile only has that same node in the file (i.e. ssh to "c302"
> and have
> the machines file only have "c302" listed), I get no problems at all
> and the
> code runs to completion just fine. If I try to run across any nodes
> though,
> I get a crashing error that is given below. I think the error may even
> be
> some architectural setup, but I'm fairly stuck regarding continued
> debugging. I ran the code using the "ddd" debugger through and it
> crashes
> on the first line of the program on the cross node (it's a fortran
> program,
> and crashes on the first line simply naming the program . 'program
> laplace'), and crashes on the first "step" in the ddd debugger in the
> window
> pertaining to the instance running on the headnode, which spawned the
> mpiexec job, saying:
> >
> >
> > Program received signal SIGINT, Interrupt.
> >
> > 0x00002aaaaaee4920 in __read_nocancel () from /lib64/libpthread.so.0
> >
> > (gdb) step
> >
> >
> > I've pretty well exhausted my troubleshooting on this, and any help
> > would
> be greatly appreciated. We're running Ubuntu 10.04, Lucid Lynx,
> running
> mpich2-1.4.1p1. Feel free to ask any questions or offer some
> troubleshooting tips. Thanks,
> >
> > ~Jack
> >
> > Error when running code:
> >
> > galloway at tebow:~/Flow3D/hybrid-test$ mpiexec -machinefile machines
> > -np 2
> -print-all-exitcodes ./mpich_debug_exec
> >
> >
> ============================================================================
> =========
> > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> > = EXIT CODE: 11
> > = CLEANING UP REMAINING PROCESSES
> > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> >
> ============================================================================
> =========
> > [proxy:0:0 at tebow] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
> > [proxy:0:0 at tebow] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> > [proxy:0:0 at tebow] main (./pm/pmiserv/pmip.c:226): demux engine error
> waiting for event
> > [mpiexec at tebow] HYDT_bscu_wait_for_completion
> (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes
> terminated
> badly; aborting
> > [mpiexec at tebow] HYDT_bsci_wait_for_completion
> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
> waiting for
> completion
> > [mpiexec at tebow] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting for
> completion
> > [mpiexec at tebow] main (./ui/mpich/mpiexec.c:405): process manager
> > error
> waiting for completion
> > galloway at tebow:~/Flow3D/hybrid-test$
> > _______________________________________________
> > mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list