[mpich-discuss] mpich2-1.4 works on one host, fails on multiple

Rajeev Thakur thakur at mcs.anl.gov
Fri Dec 9 12:22:14 CST 2011


Another possibility is that there is some problem with OpenMP compiler/runtime on that system. Try running an OpenMP-only program (no MPI at all).

Rajeev

On Dec 9, 2011, at 12:14 PM, Anthony Chan wrote:

> 
> Is your program making any MPI call in within OpenMP pragmas ?
> If it does, you need to use MPI_Init_thread with MPI_THREAD_MULTIPLE.
> If not, a correct MPI program should still need to use MPI_Init_thread
> with MPI_THREAD_FUNNELED.  Not sure if this is the cause of problem,
> but you should at least use the correct MPI_Init call for your threaded
> program.
> 
> A.Chan
> 
> ----- Original Message -----
>> Sorry, one further detail that I've just discovered. The program is a
>> MPI/OpenMP hybrid program. As a hunch, I commented out all the OpenMP
>> calls, include statement, omp_set_num_threads, !$OMP DO etc... and
>> tried to
>> re-run the problem across several nodes and it worked out perfectly.
>> Somehow the inclusion of OpenMP stuff, when trying to run across
>> several
>> nodes, is causing corruption that manifests itself in the MPI_INIT
>> call, the
>> first real call in the program after variable initialization and
>> include
>> statements. I don't know if this error is due to OpenMP or MPICH2, or
>> something else. Hopefully this helps.
>> 
>> Thanks,
>> Jack
>> 
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev Thakur
>> Sent: Friday, December 09, 2011 10:23 AM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] mpich2-1.4 works on one host, fails on
>> multiple
>> 
>> Can you try running the hellow.f example from examples/f77.
>> 
>> Rajeev
>> 
>> On Dec 9, 2011, at 10:40 AM, Jack D. Galloway wrote:
>> 
>>> I have an mpich2 error that I have run into a wall trying to debug.
>>> When
>> I run on a given node, either the headnode, or a slave node, as long
>> as the
>> machinefile only has that same node in the file (i.e. ssh to "c302"
>> and have
>> the machines file only have "c302" listed), I get no problems at all
>> and the
>> code runs to completion just fine. If I try to run across any nodes
>> though,
>> I get a crashing error that is given below. I think the error may even
>> be
>> some architectural setup, but I'm fairly stuck regarding continued
>> debugging. I ran the code using the "ddd" debugger through and it
>> crashes
>> on the first line of the program on the cross node (it's a fortran
>> program,
>> and crashes on the first line simply naming the program . 'program
>> laplace'), and crashes on the first "step" in the ddd debugger in the
>> window
>> pertaining to the instance running on the headnode, which spawned the
>> mpiexec job, saying:
>>> 
>>> 
>>> Program received signal SIGINT, Interrupt.
>>> 
>>> 0x00002aaaaaee4920 in __read_nocancel () from /lib64/libpthread.so.0
>>> 
>>> (gdb) step
>>> 
>>> 
>>> I've pretty well exhausted my troubleshooting on this, and any help
>>> would
>> be greatly appreciated. We're running Ubuntu 10.04, Lucid Lynx,
>> running
>> mpich2-1.4.1p1. Feel free to ask any questions or offer some
>> troubleshooting tips. Thanks,
>>> 
>>> ~Jack
>>> 
>>> Error when running code:
>>> 
>>> galloway at tebow:~/Flow3D/hybrid-test$ mpiexec -machinefile machines
>>> -np 2
>> -print-all-exitcodes ./mpich_debug_exec
>>> 
>>> 
>> ============================================================================
>> =========
>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>> = EXIT CODE: 11
>>> = CLEANING UP REMAINING PROCESSES
>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>> 
>> ============================================================================
>> =========
>>> [proxy:0:0 at tebow] HYD_pmcd_pmip_control_cmd_cb
>> (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
>>> [proxy:0:0 at tebow] HYDT_dmxu_poll_wait_for_event
>> (./tools/demux/demux_poll.c:77): callback returned error status
>>> [proxy:0:0 at tebow] main (./pm/pmiserv/pmip.c:226): demux engine error
>> waiting for event
>>> [mpiexec at tebow] HYDT_bscu_wait_for_completion
>> (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes
>> terminated
>> badly; aborting
>>> [mpiexec at tebow] HYDT_bsci_wait_for_completion
>> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
>> waiting for
>> completion
>>> [mpiexec at tebow] HYD_pmci_wait_for_completion
>> (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting for
>> completion
>>> [mpiexec at tebow] main (./ui/mpich/mpiexec.c:405): process manager
>>> error
>> waiting for completion
>>> galloway at tebow:~/Flow3D/hybrid-test$
>>> _______________________________________________
>>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>>> To manage subscription options or unsubscribe:
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> 
>> _______________________________________________
>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> 
>> _______________________________________________
>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list