[mpich-discuss] mpich2-1.4 works on one host, fails on multiple

Jack D. Galloway jackg at lanl.gov
Fri Dec 9 12:04:19 CST 2011


Sorry, one further detail that I've just discovered.  The program is a
MPI/OpenMP hybrid program.  As a hunch, I commented out all the OpenMP
calls, include statement, omp_set_num_threads, !$OMP DO etc... and tried to
re-run the problem across several nodes and it worked out perfectly.
Somehow the inclusion of OpenMP stuff, when trying to run across several
nodes, is causing corruption that manifests itself in the MPI_INIT call, the
first real call in the program after variable initialization and include
statements.  I don't know if this error is due to OpenMP or MPICH2, or
something else.  Hopefully this helps.

Thanks,
Jack

-----Original Message-----
From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev Thakur
Sent: Friday, December 09, 2011 10:23 AM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] mpich2-1.4 works on one host, fails on multiple

Can you try running the hellow.f example from examples/f77. 

Rajeev

On Dec 9, 2011, at 10:40 AM, Jack D. Galloway wrote:

> I have an mpich2 error that I have run into a wall trying to debug.  When
I run on a given node, either the headnode, or a slave node, as long as the
machinefile only has that same node in the file (i.e. ssh to "c302" and have
the machines file only have "c302" listed), I get no problems at all and the
code runs to completion just fine.  If I try to run across any nodes though,
I get a crashing error that is given below.  I think the error may even be
some architectural setup, but I'm fairly stuck regarding continued
debugging.  I ran the code using the "ddd" debugger through and it crashes
on the first line of the program on the cross node (it's a fortran program,
and crashes on the first line simply naming the program . 'program
laplace'), and crashes on the first "step" in the ddd debugger in the window
pertaining to the instance running on the headnode, which spawned the
mpiexec job, saying:
>  
> 
> Program received signal SIGINT, Interrupt.
> 
> 0x00002aaaaaee4920 in __read_nocancel () from /lib64/libpthread.so.0
> 
> (gdb) step
> 
>  
> I've pretty well exhausted my troubleshooting on this, and any help would
be greatly appreciated.  We're running Ubuntu 10.04, Lucid Lynx, running
mpich2-1.4.1p1.  Feel free to ask any questions or offer some
troubleshooting tips.  Thanks,
>  
> ~Jack
>  
> Error when running code:
>  
> galloway at tebow:~/Flow3D/hybrid-test$ mpiexec -machinefile machines -np 2
-print-all-exitcodes ./mpich_debug_exec
>  
>
============================================================================
=========
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 11
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
============================================================================
=========
> [proxy:0:0 at tebow] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
> [proxy:0:0 at tebow] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at tebow] main (./pm/pmiserv/pmip.c:226): demux engine error
waiting for event
> [mpiexec at tebow] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated
badly; aborting
> [mpiexec at tebow] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
> [mpiexec at tebow] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting for
completion
> [mpiexec at tebow] main (./ui/mpich/mpiexec.c:405): process manager error
waiting for completion
> galloway at tebow:~/Flow3D/hybrid-test$
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

_______________________________________________
mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
To manage subscription options or unsubscribe:
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list