[mpich-discuss] mpich2-1.4 works on one host, fails on multiple
    Jack D. Galloway 
    jackg at lanl.gov
       
    Fri Dec  9 10:40:46 CST 2011
    
    
  
I have an mpich2 error that I have run into a wall trying to debug.  When I
run on a given node, either the headnode, or a slave node, as long as the
machinefile only has that same node in the file (i.e. ssh to "c302" and have
the machines file only have "c302" listed), I get no problems at all and the
code runs to completion just fine.  If I try to run across any nodes though,
I get a crashing error that is given below.  I think the error may even be
some architectural setup, but I'm fairly stuck regarding continued
debugging.  I ran the code using the "ddd" debugger through and it crashes
on the first line of the program on the cross node (it's a fortran program,
and crashes on the first line simply naming the program . 'program
laplace'), and crashes on the first "step" in the ddd debugger in the window
pertaining to the instance running on the headnode, which spawned the
mpiexec job, saying:
 
Program received signal SIGINT, Interrupt.
0x00002aaaaaee4920 in __read_nocancel () from /lib64/libpthread.so.0
(gdb) step
 
I've pretty well exhausted my troubleshooting on this, and any help would be
greatly appreciated.  We're running Ubuntu 10.04, Lucid Lynx, running
mpich2-1.4.1p1.  Feel free to ask any questions or offer some
troubleshooting tips.  Thanks,
 
~Jack
 
Error when running code:
 
galloway at tebow:~/Flow3D/hybrid-test$ mpiexec -machinefile machines -np 2
-print-all-exitcodes ./mpich_debug_exec 
 
============================================================================
=========
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
============================================================================
=========
[proxy:0:0 at tebow] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:928):
assert (!closed) failed
[proxy:0:0 at tebow] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0 at tebow] main (./pm/pmiserv/pmip.c:226): demux engine error waiting
for event
[mpiexec at tebow] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated
badly; aborting
[mpiexec at tebow] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
[mpiexec at tebow] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting for
completion
[mpiexec at tebow] main (./ui/mpich/mpiexec.c:405): process manager error
waiting for completion
galloway at tebow:~/Flow3D/hybrid-test$
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111209/185ad460/attachment.htm>
    
    
More information about the mpich-discuss
mailing list