[mpich-discuss] mpich2-1.4 works on one host, fails on multiple
Rajeev Thakur
thakur at mcs.anl.gov
Fri Dec 9 11:22:59 CST 2011
Can you try running the hellow.f example from examples/f77.
Rajeev
On Dec 9, 2011, at 10:40 AM, Jack D. Galloway wrote:
> I have an mpich2 error that I have run into a wall trying to debug. When I run on a given node, either the headnode, or a slave node, as long as the machinefile only has that same node in the file (i.e. ssh to “c302” and have the machines file only have “c302” listed), I get no problems at all and the code runs to completion just fine. If I try to run across any nodes though, I get a crashing error that is given below. I think the error may even be some architectural setup, but I’m fairly stuck regarding continued debugging. I ran the code using the “ddd” debugger through and it crashes on the first line of the program on the cross node (it’s a fortran program, and crashes on the first line simply naming the program … ‘program laplace’), and crashes on the first “step” in the ddd debugger in the window pertaining to the instance running on the headnode, which spawned the mpiexec job, saying:
>
>
> Program received signal SIGINT, Interrupt.
>
> 0x00002aaaaaee4920 in __read_nocancel () from /lib64/libpthread.so.0
>
> (gdb) step
>
>
> I’ve pretty well exhausted my troubleshooting on this, and any help would be greatly appreciated. We’re running Ubuntu 10.04, Lucid Lynx, running mpich2-1.4.1p1. Feel free to ask any questions or offer some troubleshooting tips. Thanks,
>
> ~Jack
>
> Error when running code:
>
> galloway at tebow:~/Flow3D/hybrid-test$ mpiexec -machinefile machines -np 2 -print-all-exitcodes ./mpich_debug_exec
>
> =====================================================================================
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = EXIT CODE: 11
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> =====================================================================================
> [proxy:0:0 at tebow] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
> [proxy:0:0 at tebow] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at tebow] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [mpiexec at tebow] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated badly; aborting
> [mpiexec at tebow] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
> [mpiexec at tebow] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting for completion
> [mpiexec at tebow] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion
> galloway at tebow:~/Flow3D/hybrid-test$
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list