[mpich-discuss] mpich2-1.4 works on one host, fails on multiple

Jack D. Galloway jackg at lanl.gov
Fri Dec 9 11:35:01 CST 2011


Yes, the example works just fine over several nodes:

galloway at tebow:~/examples$ mpiexec -machinefile machines -np 4 ./hellow.exe 
 Process            0  of            4  is alive
 Process            2  of            4  is alive
 Process            1  of            4  is alive
 Process            3  of            4  is alive

galloway at tebow:~/examples$ more machines 
tebow
c301
c302
c303

I was able to glean one bit of evidence from the ddd debugger, it is
crashing on the "MPI_INIT" line, the first MPI call, where I obtained the
following stack trace:

[cli_0]: Command cmd=put kvsname=kvs_4374_0 key=P0-businesscard
value=description#tebow$port#43370$ifname#128.165.143.21$
 failed, reason='duplicate_keyP0-businesscard'
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(388)....: 
MPID_Init(139)...........: channel initialization failed
MPIDI_CH3_Init(38).......: 
MPID_nem_init(310).......: 
MPIDI_PG_SetConnInfo(630): PMI_KVS_Put returned -1

Program exited with code 01.
(gdb)

Thanks,
Jack

-----Original Message-----
From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev Thakur
Sent: Friday, December 09, 2011 10:23 AM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] mpich2-1.4 works on one host, fails on multiple

Can you try running the hellow.f example from examples/f77. 

Rajeev

On Dec 9, 2011, at 10:40 AM, Jack D. Galloway wrote:

> I have an mpich2 error that I have run into a wall trying to debug.  When
I run on a given node, either the headnode, or a slave node, as long as the
machinefile only has that same node in the file (i.e. ssh to "c302" and have
the machines file only have "c302" listed), I get no problems at all and the
code runs to completion just fine.  If I try to run across any nodes though,
I get a crashing error that is given below.  I think the error may even be
some architectural setup, but I'm fairly stuck regarding continued
debugging.  I ran the code using the "ddd" debugger through and it crashes
on the first line of the program on the cross node (it's a fortran program,
and crashes on the first line simply naming the program . 'program
laplace'), and crashes on the first "step" in the ddd debugger in the window
pertaining to the instance running on the headnode, which spawned the
mpiexec job, saying:
>  
> 
> Program received signal SIGINT, Interrupt.
> 
> 0x00002aaaaaee4920 in __read_nocancel () from /lib64/libpthread.so.0
> 
> (gdb) step
> 
>  
> I've pretty well exhausted my troubleshooting on this, and any help would
be greatly appreciated.  We're running Ubuntu 10.04, Lucid Lynx, running
mpich2-1.4.1p1.  Feel free to ask any questions or offer some
troubleshooting tips.  Thanks,
>  
> ~Jack
>  
> Error when running code:
>  
> galloway at tebow:~/Flow3D/hybrid-test$ mpiexec -machinefile machines -np 2
-print-all-exitcodes ./mpich_debug_exec
>  
>
============================================================================
=========
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 11
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
============================================================================
=========
> [proxy:0:0 at tebow] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
> [proxy:0:0 at tebow] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at tebow] main (./pm/pmiserv/pmip.c:226): demux engine error
waiting for event
> [mpiexec at tebow] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated
badly; aborting
> [mpiexec at tebow] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
> [mpiexec at tebow] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting for
completion
> [mpiexec at tebow] main (./ui/mpich/mpiexec.c:405): process manager error
waiting for completion
> galloway at tebow:~/Flow3D/hybrid-test$
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

_______________________________________________
mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
To manage subscription options or unsubscribe:
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list