[mpich-discuss] mpdboot hanging

Dave Goodell goodell at mcs.anl.gov
Mon Feb 15 08:33:23 CST 2010


We have had some trouble with mpd lately:

https://trac.mcs.anl.gov/projects/mpich2/ticket/963
https://trac.mcs.anl.gov/projects/mpich2/ticket/974

The fix specified in #963 may resolve the problem for you.

Overall, I recommend using hydra instead:

http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager

-Dave

On Feb 15, 2010, at 1:58 AM, Jason Palmer wrote:

> I tried shuffling the hosts—it hangs on the last host in the list  
> (see output below). I realized that it was working before because I  
> was actually calling the mpich2 installed in /opt/mpich2 which uses  
> an older non-openMP compatible gcc. That mpdboot works fine, but the  
> one I installed to use gcc-4.4.3 hangs on the last node as seen  
> below. The mpiCC, etc. that I built work ok, so I guess I could use  
> the older mpdboot to launch mpd’s, and use he mpiCC etc. that I  
> built to compile. It would be nice to know what the difference in  
> the mpdboots is though. The mpich2version compilation options are  
> the same (I recompiled the one I built several times).
>
> [jason at juggling ~]$ cat mpdfile2
> compute-0-20
> compute-0-16
> compute-0-17
> compute-0-18
> compute-0-19
> [jason at juggling ~]$ mpdboot -f mpdfile2 -n 6 --verbose
> running mpdallexit on juggling.ucsd.edu
> LAUNCHED mpd on juggling.ucsd.edu  via
> RUNNING: mpd on juggling.ucsd.edu
> LAUNCHED mpd on compute-0-20  via  juggling.ucsd.edu
> LAUNCHED mpd on compute-0-16  via  juggling.ucsd.edu
> LAUNCHED mpd on compute-0-17  via  juggling.ucsd.edu
> LAUNCHED mpd on compute-0-18  via  juggling.ucsd.edu
> Traceback (most recent call last):
>   File "/home/jason/mpich2-1.2.1-install/bin/mpdboot", line 476, in ?
>     mpdboot()
>   File "/home/jason/mpich2-1.2.1-install/bin/mpdboot", line 347, in  
> mpdboot
>     handle_mpd_output(fd,fd2idx,hostsAndInfo)
>   File "/home/jason/mpich2-1.2.1-install/bin/mpdboot", line 385, in  
> handle_mpd_output
>     for line in fd.readlines():    # handle output from shells that  
> echo stuff
> KeyboardInterrupt
> [jason at juggling ~]$ mpdtrace
> juggling
> compute-0-18
> compute-0-17
> compute-0-16
> compute-0-20
> [jason at juggling ~]$
>
> Thanks,
> Jason
>
>
> From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov 
> ] On Behalf OfRajeev Thakur
> Sent: Sunday, February 14, 2010 5:24 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] mpdboot hanging
>
> It shouldn't need --maxbranch. Try shuffling the hosts in the  
> hostfile and see if the problem persists with the same host. In that  
> case, there may be something wrong with the networking configuration  
> for that host.
>
> Or try using the Hydra process manager, which doesn't require  
> setting up MPDs. You can use mpiexec.hydra.
>
> Rajeev
>
> From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov 
> ] On Behalf OfJason Palmer
> Sent: Friday, February 12, 2010 7:45 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] mpdboot hanging
>
> My problem may involve --maxbranch. I don’t recall needing to set  
> this before to start 6 mpd procs, one local and 5 on 5 remote hosts,  
> but now to start more than 4 remote mpd’s, which it says is the  
> maxbranch default, I need to set –maxbranch=5 for example.
>
> Maybe I’m misremembering how mpdboot worked. It is supposed to  
> return after starting the mpd’s right?
>
> Is setting maxbranch always required to start more than 4 remote  
> mpd’s?
>
> Here is what I’m getting, where “mpdfile” contains the hostnames …  
> the traceback occurs after hitting ctrl-c.
>
> [jason at juggling ~]$ mpdboot -f mpdfile -n 7 --verbose --maxbranch=6
> running mpdallexit on juggling.ucsd.edu
> LAUNCHED mpd on juggling.ucsd.edu  via
> RUNNING: mpd on juggling.ucsd.edu
> LAUNCHED mpd on compute-0-16  via  juggling.ucsd.edu
> LAUNCHED mpd on compute-0-17  via  juggling.ucsd.edu
> LAUNCHED mpd on compute-0-18  via  juggling.ucsd.edu
> LAUNCHED mpd on compute-0-19  via  juggling.ucsd.edu
> LAUNCHED mpd on compute-0-20  via  juggling.ucsd.edu
> LAUNCHED mpd on compute-0-15  via  juggling.ucsd.edu
> Traceback (most recent call last):
>   File "/home/jason/mpich2-1.2.1-install/bin/mpdboot", line 476, in ?
>     mpdboot()
>   File "/home/jason/mpich2-1.2.1-install/bin/mpdboot", line 347, in  
> mpdboot
>     handle_mpd_output(fd,fd2idx,hostsAndInfo)
>   File "/home/jason/mpich2-1.2.1-install/bin/mpdboot", line 385, in  
> handle_mpd_output
>     for line in fd.readlines():    # handle output from shells that  
> echo stuff
> KeyboardInterrupt
> [jason at juggling ~]$
>
> Thanks,
> Jason
>
>
> From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov 
> ] On Behalf OfJason Palmer
> Sent: Friday, February 12, 2010 3:13 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: [mpich-discuss] mpdboot hanging
>
> Hi,
> This is probably something simple, but when I run mpdboot with a  
> file containing node names, mpd is started on the all the nodes but  
> the last one in the list (in the mpd.hosts file) and mpdboot hangs  
> without returning. If I hit ctrl-C it breaks saying it was in a  
> function “handle shells that echo”, with the mpd’s that were started  
> still up.
>
> I ran mpdboot successfully before as I recall, with no hanging, and  
> all the mpd’s requested being started on all the nodes in the file,  
> so it seems like something simple has changed to cause this issue.
>
> Any help greatly appreciated.
>
> Thanks,
> Jason
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list