[mpich-discuss] mpdboot hanging

Jason Palmer jason at sccn.ucsd.edu
Mon Feb 15 09:27:04 CST 2010


Yes that fixes it, thanks. Is there reason to prefer hydra over mpd aside
from fewer bugs? Say in terms of process maintenance / efficiency in
cleaning up all processes associated with a run (I'm using sge on linux) or
speed? My preference would be to use the manager that does the best job of
cleaning up broken runs, or provides the best facility for automatically
killing all of a run's processes on all its nodes. Does hydra improve on the
way Mpich1 did things?

-Jason

-----Original Message-----
From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Dave Goodell
Sent: Monday, February 15, 2010 6:33 AM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] mpdboot hanging

We have had some trouble with mpd lately:

https://trac.mcs.anl.gov/projects/mpich2/ticket/963
https://trac.mcs.anl.gov/projects/mpich2/ticket/974

The fix specified in #963 may resolve the problem for you.

Overall, I recommend using hydra instead:

http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager

-Dave

On Feb 15, 2010, at 1:58 AM, Jason Palmer wrote:

> I tried shuffling the hosts-it hangs on the last host in the list  
> (see output below). I realized that it was working before because I  
> was actually calling the mpich2 installed in /opt/mpich2 which uses  
> an older non-openMP compatible gcc. That mpdboot works fine, but the  
> one I installed to use gcc-4.4.3 hangs on the last node as seen  
> below. The mpiCC, etc. that I built work ok, so I guess I could use  
> the older mpdboot to launch mpd's, and use he mpiCC etc. that I  
> built to compile. It would be nice to know what the difference in  
> the mpdboots is though. The mpich2version compilation options are  
> the same (I recompiled the one I built several times).
>
> [jason at juggling ~]$ cat mpdfile2
> compute-0-20
> compute-0-16
> compute-0-17
> compute-0-18
> compute-0-19
> [jason at juggling ~]$ mpdboot -f mpdfile2 -n 6 --verbose
> running mpdallexit on juggling.ucsd.edu
> LAUNCHED mpd on juggling.ucsd.edu  via
> RUNNING: mpd on juggling.ucsd.edu
> LAUNCHED mpd on compute-0-20  via  juggling.ucsd.edu
> LAUNCHED mpd on compute-0-16  via  juggling.ucsd.edu
> LAUNCHED mpd on compute-0-17  via  juggling.ucsd.edu
> LAUNCHED mpd on compute-0-18  via  juggling.ucsd.edu
> Traceback (most recent call last):
>   File "/home/jason/mpich2-1.2.1-install/bin/mpdboot", line 476, in ?
>     mpdboot()
>   File "/home/jason/mpich2-1.2.1-install/bin/mpdboot", line 347, in  
> mpdboot
>     handle_mpd_output(fd,fd2idx,hostsAndInfo)
>   File "/home/jason/mpich2-1.2.1-install/bin/mpdboot", line 385, in  
> handle_mpd_output
>     for line in fd.readlines():    # handle output from shells that  
> echo stuff
> KeyboardInterrupt
> [jason at juggling ~]$ mpdtrace
> juggling
> compute-0-18
> compute-0-17
> compute-0-16
> compute-0-20
> [jason at juggling ~]$
>
> Thanks,
> Jason
>
>
> From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov 
> ] On Behalf OfRajeev Thakur
> Sent: Sunday, February 14, 2010 5:24 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] mpdboot hanging
>
> It shouldn't need --maxbranch. Try shuffling the hosts in the  
> hostfile and see if the problem persists with the same host. In that  
> case, there may be something wrong with the networking configuration  
> for that host.
>
> Or try using the Hydra process manager, which doesn't require  
> setting up MPDs. You can use mpiexec.hydra.
>
> Rajeev
>
> From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov 
> ] On Behalf OfJason Palmer
> Sent: Friday, February 12, 2010 7:45 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] mpdboot hanging
>
> My problem may involve --maxbranch. I don't recall needing to set  
> this before to start 6 mpd procs, one local and 5 on 5 remote hosts,  
> but now to start more than 4 remote mpd's, which it says is the  
> maxbranch default, I need to set -maxbranch=5 for example.
>
> Maybe I'm misremembering how mpdboot worked. It is supposed to  
> return after starting the mpd's right?
>
> Is setting maxbranch always required to start more than 4 remote  
> mpd's?
>
> Here is what I'm getting, where "mpdfile" contains the hostnames .  
> the traceback occurs after hitting ctrl-c.
>
> [jason at juggling ~]$ mpdboot -f mpdfile -n 7 --verbose --maxbranch=6
> running mpdallexit on juggling.ucsd.edu
> LAUNCHED mpd on juggling.ucsd.edu  via
> RUNNING: mpd on juggling.ucsd.edu
> LAUNCHED mpd on compute-0-16  via  juggling.ucsd.edu
> LAUNCHED mpd on compute-0-17  via  juggling.ucsd.edu
> LAUNCHED mpd on compute-0-18  via  juggling.ucsd.edu
> LAUNCHED mpd on compute-0-19  via  juggling.ucsd.edu
> LAUNCHED mpd on compute-0-20  via  juggling.ucsd.edu
> LAUNCHED mpd on compute-0-15  via  juggling.ucsd.edu
> Traceback (most recent call last):
>   File "/home/jason/mpich2-1.2.1-install/bin/mpdboot", line 476, in ?
>     mpdboot()
>   File "/home/jason/mpich2-1.2.1-install/bin/mpdboot", line 347, in  
> mpdboot
>     handle_mpd_output(fd,fd2idx,hostsAndInfo)
>   File "/home/jason/mpich2-1.2.1-install/bin/mpdboot", line 385, in  
> handle_mpd_output
>     for line in fd.readlines():    # handle output from shells that  
> echo stuff
> KeyboardInterrupt
> [jason at juggling ~]$
>
> Thanks,
> Jason
>
>
> From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov 
> ] On Behalf OfJason Palmer
> Sent: Friday, February 12, 2010 3:13 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: [mpich-discuss] mpdboot hanging
>
> Hi,
> This is probably something simple, but when I run mpdboot with a  
> file containing node names, mpd is started on the all the nodes but  
> the last one in the list (in the mpd.hosts file) and mpdboot hangs  
> without returning. If I hit ctrl-C it breaks saying it was in a  
> function "handle shells that echo", with the mpd's that were started  
> still up.
>
> I ran mpdboot successfully before as I recall, with no hanging, and  
> all the mpd's requested being started on all the nodes in the file,  
> so it seems like something simple has changed to cause this issue.
>
> Any help greatly appreciated.
>
> Thanks,
> Jason
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list