[mpich-discuss] mpdboot and hostsfile

Dave Goodell goodell at mcs.anl.gov
Fri Dec 4 14:42:40 CST 2009


(reposting this on this thread in case you didn't see the other thread)
---------8<--------
This has been fixed in the trunk.  Anyone who needs a fix in the short  
term should be able to download the following copy of mpd.py and drop  
it into src/pm/mpd/ in their MPICH2 source tree (and then re-install  
MPICH2):

https://trac.mcs.anl.gov/projects/mpich2/export/5923/mpich2/trunk/src/pm/mpd/mpd.py
---------8<--------

By the way, I tested the --ncpus=0 behavior and found that it doesn't  
do what you would expect.  It basically just assumes that you really  
passed --ncpus=1 on the mpdboot host.  If you want a "head node" with  
mpd you are probably better off just dropping the "--ncpus=0" argument  
to mpdboot and then making sure that you pass "-1" to your mpiexec  
commands to avoid launching processes on the head node first.  Or use  
hydra, which has no trouble launching processes on exactly the hosts  
that you specify.

-Dave

On Dec 2, 2009, at 3:46 PM, Kenin Coloma wrote:

> Thanks, Dave!
>
> The idea was that we wanted to run the mpd's under root so that  
> anyone could use them and have a "job submission" node.  We haven't  
> gotten to the point where we needed/wanted to setup real resource  
> management/schedulers &c - but hopefully we will!
>
> -kenin
>
> On Wed, Dec 2, 2009 at 12:39 PM, Dave Goodell <goodell at mcs.anl.gov>  
> wrote:
> Hmm... I'm slightly surprised that "--ncpus=0" ever worked.   
> Glancing at the code right now, there's nothing that I see that  
> specifically would cause a problem, but it's likely that's a broken  
> corner case.  Skimming the code I bet that it will even accept a  
> negative ncpus argument, which clearly doesn't make any sense.
>
> Also, it seems strange that this would fail with the fairly minor  
> modifications that are present in the 1.2.1 mpd.
>
> It sounds like you have a reasonable workaround for this right now,  
> so I've filed this as a ticket to fix later: https://trac.mcs.anl.gov/projects/mpich2/ticket/963
>
> Another alternative if you don't need dynamic process support is to  
> use the hydra process manager: http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager
>
> -Dave
>
>
> On Dec 1, 2009, at 5:33 PM, Kenin Coloma wrote:
>
> In the mpich2-1.2.1, mpdboot stopped working (upgraded from  
> mpich2-1.1.1) for a fairly simple host file
>
> (on compute06)
> mpdboot --totalnum=6 --ncpus=0
>
> host file:
> compute07
> compute08
> compute09
> compute10
> compute11
>
> mpdboot will hang after trying to launch mpd on compute10
>
> [kcoloma at compute06 ~]$ /rd_personalization08/kcoloma/mpich_install/ 
> bin/mpdboot --totalnum=6 --ncpus=0 --file=/home/kcoloma/mpiHosts.txt  
> --mpd=/rd_personalization08/kcoloma/mpich_install/bin/mpd --verbose
> running mpdallexit on compute06
> LAUNCHED mpd on compute06  via
> RUNNING: mpd on compute06
> LAUNCHED mpd on compute07  via  compute06
> LAUNCHED mpd on compute08  via  compute06
> LAUNCHED mpd on compute09  via  compute06
> LAUNCHED mpd on compute10  via  compute06
> Traceback (most recent call last):
>  File "/rd_personalization08/kcoloma/mpich_install/bin/mpdboot",  
> line 476, in ?
>    mpdboot()
>  File "/rd_personalization08/kcoloma/mpich_install/bin/mpdboot",  
> line 347, in mpdboot
>    handle_mpd_output(fd,fd2idx,hostsAndInfo)
>  File "/rd_personalization08/kcoloma/mpich_install/bin/mpdboot",  
> line 385, in handle_mpd_output
>    for line in fd.readlines():    # handle output from shells that  
> echo stuff
> KeyboardInterrupt
>
> It will hang as long as --totalnum > 1.
>
> mpdboot.py scripts are the same between the two versions of mpich,  
> but the mpd.py scripts changed to address ticket #905.  I've found  
> that rolling back to the mpich2-1.1.1p1 mpd.py, fixes the mpdboot  
> issue I'm having.
>
> _______________________________________________
> mpich-discuss mailing list
>
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
>
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list