[mpich-discuss] mpdboot and hostsfile

Dave Goodell goodell at mcs.anl.gov
Wed Dec 2 14:39:32 CST 2009


Hmm... I'm slightly surprised that "--ncpus=0" ever worked.  Glancing  
at the code right now, there's nothing that I see that specifically  
would cause a problem, but it's likely that's a broken corner case.   
Skimming the code I bet that it will even accept a negative ncpus  
argument, which clearly doesn't make any sense.

Also, it seems strange that this would fail with the fairly minor  
modifications that are present in the 1.2.1 mpd.

It sounds like you have a reasonable workaround for this right now, so  
I've filed this as a ticket to fix later: https://trac.mcs.anl.gov/projects/mpich2/ticket/963

Another alternative if you don't need dynamic process support is to  
use the hydra process manager: http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager

-Dave

On Dec 1, 2009, at 5:33 PM, Kenin Coloma wrote:

> In the mpich2-1.2.1, mpdboot stopped working (upgraded from  
> mpich2-1.1.1) for a fairly simple host file
>
> (on compute06)
> mpdboot --totalnum=6 --ncpus=0
>
> host file:
> compute07
> compute08
> compute09
> compute10
> compute11
>
> mpdboot will hang after trying to launch mpd on compute10
>
> [kcoloma at compute06 ~]$ /rd_personalization08/kcoloma/mpich_install/ 
> bin/mpdboot --totalnum=6 --ncpus=0 --file=/home/kcoloma/mpiHosts.txt  
> --mpd=/rd_personalization08/kcoloma/mpich_install/bin/mpd --verbose
> running mpdallexit on compute06
> LAUNCHED mpd on compute06  via
> RUNNING: mpd on compute06
> LAUNCHED mpd on compute07  via  compute06
> LAUNCHED mpd on compute08  via  compute06
> LAUNCHED mpd on compute09  via  compute06
> LAUNCHED mpd on compute10  via  compute06
> Traceback (most recent call last):
>   File "/rd_personalization08/kcoloma/mpich_install/bin/mpdboot",  
> line 476, in ?
>     mpdboot()
>   File "/rd_personalization08/kcoloma/mpich_install/bin/mpdboot",  
> line 347, in mpdboot
>     handle_mpd_output(fd,fd2idx,hostsAndInfo)
>   File "/rd_personalization08/kcoloma/mpich_install/bin/mpdboot",  
> line 385, in handle_mpd_output
>     for line in fd.readlines():    # handle output from shells that  
> echo stuff
> KeyboardInterrupt
>
> It will hang as long as --totalnum > 1.
>
> mpdboot.py scripts are the same between the two versions of mpich,  
> but the mpd.py scripts changed to address ticket #905.  I've found  
> that rolling back to the mpich2-1.1.1p1 mpd.py, fixes the mpdboot  
> issue I'm having.
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list