[mpich-discuss] mpdboot and hostsfile
Dave Goodell
goodell at mcs.anl.gov
Wed Dec 2 14:39:32 CST 2009
Hmm... I'm slightly surprised that "--ncpus=0" ever worked. Glancing
at the code right now, there's nothing that I see that specifically
would cause a problem, but it's likely that's a broken corner case.
Skimming the code I bet that it will even accept a negative ncpus
argument, which clearly doesn't make any sense.
Also, it seems strange that this would fail with the fairly minor
modifications that are present in the 1.2.1 mpd.
It sounds like you have a reasonable workaround for this right now, so
I've filed this as a ticket to fix later: https://trac.mcs.anl.gov/projects/mpich2/ticket/963
Another alternative if you don't need dynamic process support is to
use the hydra process manager: http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager
-Dave
On Dec 1, 2009, at 5:33 PM, Kenin Coloma wrote:
> In the mpich2-1.2.1, mpdboot stopped working (upgraded from
> mpich2-1.1.1) for a fairly simple host file
>
> (on compute06)
> mpdboot --totalnum=6 --ncpus=0
>
> host file:
> compute07
> compute08
> compute09
> compute10
> compute11
>
> mpdboot will hang after trying to launch mpd on compute10
>
> [kcoloma at compute06 ~]$ /rd_personalization08/kcoloma/mpich_install/
> bin/mpdboot --totalnum=6 --ncpus=0 --file=/home/kcoloma/mpiHosts.txt
> --mpd=/rd_personalization08/kcoloma/mpich_install/bin/mpd --verbose
> running mpdallexit on compute06
> LAUNCHED mpd on compute06 via
> RUNNING: mpd on compute06
> LAUNCHED mpd on compute07 via compute06
> LAUNCHED mpd on compute08 via compute06
> LAUNCHED mpd on compute09 via compute06
> LAUNCHED mpd on compute10 via compute06
> Traceback (most recent call last):
> File "/rd_personalization08/kcoloma/mpich_install/bin/mpdboot",
> line 476, in ?
> mpdboot()
> File "/rd_personalization08/kcoloma/mpich_install/bin/mpdboot",
> line 347, in mpdboot
> handle_mpd_output(fd,fd2idx,hostsAndInfo)
> File "/rd_personalization08/kcoloma/mpich_install/bin/mpdboot",
> line 385, in handle_mpd_output
> for line in fd.readlines(): # handle output from shells that
> echo stuff
> KeyboardInterrupt
>
> It will hang as long as --totalnum > 1.
>
> mpdboot.py scripts are the same between the two versions of mpich,
> but the mpd.py scripts changed to address ticket #905. I've found
> that rolling back to the mpich2-1.1.1p1 mpd.py, fixes the mpdboot
> issue I'm having.
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list