[mpich-discuss] mpdboot and hostsfile

Kenin Coloma keninc at gmail.com
Wed Dec 2 15:46:54 CST 2009


Thanks, Dave!

The idea was that we wanted to run the mpd's under root so that anyone could
use them and have a "job submission" node.  We haven't gotten to the point
where we needed/wanted to setup real resource management/schedulers &c - but
hopefully we will!

-kenin

On Wed, Dec 2, 2009 at 12:39 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:

> Hmm... I'm slightly surprised that "--ncpus=0" ever worked.  Glancing at
> the code right now, there's nothing that I see that specifically would cause
> a problem, but it's likely that's a broken corner case.  Skimming the code I
> bet that it will even accept a negative ncpus argument, which clearly
> doesn't make any sense.
>
> Also, it seems strange that this would fail with the fairly minor
> modifications that are present in the 1.2.1 mpd.
>
> It sounds like you have a reasonable workaround for this right now, so I've
> filed this as a ticket to fix later:
> https://trac.mcs.anl.gov/projects/mpich2/ticket/963
>
> Another alternative if you don't need dynamic process support is to use the
> hydra process manager:
> http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager
>
> -Dave
>
>
> On Dec 1, 2009, at 5:33 PM, Kenin Coloma wrote:
>
>  In the mpich2-1.2.1, mpdboot stopped working (upgraded from mpich2-1.1.1)
>> for a fairly simple host file
>>
>> (on compute06)
>> mpdboot --totalnum=6 --ncpus=0
>>
>> host file:
>> compute07
>> compute08
>> compute09
>> compute10
>> compute11
>>
>> mpdboot will hang after trying to launch mpd on compute10
>>
>> [kcoloma at compute06 ~]$
>> /rd_personalization08/kcoloma/mpich_install/bin/mpdboot --totalnum=6
>> --ncpus=0 --file=/home/kcoloma/mpiHosts.txt
>> --mpd=/rd_personalization08/kcoloma/mpich_install/bin/mpd --verbose
>> running mpdallexit on compute06
>> LAUNCHED mpd on compute06  via
>> RUNNING: mpd on compute06
>> LAUNCHED mpd on compute07  via  compute06
>> LAUNCHED mpd on compute08  via  compute06
>> LAUNCHED mpd on compute09  via  compute06
>> LAUNCHED mpd on compute10  via  compute06
>> Traceback (most recent call last):
>>  File "/rd_personalization08/kcoloma/mpich_install/bin/mpdboot", line 476,
>> in ?
>>    mpdboot()
>>  File "/rd_personalization08/kcoloma/mpich_install/bin/mpdboot", line 347,
>> in mpdboot
>>    handle_mpd_output(fd,fd2idx,hostsAndInfo)
>>  File "/rd_personalization08/kcoloma/mpich_install/bin/mpdboot", line 385,
>> in handle_mpd_output
>>    for line in fd.readlines():    # handle output from shells that echo
>> stuff
>> KeyboardInterrupt
>>
>> It will hang as long as --totalnum > 1.
>>
>> mpdboot.py scripts are the same between the two versions of mpich, but the
>> mpd.py scripts changed to address ticket #905.  I've found that rolling back
>> to the mpich2-1.1.1p1 mpd.py, fixes the mpdboot issue I'm having.
>>
>> _______________________________________________
>> mpich-discuss mailing list
>>
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>
> _______________________________________________
> mpich-discuss mailing list
>
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20091202/d5086f1a/attachment.htm>


More information about the mpich-discuss mailing list