[mpich-discuss] Hydra unable to execute jobs that use more than one node(host) under PBS RMK
Pavan Balaji
balaji at mcs.anl.gov
Tue Jan 26 09:51:08 CST 2010
Mario,
This is good information. Yes, it shouldn't matter which process does
the ssh, and yes it is possible that closing stdin is the culprit. Would
you be willing to try out the trunk version of Hydra, which has a bunch
of fixes in this area?
http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/nightly/hydra
Note that the trunk has a few critical bugs that I'm working on right
now, so these nightly tarballs are only meant for testing, and not for
production use.
-- Pavan
On 01/21/2010 09:17 PM, Mário Costa wrote:
> Hi again,
>
> I found out the problem comes up only on some specific ssh versions
> (in my caseOpenSSH_4.2p1, OpenSSL 0.9.8a 11 Oct 2005), and it depends
> on the order of the processes in the command.
>
> If I test
>
> 1. mpiexec.hydra -bootstrap fork -n 1 /bin/true : ssh gorgon002 hostname
> I've got the problem, it hangs, and reports to stderr
>
> bad fd
> ssh_keysign: no reply
> key_sign failed
>
> After some googling I found this
> (http://l-sourcemotel.gsfc.nasa.gov/pipermail/test-proj-1-commits/2006-March/000350.html),
> which looks like the problem I have.
>
> 2. mpiexec.hydra -bootstrap fork -n 1 ssh gorgon002 hostname : /bin/true
>
> Works fine!
>
> Shouldn't it be the same, independently of the order ? Could you be
> closing the stdin (or changing it) of the second exec before its time
> ?
>
> I've replaced the hostname by sleep 5m to get the opened files via
> lsof, check the difference
>
> 1. mpiexec.hydra -bootstrap fork -n 1 /bin/true : ssh gorgon002 sleep 5m
>
> $>ps auxf
> mjscosta 27653 0.0 0.0 10100 2416 pts/1 Ss 01:00 0:00 |
> \_ -bash
> mjscosta 28825 0.0 0.0 6076 704 pts/1 S+ 02:59 0:00 |
> \_ mpiexec.hydra -bootstrap fork -n 1 /bin/true : ssh gorgon002
> sleep 5m
> mjscosta 28826 0.0 0.0 6220 756 pts/1 S+ 02:59 0:00 |
> \_ /usr/bin/pmi_proxy --launch-mode 1 --proxy-port
> gorgon001 49063 --bootstrap fork --proxy-id 0
> mjscosta 28827 0.0 0.0 0 0 pts/1 Z+ 02:59 0:00 |
> \_ [true] <defunct>
> mjscosta 28828 0.0 0.0 24064 2500 pts/1 S+ 02:59 0:00 |
> \_ ssh gorgon002 sleep 5m
> mjscosta 28829 0.0 0.0 0 0 pts/1 Z+ 02:59 0:00 |
> \_ [ssh-keysign] <defunct>
>
> $>lsof 28827
> COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
> ...
> ssh 28828 mjscosta 0u IPv4 1839838 TCP
> gorgon001.lnec.pt:50989->gorgon002.lnec.pt:ssh (ESTABLISHED) <<
> ssh 28828 mjscosta 1w FIFO 0,6 1839832 pipe
> ssh 28828 mjscosta 2w FIFO 0,6 1839833 pipe
> ssh 28828 mjscosta 3u IPv4 1839820 TCP *:58133 (LISTEN)
> ssh 28828 mjscosta 4u IPv4 1839821 TCP *:49063 (LISTEN)
> ssh 28828 mjscosta 5r FIFO 0,6 1839822 pipe
> ssh 28828 mjscosta 6u IPv4 1839827 TCP
> gorgon001.lnec.pt:51911->gorgon001.lnec.pt:49063 (ESTABLISHED)
> ssh 28828 mjscosta 7u IPv4 1839838 TCP
> gorgon001.lnec.pt:50989->gorgon002.lnec.pt:ssh (ESTABLISHED)
> ssh 28828 mjscosta 8w FIFO 0,6 1839823 pipe
> ssh 28828 mjscosta 9w FIFO 0,6 1839829 pipe
> ssh 28828 mjscosta 10w FIFO 0,6 1839824 pipe
> ssh 28828 mjscosta 11r FIFO 0,6 1839830 pipe
> ssh 28828 mjscosta 12w FIFO 0,6 1839832 pipe
> ssh 28828 mjscosta 13r FIFO 0,6 1839831 pipe
> ssh 28828 mjscosta 14w FIFO 0,6 1839839 pipe
> ssh 28828 mjscosta 15w FIFO 0,6 1839833 pipe
> ssh 28828 mjscosta 16r FIFO 0,6 1839840 pipe
> ssh 28828 mjscosta 17w FIFO 0,6 1839832 pipe
> ssh 28828 mjscosta 18w FIFO 0,6 1839833 pipe
>
> 2. mpiexec.hydra -bootstrap fork -n 1 ssh gorgon002 sleep 5m : /bin/true
>
> $>ps auxf
> mjscosta 27653 0.0 0.0 10100 2416 pts/1 Ss 01:00 0:00 |
> \_ -bash
> mjscosta 28870 0.0 0.0 6072 704 pts/1 S+ 03:03 0:00 |
> \_ mpiexec.hydra -bootstrap fork -n 1 ssh gorgon002 sleep 5m :
> /bin/true
> mjscosta 28871 0.0 0.0 6216 756 pts/1 S+ 03:03 0:00 |
> \_ /usr/bin/pmi_proxy --launch-mode 1 --proxy-port
> gorgon001 44391 --bootstrap fork --proxy-id 0
> mjscosta 28872 0.4 0.0 24064 2504 pts/1 S+ 03:03 0:00 |
> \_ ssh gorgon002 sleep 5m
> mjscosta 28873 0.0 0.0 0 0 pts/1 Z+ 03:03 0:00 |
> \_ [true] <defunct>
>
> $>lsof 28872
> COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
> ...
> ssh 28872 mjscosta 0r FIFO 0,6 1839988 pipe <<
> ssh 28872 mjscosta 1w FIFO 0,6 1839989 pipe
> ssh 28872 mjscosta 2w FIFO 0,6 1839990 pipe
> ssh 28872 mjscosta 3u IPv4 1839979 TCP *:41804 (LISTEN)
> ssh 28872 mjscosta 4u IPv4 1839980 TCP *:44391 (LISTEN)
> ssh 28872 mjscosta 5r FIFO 0,6 1839981 pipe
> ssh 28872 mjscosta 6u IPv4 1839986 TCP
> gorgon001.lnec.pt:45713->gorgon001.lnec.pt:44391 (ESTABLISHED)
> ssh 28872 mjscosta 7r FIFO 0,6 1839988 pipe
> ssh 28872 mjscosta 8w FIFO 0,6 1839982 pipe
> ssh 28872 mjscosta 9u IPv4 1839997 TCP
> gorgon001.lnec.pt:58955->gorgon002.lnec.pt:ssh (ESTABLISHED)
> ssh 28872 mjscosta 10w FIFO 0,6 1839983 pipe
> ssh 28872 mjscosta 12w FIFO 0,6 1839989 pipe
> ssh 28872 mjscosta 13w FIFO 0,6 1839989 pipe
> ssh 28872 mjscosta 14w FIFO 0,6 1839990 pipe
> ssh 28872 mjscosta 15w FIFO 0,6 1839990 pipe
>
> Anyway it can be solved updating to a more recent ssh version, that's
> why you can't reproduce it, but non the less there is something in the
> mpiexec.hydra that causes it to work depending on the order the
> command is invoked.
>
> Let me know what you think about this...
>
> Thanks and Regards,
>
> 2010/1/17 Mário Costa <mario.silva.costa at gmail.com>:
>> 2010/1/17 Pavan Balaji <balaji at mcs.anl.gov>:
>>> On 01/16/2010 07:13 PM, Mário Costa wrote:
>>>> I have one question, does mpiexec.hydra agregates the outputs from all
>>>> launched mpi processes ?
>>> Yes.
>>>
>>>> I think it might hang waiting for the output of ssh, that for some
>>>> reason doesn't come out, could this be the case ?
>>> Yes, that's my guess too. This behavior is also possible if the MPI
>>> processes hang. But an ssh problem seems more likely in this case. In
>>> the previous email, when you tried a non-MPI program, did it hang as well?
>> Yes, the same, in a deterministic way ...
>>> % mpiexec.hydra -rmk pbs hostname
>>>
>>>> Here we use ldap in the nodes of the cluster, I've read something
>>>> about ssh processes getting defunct due to ldap ...
>>> Hmm.. This keeps getting more and more interesting :-).
>>>
>>> -- Pavan
>>>
>>> --
>>> Pavan Balaji
>>> http://www.mcs.anl.gov/~balaji
>>>
>>
>>
>> --
>> Mário Costa
>>
>> Laboratório Nacional de Engenharia Civil
>> LNEC.CTI.NTIEC
>> Avenida do Brasil 101
>> 1700-066 Lisboa, Portugal
>> Tel : ++351 21 844 3911
>>
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the mpich-discuss
mailing list