[mpich-discuss] Hydra unable to execute jobs that use more than one node(host) under PBS RMK

Thu Jan 14 21:31:18 CST 2010

Hmm.. This is getting especially difficult since I can't reproduce this
issue on any machine here. This does look like an ssh issue, but I'm not
able to nail that down either.

Would you be able to run the attached program on your system without
mpiexec (just standalone):

% gcc test.c -o test

% ./test gorgon125

 -- Pavan

On 01/14/2010 10:40 AM, Mário Costa wrote:
> Hello,
> 
> I've performed test 1 below, It replicated the problem,
> 
> Check the hydra.test1.out, its the output.
> In the end the test hanged, I used C^C to stop ...
> 
> I found this in the output begining:
> 
> debug1: identity file /home/mjscosta/.ssh/identity type -1
> 
> debug3: Not a RSA1 key file /home/mjscosta/.ssh/id_rsa.
> 
> debug2: key_type_from_name: unknown key type '-----BEGIN'
> 
> debug3: key_read: missing keytype
> ...
> debug2: key_type_from_name: unknown key type '-----END'
> 
> debug3: key_read: missing keytype
> 
> debug1: identity file /home/mjscosta/.ssh/id_rsa type 1
> 
> debug1: identity file /home/mjscosta/.ssh/id_dsa type -1
> 
> bad fd
> 
> ssh_keysign: no reply
> 
> key_sign failed
> 
> My id_rsa file has the following header :
> -----BEGIN RSA PRIVATE KEY-----
> 
> Check ssh.vvv.out, its the output of ssh -vvv gorgon125 (no problem
> with publickey)
> 
> I've noticed the following:
> 
> ssh -vvv gorgon125 2>&1 | tee  ssh.vvv.out
> 
> did not hanged ...
> 
> ssh -vvv gorgon125 >ssh.vvv.out
> 
> hanged
> 
> 
> Let me know if you still need output from the pached version.
> 
> Regards,
> Mário
> 
> 2010/1/14 Pavan Balaji <balaji at mcs.anl.gov>:
>> Can you try the following two options:
>>
>> 1. Allocate a 2-node partition in interactive mode. Suppose the nodes
>> allocated to you (cat $PBS_NODEFILE) are n0 and n1 run the following
>> program:
>>
>> % mpiexec.hydra -bootstrap fork -n 1 ssh -vvv n0 hostname : -n 1 ssh n1
>> /bin/true
>>
>> ... and send me the output for this (assuming that it shows the error
>> you reported)? Note that the above command does not have -rmk pbs in it.
>>
>> 2. Apply the attached patch to mpich2, and recompile it. Then try
>> running your application as:
>>
>> % mpiexec.hydra -rmk pbs -verbose hostname
>>
>>  -- Pavan
>>
>> On 01/14/2010 04:26 AM, Mário Costa wrote:
>>> Thanks for your reply!
>>>
>>> On Thu, Jan 14, 2010 at 1:37 AM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>>>> The error itself seems to be thrown by ssh, not Hydra. Based on some
>>>> googling, this seems to be a common problem with host-based
>>>> authentication in ssh. For example, see
>>>> https://www.cs.uwaterloo.ca/twiki/view/CF/SSHHostBasedAuthentication
>>> You are right, host-based authentication is not setup only publickey.
>>>
>>>> Can someone check this on your system (my guess is that something is
>>>> wrong with nodes 125 and 126, if it helps)? Alternatively, can you setup
>>>> a key based ssh (either by using a passwordless key, or an ssh agent) to
>>>> work around this?
>>> No problem with ssh to those nodes, if I do it by hand (ssh using
>>> publickey authentication, I forgot to mention in the test)  it works
>>> properly.
>>>
>>> Could it be that hydra is forcing somehow host-based authentication?
>>>
>>>> Note that though both hydra and mpd use ssh, they use different models,
>>>> so which node ssh's to which other node will be different with both the
>>>> process managers.
>>> I wrote some wrapping scripts to create dynamic mpd rings spanning
>>> only PBS assigned nodes to the job, in this sense I think it would be
>>> similar!?
>>> All nodes can ssh to each other properly using publickey.
>>>
>>> If you need some additional info let me know.
>>>
>>> Regards,
>>> Mário
>>>>  -- Pavan
>>>>
>>>> On 01/13/2010 05:48 PM, Mário Costa wrote:
>>>>> Hello,
>>>>>
>>>>> I'm currently testing mpiexec.hydra under PBS(Torque) in Enterprise
>>>>> Suse 10, using the default bootstrap server (ssh).
>>>>>
>>>>> With MPD, I've managed to execute successfully jobs using any number
>>>>> of nodes(hosts)/processors.
>>>>> I've setup ssh keys, known_hosts, ... i've been using a wrapper script
>>>>> to manage mpi rings complying to the PBS provided nodes/resources, to
>>>>> execute under PBS...
>>>>>
>>>>> With Hydra, I successfully managed to execute jobs that span over one
>>>>> node only, tested it with four processors and less.
>>>>>
>>>>> My test, a shell script:
>>>>>
>>>>> #!/bin/bash
>>>>> env | grep PMI
>>>>>
>>>>> When I submit a job that spans over more than one node I get the
>>>>> following errors.
>>>>>
>>>>> 1. Job hangs till its killed by PBS due to exceeded time limit, used 3
>>>>> nodes, 8 procs.
>>>>>
>>>>> stderr:
>>>>>
>>>>> bad fd
>>>>> ssh_keysign: no reply
>>>>> key_sign failed
>>>>> bad fd
>>>>> ssh_keysign: no reply
>>>>> key_sign failed
>>>>> =>> PBS: job killed: walltime 917 exceeded limit 900
>>>>> Killed by signal 15.
>>>>> Killed by signal 15.
>>>>>
>>>>> stdout:
>>>>>
>>>>> PMI_PORT=gorgon127:35454
>>>>> PMI_ID=0
>>>>> PMI_PORT=gorgon127:35454
>>>>> PMI_ID=1
>>>>> PMI_PORT=gorgon127:35454
>>>>> PMI_ID=2
>>>>>
>>>>> ps at the node where PBS executed the script that invoked hydra:
>>>>>
>>>>> userX 10187  0.0  0.0   3928   460 ?        S    23:03   0:00 pbs_demux
>>>>> userX 10205  0.0  0.0   9360  1548 ?        S    23:03   0:00
>>>>> /bin/bash /var/spool/torque/mom_priv/jobs/9950.clulne.SC
>>>>> userX 10208  0.0  0.0   6072   764 ?        S    23:03   0:00
>>>>> mpiexec.hydra -rmk pbs ./test.sh
>>>>> userX 10209  0.0  0.0      0     0 ?        Z    23:03   0:00 [ssh] <defunct>
>>>>> userX 10210  0.0  0.0  24084  2508 ?        S    23:03   0:00
>>>>> /usr/bin/ssh -x node126 /usr/bin/pmi_proxy --launch-mode 1
>>>>> --proxy-port node127:36669 --bootstrap ssh --partition-id 1
>>>>> userX 10211  0.0  0.0  24088  2508 ?        S    23:03   0:00
>>>>> /usr/bin/ssh -x node125 /usr/bin/pmi_proxy --launch-mode 1
>>>>> --proxy-port node127:36669 --bootstrap ssh --partition-id 2
>>>>> userX 10215  0.0  0.0      0     0 ?        Z    23:03   0:00
>>>>> [ssh-keysign] <defunct>
>>>>> userX 10255  0.0  0.0      0     0 ?        Z    23:03   0:00
>>>>> [ssh-keysign] <defunct>
>>>>> root     10256  0.1  0.0  43580  3520 ?        Ss   23:04   0:00 sshd:
>>>>> userX [priv]
>>>>> userX 10258  0.0  0.0  43580  1968 ?        S    23:04   0:00 sshd:
>>>>> userX at pts/0
>>>>>
>>>>> 2. Job terminated immediately, used 2 nodes, 8 procs (but other test
>>>>> with 2 nodes had the same result as above).
>>>>>
>>>>> stderr:
>>>>>
>>>>> bad fd
>>>>> ssh_keysign: no reply
>>>>> key_sign failed
>>>>> Disconnecting: Bad packet length 232220199.
>>>>>
>>>>> stdout:
>>>>>
>>>>> PMI_PORT=gorgon116:52217
>>>>> PMI_ID=1
>>>>> PMI_PORT=gorgon116:52217
>>>>> PMI_ID=3
>>>>> PMI_PORT=gorgon116:52217
>>>>> PMI_ID=2
>>>>> PMI_PORT=gorgon116:52217
>>>>> PMI_ID=0
>>>>>
>>>>> Any idea of what might be wrong ?
>>>>>
>>>>> There is something wrong with ssh, in test 1, I've ssh to the node and
>>>>> executed the command showed in the ps, and it executed properly with
>>>>> the respective partion/PMI_IDs being dysplayed.
>>>>>
>>>>> Since I've managed to use MPD with any kind of problems I would
>>>>> presume my ssh is working properly.
>>>>>
>>>>> Could it be that the is something wrong with hydra ?
>>>>>
>>>>> Thanks, regards,
>>>>>
>>>>> Mário
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>> --
>>>> Pavan Balaji
>>>> http://www.mcs.anl.gov/~balaji
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>
>> --
>> Pavan Balaji
>> http://www.mcs.anl.gov/~balaji
>>
> 
> 
> 

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.c
Type: text/x-csrc
Size: 624 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100114/87bbe7e1/attachment.c>