[mpich-discuss] Hydra unable to execute jobs that use more than one node(host) under PBS RMK

Darius Buntinas buntinas at mcs.anl.gov
Thu Jan 14 21:52:26 CST 2010


The error message about unknown key type indicates that your id_rsa key 
may be in an incorrect format for the version of ssh installed on that 
system.  There's two formates, openssh and ssh2.  ssh-keygen has the 
ability to convert between them (check the man page).  Try converting 
your key then try again.

-d

On 01/14/2010 09:31 PM, Pavan Balaji wrote:
>
> Hmm.. This is getting especially difficult since I can't reproduce this
> issue on any machine here. This does look like an ssh issue, but I'm not
> able to nail that down either.
>
> Would you be able to run the attached program on your system without
> mpiexec (just standalone):
>
> % gcc test.c -o test
>
> % ./test gorgon125
>
>   -- Pavan
>
> On 01/14/2010 10:40 AM, Mário Costa wrote:
>> Hello,
>>
>> I've performed test 1 below, It replicated the problem,
>>
>> Check the hydra.test1.out, its the output.
>> In the end the test hanged, I used C^C to stop ...
>>
>> I found this in the output begining:
>>
>> debug1: identity file /home/mjscosta/.ssh/identity type -1
>>
>> debug3: Not a RSA1 key file /home/mjscosta/.ssh/id_rsa.
>>
>> debug2: key_type_from_name: unknown key type '-----BEGIN'
>>
>> debug3: key_read: missing keytype
>> ...
>> debug2: key_type_from_name: unknown key type '-----END'
>>
>> debug3: key_read: missing keytype
>>
>> debug1: identity file /home/mjscosta/.ssh/id_rsa type 1
>>
>> debug1: identity file /home/mjscosta/.ssh/id_dsa type -1
>>
>> bad fd
>>
>> ssh_keysign: no reply
>>
>> key_sign failed
>>
>> My id_rsa file has the following header :
>> -----BEGIN RSA PRIVATE KEY-----
>>
>> Check ssh.vvv.out, its the output of ssh -vvv gorgon125 (no problem
>> with publickey)
>>
>> I've noticed the following:
>>
>> ssh -vvv gorgon125 2>&1 | tee  ssh.vvv.out
>>
>> did not hanged ...
>>
>> ssh -vvv gorgon125>ssh.vvv.out
>>
>> hanged
>>
>>
>> Let me know if you still need output from the pached version.
>>
>> Regards,
>> Mário
>>
>> 2010/1/14 Pavan Balaji<balaji at mcs.anl.gov>:
>>> Can you try the following two options:
>>>
>>> 1. Allocate a 2-node partition in interactive mode. Suppose the nodes
>>> allocated to you (cat $PBS_NODEFILE) are n0 and n1 run the following
>>> program:
>>>
>>> % mpiexec.hydra -bootstrap fork -n 1 ssh -vvv n0 hostname : -n 1 ssh n1
>>> /bin/true
>>>
>>> ... and send me the output for this (assuming that it shows the error
>>> you reported)? Note that the above command does not have -rmk pbs in it.
>>>
>>> 2. Apply the attached patch to mpich2, and recompile it. Then try
>>> running your application as:
>>>
>>> % mpiexec.hydra -rmk pbs -verbose hostname
>>>
>>>   -- Pavan
>>>
>>> On 01/14/2010 04:26 AM, Mário Costa wrote:
>>>> Thanks for your reply!
>>>>
>>>> On Thu, Jan 14, 2010 at 1:37 AM, Pavan Balaji<balaji at mcs.anl.gov>  wrote:
>>>>> The error itself seems to be thrown by ssh, not Hydra. Based on some
>>>>> googling, this seems to be a common problem with host-based
>>>>> authentication in ssh. For example, see
>>>>> https://www.cs.uwaterloo.ca/twiki/view/CF/SSHHostBasedAuthentication
>>>> You are right, host-based authentication is not setup only publickey.
>>>>
>>>>> Can someone check this on your system (my guess is that something is
>>>>> wrong with nodes 125 and 126, if it helps)? Alternatively, can you setup
>>>>> a key based ssh (either by using a passwordless key, or an ssh agent) to
>>>>> work around this?
>>>> No problem with ssh to those nodes, if I do it by hand (ssh using
>>>> publickey authentication, I forgot to mention in the test)  it works
>>>> properly.
>>>>
>>>> Could it be that hydra is forcing somehow host-based authentication?
>>>>
>>>>> Note that though both hydra and mpd use ssh, they use different models,
>>>>> so which node ssh's to which other node will be different with both the
>>>>> process managers.
>>>> I wrote some wrapping scripts to create dynamic mpd rings spanning
>>>> only PBS assigned nodes to the job, in this sense I think it would be
>>>> similar!?
>>>> All nodes can ssh to each other properly using publickey.
>>>>
>>>> If you need some additional info let me know.
>>>>
>>>> Regards,
>>>> Mário
>>>>>   -- Pavan
>>>>>
>>>>> On 01/13/2010 05:48 PM, Mário Costa wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I'm currently testing mpiexec.hydra under PBS(Torque) in Enterprise
>>>>>> Suse 10, using the default bootstrap server (ssh).
>>>>>>
>>>>>> With MPD, I've managed to execute successfully jobs using any number
>>>>>> of nodes(hosts)/processors.
>>>>>> I've setup ssh keys, known_hosts, ... i've been using a wrapper script
>>>>>> to manage mpi rings complying to the PBS provided nodes/resources, to
>>>>>> execute under PBS...
>>>>>>
>>>>>> With Hydra, I successfully managed to execute jobs that span over one
>>>>>> node only, tested it with four processors and less.
>>>>>>
>>>>>> My test, a shell script:
>>>>>>
>>>>>> #!/bin/bash
>>>>>> env | grep PMI
>>>>>>
>>>>>> When I submit a job that spans over more than one node I get the
>>>>>> following errors.
>>>>>>
>>>>>> 1. Job hangs till its killed by PBS due to exceeded time limit, used 3
>>>>>> nodes, 8 procs.
>>>>>>
>>>>>> stderr:
>>>>>>
>>>>>> bad fd
>>>>>> ssh_keysign: no reply
>>>>>> key_sign failed
>>>>>> bad fd
>>>>>> ssh_keysign: no reply
>>>>>> key_sign failed
>>>>>> =>>  PBS: job killed: walltime 917 exceeded limit 900
>>>>>> Killed by signal 15.
>>>>>> Killed by signal 15.
>>>>>>
>>>>>> stdout:
>>>>>>
>>>>>> PMI_PORT=gorgon127:35454
>>>>>> PMI_ID=0
>>>>>> PMI_PORT=gorgon127:35454
>>>>>> PMI_ID=1
>>>>>> PMI_PORT=gorgon127:35454
>>>>>> PMI_ID=2
>>>>>>
>>>>>> ps at the node where PBS executed the script that invoked hydra:
>>>>>>
>>>>>> userX 10187  0.0  0.0   3928   460 ?        S    23:03   0:00 pbs_demux
>>>>>> userX 10205  0.0  0.0   9360  1548 ?        S    23:03   0:00
>>>>>> /bin/bash /var/spool/torque/mom_priv/jobs/9950.clulne.SC
>>>>>> userX 10208  0.0  0.0   6072   764 ?        S    23:03   0:00
>>>>>> mpiexec.hydra -rmk pbs ./test.sh
>>>>>> userX 10209  0.0  0.0      0     0 ?        Z    23:03   0:00 [ssh]<defunct>
>>>>>> userX 10210  0.0  0.0  24084  2508 ?        S    23:03   0:00
>>>>>> /usr/bin/ssh -x node126 /usr/bin/pmi_proxy --launch-mode 1
>>>>>> --proxy-port node127:36669 --bootstrap ssh --partition-id 1
>>>>>> userX 10211  0.0  0.0  24088  2508 ?        S    23:03   0:00
>>>>>> /usr/bin/ssh -x node125 /usr/bin/pmi_proxy --launch-mode 1
>>>>>> --proxy-port node127:36669 --bootstrap ssh --partition-id 2
>>>>>> userX 10215  0.0  0.0      0     0 ?        Z    23:03   0:00
>>>>>> [ssh-keysign]<defunct>
>>>>>> userX 10255  0.0  0.0      0     0 ?        Z    23:03   0:00
>>>>>> [ssh-keysign]<defunct>
>>>>>> root     10256  0.1  0.0  43580  3520 ?        Ss   23:04   0:00 sshd:
>>>>>> userX [priv]
>>>>>> userX 10258  0.0  0.0  43580  1968 ?        S    23:04   0:00 sshd:
>>>>>> userX at pts/0
>>>>>>
>>>>>> 2. Job terminated immediately, used 2 nodes, 8 procs (but other test
>>>>>> with 2 nodes had the same result as above).
>>>>>>
>>>>>> stderr:
>>>>>>
>>>>>> bad fd
>>>>>> ssh_keysign: no reply
>>>>>> key_sign failed
>>>>>> Disconnecting: Bad packet length 232220199.
>>>>>>
>>>>>> stdout:
>>>>>>
>>>>>> PMI_PORT=gorgon116:52217
>>>>>> PMI_ID=1
>>>>>> PMI_PORT=gorgon116:52217
>>>>>> PMI_ID=3
>>>>>> PMI_PORT=gorgon116:52217
>>>>>> PMI_ID=2
>>>>>> PMI_PORT=gorgon116:52217
>>>>>> PMI_ID=0
>>>>>>
>>>>>> Any idea of what might be wrong ?
>>>>>>
>>>>>> There is something wrong with ssh, in test 1, I've ssh to the node and
>>>>>> executed the command showed in the ps, and it executed properly with
>>>>>> the respective partion/PMI_IDs being dysplayed.
>>>>>>
>>>>>> Since I've managed to use MPD with any kind of problems I would
>>>>>> presume my ssh is working properly.
>>>>>>
>>>>>> Could it be that the is something wrong with hydra ?
>>>>>>
>>>>>> Thanks, regards,
>>>>>>
>>>>>> Mário
>>>>>> _______________________________________________
>>>>>> mpich-discuss mailing list
>>>>>> mpich-discuss at mcs.anl.gov
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>> --
>>>>> Pavan Balaji
>>>>> http://www.mcs.anl.gov/~balaji
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>
>>> --
>>> Pavan Balaji
>>> http://www.mcs.anl.gov/~balaji
>>>
>>
>>
>>
>
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list