[mpich-discuss] Hydra unable to execute jobs that use more than one node(host) under PBS RMK

Mário Costa mario.silva.costa at gmail.com
Fri Jan 15 05:27:36 CST 2010


Hi Darius, Pavan,

I checked the key format its ok, key format is rsa not rsa1, thats the
reason for the message, still ssh authenticates properly...

I've executed the test.c with no problem, output:

./test gorgon125
gorgon125
gorgon125
gorgon125
gorgon125

I've just checked the patch you sent, I've noticed the file structure
is different from the one I have, I'm using version 1.2, I will
updated to the latest 1.2.1, probably its solved in that one, I will
let you know the results after the update.

Best regards,
Mário

2010/1/15 Darius Buntinas <buntinas at mcs.anl.gov>:
> The error message about unknown key type indicates that your id_rsa key may
> be in an incorrect format for the version of ssh installed on that system.
>  There's two formates, openssh and ssh2.  ssh-keygen has the ability to
> convert between them (check the man page).  Try converting your key then try
> again.
>
> -d
>
> On 01/14/2010 09:31 PM, Pavan Balaji wrote:
>>
>> Hmm.. This is getting especially difficult since I can't reproduce this
>> issue on any machine here. This does look like an ssh issue, but I'm not
>> able to nail that down either.
>>
>> Would you be able to run the attached program on your system without
>> mpiexec (just standalone):
>>
>> % gcc test.c -o test
>>
>> % ./test gorgon125
>>
>>  -- Pavan
>>
>> On 01/14/2010 10:40 AM, Mário Costa wrote:
>>>
>>> Hello,
>>>
>>> I've performed test 1 below, It replicated the problem,
>>>
>>> Check the hydra.test1.out, its the output.
>>> In the end the test hanged, I used C^C to stop ...
>>>
>>> I found this in the output begining:
>>>
>>> debug1: identity file /home/mjscosta/.ssh/identity type -1
>>>
>>> debug3: Not a RSA1 key file /home/mjscosta/.ssh/id_rsa.
>>>
>>> debug2: key_type_from_name: unknown key type '-----BEGIN'
>>>
>>> debug3: key_read: missing keytype
>>> ...
>>> debug2: key_type_from_name: unknown key type '-----END'
>>>
>>> debug3: key_read: missing keytype
>>>
>>> debug1: identity file /home/mjscosta/.ssh/id_rsa type 1
>>>
>>> debug1: identity file /home/mjscosta/.ssh/id_dsa type -1
>>>
>>> bad fd
>>>
>>> ssh_keysign: no reply
>>>
>>> key_sign failed
>>>
>>> My id_rsa file has the following header :
>>> -----BEGIN RSA PRIVATE KEY-----
>>>
>>> Check ssh.vvv.out, its the output of ssh -vvv gorgon125 (no problem
>>> with publickey)
>>>
>>> I've noticed the following:
>>>
>>> ssh -vvv gorgon125 2>&1 | tee  ssh.vvv.out
>>>
>>> did not hanged ...
>>>
>>> ssh -vvv gorgon125>ssh.vvv.out
>>>
>>> hanged
>>>
>>>
>>> Let me know if you still need output from the pached version.
>>>
>>> Regards,
>>> Mário
>>>
>>> 2010/1/14 Pavan Balaji<balaji at mcs.anl.gov>:
>>>>
>>>> Can you try the following two options:
>>>>
>>>> 1. Allocate a 2-node partition in interactive mode. Suppose the nodes
>>>> allocated to you (cat $PBS_NODEFILE) are n0 and n1 run the following
>>>> program:
>>>>
>>>> % mpiexec.hydra -bootstrap fork -n 1 ssh -vvv n0 hostname : -n 1 ssh n1
>>>> /bin/true
>>>>
>>>> ... and send me the output for this (assuming that it shows the error
>>>> you reported)? Note that the above command does not have -rmk pbs in it.
>>>>
>>>> 2. Apply the attached patch to mpich2, and recompile it. Then try
>>>> running your application as:
>>>>
>>>> % mpiexec.hydra -rmk pbs -verbose hostname
>>>>
>>>>  -- Pavan
>>>>
>>>> On 01/14/2010 04:26 AM, Mário Costa wrote:
>>>>>
>>>>> Thanks for your reply!
>>>>>
>>>>> On Thu, Jan 14, 2010 at 1:37 AM, Pavan Balaji<balaji at mcs.anl.gov>
>>>>>  wrote:
>>>>>>
>>>>>> The error itself seems to be thrown by ssh, not Hydra. Based on some
>>>>>> googling, this seems to be a common problem with host-based
>>>>>> authentication in ssh. For example, see
>>>>>> https://www.cs.uwaterloo.ca/twiki/view/CF/SSHHostBasedAuthentication
>>>>>
>>>>> You are right, host-based authentication is not setup only publickey.
>>>>>
>>>>>> Can someone check this on your system (my guess is that something is
>>>>>> wrong with nodes 125 and 126, if it helps)? Alternatively, can you
>>>>>> setup
>>>>>> a key based ssh (either by using a passwordless key, or an ssh agent)
>>>>>> to
>>>>>> work around this?
>>>>>
>>>>> No problem with ssh to those nodes, if I do it by hand (ssh using
>>>>> publickey authentication, I forgot to mention in the test)  it works
>>>>> properly.
>>>>>
>>>>> Could it be that hydra is forcing somehow host-based authentication?
>>>>>
>>>>>> Note that though both hydra and mpd use ssh, they use different
>>>>>> models,
>>>>>> so which node ssh's to which other node will be different with both
>>>>>> the
>>>>>> process managers.
>>>>>
>>>>> I wrote some wrapping scripts to create dynamic mpd rings spanning
>>>>> only PBS assigned nodes to the job, in this sense I think it would be
>>>>> similar!?
>>>>> All nodes can ssh to each other properly using publickey.
>>>>>
>>>>> If you need some additional info let me know.
>>>>>
>>>>> Regards,
>>>>> Mário
>>>>>>
>>>>>>  -- Pavan
>>>>>>
>>>>>> On 01/13/2010 05:48 PM, Mário Costa wrote:
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I'm currently testing mpiexec.hydra under PBS(Torque) in Enterprise
>>>>>>> Suse 10, using the default bootstrap server (ssh).
>>>>>>>
>>>>>>> With MPD, I've managed to execute successfully jobs using any number
>>>>>>> of nodes(hosts)/processors.
>>>>>>> I've setup ssh keys, known_hosts, ... i've been using a wrapper
>>>>>>> script
>>>>>>> to manage mpi rings complying to the PBS provided nodes/resources, to
>>>>>>> execute under PBS...
>>>>>>>
>>>>>>> With Hydra, I successfully managed to execute jobs that span over one
>>>>>>> node only, tested it with four processors and less.
>>>>>>>
>>>>>>> My test, a shell script:
>>>>>>>
>>>>>>> #!/bin/bash
>>>>>>> env | grep PMI
>>>>>>>
>>>>>>> When I submit a job that spans over more than one node I get the
>>>>>>> following errors.
>>>>>>>
>>>>>>> 1. Job hangs till its killed by PBS due to exceeded time limit, used
>>>>>>> 3
>>>>>>> nodes, 8 procs.
>>>>>>>
>>>>>>> stderr:
>>>>>>>
>>>>>>> bad fd
>>>>>>> ssh_keysign: no reply
>>>>>>> key_sign failed
>>>>>>> bad fd
>>>>>>> ssh_keysign: no reply
>>>>>>> key_sign failed
>>>>>>> =>>  PBS: job killed: walltime 917 exceeded limit 900
>>>>>>> Killed by signal 15.
>>>>>>> Killed by signal 15.
>>>>>>>
>>>>>>> stdout:
>>>>>>>
>>>>>>> PMI_PORT=gorgon127:35454
>>>>>>> PMI_ID=0
>>>>>>> PMI_PORT=gorgon127:35454
>>>>>>> PMI_ID=1
>>>>>>> PMI_PORT=gorgon127:35454
>>>>>>> PMI_ID=2
>>>>>>>
>>>>>>> ps at the node where PBS executed the script that invoked hydra:
>>>>>>>
>>>>>>> userX 10187  0.0  0.0   3928   460 ?        S    23:03   0:00
>>>>>>> pbs_demux
>>>>>>> userX 10205  0.0  0.0   9360  1548 ?        S    23:03   0:00
>>>>>>> /bin/bash /var/spool/torque/mom_priv/jobs/9950.clulne.SC
>>>>>>> userX 10208  0.0  0.0   6072   764 ?        S    23:03   0:00
>>>>>>> mpiexec.hydra -rmk pbs ./test.sh
>>>>>>> userX 10209  0.0  0.0      0     0 ?        Z    23:03   0:00
>>>>>>> [ssh]<defunct>
>>>>>>> userX 10210  0.0  0.0  24084  2508 ?        S    23:03   0:00
>>>>>>> /usr/bin/ssh -x node126 /usr/bin/pmi_proxy --launch-mode 1
>>>>>>> --proxy-port node127:36669 --bootstrap ssh --partition-id 1
>>>>>>> userX 10211  0.0  0.0  24088  2508 ?        S    23:03   0:00
>>>>>>> /usr/bin/ssh -x node125 /usr/bin/pmi_proxy --launch-mode 1
>>>>>>> --proxy-port node127:36669 --bootstrap ssh --partition-id 2
>>>>>>> userX 10215  0.0  0.0      0     0 ?        Z    23:03   0:00
>>>>>>> [ssh-keysign]<defunct>
>>>>>>> userX 10255  0.0  0.0      0     0 ?        Z    23:03   0:00
>>>>>>> [ssh-keysign]<defunct>
>>>>>>> root     10256  0.1  0.0  43580  3520 ?        Ss   23:04   0:00
>>>>>>> sshd:
>>>>>>> userX [priv]
>>>>>>> userX 10258  0.0  0.0  43580  1968 ?        S    23:04   0:00 sshd:
>>>>>>> userX at pts/0
>>>>>>>
>>>>>>> 2. Job terminated immediately, used 2 nodes, 8 procs (but other test
>>>>>>> with 2 nodes had the same result as above).
>>>>>>>
>>>>>>> stderr:
>>>>>>>
>>>>>>> bad fd
>>>>>>> ssh_keysign: no reply
>>>>>>> key_sign failed
>>>>>>> Disconnecting: Bad packet length 232220199.
>>>>>>>
>>>>>>> stdout:
>>>>>>>
>>>>>>> PMI_PORT=gorgon116:52217
>>>>>>> PMI_ID=1
>>>>>>> PMI_PORT=gorgon116:52217
>>>>>>> PMI_ID=3
>>>>>>> PMI_PORT=gorgon116:52217
>>>>>>> PMI_ID=2
>>>>>>> PMI_PORT=gorgon116:52217
>>>>>>> PMI_ID=0
>>>>>>>
>>>>>>> Any idea of what might be wrong ?
>>>>>>>
>>>>>>> There is something wrong with ssh, in test 1, I've ssh to the node
>>>>>>> and
>>>>>>> executed the command showed in the ps, and it executed properly with
>>>>>>> the respective partion/PMI_IDs being dysplayed.
>>>>>>>
>>>>>>> Since I've managed to use MPD with any kind of problems I would
>>>>>>> presume my ssh is working properly.
>>>>>>>
>>>>>>> Could it be that the is something wrong with hydra ?
>>>>>>>
>>>>>>> Thanks, regards,
>>>>>>>
>>>>>>> Mário
>>>>>>> _______________________________________________
>>>>>>> mpich-discuss mailing list
>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>
>>>>>> --
>>>>>> Pavan Balaji
>>>>>> http://www.mcs.anl.gov/~balaji
>>>>>> _______________________________________________
>>>>>> mpich-discuss mailing list
>>>>>> mpich-discuss at mcs.anl.gov
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>
>>>> --
>>>> Pavan Balaji
>>>> http://www.mcs.anl.gov/~balaji
>>>>
>>>
>>>
>>>
>>
>>
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>



-- 
Mário Costa

Laboratório Nacional de Engenharia Civil
LNEC.CTI.NTIEC
Avenida do Brasil 101
1700-066 Lisboa, Portugal
Tel : ++351 21 844 3911


More information about the mpich-discuss mailing list