[mpich-discuss] Hydra unable to execute jobs that use more than one node(host) under PBS RMK

Mário Costa mario.silva.costa at gmail.com
Fri Jan 15 12:27:23 CST 2010


Hi again Pavan,

I've updated to version 1.2.1 and performed the tests with the pached
version, the result was the same.
Check the outputs:

1.
For test 1. the output is in hydra.test1.txt

performing test 1. , the mpiexec.hydra hangs, than with Ctrl+C it dies
with the following message:

(% mpiexec.hydra -bootstrap fork -n 1 ssh -vvv n0 hostname : -n 1 ssh n1
/bin/true)

[mpiexec at gorgon125] HYDU_sock_read (./utils/sock/sock.c:277): read
errno (Connection reset by peer)
[mpiexec at gorgon125] HYD_pmcd_pmi_serv_control_cb
(./pm/pmiserv/pmi_serv_cb.c:263): unable to read status from proxy
[mpiexec at gorgon125] HYDT_dmx_wait_for_event
(./tools/demux/demux.c:168): callback returned error status
[mpiexec at gorgon125] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmi_serv_launch.c:499): error waiting for event
[mpiexec at gorgon125] main (./ui/mpiexec/mpiexec.c:277): process manager
error waiting for completion

looks like its blocked in a socket after the ssh has exited...

2.

With test/patched the outputs are in hydra.test2.stdout.txt and hydra.stderr.txt

The job was submited in the PBS and hanged, then it was removed from
the queue ...

2010/1/15 Mário Costa <mario.silva.costa at gmail.com>:
> Hi Darius, Pavan,
>
> I checked the key format its ok, key format is rsa not rsa1, thats the
> reason for the message, still ssh authenticates properly...
>
> I've executed the test.c with no problem, output:
>
> ./test gorgon125
> gorgon125
> gorgon125
> gorgon125
> gorgon125
>
> I've just checked the patch you sent, I've noticed the file structure
> is different from the one I have, I'm using version 1.2, I will
> updated to the latest 1.2.1, probably its solved in that one, I will
> let you know the results after the update.
>
> Best regards,
> Mário
>
> 2010/1/15 Darius Buntinas <buntinas at mcs.anl.gov>:
>> The error message about unknown key type indicates that your id_rsa key may
>> be in an incorrect format for the version of ssh installed on that system.
>>  There's two formates, openssh and ssh2.  ssh-keygen has the ability to
>> convert between them (check the man page).  Try converting your key then try
>> again.
>>
>> -d
>>
>> On 01/14/2010 09:31 PM, Pavan Balaji wrote:
>>>
>>> Hmm.. This is getting especially difficult since I can't reproduce this
>>> issue on any machine here. This does look like an ssh issue, but I'm not
>>> able to nail that down either.
>>>
>>> Would you be able to run the attached program on your system without
>>> mpiexec (just standalone):
>>>
>>> % gcc test.c -o test
>>>
>>> % ./test gorgon125
>>>
>>>  -- Pavan
>>>
>>> On 01/14/2010 10:40 AM, Mário Costa wrote:
>>>>
>>>> Hello,
>>>>
>>>> I've performed test 1 below, It replicated the problem,
>>>>
>>>> Check the hydra.test1.out, its the output.
>>>> In the end the test hanged, I used C^C to stop ...
>>>>
>>>> I found this in the output begining:
>>>>
>>>> debug1: identity file /home/mjscosta/.ssh/identity type -1
>>>>
>>>> debug3: Not a RSA1 key file /home/mjscosta/.ssh/id_rsa.
>>>>
>>>> debug2: key_type_from_name: unknown key type '-----BEGIN'
>>>>
>>>> debug3: key_read: missing keytype
>>>> ...
>>>> debug2: key_type_from_name: unknown key type '-----END'
>>>>
>>>> debug3: key_read: missing keytype
>>>>
>>>> debug1: identity file /home/mjscosta/.ssh/id_rsa type 1
>>>>
>>>> debug1: identity file /home/mjscosta/.ssh/id_dsa type -1
>>>>
>>>> bad fd
>>>>
>>>> ssh_keysign: no reply
>>>>
>>>> key_sign failed
>>>>
>>>> My id_rsa file has the following header :
>>>> -----BEGIN RSA PRIVATE KEY-----
>>>>
>>>> Check ssh.vvv.out, its the output of ssh -vvv gorgon125 (no problem
>>>> with publickey)
>>>>
>>>> I've noticed the following:
>>>>
>>>> ssh -vvv gorgon125 2>&1 | tee  ssh.vvv.out
>>>>
>>>> did not hanged ...
>>>>
>>>> ssh -vvv gorgon125>ssh.vvv.out
>>>>
>>>> hanged
>>>>
>>>>
>>>> Let me know if you still need output from the pached version.
>>>>
>>>> Regards,
>>>> Mário
>>>>
>>>> 2010/1/14 Pavan Balaji<balaji at mcs.anl.gov>:
>>>>>
>>>>> Can you try the following two options:
>>>>>
>>>>> 1. Allocate a 2-node partition in interactive mode. Suppose the nodes
>>>>> allocated to you (cat $PBS_NODEFILE) are n0 and n1 run the following
>>>>> program:
>>>>>
>>>>> % mpiexec.hydra -bootstrap fork -n 1 ssh -vvv n0 hostname : -n 1 ssh n1
>>>>> /bin/true
>>>>>
>>>>> ... and send me the output for this (assuming that it shows the error
>>>>> you reported)? Note that the above command does not have -rmk pbs in it.
>>>>>
>>>>> 2. Apply the attached patch to mpich2, and recompile it. Then try
>>>>> running your application as:
>>>>>
>>>>> % mpiexec.hydra -rmk pbs -verbose hostname
>>>>>
>>>>>  -- Pavan
>>>>>
>>>>> On 01/14/2010 04:26 AM, Mário Costa wrote:
>>>>>>
>>>>>> Thanks for your reply!
>>>>>>
>>>>>> On Thu, Jan 14, 2010 at 1:37 AM, Pavan Balaji<balaji at mcs.anl.gov>
>>>>>>  wrote:
>>>>>>>
>>>>>>> The error itself seems to be thrown by ssh, not Hydra. Based on some
>>>>>>> googling, this seems to be a common problem with host-based
>>>>>>> authentication in ssh. For example, see
>>>>>>> https://www.cs.uwaterloo.ca/twiki/view/CF/SSHHostBasedAuthentication
>>>>>>
>>>>>> You are right, host-based authentication is not setup only publickey.
>>>>>>
>>>>>>> Can someone check this on your system (my guess is that something is
>>>>>>> wrong with nodes 125 and 126, if it helps)? Alternatively, can you
>>>>>>> setup
>>>>>>> a key based ssh (either by using a passwordless key, or an ssh agent)
>>>>>>> to
>>>>>>> work around this?
>>>>>>
>>>>>> No problem with ssh to those nodes, if I do it by hand (ssh using
>>>>>> publickey authentication, I forgot to mention in the test)  it works
>>>>>> properly.
>>>>>>
>>>>>> Could it be that hydra is forcing somehow host-based authentication?
>>>>>>
>>>>>>> Note that though both hydra and mpd use ssh, they use different
>>>>>>> models,
>>>>>>> so which node ssh's to which other node will be different with both
>>>>>>> the
>>>>>>> process managers.
>>>>>>
>>>>>> I wrote some wrapping scripts to create dynamic mpd rings spanning
>>>>>> only PBS assigned nodes to the job, in this sense I think it would be
>>>>>> similar!?
>>>>>> All nodes can ssh to each other properly using publickey.
>>>>>>
>>>>>> If you need some additional info let me know.
>>>>>>
>>>>>> Regards,
>>>>>> Mário
>>>>>>>
>>>>>>>  -- Pavan
>>>>>>>
>>>>>>> On 01/13/2010 05:48 PM, Mário Costa wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I'm currently testing mpiexec.hydra under PBS(Torque) in Enterprise
>>>>>>>> Suse 10, using the default bootstrap server (ssh).
>>>>>>>>
>>>>>>>> With MPD, I've managed to execute successfully jobs using any number
>>>>>>>> of nodes(hosts)/processors.
>>>>>>>> I've setup ssh keys, known_hosts, ... i've been using a wrapper
>>>>>>>> script
>>>>>>>> to manage mpi rings complying to the PBS provided nodes/resources, to
>>>>>>>> execute under PBS...
>>>>>>>>
>>>>>>>> With Hydra, I successfully managed to execute jobs that span over one
>>>>>>>> node only, tested it with four processors and less.
>>>>>>>>
>>>>>>>> My test, a shell script:
>>>>>>>>
>>>>>>>> #!/bin/bash
>>>>>>>> env | grep PMI
>>>>>>>>
>>>>>>>> When I submit a job that spans over more than one node I get the
>>>>>>>> following errors.
>>>>>>>>
>>>>>>>> 1. Job hangs till its killed by PBS due to exceeded time limit, used
>>>>>>>> 3
>>>>>>>> nodes, 8 procs.
>>>>>>>>
>>>>>>>> stderr:
>>>>>>>>
>>>>>>>> bad fd
>>>>>>>> ssh_keysign: no reply
>>>>>>>> key_sign failed
>>>>>>>> bad fd
>>>>>>>> ssh_keysign: no reply
>>>>>>>> key_sign failed
>>>>>>>> =>>  PBS: job killed: walltime 917 exceeded limit 900
>>>>>>>> Killed by signal 15.
>>>>>>>> Killed by signal 15.
>>>>>>>>
>>>>>>>> stdout:
>>>>>>>>
>>>>>>>> PMI_PORT=gorgon127:35454
>>>>>>>> PMI_ID=0
>>>>>>>> PMI_PORT=gorgon127:35454
>>>>>>>> PMI_ID=1
>>>>>>>> PMI_PORT=gorgon127:35454
>>>>>>>> PMI_ID=2
>>>>>>>>
>>>>>>>> ps at the node where PBS executed the script that invoked hydra:
>>>>>>>>
>>>>>>>> userX 10187  0.0  0.0   3928   460 ?        S    23:03   0:00
>>>>>>>> pbs_demux
>>>>>>>> userX 10205  0.0  0.0   9360  1548 ?        S    23:03   0:00
>>>>>>>> /bin/bash /var/spool/torque/mom_priv/jobs/9950.clulne.SC
>>>>>>>> userX 10208  0.0  0.0   6072   764 ?        S    23:03   0:00
>>>>>>>> mpiexec.hydra -rmk pbs ./test.sh
>>>>>>>> userX 10209  0.0  0.0      0     0 ?        Z    23:03   0:00
>>>>>>>> [ssh]<defunct>
>>>>>>>> userX 10210  0.0  0.0  24084  2508 ?        S    23:03   0:00
>>>>>>>> /usr/bin/ssh -x node126 /usr/bin/pmi_proxy --launch-mode 1
>>>>>>>> --proxy-port node127:36669 --bootstrap ssh --partition-id 1
>>>>>>>> userX 10211  0.0  0.0  24088  2508 ?        S    23:03   0:00
>>>>>>>> /usr/bin/ssh -x node125 /usr/bin/pmi_proxy --launch-mode 1
>>>>>>>> --proxy-port node127:36669 --bootstrap ssh --partition-id 2
>>>>>>>> userX 10215  0.0  0.0      0     0 ?        Z    23:03   0:00
>>>>>>>> [ssh-keysign]<defunct>
>>>>>>>> userX 10255  0.0  0.0      0     0 ?        Z    23:03   0:00
>>>>>>>> [ssh-keysign]<defunct>
>>>>>>>> root     10256  0.1  0.0  43580  3520 ?        Ss   23:04   0:00
>>>>>>>> sshd:
>>>>>>>> userX [priv]
>>>>>>>> userX 10258  0.0  0.0  43580  1968 ?        S    23:04   0:00 sshd:
>>>>>>>> userX at pts/0
>>>>>>>>
>>>>>>>> 2. Job terminated immediately, used 2 nodes, 8 procs (but other test
>>>>>>>> with 2 nodes had the same result as above).
>>>>>>>>
>>>>>>>> stderr:
>>>>>>>>
>>>>>>>> bad fd
>>>>>>>> ssh_keysign: no reply
>>>>>>>> key_sign failed
>>>>>>>> Disconnecting: Bad packet length 232220199.
>>>>>>>>
>>>>>>>> stdout:
>>>>>>>>
>>>>>>>> PMI_PORT=gorgon116:52217
>>>>>>>> PMI_ID=1
>>>>>>>> PMI_PORT=gorgon116:52217
>>>>>>>> PMI_ID=3
>>>>>>>> PMI_PORT=gorgon116:52217
>>>>>>>> PMI_ID=2
>>>>>>>> PMI_PORT=gorgon116:52217
>>>>>>>> PMI_ID=0
>>>>>>>>
>>>>>>>> Any idea of what might be wrong ?
>>>>>>>>
>>>>>>>> There is something wrong with ssh, in test 1, I've ssh to the node
>>>>>>>> and
>>>>>>>> executed the command showed in the ps, and it executed properly with
>>>>>>>> the respective partion/PMI_IDs being dysplayed.
>>>>>>>>
>>>>>>>> Since I've managed to use MPD with any kind of problems I would
>>>>>>>> presume my ssh is working properly.
>>>>>>>>
>>>>>>>> Could it be that the is something wrong with hydra ?
>>>>>>>>
>>>>>>>> Thanks, regards,
>>>>>>>>
>>>>>>>> Mário
>>>>>>>> _______________________________________________
>>>>>>>> mpich-discuss mailing list
>>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>>
>>>>>>> --
>>>>>>> Pavan Balaji
>>>>>>> http://www.mcs.anl.gov/~balaji
>>>>>>> _______________________________________________
>>>>>>> mpich-discuss mailing list
>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>>
>>>>> --
>>>>> Pavan Balaji
>>>>> http://www.mcs.anl.gov/~balaji
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>
>
>
> --
> Mário Costa
>
> Laboratório Nacional de Engenharia Civil
> LNEC.CTI.NTIEC
> Avenida do Brasil 101
> 1700-066 Lisboa, Portugal
> Tel : ++351 21 844 3911
>



-- 
Mário Costa

Laboratório Nacional de Engenharia Civil
LNEC.CTI.NTIEC
Avenida do Brasil 101
1700-066 Lisboa, Portugal
Tel : ++351 21 844 3911
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpiexec.hydra.test.results.tgz
Type: application/x-gzip
Size: 5395 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100115/e9988ce2/attachment.bin>


More information about the mpich-discuss mailing list