[mpich-discuss] Hydra unable to execute jobs that use more than one node(host) under PBS RMK
Pavan Balaji
balaji at mcs.anl.gov
Fri Jan 15 19:34:39 CST 2010
Mario,
I looked through the output you sent; it looks like Hydra is getting
launched correctly. I don't see any error in there.
Did you try running some other MPI programs in the examples directory,
such as CPI?
% mpiexec.hydra -rmk pbs ./examples/cpi
-- Pavan
On 01/15/2010 12:27 PM, Mário Costa wrote:
> Hi again Pavan,
>
> I've updated to version 1.2.1 and performed the tests with the pached
> version, the result was the same.
> Check the outputs:
>
> 1.
> For test 1. the output is in hydra.test1.txt
>
> performing test 1. , the mpiexec.hydra hangs, than with Ctrl+C it dies
> with the following message:
>
> (% mpiexec.hydra -bootstrap fork -n 1 ssh -vvv n0 hostname : -n 1 ssh n1
> /bin/true)
>
> [mpiexec at gorgon125] HYDU_sock_read (./utils/sock/sock.c:277): read
> errno (Connection reset by peer)
> [mpiexec at gorgon125] HYD_pmcd_pmi_serv_control_cb
> (./pm/pmiserv/pmi_serv_cb.c:263): unable to read status from proxy
> [mpiexec at gorgon125] HYDT_dmx_wait_for_event
> (./tools/demux/demux.c:168): callback returned error status
> [mpiexec at gorgon125] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmi_serv_launch.c:499): error waiting for event
> [mpiexec at gorgon125] main (./ui/mpiexec/mpiexec.c:277): process manager
> error waiting for completion
>
> looks like its blocked in a socket after the ssh has exited...
>
> 2.
>
> With test/patched the outputs are in hydra.test2.stdout.txt and hydra.stderr.txt
>
> The job was submited in the PBS and hanged, then it was removed from
> the queue ...
>
> 2010/1/15 Mário Costa <mario.silva.costa at gmail.com>:
>> Hi Darius, Pavan,
>>
>> I checked the key format its ok, key format is rsa not rsa1, thats the
>> reason for the message, still ssh authenticates properly...
>>
>> I've executed the test.c with no problem, output:
>>
>> ./test gorgon125
>> gorgon125
>> gorgon125
>> gorgon125
>> gorgon125
>>
>> I've just checked the patch you sent, I've noticed the file structure
>> is different from the one I have, I'm using version 1.2, I will
>> updated to the latest 1.2.1, probably its solved in that one, I will
>> let you know the results after the update.
>>
>> Best regards,
>> Mário
>>
>> 2010/1/15 Darius Buntinas <buntinas at mcs.anl.gov>:
>>> The error message about unknown key type indicates that your id_rsa key may
>>> be in an incorrect format for the version of ssh installed on that system.
>>> There's two formates, openssh and ssh2. ssh-keygen has the ability to
>>> convert between them (check the man page). Try converting your key then try
>>> again.
>>>
>>> -d
>>>
>>> On 01/14/2010 09:31 PM, Pavan Balaji wrote:
>>>> Hmm.. This is getting especially difficult since I can't reproduce this
>>>> issue on any machine here. This does look like an ssh issue, but I'm not
>>>> able to nail that down either.
>>>>
>>>> Would you be able to run the attached program on your system without
>>>> mpiexec (just standalone):
>>>>
>>>> % gcc test.c -o test
>>>>
>>>> % ./test gorgon125
>>>>
>>>> -- Pavan
>>>>
>>>> On 01/14/2010 10:40 AM, Mário Costa wrote:
>>>>> Hello,
>>>>>
>>>>> I've performed test 1 below, It replicated the problem,
>>>>>
>>>>> Check the hydra.test1.out, its the output.
>>>>> In the end the test hanged, I used C^C to stop ...
>>>>>
>>>>> I found this in the output begining:
>>>>>
>>>>> debug1: identity file /home/mjscosta/.ssh/identity type -1
>>>>>
>>>>> debug3: Not a RSA1 key file /home/mjscosta/.ssh/id_rsa.
>>>>>
>>>>> debug2: key_type_from_name: unknown key type '-----BEGIN'
>>>>>
>>>>> debug3: key_read: missing keytype
>>>>> ...
>>>>> debug2: key_type_from_name: unknown key type '-----END'
>>>>>
>>>>> debug3: key_read: missing keytype
>>>>>
>>>>> debug1: identity file /home/mjscosta/.ssh/id_rsa type 1
>>>>>
>>>>> debug1: identity file /home/mjscosta/.ssh/id_dsa type -1
>>>>>
>>>>> bad fd
>>>>>
>>>>> ssh_keysign: no reply
>>>>>
>>>>> key_sign failed
>>>>>
>>>>> My id_rsa file has the following header :
>>>>> -----BEGIN RSA PRIVATE KEY-----
>>>>>
>>>>> Check ssh.vvv.out, its the output of ssh -vvv gorgon125 (no problem
>>>>> with publickey)
>>>>>
>>>>> I've noticed the following:
>>>>>
>>>>> ssh -vvv gorgon125 2>&1 | tee ssh.vvv.out
>>>>>
>>>>> did not hanged ...
>>>>>
>>>>> ssh -vvv gorgon125>ssh.vvv.out
>>>>>
>>>>> hanged
>>>>>
>>>>>
>>>>> Let me know if you still need output from the pached version.
>>>>>
>>>>> Regards,
>>>>> Mário
>>>>>
>>>>> 2010/1/14 Pavan Balaji<balaji at mcs.anl.gov>:
>>>>>> Can you try the following two options:
>>>>>>
>>>>>> 1. Allocate a 2-node partition in interactive mode. Suppose the nodes
>>>>>> allocated to you (cat $PBS_NODEFILE) are n0 and n1 run the following
>>>>>> program:
>>>>>>
>>>>>> % mpiexec.hydra -bootstrap fork -n 1 ssh -vvv n0 hostname : -n 1 ssh n1
>>>>>> /bin/true
>>>>>>
>>>>>> ... and send me the output for this (assuming that it shows the error
>>>>>> you reported)? Note that the above command does not have -rmk pbs in it.
>>>>>>
>>>>>> 2. Apply the attached patch to mpich2, and recompile it. Then try
>>>>>> running your application as:
>>>>>>
>>>>>> % mpiexec.hydra -rmk pbs -verbose hostname
>>>>>>
>>>>>> -- Pavan
>>>>>>
>>>>>> On 01/14/2010 04:26 AM, Mário Costa wrote:
>>>>>>> Thanks for your reply!
>>>>>>>
>>>>>>> On Thu, Jan 14, 2010 at 1:37 AM, Pavan Balaji<balaji at mcs.anl.gov>
>>>>>>> wrote:
>>>>>>>> The error itself seems to be thrown by ssh, not Hydra. Based on some
>>>>>>>> googling, this seems to be a common problem with host-based
>>>>>>>> authentication in ssh. For example, see
>>>>>>>> https://www.cs.uwaterloo.ca/twiki/view/CF/SSHHostBasedAuthentication
>>>>>>> You are right, host-based authentication is not setup only publickey.
>>>>>>>
>>>>>>>> Can someone check this on your system (my guess is that something is
>>>>>>>> wrong with nodes 125 and 126, if it helps)? Alternatively, can you
>>>>>>>> setup
>>>>>>>> a key based ssh (either by using a passwordless key, or an ssh agent)
>>>>>>>> to
>>>>>>>> work around this?
>>>>>>> No problem with ssh to those nodes, if I do it by hand (ssh using
>>>>>>> publickey authentication, I forgot to mention in the test) it works
>>>>>>> properly.
>>>>>>>
>>>>>>> Could it be that hydra is forcing somehow host-based authentication?
>>>>>>>
>>>>>>>> Note that though both hydra and mpd use ssh, they use different
>>>>>>>> models,
>>>>>>>> so which node ssh's to which other node will be different with both
>>>>>>>> the
>>>>>>>> process managers.
>>>>>>> I wrote some wrapping scripts to create dynamic mpd rings spanning
>>>>>>> only PBS assigned nodes to the job, in this sense I think it would be
>>>>>>> similar!?
>>>>>>> All nodes can ssh to each other properly using publickey.
>>>>>>>
>>>>>>> If you need some additional info let me know.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Mário
>>>>>>>> -- Pavan
>>>>>>>>
>>>>>>>> On 01/13/2010 05:48 PM, Mário Costa wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I'm currently testing mpiexec.hydra under PBS(Torque) in Enterprise
>>>>>>>>> Suse 10, using the default bootstrap server (ssh).
>>>>>>>>>
>>>>>>>>> With MPD, I've managed to execute successfully jobs using any number
>>>>>>>>> of nodes(hosts)/processors.
>>>>>>>>> I've setup ssh keys, known_hosts, ... i've been using a wrapper
>>>>>>>>> script
>>>>>>>>> to manage mpi rings complying to the PBS provided nodes/resources, to
>>>>>>>>> execute under PBS...
>>>>>>>>>
>>>>>>>>> With Hydra, I successfully managed to execute jobs that span over one
>>>>>>>>> node only, tested it with four processors and less.
>>>>>>>>>
>>>>>>>>> My test, a shell script:
>>>>>>>>>
>>>>>>>>> #!/bin/bash
>>>>>>>>> env | grep PMI
>>>>>>>>>
>>>>>>>>> When I submit a job that spans over more than one node I get the
>>>>>>>>> following errors.
>>>>>>>>>
>>>>>>>>> 1. Job hangs till its killed by PBS due to exceeded time limit, used
>>>>>>>>> 3
>>>>>>>>> nodes, 8 procs.
>>>>>>>>>
>>>>>>>>> stderr:
>>>>>>>>>
>>>>>>>>> bad fd
>>>>>>>>> ssh_keysign: no reply
>>>>>>>>> key_sign failed
>>>>>>>>> bad fd
>>>>>>>>> ssh_keysign: no reply
>>>>>>>>> key_sign failed
>>>>>>>>> =>> PBS: job killed: walltime 917 exceeded limit 900
>>>>>>>>> Killed by signal 15.
>>>>>>>>> Killed by signal 15.
>>>>>>>>>
>>>>>>>>> stdout:
>>>>>>>>>
>>>>>>>>> PMI_PORT=gorgon127:35454
>>>>>>>>> PMI_ID=0
>>>>>>>>> PMI_PORT=gorgon127:35454
>>>>>>>>> PMI_ID=1
>>>>>>>>> PMI_PORT=gorgon127:35454
>>>>>>>>> PMI_ID=2
>>>>>>>>>
>>>>>>>>> ps at the node where PBS executed the script that invoked hydra:
>>>>>>>>>
>>>>>>>>> userX 10187 0.0 0.0 3928 460 ? S 23:03 0:00
>>>>>>>>> pbs_demux
>>>>>>>>> userX 10205 0.0 0.0 9360 1548 ? S 23:03 0:00
>>>>>>>>> /bin/bash /var/spool/torque/mom_priv/jobs/9950.clulne.SC
>>>>>>>>> userX 10208 0.0 0.0 6072 764 ? S 23:03 0:00
>>>>>>>>> mpiexec.hydra -rmk pbs ./test.sh
>>>>>>>>> userX 10209 0.0 0.0 0 0 ? Z 23:03 0:00
>>>>>>>>> [ssh]<defunct>
>>>>>>>>> userX 10210 0.0 0.0 24084 2508 ? S 23:03 0:00
>>>>>>>>> /usr/bin/ssh -x node126 /usr/bin/pmi_proxy --launch-mode 1
>>>>>>>>> --proxy-port node127:36669 --bootstrap ssh --partition-id 1
>>>>>>>>> userX 10211 0.0 0.0 24088 2508 ? S 23:03 0:00
>>>>>>>>> /usr/bin/ssh -x node125 /usr/bin/pmi_proxy --launch-mode 1
>>>>>>>>> --proxy-port node127:36669 --bootstrap ssh --partition-id 2
>>>>>>>>> userX 10215 0.0 0.0 0 0 ? Z 23:03 0:00
>>>>>>>>> [ssh-keysign]<defunct>
>>>>>>>>> userX 10255 0.0 0.0 0 0 ? Z 23:03 0:00
>>>>>>>>> [ssh-keysign]<defunct>
>>>>>>>>> root 10256 0.1 0.0 43580 3520 ? Ss 23:04 0:00
>>>>>>>>> sshd:
>>>>>>>>> userX [priv]
>>>>>>>>> userX 10258 0.0 0.0 43580 1968 ? S 23:04 0:00 sshd:
>>>>>>>>> userX at pts/0
>>>>>>>>>
>>>>>>>>> 2. Job terminated immediately, used 2 nodes, 8 procs (but other test
>>>>>>>>> with 2 nodes had the same result as above).
>>>>>>>>>
>>>>>>>>> stderr:
>>>>>>>>>
>>>>>>>>> bad fd
>>>>>>>>> ssh_keysign: no reply
>>>>>>>>> key_sign failed
>>>>>>>>> Disconnecting: Bad packet length 232220199.
>>>>>>>>>
>>>>>>>>> stdout:
>>>>>>>>>
>>>>>>>>> PMI_PORT=gorgon116:52217
>>>>>>>>> PMI_ID=1
>>>>>>>>> PMI_PORT=gorgon116:52217
>>>>>>>>> PMI_ID=3
>>>>>>>>> PMI_PORT=gorgon116:52217
>>>>>>>>> PMI_ID=2
>>>>>>>>> PMI_PORT=gorgon116:52217
>>>>>>>>> PMI_ID=0
>>>>>>>>>
>>>>>>>>> Any idea of what might be wrong ?
>>>>>>>>>
>>>>>>>>> There is something wrong with ssh, in test 1, I've ssh to the node
>>>>>>>>> and
>>>>>>>>> executed the command showed in the ps, and it executed properly with
>>>>>>>>> the respective partion/PMI_IDs being dysplayed.
>>>>>>>>>
>>>>>>>>> Since I've managed to use MPD with any kind of problems I would
>>>>>>>>> presume my ssh is working properly.
>>>>>>>>>
>>>>>>>>> Could it be that the is something wrong with hydra ?
>>>>>>>>>
>>>>>>>>> Thanks, regards,
>>>>>>>>>
>>>>>>>>> Mário
>>>>>>>>> _______________________________________________
>>>>>>>>> mpich-discuss mailing list
>>>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>>> --
>>>>>>>> Pavan Balaji
>>>>>>>> http://www.mcs.anl.gov/~balaji
>>>>>>>> _______________________________________________
>>>>>>>> mpich-discuss mailing list
>>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>>>
>>>>>> --
>>>>>> Pavan Balaji
>>>>>> http://www.mcs.anl.gov/~balaji
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>>
>> --
>> Mário Costa
>>
>> Laboratório Nacional de Engenharia Civil
>> LNEC.CTI.NTIEC
>> Avenida do Brasil 101
>> 1700-066 Lisboa, Portugal
>> Tel : ++351 21 844 3911
>>
>
>
>
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the mpich-discuss
mailing list