[mpich-discuss] Hydra unable to execute jobs that use more than one node(host) under PBS RMK

Mário Costa mario.silva.costa at gmail.com
Sat Jan 16 19:13:00 CST 2010


Hi Pavan,

I've executed the test cpi, it runs properly but mpiexec.hydra hangs,
also other tests execute properly.

Hydra is launching the jobs and they execute and terminate.

I have one question, does mpiexec.hydra agregates the outputs from all
launched mpi processes ?

I think it might hang waiting for the output of ssh, that for some
reason doesn't come out, could this be the case ?

Here we use ldap in the nodes of the cluster, I've read something
about ssh processes getting defunct due to ldap ...

Thanks for you time!

Best regards,
Mário

2010/1/16 Pavan Balaji <balaji at mcs.anl.gov>:
> Mario,
>
> I looked through the output you sent; it looks like Hydra is getting
> launched correctly. I don't see any error in there.
>
> Did you try running some other MPI programs in the examples directory,
> such as CPI?
>
> % mpiexec.hydra -rmk pbs ./examples/cpi
>
>  -- Pavan
>
> On 01/15/2010 12:27 PM, Mário Costa wrote:
>> Hi again Pavan,
>>
>> I've updated to version 1.2.1 and performed the tests with the pached
>> version, the result was the same.
>> Check the outputs:
>>
>> 1.
>> For test 1. the output is in hydra.test1.txt
>>
>> performing test 1. , the mpiexec.hydra hangs, than with Ctrl+C it dies
>> with the following message:
>>
>> (% mpiexec.hydra -bootstrap fork -n 1 ssh -vvv n0 hostname : -n 1 ssh n1
>> /bin/true)
>>
>> [mpiexec at gorgon125] HYDU_sock_read (./utils/sock/sock.c:277): read
>> errno (Connection reset by peer)
>> [mpiexec at gorgon125] HYD_pmcd_pmi_serv_control_cb
>> (./pm/pmiserv/pmi_serv_cb.c:263): unable to read status from proxy
>> [mpiexec at gorgon125] HYDT_dmx_wait_for_event
>> (./tools/demux/demux.c:168): callback returned error status
>> [mpiexec at gorgon125] HYD_pmci_wait_for_completion
>> (./pm/pmiserv/pmi_serv_launch.c:499): error waiting for event
>> [mpiexec at gorgon125] main (./ui/mpiexec/mpiexec.c:277): process manager
>> error waiting for completion
>>
>> looks like its blocked in a socket after the ssh has exited...
>>
>> 2.
>>
>> With test/patched the outputs are in hydra.test2.stdout.txt and hydra.stderr.txt
>>
>> The job was submited in the PBS and hanged, then it was removed from
>> the queue ...
>>
>> 2010/1/15 Mário Costa <mario.silva.costa at gmail.com>:
>>> Hi Darius, Pavan,
>>>
>>> I checked the key format its ok, key format is rsa not rsa1, thats the
>>> reason for the message, still ssh authenticates properly...
>>>
>>> I've executed the test.c with no problem, output:
>>>
>>> ./test gorgon125
>>> gorgon125
>>> gorgon125
>>> gorgon125
>>> gorgon125
>>>
>>> I've just checked the patch you sent, I've noticed the file structure
>>> is different from the one I have, I'm using version 1.2, I will
>>> updated to the latest 1.2.1, probably its solved in that one, I will
>>> let you know the results after the update.
>>>
>>> Best regards,
>>> Mário
>>>
>>> 2010/1/15 Darius Buntinas <buntinas at mcs.anl.gov>:
>>>> The error message about unknown key type indicates that your id_rsa key may
>>>> be in an incorrect format for the version of ssh installed on that system.
>>>>  There's two formates, openssh and ssh2.  ssh-keygen has the ability to
>>>> convert between them (check the man page).  Try converting your key then try
>>>> again.
>>>>
>>>> -d
>>>>
>>>> On 01/14/2010 09:31 PM, Pavan Balaji wrote:
>>>>> Hmm.. This is getting especially difficult since I can't reproduce this
>>>>> issue on any machine here. This does look like an ssh issue, but I'm not
>>>>> able to nail that down either.
>>>>>
>>>>> Would you be able to run the attached program on your system without
>>>>> mpiexec (just standalone):
>>>>>
>>>>> % gcc test.c -o test
>>>>>
>>>>> % ./test gorgon125
>>>>>
>>>>>  -- Pavan
>>>>>
>>>>> On 01/14/2010 10:40 AM, Mário Costa wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I've performed test 1 below, It replicated the problem,
>>>>>>
>>>>>> Check the hydra.test1.out, its the output.
>>>>>> In the end the test hanged, I used C^C to stop ...
>>>>>>
>>>>>> I found this in the output begining:
>>>>>>
>>>>>> debug1: identity file /home/mjscosta/.ssh/identity type -1
>>>>>>
>>>>>> debug3: Not a RSA1 key file /home/mjscosta/.ssh/id_rsa.
>>>>>>
>>>>>> debug2: key_type_from_name: unknown key type '-----BEGIN'
>>>>>>
>>>>>> debug3: key_read: missing keytype
>>>>>> ...
>>>>>> debug2: key_type_from_name: unknown key type '-----END'
>>>>>>
>>>>>> debug3: key_read: missing keytype
>>>>>>
>>>>>> debug1: identity file /home/mjscosta/.ssh/id_rsa type 1
>>>>>>
>>>>>> debug1: identity file /home/mjscosta/.ssh/id_dsa type -1
>>>>>>
>>>>>> bad fd
>>>>>>
>>>>>> ssh_keysign: no reply
>>>>>>
>>>>>> key_sign failed
>>>>>>
>>>>>> My id_rsa file has the following header :
>>>>>> -----BEGIN RSA PRIVATE KEY-----
>>>>>>
>>>>>> Check ssh.vvv.out, its the output of ssh -vvv gorgon125 (no problem
>>>>>> with publickey)
>>>>>>
>>>>>> I've noticed the following:
>>>>>>
>>>>>> ssh -vvv gorgon125 2>&1 | tee  ssh.vvv.out
>>>>>>
>>>>>> did not hanged ...
>>>>>>
>>>>>> ssh -vvv gorgon125>ssh.vvv.out
>>>>>>
>>>>>> hanged
>>>>>>
>>>>>>
>>>>>> Let me know if you still need output from the pached version.
>>>>>>
>>>>>> Regards,
>>>>>> Mário
>>>>>>
>>>>>> 2010/1/14 Pavan Balaji<balaji at mcs.anl.gov>:
>>>>>>> Can you try the following two options:
>>>>>>>
>>>>>>> 1. Allocate a 2-node partition in interactive mode. Suppose the nodes
>>>>>>> allocated to you (cat $PBS_NODEFILE) are n0 and n1 run the following
>>>>>>> program:
>>>>>>>
>>>>>>> % mpiexec.hydra -bootstrap fork -n 1 ssh -vvv n0 hostname : -n 1 ssh n1
>>>>>>> /bin/true
>>>>>>>
>>>>>>> ... and send me the output for this (assuming that it shows the error
>>>>>>> you reported)? Note that the above command does not have -rmk pbs in it.
>>>>>>>
>>>>>>> 2. Apply the attached patch to mpich2, and recompile it. Then try
>>>>>>> running your application as:
>>>>>>>
>>>>>>> % mpiexec.hydra -rmk pbs -verbose hostname
>>>>>>>
>>>>>>>  -- Pavan
>>>>>>>
>>>>>>> On 01/14/2010 04:26 AM, Mário Costa wrote:
>>>>>>>> Thanks for your reply!
>>>>>>>>
>>>>>>>> On Thu, Jan 14, 2010 at 1:37 AM, Pavan Balaji<balaji at mcs.anl.gov>
>>>>>>>>  wrote:
>>>>>>>>> The error itself seems to be thrown by ssh, not Hydra. Based on some
>>>>>>>>> googling, this seems to be a common problem with host-based
>>>>>>>>> authentication in ssh. For example, see
>>>>>>>>> https://www.cs.uwaterloo.ca/twiki/view/CF/SSHHostBasedAuthentication
>>>>>>>> You are right, host-based authentication is not setup only publickey.
>>>>>>>>
>>>>>>>>> Can someone check this on your system (my guess is that something is
>>>>>>>>> wrong with nodes 125 and 126, if it helps)? Alternatively, can you
>>>>>>>>> setup
>>>>>>>>> a key based ssh (either by using a passwordless key, or an ssh agent)
>>>>>>>>> to
>>>>>>>>> work around this?
>>>>>>>> No problem with ssh to those nodes, if I do it by hand (ssh using
>>>>>>>> publickey authentication, I forgot to mention in the test)  it works
>>>>>>>> properly.
>>>>>>>>
>>>>>>>> Could it be that hydra is forcing somehow host-based authentication?
>>>>>>>>
>>>>>>>>> Note that though both hydra and mpd use ssh, they use different
>>>>>>>>> models,
>>>>>>>>> so which node ssh's to which other node will be different with both
>>>>>>>>> the
>>>>>>>>> process managers.
>>>>>>>> I wrote some wrapping scripts to create dynamic mpd rings spanning
>>>>>>>> only PBS assigned nodes to the job, in this sense I think it would be
>>>>>>>> similar!?
>>>>>>>> All nodes can ssh to each other properly using publickey.
>>>>>>>>
>>>>>>>> If you need some additional info let me know.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Mário
>>>>>>>>>  -- Pavan
>>>>>>>>>
>>>>>>>>> On 01/13/2010 05:48 PM, Mário Costa wrote:
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> I'm currently testing mpiexec.hydra under PBS(Torque) in Enterprise
>>>>>>>>>> Suse 10, using the default bootstrap server (ssh).
>>>>>>>>>>
>>>>>>>>>> With MPD, I've managed to execute successfully jobs using any number
>>>>>>>>>> of nodes(hosts)/processors.
>>>>>>>>>> I've setup ssh keys, known_hosts, ... i've been using a wrapper
>>>>>>>>>> script
>>>>>>>>>> to manage mpi rings complying to the PBS provided nodes/resources, to
>>>>>>>>>> execute under PBS...
>>>>>>>>>>
>>>>>>>>>> With Hydra, I successfully managed to execute jobs that span over one
>>>>>>>>>> node only, tested it with four processors and less.
>>>>>>>>>>
>>>>>>>>>> My test, a shell script:
>>>>>>>>>>
>>>>>>>>>> #!/bin/bash
>>>>>>>>>> env | grep PMI
>>>>>>>>>>
>>>>>>>>>> When I submit a job that spans over more than one node I get the
>>>>>>>>>> following errors.
>>>>>>>>>>
>>>>>>>>>> 1. Job hangs till its killed by PBS due to exceeded time limit, used
>>>>>>>>>> 3
>>>>>>>>>> nodes, 8 procs.
>>>>>>>>>>
>>>>>>>>>> stderr:
>>>>>>>>>>
>>>>>>>>>> bad fd
>>>>>>>>>> ssh_keysign: no reply
>>>>>>>>>> key_sign failed
>>>>>>>>>> bad fd
>>>>>>>>>> ssh_keysign: no reply
>>>>>>>>>> key_sign failed
>>>>>>>>>> =>>  PBS: job killed: walltime 917 exceeded limit 900
>>>>>>>>>> Killed by signal 15.
>>>>>>>>>> Killed by signal 15.
>>>>>>>>>>
>>>>>>>>>> stdout:
>>>>>>>>>>
>>>>>>>>>> PMI_PORT=gorgon127:35454
>>>>>>>>>> PMI_ID=0
>>>>>>>>>> PMI_PORT=gorgon127:35454
>>>>>>>>>> PMI_ID=1
>>>>>>>>>> PMI_PORT=gorgon127:35454
>>>>>>>>>> PMI_ID=2
>>>>>>>>>>
>>>>>>>>>> ps at the node where PBS executed the script that invoked hydra:
>>>>>>>>>>
>>>>>>>>>> userX 10187  0.0  0.0   3928   460 ?        S    23:03   0:00
>>>>>>>>>> pbs_demux
>>>>>>>>>> userX 10205  0.0  0.0   9360  1548 ?        S    23:03   0:00
>>>>>>>>>> /bin/bash /var/spool/torque/mom_priv/jobs/9950.clulne.SC
>>>>>>>>>> userX 10208  0.0  0.0   6072   764 ?        S    23:03   0:00
>>>>>>>>>> mpiexec.hydra -rmk pbs ./test.sh
>>>>>>>>>> userX 10209  0.0  0.0      0     0 ?        Z    23:03   0:00
>>>>>>>>>> [ssh]<defunct>
>>>>>>>>>> userX 10210  0.0  0.0  24084  2508 ?        S    23:03   0:00
>>>>>>>>>> /usr/bin/ssh -x node126 /usr/bin/pmi_proxy --launch-mode 1
>>>>>>>>>> --proxy-port node127:36669 --bootstrap ssh --partition-id 1
>>>>>>>>>> userX 10211  0.0  0.0  24088  2508 ?        S    23:03   0:00
>>>>>>>>>> /usr/bin/ssh -x node125 /usr/bin/pmi_proxy --launch-mode 1
>>>>>>>>>> --proxy-port node127:36669 --bootstrap ssh --partition-id 2
>>>>>>>>>> userX 10215  0.0  0.0      0     0 ?        Z    23:03   0:00
>>>>>>>>>> [ssh-keysign]<defunct>
>>>>>>>>>> userX 10255  0.0  0.0      0     0 ?        Z    23:03   0:00
>>>>>>>>>> [ssh-keysign]<defunct>
>>>>>>>>>> root     10256  0.1  0.0  43580  3520 ?        Ss   23:04   0:00
>>>>>>>>>> sshd:
>>>>>>>>>> userX [priv]
>>>>>>>>>> userX 10258  0.0  0.0  43580  1968 ?        S    23:04   0:00 sshd:
>>>>>>>>>> userX at pts/0
>>>>>>>>>>
>>>>>>>>>> 2. Job terminated immediately, used 2 nodes, 8 procs (but other test
>>>>>>>>>> with 2 nodes had the same result as above).
>>>>>>>>>>
>>>>>>>>>> stderr:
>>>>>>>>>>
>>>>>>>>>> bad fd
>>>>>>>>>> ssh_keysign: no reply
>>>>>>>>>> key_sign failed
>>>>>>>>>> Disconnecting: Bad packet length 232220199.
>>>>>>>>>>
>>>>>>>>>> stdout:
>>>>>>>>>>
>>>>>>>>>> PMI_PORT=gorgon116:52217
>>>>>>>>>> PMI_ID=1
>>>>>>>>>> PMI_PORT=gorgon116:52217
>>>>>>>>>> PMI_ID=3
>>>>>>>>>> PMI_PORT=gorgon116:52217
>>>>>>>>>> PMI_ID=2
>>>>>>>>>> PMI_PORT=gorgon116:52217
>>>>>>>>>> PMI_ID=0
>>>>>>>>>>
>>>>>>>>>> Any idea of what might be wrong ?
>>>>>>>>>>
>>>>>>>>>> There is something wrong with ssh, in test 1, I've ssh to the node
>>>>>>>>>> and
>>>>>>>>>> executed the command showed in the ps, and it executed properly with
>>>>>>>>>> the respective partion/PMI_IDs being dysplayed.
>>>>>>>>>>
>>>>>>>>>> Since I've managed to use MPD with any kind of problems I would
>>>>>>>>>> presume my ssh is working properly.
>>>>>>>>>>
>>>>>>>>>> Could it be that the is something wrong with hydra ?
>>>>>>>>>>
>>>>>>>>>> Thanks, regards,
>>>>>>>>>>
>>>>>>>>>> Mário
>>>>>>>>>> _______________________________________________
>>>>>>>>>> mpich-discuss mailing list
>>>>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>>>> --
>>>>>>>>> Pavan Balaji
>>>>>>>>> http://www.mcs.anl.gov/~balaji
>>>>>>>>> _______________________________________________
>>>>>>>>> mpich-discuss mailing list
>>>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>>>>
>>>>>>> --
>>>>>>> Pavan Balaji
>>>>>>> http://www.mcs.anl.gov/~balaji
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>>
>>> --
>>> Mário Costa
>>>
>>> Laboratório Nacional de Engenharia Civil
>>> LNEC.CTI.NTIEC
>>> Avenida do Brasil 101
>>> 1700-066 Lisboa, Portugal
>>> Tel : ++351 21 844 3911
>>>
>>
>>
>>
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
>



-- 
Mário Costa

Laboratório Nacional de Engenharia Civil
LNEC.CTI.NTIEC
Avenida do Brasil 101
1700-066 Lisboa, Portugal
Tel : ++351 21 844 3911


More information about the mpich-discuss mailing list