[mpich-discuss] Hydra unable to execute jobs that use more than one node(host) under PBS RMK

Pavan Balaji balaji at mcs.anl.gov
Fri Jan 15 19:34:39 CST 2010


Mario,

I looked through the output you sent; it looks like Hydra is getting
launched correctly. I don't see any error in there.

Did you try running some other MPI programs in the examples directory,
such as CPI?

% mpiexec.hydra -rmk pbs ./examples/cpi

 -- Pavan

On 01/15/2010 12:27 PM, Mário Costa wrote:
> Hi again Pavan,
> 
> I've updated to version 1.2.1 and performed the tests with the pached
> version, the result was the same.
> Check the outputs:
> 
> 1.
> For test 1. the output is in hydra.test1.txt
> 
> performing test 1. , the mpiexec.hydra hangs, than with Ctrl+C it dies
> with the following message:
> 
> (% mpiexec.hydra -bootstrap fork -n 1 ssh -vvv n0 hostname : -n 1 ssh n1
> /bin/true)
> 
> [mpiexec at gorgon125] HYDU_sock_read (./utils/sock/sock.c:277): read
> errno (Connection reset by peer)
> [mpiexec at gorgon125] HYD_pmcd_pmi_serv_control_cb
> (./pm/pmiserv/pmi_serv_cb.c:263): unable to read status from proxy
> [mpiexec at gorgon125] HYDT_dmx_wait_for_event
> (./tools/demux/demux.c:168): callback returned error status
> [mpiexec at gorgon125] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmi_serv_launch.c:499): error waiting for event
> [mpiexec at gorgon125] main (./ui/mpiexec/mpiexec.c:277): process manager
> error waiting for completion
> 
> looks like its blocked in a socket after the ssh has exited...
> 
> 2.
> 
> With test/patched the outputs are in hydra.test2.stdout.txt and hydra.stderr.txt
> 
> The job was submited in the PBS and hanged, then it was removed from
> the queue ...
> 
> 2010/1/15 Mário Costa <mario.silva.costa at gmail.com>:
>> Hi Darius, Pavan,
>>
>> I checked the key format its ok, key format is rsa not rsa1, thats the
>> reason for the message, still ssh authenticates properly...
>>
>> I've executed the test.c with no problem, output:
>>
>> ./test gorgon125
>> gorgon125
>> gorgon125
>> gorgon125
>> gorgon125
>>
>> I've just checked the patch you sent, I've noticed the file structure
>> is different from the one I have, I'm using version 1.2, I will
>> updated to the latest 1.2.1, probably its solved in that one, I will
>> let you know the results after the update.
>>
>> Best regards,
>> Mário
>>
>> 2010/1/15 Darius Buntinas <buntinas at mcs.anl.gov>:
>>> The error message about unknown key type indicates that your id_rsa key may
>>> be in an incorrect format for the version of ssh installed on that system.
>>>  There's two formates, openssh and ssh2.  ssh-keygen has the ability to
>>> convert between them (check the man page).  Try converting your key then try
>>> again.
>>>
>>> -d
>>>
>>> On 01/14/2010 09:31 PM, Pavan Balaji wrote:
>>>> Hmm.. This is getting especially difficult since I can't reproduce this
>>>> issue on any machine here. This does look like an ssh issue, but I'm not
>>>> able to nail that down either.
>>>>
>>>> Would you be able to run the attached program on your system without
>>>> mpiexec (just standalone):
>>>>
>>>> % gcc test.c -o test
>>>>
>>>> % ./test gorgon125
>>>>
>>>>  -- Pavan
>>>>
>>>> On 01/14/2010 10:40 AM, Mário Costa wrote:
>>>>> Hello,
>>>>>
>>>>> I've performed test 1 below, It replicated the problem,
>>>>>
>>>>> Check the hydra.test1.out, its the output.
>>>>> In the end the test hanged, I used C^C to stop ...
>>>>>
>>>>> I found this in the output begining:
>>>>>
>>>>> debug1: identity file /home/mjscosta/.ssh/identity type -1
>>>>>
>>>>> debug3: Not a RSA1 key file /home/mjscosta/.ssh/id_rsa.
>>>>>
>>>>> debug2: key_type_from_name: unknown key type '-----BEGIN'
>>>>>
>>>>> debug3: key_read: missing keytype
>>>>> ...
>>>>> debug2: key_type_from_name: unknown key type '-----END'
>>>>>
>>>>> debug3: key_read: missing keytype
>>>>>
>>>>> debug1: identity file /home/mjscosta/.ssh/id_rsa type 1
>>>>>
>>>>> debug1: identity file /home/mjscosta/.ssh/id_dsa type -1
>>>>>
>>>>> bad fd
>>>>>
>>>>> ssh_keysign: no reply
>>>>>
>>>>> key_sign failed
>>>>>
>>>>> My id_rsa file has the following header :
>>>>> -----BEGIN RSA PRIVATE KEY-----
>>>>>
>>>>> Check ssh.vvv.out, its the output of ssh -vvv gorgon125 (no problem
>>>>> with publickey)
>>>>>
>>>>> I've noticed the following:
>>>>>
>>>>> ssh -vvv gorgon125 2>&1 | tee  ssh.vvv.out
>>>>>
>>>>> did not hanged ...
>>>>>
>>>>> ssh -vvv gorgon125>ssh.vvv.out
>>>>>
>>>>> hanged
>>>>>
>>>>>
>>>>> Let me know if you still need output from the pached version.
>>>>>
>>>>> Regards,
>>>>> Mário
>>>>>
>>>>> 2010/1/14 Pavan Balaji<balaji at mcs.anl.gov>:
>>>>>> Can you try the following two options:
>>>>>>
>>>>>> 1. Allocate a 2-node partition in interactive mode. Suppose the nodes
>>>>>> allocated to you (cat $PBS_NODEFILE) are n0 and n1 run the following
>>>>>> program:
>>>>>>
>>>>>> % mpiexec.hydra -bootstrap fork -n 1 ssh -vvv n0 hostname : -n 1 ssh n1
>>>>>> /bin/true
>>>>>>
>>>>>> ... and send me the output for this (assuming that it shows the error
>>>>>> you reported)? Note that the above command does not have -rmk pbs in it.
>>>>>>
>>>>>> 2. Apply the attached patch to mpich2, and recompile it. Then try
>>>>>> running your application as:
>>>>>>
>>>>>> % mpiexec.hydra -rmk pbs -verbose hostname
>>>>>>
>>>>>>  -- Pavan
>>>>>>
>>>>>> On 01/14/2010 04:26 AM, Mário Costa wrote:
>>>>>>> Thanks for your reply!
>>>>>>>
>>>>>>> On Thu, Jan 14, 2010 at 1:37 AM, Pavan Balaji<balaji at mcs.anl.gov>
>>>>>>>  wrote:
>>>>>>>> The error itself seems to be thrown by ssh, not Hydra. Based on some
>>>>>>>> googling, this seems to be a common problem with host-based
>>>>>>>> authentication in ssh. For example, see
>>>>>>>> https://www.cs.uwaterloo.ca/twiki/view/CF/SSHHostBasedAuthentication
>>>>>>> You are right, host-based authentication is not setup only publickey.
>>>>>>>
>>>>>>>> Can someone check this on your system (my guess is that something is
>>>>>>>> wrong with nodes 125 and 126, if it helps)? Alternatively, can you
>>>>>>>> setup
>>>>>>>> a key based ssh (either by using a passwordless key, or an ssh agent)
>>>>>>>> to
>>>>>>>> work around this?
>>>>>>> No problem with ssh to those nodes, if I do it by hand (ssh using
>>>>>>> publickey authentication, I forgot to mention in the test)  it works
>>>>>>> properly.
>>>>>>>
>>>>>>> Could it be that hydra is forcing somehow host-based authentication?
>>>>>>>
>>>>>>>> Note that though both hydra and mpd use ssh, they use different
>>>>>>>> models,
>>>>>>>> so which node ssh's to which other node will be different with both
>>>>>>>> the
>>>>>>>> process managers.
>>>>>>> I wrote some wrapping scripts to create dynamic mpd rings spanning
>>>>>>> only PBS assigned nodes to the job, in this sense I think it would be
>>>>>>> similar!?
>>>>>>> All nodes can ssh to each other properly using publickey.
>>>>>>>
>>>>>>> If you need some additional info let me know.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Mário
>>>>>>>>  -- Pavan
>>>>>>>>
>>>>>>>> On 01/13/2010 05:48 PM, Mário Costa wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I'm currently testing mpiexec.hydra under PBS(Torque) in Enterprise
>>>>>>>>> Suse 10, using the default bootstrap server (ssh).
>>>>>>>>>
>>>>>>>>> With MPD, I've managed to execute successfully jobs using any number
>>>>>>>>> of nodes(hosts)/processors.
>>>>>>>>> I've setup ssh keys, known_hosts, ... i've been using a wrapper
>>>>>>>>> script
>>>>>>>>> to manage mpi rings complying to the PBS provided nodes/resources, to
>>>>>>>>> execute under PBS...
>>>>>>>>>
>>>>>>>>> With Hydra, I successfully managed to execute jobs that span over one
>>>>>>>>> node only, tested it with four processors and less.
>>>>>>>>>
>>>>>>>>> My test, a shell script:
>>>>>>>>>
>>>>>>>>> #!/bin/bash
>>>>>>>>> env | grep PMI
>>>>>>>>>
>>>>>>>>> When I submit a job that spans over more than one node I get the
>>>>>>>>> following errors.
>>>>>>>>>
>>>>>>>>> 1. Job hangs till its killed by PBS due to exceeded time limit, used
>>>>>>>>> 3
>>>>>>>>> nodes, 8 procs.
>>>>>>>>>
>>>>>>>>> stderr:
>>>>>>>>>
>>>>>>>>> bad fd
>>>>>>>>> ssh_keysign: no reply
>>>>>>>>> key_sign failed
>>>>>>>>> bad fd
>>>>>>>>> ssh_keysign: no reply
>>>>>>>>> key_sign failed
>>>>>>>>> =>>  PBS: job killed: walltime 917 exceeded limit 900
>>>>>>>>> Killed by signal 15.
>>>>>>>>> Killed by signal 15.
>>>>>>>>>
>>>>>>>>> stdout:
>>>>>>>>>
>>>>>>>>> PMI_PORT=gorgon127:35454
>>>>>>>>> PMI_ID=0
>>>>>>>>> PMI_PORT=gorgon127:35454
>>>>>>>>> PMI_ID=1
>>>>>>>>> PMI_PORT=gorgon127:35454
>>>>>>>>> PMI_ID=2
>>>>>>>>>
>>>>>>>>> ps at the node where PBS executed the script that invoked hydra:
>>>>>>>>>
>>>>>>>>> userX 10187  0.0  0.0   3928   460 ?        S    23:03   0:00
>>>>>>>>> pbs_demux
>>>>>>>>> userX 10205  0.0  0.0   9360  1548 ?        S    23:03   0:00
>>>>>>>>> /bin/bash /var/spool/torque/mom_priv/jobs/9950.clulne.SC
>>>>>>>>> userX 10208  0.0  0.0   6072   764 ?        S    23:03   0:00
>>>>>>>>> mpiexec.hydra -rmk pbs ./test.sh
>>>>>>>>> userX 10209  0.0  0.0      0     0 ?        Z    23:03   0:00
>>>>>>>>> [ssh]<defunct>
>>>>>>>>> userX 10210  0.0  0.0  24084  2508 ?        S    23:03   0:00
>>>>>>>>> /usr/bin/ssh -x node126 /usr/bin/pmi_proxy --launch-mode 1
>>>>>>>>> --proxy-port node127:36669 --bootstrap ssh --partition-id 1
>>>>>>>>> userX 10211  0.0  0.0  24088  2508 ?        S    23:03   0:00
>>>>>>>>> /usr/bin/ssh -x node125 /usr/bin/pmi_proxy --launch-mode 1
>>>>>>>>> --proxy-port node127:36669 --bootstrap ssh --partition-id 2
>>>>>>>>> userX 10215  0.0  0.0      0     0 ?        Z    23:03   0:00
>>>>>>>>> [ssh-keysign]<defunct>
>>>>>>>>> userX 10255  0.0  0.0      0     0 ?        Z    23:03   0:00
>>>>>>>>> [ssh-keysign]<defunct>
>>>>>>>>> root     10256  0.1  0.0  43580  3520 ?        Ss   23:04   0:00
>>>>>>>>> sshd:
>>>>>>>>> userX [priv]
>>>>>>>>> userX 10258  0.0  0.0  43580  1968 ?        S    23:04   0:00 sshd:
>>>>>>>>> userX at pts/0
>>>>>>>>>
>>>>>>>>> 2. Job terminated immediately, used 2 nodes, 8 procs (but other test
>>>>>>>>> with 2 nodes had the same result as above).
>>>>>>>>>
>>>>>>>>> stderr:
>>>>>>>>>
>>>>>>>>> bad fd
>>>>>>>>> ssh_keysign: no reply
>>>>>>>>> key_sign failed
>>>>>>>>> Disconnecting: Bad packet length 232220199.
>>>>>>>>>
>>>>>>>>> stdout:
>>>>>>>>>
>>>>>>>>> PMI_PORT=gorgon116:52217
>>>>>>>>> PMI_ID=1
>>>>>>>>> PMI_PORT=gorgon116:52217
>>>>>>>>> PMI_ID=3
>>>>>>>>> PMI_PORT=gorgon116:52217
>>>>>>>>> PMI_ID=2
>>>>>>>>> PMI_PORT=gorgon116:52217
>>>>>>>>> PMI_ID=0
>>>>>>>>>
>>>>>>>>> Any idea of what might be wrong ?
>>>>>>>>>
>>>>>>>>> There is something wrong with ssh, in test 1, I've ssh to the node
>>>>>>>>> and
>>>>>>>>> executed the command showed in the ps, and it executed properly with
>>>>>>>>> the respective partion/PMI_IDs being dysplayed.
>>>>>>>>>
>>>>>>>>> Since I've managed to use MPD with any kind of problems I would
>>>>>>>>> presume my ssh is working properly.
>>>>>>>>>
>>>>>>>>> Could it be that the is something wrong with hydra ?
>>>>>>>>>
>>>>>>>>> Thanks, regards,
>>>>>>>>>
>>>>>>>>> Mário
>>>>>>>>> _______________________________________________
>>>>>>>>> mpich-discuss mailing list
>>>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>>> --
>>>>>>>> Pavan Balaji
>>>>>>>> http://www.mcs.anl.gov/~balaji
>>>>>>>> _______________________________________________
>>>>>>>> mpich-discuss mailing list
>>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>>>
>>>>>> --
>>>>>> Pavan Balaji
>>>>>> http://www.mcs.anl.gov/~balaji
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>>
>> --
>> Mário Costa
>>
>> Laboratório Nacional de Engenharia Civil
>> LNEC.CTI.NTIEC
>> Avenida do Brasil 101
>> 1700-066 Lisboa, Portugal
>> Tel : ++351 21 844 3911
>>
> 
> 
> 

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list