[mpich-discuss] Hydra unable to execute jobs that use more than one node(host) under PBS RMK

Mário Costa mario.silva.costa at gmail.com
Thu Jan 14 10:40:08 CST 2010


Hello,

I've performed test 1 below, It replicated the problem,

Check the hydra.test1.out, its the output.
In the end the test hanged, I used C^C to stop ...

I found this in the output begining:

debug1: identity file /home/mjscosta/.ssh/identity type -1

debug3: Not a RSA1 key file /home/mjscosta/.ssh/id_rsa.

debug2: key_type_from_name: unknown key type '-----BEGIN'

debug3: key_read: missing keytype
...
debug2: key_type_from_name: unknown key type '-----END'

debug3: key_read: missing keytype

debug1: identity file /home/mjscosta/.ssh/id_rsa type 1

debug1: identity file /home/mjscosta/.ssh/id_dsa type -1

bad fd

ssh_keysign: no reply

key_sign failed

My id_rsa file has the following header :
-----BEGIN RSA PRIVATE KEY-----

Check ssh.vvv.out, its the output of ssh -vvv gorgon125 (no problem
with publickey)

I've noticed the following:

ssh -vvv gorgon125 2>&1 | tee  ssh.vvv.out

did not hanged ...

ssh -vvv gorgon125 >ssh.vvv.out

hanged


Let me know if you still need output from the pached version.

Regards,
Mário

2010/1/14 Pavan Balaji <balaji at mcs.anl.gov>:
>
> Can you try the following two options:
>
> 1. Allocate a 2-node partition in interactive mode. Suppose the nodes
> allocated to you (cat $PBS_NODEFILE) are n0 and n1 run the following
> program:
>
> % mpiexec.hydra -bootstrap fork -n 1 ssh -vvv n0 hostname : -n 1 ssh n1
> /bin/true
>
> ... and send me the output for this (assuming that it shows the error
> you reported)? Note that the above command does not have -rmk pbs in it.
>
> 2. Apply the attached patch to mpich2, and recompile it. Then try
> running your application as:
>
> % mpiexec.hydra -rmk pbs -verbose hostname
>
>  -- Pavan
>
> On 01/14/2010 04:26 AM, Mário Costa wrote:
>> Thanks for your reply!
>>
>> On Thu, Jan 14, 2010 at 1:37 AM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>>> The error itself seems to be thrown by ssh, not Hydra. Based on some
>>> googling, this seems to be a common problem with host-based
>>> authentication in ssh. For example, see
>>> https://www.cs.uwaterloo.ca/twiki/view/CF/SSHHostBasedAuthentication
>>
>> You are right, host-based authentication is not setup only publickey.
>>
>>> Can someone check this on your system (my guess is that something is
>>> wrong with nodes 125 and 126, if it helps)? Alternatively, can you setup
>>> a key based ssh (either by using a passwordless key, or an ssh agent) to
>>> work around this?
>>
>> No problem with ssh to those nodes, if I do it by hand (ssh using
>> publickey authentication, I forgot to mention in the test)  it works
>> properly.
>>
>> Could it be that hydra is forcing somehow host-based authentication?
>>
>>> Note that though both hydra and mpd use ssh, they use different models,
>>> so which node ssh's to which other node will be different with both the
>>> process managers.
>>
>> I wrote some wrapping scripts to create dynamic mpd rings spanning
>> only PBS assigned nodes to the job, in this sense I think it would be
>> similar!?
>> All nodes can ssh to each other properly using publickey.
>>
>> If you need some additional info let me know.
>>
>> Regards,
>> Mário
>>>  -- Pavan
>>>
>>> On 01/13/2010 05:48 PM, Mário Costa wrote:
>>>> Hello,
>>>>
>>>> I'm currently testing mpiexec.hydra under PBS(Torque) in Enterprise
>>>> Suse 10, using the default bootstrap server (ssh).
>>>>
>>>> With MPD, I've managed to execute successfully jobs using any number
>>>> of nodes(hosts)/processors.
>>>> I've setup ssh keys, known_hosts, ... i've been using a wrapper script
>>>> to manage mpi rings complying to the PBS provided nodes/resources, to
>>>> execute under PBS...
>>>>
>>>> With Hydra, I successfully managed to execute jobs that span over one
>>>> node only, tested it with four processors and less.
>>>>
>>>> My test, a shell script:
>>>>
>>>> #!/bin/bash
>>>> env | grep PMI
>>>>
>>>> When I submit a job that spans over more than one node I get the
>>>> following errors.
>>>>
>>>> 1. Job hangs till its killed by PBS due to exceeded time limit, used 3
>>>> nodes, 8 procs.
>>>>
>>>> stderr:
>>>>
>>>> bad fd
>>>> ssh_keysign: no reply
>>>> key_sign failed
>>>> bad fd
>>>> ssh_keysign: no reply
>>>> key_sign failed
>>>> =>> PBS: job killed: walltime 917 exceeded limit 900
>>>> Killed by signal 15.
>>>> Killed by signal 15.
>>>>
>>>> stdout:
>>>>
>>>> PMI_PORT=gorgon127:35454
>>>> PMI_ID=0
>>>> PMI_PORT=gorgon127:35454
>>>> PMI_ID=1
>>>> PMI_PORT=gorgon127:35454
>>>> PMI_ID=2
>>>>
>>>> ps at the node where PBS executed the script that invoked hydra:
>>>>
>>>> userX 10187  0.0  0.0   3928   460 ?        S    23:03   0:00 pbs_demux
>>>> userX 10205  0.0  0.0   9360  1548 ?        S    23:03   0:00
>>>> /bin/bash /var/spool/torque/mom_priv/jobs/9950.clulne.SC
>>>> userX 10208  0.0  0.0   6072   764 ?        S    23:03   0:00
>>>> mpiexec.hydra -rmk pbs ./test.sh
>>>> userX 10209  0.0  0.0      0     0 ?        Z    23:03   0:00 [ssh] <defunct>
>>>> userX 10210  0.0  0.0  24084  2508 ?        S    23:03   0:00
>>>> /usr/bin/ssh -x node126 /usr/bin/pmi_proxy --launch-mode 1
>>>> --proxy-port node127:36669 --bootstrap ssh --partition-id 1
>>>> userX 10211  0.0  0.0  24088  2508 ?        S    23:03   0:00
>>>> /usr/bin/ssh -x node125 /usr/bin/pmi_proxy --launch-mode 1
>>>> --proxy-port node127:36669 --bootstrap ssh --partition-id 2
>>>> userX 10215  0.0  0.0      0     0 ?        Z    23:03   0:00
>>>> [ssh-keysign] <defunct>
>>>> userX 10255  0.0  0.0      0     0 ?        Z    23:03   0:00
>>>> [ssh-keysign] <defunct>
>>>> root     10256  0.1  0.0  43580  3520 ?        Ss   23:04   0:00 sshd:
>>>> userX [priv]
>>>> userX 10258  0.0  0.0  43580  1968 ?        S    23:04   0:00 sshd:
>>>> userX at pts/0
>>>>
>>>> 2. Job terminated immediately, used 2 nodes, 8 procs (but other test
>>>> with 2 nodes had the same result as above).
>>>>
>>>> stderr:
>>>>
>>>> bad fd
>>>> ssh_keysign: no reply
>>>> key_sign failed
>>>> Disconnecting: Bad packet length 232220199.
>>>>
>>>> stdout:
>>>>
>>>> PMI_PORT=gorgon116:52217
>>>> PMI_ID=1
>>>> PMI_PORT=gorgon116:52217
>>>> PMI_ID=3
>>>> PMI_PORT=gorgon116:52217
>>>> PMI_ID=2
>>>> PMI_PORT=gorgon116:52217
>>>> PMI_ID=0
>>>>
>>>> Any idea of what might be wrong ?
>>>>
>>>> There is something wrong with ssh, in test 1, I've ssh to the node and
>>>> executed the command showed in the ps, and it executed properly with
>>>> the respective partion/PMI_IDs being dysplayed.
>>>>
>>>> Since I've managed to use MPD with any kind of problems I would
>>>> presume my ssh is working properly.
>>>>
>>>> Could it be that the is something wrong with hydra ?
>>>>
>>>> Thanks, regards,
>>>>
>>>> Mário
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> --
>>> Pavan Balaji
>>> http://www.mcs.anl.gov/~balaji
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
>



-- 
Mário Costa

Laboratório Nacional de Engenharia Civil
LNEC.CTI.NTIEC
Avenida do Brasil 101
1700-066 Lisboa, Portugal
Tel : ++351 21 844 3911
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hydra.test1.out
Type: chemical/x-gulp
Size: 10315 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100114/09076bee/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ssh.vvv.out
Type: chemical/x-gulp
Size: 10849 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100114/09076bee/attachment-0003.bin>


More information about the mpich-discuss mailing list