[mpich-discuss] Hydra unable to execute jobs that use more than one node(host) under PBS RMK
Mário Costa
mario.silva.costa at gmail.com
Thu Jan 14 04:26:19 CST 2010
Thanks for your reply!
On Thu, Jan 14, 2010 at 1:37 AM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>
> The error itself seems to be thrown by ssh, not Hydra. Based on some
> googling, this seems to be a common problem with host-based
> authentication in ssh. For example, see
> https://www.cs.uwaterloo.ca/twiki/view/CF/SSHHostBasedAuthentication
You are right, host-based authentication is not setup only publickey.
>
> Can someone check this on your system (my guess is that something is
> wrong with nodes 125 and 126, if it helps)? Alternatively, can you setup
> a key based ssh (either by using a passwordless key, or an ssh agent) to
> work around this?
No problem with ssh to those nodes, if I do it by hand (ssh using
publickey authentication, I forgot to mention in the test) it works
properly.
Could it be that hydra is forcing somehow host-based authentication?
>
> Note that though both hydra and mpd use ssh, they use different models,
> so which node ssh's to which other node will be different with both the
> process managers.
I wrote some wrapping scripts to create dynamic mpd rings spanning
only PBS assigned nodes to the job, in this sense I think it would be
similar!?
All nodes can ssh to each other properly using publickey.
If you need some additional info let me know.
Regards,
Mário
>
> -- Pavan
>
> On 01/13/2010 05:48 PM, Mário Costa wrote:
>> Hello,
>>
>> I'm currently testing mpiexec.hydra under PBS(Torque) in Enterprise
>> Suse 10, using the default bootstrap server (ssh).
>>
>> With MPD, I've managed to execute successfully jobs using any number
>> of nodes(hosts)/processors.
>> I've setup ssh keys, known_hosts, ... i've been using a wrapper script
>> to manage mpi rings complying to the PBS provided nodes/resources, to
>> execute under PBS...
>>
>> With Hydra, I successfully managed to execute jobs that span over one
>> node only, tested it with four processors and less.
>>
>> My test, a shell script:
>>
>> #!/bin/bash
>> env | grep PMI
>>
>> When I submit a job that spans over more than one node I get the
>> following errors.
>>
>> 1. Job hangs till its killed by PBS due to exceeded time limit, used 3
>> nodes, 8 procs.
>>
>> stderr:
>>
>> bad fd
>> ssh_keysign: no reply
>> key_sign failed
>> bad fd
>> ssh_keysign: no reply
>> key_sign failed
>> =>> PBS: job killed: walltime 917 exceeded limit 900
>> Killed by signal 15.
>> Killed by signal 15.
>>
>> stdout:
>>
>> PMI_PORT=gorgon127:35454
>> PMI_ID=0
>> PMI_PORT=gorgon127:35454
>> PMI_ID=1
>> PMI_PORT=gorgon127:35454
>> PMI_ID=2
>>
>> ps at the node where PBS executed the script that invoked hydra:
>>
>> userX 10187 0.0 0.0 3928 460 ? S 23:03 0:00 pbs_demux
>> userX 10205 0.0 0.0 9360 1548 ? S 23:03 0:00
>> /bin/bash /var/spool/torque/mom_priv/jobs/9950.clulne.SC
>> userX 10208 0.0 0.0 6072 764 ? S 23:03 0:00
>> mpiexec.hydra -rmk pbs ./test.sh
>> userX 10209 0.0 0.0 0 0 ? Z 23:03 0:00 [ssh] <defunct>
>> userX 10210 0.0 0.0 24084 2508 ? S 23:03 0:00
>> /usr/bin/ssh -x node126 /usr/bin/pmi_proxy --launch-mode 1
>> --proxy-port node127:36669 --bootstrap ssh --partition-id 1
>> userX 10211 0.0 0.0 24088 2508 ? S 23:03 0:00
>> /usr/bin/ssh -x node125 /usr/bin/pmi_proxy --launch-mode 1
>> --proxy-port node127:36669 --bootstrap ssh --partition-id 2
>> userX 10215 0.0 0.0 0 0 ? Z 23:03 0:00
>> [ssh-keysign] <defunct>
>> userX 10255 0.0 0.0 0 0 ? Z 23:03 0:00
>> [ssh-keysign] <defunct>
>> root 10256 0.1 0.0 43580 3520 ? Ss 23:04 0:00 sshd:
>> userX [priv]
>> userX 10258 0.0 0.0 43580 1968 ? S 23:04 0:00 sshd:
>> userX at pts/0
>>
>> 2. Job terminated immediately, used 2 nodes, 8 procs (but other test
>> with 2 nodes had the same result as above).
>>
>> stderr:
>>
>> bad fd
>> ssh_keysign: no reply
>> key_sign failed
>> Disconnecting: Bad packet length 232220199.
>>
>> stdout:
>>
>> PMI_PORT=gorgon116:52217
>> PMI_ID=1
>> PMI_PORT=gorgon116:52217
>> PMI_ID=3
>> PMI_PORT=gorgon116:52217
>> PMI_ID=2
>> PMI_PORT=gorgon116:52217
>> PMI_ID=0
>>
>> Any idea of what might be wrong ?
>>
>> There is something wrong with ssh, in test 1, I've ssh to the node and
>> executed the command showed in the ps, and it executed properly with
>> the respective partion/PMI_IDs being dysplayed.
>>
>> Since I've managed to use MPD with any kind of problems I would
>> presume my ssh is working properly.
>>
>> Could it be that the is something wrong with hydra ?
>>
>> Thanks, regards,
>>
>> Mário
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
More information about the mpich-discuss
mailing list