[mpich-discuss] Hydra unable to execute jobs that use more than one node(host) under PBS RMK
Pavan Balaji
balaji at mcs.anl.gov
Wed Jan 13 19:37:56 CST 2010
The error itself seems to be thrown by ssh, not Hydra. Based on some
googling, this seems to be a common problem with host-based
authentication in ssh. For example, see
https://www.cs.uwaterloo.ca/twiki/view/CF/SSHHostBasedAuthentication
Can someone check this on your system (my guess is that something is
wrong with nodes 125 and 126, if it helps)? Alternatively, can you setup
a key based ssh (either by using a passwordless key, or an ssh agent) to
work around this?
Note that though both hydra and mpd use ssh, they use different models,
so which node ssh's to which other node will be different with both the
process managers.
-- Pavan
On 01/13/2010 05:48 PM, Mário Costa wrote:
> Hello,
>
> I'm currently testing mpiexec.hydra under PBS(Torque) in Enterprise
> Suse 10, using the default bootstrap server (ssh).
>
> With MPD, I've managed to execute successfully jobs using any number
> of nodes(hosts)/processors.
> I've setup ssh keys, known_hosts, ... i've been using a wrapper script
> to manage mpi rings complying to the PBS provided nodes/resources, to
> execute under PBS...
>
> With Hydra, I successfully managed to execute jobs that span over one
> node only, tested it with four processors and less.
>
> My test, a shell script:
>
> #!/bin/bash
> env | grep PMI
>
> When I submit a job that spans over more than one node I get the
> following errors.
>
> 1. Job hangs till its killed by PBS due to exceeded time limit, used 3
> nodes, 8 procs.
>
> stderr:
>
> bad fd
> ssh_keysign: no reply
> key_sign failed
> bad fd
> ssh_keysign: no reply
> key_sign failed
> =>> PBS: job killed: walltime 917 exceeded limit 900
> Killed by signal 15.
> Killed by signal 15.
>
> stdout:
>
> PMI_PORT=gorgon127:35454
> PMI_ID=0
> PMI_PORT=gorgon127:35454
> PMI_ID=1
> PMI_PORT=gorgon127:35454
> PMI_ID=2
>
> ps at the node where PBS executed the script that invoked hydra:
>
> userX 10187 0.0 0.0 3928 460 ? S 23:03 0:00 pbs_demux
> userX 10205 0.0 0.0 9360 1548 ? S 23:03 0:00
> /bin/bash /var/spool/torque/mom_priv/jobs/9950.clulne.SC
> userX 10208 0.0 0.0 6072 764 ? S 23:03 0:00
> mpiexec.hydra -rmk pbs ./test.sh
> userX 10209 0.0 0.0 0 0 ? Z 23:03 0:00 [ssh] <defunct>
> userX 10210 0.0 0.0 24084 2508 ? S 23:03 0:00
> /usr/bin/ssh -x node126 /usr/bin/pmi_proxy --launch-mode 1
> --proxy-port node127:36669 --bootstrap ssh --partition-id 1
> userX 10211 0.0 0.0 24088 2508 ? S 23:03 0:00
> /usr/bin/ssh -x node125 /usr/bin/pmi_proxy --launch-mode 1
> --proxy-port node127:36669 --bootstrap ssh --partition-id 2
> userX 10215 0.0 0.0 0 0 ? Z 23:03 0:00
> [ssh-keysign] <defunct>
> userX 10255 0.0 0.0 0 0 ? Z 23:03 0:00
> [ssh-keysign] <defunct>
> root 10256 0.1 0.0 43580 3520 ? Ss 23:04 0:00 sshd:
> userX [priv]
> userX 10258 0.0 0.0 43580 1968 ? S 23:04 0:00 sshd:
> userX at pts/0
>
> 2. Job terminated immediately, used 2 nodes, 8 procs (but other test
> with 2 nodes had the same result as above).
>
> stderr:
>
> bad fd
> ssh_keysign: no reply
> key_sign failed
> Disconnecting: Bad packet length 232220199.
>
> stdout:
>
> PMI_PORT=gorgon116:52217
> PMI_ID=1
> PMI_PORT=gorgon116:52217
> PMI_ID=3
> PMI_PORT=gorgon116:52217
> PMI_ID=2
> PMI_PORT=gorgon116:52217
> PMI_ID=0
>
> Any idea of what might be wrong ?
>
> There is something wrong with ssh, in test 1, I've ssh to the node and
> executed the command showed in the ps, and it executed properly with
> the respective partion/PMI_IDs being dysplayed.
>
> Since I've managed to use MPD with any kind of problems I would
> presume my ssh is working properly.
>
> Could it be that the is something wrong with hydra ?
>
> Thanks, regards,
>
> Mário
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the mpich-discuss
mailing list