[mpich-discuss] Hydra unable to execute jobs that use more than one node(host) under PBS RMK

Wed Jan 13 17:48:00 CST 2010

Hello,

I'm currently testing mpiexec.hydra under PBS(Torque) in Enterprise
Suse 10, using the default bootstrap server (ssh).

With MPD, I've managed to execute successfully jobs using any number
of nodes(hosts)/processors.
I've setup ssh keys, known_hosts, ... i've been using a wrapper script
to manage mpi rings complying to the PBS provided nodes/resources, to
execute under PBS...

With Hydra, I successfully managed to execute jobs that span over one
node only, tested it with four processors and less.

My test, a shell script:

#!/bin/bash
env | grep PMI

When I submit a job that spans over more than one node I get the
following errors.

1. Job hangs till its killed by PBS due to exceeded time limit, used 3
nodes, 8 procs.

stderr:

bad fd
ssh_keysign: no reply
key_sign failed
bad fd
ssh_keysign: no reply
key_sign failed
=>> PBS: job killed: walltime 917 exceeded limit 900
Killed by signal 15.
Killed by signal 15.

stdout:

PMI_PORT=gorgon127:35454
PMI_ID=0
PMI_PORT=gorgon127:35454
PMI_ID=1
PMI_PORT=gorgon127:35454
PMI_ID=2

ps at the node where PBS executed the script that invoked hydra:

userX 10187  0.0  0.0   3928   460 ?        S    23:03   0:00 pbs_demux
userX 10205  0.0  0.0   9360  1548 ?        S    23:03   0:00
/bin/bash /var/spool/torque/mom_priv/jobs/9950.clulne.SC
userX 10208  0.0  0.0   6072   764 ?        S    23:03   0:00
mpiexec.hydra -rmk pbs ./test.sh
userX 10209  0.0  0.0      0     0 ?        Z    23:03   0:00 [ssh] <defunct>
userX 10210  0.0  0.0  24084  2508 ?        S    23:03   0:00
/usr/bin/ssh -x node126 /usr/bin/pmi_proxy --launch-mode 1
--proxy-port node127:36669 --bootstrap ssh --partition-id 1
userX 10211  0.0  0.0  24088  2508 ?        S    23:03   0:00
/usr/bin/ssh -x node125 /usr/bin/pmi_proxy --launch-mode 1
--proxy-port node127:36669 --bootstrap ssh --partition-id 2
userX 10215  0.0  0.0      0     0 ?        Z    23:03   0:00
[ssh-keysign] <defunct>
userX 10255  0.0  0.0      0     0 ?        Z    23:03   0:00
[ssh-keysign] <defunct>
root     10256  0.1  0.0  43580  3520 ?        Ss   23:04   0:00 sshd:
userX [priv]
userX 10258  0.0  0.0  43580  1968 ?        S    23:04   0:00 sshd:
userX at pts/0

2. Job terminated immediately, used 2 nodes, 8 procs (but other test
with 2 nodes had the same result as above).

stderr:

bad fd
ssh_keysign: no reply
key_sign failed
Disconnecting: Bad packet length 232220199.

stdout:

PMI_PORT=gorgon116:52217
PMI_ID=1
PMI_PORT=gorgon116:52217
PMI_ID=3
PMI_PORT=gorgon116:52217
PMI_ID=2
PMI_PORT=gorgon116:52217
PMI_ID=0

Any idea of what might be wrong ?

There is something wrong with ssh, in test 1, I've ssh to the node and
executed the command showed in the ps, and it executed properly with
the respective partion/PMI_IDs being dysplayed.

Since I've managed to use MPD with any kind of problems I would
presume my ssh is working properly.

Could it be that the is something wrong with hydra ?

Thanks, regards,

Mário