[mpich-discuss] Hydra unable to execute jobs that use more than one node(host) under PBS RMK

Wed Jan 13 19:37:56 CST 2010

The error itself seems to be thrown by ssh, not Hydra. Based on some
googling, this seems to be a common problem with host-based
authentication in ssh. For example, see
https://www.cs.uwaterloo.ca/twiki/view/CF/SSHHostBasedAuthentication

Can someone check this on your system (my guess is that something is
wrong with nodes 125 and 126, if it helps)? Alternatively, can you setup
a key based ssh (either by using a passwordless key, or an ssh agent) to
work around this?

Note that though both hydra and mpd use ssh, they use different models,
so which node ssh's to which other node will be different with both the
process managers.

 -- Pavan

On 01/13/2010 05:48 PM, Mário Costa wrote:
> Hello,
> 
> I'm currently testing mpiexec.hydra under PBS(Torque) in Enterprise
> Suse 10, using the default bootstrap server (ssh).
> 
> With MPD, I've managed to execute successfully jobs using any number
> of nodes(hosts)/processors.
> I've setup ssh keys, known_hosts, ... i've been using a wrapper script
> to manage mpi rings complying to the PBS provided nodes/resources, to
> execute under PBS...
> 
> With Hydra, I successfully managed to execute jobs that span over one
> node only, tested it with four processors and less.
> 
> My test, a shell script:
> 
> #!/bin/bash
> env | grep PMI
> 
> When I submit a job that spans over more than one node I get the
> following errors.
> 
> 1. Job hangs till its killed by PBS due to exceeded time limit, used 3
> nodes, 8 procs.
> 
> stderr:
> 
> bad fd
> ssh_keysign: no reply
> key_sign failed
> bad fd
> ssh_keysign: no reply
> key_sign failed
> =>> PBS: job killed: walltime 917 exceeded limit 900
> Killed by signal 15.
> Killed by signal 15.
> 
> stdout:
> 
> PMI_PORT=gorgon127:35454
> PMI_ID=0
> PMI_PORT=gorgon127:35454
> PMI_ID=1
> PMI_PORT=gorgon127:35454
> PMI_ID=2
> 
> ps at the node where PBS executed the script that invoked hydra:
> 
> userX 10187  0.0  0.0   3928   460 ?        S    23:03   0:00 pbs_demux
> userX 10205  0.0  0.0   9360  1548 ?        S    23:03   0:00
> /bin/bash /var/spool/torque/mom_priv/jobs/9950.clulne.SC
> userX 10208  0.0  0.0   6072   764 ?        S    23:03   0:00
> mpiexec.hydra -rmk pbs ./test.sh
> userX 10209  0.0  0.0      0     0 ?        Z    23:03   0:00 [ssh] <defunct>
> userX 10210  0.0  0.0  24084  2508 ?        S    23:03   0:00
> /usr/bin/ssh -x node126 /usr/bin/pmi_proxy --launch-mode 1
> --proxy-port node127:36669 --bootstrap ssh --partition-id 1
> userX 10211  0.0  0.0  24088  2508 ?        S    23:03   0:00
> /usr/bin/ssh -x node125 /usr/bin/pmi_proxy --launch-mode 1
> --proxy-port node127:36669 --bootstrap ssh --partition-id 2
> userX 10215  0.0  0.0      0     0 ?        Z    23:03   0:00
> [ssh-keysign] <defunct>
> userX 10255  0.0  0.0      0     0 ?        Z    23:03   0:00
> [ssh-keysign] <defunct>
> root     10256  0.1  0.0  43580  3520 ?        Ss   23:04   0:00 sshd:
> userX [priv]
> userX 10258  0.0  0.0  43580  1968 ?        S    23:04   0:00 sshd:
> userX at pts/0
> 
> 2. Job terminated immediately, used 2 nodes, 8 procs (but other test
> with 2 nodes had the same result as above).
> 
> stderr:
> 
> bad fd
> ssh_keysign: no reply
> key_sign failed
> Disconnecting: Bad packet length 232220199.
> 
> stdout:
> 
> PMI_PORT=gorgon116:52217
> PMI_ID=1
> PMI_PORT=gorgon116:52217
> PMI_ID=3
> PMI_PORT=gorgon116:52217
> PMI_ID=2
> PMI_PORT=gorgon116:52217
> PMI_ID=0
> 
> Any idea of what might be wrong ?
> 
> There is something wrong with ssh, in test 1, I've ssh to the node and
> executed the command showed in the ps, and it executed properly with
> the respective partion/PMI_IDs being dysplayed.
> 
> Since I've managed to use MPD with any kind of problems I would
> presume my ssh is working properly.
> 
> Could it be that the is something wrong with hydra ?
> 
> Thanks, regards,
> 
> Mário
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji