[mpich-discuss] Hydra unable to execute jobs that use more than one node(host) under PBS RMK

Mário Costa mario.silva.costa at gmail.com
Thu Jan 14 04:26:19 CST 2010


Thanks for your reply!

On Thu, Jan 14, 2010 at 1:37 AM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>
> The error itself seems to be thrown by ssh, not Hydra. Based on some
> googling, this seems to be a common problem with host-based
> authentication in ssh. For example, see
> https://www.cs.uwaterloo.ca/twiki/view/CF/SSHHostBasedAuthentication

You are right, host-based authentication is not setup only publickey.

>
> Can someone check this on your system (my guess is that something is
> wrong with nodes 125 and 126, if it helps)? Alternatively, can you setup
> a key based ssh (either by using a passwordless key, or an ssh agent) to
> work around this?

No problem with ssh to those nodes, if I do it by hand (ssh using
publickey authentication, I forgot to mention in the test)  it works
properly.

Could it be that hydra is forcing somehow host-based authentication?

>
> Note that though both hydra and mpd use ssh, they use different models,
> so which node ssh's to which other node will be different with both the
> process managers.

I wrote some wrapping scripts to create dynamic mpd rings spanning
only PBS assigned nodes to the job, in this sense I think it would be
similar!?
All nodes can ssh to each other properly using publickey.

If you need some additional info let me know.

Regards,
Mário
>
>  -- Pavan
>
> On 01/13/2010 05:48 PM, Mário Costa wrote:
>> Hello,
>>
>> I'm currently testing mpiexec.hydra under PBS(Torque) in Enterprise
>> Suse 10, using the default bootstrap server (ssh).
>>
>> With MPD, I've managed to execute successfully jobs using any number
>> of nodes(hosts)/processors.
>> I've setup ssh keys, known_hosts, ... i've been using a wrapper script
>> to manage mpi rings complying to the PBS provided nodes/resources, to
>> execute under PBS...
>>
>> With Hydra, I successfully managed to execute jobs that span over one
>> node only, tested it with four processors and less.
>>
>> My test, a shell script:
>>
>> #!/bin/bash
>> env | grep PMI
>>
>> When I submit a job that spans over more than one node I get the
>> following errors.
>>
>> 1. Job hangs till its killed by PBS due to exceeded time limit, used 3
>> nodes, 8 procs.
>>
>> stderr:
>>
>> bad fd
>> ssh_keysign: no reply
>> key_sign failed
>> bad fd
>> ssh_keysign: no reply
>> key_sign failed
>> =>> PBS: job killed: walltime 917 exceeded limit 900
>> Killed by signal 15.
>> Killed by signal 15.
>>
>> stdout:
>>
>> PMI_PORT=gorgon127:35454
>> PMI_ID=0
>> PMI_PORT=gorgon127:35454
>> PMI_ID=1
>> PMI_PORT=gorgon127:35454
>> PMI_ID=2
>>
>> ps at the node where PBS executed the script that invoked hydra:
>>
>> userX 10187  0.0  0.0   3928   460 ?        S    23:03   0:00 pbs_demux
>> userX 10205  0.0  0.0   9360  1548 ?        S    23:03   0:00
>> /bin/bash /var/spool/torque/mom_priv/jobs/9950.clulne.SC
>> userX 10208  0.0  0.0   6072   764 ?        S    23:03   0:00
>> mpiexec.hydra -rmk pbs ./test.sh
>> userX 10209  0.0  0.0      0     0 ?        Z    23:03   0:00 [ssh] <defunct>
>> userX 10210  0.0  0.0  24084  2508 ?        S    23:03   0:00
>> /usr/bin/ssh -x node126 /usr/bin/pmi_proxy --launch-mode 1
>> --proxy-port node127:36669 --bootstrap ssh --partition-id 1
>> userX 10211  0.0  0.0  24088  2508 ?        S    23:03   0:00
>> /usr/bin/ssh -x node125 /usr/bin/pmi_proxy --launch-mode 1
>> --proxy-port node127:36669 --bootstrap ssh --partition-id 2
>> userX 10215  0.0  0.0      0     0 ?        Z    23:03   0:00
>> [ssh-keysign] <defunct>
>> userX 10255  0.0  0.0      0     0 ?        Z    23:03   0:00
>> [ssh-keysign] <defunct>
>> root     10256  0.1  0.0  43580  3520 ?        Ss   23:04   0:00 sshd:
>> userX [priv]
>> userX 10258  0.0  0.0  43580  1968 ?        S    23:04   0:00 sshd:
>> userX at pts/0
>>
>> 2. Job terminated immediately, used 2 nodes, 8 procs (but other test
>> with 2 nodes had the same result as above).
>>
>> stderr:
>>
>> bad fd
>> ssh_keysign: no reply
>> key_sign failed
>> Disconnecting: Bad packet length 232220199.
>>
>> stdout:
>>
>> PMI_PORT=gorgon116:52217
>> PMI_ID=1
>> PMI_PORT=gorgon116:52217
>> PMI_ID=3
>> PMI_PORT=gorgon116:52217
>> PMI_ID=2
>> PMI_PORT=gorgon116:52217
>> PMI_ID=0
>>
>> Any idea of what might be wrong ?
>>
>> There is something wrong with ssh, in test 1, I've ssh to the node and
>> executed the command showed in the ps, and it executed properly with
>> the respective partion/PMI_IDs being dysplayed.
>>
>> Since I've managed to use MPD with any kind of problems I would
>> presume my ssh is working properly.
>>
>> Could it be that the is something wrong with hydra ?
>>
>> Thanks, regards,
>>
>> Mário
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>


More information about the mpich-discuss mailing list