[mpich-discuss] MPICH2 Hydra integration with SGE and PBS

Pavan Balaji balaji at mcs.anl.gov
Sat Aug 7 17:54:14 CDT 2010


Reuti,

I've added support for SGE as a bootstrap server in r7018. Can you try 
out the latest nightly snapshot to make sure I didn't miss anything: 
http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/nightly/hydra

Thanks,

  -- Pavan

On 08/07/2010 03:20 PM, Reuti wrote:
> Am 07.08.2010 um 20:32 schrieb Pavan Balaji:
>> On 08/07/2010 12:58 PM, Reuti wrote:
>>>>> job_is_first_task  FALSE
>>>>
>>>> I'm not sure I follow this. The script should already only launch
>>>> one process (which will be mpiexec) on the first node. mpiexec
>>>> will then launch the remaining processes.
>>>
>>> SGE will control the number of started slave processes. In the old
>>> MPICH(1) it was indeed the case, that the started `mpirun` did some
>>> work in one of its forks and started only (n-1) slaves. What I
>>> observe in MPICH2 with Hydra is the following for a `qsub -pe mpich 2
>>> test_mpich.sh`:
>>
>> Ah, I see the confusion here. This has been fixed in Hydra recently, so for local node launches Hydra just does a fork instead of trying to ssh/rsh/qrsh. That was probably after 1.3a2. We are trying to get 1.3b1 out in the next few days which will have this fix. In the meanwhile, can you try out the nightly snapshot: http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/nightly/hydra
>
> Indeed, the local startup is gone, it just forks. Okay.
>
>
>>> a) Use of "SGE's internal launchers" (i.e. `qrsh ...` instead of a
>>> plain "ssh/rsh")
>>>
>>> This looks like shown above, so that all started master and slave
>>> processes are kids of the `sge_execd`. Advantage is to remove all
>>> processes of a job by a single `qdel`. You will also get a correct
>>> accounting of the consumed memory and time for a job, as SGE can
>>> track each of sge_execd's kids (called a tight integration).
>>
>> I see, good point. Sounds like the only change that's required is to pass qrsh as the bootstrap executable (-bootstrap-exec qrsh), apart from using a newer version of Hydra as described above. Let me know if that works and I'll add sge as a bootstrap server which automatically does this.
>
> Besides calling `qrsh`, it needs the argument "-inherit", as without it a new job would be created. We want just to use an already granted slot of our parallel job. It would be nice if it would work out-of-the-box with MPICH2. Like with the pbs bootstrap server, the name and slots per node can be read from a file where the environment variable $PE_HOSTFILE points to. The format is like:
>
> node14 1 parallel at node14 UNDEFINED
> node20 1 parallel at node20 UNDEFINED
>
> Hence the first column is the name of the node, the second the number of granted slots on this machine. (Other columns include the name of the queue and the chosen core(s), if the job was submitted with processor core binding.)
>
>
>>> b) Use of "builtin" starter in an already tight integration
>>>
>>> Originally, slave tasks were started by an `rsh`. In case you need
>>> X-11 forwarding or a large number of slave tasks (rsh has a certain
>>> limit of file descriptors) `ssh` can be used. This means of course to
>>> setup hostbased or passphraseless authentication for the slave tasks.
>>> Both methods (rsh/ssh) will use a random port per job and per node to
>>> start the slave processes (for each slave process therefore a
>>> dedicated rshd/sshd is started, the system wide ones don't need to
>>> run all the time. I.e. rshd can be disabled in /etc/xinetd.d/rshd,
>>> and SGE can still use rsh). Whether the started slaves need any port
>>> on their own is a different things.
>>>
>>> The "builtin" method does not need a random port, allows also a
>>> larger number of file descriptors and need no authorization setup.
>>> X11 forwarding should be added later.
>>
>> By "builtin" here, do you mean using "qrsh" or something else?
>
> Something else: it resembles in same way the function of the task manager of Torque to start a slave task on a node. Up to its implementation, SGE used a controlled rsh/ssh to start slave tasks. To illustrate this, here the process chain for the job from my previous email for the master with set up `ssh` as startup method in SGE:
>
> 31829 sgeadmin /usr/sge/bin/lx24-x86/sge_execd
> 31947 sgeadmin  \_ sge_shepherd-1789 -bg
> 31970 reuti     |   \_ /bin/sh /var/spool/sge/pc15370/job_scripts/1789
> 31971 reuti     |       \_ mpiexec -bootstrap rsh -bootstrap-exec rsh -machinefi
> 31972 reuti     |           \_ /usr/sge/bin/lx24-x86/qrsh -inherit pc15370 /home
> 31990 reuti     |           |   \_ /usr/bin/ssh -X -p 48357 pc15370.Chemie.Uni-M
> 31973 reuti     |           \_ /usr/sge/bin/lx24-x86/qrsh -inherit pc15381 /home
> 31991 reuti     |               \_ /usr/bin/ssh -X -p 50799 pc15381.Chemie.Uni-M
> 31988 sgeadmin  \_ sge_shepherd-1789 -bg
> 31989 root          \_ sshd: reuti [priv]
> 31997 reuti             \_ sshd: reuti at notty
> 32000 reuti                 \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool
> 32074 reuti                     \_ /home/reuti/local/mpich2-1.3a2/bin/hydra_pmi_
> 32079 reuti                         \_ ./mpihello
>
> and the slave:
>
>   7607 sgeadmin /usr/sge/bin/lx24-x86/sge_execd
>   1403 sgeadmin  \_ sge_shepherd-1789 -bg
>   1404 root          \_ sshd: reuti [priv]
>   1409 reuti             \_ sshd: reuti at notty
>   1422 reuti                 \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool
>   1440 reuti                     \_ /home/reuti/local/mpich2-1.3a2/bin/hydra_pmi_
>   1445 reuti                         \_ ./mpihello
>
> Dedicated "sshd"s and ports, still a tight integration.
>
> The original email showed the "builtin" method.
>
>
>> I thought qrsh does X-forwarding by default (or does it require us to pass an extra argument?).
>
> No. If it does in any SGE installation you have access to, it is either setup to use `ssh -X`, or you use a direct connection via X11's port 6000+ between your workstation and the node your job runs on. The latter is judged unsafe nowadays and often not possible, as the nodes are on a private subnet without external access.
>
> -- Reuti

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list