[mpich-discuss] MPICH2 Hydra integration with SGE and PBS

Reuti reuti at staff.uni-marburg.de
Sat Aug 7 15:20:07 CDT 2010


Am 07.08.2010 um 20:32 schrieb Pavan Balaji:
> On 08/07/2010 12:58 PM, Reuti wrote:
>>>> job_is_first_task  FALSE
>>> 
>>> I'm not sure I follow this. The script should already only launch
>>> one process (which will be mpiexec) on the first node. mpiexec
>>> will then launch the remaining processes.
>> 
>> SGE will control the number of started slave processes. In the old
>> MPICH(1) it was indeed the case, that the started `mpirun` did some
>> work in one of its forks and started only (n-1) slaves. What I
>> observe in MPICH2 with Hydra is the following for a `qsub -pe mpich 2
>> test_mpich.sh`:
> 
> Ah, I see the confusion here. This has been fixed in Hydra recently, so for local node launches Hydra just does a fork instead of trying to ssh/rsh/qrsh. That was probably after 1.3a2. We are trying to get 1.3b1 out in the next few days which will have this fix. In the meanwhile, can you try out the nightly snapshot: http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/nightly/hydra

Indeed, the local startup is gone, it just forks. Okay.


>> a) Use of "SGE's internal launchers" (i.e. `qrsh ...` instead of a
>> plain "ssh/rsh")
>> 
>> This looks like shown above, so that all started master and slave
>> processes are kids of the `sge_execd`. Advantage is to remove all
>> processes of a job by a single `qdel`. You will also get a correct
>> accounting of the consumed memory and time for a job, as SGE can
>> track each of sge_execd's kids (called a tight integration).
> 
> I see, good point. Sounds like the only change that's required is to pass qrsh as the bootstrap executable (-bootstrap-exec qrsh), apart from using a newer version of Hydra as described above. Let me know if that works and I'll add sge as a bootstrap server which automatically does this.

Besides calling `qrsh`, it needs the argument "-inherit", as without it a new job would be created. We want just to use an already granted slot of our parallel job. It would be nice if it would work out-of-the-box with MPICH2. Like with the pbs bootstrap server, the name and slots per node can be read from a file where the environment variable $PE_HOSTFILE points to. The format is like:

node14 1 parallel at node14 UNDEFINED
node20 1 parallel at node20 UNDEFINED

Hence the first column is the name of the node, the second the number of granted slots on this machine. (Other columns include the name of the queue and the chosen core(s), if the job was submitted with processor core binding.)


>> b) Use of "builtin" starter in an already tight integration
>> 
>> Originally, slave tasks were started by an `rsh`. In case you need
>> X-11 forwarding or a large number of slave tasks (rsh has a certain
>> limit of file descriptors) `ssh` can be used. This means of course to
>> setup hostbased or passphraseless authentication for the slave tasks.
>> Both methods (rsh/ssh) will use a random port per job and per node to
>> start the slave processes (for each slave process therefore a
>> dedicated rshd/sshd is started, the system wide ones don't need to
>> run all the time. I.e. rshd can be disabled in /etc/xinetd.d/rshd,
>> and SGE can still use rsh). Whether the started slaves need any port
>> on their own is a different things.
>> 
>> The "builtin" method does not need a random port, allows also a
>> larger number of file descriptors and need no authorization setup.
>> X11 forwarding should be added later.
> 
> By "builtin" here, do you mean using "qrsh" or something else?

Something else: it resembles in same way the function of the task manager of Torque to start a slave task on a node. Up to its implementation, SGE used a controlled rsh/ssh to start slave tasks. To illustrate this, here the process chain for the job from my previous email for the master with set up `ssh` as startup method in SGE:

31829 sgeadmin /usr/sge/bin/lx24-x86/sge_execd
31947 sgeadmin  \_ sge_shepherd-1789 -bg
31970 reuti     |   \_ /bin/sh /var/spool/sge/pc15370/job_scripts/1789
31971 reuti     |       \_ mpiexec -bootstrap rsh -bootstrap-exec rsh -machinefi
31972 reuti     |           \_ /usr/sge/bin/lx24-x86/qrsh -inherit pc15370 /home
31990 reuti     |           |   \_ /usr/bin/ssh -X -p 48357 pc15370.Chemie.Uni-M
31973 reuti     |           \_ /usr/sge/bin/lx24-x86/qrsh -inherit pc15381 /home
31991 reuti     |               \_ /usr/bin/ssh -X -p 50799 pc15381.Chemie.Uni-M
31988 sgeadmin  \_ sge_shepherd-1789 -bg
31989 root          \_ sshd: reuti [priv]
31997 reuti             \_ sshd: reuti at notty
32000 reuti                 \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool
32074 reuti                     \_ /home/reuti/local/mpich2-1.3a2/bin/hydra_pmi_
32079 reuti                         \_ ./mpihello

and the slave:

 7607 sgeadmin /usr/sge/bin/lx24-x86/sge_execd
 1403 sgeadmin  \_ sge_shepherd-1789 -bg
 1404 root          \_ sshd: reuti [priv]
 1409 reuti             \_ sshd: reuti at notty
 1422 reuti                 \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool
 1440 reuti                     \_ /home/reuti/local/mpich2-1.3a2/bin/hydra_pmi_
 1445 reuti                         \_ ./mpihello

Dedicated "sshd"s and ports, still a tight integration.

The original email showed the "builtin" method.


> I thought qrsh does X-forwarding by default (or does it require us to pass an extra argument?).

No. If it does in any SGE installation you have access to, it is either setup to use `ssh -X`, or you use a direct connection via X11's port 6000+ between your workstation and the node your job runs on. The latter is judged unsafe nowadays and often not possible, as the nodes are on a private subnet without external access.

-- Reuti


More information about the mpich-discuss mailing list