[mpich-discuss] MPICH2 Hydra integration with SGE and PBS

Reuti reuti at staff.uni-marburg.de
Sat Aug 7 12:58:30 CDT 2010


Hi Pavan,

Am 07.08.2010 um 17:57 schrieb Pavan Balaji:

> Reuti,
> 
> Thanks. We do have some support for PBS and SGE, but I'll be happy to work with you to improve them.
> 
> On 08/06/2010 10:16 AM, Reuti wrote:
>> job_is_first_task  FALSE
> 
> I'm not sure I follow this. The script should already only launch one process (which will be mpiexec) on the first node. mpiexec will then launch the remaining processes.

SGE will control the number of started slave processes. In the old MPICH(1) it was indeed the case, that the started `mpirun` did some work in one of its forks and started only (n-1) slaves. What I observe in MPICH2 with Hydra is the following for a `qsub -pe mpich 2 test_mpich.sh`:

On the master node of the parallel job:

 7607 sgeadmin /usr/sge/bin/lx24-x86/sge_execd
  548 sgeadmin  \_ sge_shepherd-1784 -bg
  571 reuti     |   \_ /bin/sh /var/spool/sge/pc15381/job_scripts/1784
  572 reuti     |       \_ mpiexec -bootstrap rsh -bootstrap-exec rsh -machinefi
  573 reuti     |           \_ /usr/sge/bin/lx24-x86/qrsh -inherit pc15381 /home
  574 reuti     |           \_ /usr/sge/bin/lx24-x86/qrsh -inherit pc15370 /home
  589 sgeadmin  \_ sge_shepherd-1784 -bg
  590 reuti         \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc1
  601 reuti             \_ /home/reuti/local/mpich2-1.3a2/bin/hydra_pmi_proxy --
  602 reuti                 \_ ./mpihello

On the slave:

31829 sgeadmin /usr/sge/bin/lx24-x86/sge_execd
27841 sgeadmin  \_ sge_shepherd-1784 -bg
27842 reuti         \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc1
27849 reuti             \_ /home/reuti/local/mpich2-1.3a2/bin/hydra_pmi_proxy --
27850 reuti                 \_ ./mpihello

Hence the `mpiexec` will do one `ssh/rsh` per node, even on the node where the `mpiexec` was issued. So SGE must be told to allow this by setting "job_is_first_task FALSE". "job" in this context refers to the jobscript on its own.


>> *) Note: the final communication method is setup solely in SGE,
>> which can be "builtin", "classic rsh" or also "ssh" (according to the
>> Howto at http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html).
>> From the point of view of the application, it's also possible to
>> instruct it to call "fubar" to reach another node. In SGE it would be
>> necessary in start_proc_args to create a link in $TMPDIR which is
>> named "fubar" and point to SGE's rsh-wrapper. Only inside the
>> rsh-wrapper, the `qrsh -inherit ...` will use the method which is
>> setup in SGE to reach another node in the end.
> 
> The website seems to allow using ssh as well. So, why not use ssh? I'm not sure SGE's internal launchers give any benefit compared to ssh (or rsh).

Yes, "ssh" is possible besides "rsh" or "builtin". The term "SGE's internal launchers" can refer to two different things. So I explain both, as I'm not sure to which you refer:

a) Use of "SGE's internal launchers" (i.e. `qrsh ...` instead of a plain "ssh/rsh")

This looks like shown above, so that all started master and slave processes are kids of the `sge_execd`. Advantage is to remove all processes of a job by a single `qdel`. You will also get a correct accounting of the consumed memory and time for a job, as SGE can track each of sge_execd's kids (called a tight integration).

b) Use of "builtin" starter in an already tight integration

Originally, slave tasks were started by an `rsh`. In case you need X-11 forwarding or a large number of slave tasks (rsh has a certain limit of file descriptors) `ssh` can be used. This means of course to setup hostbased or passphraseless authentication for the slave tasks. Both methods (rsh/ssh) will use a random port per job and per node to start the slave processes (for each slave process therefore a dedicated rshd/sshd is started, the system wide ones don't need to run all the time. I.e. rshd can be disabled in /etc/xinetd.d/rshd, and SGE can still use rsh). Whether the started slaves need any port on their own is a different things.

The "builtin" method does not need a random port, allows also a larger number of file descriptors and need no authorization setup. X11 forwarding should be added later.

==

The thing I wanted to point out was, that in a tight integration the compiled-in starter does not matter. So MPICH2 and other parallel libs can be:

1) configured, compiled and run to use a call to: rsh / ssh / fubar / what_ever

2) setup SGE to catch the call to: rsh / ssh / fubar / what_ever

3) SGE will solely use the set up method in SGE's configuration: rsh / ssh / builtin


>> but it looks like it will just get the list of nodes from PBS. For
>> the use of the task manager it's still necessary to use an external
>> `mpiexec` from OSC? Are there any plans to have it directly built
>> into MPICH2?
> 
> Correct. PBS support is only available as a resource management kernel (meaning, that Hydra will only query it for the available nodes, but not use it to launch processes). Yes, supporting PBS as a bootstrap server is in our plans. See https://trac.mcs.anl.gov/projects/mpich2/ticket/443
> 
> Please feel free to add yourself to the ticket to track progress on it.

Ok, thx for confirmation.

-- Reuti


> 
> -- Pavan
> 
> -- 
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji



More information about the mpich-discuss mailing list