[MPICH] one shot jobs in mpich2?

Benjamin Rutt rutt at bmi.osu.edu
Fri Jun 17 15:41:15 CDT 2005


Reuti <reuti at staff.uni-marburg.de> writes:

> I got the hint from the MPICH2 team, when I wrote the integration Howto into 
> SGE, it's still not in the documentation of MPICH2. When you already compiled 
> MPICH2 with smpd operation, you just have to use these lines to start the job 
> (adjust to your environment):
>
> export MPIEXEC_RSH=rsh
> export PATH=/usr/mpich2_smpd/bin:$PATH
> mpiexec -rsh -nopm -n $NSLOTS -machinefile $TMPDIR/machines ~/mpihello
>
> The default is "ssh -x" AFAIR otherwise, but as SGE has private
> rshds dedicated to each job, it's safe to use the "rsh client" with
> "SGE's rshd" in a cluster.

Thank you, I think this is exactly what I want.  It works well for me.
I think this is much easier than any of the other "manage starting and
stopping daemons for just 1 job" solutions, which are so much more
complicated.

> 1. For starting the daemons you have to give the option "-p $port" then
>    in the rsh command to each node
>
> 2. To run the job:
>    mpiexec -n $NSLOTS -machinefile $TMPDIR/machines -port $port ~/mpihello
>
> 3. To shut down: $MPICH2_ROOT/bin/smpd -port $port -shutdown $host
>    loop over all used hosts
>
> But it would be best not to kill only the daemon for a job abort,
> but the whole process group of the job (kill -9 -- -$pid), otherwise
> in case of an abort the job may survive the death of the parent. I
> also used "-d 0" to get the smpds still bound to the SGE daemons for
> the job and not to let them vanish into daemon land. This way I
> don't have to kill anything by hand, as SGE will take care of
> removing all the processgroup stuff (for this job) from the node.
> Whether it's an intended end of the progrm, or a forced one via
> "qdel" in the middle of the job doesn't matter.

Thank you for writing us up.  I will look into this it for some reason
the previous solution stops working.

> I really suggest to look into SGE, as the prolog and epilog scripts
> will setup the whole smpd universe for the user, who only has to use
> the correct portnumber for his/her job. Also shutdown will be
> handled by SGE to remove the daemons.

Thank you.  Unfortunately, although I have lots of accounts, I only
have root on my workstation and not on any clusters.  Therefore, I
probably cannot install it myself, if I understand what it does
correctly (serve as a batch queueing system among other things).
However, I will take a look at it to see if our existing clusters
(which sorely need a batch scheduler) can make use of it.

Thank you kindly for all these pointers to undocumented features.
Hopefully, they will become "official" and not get phased out. :)
-- 
Benjamin




More information about the mpich-discuss mailing list