[MPICH] one shot jobs in mpich2?
Benjamin Rutt
rutt at bmi.osu.edu
Fri Jun 17 15:41:15 CDT 2005
Reuti <reuti at staff.uni-marburg.de> writes:
> I got the hint from the MPICH2 team, when I wrote the integration Howto into
> SGE, it's still not in the documentation of MPICH2. When you already compiled
> MPICH2 with smpd operation, you just have to use these lines to start the job
> (adjust to your environment):
>
> export MPIEXEC_RSH=rsh
> export PATH=/usr/mpich2_smpd/bin:$PATH
> mpiexec -rsh -nopm -n $NSLOTS -machinefile $TMPDIR/machines ~/mpihello
>
> The default is "ssh -x" AFAIR otherwise, but as SGE has private
> rshds dedicated to each job, it's safe to use the "rsh client" with
> "SGE's rshd" in a cluster.
Thank you, I think this is exactly what I want. It works well for me.
I think this is much easier than any of the other "manage starting and
stopping daemons for just 1 job" solutions, which are so much more
complicated.
> 1. For starting the daemons you have to give the option "-p $port" then
> in the rsh command to each node
>
> 2. To run the job:
> mpiexec -n $NSLOTS -machinefile $TMPDIR/machines -port $port ~/mpihello
>
> 3. To shut down: $MPICH2_ROOT/bin/smpd -port $port -shutdown $host
> loop over all used hosts
>
> But it would be best not to kill only the daemon for a job abort,
> but the whole process group of the job (kill -9 -- -$pid), otherwise
> in case of an abort the job may survive the death of the parent. I
> also used "-d 0" to get the smpds still bound to the SGE daemons for
> the job and not to let them vanish into daemon land. This way I
> don't have to kill anything by hand, as SGE will take care of
> removing all the processgroup stuff (for this job) from the node.
> Whether it's an intended end of the progrm, or a forced one via
> "qdel" in the middle of the job doesn't matter.
Thank you for writing us up. I will look into this it for some reason
the previous solution stops working.
> I really suggest to look into SGE, as the prolog and epilog scripts
> will setup the whole smpd universe for the user, who only has to use
> the correct portnumber for his/her job. Also shutdown will be
> handled by SGE to remove the daemons.
Thank you. Unfortunately, although I have lots of accounts, I only
have root on my workstation and not on any clusters. Therefore, I
probably cannot install it myself, if I understand what it does
correctly (serve as a batch queueing system among other things).
However, I will take a look at it to see if our existing clusters
(which sorely need a batch scheduler) can make use of it.
Thank you kindly for all these pointers to undocumented features.
Hopefully, they will become "official" and not get phased out. :)
--
Benjamin
More information about the mpich-discuss
mailing list