[MPICH] mpiexec -soft with Sun Grid Engine
Yusong Wang
ywang25 at aps.anl.gov
Fri Jul 21 17:49:13 CDT 2006
It works finally. Thanks for all of your help!
When we run MPI jobs with mpiexec under SGE, the number of processes is
still managed by SGE. So we have to let SGE decide how many processes
are acceptable instead of mpiexec (MPD). In our case, "-pe mpich2 10-20"
works well. It will return the maximal number of available processes
between 10 and 20. If there are less than 10 processors available at
submission, SGE will start the job until the minimal number (10) is
satisfied. Also, $NSLOTS, which returns the actual number processes at
run time, can be passed to mpiexec for MPI jobs to start.
Yusong
On Fri, 2006-07-21 at 19:03 +0200, Reuti wrote:
> Hi,
>
> Am 21.07.2006 um 18:02 schrieb Yusong Wang:
>
> > Hi,
> >
> > I am trying to take advantage of the "-soft" option on a busy cluster.
> > In the Appendix D (P362) of "Using MPI-2":
> > mpiexec -n 10 -soft 2:10 program
> > will run the program on various number of processes, depending on what
> > is available.
> >
> > But this option doesn't work at all on our system, even though
> > there are
> > enough nodes for the lower bound in the soft option. The version of
> > MPICH2 is: mpich2_mpd_sock v 1.78 2005/09/23
> >
> > I read the python source code for mpiexec and tried:
> > mpiexec -n 10 -soft 2 program
> > It does not work either.
> >
> > I wonder what is available for this option in the current distribution
> > of MPICH2. Also, should mpiexec talk to Sun Grid Engine
> > periodically to
> > see if there is enough resource to run the job with the lower bound
> > number of CPUs after submission? Or just check the availability of
> > CPUs
> > once at submission.
>
> please have a look here:
>
> http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-
> integration.html
>
> The mpd startup method isn't integrated for now with SGE, so you have
> to use one of the other startup methods.
>
> What you have to request in SGE is a parallel environment, where you
> can give the range of slots as an argument (-pe mpich2 2-10). What in
> the end is granted for your job by SGE, you can see in your jobscript
> by accessing the variable/file $NSLOTS and $TMPDIR/machines, which
> you should also use in your mpiexec call like outlined in the Howto,
> as at this point the amount of slots and their nodes are already
> fixed by SGE.
>
> HTH - Reuti
>
More information about the mpich-discuss
mailing list