[mpich-discuss] processor/memory affinity on quad core systems
Gus Correa
gus at ldeo.columbia.edu
Tue Jul 22 18:28:56 CDT 2008
Hello Franco and list
It looks like to me that one of the problems with solutions based on
taskset and numactl is that you need to know
ahead of time which cores will be available/idle, which ones will be
busy, when your program starts to run.
Likewise for the nodes on a cluster, for that matter.
On a machine that you use alone or share with a few people, using
taskset or numactl may be ok.
However, this may not be the way to go on an installation where many
nodes are shared by many users,
or that has a busy job queue, and is perhaps managed by a batch system /
resource management software
(SGE, Torque/PBS, etc), which may complicate matters even more.
I wonder if the solution just proposed here would work for a job that
runs on a cluster, requesting say,
2 processors only, particularly if you don't know when you submit the
job on which cluster nodes the
job will eventually run, and which idle cores/processors they will have
at runtime.
Even after the jobs are running, and the process IDs are assigned, using
taskset or numactl to enforce processor affinity on a multi-user cluster
may require some hairy scripting to chase and match PIDs to cores.
Robert Kubrik kindly pointed out here a looong parallel discussion of
this problem on the
Beowulf list, where a number of people advocate that the resource
management software,
rather than mpiexec, should take care of processor affinity.
Please, see these two threads:
http://www-unix.mcs.anl.gov/web-mail-archive/lists/mpich-discuss/2008/07/msg00112.html
http://www.beowulf.org/archive/2008-June/021810.html
So, before we have both mpixec and resource management software fighting
for who
is in charge of processor affinity assignment, making the life of users
and system administrators
even more complicated and less productive than it is on the current
scenario of memory contention ,
it might be useful to get to a minimal agreement about where the
processor affinity control should reside.
Minimally, all software layers (OS, resource manager, mpiexec),
should gently allow the user or sys admin to choose to use or not to use
the processor affinity enforcement features available, I believe.
The current MPICH2/mpiexec policy, which according to Rajeev Takhur
is to delegate processor affinity to the OS scheduler
(see:
http://www-unix.mcs.anl.gov/web-mail-archive/lists/mpich-discuss/2008/07/msg00090.html),
has the downside that was pointed out on Franco's and other postings.
However, one upside of it is that least it doesn't conflict with other
software.
I would love to hear more about this topic, including planned solutions
for the problem,
from the expert subscribers of this list.
Thank you,
Gus Correa
--
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------
chong tan wrote:
> no easy way with mpiexec, especially if you do mpiexec -n. But this
> should work
>
>
>
>
>
> mpiexec numactl --physcpubind N0 <1 of your proc> :
>
> numactl -- physcpubind N1 <2nd of oof proc> :
>
> .<same for the rest>
>
>
>
> add --membind if you want (and you definately want it for Opteron).
>
>
>
> tan
>
>
>
> --- On *Tue, 7/22/08, Franco Catalano /<franco.catalano at uniroma1.it>/*
> wrote:
>
> From: Franco Catalano <franco.catalano at uniroma1.it>
> Subject: [mpich-discuss] processor/memory affinity on quad core
> systems
> To: mpich-discuss at mcs.anl.gov
> Date: Tuesday, July 22, 2008, 2:28 AM
>
>Hi,
>Is it possible to ensure processor/memory affinity on mpi jobs launched
>with mpiexec (or mpirun)?
>I am using mpich2 1.0.7 with WRF on a 4 processor Opteron quad core (16
>cores total) machine and I have observed a sensible (more than 20%)
>variability of the time needed to compute a single time step. Taking a
>look to the output of top, I have noticed that the system moves
>processes over the 16 cores regardless of processor/memory affinity. So,
>when processes are running on cores away from their memory, the time
>needed for the time advancement is longer.
>I know that, for example, OpenMPI provides a command line option for
>mpiexec (or mpirun) to ensure the affinity binding:
>--mca param mpi_paffinity_alone = 1
>I have tried this with WRF and it works.
>Is there a way to do this with mpich2?
>Otherwise, I think that it would be very useful to include such
>cabability into the next release.
>Thank you for any suggestion.
>
>Franco
>
>--
>____________________________________________________
>Eng. Franco Catalano
>Ph.D. Student
>
>D.I.T.S.
>Department of Hydraulics, Transportation and Roads.
>Via Eudossiana 18, 00184 Rome
>University of Rome "La Sapienza".
>tel: +390644585218
>
>
More information about the mpich-discuss
mailing list