[mpich-discuss] processor/memory affinity on quad core systems

Gus Correa gus at ldeo.columbia.edu
Tue Jul 22 18:28:56 CDT 2008


Hello Franco and list

It looks like to me that one of the problems with solutions based on 
taskset and numactl is that you need to know
ahead of time which cores will be available/idle, which ones will be 
busy, when your program starts to run.
Likewise for the nodes on a cluster, for that matter.

On a machine that you use alone or share with a few people, using 
taskset or numactl may be ok.
However, this may not be the way to go on an installation where many 
nodes are shared by many users,
or that has a busy job queue, and is perhaps managed by a batch system / 
resource management software
(SGE, Torque/PBS, etc), which may complicate matters even more.

I wonder if the solution just proposed here would work for a job that 
runs on a cluster, requesting say,
2 processors only, particularly if you don't know when you submit the 
job on which cluster nodes the
job will eventually run, and which idle cores/processors they will have 
at runtime.
Even after the jobs are running, and the process IDs are assigned, using 
taskset or numactl to enforce processor affinity on a multi-user cluster 
may require some hairy scripting to chase and match PIDs to cores.

Robert Kubrik kindly pointed out here a looong parallel discussion of 
this problem on the
Beowulf list, where a number of people advocate that the resource 
management software,
rather than mpiexec, should take care of processor affinity.
Please, see these two threads:

http://www-unix.mcs.anl.gov/web-mail-archive/lists/mpich-discuss/2008/07/msg00112.html
http://www.beowulf.org/archive/2008-June/021810.html

So, before we have both mpixec and resource management software fighting 
for who
is in charge of processor affinity assignment, making the life of users 
and system administrators
even more complicated and less productive than it is on the current 
scenario of memory contention ,
it might be useful to get to a minimal agreement about where the 
processor affinity control should reside.

Minimally, all software layers (OS, resource manager, mpiexec),
should gently allow the user or sys admin to choose to use or not to use
the processor affinity enforcement features available, I believe.

The current MPICH2/mpiexec policy, which according to Rajeev Takhur
is to delegate processor affinity  to the OS scheduler
(see: 
http://www-unix.mcs.anl.gov/web-mail-archive/lists/mpich-discuss/2008/07/msg00090.html), 

has the downside that was pointed out on Franco's and other postings.
However, one upside of it is that least it doesn't conflict with other 
software.

I would love to hear more about this topic, including planned solutions 
for the problem,
from the expert subscribers of this list.

Thank you,
Gus Correa

-- 
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


chong tan wrote:

> no easy way with mpiexec, especially if you do mpiexec -n.  But this 
> should work
>
>  
>
>  
>
> mpiexec numactl --physcpubind N0 <1 of your proc> :
>
>              numactl  -- physcpubind N1 <2nd of oof proc>  :
>
>              .<same for the rest>
>
>  
>
> add --membind if you want (and you definately want it for Opteron). 
>
>  
>
> tan
>
>
>
> --- On *Tue, 7/22/08, Franco Catalano /<franco.catalano at uniroma1.it>/* 
> wrote:
>
>     From: Franco Catalano <franco.catalano at uniroma1.it>
>     Subject: [mpich-discuss] processor/memory affinity on quad core
>     systems
>     To: mpich-discuss at mcs.anl.gov
>     Date: Tuesday, July 22, 2008, 2:28 AM
>
>Hi,
>Is it possible to ensure processor/memory affinity on mpi jobs launched
>with mpiexec (or mpirun)?
>I am using mpich2 1.0.7 with WRF on a 4 processor Opteron quad core (16
>cores total) machine and I have observed a sensible (more than 20%)
>variability of the time needed to compute a single time step. Taking a
>look to the output of top, I have noticed that the system moves
>processes over the 16 cores regardless of processor/memory affinity. So,
>when processes are running on cores away from their memory, the time
>needed for the time advancement is longer.
>I know that, for example, OpenMPI provides a command line option for
>mpiexec (or mpirun) to ensure the affinity binding:
>--mca param mpi_paffinity_alone = 1
>I have tried this with WRF and it works.
>Is there a way to do this with mpich2?
>Otherwise, I think that it would be very useful to include such
>cabability into the next release.
>Thank you for any suggestion.
>
>Franco
>
>-- 
>____________________________________________________
>Eng. Franco Catalano
>Ph.D. Student
>
>D.I.T.S.
>Department of Hydraulics, Transportation and Roads.
>Via Eudossiana 18, 00184 Rome 
>University of Rome "La Sapienza".
>tel: +390644585218
>
>




More information about the mpich-discuss mailing list