[mpich-discuss] processor/memory affinity on quad core systems

William Gropp gropp at mcs.anl.gov
Wed Jul 23 11:55:18 CDT 2008


This is a good point - mpiexec needs to work with the resource  
manager when possible.  In fact, the design for the gforker and  
remshell mpiexec (and any that make use of the same supporting  
routines) includes a resource manager step, though this is currently  
usually empty.

Here's what might happen in a full mpiexec implementation:

1) parse the arguments, looking for such things a -n (number of  
processes) or specific assignment to resources (e.g., a node file  
delivered by a batch system)

2) contact the relevant resource manager, acquiring the requested  
resources.  Note that in the batch case, there may be nothing to do  
here (the resources were already assigned and provided in step 1).   
In the case of the mpd version of mpiexec, this step usually returns  
a list of node names.

3) negotiate with the resource manager to see who is responsible for  
setting various resource controls.  This includes affinity, priority  
(e.g., the old nice setting), etc.  It also needs to coordinate with  
any thread package (e.g., if OpenMP is being used, the resource  
manager and mpiexec need to take that into account).

4) mpiexec must set appropriate resource controls, but avoid  
conflicting with the resource/process manager.

Note that because of the wide variety of resource and process  
managers, the reality is that mpiexec must adapt to the existing  
environment - it must be able to manage resources such as affinity  
and priority, but must also be able to defer to richer environments  
that already manage that.

Bill


On Jul 22, 2008, at 6:28 PM, Gus Correa wrote:

> Hello Franco and list
>
> It looks like to me that one of the problems with solutions based on
> taskset and numactl is that you need to know
> ahead of time which cores will be available/idle, which ones will be
> busy, when your program starts to run.
> Likewise for the nodes on a cluster, for that matter.
>
> On a machine that you use alone or share with a few people, using
> taskset or numactl may be ok.
> However, this may not be the way to go on an installation where many
> nodes are shared by many users,
> or that has a busy job queue, and is perhaps managed by a batch  
> system /
> resource management software
> (SGE, Torque/PBS, etc), which may complicate matters even more.
>
> I wonder if the solution just proposed here would work for a job that
> runs on a cluster, requesting say,
> 2 processors only, particularly if you don't know when you submit the
> job on which cluster nodes the
> job will eventually run, and which idle cores/processors they will  
> have
> at runtime.
> Even after the jobs are running, and the process IDs are assigned,  
> using
> taskset or numactl to enforce processor affinity on a multi-user  
> cluster
> may require some hairy scripting to chase and match PIDs to cores.
>
> Robert Kubrik kindly pointed out here a looong parallel discussion of
> this problem on the
> Beowulf list, where a number of people advocate that the resource
> management software,
> rather than mpiexec, should take care of processor affinity.
> Please, see these two threads:
>
> http://www-unix.mcs.anl.gov/web-mail-archive/lists/mpich-discuss/ 
> 2008/07/msg00112.html
> http://www.beowulf.org/archive/2008-June/021810.html
>
> So, before we have both mpixec and resource management software  
> fighting
> for who
> is in charge of processor affinity assignment, making the life of  
> users
> and system administrators
> even more complicated and less productive than it is on the current
> scenario of memory contention ,
> it might be useful to get to a minimal agreement about where the
> processor affinity control should reside.
>
> Minimally, all software layers (OS, resource manager, mpiexec),
> should gently allow the user or sys admin to choose to use or not  
> to use
> the processor affinity enforcement features available, I believe.
>
> The current MPICH2/mpiexec policy, which according to Rajeev Takhur
> is to delegate processor affinity  to the OS scheduler
> (see:
> http://www-unix.mcs.anl.gov/web-mail-archive/lists/mpich-discuss/ 
> 2008/07/msg00090.html),
>
> has the downside that was pointed out on Franco's and other postings.
> However, one upside of it is that least it doesn't conflict with other
> software.
>
> I would love to hear more about this topic, including planned  
> solutions
> for the problem,
> from the expert subscribers of this list.
>
> Thank you,
> Gus Correa
>
> --
> ---------------------------------------------------------------------
> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
> Lamont-Doherty Earth Observatory - Columbia University
> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
>
> chong tan wrote:
>
>> no easy way with mpiexec, especially if you do mpiexec -n.  But this
>> should work
>>
>>
>>
>>
>>
>> mpiexec numactl --physcpubind N0 <1 of your proc> :
>>
>>              numactl  -- physcpubind N1 <2nd of oof proc>  :
>>
>>              .<same for the rest>
>>
>>
>>
>> add --membind if you want (and you definately want it for Opteron).
>>
>>
>>
>> tan
>>
>>
>>
>> --- On *Tue, 7/22/08, Franco Catalano / 
>> <franco.catalano at uniroma1.it>/*
>> wrote:
>>
>>     From: Franco Catalano <franco.catalano at uniroma1.it>
>>     Subject: [mpich-discuss] processor/memory affinity on quad core
>>     systems
>>     To: mpich-discuss at mcs.anl.gov
>>     Date: Tuesday, July 22, 2008, 2:28 AM
>>
>> Hi,
>> Is it possible to ensure processor/memory affinity on mpi jobs  
>> launched
>> with mpiexec (or mpirun)?
>> I am using mpich2 1.0.7 with WRF on a 4 processor Opteron quad  
>> core (16
>> cores total) machine and I have observed a sensible (more than 20%)
>> variability of the time needed to compute a single time step.  
>> Taking a
>> look to the output of top, I have noticed that the system moves
>> processes over the 16 cores regardless of processor/memory  
>> affinity. So,
>> when processes are running on cores away from their memory, the time
>> needed for the time advancement is longer.
>> I know that, for example, OpenMPI provides a command line option for
>> mpiexec (or mpirun) to ensure the affinity binding:
>> --mca param mpi_paffinity_alone = 1
>> I have tried this with WRF and it works.
>> Is there a way to do this with mpich2?
>> Otherwise, I think that it would be very useful to include such
>> cabability into the next release.
>> Thank you for any suggestion.
>>
>> Franco
>>
>> --
>> ____________________________________________________
>> Eng. Franco Catalano
>> Ph.D. Student
>>
>> D.I.T.S.
>> Department of Hydraulics, Transportation and Roads.
>> Via Eudossiana 18, 00184 Rome
>> University of Rome "La Sapienza".
>> tel: +390644585218
>>
>>
>

William Gropp
Paul and Cynthia Saylor Professor of Computer Science
University of Illinois Urbana-Champaign


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080723/c5fb4817/attachment.htm>


More information about the mpich-discuss mailing list