[mpich-discuss] processor/memory affinity on quad core systems

Gus Correa gus at ldeo.columbia.edu
Wed Jul 23 16:58:05 CDT 2008


Hi Bill and list

The proposed mpiexec extensions, and the section
"Support for Multithreaded and Multicore Applications"  look very good.
I hope that a consensus will emerge,
and a solution will be included maybe in the next release of MPICH2.

On behalf of all the folks that have been posting messages about this topic:
Many thanks!

It takes a real issue to catch the attention of the chief architect,
and I'm glad that it did.  :)

Gus Correa

-- 
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


William Gropp wrote:

> I've ported the mpiexec extensions web page over to the wiki and added 
> a strawman for affinity.  I don't think what I proposed is the right 
> way to do affinity, but I hope it will help get the discussion 
> started.  The wiki page 
> is http://wiki.mcs.anl.gov/mpich2/index.php/Proposed_MPIEXEC_Extensions .
>
> Bill
>
> On Jul 23, 2008, at 11:55 AM, William Gropp wrote:
>
>> This is a good point - mpiexec needs to work with the resource 
>> manager when possible.  In fact, the design for the gforker and 
>> remshell mpiexec (and any that make use of the same supporting 
>> routines) includes a resource manager step, though this is currently 
>> usually empty.  
>>
>> Here's what might happen in a full mpiexec implementation:
>>
>> 1) parse the arguments, looking for such things a -n (number of 
>> processes) or specific assignment to resources (e.g., a node file 
>> delivered by a batch system)
>>
>> 2) contact the relevant resource manager, acquiring the requested 
>> resources.  Note that in the batch case, there may be nothing to do 
>> here (the resources were already assigned and provided in step 1). 
>>  In the case of the mpd version of mpiexec, this step usually returns 
>> a list of node names.  
>>
>> 3) negotiate with the resource manager to see who is responsible for 
>> setting various resource controls.  This includes affinity, priority 
>> (e.g., the old nice setting), etc.  It also needs to coordinate with 
>> any thread package (e.g., if OpenMP is being used, the resource 
>> manager and mpiexec need to take that into account).
>>
>> 4) mpiexec must set appropriate resource controls, but avoid 
>> conflicting with the resource/process manager.
>>
>> Note that because of the wide variety of resource and process 
>> managers, the reality is that mpiexec must adapt to the existing 
>> environment - it must be able to manage resources such as affinity 
>> and priority, but must also be able to defer to richer environments 
>> that already manage that.
>>
>> Bill
>>
>>
>> On Jul 22, 2008, at 6:28 PM, Gus Correa wrote:
>>
>>> Hello Franco and list
>>>
>>> It looks like to me that one of the problems with solutions based on
>>> taskset and numactl is that you need to know
>>> ahead of time which cores will be available/idle, which ones will be
>>> busy, when your program starts to run.
>>> Likewise for the nodes on a cluster, for that matter.
>>>
>>> On a machine that you use alone or share with a few people, using
>>> taskset or numactl may be ok.
>>> However, this may not be the way to go on an installation where many
>>> nodes are shared by many users,
>>> or that has a busy job queue, and is perhaps managed by a batch system /
>>> resource management software
>>> (SGE, Torque/PBS, etc), which may complicate matters even more.
>>>
>>> I wonder if the solution just proposed here would work for a job that
>>> runs on a cluster, requesting say,
>>> 2 processors only, particularly if you don't know when you submit the
>>> job on which cluster nodes the
>>> job will eventually run, and which idle cores/processors they will have
>>> at runtime.
>>> Even after the jobs are running, and the process IDs are assigned, using
>>> taskset or numactl to enforce processor affinity on a multi-user cluster
>>> may require some hairy scripting to chase and match PIDs to cores.
>>>
>>> Robert Kubrik kindly pointed out here a looong parallel discussion of
>>> this problem on the
>>> Beowulf list, where a number of people advocate that the resource
>>> management software,
>>> rather than mpiexec, should take care of processor affinity.
>>> Please, see these two threads:
>>>
>>> http://www-unix.mcs.anl.gov/web-mail-archive/lists/mpich-discuss/2008/07/msg00112.html
>>> http://www.beowulf.org/archive/2008-June/021810.html
>>>
>>> So, before we have both mpixec and resource management software fighting
>>> for who
>>> is in charge of processor affinity assignment, making the life of users
>>> and system administrators
>>> even more complicated and less productive than it is on the current
>>> scenario of memory contention ,
>>> it might be useful to get to a minimal agreement about where the
>>> processor affinity control should reside.
>>>
>>> Minimally, all software layers (OS, resource manager, mpiexec),
>>> should gently allow the user or sys admin to choose to use or not to use
>>> the processor affinity enforcement features available, I believe.
>>>
>>> The current MPICH2/mpiexec policy, which according to Rajeev Takhur
>>> is to delegate processor affinity  to the OS scheduler
>>> (see:
>>> http://www-unix.mcs.anl.gov/web-mail-archive/lists/mpich-discuss/2008/07/msg00090.html),
>>>
>>> has the downside that was pointed out on Franco's and other postings.
>>> However, one upside of it is that least it doesn't conflict with other
>>> software.
>>>
>>> I would love to hear more about this topic, including planned solutions
>>> for the problem,
>>> from the expert subscribers of this list.
>>>
>>> Thank you,
>>> Gus Correa
>>>
>>> --
>>> ---------------------------------------------------------------------
>>> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu 
>>> <mailto:gus at ldeo.columbia.edu>
>>> Lamont-Doherty Earth Observatory - Columbia University
>>> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
>>> ---------------------------------------------------------------------
>>>
>>>
>>> chong tan wrote:
>>>
>>>> no easy way with mpiexec, especially if you do mpiexec -n.  But this
>>>> should work
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> mpiexec numactl --physcpubind N0 <1 of your proc> :
>>>>
>>>>              numactl  -- physcpubind N1 <2nd of oof proc>  :
>>>>
>>>>              .<same for the rest>
>>>>
>>>>
>>>>
>>>> add --membind if you want (and you definately want it for Opteron).
>>>>
>>>>
>>>>
>>>> tan
>>>>
>>>>
>>>>
>>>> --- On *Tue, 7/22/08, Franco Catalano /<franco.catalano at uniroma1.it 
>>>> <mailto:franco.catalano at uniroma1.it>>/*
>>>> wrote:
>>>>
>>>>     From: Franco Catalano <franco.catalano at uniroma1.it 
>>>> <mailto:franco.catalano at uniroma1.it>>
>>>>     Subject: [mpich-discuss] processor/memory affinity on quad core
>>>>     systems
>>>>     To: mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
>>>>     Date: Tuesday, July 22, 2008, 2:28 AM
>>>>
>>>> Hi,
>>>> Is it possible to ensure processor/memory affinity on mpi jobs launched
>>>> with mpiexec (or mpirun)?
>>>> I am using mpich2 1.0.7 with WRF on a 4 processor Opteron quad core (16
>>>> cores total) machine and I have observed a sensible (more than 20%)
>>>> variability of the time needed to compute a single time step. Taking a
>>>> look to the output of top, I have noticed that the system moves
>>>> processes over the 16 cores regardless of processor/memory 
>>>> affinity. So,
>>>> when processes are running on cores away from their memory, the time
>>>> needed for the time advancement is longer.
>>>> I know that, for example, OpenMPI provides a command line option for
>>>> mpiexec (or mpirun) to ensure the affinity binding:
>>>> --mca param mpi_paffinity_alone = 1
>>>> I have tried this with WRF and it works.
>>>> Is there a way to do this with mpich2?
>>>> Otherwise, I think that it would be very useful to include such
>>>> cabability into the next release.
>>>> Thank you for any suggestion.
>>>>
>>>> Franco
>>>>
>>>> --
>>>> ____________________________________________________
>>>> Eng. Franco Catalano
>>>> Ph.D. Student
>>>>
>>>> D.I.T.S.
>>>> Department of Hydraulics, Transportation and Roads.
>>>> Via Eudossiana 18, 00184 Rome
>>>> University of Rome "La Sapienza".
>>>> tel: +390644585218
>>>>
>>>>
>>>
>>
>> William Gropp
>> Paul and Cynthia Saylor Professor of Computer Science
>> University of Illinois Urbana-Champaign
>>
>>
>
> William Gropp
> Paul and Cynthia Saylor Professor of Computer Science
> University of Illinois Urbana-Champaign
>
>




More information about the mpich-discuss mailing list