[mpich-discuss] processor/memory affinity on quad core systems
William Gropp
gropp at mcs.anl.gov
Wed Jul 23 14:28:21 CDT 2008
I've ported the mpiexec extensions web page over to the wiki and
added a strawman for affinity. I don't think what I proposed is the
right way to do affinity, but I hope it will help get the discussion
started. The wiki page is http://wiki.mcs.anl.gov/mpich2/index.php/
Proposed_MPIEXEC_Extensions .
Bill
On Jul 23, 2008, at 11:55 AM, William Gropp wrote:
> This is a good point - mpiexec needs to work with the resource
> manager when possible. In fact, the design for the gforker and
> remshell mpiexec (and any that make use of the same supporting
> routines) includes a resource manager step, though this is
> currently usually empty.
>
> Here's what might happen in a full mpiexec implementation:
>
> 1) parse the arguments, looking for such things a -n (number of
> processes) or specific assignment to resources (e.g., a node file
> delivered by a batch system)
>
> 2) contact the relevant resource manager, acquiring the requested
> resources. Note that in the batch case, there may be nothing to do
> here (the resources were already assigned and provided in step 1).
> In the case of the mpd version of mpiexec, this step usually
> returns a list of node names.
>
> 3) negotiate with the resource manager to see who is responsible
> for setting various resource controls. This includes affinity,
> priority (e.g., the old nice setting), etc. It also needs to
> coordinate with any thread package (e.g., if OpenMP is being used,
> the resource manager and mpiexec need to take that into account).
>
> 4) mpiexec must set appropriate resource controls, but avoid
> conflicting with the resource/process manager.
>
> Note that because of the wide variety of resource and process
> managers, the reality is that mpiexec must adapt to the existing
> environment - it must be able to manage resources such as affinity
> and priority, but must also be able to defer to richer environments
> that already manage that.
>
> Bill
>
>
> On Jul 22, 2008, at 6:28 PM, Gus Correa wrote:
>
>> Hello Franco and list
>>
>> It looks like to me that one of the problems with solutions based on
>> taskset and numactl is that you need to know
>> ahead of time which cores will be available/idle, which ones will be
>> busy, when your program starts to run.
>> Likewise for the nodes on a cluster, for that matter.
>>
>> On a machine that you use alone or share with a few people, using
>> taskset or numactl may be ok.
>> However, this may not be the way to go on an installation where many
>> nodes are shared by many users,
>> or that has a busy job queue, and is perhaps managed by a batch
>> system /
>> resource management software
>> (SGE, Torque/PBS, etc), which may complicate matters even more.
>>
>> I wonder if the solution just proposed here would work for a job that
>> runs on a cluster, requesting say,
>> 2 processors only, particularly if you don't know when you submit the
>> job on which cluster nodes the
>> job will eventually run, and which idle cores/processors they will
>> have
>> at runtime.
>> Even after the jobs are running, and the process IDs are assigned,
>> using
>> taskset or numactl to enforce processor affinity on a multi-user
>> cluster
>> may require some hairy scripting to chase and match PIDs to cores.
>>
>> Robert Kubrik kindly pointed out here a looong parallel discussion of
>> this problem on the
>> Beowulf list, where a number of people advocate that the resource
>> management software,
>> rather than mpiexec, should take care of processor affinity.
>> Please, see these two threads:
>>
>> http://www-unix.mcs.anl.gov/web-mail-archive/lists/mpich-discuss/
>> 2008/07/msg00112.html
>> http://www.beowulf.org/archive/2008-June/021810.html
>>
>> So, before we have both mpixec and resource management software
>> fighting
>> for who
>> is in charge of processor affinity assignment, making the life of
>> users
>> and system administrators
>> even more complicated and less productive than it is on the current
>> scenario of memory contention ,
>> it might be useful to get to a minimal agreement about where the
>> processor affinity control should reside.
>>
>> Minimally, all software layers (OS, resource manager, mpiexec),
>> should gently allow the user or sys admin to choose to use or not
>> to use
>> the processor affinity enforcement features available, I believe.
>>
>> The current MPICH2/mpiexec policy, which according to Rajeev Takhur
>> is to delegate processor affinity to the OS scheduler
>> (see:
>> http://www-unix.mcs.anl.gov/web-mail-archive/lists/mpich-discuss/
>> 2008/07/msg00090.html),
>>
>> has the downside that was pointed out on Franco's and other postings.
>> However, one upside of it is that least it doesn't conflict with
>> other
>> software.
>>
>> I would love to hear more about this topic, including planned
>> solutions
>> for the problem,
>> from the expert subscribers of this list.
>>
>> Thank you,
>> Gus Correa
>>
>> --
>> ---------------------------------------------------------------------
>> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
>> Lamont-Doherty Earth Observatory - Columbia University
>> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
>> ---------------------------------------------------------------------
>>
>>
>> chong tan wrote:
>>
>>> no easy way with mpiexec, especially if you do mpiexec -n. But this
>>> should work
>>>
>>>
>>>
>>>
>>>
>>> mpiexec numactl --physcpubind N0 <1 of your proc> :
>>>
>>> numactl -- physcpubind N1 <2nd of oof proc> :
>>>
>>> .<same for the rest>
>>>
>>>
>>>
>>> add --membind if you want (and you definately want it for Opteron).
>>>
>>>
>>>
>>> tan
>>>
>>>
>>>
>>> --- On *Tue, 7/22/08, Franco Catalano /
>>> <franco.catalano at uniroma1.it>/*
>>> wrote:
>>>
>>> From: Franco Catalano <franco.catalano at uniroma1.it>
>>> Subject: [mpich-discuss] processor/memory affinity on quad core
>>> systems
>>> To: mpich-discuss at mcs.anl.gov
>>> Date: Tuesday, July 22, 2008, 2:28 AM
>>>
>>> Hi,
>>> Is it possible to ensure processor/memory affinity on mpi jobs
>>> launched
>>> with mpiexec (or mpirun)?
>>> I am using mpich2 1.0.7 with WRF on a 4 processor Opteron quad
>>> core (16
>>> cores total) machine and I have observed a sensible (more than 20%)
>>> variability of the time needed to compute a single time step.
>>> Taking a
>>> look to the output of top, I have noticed that the system moves
>>> processes over the 16 cores regardless of processor/memory
>>> affinity. So,
>>> when processes are running on cores away from their memory, the time
>>> needed for the time advancement is longer.
>>> I know that, for example, OpenMPI provides a command line option for
>>> mpiexec (or mpirun) to ensure the affinity binding:
>>> --mca param mpi_paffinity_alone = 1
>>> I have tried this with WRF and it works.
>>> Is there a way to do this with mpich2?
>>> Otherwise, I think that it would be very useful to include such
>>> cabability into the next release.
>>> Thank you for any suggestion.
>>>
>>> Franco
>>>
>>> --
>>> ____________________________________________________
>>> Eng. Franco Catalano
>>> Ph.D. Student
>>>
>>> D.I.T.S.
>>> Department of Hydraulics, Transportation and Roads.
>>> Via Eudossiana 18, 00184 Rome
>>> University of Rome "La Sapienza".
>>> tel: +390644585218
>>>
>>>
>>
>
> William Gropp
> Paul and Cynthia Saylor Professor of Computer Science
> University of Illinois Urbana-Champaign
>
>
William Gropp
Paul and Cynthia Saylor Professor of Computer Science
University of Illinois Urbana-Champaign
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080723/4d1d344a/attachment.htm>
More information about the mpich-discuss
mailing list