[mpich-discuss] Poor scaling of MPI_WIN_CREATE?

Jeff Hammond jhammond at alcf.anl.gov
Wed May 30 11:35:56 CDT 2012


MPI-3 will help fix this in some cases since using a symmetric heap as
the back-end for MPI_WIN_ALLOCATE will eliminate the need to do
collective communication in some cases.

I've implemented this below MPI on Cray XE, Blue Gene/P and Blue
Gene/Q and it allows window allocation to be local if the user can
assert that they are following the rules for how one uses a symmetric
heap (see SHMEM docs for details).

(Tim - I will be sending you the code for this offline, btw.)

Jeff

On Wed, May 30, 2012 at 11:28 AM, Jim Dinan <dinan at mcs.anl.gov> wrote:
> Hi Tim,
>
> I would expect creation of a shared/one-sided memory segment to be expensive
> on most systems (with a few exceptions, e.g. Cray DMAPP), regardless of the
> one-sided communication library you use.  So, hoisting this from the
> critical path would be a good change to make.
>
> Cheers,
>  ~Jim.
>
>
> On 5/30/12 11:11 AM, Timothy Stitt wrote:
>>
>> Jim,
>>
>> The MPI_WIN_CREATE routine would definitely be called quite regularly. I
>> didn't realize the implementation of MPI_WIN_CREATE was so expensive, so
>> I might have been more liberal in my use of using the routine than
>> necessary i.e. I think I can definitely reuse the buffers since the
>> routine is called every iteration and the window size is constant for
>> all processes across each iteration.
>>
>> Tim.
>>
>> On May 30, 2012, at 12:03 PM, Jim Dinan wrote:
>>
>>> Hi Tim,
>>>
>>> How often are you creating windows? As Jed mentioned, this is expected
>>> to be fairly expensive and synchronizing on most systems. The Cray XE
>>> has some special sauce that can make this cheap if you go through DMAPP
>>> directly, but if you want your performance tuning to be portable, taking
>>> window creation off the critical path would be a good change to make.
>>>
>>> ~Jim.
>>>
>>> On 5/30/12 10:48 AM, Timothy Stitt wrote:
>>>>
>>>> Thanks Jeff...you provided some good suggestions. I'll consult the DMAPP
>>>> documentation and also go back to the code to see if I can reuse window
>>>> buffers in some way.
>>>>
>>>> Would you happen to have links to the DMAPP docs on-hand? I couldn't
>>>> seem to find any tutorials etc. after a quick browse.
>>>>
>>>> Cheers,
>>>>
>>>> Tim.
>>>>
>>>> On May 30, 2012, at 11:40 AM, Jeff Hammond wrote:
>>>>
>>>>> If you don't care about portability, translating from MPI-2 RMA to
>>>>> DMAPP is mostly trivial and you can eliminate collective window
>>>>> creation altogether. However, I will note that my experience getting
>>>>> MPI and DMAPP to inter-operate properly on XE6 (Hopper, in fact) was
>>>>> terrible. And yes, I did everything the NERSC documentation and Cray
>>>>> told me to do.
>>>>>
>>>>> I wonder if you can reduce the time spent in MPI_WIN_CREATE by calling
>>>>> it less often. Can you not allocate the window once and keep reusing
>>>>> it? You might need to restructure your code to reuse the underlying
>>>>> local buffers but that isn't that complicated in some cases.
>>>>>
>>>>> Best,
>>>>>
>>>>> Jeff
>>>>>
>>>>> On Wed, May 30, 2012 at 10:36 AM, Jed Brown <jedbrown at mcs.anl.gov
>>>>> <mailto:jedbrown at mcs.anl.gov>
>>>>> <mailto:jedbrown at mcs.anl.gov>> wrote:
>>>>>>
>>>>>> On Wed, May 30, 2012 at 10:29 AM, Timothy Stitt
>>>>>> <Timothy.Stitt.9 at nd.edu <mailto:Timothy.Stitt.9 at nd.edu>
>>>>>> <mailto:Timothy.Stitt.9 at nd.edu>>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I am currently trying to improve the scaling of a CFD code on some
>>>>>>> Cray
>>>>>>> machines at NERSC (I believe Cray systems leverage mpich2 for
>>>>>>> their MPI
>>>>>>> communications, hence the posting to this list) and I am running
>>>>>>> into some
>>>>>>> scalability issues with the MPI_WIN_CREATE() routine.
>>>>>>>
>>>>>>> To cut a long story short, the CFD code requires each process to
>>>>>>> receive
>>>>>>> values from some neighborhood processes. Unfortunately, each process
>>>>>>> doesn't
>>>>>>> know who its neighbors should be in advance.
>>>>>>
>>>>>>
>>>>>>
>>>>>> How often do the neighbors change? By what mechanism?
>>>>>>
>>>>>>>
>>>>>>> To overcome this we exploit the one-sided MPI_PUT() routine to
>>>>>>> communicate
>>>>>>> data from neighbors directly.
>>>>>>>
>>>>>>> Recent profiling at 256, 512 and 1024 processes shows that the
>>>>>>> MPI_WIN_CREATE routine is starting to dominate the walltime and
>>>>>>> reduce our
>>>>>>> scalability quite rapidly. For instance the %walltime for
>>>>>>> MPI_WIN_CREATE
>>>>>>> over various process sizes increases as follows:
>>>>>>>
>>>>>>> 256 cores - 4.0%
>>>>>>> 512 cores - 9.8%
>>>>>>> 1024 cores - 24.3%
>>>>>>
>>>>>>
>>>>>>
>>>>>> The current implementation of MPI_Win_create uses an Allgather which
>>>>>> is
>>>>>> synchronizing and relatively expensive.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I was wondering if anyone in the MPICH2 community had any advice
>>>>>>> on how
>>>>>>> one can improve the performance of MPI_WIN_CREATE? Or maybe someone
>>>>>>> has a
>>>>>>> better strategy for communicating the data that bypasses the (poorly
>>>>>>> scaling?) MPI_WIN_CREATE routine.
>>>>>>>
>>>>>>> Thanks in advance for any help you can provide.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Tim.
>>>>>>> _______________________________________________
>>>>>>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>>>>>>> <mailto:mpich-discuss at mcs.anl.gov>
>>>>>>> <mailto:mpich-discuss at mcs.anl.gov>
>>>>>>> To manage subscription options or unsubscribe:
>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>>>>>> <mailto:mpich-discuss at mcs.anl.gov>
>>>>>> <mailto:mpich-discuss at mcs.anl.gov>
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Hammond
>>>>> Argonne Leadership Computing Facility
>>>>> University of Chicago Computation Institute
>>>>> jhammond at alcf.anl.gov <mailto:jhammond at alcf.anl.gov>
>>>>> <mailto:jhammond at alcf.anl.gov> / (630) 252-5381
>>>>> http://www.linkedin.com/in/jeffhammond
>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>>>>> <mailto:mpich-discuss at mcs.anl.gov>
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>
>>>>
>>>> *Tim Stitt*PhD(User Support Manager).
>>>> Center for Research Computing | University of Notre Dame |
>>>> P.O. Box 539, Notre Dame, IN 46556 | Phone: 574-631-5287 | Email:
>>>> tstitt at nd.edu <mailto:tstitt at nd.edu> <mailto:tstitt at nd.edu>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>>>> <mailto:mpich-discuss at mcs.anl.gov>
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>>> <mailto:mpich-discuss at mcs.anl.gov>
>>> To manage subscription options or unsubscribe:
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>>
>> *Tim Stitt*PhD(User Support Manager).
>> Center for Research Computing | University of Notre Dame |
>> P.O. Box 539, Notre Dame, IN 46556 | Phone: 574-631-5287 | Email:
>> tstitt at nd.edu <mailto:tstitt at nd.edu>
>>
>>
>>
>> _______________________________________________
>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond


More information about the mpich-discuss mailing list