[mpich-discuss] Poor scaling of MPI_WIN_CREATE?

Wed May 30 11:28:35 CDT 2012

Hi Tim,

I would expect creation of a shared/one-sided memory segment to be 
expensive on most systems (with a few exceptions, e.g. Cray DMAPP), 
regardless of the one-sided communication library you use.  So, hoisting 
this from the critical path would be a good change to make.

Cheers,
  ~Jim.

On 5/30/12 11:11 AM, Timothy Stitt wrote:
> Jim,
>
> The MPI_WIN_CREATE routine would definitely be called quite regularly. I
> didn't realize the implementation of MPI_WIN_CREATE was so expensive, so
> I might have been more liberal in my use of using the routine than
> necessary i.e. I think I can definitely reuse the buffers since the
> routine is called every iteration and the window size is constant for
> all processes across each iteration.
>
> Tim.
>
> On May 30, 2012, at 12:03 PM, Jim Dinan wrote:
>
>> Hi Tim,
>>
>> How often are you creating windows? As Jed mentioned, this is expected
>> to be fairly expensive and synchronizing on most systems. The Cray XE
>> has some special sauce that can make this cheap if you go through DMAPP
>> directly, but if you want your performance tuning to be portable, taking
>> window creation off the critical path would be a good change to make.
>>
>> ~Jim.
>>
>> On 5/30/12 10:48 AM, Timothy Stitt wrote:
>>> Thanks Jeff...you provided some good suggestions. I'll consult the DMAPP
>>> documentation and also go back to the code to see if I can reuse window
>>> buffers in some way.
>>>
>>> Would you happen to have links to the DMAPP docs on-hand? I couldn't
>>> seem to find any tutorials etc. after a quick browse.
>>>
>>> Cheers,
>>>
>>> Tim.
>>>
>>> On May 30, 2012, at 11:40 AM, Jeff Hammond wrote:
>>>
>>>> If you don't care about portability, translating from MPI-2 RMA to
>>>> DMAPP is mostly trivial and you can eliminate collective window
>>>> creation altogether. However, I will note that my experience getting
>>>> MPI and DMAPP to inter-operate properly on XE6 (Hopper, in fact) was
>>>> terrible. And yes, I did everything the NERSC documentation and Cray
>>>> told me to do.
>>>>
>>>> I wonder if you can reduce the time spent in MPI_WIN_CREATE by calling
>>>> it less often. Can you not allocate the window once and keep reusing
>>>> it? You might need to restructure your code to reuse the underlying
>>>> local buffers but that isn't that complicated in some cases.
>>>>
>>>> Best,
>>>>
>>>> Jeff
>>>>
>>>> On Wed, May 30, 2012 at 10:36 AM, Jed Brown <jedbrown at mcs.anl.gov
>>>> <mailto:jedbrown at mcs.anl.gov>
>>>> <mailto:jedbrown at mcs.anl.gov>> wrote:
>>>>> On Wed, May 30, 2012 at 10:29 AM, Timothy Stitt
>>>>> <Timothy.Stitt.9 at nd.edu <mailto:Timothy.Stitt.9 at nd.edu>
>>>>> <mailto:Timothy.Stitt.9 at nd.edu>>
>>>>> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I am currently trying to improve the scaling of a CFD code on some
>>>>>> Cray
>>>>>> machines at NERSC (I believe Cray systems leverage mpich2 for
>>>>>> their MPI
>>>>>> communications, hence the posting to this list) and I am running
>>>>>> into some
>>>>>> scalability issues with the MPI_WIN_CREATE() routine.
>>>>>>
>>>>>> To cut a long story short, the CFD code requires each process to
>>>>>> receive
>>>>>> values from some neighborhood processes. Unfortunately, each process
>>>>>> doesn't
>>>>>> know who its neighbors should be in advance.
>>>>>
>>>>>
>>>>> How often do the neighbors change? By what mechanism?
>>>>>
>>>>>>
>>>>>> To overcome this we exploit the one-sided MPI_PUT() routine to
>>>>>> communicate
>>>>>> data from neighbors directly.
>>>>>>
>>>>>> Recent profiling at 256, 512 and 1024 processes shows that the
>>>>>> MPI_WIN_CREATE routine is starting to dominate the walltime and
>>>>>> reduce our
>>>>>> scalability quite rapidly. For instance the %walltime for
>>>>>> MPI_WIN_CREATE
>>>>>> over various process sizes increases as follows:
>>>>>>
>>>>>> 256 cores - 4.0%
>>>>>> 512 cores - 9.8%
>>>>>> 1024 cores - 24.3%
>>>>>
>>>>>
>>>>> The current implementation of MPI_Win_create uses an Allgather which is
>>>>> synchronizing and relatively expensive.
>>>>>
>>>>>>
>>>>>>
>>>>>> I was wondering if anyone in the MPICH2 community had any advice
>>>>>> on how
>>>>>> one can improve the performance of MPI_WIN_CREATE? Or maybe someone
>>>>>> has a
>>>>>> better strategy for communicating the data that bypasses the (poorly
>>>>>> scaling?) MPI_WIN_CREATE routine.
>>>>>>
>>>>>> Thanks in advance for any help you can provide.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Tim.
>>>>>> _______________________________________________
>>>>>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>>>>>> <mailto:mpich-discuss at mcs.anl.gov>
>>>>>> <mailto:mpich-discuss at mcs.anl.gov>
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>>>>> <mailto:mpich-discuss at mcs.anl.gov>
>>>>> <mailto:mpich-discuss at mcs.anl.gov>
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jeff Hammond
>>>> Argonne Leadership Computing Facility
>>>> University of Chicago Computation Institute
>>>> jhammond at alcf.anl.gov <mailto:jhammond at alcf.anl.gov>
>>>> <mailto:jhammond at alcf.anl.gov> / (630) 252-5381
>>>> http://www.linkedin.com/in/jeffhammond
>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>> _______________________________________________
>>>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>>>> <mailto:mpich-discuss at mcs.anl.gov>
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>> *Tim Stitt*PhD(User Support Manager).
>>> Center for Research Computing | University of Notre Dame |
>>> P.O. Box 539, Notre Dame, IN 46556 | Phone: 574-631-5287 | Email:
>>> tstitt at nd.edu <mailto:tstitt at nd.edu> <mailto:tstitt at nd.edu>
>>>
>>>
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>>> <mailto:mpich-discuss at mcs.anl.gov>
>>> To manage subscription options or unsubscribe:
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> _______________________________________________
>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>> <mailto:mpich-discuss at mcs.anl.gov>
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> *Tim Stitt*PhD(User Support Manager).
> Center for Research Computing | University of Notre Dame |
> P.O. Box 539, Notre Dame, IN 46556 | Phone: 574-631-5287 | Email:
> tstitt at nd.edu <mailto:tstitt at nd.edu>
>
>
>
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss