[mpich-discuss] Poor scaling of MPI_WIN_CREATE?

Jeff Hammond jhammond at alcf.anl.gov
Wed May 30 11:17:10 CDT 2012


The special sauce is that DMAPP doesn't require collective allocation
for one-sided, which is semantically inconsistent with MPI-2 RMA.

CrayMPI does not use DMAPP for one-sided, rather uGNI, which is partly
due to how MPICH2 implements one-sided.  We should take any further
discussion of this topic offline.

Jeff


On Wed, May 30, 2012 at 11:08 AM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
> On Wed, May 30, 2012 at 11:03 AM, Jim Dinan <dinan at mcs.anl.gov> wrote:
>>
>> Hi Tim,
>>
>> How often are you creating windows?  As Jed mentioned, this is expected to
>> be fairly expensive and synchronizing on most systems.  The Cray XE has some
>> special sauce that can make this cheap if you go through DMAPP directly,
>
>
> Isn't the whole point of a "vendor optimized MPI" that they would have done
> this? Is there a semantic reason why MPI_Win_create() cannot be implemented
> in this fast way using DMAPP?
>
>>
>> but if you want your performance tuning to be portable, taking window
>> creation off the critical path would be a good change to make.
>>
>>  ~Jim.
>>
>>
>> On 5/30/12 10:48 AM, Timothy Stitt wrote:
>>>
>>> Thanks Jeff...you provided some good suggestions. I'll consult the DMAPP
>>> documentation and also go back to the code to see if I can reuse window
>>> buffers in some way.
>>>
>>> Would you happen to have links to the DMAPP docs on-hand? I couldn't
>>> seem to find any tutorials etc. after a quick browse.
>>>
>>> Cheers,
>>>
>>> Tim.
>>>
>>> On May 30, 2012, at 11:40 AM, Jeff Hammond wrote:
>>>
>>>> If you don't care about portability, translating from MPI-2 RMA to
>>>> DMAPP is mostly trivial and you can eliminate collective window
>>>> creation altogether. However, I will note that my experience getting
>>>> MPI and DMAPP to inter-operate properly on XE6 (Hopper, in fact) was
>>>> terrible. And yes, I did everything the NERSC documentation and Cray
>>>> told me to do.
>>>>
>>>> I wonder if you can reduce the time spent in MPI_WIN_CREATE by calling
>>>> it less often. Can you not allocate the window once and keep reusing
>>>> it? You might need to restructure your code to reuse the underlying
>>>> local buffers but that isn't that complicated in some cases.
>>>>
>>>> Best,
>>>>
>>>> Jeff
>>>>
>>>> On Wed, May 30, 2012 at 10:36 AM, Jed Brown <jedbrown at mcs.anl.gov
>>>> <mailto:jedbrown at mcs.anl.gov>> wrote:
>>>>>
>>>>> On Wed, May 30, 2012 at 10:29 AM, Timothy Stitt
>>>>> <Timothy.Stitt.9 at nd.edu <mailto:Timothy.Stitt.9 at nd.edu>>
>>>>>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I am currently trying to improve the scaling of a CFD code on some
>>>>>> Cray
>>>>>> machines at NERSC (I believe Cray systems leverage mpich2 for their
>>>>>> MPI
>>>>>> communications, hence the posting to this list) and I am running
>>>>>> into some
>>>>>> scalability issues with the MPI_WIN_CREATE() routine.
>>>>>>
>>>>>> To cut a long story short, the CFD code requires each process to
>>>>>> receive
>>>>>> values from some neighborhood processes. Unfortunately, each process
>>>>>> doesn't
>>>>>> know who its neighbors should be in advance.
>>>>>
>>>>>
>>>>>
>>>>> How often do the neighbors change? By what mechanism?
>>>>>
>>>>>>
>>>>>> To overcome this we exploit the one-sided MPI_PUT() routine to
>>>>>> communicate
>>>>>> data from neighbors directly.
>>>>>>
>>>>>> Recent profiling at 256, 512 and 1024 processes shows that the
>>>>>> MPI_WIN_CREATE routine is starting to dominate the walltime and
>>>>>> reduce our
>>>>>> scalability quite rapidly. For instance the %walltime for
>>>>>> MPI_WIN_CREATE
>>>>>> over various process sizes increases as follows:
>>>>>>
>>>>>> 256 cores - 4.0%
>>>>>> 512 cores - 9.8%
>>>>>> 1024 cores - 24.3%
>>>>>
>>>>>
>>>>>
>>>>> The current implementation of MPI_Win_create uses an Allgather which is
>>>>> synchronizing and relatively expensive.
>>>>>
>>>>>>
>>>>>>
>>>>>> I was wondering if anyone in the MPICH2 community had any advice on
>>>>>> how
>>>>>> one can improve the performance of MPI_WIN_CREATE? Or maybe someone
>>>>>> has a
>>>>>> better strategy for communicating the data that bypasses the (poorly
>>>>>> scaling?) MPI_WIN_CREATE routine.
>>>>>>
>>>>>> Thanks in advance for any help you can provide.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Tim.
>>>>>> _______________________________________________
>>>>>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>>>>>> <mailto:mpich-discuss at mcs.anl.gov>
>>>>>>
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>>>>> <mailto:mpich-discuss at mcs.anl.gov>
>>>>>
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jeff Hammond
>>>> Argonne Leadership Computing Facility
>>>> University of Chicago Computation Institute
>>>> jhammond at alcf.anl.gov <mailto:jhammond at alcf.anl.gov> / (630) 252-5381
>>>>
>>>> http://www.linkedin.com/in/jeffhammond
>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>> _______________________________________________
>>>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>>
>>> *Tim Stitt*PhD(User Support Manager).
>>>
>>> Center for Research Computing | University of Notre Dame |
>>> P.O. Box 539, Notre Dame, IN 46556 | Phone: 574-631-5287 | Email:
>>> tstitt at nd.edu <mailto:tstitt at nd.edu>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>>> To manage subscription options or unsubscribe:
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> _______________________________________________
>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
>
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>



-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond


More information about the mpich-discuss mailing list