[mpich-discuss] Poor scaling of MPI_WIN_CREATE?

Wed May 30 11:33:34 CDT 2012

Thanks all for your contributions. I have learnt that MPI_WIN_CREATE in general is an expensive routine and should be minimized where possible for scalability. I will try to cull the use of the routine before taking the last resort of dabbling in the dark arts of Cray's DMAPP library.

Thanks again,

Tim.

On May 30, 2012, at 12:28 PM, Jim Dinan wrote:

Hi Tim,

I would expect creation of a shared/one-sided memory segment to be
expensive on most systems (with a few exceptions, e.g. Cray DMAPP),
regardless of the one-sided communication library you use.  So, hoisting
this from the critical path would be a good change to make.

Cheers,
 ~Jim.

On 5/30/12 11:11 AM, Timothy Stitt wrote:
Jim,

The MPI_WIN_CREATE routine would definitely be called quite regularly. I
didn't realize the implementation of MPI_WIN_CREATE was so expensive, so
I might have been more liberal in my use of using the routine than
necessary i.e. I think I can definitely reuse the buffers since the
routine is called every iteration and the window size is constant for
all processes across each iteration.

Tim.

On May 30, 2012, at 12:03 PM, Jim Dinan wrote:

Hi Tim,

How often are you creating windows? As Jed mentioned, this is expected
to be fairly expensive and synchronizing on most systems. The Cray XE
has some special sauce that can make this cheap if you go through DMAPP
directly, but if you want your performance tuning to be portable, taking
window creation off the critical path would be a good change to make.

~Jim.

On 5/30/12 10:48 AM, Timothy Stitt wrote:
Thanks Jeff...you provided some good suggestions. I'll consult the DMAPP
documentation and also go back to the code to see if I can reuse window
buffers in some way.

Would you happen to have links to the DMAPP docs on-hand? I couldn't
seem to find any tutorials etc. after a quick browse.

Cheers,

Tim.

On May 30, 2012, at 11:40 AM, Jeff Hammond wrote:

If you don't care about portability, translating from MPI-2 RMA to
DMAPP is mostly trivial and you can eliminate collective window
creation altogether. However, I will note that my experience getting
MPI and DMAPP to inter-operate properly on XE6 (Hopper, in fact) was
terrible. And yes, I did everything the NERSC documentation and Cray
told me to do.

I wonder if you can reduce the time spent in MPI_WIN_CREATE by calling
it less often. Can you not allocate the window once and keep reusing
it? You might need to restructure your code to reuse the underlying
local buffers but that isn't that complicated in some cases.

Best,

Jeff

On Wed, May 30, 2012 at 10:36 AM, Jed Brown <jedbrown at mcs.anl.gov<mailto:jedbrown at mcs.anl.gov>
<mailto:jedbrown at mcs.anl.gov>
<mailto:jedbrown at mcs.anl.gov>> wrote:
On Wed, May 30, 2012 at 10:29 AM, Timothy Stitt
<Timothy.Stitt.9 at nd.edu<mailto:Timothy.Stitt.9 at nd.edu> <mailto:Timothy.Stitt.9 at nd.edu>
<mailto:Timothy.Stitt.9 at nd.edu>>
wrote:

Hi all,

I am currently trying to improve the scaling of a CFD code on some
Cray
machines at NERSC (I believe Cray systems leverage mpich2 for
their MPI
communications, hence the posting to this list) and I am running
into some
scalability issues with the MPI_WIN_CREATE() routine.

To cut a long story short, the CFD code requires each process to
receive
values from some neighborhood processes. Unfortunately, each process
doesn't
know who its neighbors should be in advance.

How often do the neighbors change? By what mechanism?

To overcome this we exploit the one-sided MPI_PUT() routine to
communicate
data from neighbors directly.

Recent profiling at 256, 512 and 1024 processes shows that the
MPI_WIN_CREATE routine is starting to dominate the walltime and
reduce our
scalability quite rapidly. For instance the %walltime for
MPI_WIN_CREATE
over various process sizes increases as follows:

256 cores - 4.0%
512 cores - 9.8%
1024 cores - 24.3%

The current implementation of MPI_Win_create uses an Allgather which is
synchronizing and relatively expensive.

I was wondering if anyone in the MPICH2 community had any advice
on how
one can improve the performance of MPI_WIN_CREATE? Or maybe someone
has a
better strategy for communicating the data that bypasses the (poorly
scaling?) MPI_WIN_CREATE routine.

Thanks in advance for any help you can provide.

Regards,

Tim.
_______________________________________________
mpich-discuss mailing list mpich-discuss at mcs.anl.gov<mailto:mpich-discuss at mcs.anl.gov>
<mailto:mpich-discuss at mcs.anl.gov>
<mailto:mpich-discuss at mcs.anl.gov>
To manage subscription options or unsubscribe:
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

_______________________________________________
mpich-discuss mailing list mpich-discuss at mcs.anl.gov<mailto:mpich-discuss at mcs.anl.gov>
<mailto:mpich-discuss at mcs.anl.gov>
<mailto:mpich-discuss at mcs.anl.gov>
To manage subscription options or unsubscribe:
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

--
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov<mailto:jhammond at alcf.anl.gov> <mailto:jhammond at alcf.anl.gov>
<mailto:jhammond at alcf.anl.gov> / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
_______________________________________________
mpich-discuss mailing list mpich-discuss at mcs.anl.gov<mailto:mpich-discuss at mcs.anl.gov>
<mailto:mpich-discuss at mcs.anl.gov>
To manage subscription options or unsubscribe:
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

*Tim Stitt*PhD(User Support Manager).
Center for Research Computing | University of Notre Dame |
P.O. Box 539, Notre Dame, IN 46556 | Phone: 574-631-5287 | Email:
tstitt at nd.edu<mailto:tstitt at nd.edu> <mailto:tstitt at nd.edu> <mailto:tstitt at nd.edu>

_______________________________________________
mpich-discuss mailing list mpich-discuss at mcs.anl.gov<mailto:mpich-discuss at mcs.anl.gov>
<mailto:mpich-discuss at mcs.anl.gov>
To manage subscription options or unsubscribe:
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
_______________________________________________
mpich-discuss mailing list mpich-discuss at mcs.anl.gov<mailto:mpich-discuss at mcs.anl.gov>
<mailto:mpich-discuss at mcs.anl.gov>
To manage subscription options or unsubscribe:
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

*Tim Stitt*PhD(User Support Manager).
Center for Research Computing | University of Notre Dame |
P.O. Box 539, Notre Dame, IN 46556 | Phone: 574-631-5287 | Email:
tstitt at nd.edu<mailto:tstitt at nd.edu> <mailto:tstitt at nd.edu>

_______________________________________________
mpich-discuss mailing list     mpich-discuss at mcs.anl.gov<mailto:mpich-discuss at mcs.anl.gov>
To manage subscription options or unsubscribe:
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
_______________________________________________
mpich-discuss mailing list     mpich-discuss at mcs.anl.gov<mailto:mpich-discuss at mcs.anl.gov>
To manage subscription options or unsubscribe:
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

Tim Stitt PhD (User Support Manager).
Center for Research Computing | University of Notre Dame |
P.O. Box 539, Notre Dame, IN 46556 | Phone:  574-631-5287 | Email: tstitt at nd.edu<mailto:tstitt at nd.edu>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120530/bb0c7667/attachment-0001.html>