[mpich-discuss] MPI_WIN_FENCE versus MPI_WIN_{LOCK|UNLOCK}

Fri Jun 1 11:17:55 CDT 2012

MPI-2.2 11.7.2 discusses the possibilities for an implementation of
RMA.  It is possible that communication is started as soon as a call
to MPI_{Put,Get,Accumulate} is made, although these operations are
nonblocking and can only be ensured to be complete when the
appropriate synchronization call is made.  On the other hand, it is
also possible that MPI_{Put,Get,Accumulate} merely enqueue the
associated data transfer and that the data movement begins only when
the synchronization call is made.  In both cases, it is only after the
synchronization call has returned that one can be sure the data has
moved and therefore can be accessed.

I believe that MPICH2 favors the latter (deferred transfer, but
perhaps not deferred entirely until synchronization), but I know that
some MPICH2-derived implementations start transfers immediately (Blue
Gene/P and Blue Gene/Q both do this).  I imagine that MVAPICH2 and
CrayMPI are more aggressive about moving data than MPICH2 is on
Ethernet, but I've not spent enough time looking at the source (where
possible) of these implementations to say anything authoritative.  I
spend all my timing staring at MPI for BG.

In general, attempt 2 should be faster because it does one collective
synchronization instead of N p2p calls from each process.  However, I
suspect that it is both implementation- and usage-dependent so that
you'll want to have both versions working and compare the two.
Attempt 2 could also be implemented as MPI collectives like alltoall*,
though.  I think the reference Graph500 implementation has this
already though and thus you're not interested in reproducing that
effort.

Just to make life even more interesting, you'll find that the relative
performance of noncontiguous datatypes relative to their contiguous
counter parts has sufficient variation that you'll want to have code
for both.  for example, some implementations do much better with N
calls to MPI_{Put,Get,Accumulate} with count M of a primitive
datatype, whereas others do better with an M*N subarray.  I believe
that most implementations do the right thing and reward you for using
datatypes, but at least Blue Gene is an exception, in part because
datatype pack/unpack is more expensive on the associated CPUs.

Furthermore, in passive target mode, sometimes it is better to lock
and unlock around each of the N calls instead of locking/unlocking
once for all N of them.  Jim Dinan wrote an excellent paper that
elucidates these issues in the context of ARMCI over MPI-2 RMA
[http://www.mcs.anl.gov/publications/paper_detail.php?id=1535] and his
implementation (ARMCI-MPI) demonstrates all the possibilities in very
readable code that you can get just by downloading MPICH2
[http://trac.mcs.anl.gov/projects/mpich2/browser/mpich2/trunk/src/mpix/armci].

For an application like NWChem that makes heavy use of one-sided, I
just tested all the different strategies and used the best one on each
architecture.  As I recall, CrayMPI and MVAPICH2 did better with
datatypes (as they should), but you'll have to read the paper to be
sure.

jeff

On Fri, Jun 1, 2012 at 9:37 AM, Timothy Stitt <Timothy.Stitt.9 at nd.edu> wrote:
> Hi all,
>
> Following a recent discussion on MPI_WIN_CREATE I was hoping some MPICH2 folks could shed some further light on how MPI_WIN_FENCE and MPI_{LOCK|UNLOCK} are implemented. Which of the two attempts below is better from a performance point-of-view when executed regularly within my code:
>
> * Attempt 1 *
>
> loop 1..n
>
>        call MPI_LOCK(...)
>        call MPI_PUT(...)
>        call MPI_UNLOCK(...)
>
> end loop
>
> or
>
> * Attempt 2 *
>
> loop 1..n
>
>        call MPI_PUT(...)
>
> end loop
>
> call MPI_WIN_FENCE(...)
>
> I also have a third attempt that uses active target RMA using MPI_WIN_{START|COMPLETE} and MPI_WIN_{POST|WAIT}. Is there any benefit to using one approach over the other, in general?
>
> Thanks in advance for your advice,
>
> Tim.
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond