I thought this thread on MPI shared memory interfaces with possibly weak consistency models is relevant to our discussions about threading. Boehm's paper is<div><br></div><div><a href="http://www.cs.washington.edu/education/courses/cse590p/05au/HPL-2004-209.pdf">http://www.cs.washington.edu/education/courses/cse590p/05au/HPL-2004-209.pdf</a><br>

<br><div class="gmail_quote">---------- Forwarded message ----------<br>From: <b class="gmail_sendername">Torsten Hoefler</b> <span dir="ltr"><<a href="mailto:htor@illinois.edu">htor@illinois.edu</a>></span><br>Date: Sun, Feb 12, 2012 at 10:01<br>

Subject: Re: [Mpi3-rma] [Mpi3-hybridpm] Fwd: MPI shared memory allocation issues<br>To: <a href="mailto:mpi3-hybridpm@lists.mpi-forum.org">mpi3-hybridpm@lists.mpi-forum.org</a><br>Cc: <a href="mailto:mpi3-rma@lists.mpi-forum.org">mpi3-rma@lists.mpi-forum.org</a><br>

<br><br>Dave,<br>

<br>

On Mon, Mar 28, 2011 at 01:28:48PM -0500, Dave Goodell wrote:<br>

> Forwarding this mail to a broader audience, based on our discussion<br>

> here at the March 2011 MPI Forum meeting.  There was additional<br>

> correspondence on this thread that I can forward as needed, but this<br>

> forwarded mail contains my core argument against the original<br>

> (allocate+free+fence) proposal.<br>

Thanks! This is an important discussion. Let me recap below what we<br>

discussed in the RMA group when moving towards the newer integrated<br>

allocate_shared.<br>

<br>

> Begin forwarded message:<br>

><br>

> > From: Dave Goodell <<a href="mailto:goodell@mcs.anl.gov">goodell@mcs.anl.gov</a>><br>

> > Date: February 24, 2011 11:18:20 AM CST<br>

> > To: Ron Brightwell <<a href="mailto:rbbrigh@sandia.gov">rbbrigh@sandia.gov</a>>, Douglas Miller <<a href="mailto:dougmill@us.ibm.com">dougmill@us.ibm.com</a>>, "Bronis R. de Supinski" <<a href="mailto:bronis@llnl.gov">bronis@llnl.gov</a>>, Jim Dinan <<a href="mailto:dinan@mcs.anl.gov">dinan@mcs.anl.gov</a>>, Pavan Balaji <<a href="mailto:balaji@mcs.anl.gov">balaji@mcs.anl.gov</a>>, Marc Snir <<a href="mailto:snir@illinois.edu">snir@illinois.edu</a>><br>


> > Subject: MPI shared memory allocation issues<br>

> ><br>

> > I voiced concerns at the last MPI forum meeting about the proposed<br>

> > MPI extensions for allocating shared memory.  In particular I was<br>

> > concerned about "MPI_Shm_fence".  Pavan asked me to write up a quick<br>

> > email to this group in order to help clarify my view in the<br>

> > discussion; this is that email.  Please widen the distribution list<br>

> > as appropriate, I just mailed the addresses that Pavan indicated to<br>

> > me.  FYI, I am not currently subscribed to the mpi3-hybrid list.<br>

> ><br>

> > First, I view multithreaded programming within an OS process and<br>

> > multiprocess programming using a shared memory region to be<br>

> > essentially the same problem.  There are probably alternative<br>

> > interpretations of the words "process" and "thread" that could muddy<br>

> > the picture here, but for the sake of clarity, let's use the<br>

> > conventional meanings for the moment.  Also, I am not interested in<br>

> > discussing distributed shared memory (DSM) here, I think that<br>

> > bringing it up just confuses the discussion further.  My primary<br>

> > objections to the proposal are valid entirely within a discussion of<br>

> > conventional shared memory, processes, and threads.<br>

> ><br>

> > Given that preface, I believe that many, if not all, of the issues<br>

> > raised by Boehm's paper, "Threads Cannot Be Implemented As a<br>

> > Library" [1], apply here.  In particular, some variation on the<br>

> > example from section 4.3 is probably an issue, but the others seem<br>

> > to apply as well.  The performance example is also relevant here,<br>

> > but in an even more dramatic fashion given the dearth of<br>

> > synchronization primitives offered by the proposal.<br>

> ><br>

> > I do not believe that we can specify a way to program the provided<br>

> > shared memory in any way that is robust and useful to the user,<br>

> > because C and Fortran do not give us enough of a specification in<br>

> > order to do so.  Without getting into the business of compiler<br>

> > writing, MPI has no way to give the user any meaningful guarantees.<br>

> > Just as Boehm noted about pthreads, we can probably come up with an<br>

> > approach that will work most of the time.  But that's a pretty<br>

> > flimsy guarantee for a standard like MPI.<br>

Yes, as Boehm points out, serial compiler optimizations can have<br>

very bad effects on concurrently running code accessing close-by memory.<br>

However, as you point out, the situation is equivalent to what we have<br>

today in pthreads and the proposal does not claim any more, it says "The<br>

consistency of load/store accesses from/to the shared memory as observed<br>

by the user program depends on the architecture.". We can extend it to<br>

include the compiler and maybe reference Boehm's paper (I would see this<br>

as a ticket 0 change).<br>

<br>

I agree to the general sentiment that it is impossible to implement<br>

shared memory semantics in a language that doesn't even have a real<br>

memory model. However, at the same time, I want to remind us that<br>

the behavior of Fortran was *never* 100% correct in MPI <= 2.2 (and we<br>

rely in this TR for MPI-3.0). At the same time, Fortran/MPI programs are<br>

ubiquitous :-).<br>

<br>

But let's discuss Boehm's identified correctness issues here:<br>

<br>

* 4.1 Concurrent modification<br>

<br>

This is only an issue if users rely on the consistency of the underlying<br>

hardware, if they use Win_flush and friends (as advised), such a<br>

reordering would be illegal. One downside is that this will only work in<br>

C and Fortran will probably may have all kinds of wacky problems with code<br>

movement as usual (however, they should be in a position to fix this<br>

with the new bindings and the Fortran TR).<br>

<br>

* 4.2 Rewriting of Adjacent Data<br>

<br>

This applies to the unified window as well where we simply specify the<br>

byte granularity of updates (an architecture could work with larger<br>

chunks (e.g., words) and cause the same trouble). So this issue is not<br>

limited to the shared memory window, especially when fast remote memory<br>

access hardware is used. Here we face the general trade-off between fast<br>

hardware access and safety (securing it through a software layer). We<br>

decided that byte-consistency is something we can expect from vendors.<br>

Also, the vendor library is always free to return MPI_ERR_RMA_SHARED<br>

(like he can always choose to not offer the unified memory model).<br>

<br>

* 4.3 Register Promotion<br>

<br>

While this is certainly a problem with threads, it would not be one with<br>

MPI windows because they have to dereference the address of the accessed<br>

memory and would thus prevent register promotion. Copying the data into<br>

a faster memory region would also not harm because the remote side has<br>

to query the addresses anyway. Again, restrictions may apply for Fortran<br>

codes.<br>

<br>

> > If you ignore the difficulty in specifying an interface that can<br>

> > actually be used correctly, then another issue arises.  The only<br>

> > proposed synchronization mechanism virtually guarantees that the<br>

> > user can at best utilize the allocated shared memory region to share<br>

> > data that is written once and otherwise read-only. Any other shared<br>

> > memory programming techniques are either going to be non-portable<br>

> > (e.g., using pthread mutexes or calls/macros from some atomic<br>

> > operations library), or they will be limited to potentially slow<br>

> > dark-ages techniques such as Dekker's Algorithm with excessive<br>

> > MPI_Shm_fence-ing.  So does this proposal really empower the user in<br>

> > any meaningful way?<br>

I agree. This should be addressed by the merge into the RMA context<br>

which offers all required functionality (we avoided memory locks on<br>

purpose because they are evil).<br>

<br>

> > I don't see a compelling advantage to putting this into MPI as<br>

> > opposed to providing this as some third-party library on top of MPI.<br>

> > Sure, it's easy to implement the allocate/free calls inside of MPI<br>

> > because the machinery is typically there.  But a third-party library<br>

> > would be able to escape some of the extremely generic portability<br>

> > constraints of the MPI standard and would therefore be able to<br>

> > provide a more robust interface to the user.  A discussion of DSM<br>

> > might make putting it into MPI more compelling because access to the<br>

> > network hardware might be involved, but I'm not particularly<br>

> > interested in having that discussion right now.  I think that MPI-3<br>

> > RMA would probably be more suitable for that use case.<br>

First, I am against DSM. Second, I believe that it may be very valuable<br>

to have this kind of functionality in MPI because virtually all<br>

large-scale codes have to become hybrid. The main issue are the<br>

associated memory saving (on-node communication with MPI is often<br>

sufficiently fast). I believe the current practice of mixing OpenMP and<br>

MPI to achieve this simple goal may be suboptimal (OpenMP supports only<br>

the "shared everything" (threaded) model and enables thus a whole new<br>

class of bugs and races).<br>

<br>

All the Best,<br>

  Torsten<br>

<font color="#888888"><br>

--<br>

 bash$ :(){ :|:&};: --------------------- <a href="http://www.unixer.de/" target="_blank">http://www.unixer.de/</a> -----<br>

Torsten Hoefler         | Performance Modeling and Simulation Lead<br>

Blue Waters Directorate | University of Illinois (UIUC)<br>

1205 W Clark Street     | Urbana, IL, 61801<br>

NCSA Building           | +01 <a href="tel:%28217%29%20244-7736" value="+12172447736">(217) 244-7736</a><br>

_______________________________________________<br>

mpi3-rma mailing list<br>

<a href="mailto:mpi3-rma@lists.mpi-forum.org">mpi3-rma@lists.mpi-forum.org</a><br>

<a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma</a><br>

</font></div><br></div>