I thought this thread on MPI shared memory interfaces with possibly weak consistency models is relevant to our discussions about threading. Boehm's paper is<div><br></div><div><a href="http://www.cs.washington.edu/education/courses/cse590p/05au/HPL-2004-209.pdf">http://www.cs.washington.edu/education/courses/cse590p/05au/HPL-2004-209.pdf</a><br>
<br><div class="gmail_quote">---------- Forwarded message ----------<br>From: <b class="gmail_sendername">Torsten Hoefler</b> <span dir="ltr"><<a href="mailto:htor@illinois.edu">htor@illinois.edu</a>></span><br>Date: Sun, Feb 12, 2012 at 10:01<br>
Subject: Re: [Mpi3-rma] [Mpi3-hybridpm] Fwd: MPI shared memory allocation issues<br>To: <a href="mailto:mpi3-hybridpm@lists.mpi-forum.org">mpi3-hybridpm@lists.mpi-forum.org</a><br>Cc: <a href="mailto:mpi3-rma@lists.mpi-forum.org">mpi3-rma@lists.mpi-forum.org</a><br>
<br><br>Dave,<br>
<br>
On Mon, Mar 28, 2011 at 01:28:48PM -0500, Dave Goodell wrote:<br>
> Forwarding this mail to a broader audience, based on our discussion<br>
> here at the March 2011 MPI Forum meeting. There was additional<br>
> correspondence on this thread that I can forward as needed, but this<br>
> forwarded mail contains my core argument against the original<br>
> (allocate+free+fence) proposal.<br>
Thanks! This is an important discussion. Let me recap below what we<br>
discussed in the RMA group when moving towards the newer integrated<br>
allocate_shared.<br>
<br>
> Begin forwarded message:<br>
><br>
> > From: Dave Goodell <<a href="mailto:goodell@mcs.anl.gov">goodell@mcs.anl.gov</a>><br>
> > Date: February 24, 2011 11:18:20 AM CST<br>
> > To: Ron Brightwell <<a href="mailto:rbbrigh@sandia.gov">rbbrigh@sandia.gov</a>>, Douglas Miller <<a href="mailto:dougmill@us.ibm.com">dougmill@us.ibm.com</a>>, "Bronis R. de Supinski" <<a href="mailto:bronis@llnl.gov">bronis@llnl.gov</a>>, Jim Dinan <<a href="mailto:dinan@mcs.anl.gov">dinan@mcs.anl.gov</a>>, Pavan Balaji <<a href="mailto:balaji@mcs.anl.gov">balaji@mcs.anl.gov</a>>, Marc Snir <<a href="mailto:snir@illinois.edu">snir@illinois.edu</a>><br>
> > Subject: MPI shared memory allocation issues<br>
> ><br>
> > I voiced concerns at the last MPI forum meeting about the proposed<br>
> > MPI extensions for allocating shared memory. In particular I was<br>
> > concerned about "MPI_Shm_fence". Pavan asked me to write up a quick<br>
> > email to this group in order to help clarify my view in the<br>
> > discussion; this is that email. Please widen the distribution list<br>
> > as appropriate, I just mailed the addresses that Pavan indicated to<br>
> > me. FYI, I am not currently subscribed to the mpi3-hybrid list.<br>
> ><br>
> > First, I view multithreaded programming within an OS process and<br>
> > multiprocess programming using a shared memory region to be<br>
> > essentially the same problem. There are probably alternative<br>
> > interpretations of the words "process" and "thread" that could muddy<br>
> > the picture here, but for the sake of clarity, let's use the<br>
> > conventional meanings for the moment. Also, I am not interested in<br>
> > discussing distributed shared memory (DSM) here, I think that<br>
> > bringing it up just confuses the discussion further. My primary<br>
> > objections to the proposal are valid entirely within a discussion of<br>
> > conventional shared memory, processes, and threads.<br>
> ><br>
> > Given that preface, I believe that many, if not all, of the issues<br>
> > raised by Boehm's paper, "Threads Cannot Be Implemented As a<br>
> > Library" [1], apply here. In particular, some variation on the<br>
> > example from section 4.3 is probably an issue, but the others seem<br>
> > to apply as well. The performance example is also relevant here,<br>
> > but in an even more dramatic fashion given the dearth of<br>
> > synchronization primitives offered by the proposal.<br>
> ><br>
> > I do not believe that we can specify a way to program the provided<br>
> > shared memory in any way that is robust and useful to the user,<br>
> > because C and Fortran do not give us enough of a specification in<br>
> > order to do so. Without getting into the business of compiler<br>
> > writing, MPI has no way to give the user any meaningful guarantees.<br>
> > Just as Boehm noted about pthreads, we can probably come up with an<br>
> > approach that will work most of the time. But that's a pretty<br>
> > flimsy guarantee for a standard like MPI.<br>
Yes, as Boehm points out, serial compiler optimizations can have<br>
very bad effects on concurrently running code accessing close-by memory.<br>
However, as you point out, the situation is equivalent to what we have<br>
today in pthreads and the proposal does not claim any more, it says "The<br>
consistency of load/store accesses from/to the shared memory as observed<br>
by the user program depends on the architecture.". We can extend it to<br>
include the compiler and maybe reference Boehm's paper (I would see this<br>
as a ticket 0 change).<br>
<br>
I agree to the general sentiment that it is impossible to implement<br>
shared memory semantics in a language that doesn't even have a real<br>
memory model. However, at the same time, I want to remind us that<br>
the behavior of Fortran was *never* 100% correct in MPI <= 2.2 (and we<br>
rely in this TR for MPI-3.0). At the same time, Fortran/MPI programs are<br>
ubiquitous :-).<br>
<br>
But let's discuss Boehm's identified correctness issues here:<br>
<br>
* 4.1 Concurrent modification<br>
<br>
This is only an issue if users rely on the consistency of the underlying<br>
hardware, if they use Win_flush and friends (as advised), such a<br>
reordering would be illegal. One downside is that this will only work in<br>
C and Fortran will probably may have all kinds of wacky problems with code<br>
movement as usual (however, they should be in a position to fix this<br>
with the new bindings and the Fortran TR).<br>
<br>
* 4.2 Rewriting of Adjacent Data<br>
<br>
This applies to the unified window as well where we simply specify the<br>
byte granularity of updates (an architecture could work with larger<br>
chunks (e.g., words) and cause the same trouble). So this issue is not<br>
limited to the shared memory window, especially when fast remote memory<br>
access hardware is used. Here we face the general trade-off between fast<br>
hardware access and safety (securing it through a software layer). We<br>
decided that byte-consistency is something we can expect from vendors.<br>
Also, the vendor library is always free to return MPI_ERR_RMA_SHARED<br>
(like he can always choose to not offer the unified memory model).<br>
<br>
* 4.3 Register Promotion<br>
<br>
While this is certainly a problem with threads, it would not be one with<br>
MPI windows because they have to dereference the address of the accessed<br>
memory and would thus prevent register promotion. Copying the data into<br>
a faster memory region would also not harm because the remote side has<br>
to query the addresses anyway. Again, restrictions may apply for Fortran<br>
codes.<br>
<br>
> > If you ignore the difficulty in specifying an interface that can<br>
> > actually be used correctly, then another issue arises. The only<br>
> > proposed synchronization mechanism virtually guarantees that the<br>
> > user can at best utilize the allocated shared memory region to share<br>
> > data that is written once and otherwise read-only. Any other shared<br>
> > memory programming techniques are either going to be non-portable<br>
> > (e.g., using pthread mutexes or calls/macros from some atomic<br>
> > operations library), or they will be limited to potentially slow<br>
> > dark-ages techniques such as Dekker's Algorithm with excessive<br>
> > MPI_Shm_fence-ing. So does this proposal really empower the user in<br>
> > any meaningful way?<br>
I agree. This should be addressed by the merge into the RMA context<br>
which offers all required functionality (we avoided memory locks on<br>
purpose because they are evil).<br>
<br>
> > I don't see a compelling advantage to putting this into MPI as<br>
> > opposed to providing this as some third-party library on top of MPI.<br>
> > Sure, it's easy to implement the allocate/free calls inside of MPI<br>
> > because the machinery is typically there. But a third-party library<br>
> > would be able to escape some of the extremely generic portability<br>
> > constraints of the MPI standard and would therefore be able to<br>
> > provide a more robust interface to the user. A discussion of DSM<br>
> > might make putting it into MPI more compelling because access to the<br>
> > network hardware might be involved, but I'm not particularly<br>
> > interested in having that discussion right now. I think that MPI-3<br>
> > RMA would probably be more suitable for that use case.<br>
First, I am against DSM. Second, I believe that it may be very valuable<br>
to have this kind of functionality in MPI because virtually all<br>
large-scale codes have to become hybrid. The main issue are the<br>
associated memory saving (on-node communication with MPI is often<br>
sufficiently fast). I believe the current practice of mixing OpenMP and<br>
MPI to achieve this simple goal may be suboptimal (OpenMP supports only<br>
the "shared everything" (threaded) model and enables thus a whole new<br>
class of bugs and races).<br>
<br>
All the Best,<br>
Torsten<br>
<font color="#888888"><br>
--<br>
bash$ :(){ :|:&};: --------------------- <a href="http://www.unixer.de/" target="_blank">http://www.unixer.de/</a> -----<br>
Torsten Hoefler | Performance Modeling and Simulation Lead<br>
Blue Waters Directorate | University of Illinois (UIUC)<br>
1205 W Clark Street | Urbana, IL, 61801<br>
NCSA Building | +01 <a href="tel:%28217%29%20244-7736" value="+12172447736">(217) 244-7736</a><br>
_______________________________________________<br>
mpi3-rma mailing list<br>
<a href="mailto:mpi3-rma@lists.mpi-forum.org">mpi3-rma@lists.mpi-forum.org</a><br>
<a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma</a><br>
</font></div><br></div>