[petsc-dev] IMP prototype

Fri Jan 3 07:16:39 CST 2014

Victor Eijkhout <eijkhout at tacc.utexas.edu> writes:

> On Jan 2, 2014, at 10:12 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
>
>> the execution model is BSP
>
> No it's not. There are no barriers or syncs.

You have a communication phase followed by computation, followed by more
communication (in general).  Looks like BSP without explicit barriers
(which would be semantically meaningless if you added them).  An example
of a less structured communication pattern is Mark's asynchronous
Gauss-Seidel.

  http://crd.lbl.gov/assets/Uploads/ANAG/gs.pdf

> The callbacks are there just to make the code uniform. I've edited the
> document to reflect that in the MPI case you can dispense with them.

I thought you were targeting hybrid MPI/threads?

>> your transformation is recognizing a common
>> pattern of communication into temporary buffers, followed by
>> computation, followed by post-communication and putting a declarative
>> syntax on it
>
> Somewhat simplified, but not wrong. I'm kind of interested in the
> question what practically relevant algorithms do not conform to that
> model.

The GS above is one example.  More simply, how does MatMult_MPI look in
your model (note overlapped communication and computation)?  Also, you
can't implement an efficient setup for your communication pattern
without richer semantics, see PetscCommBuildTwoSided_Ibarrier and the paper

  http://unixer.de/publications/img/hoefler-dsde-protocols.pdf

>> Your abstraction is not uniform
>> if you need to index into owned parts of shared data structures or
>> perform optimizations like cooperative prefetch.
>
> Not sure what you're saying here. In case you dug into my code deeply,
> at the moment gathering the halo region involves one send-to-self that
> should be optimized away in the future. Look, it's a prototype, all
> right?

I don't care about that.  What does the hybrid case look like?  Do you
prepare separate work buffers for each thread or do the threads work on
parts of a shared data structure, perhaps cooperating at higher
frequency than the MPI?  I thought you were creating strict independence
and private buffers, which would be MPI-like semantics using threads
(totally possible, and I think usually a good thing, but most threaded
programming models are explicitly trying to avoid it).

>>  If you're going to use
>> separate buffers, why is a system like MPI not sufficient?
>
> 1. No idea why you're so fixated on buffers. I've just been going for
> the simplest implementation that makes it work on this one example. It
> can all be optimized.

What would the optimized version look like?  Someone has to decide how
to index into the data structures.  You're not providing much uniformity
relative to MPI+X (e.g., X=OpenMP) if the threaded part is always sorted
out by the user in an application-dependent way.

> 2. Why is MPI not sufficient: because you have to spell out too
> much. Calculating a halo in the case of a block-cyclically distributed
> vector is way too much work. An extension of IMP would make this very
> easy to specify, and all the hard work is done under the covers.

VecScatter and PetscSF are less intrusive interfaces on top of MPI.  I
agree that MPI_Isend is pretty low level, but what are you providing
relative to these less intrusive abstractions?  You said that you were
NOT looking for 'a better notation for VecScatters", so let's assume
that such interfaces are available to the MPI programmer.

>> What semantic does your abstraction provide for hybrid
>> distributed/shared memory that imperative communication systems cannot?
>
> How about having the exact same code that I've shown you, except that
> you specify that the processes are organized as N nodes times C cores?
> Right now I've not implemented hybrid programming, but it shouldn't be
> hard.

How does the code the user writes in the callback remain the same for
MPI and threads without the copies that AMPI or shared-memory MPI would
have done and without requiring the user to explicitly deal with the
threads (working on shared data structures as with OpenMP/TBB/pthreads)?

I'm looking for a precise statement of something user code can be
ignorant of with your model, yet reap the benefits of a model where they
explicitly used that information (e.g., a semantic not possible with
shared-memory MPI/AMPI, and possible using MPI+X only with some onerous
complexity that cannot be easily tucked into a generic library
function).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 835 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20140103/034c28f3/attachment.sig>