[petsc-dev] PetscBG

Tue Dec 20 21:48:10 CST 2011

I finished implementing the core functionality for PetscBG. Configure
--with-bg=1 to activate this code.

Look at src/vec/bg/examples/tutorials/ex1.c

The man pages are mostly up (e.g.
http://www.mcs.anl.gov/petsc/petsc-dev/docs/manualpages/PetscBG/PetscBGSetGraph.html#PetscBGSetGraph),
the newly pushed ones should update tonight. You can also look at
include/petscbg.h.

There is test output for all operations except PetscBGFetchAndOpBegin/End()
because this is (intentionally) non-deterministic. Don't mind that, the
implementation of gather and scatter use this call for setup, but they sort
so the results are deterministic (unless you -bg_rank_order 0).

You can -bg_synchronization FENCE (currently the default), LOCK
(non-default, too sequential), and ACTIVE (least synchronous, but needs
more one-time setup). It makes sense to use FENCE if the graph will only be
used once, and to use ACTIVE if it will be used many times.

The communication concept is based on a generalization of a local-to-global
map, a bipartite graph where each process has a set of "local" points, each
with at most 1 outgoing edge, and a set of "global" points, each with any
number of incoming edges. The entire graph structure is stored
non-redundantly on the local side (the global side has no knowledge of how
many times points are accessed).

The most common operations are bcast (global-to-local) and reduce
(local-to-global with some combining function, e.g. MPI_SUM, MPI_MAX,
MPI_REPLACE).

You can use any MPI datatype at each point, including heterogeneous
structures like (double,complex double,int,char). You just create the
"unit" type that you want to use.

There is atomic fetch-and-op. I usually use this as fetch-and-add as part
of a process for inverting a communication graph. You can see an example of
this in the internal function PetscBGGetMultiBG(). See ex1.c for a simpler
example.

There are gathers and scatters, which are useful if at some point, the
algorithm needs to operate on independent data from all ghosters at once,
or to send different data to each ghoster. This is relatively rare, so the
data structure for that use is created the first time it is used. Support
for these operations is the only part that is not arbitrarily
memory-scalable.

All operations are split into Begin/End pairs. The same PetscBG can be used
in any number of simultaneous operations, provided that a different global
array is used for each.

The current implementation relies on MPI-2 one-sided functionality and MPI
data types. Some operations, especially fetch-and-op (and consequently
gather/scatter setup) will benefit from native MPI-3 support for this
operation, which will make it less synchronous.

There is a bug in Open MPI datatype handling, so you should use MPICH2 when
playing with this. I believe we are affected by at least these two open
bugs.

https://svn.open-mpi.org/trac/ompi/ticket/2656
https://svn.open-mpi.org/trac/ompi/ticket/1905

I recently wrote the following description for Matt to invert the
communication graph as part of mesh redistribution from a partitioner.

1. Count the number of ranks that your points should be sent to (nranks)
2. Create a PetscBG that maps from nranks local points to (rank, 0). That
is, each process has only one "owned" point. Call this bgcount.
3. Put the number of points destined to each rank in outgoing, create
another array outoffset of same size (nranks)
4. incoming[0] = 0
5.
PetscBGFetchAndOpBegin/End(bg,MPIU_INT,incoming,outgoing,outoffset,MPIU_SUM);

Now, incoming[0] holds the number of points you will receive and
outoffset[i] holds the offset on rank[i] at which to place outgoing[i]
points.

6. Call PetscBGSetGraph() to build this new graph that communicates all
your currently owned points to the remote processes at the offsets above.
7.
PetscBGReduceBegin/End(newbg,MPIU_POINT,outgoing_points,incoming_points,MPI_REPLACE);

I would typically make MPIU_POINT just represent a PetscBGNode() indicating
where the point data is in my local space. That would have successfully
inverted the communication graph: now I have two-sided knowledge and
local-to-global now maps from newowner-to-oldowner.

So far, I have found the implementation of PetscBG to be reasonably
straightforward and bug-free compared to raw 2-sided MPI code for similar
purposes. I'd like to hear suggestions about ways to make it better and
ways to improve the terminology/documentation.

I initially thought of this as a sibling of IS, so I put it in src/vec, but
perhaps it should be moved to src/sys because it's really generic.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20111220/1200d611/attachment.html>