[petsc-dev] PetscBG (to become PetscSF: Star Forest)

Jed Brown jedbrown at mcs.anl.gov
Sat Dec 24 22:42:21 CST 2011


1. I am prepared to rename PetscBG to PetscSF because this model is less
general than an arbitrary bipartite graph. Specifically, it is a "star
forest", a star being a very simple tree consisting of a root connected to
n>=0 leaves. Please speak up soon if you can think of a better name.

2. Documentation for SFBroadcast, SFReduce, and SFFetchAndAdd (with
illustrations) is attached. This is the "required" feature set in the sense
that (gather and scatter are straightforword to implement in terms of these
primitives), but I will make the pictures for those operations too.

3. I anticipate moving PetscSF from src/vec/ to src/sys/ because it doesn't
use any feature of Vec and is a generic communication facility (similar to
src/sys/utils/mpimesg.c).

4. We should decide how to handle non-uniform block sizes. I'm inclined to
just an an additional option argument to PetscSFSetGraph() (with default
constant size 1 if NULL is passed).

On Tue, Dec 20, 2011 at 21:48, Jed Brown <jedbrown at mcs.anl.gov> wrote:

> I finished implementing the core functionality for PetscBG. Configure
> --with-bg=1 to activate this code.
>
> Look at src/vec/bg/examples/tutorials/ex1.c
>
> The man pages are mostly up (e.g.
> http://www.mcs.anl.gov/petsc/petsc-dev/docs/manualpages/PetscBG/PetscBGSetGraph.html#PetscBGSetGraph),
> the newly pushed ones should update tonight. You can also look at
> include/petscbg.h.
>
> There is test output for all operations except
> PetscBGFetchAndOpBegin/End() because this is (intentionally)
> non-deterministic. Don't mind that, the implementation of gather and
> scatter use this call for setup, but they sort so the results are
> deterministic (unless you -bg_rank_order 0).
>
> You can -bg_synchronization FENCE (currently the default), LOCK
> (non-default, too sequential), and ACTIVE (least synchronous, but needs
> more one-time setup). It makes sense to use FENCE if the graph will only be
> used once, and to use ACTIVE if it will be used many times.
>
> The communication concept is based on a generalization of a
> local-to-global map, a bipartite graph where each process has a set of
> "local" points, each with at most 1 outgoing edge, and a set of "global"
> points, each with any number of incoming edges. The entire graph structure
> is stored non-redundantly on the local side (the global side has no
> knowledge of how many times points are accessed).
>
> The most common operations are bcast (global-to-local) and reduce
> (local-to-global with some combining function, e.g. MPI_SUM, MPI_MAX,
> MPI_REPLACE).
>
> You can use any MPI datatype at each point, including heterogeneous
> structures like (double,complex double,int,char). You just create the
> "unit" type that you want to use.
>
> There is atomic fetch-and-op. I usually use this as fetch-and-add as part
> of a process for inverting a communication graph. You can see an example of
> this in the internal function PetscBGGetMultiBG(). See ex1.c for a simpler
> example.
>
> There are gathers and scatters, which are useful if at some point, the
> algorithm needs to operate on independent data from all ghosters at once,
> or to send different data to each ghoster. This is relatively rare, so the
> data structure for that use is created the first time it is used. Support
> for these operations is the only part that is not arbitrarily
> memory-scalable.
>
> All operations are split into Begin/End pairs. The same PetscBG can be
> used in any number of simultaneous operations, provided that a different
> global array is used for each.
>
> The current implementation relies on MPI-2 one-sided functionality and MPI
> data types. Some operations, especially fetch-and-op (and consequently
> gather/scatter setup) will benefit from native MPI-3 support for this
> operation, which will make it less synchronous.
>
> There is a bug in Open MPI datatype handling, so you should use MPICH2
> when playing with this. I believe we are affected by at least these two
> open bugs.
>
> https://svn.open-mpi.org/trac/ompi/ticket/2656
> https://svn.open-mpi.org/trac/ompi/ticket/1905
>
>
>
> I recently wrote the following description for Matt to invert the
> communication graph as part of mesh redistribution from a partitioner.
>
> 1. Count the number of ranks that your points should be sent to (nranks)
> 2. Create a PetscBG that maps from nranks local points to (rank, 0). That
> is, each process has only one "owned" point. Call this bgcount.
> 3. Put the number of points destined to each rank in outgoing, create
> another array outoffset of same size (nranks)
> 4. incoming[0] = 0
> 5.
> PetscBGFetchAndOpBegin/End(bg,MPIU_INT,incoming,outgoing,outoffset,MPIU_SUM);
>
> Now, incoming[0] holds the number of points you will receive and
> outoffset[i] holds the offset on rank[i] at which to place outgoing[i]
> points.
>
> 6. Call PetscBGSetGraph() to build this new graph that communicates all
> your currently owned points to the remote processes at the offsets above.
> 7.
> PetscBGReduceBegin/End(newbg,MPIU_POINT,outgoing_points,incoming_points,MPI_REPLACE);
>
> I would typically make MPIU_POINT just represent a PetscBGNode()
> indicating where the point data is in my local space. That would have
> successfully inverted the communication graph: now I have two-sided
> knowledge and local-to-global now maps from newowner-to-oldowner.
>
>
> So far, I have found the implementation of PetscBG to be reasonably
> straightforward and bug-free compared to raw 2-sided MPI code for similar
> purposes. I'd like to hear suggestions about ways to make it better and
> ways to improve the terminology/documentation.
>
> I initially thought of this as a sibling of IS, so I put it in src/vec,
> but perhaps it should be moved to src/sys because it's really generic.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20111224/7bfa1e70/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sf.pdf
Type: application/pdf
Size: 178853 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20111224/7bfa1e70/attachment.pdf>


More information about the petsc-dev mailing list