1. I am prepared to rename PetscBG to PetscSF because this model is less general than an arbitrary bipartite graph. Specifically, it is a "star forest", a star being a very simple tree consisting of a root connected to n>=0 leaves. Please speak up soon if you can think of a better name.<div>

<br></div><div>2. Documentation for SFBroadcast, SFReduce, and SFFetchAndAdd (with illustrations) is attached. This is the "required" feature set in the sense that (gather and scatter are straightforword to implement in terms of these primitives), but I will make the pictures for those operations too.</div>

<div><br></div><div>3. I anticipate moving PetscSF from src/vec/ to src/sys/ because it doesn't use any feature of Vec and is a generic communication facility (similar to src/sys/utils/mpimesg.c).</div><div><br></div>

<div>4. We should decide how to handle non-uniform block sizes. I'm inclined to just an an additional option argument to PetscSFSetGraph() (with default constant size 1 if NULL is passed).<br><br><div class="gmail_quote">

On Tue, Dec 20, 2011 at 21:48, Jed Brown <span dir="ltr"><<a href="mailto:jedbrown@mcs.anl.gov">jedbrown@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I finished implementing the core functionality for PetscBG. Configure --with-bg=1 to activate this code.<div><br></div><div>Look at src/vec/bg/examples/tutorials/ex1.c</div><div><br></div><div>The man pages are mostly up (e.g. <a href="http://www.mcs.anl.gov/petsc/petsc-dev/docs/manualpages/PetscBG/PetscBGSetGraph.html#PetscBGSetGraph" target="_blank">http://www.mcs.anl.gov/petsc/petsc-dev/docs/manualpages/PetscBG/PetscBGSetGraph.html#PetscBGSetGraph</a>), the newly pushed ones should update tonight. You can also look at include/petscbg.h.</div>


<div><br></div><div>There is test output for all operations except PetscBGFetchAndOpBegin/End() because this is (intentionally) non-deterministic. Don't mind that, the implementation of gather and scatter use this call for setup, but they sort so the results are deterministic (unless you -bg_rank_order 0).</div>


<div><br></div><div>You can -bg_synchronization FENCE (currently the default), LOCK (non-default, too sequential), and ACTIVE (least synchronous, but needs more one-time setup). It makes sense to use FENCE if the graph will only be used once, and to use ACTIVE if it will be used many times.</div>


<div><br></div><div>The communication concept is based on a generalization of a local-to-global map, a bipartite graph where each process has a set of "local" points, each with at most 1 outgoing edge, and a set of "global" points, each with any number of incoming edges. The entire graph structure is stored non-redundantly on the local side (the global side has no knowledge of how many times points are accessed).</div>


<div><br></div><div>The most common operations are bcast (global-to-local) and reduce (local-to-global with some combining function, e.g. MPI_SUM, MPI_MAX, MPI_REPLACE).</div><div><br></div><div>You can use any MPI datatype at each point, including heterogeneous structures like (double,complex double,int,char). You just create the "unit" type that you want to use.</div>


<div><br></div><div>There is atomic fetch-and-op. I usually use this as fetch-and-add as part of a process for inverting a communication graph. You can see an example of this in the internal function PetscBGGetMultiBG(). See ex1.c for a simpler example.</div>


<div><br></div><div>There are gathers and scatters, which are useful if at some point, the algorithm needs to operate on independent data from all ghosters at once, or to send different data to each ghoster. This is relatively rare, so the data structure for that use is created the first time it is used. Support for these operations is the only part that is not arbitrarily memory-scalable.</div>


<div><br></div><div>All operations are split into Begin/End pairs. The same PetscBG can be used in any number of simultaneous operations, provided that a different global array is used for each.</div><div><br></div><div>

The current implementation relies on MPI-2 one-sided functionality and MPI data types. Some operations, especially fetch-and-op (and consequently gather/scatter setup) will benefit from native MPI-3 support for this operation, which will make it less synchronous.</div>


<div><br></div><div>There is a bug in Open MPI datatype handling, so you should use MPICH2 when playing with this. I believe we are affected by at least these two open bugs.</div><div><br></div><div><a href="https://svn.open-mpi.org/trac/ompi/ticket/2656" target="_blank">https://svn.open-mpi.org/trac/ompi/ticket/2656</a></div>


<div><a href="https://svn.open-mpi.org/trac/ompi/ticket/1905" target="_blank">https://svn.open-mpi.org/trac/ompi/ticket/1905</a></div><div><br></div><div><br></div><div><br></div><div>I recently wrote the following description for Matt to invert the communication graph as part of mesh redistribution from a partitioner.</div>


<div><br></div><div><div>1. Count the number of ranks that your points should be sent to (nranks)</div><div>2. Create a PetscBG that maps from nranks local points to (rank, 0). That is, each process has only one "owned" point. Call this bgcount.</div>


<div>3. Put the number of points destined to each rank in outgoing, create another array outoffset of same size (nranks)</div><div>4. incoming[0] = 0</div><div>5. PetscBGFetchAndOpBegin/End(bg,MPIU_INT,incoming,outgoing,outoffset,MPIU_SUM);</div>


<div><br></div><div>Now, incoming[0] holds the number of points you will receive and outoffset[i] holds the offset on rank[i] at which to place outgoing[i] points.</div><div><br></div><div>6. Call PetscBGSetGraph() to build this new graph that communicates all your currently owned points to the remote processes at the offsets above.</div>


<div>7. PetscBGReduceBegin/End(newbg,MPIU_POINT,outgoing_points,incoming_points,MPI_REPLACE);</div><div><br></div><div>I would typically make MPIU_POINT just represent a PetscBGNode() indicating where the point data is in my local space. That would have successfully inverted the communication graph: now I have two-sided knowledge and local-to-global now maps from newowner-to-oldowner.</div>


</div><div><br></div><div><br></div><div>So far, I have found the implementation of PetscBG to be reasonably straightforward and bug-free compared to raw 2-sided MPI code for similar purposes. I'd like to hear suggestions about ways to make it better and ways to improve the terminology/documentation.</div>


<div><br></div><div>I initially thought of this as a sibling of IS, so I put it in src/vec, but perhaps it should be moved to src/sys because it's really generic.</div>

</blockquote></div><br></div>