[petsc-dev] Generality of VecScatter

Thu Nov 24 13:28:51 CST 2011

   Jed,

     It sounds to me like you are proposing deprecating point to point in MPI and using the one-sided operations as the primitives? I  think this is still wrong because you still have no ability to combine different one-sided to do proper scheduling,

   Barry

On Nov 24, 2011, at 1:12 PM, Jed Brown wrote:

> On Thu, Nov 24, 2011 at 12:40, Mark F. Adams <mark.adams at columbia.edu> wrote:
> OK, that makes sense at least abstractly.  I'm skeptical that you are going to be able to do much with all this one sided stuff; it seems to be that you would always have the two sided data at some point and then you might setup one sided maps for repeated operations like for a MatVec that are efficient.  But this is kind of trivial.
> 
> Yeah, that's easy. The update for MatMult() just involves each process calling one MPI_Get() for each remote process that it needs data from. What to get is encoded in an MPI_Datatype so there is no user-visible packing on either side. These calls are all non-blocking and you finish the update with MPI_Win_fence().
> 
> You can test performance of this approach using -vecscatter_window (although it's slightly different because it does user-visible packing).
> 
> Of course we can easily build a two-sided representation, but it's not necessary. In the one-sided model above, the owner of a global point does not know how many local parts accessed it (and nowhere inside the MPI implementation or network stack is there storage proportional to the number of accessors, so there isn't a scalability problem with globally coupled dofs).
>  
> 
> I would suggest writing a non-trivail algorithm with your model.  I have two ideas:
> 
> 1) when a coarse AMG (use aggregation MG to me concrete) grid is constructed and you have a map (bipartite graph) with one edge for each fine grid point to a coarse grid point (not necessarily on your processor).  I have too few vertices per processor so I want to reduce the number of processors used in the coarse grid.
> 
> Do you assume the availability of a partitioner at this stage, or you just want to do something simple like rank-order aggregation? In my model, the owner of a point would somehow (e.g. by calling a partitioner) determine the new owner of that point. This information is communicated to the ghosters of that point (again without knowing how many there are), any other node/connectivity data is communicated from old owner to new owner (this is a 1-1 relation, thus trivial with these primitives), and finally the ghosters address the new location when they need updates.
>  
>  I can assume that my fine grid was well partitioned, and to keep the process local assume the fine grid was partitioned with a space filling curve, or something, such that when I aggregate processors (eg, (0-3), (4-7) ... if I reduced the number of active processor by 4) I still have decent data locality.  Eventually I would like a local algorithm to improve load balance at this point, but that is more than you might want to start with.
> 
> If you have a local graph partitioner that can tell you what to move for better load balance, then I think I can move it efficiently and without redundant storage or specification.
>  
> 
> 2) Any AMR algorithm that you like.  Again, you might want to punt on load balancing.
> 
> I think the primitives are fine for this. There is an algorithmic question of how to modify boundary zones with minimal communication, but if you decide how much you want to communicate when (e.g. on the chalk board), I will figure out how to implement it.
> 
> I think redistribution for load balancing is represented pretty elegantly, but it would be good to work out the details.