<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><br><div><div>On Nov 24, 2011, at 2:12 PM, Jed Brown wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div class="gmail_quote">On Thu, Nov 24, 2011 at 12:40, Mark F. Adams <span dir="ltr"><<a href="mailto:mark.adams@columbia.edu">mark.adams@columbia.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div>OK, that makes sense at least abstractly.  I'm skeptical that you are going to be able to do much with all this one sided stuff; it seems to be that you would always have the two sided data at some point and then you might setup one sided maps for repeated operations like for a MatVec that are efficient.  But this is kind of trivial.</div>

</blockquote><div><br></div><div>Yeah, that's easy. The update for MatMult() just involves each process calling one MPI_Get() for each remote process that it needs data from. What to get is encoded in an MPI_Datatype so there is no user-visible packing on either side. These calls are all non-blocking and you finish the update with MPI_Win_fence().</div>

<div><br></div><div>You can test performance of this approach using -vecscatter_window (although it's slightly different because it does user-visible packing).</div><div><br></div><div>Of course we can easily build a two-sided representation, but it's not necessary. In the one-sided model above, the owner of a global point does not know how many local parts accessed it (and nowhere inside the MPI implementation or network stack is there storage proportional to the number of accessors, so there isn't a scalability problem with globally coupled dofs).</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div><br></div><div>I would suggest writing a non-trivail algorithm with your model.  I have two ideas:</div>

<div><br></div><div>1) when a coarse AMG (use aggregation MG to me concrete) grid is constructed and you have a map (bipartite graph) with one edge for each fine grid point to a coarse grid point (not necessarily on your processor).  I have too few vertices per processor so I want to reduce the number of processors used in the coarse grid. </div>

</blockquote><div><br></div><div>Do you assume the availability of a partitioner at this stage, </div></div></blockquote><div><br></div><div>No, lets keep it simple for now.</div><br><blockquote type="cite"><div class="gmail_quote"><div>or you just want to do something simple like rank-order aggregation? </div></div></blockquote><div><br></div><div>Thats fine.</div><br><blockquote type="cite"><div class="gmail_quote"><div>In my model, the owner of a point would somehow (e.g. by calling a partitioner) determine the new owner of that point. </div></div></blockquote><div><br></div><div>My point is that at some point the rubber has to hit the road and you need two side information.  You can then set up pointers for some fast one-sided stuff, and that is fine but you seem to agree that this is trivial.  Its not clear to me that your model is not simply removing some (redundant) data from the user space, while it is probably in the MPI library.</div><br><blockquote type="cite"><div class="gmail_quote"><div>This information is communicated to the ghosters of that point (again without knowing how many there are), </div></div></blockquote><div><br></div><div>But you knew how many there were, and chose to forget it, and the MPI layer must know how many there are so it knows when its done communicating.  I'm not sure how you insure correctness with your model.</div><br><blockquote type="cite"><div class="gmail_quote"><div>any other node/connectivity data is communicated from old owner to new owner (this is a 1-1 relation, thus trivial with these primitives), and finally the ghosters address the new location when they need updates.</div>

<div> </div><blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; position: static; z-index: auto; "><div> I can assume that my fine grid was well partitioned, and to keep the process local assume the fine grid was partitioned with a space filling curve, or something, such that when I aggregate processors (eg, (0-3), (4-7) ... if I reduced the number of active processor by 4) I still have decent data locality.  Eventually I would like a local algorithm to improve load balance at this point, but that is more than you might want to start with.</div>

</blockquote><div><br></div><div>If you have a local graph partitioner that can tell you what to move for better load balance, then I think I can move it efficiently and without redundant storage or specification.</div></div></blockquote><div><br></div>How do you load balance with a local partitioner?</div><div><br></div><div>Mark</div><div><br><blockquote type="cite"><div class="gmail_quote"><div>

 </div><blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; position: static; z-index: auto; "><div><br></div><div>2) Any AMR algorithm that you like.  Again, you might want to punt on load balancing.</div>

</blockquote></div><br><div>I think the primitives are fine for this. There is an algorithmic question of how to modify boundary zones with minimal communication, but if you decide how much you want to communicate when (e.g. on the chalk board), I will figure out how to implement it.</div>

<div><br></div><div>I think redistribution for load balancing is represented pretty elegantly, but it would be good to work out the details.</div>

</blockquote></div><br></body></html>