[petsc-dev] Non-scalable matrix operations

Mark F. Adams mark.adams at columbia.edu
Fri Dec 23 12:27:42 CST 2011

On Dec 23, 2011, at 11:54 AM, Matthew Knepley wrote:

> On Fri, Dec 23, 2011 at 10:48 AM, Mark F. Adams <mark.adams at columbia.edu> wrote:
> On Dec 23, 2011, at 10:53 AM, Jed Brown wrote:
>> On Fri, Dec 23, 2011 at 09:50, Mark F. Adams <mark.adams at columbia.edu> wrote:
>> Humm, my G-S in not in PETSc and it is perfectly scalable.  It does have more complex communication patterns but they are O(1) in latency and bandwidth.  I'm not sure I understand your description above.
>> It was more like, here's something that perhaps we want to put in PETSc, what rich communication pattern does it use, such that, if provided, the implementation would be simple?
> There is the implementation in Prometheus that uses my C++ linked lists and hash tables.  I would like to implement this with STLs.  I also hack into MPIAIJ matrices to provide a primitive of applying G-S on an index set of local vertices, required for the algorithm.  This should be rethought.  I would guess that it would take about a week or two to move this into PETSc.
> The complex communication required make this code work much better with large subdomains, so it is getting less attractive in a flat MPI mode, as it is currently written.  If I do this I would like to think about doing it in the next programming model of PETSc (pthreads?).  Anyway, this would take enough work that I'd like to think a bit about its design and even the algorithm in a non flat MPI model.
> I think we should give at least some thought to how this would look in Thrust/OpenCL.

A simple(er) thing to do is do whatever you want in this new (hack) kernel that I mentioned.  This just applies G-S on an (MPI)AIJ matrix (or whatever you want to code up), however you want to do this.  This kernel just needs apply G-S to a subset of the local equations.

A more interesting thing is partition down to the thread level and keep about 100 vertices per thread (this might be to big for a GPU...) and then use locks of some sort for the shared memory synchronization and the existing MPI code for the distributed memory part.  This would take a fair amount of work but it would be very nice and this type of synchronization that comes up in other algorithms like the fused multigrid that I'm working on now.


>    Matt
> Note, I see the win with G-S over Cheby in highly unsymmetric (convection, hyperbolic) problems where Cheby is not very good.
> Mark
> -- 
> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
> -- Norbert Wiener

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20111223/c1969366/attachment.html>

More information about the petsc-dev mailing list