[petsc-dev] Non-scalable matrix operations

Fri Dec 23 12:27:42 CST 2011

On Dec 23, 2011, at 11:54 AM, Matthew Knepley wrote:

> On Fri, Dec 23, 2011 at 10:48 AM, Mark F. Adams <mark.adams at columbia.edu> wrote:
> 
> On Dec 23, 2011, at 10:53 AM, Jed Brown wrote:
> 
>> On Fri, Dec 23, 2011 at 09:50, Mark F. Adams <mark.adams at columbia.edu> wrote:
>> Humm, my G-S in not in PETSc and it is perfectly scalable.  It does have more complex communication patterns but they are O(1) in latency and bandwidth.  I'm not sure I understand your description above.
>> 
>> It was more like, here's something that perhaps we want to put in PETSc, what rich communication pattern does it use, such that, if provided, the implementation would be simple?
> 
> There is the implementation in Prometheus that uses my C++ linked lists and hash tables.  I would like to implement this with STLs.  I also hack into MPIAIJ matrices to provide a primitive of applying G-S on an index set of local vertices, required for the algorithm.  This should be rethought.  I would guess that it would take about a week or two to move this into PETSc.
> 
> The complex communication required make this code work much better with large subdomains, so it is getting less attractive in a flat MPI mode, as it is currently written.  If I do this I would like to think about doing it in the next programming model of PETSc (pthreads?).  Anyway, this would take enough work that I'd like to think a bit about its design and even the algorithm in a non flat MPI model.
> 
> I think we should give at least some thought to how this would look in Thrust/OpenCL.
> 

A simple(er) thing to do is do whatever you want in this new (hack) kernel that I mentioned.  This just applies G-S on an (MPI)AIJ matrix (or whatever you want to code up), however you want to do this.  This kernel just needs apply G-S to a subset of the local equations.

A more interesting thing is partition down to the thread level and keep about 100 vertices per thread (this might be to big for a GPU...) and then use locks of some sort for the shared memory synchronization and the existing MPI code for the distributed memory part.  This would take a fair amount of work but it would be very nice and this type of synchronization that comes up in other algorithms like the fused multigrid that I'm working on now.

Mark

>    Matt
>  
> Note, I see the win with G-S over Cheby in highly unsymmetric (convection, hyperbolic) problems where Cheby is not very good.
> 
> Mark
> 
> 
> 
> -- 
> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
> -- Norbert Wiener

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20111223/c1969366/attachment.html>