On Thu, Jul 9, 2009 at 7:31 AM, Jed Brown <span dir="ltr"><<a href="mailto:jed@59a2.org">jed@59a2.org</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="im">Matthew Knepley wrote:<br>

<br>

> PCs which have high flop to memory access ratios look good.  No<br>

> surprise there.<br>

<br>

</div>My concern here is that almost all "good" preconditioners are<br>

multiplicative in the fine-grained kernels or do significant work on<br>

coarse levels.  Both of these are very bad for putting on a GPU.<br>

Switching from SOR or ILU to Jacobi or red-black GS will greatly improve<br>

the throughput on a GPU, but is normally much less effective.  Since the<br>

GPU typically needs thousands of threads to attain high performance,<br>

it's really hard to use on all but the finest level.</blockquote><div><br>I agree with all these comments. I have no idea how to make those PCs<br>work. I am counting on Barry's genius here.<br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<br>

One of the more interesting preconditioners would be 3-level balancing<br>

or overlapping DD with very small subdomains (like thousands of<br>

subdomains per process).  There would then be 1 subregion per process<br>

and a global coarse level.  This would allow the PC to be additive with<br>

chunks of the right block size, while keeping a minimal amount of work<br>

on the coarser levels (which are handled by the CPU).  (It's really hard<br>

to get multigrid to coarsen this rapidly, as in 1M dofs to 10 dofs in 2<br>

levels.)  Unfortunately, this sort of scheme is rather problem- and<br>

discretization-dependent, as well as rather complex to implement.</blockquote><div><br>With regard to targets, my strategy is to implement things that I can<br>prove work well on a GPU. For starters, we have FMM. We have done<br>

a complete computational model and can prove that this will scale almost<br>indefinitely. The first paper is out, and the other 2 are almost done. We are<br>also implementing wavelets, since the structure and proofs are very similar<br>

to FMM.<br> <br>The strategy is to use FMM/Wavelets for problems they can solve to precondition<br>more complex problems. The prototype is Stokes preconditioning variable<br>viscosity Stokes, which I am working on with Dave May and Dave Yuen.<br>

<br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br>

I'll be interested to see what sort of performance you can get for real<br>

preconditioners on a GPU.</blockquote><div><br>Felipe Cruz has preliminary numbers for FMM: 500 GF on a single 1060C!<br>That is probably 10 times what you can hope to achieve with traditional<br>relaxation (I think).<br>

<br>   Matt<br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><font color="#888888"><br>

Jed<br></font></blockquote></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener<br>