GPU related stuff

Fri Jul 10 15:17:42 CDT 2009

Thanks all for comments.

--- On Thu, 7/9/09, Matthew Knepley <knepley at gmail.com> wrote:

> From: Matthew Knepley <knepley at gmail.com>
> Subject: Re: GPU related stuff
> To: "For users of the development version of PETSc" <petsc-dev at mcs.anl.gov>
> Date: Thursday, July 9, 2009, 5:09 PM
> On Thu, Jul 9, 2009 at 7:31 AM, Jed
> Brown <jed at 59a2.org>
> wrote:
> 
> Matthew Knepley wrote:
> 
> 
> 
> > PCs which have high flop to memory access ratios look
> good.  No
> 
> > surprise there.
> 
> 
> 
> My concern here is that almost all "good"
> preconditioners are
> 
> multiplicative in the fine-grained kernels or do
> significant work on
> 
> coarse levels.  Both of these are very bad for putting on
> a GPU.
> 
> Switching from SOR or ILU to Jacobi or red-black GS will
> greatly improve
> 
> the throughput on a GPU, but is normally much less
> effective.  Since the
> 
> GPU typically needs thousands of threads to attain high
> performance,
> 
> it's really hard to use on all but the finest
> level.
> I agree with all these comments. I have no idea how to make
> those PCs
> work. I am counting on Barry's genius here.
>  
> 
> 
> One of the more interesting preconditioners would be
> 3-level balancing
> 
> or overlapping DD with very small subdomains (like
> thousands of
> 
> subdomains per process).  There would then be 1 subregion
> per process
> 
> and a global coarse level.  This would allow the PC to be
> additive with
> 
> chunks of the right block size, while keeping a minimal
> amount of work
> 
> on the coarser levels (which are handled by the CPU).
>  (It's really hard
> 
> to get multigrid to coarsen this rapidly, as in 1M dofs to
> 10 dofs in 2
> 
> levels.)  Unfortunately, this sort of scheme is rather
> problem- and
> 
> discretization-dependent, as well as rather complex to
> implement.
> With regard to targets, my strategy is to implement things
> that I can
> prove work well on a GPU. For starters, we have FMM. We
> have done
> 
> a complete computational model and can prove that this will
> scale almost
> indefinitely. The first paper is out, and the other 2 are
> almost done. We are
> also implementing wavelets, since the structure and proofs
> are very similar
> 
> to FMM.
>  
> The strategy is to use FMM/Wavelets for problems they can
> solve to precondition
> more complex problems. The prototype is Stokes
> preconditioning variable
> viscosity Stokes, which I am working on with Dave May and
> Dave Yuen.
> 
> 
> 
> 
> I'll be interested to see what sort of performance you
> can get for real
> 
> preconditioners on a GPU.
> Felipe Cruz has preliminary numbers for FMM: 500 GF on a
> single 1060C!
> That is probably 10 times what you can hope to achieve with
> traditional
> relaxation (I think).
> 
> 
>    Matt
>  
> 
> Jed
> -- 
> What most experimenters take for granted before they begin
> their experiments is infinitely more interesting than any
> results to which their experiments lead.
> -- Norbert Wiener
> 
> 
>