[petsc-dev] Improving and stabilizing GPU support

Fri Jul 19 19:10:11 CDT 2013

Hi Karli,

Karl Rupp writes:
 > Hi Dave,
 > 
 > > Your list sounds great to me.  Glad that you and Paul are working on 
 > > this together.
 > >
 > > My main interests are in better preconditioner support and better
 > > multi-GPU/MPI scalability.
 > 
 > This is follow-up work then. There are a couple of 'simple'
 > preconditioners (polynomial preconditioning, maybe some point-block
 > Jacobi) which can also be useful as smoothers and which we can add in the
 > near future. We should just get the 'infrastructure' work done first so
 > that we don't have to unnecessarily adjust too much code later on.

That sounds very reasonable.  Regarding polynomial preconditioning, were you
thinking of least squares polynomial preconditioning or something else?

 > > Is there any progress on Steve Dalton's work on the cusp algebraic multigrid
 > > preconditioner with PETSc?  I believe Jed said in a previous email that Steve
 > > was going to be working on adding MPI support for that as well as other
 > > enhancements.
 > 
 > Yes, Steve is working on this right here at our division. Jed can give a 
 > more detailed answer on this.

Glad to hear that that project is going forward.

 > > Will there be any improvements for GPU preconditioners in ViennaCL 1.5.0?
 > > When do you expect ViennaCL 1.5.0 to be available in PETSc?
 > 
 > Jed gave me a good hint with respect to D-ILU0, which I'll also add to 
 > PETSc. As with other GPU-accelerations using ILU, it will require a 
 > proper matrix ordering to give good performance. I'm somewhat tempted to 
 > port the SA-AMG implementation in CUSP to OpenCL as well, but this 
 > certainly won't be in 1.5.0.

Porting SA-AMG to OpenCL also sounds attractive.  I was thinking that the
ViennaCL documentation already mentioned an algebraic preconditioner that was
in alpha or beta status.

 > > I'm also interested in trying the PETSc ViennaCL support on the Xeon Phi.
 > > Do you have a schedule for when that might be ready for friendly testing?
 > 
 > With OpenCL you can already test this now. Just install the Intel OpenCL 
 > SDK on your Xeon Phi machine, configure with --download-viennacl, 
 > --with-opencl-include=..., --with-opencl-lib=..., and pass the
 >    -viennacl_device_accelerator
 > flag in addition to -vec_type viennacl -mat_type aijviennacl when executing.
 > 
 > Unfortunately the application memory bandwidth we get on the Xeon Phi is 
 > too limited to be useful for off-loaded execution as it is the case with 
 > OpenCL: Even the folks at Intel couldn't obtain more than ~95 GB/sec 
 > even when filling up the whole MIC with just two vectors for 
 > benchmarking a simple copy operation.

I'm still trying to get my mind around the memory bandwidth issue for sparse
linear algebra.  Your report above of the Intel result adds to my confusion.
>From my understanding, the theoretical peak memory bandwidth for some systems
of interest is as follows:

Dual socket Sandy Bridge:  102 GB/s
Nvidia Kepler K20X:        250 GB/s
Intel Xeon Phi:            350 GB/s

What I am trying to understand is what sort of memory bandwidth is achievable
by a good implementation for the sparse linear algebra that PETSc does with
an iterative solver like CG using Jacobi preconditioning.  The plots which I
sent links to yesterday seemed to show memory bandwidth for a dual socket
Sandy Bridge to be well below the theoretical peak, perhaps less than 50 GB/s
for 16 threads.  For Xeon Phi, you are saying that Intel could not get more
than 95 GB/s.  But I saw a presentation last week where Nvidia was getting
about 200 GB/s for a matrix transpose.  So it makes me wonder if the
different systems are equally good at exploiting their theoretical peak
memory bandwidths or whether one, like the Nvidia K20X, might be better.  If
that were the case, then I might expect a good implementation of sparse
linear algebra on a Kepler K20X to be 4-5 times faster than a good
implementation on a dual socket Sandy Bridge node rather than a 2.5x
difference.

 > Thus, I don't think our efforts are currently well spent on trying a fully
 > native execution of PETSc on the MIC, because the trend is going more
 > towards a tighter CPU/accelerator integration on the same die rather than
 > piggy-backing via PCI-Express. Anyway, I'll let you know if there are any
 > updates on this front.

Thanks.  Please do.

Best regards,

Dave

 > Best regards,
 > 
 > Karli