[petsc-dev] Improving and stabilizing GPU support

Fri Jul 19 16:54:40 CDT 2013

Hi Dave,

 > Your list sounds great to me.  Glad that you and Paul are working on 
this together.
>
> My main interests are in better preconditioner support and better multi-GPU/MPI
> scalability.

This is follow-up work then. There are a couple of 'simple' 
preconditioners (polynomial preconditioning, maybe some point-block 
Jacobi) which can also be useful as smoothers and which we can add in 
the near future. We should just get the 'infrastructure' work done first 
so that we don't have to unnecessarily adjust too much code later on.

> Is there any progress on Steve Dalton's work on the cusp algebraic multigrid
> preconditioner with PETSc?  I believe Jed said in a previous email that Steve
> was going to be working on adding MPI support for that as well as other
> enhancements.

Yes, Steve is working on this right here at our division. Jed can give a 
more detailed answer on this.

> Will there be any improvements for GPU preconditioners in ViennaCL 1.5.0?
> When do you expect ViennaCL 1.5.0 to be available in PETSc?

Jed gave me a good hint with respect to D-ILU0, which I'll also add to 
PETSc. As with other GPU-accelerations using ILU, it will require a 
proper matrix ordering to give good performance. I'm somewhat tempted to 
port the SA-AMG implementation in CUSP to OpenCL as well, but this 
certainly won't be in 1.5.0.

> I'm also interested in trying the PETSc ViennaCL support on the Xeon Phi.
> Do you have a schedule for when that might be ready for friendly testing?

With OpenCL you can already test this now. Just install the Intel OpenCL 
SDK on your Xeon Phi machine, configure with --download-viennacl, 
--with-opencl-include=..., --with-opencl-lib=..., and pass the
   -viennacl_device_accelerator
flag in addition to -vec_type viennacl -mat_type aijviennacl when executing.

Unfortunately the application memory bandwidth we get on the Xeon Phi is 
too limited to be useful for off-loaded execution as it is the case with 
OpenCL: Even the folks at Intel couldn't obtain more than ~95 GB/sec 
even when filling up the whole MIC with just two vectors for 
benchmarking a simple copy operation. Thus, I don't think our efforts 
are currently well spent on trying a fully native execution of PETSc on 
the MIC, because the trend is going more towards a tighter 
CPU/accelerator integration on the same die rather than piggy-backing 
via PCI-Express. Anyway, I'll let you know if there are any updates on 
this front.

Best regards,
Karli