[petsc-dev] Improving and stabilizing GPU support

Fri Jul 19 20:28:34 CDT 2013

Hi Dave,

> That sounds very reasonable.  Regarding polynomial preconditioning, were you
> thinking of least squares polynomial preconditioning or something else?

I haven't thought about anything specific yet, just about the 
infrastructure for applying any p(A).

>   > > Will there be any improvements for GPU preconditioners in ViennaCL 1.5.0?
>   > > When do you expect ViennaCL 1.5.0 to be available in PETSc?
>   >
>   > Jed gave me a good hint with respect to D-ILU0, which I'll also add to
>   > PETSc. As with other GPU-accelerations using ILU, it will require a
>   > proper matrix ordering to give good performance. I'm somewhat tempted to
>   > port the SA-AMG implementation in CUSP to OpenCL as well, but this
>   > certainly won't be in 1.5.0.
>
> Porting SA-AMG to OpenCL also sounds attractive.  I was thinking that the
> ViennaCL documentation already mentioned an algebraic preconditioner that was
> in alpha or beta status.

The current AMG implementations all require a CPU-based setup stage and 
thus limit the gain you could eventually get. In some cases where the 
setup is less pronounced (e.g. lagging the preconditioner for nonlinear 
or time-dependent problems) this is fine, but for stationary linear 
problems with regular operators this is not very competitive.

> I'm still trying to get my mind around the memory bandwidth issue for sparse
> linear algebra.  Your report above of the Intel result adds to my confusion.
>  From my understanding, the theoretical peak memory bandwidth for some systems
> of interest is as follows:
>
> Dual socket Sandy Bridge:  102 GB/s
> Nvidia Kepler K20X:        250 GB/s
> Intel Xeon Phi:            350 GB/s
>
> What I am trying to understand is what sort of memory bandwidth is achievable
> by a good implementation for the sparse linear algebra that PETSc does with
> an iterative solver like CG using Jacobi preconditioning.  The plots which I
> sent links to yesterday seemed to show memory bandwidth for a dual socket
> Sandy Bridge to be well below the theoretical peak, perhaps less than 50 GB/s
> for 16 threads.  For Xeon Phi, you are saying that Intel could not get more
> than 95 GB/s.  But I saw a presentation last week where Nvidia was getting
> about 200 GB/s for a matrix transpose.  So it makes me wonder if the
> different systems are equally good at exploiting their theoretical peak
> memory bandwidths or whether one, like the Nvidia K20X, might be better.  If
> that were the case, then I might expect a good implementation of sparse
> linear algebra on a Kepler K20X to be 4-5 times faster than a good
> implementation on a dual socket Sandy Bridge node rather than a 2.5x
> difference.

Intel's marketing machinery was tricking you: The 350 GB/sec are the 
peak bandwidth from the ring bus connecting the MIC cores to GDDRAM. 
However, the internal ring bus operates at only 220 GB/sec (see for 
example the following paper [1]). With some prefetching tricks and Intel 
pragma/compiler magic one obtains about 160 GB/sec for the STREAM 
benchmark, which is 75% of peak. The Intel OpenCL SDK adds another loss 
here, resulting in only 95 GB/sec. This was why I got in contact with 
Intel in order to find out whether this is a weakness of the SDK or 
whether I missed something. Turned out to be the former...

As you know, for dual Socket systems one only gets good bandwidth if the 
placement in memory is done in order to adhere to NUMA. On such a dual 
socket system I recently managed to get 75 GB/sec with OpenCL, which is 
again 75% of peak performance. Unfortunately OpenCL does not consider 
NUMA, so this is not very stable, so you may only get half of it if all 
data happens to reside on the same memory link.

On GPUs including the K20X one also obtains about 75% of peak: On a 
Radeon 7970 I got 220 out of 288 theoretical peak, other people even 
reported up to 250 GB/sec for a GTX Titan (288 GB/sec theoretical peak), 
and I also got 131 GB/sec out of 159 GB/sec peak for a rather dated GTX 285.

Overall, the rule of thumb seems to be 75% of peak if everything is done 
correctly and if one finds the right baseline (the Xeon Phi is a beast 
in this regard). These are numbers for sequential reads, hence no cache 
effects or other mechanisms such as paging cause other spurious effects. 
When it comes to actual optimizations for sparse linear algebra, CPUs 
and GPUs ask for slightly different sets of optimizations because cache 
lines and memory controllers differ...

Best regards,
Karli

[1] http://arxiv.org/abs/1302.1078