[petsc-dev] Improving and stabilizing GPU support
Karl Rupp
rupp at mcs.anl.gov
Fri Jul 19 20:28:34 CDT 2013
Hi Dave,
> That sounds very reasonable. Regarding polynomial preconditioning, were you
> thinking of least squares polynomial preconditioning or something else?
I haven't thought about anything specific yet, just about the
infrastructure for applying any p(A).
> > > Will there be any improvements for GPU preconditioners in ViennaCL 1.5.0?
> > > When do you expect ViennaCL 1.5.0 to be available in PETSc?
> >
> > Jed gave me a good hint with respect to D-ILU0, which I'll also add to
> > PETSc. As with other GPU-accelerations using ILU, it will require a
> > proper matrix ordering to give good performance. I'm somewhat tempted to
> > port the SA-AMG implementation in CUSP to OpenCL as well, but this
> > certainly won't be in 1.5.0.
>
> Porting SA-AMG to OpenCL also sounds attractive. I was thinking that the
> ViennaCL documentation already mentioned an algebraic preconditioner that was
> in alpha or beta status.
The current AMG implementations all require a CPU-based setup stage and
thus limit the gain you could eventually get. In some cases where the
setup is less pronounced (e.g. lagging the preconditioner for nonlinear
or time-dependent problems) this is fine, but for stationary linear
problems with regular operators this is not very competitive.
> I'm still trying to get my mind around the memory bandwidth issue for sparse
> linear algebra. Your report above of the Intel result adds to my confusion.
> From my understanding, the theoretical peak memory bandwidth for some systems
> of interest is as follows:
>
> Dual socket Sandy Bridge: 102 GB/s
> Nvidia Kepler K20X: 250 GB/s
> Intel Xeon Phi: 350 GB/s
>
> What I am trying to understand is what sort of memory bandwidth is achievable
> by a good implementation for the sparse linear algebra that PETSc does with
> an iterative solver like CG using Jacobi preconditioning. The plots which I
> sent links to yesterday seemed to show memory bandwidth for a dual socket
> Sandy Bridge to be well below the theoretical peak, perhaps less than 50 GB/s
> for 16 threads. For Xeon Phi, you are saying that Intel could not get more
> than 95 GB/s. But I saw a presentation last week where Nvidia was getting
> about 200 GB/s for a matrix transpose. So it makes me wonder if the
> different systems are equally good at exploiting their theoretical peak
> memory bandwidths or whether one, like the Nvidia K20X, might be better. If
> that were the case, then I might expect a good implementation of sparse
> linear algebra on a Kepler K20X to be 4-5 times faster than a good
> implementation on a dual socket Sandy Bridge node rather than a 2.5x
> difference.
Intel's marketing machinery was tricking you: The 350 GB/sec are the
peak bandwidth from the ring bus connecting the MIC cores to GDDRAM.
However, the internal ring bus operates at only 220 GB/sec (see for
example the following paper [1]). With some prefetching tricks and Intel
pragma/compiler magic one obtains about 160 GB/sec for the STREAM
benchmark, which is 75% of peak. The Intel OpenCL SDK adds another loss
here, resulting in only 95 GB/sec. This was why I got in contact with
Intel in order to find out whether this is a weakness of the SDK or
whether I missed something. Turned out to be the former...
As you know, for dual Socket systems one only gets good bandwidth if the
placement in memory is done in order to adhere to NUMA. On such a dual
socket system I recently managed to get 75 GB/sec with OpenCL, which is
again 75% of peak performance. Unfortunately OpenCL does not consider
NUMA, so this is not very stable, so you may only get half of it if all
data happens to reside on the same memory link.
On GPUs including the K20X one also obtains about 75% of peak: On a
Radeon 7970 I got 220 out of 288 theoretical peak, other people even
reported up to 250 GB/sec for a GTX Titan (288 GB/sec theoretical peak),
and I also got 131 GB/sec out of 159 GB/sec peak for a rather dated GTX 285.
Overall, the rule of thumb seems to be 75% of peak if everything is done
correctly and if one finds the right baseline (the Xeon Phi is a beast
in this regard). These are numbers for sequential reads, hence no cache
effects or other mechanisms such as paging cause other spurious effects.
When it comes to actual optimizations for sparse linear algebra, CPUs
and GPUs ask for slightly different sets of optimizations because cache
lines and memory controllers differ...
Best regards,
Karli
[1] http://arxiv.org/abs/1302.1078
More information about the petsc-dev
mailing list