[petsc-dev] Improving and stabilizing GPU support

Dave Nystrom dnystrom1 at comcast.net
Fri Jul 19 21:17:36 CDT 2013


Karl Rupp writes:
 > Hi Dave,
 > 
 > > That sounds very reasonable.  Regarding polynomial preconditioning, were you
 > > thinking of least squares polynomial preconditioning or something else?
 > 
 > I haven't thought about anything specific yet, just about the 
 > infrastructure for applying any p(A).

Okay.  I knew that Paul had implemented a least squares polynomial
preconditioner and published some results with that.  So I had wondered if
you might be working with that.

 > >   > > Will there be any improvements for GPU preconditioners in ViennaCL 1.5.0?
 > >   > > When do you expect ViennaCL 1.5.0 to be available in PETSc?
 > >   >
 > >   > Jed gave me a good hint with respect to D-ILU0, which I'll also add to
 > >   > PETSc. As with other GPU-accelerations using ILU, it will require a
 > >   > proper matrix ordering to give good performance. I'm somewhat tempted to
 > >   > port the SA-AMG implementation in CUSP to OpenCL as well, but this
 > >   > certainly won't be in 1.5.0.
 > >
 > > Porting SA-AMG to OpenCL also sounds attractive.  I was thinking that the
 > > ViennaCL documentation already mentioned an algebraic preconditioner that was
 > > in alpha or beta status.
 > 
 > The current AMG implementations all require a CPU-based setup stage and 
 > thus limit the gain you could eventually get. In some cases where the 
 > setup is less pronounced (e.g. lagging the preconditioner for nonlinear 
 > or time-dependent problems) this is fine, but for stationary linear 
 > problems with regular operators this is not very competitive.

Do you know anything about nvamg?  My understanding is that it is an Nvidia
project that has not been advertised much and was used in some special apps
like Ansys Fluent.

 > > I'm still trying to get my mind around the memory bandwidth issue for sparse
 > > linear algebra.  Your report above of the Intel result adds to my confusion.
 > > From my understanding, the theoretical peak memory bandwidth for some systems
 > > of interest is as follows:
 > >
 > > Dual socket Sandy Bridge:  102 GB/s
 > > Nvidia Kepler K20X:        250 GB/s
 > > Intel Xeon Phi:            350 GB/s
 > >
 > > What I am trying to understand is what sort of memory bandwidth is achievable
 > > by a good implementation for the sparse linear algebra that PETSc does with
 > > an iterative solver like CG using Jacobi preconditioning.  The plots which I
 > > sent links to yesterday seemed to show memory bandwidth for a dual socket
 > > Sandy Bridge to be well below the theoretical peak, perhaps less than 50 GB/s
 > > for 16 threads.  For Xeon Phi, you are saying that Intel could not get more
 > > than 95 GB/s.  But I saw a presentation last week where Nvidia was getting
 > > about 200 GB/s for a matrix transpose.  So it makes me wonder if the
 > > different systems are equally good at exploiting their theoretical peak
 > > memory bandwidths or whether one, like the Nvidia K20X, might be better.  If
 > > that were the case, then I might expect a good implementation of sparse
 > > linear algebra on a Kepler K20X to be 4-5 times faster than a good
 > > implementation on a dual socket Sandy Bridge node rather than a 2.5x
 > > difference.
 > 
 > Intel's marketing machinery was tricking you: The 350 GB/sec are the 
 > peak bandwidth from the ring bus connecting the MIC cores to GDDRAM. 
 > However, the internal ring bus operates at only 220 GB/sec (see for 
 > example the following paper [1]). With some prefetching tricks and Intel 
 > pragma/compiler magic one obtains about 160 GB/sec for the STREAM 
 > benchmark, which is 75% of peak. The Intel OpenCL SDK adds another loss 
 > here, resulting in only 95 GB/sec. This was why I got in contact with 
 > Intel in order to find out whether this is a weakness of the SDK or 
 > whether I missed something. Turned out to be the former...
 > 
 > As you know, for dual Socket systems one only gets good bandwidth if the 
 > placement in memory is done in order to adhere to NUMA. On such a dual 
 > socket system I recently managed to get 75 GB/sec with OpenCL, which is 
 > again 75% of peak performance. Unfortunately OpenCL does not consider 
 > NUMA, so this is not very stable, so you may only get half of it if all 
 > data happens to reside on the same memory link.
 > 
 > On GPUs including the K20X one also obtains about 75% of peak: On a 
 > Radeon 7970 I got 220 out of 288 theoretical peak, other people even 
 > reported up to 250 GB/sec for a GTX Titan (288 GB/sec theoretical peak), 
 > and I also got 131 GB/sec out of 159 GB/sec peak for a rather dated GTX 285.
 > 
 > Overall, the rule of thumb seems to be 75% of peak if everything is done 
 > correctly and if one finds the right baseline (the Xeon Phi is a beast 
 > in this regard). These are numbers for sequential reads, hence no cache 
 > effects or other mechanisms such as paging cause other spurious effects. 
 > When it comes to actual optimizations for sparse linear algebra, CPUs 
 > and GPUs ask for slightly different sets of optimizations because cache 
 > lines and memory controllers differ...

Thanks for sharing your experiences.  I found the paper below earlier this
evening when I revisited the ViennaCL web site and looked at the benchmark
section.

 > Best regards,
 > 
 > Karli
 > 
 > [1] http://arxiv.org/abs/1302.1078



More information about the petsc-dev mailing list