[petsc-dev] A closer look at the Xeon Phi

Tue Feb 12 17:17:32 CST 2013

On Tue, Feb 12, 2013 at 6:06 PM, Karl Rupp <rupp at mcs.anl.gov> wrote:

> Hi guys,
>
> I finally got to play with the Intel Xeon Phi Beta hardware here. It's
> supposed to have a slightly higher peak memory bandwidth (352 GB/sec) than
> the release hardware (320 GB/sec), and it gives a first impression on what
> can be done with it. One thing to keep in mind is that the ring bus
> connecting memory controllers with the cores saturates at 220 GB/sec, so
> this represents the theoretical peak performance for applications.
>
> A sparse matrix-vector multiplication paper [1] got published recently,
> but I'm more interested in what the Xeon Phi can do in terms of iterative
> solvers. Thus, I ran some benchmarks with ViennaCL on the Xeon Phi in both
> native mode (everything runs on the Xeon Phi) and using OpenCL. I also
> tried to use the offload-mode, i.e. one specifies via some #pragma that
> data should be moved to the MIC and computations are run there, but this
> #pragma-handling turned out to be fairly unusable for anything where
> PCI-Express can be a bottleneck. For PETSc-purposes this means that it is
> completely useless. Even though I haven't tried it yet, I think this
> consequently also applies to OpenACC in general.
>
> All benchmarks are run on Linux OSes using double precision. Blue colors
> in the graphs denote Intel hardware, red colors AMD, and green colors
> NVIDIA. Although the test machine gets occasional updates of the Intel
> toolchain, I'm not entirely sure whether the latest version is installed.
>
> The first STREAM-like benchmark is the vector addition x = y + z in
> vector-timings.png. It is surprising that the OpenCL-overhead at small
> vector sizes (less than 10^6) is fairly large, so either the Beta-stage of
> OpenCL on MIC is indeed very beta, or the MIC is not designed for fast
> responses to requests from the host. OpenCL memory transfer rates reach the
> range of 25 GB/sec on MIC, which is unexpectedly far from peak. With native
> execution on the MIC, one obtains around 75 GB/sec. This is in line with
> results in [1]. Higher performance requires vectorization and prefetching -
> apparently injected by the programmer and thus not very convenient. The
> GPUs are about a factor of two faster and get close to their peak
> performance without any explicit vectorization or prefetching.
>
> The second benchmark is a sparse matrix-vector multiplication for a
> standard 2D finite-difference discretization of the Laplace operator on the
> unit square (sparse-timings.png). The performance of MIC is better than
> that of the CPU, but again the overhead at smaller problem sizes is
> considerable and larger than for NVIDIA GPUs (both OpenCL and CUDA). The
> poor performance at around 10^4 on the MIC is reproducible, but I don't
> have an explanation for it. Overall, the GPUs are by a factor of around 2-3
> faster than MIC. Further tuning might reduce this gap, as some experiments
> with vectorizations on MIC have shown mild improvements (~30%).
>

Karl, I am assuming that the places in the article where the Phi beats the
K20 are for denser matrices
where they have explicitly vectorized?

   Matt

> Finally, 50 iterations of a full conjugate gradient solver are benchmarked
> in cg-timings.png. One could hope that the native execution on the Xeon Phi
> eliminates all the high-latency transfers via PCI-Express for the GPU case,
> but this is not the case. While MIC beats the OpenMP-accelerated CPU
> implementation, it fails to reach the performance of GPUs. Some of the
> overhead of MIC at smaller problem sizes was found to be due to OpenMP and
> can be reduced to somewhere between NVIDIA's CUDA and OpenCL
> implementations. However, either the cores on MIC are too weak to be run
> for the serial portions, or the ring-bus and thread startup
> synchronizations are too high to keep up with GPUs.
>
> Overall, I'm not very impressed by the Xeon Phi. In contrast to GPUs it
> seems to require even more effort to get good memory bandwidth. The OpenCL
> implementation on MIC could do a lot better because it allows for more
> aggressive optimizations in principle, but this is not yet seen in
> practice. The offload-pragma is - if at all - useful for compute intensive
> problems. It might be a good fit for problems which map well to the 61
> cores and can be pinned there, but I doubt that we want to run 61 MPI
> processes on a MIC within PETSc.
>
> Best regards,
> Karli
>
> [1] http://arxiv.org/abs/1302.1078
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130212/afc5a125/attachment.html>