[petsc-dev] A closer look at the Xeon Phi
Karl Rupp
rupp at mcs.anl.gov
Tue Feb 12 17:06:44 CST 2013
Hi guys,
I finally got to play with the Intel Xeon Phi Beta hardware here. It's
supposed to have a slightly higher peak memory bandwidth (352 GB/sec)
than the release hardware (320 GB/sec), and it gives a first impression
on what can be done with it. One thing to keep in mind is that the ring
bus connecting memory controllers with the cores saturates at 220
GB/sec, so this represents the theoretical peak performance for
applications.
A sparse matrix-vector multiplication paper [1] got published recently,
but I'm more interested in what the Xeon Phi can do in terms of
iterative solvers. Thus, I ran some benchmarks with ViennaCL on the Xeon
Phi in both native mode (everything runs on the Xeon Phi) and using
OpenCL. I also tried to use the offload-mode, i.e. one specifies via
some #pragma that data should be moved to the MIC and computations are
run there, but this #pragma-handling turned out to be fairly unusable
for anything where PCI-Express can be a bottleneck. For PETSc-purposes
this means that it is completely useless. Even though I haven't tried it
yet, I think this consequently also applies to OpenACC in general.
All benchmarks are run on Linux OSes using double precision. Blue colors
in the graphs denote Intel hardware, red colors AMD, and green colors
NVIDIA. Although the test machine gets occasional updates of the Intel
toolchain, I'm not entirely sure whether the latest version is installed.
The first STREAM-like benchmark is the vector addition x = y + z in
vector-timings.png. It is surprising that the OpenCL-overhead at small
vector sizes (less than 10^6) is fairly large, so either the Beta-stage
of OpenCL on MIC is indeed very beta, or the MIC is not designed for
fast responses to requests from the host. OpenCL memory transfer rates
reach the range of 25 GB/sec on MIC, which is unexpectedly far from
peak. With native execution on the MIC, one obtains around 75 GB/sec.
This is in line with results in [1]. Higher performance requires
vectorization and prefetching - apparently injected by the programmer
and thus not very convenient. The GPUs are about a factor of two faster
and get close to their peak performance without any explicit
vectorization or prefetching.
The second benchmark is a sparse matrix-vector multiplication for a
standard 2D finite-difference discretization of the Laplace operator on
the unit square (sparse-timings.png). The performance of MIC is better
than that of the CPU, but again the overhead at smaller problem sizes is
considerable and larger than for NVIDIA GPUs (both OpenCL and CUDA). The
poor performance at around 10^4 on the MIC is reproducible, but I don't
have an explanation for it. Overall, the GPUs are by a factor of around
2-3 faster than MIC. Further tuning might reduce this gap, as some
experiments with vectorizations on MIC have shown mild improvements (~30%).
Finally, 50 iterations of a full conjugate gradient solver are
benchmarked in cg-timings.png. One could hope that the native execution
on the Xeon Phi eliminates all the high-latency transfers via
PCI-Express for the GPU case, but this is not the case. While MIC beats
the OpenMP-accelerated CPU implementation, it fails to reach the
performance of GPUs. Some of the overhead of MIC at smaller problem
sizes was found to be due to OpenMP and can be reduced to somewhere
between NVIDIA's CUDA and OpenCL implementations. However, either the
cores on MIC are too weak to be run for the serial portions, or the
ring-bus and thread startup synchronizations are too high to keep up
with GPUs.
Overall, I'm not very impressed by the Xeon Phi. In contrast to GPUs it
seems to require even more effort to get good memory bandwidth. The
OpenCL implementation on MIC could do a lot better because it allows for
more aggressive optimizations in principle, but this is not yet seen in
practice. The offload-pragma is - if at all - useful for compute
intensive problems. It might be a good fit for problems which map well
to the 61 cores and can be pinned there, but I doubt that we want to run
61 MPI processes on a MIC within PETSc.
Best regards,
Karli
[1] http://arxiv.org/abs/1302.1078
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vector-timings.png
Type: image/png
Size: 76260 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130212/7ec67798/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparse-timings.png
Type: image/png
Size: 69885 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130212/7ec67798/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cg-timings.png
Type: image/png
Size: 68092 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130212/7ec67798/attachment-0002.png>
More information about the petsc-dev
mailing list