[petsc-dev] A closer look at the Xeon Phi

Tue Feb 12 17:06:44 CST 2013

Hi guys,

I finally got to play with the Intel Xeon Phi Beta hardware here. It's 
supposed to have a slightly higher peak memory bandwidth (352 GB/sec) 
than the release hardware (320 GB/sec), and it gives a first impression 
on what can be done with it. One thing to keep in mind is that the ring 
bus connecting memory controllers with the cores saturates at 220 
GB/sec, so this represents the theoretical peak performance for 
applications.

A sparse matrix-vector multiplication paper [1] got published recently, 
but I'm more interested in what the Xeon Phi can do in terms of 
iterative solvers. Thus, I ran some benchmarks with ViennaCL on the Xeon 
Phi in both native mode (everything runs on the Xeon Phi) and using 
OpenCL. I also tried to use the offload-mode, i.e. one specifies via 
some #pragma that data should be moved to the MIC and computations are 
run there, but this #pragma-handling turned out to be fairly unusable 
for anything where PCI-Express can be a bottleneck. For PETSc-purposes 
this means that it is completely useless. Even though I haven't tried it 
yet, I think this consequently also applies to OpenACC in general.

All benchmarks are run on Linux OSes using double precision. Blue colors 
in the graphs denote Intel hardware, red colors AMD, and green colors 
NVIDIA. Although the test machine gets occasional updates of the Intel 
toolchain, I'm not entirely sure whether the latest version is installed.

The first STREAM-like benchmark is the vector addition x = y + z in 
vector-timings.png. It is surprising that the OpenCL-overhead at small 
vector sizes (less than 10^6) is fairly large, so either the Beta-stage 
of OpenCL on MIC is indeed very beta, or the MIC is not designed for 
fast responses to requests from the host. OpenCL memory transfer rates 
reach the range of 25 GB/sec on MIC, which is unexpectedly far from 
peak. With native execution on the MIC, one obtains around 75 GB/sec. 
This is in line with results in [1]. Higher performance requires 
vectorization and prefetching - apparently injected by the programmer 
and thus not very convenient. The GPUs are about a factor of two faster 
and get close to their peak performance without any explicit 
vectorization or prefetching.

The second benchmark is a sparse matrix-vector multiplication for a 
standard 2D finite-difference discretization of the Laplace operator on 
the unit square (sparse-timings.png). The performance of MIC is better 
than that of the CPU, but again the overhead at smaller problem sizes is 
considerable and larger than for NVIDIA GPUs (both OpenCL and CUDA). The 
poor performance at around 10^4 on the MIC is reproducible, but I don't 
have an explanation for it. Overall, the GPUs are by a factor of around 
2-3 faster than MIC. Further tuning might reduce this gap, as some 
experiments with vectorizations on MIC have shown mild improvements (~30%).

Finally, 50 iterations of a full conjugate gradient solver are 
benchmarked in cg-timings.png. One could hope that the native execution 
on the Xeon Phi eliminates all the high-latency transfers via 
PCI-Express for the GPU case, but this is not the case. While MIC beats 
the OpenMP-accelerated CPU implementation, it fails to reach the 
performance of GPUs. Some of the overhead of MIC at smaller problem 
sizes was found to be due to OpenMP and can be reduced to somewhere 
between NVIDIA's CUDA and OpenCL implementations. However, either the 
cores on MIC are too weak to be run for the serial portions, or the 
ring-bus and thread startup synchronizations are too high to keep up 
with GPUs.

Overall, I'm not very impressed by the Xeon Phi. In contrast to GPUs it 
seems to require even more effort to get good memory bandwidth. The 
OpenCL implementation on MIC could do a lot better because it allows for 
more aggressive optimizations in principle, but this is not yet seen in 
practice. The offload-pragma is - if at all - useful for compute 
intensive problems. It might be a good fit for problems which map well 
to the 61 cores and can be pinned there, but I doubt that we want to run 
61 MPI processes on a MIC within PETSc.

Best regards,
Karli

[1] http://arxiv.org/abs/1302.1078
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vector-timings.png
Type: image/png
Size: 76260 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130212/7ec67798/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparse-timings.png
Type: image/png
Size: 69885 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130212/7ec67798/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cg-timings.png
Type: image/png
Size: 68092 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130212/7ec67798/attachment-0002.png>