<div dir="ltr">On Tue, Feb 12, 2013 at 6:06 PM, Karl Rupp <span dir="ltr"><<a href="mailto:rupp@mcs.anl.gov" target="_blank">rupp@mcs.anl.gov</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi guys,<br>

<br>

I finally got to play with the Intel Xeon Phi Beta hardware here. It's supposed to have a slightly higher peak memory bandwidth (352 GB/sec) than the release hardware (320 GB/sec), and it gives a first impression on what can be done with it. One thing to keep in mind is that the ring bus connecting memory controllers with the cores saturates at 220 GB/sec, so this represents the theoretical peak performance for applications.<br>


<br>

A sparse matrix-vector multiplication paper [1] got published recently, but I'm more interested in what the Xeon Phi can do in terms of iterative solvers. Thus, I ran some benchmarks with ViennaCL on the Xeon Phi in both native mode (everything runs on the Xeon Phi) and using OpenCL. I also tried to use the offload-mode, i.e. one specifies via some #pragma that data should be moved to the MIC and computations are run there, but this #pragma-handling turned out to be fairly unusable for anything where PCI-Express can be a bottleneck. For PETSc-purposes this means that it is completely useless. Even though I haven't tried it yet, I think this consequently also applies to OpenACC in general.<br>


<br>

All benchmarks are run on Linux OSes using double precision. Blue colors in the graphs denote Intel hardware, red colors AMD, and green colors NVIDIA. Although the test machine gets occasional updates of the Intel toolchain, I'm not entirely sure whether the latest version is installed.<br>


<br>

The first STREAM-like benchmark is the vector addition x = y + z in vector-timings.png. It is surprising that the OpenCL-overhead at small vector sizes (less than 10^6) is fairly large, so either the Beta-stage of OpenCL on MIC is indeed very beta, or the MIC is not designed for fast responses to requests from the host. OpenCL memory transfer rates reach the range of 25 GB/sec on MIC, which is unexpectedly far from peak. With native execution on the MIC, one obtains around 75 GB/sec. This is in line with results in [1]. Higher performance requires vectorization and prefetching - apparently injected by the programmer and thus not very convenient. The GPUs are about a factor of two faster and get close to their peak performance without any explicit vectorization or prefetching.<br>


<br>

The second benchmark is a sparse matrix-vector multiplication for a standard 2D finite-difference discretization of the Laplace operator on the unit square (sparse-timings.png). The performance of MIC is better than that of the CPU, but again the overhead at smaller problem sizes is considerable and larger than for NVIDIA GPUs (both OpenCL and CUDA). The poor performance at around 10^4 on the MIC is reproducible, but I don't have an explanation for it. Overall, the GPUs are by a factor of around 2-3 faster than MIC. Further tuning might reduce this gap, as some experiments with vectorizations on MIC have shown mild improvements (~30%).<br>

</blockquote><div><br></div><div style>Karl, I am assuming that the places in the article where the Phi beats the K20 are for denser matrices</div><div style>where they have explicitly vectorized?</div><div style><br></div>

<div style>   Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Finally, 50 iterations of a full conjugate gradient solver are benchmarked in cg-timings.png. One could hope that the native execution on the Xeon Phi eliminates all the high-latency transfers via PCI-Express for the GPU case, but this is not the case. While MIC beats the OpenMP-accelerated CPU implementation, it fails to reach the performance of GPUs. Some of the overhead of MIC at smaller problem sizes was found to be due to OpenMP and can be reduced to somewhere between NVIDIA's CUDA and OpenCL implementations. However, either the cores on MIC are too weak to be run for the serial portions, or the ring-bus and thread startup synchronizations are too high to keep up with GPUs.<br>


<br>

Overall, I'm not very impressed by the Xeon Phi. In contrast to GPUs it seems to require even more effort to get good memory bandwidth. The OpenCL implementation on MIC could do a lot better because it allows for more aggressive optimizations in principle, but this is not yet seen in practice. The offload-pragma is - if at all - useful for compute intensive problems. It might be a good fit for problems which map well to the 61 cores and can be pinned there, but I doubt that we want to run 61 MPI processes on a MIC within PETSc.<br>


<br>

Best regards,<br>

Karli<br>

<br>

[1] <a href="http://arxiv.org/abs/1302.1078" target="_blank">http://arxiv.org/abs/1302.1078</a><br>

</blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>

-- Norbert Wiener

</div></div>