[petsc-dev] A closer look at the Xeon Phi

Tim Tautges tautges at mcs.anl.gov
Tue Feb 12 18:06:24 CST 2013


I'm kind of surprised at the > 10k element crossover myself.  For the strong scaling cases, at high core counts, that's 
not terribly far from the number of DOFS per processor, is it?  I guess CPUs will be slower than the Xeon in most cases 
(BGx), or fewer (Titan), but still.

- tim

On 02/12/2013 05:06 PM, Karl Rupp wrote:
> Hi guys,
>
> I finally got to play with the Intel Xeon Phi Beta hardware here. It's supposed to have a slightly higher peak memory
> bandwidth (352 GB/sec) than the release hardware (320 GB/sec), and it gives a first impression on what can be done with
> it. One thing to keep in mind is that the ring bus connecting memory controllers with the cores saturates at 220 GB/sec,
> so this represents the theoretical peak performance for applications.
>
> A sparse matrix-vector multiplication paper [1] got published recently, but I'm more interested in what the Xeon Phi can
> do in terms of iterative solvers. Thus, I ran some benchmarks with ViennaCL on the Xeon Phi in both native mode
> (everything runs on the Xeon Phi) and using OpenCL. I also tried to use the offload-mode, i.e. one specifies via some
> #pragma that data should be moved to the MIC and computations are run there, but this #pragma-handling turned out to be
> fairly unusable for anything where PCI-Express can be a bottleneck. For PETSc-purposes this means that it is completely
> useless. Even though I haven't tried it yet, I think this consequently also applies to OpenACC in general.
>
> All benchmarks are run on Linux OSes using double precision. Blue colors in the graphs denote Intel hardware, red colors
> AMD, and green colors NVIDIA. Although the test machine gets occasional updates of the Intel toolchain, I'm not entirely
> sure whether the latest version is installed.
>
> The first STREAM-like benchmark is the vector addition x = y + z in vector-timings.png. It is surprising that the
> OpenCL-overhead at small vector sizes (less than 10^6) is fairly large, so either the Beta-stage of OpenCL on MIC is
> indeed very beta, or the MIC is not designed for fast responses to requests from the host. OpenCL memory transfer rates
> reach the range of 25 GB/sec on MIC, which is unexpectedly far from peak. With native execution on the MIC, one obtains
> around 75 GB/sec. This is in line with results in [1]. Higher performance requires vectorization and prefetching -
> apparently injected by the programmer and thus not very convenient. The GPUs are about a factor of two faster and get
> close to their peak performance without any explicit vectorization or prefetching.
>
> The second benchmark is a sparse matrix-vector multiplication for a standard 2D finite-difference discretization of the
> Laplace operator on the unit square (sparse-timings.png). The performance of MIC is better than that of the CPU, but
> again the overhead at smaller problem sizes is considerable and larger than for NVIDIA GPUs (both OpenCL and CUDA). The
> poor performance at around 10^4 on the MIC is reproducible, but I don't have an explanation for it. Overall, the GPUs
> are by a factor of around 2-3 faster than MIC. Further tuning might reduce this gap, as some experiments with
> vectorizations on MIC have shown mild improvements (~30%).
>
> Finally, 50 iterations of a full conjugate gradient solver are benchmarked in cg-timings.png. One could hope that the
> native execution on the Xeon Phi eliminates all the high-latency transfers via PCI-Express for the GPU case, but this is
> not the case. While MIC beats the OpenMP-accelerated CPU implementation, it fails to reach the performance of GPUs. Some
> of the overhead of MIC at smaller problem sizes was found to be due to OpenMP and can be reduced to somewhere between
> NVIDIA's CUDA and OpenCL implementations. However, either the cores on MIC are too weak to be run for the serial
> portions, or the ring-bus and thread startup synchronizations are too high to keep up with GPUs.
>
> Overall, I'm not very impressed by the Xeon Phi. In contrast to GPUs it seems to require even more effort to get good
> memory bandwidth. The OpenCL implementation on MIC could do a lot better because it allows for more aggressive
> optimizations in principle, but this is not yet seen in practice. The offload-pragma is - if at all - useful for compute
> intensive problems. It might be a good fit for problems which map well to the 61 cores and can be pinned there, but I
> doubt that we want to run 61 MPI processes on a MIC within PETSc.
>
> Best regards,
> Karli
>
> [1] http://arxiv.org/abs/1302.1078

-- 
================================================================
"You will keep in perfect peace him whose mind is
   steadfast, because he trusts in you."               Isaiah 26:3

              Tim Tautges            Argonne National Laboratory
          (tautges at mcs.anl.gov)      (telecommuting from UW-Madison)
  phone (gvoice): (608) 354-1459      1500 Engineering Dr.
             fax: (608) 263-4499      Madison, WI 53706




More information about the petsc-dev mailing list