[petsc-dev] MIC available for testing

Fri Dec 14 22:09:06 CST 2012

As a point of comparison. I've been running a PETSc CG algorithm on an 
Nvidia K20. The simulation has 1.4e7 elements.

The PETSc AXPY takes .001 seconds in single precision. That's 26 GFlops.

In another simulation using a double complex BiCG algorithm with 1.e6 
unknowns, the Petsc MatMult on the K20 runs at 55 GFlops!

-Paul
> Hi guys,
>
> today I got a gentle introduction into our testing machine equipped 
> with two Intel MICs. They are still beta, yet I could run some simple 
> kernels in native mode. As an example, without any modification of 
> existing OpenMP code for vector addition in double precision of 3e6 
> elements, I got the following timings:
>
> -- Native mode, i.e. all code executed on MIC --
> Single core time: 0.642 sec
> All-core time:    0.011 sec
>
> For offloaded execution (CPU <-> MIC, just like with GPUs), additional 
> pragmas are required, I haven't tried that yet.
>
> For comparison, the same code on the CPU (Sandy Bridge, 8x2 cores, 2.6 
> GHz) takes 0.060 sec without OpenMP and 0.030 sec with OpenMP. Thus, 
> the conclusion is that one *really* needs to get all cores on the MIC 
> busy in order to get the full memory bandwidth. Thus, a plain 'just 
> recompile for MIC and you get good performance' won't work for most 
> applications in practice, simply because the serial performance is so 
> limited.
>
> @Shri: It would be interesting to give pthreads a try, particularly 
> how it compares with OpenMP. I'll be out of the lab until the 
> beginning of January, but I can help you with getting an account and 
> getting started.
>
> Btw: I just got a call regarding Altera hardware, we might have 
> chances to get our fingers on their OpenCL-enabled hardware.
>
> Best regards,
> Karli
>