[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Sun Oct 7 09:34:53 CDT 2012

>
>      Problem already solved http://dl.acm.org/citation.cfm?id=2145863   :-)
>

Oh, yes, these are the SnuCL guys, cf.
http://aces.snu.ac.kr/~jlee/Jaejin_Lees_Home/Pact11_tutorial.html

Their approach is to collect all devices across the cluster and make it 
available as if it were all local on the same machine, cf.
http://www.acrc.a-star.edu.sg/astaratipreg_2012/Proceedings/Presentation%20-%20Jaejin%20Lee.pdf, 
Slide 6.

As they are presenting results for the standard matrix-matrix 
multiplication benchmark, memory transfer is not the bottleneck. 
However, I doubt that the same works well with BLAS level 1 and 2 
operations.

By the way, I've attached benchmark results for a simple BLAS 1 kernel 
using double precision on an NVIDIA GPU using OpenCL (CUDA is 
qualitatively the same, yet may have slightly lower latency). You 
usually won't see this kind of benchmark result in publications on GPUs, 
as they are considering BLAS 3 only, or hide this unpleasant fact in 
diagrams starting at higher data sizes. For vector sizes below about 50k 
(!!), one can clearly see that one is in the kernel launch overhead 
regime, even if data transfer to the GPU is ignored. We are taking about 
latencies of 10us here, which is in the range of 1us when running the 
same benchmark on the CPU (I have to rerun benchmarks on latest SDKs, 
maybe the situation has improved). For this reason, I suggest to refrain 
from unconditionally submitting all types of jobs to an OpenCL scheduler 
and rather preserve the option of starting a small job right away using 
e.g. threadcomm (with the cost of an indirect function call, as Jed noted).

Best regards,
Karli

-------------- next part --------------
A non-text attachment was scrubbed...
Name: v1_peq_alpha_v2_p_beta_v3_double.png
Type: image/png
Size: 7402 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121007/1844c25f/attachment.png>