[petsc-dev] PETSc programming model for multi-core systems

Thu Nov 11 19:19:51 CST 2010

On Nov 11, 2010, at 7:15 PM, Jed Brown wrote:

> On Fri, Nov 12, 2010 at 02:03, Barry Smith <bsmith at mcs.anl.gov> wrote:
> > I mean its easy to tell a thread to do something, but I was not aware that pthreads had nice support
> > for telling all threads to do something at the same time. On a multicore, you want vector instructions,
> 
>   Why do you want vector instructions on multicore? Since each core has a full instruction stream what do you get by vectorization?
> 
> I agree with Barry that massively parallel vector instructions are not the way to go for multi-core.  Each core has a vector unit (16 bytes today, 32 bytes next year (AVX), 64 or 128 likely in a few years, depending on who you believe) which gets you the fine-grained parallelism.
> 
> The main problem with OpenMP across multiple cores is that you don't get any control over data locality.  In reality, especially on a NUMA system (every multi-socket system worth discussing now is NUMA), the location of the physical pages is of critical importance.  You can easily see a factor of more than 3 performance hit on a quad-core system due to physical pages getting mis-mapped.  This is already an issue with separate processes using affinity, but only if the OS is sloppy or the sysadmins are incompetent (I ran into this when a process was leaving some stale ramdisk lying around).
> 
> But it is a way bigger deal for OpenMP where you have no control over how the threads get mapped, but it is absolutely critical that every time you touch some memory, you use a thread bound to the same NUMA node (=socket, usually) as the thread that faulted it (not allocated, that doesn't matter, even if it's statically allocated).  With pthreads, you get a more flexible programming model, and you can organize your memory so that almost all accesses are local.  In exchange for this more explicit control over memory locality (and generally more flexible programming model), you get some added complexity, and it can't just be "annotated" into an existing code to "parallelize" it.  Projecting on someone I've never met, this is likely the primary issue Bill Gropp has with OpenMP, and I think it is entirely valid.

   This is my understanding of what Bill told me. Better explained then I could.

   Barry

> 
> I don't have experience using CUDA to generate CPU code, and I don't know how it performs.  An obvious difference relative to the GPU is that the CPU can perform much less structured computation without a performance hit.  I don't know if CUDA/OpenCL for CPU allows you to take advantage of this.
> 
> Jed