[petsc-dev] PETSc programming model for multi-core systems

Thu Nov 11 19:15:58 CST 2010

On Fri, Nov 12, 2010 at 02:03, Barry Smith <bsmith at mcs.anl.gov> wrote:

> > I mean its easy to tell a thread to do something, but I was not aware
> that pthreads had nice support
> > for telling all threads to do something at the same time. On a multicore,
> you want vector instructions,
>
>    Why do you want vector instructions on multicore? Since each core has a
> full instruction stream what do you get by vectorization?

I agree with Barry that massively parallel vector instructions are not the
way to go for multi-core.  Each core has a vector unit (16 bytes today, 32
bytes next year (AVX), 64 or 128 likely in a few years, depending on who you
believe) which gets you the fine-grained parallelism.

The main problem with OpenMP across multiple cores is that you don't get any
control over data locality.  In reality, especially on a NUMA system (every
multi-socket system worth discussing now is NUMA), the location of the *
physical* pages is of critical importance.  You can easily see a factor of
more than 3 performance hit on a quad-core system due to physical pages
getting mis-mapped.  This is already an issue with separate processes using
affinity, but only if the OS is sloppy or the sysadmins are incompetent (I
ran into this when a process was leaving some stale ramdisk lying around).

But it is a way bigger deal for OpenMP where you have no control over how
the threads get mapped, but it is absolutely critical that every time you
touch some memory, you use a thread bound to the same NUMA node (=socket,
usually) as the thread that *faulted* it (not allocated, that doesn't
matter, even if it's statically allocated).  With pthreads, you get a more
flexible programming model, and you can organize your memory so that almost
all accesses are local.  In exchange for this more explicit control over
memory locality (and generally more flexible programming model), you get
some added complexity, and it can't just be "annotated" into an existing
code to "parallelize" it.  Projecting on someone I've never met, this is
likely the primary issue Bill Gropp has with OpenMP, and I think it is
entirely valid.

I don't have experience using CUDA to generate CPU code, and I don't know
how it performs.  An obvious difference relative to the GPU is that the CPU
can perform much less structured computation without a performance hit.  I
don't know if CUDA/OpenCL for CPU allows you to take advantage of this.

Jed
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20101112/6578608a/attachment.html>