[petsc-dev] Fwd: Poisson step in GTS

Sun Jun 19 08:13:00 CDT 2011

On Sun, Jun 19, 2011 at 03:33, Barry Smith <bsmith at mcs.anl.gov> wrote:

> > No, Cray does not provide any threaded BLAS 1.  Generally speaking it is
> not worth threading a single nested loop unless the trip count is very high
> and generally that does not happen often enough to warrant the special BLAS.
>  In fact, I am not even sure we omp BLAS 2, I don't think so.

This shouldn't be surprising. It's worth noting that inside of multigrid (or
a surface-volume split problem) the same vector operation will be called
with quite different sizes. Managing the granularity so that threads are
used when they will actually be faster, but not when the size is too small
to offset the startup cost is not trivial.

A related matter that I keep harping on is that the memory hierarchy is very
non-uniform. In the old days, it was reasonably uniform within a socket, but
some of the latest hardware has multiple dies within a socket, each with
more-or-less independent memory buses.

Of course you can always move MPI down to finer granularity (e.g. one MPI
process per die instead of one per socket). I think this is a good solution
for many applications, and may perform better than threads for reasonably
large subdomains (mostly because memory affinity can be reliably managed),
but it is less appealing in the strong scaling limit.

I have yet to see a threading library/language that offers a good platform
for bandwidth-constrained strong scaling. The existing solutions tend to
base everything on the absurd assumption that parallel computation is about
parallelizing the computation. In reality, certainly for the sorts of
problems that most PETSc users care about, everything hard about parallelism
is how to communicate that which needs to be communicated while retaining
good data locality on that which doesn't need to be communicated.

It's an easy local optimization to get high performance out of a local
kernel. In contrast, optimizing for data locality often involves major data
structure and algorithm changes.

The current systems all seem to be really bad at this, with data locality
being something that is sometimes provided implicitly, but is ultimately
very fragile. The number of recent threading papers that report less than
30% of hardware bandwidth peak for STREAM is not inspiring. (Rather few
authors actually state the hardware peak, but if you look up the specs, it's
really rare to see something respectable (e.g. 80%).)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20110619/70691eb3/attachment.html>