[petsc-dev] OpenMP/Vec

Mon Feb 27 22:39:19 CST 2012

On Mon, Feb 27, 2012 at 16:31, Gerard Gorman <g.gorman at imperial.ac.uk>wrote:

> I had a quick go at trying to get some sensible benchmarks for this but
> I was getting too much system noise. I am particularly interested in
> seeing if the overhead goes to zero if num_threads(1) is used.
>

What timing method did you use? I did not see overhead going to zero when
num_threads goes to 1 when using GCC compilers, but Intel seems to do
fairly well.

>
> I'm surprised by this. I not aware of any compiler that doesn't have
> OpenMP support - and then you do not actually enable OpenMP compilers
> generally just ignore the pragma. Do you know of any compiler that does
> not have OpenMP support which will complain?
>

Sean points out that omp.h might not be available, but that misses the
point. As far as I know, recent mainstream compilers have enough sense to
at least ignore these directives, but I'm sure there are still cases where
it would be an issue. More importantly, #pragma was a misfeature that
should never be used now that _Pragma() exists. The latter is better not
just because it can be turned off, but because it can be manipulated using
macros and can be explicitly compiled out.

> This may not be flexible enough. You frequently want to have a parallel
> region, and then have multiple omp for's within that one region.
>

PetscPragmaOMPObject(obj, parallel)
{
PetscPragmaOMP(whetever you normally write for this loop)
for (....) { }
...
and so on
}

> I think what you describe is close to Fig 3 of this paper written by
> your neighbours:
> http://greg.bronevetsky.com/papers/2008IWOMP.pdf
> However, before making the implementation more complex, it would be good
> to benchmark the current approach and use a tool like likwid to measure
> the NUMA traffic so we can get a good handle on the costs.
>

Sure.

> Well this is where the implementation details get richer and there are
> many options - they also become less portable. For example, what does
> all this mean for the sparc64 processors which are UMA.
>

Delay to runtime, use an ignorant partition for UMA. (Blue Gene/Q is also
essentially uniform.) But note that even with uniform memory, cache still
makes it somewhat hierarchical.

> Not to mention
> Intel MIC which also supports OpenMP. I guess I am cautious about
> getting too bogged down with very invasive optimisations until we have
> benchmarked the basic approach which in a wide range of use cases will
> achieve good thread/page locality as illustrated previously.
>

I guess I'm just interested in exposing enough semantic information to be
able to schedule a few different ways using run-time (or, if absolutely
necessary, configure-time) options. I don't want to have to revisit
individual loops.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120227/5163a159/attachment.html>