[petsc-dev] OpenMP/Vec

Mon Feb 27 16:31:21 CST 2012

Jed Brown emailed the following on 27/02/12 00:39:
> On Sun, Feb 26, 2012 at 04:07, Gerard Gorman <g.gorman at imperial.ac.uk
> <mailto:g.gorman at imperial.ac.uk>> wrote:
>
>     > Did you post a repository yet? I'd like to have a look at the code.
>
>     It's on Barry's favourite collaborative software development site of
>     course ;-)
>
>     https://bitbucket.org/wence/petsc-dev-omp/overview
>
>
> I looked through the code and I'm concerned that all the OpenMP code
> is inlined into vec/impls/seq and mat/impls/aij/seq with, as far as I
> can tell, no way to use OpenMP for some objects in a simulation, but
> not others. I think that all the pragmas should have
> num_threads(x->nthreads) clauses. We can compute the correct number of
> threads based on sizes when memory is allocated (or specified through
> command line options, inherited from related objects, etc).

The num_threads(x->nthreads) is worth investigating. However, in the
benchmarks I have done so far it would seem that the only two sensible
values for nthreads would be 1 or the total number of threads. When you
set num_threads(1) it seems that OpenMP (for some implementations at
least) is sensible enough not to do anything silly that would introduce
scheduling/synchronisation overheads. Once you use more than one thread
then you incur the scheduling/synchronisation overheads. As you increase
the number of threads you may for small arrays see parallel efficiency
decreasing, but I have not seen the actual time increasing. If this is a
general result then it might not be a big deal that OpenMP is inlined
into vec/impls/seq and mat/impls/aij/seq so long as the num_threads was
used as suggested.

For determining where the cut off array size for using all or one thread
- as interesting option would be to determine this at run time, i.e.
learn at run time what the appropriate is.

>
> I don't think we can get away from OpenMP schedule overhead (several
> hundred cycles) even for those objects that we choose not to use
> threads for, but (at least with my gcc tests), that overhead is only
> about a third of the cost of actually starting a parallel region.

I had a quick go at trying to get some sensible benchmarks for this but
I was getting too much system noise. I am particularly interested in
seeing if the overhead goes to zero if num_threads(1) is used. The next
step is to take a look at the EPCC OpenMP microbenchmarks to see if they
have tied down these issues.

>
> It's really not acceptable to insert unguarded
>
> #pragma omp ...
>
> into the code because this will generate tons of warnings or errors
> with compilers that don't know about OpenMP. It would be better to
> test for _Pragma and use

I'm surprised by this. I not aware of any compiler that doesn't have
OpenMP support - and then you do not actually enable OpenMP compilers
generally just ignore the pragma. Do you know of any compiler that does
not have OpenMP support which will complain?

>
> #define PetscPragmatize(x) _Pragma(#x)
> #if defined(PETSC_USE_OPENMP)
> #  define PetscPragmaOMP(x) PetscPragmatize(omp x)
> #else
> #  define PetscPragmaOMP(x)
> #endif
>
> then use
>
> PetscPragmaOMP(parallel for ...)
>
> We should probably use a variant for object-based threading
>
> #define PetscPragmaOMPObject(obj,x) PetscPragmaOMP(x
> num_threads((obj)->nthreads))

This may not be flexible enough. You frequently want to have a parallel
region, and then have multiple omp for's within that one region.

>
> In the case of multiple objects, I think you usually want the object
> being modified to control the number of threads.

I take this point.

>
> In many cases, I would prefer more control over the partition of the
> loop. For example, in many cases, I'd be willing to tolerate a slight
> computational imbalance between threads in exchange for working
> exclusively within my page. Note that the arithmetic to compute such
> things is orders of magnitude less expensive than the
> schedule/distribution to threads. I don't know how to do that except to
>
> PragmaOMP(parallel) {
>   int nthreads = omp_get_num_threads();
>   int tnum = omp_get_thread_num();
>   int start,end;
>   // compute start and end
>   for (int i=start; i<end; i++) {
>     // the work
>   }
> }
>
> We could perhaps capture some of this common logic in a macro:
>
> #define VecOMPParallelBegin(X,args) do { \
>   PragmaOMPObject(X,parallel args) { \
>   PetscInt _start, _end; \
>   VecOMPGetThreadLocalPart(X,&_start,&_end); \
>   { do {} while(0)
>
> #define VecOMPParallelEnd() }}} while(0)
>
> VecOMPParallelBegin(X, shared/private ...);
> {
>   PetscInt i;
>   for (i=_start; i<_end; i++) {
>     // the work
>   }
> }
> VecOMPParallelEnd();
>
> That should reasonably give us complete run-time control of the number
> of parallel threads per object and their distribution, within the
> constraints of contiguous thread partition.

I think what you describe is close to Fig 3 of this paper written by
your neighbours:
http://greg.bronevetsky.com/papers/2008IWOMP.pdf
However, before making the implementation more complex, it would be good
to benchmark the current approach and use a tool like likwid to measure
the NUMA traffic so we can get a good handle on the costs.

> That also leaves open the possibility of using libnuma to query and
> migrate pages. (For example, a short vector that needs to be accessed
> from multiple NUMA nodes might intentionally be faulted with pages
> spread apart even though other vectors of similar size might be
> accessed from within one NUMA nodes and thus not use threads at all.
> (One 4 KiB page is only 512 doubles, but if the memory is local to a
> single NUMA node, we wouldn't use threads until the vector length was
> 4 to 8 times larger.)

Well this is where the implementation details get richer and there are
many options - they also become less portable. For example, what does
all this mean for the sparc64 processors which are UMA. Not to mention
Intel MIC which also supports OpenMP. I guess I am cautious about
getting too bogged down with very invasive optimisations until we have
benchmarked the basic approach which in a wide range of use cases will
achieve good thread/page locality as illustrated previously.

Cheers
Gerard