[petsc-dev] Status of pthreads and OpenMP support

Fri Oct 26 08:47:07 CDT 2012

On Thu, Oct 25, 2012 at 9:16 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
> On Thu, Oct 25, 2012 at 8:05 PM, John Fettig <john.fettig at gmail.com> wrote:
>>
>> What I see in your results is about 7x speedup by using 16 threads.  I
>> think you should get better results by running 8 threads with 2
>> processes because the memory can be allocated on separate memory
>> controllers, and the memory will be physically closer to the cores.
>> I'm surprised that you get worse results.
>
>
> Our intent is for the threads to use an explicit first-touch policy so that
> they get local memory even when you have threads across multiple NUMA zones.

Great.  I still think the performance using jacobi (as Dave does)
should be no worse using 2x(MPI) and 8x(thread) than it is with
1x(MPI) and 16x(thread).

>>
>> It doesn't surprise me that an explicit code gets much better speedup.
>
>
> The explicit code is much less dependent on memory bandwidth relative to
> floating point.
>
>>
>>
>> > I also get about the same performance results on the ex2 problem when
>> > running it with just
>> > mpi alone i.e. with 16 mpi processes.
>> >
>> > So from my perspective, the new pthreads/openmp support is looking
>> > pretty good assuming
>> > the issue with the MKL/external packages interaction can be fixed.
>> >
>> > I was just using jacobi preconditioning for ex2.  I'm wondering if there
>> > are any other preconditioners
>> > that might be multi-threaded.  Or maybe a polynomial preconditioner
>> > could work well for the
>> > new pthreads/openmp support.
>>
>> GAMG with SOR smoothing seems like a prime candidate for threading.  I
>> wonder if anybody has worked on this yet?
>
>
> SOR is not great because it's sequential.

For structured grids we have multi-color schemes and temporally
blocked schemes as in this paper,

http://www.it.uu.se/research/publications/reports/2006-018/2006-018-nc.pdf

For unstructured grids, could we do some analagous decomposition using
e.g. parmetis?

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.4764

Regards,
John

> A block Jacobi/SOR parallelizes
> fine, but does not guarantee stability without additional
> (operator-dependent) damping. Chebyshev/Jacobi smoothing will perform well
> with threads (but not all the kernels are ready).
>
> Coarsening and the Galerkin triple product is more difficult to thread.