[petsc-dev] Status of pthreads and OpenMP support

Fri Oct 26 10:58:20 CDT 2012

Jed or Shri,

Are there other preconditioners I could use/try now with the petsc thread support besides jacobi?
I looked around in the documentation for something like least squares polynomial preconditioning
that is referenced in a paper by Li and Saad titled "GPU-Accelerated Preconditioned Iterative
Linear Solvers" but did not find anything like that.  Would block jacobi with lu/cholesky for the
block solves work with the current thread support?

Regarding the performance of my recent runs, I was surprised that I was not getting closer to a
16x speedup for the purely vector operations when using 16 threads compared to 1 thread.  I'm
running on a single node of a cluster where the nodes are dual socked sandybridge cpus and
the OS is TOSS 2 Linux from Livermore.  So I'm assuming that is not really an "unknown" sort
of system.  One thing I am wondering is whether there is an issue with my thread affinities.  I am
setting them but am wondering if there could be issues with which chunk of a vector a given
threads gets.  For instance, assuming a single mpi process on a single node and using 16 threads,
I would assume that the vector occupies a contiguous chunk of memory and that it will get divided
into 16 chunks.  If thread 13 is the first to launch, does it get the first chunk of the vector or the
13th chunk of the vector?  If the latter, then I would think my assignment of thread affinities is
optimal.  If my thread assignment is optimal, then is the less than 16x speedup in the vector
operations because of memory bandwidth limitations or cache effects?

What profiling tools do you recommend to use with petsc?  I have investigated and tried Openspeedshop,
HPC Toolkit and Tau but have not tried any with petsc.  I was told that there were some issues with
using Tau with petsc.  Not sure what they are.  So far, I have liked Tau best.

Dave

________________________________________
From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on behalf of John Fettig [john.fettig at gmail.com]
Sent: Friday, October 26, 2012 7:47 AM
To: For users of the development version of PETSc
Subject: Re: [petsc-dev] Status of pthreads and OpenMP support

On Thu, Oct 25, 2012 at 9:16 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
> On Thu, Oct 25, 2012 at 8:05 PM, John Fettig <john.fettig at gmail.com> wrote:
>>
>> What I see in your results is about 7x speedup by using 16 threads.  I
>> think you should get better results by running 8 threads with 2
>> processes because the memory can be allocated on separate memory
>> controllers, and the memory will be physically closer to the cores.
>> I'm surprised that you get worse results.
>
>
> Our intent is for the threads to use an explicit first-touch policy so that
> they get local memory even when you have threads across multiple NUMA zones.

Great.  I still think the performance using jacobi (as Dave does)
should be no worse using 2x(MPI) and 8x(thread) than it is with
1x(MPI) and 16x(thread).

>>
>> It doesn't surprise me that an explicit code gets much better speedup.
>
>
> The explicit code is much less dependent on memory bandwidth relative to
> floating point.
>
>>
>>
>> > I also get about the same performance results on the ex2 problem when
>> > running it with just
>> > mpi alone i.e. with 16 mpi processes.
>> >
>> > So from my perspective, the new pthreads/openmp support is looking
>> > pretty good assuming
>> > the issue with the MKL/external packages interaction can be fixed.
>> >
>> > I was just using jacobi preconditioning for ex2.  I'm wondering if there
>> > are any other preconditioners
>> > that might be multi-threaded.  Or maybe a polynomial preconditioner
>> > could work well for the
>> > new pthreads/openmp support.
>>
>> GAMG with SOR smoothing seems like a prime candidate for threading.  I
>> wonder if anybody has worked on this yet?
>
>
> SOR is not great because it's sequential.

For structured grids we have multi-color schemes and temporally
blocked schemes as in this paper,

http://www.it.uu.se/research/publications/reports/2006-018/2006-018-nc.pdf

For unstructured grids, could we do some analagous decomposition using
e.g. parmetis?

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.4764

Regards,
John

> A block Jacobi/SOR parallelizes
> fine, but does not guarantee stability without additional
> (operator-dependent) damping. Chebyshev/Jacobi smoothing will perform well
> with threads (but not all the kernels are ready).
>
> Coarsening and the Galerkin triple product is more difficult to thread.