[petsc-dev] Status of pthreads and OpenMP support

Fri Oct 26 11:34:24 CDT 2012

Hi Karli,

Thanks for your reply.  Doing the memory bandwidth calculation seems like a useful exercise.  I'll
give that a try.  I was also trying to think of this from a higher level perspective.  Does this seem
reasonable?

T_vec_op = T_vec_compute + T_vec_memory

where these are times but using multiple threads only speeds up the T_vec_compute part while
T_vec_memory is relatively constant whether I am doing memory operations with a single thread
or multiple threads.

Thanks,

Dave

________________________________________
From: Karl Rupp [rupp at mcs.anl.gov]
Sent: Friday, October 26, 2012 10:20 AM
To: For users of the development version of PETSc
Cc: Nystrom, William D
Subject: Re: [petsc-dev] Status of pthreads and OpenMP support

Hi Dave,

let me just comment on the expected speed-up: As the arithmetic
intensity of vector operations is small, you are in a memory-bandwidth
limited regime. If you use smaller vectors in order to stay in cache,
you may still not obtain the expected speedup because then thread
management overhead becomes more of an issue. I suggest you compute the
effective memory bandwidth of your vector operations, because I suspect
you are pretty close to bandwidth saturation already.

Best regards,
Karli

On 10/26/2012 10:58 AM, Nystrom, William D wrote:
> Jed or Shri,
>
> Are there other preconditioners I could use/try now with the petsc thread support besides jacobi?
> I looked around in the documentation for something like least squares polynomial preconditioning
> that is referenced in a paper by Li and Saad titled "GPU-Accelerated Preconditioned Iterative
> Linear Solvers" but did not find anything like that.  Would block jacobi with lu/cholesky for the
> block solves work with the current thread support?
>
> Regarding the performance of my recent runs, I was surprised that I was not getting closer to a
> 16x speedup for the purely vector operations when using 16 threads compared to 1 thread.  I'm
> running on a single node of a cluster where the nodes are dual socked sandybridge cpus and
> the OS is TOSS 2 Linux from Livermore.  So I'm assuming that is not really an "unknown" sort
> of system.  One thing I am wondering is whether there is an issue with my thread affinities.  I am
> setting them but am wondering if there could be issues with which chunk of a vector a given
> threads gets.  For instance, assuming a single mpi process on a single node and using 16 threads,
> I would assume that the vector occupies a contiguous chunk of memory and that it will get divided
> into 16 chunks.  If thread 13 is the first to launch, does it get the first chunk of the vector or the
> 13th chunk of the vector?  If the latter, then I would think my assignment of thread affinities is
> optimal.  If my thread assignment is optimal, then is the less than 16x speedup in the vector
> operations because of memory bandwidth limitations or cache effects?
>
> What profiling tools do you recommend to use with petsc?  I have investigated and tried Openspeedshop,
> HPC Toolkit and Tau but have not tried any with petsc.  I was told that there were some issues with
> using Tau with petsc.  Not sure what they are.  So far, I have liked Tau best.
>
> Dave
>
> ________________________________________
> From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on behalf of John Fettig [john.fettig at gmail.com]
> Sent: Friday, October 26, 2012 7:47 AM
> To: For users of the development version of PETSc
> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
>
> On Thu, Oct 25, 2012 at 9:16 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
>> On Thu, Oct 25, 2012 at 8:05 PM, John Fettig <john.fettig at gmail.com> wrote:
>>>
>>> What I see in your results is about 7x speedup by using 16 threads.  I
>>> think you should get better results by running 8 threads with 2
>>> processes because the memory can be allocated on separate memory
>>> controllers, and the memory will be physically closer to the cores.
>>> I'm surprised that you get worse results.
>>
>>
>> Our intent is for the threads to use an explicit first-touch policy so that
>> they get local memory even when you have threads across multiple NUMA zones.
>
> Great.  I still think the performance using jacobi (as Dave does)
> should be no worse using 2x(MPI) and 8x(thread) than it is with
> 1x(MPI) and 16x(thread).
>
>>>
>>> It doesn't surprise me that an explicit code gets much better speedup.
>>
>>
>> The explicit code is much less dependent on memory bandwidth relative to
>> floating point.
>>
>>>
>>>
>>>> I also get about the same performance results on the ex2 problem when
>>>> running it with just
>>>> mpi alone i.e. with 16 mpi processes.
>>>>
>>>> So from my perspective, the new pthreads/openmp support is looking
>>>> pretty good assuming
>>>> the issue with the MKL/external packages interaction can be fixed.
>>>>
>>>> I was just using jacobi preconditioning for ex2.  I'm wondering if there
>>>> are any other preconditioners
>>>> that might be multi-threaded.  Or maybe a polynomial preconditioner
>>>> could work well for the
>>>> new pthreads/openmp support.
>>>
>>> GAMG with SOR smoothing seems like a prime candidate for threading.  I
>>> wonder if anybody has worked on this yet?
>>
>>
>> SOR is not great because it's sequential.
>
> For structured grids we have multi-color schemes and temporally
> blocked schemes as in this paper,
>
> http://www.it.uu.se/research/publications/reports/2006-018/2006-018-nc.pdf
>
> For unstructured grids, could we do some analagous decomposition using
> e.g. parmetis?
>
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.4764
>
> Regards,
> John
>
>> A block Jacobi/SOR parallelizes
>> fine, but does not guarantee stability without additional
>> (operator-dependent) damping. Chebyshev/Jacobi smoothing will perform well
>> with threads (but not all the kernels are ready).
>>
>> Coarsening and the Galerkin triple product is more difficult to thread.