[petsc-dev] Status of pthreads and OpenMP support

Fri Oct 26 15:08:47 CDT 2012

Are there any petsc examples that do cache blocking that would work for the new
threads support?  I was initially investigating DMDA but that looks like it only works
for mpi processes.  I was looking at ex34.c and ex45.c located in petsc-dev/src/ksp/ksp/examples/tutorials.

Thanks,

Dave

________________________________________
From: Nystrom, William D
Sent: Friday, October 26, 2012 10:53 AM
To: Karl Rupp
Cc: For users of the development version of PETSc; Nystrom, William D
Subject: RE: [petsc-dev] Status of pthreads and OpenMP support

Karli,

Thanks.  Sounds like I need to actually do the memory bandwidth calculation to get more
quantitative.

Thanks again,

Dave

________________________________________
From: Karl Rupp [rupp at mcs.anl.gov]
Sent: Friday, October 26, 2012 10:47 AM
To: Nystrom, William D
Cc: For users of the development version of PETSc
Subject: Re: [petsc-dev] Status of pthreads and OpenMP support

Hi,

 > Thanks for your reply.  Doing the memory bandwidth calculation seems
like a useful exercise.  I'll
> give that a try.  I was also trying to think of this from a higher level perspective.  Does this seem
> reasonable?
>
> T_vec_op = T_vec_compute + T_vec_memory
>
> where these are times but using multiple threads only speeds up the T_vec_compute part while
> T_vec_memory is relatively constant whether I am doing memory operations with a single thread
> or multiple threads.

Yes and no :-)
Due to possible multiple physical memory links and NUMA, T_vec_memory
shows a dependence on the number and affinity of threads. Also,

  T_vec_op = max(T_vec_compute, T_vec_memory)

can be a better approximation, as memory transfers and actual
arithmetics may overlap ('prefetching').

Still, the main speed-up when using threads (or multiple processes) is
in T_vec_compute. However, hardware processing speed has evolved such
that T_vec_memory is now often dominant (exceptions are mostly BLAS
level 3 algorithms), making proper data layout and affinity even more
important.

Best regards,
Karli

> ________________________________________
> From: Karl Rupp [rupp at mcs.anl.gov]
> Sent: Friday, October 26, 2012 10:20 AM
> To: For users of the development version of PETSc
> Cc: Nystrom, William D
> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
>
> Hi Dave,
>
> let me just comment on the expected speed-up: As the arithmetic
> intensity of vector operations is small, you are in a memory-bandwidth
> limited regime. If you use smaller vectors in order to stay in cache,
> you may still not obtain the expected speedup because then thread
> management overhead becomes more of an issue. I suggest you compute the
> effective memory bandwidth of your vector operations, because I suspect
> you are pretty close to bandwidth saturation already.
>
> Best regards,
> Karli
>
>
> On 10/26/2012 10:58 AM, Nystrom, William D wrote:
>> Jed or Shri,
>>
>> Are there other preconditioners I could use/try now with the petsc thread support besides jacobi?
>> I looked around in the documentation for something like least squares polynomial preconditioning
>> that is referenced in a paper by Li and Saad titled "GPU-Accelerated Preconditioned Iterative
>> Linear Solvers" but did not find anything like that.  Would block jacobi with lu/cholesky for the
>> block solves work with the current thread support?
>>
>> Regarding the performance of my recent runs, I was surprised that I was not getting closer to a
>> 16x speedup for the purely vector operations when using 16 threads compared to 1 thread.  I'm
>> running on a single node of a cluster where the nodes are dual socked sandybridge cpus and
>> the OS is TOSS 2 Linux from Livermore.  So I'm assuming that is not really an "unknown" sort
>> of system.  One thing I am wondering is whether there is an issue with my thread affinities.  I am
>> setting them but am wondering if there could be issues with which chunk of a vector a given
>> threads gets.  For instance, assuming a single mpi process on a single node and using 16 threads,
>> I would assume that the vector occupies a contiguous chunk of memory and that it will get divided
>> into 16 chunks.  If thread 13 is the first to launch, does it get the first chunk of the vector or the
>> 13th chunk of the vector?  If the latter, then I would think my assignment of thread affinities is
>> optimal.  If my thread assignment is optimal, then is the less than 16x speedup in the vector
>> operations because of memory bandwidth limitations or cache effects?
>>
>> What profiling tools do you recommend to use with petsc?  I have investigated and tried Openspeedshop,
>> HPC Toolkit and Tau but have not tried any with petsc.  I was told that there were some issues with
>> using Tau with petsc.  Not sure what they are.  So far, I have liked Tau best.
>>
>> Dave
>>
>> ________________________________________
>> From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on behalf of John Fettig [john.fettig at gmail.com]
>> Sent: Friday, October 26, 2012 7:47 AM
>> To: For users of the development version of PETSc
>> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
>>
>> On Thu, Oct 25, 2012 at 9:16 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
>>> On Thu, Oct 25, 2012 at 8:05 PM, John Fettig <john.fettig at gmail.com> wrote:
>>>>
>>>> What I see in your results is about 7x speedup by using 16 threads.  I
>>>> think you should get better results by running 8 threads with 2
>>>> processes because the memory can be allocated on separate memory
>>>> controllers, and the memory will be physically closer to the cores.
>>>> I'm surprised that you get worse results.
>>>
>>>
>>> Our intent is for the threads to use an explicit first-touch policy so that
>>> they get local memory even when you have threads across multiple NUMA zones.
>>
>> Great.  I still think the performance using jacobi (as Dave does)
>> should be no worse using 2x(MPI) and 8x(thread) than it is with
>> 1x(MPI) and 16x(thread).
>>
>>>>
>>>> It doesn't surprise me that an explicit code gets much better speedup.
>>>
>>>
>>> The explicit code is much less dependent on memory bandwidth relative to
>>> floating point.
>>>
>>>>
>>>>
>>>>> I also get about the same performance results on the ex2 problem when
>>>>> running it with just
>>>>> mpi alone i.e. with 16 mpi processes.
>>>>>
>>>>> So from my perspective, the new pthreads/openmp support is looking
>>>>> pretty good assuming
>>>>> the issue with the MKL/external packages interaction can be fixed.
>>>>>
>>>>> I was just using jacobi preconditioning for ex2.  I'm wondering if there
>>>>> are any other preconditioners
>>>>> that might be multi-threaded.  Or maybe a polynomial preconditioner
>>>>> could work well for the
>>>>> new pthreads/openmp support.
>>>>
>>>> GAMG with SOR smoothing seems like a prime candidate for threading.  I
>>>> wonder if anybody has worked on this yet?
>>>
>>>
>>> SOR is not great because it's sequential.
>>
>> For structured grids we have multi-color schemes and temporally
>> blocked schemes as in this paper,
>>
>> http://www.it.uu.se/research/publications/reports/2006-018/2006-018-nc.pdf
>>
>> For unstructured grids, could we do some analagous decomposition using
>> e.g. parmetis?
>>
>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.4764
>>
>> Regards,
>> John
>>
>>> A block Jacobi/SOR parallelizes
>>> fine, but does not guarantee stability without additional
>>> (operator-dependent) damping. Chebyshev/Jacobi smoothing will perform well
>>> with threads (but not all the kernels are ready).
>>>
>>> Coarsening and the Galerkin triple product is more difficult to thread.
>