[petsc-dev] Status of pthreads and OpenMP support

Shri abhyshr at mcs.anl.gov
Fri Oct 26 18:35:32 CDT 2012


On Oct 26, 2012, at 3:08 PM, Nystrom, William D wrote:

> Are there any petsc examples that do cache blocking that would work for the new
> threads support?  

I don't think there are any examples that can do cache blocking using threads.

> I was initially investigating DMDA but that looks like it only works
> for mpi processes.  I was looking at ex34.c and ex45.c located in petsc-dev/src/ksp/ksp/examples/tutorials.
> 
> Thanks,
> 
> Dave
> 
> ________________________________________
> From: Nystrom, William D
> Sent: Friday, October 26, 2012 10:53 AM
> To: Karl Rupp
> Cc: For users of the development version of PETSc; Nystrom, William D
> Subject: RE: [petsc-dev] Status of pthreads and OpenMP support
> 
> Karli,
> 
> Thanks.  Sounds like I need to actually do the memory bandwidth calculation to get more
> quantitative.
> 
> Thanks again,
> 
> Dave
> 
> ________________________________________
> From: Karl Rupp [rupp at mcs.anl.gov]
> Sent: Friday, October 26, 2012 10:47 AM
> To: Nystrom, William D
> Cc: For users of the development version of PETSc
> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
> 
> Hi,
> 
>> Thanks for your reply.  Doing the memory bandwidth calculation seems
> like a useful exercise.  I'll
>> give that a try.  I was also trying to think of this from a higher level perspective.  Does this seem
>> reasonable?
>> 
>> T_vec_op = T_vec_compute + T_vec_memory
>> 
>> where these are times but using multiple threads only speeds up the T_vec_compute part while
>> T_vec_memory is relatively constant whether I am doing memory operations with a single thread
>> or multiple threads.
> 
> Yes and no :-)
> Due to possible multiple physical memory links and NUMA, T_vec_memory
> shows a dependence on the number and affinity of threads. Also,
> 
>  T_vec_op = max(T_vec_compute, T_vec_memory)
> 
> can be a better approximation, as memory transfers and actual
> arithmetics may overlap ('prefetching').
> 
> Still, the main speed-up when using threads (or multiple processes) is
> in T_vec_compute. However, hardware processing speed has evolved such
> that T_vec_memory is now often dominant (exceptions are mostly BLAS
> level 3 algorithms), making proper data layout and affinity even more
> important.
> 
> Best regards,
> Karli
> 
> 
> 
>> ________________________________________
>> From: Karl Rupp [rupp at mcs.anl.gov]
>> Sent: Friday, October 26, 2012 10:20 AM
>> To: For users of the development version of PETSc
>> Cc: Nystrom, William D
>> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
>> 
>> Hi Dave,
>> 
>> let me just comment on the expected speed-up: As the arithmetic
>> intensity of vector operations is small, you are in a memory-bandwidth
>> limited regime. If you use smaller vectors in order to stay in cache,
>> you may still not obtain the expected speedup because then thread
>> management overhead becomes more of an issue. I suggest you compute the
>> effective memory bandwidth of your vector operations, because I suspect
>> you are pretty close to bandwidth saturation already.
>> 
>> Best regards,
>> Karli
>> 
>> 
>> On 10/26/2012 10:58 AM, Nystrom, William D wrote:
>>> Jed or Shri,
>>> 
>>> Are there other preconditioners I could use/try now with the petsc thread support besides jacobi?
>>> I looked around in the documentation for something like least squares polynomial preconditioning
>>> that is referenced in a paper by Li and Saad titled "GPU-Accelerated Preconditioned Iterative
>>> Linear Solvers" but did not find anything like that.  Would block jacobi with lu/cholesky for the
>>> block solves work with the current thread support?
>>> 
>>> Regarding the performance of my recent runs, I was surprised that I was not getting closer to a
>>> 16x speedup for the purely vector operations when using 16 threads compared to 1 thread.  I'm
>>> running on a single node of a cluster where the nodes are dual socked sandybridge cpus and
>>> the OS is TOSS 2 Linux from Livermore.  So I'm assuming that is not really an "unknown" sort
>>> of system.  One thing I am wondering is whether there is an issue with my thread affinities.  I am
>>> setting them but am wondering if there could be issues with which chunk of a vector a given
>>> threads gets.  For instance, assuming a single mpi process on a single node and using 16 threads,
>>> I would assume that the vector occupies a contiguous chunk of memory and that it will get divided
>>> into 16 chunks.  If thread 13 is the first to launch, does it get the first chunk of the vector or the
>>> 13th chunk of the vector?  If the latter, then I would think my assignment of thread affinities is
>>> optimal.  If my thread assignment is optimal, then is the less than 16x speedup in the vector
>>> operations because of memory bandwidth limitations or cache effects?
>>> 
>>> What profiling tools do you recommend to use with petsc?  I have investigated and tried Openspeedshop,
>>> HPC Toolkit and Tau but have not tried any with petsc.  I was told that there were some issues with
>>> using Tau with petsc.  Not sure what they are.  So far, I have liked Tau best.
>>> 
>>> Dave
>>> 
>>> ________________________________________
>>> From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on behalf of John Fettig [john.fettig at gmail.com]
>>> Sent: Friday, October 26, 2012 7:47 AM
>>> To: For users of the development version of PETSc
>>> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
>>> 
>>> On Thu, Oct 25, 2012 at 9:16 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
>>>> On Thu, Oct 25, 2012 at 8:05 PM, John Fettig <john.fettig at gmail.com> wrote:
>>>>> 
>>>>> What I see in your results is about 7x speedup by using 16 threads.  I
>>>>> think you should get better results by running 8 threads with 2
>>>>> processes because the memory can be allocated on separate memory
>>>>> controllers, and the memory will be physically closer to the cores.
>>>>> I'm surprised that you get worse results.
>>>> 
>>>> 
>>>> Our intent is for the threads to use an explicit first-touch policy so that
>>>> they get local memory even when you have threads across multiple NUMA zones.
>>> 
>>> Great.  I still think the performance using jacobi (as Dave does)
>>> should be no worse using 2x(MPI) and 8x(thread) than it is with
>>> 1x(MPI) and 16x(thread).
>>> 
>>>>> 
>>>>> It doesn't surprise me that an explicit code gets much better speedup.
>>>> 
>>>> 
>>>> The explicit code is much less dependent on memory bandwidth relative to
>>>> floating point.
>>>> 
>>>>> 
>>>>> 
>>>>>> I also get about the same performance results on the ex2 problem when
>>>>>> running it with just
>>>>>> mpi alone i.e. with 16 mpi processes.
>>>>>> 
>>>>>> So from my perspective, the new pthreads/openmp support is looking
>>>>>> pretty good assuming
>>>>>> the issue with the MKL/external packages interaction can be fixed.
>>>>>> 
>>>>>> I was just using jacobi preconditioning for ex2.  I'm wondering if there
>>>>>> are any other preconditioners
>>>>>> that might be multi-threaded.  Or maybe a polynomial preconditioner
>>>>>> could work well for the
>>>>>> new pthreads/openmp support.
>>>>> 
>>>>> GAMG with SOR smoothing seems like a prime candidate for threading.  I
>>>>> wonder if anybody has worked on this yet?
>>>> 
>>>> 
>>>> SOR is not great because it's sequential.
>>> 
>>> For structured grids we have multi-color schemes and temporally
>>> blocked schemes as in this paper,
>>> 
>>> http://www.it.uu.se/research/publications/reports/2006-018/2006-018-nc.pdf
>>> 
>>> For unstructured grids, could we do some analagous decomposition using
>>> e.g. parmetis?
>>> 
>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.4764
>>> 
>>> Regards,
>>> John
>>> 
>>>> A block Jacobi/SOR parallelizes
>>>> fine, but does not guarantee stability without additional
>>>> (operator-dependent) damping. Chebyshev/Jacobi smoothing will perform well
>>>> with threads (but not all the kernels are ready).
>>>> 
>>>> Coarsening and the Galerkin triple product is more difficult to thread.
>> 
> 




More information about the petsc-dev mailing list