[petsc-dev] Status of pthreads and OpenMP support

Wed Oct 31 13:15:11 CDT 2012

Dave,
   I configured PETSc with MKL on our machine and tested ex2 using the options you sent (./ex2 -pc_type jacobi -m 1000 -n 1000 -threadcomm_type pthread -threadcomm_nthreads 1). However, I  could not reproduce the problem you encountered. Using more threads did not reproduce it either. What configure options did you use?

Shri
On Oct 31, 2012, at 11:41 AM, Nystrom, William D wrote:

> Shri,
> 
> Have you had a chance to investigate the issues related to the new PETSc threads
> package and MKL?
> 
> Dave
> 
> ________________________________________
> From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on behalf of Shri [abhyshr at mcs.anl.gov]
> Sent: Friday, October 26, 2012 5:35 PM
> To: For users of the development version of PETSc
> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
> 
> On Oct 26, 2012, at 3:08 PM, Nystrom, William D wrote:
> 
>> Are there any petsc examples that do cache blocking that would work for the new
>> threads support?
> 
> I don't think there are any examples that can do cache blocking using threads.
> 
>> I was initially investigating DMDA but that looks like it only works
>> for mpi processes.  I was looking at ex34.c and ex45.c located in petsc-dev/src/ksp/ksp/examples/tutorials.
>> 
>> Thanks,
>> 
>> Dave
>> 
>> ________________________________________
>> From: Nystrom, William D
>> Sent: Friday, October 26, 2012 10:53 AM
>> To: Karl Rupp
>> Cc: For users of the development version of PETSc; Nystrom, William D
>> Subject: RE: [petsc-dev] Status of pthreads and OpenMP support
>> 
>> Karli,
>> 
>> Thanks.  Sounds like I need to actually do the memory bandwidth calculation to get more
>> quantitative.
>> 
>> Thanks again,
>> 
>> Dave
>> 
>> ________________________________________
>> From: Karl Rupp [rupp at mcs.anl.gov]
>> Sent: Friday, October 26, 2012 10:47 AM
>> To: Nystrom, William D
>> Cc: For users of the development version of PETSc
>> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
>> 
>> Hi,
>> 
>>> Thanks for your reply.  Doing the memory bandwidth calculation seems
>> like a useful exercise.  I'll
>>> give that a try.  I was also trying to think of this from a higher level perspective.  Does this seem
>>> reasonable?
>>> 
>>> T_vec_op = T_vec_compute + T_vec_memory
>>> 
>>> where these are times but using multiple threads only speeds up the T_vec_compute part while
>>> T_vec_memory is relatively constant whether I am doing memory operations with a single thread
>>> or multiple threads.
>> 
>> Yes and no :-)
>> Due to possible multiple physical memory links and NUMA, T_vec_memory
>> shows a dependence on the number and affinity of threads. Also,
>> 
>> T_vec_op = max(T_vec_compute, T_vec_memory)
>> 
>> can be a better approximation, as memory transfers and actual
>> arithmetics may overlap ('prefetching').
>> 
>> Still, the main speed-up when using threads (or multiple processes) is
>> in T_vec_compute. However, hardware processing speed has evolved such
>> that T_vec_memory is now often dominant (exceptions are mostly BLAS
>> level 3 algorithms), making proper data layout and affinity even more
>> important.
>> 
>> Best regards,
>> Karli
>> 
>> 
>> 
>>> ________________________________________
>>> From: Karl Rupp [rupp at mcs.anl.gov]
>>> Sent: Friday, October 26, 2012 10:20 AM
>>> To: For users of the development version of PETSc
>>> Cc: Nystrom, William D
>>> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
>>> 
>>> Hi Dave,
>>> 
>>> let me just comment on the expected speed-up: As the arithmetic
>>> intensity of vector operations is small, you are in a memory-bandwidth
>>> limited regime. If you use smaller vectors in order to stay in cache,
>>> you may still not obtain the expected speedup because then thread
>>> management overhead becomes more of an issue. I suggest you compute the
>>> effective memory bandwidth of your vector operations, because I suspect
>>> you are pretty close to bandwidth saturation already.
>>> 
>>> Best regards,
>>> Karli
>>> 
>>> 
>>> On 10/26/2012 10:58 AM, Nystrom, William D wrote:
>>>> Jed or Shri,
>>>> 
>>>> Are there other preconditioners I could use/try now with the petsc thread support besides jacobi?
>>>> I looked around in the documentation for something like least squares polynomial preconditioning
>>>> that is referenced in a paper by Li and Saad titled "GPU-Accelerated Preconditioned Iterative
>>>> Linear Solvers" but did not find anything like that.  Would block jacobi with lu/cholesky for the
>>>> block solves work with the current thread support?
>>>> 
>>>> Regarding the performance of my recent runs, I was surprised that I was not getting closer to a
>>>> 16x speedup for the purely vector operations when using 16 threads compared to 1 thread.  I'm
>>>> running on a single node of a cluster where the nodes are dual socked sandybridge cpus and
>>>> the OS is TOSS 2 Linux from Livermore.  So I'm assuming that is not really an "unknown" sort
>>>> of system.  One thing I am wondering is whether there is an issue with my thread affinities.  I am
>>>> setting them but am wondering if there could be issues with which chunk of a vector a given
>>>> threads gets.  For instance, assuming a single mpi process on a single node and using 16 threads,
>>>> I would assume that the vector occupies a contiguous chunk of memory and that it will get divided
>>>> into 16 chunks.  If thread 13 is the first to launch, does it get the first chunk of the vector or the
>>>> 13th chunk of the vector?  If the latter, then I would think my assignment of thread affinities is
>>>> optimal.  If my thread assignment is optimal, then is the less than 16x speedup in the vector
>>>> operations because of memory bandwidth limitations or cache effects?
>>>> 
>>>> What profiling tools do you recommend to use with petsc?  I have investigated and tried Openspeedshop,
>>>> HPC Toolkit and Tau but have not tried any with petsc.  I was told that there were some issues with
>>>> using Tau with petsc.  Not sure what they are.  So far, I have liked Tau best.
>>>> 
>>>> Dave
>>>> 
>>>> ________________________________________
>>>> From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on behalf of John Fettig [john.fettig at gmail.com]
>>>> Sent: Friday, October 26, 2012 7:47 AM
>>>> To: For users of the development version of PETSc
>>>> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
>>>> 
>>>> On Thu, Oct 25, 2012 at 9:16 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
>>>>> On Thu, Oct 25, 2012 at 8:05 PM, John Fettig <john.fettig at gmail.com> wrote:
>>>>>> 
>>>>>> What I see in your results is about 7x speedup by using 16 threads.  I
>>>>>> think you should get better results by running 8 threads with 2
>>>>>> processes because the memory can be allocated on separate memory
>>>>>> controllers, and the memory will be physically closer to the cores.
>>>>>> I'm surprised that you get worse results.
>>>>> 
>>>>> 
>>>>> Our intent is for the threads to use an explicit first-touch policy so that
>>>>> they get local memory even when you have threads across multiple NUMA zones.
>>>> 
>>>> Great.  I still think the performance using jacobi (as Dave does)
>>>> should be no worse using 2x(MPI) and 8x(thread) than it is with
>>>> 1x(MPI) and 16x(thread).
>>>> 
>>>>>> 
>>>>>> It doesn't surprise me that an explicit code gets much better speedup.
>>>>> 
>>>>> 
>>>>> The explicit code is much less dependent on memory bandwidth relative to
>>>>> floating point.
>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> I also get about the same performance results on the ex2 problem when
>>>>>>> running it with just
>>>>>>> mpi alone i.e. with 16 mpi processes.
>>>>>>> 
>>>>>>> So from my perspective, the new pthreads/openmp support is looking
>>>>>>> pretty good assuming
>>>>>>> the issue with the MKL/external packages interaction can be fixed.
>>>>>>> 
>>>>>>> I was just using jacobi preconditioning for ex2.  I'm wondering if there
>>>>>>> are any other preconditioners
>>>>>>> that might be multi-threaded.  Or maybe a polynomial preconditioner
>>>>>>> could work well for the
>>>>>>> new pthreads/openmp support.
>>>>>> 
>>>>>> GAMG with SOR smoothing seems like a prime candidate for threading.  I
>>>>>> wonder if anybody has worked on this yet?
>>>>> 
>>>>> 
>>>>> SOR is not great because it's sequential.
>>>> 
>>>> For structured grids we have multi-color schemes and temporally
>>>> blocked schemes as in this paper,
>>>> 
>>>> http://www.it.uu.se/research/publications/reports/2006-018/2006-018-nc.pdf
>>>> 
>>>> For unstructured grids, could we do some analagous decomposition using
>>>> e.g. parmetis?
>>>> 
>>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.4764
>>>> 
>>>> Regards,
>>>> John
>>>> 
>>>>> A block Jacobi/SOR parallelizes
>>>>> fine, but does not guarantee stability without additional
>>>>> (operator-dependent) damping. Chebyshev/Jacobi smoothing will perform well
>>>>> with threads (but not all the kernels are ready).
>>>>> 
>>>>> Coarsening and the Galerkin triple product is more difficult to thread.
>>> 
>> 
>