[petsc-dev] Status of pthreads and OpenMP support
Nystrom, William D
wdn at lanl.gov
Wed Oct 31 11:41:57 CDT 2012
Shri,
Have you had a chance to investigate the issues related to the new PETSc threads
package and MKL?
Dave
________________________________________
From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on behalf of Shri [abhyshr at mcs.anl.gov]
Sent: Friday, October 26, 2012 5:35 PM
To: For users of the development version of PETSc
Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
On Oct 26, 2012, at 3:08 PM, Nystrom, William D wrote:
> Are there any petsc examples that do cache blocking that would work for the new
> threads support?
I don't think there are any examples that can do cache blocking using threads.
> I was initially investigating DMDA but that looks like it only works
> for mpi processes. I was looking at ex34.c and ex45.c located in petsc-dev/src/ksp/ksp/examples/tutorials.
>
> Thanks,
>
> Dave
>
> ________________________________________
> From: Nystrom, William D
> Sent: Friday, October 26, 2012 10:53 AM
> To: Karl Rupp
> Cc: For users of the development version of PETSc; Nystrom, William D
> Subject: RE: [petsc-dev] Status of pthreads and OpenMP support
>
> Karli,
>
> Thanks. Sounds like I need to actually do the memory bandwidth calculation to get more
> quantitative.
>
> Thanks again,
>
> Dave
>
> ________________________________________
> From: Karl Rupp [rupp at mcs.anl.gov]
> Sent: Friday, October 26, 2012 10:47 AM
> To: Nystrom, William D
> Cc: For users of the development version of PETSc
> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
>
> Hi,
>
>> Thanks for your reply. Doing the memory bandwidth calculation seems
> like a useful exercise. I'll
>> give that a try. I was also trying to think of this from a higher level perspective. Does this seem
>> reasonable?
>>
>> T_vec_op = T_vec_compute + T_vec_memory
>>
>> where these are times but using multiple threads only speeds up the T_vec_compute part while
>> T_vec_memory is relatively constant whether I am doing memory operations with a single thread
>> or multiple threads.
>
> Yes and no :-)
> Due to possible multiple physical memory links and NUMA, T_vec_memory
> shows a dependence on the number and affinity of threads. Also,
>
> T_vec_op = max(T_vec_compute, T_vec_memory)
>
> can be a better approximation, as memory transfers and actual
> arithmetics may overlap ('prefetching').
>
> Still, the main speed-up when using threads (or multiple processes) is
> in T_vec_compute. However, hardware processing speed has evolved such
> that T_vec_memory is now often dominant (exceptions are mostly BLAS
> level 3 algorithms), making proper data layout and affinity even more
> important.
>
> Best regards,
> Karli
>
>
>
>> ________________________________________
>> From: Karl Rupp [rupp at mcs.anl.gov]
>> Sent: Friday, October 26, 2012 10:20 AM
>> To: For users of the development version of PETSc
>> Cc: Nystrom, William D
>> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
>>
>> Hi Dave,
>>
>> let me just comment on the expected speed-up: As the arithmetic
>> intensity of vector operations is small, you are in a memory-bandwidth
>> limited regime. If you use smaller vectors in order to stay in cache,
>> you may still not obtain the expected speedup because then thread
>> management overhead becomes more of an issue. I suggest you compute the
>> effective memory bandwidth of your vector operations, because I suspect
>> you are pretty close to bandwidth saturation already.
>>
>> Best regards,
>> Karli
>>
>>
>> On 10/26/2012 10:58 AM, Nystrom, William D wrote:
>>> Jed or Shri,
>>>
>>> Are there other preconditioners I could use/try now with the petsc thread support besides jacobi?
>>> I looked around in the documentation for something like least squares polynomial preconditioning
>>> that is referenced in a paper by Li and Saad titled "GPU-Accelerated Preconditioned Iterative
>>> Linear Solvers" but did not find anything like that. Would block jacobi with lu/cholesky for the
>>> block solves work with the current thread support?
>>>
>>> Regarding the performance of my recent runs, I was surprised that I was not getting closer to a
>>> 16x speedup for the purely vector operations when using 16 threads compared to 1 thread. I'm
>>> running on a single node of a cluster where the nodes are dual socked sandybridge cpus and
>>> the OS is TOSS 2 Linux from Livermore. So I'm assuming that is not really an "unknown" sort
>>> of system. One thing I am wondering is whether there is an issue with my thread affinities. I am
>>> setting them but am wondering if there could be issues with which chunk of a vector a given
>>> threads gets. For instance, assuming a single mpi process on a single node and using 16 threads,
>>> I would assume that the vector occupies a contiguous chunk of memory and that it will get divided
>>> into 16 chunks. If thread 13 is the first to launch, does it get the first chunk of the vector or the
>>> 13th chunk of the vector? If the latter, then I would think my assignment of thread affinities is
>>> optimal. If my thread assignment is optimal, then is the less than 16x speedup in the vector
>>> operations because of memory bandwidth limitations or cache effects?
>>>
>>> What profiling tools do you recommend to use with petsc? I have investigated and tried Openspeedshop,
>>> HPC Toolkit and Tau but have not tried any with petsc. I was told that there were some issues with
>>> using Tau with petsc. Not sure what they are. So far, I have liked Tau best.
>>>
>>> Dave
>>>
>>> ________________________________________
>>> From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on behalf of John Fettig [john.fettig at gmail.com]
>>> Sent: Friday, October 26, 2012 7:47 AM
>>> To: For users of the development version of PETSc
>>> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
>>>
>>> On Thu, Oct 25, 2012 at 9:16 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
>>>> On Thu, Oct 25, 2012 at 8:05 PM, John Fettig <john.fettig at gmail.com> wrote:
>>>>>
>>>>> What I see in your results is about 7x speedup by using 16 threads. I
>>>>> think you should get better results by running 8 threads with 2
>>>>> processes because the memory can be allocated on separate memory
>>>>> controllers, and the memory will be physically closer to the cores.
>>>>> I'm surprised that you get worse results.
>>>>
>>>>
>>>> Our intent is for the threads to use an explicit first-touch policy so that
>>>> they get local memory even when you have threads across multiple NUMA zones.
>>>
>>> Great. I still think the performance using jacobi (as Dave does)
>>> should be no worse using 2x(MPI) and 8x(thread) than it is with
>>> 1x(MPI) and 16x(thread).
>>>
>>>>>
>>>>> It doesn't surprise me that an explicit code gets much better speedup.
>>>>
>>>>
>>>> The explicit code is much less dependent on memory bandwidth relative to
>>>> floating point.
>>>>
>>>>>
>>>>>
>>>>>> I also get about the same performance results on the ex2 problem when
>>>>>> running it with just
>>>>>> mpi alone i.e. with 16 mpi processes.
>>>>>>
>>>>>> So from my perspective, the new pthreads/openmp support is looking
>>>>>> pretty good assuming
>>>>>> the issue with the MKL/external packages interaction can be fixed.
>>>>>>
>>>>>> I was just using jacobi preconditioning for ex2. I'm wondering if there
>>>>>> are any other preconditioners
>>>>>> that might be multi-threaded. Or maybe a polynomial preconditioner
>>>>>> could work well for the
>>>>>> new pthreads/openmp support.
>>>>>
>>>>> GAMG with SOR smoothing seems like a prime candidate for threading. I
>>>>> wonder if anybody has worked on this yet?
>>>>
>>>>
>>>> SOR is not great because it's sequential.
>>>
>>> For structured grids we have multi-color schemes and temporally
>>> blocked schemes as in this paper,
>>>
>>> http://www.it.uu.se/research/publications/reports/2006-018/2006-018-nc.pdf
>>>
>>> For unstructured grids, could we do some analagous decomposition using
>>> e.g. parmetis?
>>>
>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.4764
>>>
>>> Regards,
>>> John
>>>
>>>> A block Jacobi/SOR parallelizes
>>>> fine, but does not guarantee stability without additional
>>>> (operator-dependent) damping. Chebyshev/Jacobi smoothing will perform well
>>>> with threads (but not all the kernels are ready).
>>>>
>>>> Coarsening and the Galerkin triple product is more difficult to thread.
>>
>
More information about the petsc-dev
mailing list