[petsc-dev] Status of pthreads and OpenMP support

Dave Nystrom dnystrom1 at comcast.net
Wed Oct 31 13:25:33 CDT 2012


I'm on travel today but could send you that info tomorrow when I'm back in my
office.

Dave

Shri writes:
 > Dave,
 > 
 > I configured PETSc with MKL on our machine and tested ex2 using the
 > options you sent (./ex2 -pc_type jacobi -m 1000 -n 1000 -threadcomm_type
 > pthread -threadcomm_nthreads 1). However, I could not reproduce the
 > problem you encountered. Using more threads did not reproduce it
 > either. What configure options did you use?
 > 
 > Shri
 > 
 > On Oct 31, 2012, at 11:41 AM, Nystrom, William D wrote:
 > 
 > > Shri,
 > > 
 > > Have you had a chance to investigate the issues related to the new PETSc threads
 > > package and MKL?
 > > 
 > > Dave
 > > 
 > > ________________________________________
 > > From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on behalf of Shri [abhyshr at mcs.anl.gov]
 > > Sent: Friday, October 26, 2012 5:35 PM
 > > To: For users of the development version of PETSc
 > > Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
 > > 
 > > On Oct 26, 2012, at 3:08 PM, Nystrom, William D wrote:
 > > 
 > >> Are there any petsc examples that do cache blocking that would work for the new
 > >> threads support?
 > > 
 > > I don't think there are any examples that can do cache blocking using threads.
 > > 
 > >> I was initially investigating DMDA but that looks like it only works
 > >> for mpi processes.  I was looking at ex34.c and ex45.c located in petsc-dev/src/ksp/ksp/examples/tutorials.
 > >> 
 > >> Thanks,
 > >> 
 > >> Dave
 > >> 
 > >> ________________________________________
 > >> From: Nystrom, William D
 > >> Sent: Friday, October 26, 2012 10:53 AM
 > >> To: Karl Rupp
 > >> Cc: For users of the development version of PETSc; Nystrom, William D
 > >> Subject: RE: [petsc-dev] Status of pthreads and OpenMP support
 > >> 
 > >> Karli,
 > >> 
 > >> Thanks.  Sounds like I need to actually do the memory bandwidth calculation to get more
 > >> quantitative.
 > >> 
 > >> Thanks again,
 > >> 
 > >> Dave
 > >> 
 > >> ________________________________________
 > >> From: Karl Rupp [rupp at mcs.anl.gov]
 > >> Sent: Friday, October 26, 2012 10:47 AM
 > >> To: Nystrom, William D
 > >> Cc: For users of the development version of PETSc
 > >> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
 > >> 
 > >> Hi,
 > >> 
 > >>> Thanks for your reply.  Doing the memory bandwidth calculation seems
 > >> like a useful exercise.  I'll
 > >>> give that a try.  I was also trying to think of this from a higher level perspective.  Does this seem
 > >>> reasonable?
 > >>> 
 > >>> T_vec_op = T_vec_compute + T_vec_memory
 > >>> 
 > >>> where these are times but using multiple threads only speeds up the T_vec_compute part while
 > >>> T_vec_memory is relatively constant whether I am doing memory operations with a single thread
 > >>> or multiple threads.
 > >> 
 > >> Yes and no :-)
 > >> Due to possible multiple physical memory links and NUMA, T_vec_memory
 > >> shows a dependence on the number and affinity of threads. Also,
 > >> 
 > >> T_vec_op = max(T_vec_compute, T_vec_memory)
 > >> 
 > >> can be a better approximation, as memory transfers and actual
 > >> arithmetics may overlap ('prefetching').
 > >> 
 > >> Still, the main speed-up when using threads (or multiple processes) is
 > >> in T_vec_compute. However, hardware processing speed has evolved such
 > >> that T_vec_memory is now often dominant (exceptions are mostly BLAS
 > >> level 3 algorithms), making proper data layout and affinity even more
 > >> important.
 > >> 
 > >> Best regards,
 > >> Karli
 > >> 
 > >> 
 > >> 
 > >>> ________________________________________
 > >>> From: Karl Rupp [rupp at mcs.anl.gov]
 > >>> Sent: Friday, October 26, 2012 10:20 AM
 > >>> To: For users of the development version of PETSc
 > >>> Cc: Nystrom, William D
 > >>> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
 > >>> 
 > >>> Hi Dave,
 > >>> 
 > >>> let me just comment on the expected speed-up: As the arithmetic
 > >>> intensity of vector operations is small, you are in a memory-bandwidth
 > >>> limited regime. If you use smaller vectors in order to stay in cache,
 > >>> you may still not obtain the expected speedup because then thread
 > >>> management overhead becomes more of an issue. I suggest you compute the
 > >>> effective memory bandwidth of your vector operations, because I suspect
 > >>> you are pretty close to bandwidth saturation already.
 > >>> 
 > >>> Best regards,
 > >>> Karli
 > >>> 
 > >>> 
 > >>> On 10/26/2012 10:58 AM, Nystrom, William D wrote:
 > >>>> Jed or Shri,
 > >>>> 
 > >>>> Are there other preconditioners I could use/try now with the petsc thread support besides jacobi?
 > >>>> I looked around in the documentation for something like least squares polynomial preconditioning
 > >>>> that is referenced in a paper by Li and Saad titled "GPU-Accelerated Preconditioned Iterative
 > >>>> Linear Solvers" but did not find anything like that.  Would block jacobi with lu/cholesky for the
 > >>>> block solves work with the current thread support?
 > >>>> 
 > >>>> Regarding the performance of my recent runs, I was surprised that I was not getting closer to a
 > >>>> 16x speedup for the purely vector operations when using 16 threads compared to 1 thread.  I'm
 > >>>> running on a single node of a cluster where the nodes are dual socked sandybridge cpus and
 > >>>> the OS is TOSS 2 Linux from Livermore.  So I'm assuming that is not really an "unknown" sort
 > >>>> of system.  One thing I am wondering is whether there is an issue with my thread affinities.  I am
 > >>>> setting them but am wondering if there could be issues with which chunk of a vector a given
 > >>>> threads gets.  For instance, assuming a single mpi process on a single node and using 16 threads,
 > >>>> I would assume that the vector occupies a contiguous chunk of memory and that it will get divided
 > >>>> into 16 chunks.  If thread 13 is the first to launch, does it get the first chunk of the vector or the
 > >>>> 13th chunk of the vector?  If the latter, then I would think my assignment of thread affinities is
 > >>>> optimal.  If my thread assignment is optimal, then is the less than 16x speedup in the vector
 > >>>> operations because of memory bandwidth limitations or cache effects?
 > >>>> 
 > >>>> What profiling tools do you recommend to use with petsc?  I have investigated and tried Openspeedshop,
 > >>>> HPC Toolkit and Tau but have not tried any with petsc.  I was told that there were some issues with
 > >>>> using Tau with petsc.  Not sure what they are.  So far, I have liked Tau best.
 > >>>> 
 > >>>> Dave
 > >>>> 
 > >>>> ________________________________________
 > >>>> From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on behalf of John Fettig [john.fettig at gmail.com]
 > >>>> Sent: Friday, October 26, 2012 7:47 AM
 > >>>> To: For users of the development version of PETSc
 > >>>> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
 > >>>> 
 > >>>> On Thu, Oct 25, 2012 at 9:16 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
 > >>>>> On Thu, Oct 25, 2012 at 8:05 PM, John Fettig <john.fettig at gmail.com> wrote:
 > >>>>>> 
 > >>>>>> What I see in your results is about 7x speedup by using 16 threads.  I
 > >>>>>> think you should get better results by running 8 threads with 2
 > >>>>>> processes because the memory can be allocated on separate memory
 > >>>>>> controllers, and the memory will be physically closer to the cores.
 > >>>>>> I'm surprised that you get worse results.
 > >>>>> 
 > >>>>> 
 > >>>>> Our intent is for the threads to use an explicit first-touch policy so that
 > >>>>> they get local memory even when you have threads across multiple NUMA zones.
 > >>>> 
 > >>>> Great.  I still think the performance using jacobi (as Dave does)
 > >>>> should be no worse using 2x(MPI) and 8x(thread) than it is with
 > >>>> 1x(MPI) and 16x(thread).
 > >>>> 
 > >>>>>> 
 > >>>>>> It doesn't surprise me that an explicit code gets much better speedup.
 > >>>>> 
 > >>>>> 
 > >>>>> The explicit code is much less dependent on memory bandwidth relative to
 > >>>>> floating point.
 > >>>>> 
 > >>>>>> 
 > >>>>>> 
 > >>>>>>> I also get about the same performance results on the ex2 problem when
 > >>>>>>> running it with just
 > >>>>>>> mpi alone i.e. with 16 mpi processes.
 > >>>>>>> 
 > >>>>>>> So from my perspective, the new pthreads/openmp support is looking
 > >>>>>>> pretty good assuming
 > >>>>>>> the issue with the MKL/external packages interaction can be fixed.
 > >>>>>>> 
 > >>>>>>> I was just using jacobi preconditioning for ex2.  I'm wondering if there
 > >>>>>>> are any other preconditioners
 > >>>>>>> that might be multi-threaded.  Or maybe a polynomial preconditioner
 > >>>>>>> could work well for the
 > >>>>>>> new pthreads/openmp support.
 > >>>>>> 
 > >>>>>> GAMG with SOR smoothing seems like a prime candidate for threading.  I
 > >>>>>> wonder if anybody has worked on this yet?
 > >>>>> 
 > >>>>> 
 > >>>>> SOR is not great because it's sequential.
 > >>>> 
 > >>>> For structured grids we have multi-color schemes and temporally
 > >>>> blocked schemes as in this paper,
 > >>>> 
 > >>>> http://www.it.uu.se/research/publications/reports/2006-018/2006-018-nc.pdf
 > >>>> 
 > >>>> For unstructured grids, could we do some analagous decomposition using
 > >>>> e.g. parmetis?
 > >>>> 
 > >>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.4764
 > >>>> 
 > >>>> Regards,
 > >>>> John
 > >>>> 
 > >>>>> A block Jacobi/SOR parallelizes
 > >>>>> fine, but does not guarantee stability without additional
 > >>>>> (operator-dependent) damping. Chebyshev/Jacobi smoothing will perform well
 > >>>>> with threads (but not all the kernels are ready).
 > >>>>> 
 > >>>>> Coarsening and the Galerkin triple product is more difficult to thread.
 > >>> 
 > >> 
 > > 
 > 



More information about the petsc-dev mailing list