[petsc-dev] Status of pthreads and OpenMP support
Dave Nystrom
dnystrom1 at comcast.net
Wed Oct 31 13:25:33 CDT 2012
I'm on travel today but could send you that info tomorrow when I'm back in my
office.
Dave
Shri writes:
> Dave,
>
> I configured PETSc with MKL on our machine and tested ex2 using the
> options you sent (./ex2 -pc_type jacobi -m 1000 -n 1000 -threadcomm_type
> pthread -threadcomm_nthreads 1). However, I could not reproduce the
> problem you encountered. Using more threads did not reproduce it
> either. What configure options did you use?
>
> Shri
>
> On Oct 31, 2012, at 11:41 AM, Nystrom, William D wrote:
>
> > Shri,
> >
> > Have you had a chance to investigate the issues related to the new PETSc threads
> > package and MKL?
> >
> > Dave
> >
> > ________________________________________
> > From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on behalf of Shri [abhyshr at mcs.anl.gov]
> > Sent: Friday, October 26, 2012 5:35 PM
> > To: For users of the development version of PETSc
> > Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
> >
> > On Oct 26, 2012, at 3:08 PM, Nystrom, William D wrote:
> >
> >> Are there any petsc examples that do cache blocking that would work for the new
> >> threads support?
> >
> > I don't think there are any examples that can do cache blocking using threads.
> >
> >> I was initially investigating DMDA but that looks like it only works
> >> for mpi processes. I was looking at ex34.c and ex45.c located in petsc-dev/src/ksp/ksp/examples/tutorials.
> >>
> >> Thanks,
> >>
> >> Dave
> >>
> >> ________________________________________
> >> From: Nystrom, William D
> >> Sent: Friday, October 26, 2012 10:53 AM
> >> To: Karl Rupp
> >> Cc: For users of the development version of PETSc; Nystrom, William D
> >> Subject: RE: [petsc-dev] Status of pthreads and OpenMP support
> >>
> >> Karli,
> >>
> >> Thanks. Sounds like I need to actually do the memory bandwidth calculation to get more
> >> quantitative.
> >>
> >> Thanks again,
> >>
> >> Dave
> >>
> >> ________________________________________
> >> From: Karl Rupp [rupp at mcs.anl.gov]
> >> Sent: Friday, October 26, 2012 10:47 AM
> >> To: Nystrom, William D
> >> Cc: For users of the development version of PETSc
> >> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
> >>
> >> Hi,
> >>
> >>> Thanks for your reply. Doing the memory bandwidth calculation seems
> >> like a useful exercise. I'll
> >>> give that a try. I was also trying to think of this from a higher level perspective. Does this seem
> >>> reasonable?
> >>>
> >>> T_vec_op = T_vec_compute + T_vec_memory
> >>>
> >>> where these are times but using multiple threads only speeds up the T_vec_compute part while
> >>> T_vec_memory is relatively constant whether I am doing memory operations with a single thread
> >>> or multiple threads.
> >>
> >> Yes and no :-)
> >> Due to possible multiple physical memory links and NUMA, T_vec_memory
> >> shows a dependence on the number and affinity of threads. Also,
> >>
> >> T_vec_op = max(T_vec_compute, T_vec_memory)
> >>
> >> can be a better approximation, as memory transfers and actual
> >> arithmetics may overlap ('prefetching').
> >>
> >> Still, the main speed-up when using threads (or multiple processes) is
> >> in T_vec_compute. However, hardware processing speed has evolved such
> >> that T_vec_memory is now often dominant (exceptions are mostly BLAS
> >> level 3 algorithms), making proper data layout and affinity even more
> >> important.
> >>
> >> Best regards,
> >> Karli
> >>
> >>
> >>
> >>> ________________________________________
> >>> From: Karl Rupp [rupp at mcs.anl.gov]
> >>> Sent: Friday, October 26, 2012 10:20 AM
> >>> To: For users of the development version of PETSc
> >>> Cc: Nystrom, William D
> >>> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
> >>>
> >>> Hi Dave,
> >>>
> >>> let me just comment on the expected speed-up: As the arithmetic
> >>> intensity of vector operations is small, you are in a memory-bandwidth
> >>> limited regime. If you use smaller vectors in order to stay in cache,
> >>> you may still not obtain the expected speedup because then thread
> >>> management overhead becomes more of an issue. I suggest you compute the
> >>> effective memory bandwidth of your vector operations, because I suspect
> >>> you are pretty close to bandwidth saturation already.
> >>>
> >>> Best regards,
> >>> Karli
> >>>
> >>>
> >>> On 10/26/2012 10:58 AM, Nystrom, William D wrote:
> >>>> Jed or Shri,
> >>>>
> >>>> Are there other preconditioners I could use/try now with the petsc thread support besides jacobi?
> >>>> I looked around in the documentation for something like least squares polynomial preconditioning
> >>>> that is referenced in a paper by Li and Saad titled "GPU-Accelerated Preconditioned Iterative
> >>>> Linear Solvers" but did not find anything like that. Would block jacobi with lu/cholesky for the
> >>>> block solves work with the current thread support?
> >>>>
> >>>> Regarding the performance of my recent runs, I was surprised that I was not getting closer to a
> >>>> 16x speedup for the purely vector operations when using 16 threads compared to 1 thread. I'm
> >>>> running on a single node of a cluster where the nodes are dual socked sandybridge cpus and
> >>>> the OS is TOSS 2 Linux from Livermore. So I'm assuming that is not really an "unknown" sort
> >>>> of system. One thing I am wondering is whether there is an issue with my thread affinities. I am
> >>>> setting them but am wondering if there could be issues with which chunk of a vector a given
> >>>> threads gets. For instance, assuming a single mpi process on a single node and using 16 threads,
> >>>> I would assume that the vector occupies a contiguous chunk of memory and that it will get divided
> >>>> into 16 chunks. If thread 13 is the first to launch, does it get the first chunk of the vector or the
> >>>> 13th chunk of the vector? If the latter, then I would think my assignment of thread affinities is
> >>>> optimal. If my thread assignment is optimal, then is the less than 16x speedup in the vector
> >>>> operations because of memory bandwidth limitations or cache effects?
> >>>>
> >>>> What profiling tools do you recommend to use with petsc? I have investigated and tried Openspeedshop,
> >>>> HPC Toolkit and Tau but have not tried any with petsc. I was told that there were some issues with
> >>>> using Tau with petsc. Not sure what they are. So far, I have liked Tau best.
> >>>>
> >>>> Dave
> >>>>
> >>>> ________________________________________
> >>>> From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on behalf of John Fettig [john.fettig at gmail.com]
> >>>> Sent: Friday, October 26, 2012 7:47 AM
> >>>> To: For users of the development version of PETSc
> >>>> Subject: Re: [petsc-dev] Status of pthreads and OpenMP support
> >>>>
> >>>> On Thu, Oct 25, 2012 at 9:16 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
> >>>>> On Thu, Oct 25, 2012 at 8:05 PM, John Fettig <john.fettig at gmail.com> wrote:
> >>>>>>
> >>>>>> What I see in your results is about 7x speedup by using 16 threads. I
> >>>>>> think you should get better results by running 8 threads with 2
> >>>>>> processes because the memory can be allocated on separate memory
> >>>>>> controllers, and the memory will be physically closer to the cores.
> >>>>>> I'm surprised that you get worse results.
> >>>>>
> >>>>>
> >>>>> Our intent is for the threads to use an explicit first-touch policy so that
> >>>>> they get local memory even when you have threads across multiple NUMA zones.
> >>>>
> >>>> Great. I still think the performance using jacobi (as Dave does)
> >>>> should be no worse using 2x(MPI) and 8x(thread) than it is with
> >>>> 1x(MPI) and 16x(thread).
> >>>>
> >>>>>>
> >>>>>> It doesn't surprise me that an explicit code gets much better speedup.
> >>>>>
> >>>>>
> >>>>> The explicit code is much less dependent on memory bandwidth relative to
> >>>>> floating point.
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>>> I also get about the same performance results on the ex2 problem when
> >>>>>>> running it with just
> >>>>>>> mpi alone i.e. with 16 mpi processes.
> >>>>>>>
> >>>>>>> So from my perspective, the new pthreads/openmp support is looking
> >>>>>>> pretty good assuming
> >>>>>>> the issue with the MKL/external packages interaction can be fixed.
> >>>>>>>
> >>>>>>> I was just using jacobi preconditioning for ex2. I'm wondering if there
> >>>>>>> are any other preconditioners
> >>>>>>> that might be multi-threaded. Or maybe a polynomial preconditioner
> >>>>>>> could work well for the
> >>>>>>> new pthreads/openmp support.
> >>>>>>
> >>>>>> GAMG with SOR smoothing seems like a prime candidate for threading. I
> >>>>>> wonder if anybody has worked on this yet?
> >>>>>
> >>>>>
> >>>>> SOR is not great because it's sequential.
> >>>>
> >>>> For structured grids we have multi-color schemes and temporally
> >>>> blocked schemes as in this paper,
> >>>>
> >>>> http://www.it.uu.se/research/publications/reports/2006-018/2006-018-nc.pdf
> >>>>
> >>>> For unstructured grids, could we do some analagous decomposition using
> >>>> e.g. parmetis?
> >>>>
> >>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.4764
> >>>>
> >>>> Regards,
> >>>> John
> >>>>
> >>>>> A block Jacobi/SOR parallelizes
> >>>>> fine, but does not guarantee stability without additional
> >>>>> (operator-dependent) damping. Chebyshev/Jacobi smoothing will perform well
> >>>>> with threads (but not all the kernels are ready).
> >>>>>
> >>>>> Coarsening and the Galerkin triple product is more difficult to thread.
> >>>
> >>
> >
>
More information about the petsc-dev
mailing list