[petsc-dev] GAMG error with MKL

Mon Jul 9 18:30:27 CDT 2018

On Mon, Jul 9, 2018 at 11:36 AM, Smith, Barry F. <bsmith at mcs.anl.gov> wrote:

>
>
> > On Jul 9, 2018, at 8:33 AM, Jeff Hammond <jeff.science at gmail.com> wrote:
> >
> >
> >
> > On Fri, Jul 6, 2018 at 4:28 PM, Smith, Barry F. <bsmith at mcs.anl.gov>
> wrote:
> >
> >   Richard,
> >
> >     The problem is that OpenMP is too large and has too many different
> programming models imbedded in it (and it will get worse) to "support
> OpenMP" from PETSc.
> >
> > This is also true of MPI.  You can write CSP, BSP, PGAS, fork-join,
> agent-based, etc. in MPI.  Just like MPI, you don't have to use all the
> features.  PETSc doesn't use MPI_Comm_spawn, MPI_Rget_accumulate, or
> MPI_Neighborhood_alltoallv, does it?
>
>    Jeff,
>
>      The issue is that PETSc doesn't control which model and which
> features of OpenMP the user chooses to use. If we "support" OpenMP then we
> need to allow users to make their own choices. We can't write PETSc with
> our paradigm for OpenMP (which would not be terrible as you say since we
> get to pick the paradigm) since that paradigm won't match the paradigm
> chosen by most users.
>

I agree.  It's a pain.  We face this with MKL.  There are at least three
ways for a library to support OpenMP, not including any use of target.

Jeff

> >     One way to use #pragma based optimization tools (which is one way
> to  treat OpenMP) is to run the application code in a realistic size
> problem, using the number of threads/MPI process they prefer with profiling
> and begin adding #pragmas to the most time consuming code
> fragments/routines, measuring the (small) improvement in performance as
> they are added. This is the way I would proceed. The branch generated will
> not have very many pragmas in it so would likely be acceptable to be
> included into PETSc. It would also give a quantitative measure of the
> possible performance with the #prama approach.
> >
> > This is the textbook Wrong Way to write OpenMP and the reason that the
> thread-scalability of DOE applications using MPI+OpenMP sucks.  It leads to
> codes that do fork-join far too often and suffer from death by Amdahl,
> unless you do a second pass where you fuse all the OpenMP regions and
> replace the serial regions between them with critical sections or similar.
> >
> > This isn't how you'd write MPI, is it?  No, you'd figure out how to
> decompose your data properly to exploit locality and then implement an
> algorithm that minimizes communication and synchronization.  Do that with
> OpenMP.
> >
> >    Note: that for BLAS 1 operations likely the correct thing to do is
> turn on MKL BLAS threading (being careful to make sure the number of
> threads MKL uses matches that used by other parts of the code). This way we
> don't need to OpenMP optimize many parts of PETSc's vector operations
> (norm, dot, scale, axpy). In fact, this is the first thing Mark should do,
> how much does it speed up the vector operations?
> >
> > BLAS1 operations are all memory-bound unless running out of cache (in
> which case one shouldn't use threads) and compilers do a great job with
> them.  Just put the pragmas on and let the compiler do its job.
> >
> >   The problem is how many ECP applications actually use OpenMP just as a
> #pragma optimization tool, or do they use other features of OpenMP. For
> example I remember Brian wanted to/did use OpenMP threads directly in
> BoxLib and didn't just stick to the #pragma model. If they did this then we
> would need custom PETSc to match their model.
> >
> > If this implies that BoxLib will use omp-parallel and then use explicit
> threading in a manner similar to MPI (omp_get_num_threads=MPI_Comm_size
> and omp_get_thread_num=MPI_Comm_rank), then this is the Right Way to
> write OpenMP.
> >
> > Unfortunately, the Right Way to use OpenMP makes it hard to use MPI
> unless you use MPI_THREAD_MULTIPLE and endpoints.  ECP projects should be
> pushing the MPI folks harder to ratify and implement endpoints.  I don't
> know if the proposal is even active right now, but that doesn't prevent DOE
> from compelling Open-MPI and MPICH to support it.
> >
> > To end on a positive note, OpenMP tasking is a relatively composable
> model and supports DAG-based parallelism.  I suspect the initial results in
> a code like PETSc will be worse than with traditional implicit OpenMP
> (omp-for-simd on all the loops) but it eventually wins own because it
> doesn't require any unnecessary barriers and makes it much easier to fuse
> parallel regions.
> >
> > Jeff
> >
> >
> >   Barry
> >
> >
> > > On Jul 6, 2018, at 3:07 PM, Mills, Richard Tran <rtmills at anl.gov>
> wrote:
> > >
> > > True, Barry. But, unfortunately, I think Jed's argument has something
> to it because the hybrid MPI + OpenMP model has become so popular. I know
> of a few codes where adopting this model makes some sense, though I believe
> that, more often, the model has been adopted simply because it is the
> fashionable thing to do. Regardless of good or bad reasons for its
> adoption, I do have some real concern that codes that use this model have a
> difficult time using PETSc effectively because of the lack of thread
> support. Like many of us, I had hoped that endpoints would make it into the
> MPI standard and this would provide a reasonable mechanism for integrating
> PETSc with codes using MPI+threads, but progress on this seems to have
> stagnated. I hope that the MPI endpoints effort eventually goes somewhere,
> but what can we do in the meantime? Within the DOE ECP program, the
> MPI+threads approach is being pushed really hard, and many of the ECP
> subprojects have adopted it. I think it's mostly idiotic, but I think it's
> too late to turn the tide and convince most people that pure MPI is the way
> to go. Meanwhile, my understanding is that we need to be able to support
> more of the ECP application projects to justify the substantial funding we
> are getting from the program. Many of these projects are dead-set on using
> OpenMP. (I note that I believe that the folks Mark is trying to help with
> PETSc and OpenMP are people affiliated with Carl Steefel's ECP subsurface
> project.)
> > >
> > > Since it looks like MPI endpoints are going to be a long time (or
> possibly forever) in coming, I think we need (a) stopgap plan(s) to support
> this crappy MPI + OpenMP model in the meantime. One possible approach is to
> do what Mark is trying with to do with MKL: Use a third party library that
> provides optimized OpenMP implementations of computationally expensive
> kernels. It might make sense to also consider using Karl's ViennaCL library
> in this manner, which we already use to support GPUs, but which I believe
> (Karl, please let me know if I am off-base here) we could also use to
> provide OpenMP-ized linear algebra operations on CPUs as well. Such
> approaches won't use threads for lots of the things that a PETSc code will
> do, but might be able to provide decent resource utilization for the most
> expensive parts for some codes.
> > >
> > > Clever ideas from anyone on this list about how to use an adequate
> number of MPI ranks for PETSc while using only a subset of these ranks for
> the MPI+OpenMP application code will be appreciated, though I don't know if
> there are any good solutions.
> > >
> > > --Richard
> > >
> > > On Wed, Jul 4, 2018 at 11:38 PM, Smith, Barry F. <bsmith at mcs.anl.gov>
> wrote:
> > >
> > >    Jed,
> > >
> > >      You could use your same argument to argue PETSc should do
> "something" to help people who have (rightly or wrongly) chosen to code
> their application in High Performance Fortran or any other similar inane
> parallel programming model.
> > >
> > >    Barry
> > >
> > >
> > >
> > > > On Jul 4, 2018, at 11:51 PM, Jed Brown <jed at jedbrown.org> wrote:
> > > >
> > > > Matthew Knepley <knepley at gmail.com> writes:
> > > >
> > > >> On Wed, Jul 4, 2018 at 4:51 PM Jeff Hammond <jeff.science at gmail.com>
> wrote:
> > > >>
> > > >>> On Wed, Jul 4, 2018 at 6:31 AM Matthew Knepley <knepley at gmail.com>
> wrote:
> > > >>>
> > > >>>> On Tue, Jul 3, 2018 at 10:32 PM Jeff Hammond <
> jeff.science at gmail.com>
> > > >>>> wrote:
> > > >>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> On Tue, Jul 3, 2018 at 4:35 PM Mark Adams <mfadams at lbl.gov>
> wrote:
> > > >>>>>
> > > >>>>>> On Tue, Jul 3, 2018 at 1:00 PM Richard Tran Mills <
> rtmills at anl.gov>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> Hi Mark,
> > > >>>>>>>
> > > >>>>>>> I'm glad to see you trying out the AIJMKL stuff. I think you
> are the
> > > >>>>>>> first person trying to actually use it, so we are probably
> going to expose
> > > >>>>>>> some bugs and also some performance issues. My somewhat
> limited testing has
> > > >>>>>>> shown that the MKL sparse routines often perform worse than
> our own
> > > >>>>>>> implementations in PETSc.
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>> My users just want OpenMP.
> > > >>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>> Why not just add OpenMP to PETSc? I know certain developers hate
> it, but
> > > >>>>> it is silly to let a principled objection stand in the way of
> enabling users
> > > >>>>>
> > > >>>>
> > > >>>> "if that would deliver the best performance for NERSC users."
> > > >>>>
> > > >>>> You have answered your own question.
> > > >>>>
> > > >>>
> > > >>> Please share the results of your experiments that prove OpenMP
> does not
> > > >>> improve performance for Mark’s users.
> > > >>>
> > > >>
> > > >> Oh God. I am supremely uninterested in minutely proving yet again
> that
> > > >> OpenMP is not better than MPI.
> > > >> There are already countless experiments. One more will not add
> anything of
> > > >> merit.
> > > >
> > > > Jeff assumes an absurd null hypothesis, Matt selfishly believes that
> > > > users should modify their code/execution environment to subscribe to
> a
> > > > more robust and equally performant approach, and the MPI forum
> abdicates
> > > > by stalling on endpoints.  How do we resolve this?
> > > >
> > > >>> Also we are not in the habit of fucking up our codebase in order
> to follow
> > > >>>> some fad.
> > > >>>>
> > > >>>
> > > >>> If you can’t use OpenMP without messing up your code base, you
> probably
> > > >>> don’t know how to design software.
> > > >>>
> > > >>
> > > >> That is an interesting, if wrong, opinion. It might be your
> contention that
> > > >> sticking any random paradigm in a library should
> > > >> be alright if its "well designed"? I have never encountered such a
> > > >> well-designed library.
> > > >>
> > > >>
> > > >>> I guess if you refuse to use _Pragma because C99 is still a fad
> for you,
> > > >>> it is harder, but clearly _Complex is tolerated.
> > > >>>
> > > >>
> > > >> Yes, littering your code with preprocessor directives improves
> almost
> > > >> everything. Doing proper resource management
> > > >> using Pragmas, in an environment with several layers of libraries,
> is a
> > > >> dream.
> > > >>
> > > >>
> > > >>> More seriously, you’ve adopted OpenMP hidden behind MKL
> > > >>>
> > > >>
> > > >> Nope. We can use MKL with that crap shutoff.
> > > >>
> > > >>
> > > >>> so I see no reason why you can’t wrap OpenMP implementations of
> the PETSc
> > > >>> sparse kernels in a similar manner.
> > > >>>
> > > >>
> > > >> We could, its just a colossal waste of time and effort, as well as
> > > >> counterproductive for the codebase :)
> > > >
> > > > Endpoints either need to become a thing we can depend on or we need a
> > > > solution for users that insist on using threads (even if their
> decision
> > > > to use threads is objectively bad).  The problem Matt harps on is
> > > > legitimate: OpenMP parallel regions cannot reliably cross module
> > > > boundaries except for embarrassingly parallel operations.  This means
> > > > loop-level omp parallel which significantly increases overhead for
> small
> > > > problem sizes (e.g., slowing coarse grid solves and strong scaling
> > > > limits).  It can be done and isn't that hard, but the Imperial group
> > > > discarded their branch after observing that it also provided no
> > > > performance benefit.  However, I'm coming around to the idea that
> PETSc
> > > > should do it so that there is _a_ solution for users that insist on
> > > > using threads in a particular way.  Unless Endpoints become available
> > > > and reliable, in which case we could do it right.
> > >
> > >
> >
> >
> >
> >
> > --
> > Jeff Hammond
> > jeff.science at gmail.com
> > http://jeffhammond.github.io/
>
>

-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20180709/53e8facd/attachment.html>