[petsc-dev] GAMG error with MKL

Mon Jul 9 08:38:54 CDT 2018

On Mon, Jul 9, 2018 at 9:34 AM Jeff Hammond <jeff.science at gmail.com> wrote:

> On Fri, Jul 6, 2018 at 4:28 PM, Smith, Barry F. <bsmith at mcs.anl.gov>
> wrote:
>
>>
>>   Richard,
>>
>>     The problem is that OpenMP is too large and has too many different
>> programming models imbedded in it (and it will get worse) to "support
>> OpenMP" from PETSc.
>>
>
> This is also true of MPI.  You can write CSP, BSP, PGAS, fork-join,
> agent-based, etc. in MPI.  Just like MPI, you don't have to use all the
> features.  PETSc doesn't use MPI_Comm_spawn, MPI_Rget_accumulate, or
> MPI_Neighborhood_alltoallv, does it?
>
>
>>     One way to use #pragma based optimization tools (which is one way to
>> treat OpenMP) is to run the application code in a realistic size problem,
>> using the number of threads/MPI process they prefer with profiling and
>> begin adding #pragmas to the most time consuming code fragments/routines,
>> measuring the (small) improvement in performance as they are added. This is
>> the way I would proceed. The branch generated will not have very many
>> pragmas in it so would likely be acceptable to be included into PETSc. It
>> would also give a quantitative measure of the possible performance with the
>> #prama approach.
>>
>
> This is the textbook Wrong Way to write OpenMP and the reason that the
> thread-scalability of DOE applications using MPI+OpenMP sucks.  It leads to
> codes that do fork-join far too often and suffer from death by Amdahl,
> unless you do a second pass where you fuse all the OpenMP regions and
> replace the serial regions between them with critical sections or similar.
>
> This isn't how you'd write MPI, is it?  No, you'd figure out how to
> decompose your data properly to exploit locality and then implement an
> algorithm that minimizes communication and synchronization.  Do that with
> OpenMP.
>

This is the worst advice. It is ivory tower maxims that ignore any
practical considerations. What you propose, while
it would produce the bet performing OpenMP, is a lot of work for absolutely
no value. We already have that performance
with MPI, it would not get any better with all this work, and the shit-wits
that want OpenMP do not care about performance,
they only care about ass covering and doing what other idiots in the DOE
tell them to do.

  Matt

>
>
>>    Note: that for BLAS 1 operations likely the correct thing to do is
>> turn on MKL BLAS threading (being careful to make sure the number of
>> threads MKL uses matches that used by other parts of the code). This way we
>> don't need to OpenMP optimize many parts of PETSc's vector operations
>> (norm, dot, scale, axpy). In fact, this is the first thing Mark should do,
>> how much does it speed up the vector operations?
>>
>
> BLAS1 operations are all memory-bound unless running out of cache (in
> which case one shouldn't use threads) and compilers do a great job with
> them.  Just put the pragmas on and let the compiler do its job.
>
>
>>   The problem is how many ECP applications actually use OpenMP just as a
>> #pragma optimization tool, or do they use other features of OpenMP. For
>> example I remember Brian wanted to/did use OpenMP threads directly in
>> BoxLib and didn't just stick to the #pragma model. If they did this then we
>> would need custom PETSc to match their model.
>>
>
> If this implies that BoxLib will use omp-parallel and then use explicit
> threading in a manner similar to MPI (omp_get_num_threads=MPI_Comm_size and
> omp_get_thread_num=MPI_Comm_rank), then this is the Right Way to write
> OpenMP.
>
> Unfortunately, the Right Way to use OpenMP makes it hard to use MPI unless
> you use MPI_THREAD_MULTIPLE and endpoints.  ECP projects should be pushing
> the MPI folks harder to ratify and implement endpoints.  I don't know if
> the proposal is even active right now, but that doesn't prevent DOE from
> compelling Open-MPI and MPICH to support it.
>
> To end on a positive note, OpenMP tasking is a relatively composable model
> and supports DAG-based parallelism.  I suspect the initial results in a
> code like PETSc will be worse than with traditional implicit OpenMP
> (omp-for-simd on all the loops) but it eventually wins own because it
> doesn't require any unnecessary barriers and makes it much easier to fuse
> parallel regions.
>
> Jeff
>
>
>>
>>   Barry
>>
>>
>> > On Jul 6, 2018, at 3:07 PM, Mills, Richard Tran <rtmills at anl.gov>
>> wrote:
>> >
>> > True, Barry. But, unfortunately, I think Jed's argument has something
>> to it because the hybrid MPI + OpenMP model has become so popular. I know
>> of a few codes where adopting this model makes some sense, though I believe
>> that, more often, the model has been adopted simply because it is the
>> fashionable thing to do. Regardless of good or bad reasons for its
>> adoption, I do have some real concern that codes that use this model have a
>> difficult time using PETSc effectively because of the lack of thread
>> support. Like many of us, I had hoped that endpoints would make it into the
>> MPI standard and this would provide a reasonable mechanism for integrating
>> PETSc with codes using MPI+threads, but progress on this seems to have
>> stagnated. I hope that the MPI endpoints effort eventually goes somewhere,
>> but what can we do in the meantime? Within the DOE ECP program, the
>> MPI+threads approach is being pushed really hard, and many of the ECP
>> subprojects have adopted it. I think it's mostly idiotic, but I think it's
>> too late to turn the tide and convince most people that pure MPI is the way
>> to go. Meanwhile, my understanding is that we need to be able to support
>> more of the ECP application projects to justify the substantial funding we
>> are getting from the program. Many of these projects are dead-set on using
>> OpenMP. (I note that I believe that the folks Mark is trying to help with
>> PETSc and OpenMP are people affiliated with Carl Steefel's ECP subsurface
>> project.)
>> >
>> > Since it looks like MPI endpoints are going to be a long time (or
>> possibly forever) in coming, I think we need (a) stopgap plan(s) to support
>> this crappy MPI + OpenMP model in the meantime. One possible approach is to
>> do what Mark is trying with to do with MKL: Use a third party library that
>> provides optimized OpenMP implementations of computationally expensive
>> kernels. It might make sense to also consider using Karl's ViennaCL library
>> in this manner, which we already use to support GPUs, but which I believe
>> (Karl, please let me know if I am off-base here) we could also use to
>> provide OpenMP-ized linear algebra operations on CPUs as well. Such
>> approaches won't use threads for lots of the things that a PETSc code will
>> do, but might be able to provide decent resource utilization for the most
>> expensive parts for some codes.
>> >
>> > Clever ideas from anyone on this list about how to use an adequate
>> number of MPI ranks for PETSc while using only a subset of these ranks for
>> the MPI+OpenMP application code will be appreciated, though I don't know if
>> there are any good solutions.
>> >
>> > --Richard
>> >
>> > On Wed, Jul 4, 2018 at 11:38 PM, Smith, Barry F. <bsmith at mcs.anl.gov>
>> wrote:
>> >
>> >    Jed,
>> >
>> >      You could use your same argument to argue PETSc should do
>> "something" to help people who have (rightly or wrongly) chosen to code
>> their application in High Performance Fortran or any other similar inane
>> parallel programming model.
>> >
>> >    Barry
>> >
>> >
>> >
>> > > On Jul 4, 2018, at 11:51 PM, Jed Brown <jed at jedbrown.org> wrote:
>> > >
>> > > Matthew Knepley <knepley at gmail.com> writes:
>> > >
>> > >> On Wed, Jul 4, 2018 at 4:51 PM Jeff Hammond <jeff.science at gmail.com>
>> wrote:
>> > >>
>> > >>> On Wed, Jul 4, 2018 at 6:31 AM Matthew Knepley <knepley at gmail.com>
>> wrote:
>> > >>>
>> > >>>> On Tue, Jul 3, 2018 at 10:32 PM Jeff Hammond <
>> jeff.science at gmail.com>
>> > >>>> wrote:
>> > >>>>
>> > >>>>>
>> > >>>>>
>> > >>>>> On Tue, Jul 3, 2018 at 4:35 PM Mark Adams <mfadams at lbl.gov>
>> wrote:
>> > >>>>>
>> > >>>>>> On Tue, Jul 3, 2018 at 1:00 PM Richard Tran Mills <
>> rtmills at anl.gov>
>> > >>>>>> wrote:
>> > >>>>>>
>> > >>>>>>> Hi Mark,
>> > >>>>>>>
>> > >>>>>>> I'm glad to see you trying out the AIJMKL stuff. I think you
>> are the
>> > >>>>>>> first person trying to actually use it, so we are probably
>> going to expose
>> > >>>>>>> some bugs and also some performance issues. My somewhat limited
>> testing has
>> > >>>>>>> shown that the MKL sparse routines often perform worse than our
>> own
>> > >>>>>>> implementations in PETSc.
>> > >>>>>>>
>> > >>>>>>
>> > >>>>>> My users just want OpenMP.
>> > >>>>>>
>> > >>>>>>
>> > >>>>>
>> > >>>>> Why not just add OpenMP to PETSc? I know certain developers hate
>> it, but
>> > >>>>> it is silly to let a principled objection stand in the way of
>> enabling users
>> > >>>>>
>> > >>>>
>> > >>>> "if that would deliver the best performance for NERSC users."
>> > >>>>
>> > >>>> You have answered your own question.
>> > >>>>
>> > >>>
>> > >>> Please share the results of your experiments that prove OpenMP does
>> not
>> > >>> improve performance for Mark’s users.
>> > >>>
>> > >>
>> > >> Oh God. I am supremely uninterested in minutely proving yet again
>> that
>> > >> OpenMP is not better than MPI.
>> > >> There are already countless experiments. One more will not add
>> anything of
>> > >> merit.
>> > >
>> > > Jeff assumes an absurd null hypothesis, Matt selfishly believes that
>> > > users should modify their code/execution environment to subscribe to a
>> > > more robust and equally performant approach, and the MPI forum
>> abdicates
>> > > by stalling on endpoints.  How do we resolve this?
>> > >
>> > >>> Also we are not in the habit of fucking up our codebase in order to
>> follow
>> > >>>> some fad.
>> > >>>>
>> > >>>
>> > >>> If you can’t use OpenMP without messing up your code base, you
>> probably
>> > >>> don’t know how to design software.
>> > >>>
>> > >>
>> > >> That is an interesting, if wrong, opinion. It might be your
>> contention that
>> > >> sticking any random paradigm in a library should
>> > >> be alright if its "well designed"? I have never encountered such a
>> > >> well-designed library.
>> > >>
>> > >>
>> > >>> I guess if you refuse to use _Pragma because C99 is still a fad for
>> you,
>> > >>> it is harder, but clearly _Complex is tolerated.
>> > >>>
>> > >>
>> > >> Yes, littering your code with preprocessor directives improves almost
>> > >> everything. Doing proper resource management
>> > >> using Pragmas, in an environment with several layers of libraries,
>> is a
>> > >> dream.
>> > >>
>> > >>
>> > >>> More seriously, you’ve adopted OpenMP hidden behind MKL
>> > >>>
>> > >>
>> > >> Nope. We can use MKL with that crap shutoff.
>> > >>
>> > >>
>> > >>> so I see no reason why you can’t wrap OpenMP implementations of the
>> PETSc
>> > >>> sparse kernels in a similar manner.
>> > >>>
>> > >>
>> > >> We could, its just a colossal waste of time and effort, as well as
>> > >> counterproductive for the codebase :)
>> > >
>> > > Endpoints either need to become a thing we can depend on or we need a
>> > > solution for users that insist on using threads (even if their
>> decision
>> > > to use threads is objectively bad).  The problem Matt harps on is
>> > > legitimate: OpenMP parallel regions cannot reliably cross module
>> > > boundaries except for embarrassingly parallel operations.  This means
>> > > loop-level omp parallel which significantly increases overhead for
>> small
>> > > problem sizes (e.g., slowing coarse grid solves and strong scaling
>> > > limits).  It can be done and isn't that hard, but the Imperial group
>> > > discarded their branch after observing that it also provided no
>> > > performance benefit.  However, I'm coming around to the idea that
>> PETSc
>> > > should do it so that there is _a_ solution for users that insist on
>> > > using threads in a particular way.  Unless Endpoints become available
>> > > and reliable, in which case we could do it right.
>> >
>> >
>>
>>
>
>
> --
> Jeff Hammond
> jeff.science at gmail.com
> http://jeffhammond.github.io/
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.caam.rice.edu/~mk51/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20180709/417fd65a/attachment-0001.html>