[petsc-dev] GAMG error with MKL

Jeff Hammond jeff.science at gmail.com
Mon Jul 9 18:18:58 CDT 2018


On Mon, Jul 9, 2018 at 7:38 AM, Mark Adams <mfadams at lbl.gov> wrote:

> I agree with Matt's comment and let me add (somewhat redundantly)
>
>
>> This isn't how you'd write MPI, is it?  No, you'd figure out how to
>> decompose your data properly to exploit locality and then implement an
>> algorithm that minimizes communication and synchronization.  Do that with
>> OpenMP.
>>
>
> I have never seen a DOE app that does this correct, get you data model
> figured out first, then implement.
>

Chris Kerr's weather code (GFDL Hiram) has a single OpenMP parallel
region.  He was at the last NERSC workshop I attended.  You should talk to
him.


> In fact, in my mind, the only advantage of OMP is that it is incremental.
> You can get something running quickly and then incrementally optimize as
> resources allow and performance demands.
>

This is why OpenMP has a bad reputation.  We have customers that use OpenMP
holistically and get great results.


> That is nice in theory, but in practice apps just say "we did it all on
> our own without that pesky distributed memory computing" and do science.
> Fine. We all have resource limits and have to make decisions.
>

> Jeff, with your approach, you do all the hard work of distributing your
> data intelligently, which must be done regardless of PM, but you are
> probably left with a code that has more shared memory algorithms in it than
> if you had started with the MPI side.
>

Good OpenMP shares very little state between threads, although I suppose it
doesn't lead to halo buffers and explicit exchanges through them.  Is that
a bad thing?


> I thought *you* were one of the savants that preach shared memory code is
> just impossible to make correct for non-trivial codes, and thus hard to
> maintain.
>

Shared-memory programming is hard but so is raising children. Both are
feasible with sufficient effort.

I argue that:
- data sharing by default, which is implied by threading, is bad semantic,
but largely academic if you write OpenMP properly;
- incremental OpenMP leads to excessive fork-join overhead and
death-by-Amdahl;
- MPI's failure to support threads properly leads to a lot of stupid design
choices in MPI+OpenMP applications;
- good OpenMP looks like MPI and requires a lot more work up front than
incremental OpenMP.


> Case in point. I recently tried to use hypre's OMP support and we had
> numerical problems. After a week of digging found a hypre test case (ie, no
> PETSc) that seemed to work with -O1 and failed with -O2 (solver just did
> not converge and I valgrind seemed clean). (This was using the 'ij' test
> problem.) I then ran a PETSc test problem, with this -O1 hypre build, and
> it failed. I gave up at that point. Ulrike is in the loop and she agreed it
> looked like a compiler problem.
>

Sadly, I'm not aware of any bug-free compilers.


> If Intel can get this hypre test to work they can tell me what they did I
> can try it again in PETSc. BTW, I looked at the hypre code and they do not
> seem to do much if any fusing, etc.
>

Yeah, it's hard to get fusing right across subroutines.  Fusing only
matters when the amount of compute is limited though.  Personally I prefer
tasks to fusing old-school OpenMP but the implementation support isn't
ideal right now.

And, this is all anecdotal and I do not want to imply that OMP or hypre or
> Intel are bad in any way (in fact I like both hypre and Intel).
>
>
>>
>>>    Note: that for BLAS 1 operations likely the correct thing to do is
>>> turn on MKL BLAS threading (being careful to make sure the number of
>>> threads MKL uses matches that used by other parts of the code). This way we
>>> don't need to OpenMP optimize many parts of PETSc's vector operations
>>> (norm, dot, scale, axpy). In fact, this is the first thing Mark should do,
>>> how much does it speed up the vector operations?
>>>
>>
>> BLAS1 operations are all memory-bound unless running out of cache (in
>> which case one shouldn't use threads) and compilers do a great job with
>> them.  Just put the pragmas on and let the compiler do its job.
>>
>>
>>>   The problem is how many ECP applications actually use OpenMP just as a
>>> #pragma optimization tool, or do they use other features of OpenMP. For
>>> example I remember Brian wanted to/did use OpenMP threads directly in
>>> BoxLib and didn't just stick to the #pragma model. If they did this then we
>>> would need custom PETSc to match their model.
>>>
>>
>> If this implies that BoxLib will use omp-parallel and then use explicit
>> threading in a manner similar to MPI (omp_get_num_threads=MPI_Comm_size
>> and omp_get_thread_num=MPI_Comm_rank), then this is the Right Way to
>> write OpenMP.
>>
>
> Note, Chombo (Phil Collela) split from BoxLib (John Bell) about 15 years
> ago (and added more C++) and BoxLib has been refactored into AMReX. Brian
> works with Chombo. Some staff are fungible and go between both projects. I
> don't think Brian is fungible.
>
>>
If I say "OpenMP and C++ are great", will I be able to hear the swearing
all the way from Buffalo? :-)

Jeff

-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20180709/3fda2ba6/attachment-0001.html>


More information about the petsc-dev mailing list