[petsc-dev] GAMG error with MKL

Mon Jul 9 09:38:37 CDT 2018

I agree with Matt's comment and let me add (somewhat redundantly)

> This isn't how you'd write MPI, is it?  No, you'd figure out how to
> decompose your data properly to exploit locality and then implement an
> algorithm that minimizes communication and synchronization.  Do that with
> OpenMP.
>

I have never seen a DOE app that does this correct, get you data model
figured out first, then implement. In fact, in my mind, the only advantage
of OMP is that it is incremental. You can get something running quickly and
then incrementally optimize as resources allow and performance demands.
That is nice in theory, but in practice apps just say "we did it all on our
own without that pesky distributed memory computing" and do science. Fine.
We all have resource limits and have to make decisions.

Jeff, with your approach, you do all the hard work of distributing your
data intelligently, which must be done regardless of PM, but you are
probably left with a code that has more shared memory algorithms in it than
if you had started with the MPI side. I thought *you* were one of the
savants that preach shared memory code is just impossible to make correct
for non-trivial codes, and thus hard to maintain.

Case in point. I recently tried to use hypre's OMP support and we had
numerical problems. After a week of digging found a hypre test case (ie, no
PETSc) that seemed to work with -O1 and failed with -O2 (solver just did
not converge and I valgrind seemed clean). (This was using the 'ij' test
problem.) I then ran a PETSc test problem, with this -O1 hypre build, and
it failed. I gave up at that point. Ulrike is in the loop and she agreed it
looked like a compiler problem.

If Intel can get this hypre test to work they can tell me what they did I
can try it again in PETSc. BTW, I looked at the hypre code and they do not
seem to do much if any fusing, etc.

And, this is all anecdotal and I do not want to imply that OMP or hypre or
Intel are bad in any way (in fact I like both hypre and Intel).

>
>>    Note: that for BLAS 1 operations likely the correct thing to do is
>> turn on MKL BLAS threading (being careful to make sure the number of
>> threads MKL uses matches that used by other parts of the code). This way we
>> don't need to OpenMP optimize many parts of PETSc's vector operations
>> (norm, dot, scale, axpy). In fact, this is the first thing Mark should do,
>> how much does it speed up the vector operations?
>>
>
> BLAS1 operations are all memory-bound unless running out of cache (in
> which case one shouldn't use threads) and compilers do a great job with
> them.  Just put the pragmas on and let the compiler do its job.
>
>
>>   The problem is how many ECP applications actually use OpenMP just as a
>> #pragma optimization tool, or do they use other features of OpenMP. For
>> example I remember Brian wanted to/did use OpenMP threads directly in
>> BoxLib and didn't just stick to the #pragma model. If they did this then we
>> would need custom PETSc to match their model.
>>
>
> If this implies that BoxLib will use omp-parallel and then use explicit
> threading in a manner similar to MPI (omp_get_num_threads=MPI_Comm_size and
> omp_get_thread_num=MPI_Comm_rank), then this is the Right Way to write
> OpenMP.
>

Note, Chombo (Phil Collela) split from BoxLib (John Bell) about 15 years
ago (and added more C++) and BoxLib has been refactored into AMReX. Brian
works with Chombo. Some staff are fungible and go between both projects. I
don't think Brian is fungible.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20180709/709638c8/attachment.html>