[petsc-dev] GAMG error with MKL

Jed Brown jed at jedbrown.org
Wed Jul 11 01:55:28 CDT 2018


Jeff Hammond <jeff.science at gmail.com> writes:

> On Tue, Jul 10, 2018 at 9:33 AM, Jed Brown <jed at jedbrown.org> wrote:
>
>> Mark Adams <mfadams at lbl.gov> writes:
>>
>> > On Mon, Jul 9, 2018 at 7:19 PM Jeff Hammond <jeff.science at gmail.com>
>> wrote:
>> >
>> >>
>> >>
>> >> On Mon, Jul 9, 2018 at 7:38 AM, Mark Adams <mfadams at lbl.gov> wrote:
>> >>
>> >>> I agree with Matt's comment and let me add (somewhat redundantly)
>> >>>
>> >>>
>> >>>> This isn't how you'd write MPI, is it?  No, you'd figure out how to
>> >>>> decompose your data properly to exploit locality and then implement an
>> >>>> algorithm that minimizes communication and synchronization.  Do that
>> with
>> >>>> OpenMP.
>> >>>>
>> >>>
>> >>> I have never seen a DOE app that does this correct, get you data model
>> >>> figured out first, then implement.
>> >>>
>> >>
>> >> Chris Kerr's weather code (GFDL Hiram) has a single OpenMP parallel
>> >> region.  He was at the last NERSC workshop I attended.  You should talk
>> to
>> >> him.
>> >>
>> >>
>> > He is the last person I need to talk to :) but I wish my fusion
>> colleagues
>> > had walked down the road and talked with him 10 years ago.
>>
>> I don't know if Chris has ever lived there.  And he's great, but GFDL is
>> an application, not a library.
>>
>> Jeff, let us know when MKL can be called collectively from within an omp
>> parallel region.
>>
>
> It can be called like that already, although I expect it to serialize by
> default to avoid nested parallelism.  You can always do something like the
> following, which may benefit from KMP_HOT_TEAMS_MODE=1.
>
> mkl_foo_wrapper(ARGS)
> {
>   if (omp_in_parallel()) {
>     #pragma omp master
>     {
>       mkl_set_num_threads(omp_get_num_threads());
>       mkl_foo(ARGS);
>     }
>   } else {
>     mkl_foo(ARGS);
>   }
> }

If we do this with OpenMP and have #pragma omp parallel inside of
mkl_foo, then it would only use 1 thread.  As an implementation matter,
my understanding is that all non-master threads wait for the master
region and I don't know how their control flow would transfer to a
nested region that might exist inside mkl_foo.

And if several threads call a dot product as above, how does MKL know
whether it is an independent dot product for each thread or a collective
dot product where all the threads contribute to a reduction and
(presumably) all threads get the result of the collective dot product?

This is square one for any library that wishes to be called for
collective operations within parallel regions.  As far as I can tell,
the OpenMP committee has completely ignored such users.

> Of course, the right solution is for MKL to have entry points that don't
> create a parallel region.  We have discussed this but it's not clear how
> important it is.  What MKL functions would you like to support this
> interface?  "All of them" is not going to be productive for either of us.

It's the same question to ask of PETSc, and because users are developing
new preconditioners and new Krylov methods (the E is for Extensible),
the answer is absolutely "all of them".  From dot products to entire
solves.


More information about the petsc-dev mailing list