[petsc-dev] CUDA + OMP make error

Thu Apr 16 10:05:58 CDT 2020

On Thu, Apr 16, 2020 at 10:18 AM Matthew Knepley <knepley at gmail.com> wrote:

> On Thu, Apr 16, 2020 at 10:11 AM Mark Adams <mfadams at lbl.gov> wrote:
>
>> On Thu, Apr 16, 2020 at 9:31 AM Matthew Knepley <knepley at gmail.com>
>> wrote:
>>
>>> On Thu, Apr 16, 2020 at 8:42 AM Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>> Yea, GPU assembly would be great. I was figuring OMP might be simpler.
>>>>
>>>> As far as the interface, I am flexible, the simplest way to do it would
>>>> be to take an array of element matrices and a DMPlex and call
>>>> to DMPlexMatSetClosure. You can see this code in
>>>> mark/feature-xgc-interface-rebase, at the bottom of
>>>> src/vec/vec/impls/seq/seqcuda/landau.cu.
>>>>
>>>> I was shy about putting a version of DMPlexMatSetClosure in CUDA, but
>>>> maybe that is easier, just plow through it and cut stuff that we don't
>>>> need. OMP broke because there are some temp arrays that Matt caches that
>>>> need to be "private" of dealt with in some way.
>>>>
>>>
>>> We should refactor so that all temp arrays are sized and constructed up
>>> front, and then the work is done in an internal function which is passed
>>> those arrays. I tried to do this, but might have crapped out here. Then
>>> you can just call the internal function directly with your arrays.
>>>
>>
>> Just to be clear, we have to copy the code to a .cu file and declare
>> everything as a device method (__global__) right?
>>
>> Or write a batched version of DMPlexMatSetClosure and when it gets down
>> to some kernel like MatSetValues, we, for example, move the Mat pointers to
>> the GPU, copy to the element matrices to the device, in a .cu file, launch
>> a kernel that calls a __global__ version of MatSetValues_SeqAIJ (with
>> mallocs stripped out), then Copy the Mat pointers back to the CPU. All this
>> copy stuff is usually done with a shadow copy of the object, but that is
>> complicated by cusparse matrices which, from what I can tell, have a
>> different, cusparse friendly, device Mat object.
>>
>
> I would just get the closure indices for that batch of cells, push them to
> the GPU, and call MatSetValues() from the GPU.
>

How do you get the closure indices? I looked and could not find it.

This would fix the first OMP problem by avoiding Plex, and maybe
MatSetValues is already thread safe or can be made so easily (without
mallocs).

Not that I'm a big OMP fan but I think the work of an OMP implementation is
a subset of the GPU implementation. eg, MatSetValues will have to be copied
into a __global__ method for GPUs and it will essentially need to be made
thread safe. If we make it thread safe first we get OMP for free. We can
then compare the two, which would make a paper.

> Here is the way I am thinking. You can not going to push the Plex to the
> GPU, so you have to do Plex->Indices on the CPU. Just do that,
> push them down, and use PETSc Mat from there.
>
>   Matt
>
>
>>
>>>   Matt
>>>
>>>
>>>> Coloring is not attractive to me because GPUs demand a lot of
>>>> parallelism and the code that this serial (velocity space) solver would be
>>>> embedded in a full 3D code that does not use a huge amount of MPI
>>>> parallelism. For instance if the app code was to use 6 (or 7 max in SUMMIT)
>>>> cores per GPU (or even 4x that with hardware threads) then *I could
>>>> imagine* there would be enough parallelism, with coloring, to fuse the
>>>> element construction and assembly, so assembling the element matrices right
>>>> after they are created. That would be great in terms of not storing all
>>>> these matrices and then assembling them all at once. The app that I am
>>>> targeting does not use that much MPI parallelism though. But we could
>>>> explore that, coloring, space and my mental model could be inaccurate.
>>>> (note, I did recently add 8x more parallelism to my code this week and got
>>>> a 25% speedup, using one whole GPU).
>>>>
>>>> Or if you have some sort of lower level synchronization that could
>>>> allow for fusing the the assembly with the element creation, then, by all
>>>> means, we can explore that.
>>>>
>>>> I'd be happy to work with you on this.
>>>>
>>>> Thanks,
>>>> Mark
>>>>
>>>> On Mon, Apr 13, 2020 at 7:08 PM Junchao Zhang <junchao.zhang at gmail.com>
>>>> wrote:
>>>>
>>>>> Probably matrix assembly on GPU is more important. Do you have an
>>>>> example for me to play to see what GPU interface we should have?
>>>>> --Junchao Zhang
>>>>>
>>>>> On Mon, Apr 13, 2020 at 5:44 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>>>
>>>>>> I was looking into assembling matrices with threads. I have a
>>>>>> coloring to avoid conflicts.
>>>>>>
>>>>>> Turning off all the logging seems way overkill and for methods that
>>>>>> can get called in a thread then we could use PETSC_HAVE_THREADSAFTEY
>>>>>> (thingy) to protect logging functions. So one can still get timings for the
>>>>>> whole assembly process, just not for MatSetValues. Few people are going to
>>>>>> do this. I don't think it will be a time sink, and if it is we just revert
>>>>>> back to saying 'turn logging off'. I don't see a good argument for
>>>>>> insisting on turning off logging, it is pretty important, if we just say
>>>>>> that we are going to protect methods as needed.
>>>>>>
>>>>>> It is not a big deal, I am just exploring this idea. It is such a
>>>>>> basic concept in shared memory sparse linear algebra that it seems like a
>>>>>> good thing to be able to support and have in an example to say we can
>>>>>> assemble matrices in threads (not that it is a great idea). We have all the
>>>>>> tools (eg, coloring methods) that it is just a matter of protecting code a
>>>>>> few methods. I use DMPlex MatClosure instead of MatSetValues and this is
>>>>>> where I die now with non-thread safe code. We have an idea, from Jed, on
>>>>>> how to fix it.
>>>>>>
>>>>>> Anyway, thanks for your help, but I think we should hold off on doing
>>>>>> anything until we have some consensus that this would be a good idea to put
>>>>>> some effort into getting a thread safe PETSc that can support OMP matrix
>>>>>> assembly with a nice compact example.
>>>>>>
>>>>>> Thanks again,
>>>>>> Mark
>>>>>>
>>>>>> On Mon, Apr 13, 2020 at 5:44 PM Junchao Zhang <
>>>>>> junchao.zhang at gmail.com> wrote:
>>>>>>
>>>>>>> Mark,
>>>>>>>  I saw you had "--with-threadsaftey --with-log=0".  Do you really
>>>>>>> want to call petsc from multiple threads (in contrast to letting petsc call
>>>>>>> other libraries, e.g., BLAS, doing multithreading)?  If not, you can
>>>>>>> drop --with-threadsaftey.
>>>>>>>  I have https://gitlab.com/petsc/petsc/-/merge_requests/2714 that
>>>>>>> should fix your original compilation errors.
>>>>>>>
>>>>>>> --Junchao Zhang
>>>>>>>
>>>>>>> On Mon, Apr 13, 2020 at 2:07 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>>>>>
>>>>>>>> https://www.mcs.anl.gov/petsc/miscellaneous/petscthreads.html
>>>>>>>>
>>>>>>>> and I see this on my Mac:
>>>>>>>>
>>>>>>>> 14:23 1 mark/feature-xgc-interface-rebase *= ~/Codes/petsc$
>>>>>>>> ../arch-macosx-gnu-O-omp.py
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ===============================================================================
>>>>>>>>              Configuring PETSc to compile on your system
>>>>>>>>
>>>>>>>>
>>>>>>>> ===============================================================================
>>>>>>>> ===============================================================================
>>>>>>>>
>>>>>>>>
>>>>>>>>                Warning: PETSC_ARCH from environment does not match
>>>>>>>> command-line or name of script.
>>>>>>>>
>>>>>>>>                                      Warning: Using from command-line or
>>>>>>>> name of script: arch-macosx-gnu-O-omp, ignoring environment:
>>>>>>>> arch-macosx-gnu-g
>>>>>>>>
>>>>>>>> ===============================================================================
>>>>>>>>
>>>>>>>>
>>>>>>>>          TESTING: configureLibraryOptions from
>>>>>>>> PETSc.options.libraryOptions(config/PETSc/options/libraryOptions.py:37)
>>>>>>>>
>>>>>>>>
>>>>>>>>  *******************************************************************************
>>>>>>>>          UNABLE to CONFIGURE with GIVEN OPTIONS    (see
>>>>>>>> configure.log for details):
>>>>>>>>
>>>>>>>> -------------------------------------------------------------------------------
>>>>>>>> Must use --with-log=0 with --with-threadsafety
>>>>>>>>
>>>>>>>> *******************************************************************************
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Apr 13, 2020 at 2:54 PM Junchao Zhang <
>>>>>>>> junchao.zhang at gmail.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Apr 13, 2020 at 12:06 PM Mark Adams <mfadams at lbl.gov>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> BTW, I can build on SUMMIT with logging and OMP, apparently. I
>>>>>>>>>> also seem to be able to build with debugging. Both of which are not allowed
>>>>>>>>>> according the the docs. I am puzzled.
>>>>>>>>>>
>>>>>>>>>  What are "the docs"?
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Apr 13, 2020 at 12:05 PM Mark Adams <mfadams at lbl.gov>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I think the problem is that you have to turn off logging with
>>>>>>>>>>> openmp and the (newish) GPU timers did not protect their timers.
>>>>>>>>>>>
>>>>>>>>>>> I don't see a good reason to require logging be turned off with
>>>>>>>>>>> OMP. We could use PETSC_HAVE_THREADSAFETY to protect logs that we care
>>>>>>>>>>> about (eg, in MatSetValues) and as users discover more things that they
>>>>>>>>>>> want to call in an OMP thread block, then tell them to turn logging off and
>>>>>>>>>>> we will fix it when we can.
>>>>>>>>>>>
>>>>>>>>>>> Any thoughts on the idea of letting users keep logging with
>>>>>>>>>>> openmp?
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Apr 13, 2020 at 11:40 AM Junchao Zhang <
>>>>>>>>>>> junchao.zhang at gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes. Looks we need to include petsclog.h. Don't know why OMP
>>>>>>>>>>>> triggered the error.
>>>>>>>>>>>> --Junchao Zhang
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Apr 13, 2020 at 9:59 AM Mark Adams <mfadams at lbl.gov>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Should I do an MR to fix this?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>> https://www.cse.buffalo.edu/~knepley/
>>> <http://www.cse.buffalo.edu/~knepley/>
>>>
>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20200416/fc4caeed/attachment.html>