[petsc-users] MatCreateSeqAIJWithArrays for GPU / cusparse

Sat Jan 7 10:39:23 CST 2023

I see.  Thanks a lot.
--Junchao Zhang

On Sat, Jan 7, 2023 at 6:15 AM Mark Lohry <mlohry at gmail.com> wrote:

> I've worked on a few different codes doing matrix assembly on GPU
> independently of petsc. In all instances to plug into petsc all I need are
> the device CSR pointers and some guarantee they don't move around (my first
> try without setpreallocation on CPU I saw the value array pointer move
> after the first solve). It would also be nice to have a guarantee there
> aren't any unnecessary copies since memory constraints are always a concern.
>
> Here I call
> MatCreateSeqAIJCUSPARSE
> MatSeqAIJSetPreallocationCSR (filled using a preexisting CSR on host using
> the correct index arrays and zeros for values)
> MatSeqAIJGetCSRAndMemType (grab the allocated device CSR pointers and use
> those directly)
>
> Then in the Jacobian evaluation routine I fill that CSR directly with no
> calls to MatSetValues, just
>
> MatAssemblyBegin(J,MAT_FINAL_ASSEMBLY);
>  MatAssemblyEnd(J,MAT_FINAL_ASSEMBLY);
>
> after to put it in the correct state.
>
> In this code to fill the CSR coefficients, each GPU thread gets one row
> and fills it. No race conditions to contend with. Technically I'm
> duplicating some computations (a given dof could fill its own row and
> column) but this is much faster than the linear solver anyway.
>
> Other mesh based codes did GPU assembly using either coloring or mutexes,
> but still just need the CSR value array to fill.
>
>
> On Fri, Jan 6, 2023, 9:44 PM Junchao Zhang <junchao.zhang at gmail.com>
> wrote:
>
>>
>>
>>
>> On Fri, Jan 6, 2023 at 7:35 PM Mark Lohry <mlohry at gmail.com> wrote:
>>
>>> Well, I think it's a moderately crazy idea unless it's less painful to
>>> implement than I'm thinking. Is there a use case for a mixed device system
>>> where one petsc executable might be addressing both a HIP and CUDA device
>>> beyond some frankenstein test system somebody cooked up? In all my code I
>>> implicitly assume I have either have one host with one device or one host
>>> with zero devices. I guess you can support these weird scenarios, but why?
>>> Life is hard enough supporting one device compiler with one host compiler.
>>>
>>> Many thanks Junchao -- with combinations of SetPreallocation I was able
>>> to grab allocated pointers out of petsc. Now I have all the jacobian
>>> construction on device with no copies.
>>>
>> Hi, Mark, could you say a few words about how you assemble matrices on
>> GPUs?  We ported MatSetValues like routines to GPUs but did not continue
>> this approach since we have to resolve data races between GPU threads.
>>
>>
>>>
>>> On Fri, Jan 6, 2023 at 12:27 AM Barry Smith <bsmith at petsc.dev> wrote:
>>>
>>>>
>>>>   So Jed's "everyone" now consists of "no one" and Jed can stop
>>>> complaining that "everyone" thinks it is a bad idea.
>>>>
>>>>
>>>>
>>>> On Jan 5, 2023, at 11:50 PM, Junchao Zhang <junchao.zhang at gmail.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jan 5, 2023 at 10:32 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>>
>>>>>
>>>>>
>>>>> > On Jan 5, 2023, at 3:42 PM, Jed Brown <jed at jedbrown.org> wrote:
>>>>> >
>>>>> > Mark Adams <mfadams at lbl.gov> writes:
>>>>> >
>>>>> >> Support of HIP and CUDA hardware together would be crazy,
>>>>> >
>>>>> > I don't think it's remotely crazy. libCEED supports both together
>>>>> and it's very convenient when testing on a development machine that has one
>>>>> of each brand GPU and simplifies binary distribution for us and every
>>>>> package that uses us. Every day I wish PETSc could build with both
>>>>> simultaneously, but everyone tells me it's silly.
>>>>>
>>>>>   Not everyone at all; just a subset of everyone. Junchao is really
>>>>> the hold-out :-)
>>>>>
>>>> I am not, instead I think we should try (I fully agree it can ease
>>>> binary distribution).  But satish needs to install such a machine first :)
>>>> There are issues out of our control if we want to mix GPUs in
>>>> execution.  For example, how to do VexAXPY on a cuda vector and a hip
>>>> vector? Shall we do it on the host? Also, there are no gpu-aware MPI
>>>> implementations supporting messages between cuda memory and hip memory.
>>>>
>>>>>
>>>>>   I just don't care about "binary packages" :-); I think they are an
>>>>> archaic and bad way of thinking about code distribution (but yes the
>>>>> alternatives need lots of work to make them flawless, but I think that is
>>>>> where the work should go in the packaging world.)
>>>>>
>>>>>    I go further and think one should be able to automatically use a
>>>>> CUDA vector on a HIP device as well, it is not hard in theory but requires
>>>>> thinking about how we handle classes and subclasses a little to make it
>>>>> straightforward; or perhaps Jacob has fixed that also?
>>>>
>>>>
>>>>
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230107/6b46af06/attachment.html>