[petsc-dev] ASM for each field solve on GPUs

Wed Dec 30 22:00:50 CST 2020

  I was assuming the triangular systems were all solvable and setup in the correct format so the triangular solver would not be the place to first detect problems with the factor. Though not completely generally that seems to cover most cases.

> On Dec 30, 2020, at 9:49 PM, Jed Brown <jed at jedbrown.org> wrote:
> 
> Mark Adams <mfadams at lbl.gov> writes:
> 
>> On Wed, Dec 30, 2020 at 8:57 PM Barry Smith <bsmith at petsc.dev> wrote:
>> 
>>> 
>>> 
>>>> On Dec 30, 2020, at 7:30 PM, Jed Brown <jed at jedbrown.org> wrote:
>>>> 
>>>> Barry Smith <bsmith at petsc.dev> writes:
>>>> 
>>>>> If you are using direct solvers on each block on each GPU (several
>>> matrices on each GPU) you could pull apart, for example,
>>> MatSolve_SeqAIJCUSPARSE()
>>>>> and launch each of the matrix solves on a separate stream.   You could
>>> use a MatSolveBegin/MatSolveEnd style or as Jed may prefer a Wait() model.
>>> Maybe a couple hours coding to produce a prototype
>>> MatSolveBegin/MatSolveEnd from MatSolve_SeqAIJCUSPARSE.
>>>> 
>>>> I doubt cusparse_solve is a single kernel launch (and there's two of
>>> them already). You'd almost certainly need a thread to keep driving it, or
>>> an async/await model. Begin/End pairs for compute (even "offloaded")
>>> compute are no small change.
>>> 
>>>  Why, it can simply launch the 4 non-blocking kernels needed in the same
>>> stream for a given matrix and then go to the next matrix and do the same in
>>> the next stream. If the GPU is smarter enough to manage utilizing the
>>> multiple streams I don't see why any baby-sitting by the CPU is needed at
>>> all. Note there is no CPU work needed between each of the 4 kernels that I
>>> can see.
>>> 
>> 
>> I agree. The GPU scheduler can partition the GPU in space and time to keep
>> it busy. For instance a simple model for my 10 solves is loop over all
>> blocks, do a non-blocking Solve, and wait. My solves might fill 1/10 of the
>> GPU, say, and I get 10x speed up. I think this is theoretically possible
>> and there will be inefficiency but I have noticed that my current code
>> overlapps CPU and GPU work in separate MPI processes, which is just one way
>> to do things asynchronously. There are mechanisms to do this with one
>> process.
> 
> I missed that cusparseDcsrsv2_solve() supports asynchronous execution, however it appears that it needs to do some work (launching a kernel to inspect device memory and waiting for it to complete) to know what error to return (at least on the factor that does not have unit diagonal).
> 
> | Function csrsv2_solve() reports the first numerical zero, including a structural zero. If status is 0, no numerical zero was found. Furthermore, no numerical zero is reported if CUSPARSE_DIAG_TYPE_UNIT is specified, even if A(j,j) is zero for some j. The user needs to call cusparseXcsrsv2_zeroPivot() to know where the numerical zero is.
> 
> https://docs.nvidia.com/cuda/cusparse/index.html#csrsv2_solve
> 
> As such, I remain skeptical that you can just fire off a bunch of these without incurring significant serialization penalty.