[petsc-dev] ASM for each field solve on GPUs

Wed Dec 30 21:49:16 CST 2020

Mark Adams <mfadams at lbl.gov> writes:

> On Wed, Dec 30, 2020 at 8:57 PM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>>
>> > On Dec 30, 2020, at 7:30 PM, Jed Brown <jed at jedbrown.org> wrote:
>> >
>> > Barry Smith <bsmith at petsc.dev> writes:
>> >
>> >>  If you are using direct solvers on each block on each GPU (several
>> matrices on each GPU) you could pull apart, for example,
>> MatSolve_SeqAIJCUSPARSE()
>> >> and launch each of the matrix solves on a separate stream.   You could
>> use a MatSolveBegin/MatSolveEnd style or as Jed may prefer a Wait() model.
>> Maybe a couple hours coding to produce a prototype
>> MatSolveBegin/MatSolveEnd from MatSolve_SeqAIJCUSPARSE.
>> >
>> > I doubt cusparse_solve is a single kernel launch (and there's two of
>> them already). You'd almost certainly need a thread to keep driving it, or
>> an async/await model. Begin/End pairs for compute (even "offloaded")
>> compute are no small change.
>>
>>   Why, it can simply launch the 4 non-blocking kernels needed in the same
>> stream for a given matrix and then go to the next matrix and do the same in
>> the next stream. If the GPU is smarter enough to manage utilizing the
>> multiple streams I don't see why any baby-sitting by the CPU is needed at
>> all. Note there is no CPU work needed between each of the 4 kernels that I
>> can see.
>>
>
> I agree. The GPU scheduler can partition the GPU in space and time to keep
> it busy. For instance a simple model for my 10 solves is loop over all
> blocks, do a non-blocking Solve, and wait. My solves might fill 1/10 of the
> GPU, say, and I get 10x speed up. I think this is theoretically possible
> and there will be inefficiency but I have noticed that my current code
> overlapps CPU and GPU work in separate MPI processes, which is just one way
> to do things asynchronously. There are mechanisms to do this with one
> process.

I missed that cusparseDcsrsv2_solve() supports asynchronous execution, however it appears that it needs to do some work (launching a kernel to inspect device memory and waiting for it to complete) to know what error to return (at least on the factor that does not have unit diagonal).

| Function csrsv2_solve() reports the first numerical zero, including a structural zero. If status is 0, no numerical zero was found. Furthermore, no numerical zero is reported if CUSPARSE_DIAG_TYPE_UNIT is specified, even if A(j,j) is zero for some j. The user needs to call cusparseXcsrsv2_zeroPivot() to know where the numerical zero is.

https://docs.nvidia.com/cuda/cusparse/index.html#csrsv2_solve

As such, I remain skeptical that you can just fire off a bunch of these without incurring significant serialization penalty.