[petsc-dev] ASM for each field solve on GPUs

Wed Dec 30 22:14:54 CST 2020

For the near future I am going to be driving this with multiple MPI
processes per GPU and the LU factorizations are the big problem. I can use
the existing serial ASM solver that would call cuSparse from each MPI
process. So that will run as is. That is what I need for a paper, LU
factorization on the GPU and splitting these 10 solve off for SuperLU is
the first step.

A user however may not want to run with 7 MPI processes per GPU (on Summit)
so in that case some sort of asynchronous thing will be needed. But that is
later. A code that I work with uses Kokkos and they use OpenMP to drive
asynchronous GPU processes.

On Wed, Dec 30, 2020 at 10:49 PM Jed Brown <jed at jedbrown.org> wrote:

> Mark Adams <mfadams at lbl.gov> writes:
>
> > On Wed, Dec 30, 2020 at 8:57 PM Barry Smith <bsmith at petsc.dev> wrote:
> >
> >>
> >>
> >> > On Dec 30, 2020, at 7:30 PM, Jed Brown <jed at jedbrown.org> wrote:
> >> >
> >> > Barry Smith <bsmith at petsc.dev> writes:
> >> >
> >> >>  If you are using direct solvers on each block on each GPU (several
> >> matrices on each GPU) you could pull apart, for example,
> >> MatSolve_SeqAIJCUSPARSE()
> >> >> and launch each of the matrix solves on a separate stream.   You
> could
> >> use a MatSolveBegin/MatSolveEnd style or as Jed may prefer a Wait()
> model.
> >> Maybe a couple hours coding to produce a prototype
> >> MatSolveBegin/MatSolveEnd from MatSolve_SeqAIJCUSPARSE.
> >> >
> >> > I doubt cusparse_solve is a single kernel launch (and there's two of
> >> them already). You'd almost certainly need a thread to keep driving it,
> or
> >> an async/await model. Begin/End pairs for compute (even "offloaded")
> >> compute are no small change.
> >>
> >>   Why, it can simply launch the 4 non-blocking kernels needed in the
> same
> >> stream for a given matrix and then go to the next matrix and do the
> same in
> >> the next stream. If the GPU is smarter enough to manage utilizing the
> >> multiple streams I don't see why any baby-sitting by the CPU is needed
> at
> >> all. Note there is no CPU work needed between each of the 4 kernels
> that I
> >> can see.
> >>
> >
> > I agree. The GPU scheduler can partition the GPU in space and time to
> keep
> > it busy. For instance a simple model for my 10 solves is loop over all
> > blocks, do a non-blocking Solve, and wait. My solves might fill 1/10 of
> the
> > GPU, say, and I get 10x speed up. I think this is theoretically possible
> > and there will be inefficiency but I have noticed that my current code
> > overlapps CPU and GPU work in separate MPI processes, which is just one
> way
> > to do things asynchronously. There are mechanisms to do this with one
> > process.
>
> I missed that cusparseDcsrsv2_solve() supports asynchronous execution,
> however it appears that it needs to do some work (launching a kernel to
> inspect device memory and waiting for it to complete) to know what error to
> return (at least on the factor that does not have unit diagonal).
>
> | Function csrsv2_solve() reports the first numerical zero, including a
> structural zero. If status is 0, no numerical zero was found. Furthermore,
> no numerical zero is reported if CUSPARSE_DIAG_TYPE_UNIT is specified, even
> if A(j,j) is zero for some j. The user needs to call
> cusparseXcsrsv2_zeroPivot() to know where the numerical zero is.
>
> https://docs.nvidia.com/cuda/cusparse/index.html#csrsv2_solve
>
> As such, I remain skeptical that you can just fire off a bunch of these
> without incurring significant serialization penalty.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20201230/6b1f8b28/attachment.html>