[petsc-dev] ASM for each field solve on GPUs

Mark Adams mfadams at lbl.gov
Wed Dec 30 21:35:56 CST 2020


On Wed, Dec 30, 2020 at 8:57 PM Barry Smith <bsmith at petsc.dev> wrote:

>
>
> > On Dec 30, 2020, at 7:30 PM, Jed Brown <jed at jedbrown.org> wrote:
> >
> > Barry Smith <bsmith at petsc.dev> writes:
> >
> >>  If you are using direct solvers on each block on each GPU (several
> matrices on each GPU) you could pull apart, for example,
> MatSolve_SeqAIJCUSPARSE()
> >> and launch each of the matrix solves on a separate stream.   You could
> use a MatSolveBegin/MatSolveEnd style or as Jed may prefer a Wait() model.
> Maybe a couple hours coding to produce a prototype
> MatSolveBegin/MatSolveEnd from MatSolve_SeqAIJCUSPARSE.
> >
> > I doubt cusparse_solve is a single kernel launch (and there's two of
> them already). You'd almost certainly need a thread to keep driving it, or
> an async/await model. Begin/End pairs for compute (even "offloaded")
> compute are no small change.
>
>   Why, it can simply launch the 4 non-blocking kernels needed in the same
> stream for a given matrix and then go to the next matrix and do the same in
> the next stream. If the GPU is smarter enough to manage utilizing the
> multiple streams I don't see why any baby-sitting by the CPU is needed at
> all. Note there is no CPU work needed between each of the 4 kernels that I
> can see.
>

I agree. The GPU scheduler can partition the GPU in space and time to keep
it busy. For instance a simple model for my 10 solves is loop over all
blocks, do a non-blocking Solve, and wait. My solves might fill 1/10 of the
GPU, say, and I get 10x speed up. I think this is theoretically possible
and there will be inefficiency but I have noticed that my current code
overlapps CPU and GPU work in separate MPI processes, which is just one way
to do things asynchronously. There are mechanisms to do this with one
process.


>
>
> >
> >>  Note pulling apart a non-coupled single MatAIJ that contains all the
> matrices would be hugely expensive. Better to build each matrix already
> separate or use MatNest with only diagonal matrices.
> >
> > Nonsense, the ND will notice that they're decoupled and you get more
> meat per kernel launch.
>
>   Yes, if the underlying GPU factorization and solver can take advantage
> of this you are of course completely correct.


Exactly.


> It would be a good test of SuperLU_DIST GPU to just give it the uncoupled
> big matrix and see how it does with profiling on the GPU. '''


I talked to Sherry today and she does not want that. She wants it
separated. That is why I am doing this.


> It is playing the "I have information I know that I throw away and then
> expect the software to recover model" game.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20201230/a8c6e2a4/attachment-0001.html>


More information about the petsc-dev mailing list