[petsc-dev] ASM for each field solve on GPUs

Wed Dec 30 19:56:08 CST 2020

> On Dec 30, 2020, at 7:30 PM, Jed Brown <jed at jedbrown.org> wrote:
> 
> Barry Smith <bsmith at petsc.dev> writes:
> 
>>  If you are using direct solvers on each block on each GPU (several matrices on each GPU) you could pull apart, for example, MatSolve_SeqAIJCUSPARSE()
>> and launch each of the matrix solves on a separate stream.   You could use a MatSolveBegin/MatSolveEnd style or as Jed may prefer a Wait() model. Maybe a couple hours coding to produce a prototype MatSolveBegin/MatSolveEnd from MatSolve_SeqAIJCUSPARSE.
> 
> I doubt cusparse_solve is a single kernel launch (and there's two of them already). You'd almost certainly need a thread to keep driving it, or an async/await model. Begin/End pairs for compute (even "offloaded") compute are no small change. 

  Why, it can simply launch the 4 non-blocking kernels needed in the same stream for a given matrix and then go to the next matrix and do the same in the next stream. If the GPU is smarter enough to manage utilizing the multiple streams I don't see why any baby-sitting by the CPU is needed at all. Note there is no CPU work needed between each of the 4 kernels that I can see.

> 
>>  Note pulling apart a non-coupled single MatAIJ that contains all the matrices would be hugely expensive. Better to build each matrix already separate or use MatNest with only diagonal matrices.
> 
> Nonsense, the ND will notice that they're decoupled and you get more meat per kernel launch.

  Yes, if the underlying GPU factorization and solver can take advantage of this you are of course completely correct. It would be a good test of SuperLU_DIST GPU to just give it the uncoupled big matrix and see how it does with profiling on the GPU. It is playing the "I have information I know that I throw away and then expect the software to recover model" game.