<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Dec 30, 2020 at 8:57 PM Barry Smith <<a href="mailto:bsmith@petsc.dev">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

<br>

> On Dec 30, 2020, at 7:30 PM, Jed Brown <<a href="mailto:jed@jedbrown.org" target="_blank">jed@jedbrown.org</a>> wrote:<br>

> <br>

> Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> writes:<br>

> <br>

>>  If you are using direct solvers on each block on each GPU (several matrices on each GPU) you could pull apart, for example, MatSolve_SeqAIJCUSPARSE()<br>

>> and launch each of the matrix solves on a separate stream.   You could use a MatSolveBegin/MatSolveEnd style or as Jed may prefer a Wait() model. Maybe a couple hours coding to produce a prototype MatSolveBegin/MatSolveEnd from MatSolve_SeqAIJCUSPARSE.<br>

> <br>

> I doubt cusparse_solve is a single kernel launch (and there's two of them already). You'd almost certainly need a thread to keep driving it, or an async/await model. Begin/End pairs for compute (even "offloaded") compute are no small change. <br>

<br>

  Why, it can simply launch the 4 non-blocking kernels needed in the same stream for a given matrix and then go to the next matrix and do the same in the next stream. If the GPU is smarter enough to manage utilizing the multiple streams I don't see why any baby-sitting by the CPU is needed at all. Note there is no CPU work needed between each of the 4 kernels that I can see.<br></blockquote><div><br></div><div>I agree. The GPU scheduler can partition the GPU in space and time to keep it busy. For instance a simple model for my 10 solves is loop over all blocks, do a non-blocking Solve, and wait. My solves might fill 1/10 of the GPU, say, and I get 10x speed up. I think this is theoretically possible and there will be inefficiency but I have noticed that my current code overlapps CPU and GPU work in separate MPI processes, which is just one way to do things asynchronously. There are mechanisms to do this with one process. </div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

<br>

> <br>

>>  Note pulling apart a non-coupled single MatAIJ that contains all the matrices would be hugely expensive. Better to build each matrix already separate or use MatNest with only diagonal matrices.<br>

> <br>

> Nonsense, the ND will notice that they're decoupled and you get more meat per kernel launch.<br>

<br>

  Yes, if the underlying GPU factorization and solver can take advantage of this you are of course completely correct. </blockquote><div><br></div><div>Exactly.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">It would be a good test of SuperLU_DIST GPU to just give it the uncoupled big matrix and see how it does with profiling on the GPU. '''</blockquote><div><br></div><div>I talked to Sherry today and she does not want that. She wants it separated. That is why I am doing this. </div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">It is playing the "I have information I know that I throw away and then expect the software to recover model" game.<br>

<br>

<br>

</blockquote></div></div>