<div dir="ltr">For the near future I am going to be driving this with multiple MPI processes per GPU and the LU factorizations are the big problem. I can use the existing serial ASM solver that would call cuSparse from each MPI process. So that will run as is. That is what I need for a paper, LU factorization on the GPU and splitting these 10 solve off for SuperLU is the first step.<div><br></div><div>A user however may not want to run with 7 MPI processes per GPU (on Summit) so in that case some sort of asynchronous thing will be needed. But that is later. A code that I work with uses Kokkos and they use OpenMP to drive asynchronous GPU processes.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Dec 30, 2020 at 10:49 PM Jed Brown <<a href="mailto:jed@jedbrown.org">jed@jedbrown.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> writes:<br>

<br>

> On Wed, Dec 30, 2020 at 8:57 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:<br>

><br>

>><br>

>><br>

>> > On Dec 30, 2020, at 7:30 PM, Jed Brown <<a href="mailto:jed@jedbrown.org" target="_blank">jed@jedbrown.org</a>> wrote:<br>

>> ><br>

>> > Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> writes:<br>

>> ><br>

>> >>  If you are using direct solvers on each block on each GPU (several<br>

>> matrices on each GPU) you could pull apart, for example,<br>

>> MatSolve_SeqAIJCUSPARSE()<br>

>> >> and launch each of the matrix solves on a separate stream.   You could<br>

>> use a MatSolveBegin/MatSolveEnd style or as Jed may prefer a Wait() model.<br>

>> Maybe a couple hours coding to produce a prototype<br>

>> MatSolveBegin/MatSolveEnd from MatSolve_SeqAIJCUSPARSE.<br>

>> ><br>

>> > I doubt cusparse_solve is a single kernel launch (and there's two of<br>

>> them already). You'd almost certainly need a thread to keep driving it, or<br>

>> an async/await model. Begin/End pairs for compute (even "offloaded")<br>

>> compute are no small change.<br>

>><br>

>>   Why, it can simply launch the 4 non-blocking kernels needed in the same<br>

>> stream for a given matrix and then go to the next matrix and do the same in<br>

>> the next stream. If the GPU is smarter enough to manage utilizing the<br>

>> multiple streams I don't see why any baby-sitting by the CPU is needed at<br>

>> all. Note there is no CPU work needed between each of the 4 kernels that I<br>

>> can see.<br>

>><br>

><br>

> I agree. The GPU scheduler can partition the GPU in space and time to keep<br>

> it busy. For instance a simple model for my 10 solves is loop over all<br>

> blocks, do a non-blocking Solve, and wait. My solves might fill 1/10 of the<br>

> GPU, say, and I get 10x speed up. I think this is theoretically possible<br>

> and there will be inefficiency but I have noticed that my current code<br>

> overlapps CPU and GPU work in separate MPI processes, which is just one way<br>

> to do things asynchronously. There are mechanisms to do this with one<br>

> process.<br>

<br>

I missed that cusparseDcsrsv2_solve() supports asynchronous execution, however it appears that it needs to do some work (launching a kernel to inspect device memory and waiting for it to complete) to know what error to return (at least on the factor that does not have unit diagonal).<br>

<br>

| Function csrsv2_solve() reports the first numerical zero, including a structural zero. If status is 0, no numerical zero was found. Furthermore, no numerical zero is reported if CUSPARSE_DIAG_TYPE_UNIT is specified, even if A(j,j) is zero for some j. The user needs to call cusparseXcsrsv2_zeroPivot() to know where the numerical zero is.<br>

<br>

<a href="https://docs.nvidia.com/cuda/cusparse/index.html#csrsv2_solve" rel="noreferrer" target="_blank">https://docs.nvidia.com/cuda/cusparse/index.html#csrsv2_solve</a><br>

<br>

As such, I remain skeptical that you can just fire off a bunch of these without incurring significant serialization penalty.<br>

</blockquote></div>