<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><blockquote type="cite" class=""><div dir="ltr" class=""><div class="">I could try this with fieldsplit/additive and Jacob's Vector object on Cuda, with Jacobi now of its ready. Jacob?</div></div></blockquote><div class=""><br class=""></div>Still under construction, although depending on the exact vec ops you require you may be able to tinker something together. Note so far only direct vec compute ops have been given the stream treatment, for scatter/gather I think Junchao covers that in his NVSHMEM MR. <div class=""><br class=""></div><div class="">If you are just pipelining vectors ops into one another without any intermittent modification of scalars though then you are good to go. As of writing this I haven’t gotten round to  implementing device-side scalar modifications. So in a nutshell:<div class=""><br class=""></div><div class="">VecDotAsync(…, &r00, stream);</div><div class="">VecScaleAsync(…, r00, stream);</div><div class=""><br class=""></div><div class="">Is fully pipelined as long as you directly feed output to input, but</div><div class=""><br class=""></div><div class="">VecDotAsync(…, &r00, stream);<br class="">VecScaleAsync(…, 1./r00, stream);    <————— note modification of result <br class=""><br class=""><div class="">Is not. </div><div class=""><br class=""><div class="">

<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div>Best regards,<br class=""><br class="">Jacob Faibussowitsch<br class="">(Jacob Fai - booss - oh - vitch)<br class="">Cell: (312) 694-3391</div></div>


</div>

<div><br class=""><blockquote type="cite" class=""><div class="">On Jan 11, 2021, at 08:14, Mark Adams <<a href="mailto:mfadams@lbl.gov" class="">mfadams@lbl.gov</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">Ah, so first first maybe the Solver does not need a stream because it uses VecScatter ?<div class=""><br class=""></div><div class="">The model is 1) block to get your data, 2) doit non-blocking. If a vector is used in a non-GPU method that does not know about (1) you are hosed, but that is a detail.</div><div class=""><br class=""></div><div class="">I like the middle, single loop algo or whatever is easier, keep it simple, and once it's running it is easy to an experiment with more loops.</div><div class=""><br class=""></div><div class="">I could try this with fieldsplit/additive and Jacob's Vector object on Cuda, with Jacobi now of its ready. Jacob?</div><div class=""><br class=""></div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Jan 10, 2021 at 9:38 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" class="">bsmith@petsc.dev</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;" class=""><br class=""><div class=""><br class=""><blockquote type="cite" class=""><div class="">On Jan 10, 2021, at 1:30 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank" class="">mfadams@lbl.gov</a>> wrote:</div><br class=""><div class=""><div dir="ltr" class=""><div dir="ltr" class=""><br class=""></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jan 9, 2021 at 11:14 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank" class="">bsmith@petsc.dev</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class=""><div class=""><br class=""></div>  Ok, fieldsplit, pretty much the same as what I said for block Jacobi. <div class=""><br class=""></div><div class=""><div class="">static PetscErrorCode PCApply_FieldSplit(PC pc,Vec x,Vec y)</div><div class="">{</div><div class="">  PC_FieldSplit      *jac = (PC_FieldSplit*)pc->data;</div><div class="">  PetscErrorCode     ierr;</div><div class="">  PC_FieldSplitLink  ilink = jac->head;</div><div class="">  PetscInt           cnt,bs;</div><div class=""><br class=""></div><div class="">  PetscFunctionBegin;</div><div class="">  if (jac->type == PC_COMPOSITE_ADDITIVE) {</div><div class="">    if (jac->defaultsplit) {</div><div class="">      ierr = VecGetBlockSize(x,&bs);CHKERRQ(ierr);</div><div class="">      if (jac->bs > 0 && bs != jac->bs) SETERRQ2(PetscObjectComm((PetscObject)pc),PETSC_ERR_ARG_WRONGSTATE,"Blocksize of x vector %D does not match fieldsplit blocksize %D",bs,jac->bs);</div><div class="">      ierr = VecGetBlockSize(y,&bs);CHKERRQ(ierr);</div><div class="">      if (jac->bs > 0 && bs != jac->bs) SETERRQ2(PetscObjectComm((PetscObject)pc),PETSC_ERR_ARG_WRONGSTATE,"Blocksize of y vector %D does not match fieldsplit blocksize %D",bs,jac->bs);</div><div class="">      ierr = VecStrideGatherAll(x,jac->x,INSERT_VALUES);CHKERRQ(ierr);</div><div class="">      while (ilink) {</div><div class="">        ierr = PetscLogEventBegin(ilink->event,ilink->ksp,ilink->x,ilink->y,NULL);CHKERRQ(ierr);</div><div class="">        ierr = KSPSolve(ilink->ksp,ilink->x,ilink->y);CHKERRQ(ierr);</div><div class=""><br class=""></div><div class="">you need these KSPSolve() to not block which if you are using SuperLU_DIST means each call to </div><div class=""><br class=""></div><div class=""><div class=""> PetscStackCall("SuperLU_DIST:PStatInit",PStatInit(&stat));        /* Initialize the statistics variables. */</div><div class="">#if defined(PETSC_USE_COMPLEX)</div><div class="">  PetscStackCall("SuperLU_DIST:pzgssvx",pzgssvx(&lu->options,&lu->A_sup,&lu->ScalePermstruct,(doublecomplex*)bptr,m,1,&lu->grid,&lu->LUstruct,&lu->SOLVEstruct,berr,&stat,&info));</div><div class="">#else</div><div class="">  PetscStackCall("SuperLU_DIST:pdgssvx",pdgssvx(&lu->options,&lu->A_sup,&lu->ScalePermstruct,bptr,m,1,&lu->grid,&lu->LUstruct,&lu->SOLVEstruct,berr,&stat,&info));</div><div class="">#endif</div></div><div class=""><br class=""></div><div class="">must be on a different stream and cannot block. Which means essentially the PStatInit and pdgssvx have to run completely on the GPU or have some complex CPU babysitting that allows them to be interlaced on the GPU while doing their CPU stuff.  </div></div></div></blockquote><div class=""><br class=""></div><div class="">It is not clear why the library code would be that complex but I don't understand streams well. SuperLU currently uses Thrust driven by CPU code, but is developing a pure GPU version. I can wait for the pure GPU version.</div><div class=""> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class=""><div class=""><div class="">I doubt either thing is true and you are best off using the cuSparse Solvers which do run completely on the GPU in any stream you give them.</div></div></div></blockquote><div class=""><br class=""></div><div class="">CUSparse does not have an LU factorization.</div><div class=""> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class=""><div class=""><div class=""><br class=""></div><div class="">The factorization is even more of a nightmare, I think Sherry uses the GPU just to do blocks for her sequential factorization so I doubt very much it can handle multiple simultaneous factorizations. </div></div></div></blockquote><div class=""><br class=""></div><div class="">Oh you were talking about the solves. (I have only been thinking about factorizations and that has caused some confusion I now see).</div><div class=""><br class=""></div><div class="">I am still not clear about the parallel dispatch. If an MPI serial library method needs a lot of complex logic to work perhaps look at other models? The XGC code uses OMP a lot and they dispatch asynchronous "solves" in an OMP loop. I'm not sure if being in an OMP thread buys you anything for a GPU solver but this is what they have done for CPU solvers. At one point they used PETSc for a LU solve and we (Barry) made some fixes to make it thread safe enough for this operation, but don't know what GPU "solvers" they have now running in this model on the GPU.</div><div class=""><br class=""></div><div class="">Regardless, my best guess as the basic model is something like:</div></div></div></div></blockquote><div class=""><br class=""></div>  The thing is streams manage their own dependencies so you don't want or need any CPU waits on them.  Assume both VecScatter and Solve take a stream and are not blocking. The code for CUDA GPUs could be </div><div class=""><br class=""></div><div class=""><div class=""> for all blocks i</div><div class="">      VecScatter(i, i.stream)    // launches scatter kernel(s) </div><div class=""> for all blocks i</div><div class="">      Solve(i, i.stream)            // launches solve kernel(s)</div><div class=""> for all blocks i</div><div class="">      VecScatter(i, i.stream)     // launches scatter kernel(s)</div><div class=""><br class=""></div><div class=""><div class="">launch here means the CPU gives the stream and the code and data pointers to the GPU to put in its queue and schedule when it can. </div></div><div class=""><br class=""></div></div><div class="">  Since everything is non-blocking on the CPU just zips through the loops, taking about 10 micro-seconds for each "launch"  and the GPU manages the scheduling. You could also do </div><div class=""><br class=""></div><div class=""><div class=""><br class=""></div><div class=""><div class="">  for all blocks i</div><div class="">      VecScatter(i, i.stream)    // launches scatter kernel(s) </div><div class="">      Solve(i, i.stream)            // launches solve kernel(s)</div><div class="">      VecScatter(i, i.stream)     // launches scatter kerkel(s)</div></div><div class=""><br class=""></div></div><div class="">  Again the CPU just zips through all the loops giving the GPU a big list of kernels and the GPU manages the scheduling. The second approach would, I am guessing, be slightly slower since the GPU gets the scatter kernels with some time spacing between them (because the CPU is putting the later operations in the queue) so cannot schedule them as rapidly. </div><div class=""><br class=""></div><div class="">There is also</div><div class=""><br class=""></div><div class=""><div class=""> for all blocks i</div><div class="">      VecScatter(i, i.stream)    // launches scatter kernel(s) </div><div class=""> for all blocks i</div><div class="">      Solve(i, i.stream)            // launches solve kernel(s)</div><div class="">      VecScatter(i, i.stream)     // launches scatter kerkel(s)</div><div class=""><br class=""></div><div class="">which may be as good as the first assuming the scatters take long enough to cover the time for all the other kernels to be launched by the CPU and queued up by the GPU.</div></div><div class="">  </div><div class=""><br class=""></div><div class=""><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class="gmail_quote"><div class=""><br class=""></div><div class="">For all blocks i,</div><div class="">   create a stream i.s, and launch a non-blocking vecScater with i.s</div><div class="">For all blocks i,<br class=""></div><div class="">  vecScaterEnd(i),</div><div class="">  Solve(i.s, ....)  // So library solvers and our built-in solvers would need to take a stream or I guess that would get passed in through Jacob's vector class and the library interface code would need to take the Cuda stream out and give it to SuperLU, say.</div><div class="">For all blocks i,<br class=""></div><div class="">  Wait (i.s)</div><div class="">  vecscatterBegin (i.s, ...)</div><div class="">For all blocks i,<br class=""></div><div class=""> vecscatterEnd (i.s, ...)  // non-overlapping blocks so problem with adding values</div><div class=""><br class=""></div></div></div>

</div></blockquote></div><br class=""></div></blockquote></div>

</div></blockquote></div><br class=""></div></div></div></body></html>