[petsc-dev] model for parallel ASM

Mon Jan 11 08:14:53 CST 2021

Ah, so first first maybe the Solver does not need a stream because it uses
VecScatter ?

The model is 1) block to get your data, 2) doit non-blocking. If a vector
is used in a non-GPU method that does not know about (1) you are hosed, but
that is a detail.

I like the middle, single loop algo or whatever is easier, keep it simple,
and once it's running it is easy to an experiment with more loops.

I could try this with fieldsplit/additive and Jacob's Vector object on
Cuda, with Jacobi now of its ready. Jacob?

On Sun, Jan 10, 2021 at 9:38 PM Barry Smith <bsmith at petsc.dev> wrote:

>
>
> On Jan 10, 2021, at 1:30 PM, Mark Adams <mfadams at lbl.gov> wrote:
>
>
>
> On Sat, Jan 9, 2021 at 11:14 PM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>>   Ok, fieldsplit, pretty much the same as what I said for block Jacobi.
>>
>> static PetscErrorCode PCApply_FieldSplit(PC pc,Vec x,Vec y)
>> {
>>   PC_FieldSplit      *jac = (PC_FieldSplit*)pc->data;
>>   PetscErrorCode     ierr;
>>   PC_FieldSplitLink  ilink = jac->head;
>>   PetscInt           cnt,bs;
>>
>>   PetscFunctionBegin;
>>   if (jac->type == PC_COMPOSITE_ADDITIVE) {
>>     if (jac->defaultsplit) {
>>       ierr = VecGetBlockSize(x,&bs);CHKERRQ(ierr);
>>       if (jac->bs > 0 && bs != jac->bs)
>> SETERRQ2(PetscObjectComm((PetscObject)pc),PETSC_ERR_ARG_WRONGSTATE,"Blocksize
>> of x vector %D does not match fieldsplit blocksize %D",bs,jac->bs);
>>       ierr = VecGetBlockSize(y,&bs);CHKERRQ(ierr);
>>       if (jac->bs > 0 && bs != jac->bs)
>> SETERRQ2(PetscObjectComm((PetscObject)pc),PETSC_ERR_ARG_WRONGSTATE,"Blocksize
>> of y vector %D does not match fieldsplit blocksize %D",bs,jac->bs);
>>       ierr = VecStrideGatherAll(x,jac->x,INSERT_VALUES);CHKERRQ(ierr);
>>       while (ilink) {
>>         ierr =
>> PetscLogEventBegin(ilink->event,ilink->ksp,ilink->x,ilink->y,NULL);CHKERRQ(ierr);
>>         ierr = KSPSolve(ilink->ksp,ilink->x,ilink->y);CHKERRQ(ierr);
>>
>> you need these KSPSolve() to not block which if you are using
>> SuperLU_DIST means each call to
>>
>>  PetscStackCall("SuperLU_DIST:PStatInit",PStatInit(&stat));        /*
>> Initialize the statistics variables. */
>> #if defined(PETSC_USE_COMPLEX)
>>
>> PetscStackCall("SuperLU_DIST:pzgssvx",pzgssvx(&lu->options,&lu->A_sup,&lu->ScalePermstruct,(doublecomplex*)bptr,m,1,&lu->grid,&lu->LUstruct,&lu->SOLVEstruct,berr,&stat,&info));
>> #else
>>
>> PetscStackCall("SuperLU_DIST:pdgssvx",pdgssvx(&lu->options,&lu->A_sup,&lu->ScalePermstruct,bptr,m,1,&lu->grid,&lu->LUstruct,&lu->SOLVEstruct,berr,&stat,&info));
>> #endif
>>
>> must be on a different stream and cannot block. Which means essentially
>> the PStatInit and pdgssvx have to run completely on the GPU or have some
>> complex CPU babysitting that allows them to be interlaced on the GPU while
>> doing their CPU stuff.
>>
>
> It is not clear why the library code would be that complex but I don't
> understand streams well. SuperLU currently uses Thrust driven by CPU code,
> but is developing a pure GPU version. I can wait for the pure GPU version.
>
>
>> I doubt either thing is true and you are best off using the cuSparse
>> Solvers which do run completely on the GPU in any stream you give them.
>>
>
> CUSparse does not have an LU factorization.
>
>
>>
>> The factorization is even more of a nightmare, I think Sherry uses the
>> GPU just to do blocks for her sequential factorization so I doubt very much
>> it can handle multiple simultaneous factorizations.
>>
>
> Oh you were talking about the solves. (I have only been thinking about
> factorizations and that has caused some confusion I now see).
>
> I am still not clear about the parallel dispatch. If an MPI serial library
> method needs a lot of complex logic to work perhaps look at other models?
> The XGC code uses OMP a lot and they dispatch asynchronous "solves" in an
> OMP loop. I'm not sure if being in an OMP thread buys you anything for a
> GPU solver but this is what they have done for CPU solvers. At one point
> they used PETSc for a LU solve and we (Barry) made some fixes to make it
> thread safe enough for this operation, but don't know what GPU "solvers"
> they have now running in this model on the GPU.
>
> Regardless, my best guess as the basic model is something like:
>
>
>   The thing is streams manage their own dependencies so you don't want or
> need any CPU waits on them.  Assume both VecScatter and Solve take a stream
> and are not blocking. The code for CUDA GPUs could be
>
>  for all blocks i
>       VecScatter(i, i.stream)    // launches scatter kernel(s)
>  for all blocks i
>       Solve(i, i.stream)            // launches solve kernel(s)
>  for all blocks i
>       VecScatter(i, i.stream)     // launches scatter kernel(s)
>
> launch here means the CPU gives the stream and the code and data pointers
> to the GPU to put in its queue and schedule when it can.
>
>   Since everything is non-blocking on the CPU just zips through the loops,
> taking about 10 micro-seconds for each "launch"  and the GPU manages the
> scheduling. You could also do
>
>
>   for all blocks i
>       VecScatter(i, i.stream)    // launches scatter kernel(s)
>       Solve(i, i.stream)            // launches solve kernel(s)
>       VecScatter(i, i.stream)     // launches scatter kerkel(s)
>
>   Again the CPU just zips through all the loops giving the GPU a big list
> of kernels and the GPU manages the scheduling. The second approach would, I
> am guessing, be slightly slower since the GPU gets the scatter kernels with
> some time spacing between them (because the CPU is putting the later
> operations in the queue) so cannot schedule them as rapidly.
>
> There is also
>
>  for all blocks i
>       VecScatter(i, i.stream)    // launches scatter kernel(s)
>  for all blocks i
>       Solve(i, i.stream)            // launches solve kernel(s)
>       VecScatter(i, i.stream)     // launches scatter kerkel(s)
>
> which may be as good as the first assuming the scatters take long enough
> to cover the time for all the other kernels to be launched by the CPU and
> queued up by the GPU.
>
>
>
>
> For all blocks i,
>    create a stream i.s, and launch a non-blocking vecScater with i.s
> For all blocks i,
>   vecScaterEnd(i),
>   Solve(i.s, ....)  // So library solvers and our built-in solvers would
> need to take a stream or I guess that would get passed in through Jacob's
> vector class and the library interface code would need to take the Cuda
> stream out and give it to SuperLU, say.
> For all blocks i,
>   Wait (i.s)
>   vecscatterBegin (i.s, ...)
> For all blocks i,
>  vecscatterEnd (i.s, ...)  // non-overlapping blocks so problem with
> adding values
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20210111/aef162e1/attachment-0001.html>