[petsc-dev] model for parallel ASM

Sat Jan 9 22:14:35 CST 2021

  Ok, fieldsplit, pretty much the same as what I said for block Jacobi. 

static PetscErrorCode PCApply_FieldSplit(PC pc,Vec x,Vec y)
{
  PC_FieldSplit      *jac = (PC_FieldSplit*)pc->data;
  PetscErrorCode     ierr;
  PC_FieldSplitLink  ilink = jac->head;
  PetscInt           cnt,bs;

  PetscFunctionBegin;
  if (jac->type == PC_COMPOSITE_ADDITIVE) {
    if (jac->defaultsplit) {
      ierr = VecGetBlockSize(x,&bs);CHKERRQ(ierr);
      if (jac->bs > 0 && bs != jac->bs) SETERRQ2(PetscObjectComm((PetscObject)pc),PETSC_ERR_ARG_WRONGSTATE,"Blocksize of x vector %D does not match fieldsplit blocksize %D",bs,jac->bs);
      ierr = VecGetBlockSize(y,&bs);CHKERRQ(ierr);
      if (jac->bs > 0 && bs != jac->bs) SETERRQ2(PetscObjectComm((PetscObject)pc),PETSC_ERR_ARG_WRONGSTATE,"Blocksize of y vector %D does not match fieldsplit blocksize %D",bs,jac->bs);
      ierr = VecStrideGatherAll(x,jac->x,INSERT_VALUES);CHKERRQ(ierr);
      while (ilink) {
        ierr = PetscLogEventBegin(ilink->event,ilink->ksp,ilink->x,ilink->y,NULL);CHKERRQ(ierr);
        ierr = KSPSolve(ilink->ksp,ilink->x,ilink->y);CHKERRQ(ierr);

you need these KSPSolve() to not block which if you are using SuperLU_DIST means each call to 

 PetscStackCall("SuperLU_DIST:PStatInit",PStatInit(&stat));        /* Initialize the statistics variables. */
#if defined(PETSC_USE_COMPLEX)
  PetscStackCall("SuperLU_DIST:pzgssvx",pzgssvx(&lu->options,&lu->A_sup,&lu->ScalePermstruct,(doublecomplex*)bptr,m,1,&lu->grid,&lu->LUstruct,&lu->SOLVEstruct,berr,&stat,&info));
#else
  PetscStackCall("SuperLU_DIST:pdgssvx",pdgssvx(&lu->options,&lu->A_sup,&lu->ScalePermstruct,bptr,m,1,&lu->grid,&lu->LUstruct,&lu->SOLVEstruct,berr,&stat,&info));
#endif

must be on a different stream and cannot block. Which means essentially the PStatInit and pdgssvx have to run completely on the GPU or have some complex CPU babysitting that allows them to be interlaced on the GPU while doing their CPU stuff.  I doubt either thing is true and you are best off using the cuSparse Solvers which do run completely on the GPU in any stream you give them.

The factorization is even more of a nightmare, I think Sherry uses the GPU just to do blocks for her sequential factorization so I doubt very much it can handle multiple simultaneous factorizations. 

        ierr = KSPCheckSolve(ilink->ksp,pc,ilink->y);CHKERRQ(ierr);
        ierr = PetscLogEventEnd(ilink->event,ilink->ksp,ilink->x,ilink->y,NULL);CHKERRQ(ierr);
        ilink = ilink->next;
      }

> On Jan 9, 2021, at 6:06 PM, Mark Adams <mfadams at lbl.gov> wrote:
> 
> 
> 
> On Sat, Jan 9, 2021 at 5:14 PM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
> 
>   If it is non-overlapping do you mean block Jacobi with multiple blocks per MPI rank? (one block per rank is trivial and should work now). 
> 
> Yes, only one MPI rank.
>  
> 
>   If you mean block Jacobi with multiple blocks per MPI rank you should start with the PCApply_BJacob_Multiblock(). It monkey's with pointers into the vector and then calls KSPSolve() for each block. So you just need a non-blocking KSPSolve(). What KSP and what PC do you want to use per block?
> 
> SuperLU
>  
> If you want to use LU then I think you proceed largely as I said a few days ago. All the routines in MatSolve_SeqAIJCUSparse can be made non-blocking as discussed with each one using its own stream (supplied with a hack or with approaches from Junchao and Jacob in the future possibly; but a hack is all that is needed for this trivial case.)
> 
> I'm not sure what goes into this hack. 
> 
> I am using fieldsplit and I now see that I don't specify asm, just lu, so I guess PCApply_FieldSplit is the target
> 
>     KSP Object: 1 MPI processes
>       type: preonly
>       maximum iterations=10000, initial guess is zero
>       tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
>       left preconditioning
>       using NONE norm type for convergence test
>     PC Object: 1 MPI processes
>       type: fieldsplit
>         FieldSplit with ADDITIVE composition: total splits = 2
>         Solver info for each split is in the following KSP objects:
>       Split number 0 Defined by IS
>       KSP Object: (fieldsplit_e_) 1 MPI processes
>         type: preonly
>         maximum iterations=10000, initial guess is zero
>         tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
>         left preconditioning
>         using NONE norm type for convergence test
>       PC Object: (fieldsplit_e_) 1 MPI processes
>         type: lu
>           out-of-place factorization
>           tolerance for zero pivot 2.22045e-14
>           matrix ordering: nd
>           factor fill ratio given 5., needed 1.30805
>             Factored matrix follows:
>               Mat Object: 1 MPI processes
>                 type: seqaij
>                 rows=448, cols=448
>                 package used to perform factorization: petsc
>                 total: nonzeros=14038, allocated nonzeros=14038
>                   using I-node routines: found 175 nodes, limit used is 5
>         linear system matrix = precond matrix:
>         Mat Object: (fieldsplit_e_) 1 MPI processes
>           type: seqaij
>           rows=448, cols=448
>           total: nonzeros=10732, allocated nonzeros=10732
>           total number of mallocs used during MatSetValues calls=0
>             using I-node routines: found 197 nodes, limit used is 5
>       Split number 1 Defined by IS
>       KSP Object: (fieldsplit_i1_) 1 MPI processes
>  ....
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20210109/1d623584/attachment-0001.html>