[petsc-dev] Memory problem with OpenMP and Fieldsplit sub solvers

Wed Jan 20 17:21:44 CST 2021

> On Jan 20, 2021, at 3:09 PM, Mark Adams <mfadams at lbl.gov> wrote:
> 
> So I put in a temporary hack to get the first Fieldsplit apply to NOT use OMP and it sort of works. 
> 
> Preonly/lu is fine. GMRES calls vector creates/dups in every solve so that is a big problem.

  It should definitely not be creating vectors "in every" solve. But it does do lazy allocation of needed restarted vectors which may make it look like it is creating "every" vectors in every solve.  You can use -ksp_gmres_preallocate to force it to create all the restart vectors up front at KSPSetUp(). 

  Why is creating vectors "at every solve" a problem? It is not thread safe I guess?

> Richardson works except the convergence test gets confused, presumably because MPI reductions with PETSC_COMM_SELF is not threadsafe.

> 
> One fix for the norms might be to create each subdomain solver with a different communicator.

   Yes you could do that. It might actually be the correct thing to do also, if you have multiple threads call MPI reductions on the same communicator that would be a problem. Each KSP should get a new MPI_Comm. 

> 
> I could live with this for my project but maybe I should take a shot at something better.
> 
> Any thoughts?
> 
> 
> On Wed, Jan 20, 2021 at 2:28 PM Mark Adams <mfadams at lbl.gov <mailto:mfadams at lbl.gov>> wrote:
> Barry, Here is a stack trace. [moving this to dev]
> 
> I am thinking that KSPSetUp must be called before the OMP loop. That is fine with me and I think it is a reasonable general model.
> 
> But will it work?
> 
> To be specific, just looking at LU I see that SetUp does the numerical factorization and the PCSetUp interface skips this after it is called once. I would guess that Solve will do the factorization (in the OMP thread/GPU) if the data changes (MatAssembly has been called). And that could work.
> 
> Doing the factorization on the CPU, in serial, the first time is what I do with GPU assembly also. Not perfect but if you can amortize then its OK. Then having all subsequent KSP solves work in OMP would be OK. 
> 
> So I'm going to give this model a try and see where I get. Comments welcome.
> 
> This is a messy abstraction in general, for instance, you can separated 
> #16 gomp_thread_start (xdata=<optimized out>) at /sw/summit/gcc/6.4.0/src/gcc-6.4.0/libgomp/team.c:119 (at 0x0000200020c4a51c)
> #15 PCApply_FieldSplit._omp_fn.0 () at /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/fieldsplit/fieldsplit.c:1285 (at 0x0000200000fb2584)
> #14 PetscFSOMPSolve (y=<optimized out>, x=<optimized out>, ilink=0x763ee450) at /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/fieldsplit/fieldsplit.c:1238 (at 0x0000200000fb2584)
> #13 KSPSolve (ksp=<optimized out>, b=<optimized out>, x=<optimized out>) at /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:963 (at 0x000020000105af78)
> #12 KSPSolve_Private (ksp=0x763ef010, b=0x763f7fb0, x=<optimized out>) at /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:726 (at 0x00002000010586b0)
> #11 KSPSetUp (ksp=0x763ef010) at /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:365 (at 0x000020000105636c)
> #10 KSPSetUp_GMRES (ksp=0x763ef010) at /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/impls/gmres/gmres.c:85 (at 0x00002000010e2168)
> #9 KSPCreateVecs (ksp=0x763ef010, rightn=<optimized out>, right=0x200054007bb0, leftn=<optimized out>, left=0x0) at /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/iterativ.c:971 (at 0x0000200001063610)
> #8 VecDuplicateVecs (v=<optimized out>, m=<optimized out>, V=<optimized out>) at /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/interface/vector.c:439 (at 0x00002000003d3f58)
> #7 VecDuplicateVecs_Default (w=0x76400200, m=<optimized out>, V=0x200054007bb0) at /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/interface/vector.c:825 (at 0x00002000003d4500)
> #6 VecDuplicate (v=<optimized out>, newv=<optimized out>) at /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/interface/vector.c:369 (at 0x00002000003d3d14)
> #5 VecDuplicate_SeqCUDA (win=0x76400200, V=0x200054007d70) at /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/impls/seq/seqcuda/veccuda.c:222 (at 0x00002000003c65b0)
> #4 VecCreateSeqCUDA (comm=<optimized out>, n=<optimized out>, v=0x200054007d70) at /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/impls/seq/seqcuda/veccuda.c:213 (at 0x00002000003c64a4)
> #3 VecSetType (vec=0x200054007db0, method=0x20000132f308 "seqcuda") at /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/interface/vecreg.c:91 (at 0x00002000003d7f90)
> #2 VecCreate_SeqCUDA (V=0x200054007db0) at /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/impls/seq/seqcuda/veccuda.c:237 (at 0x00002000003c6c2c)
> #1 VecCUDAAllocateCheck (v=0x200054007db0) at /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/impls/seq/seqcuda/veccuda2.cu:53 <http://veccuda2.cu:53/> (at 0x000020000072b698)
> #0 PetscOptionsEnd_Private (PetscOptionsObject=0x20003c712078) at /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/aoptions.c:521 (at 0x000020000026e880)
> 
> On Tue, Jan 19, 2021 at 10:46 AM Mark Adams <mfadams at lbl.gov <mailto:mfadams at lbl.gov>> wrote:
> Well, Summit is down for the day so I moved to NERSc and OMP/FieldSplit/cuSparse/[ILU,LU] seem to be working there. I added PetscInfo lines, below, and the answers with 10 species/threads are perfect.
> 
> It looks like OMP threads are serialized (see below) but in random order.  You can see that field 0 ("e") calls solve 14 times and converged in 13 iterations.
> 
> I'm not sure what to make of this. I think I'll try to see if I can see any difference in total run time with 1 and 10 OMP threads.
> 
>    9 SNES Function norm 4.741380472654e-13 
> [0] MatCUSPARSEGetDeviceMatWrite(): Assemble more than once already
> [0] PCSetUp(): Setting up PC with same nonzero pattern
> [0] PCApply_FieldSplit(): thread 2 in field 2
> [0] PCSetUp(): Setting up PC with same nonzero pattern
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] PCApply_FieldSplit(): thread 7 in field 7
> [0] PCSetUp(): Setting up PC with same nonzero pattern
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
>   ....
> [0] PCApply_FieldSplit(): thread 0 in field 0
> [0] PCSetUp(): Setting up PC with same nonzero pattern
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 13
> [0] PCApply_FieldSplit(): thread 5 in field 5
> [0] PCSetUp(): Setting up PC with same nonzero pattern
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
>  ...
> 
> On Tue, Jan 19, 2021 at 8:07 AM Mark Adams <mfadams at lbl.gov <mailto:mfadams at lbl.gov>> wrote:
> 
> 
> On Mon, Jan 18, 2021 at 11:06 PM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
> 
>   Can valgrind run and help with OpenMP?
> 
> I am pretty sure. There is also cuda-memcheck that has the same semantics that works on GPU code, but I'm not sure how good it is for CPU code.
>  
> 
>   You can run in the debugger and find any calls to the options checking inside your code block and comment them all out to see if that eliminates the problem.
> 
> The stack trace does give me the method that it calls the fatal Free in, so I will try a breakpoint in there. DDT does work with threads but not GPU code.
>  
> 
>   Also generically how safe is CUDA inside OpenMP? That is with multiple threads calling CUDA stuff?
> 
> I recall that the XGC code, which has a lot of OMP, Cuda (and Kokkos) does this. Not 100% sure. 
> 
> I know that they recently had to tear out some OMP loops that they Kokkos'ized because they had some problem mixing Kokkos-OMP and Cuda so they reverted back to pure OMP.
>  
> 
> 
>   Barry
> 
> 
> 
>> On Jan 18, 2021, at 7:04 PM, Mark Adams <mfadams at lbl.gov <mailto:mfadams at lbl.gov>> wrote:
>> 
>> 
>> Added this w/o luck:
>> 
>> #if defined(PETSC_HAVE_CUDA)
>>   ierr = PetscOptionsCheckCUDA(logView);CHKERRQ(ierr);
>> #if defined(PETSC_HAVE_THREADSAFETY)
>>   ierr = PetscCUPMInitializeCheck();CHKERRQ(ierr);
>> #endif
>> #endif
>>  
>> Do you think I should keep this in or take it out? Seems like a good idea and when it all works we can see if we can make it lazy.
>> 
>> 1)  Calling PetscOptions inside threads. I looked quickly at the code and it seems like it should be ok but perhaps not. This is one reason why having stuff like PetscOptionsBegin inside a low-level creation VecCreate_SeqCUDA_Private is normally not done in PETSc. Eventually this needs to be moved or reworked. 
>> 
>> 
>> I will try this next. It is hard to see the stack here. I think I will put it in ddt and put a breakpoint PetscOptionsEnd_Private. Other ideas welcome.
>> 
>> Mark
>>  
>> 2) PetscCUDAInitializeCheck is not thread safe.If it is being call for the first timeby multiple threads there can be trouble. So edit init.c and under
>> 
>> #if defined(PETSC_HAVE_CUDA)
>>   ierr = PetscOptionsCheckCUDA(logView);CHKERRQ(ierr);
>> #endif
>> 
>> #if defined(PETSC_HAVE_HIP)
>>   ierr = PetscOptionsCheckHIP(logView);CHKERRQ(ierr);
>> #endif
>> 
>> put in 
>> #if defined thread safety
>> PetscCUPMInitializeCheck
>> #endif
>> 
>> this will force the initialize to be done before any threads are used 
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20210120/d22eb422/attachment.html>