[petsc-dev] Memory problem with OpenMP and Fieldsplit sub solvers

Mark Adams mfadams at lbl.gov
Wed Jan 20 13:28:49 CST 2021


Barry, Here is a stack trace. [moving this to dev]

I am thinking that KSPSetUp must be called before the OMP loop. That is
fine with me and I think it is a reasonable general model.

But will it work?

To be specific, just looking at LU I see that SetUp does the numerical
factorization and the PCSetUp interface skips this after it is called once.
I would guess that Solve will do the factorization (in the OMP thread/GPU)
if the data changes (MatAssembly has been called). And that could work.

Doing the factorization on the CPU, in serial, the first time is what I do
with GPU assembly also. Not perfect but if you can amortize then its OK.
Then having all subsequent KSP solves work in OMP would be OK.

So I'm going to give this model a try and see where I get. Comments welcome.

This is a messy abstraction in general, for instance, you can separated
#16 gomp_thread_start (xdata=<optimized out>) at
/sw/summit/gcc/6.4.0/src/gcc-6.4.0/libgomp/team.c:119 (at
0x0000200020c4a51c)
#15 PCApply_FieldSplit._omp_fn.0 () at
/gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/fieldsplit/fieldsplit.c:1285
(at 0x0000200000fb2584)
#14 PetscFSOMPSolve (y=<optimized out>, x=<optimized out>,
ilink=0x763ee450) at
/gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/fieldsplit/fieldsplit.c:1238
(at 0x0000200000fb2584)
#13 KSPSolve (ksp=<optimized out>, b=<optimized out>, x=<optimized out>) at
/gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:963
(at 0x000020000105af78)
#12 KSPSolve_Private (ksp=0x763ef010, b=0x763f7fb0, x=<optimized out>) at
/gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:726
(at 0x00002000010586b0)
#11 KSPSetUp (ksp=0x763ef010) at
/gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:365
(at 0x000020000105636c)
#10 KSPSetUp_GMRES (ksp=0x763ef010) at
/gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/impls/gmres/gmres.c:85
(at 0x00002000010e2168)
#9 KSPCreateVecs (ksp=0x763ef010, rightn=<optimized out>,
right=0x200054007bb0, leftn=<optimized out>, left=0x0) at
/gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/iterativ.c:971
(at 0x0000200001063610)
#8 VecDuplicateVecs (v=<optimized out>, m=<optimized out>, V=<optimized
out>) at
/gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/interface/vector.c:439
(at 0x00002000003d3f58)
#7 VecDuplicateVecs_Default (w=0x76400200, m=<optimized out>,
V=0x200054007bb0) at
/gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/interface/vector.c:825
(at 0x00002000003d4500)
#6 VecDuplicate (v=<optimized out>, newv=<optimized out>) at
/gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/interface/vector.c:369
(at 0x00002000003d3d14)
#5 VecDuplicate_SeqCUDA (win=0x76400200, V=0x200054007d70) at
/gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/impls/seq/seqcuda/veccuda.c:222
(at 0x00002000003c65b0)
#4 VecCreateSeqCUDA (comm=<optimized out>, n=<optimized out>,
v=0x200054007d70) at
/gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/impls/seq/seqcuda/veccuda.c:213
(at 0x00002000003c64a4)
#3 VecSetType (vec=0x200054007db0, method=0x20000132f308 "seqcuda") at
/gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/interface/vecreg.c:91
(at 0x00002000003d7f90)
#2 VecCreate_SeqCUDA (V=0x200054007db0) at
/gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/impls/seq/seqcuda/veccuda.c:237
(at 0x00002000003c6c2c)
#1 VecCUDAAllocateCheck (v=0x200054007db0) at
/gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/impls/seq/seqcuda/
veccuda2.cu:53 (at 0x000020000072b698)
#0 PetscOptionsEnd_Private (PetscOptionsObject=0x20003c712078) at
/gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/aoptions.c:521 (at
0x000020000026e880)

On Tue, Jan 19, 2021 at 10:46 AM Mark Adams <mfadams at lbl.gov> wrote:

> Well, Summit is down for the day so I moved to NERSc and
> OMP/FieldSplit/cuSparse/[ILU,LU] seem to be working there. I added
> PetscInfo lines, below, and the answers with 10 species/threads are perfect.
>
> It looks like OMP threads are serialized (see below) but in random order.
> You can see that field 0 ("e") calls solve 14 times and converged in 13
> iterations.
>
> I'm not sure what to make of this. I think I'll try to see if I can see
> any difference in total run time with 1 and 10 OMP threads.
>
>    9 SNES Function norm 4.741380472654e-13
> [0] MatCUSPARSEGetDeviceMatWrite(): Assemble more than once already
> [0] PCSetUp(): Setting up PC with same nonzero pattern
> [0] PCApply_FieldSplit(): thread 2 in field 2
> [0] PCSetUp(): Setting up PC with same nonzero pattern
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] PCApply_FieldSplit(): thread 7 in field 7
> [0] PCSetUp(): Setting up PC with same nonzero pattern
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
>   ....
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *[0] PCApply_FieldSplit(): thread 0 in field 0[0] PCSetUp(): Setting up PC
> with same nonzero pattern[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering():
> Cuda solve (NO)[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve
> (NO)[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
> MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)      Linear
> fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 13*
> [0] PCApply_FieldSplit(): thread 5 in field 5
> [0] PCSetUp(): Setting up PC with same nonzero pattern
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
> [0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
>  ...
>
> On Tue, Jan 19, 2021 at 8:07 AM Mark Adams <mfadams at lbl.gov> wrote:
>
>>
>>
>> On Mon, Jan 18, 2021 at 11:06 PM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>>   Can valgrind run and help with OpenMP?
>>>
>>
>> I am pretty sure. There is also cuda-memcheck that has the same semantics
>> that works on GPU code, but I'm not sure how good it is for CPU code.
>>
>>
>>>
>>>   You can run in the debugger and find any calls to the options checking
>>> inside your code block and comment them all out to see if that eliminates
>>> the problem.
>>>
>>
>> The stack trace does give me the method that it calls the fatal Free in,
>> so I will try a breakpoint in there. DDT does work with threads but not GPU
>> code.
>>
>>
>>>
>>>   Also generically how safe is CUDA inside OpenMP? That is with multiple
>>> threads calling CUDA stuff?
>>>
>>
>> I recall that the XGC code, which has a lot of OMP, Cuda (and Kokkos)
>> does this. Not 100% sure.
>>
>> I know that they recently had to tear out some OMP loops that they
>> Kokkos'ized because they had some problem mixing Kokkos-OMP and Cuda so
>> they reverted back to pure OMP.
>>
>>
>>>
>>>
>>>   Barry
>>>
>>>
>>>
>>> On Jan 18, 2021, at 7:04 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>
>>> Added this w/o luck:
>>>
>>> #if defined(PETSC_HAVE_CUDA)
>>>   ierr = PetscOptionsCheckCUDA(logView);CHKERRQ(ierr);
>>> #if defined(PETSC_HAVE_THREADSAFETY)
>>>   ierr = PetscCUPMInitializeCheck();CHKERRQ(ierr);
>>> #endif
>>> #endif
>>>
>>> Do you think I should keep this in or take it out? Seems like a good
>>> idea and when it all works we can see if we can make it lazy.
>>>
>>> 1)  Calling PetscOptions inside threads. I looked quickly at the code
>>>> and it seems like it should be ok but perhaps not. This is one reason why
>>>> having stuff like PetscOptionsBegin inside a low-level creation
>>>> VecCreate_SeqCUDA_Private is normally not done in PETSc. Eventually this
>>>> needs to be moved or reworked.
>>>>
>>>>
>>> I will try this next. It is hard to see the stack here. I think I will
>>> put it in ddt and put a breakpoint PetscOptionsEnd_Private. Other ideas
>>> welcome.
>>>
>>> Mark
>>>
>>>
>>>> 2) PetscCUDAInitializeCheck is not thread safe.If it is being call for
>>>> the first timeby multiple threads there can be trouble. So edit init.c and
>>>> under
>>>>
>>>> #if defined(PETSC_HAVE_CUDA)
>>>>   ierr = PetscOptionsCheckCUDA(logView);CHKERRQ(ierr);
>>>> #endif
>>>>
>>>> #if defined(PETSC_HAVE_HIP)
>>>>   ierr = PetscOptionsCheckHIP(logView);CHKERRQ(ierr);
>>>> #endif
>>>>
>>>> put in
>>>> #if defined thread safety
>>>> PetscCUPMInitializeCheck
>>>> #endif
>>>>
>>>> this will force the initialize to be done before any threads are used
>>>>
>>>
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20210120/35d2e292/attachment.html>


More information about the petsc-dev mailing list