[petsc-dev] Memory problem with OpenMP and Fieldsplit sub solvers

Tue Jan 19 09:46:01 CST 2021

Well, Summit is down for the day so I moved to NERSc and
OMP/FieldSplit/cuSparse/[ILU,LU] seem to be working there. I added
PetscInfo lines, below, and the answers with 10 species/threads are perfect.

It looks like OMP threads are serialized (see below) but in random order.
You can see that field 0 ("e") calls solve 14 times and converged in 13
iterations.

I'm not sure what to make of this. I think I'll try to see if I can see any
difference in total run time with 1 and 10 OMP threads.

   9 SNES Function norm 4.741380472654e-13
[0] MatCUSPARSEGetDeviceMatWrite(): Assemble more than once already
[0] PCSetUp(): Setting up PC with same nonzero pattern
[0] PCApply_FieldSplit(): thread 2 in field 2
[0] PCSetUp(): Setting up PC with same nonzero pattern
[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
[0] PCApply_FieldSplit(): thread 7 in field 7
[0] PCSetUp(): Setting up PC with same nonzero pattern
[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
  ....

*[0] PCApply_FieldSplit(): thread 0 in field 0[0] PCSetUp(): Setting up PC
with same nonzero pattern[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering():
Cuda solve (NO)[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve
(NO)[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)[0]
MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)      Linear
fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 13*
[0] PCApply_FieldSplit(): thread 5 in field 5
[0] PCSetUp(): Setting up PC with same nonzero pattern
[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)
 ...

On Tue, Jan 19, 2021 at 8:07 AM Mark Adams <mfadams at lbl.gov> wrote:

>
>
> On Mon, Jan 18, 2021 at 11:06 PM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>>   Can valgrind run and help with OpenMP?
>>
>
> I am pretty sure. There is also cuda-memcheck that has the same semantics
> that works on GPU code, but I'm not sure how good it is for CPU code.
>
>
>>
>>   You can run in the debugger and find any calls to the options checking
>> inside your code block and comment them all out to see if that eliminates
>> the problem.
>>
>
> The stack trace does give me the method that it calls the fatal Free in,
> so I will try a breakpoint in there. DDT does work with threads but not GPU
> code.
>
>
>>
>>   Also generically how safe is CUDA inside OpenMP? That is with multiple
>> threads calling CUDA stuff?
>>
>
> I recall that the XGC code, which has a lot of OMP, Cuda (and Kokkos) does
> this. Not 100% sure.
>
> I know that they recently had to tear out some OMP loops that they
> Kokkos'ized because they had some problem mixing Kokkos-OMP and Cuda so
> they reverted back to pure OMP.
>
>
>>
>>
>>   Barry
>>
>>
>>
>> On Jan 18, 2021, at 7:04 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>
>>
>> Added this w/o luck:
>>
>> #if defined(PETSC_HAVE_CUDA)
>>   ierr = PetscOptionsCheckCUDA(logView);CHKERRQ(ierr);
>> #if defined(PETSC_HAVE_THREADSAFETY)
>>   ierr = PetscCUPMInitializeCheck();CHKERRQ(ierr);
>> #endif
>> #endif
>>
>> Do you think I should keep this in or take it out? Seems like a good idea
>> and when it all works we can see if we can make it lazy.
>>
>> 1)  Calling PetscOptions inside threads. I looked quickly at the code and
>>> it seems like it should be ok but perhaps not. This is one reason why
>>> having stuff like PetscOptionsBegin inside a low-level creation
>>> VecCreate_SeqCUDA_Private is normally not done in PETSc. Eventually this
>>> needs to be moved or reworked.
>>>
>>>
>> I will try this next. It is hard to see the stack here. I think I will
>> put it in ddt and put a breakpoint PetscOptionsEnd_Private. Other ideas
>> welcome.
>>
>> Mark
>>
>>
>>> 2) PetscCUDAInitializeCheck is not thread safe.If it is being call for
>>> the first timeby multiple threads there can be trouble. So edit init.c and
>>> under
>>>
>>> #if defined(PETSC_HAVE_CUDA)
>>>   ierr = PetscOptionsCheckCUDA(logView);CHKERRQ(ierr);
>>> #endif
>>>
>>> #if defined(PETSC_HAVE_HIP)
>>>   ierr = PetscOptionsCheckHIP(logView);CHKERRQ(ierr);
>>> #endif
>>>
>>> put in
>>> #if defined thread safety
>>> PetscCUPMInitializeCheck
>>> #endif
>>>
>>> this will force the initialize to be done before any threads are used
>>>
>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20210119/9ae25b2e/attachment.html>