<div dir="ltr">Barry, Here is a stack trace. [moving this to dev]<div><br></div><div>I am thinking that KSPSetUp must be called before the OMP loop. That is fine with me and I think it is a reasonable general model.</div><div><br></div><div>But will it work?</div><div><br></div><div>To be specific, just looking at LU I see that SetUp does the numerical factorization and the PCSetUp interface skips this after it is called once. I would guess that Solve will do the factorization (in the OMP thread/GPU) if the data changes (MatAssembly has been called). And that could work.</div><div><br></div><div>Doing the factorization on the CPU, in serial, the first time is what I do with GPU assembly also. Not perfect but if you can amortize then its OK. Then having all subsequent KSP solves work in OMP would be OK. </div><div><br></div><div>So I'm going to give this model a try and see where I get. Comments welcome.</div><div><div><br></div><div>This is a messy abstraction in general, for instance, you can separated <div>#16 gomp_thread_start (xdata=<optimized out>) at /sw/summit/gcc/6.4.0/src/gcc-6.4.0/libgomp/team.c:119 (at 0x0000200020c4a51c)<br>#15 PCApply_FieldSplit._omp_fn.0 () at /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/fieldsplit/fieldsplit.c:1285 (at 0x0000200000fb2584)<br>#14 PetscFSOMPSolve (y=<optimized out>, x=<optimized out>, ilink=0x763ee450) at /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/fieldsplit/fieldsplit.c:1238 (at 0x0000200000fb2584)<br>#13 KSPSolve (ksp=<optimized out>, b=<optimized out>, x=<optimized out>) at /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:963 (at 0x000020000105af78)<br>#12 KSPSolve_Private (ksp=0x763ef010, b=0x763f7fb0, x=<optimized out>) at /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:726 (at 0x00002000010586b0)<br>#11 KSPSetUp (ksp=0x763ef010) at /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:365 (at 0x000020000105636c)<br>#10 KSPSetUp_GMRES (ksp=0x763ef010) at /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/impls/gmres/gmres.c:85 (at 0x00002000010e2168)<br>#9 KSPCreateVecs (ksp=0x763ef010, rightn=<optimized out>, right=0x200054007bb0, leftn=<optimized out>, left=0x0) at /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/iterativ.c:971 (at 0x0000200001063610)<br>#8 VecDuplicateVecs (v=<optimized out>, m=<optimized out>, V=<optimized out>) at /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/interface/vector.c:439 (at 0x00002000003d3f58)<br>#7 VecDuplicateVecs_Default (w=0x76400200, m=<optimized out>, V=0x200054007bb0) at /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/interface/vector.c:825 (at 0x00002000003d4500)<br>#6 VecDuplicate (v=<optimized out>, newv=<optimized out>) at /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/interface/vector.c:369 (at 0x00002000003d3d14)<br>#5 VecDuplicate_SeqCUDA (win=0x76400200, V=0x200054007d70) at /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/impls/seq/seqcuda/veccuda.c:222 (at 0x00002000003c65b0)<br>#4 VecCreateSeqCUDA (comm=<optimized out>, n=<optimized out>, v=0x200054007d70) at /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/impls/seq/seqcuda/veccuda.c:213 (at 0x00002000003c64a4)<br>#3 VecSetType (vec=0x200054007db0, method=0x20000132f308 "seqcuda") at /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/interface/vecreg.c:91 (at 0x00002000003d7f90)<br>#2 VecCreate_SeqCUDA (V=0x200054007db0) at /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/impls/seq/seqcuda/veccuda.c:237 (at 0x00002000003c6c2c)<br>#1 VecCUDAAllocateCheck (v=0x200054007db0) at /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/vec/impls/seq/seqcuda/<a href="http://veccuda2.cu:53">veccuda2.cu:53</a> (at 0x000020000072b698)<br>#0 PetscOptionsEnd_Private (PetscOptionsObject=0x20003c712078) at /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/aoptions.c:521 (at 0x000020000026e880)<br></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jan 19, 2021 at 10:46 AM Mark Adams <<a href="mailto:mfadams@lbl.gov">mfadams@lbl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Well, Summit is down for the day so I moved to NERSc and OMP/FieldSplit/cuSparse/[ILU,LU] seem to be working there. I added PetscInfo lines, below, and the answers with 10 species/threads are perfect.<div><br></div><div>It looks like OMP threads are serialized (see below) but in random order.  You can see that field 0 ("e") calls solve 14 times and converged in 13 iterations.</div><div><br></div><div>I'm not sure what to make of this. I think I'll try to see if I can see any difference in total run time with 1 and 10 OMP threads.</div><div><br></div><div>   9 SNES Function norm 4.741380472654e-13 <br>[0] MatCUSPARSEGetDeviceMatWrite(): Assemble more than once already<br>[0] PCSetUp(): Setting up PC with same nonzero pattern<br>[0] PCApply_FieldSplit(): thread 2 in field 2<br>[0] PCSetUp(): Setting up PC with same nonzero pattern<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] PCApply_FieldSplit(): thread 7 in field 7<br>[0] PCSetUp(): Setting up PC with same nonzero pattern<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br></div><div><div>  ....</div><div><b>[0] PCApply_FieldSplit(): thread 0 in field 0<br>[0] PCSetUp(): Setting up PC with same nonzero pattern<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>      Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL iterations 13</b><br>[0] PCApply_FieldSplit(): thread 5 in field 5<br>[0] PCSetUp(): Setting up PC with same nonzero pattern<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br>[0] MatSolve_SeqAIJCUSPARSE_NaturalOrdering(): Cuda solve (NO)<br></div><div><div> ...</div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jan 19, 2021 at 8:07 AM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jan 18, 2021 at 11:06 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><br></div>  Can valgrind run and help with OpenMP?</div></blockquote><div><br></div><div>I am pretty sure. There is also cuda-memcheck that has the same semantics that works on GPU code, but I'm not sure how good it is for CPU code.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><br></div><div>  You can run in the debugger and find any calls to the options checking inside your code block and comment them all out to see if that eliminates the problem.</div></div></blockquote><div><br></div><div>The stack trace does give me the method that it calls the fatal Free in, so I will try a breakpoint in there. DDT does work with threads but not GPU code.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><br></div><div>  Also generically how safe is CUDA inside OpenMP? That is with multiple threads calling CUDA stuff?</div></div></blockquote><div><br></div><div>I recall that the XGC code, which has a lot of OMP, Cuda (and Kokkos) does this. Not 100% sure. </div><div><br></div><div>I know that they recently had to tear out some OMP loops that they Kokkos'ized because they had some problem mixing Kokkos-OMP and Cuda so they reverted back to pure OMP.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><br></div><div><br></div><div>  Barry</div><div><br></div><div><br><div><br><blockquote type="cite"><div>On Jan 18, 2021, at 7:04 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:</div><br><div><div dir="ltr"><div dir="ltr"><div><br></div><div>Added this w/o luck:</div><div><br></div><div>#if defined(PETSC_HAVE_CUDA)<br>  ierr = PetscOptionsCheckCUDA(logView);CHKERRQ(ierr);<br>#if defined(PETSC_HAVE_THREADSAFETY)<br>  ierr = PetscCUPMInitializeCheck();CHKERRQ(ierr);<br>#endif<br>#endif<br></div><div> </div><div>Do you think I should keep this in or take it out? Seems like a good idea and when it all works we can see if we can make it lazy.</div><div><br></div></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

1)  Calling PetscOptions inside threads. I looked quickly at the code and it seems like it should be ok but perhaps not. This is one reason why having stuff like PetscOptionsBegin inside a low-level creation VecCreate_SeqCUDA_Private is normally not done in PETSc. Eventually this needs to be moved or reworked. <br>

<br></blockquote><div><br></div><div>I will try this next. It is hard to see the stack here. I think I will put it in ddt and put a breakpoint PetscOptionsEnd_Private. Other ideas welcome.</div><div><br></div><div>Mark</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

2) PetscCUDAInitializeCheck is not thread safe.If it is being call for the first timeby multiple threads there can be trouble. So edit init.c and under<br>

<br>

#if defined(PETSC_HAVE_CUDA)<br>

  ierr = PetscOptionsCheckCUDA(logView);CHKERRQ(ierr);<br>

#endif<br>

<br>

#if defined(PETSC_HAVE_HIP)<br>

  ierr = PetscOptionsCheckHIP(logView);CHKERRQ(ierr);<br>

#endif<br>

<br>

put in <br>

#if defined thread safety<br>

PetscCUPMInitializeCheck<br>

#endif<br>

<br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

this will force the initialize to be done before any threads are used <br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br></blockquote></div></div>

</div></blockquote></div><br></div></div></blockquote></div></div>

</blockquote></div>

</blockquote></div>