<div dir="ltr"><div dir="ltr">On Thu, Apr 16, 2020 at 10:11 AM Mark Adams <<a href="mailto:mfadams@lbl.gov">mfadams@lbl.gov</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">On Thu, Apr 16, 2020 at 9:31 AM Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">On Thu, Apr 16, 2020 at 8:42 AM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Yea, GPU assembly would be great. I was figuring OMP might be simpler.<div><br></div><div>As far as the interface, I am flexible, the simplest way to do it would be to take an array of element matrices and a DMPlex and call to DMPlexMatSetClosure. You can see this code in mark/feature-xgc-interface-rebase, at the bottom of src/vec/vec/impls/seq/seqcuda/<a href="http://landau.cu" target="_blank">landau.cu</a>.</div><div><br></div><div>I was shy about putting a version of DMPlexMatSetClosure in CUDA, but maybe that is easier, just plow through it and cut stuff that we don't need. OMP broke because there are some temp arrays that Matt caches that need to be "private" of dealt with in some way.</div></div></blockquote><div><br></div><div>We should refactor so that all temp arrays are sized and constructed up front, and then the work is done in an internal function which is passed those arrays. I tried to do this, but might have crapped out here. Then</div><div>you can just call the internal function directly with your arrays.</div></div></div></blockquote><div><br></div><div>Just to be clear, we have to copy the code to a .cu file and declare everything as a device method (__global__) right? </div><div><br></div><div>Or write a batched version of DMPlexMatSetClosure and when it gets down to some kernel like MatSetValues, we, for example, move the Mat pointers to the GPU, copy to the element matrices to the device, in a .cu file, launch a kernel that calls a __global__ version of MatSetValues_SeqAIJ (with mallocs stripped out), then Copy the Mat pointers back to the CPU. All this copy stuff is usually done with a shadow copy of the object, but that is complicated by cusparse matrices which, from what I can tell, have a different, cusparse friendly, device Mat object.</div></div></div></blockquote><div><br></div><div>I would just get the closure indices for that batch of cells, push them to the GPU, and call MatSetValues() from the GPU.</div><div><br></div><div>Here is the way I am thinking. You can not going to push the Plex to the GPU, so you have to do Plex->Indices on the CPU. Just do that,</div><div>push them down, and use PETSc Mat from there.</div><div><br></div><div>  Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><br></div><div>  Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Coloring is not attractive to me because GPUs demand a lot of parallelism and the code that this serial (velocity space) solver would be embedded in a full 3D code that does not use a huge amount of MPI parallelism. For instance if the app code was to use 6 (or 7 max in SUMMIT) cores per GPU (or even 4x that with hardware threads) then <i>I could imagine</i> there would be enough parallelism, with coloring, to fuse the element construction and assembly, so assembling the element matrices right after they are created. That would be great in terms of not storing all these matrices and then assembling them all at once. The app that I am targeting does not use that much MPI parallelism though. But we could explore that, coloring, space and my mental model could be inaccurate. (note, I did recently add 8x more parallelism to my code this week and got a 25% speedup, using one whole GPU).</div><div><br></div><div>Or if you have some sort of lower level synchronization that could allow for fusing the the assembly with the element creation, then, by all means, we can explore that.</div><div><br></div><div>I'd be happy to work with you on this.</div><div><br></div><div>Thanks,</div><div>Mark</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Apr 13, 2020 at 7:08 PM Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com" target="_blank">junchao.zhang@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Probably matrix assembly on GPU is more important. Do you have an example for me to play to see what GPU interface we should have?<br clear="all"><div><div dir="ltr"><div dir="ltr">--Junchao Zhang</div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Apr 13, 2020 at 5:44 PM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">I was looking into assembling matrices with threads. I have a coloring to avoid conflicts.<div><br></div><div>Turning off all the logging seems way overkill and for methods that can get called in a thread then we could use PETSC_HAVE_THREADSAFTEY (thingy) to protect logging functions. So one can still get timings for the whole assembly process, just not for MatSetValues. Few people are going to do this. I don't think it will be a time sink, and if it is we just revert back to saying 'turn logging off'. I don't see a good argument for insisting on turning off logging, it is pretty important, if we just say that we are going to protect methods as needed.</div><div><br></div><div>It is not a big deal, I am just exploring this idea. It is such a basic concept in shared memory sparse linear algebra that it seems like a good thing to be able to support and have in an example to say we can assemble matrices in threads (not that it is a great idea). We have all the tools (eg, coloring methods) that it is just a matter of protecting code a few methods. I use DMPlex MatClosure instead of MatSetValues and this is where I die now with non-thread safe code. We have an idea, from Jed, on how to fix it.</div><div><br></div><div>Anyway, thanks for your help, but I think we should hold off on doing anything until we have some consensus that this would be a good idea to put some effort into getting a thread safe PETSc that can support OMP matrix assembly with a nice compact example.</div><div><br></div><div>Thanks again,</div><div>Mark</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Apr 13, 2020 at 5:44 PM Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com" target="_blank">junchao.zhang@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Mark,<div> I saw you had "--with-threadsaftey --with-log=0".  Do you really want to call petsc from multiple threads (in contrast to letting petsc call other libraries, e.g., BLAS, doing multithreading)?  If not, you can drop --with-threadsaftey.</div><div> I have <a href="https://gitlab.com/petsc/petsc/-/merge_requests/2714" target="_blank">https://gitlab.com/petsc/petsc/-/merge_requests/2714</a> that should fix your original compilation errors.</div><div><br></div><div><div><div dir="ltr"><div dir="ltr">--Junchao Zhang</div></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Apr 13, 2020 at 2:07 PM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><a href="https://www.mcs.anl.gov/petsc/miscellaneous/petscthreads.html" target="_blank">https://www.mcs.anl.gov/petsc/miscellaneous/petscthreads.html</a><br><div><br></div><div>and I see this on my Mac:</div><div><br></div><div>14:23 1 mark/feature-xgc-interface-rebase *= ~/Codes/petsc$ ../arch-macosx-gnu-O-omp.py                                                                                                                                                          <br>===============================================================================<br>             Configuring PETSc to compile on your system                       <br>===============================================================================<br>===============================================================================                                                                                                                                                                        Warning: PETSC_ARCH from environment does not match command-line or name of script.                                                                                                                                                              Warning: Using from command-line or name of script: arch-macosx-gnu-O-omp, ignoring environment: arch-macosx-gnu-g                                                                                                                         ===============================================================================                                                                                                                                                                  TESTING: configureLibraryOptions from PETSc.options.libraryOptions(config/PETSc/options/libraryOptions.py:37)                                                                                                                                    *******************************************************************************<br>         UNABLE to CONFIGURE with GIVEN OPTIONS    (see configure.log for details):<br>-------------------------------------------------------------------------------<br>Must use --with-log=0 with --with-threadsafety<br>*******************************************************************************<br><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Apr 13, 2020 at 2:54 PM Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com" target="_blank">junchao.zhang@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Apr 13, 2020 at 12:06 PM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">BTW, I can build on SUMMIT with logging and OMP, apparently. I also seem to be able to build with debugging. Both of which are not allowed according the the docs. I am puzzled.</div></blockquote><div> What are "the docs"?</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Apr 13, 2020 at 12:05 PM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">I think the problem is that you have to turn off logging with openmp and the (newish) GPU timers did not protect their timers.<div><br></div><div>I don't see a good reason to require logging be turned off with OMP. We could use PETSC_HAVE_THREADSAFETY to protect logs that we care about (eg, in MatSetValues) and as users discover more things that they want to call in an OMP thread block, then tell them to turn logging off and we will fix it when we can.</div><div><br></div><div>Any thoughts on the idea of letting users keep logging with openmp?</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Apr 13, 2020 at 11:40 AM Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com" target="_blank">junchao.zhang@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Yes. Looks we need to include petsclog.h. Don't know why OMP triggered the error.<br clear="all"><div><div dir="ltr"><div dir="ltr">--Junchao Zhang</div></div></div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Apr 13, 2020 at 9:59 AM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Should I do an MR to fix this?</div>

</blockquote></div>

</blockquote></div>

</blockquote></div>

</blockquote></div></div>

</blockquote></div>

</blockquote></div>

</blockquote></div>

</blockquote></div>

</blockquote></div>

</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div><div><br></div><div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br></div></div></div></div></div></div></div></div>

</blockquote></div></div>

</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div><div><br></div><div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br></div></div></div></div></div></div></div></div>