<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:DengXian;
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Menlo-Regular;}
@font-face
{font-family:"\@DengXian";
panose-1:2 1 6 0 3 1 1 1 1 1;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
span.apple-converted-space
{mso-style-name:apple-converted-space;}
.MsoChpDefault
{mso-style-type:export-only;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style>
</head>
<body lang="EN-US" link="blue" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal">Thanks Jacob for looking into this – </p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">You can see the updated source code of ex11fc in the attachment – although there is not much that I modified (except for the jabbers I outputted). I also attached the full output (ex11fc.log) along with the configure.log file. It’s an old
dual Xeon workstation (one of my “production” machines) with Linux kernel 5.4.0 and gcc 9.3.
</p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">I simply ran the code with </p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">mpiexec -np 2 ex11fc -usecuda </p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">for GPU test. And as stated before, calling without the “-usecuda” option shows no errors.
</p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Please let me know if you find anything wrong with the configure/code. </p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Cheers, </p>
<p class="MsoNormal">Hao</p>
<p class="MsoNormal"></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div style="mso-element:para-border-div;border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal" style="border:none;padding:0in"><b>From: </b><a href="mailto:jacob.fai@gmail.com">Jacob Faibussowitsch</a><br>
<b>Sent: </b>Wednesday, January 19, 2022 3:38 AM<br>
<b>To: </b><a href="mailto:dong-hao@outlook.com">Hao DONG</a><br>
<b>Cc: </b><a href="mailto:junchao.zhang@gmail.com">Junchao Zhang</a>; <a href="mailto:petsc-users@mcs.anl.gov">
petsc-users</a><br>
<b>Subject: </b>Re: [petsc-users] Strange CUDA failure with a second petscfinalize with PETSc 3.16</p>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Apologies, forgot to mention in my previous email but can you also include a copy of the full printout of the error message that you get? It will include all the command-line flags that you ran with (if any) so I can exactly mirror your
environment.<o:p></o:p></p>
<div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<div>
<div>
<p class="MsoNormal"><span style="color:black">Best regards,<br>
<br>
Jacob Faibussowitsch<br>
(Jacob Fai - booss - oh - vitch)<o:p></o:p></span></p>
</div>
</div>
</div>
</div>
<div>
<p class="MsoNormal"><br>
<br>
<o:p></o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<p class="MsoNormal">On Jan 18, 2022, at 14:06, Jacob Faibussowitsch <<a href="mailto:jacob.fai@gmail.com">jacob.fai@gmail.com</a>> wrote:<o:p></o:p></p>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal" style="margin-bottom:12.0pt"><span style="font-size:9.0pt;font-family:"Menlo-Regular",serif">Can you send your updated source file as well as your configure.log (should be $PETSC_DIR/configure.log). I will see if I can reproduce the error
on my end.<o:p></o:p></span></p>
<div>
<p class="MsoNormal"><span style="font-size:9.0pt;font-family:"Menlo-Regular",serif">Best regards,<br>
<br>
Jacob Faibussowitsch<br>
(Jacob Fai - booss - oh - vitch)<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:9.0pt;font-family:"Menlo-Regular",serif"><br>
<br>
<o:p></o:p></span></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal" style="margin-bottom:12.0pt"><span style="font-size:9.0pt;font-family:"Menlo-Regular",serif">On Jan 17, 2022, at 23:06, Hao DONG <<a href="mailto:dong-hao@outlook.com">dong-hao@outlook.com</a>> wrote:<o:p></o:p></span></p>
</blockquote>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<p class="MsoNormal"><span style="font-size:9.0pt"></span><span style="font-size:9.0pt;font-family:"Menlo-Regular",serif"><o:p></o:p></span></p>
<div>
<p class="MsoNormal">Dear Junchao and Jacob,<span class="apple-converted-space"> </span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Thanks a lot for the response – I also don’t understand why this is related to the device, especially on why the procedure can be successfully finished for *<b>once</b>* – As instructed, I tried to add a CHKERRA() macro after (almost) every
petsc line – such as the initialization, mat assemble, ksp create, solve, mat destroy, etc. However, all other petsc commands returns with error code 0. It only gives me a similar (still not very informative) error after I call the petscfinalize (again for
the second time), with error code 97: <span class="apple-converted-space"> </span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">[0]PETSC ERROR: GPU error<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">[0]PETSC ERROR: cuda error 709 (cudaErrorContextIsDestroyed) : context is destroyed<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">[0]PETSC ERROR: See <a href="https://petsc.org/release/faq/">
https://petsc.org/release/faq/</a> for trouble shooting.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">[0]PETSC ERROR: Petsc Release Version 3.16.3, unknown<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">[0]PETSC ERROR: ./ex11f on a named stratosphere by donghao Tue Jan 18 11:39:43 2022<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">[0]PETSC ERROR: Configure options --prefix=/opt/petsc/complex-double-with-cuda --with-cc=mpicc --with-cxx=mpicxx --with-fc=mpif90 COPTFLAGS="-O3 -mavx2" CXXOPTFLAGS="-O3 -mavx2" FOPTFLAGS="-O3 -ffree-line-length-none -mavx2" CUDAOPTFLAGS=-O3
--with-cxx-dialect=cxx14 --with-cuda-dialect=cxx14 --with-scalar-type=complex --with-precision=double --with-cuda-dir=/usr/local/cuda --with-debugging=1<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">[0]PETSC ERROR: #1 PetscFinalize() at /home/donghao/packages/petsc-current/src/sys/objects/pinit.c:1638<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">[0]PETSC ERROR: #2 User provided function() at User file:0<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">I can also confirm that rolling back to petsc 3.15 will *<b>not</b>* see the problem, even with the new nvidia driver. And petsc 3.16.3 with an old nvidia driver (470.42) also get this same error. So it’s probably not connected to the
nvidia driver.<span class="apple-converted-space"> </span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Any idea on where I should look at next?<span class="apple-converted-space"> </span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Thanks a lot in advance, and all the best,<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Hao<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<div>
<p class="MsoNormal"><b>From:<span class="apple-converted-space"> </span></b><a href="mailto:jacob.fai@gmail.com">Jacob Faibussowitsch</a><br>
<b>Sent:<span class="apple-converted-space"> </span></b>Sunday, January 16, 2022 12:12 AM<br>
<b>To:<span class="apple-converted-space"> </span></b><a href="mailto:junchao.zhang@gmail.com">Junchao Zhang</a><br>
<b>Cc:<span class="apple-converted-space"> </span></b><a href="mailto:petsc-users@mcs.anl.gov">petsc-users</a>;<span class="apple-converted-space"> </span><a href="mailto:dong-hao@outlook.com">Hao DONG</a><br>
<b>Subject:<span class="apple-converted-space"> </span></b>Re: [petsc-users] Strange CUDA failure with a second petscfinalize with PETSc 3.16<o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<div>
<div>
<p class="MsoNormal">I don’t quite understand how it is getting to the CUDA error to be honest. None of the code in the stack trace is anywhere near the device code. Reading the error message carefully, it first chokes on PetscLogGetStageLog() from a call to
PetscClassIdRegister():<o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<div>
<div>
<p class="MsoNormal">PetscErrorCode PetscLogGetStageLog(PetscStageLog *stageLog)<o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal">{<o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal"> PetscFunctionBegin;<o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal"> PetscValidPointer(stageLog,1);<o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal"> if (!petsc_stageLog) {<o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal"> fprintf(stderr, "PETSC ERROR: Logging has not been enabled.\nYou might have forgotten to call PetscInitialize().\n");<o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal"> PETSCABORT(MPI_COMM_WORLD, PETSC_ERR_SUP); // Here<o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal"> }<o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal"> ...<o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal">But then jumps to PetscFinalize(). You can also see the "You might have forgotten to call PetscInitialize().” message in the error message, just under the topmost level of the stack trace.<o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal">Can you check the value of ierr of each function call (use the CHKERRA() macro to do so)? I suspect the problem here that errors occurring previously in the program are being ignored, leading to the garbled stack trace.<o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<div>
<div>
<div>
<div>
<div>
<p class="MsoNormal">Best regards,<br>
<br>
Jacob Faibussowitsch<br>
(Jacob Fai - booss - oh - vitch)<o:p></o:p></p>
</div>
</div>
</div>
</div>
</div>
<div>
<div>
<p class="MsoNormal"><br>
<br>
<br>
<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<div>
<p class="MsoNormal">On Jan 14, 2022, at 20:58, Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com">junchao.zhang@gmail.com</a>> wrote:<o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<div>
<div>
<div>
<div>
<p class="MsoNormal">Jacob, <o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal"> Could you have a look as it seems the "invalid device context" is in your newly added module?<o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal"> Thanks<o:p></o:p></p>
</div>
</div>
<div>
<div>
<div>
<div>
<p class="MsoNormal">--Junchao Zhang<o:p></o:p></p>
</div>
</div>
</div>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<div>
<div>
<div>
<p class="MsoNormal">On Fri, Jan 14, 2022 at 12:49 AM Hao DONG <<a href="mailto:dong-hao@outlook.com">dong-hao@outlook.com</a>> wrote:<o:p></o:p></p>
</div>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
<div>
<div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">Dear All, <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="margin-left:40.8pt"> </p>
</div>
<div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">I have encountered a peculiar problem when fiddling with a code with PETSC 3.16.3 (which worked fine with PETSc 3.15). It is a very straight forward PDE-based optimization code which repeatedly solves a linearized PDE problem with KSP in
a subroutine (the rest of the code does not contain any PETSc related content). The main program provides the subroutine with an MPI comm. Then I set the comm as PETSC_COMM_WORLD to tell PETSC to attach to it (and detach with it when the solving is finished
each time).<o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal" style="margin-left:40.8pt"> </p>
</div>
<div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">Strangely, I observe a CUDA failure whenever the petscfinalize is called for a *second* time. In other words, the first and second PDE calculations with GPU are fine (with correct solutions). The petsc code just fails after the SECOND
petscfinalize command is called. You can also see the PETSC config in the error message:<span class="apple-converted-space"> </span><o:p></o:p></p>
</div>
<p class="MsoNormal" style="margin-left:40.8pt"> </p>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">[1]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------<o:p></o:p></p>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">[1]PETSC ERROR: GPU error<o:p></o:p></p>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">[1]PETSC ERROR: cuda error 201 (cudaErrorDeviceUninitialized) : invalid device context<o:p></o:p></p>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">[1]PETSC ERROR: See<span class="apple-converted-space"> </span><a href="https://petsc.org/release/faq/" target="_blank">https://petsc.org/release/faq/</a><span class="apple-converted-space"> </span>for trouble shooting.<o:p></o:p></p>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">[1]PETSC ERROR: Petsc Release Version 3.16.3, unknown<o:p></o:p></p>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">[1]PETSC ERROR: maxwell.gpu on a named stratosphere by hao Fri Jan 14 10:21:05 2022<o:p></o:p></p>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">[1]PETSC ERROR: Configure options --prefix=/opt/petsc/complex-double-with-cuda --with-cc=mpicc --with-cxx=mpicxx --with-fc=mpif90 COPTFLAGS="-O3 -mavx2" CXXOPTFLAGS="-O3 -mavx2" FOPTFLAGS="-O3 -ffree-line-length-none -mavx2" CUDAOPTFLAGS=-O3
--with-cxx-dialect=cxx14 --with-cuda-dialect=cxx14 --with-scalar-type=complex --with-precision=double --with-cuda-dir=/usr/local/cuda --with-debugging=1<o:p></o:p></p>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">[1]PETSC ERROR: #1 PetscFinalize() at /home/hao/packages/petsc-current/src/sys/objects/pinit.c:1638<o:p></o:p></p>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">You might have forgotten to call PetscInitialize().<o:p></o:p></p>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">The EXACT line numbers in the error traceback are not available.<o:p></o:p></p>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">Instead the line number of the start of the function is given.<o:p></o:p></p>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">[1] #1 PetscAbortFindSourceFile_Private() at /home/hao/packages/petsc-current/src/sys/error/err.c:35<o:p></o:p></p>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">[1] #2 PetscLogGetStageLog() at /home/hao/packages/petsc-current/src/sys/logging/utils/stagelog.c:29<o:p></o:p></p>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">[1] #3 PetscClassIdRegister() at /home/hao/packages/petsc-current/src/sys/logging/plog.c:2376<o:p></o:p></p>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">[1] #4 MatMFFDInitializePackage() at /home/hao/packages/petsc-current/src/mat/impls/mffd/mffd.c:45<o:p></o:p></p>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">[1] #5 MatInitializePackage() at /home/hao/packages/petsc-current/src/mat/interface/dlregismat.c:163<o:p></o:p></p>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">[1] #6 MatCreate() at /home/hao/packages/petsc-current/src/mat/utils/gcreate.c:77<o:p></o:p></p>
</div>
<p class="MsoNormal" style="margin-left:40.8pt"> </p>
<div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">However, it doesn’t seem to affect the other part of my code, so the code can continue running until it gets to the petsc part again (the *<b>third</b>* time). Unfortunately, it doesn’t give me any further information even if I set the
debugging to yes in the configure file. It also worth noting that PETSC without CUDA (i.e. with simple MATMPIAIJ) works perfectly fine. <o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal" style="margin-left:40.8pt"> </p>
</div>
<div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">I am able to re-produce the problem with a toy code modified from ex11f. Please see the attached file (ex11fc.F90) for details. Essentially the code does the same thing as ex11f, but three times with a do loop. To do that I added an extra
MPI_INIT/MPI_FINALIZE to ensure that the MPI communicator is not destroyed when PETSC_FINALIZE is called. I used the PetscOptionsHasName utility to check if you have “-usecuda” in the options. So running the code with and without that option can give you
a comparison w/o CUDA. I can see that the code also fails after the second loop of the KSP operation. Could you kindly shed some lights on this problem?<o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal" style="margin-left:40.8pt"> </p>
</div>
<div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">I should say that I am not even sure if the problem is from PETSc, as I also accidentally updated the NVIDIA driver (for now it is 510.06 with cuda 11.6). And it is well known that NVIDIA can give you some surprise in the updates (yes,
I know I shouldn’t have touched that if it’s not broken). But my CUDA code without PETSC (which basically does the same PDE thing, but with cusparse/cublas directly) seems to work just fine after the update. It is also possible that my petsc code related to
CUDA was not quite “legitimate” – I just use:<span class="apple-converted-space"> </span><o:p></o:p></p>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal"> MatSetType(A, MATMPIAIJCUSPARSE, ierr)<o:p></o:p></p>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">and<span class="apple-converted-space"> </span><o:p></o:p></p>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal"> MatCreateVecs(A, u, PETSC_NULL_VEC, ierr)<o:p></o:p></p>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">to make the data onto GPU. I would very much appreciate it if you could show me the “right” way to do that. <o:p></o:p></p>
</div>
<p class="MsoNormal" style="margin-left:40.8pt"> </p>
<div style="margin-left:40.8pt">
<p class="MsoNormal">Thanks a lot in advance, and all the best,<o:p></o:p></p>
</div>
</div>
<div style="margin-left:40.8pt">
<p class="MsoNormal">Hao<o:p></o:p></p>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</body>
</html>