<div dir="ltr">cuda-memcheck is a valgrind clone, but like valgrind it does not report usage as it goes. Just in a report at the end.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jan 7, 2022 at 10:23 PM Barry Smith <<a href="mailto:bsmith@petsc.dev">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;"><div><br></div>  Doesn't Nvidia supply a "valgrind" like tool that will allow tracking memory usage? I'm pretty sure I've seen one; it should be able to show memory usage as a function of time so you can see where the memory is being allocated<div>  </div><div>  Barry</div><div><br><div><br><blockquote type="cite"><div>On Jan 7, 2022, at 1:56 PM, Jacob Faibussowitsch <<a href="mailto:jacob.fai@gmail.com" target="_blank">jacob.fai@gmail.com</a>> wrote:</div><br><div><div style="overflow-wrap: break-word;"><blockquote type="cite"><div style="overflow-wrap: break-word;">it seems that PETSc consumes 0.73GB CUDA memory and this overhead persists across the entire running time of an application. cupm_initialize contributes 0.36GB out of 0.73GB.</div></blockquote><div><br></div>If I had to guess this may be the latent overhead of CUDA streams and events, but even then 360 MB seems ludicrous. CUDA maintains a persistent pool of streams that is not freed until cudaDeviceReset() is called. <span>Maybe they initialize this pool immediately on start-up of the context? </span>AFAIK there is no way to disable or modify this behavior.<div><br><div>

<div dir="auto" style="letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><div dir="auto" style="letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><div>Best regards,<br><br>Jacob Faibussowitsch<br>(Jacob Fai - booss - oh - vitch)<br></div></div></div>

</div>

<div><br><blockquote type="cite"><div>On Jan 7, 2022, at 13:23, Zhang, Hong <<a href="mailto:hongzhang@anl.gov" target="_blank">hongzhang@anl.gov</a>> wrote:</div><br><div>

<div style="overflow-wrap: break-word;">

Apart from the 1.2GB caused by importing torch, it seems that PETSc consumes 0.73GB CUDA memory and this overhead persists across the entire running time of an application. cupm_initialize contributes 0.36GB out of 0.73GB. It is still unclear what takes the

 remaining 0.37GB.

<div><br>

</div>

<div>The torch issue is really a mystery. If I import torch only and do some tensor operations on GPU, it consumes only 0.004GB CUDA memory.    </div>

<div><br>

<div>

<div>

<div>

<div><br>

<blockquote type="cite">

<div>On Jan 7, 2022, at 11:54 AM, Zhang, Hong via petsc-dev <<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank">petsc-dev@mcs.anl.gov</a>> wrote:</div>

<br>

<div>

<div style="overflow-wrap: break-word;">

<div><br>

</div>

<div>1. Commenting out  ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr); in device/impls/cupm/cupmcontext.hpp:L199</div>

<div><br>

</div>

<div>CUDA memory: 1.575GB</div>

<div>CUDA memory without importing torch:  0.370GB</div>

<div><br>

</div>

<div>This has the same effect as commenting out L437-L440 in interface/device.cxx </div>

<div><br>

</div>

<div>2. Comment out these two: </div>

<div>. <span>src/sys/objects/device/impls/cupm/cupmdevice.cxx:</span>327 [<span>ierr = _devices[_defaultDevice]->configure();CHKERRQ(ierr);]</span></div>

<div>. <span>src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [</span>ierr = _devices[_defaultDevice]->initialize();CHKERRQ(ierr);]</div>

<div>

<div><br>

</div>

<div>

<div>CUDA memory: 1.936GB</div>

<div>CUDA memory without importing torch:   0.730GB</div>

</div>

</div>

<div><br>

<blockquote type="cite">

<div>On Jan 7, 2022, at 11:21 AM, Jacob Faibussowitsch <<a href="mailto:jacob.fai@gmail.com" target="_blank">jacob.fai@gmail.com</a>> wrote:</div>

<br>

<div>

<div style="overflow-wrap: break-word;">

<blockquote type="cite">

<div style="overflow-wrap: break-word;">

They had no influence to the memory usage. </div>

</blockquote>

???????????????????????????????????????????????????????????????????????

<div><br>

</div>

<div>Comment out the ierr = _devices[id]->initialize();CHKERRQ(ierr); on line 360 in cupmdevice.cxx as well.</div>

<div><br>

<div>

<div dir="auto" style="letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none">

<div dir="auto" style="letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none">

<div>Best regards,<br>

<br>

Jacob Faibussowitsch<br>

(Jacob Fai - booss - oh - vitch)<br>

</div>

</div>

</div>

</div>

<div><br>

<blockquote type="cite">

<div>On Jan 7, 2022, at 12:18, Zhang, Hong <<a href="mailto:hongzhang@anl.gov" target="_blank">hongzhang@anl.gov</a>> wrote:</div>

<br>

<div>

<div style="overflow-wrap: break-word;">

I have tried all of these. They had no influence to the memory usage. 

<div>

<div><br>

<blockquote type="cite">

<div>On Jan 7, 2022, at 11:15 AM, Jacob Faibussowitsch <<a href="mailto:jacob.fai@gmail.com" target="_blank">jacob.fai@gmail.com</a>> wrote:</div>

<br>

<div>

<div style="overflow-wrap: break-word;">

<blockquote type="cite">

<div style="overflow-wrap: break-word;">

<div>Initializing cutlass and cusolver does not affect the memory usage. I did the following to turn them off:</div>

</div>

</blockquote>

<div><br>

</div>

Ok next things to try out in order:

<div><br>

</div>

<div>1. src/sys/objects/device/impls/cupm/cupmcontext.hpp:178 [PetscFunctionBegin;] </div>

<div>Put a PetscFunctionReturn(0); right after this</div>

<div><br>

</div>

<div>2. <span>src/sys/objects/device/impls/cupm/cupmdevice.cxx:</span>327 [<span>ierr = _devices[_defaultDevice]->configure();CHKERRQ(ierr);]</span></div>

<div><font><span>Comment this out</span></font></div>

<div>

<div><br>

</div>

<div>3. <span>src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [</span>ierr = _devices[_defaultDevice]->initialize();CHKERRQ(ierr);]<br>

Comment this out</div>

<div><br>

</div>

</div>

<div>

<div>

<div dir="auto" style="letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none">

<div dir="auto" style="letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none">

<div>Best regards,<br>

<br>

Jacob Faibussowitsch<br>

(Jacob Fai - booss - oh - vitch)<br>

</div>

</div>

</div>

</div>

<div><br>

<blockquote type="cite">

<div>On Jan 7, 2022, at 12:02, Zhang, Hong <<a href="mailto:hongzhang@anl.gov" target="_blank">hongzhang@anl.gov</a>> wrote:</div>

<br>

<div>

<div style="overflow-wrap: break-word;">

<div>Initializing cutlass and cusolver does not affect the memory usage. I did the following to turn them off:</div>

<div><br>

</div>

<div>diff --git a/src/sys/objects/device/impls/cupm/cupmcontext.hpp b/src/sys/objects/device/impls/cupm/cupmcontext.hpp</div>

<div>index 51fed809e4d..9a5f068323a 100644</div>

<div>--- a/src/sys/objects/device/impls/cupm/cupmcontext.hpp</div>

<div>+++ b/src/sys/objects/device/impls/cupm/cupmcontext.hpp</div>

<div>@@ -199,7 +199,7 @@ inline PetscErrorCode CUPMContext<T>::setUp(PetscDeviceContext dctx) noexcept</div>

<div> #if PetscDefined(USE_DEBUG)</div>

<div>   dci->timerInUse = PETSC_FALSE;</div>

<div> #endif</div>

<div>-  ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);</div>

<div>+  //ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);</div>

<div>   PetscFunctionReturn(0);</div>

<div> }</div>

<div><br>

<blockquote type="cite">

<div>On Jan 7, 2022, at 10:53 AM, Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:</div>

<br>

<div>

<div style="overflow-wrap: break-word;">

<div><br>

</div>

  I don't think this is right. We want the device initialized by PETSc , we just don't want the cublas and cusolve stuff initialized. In order to see how much memory initializing the blas and solvers takes.

<div><br>

</div>

<div>  So I think you need to comment things in cupminterface.hpp like cublasCreate and cusolverDnCreate.</div>

<div><br>

</div>

<div>  Urgh, I hate C++ where huge chunks of real code are in header files.</div>

<div><br>

</div>

<div><br>

<div><br>

<blockquote type="cite">

<div>On Jan 7, 2022, at 11:34 AM, Jacob Faibussowitsch <<a href="mailto:jacob.fai@gmail.com" target="_blank">jacob.fai@gmail.com</a>> wrote:</div>

<br>

<div>

<div style="overflow-wrap: break-word;">

Hit send too early…

<div><br>

</div>

<div>If you don’t want to comment out, you can also run with "-device_enable lazy" option. Normally this is the default behavior but if -log_view or -log_summary is provided this defaults to “-device_enable eager”. See src/sys/objects/device/interface/device.cxx:398</div>

<div><br>

</div>

<div>

<div>

<div dir="auto" style="letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none">

<div dir="auto" style="letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none">

<div>Best regards,<br>

<br>

Jacob Faibussowitsch<br>

(Jacob Fai - booss - oh - vitch)<br>

</div>

</div>

</div>

</div>

<div><br>

<blockquote type="cite">

<div>On Jan 7, 2022, at 11:29, Jacob Faibussowitsch <<a href="mailto:jacob.fai@gmail.com" target="_blank">jacob.fai@gmail.com</a>> wrote:</div>

<br>

<div>

<div style="overflow-wrap: break-word;">

<blockquote type="cite">

<div style="overflow-wrap: break-word;">

<span style="color:rgb(29,28,29);font-family:Slack-Lato,appleLogo,sans-serif;font-size:15px;font-variant-ligatures:common-ligatures;background-color:rgb(248,248,248)">You need

 to go into the PetscInitialize() routine find where it loads the cublas and cusolve and comment out those lines then run with -log_view</span></div>

</blockquote>

<div><br>

</div>

Comment out

<div>

<div><br>

</div>

<div>#if (PetscDefined(HAVE_CUDA) || PetscDefined(HAVE_HIP) || PetscDefined(HAVE_SYCL))</div>

<div>  ierr = PetscDeviceInitializeFromOptions_Internal(PETSC_COMM_WORLD);CHKERRQ(ierr);</div>

<div>#endif</div>

<div><br>

</div>

<div>At <span>src/sys/objects/pinit.c:956</span></div>

<div><br>

</div>

<div>

<div dir="auto" style="letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none">

<div dir="auto" style="letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none">

<div>Best regards,<br>

<br>

Jacob Faibussowitsch<br>

(Jacob Fai - booss - oh - vitch)<br>

</div>

</div>

</div>

</div>

<div><br>

<blockquote type="cite">

<div>On Jan 7, 2022, at 11:24, Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:</div>

<br>

<div>

<div style="overflow-wrap: break-word;">

<div><br>

</div>

<span style="color:rgb(29,28,29);font-family:Slack-Lato,appleLogo,sans-serif;font-size:15px;font-variant-ligatures:common-ligatures;background-color:rgb(248,248,248)">Without

 log_view it does not load any cuBLAS/cuSolve immediately with -log_view it loads all that stuff at startup. You need to go into the PetscInitialize() routine find where it loads the cublas and cusolve and comment out those lines then run with -log_view</span>

<div>

<div><font color="#1d1c1d" face="Slack-Lato, appleLogo, sans-serif"><span style="font-size:15px;background-color:rgb(248,248,248)"><br>

</span></font></div>

<div><br>

<blockquote type="cite">

<div>On Jan 7, 2022, at 11:14 AM, Zhang, Hong via petsc-dev <<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank">petsc-dev@mcs.anl.gov</a>> wrote:</div>

<br>

<div>

<div style="overflow-wrap: break-word;">

<span>When PETSc is initialized, it takes about 2GB CUDA memory. This is way too much for doing nothing. A test script is attached to reproduce the issue. If I remove the first line "import torch", PETSc consumes about 0.73GB, which is still significant.

 Does anyone have any idea about this behavior?</span>

<div><br>

</div>

<div>Thanks,</div>

<div>Hong<br>

<div><br>

</div>

<div>

<pre style="box-sizing:inherit;margin-top:4px;margin-bottom:4px;padding:8px;font-size:12px;line-height:1.50001;font-variant-ligatures:none;white-space:pre-wrap;word-break:normal;border-radius:4px;color:rgb(29,28,29);font-family:Monaco,Menlo,Consolas,"Courier New",monospace">hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples (caidao22/update-examples)$ python3 test.py

CUDA memory before PETSc 0.000GB

CUDA memory after PETSc 0.004GB

hongzhang@gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples (caidao22/update-examples)$ python3 test.py -log_view :0.txt

CUDA memory before PETSc 0.000GB

CUDA memory after PETSc 1.936GB</pre>

<div><br>

</div>

</div>

<div>

<pre style="box-sizing:inherit;margin-top:4px;margin-bottom:4px;padding:8px;font-size:12px;line-height:1.50001;font-variant-ligatures:none;white-space:pre-wrap;word-break:normal;border-radius:4px;color:rgb(29,28,29);font-family:Monaco,Menlo,Consolas,"Courier New",monospace">import torch

import sys

import os

import nvidia_smi

nvidia_smi.nvmlInit()

handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)

info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)

print('CUDA memory before PETSc %.3fGB' % (info.used/1e9))

petsc4py_path = os.path.join(os.environ['PETSC_DIR'],os.environ['PETSC_ARCH'],'lib')

sys.path.append(petsc4py_path)

import petsc4py

petsc4py.init(sys.argv)

handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)

info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)

print('CUDA memory after PETSc %.3fGB' % (info.used/1e9))</pre>

<div><br>

</div>

</div>

</div>

</div>

</div>

</blockquote>

</div>

<br>

</div>

</div>

</div>

</blockquote>

</div>

<br>

</div>

</div>

</div>

</blockquote>

</div>

<br>

</div>

</div>

</div>

</blockquote>

</div>

<br>

</div>

</div>

</div>

</blockquote>

</div>

<br>

</div>

</div>

</blockquote>

</div>

<br>

</div>

</div>

</div>

</blockquote>

</div>

<br>

</div>

</div>

</div>

</blockquote>

</div>

<br>

</div>

</div>

</div>

</blockquote>

</div>

<br>

</div>

</div>

</blockquote>

</div>

<br>

</div>

</div>

</div>

</div>

</div>

</div></blockquote></div><br></div></div></div></blockquote></div><br></div></div></blockquote></div>