[petsc-dev] PETSc init eats too much CUDA memory

Mark Adams mfadams at lbl.gov
Sat Jan 8 07:37:00 CST 2022


cuda-memcheck is a valgrind clone, but like valgrind it does not report
usage as it goes. Just in a report at the end.

On Fri, Jan 7, 2022 at 10:23 PM Barry Smith <bsmith at petsc.dev> wrote:

>
>   Doesn't Nvidia supply a "valgrind" like tool that will allow tracking
> memory usage? I'm pretty sure I've seen one; it should be able to show
> memory usage as a function of time so you can see where the memory is being
> allocated
>
>   Barry
>
>
> On Jan 7, 2022, at 1:56 PM, Jacob Faibussowitsch <jacob.fai at gmail.com>
> wrote:
>
> it seems that PETSc consumes 0.73GB CUDA memory and this overhead persists
> across the entire running time of an application. cupm_initialize
> contributes 0.36GB out of 0.73GB.
>
>
> If I had to guess this may be the latent overhead of CUDA streams and
> events, but even then 360 MB seems ludicrous. CUDA maintains a persistent
> pool of streams that is not freed until cudaDeviceReset() is called. Maybe
> they initialize this pool immediately on start-up of the context? AFAIK
> there is no way to disable or modify this behavior.
>
> Best regards,
>
> Jacob Faibussowitsch
> (Jacob Fai - booss - oh - vitch)
>
> On Jan 7, 2022, at 13:23, Zhang, Hong <hongzhang at anl.gov> wrote:
>
> Apart from the 1.2GB caused by importing torch, it seems that PETSc
> consumes 0.73GB CUDA memory and this overhead persists across the entire
> running time of an application. cupm_initialize contributes 0.36GB out of
> 0.73GB. It is still unclear what takes the remaining 0.37GB.
>
> The torch issue is really a mystery. If I import torch only and do some
> tensor operations on GPU, it consumes only 0.004GB CUDA memory.
>
>
> On Jan 7, 2022, at 11:54 AM, Zhang, Hong via petsc-dev <
> petsc-dev at mcs.anl.gov> wrote:
>
>
> 1. Commenting out  ierr =
> __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);
> in device/impls/cupm/cupmcontext.hpp:L199
>
> CUDA memory: 1.575GB
> CUDA memory without importing torch:  0.370GB
>
> This has the same effect as commenting out L437-L440 in
> interface/device.cxx
>
> 2. Comment out these two:
> . src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr =
> _devices[_defaultDevice]->configure();CHKERRQ(ierr);]
> . src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr =
> _devices[_defaultDevice]->initialize();CHKERRQ(ierr);]
>
> CUDA memory: 1.936GB
> CUDA memory without importing torch:   0.730GB
>
> On Jan 7, 2022, at 11:21 AM, Jacob Faibussowitsch <jacob.fai at gmail.com>
> wrote:
>
> They had no influence to the memory usage.
>
> ???????????????????????????????????????????????????????????????????????
>
> Comment out the ierr = _devices[id]->initialize();CHKERRQ(ierr); on line
> 360 in cupmdevice.cxx as well.
>
> Best regards,
>
> Jacob Faibussowitsch
> (Jacob Fai - booss - oh - vitch)
>
> On Jan 7, 2022, at 12:18, Zhang, Hong <hongzhang at anl.gov> wrote:
>
> I have tried all of these. They had no influence to the memory usage.
>
> On Jan 7, 2022, at 11:15 AM, Jacob Faibussowitsch <jacob.fai at gmail.com>
> wrote:
>
> Initializing cutlass and cusolver does not affect the memory usage. I did
> the following to turn them off:
>
>
> Ok next things to try out in order:
>
> 1. src/sys/objects/device/impls/cupm/cupmcontext.hpp:178
> [PetscFunctionBegin;]
> Put a PetscFunctionReturn(0); right after this
>
> 2. src/sys/objects/device/impls/cupm/cupmdevice.cxx:327 [ierr =
> _devices[_defaultDevice]->configure();CHKERRQ(ierr);]
> Comment this out
>
> 3. src/sys/objects/device/impls/cupm/cupmdevice.cxx:326 [ierr =
> _devices[_defaultDevice]->initialize();CHKERRQ(ierr);]
> Comment this out
>
> Best regards,
>
> Jacob Faibussowitsch
> (Jacob Fai - booss - oh - vitch)
>
> On Jan 7, 2022, at 12:02, Zhang, Hong <hongzhang at anl.gov> wrote:
>
> Initializing cutlass and cusolver does not affect the memory usage. I did
> the following to turn them off:
>
> diff --git a/src/sys/objects/device/impls/cupm/cupmcontext.hpp
> b/src/sys/objects/device/impls/cupm/cupmcontext.hpp
> index 51fed809e4d..9a5f068323a 100644
> --- a/src/sys/objects/device/impls/cupm/cupmcontext.hpp
> +++ b/src/sys/objects/device/impls/cupm/cupmcontext.hpp
> @@ -199,7 +199,7 @@ inline PetscErrorCode
> CUPMContext<T>::setUp(PetscDeviceContext dctx) noexcept
>  #if PetscDefined(USE_DEBUG)
>    dci->timerInUse = PETSC_FALSE;
>  #endif
> -  ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);
> +  //ierr = __initialize(dctx->device->deviceId,dci);CHKERRQ(ierr);
>    PetscFunctionReturn(0);
>  }
>
> On Jan 7, 2022, at 10:53 AM, Barry Smith <bsmith at petsc.dev> wrote:
>
>
>   I don't think this is right. We want the device initialized by PETSc ,
> we just don't want the cublas and cusolve stuff initialized. In order to
> see how much memory initializing the blas and solvers takes.
>
>   So I think you need to comment things in cupminterface.hpp
> like cublasCreate and cusolverDnCreate.
>
>   Urgh, I hate C++ where huge chunks of real code are in header files.
>
>
>
> On Jan 7, 2022, at 11:34 AM, Jacob Faibussowitsch <jacob.fai at gmail.com>
> wrote:
>
> Hit send too early…
>
> If you don’t want to comment out, you can also run with "-device_enable
> lazy" option. Normally this is the default behavior but if -log_view or
> -log_summary is provided this defaults to “-device_enable eager”.
> See src/sys/objects/device/interface/device.cxx:398
>
> Best regards,
>
> Jacob Faibussowitsch
> (Jacob Fai - booss - oh - vitch)
>
> On Jan 7, 2022, at 11:29, Jacob Faibussowitsch <jacob.fai at gmail.com>
> wrote:
>
> You need to go into the PetscInitialize() routine find where it loads the
> cublas and cusolve and comment out those lines then run with -log_view
>
>
> Comment out
>
> #if (PetscDefined(HAVE_CUDA) || PetscDefined(HAVE_HIP) ||
> PetscDefined(HAVE_SYCL))
>   ierr =
> PetscDeviceInitializeFromOptions_Internal(PETSC_COMM_WORLD);CHKERRQ(ierr);
> #endif
>
> At src/sys/objects/pinit.c:956
>
> Best regards,
>
> Jacob Faibussowitsch
> (Jacob Fai - booss - oh - vitch)
>
> On Jan 7, 2022, at 11:24, Barry Smith <bsmith at petsc.dev> wrote:
>
>
> Without log_view it does not load any cuBLAS/cuSolve immediately with
> -log_view it loads all that stuff at startup. You need to go into the
> PetscInitialize() routine find where it loads the cublas and cusolve and
> comment out those lines then run with -log_view
>
>
> On Jan 7, 2022, at 11:14 AM, Zhang, Hong via petsc-dev <
> petsc-dev at mcs.anl.gov> wrote:
>
> When PETSc is initialized, it takes about 2GB CUDA memory. This is way too
> much for doing nothing. A test script is attached to reproduce the issue.
> If I remove the first line "import torch", PETSc consumes about 0.73GB,
> which is still significant. Does anyone have any idea about this behavior?
>
> Thanks,
> Hong
>
> hongzhang at gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples (caidao22/update-examples)$ python3 test.py
> CUDA memory before PETSc 0.000GB
> CUDA memory after PETSc 0.004GB
> hongzhang at gpu02:/gpfs/jlse-fs0/users/hongzhang/Projects/pnode/examples (caidao22/update-examples)$ python3 test.py -log_view :0.txt
> CUDA memory before PETSc 0.000GB
> CUDA memory after PETSc 1.936GB
>
>
> import torch
> import sys
> import os
>
> import nvidia_smi
> nvidia_smi.nvmlInit()
> handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
> info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
> print('CUDA memory before PETSc %.3fGB' % (info.used/1e9))
>
> petsc4py_path = os.path.join(os.environ['PETSC_DIR'],os.environ['PETSC_ARCH'],'lib')
> sys.path.append(petsc4py_path)
> import petsc4py
> petsc4py.init(sys.argv)
> handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
> info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
> print('CUDA memory after PETSc %.3fGB' % (info.used/1e9))
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220108/81f270fb/attachment-0001.html>


More information about the petsc-dev mailing list