[petsc-dev] First call to cudaMalloc or cudaFree is very slow on summit

Smith, Barry F. bsmith at mcs.anl.gov
Fri Feb 7 14:31:41 CST 2020

  ldd -o on the executable of both linkings of your code.

  My guess is that without PETSc it is linking the static version of the needed libraries and with PETSc the shared. And, in typical fashion, the shared libraries are off on some super slow file system so take a long time to be loaded and linked in on demand.

   Still a performance bug in Summit. 


> On Feb 7, 2020, at 12:23 PM, Zhang, Hong via petsc-dev <petsc-dev at mcs.anl.gov> wrote:
> Hi all,
> Previously I have noticed that the first call to a CUDA function such as cudaMalloc and cudaFree in PETSc takes a long time (7.5 seconds) on summit. Then I prepared a simple example as attached to help OCLF reproduce the problem. It turned out that the problem was  caused by PETSc. The 7.5-second overhead can be observed only when the PETSc lib is linked. If I do not link PETSc, it runs normally. Does anyone have any idea why this happens and how to fix it?
> Hong (Mr.)
> bash-4.2$ cat ex_simple.c
> #include <time.h>
> #include <cuda_runtime.h>
> #include <stdio.h>
> int main(int argc,char **args)
> {
>  clock_t start,s1,s2,s3;
>  double  cputime;
>  double   *init,tmp[100] = {0};
>  start = clock();
>  cudaFree(0);
>  s1 = clock();
>  cudaMalloc((void **)&init,100*sizeof(double));
>  s2 = clock();
>  cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);
>  s3 = clock();
>  printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) (s3 - s2)) / CLOCKS_PER_SEC);
>  return 0;
> }

More information about the petsc-dev mailing list