[petsc-dev] First call to cudaMalloc or cudaFree is very slow on summit

Zhang, Hong hongzhang at anl.gov
Fri Feb 7 14:26:09 CST 2020

Statically linked excitable works fine. The dynamic linker is probably broken.


On Feb 7, 2020, at 12:53 PM, Matthew Knepley <knepley at gmail.com<mailto:knepley at gmail.com>> wrote:

On Fri, Feb 7, 2020 at 1:23 PM Zhang, Hong via petsc-dev <petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>> wrote:
Hi all,

Previously I have noticed that the first call to a CUDA function such as cudaMalloc and cudaFree in PETSc takes a long time (7.5 seconds) on summit. Then I prepared a simple example as attached to help OCLF reproduce the problem. It turned out that the problem was  caused by PETSc. The 7.5-second overhead can be observed only when the PETSc lib is linked. If I do not link PETSc, it runs normally. Does anyone have any idea why this happens and how to fix it?

Hong, this sounds like a screwed up dynamic linker. Can you try this with a statically linked executable?



Hong (Mr.)

bash-4.2$ cat ex_simple.c
#include <time.h>
#include <cuda_runtime.h>
#include <stdio.h>

int main(int argc,char **args)
  clock_t start,s1,s2,s3;
  double  cputime;
  double   *init,tmp[100] = {0};

  start = clock();
  s1 = clock();
  cudaMalloc((void **)&init,100*sizeof(double));
  s2 = clock();
  s3 = clock();
  printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) (s3 - s2)) / CLOCKS_PER_SEC);

  return 0;

What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20200207/08a17052/attachment.html>

More information about the petsc-dev mailing list