[petsc-dev] First call to cudaMalloc or cudaFree is very slow on summit

Matthew Knepley knepley at gmail.com
Wed Feb 12 13:44:11 CST 2020


On Wed, Feb 12, 2020 at 11:38 AM Zhang, Hong <hongzhang at anl.gov> wrote:

>
>
> On Feb 12, 2020, at 11:09 AM, Matthew Knepley <knepley at gmail.com> wrote:
>
> On Wed, Feb 12, 2020 at 11:06 AM Zhang, Hong via petsc-dev <
> petsc-dev at mcs.anl.gov> wrote:
>
>> Sorry for the long post. Here are replies I have got from OLCF so far. We
>> still don’t know how to solve the problem.
>>
>> One interesting thing that Tom noticed is PetscInitialize() may have
>> called cudaFree(0) 32 times as NVPROF shows, and they all run very fast.
>> These calls may be triggered by some other libraries like cublas. But if
>> PETSc calls cudaFree() explicitly, it is always very slow.
>>
>
> It sounds really painful, but I would start removing lines from
> PetscInitialize() until it runs fast.
>
>
> It may be more painful than it sounds. The problem is not really related
> to PetscInitialize(). In the following simple example, we do not call any
> PETsc function. But if we link it to the PETSc shared library, cudaFree(0)
> would be very slow. CUDA is a blackbox. There is not much we can debug with
> this simple example.
>

I believe you can execute on shared library load. I think you said that
PETSc was static, but libcuda was not. Can we 'strace' to see
what is happening when libcuda loads?

  Thanks,

    Matt


> bash-4.2$ cat ex_simple.c
> #include <time.h>
> #include <cuda_runtime.h>
> #include <stdio.h>
>
> int main(int argc,char **args)
> {
>   clock_t start,s1,s2,s3;
>   double  cputime;
>   double   *init,tmp[100] = {0};
>
>   start = clock();
>   cudaFree(0);
>   s1 = clock();
>   cudaMalloc((void **)&init,100*sizeof(double));
>   s2 = clock();
>   cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);
>   s3 = clock();
>   printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1
> - start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double)
> (s3 - s2)) / CLOCKS_PER_SEC);
>   return 0;
> }
>
>
>
>   Thanks,
>
>      Matt
>
>
>> Hong
>>
>>
>> On Wed Feb 12 09:51:33 2020, tpapathe wrote:
>>
>>  Something else I noticed from the nvprof output (see my previous post) is
>>  that the runs with PETSc initialized have 33 calls to cudaFree, whereas
>> the
>>  non-PETSc versions only have the 1 call to cudaFree. I'm not sure what is
>>  happening in the PETSc initialize/finalize, but it appears to be doing a
>>  lot under the hood. You can also see there are many additional CUDA calls
>>  that are not shown in the profiler output from the non-PETSc runs (e.g.,
>>  additional cudaMalloc and cudaMemcpy calls, cudaDeviceSychronize, etc.).
>>  Which other systems have you tested this on? Which CUDA Toolkits and CUDA
>>  drivers were installed on those systems? Please let me know if there is
>> any
>>  additional information you can share with me about this.
>>
>>  -Tom
>>  On Wed Feb 12 09:25:23 2020, tpapathe wrote:
>>
>>    Ok. Thanks for the additional info, Hong. I'll ask around to see if any
>>    local (PETSc or CUDA) experts have experienced this behavior. In the
>>    meantime, is this impacting your work or something you're just curious
>>    about? A 5-7 second initialization time is indeed unusual, but is it
>>    negligible relative to the overall walltime of your jobs, or is it
>>    somehow affecting your productivity?
>>
>>    -Tom
>>    On Tue Feb 11 17:04:25 2020, hongzhang at anl.gov wrote:
>>
>>      We know it happens with PETSc. But note that the slow down occurs on
>> the first CUDA function call. In the example I sent to you, if we simply
>> link it to the PETSc shared library and don’t call any PETSc function, the
>> slow down still happens on cudaFree(0). We have never seen this behavior on
>> other GPU systems.
>>
>> On Feb 11, 2020, at 3:31 PM, Thomas Papatheodore via RT <help at nccs.gov>
>> wrote:
>>
>> Thanks for the update. I have now reproduced the behavior you described
>> with
>> PETSc + CUDA using your example code:
>>
>> [tpapathe at batch2: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ jsrun
>> -n1
>> -a1 -c1 -g1 -r1 -l cpu-cpu -dpacked -bpacked:1 nvprof
>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_petsc
>>
>> ==16991== NVPROF is profiling process 16991, command:
>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_petsc
>>
>> ==16991== Profiling application:
>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_petsc
>>
>> free time =4.730000 malloc time =0.000000 copy time =0.000000
>>
>> ==16991== Profiling result:
>>
>> Type Time(%) Time Calls Avg Min Max Name
>>
>> GPU activities: 100.00% 9.3760us 6 1.5620us 1.3440us 1.7920us [CUDA memcpy
>> HtoD]
>>
>> API calls: 99.78% 5.99333s 33 181.62ms 883ns 4.71976s cudaFree
>>
>> 0.11% 6.3603ms 379 16.781us 233ns 693.40us cuDeviceGetAttribute
>>
>> 0.07% 4.1453ms 4 1.0363ms 1.0186ms 1.0623ms cuDeviceTotalMem
>>
>> 0.02% 1.0046ms 4 251.15us 131.45us 449.32us cuDeviceGetName
>>
>> 0.01% 808.21us 16 50.513us 6.7080us 621.54us cudaMalloc
>>
>> 0.01% 452.06us 450 1.0040us 830ns 6.4430us cudaFuncSetAttribute
>>
>> 0.00% 104.89us 6 17.481us 13.419us 21.338us cudaMemcpy
>>
>> 0.00% 102.26us 15 6.8170us 6.1900us 10.072us cudaDeviceSynchronize
>>
>> 0.00% 93.635us 80 1.1700us 1.0190us 2.1990us cudaEventCreateWithFlags
>>
>> 0.00% 92.168us 83 1.1100us 951ns 2.3550us cudaEventDestroy
>>
>> 0.00% 52.277us 74 706ns 592ns 1.5640us cudaDeviceGetAttribute
>>
>> 0.00% 34.558us 3 11.519us 9.5410us 15.129us cudaStreamDestroy
>>
>> 0.00% 27.778us 3 9.2590us 4.9120us 17.632us cudaStreamCreateWithFlags
>>
>> 0.00% 11.955us 1 11.955us 11.955us 11.955us cudaSetDevice
>>
>> 0.00% 10.361us 7 1.4800us 809ns 3.6580us cudaGetDevice
>>
>> 0.00% 5.4310us 3 1.8100us 1.6420us 1.9980us cudaEventCreate
>>
>> 0.00% 3.8040us 6 634ns 391ns 1.5350us cuDeviceGetCount
>>
>> 0.00% 3.5350us 1 3.5350us 3.5350us 3.5350us cuDeviceGetPCIBusId
>>
>> 0.00% 3.2210us 3 1.0730us 949ns 1.1640us cuInit
>>
>> 0.00% 2.6780us 5 535ns 369ns 1.0210us cuDeviceGet
>>
>> 0.00% 2.5080us 1 2.5080us 2.5080us 2.5080us cudaSetDeviceFlags
>>
>> 0.00% 1.6800us 4 420ns 392ns 488ns cuDeviceGetUuid
>>
>> 0.00% 1.5720us 3 524ns 398ns 590ns cuDriverGetVersion
>>
>>
>>
>> If I remove all mention of PETSc from the code, compile manually and run,
>> I get
>> the expected behavior:
>>
>> [tpapathe at batch2: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ pgc++
>> -L$OLCF_CUDA_ROOT/lib64 -lcudart ex_simple.c -o ex_simple
>>
>>
>> [tpapathe at batch2: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ jsrun
>> -n1
>> -a1 -c1 -g1 -r1 -l cpu-cpu -dpacked -bpacked:1 nvprof
>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple
>>
>> ==17248== NVPROF is profiling process 17248, command:
>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple
>>
>> ==17248== Profiling application:
>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple
>>
>> free time =0.340000 malloc time =0.000000 copy time =0.000000
>>
>> ==17248== Profiling result:
>>
>> Type Time(%) Time Calls Avg Min Max Name
>>
>> GPU activities: 100.00% 1.7600us 1 1.7600us 1.7600us 1.7600us [CUDA memcpy
>> HtoD]
>>
>> API calls: 98.56% 231.76ms 1 231.76ms 231.76ms 231.76ms cudaFree
>>
>> 0.67% 1.5764ms 97 16.251us 234ns 652.65us cuDeviceGetAttribute
>>
>> 0.46% 1.0727ms 1 1.0727ms 1.0727ms 1.0727ms cuDeviceTotalMem
>>
>> 0.23% 537.38us 1 537.38us 537.38us 537.38us cudaMalloc
>>
>> 0.07% 172.80us 1 172.80us 172.80us 172.80us cuDeviceGetName
>>
>> 0.01% 21.648us 1 21.648us 21.648us 21.648us cudaMemcpy
>>
>> 0.00% 3.3470us 1 3.3470us 3.3470us 3.3470us cuDeviceGetPCIBusId
>>
>> 0.00% 2.5310us 3 843ns 464ns 1.3700us cuDeviceGetCount
>>
>> 0.00% 1.7260us 2 863ns 490ns 1.2360us cuDeviceGet
>>
>> 0.00% 377ns 1 377ns 377ns 377ns cuDeviceGetUuid
>>
>>
>>
>> I also get the expected behavior if I add an MPI_Init and MPI_Finalize to
>> the
>> code instead of PETSc initialization:
>>
>> [tpapathe at login1: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ mpicc
>> -L$OLCF_CUDA_ROOT/lib64 -lcudart ex_simple_mpi.c -o ex_simple_mpi
>>
>>
>> [tpapathe at batch1: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ jsrun
>> -n1
>> -a1 -c1 -g1 -r1 -l cpu-cpu -dpacked -bpacked:1 nvprof
>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_mpi
>>
>> ==35166== NVPROF is profiling process 35166, command:
>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_mpi
>>
>> ==35166== Profiling application:
>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_mpi
>>
>> free time =0.340000 malloc time =0.000000 copy time =0.000000
>>
>> ==35166== Profiling result:
>>
>> Type Time(%) Time Calls Avg Min Max Name
>>
>> GPU activities: 100.00% 1.7600us 1 1.7600us 1.7600us 1.7600us [CUDA memcpy
>> HtoD]
>>
>> API calls: 98.57% 235.61ms 1 235.61ms 235.61ms 235.61ms cudaFree
>>
>> 0.66% 1.5802ms 97 16.290us 239ns 650.72us cuDeviceGetAttribute
>>
>> 0.45% 1.0825ms 1 1.0825ms 1.0825ms 1.0825ms cuDeviceTotalMem
>>
>> 0.23% 542.73us 1 542.73us 542.73us 542.73us cudaMalloc
>>
>> 0.07% 174.77us 1 174.77us 174.77us 174.77us cuDeviceGetName
>>
>> 0.01% 26.431us 1 26.431us 26.431us 26.431us cudaMemcpy
>>
>> 0.00% 4.0330us 1 4.0330us 4.0330us 4.0330us cuDeviceGetPCIBusId
>>
>> 0.00% 2.8560us 3 952ns 528ns 1.6150us cuDeviceGetCount
>>
>> 0.00% 1.6190us 2 809ns 576ns 1.0430us cuDeviceGet
>>
>> 0.00% 341ns 1 341ns 341ns 341ns cuDeviceGetUuid
>>
>>
>> So this appears to be something specific happening within PETSc itself -
>> not
>> necessarily an OLCF issue. I would suggest asking this question within the
>> PETSc community to understand what's happening. Please let me know if you
>> have
>> any additional questions.
>>
>> -Tom
>>
>> On Feb 10, 2020, at 11:14 AM, Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
>>
>>
>>  gprof or some similar tool?
>>
>>
>> On Feb 10, 2020, at 11:18 AM, Zhang, Hong via petsc-dev <
>> petsc-dev at mcs.anl.gov> wrote:
>>
>> -cuda_initialize 0 does not make any difference. Actually this issue has
>> nothing to do with PetscInitialize(). I tried to call cudaFree(0) before
>> PetscInitialize(), and it still took 7.5 seconds.
>>
>> Hong
>>
>> On Feb 10, 2020, at 10:44 AM, Zhang, Junchao <jczhang at mcs.anl.gov> wrote:
>>
>> As I mentioned, have you tried -cuda_initialize 0? Also,
>> PetscCUDAInitialize contains
>> ierr = PetscCUBLASInitializeHandle();CHKERRQ(ierr);
>> ierr = PetscCUSOLVERDnInitializeHandle();CHKERRQ(ierr);
>> Have you tried to comment out them and test again?
>> --Junchao Zhang
>>
>>
>> On Sat, Feb 8, 2020 at 5:22 PM Zhang, Hong via petsc-dev <
>> petsc-dev at mcs.anl.gov> wrote:
>>
>>
>> On Feb 8, 2020, at 5:03 PM, Matthew Knepley <knepley at gmail.com> wrote:
>>
>> On Sat, Feb 8, 2020 at 4:34 PM Zhang, Hong via petsc-dev <
>> petsc-dev at mcs.anl.gov> wrote:
>> I did some further investigation. The overhead persists for both the
>> PETSc shared library and the static library. In the previous example, it
>> does not call any PETSc function, the first CUDA function becomes very slow
>> when it is linked to the petsc so. This indicates that the slowdown occurs
>> if the symbol (cudafree)is searched through the petsc so, but does not
>> occur if the symbol is found directly in the cuda runtime lib.
>>
>> So the issue has nothing to do with the dynamic linker. The following
>> example can be used to easily reproduce the problem (cudaFree(0) always
>> takes ~7.5 seconds).
>>
>> 1) This should go to OLCF admin as Jeff suggests
>>
>>
>> I had sent this to OLCF admin before the discussion was started here.
>> Thomas Papatheodore has followed up. I am trying to help him reproduce the
>> problem on summit.
>>
>>
>> 2) Just to make sure I understand, a static executable with this code is
>> still slow on the cudaFree(), since CUDA is a shared library by default.
>>
>>
>> I prepared the code as a minimal example to reproduce the problem. It
>> would be fair to say any code using PETSc (with CUDA enabled, built
>> statically or dynamically) on summit suffers a 7.5-second overhead on the
>> first CUDA function call (either in the user code or inside PETSc).
>>
>> Thanks,
>> Hong
>>
>>
>> I think we should try:
>>
>>  a) Forcing a full static link, if possible
>>
>>  b) Asking OLCF about link resolution order
>>
>> It sounds like a similar thing I have seen in the past where link
>> resolution order can exponentially increase load time.
>>
>>  Thanks,
>>
>>     Matt
>>
>> bash-4.2$ cat ex_simple_petsc.c
>> #include <time.h>
>> #include <cuda_runtime.h>
>> #include <stdio.h>
>> #include <petscmat.h>
>>
>> int main(int argc,char **args)
>> {
>>  clock_t start,s1,s2,s3;
>>  double  cputime;
>>  double  *init,tmp[100] = {0};
>>  PetscErrorCode ierr=0;
>>
>>  ierr = PetscInitialize(&argc,&args,(char*)0,NULL);if (ierr) return ierr;
>>  start = clock();
>>  cudaFree(0);
>>  s1 = clock();
>>  cudaMalloc((void **)&init,100*sizeof(double));
>>  s2 = clock();
>>  cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);
>>  s3 = clock();
>>  printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1
>> - start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double)
>> (s3 - s2)) / CLOCKS_PER_SEC);
>>  ierr = PetscFinalize();
>>  return ierr;
>> }
>>
>> Hong
>>
>> On Feb 7, 2020, at 3:09 PM, Zhang, Hong <hongzhang at anl.gov> wrote:
>>
>> Note that the overhead was triggered by the first call to a CUDA
>> function. So it seems that the first CUDA function triggered loading petsc
>> so (if petsc so is linked), which is slow on the summit file system.
>>
>> Hong
>>
>> On Feb 7, 2020, at 2:54 PM, Zhang, Hong via petsc-dev <
>> petsc-dev at mcs.anl.gov> wrote:
>>
>> Linking any other shared library does not slow down the execution. The
>> PETSc shared library is the only one causing trouble.
>>
>> Here are the ldd output for two different versions. For the first
>> version, I removed -lpetsc and it ran very fast. The second (slow) version
>> was linked to petsc so.
>>
>> bash-4.2$ ldd ex_simple
>>        linux-vdso64.so.1 =>  (0x0000200000050000)
>>        liblapack.so.0 =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/liblapack.so.0
>> (0x0000200000070000)
>>        libblas.so.0 =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libblas.so.0
>> (0x00002000009b0000)
>>        libhdf5hl_fortran.so.100 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5hl_fortran.so.100
>> (0x0000200000e80000)
>>        libhdf5_fortran.so.100 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_fortran.so.100
>> (0x0000200000ed0000)
>>        libhdf5_hl.so.100 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_hl.so.100
>> (0x0000200000f50000)
>>        libhdf5.so.103 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5.so.103
>> (0x0000200000fb0000)
>>        libX11.so.6 => /usr/lib64/libX11.so.6 (0x00002000015e0000)
>>        libcufft.so.10 => /sw/summit/cuda/10.1.168/lib64/libcufft.so.10
>> (0x0000200001770000)
>>        libcublas.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublas.so.10
>> (0x0000200009b00000)
>>        libcudart.so.10.1 =>
>> /sw/summit/cuda/10.1.168/lib64/libcudart.so.10.1 (0x000020000d950000)
>>        libcusparse.so.10 =>
>> /sw/summit/cuda/10.1.168/lib64/libcusparse.so.10 (0x000020000d9f0000)
>>        libcusolver.so.10 =>
>> /sw/summit/cuda/10.1.168/lib64/libcusolver.so.10 (0x0000200012f50000)
>>        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x000020001dc40000)
>>        libdl.so.2 => /usr/lib64/libdl.so.2 (0x000020001ddd0000)
>>        libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x000020001de00000)
>>        libmpiprofilesupport.so.3 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpiprofilesupport.so.3
>> (0x000020001de40000)
>>        libmpi_ibm_usempi.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_usempi.so
>> (0x000020001de70000)
>>        libmpi_ibm_mpifh.so.3 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_mpifh.so.3
>> (0x000020001dea0000)
>>        libmpi_ibm.so.3 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm.so.3
>> (0x000020001df40000)
>>        libpgf90rtl.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90rtl.so
>> (0x000020001e0b0000)
>>        libpgf90.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90.so
>> (0x000020001e0f0000)
>>        libpgf90_rpm1.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90_rpm1.so
>> (0x000020001e6a0000)
>>        libpgf902.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf902.so
>> (0x000020001e6d0000)
>>        libpgftnrtl.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgftnrtl.so
>> (0x000020001e700000)
>>        libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x000020001e730000)
>>        libpgkomp.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgkomp.so
>> (0x000020001e760000)
>>        libomp.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomp.so
>> (0x000020001e790000)
>>        libomptarget.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomptarget.so
>> (0x000020001e880000)
>>        libpgmath.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgmath.so
>> (0x000020001e8b0000)
>>        libpgc.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgc.so
>> (0x000020001e9d0000)
>>        librt.so.1 => /usr/lib64/librt.so.1 (0x000020001eb40000)
>>        libm.so.6 => /usr/lib64/libm.so.6 (0x000020001eb70000)
>>        libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x000020001ec60000)
>>        libc.so.6 => /usr/lib64/libc.so.6 (0x000020001eca0000)
>>        libz.so.1 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/zlib-1.2.11-2htm7ws4hgrthi5tyjnqxtjxgpfklxsc/lib/libz.so.1
>> (0x000020001ee90000)
>>        libxcb.so.1 => /usr/lib64/libxcb.so.1 (0x000020001eef0000)
>>        /lib64/ld64.so.2 (0x0000200000000000)
>>        libcublasLt.so.10 =>
>> /sw/summit/cuda/10.1.168/lib64/libcublasLt.so.10 (0x000020001ef40000)
>>        libutil.so.1 => /usr/lib64/libutil.so.1 (0x0000200020e50000)
>>        libhwloc_ompi.so.15 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libhwloc_ompi.so.15
>> (0x0000200020e80000)
>>        libevent-2.1.so.6 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent-2.1.so.6
>> (0x0000200020ef0000)
>>        libevent_pthreads-2.1.so.6 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent_pthreads-2.1.so.6
>> (0x0000200020f70000)
>>        libopen-rte.so.3 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-rte.so.3
>> (0x0000200020fa0000)
>>        libopen-pal.so.3 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-pal.so.3
>> (0x00002000210b0000)
>>        libXau.so.6 => /usr/lib64/libXau.so.6 (0x00002000211a0000)
>>
>>
>> bash-4.2$ ldd ex_simple_slow
>>        linux-vdso64.so.1 =>  (0x0000200000050000)
>>        libpetsc.so.3.012 =>
>> /autofs/nccs-svm1_home1/hongzh/Projects/petsc/arch-olcf-summit-sell-opt/lib/libpetsc.so.3.012
>> (0x0000200000070000)
>>        liblapack.so.0 =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/liblapack.so.0
>> (0x0000200002be0000)
>>        libblas.so.0 =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libblas.so.0
>> (0x0000200003520000)
>>        libhdf5hl_fortran.so.100 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5hl_fortran.so.100
>> (0x00002000039f0000)
>>        libhdf5_fortran.so.100 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_fortran.so.100
>> (0x0000200003a40000)
>>        libhdf5_hl.so.100 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_hl.so.100
>> (0x0000200003ac0000)
>>        libhdf5.so.103 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5.so.103
>> (0x0000200003b20000)
>>        libX11.so.6 => /usr/lib64/libX11.so.6 (0x0000200004150000)
>>        libcufft.so.10 => /sw/summit/cuda/10.1.168/lib64/libcufft.so.10
>> (0x00002000042e0000)
>>        libcublas.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublas.so.10
>> (0x000020000c670000)
>>        libcudart.so.10.1 =>
>> /sw/summit/cuda/10.1.168/lib64/libcudart.so.10.1 (0x00002000104c0000)
>>        libcusparse.so.10 =>
>> /sw/summit/cuda/10.1.168/lib64/libcusparse.so.10 (0x0000200010560000)
>>        libcusolver.so.10 =>
>> /sw/summit/cuda/10.1.168/lib64/libcusolver.so.10 (0x0000200015ac0000)
>>        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00002000207b0000)
>>        libdl.so.2 => /usr/lib64/libdl.so.2 (0x0000200020940000)
>>        libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x0000200020970000)
>>        libmpiprofilesupport.so.3 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpiprofilesupport.so.3
>> (0x00002000209b0000)
>>        libmpi_ibm_usempi.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_usempi.so
>> (0x00002000209e0000)
>>        libmpi_ibm_mpifh.so.3 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_mpifh.so.3
>> (0x0000200020a10000)
>>        libmpi_ibm.so.3 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm.so.3
>> (0x0000200020ab0000)
>>        libpgf90rtl.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90rtl.so
>> (0x0000200020c20000)
>>        libpgf90.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90.so
>> (0x0000200020c60000)
>>        libpgf90_rpm1.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90_rpm1.so
>> (0x0000200021210000)
>>        libpgf902.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf902.so
>> (0x0000200021240000)
>>        libpgftnrtl.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgftnrtl.so
>> (0x0000200021270000)
>>        libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x00002000212a0000)
>>        libpgkomp.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgkomp.so
>> (0x00002000212d0000)
>>        libomp.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomp.so
>> (0x0000200021300000)
>>        libomptarget.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomptarget.so
>> (0x00002000213f0000)
>>        libpgmath.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgmath.so
>> (0x0000200021420000)
>>        libpgc.so =>
>> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgc.so
>> (0x0000200021540000)
>>        librt.so.1 => /usr/lib64/librt.so.1 (0x00002000216b0000)
>>        libm.so.6 => /usr/lib64/libm.so.6 (0x00002000216e0000)
>>        libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00002000217d0000)
>>        libc.so.6 => /usr/lib64/libc.so.6 (0x0000200021810000)
>>        libz.so.1 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/zlib-1.2.11-2htm7ws4hgrthi5tyjnqxtjxgpfklxsc/lib/libz.so.1
>> (0x0000200021a10000)
>>        libxcb.so.1 => /usr/lib64/libxcb.so.1 (0x0000200021a60000)
>>        /lib64/ld64.so.2 (0x0000200000000000)
>>        libcublasLt.so.10 =>
>> /sw/summit/cuda/10.1.168/lib64/libcublasLt.so.10 (0x0000200021ab0000)
>>        libutil.so.1 => /usr/lib64/libutil.so.1 (0x00002000239c0000)
>>        libhwloc_ompi.so.15 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libhwloc_ompi.so.15
>> (0x00002000239f0000)
>>        libevent-2.1.so.6 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent-2.1.so.6
>> (0x0000200023a60000)
>>        libevent_pthreads-2.1.so.6 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent_pthreads-2.1.so.6
>> (0x0000200023ae0000)
>>        libopen-rte.so.3 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-rte.so.3
>> (0x0000200023b10000)
>>        libopen-pal.so.3 =>
>> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-pal.so.3
>> (0x0000200023c20000)
>>        libXau.so.6 => /usr/lib64/libXau.so.6 (0x0000200023d10000)
>>
>>
>> On Feb 7, 2020, at 2:31 PM, Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
>>
>>
>> ldd -o on the executable of both linkings of your code.
>>
>> My guess is that without PETSc it is linking the static version of the
>> needed libraries and with PETSc the shared. And, in typical fashion, the
>> shared libraries are off on some super slow file system so take a long time
>> to be loaded and linked in on demand.
>>
>>  Still a performance bug in Summit.
>>
>>  Barry
>>
>>
>> On Feb 7, 2020, at 12:23 PM, Zhang, Hong via petsc-dev <
>> petsc-dev at mcs.anl.gov> wrote:
>>
>> Hi all,
>>
>> Previously I have noticed that the first call to a CUDA function such as
>> cudaMalloc and cudaFree in PETSc takes a long time (7.5 seconds) on summit.
>> Then I prepared a simple example as attached to help OCLF reproduce the
>> problem. It turned out that the problem was  caused by PETSc. The
>> 7.5-second overhead can be observed only when the PETSc lib is linked. If I
>> do not link PETSc, it runs normally. Does anyone have any idea why this
>> happens and how to fix it?
>>
>> Hong (Mr.)
>>
>> bash-4.2$ cat ex_simple.c
>> #include <time.h>
>> #include <cuda_runtime.h>
>> #include <stdio.h>
>>
>> int main(int argc,char **args)
>> {
>> clock_t start,s1,s2,s3;
>> double  cputime;
>> double   *init,tmp[100] = {0};
>>
>> start = clock();
>> cudaFree(0);
>> s1 = clock();
>> cudaMalloc((void **)&init,100*sizeof(double));
>> s2 = clock();
>> cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);
>> s3 = clock();
>> printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 -
>> start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double)
>> (s3 - s2)) / CLOCKS_PER_SEC);
>>
>> return 0;
>> }
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>>
>>
>>
>>
>>
>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20200212/49d80060/attachment-0001.html>


More information about the petsc-dev mailing list