[petsc-dev] First call to cudaMalloc or cudaFree is very slow on summit

Smith, Barry F. bsmith at mcs.anl.gov
Thu Feb 13 09:39:07 CST 2020


  How are the two being compiled and linked? The same way, one with the PETSc library in the path and the other without? Or does the PETSc one have lots of flags and stuff while the non-PETSc one is just simple by hand?

   Barry


> On Feb 12, 2020, at 7:29 PM, Zhang, Hong <hongzhang at anl.gov> wrote:
> 
> 
> 
>> On Feb 12, 2020, at 5:11 PM, Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
>> 
>> 
>> ldd -o on the petsc program (static) and the non petsc program (static), what are the differences?
> 
> There is no difference in the outputs.
> 
>> 
>> nm -o both executables | grep cudaFree()
> 
> Non petsc program:
> 
> [hongzh at login3.summit tests]$ nm ex_simple | grep cudaFree
> 0000000010000ae0 t 00000017.plt_call.cudaFree@@libcudart.so.10.1
>                 U cudaFree@@libcudart.so.10.1
> 
> Petsc program:
> 
> [hongzh at login3.summit tests]$ nm ex_simple_petsc | grep cudaFree
> 0000000010016550 t 00000017.plt_call.cudaFree@@libcudart.so.10.1
> 0000000010017010 t 00000017.plt_call.cudaFreeHost@@libcudart.so.10.1
> 00000000124c3f48 V _ZGVZN6thrust2mr19get_global_resourceINS_26device_ptr_memory_resou
> rceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_
> 8cuda_cub7pointerIvEEEEEEEEPT_vE8resource
> 00000000124c3f50 V _ZGVZN6thrust2mr19get_global_resourceINS_6system4cuda6detail20cuda
> _memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEEPT_vE8r
> esource
> 0000000010726788 W _ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEE11do_allocateEmm
> 00000000107267e8 W _ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEE13do_deallocateENS_10device_ptrIvEEmm
> 0000000010726878 W _ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEED0Ev
> 0000000010726848 W _ZN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEED1Ev
> 0000000010729f78 W _ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEE11do_allocateEmm
> 000000001072a218 W _ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEE13do_deallocateES6_mm
> 000000001072a388 W _ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEED0Ev
> 000000001072a358 W _ZN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEED1Ev
> 0000000012122300 V _ZTIN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEE
> 0000000012122370 V _ZTIN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEE
> 0000000012122410 V _ZTSN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEE
> 00000000121225f0 V _ZTSN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEE
> 0000000012120630 V _ZTVN6thrust26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEE
> 00000000121205b0 V _ZTVN6thrust6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEE
> 00000000124c3f30 V _ZZN6thrust2mr19get_global_resourceINS_26device_ptr_memory_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEEEEPT_vE8resource
> 00000000124c3f20 V _ZZN6thrust2mr19get_global_resourceINS_6system4cuda6detail20cuda_memory_resourceIXadL10cudaMallocEEXadL8cudaFreeEENS_8cuda_cub7pointerIvEEEEEEPT_vE8resource
>                 U cudaFree@@libcudart.so.10.1
>                 U cudaFreeHost@@libcudart.so.10.1
> 
> Hong
> 
>> 
>> 
>> 
>> 
>> 
>>> On Feb 12, 2020, at 1:51 PM, Munson, Todd via petsc-dev <petsc-dev at mcs.anl.gov> wrote:
>>> 
>>> 
>>> There are some side effects when loading shared libraries, such as initializations of
>>> static variables, etc.  Is something like that happening?
>>> 
>>> Another place is the initial runtime library that gets linked (libcrt0 maybe?).  I 
>>> think some MPI compilers insert their own version.
>>> 
>>> Todd.
>>> 
>>>> On Feb 12, 2020, at 11:38 AM, Zhang, Hong via petsc-dev <petsc-dev at mcs.anl.gov> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Feb 12, 2020, at 11:09 AM, Matthew Knepley <knepley at gmail.com> wrote:
>>>>> 
>>>>> On Wed, Feb 12, 2020 at 11:06 AM Zhang, Hong via petsc-dev <petsc-dev at mcs.anl.gov> wrote:
>>>>> Sorry for the long post. Here are replies I have got from OLCF so far. We still don’t know how to solve the problem.
>>>>> 
>>>>> One interesting thing that Tom noticed is PetscInitialize() may have called cudaFree(0) 32 times as NVPROF shows, and they all run very fast. These calls may be triggered by some other libraries like cublas. But if PETSc calls cudaFree() explicitly, it is always very slow.
>>>>> 
>>>>> It sounds really painful, but I would start removing lines from PetscInitialize() until it runs fast.
>>>> 
>>>> It may be more painful than it sounds. The problem is not really related to PetscInitialize(). In the following simple example, we do not call any PETsc function. But if we link it to the PETSc shared library, cudaFree(0) would be very slow. CUDA is a blackbox. There is not much we can debug with this simple example.
>>>> 
>>>> bash-4.2$ cat ex_simple.c
>>>> #include <time.h>
>>>> #include <cuda_runtime.h>
>>>> #include <stdio.h>
>>>> 
>>>> int main(int argc,char **args)
>>>> {
>>>> clock_t start,s1,s2,s3;
>>>> double  cputime;
>>>> double   *init,tmp[100] = {0};
>>>> 
>>>> start = clock();
>>>> cudaFree(0);
>>>> s1 = clock();
>>>> cudaMalloc((void **)&init,100*sizeof(double));
>>>> s2 = clock();
>>>> cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);
>>>> s3 = clock();
>>>> printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) (s3 - s2)) / CLOCKS_PER_SEC);
>>>> return 0;
>>>> }
>>>> 
>>>> 
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>>   Matt
>>>>> 
>>>>> Hong
>>>>> 
>>>>> 
>>>>> On Wed Feb 12 09:51:33 2020, tpapathe wrote:
>>>>> 
>>>>> Something else I noticed from the nvprof output (see my previous post) is
>>>>> that the runs with PETSc initialized have 33 calls to cudaFree, whereas the
>>>>> non-PETSc versions only have the 1 call to cudaFree. I'm not sure what is
>>>>> happening in the PETSc initialize/finalize, but it appears to be doing a
>>>>> lot under the hood. You can also see there are many additional CUDA calls
>>>>> that are not shown in the profiler output from the non-PETSc runs (e.g.,
>>>>> additional cudaMalloc and cudaMemcpy calls, cudaDeviceSychronize, etc.).
>>>>> Which other systems have you tested this on? Which CUDA Toolkits and CUDA
>>>>> drivers were installed on those systems? Please let me know if there is any
>>>>> additional information you can share with me about this.
>>>>> 
>>>>> -Tom
>>>>> On Wed Feb 12 09:25:23 2020, tpapathe wrote:
>>>>> 
>>>>> Ok. Thanks for the additional info, Hong. I'll ask around to see if any
>>>>> local (PETSc or CUDA) experts have experienced this behavior. In the
>>>>> meantime, is this impacting your work or something you're just curious
>>>>> about? A 5-7 second initialization time is indeed unusual, but is it
>>>>> negligible relative to the overall walltime of your jobs, or is it
>>>>> somehow affecting your productivity?
>>>>> 
>>>>> -Tom
>>>>> On Tue Feb 11 17:04:25 2020, hongzhang at anl.gov wrote:
>>>>> 
>>>>>   We know it happens with PETSc. But note that the slow down occurs on the first CUDA function call. In the example I sent to you, if we simply link it to the PETSc shared library and don’t call any PETSc function, the slow down still happens on cudaFree(0). We have never seen this behavior on other GPU systems.
>>>>> 
>>>>> On Feb 11, 2020, at 3:31 PM, Thomas Papatheodore via RT <help at nccs.gov> wrote:
>>>>> 
>>>>> Thanks for the update. I have now reproduced the behavior you described with
>>>>> PETSc + CUDA using your example code:
>>>>> 
>>>>> [tpapathe at batch2: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ jsrun -n1
>>>>> -a1 -c1 -g1 -r1 -l cpu-cpu -dpacked -bpacked:1 nvprof
>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_petsc
>>>>> 
>>>>> ==16991== NVPROF is profiling process 16991, command:
>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_petsc
>>>>> 
>>>>> ==16991== Profiling application:
>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_petsc
>>>>> 
>>>>> free time =4.730000 malloc time =0.000000 copy time =0.000000
>>>>> 
>>>>> ==16991== Profiling result:
>>>>> 
>>>>> Type Time(%) Time Calls Avg Min Max Name
>>>>> 
>>>>> GPU activities: 100.00% 9.3760us 6 1.5620us 1.3440us 1.7920us [CUDA memcpy
>>>>> HtoD]
>>>>> 
>>>>> API calls: 99.78% 5.99333s 33 181.62ms 883ns 4.71976s cudaFree
>>>>> 
>>>>> 0.11% 6.3603ms 379 16.781us 233ns 693.40us cuDeviceGetAttribute
>>>>> 
>>>>> 0.07% 4.1453ms 4 1.0363ms 1.0186ms 1.0623ms cuDeviceTotalMem
>>>>> 
>>>>> 0.02% 1.0046ms 4 251.15us 131.45us 449.32us cuDeviceGetName
>>>>> 
>>>>> 0.01% 808.21us 16 50.513us 6.7080us 621.54us cudaMalloc
>>>>> 
>>>>> 0.01% 452.06us 450 1.0040us 830ns 6.4430us cudaFuncSetAttribute
>>>>> 
>>>>> 0.00% 104.89us 6 17.481us 13.419us 21.338us cudaMemcpy
>>>>> 
>>>>> 0.00% 102.26us 15 6.8170us 6.1900us 10.072us cudaDeviceSynchronize
>>>>> 
>>>>> 0.00% 93.635us 80 1.1700us 1.0190us 2.1990us cudaEventCreateWithFlags
>>>>> 
>>>>> 0.00% 92.168us 83 1.1100us 951ns 2.3550us cudaEventDestroy
>>>>> 
>>>>> 0.00% 52.277us 74 706ns 592ns 1.5640us cudaDeviceGetAttribute
>>>>> 
>>>>> 0.00% 34.558us 3 11.519us 9.5410us 15.129us cudaStreamDestroy
>>>>> 
>>>>> 0.00% 27.778us 3 9.2590us 4.9120us 17.632us cudaStreamCreateWithFlags
>>>>> 
>>>>> 0.00% 11.955us 1 11.955us 11.955us 11.955us cudaSetDevice
>>>>> 
>>>>> 0.00% 10.361us 7 1.4800us 809ns 3.6580us cudaGetDevice
>>>>> 
>>>>> 0.00% 5.4310us 3 1.8100us 1.6420us 1.9980us cudaEventCreate
>>>>> 
>>>>> 0.00% 3.8040us 6 634ns 391ns 1.5350us cuDeviceGetCount
>>>>> 
>>>>> 0.00% 3.5350us 1 3.5350us 3.5350us 3.5350us cuDeviceGetPCIBusId
>>>>> 
>>>>> 0.00% 3.2210us 3 1.0730us 949ns 1.1640us cuInit
>>>>> 
>>>>> 0.00% 2.6780us 5 535ns 369ns 1.0210us cuDeviceGet
>>>>> 
>>>>> 0.00% 2.5080us 1 2.5080us 2.5080us 2.5080us cudaSetDeviceFlags
>>>>> 
>>>>> 0.00% 1.6800us 4 420ns 392ns 488ns cuDeviceGetUuid
>>>>> 
>>>>> 0.00% 1.5720us 3 524ns 398ns 590ns cuDriverGetVersion
>>>>> 
>>>>> 
>>>>> 
>>>>> If I remove all mention of PETSc from the code, compile manually and run, I get
>>>>> the expected behavior:
>>>>> 
>>>>> [tpapathe at batch2: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ pgc++
>>>>> -L$OLCF_CUDA_ROOT/lib64 -lcudart ex_simple.c -o ex_simple
>>>>> 
>>>>> 
>>>>> [tpapathe at batch2: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ jsrun -n1
>>>>> -a1 -c1 -g1 -r1 -l cpu-cpu -dpacked -bpacked:1 nvprof
>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple
>>>>> 
>>>>> ==17248== NVPROF is profiling process 17248, command:
>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple
>>>>> 
>>>>> ==17248== Profiling application:
>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple
>>>>> 
>>>>> free time =0.340000 malloc time =0.000000 copy time =0.000000
>>>>> 
>>>>> ==17248== Profiling result:
>>>>> 
>>>>> Type Time(%) Time Calls Avg Min Max Name
>>>>> 
>>>>> GPU activities: 100.00% 1.7600us 1 1.7600us 1.7600us 1.7600us [CUDA memcpy
>>>>> HtoD]
>>>>> 
>>>>> API calls: 98.56% 231.76ms 1 231.76ms 231.76ms 231.76ms cudaFree
>>>>> 
>>>>> 0.67% 1.5764ms 97 16.251us 234ns 652.65us cuDeviceGetAttribute
>>>>> 
>>>>> 0.46% 1.0727ms 1 1.0727ms 1.0727ms 1.0727ms cuDeviceTotalMem
>>>>> 
>>>>> 0.23% 537.38us 1 537.38us 537.38us 537.38us cudaMalloc
>>>>> 
>>>>> 0.07% 172.80us 1 172.80us 172.80us 172.80us cuDeviceGetName
>>>>> 
>>>>> 0.01% 21.648us 1 21.648us 21.648us 21.648us cudaMemcpy
>>>>> 
>>>>> 0.00% 3.3470us 1 3.3470us 3.3470us 3.3470us cuDeviceGetPCIBusId
>>>>> 
>>>>> 0.00% 2.5310us 3 843ns 464ns 1.3700us cuDeviceGetCount
>>>>> 
>>>>> 0.00% 1.7260us 2 863ns 490ns 1.2360us cuDeviceGet
>>>>> 
>>>>> 0.00% 377ns 1 377ns 377ns 377ns cuDeviceGetUuid
>>>>> 
>>>>> 
>>>>> 
>>>>> I also get the expected behavior if I add an MPI_Init and MPI_Finalize to the
>>>>> code instead of PETSc initialization:
>>>>> 
>>>>> [tpapathe at login1: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ mpicc
>>>>> -L$OLCF_CUDA_ROOT/lib64 -lcudart ex_simple_mpi.c -o ex_simple_mpi
>>>>> 
>>>>> 
>>>>> [tpapathe at batch1: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ jsrun -n1
>>>>> -a1 -c1 -g1 -r1 -l cpu-cpu -dpacked -bpacked:1 nvprof
>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_mpi
>>>>> 
>>>>> ==35166== NVPROF is profiling process 35166, command:
>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_mpi
>>>>> 
>>>>> ==35166== Profiling application:
>>>>> /gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_mpi
>>>>> 
>>>>> free time =0.340000 malloc time =0.000000 copy time =0.000000
>>>>> 
>>>>> ==35166== Profiling result:
>>>>> 
>>>>> Type Time(%) Time Calls Avg Min Max Name
>>>>> 
>>>>> GPU activities: 100.00% 1.7600us 1 1.7600us 1.7600us 1.7600us [CUDA memcpy
>>>>> HtoD]
>>>>> 
>>>>> API calls: 98.57% 235.61ms 1 235.61ms 235.61ms 235.61ms cudaFree
>>>>> 
>>>>> 0.66% 1.5802ms 97 16.290us 239ns 650.72us cuDeviceGetAttribute
>>>>> 
>>>>> 0.45% 1.0825ms 1 1.0825ms 1.0825ms 1.0825ms cuDeviceTotalMem
>>>>> 
>>>>> 0.23% 542.73us 1 542.73us 542.73us 542.73us cudaMalloc
>>>>> 
>>>>> 0.07% 174.77us 1 174.77us 174.77us 174.77us cuDeviceGetName
>>>>> 
>>>>> 0.01% 26.431us 1 26.431us 26.431us 26.431us cudaMemcpy
>>>>> 
>>>>> 0.00% 4.0330us 1 4.0330us 4.0330us 4.0330us cuDeviceGetPCIBusId
>>>>> 
>>>>> 0.00% 2.8560us 3 952ns 528ns 1.6150us cuDeviceGetCount
>>>>> 
>>>>> 0.00% 1.6190us 2 809ns 576ns 1.0430us cuDeviceGet
>>>>> 
>>>>> 0.00% 341ns 1 341ns 341ns 341ns cuDeviceGetUuid
>>>>> 
>>>>> 
>>>>> So this appears to be something specific happening within PETSc itself - not
>>>>> necessarily an OLCF issue. I would suggest asking this question within the
>>>>> PETSc community to understand what's happening. Please let me know if you have
>>>>> any additional questions.
>>>>> 
>>>>> -Tom
>>>>> 
>>>>>> On Feb 10, 2020, at 11:14 AM, Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
>>>>>> 
>>>>>> 
>>>>>> gprof or some similar tool?
>>>>>> 
>>>>>> 
>>>>>>> On Feb 10, 2020, at 11:18 AM, Zhang, Hong via petsc-dev <petsc-dev at mcs.anl.gov> wrote:
>>>>>>> 
>>>>>>> -cuda_initialize 0 does not make any difference. Actually this issue has nothing to do with PetscInitialize(). I tried to call cudaFree(0) before PetscInitialize(), and it still took 7.5 seconds.
>>>>>>> 
>>>>>>> Hong
>>>>>>> 
>>>>>>>> On Feb 10, 2020, at 10:44 AM, Zhang, Junchao <jczhang at mcs.anl.gov> wrote:
>>>>>>>> 
>>>>>>>> As I mentioned, have you tried -cuda_initialize 0? Also, PetscCUDAInitialize contains
>>>>>>>> ierr = PetscCUBLASInitializeHandle();CHKERRQ(ierr);
>>>>>>>> ierr = PetscCUSOLVERDnInitializeHandle();CHKERRQ(ierr);
>>>>>>>> Have you tried to comment out them and test again?
>>>>>>>> --Junchao Zhang
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sat, Feb 8, 2020 at 5:22 PM Zhang, Hong via petsc-dev <petsc-dev at mcs.anl.gov> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Feb 8, 2020, at 5:03 PM, Matthew Knepley <knepley at gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> On Sat, Feb 8, 2020 at 4:34 PM Zhang, Hong via petsc-dev <petsc-dev at mcs.anl.gov> wrote:
>>>>>>>>> I did some further investigation. The overhead persists for both the PETSc shared library and the static library. In the previous example, it does not call any PETSc function, the first CUDA function becomes very slow when it is linked to the petsc so. This indicates that the slowdown occurs if the symbol (cudafree)is searched through the petsc so, but does not occur if the symbol is found directly in the cuda runtime lib. 
>>>>>>>>> 
>>>>>>>>> So the issue has nothing to do with the dynamic linker. The following example can be used to easily reproduce the problem (cudaFree(0) always takes ~7.5 seconds).  
>>>>>>>>> 
>>>>>>>>> 1) This should go to OLCF admin as Jeff suggests
>>>>>>>> 
>>>>>>>> I had sent this to OLCF admin before the discussion was started here. Thomas Papatheodore has followed up. I am trying to help him reproduce the problem on summit. 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 2) Just to make sure I understand, a static executable with this code is still slow on the cudaFree(), since CUDA is a shared library by default.
>>>>>>>> 
>>>>>>>> I prepared the code as a minimal example to reproduce the problem. It would be fair to say any code using PETSc (with CUDA enabled, built statically or dynamically) on summit suffers a 7.5-second overhead on the first CUDA function call (either in the user code or inside PETSc).
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Hong
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I think we should try:
>>>>>>>>> 
>>>>>>>>> a) Forcing a full static link, if possible
>>>>>>>>> 
>>>>>>>>> b) Asking OLCF about link resolution order
>>>>>>>>> 
>>>>>>>>> It sounds like a similar thing I have seen in the past where link resolution order can exponentially increase load time.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>>  Matt
>>>>>>>>> 
>>>>>>>>> bash-4.2$ cat ex_simple_petsc.c
>>>>>>>>> #include <time.h>
>>>>>>>>> #include <cuda_runtime.h>
>>>>>>>>> #include <stdio.h>
>>>>>>>>> #include <petscmat.h>
>>>>>>>>> 
>>>>>>>>> int main(int argc,char **args)
>>>>>>>>> {
>>>>>>>>> clock_t start,s1,s2,s3;
>>>>>>>>> double  cputime;
>>>>>>>>> double  *init,tmp[100] = {0};
>>>>>>>>> PetscErrorCode ierr=0;
>>>>>>>>> 
>>>>>>>>> ierr = PetscInitialize(&argc,&args,(char*)0,NULL);if (ierr) return ierr;
>>>>>>>>> start = clock();
>>>>>>>>> cudaFree(0);
>>>>>>>>> s1 = clock();
>>>>>>>>> cudaMalloc((void **)&init,100*sizeof(double));
>>>>>>>>> s2 = clock();
>>>>>>>>> cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);
>>>>>>>>> s3 = clock();
>>>>>>>>> printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) (s3 - s2)) / CLOCKS_PER_SEC);
>>>>>>>>> ierr = PetscFinalize();
>>>>>>>>> return ierr;
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> Hong
>>>>>>>>> 
>>>>>>>>>> On Feb 7, 2020, at 3:09 PM, Zhang, Hong <hongzhang at anl.gov> wrote:
>>>>>>>>>> 
>>>>>>>>>> Note that the overhead was triggered by the first call to a CUDA function. So it seems that the first CUDA function triggered loading petsc so (if petsc so is linked), which is slow on the summit file system.
>>>>>>>>>> 
>>>>>>>>>> Hong
>>>>>>>>>> 
>>>>>>>>>>> On Feb 7, 2020, at 2:54 PM, Zhang, Hong via petsc-dev <petsc-dev at mcs.anl.gov> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Linking any other shared library does not slow down the execution. The PETSc shared library is the only one causing trouble.
>>>>>>>>>>> 
>>>>>>>>>>> Here are the ldd output for two different versions. For the first version, I removed -lpetsc and it ran very fast. The second (slow) version was linked to petsc so. 
>>>>>>>>>>> 
>>>>>>>>>>> bash-4.2$ ldd ex_simple
>>>>>>>>>>>     linux-vdso64.so.1 =>  (0x0000200000050000)
>>>>>>>>>>>     liblapack.so.0 => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/liblapack.so.0 (0x0000200000070000)
>>>>>>>>>>>     libblas.so.0 => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libblas.so.0 (0x00002000009b0000)
>>>>>>>>>>>     libhdf5hl_fortran.so.100 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5hl_fortran.so.100 (0x0000200000e80000)
>>>>>>>>>>>     libhdf5_fortran.so.100 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_fortran.so.100 (0x0000200000ed0000)
>>>>>>>>>>>     libhdf5_hl.so.100 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_hl.so.100 (0x0000200000f50000)
>>>>>>>>>>>     libhdf5.so.103 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5.so.103 (0x0000200000fb0000)
>>>>>>>>>>>     libX11.so.6 => /usr/lib64/libX11.so.6 (0x00002000015e0000)
>>>>>>>>>>>     libcufft.so.10 => /sw/summit/cuda/10.1.168/lib64/libcufft.so.10 (0x0000200001770000)
>>>>>>>>>>>     libcublas.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublas.so.10 (0x0000200009b00000)
>>>>>>>>>>>     libcudart.so.10.1 => /sw/summit/cuda/10.1.168/lib64/libcudart.so.10.1 (0x000020000d950000)
>>>>>>>>>>>     libcusparse.so.10 => /sw/summit/cuda/10.1.168/lib64/libcusparse.so.10 (0x000020000d9f0000)
>>>>>>>>>>>     libcusolver.so.10 => /sw/summit/cuda/10.1.168/lib64/libcusolver.so.10 (0x0000200012f50000)
>>>>>>>>>>>     libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x000020001dc40000)
>>>>>>>>>>>     libdl.so.2 => /usr/lib64/libdl.so.2 (0x000020001ddd0000)
>>>>>>>>>>>     libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x000020001de00000)
>>>>>>>>>>>     libmpiprofilesupport.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpiprofilesupport.so.3 (0x000020001de40000)
>>>>>>>>>>>     libmpi_ibm_usempi.so => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_usempi.so (0x000020001de70000)
>>>>>>>>>>>     libmpi_ibm_mpifh.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_mpifh.so.3 (0x000020001dea0000)
>>>>>>>>>>>     libmpi_ibm.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm.so.3 (0x000020001df40000)
>>>>>>>>>>>     libpgf90rtl.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90rtl.so (0x000020001e0b0000)
>>>>>>>>>>>     libpgf90.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90.so (0x000020001e0f0000)
>>>>>>>>>>>     libpgf90_rpm1.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90_rpm1.so (0x000020001e6a0000)
>>>>>>>>>>>     libpgf902.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf902.so (0x000020001e6d0000)
>>>>>>>>>>>     libpgftnrtl.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgftnrtl.so (0x000020001e700000)
>>>>>>>>>>>     libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x000020001e730000)
>>>>>>>>>>>     libpgkomp.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgkomp.so (0x000020001e760000)
>>>>>>>>>>>     libomp.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomp.so (0x000020001e790000)
>>>>>>>>>>>     libomptarget.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomptarget.so (0x000020001e880000)
>>>>>>>>>>>     libpgmath.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgmath.so (0x000020001e8b0000)
>>>>>>>>>>>     libpgc.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgc.so (0x000020001e9d0000)
>>>>>>>>>>>     librt.so.1 => /usr/lib64/librt.so.1 (0x000020001eb40000)
>>>>>>>>>>>     libm.so.6 => /usr/lib64/libm.so.6 (0x000020001eb70000)
>>>>>>>>>>>     libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x000020001ec60000)
>>>>>>>>>>>     libc.so.6 => /usr/lib64/libc.so.6 (0x000020001eca0000)
>>>>>>>>>>>     libz.so.1 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/zlib-1.2.11-2htm7ws4hgrthi5tyjnqxtjxgpfklxsc/lib/libz.so.1 (0x000020001ee90000)
>>>>>>>>>>>     libxcb.so.1 => /usr/lib64/libxcb.so.1 (0x000020001eef0000)
>>>>>>>>>>>     /lib64/ld64.so.2 (0x0000200000000000)
>>>>>>>>>>>     libcublasLt.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublasLt.so.10 (0x000020001ef40000)
>>>>>>>>>>>     libutil.so.1 => /usr/lib64/libutil.so.1 (0x0000200020e50000)
>>>>>>>>>>>     libhwloc_ompi.so.15 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libhwloc_ompi.so.15 (0x0000200020e80000)
>>>>>>>>>>>     libevent-2.1.so.6 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent-2.1.so.6 (0x0000200020ef0000)
>>>>>>>>>>>     libevent_pthreads-2.1.so.6 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent_pthreads-2.1.so.6 (0x0000200020f70000)
>>>>>>>>>>>     libopen-rte.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-rte.so.3 (0x0000200020fa0000)
>>>>>>>>>>>     libopen-pal.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-pal.so.3 (0x00002000210b0000)
>>>>>>>>>>>     libXau.so.6 => /usr/lib64/libXau.so.6 (0x00002000211a0000)
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> bash-4.2$ ldd ex_simple_slow
>>>>>>>>>>>     linux-vdso64.so.1 =>  (0x0000200000050000)
>>>>>>>>>>>     libpetsc.so.3.012 => /autofs/nccs-svm1_home1/hongzh/Projects/petsc/arch-olcf-summit-sell-opt/lib/libpetsc.so.3.012 (0x0000200000070000)
>>>>>>>>>>>     liblapack.so.0 => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/liblapack.so.0 (0x0000200002be0000)
>>>>>>>>>>>     libblas.so.0 => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libblas.so.0 (0x0000200003520000)
>>>>>>>>>>>     libhdf5hl_fortran.so.100 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5hl_fortran.so.100 (0x00002000039f0000)
>>>>>>>>>>>     libhdf5_fortran.so.100 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_fortran.so.100 (0x0000200003a40000)
>>>>>>>>>>>     libhdf5_hl.so.100 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_hl.so.100 (0x0000200003ac0000)
>>>>>>>>>>>     libhdf5.so.103 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5.so.103 (0x0000200003b20000)
>>>>>>>>>>>     libX11.so.6 => /usr/lib64/libX11.so.6 (0x0000200004150000)
>>>>>>>>>>>     libcufft.so.10 => /sw/summit/cuda/10.1.168/lib64/libcufft.so.10 (0x00002000042e0000)
>>>>>>>>>>>     libcublas.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublas.so.10 (0x000020000c670000)
>>>>>>>>>>>     libcudart.so.10.1 => /sw/summit/cuda/10.1.168/lib64/libcudart.so.10.1 (0x00002000104c0000)
>>>>>>>>>>>     libcusparse.so.10 => /sw/summit/cuda/10.1.168/lib64/libcusparse.so.10 (0x0000200010560000)
>>>>>>>>>>>     libcusolver.so.10 => /sw/summit/cuda/10.1.168/lib64/libcusolver.so.10 (0x0000200015ac0000)
>>>>>>>>>>>     libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00002000207b0000)
>>>>>>>>>>>     libdl.so.2 => /usr/lib64/libdl.so.2 (0x0000200020940000)
>>>>>>>>>>>     libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x0000200020970000)
>>>>>>>>>>>     libmpiprofilesupport.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpiprofilesupport.so.3 (0x00002000209b0000)
>>>>>>>>>>>     libmpi_ibm_usempi.so => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_usempi.so (0x00002000209e0000)
>>>>>>>>>>>     libmpi_ibm_mpifh.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_mpifh.so.3 (0x0000200020a10000)
>>>>>>>>>>>     libmpi_ibm.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm.so.3 (0x0000200020ab0000)
>>>>>>>>>>>     libpgf90rtl.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90rtl.so (0x0000200020c20000)
>>>>>>>>>>>     libpgf90.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90.so (0x0000200020c60000)
>>>>>>>>>>>     libpgf90_rpm1.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90_rpm1.so (0x0000200021210000)
>>>>>>>>>>>     libpgf902.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf902.so (0x0000200021240000)
>>>>>>>>>>>     libpgftnrtl.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgftnrtl.so (0x0000200021270000)
>>>>>>>>>>>     libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x00002000212a0000)
>>>>>>>>>>>     libpgkomp.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgkomp.so (0x00002000212d0000)
>>>>>>>>>>>     libomp.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomp.so (0x0000200021300000)
>>>>>>>>>>>     libomptarget.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomptarget.so (0x00002000213f0000)
>>>>>>>>>>>     libpgmath.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgmath.so (0x0000200021420000)
>>>>>>>>>>>     libpgc.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgc.so (0x0000200021540000)
>>>>>>>>>>>     librt.so.1 => /usr/lib64/librt.so.1 (0x00002000216b0000)
>>>>>>>>>>>     libm.so.6 => /usr/lib64/libm.so.6 (0x00002000216e0000)
>>>>>>>>>>>     libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00002000217d0000)
>>>>>>>>>>>     libc.so.6 => /usr/lib64/libc.so.6 (0x0000200021810000)
>>>>>>>>>>>     libz.so.1 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/zlib-1.2.11-2htm7ws4hgrthi5tyjnqxtjxgpfklxsc/lib/libz.so.1 (0x0000200021a10000)
>>>>>>>>>>>     libxcb.so.1 => /usr/lib64/libxcb.so.1 (0x0000200021a60000)
>>>>>>>>>>>     /lib64/ld64.so.2 (0x0000200000000000)
>>>>>>>>>>>     libcublasLt.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublasLt.so.10 (0x0000200021ab0000)
>>>>>>>>>>>     libutil.so.1 => /usr/lib64/libutil.so.1 (0x00002000239c0000)
>>>>>>>>>>>     libhwloc_ompi.so.15 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libhwloc_ompi.so.15 (0x00002000239f0000)
>>>>>>>>>>>     libevent-2.1.so.6 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent-2.1.so.6 (0x0000200023a60000)
>>>>>>>>>>>     libevent_pthreads-2.1.so.6 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent_pthreads-2.1.so.6 (0x0000200023ae0000)
>>>>>>>>>>>     libopen-rte.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-rte.so.3 (0x0000200023b10000)
>>>>>>>>>>>     libopen-pal.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-pal.so.3 (0x0000200023c20000)
>>>>>>>>>>>     libXau.so.6 => /usr/lib64/libXau.so.6 (0x0000200023d10000)
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Feb 7, 2020, at 2:31 PM, Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> ldd -o on the executable of both linkings of your code.
>>>>>>>>>>>> 
>>>>>>>>>>>> My guess is that without PETSc it is linking the static version of the needed libraries and with PETSc the shared. And, in typical fashion, the shared libraries are off on some super slow file system so take a long time to be loaded and linked in on demand.
>>>>>>>>>>>> 
>>>>>>>>>>>> Still a performance bug in Summit. 
>>>>>>>>>>>> 
>>>>>>>>>>>> Barry
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Feb 7, 2020, at 12:23 PM, Zhang, Hong via petsc-dev <petsc-dev at mcs.anl.gov> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Previously I have noticed that the first call to a CUDA function such as cudaMalloc and cudaFree in PETSc takes a long time (7.5 seconds) on summit. Then I prepared a simple example as attached to help OCLF reproduce the problem. It turned out that the problem was  caused by PETSc. The 7.5-second overhead can be observed only when the PETSc lib is linked. If I do not link PETSc, it runs normally. Does anyone have any idea why this happens and how to fix it?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hong (Mr.)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> bash-4.2$ cat ex_simple.c
>>>>>>>>>>>>> #include <time.h>
>>>>>>>>>>>>> #include <cuda_runtime.h>
>>>>>>>>>>>>> #include <stdio.h>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> int main(int argc,char **args)
>>>>>>>>>>>>> {
>>>>>>>>>>>>> clock_t start,s1,s2,s3;
>>>>>>>>>>>>> double  cputime;
>>>>>>>>>>>>> double   *init,tmp[100] = {0};
>>>>>>>>>>>>> 
>>>>>>>>>>>>> start = clock();
>>>>>>>>>>>>> cudaFree(0);
>>>>>>>>>>>>> s1 = clock();
>>>>>>>>>>>>> cudaMalloc((void **)&init,100*sizeof(double));
>>>>>>>>>>>>> s2 = clock();
>>>>>>>>>>>>> cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);
>>>>>>>>>>>>> s3 = clock();
>>>>>>>>>>>>> printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) (s3 - s2)) / CLOCKS_PER_SEC);
>>>>>>>>>>>>> 
>>>>>>>>>>>>> return 0;
>>>>>>>>>>>>> }
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>>>>>>>>> -- Norbert Wiener
>>>>>>>>> 
>>>>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>>>>> -- Norbert Wiener
>>>>> 
>>>>> https://www.cse.buffalo.edu/~knepley/
>>>> 
>>> 
>> 
> 



More information about the petsc-dev mailing list