<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
-cuda_initialize 0 does not make any difference. Actually this issue has nothing to do with PetscInitialize(). I tried to call cudaFree(0) before PetscInitialize(), and it still took 7.5 seconds.
<div class=""><br class="">
</div>
<div class="">Hong<br class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">On Feb 10, 2020, at 10:44 AM, Zhang, Junchao <<a href="mailto:jczhang@mcs.anl.gov" class="">jczhang@mcs.anl.gov</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div dir="ltr" class="">As I mentioned, have you tried -cuda_initialize 0? Also, PetscCUDAInitialize contains<br class="">
<blockquote style="margin:0 0 0 40px;border:none;padding:0px" class="">ierr = PetscCUBLASInitializeHandle();CHKERRQ(ierr);<br class="">
ierr = PetscCUSOLVERDnInitializeHandle();CHKERRQ(ierr);</blockquote>
<div class="">Have you tried to comment out them and test again?
<div class="">
<div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">
<div dir="ltr" class="">--Junchao Zhang</div>
</div>
</div>
<br class="">
</div>
</div>
<br class="">
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Sat, Feb 8, 2020 at 5:22 PM Zhang, Hong via petsc-dev <<a href="mailto:petsc-dev@mcs.anl.gov" class="">petsc-dev@mcs.anl.gov</a>> wrote:<br class="">
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div style="overflow-wrap: break-word;" class=""><br class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">On Feb 8, 2020, at 5:03 PM, Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank" class="">knepley@gmail.com</a>> wrote:</div>
<br class="">
<div class="">
<div dir="ltr" style="font-family:Verdana;font-size:14px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none" class="">
<div dir="ltr" class="">On Sat, Feb 8, 2020 at 4:34 PM Zhang, Hong via petsc-dev <<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank" class="">petsc-dev@mcs.anl.gov</a>> wrote:<br class="">
</div>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div style="overflow-wrap: break-word;" class="">I did some further investigation. The overhead persists for both the PETSc shared library and the static library. In the previous example, it does not call any PETSc function, the first CUDA function becomes
very slow when it is linked to the petsc so. This indicates that the slowdown occurs if the symbol (cudafree)is searched through the petsc so, but does not occur if the symbol is found directly in the cuda runtime lib.
<div class=""><br class="">
</div>
<div class="">So the issue has nothing to do with the dynamic linker. The following example can be used to easily reproduce the problem (cudaFree(0) always takes ~7.5 seconds). </div>
</div>
</blockquote>
<div class=""><br class="">
</div>
<div class="">1) This should go to OLCF admin as Jeff suggests</div>
</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
<div class="">I had sent this to OLCF admin before the discussion was started here. Thomas Papatheodore has followed up. I am trying to help him reproduce the problem on summit. </div>
<br class="">
<blockquote type="cite" class="">
<div class="">
<div dir="ltr" style="font-family:Verdana;font-size:14px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none" class="">
<div class="gmail_quote">
<div class=""><br class="">
</div>
<div class="">2) Just to make sure I understand, a static executable with this code is still slow on the cudaFree(), since CUDA is a shared library by default.</div>
</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
<div class="">I prepared the code as a minimal example to reproduce the problem. It would be fair to say any code using PETSc (with CUDA enabled, built statically or dynamically) on summit suffers a 7.5-second overhead on the first CUDA function call (either
in the user code or inside PETSc).</div>
<div class=""><br class="">
</div>
<div class="">Thanks,</div>
<div class="">Hong</div>
<br class="">
<blockquote type="cite" class="">
<div class="">
<div dir="ltr" style="font-family:Verdana;font-size:14px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none" class="">
<div class="gmail_quote">
<div class=""><br class="">
</div>
<div class="">I think we should try:</div>
<div class=""><br class="">
</div>
<div class=""> a) Forcing a full static link, if possible</div>
<div class=""><br class="">
</div>
<div class=""> b) Asking OLCF about link resolution order</div>
<div class=""><br class="">
</div>
<div class="">It sounds like a similar thing I have seen in the past where link resolution order can exponentially increase load time.</div>
<div class=""><br class="">
</div>
<div class=""> Thanks,</div>
<div class=""><br class="">
</div>
<div class=""> Matt</div>
<div class=""> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div style="overflow-wrap: break-word;" class="">
<div class="">
<div class="">
<div class="">
<div class="">bash-4.2$ cat ex_simple_petsc.c</div>
<div class="">#include <time.h></div>
<div class="">#include <cuda_runtime.h></div>
<div class="">#include <stdio.h></div>
<div class="">#include <petscmat.h></div>
<div class=""><br class="">
</div>
<div class="">int main(int argc,char **args)</div>
<div class="">{</div>
<div class=""> clock_t start,s1,s2,s3;</div>
<div class=""> double cputime;</div>
<div class=""> double *init,tmp[100] = {0};</div>
<div class=""> PetscErrorCode ierr=0;</div>
<div class=""><br class="">
</div>
<div class=""> ierr = PetscInitialize(&argc,&args,(char*)0,NULL);if (ierr) return ierr;</div>
<div class=""> start = clock();</div>
<div class=""> cudaFree(0);</div>
<div class=""> s1 = clock();</div>
<div class=""> cudaMalloc((void **)&init,100*sizeof(double));</div>
<div class=""> s2 = clock();</div>
<div class=""> cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);</div>
<div class=""> s3 = clock();</div>
<div class=""> printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) (s3 - s2)) / CLOCKS_PER_SEC);</div>
<div class=""> ierr = PetscFinalize();</div>
<div class=""> return ierr;</div>
<div class="">}</div>
<div class=""><br class="">
</div>
<div class="">Hong</div>
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">On Feb 7, 2020, at 3:09 PM, Zhang, Hong <<a href="mailto:hongzhang@anl.gov" target="_blank" class="">hongzhang@anl.gov</a>> wrote:</div>
<br class="">
<div class="">
<div style="overflow-wrap: break-word;" class="">Note that the overhead was triggered by the first call to a CUDA function. So it seems that the first CUDA function triggered loading petsc so (if petsc so is linked), which is slow on the summit file system.
<div class=""><br class="">
<div class="">
<div class="">
<div class="">Hong
<div class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">On Feb 7, 2020, at 2:54 PM, Zhang, Hong via petsc-dev <<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank" class="">petsc-dev@mcs.anl.gov</a>> wrote:</div>
<br class="">
<div class="">
<div style="overflow-wrap: break-word;" class="">
<div class="">Linking any other shared library does not slow down the execution. The PETSc shared library is the only one causing trouble.</div>
<div class=""><br class="">
</div>
<div class="">Here are the ldd output for two different versions. For the first version, I removed -lpetsc and it ran very fast. The second (slow) version was linked to petsc so. </div>
<div class=""><br class="">
</div>
<div class="">bash-4.2$ ldd ex_simple</div>
<div class=""> linux-vdso64.so.1 => (0x0000200000050000)</div>
<div class=""> liblapack.so.0 => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/liblapack.so.0 (0x0000200000070000)</div>
<div class=""> libblas.so.0 => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libblas.so.0 (0x00002000009b0000)</div>
<div class=""> libhdf5hl_fortran.so.100 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5hl_fortran.so.100 (0x0000200000e80000)</div>
<div class=""> libhdf5_fortran.so.100 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_fortran.so.100 (0x0000200000ed0000)</div>
<div class=""> libhdf5_hl.so.100 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_hl.so.100 (0x0000200000f50000)</div>
<div class=""> libhdf5.so.103 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5.so.103 (0x0000200000fb0000)</div>
<div class=""> libX11.so.6 => /usr/lib64/libX11.so.6 (0x00002000015e0000)</div>
<div class=""> libcufft.so.10 => /sw/summit/cuda/10.1.168/lib64/libcufft.so.10 (0x0000200001770000)</div>
<div class=""> libcublas.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublas.so.10 (0x0000200009b00000)</div>
<div class=""> libcudart.so.10.1 => /sw/summit/cuda/10.1.168/lib64/libcudart.so.10.1 (0x000020000d950000)</div>
<div class=""> libcusparse.so.10 => /sw/summit/cuda/10.1.168/lib64/libcusparse.so.10 (0x000020000d9f0000)</div>
<div class=""> libcusolver.so.10 => /sw/summit/cuda/10.1.168/lib64/libcusolver.so.10 (0x0000200012f50000)</div>
<div class=""> libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x000020001dc40000)</div>
<div class=""> libdl.so.2 => /usr/lib64/libdl.so.2 (0x000020001ddd0000)</div>
<div class=""> libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x000020001de00000)</div>
<div class=""> libmpiprofilesupport.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpiprofilesupport.so.3 (0x000020001de40000)</div>
<div class=""> libmpi_ibm_usempi.so => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_usempi.so (0x000020001de70000)</div>
<div class=""> libmpi_ibm_mpifh.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_mpifh.so.3 (0x000020001dea0000)</div>
<div class=""> libmpi_ibm.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm.so.3 (0x000020001df40000)</div>
<div class=""> libpgf90rtl.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90rtl.so (0x000020001e0b0000)</div>
<div class=""> libpgf90.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90.so (0x000020001e0f0000)</div>
<div class=""> libpgf90_rpm1.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90_rpm1.so (0x000020001e6a0000)</div>
<div class=""> libpgf902.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf902.so (0x000020001e6d0000)</div>
<div class=""> libpgftnrtl.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgftnrtl.so (0x000020001e700000)</div>
<div class=""> libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x000020001e730000)</div>
<div class=""> libpgkomp.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgkomp.so (0x000020001e760000)</div>
<div class=""> libomp.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomp.so (0x000020001e790000)</div>
<div class=""> libomptarget.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomptarget.so (0x000020001e880000)</div>
<div class=""> libpgmath.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgmath.so (0x000020001e8b0000)</div>
<div class=""> libpgc.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgc.so (0x000020001e9d0000)</div>
<div class=""> librt.so.1 => /usr/lib64/librt.so.1 (0x000020001eb40000)</div>
<div class=""> libm.so.6 => /usr/lib64/libm.so.6 (0x000020001eb70000)</div>
<div class=""> libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x000020001ec60000)</div>
<div class=""> libc.so.6 => /usr/lib64/libc.so.6 (0x000020001eca0000)</div>
<div class=""> libz.so.1 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/zlib-1.2.11-2htm7ws4hgrthi5tyjnqxtjxgpfklxsc/lib/libz.so.1 (0x000020001ee90000)</div>
<div class=""> libxcb.so.1 => /usr/lib64/libxcb.so.1 (0x000020001eef0000)</div>
<div class=""> /lib64/ld64.so.2 (0x0000200000000000)</div>
<div class=""> libcublasLt.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublasLt.so.10 (0x000020001ef40000)</div>
<div class=""> libutil.so.1 => /usr/lib64/libutil.so.1 (0x0000200020e50000)</div>
<div class=""> libhwloc_ompi.so.15 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libhwloc_ompi.so.15 (0x0000200020e80000)</div>
<div class=""> libevent-2.1.so.6 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent-2.1.so.6 (0x0000200020ef0000)</div>
<div class=""> libevent_pthreads-2.1.so.6 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent_pthreads-2.1.so.6 (0x0000200020f70000)</div>
<div class=""> libopen-rte.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-rte.so.3 (0x0000200020fa0000)</div>
<div class=""> libopen-pal.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-pal.so.3 (0x00002000210b0000)</div>
<div class=""> libXau.so.6 => /usr/lib64/libXau.so.6 (0x00002000211a0000)</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
<div class="">
<div class="">bash-4.2$ ldd ex_simple_slow</div>
<div class=""> linux-vdso64.so.1 => (0x0000200000050000)</div>
<div class=""><font color="#ff2600" class=""> libpetsc.so.3.012 => /autofs/nccs-svm1_home1/hongzh/Projects/petsc/arch-olcf-summit-sell-opt/lib/libpetsc.so.3.012 (0x0000200000070000)</font></div>
<div class=""> liblapack.so.0 => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/liblapack.so.0 (0x0000200002be0000)</div>
<div class=""> libblas.so.0 => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libblas.so.0 (0x0000200003520000)</div>
<div class=""> libhdf5hl_fortran.so.100 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5hl_fortran.so.100 (0x00002000039f0000)</div>
<div class=""> libhdf5_fortran.so.100 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_fortran.so.100 (0x0000200003a40000)</div>
<div class=""> libhdf5_hl.so.100 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_hl.so.100 (0x0000200003ac0000)</div>
<div class=""> libhdf5.so.103 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5.so.103 (0x0000200003b20000)</div>
<div class=""> libX11.so.6 => /usr/lib64/libX11.so.6 (0x0000200004150000)</div>
<div class=""> libcufft.so.10 => /sw/summit/cuda/10.1.168/lib64/libcufft.so.10 (0x00002000042e0000)</div>
<div class=""> libcublas.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublas.so.10 (0x000020000c670000)</div>
<div class=""> libcudart.so.10.1 => /sw/summit/cuda/10.1.168/lib64/libcudart.so.10.1 (0x00002000104c0000)</div>
<div class=""> libcusparse.so.10 => /sw/summit/cuda/10.1.168/lib64/libcusparse.so.10 (0x0000200010560000)</div>
<div class=""> libcusolver.so.10 => /sw/summit/cuda/10.1.168/lib64/libcusolver.so.10 (0x0000200015ac0000)</div>
<div class=""> libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00002000207b0000)</div>
<div class=""> libdl.so.2 => /usr/lib64/libdl.so.2 (0x0000200020940000)</div>
<div class=""> libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x0000200020970000)</div>
<div class=""> libmpiprofilesupport.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpiprofilesupport.so.3 (0x00002000209b0000)</div>
<div class=""> libmpi_ibm_usempi.so => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_usempi.so (0x00002000209e0000)</div>
<div class=""> libmpi_ibm_mpifh.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_mpifh.so.3 (0x0000200020a10000)</div>
<div class=""> libmpi_ibm.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm.so.3 (0x0000200020ab0000)</div>
<div class=""> libpgf90rtl.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90rtl.so (0x0000200020c20000)</div>
<div class=""> libpgf90.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90.so (0x0000200020c60000)</div>
<div class=""> libpgf90_rpm1.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90_rpm1.so (0x0000200021210000)</div>
<div class=""> libpgf902.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf902.so (0x0000200021240000)</div>
<div class=""> libpgftnrtl.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgftnrtl.so (0x0000200021270000)</div>
<div class=""> libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x00002000212a0000)</div>
<div class=""> libpgkomp.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgkomp.so (0x00002000212d0000)</div>
<div class=""> libomp.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomp.so (0x0000200021300000)</div>
<div class=""> libomptarget.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomptarget.so (0x00002000213f0000)</div>
<div class=""> libpgmath.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgmath.so (0x0000200021420000)</div>
<div class=""> libpgc.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgc.so (0x0000200021540000)</div>
<div class=""> librt.so.1 => /usr/lib64/librt.so.1 (0x00002000216b0000)</div>
<div class=""> libm.so.6 => /usr/lib64/libm.so.6 (0x00002000216e0000)</div>
<div class=""> libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00002000217d0000)</div>
<div class=""> libc.so.6 => /usr/lib64/libc.so.6 (0x0000200021810000)</div>
<div class=""> libz.so.1 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/zlib-1.2.11-2htm7ws4hgrthi5tyjnqxtjxgpfklxsc/lib/libz.so.1 (0x0000200021a10000)</div>
<div class=""> libxcb.so.1 => /usr/lib64/libxcb.so.1 (0x0000200021a60000)</div>
<div class=""> /lib64/ld64.so.2 (0x0000200000000000)</div>
<div class=""> libcublasLt.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublasLt.so.10 (0x0000200021ab0000)</div>
<div class=""> libutil.so.1 => /usr/lib64/libutil.so.1 (0x00002000239c0000)</div>
<div class=""> libhwloc_ompi.so.15 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libhwloc_ompi.so.15 (0x00002000239f0000)</div>
<div class=""> libevent-2.1.so.6 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent-2.1.so.6 (0x0000200023a60000)</div>
<div class=""> libevent_pthreads-2.1.so.6 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent_pthreads-2.1.so.6 (0x0000200023ae0000)</div>
<div class=""> libopen-rte.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-rte.so.3 (0x0000200023b10000)</div>
<div class=""> libopen-pal.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-pal.so.3 (0x0000200023c20000)</div>
<div class=""> libXau.so.6 => /usr/lib64/libXau.so.6 (0x0000200023d10000)</div>
</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">On Feb 7, 2020, at 2:31 PM, Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov" target="_blank" class="">bsmith@mcs.anl.gov</a>> wrote:</div>
<br class="">
<div class="">
<div class=""><br class="">
ldd -o on the executable of both linkings of your code.<br class="">
<br class="">
My guess is that without PETSc it is linking the static version of the needed libraries and with PETSc the shared. And, in typical fashion, the shared libraries are off on some super slow file system so take a long time to be loaded and linked in on demand.<br class="">
<br class="">
Still a performance bug in Summit.<span class=""> </span><br class="">
<br class="">
Barry<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">On Feb 7, 2020, at 12:23 PM, Zhang, Hong via petsc-dev <<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank" class="">petsc-dev@mcs.anl.gov</a>> wrote:<br class="">
<br class="">
Hi all,<br class="">
<br class="">
Previously I have noticed that the first call to a CUDA function such as cudaMalloc and cudaFree in PETSc takes a long time (7.5 seconds) on summit. Then I prepared a simple example as attached to help OCLF reproduce the problem. It turned out that the problem
was caused by PETSc. The 7.5-second overhead can be observed only when the PETSc lib is linked. If I do not link PETSc, it runs normally. Does anyone have any idea why this happens and how to fix it?<br class="">
<br class="">
Hong (Mr.)<br class="">
<br class="">
bash-4.2$ cat ex_simple.c<br class="">
#include <time.h><br class="">
#include <cuda_runtime.h><br class="">
#include <stdio.h><br class="">
<br class="">
int main(int argc,char **args)<br class="">
{<br class="">
clock_t start,s1,s2,s3;<br class="">
double cputime;<br class="">
double *init,tmp[100] = {0};<br class="">
<br class="">
start = clock();<br class="">
cudaFree(0);<br class="">
s1 = clock();<br class="">
cudaMalloc((void **)&init,100*sizeof(double));<br class="">
s2 = clock();<br class="">
cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);<br class="">
s3 = clock();<br class="">
printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) (s3 - s2)) / CLOCKS_PER_SEC);<br class="">
<br class="">
return 0;<br class="">
}<br class="">
<br class="">
<br class="">
</blockquote>
<br class="">
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</div>
</div>
</blockquote>
</div>
<br clear="all" class="">
<div class=""><br class="">
</div>
--<span class=""> </span><br class="">
<div dir="ltr" class="">
<div dir="ltr" class="">
<div class="">
<div dir="ltr" class="">
<div class="">
<div dir="ltr" class="">
<div class="">What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br class="">
-- Norbert Wiener</div>
<div class=""><br class="">
</div>
<div class=""><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank" class="">https://www.cse.buffalo.edu/~knepley/</a></div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</body>
</html>