<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
<br class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">On Feb 12, 2020, at 11:09 AM, Matthew Knepley <<a href="mailto:knepley@gmail.com" class="">knepley@gmail.com</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div dir="ltr" style="caret-color: rgb(0, 0, 0); font-family: Verdana; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
<div dir="ltr" class="">On Wed, Feb 12, 2020 at 11:06 AM Zhang, Hong via petsc-dev <<a href="mailto:petsc-dev@mcs.anl.gov" class="">petsc-dev@mcs.anl.gov</a>> wrote:<br class="">
</div>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); padding-left: 1ex;">
<div style="word-wrap: break-word;" class="">
<div class="">Sorry for the long post. Here are replies I have got from OLCF so far. We still don’t know how to solve the problem.</div>
<div class=""><br class="">
</div>
<div class="">One interesting thing that Tom noticed is PetscInitialize() may have called cudaFree(0) 32 times as NVPROF shows, and they all run very fast. These calls may be triggered by some other libraries like cublas. But if PETSc calls cudaFree() explicitly,
 it is always very slow.</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
<div class="">It sounds really painful, but I would start removing lines from PetscInitialize() until it runs fast.</div>
</div>
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>It may be more painful than it sounds. The problem is not really related to PetscInitialize(). In the following simple example, we do not call any PETsc function. But if we link it to the PETSc shared library, cudaFree(0) would be very slow. CUDA is a
 blackbox. There is not much we can debug with this simple example.</div>
<div><br class="">
</div>
<div>
<div>bash-4.2$ cat ex_simple.c</div>
<div>#include <time.h></div>
<div>#include <cuda_runtime.h></div>
<div>#include <stdio.h></div>
<div><br class="">
</div>
<div>int main(int argc,char **args)</div>
<div>{</div>
<div>  clock_t start,s1,s2,s3;</div>
<div>  double  cputime;</div>
<div>  double   *init,tmp[100] = {0};</div>
<div><br class="">
</div>
<div>  start = clock();</div>
<div>  cudaFree(0);</div>
<div>  s1 = clock();</div>
<div>  cudaMalloc((void **)&init,100*sizeof(double));</div>
<div>  s2 = clock();</div>
<div>  cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);</div>
<div>  s3 = clock();</div>
<div>  printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) (s3 - s2)) / CLOCKS_PER_SEC);</div>
<div>  return 0;</div>
<div>}</div>
</div>
<div><br class="">
</div>
<br class="">
<blockquote type="cite" class="">
<div class="">
<div dir="ltr" style="caret-color: rgb(0, 0, 0); font-family: Verdana; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class="">
<div class="gmail_quote">
<div class=""><br class="">
</div>
<div class="">  Thanks,</div>
<div class=""><br class="">
</div>
<div class="">     Matt</div>
<div class=""> </div>
<blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); padding-left: 1ex;">
<div style="word-wrap: break-word;" class="">
<div class="">Hong</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
On Wed Feb 12 09:51:33 2020, tpapathe wrote:<br class="">
<br class="">
 Something else I noticed from the nvprof output (see my previous post) is<br class="">
 that the runs with PETSc initialized have 33 calls to cudaFree, whereas the<br class="">
 non-PETSc versions only have the 1 call to cudaFree. I'm not sure what is<br class="">
 happening in the PETSc initialize/finalize, but it appears to be doing a<br class="">
 lot under the hood. You can also see there are many additional CUDA calls<br class="">
 that are not shown in the profiler output from the non-PETSc runs (e.g.,<br class="">
 additional cudaMalloc and cudaMemcpy calls, cudaDeviceSychronize, etc.).<br class="">
 Which other systems have you tested this on? Which CUDA Toolkits and CUDA<br class="">
 drivers were installed on those systems? Please let me know if there is any<br class="">
 additional information you can share with me about this.<br class="">
<br class="">
 -Tom<br class="">
 On Wed Feb 12 09:25:23 2020, tpapathe wrote:<br class="">
<br class="">
   Ok. Thanks for the additional info, Hong. I'll ask around to see if any<br class="">
   local (PETSc or CUDA) experts have experienced this behavior. In the<br class="">
   meantime, is this impacting your work or something you're just curious<br class="">
   about? A 5-7 second initialization time is indeed unusual, but is it<br class="">
   negligible relative to the overall walltime of your jobs, or is it<br class="">
   somehow affecting your productivity?<br class="">
<br class="">
   -Tom<br class="">
   On Tue Feb 11 17:04:25 2020, <a href="mailto:hongzhang@anl.gov" target="_blank" class="">hongzhang@anl.gov</a> wrote:<br class="">
<br class="">
     We know it happens with PETSc. But note that the slow down occurs on the first CUDA function call. In the example I sent to you, if we simply link it to the PETSc shared library and don’t call any PETSc function, the slow down still happens on cudaFree(0).
 We have never seen this behavior on other GPU systems.
<div class=""><br class="">
</div>
<div class="">
<div class="">On Feb 11, 2020, at 3:31 PM, Thomas Papatheodore via RT <<a href="mailto:help@nccs.gov" target="_blank" class="">help@nccs.gov</a>> wrote:<br class="">
<br class="">
Thanks for the update. I have now reproduced the behavior you described with<br class="">
PETSc + CUDA using your example code:<br class="">
<br class="">
[tpapathe@batch2: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ jsrun -n1<br class="">
-a1 -c1 -g1 -r1 -l cpu-cpu -dpacked -bpacked:1 nvprof<br class="">
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_petsc<br class="">
<br class="">
==16991== NVPROF is profiling process 16991, command:<br class="">
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_petsc<br class="">
<br class="">
==16991== Profiling application:<br class="">
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_petsc<br class="">
<br class="">
free time =4.730000 malloc time =0.000000 copy time =0.000000<br class="">
<br class="">
==16991== Profiling result:<br class="">
<br class="">
Type Time(%) Time Calls Avg Min Max Name<br class="">
<br class="">
GPU activities: 100.00% 9.3760us 6 1.5620us 1.3440us 1.7920us [CUDA memcpy<br class="">
HtoD]<br class="">
<br class="">
API calls: 99.78% 5.99333s 33 181.62ms 883ns 4.71976s cudaFree<br class="">
<br class="">
0.11% 6.3603ms 379 16.781us 233ns 693.40us cuDeviceGetAttribute<br class="">
<br class="">
0.07% 4.1453ms 4 1.0363ms 1.0186ms 1.0623ms cuDeviceTotalMem<br class="">
<br class="">
0.02% 1.0046ms 4 251.15us 131.45us 449.32us cuDeviceGetName<br class="">
<br class="">
0.01% 808.21us 16 50.513us 6.7080us 621.54us cudaMalloc<br class="">
<br class="">
0.01% 452.06us 450 1.0040us 830ns 6.4430us cudaFuncSetAttribute<br class="">
<br class="">
0.00% 104.89us 6 17.481us 13.419us 21.338us cudaMemcpy<br class="">
<br class="">
0.00% 102.26us 15 6.8170us 6.1900us 10.072us cudaDeviceSynchronize<br class="">
<br class="">
0.00% 93.635us 80 1.1700us 1.0190us 2.1990us cudaEventCreateWithFlags<br class="">
<br class="">
0.00% 92.168us 83 1.1100us 951ns 2.3550us cudaEventDestroy<br class="">
<br class="">
0.00% 52.277us 74 706ns 592ns 1.5640us cudaDeviceGetAttribute<br class="">
<br class="">
0.00% 34.558us 3 11.519us 9.5410us 15.129us cudaStreamDestroy<br class="">
<br class="">
0.00% 27.778us 3 9.2590us 4.9120us 17.632us cudaStreamCreateWithFlags<br class="">
<br class="">
0.00% 11.955us 1 11.955us 11.955us 11.955us cudaSetDevice<br class="">
<br class="">
0.00% 10.361us 7 1.4800us 809ns 3.6580us cudaGetDevice<br class="">
<br class="">
0.00% 5.4310us 3 1.8100us 1.6420us 1.9980us cudaEventCreate<br class="">
<br class="">
0.00% 3.8040us 6 634ns 391ns 1.5350us cuDeviceGetCount<br class="">
<br class="">
0.00% 3.5350us 1 3.5350us 3.5350us 3.5350us cuDeviceGetPCIBusId<br class="">
<br class="">
0.00% 3.2210us 3 1.0730us 949ns 1.1640us cuInit<br class="">
<br class="">
0.00% 2.6780us 5 535ns 369ns 1.0210us cuDeviceGet<br class="">
<br class="">
0.00% 2.5080us 1 2.5080us 2.5080us 2.5080us cudaSetDeviceFlags<br class="">
<br class="">
0.00% 1.6800us 4 420ns 392ns 488ns cuDeviceGetUuid<br class="">
<br class="">
0.00% 1.5720us 3 524ns 398ns 590ns cuDriverGetVersion<br class="">
<br class="">
<br class="">
<br class="">
If I remove all mention of PETSc from the code, compile manually and run, I get<br class="">
the expected behavior:<br class="">
<br class="">
[tpapathe@batch2: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ pgc++<br class="">
-L$OLCF_CUDA_ROOT/lib64 -lcudart ex_simple.c -o ex_simple<br class="">
<br class="">
<br class="">
[tpapathe@batch2: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ jsrun -n1<br class="">
-a1 -c1 -g1 -r1 -l cpu-cpu -dpacked -bpacked:1 nvprof<br class="">
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple<br class="">
<br class="">
==17248== NVPROF is profiling process 17248, command:<br class="">
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple<br class="">
<br class="">
==17248== Profiling application:<br class="">
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple<br class="">
<br class="">
free time =0.340000 malloc time =0.000000 copy time =0.000000<br class="">
<br class="">
==17248== Profiling result:<br class="">
<br class="">
Type Time(%) Time Calls Avg Min Max Name<br class="">
<br class="">
GPU activities: 100.00% 1.7600us 1 1.7600us 1.7600us 1.7600us [CUDA memcpy<br class="">
HtoD]<br class="">
<br class="">
API calls: 98.56% 231.76ms 1 231.76ms 231.76ms 231.76ms cudaFree<br class="">
<br class="">
0.67% 1.5764ms 97 16.251us 234ns 652.65us cuDeviceGetAttribute<br class="">
<br class="">
0.46% 1.0727ms 1 1.0727ms 1.0727ms 1.0727ms cuDeviceTotalMem<br class="">
<br class="">
0.23% 537.38us 1 537.38us 537.38us 537.38us cudaMalloc<br class="">
<br class="">
0.07% 172.80us 1 172.80us 172.80us 172.80us cuDeviceGetName<br class="">
<br class="">
0.01% 21.648us 1 21.648us 21.648us 21.648us cudaMemcpy<br class="">
<br class="">
0.00% 3.3470us 1 3.3470us 3.3470us 3.3470us cuDeviceGetPCIBusId<br class="">
<br class="">
0.00% 2.5310us 3 843ns 464ns 1.3700us cuDeviceGetCount<br class="">
<br class="">
0.00% 1.7260us 2 863ns 490ns 1.2360us cuDeviceGet<br class="">
<br class="">
0.00% 377ns 1 377ns 377ns 377ns cuDeviceGetUuid<br class="">
<br class="">
<br class="">
<br class="">
I also get the expected behavior if I add an MPI_Init and MPI_Finalize to the<br class="">
code instead of PETSc initialization:<br class="">
<br class="">
[tpapathe@login1: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ mpicc<br class="">
-L$OLCF_CUDA_ROOT/lib64 -lcudart ex_simple_mpi.c -o ex_simple_mpi<br class="">
<br class="">
<br class="">
[tpapathe@batch1: /gpfs/alpine/scratch/tpapathe/stf007/petsc/src]$ jsrun -n1<br class="">
-a1 -c1 -g1 -r1 -l cpu-cpu -dpacked -bpacked:1 nvprof<br class="">
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_mpi<br class="">
<br class="">
==35166== NVPROF is profiling process 35166, command:<br class="">
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_mpi<br class="">
<br class="">
==35166== Profiling application:<br class="">
/gpfs/alpine/scratch/tpapathe/stf007/petsc/src/ex_simple_mpi<br class="">
<br class="">
free time =0.340000 malloc time =0.000000 copy time =0.000000<br class="">
<br class="">
==35166== Profiling result:<br class="">
<br class="">
Type Time(%) Time Calls Avg Min Max Name<br class="">
<br class="">
GPU activities: 100.00% 1.7600us 1 1.7600us 1.7600us 1.7600us [CUDA memcpy<br class="">
HtoD]<br class="">
<br class="">
API calls: 98.57% 235.61ms 1 235.61ms 235.61ms 235.61ms cudaFree<br class="">
<br class="">
0.66% 1.5802ms 97 16.290us 239ns 650.72us cuDeviceGetAttribute<br class="">
<br class="">
0.45% 1.0825ms 1 1.0825ms 1.0825ms 1.0825ms cuDeviceTotalMem<br class="">
<br class="">
0.23% 542.73us 1 542.73us 542.73us 542.73us cudaMalloc<br class="">
<br class="">
0.07% 174.77us 1 174.77us 174.77us 174.77us cuDeviceGetName<br class="">
<br class="">
0.01% 26.431us 1 26.431us 26.431us 26.431us cudaMemcpy<br class="">
<br class="">
0.00% 4.0330us 1 4.0330us 4.0330us 4.0330us cuDeviceGetPCIBusId<br class="">
<br class="">
0.00% 2.8560us 3 952ns 528ns 1.6150us cuDeviceGetCount<br class="">
<br class="">
0.00% 1.6190us 2 809ns 576ns 1.0430us cuDeviceGet<br class="">
<br class="">
0.00% 341ns 1 341ns 341ns 341ns cuDeviceGetUuid<br class="">
<br class="">
<br class="">
So this appears to be something specific happening within PETSc itself - not<br class="">
necessarily an OLCF issue. I would suggest asking this question within the<br class="">
PETSc community to understand what's happening. Please let me know if you have<br class="">
any additional questions.<br class="">
<br class="">
-Tom</div>
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">On Feb 10, 2020, at 11:14 AM, Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov" target="_blank" class="">bsmith@mcs.anl.gov</a>> wrote:</div>
<br class="">
<div class="">
<div class=""><br class="">
 gprof or some similar tool?<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">On Feb 10, 2020, at 11:18 AM, Zhang, Hong via petsc-dev <<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank" class="">petsc-dev@mcs.anl.gov</a>> wrote:<br class="">
<br class="">
-cuda_initialize 0 does not make any difference. Actually this issue has nothing to do with PetscInitialize(). I tried to call cudaFree(0) before PetscInitialize(), and it still took 7.5 seconds.<br class="">
<br class="">
Hong<br class="">
<br class="">
<blockquote type="cite" class="">On Feb 10, 2020, at 10:44 AM, Zhang, Junchao <<a href="mailto:jczhang@mcs.anl.gov" target="_blank" class="">jczhang@mcs.anl.gov</a>> wrote:<br class="">
<br class="">
As I mentioned, have you tried -cuda_initialize 0? Also, PetscCUDAInitialize contains<br class="">
ierr = PetscCUBLASInitializeHandle();CHKERRQ(ierr);<br class="">
ierr = PetscCUSOLVERDnInitializeHandle();CHKERRQ(ierr);<br class="">
Have you tried to comment out them and test again?<br class="">
--Junchao Zhang<br class="">
<br class="">
<br class="">
On Sat, Feb 8, 2020 at 5:22 PM Zhang, Hong via petsc-dev <<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank" class="">petsc-dev@mcs.anl.gov</a>> wrote:<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">On Feb 8, 2020, at 5:03 PM, Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank" class="">knepley@gmail.com</a>> wrote:<br class="">
<br class="">
On Sat, Feb 8, 2020 at 4:34 PM Zhang, Hong via petsc-dev <<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank" class="">petsc-dev@mcs.anl.gov</a>> wrote:<br class="">
I did some further investigation. The overhead persists for both the PETSc shared library and the static library. In the previous example, it does not call any PETSc function, the first CUDA function becomes very slow when it is linked to the petsc so. This
 indicates that the slowdown occurs if the symbol (cudafree)is searched through the petsc so, but does not occur if the symbol is found directly in the cuda runtime lib.<span class="Apple-converted-space"> </span><br class="">
<br class="">
So the issue has nothing to do with the dynamic linker. The following example can be used to easily reproduce the problem (cudaFree(0) always takes ~7.5 seconds).  <br class="">
<br class="">
1) This should go to OLCF admin as Jeff suggests<br class="">
</blockquote>
<br class="">
I had sent this to OLCF admin before the discussion was started here. Thomas Papatheodore has followed up. I am trying to help him reproduce the problem on summit.<span class="Apple-converted-space"> </span><br class="">
<br class="">
<blockquote type="cite" class=""><br class="">
2) Just to make sure I understand, a static executable with this code is still slow on the cudaFree(), since CUDA is a shared library by default.<br class="">
</blockquote>
<br class="">
I prepared the code as a minimal example to reproduce the problem. It would be fair to say any code using PETSc (with CUDA enabled, built statically or dynamically) on summit suffers a 7.5-second overhead on the first CUDA function call (either in the user
 code or inside PETSc).<br class="">
<br class="">
Thanks,<br class="">
Hong<br class="">
<br class="">
<blockquote type="cite" class=""><br class="">
I think we should try:<br class="">
<br class="">
 a) Forcing a full static link, if possible<br class="">
<br class="">
 b) Asking OLCF about link resolution order<br class="">
<br class="">
It sounds like a similar thing I have seen in the past where link resolution order can exponentially increase load time.<br class="">
<br class="">
 Thanks,<br class="">
<br class="">
    Matt<br class="">
<br class="">
bash-4.2$ cat ex_simple_petsc.c<br class="">
#include <time.h><br class="">
#include <cuda_runtime.h><br class="">
#include <stdio.h><br class="">
#include <petscmat.h><br class="">
<br class="">
int main(int argc,char **args)<br class="">
{<br class="">
 clock_t start,s1,s2,s3;<br class="">
 double  cputime;<br class="">
 double  *init,tmp[100] = {0};<br class="">
 PetscErrorCode ierr=0;<br class="">
<br class="">
 ierr = PetscInitialize(&argc,&args,(char*)0,NULL);if (ierr) return ierr;<br class="">
 start = clock();<br class="">
 cudaFree(0);<br class="">
 s1 = clock();<br class="">
 cudaMalloc((void **)&init,100*sizeof(double));<br class="">
 s2 = clock();<br class="">
 cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);<br class="">
 s3 = clock();<br class="">
 printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) (s3 - s2)) / CLOCKS_PER_SEC);<br class="">
 ierr = PetscFinalize();<br class="">
 return ierr;<br class="">
}<br class="">
<br class="">
Hong<br class="">
<br class="">
<blockquote type="cite" class="">On Feb 7, 2020, at 3:09 PM, Zhang, Hong <<a href="mailto:hongzhang@anl.gov" target="_blank" class="">hongzhang@anl.gov</a>> wrote:<br class="">
<br class="">
Note that the overhead was triggered by the first call to a CUDA function. So it seems that the first CUDA function triggered loading petsc so (if petsc so is linked), which is slow on the summit file system.<br class="">
<br class="">
Hong<br class="">
<br class="">
<blockquote type="cite" class="">On Feb 7, 2020, at 2:54 PM, Zhang, Hong via petsc-dev <<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank" class="">petsc-dev@mcs.anl.gov</a>> wrote:<br class="">
<br class="">
Linking any other shared library does not slow down the execution. The PETSc shared library is the only one causing trouble.<br class="">
<br class="">
Here are the ldd output for two different versions. For the first version, I removed -lpetsc and it ran very fast. The second (slow) version was linked to petsc so.<span class="Apple-converted-space"> </span><br class="">
<br class="">
bash-4.2$ ldd ex_simple<br class="">
       linux-vdso64.so.1 =>  (0x0000200000050000)<br class="">
       liblapack.so.0 => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/liblapack.so.0 (0x0000200000070000)<br class="">
       libblas.so.0 => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libblas.so.0 (0x00002000009b0000)<br class="">
       libhdf5hl_fortran.so.100 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5hl_fortran.so.100 (0x0000200000e80000)<br class="">
       libhdf5_fortran.so.100 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_fortran.so.100 (0x0000200000ed0000)<br class="">
       libhdf5_hl.so.100 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_hl.so.100 (0x0000200000f50000)<br class="">
       libhdf5.so.103 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5.so.103 (0x0000200000fb0000)<br class="">
       libX11.so.6 => /usr/lib64/libX11.so.6 (0x00002000015e0000)<br class="">
       libcufft.so.10 => /sw/summit/cuda/10.1.168/lib64/libcufft.so.10 (0x0000200001770000)<br class="">
       libcublas.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublas.so.10 (0x0000200009b00000)<br class="">
       libcudart.so.10.1 => /sw/summit/cuda/10.1.168/lib64/libcudart.so.10.1 (0x000020000d950000)<br class="">
       libcusparse.so.10 => /sw/summit/cuda/10.1.168/lib64/libcusparse.so.10 (0x000020000d9f0000)<br class="">
       libcusolver.so.10 => /sw/summit/cuda/10.1.168/lib64/libcusolver.so.10 (0x0000200012f50000)<br class="">
       libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x000020001dc40000)<br class="">
       libdl.so.2 => /usr/lib64/libdl.so.2 (0x000020001ddd0000)<br class="">
       libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x000020001de00000)<br class="">
       libmpiprofilesupport.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpiprofilesupport.so.3 (0x000020001de40000)<br class="">
       libmpi_ibm_usempi.so => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_usempi.so (0x000020001de70000)<br class="">
       libmpi_ibm_mpifh.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_mpifh.so.3 (0x000020001dea0000)<br class="">
       libmpi_ibm.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm.so.3 (0x000020001df40000)<br class="">
       libpgf90rtl.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90rtl.so (0x000020001e0b0000)<br class="">
       libpgf90.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90.so (0x000020001e0f0000)<br class="">
       libpgf90_rpm1.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90_rpm1.so (0x000020001e6a0000)<br class="">
       libpgf902.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf902.so (0x000020001e6d0000)<br class="">
       libpgftnrtl.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgftnrtl.so (0x000020001e700000)<br class="">
       libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x000020001e730000)<br class="">
       libpgkomp.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgkomp.so (0x000020001e760000)<br class="">
       libomp.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomp.so (0x000020001e790000)<br class="">
       libomptarget.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomptarget.so (0x000020001e880000)<br class="">
       libpgmath.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgmath.so (0x000020001e8b0000)<br class="">
       libpgc.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgc.so (0x000020001e9d0000)<br class="">
       librt.so.1 => /usr/lib64/librt.so.1 (0x000020001eb40000)<br class="">
       libm.so.6 => /usr/lib64/libm.so.6 (0x000020001eb70000)<br class="">
       libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x000020001ec60000)<br class="">
       libc.so.6 => /usr/lib64/libc.so.6 (0x000020001eca0000)<br class="">
       libz.so.1 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/zlib-1.2.11-2htm7ws4hgrthi5tyjnqxtjxgpfklxsc/lib/libz.so.1 (0x000020001ee90000)<br class="">
       libxcb.so.1 => /usr/lib64/libxcb.so.1 (0x000020001eef0000)<br class="">
       /lib64/ld64.so.2 (0x0000200000000000)<br class="">
       libcublasLt.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublasLt.so.10 (0x000020001ef40000)<br class="">
       libutil.so.1 => /usr/lib64/libutil.so.1 (0x0000200020e50000)<br class="">
       libhwloc_ompi.so.15 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libhwloc_ompi.so.15 (0x0000200020e80000)<br class="">
       libevent-2.1.so.6 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent-2.1.so.6 (0x0000200020ef0000)<br class="">
       libevent_pthreads-2.1.so.6 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent_pthreads-2.1.so.6 (0x0000200020f70000)<br class="">
       libopen-rte.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-rte.so.3 (0x0000200020fa0000)<br class="">
       libopen-pal.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-pal.so.3 (0x00002000210b0000)<br class="">
       libXau.so.6 => /usr/lib64/libXau.so.6 (0x00002000211a0000)<br class="">
<br class="">
<br class="">
bash-4.2$ ldd ex_simple_slow<br class="">
       linux-vdso64.so.1 =>  (0x0000200000050000)<br class="">
       libpetsc.so.3.012 => /autofs/nccs-svm1_home1/hongzh/Projects/petsc/arch-olcf-summit-sell-opt/lib/libpetsc.so.3.012 (0x0000200000070000)<br class="">
       liblapack.so.0 => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/liblapack.so.0 (0x0000200002be0000)<br class="">
       libblas.so.0 => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libblas.so.0 (0x0000200003520000)<br class="">
       libhdf5hl_fortran.so.100 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5hl_fortran.so.100 (0x00002000039f0000)<br class="">
       libhdf5_fortran.so.100 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_fortran.so.100 (0x0000200003a40000)<br class="">
       libhdf5_hl.so.100 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5_hl.so.100 (0x0000200003ac0000)<br class="">
       libhdf5.so.103 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/hdf5-1.10.3-pgiul2yf4auv7krecd72t6vupd7e3qgn/lib/libhdf5.so.103 (0x0000200003b20000)<br class="">
       libX11.so.6 => /usr/lib64/libX11.so.6 (0x0000200004150000)<br class="">
       libcufft.so.10 => /sw/summit/cuda/10.1.168/lib64/libcufft.so.10 (0x00002000042e0000)<br class="">
       libcublas.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublas.so.10 (0x000020000c670000)<br class="">
       libcudart.so.10.1 => /sw/summit/cuda/10.1.168/lib64/libcudart.so.10.1 (0x00002000104c0000)<br class="">
       libcusparse.so.10 => /sw/summit/cuda/10.1.168/lib64/libcusparse.so.10 (0x0000200010560000)<br class="">
       libcusolver.so.10 => /sw/summit/cuda/10.1.168/lib64/libcusolver.so.10 (0x0000200015ac0000)<br class="">
       libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00002000207b0000)<br class="">
       libdl.so.2 => /usr/lib64/libdl.so.2 (0x0000200020940000)<br class="">
       libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x0000200020970000)<br class="">
       libmpiprofilesupport.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpiprofilesupport.so.3 (0x00002000209b0000)<br class="">
       libmpi_ibm_usempi.so => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_usempi.so (0x00002000209e0000)<br class="">
       libmpi_ibm_mpifh.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm_mpifh.so.3 (0x0000200020a10000)<br class="">
       libmpi_ibm.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libmpi_ibm.so.3 (0x0000200020ab0000)<br class="">
       libpgf90rtl.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90rtl.so (0x0000200020c20000)<br class="">
       libpgf90.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90.so (0x0000200020c60000)<br class="">
       libpgf90_rpm1.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf90_rpm1.so (0x0000200021210000)<br class="">
       libpgf902.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgf902.so (0x0000200021240000)<br class="">
       libpgftnrtl.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgftnrtl.so (0x0000200021270000)<br class="">
       libatomic.so.1 => /usr/lib64/libatomic.so.1 (0x00002000212a0000)<br class="">
       libpgkomp.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgkomp.so (0x00002000212d0000)<br class="">
       libomp.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomp.so (0x0000200021300000)<br class="">
       libomptarget.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libomptarget.so (0x00002000213f0000)<br class="">
       libpgmath.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgmath.so (0x0000200021420000)<br class="">
       libpgc.so => /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/libpgc.so (0x0000200021540000)<br class="">
       librt.so.1 => /usr/lib64/librt.so.1 (0x00002000216b0000)<br class="">
       libm.so.6 => /usr/lib64/libm.so.6 (0x00002000216e0000)<br class="">
       libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00002000217d0000)<br class="">
       libc.so.6 => /usr/lib64/libc.so.6 (0x0000200021810000)<br class="">
       libz.so.1 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/zlib-1.2.11-2htm7ws4hgrthi5tyjnqxtjxgpfklxsc/lib/libz.so.1 (0x0000200021a10000)<br class="">
       libxcb.so.1 => /usr/lib64/libxcb.so.1 (0x0000200021a60000)<br class="">
       /lib64/ld64.so.2 (0x0000200000000000)<br class="">
       libcublasLt.so.10 => /sw/summit/cuda/10.1.168/lib64/libcublasLt.so.10 (0x0000200021ab0000)<br class="">
       libutil.so.1 => /usr/lib64/libutil.so.1 (0x00002000239c0000)<br class="">
       libhwloc_ompi.so.15 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libhwloc_ompi.so.15 (0x00002000239f0000)<br class="">
       libevent-2.1.so.6 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent-2.1.so.6 (0x0000200023a60000)<br class="">
       libevent_pthreads-2.1.so.6 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libevent_pthreads-2.1.so.6 (0x0000200023ae0000)<br class="">
       libopen-rte.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-rte.so.3 (0x0000200023b10000)<br class="">
       libopen-pal.so.3 => /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib/libopen-pal.so.3 (0x0000200023c20000)<br class="">
       libXau.so.6 => /usr/lib64/libXau.so.6 (0x0000200023d10000)<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">On Feb 7, 2020, at 2:31 PM, Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov" target="_blank" class="">bsmith@mcs.anl.gov</a>> wrote:<br class="">
<br class="">
<br class="">
ldd -o on the executable of both linkings of your code.<br class="">
<br class="">
My guess is that without PETSc it is linking the static version of the needed libraries and with PETSc the shared. And, in typical fashion, the shared libraries are off on some super slow file system so take a long time to be loaded and linked in on demand.<br class="">
<br class="">
 Still a performance bug in Summit.<span class="Apple-converted-space"> </span><br class="">
<br class="">
 Barry<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">On Feb 7, 2020, at 12:23 PM, Zhang, Hong via petsc-dev <<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank" class="">petsc-dev@mcs.anl.gov</a>> wrote:<br class="">
<br class="">
Hi all,<br class="">
<br class="">
Previously I have noticed that the first call to a CUDA function such as cudaMalloc and cudaFree in PETSc takes a long time (7.5 seconds) on summit. Then I prepared a simple example as attached to help OCLF reproduce the problem. It turned out that the problem
 was  caused by PETSc. The 7.5-second overhead can be observed only when the PETSc lib is linked. If I do not link PETSc, it runs normally. Does anyone have any idea why this happens and how to fix it?<br class="">
<br class="">
Hong (Mr.)<br class="">
<br class="">
bash-4.2$ cat ex_simple.c<br class="">
#include <time.h><br class="">
#include <cuda_runtime.h><br class="">
#include <stdio.h><br class="">
<br class="">
int main(int argc,char **args)<br class="">
{<br class="">
clock_t start,s1,s2,s3;<br class="">
double  cputime;<br class="">
double   *init,tmp[100] = {0};<br class="">
<br class="">
start = clock();<br class="">
cudaFree(0);<br class="">
s1 = clock();<br class="">
cudaMalloc((void **)&init,100*sizeof(double));<br class="">
s2 = clock();<br class="">
cudaMemcpy(init,tmp,100*sizeof(double),cudaMemcpyHostToDevice);<br class="">
s3 = clock();<br class="">
printf("free time =%lf malloc time =%lf copy time =%lf\n",((double) (s1 - start)) / CLOCKS_PER_SEC,((double) (s2 - s1)) / CLOCKS_PER_SEC,((double) (s3 - s2)) / CLOCKS_PER_SEC);<br class="">
<br class="">
return 0;<br class="">
}<br class="">
<br class="">
<br class="">
</blockquote>
<br class="">
</blockquote>
<br class="">
</blockquote>
<br class="">
</blockquote>
<br class="">
<br class="">
<br class="">
--<span class="Apple-converted-space"> </span><br class="">
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br class="">
-- Norbert Wiener<br class="">
<br class="">
<a href="https://www.cse.buffalo.edu/~knepley/" target="_blank" class="">https://www.cse.buffalo.edu/~knepley/</a><br class="">
</blockquote>
<br class="">
</blockquote>
<br class="">
</blockquote>
<br class="">
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</blockquote>
</div>
<br clear="all" class="">
<div class=""><br class="">
</div>
--<span class="Apple-converted-space"> </span><br class="">
<div dir="ltr" class="gmail_signature">
<div dir="ltr" class="">
<div class="">
<div dir="ltr" class="">
<div class="">
<div dir="ltr" class="">
<div class="">What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br class="">
-- Norbert Wiener</div>
<div class=""><br class="">
</div>
<div class=""><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank" class="">https://www.cse.buffalo.edu/~knepley/</a></div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<br class="">
</body>
</html>