<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
<br class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">On Dec 22, 2020, at 3:38 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" class="">mfadams@lbl.gov</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div dir="ltr" class="">I am MPI serial LU solving a smallish matrix (2D, Q3, 8K equations) on a Summit node (42 P9 cores, 6 V100 GPUs) using cuSparse and Kokkos kernels. The cuSparse performance is terrible.
<div class=""><br class="">
</div>
<div class="">I solve the same TS problem in MPI serial on each global process. I run with NP=1 or (all) 7 cores/MPI per GPU:
<div class=""><font face="monospace" class="">MatLUFactorNum time, using all 6 GPUs:<br class="">
</font></div>
<div class=""><font face="monospace" class="">NP/GPU cuSparse Kokkos kernels <br class="">
</font></div>
<div class=""><font face="monospace" class="">1 0.12 0.075</font></div>
<div class=""><font face="monospace" class="">7 0.55 0.072 // some noise here</font></div>
<div class="">
<div class="">
<div class="">So cuSparse is about 2x slower on one process and 8x slower when using all the cores, from memory contention I assume.</div>
<div class=""><br class="">
</div>
<div class="">I found that the problem is in MatSeqAIJCUSPARSEBuildILULower[Upper]TriMatrix. Most of this excess time is in:</div>
<div class=""><br class="">
</div>
<div class=""><font face="monospace" class=""> cerr = cudaMallocHost((void**) &AALo, nzLower*sizeof(PetscScalar));CHKERRCUDA(cerr);</font></div>
</div>
</div>
<div class=""><br class="">
</div>
<div class="">and</div>
<div class=""><br class="">
</div>
<div class=""><font face="monospace" class=""> cerr = cudaFreeHost(AALo);CHKERRCUDA(cerr);<br class="">
</font></div>
<div class=""><br class="">
</div>
<div class="">
<div class="">nzLower is about 140K. Here is my timer data, in a stage after a "warm up stage":<br class="">
</div>
<div class=""><br class="">
</div>
<div class=""> Inner-MatSeqAIJCUSPARSEBuildILULowerTriMatrix 12 1.0 <b class="">
2.3514e-01 </b>1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 23 0 0 0 0 0 0 12 1.34e+01 0 0.00e+00 0<br class="">
MatSeqAIJCUSPARSEBuildILULowerTriMatrix: cudaMallocHost 12 1.0 <b class="">
1.5448e-01</b> 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 15 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0<br class="">
MatSeqAIJCUSPARSEBuildILULowerTriMatrix: cudaFreeHost 12 1.0 <b class="">
8.3908e-02 </b>1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 8 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0</div>
</div>
</div>
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>Allocation/free of pinned memory is slow, usually on the order of several milliseconds. So these numbers look normal. Is there any opportunity to reuse the pinned memory in these functions? </div>
<div><br class="">
</div>
<div>Hong (Mr.) </div>
<br class="">
<blockquote type="cite" class="">
<div class="">
<div dir="ltr" class="">
<div class="">
<div class="">
<div class="">This 0.23 sec happens in Upper also, for a total of ~0.46, which pretty much matches the difference with Kokkos.</div>
<div class=""><br class="">
</div>
<div class="">Any ideas?</div>
</div>
</div>
<div class=""><br class="">
</div>
<div class="">Thanks,</div>
<div class="">Mark</div>
</div>
</div>
</blockquote>
</div>
<br class="">
</body>
</html>