<html>


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


</head>


<body>


<div dir="ltr">


<div dir="ltr">


<div><br>


</div>


<br>


</div>


<br>


<div class="gmail_quote">


<div dir="ltr" class="gmail_attr">On Sat, Sep 21, 2019 at 11:08 PM Karl Rupp via petsc-dev <<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank">petsc-dev@mcs.anl.gov</a>> wrote:<br>


</div>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">


Hi Junchao,<br>


<br>


thanks, these numbers are interesting.<br>


<br>


Do you have an easy way to evaluate the benefits of a CUDA-aware MPI vs. <br>


a non-CUDA-aware MPI that still keeps the benefits of your <br>


packing/unpacking routines?<br>


<br>


I'd like to get a feeling of where the performance gains come from. Is <br>


it due to the reduced PCI-Express transfer for the scatters (i.e. <br>


packing/unpacking and transferring only the relevant entries) on each <br>


rank, or is it some low-level optimization that makes the MPI-part of <br>


the communication faster? Your current MR includes both; it would be <br>


helpful to know whether we can extract similar benefits for other GPU <br>


backends without having to require "CUDA-awareness" of MPI. If the <br>


benefits are mostly due to the packing/unpacking, we could carry over <br>


the benefits to other GPU backends (e.g. upcoming Intel GPUs) without <br>


having to wait for an "Intel-GPU-aware MPI".<br>


<br>


</blockquote>


<div>Your argument is fair. I will add this support later. Besides performance benefit, GPU-aware can simplify user's code. That is why I think all vendors will converge on that.</div>


<div>This post <a href="https://devblogs.nvidia.com/introduction-cuda-aware-mpi/">https://devblogs.nvidia.com/introduction-cuda-aware-mpi/</a> has detailed explanation of CUDA-aware MPI. In short, it avoids CPU involvement and redundant memory copies. </div>


<div> </div>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">


Best regards,<br>


Karli<br>


<br>


<br>


On 9/21/19 6:22 AM, Zhang, Junchao via petsc-dev wrote:<br>


> I downloaded a sparse matrix (HV15R <br>


> <<a href="https://sparse.tamu.edu/Fluorem/HV15R" rel="noreferrer" target="_blank">https://sparse.tamu.edu/Fluorem/HV15R</a>>) from Florida Sparse Matrix


<br>


> Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100 <br>


> times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I <br>


> found MatMult was almost dominated by VecScatter in this simple test. <br>


> Using 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve <br>


> performance. But if I enabled Multi-Process Service on Summit and used <br>


> 24 ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't know <br>


> why and have to profile it. I will also collect  data with multiple <br>


> nodes. Are the matrix and tests proper?<br>


> <br>


> ------------------------------------------------------------------------------------------------------------------------<br>


> Event                Count      Time (sec)     Flop                     <br>


>           --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   <br>


> - GpuToCpu - GPU<br>


>                     Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen <br>


>   Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   <br>


> Count   Size  %F<br>


> ---------------------------------------------------------------------------------------------------------------------------------------------------------------<br>


> 6 MPI ranks (CPU version)<br>


> MatMult              100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 <br>


> 0.0e+00 24 99 97 18  0 100100100100  0  4743       0      0 0.00e+00   <br>


>   0 0.00e+00  0<br>


> VecScatterBegin      100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 <br>


> 0.0e+00  0  0 97 18  0   0  0100100  0     0       0      0 0.00e+00   <br>


>   0 0.00e+00  0<br>


> VecScatterEnd        100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 <br>


> 0.0e+00  3  0  0  0  0  13  0  0  0  0     0       0      0 0.00e+00   <br>


>   0 0.00e+00  0<br>


> <br>


> 6 MPI ranks + 6 GPUs + regular SF<br>


> MatMult              100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 <br>


> 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02 <br>


>   100 2.69e+02 100<br>


> VecScatterBegin      100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 <br>


> 0.0e+00  0  0 97 18  0  64  0100100  0     0       0      0 0.00e+00 <br>


>   100 2.69e+02  0<br>


> VecScatterEnd        100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 <br>


> 0.0e+00  0  0  0  0  0  22  0  0  0  0     0       0      0 0.00e+00   <br>


>   0 0.00e+00  0<br>


> VecCUDACopyTo        100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 <br>


> 0.0e+00  0  0  0  0  0   5  0  0  0  0     0       0    100 1.02e+02   <br>


>   0 0.00e+00  0<br>


> VecCopyFromSome      100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 <br>


> 0.0e+00  0  0  0  0  0  54  0  0  0  0     0       0      0 0.00e+00 <br>


>   100 2.69e+02  0<br>


> <br>


> 6 MPI ranks + 6 GPUs + CUDA-aware SF<br>


> MatMult              100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 <br>


> 0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+00   <br>


>   0 0.00e+00 100<br>


> VecScatterBegin      100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 <br>


> 0.0e+00  1  0 97 18  0  70  0100100  0     0       0      0 0.00e+00   <br>


>   0 0.00e+00  0<br>


> VecScatterEnd        100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 <br>


> 0.0e+00  0  0  0  0  0  17  0  0  0  0     0       0      0 0.00e+00   <br>


>   0 0.00e+00  0<br>


> <br>


> 24 MPI ranks + 6 GPUs + regular SF<br>


> MatMult              100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 <br>


> 0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61e+01 <br>


>   100 6.72e+01 100<br>


> VecScatterBegin      100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 <br>


> 0.0e+00  0  0 97 25  0  34  0100100  0     0       0      0 0.00e+00 <br>


>   100 6.72e+01  0<br>


> VecScatterEnd        100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 <br>


> 0.0e+00  1  0  0  0  0  42  0  0  0  0     0       0      0 0.00e+00   <br>


>   0 0.00e+00  0<br>


> VecCUDACopyTo        100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 <br>


> 0.0e+00  0  0  0  0  0   3  0  0  0  0     0       0    100 4.61e+01   <br>


>   0 0.00e+00  0<br>


> VecCopyFromSome      100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 <br>


> 0.0e+00  0  0  0  0  0  29  0  0  0  0     0       0      0 0.00e+00 <br>


>   100 6.72e+01  0<br>


> <br>


> 24 MPI ranks + 6 GPUs + CUDA-aware SF<br>


> MatMult              100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 5.9e+04 <br>


> 0.0e+00  1 99 97 25  0 100100100100  0 387864   973391    0 0.00e+00   <br>


>   0 0.00e+00 100<br>


> VecScatterBegin      100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 5.9e+04 <br>


> 0.0e+00  1  0 97 25  0  35  0100100  0     0       0      0 0.00e+00   <br>


>   0 0.00e+00  0<br>


> VecScatterEnd        100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 <br>


> 0.0e+00  1  0  0  0  0  48  0  0  0  0     0       0      0 0.00e+00   <br>


>   0 0.00e+00  0<br>


> <br>


> <br>


> --Junchao Zhang<br>


</blockquote>


</div>


</div>


</body>


</html>