<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

</head>

<body>

<div dir="ltr">The figure did not clearly say all cores share L3.  Instead, we should look at p.16 of <a href="https://www.redbooks.ibm.com/redpapers/pdfs/redp5472.pdf">https://www.redbooks.ibm.com/redpapers/pdfs/redp5472.pdf</a>

<div><br>

</div>

<div>"The POWER9 chip contains two memory controllers, PCIe Gen4 I/O controllers, and an interconnection system that connects all components within the chip at 7 TBps. Each core has 256 KB of L2 cache, and all cores share 120 MB of L3 embedded DRAM (eDRAM)."<br>

</div>

<div>

<div>

<div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">

<div dir="ltr">--Junchao Zhang</div>

</div>

</div>

<br>

</div>

</div>

<br>

<div class="gmail_quote">

<div dir="ltr" class="gmail_attr">On Mon, Sep 23, 2019 at 11:58 AM Mills, Richard Tran via petsc-dev <<a href="mailto:petsc-dev@mcs.anl.gov">petsc-dev@mcs.anl.gov</a>> wrote:<br>

</div>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div bgcolor="#FFFFFF">L3 and L2 are shared between cores, actually. See the attached 'lstopo' PDF output from a Summit compute node to see an illustration of the node layout.<br>

<br>

--Richard<br>

<br>

<div>On 9/23/19 9:01 AM, Zhang, Junchao via petsc-dev wrote:<br>

</div>

<blockquote type="cite">

<div dir="ltr">

<div><font face="arial, sans-serif">I also did OpenMP stream test and then I found mismatch between OpenMPI and MPI.  That reminded me a subtle issue on summit: pair of cores share L2 cache.  One has to place MPI ranks to different pairs to get best bandwidth.

 See different bindings</font></div>

<div><a href="https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0=" target="_blank">https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0=</a> and <a href="https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=" target="_blank">https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=</a>.

 Note each node has 21 cores. I assume that means 11 pairs. The new results are below. They match with we what I got from OpenMPI. The bandwidth is almost doubled from 1 to 2 cores per socket. IBM document also says each socket has two memory controllers. I

 could not find the core-memory controller affinity info. I tried different bindings but did not find huge difference.</div>

<div>  </div>

<div><font face="monospace">#Ranks  Rate (MB/s)    Ratio over 2 ranks<br>

1         29229.8       -<br>

2         59091.0      1.0<br>

4        112260.7      1.9<br>

6        159852.8      2.7<br>

8        194351.7      3.3<br>

10       215841.0      3.7<br>

12       232316.6      3.9<br>

14       244615.7      4.1<br>

16       254450.8      4.3<br>

18       262185.7      4.4<br>

20       267181.0      4.5<br>

22       270290.4      4.6<br>

<font color="#ff0000">24       221944.9      3.8<br>

26       238302.8      4.0</font></font><font color="#ff0000"><br>

</font></div>

<div><br>

</div>

<div><br>

</div>

<div>

<div dir="ltr">

<div dir="ltr">--Junchao Zhang</div>

</div>

</div>

<br>

</div>

<br>

<div class="gmail_quote">

<div dir="ltr" class="gmail_attr">On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>> wrote:<br>

</div>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

  Junchao,<br>

<br>

     For completeness could you please run with a single core? But leave the ratio as you have with over 2 ranks since that is the correct model.<br>

<br>

   Thanks<br>

<br>

     Barry<br>

<br>

<br>

> On Sep 22, 2019, at 11:14 AM, Zhang, Junchao <<a href="mailto:jczhang@mcs.anl.gov" target="_blank">jczhang@mcs.anl.gov</a>> wrote:<br>

> <br>

> I did stream test on Summit. I used the MPI version from petsc, but largely increased the array size N since one socket of Summit has 120MB L3 cache. I used MPI version since it was easy for me to distribute ranks evenly to the two sockets.

<br>

> The result matches with data released by OLCF (see attached figure) and data given by Jed. We can see the bandwidth saturates around 24 ranks.<br>

> <br>

> #Ranks     Rate (MB/s)     Ratio over 2 ranks<br>

> ------------------------------------------<br>

> 2          59012.2834        1.00<br>

> 4          70959.1475        1.20<br>

> 6         106639.9837        1.81<br>

> 8         138638.6929        2.35<br>

> 10        171125.0873        2.90<br>

> 12        196162.5197        3.32<br>

> 14        215272.7810        3.65<br>

> 16        229562.4040        3.89<br>

> 18        242587.4913        4.11<br>

> 20        251057.1731        4.25<br>

> 22        258569.7794        4.38<br>

> 24        265443.2924        4.50<br>

> 26        266562.7872        4.52<br>

> 28        267043.6367        4.53<br>

> 30        266833.7212        4.52<br>

> 32        267183.8474        4.53<br>

> <br>

> On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>> wrote:<br>

> <br>

>   Junchao could try the PETSc (and non-PETSc) streams tests on the machine. <br>

> <br>

>   There are a few differences, compiler, the reported results are with OpenMP, different number of cores but yes the performance is a bit low. For DOE that is great, makes GPUs look better :-)<br>

> <br>

> <br>

> > On Sep 21, 2019, at 11:11 PM, Jed Brown <<a href="mailto:jed@jedbrown.org" target="_blank">jed@jedbrown.org</a>> wrote:<br>

> > <br>

> > For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165<br>

> > GB/s for the node for the best case (42 ranks).<br>

> > <br>

> > My understanding is that these systems have 8 channels of DDR4-2666 per<br>

> > socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket<br>

> > system, and 270 GB/s STREAM Triad according to this post<br>

> > <br>

> >  <a href="https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/" rel="noreferrer" target="_blank">

https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/</a><br>

> > <br>

> > Is this 60% of Triad the best we can get for SpMV?<br>

> > <br>

> > "Zhang, Junchao via petsc-dev" <<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank">petsc-dev@mcs.anl.gov</a>> writes:<br>

> > <br>

> >> 42 cores have better performance.<br>

> >> <br>

> >> 36 MPI ranks<br>

> >> MatMult              100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 0.0e+00  6 99 97 28  0 100100100100  0 25145       0      0 0.00e+00    0 0.00e+00  0<br>

> >> VecScatterBegin      100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 0.0e+00  0  0 97 28  0   1  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0<br>

> >> VecScatterEnd        100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0  22  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0<br>

> >> <br>

> >> --Junchao Zhang<br>

> >> <br>

> >> <br>

> >> On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a><mailto:<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>>> wrote:<br>

> >> <br>

> >>  Junchao,<br>

> >> <br>

> >>    Mark has a good point; could you also try for completeness the CPU with 36 cores and see if it is any better than the 42 core case?<br>

> >> <br>

> >>  Barry<br>

> >> <br>

> >>  So extrapolating about 20 nodes of the CPUs is equivalent to 1 node of the GPUs for the multiply for this problem size.<br>

> >> <br>

> >>> On Sep 21, 2019, at 6:40 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a><mailto:<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>>> wrote:<br>

> >>> <br>

> >>> I came up with 36 cores/node for CPU GAMG runs. The memory bus is pretty saturated at that point.<br>

> >>> <br>

> >>> On Sat, Sep 21, 2019 at 1:44 AM Zhang, Junchao via petsc-dev <<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank">petsc-dev@mcs.anl.gov</a><mailto:<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank">petsc-dev@mcs.anl.gov</a>>> wrote:<br>

> >>> Here are CPU version results on one node with 24 cores, 42 cores. Click the links for core layout.<br>

> >>> <br>

> >>> 24 MPI ranks, <a href="https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=" rel="noreferrer" target="_blank">

https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=</a><br>

> >>> MatMult              100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 0.0e+00  8 99 97 25  0 100100100100  0 17948       0      0 0.00e+00    0 0.00e+00  0<br>

> >>> VecScatterBegin      100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 0.0e+00  0  0 97 25  0   0  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0<br>

> >>> VecScatterEnd        100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0  19  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0<br>

> >>> <br>

> >>> 42 MPI ranks, <a href="https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=" rel="noreferrer" target="_blank">

https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=</a><br>

> >>> MatMult              100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 0.0e+00 23 99 97 30  0 100100100100  0 27493       0      0 0.00e+00    0 0.00e+00  0<br>

> >>> VecScatterBegin      100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 0.0e+00  0  0 97 30  0   1  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0<br>

> >>> VecScatterEnd        100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  6  0  0  0  0  24  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0<br>

> >>> <br>

> >>> --Junchao Zhang<br>

> >>> <br>

> >>> <br>

> >>> On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a><mailto:<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>>> wrote:<br>

> >>> <br>

> >>>  Junchao,<br>

> >>> <br>

> >>>   Very interesting. For completeness please run also 24 and 42 CPUs without the GPUs. Note that the default layout for CPU cores is not good. You will want 3 cores on each socket then 12 on each.<br>

> >>> <br>

> >>>  Thanks<br>

> >>> <br>

> >>>   Barry<br>

> >>> <br>

> >>>  Since Tim is one of our reviewers next week this is a very good test matrix :-)<br>

> >>> <br>

> >>> <br>

> >>>> On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev <<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank">petsc-dev@mcs.anl.gov</a><mailto:<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank">petsc-dev@mcs.anl.gov</a>>> wrote:<br>

> >>>> <br>

> >>>> Click the links to visualize it.<br>

> >>>> <br>

> >>>> 6 ranks<br>

> >>>> <a href="https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=" rel="noreferrer" target="_blank">

https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=</a><br>

> >>>> jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view<br>

> >>>> <br>

> >>>> 24 ranks<br>

> >>>> <a href="https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=" rel="noreferrer" target="_blank">

https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=</a><br>

> >>>> jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view<br>

> >>>> <br>

> >>>> --Junchao Zhang<br>

> >>>> <br>

> >>>> <br>

> >>>> On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev <<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank">petsc-dev@mcs.anl.gov</a><mailto:<a href="mailto:petsc-dev@mcs.anl.gov" target="_blank">petsc-dev@mcs.anl.gov</a>>> wrote:<br>

> >>>> Junchao,<br>

> >>>> <br>

> >>>> Can you share your 'jsrun' command so that we can see how you are mapping things to resource sets?<br>

> >>>> <br>

> >>>> --Richard<br>

> >>>> <br>

> >>>> On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:<br>

> >>>>> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100 times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I found MatMult was almost dominated by VecScatter

 in this simple test. Using 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve performance. But if I enabled Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't know why and have to profile it.

 I will also collect  data with multiple nodes. Are the matrix and tests proper?<br>

> >>>>> <br>

> >>>>> ------------------------------------------------------------------------------------------------------------------------<br>

> >>>>> Event                Count      Time (sec)     Flop                              --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu - GPU<br>

> >>>>>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F<br>

> >>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------<br>

> >>>>> 6 MPI ranks (CPU version)<br>

> >>>>> MatMult              100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743       0      0 0.00e+00    0 0.00e+00  0<br>

> >>>>> VecScatterBegin      100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00  0  0 97 18  0   0  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0<br>

> >>>>> VecScatterEnd        100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  3  0  0  0  0  13  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0<br>

> >>>>> <br>

> >>>>> 6 MPI ranks + 6 GPUs + regular SF<br>

> >>>>> MatMult              100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  100 2.69e+02 100<br>

> >>>>> VecScatterBegin      100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00  0  0 97 18  0  64  0100100  0     0       0      0 0.00e+00  100 2.69e+02  0<br>

> >>>>> VecScatterEnd        100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  22  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0<br>

> >>>>> VecCUDACopyTo        100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   5  0  0  0  0     0       0    100 1.02e+02    0 0.00e+00  0<br>

> >>>>> VecCopyFromSome      100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  54  0  0  0  0     0       0      0 0.00e+00  100 2.69e+02  0<br>

> >>>>> <br>

> >>>>> 6 MPI ranks + 6 GPUs + CUDA-aware SF<br>

> >>>>> MatMult              100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+00    0 0.00e+00 100<br>

> >>>>> VecScatterBegin      100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00  1  0 97 18  0  70  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0<br>

> >>>>> VecScatterEnd        100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  17  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0<br>

> >>>>> <br>

> >>>>> 24 MPI ranks + 6 GPUs + regular SF<br>

> >>>>> MatMult              100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61e+01  100 6.72e+01 100<br>

> >>>>> VecScatterBegin      100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 0.0e+00  0  0 97 25  0  34  0100100  0     0       0      0 0.00e+00  100 6.72e+01  0<br>

> >>>>> VecScatterEnd        100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0  42  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0<br>

> >>>>> VecCUDACopyTo        100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   3  0  0  0  0     0       0    100 4.61e+01    0 0.00e+00  0<br>

> >>>>> VecCopyFromSome      100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  29  0  0  0  0     0       0      0 0.00e+00  100 6.72e+01  0<br>

> >>>>> <br>

> >>>>> 24 MPI ranks + 6 GPUs + CUDA-aware SF<br>

> >>>>> MatMult              100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 5.9e+04 0.0e+00  1 99 97 25  0 100100100100  0 387864   973391    0 0.00e+00    0 0.00e+00 100<br>

> >>>>> VecScatterBegin      100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 5.9e+04 0.0e+00  1  0 97 25  0  35  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0<br>

> >>>>> VecScatterEnd        100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0  48  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0<br>

> >>>>> <br>

> >>>>> <br>

> >>>>> --Junchao Zhang<br>

> >>>> <br>

> >>> <br>

> <br>

> <SummitNode.png><br>

<br>

</blockquote>

</div>

</blockquote>

<br>

</div>

</blockquote>

</div>

</body>

</html>