[petsc-dev] MatMult on Summit

Mon Sep 23 13:43:22 CDT 2019

OK, I wrote to the OLCF Consultants and they told me that

* Yes, the jsrun Visualizer numberings correspond to the 'lstopo' ones.

and, from this I can conclude that

* If I ask for 6 resource sets, each with 1 core and 1 GPU each, the some of the cores in different resource sets will share L2/L3 cache.

* For the above case, in which I want 6 MPI ranks that don't share anything, I need to ask for 6 resource sets each with *2 cores* and 1 GPU each. When I ask for 2 cores, each resource set will consist of 2 cores that share L2/L3, so this is how you can get resource sets that don't share L2/L3 between them.

--Richard

On 9/23/19 11:10 AM, Mills, Richard Tran wrote:
To further muddy the waters, the OLCF Summit User Guide (https://www.olcf.ornl.gov/for-users/system-user-guides/summit/summit-user-guide) states that

"The POWER9 processor is built around IBM’s SIMD Multi-Core (SMC). The processor provides 22 SMCs with separate 32kB L1 data and instruction caches. Pairs of SMCs share a 512kB L2 cache and a 10MB L3 cache."

And there is some funny stuff in that lstopo output. On the first socket, I see one "SMC" that doesn't share L2/L3 with anyone. This may be because it actually shares this with a "service" node that is hidden to jsrun. But why are there three such SMCs on the second socket?!

I've written to the OLCF Consultants to see if they can provide any clarification on this. In particular, I want to know if the jsrun Visualizer hardware thread and core numberings correspond to the lstopo ones. I think that's the only way to tell if we are getting cores that don't share L2/L3 resources or not.

--Richard

On 9/23/19 10:58 AM, Zhang, Junchao wrote:
The figure did not clearly say all cores share L3.  Instead, we should look at p.16 of https://www.redbooks.ibm.com/redpapers/pdfs/redp5472.pdf

"The POWER9 chip contains two memory controllers, PCIe Gen4 I/O controllers, and an interconnection system that connects all components within the chip at 7 TBps. Each core has 256 KB of L2 cache, and all cores share 120 MB of L3 embedded DRAM (eDRAM)."
--Junchao Zhang

On Mon, Sep 23, 2019 at 11:58 AM Mills, Richard Tran via petsc-dev <petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>> wrote:
L3 and L2 are shared between cores, actually. See the attached 'lstopo' PDF output from a Summit compute node to see an illustration of the node layout.

--Richard

On 9/23/19 9:01 AM, Zhang, Junchao via petsc-dev wrote:
I also did OpenMP stream test and then I found mismatch between OpenMPI and MPI.  That reminded me a subtle issue on summit: pair of cores share L2 cache.  One has to place MPI ranks to different pairs to get best bandwidth. See different bindings
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0= and https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=. Note each node has 21 cores. I assume that means 11 pairs. The new results are below. They match with we what I got from OpenMPI. The bandwidth is almost doubled from 1 to 2 cores per socket. IBM document also says each socket has two memory controllers. I could not find the core-memory controller affinity info. I tried different bindings but did not find huge difference.

#Ranks  Rate (MB/s)    Ratio over 2 ranks
1         29229.8       -
2         59091.0      1.0
4        112260.7      1.9
6        159852.8      2.7
8        194351.7      3.3
10       215841.0      3.7
12       232316.6      3.9
14       244615.7      4.1
16       254450.8      4.3
18       262185.7      4.4
20       267181.0      4.5
22       270290.4      4.6
24       221944.9      3.8
26       238302.8      4.0

--Junchao Zhang

On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F. <bsmith at mcs.anl.gov<mailto:bsmith at mcs.anl.gov>> wrote:

  Junchao,

     For completeness could you please run with a single core? But leave the ratio as you have with over 2 ranks since that is the correct model.

   Thanks

     Barry

> On Sep 22, 2019, at 11:14 AM, Zhang, Junchao <jczhang at mcs.anl.gov<mailto:jczhang at mcs.anl.gov>> wrote:
>
> I did stream test on Summit. I used the MPI version from petsc, but largely increased the array size N since one socket of Summit has 120MB L3 cache. I used MPI version since it was easy for me to distribute ranks evenly to the two sockets.
> The result matches with data released by OLCF (see attached figure) and data given by Jed. We can see the bandwidth saturates around 24 ranks.
>
> #Ranks     Rate (MB/s)     Ratio over 2 ranks
> ------------------------------------------
> 2          59012.2834        1.00
> 4          70959.1475        1.20
> 6         106639.9837        1.81
> 8         138638.6929        2.35
> 10        171125.0873        2.90
> 12        196162.5197        3.32
> 14        215272.7810        3.65
> 16        229562.4040        3.89
> 18        242587.4913        4.11
> 20        251057.1731        4.25
> 22        258569.7794        4.38
> 24        265443.2924        4.50
> 26        266562.7872        4.52
> 28        267043.6367        4.53
> 30        266833.7212        4.52
> 32        267183.8474        4.53
>
> On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F. <bsmith at mcs.anl.gov<mailto:bsmith at mcs.anl.gov>> wrote:
>
>   Junchao could try the PETSc (and non-PETSc) streams tests on the machine.
>
>   There are a few differences, compiler, the reported results are with OpenMP, different number of cores but yes the performance is a bit low. For DOE that is great, makes GPUs look better :-)
>
>
> > On Sep 21, 2019, at 11:11 PM, Jed Brown <jed at jedbrown.org<mailto:jed at jedbrown.org>> wrote:
> >
> > For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165
> > GB/s for the node for the best case (42 ranks).
> >
> > My understanding is that these systems have 8 channels of DDR4-2666 per
> > socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket
> > system, and 270 GB/s STREAM Triad according to this post
> >
> >  https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/
> >
> > Is this 60% of Triad the best we can get for SpMV?
> >
> > "Zhang, Junchao via petsc-dev" <petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>> writes:
> >
> >> 42 cores have better performance.
> >>
> >> 36 MPI ranks
> >> MatMult              100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 0.0e+00  6 99 97 28  0 100100100100  0 25145       0      0 0.00e+00    0 0.00e+00  0
> >> VecScatterBegin      100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 0.0e+00  0  0 97 28  0   1  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0
> >> VecScatterEnd        100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0  22  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> >>
> >> --Junchao Zhang
> >>
> >>
> >> On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. <bsmith at mcs.anl.gov<mailto:bsmith at mcs.anl.gov><mailto:bsmith at mcs.anl.gov<mailto:bsmith at mcs.anl.gov>>> wrote:
> >>
> >>  Junchao,
> >>
> >>    Mark has a good point; could you also try for completeness the CPU with 36 cores and see if it is any better than the 42 core case?
> >>
> >>  Barry
> >>
> >>  So extrapolating about 20 nodes of the CPUs is equivalent to 1 node of the GPUs for the multiply for this problem size.
> >>
> >>> On Sep 21, 2019, at 6:40 PM, Mark Adams <mfadams at lbl.gov<mailto:mfadams at lbl.gov><mailto:mfadams at lbl.gov<mailto:mfadams at lbl.gov>>> wrote:
> >>>
> >>> I came up with 36 cores/node for CPU GAMG runs. The memory bus is pretty saturated at that point.
> >>>
> >>> On Sat, Sep 21, 2019 at 1:44 AM Zhang, Junchao via petsc-dev <petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov><mailto:petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>>> wrote:
> >>> Here are CPU version results on one node with 24 cores, 42 cores. Click the links for core layout.
> >>>
> >>> 24 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> >>> MatMult              100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 0.0e+00  8 99 97 25  0 100100100100  0 17948       0      0 0.00e+00    0 0.00e+00  0
> >>> VecScatterBegin      100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 0.0e+00  0  0 97 25  0   0  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0
> >>> VecScatterEnd        100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0  19  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> >>>
> >>> 42 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=
> >>> MatMult              100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 0.0e+00 23 99 97 30  0 100100100100  0 27493       0      0 0.00e+00    0 0.00e+00  0
> >>> VecScatterBegin      100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 0.0e+00  0  0 97 30  0   1  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0
> >>> VecScatterEnd        100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  6  0  0  0  0  24  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> >>>
> >>> --Junchao Zhang
> >>>
> >>>
> >>> On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F. <bsmith at mcs.anl.gov<mailto:bsmith at mcs.anl.gov><mailto:bsmith at mcs.anl.gov<mailto:bsmith at mcs.anl.gov>>> wrote:
> >>>
> >>>  Junchao,
> >>>
> >>>   Very interesting. For completeness please run also 24 and 42 CPUs without the GPUs. Note that the default layout for CPU cores is not good. You will want 3 cores on each socket then 12 on each.
> >>>
> >>>  Thanks
> >>>
> >>>   Barry
> >>>
> >>>  Since Tim is one of our reviewers next week this is a very good test matrix :-)
> >>>
> >>>
> >>>> On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev <petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov><mailto:petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>>> wrote:
> >>>>
> >>>> Click the links to visualize it.
> >>>>
> >>>> 6 ranks
> >>>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
> >>>> jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> >>>>
> >>>> 24 ranks
> >>>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> >>>> jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> >>>>
> >>>> --Junchao Zhang
> >>>>
> >>>>
> >>>> On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev <petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov><mailto:petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>>> wrote:
> >>>> Junchao,
> >>>>
> >>>> Can you share your 'jsrun' command so that we can see how you are mapping things to resource sets?
> >>>>
> >>>> --Richard
> >>>>
> >>>> On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
> >>>>> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100 times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I found MatMult was almost dominated by VecScatter in this simple test. Using 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve performance. But if I enabled Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't know why and have to profile it. I will also collect  data with multiple nodes. Are the matrix and tests proper?
> >>>>>
> >>>>> ------------------------------------------------------------------------------------------------------------------------
> >>>>> Event                Count      Time (sec)     Flop                              --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu - GPU
> >>>>>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
> >>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> >>>>> 6 MPI ranks (CPU version)
> >>>>> MatMult              100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743       0      0 0.00e+00    0 0.00e+00  0
> >>>>> VecScatterBegin      100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00  0  0 97 18  0   0  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0
> >>>>> VecScatterEnd        100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  3  0  0  0  0  13  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> >>>>>
> >>>>> 6 MPI ranks + 6 GPUs + regular SF
> >>>>> MatMult              100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  100 2.69e+02 100
> >>>>> VecScatterBegin      100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00  0  0 97 18  0  64  0100100  0     0       0      0 0.00e+00  100 2.69e+02  0
> >>>>> VecScatterEnd        100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  22  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> >>>>> VecCUDACopyTo        100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   5  0  0  0  0     0       0    100 1.02e+02    0 0.00e+00  0
> >>>>> VecCopyFromSome      100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  54  0  0  0  0     0       0      0 0.00e+00  100 2.69e+02  0
> >>>>>
> >>>>> 6 MPI ranks + 6 GPUs + CUDA-aware SF
> >>>>> MatMult              100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+00    0 0.00e+00 100
> >>>>> VecScatterBegin      100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00  1  0 97 18  0  70  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0
> >>>>> VecScatterEnd        100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  17  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> >>>>>
> >>>>> 24 MPI ranks + 6 GPUs + regular SF
> >>>>> MatMult              100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61e+01  100 6.72e+01 100
> >>>>> VecScatterBegin      100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 0.0e+00  0  0 97 25  0  34  0100100  0     0       0      0 0.00e+00  100 6.72e+01  0
> >>>>> VecScatterEnd        100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0  42  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> >>>>> VecCUDACopyTo        100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   3  0  0  0  0     0       0    100 4.61e+01    0 0.00e+00  0
> >>>>> VecCopyFromSome      100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  29  0  0  0  0     0       0      0 0.00e+00  100 6.72e+01  0
> >>>>>
> >>>>> 24 MPI ranks + 6 GPUs + CUDA-aware SF
> >>>>> MatMult              100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 5.9e+04 0.0e+00  1 99 97 25  0 100100100100  0 387864   973391    0 0.00e+00    0 0.00e+00 100
> >>>>> VecScatterBegin      100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 5.9e+04 0.0e+00  1  0 97 25  0  35  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0
> >>>>> VecScatterEnd        100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0  48  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
> >>>>>
> >>>>>
> >>>>> --Junchao Zhang
> >>>>
> >>>
>
> <SummitNode.png>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190923/0c91609d/attachment-0001.html>