[petsc-dev] MatMult on Summit

Mon Sep 23 11:10:12 CDT 2019

Thanks Junchao. That is very clear and helpful.

   Matt

On Mon, Sep 23, 2019 at 12:01 PM Zhang, Junchao via petsc-dev <
petsc-dev at mcs.anl.gov> wrote:

> I also did OpenMP stream test and then I found mismatch between OpenMPI
> and MPI.  That reminded me a subtle issue on summit: pair of cores share L2
> cache.  One has to place MPI ranks to different pairs to get best
> bandwidth. See different bindings
> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0= and
> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=. Note
> each node has 21 cores. I assume that means 11 pairs. The new results are
> below. They match with we what I got from OpenMPI. The bandwidth is almost
> doubled from 1 to 2 cores per socket. IBM document also says each socket
> has two memory controllers. I could not find the core-memory controller
> affinity info. I tried different bindings but did not find huge difference.
>
> #Ranks  Rate (MB/s)    Ratio over 2 ranks
> 1         29229.8       -
> 2         59091.0      1.0
> 4        112260.7      1.9
> 6        159852.8      2.7
> 8        194351.7      3.3
> 10       215841.0      3.7
> 12       232316.6      3.9
> 14       244615.7      4.1
> 16       254450.8      4.3
> 18       262185.7      4.4
> 20       267181.0      4.5
> 22       270290.4      4.6
> 24       221944.9      3.8
> 26       238302.8      4.0
>
>
> --Junchao Zhang
>
>
> On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F. <bsmith at mcs.anl.gov>
> wrote:
>
>>
>>   Junchao,
>>
>>      For completeness could you please run with a single core? But leave
>> the ratio as you have with over 2 ranks since that is the correct model.
>>
>>    Thanks
>>
>>      Barry
>>
>>
>> > On Sep 22, 2019, at 11:14 AM, Zhang, Junchao <jczhang at mcs.anl.gov>
>> wrote:
>> >
>> > I did stream test on Summit. I used the MPI version from petsc, but
>> largely increased the array size N since one socket of Summit has 120MB L3
>> cache. I used MPI version since it was easy for me to distribute ranks
>> evenly to the two sockets.
>> > The result matches with data released by OLCF (see attached figure) and
>> data given by Jed. We can see the bandwidth saturates around 24 ranks.
>> >
>> > #Ranks     Rate (MB/s)     Ratio over 2 ranks
>> > ------------------------------------------
>> > 2          59012.2834        1.00
>> > 4          70959.1475        1.20
>> > 6         106639.9837        1.81
>> > 8         138638.6929        2.35
>> > 10        171125.0873        2.90
>> > 12        196162.5197        3.32
>> > 14        215272.7810        3.65
>> > 16        229562.4040        3.89
>> > 18        242587.4913        4.11
>> > 20        251057.1731        4.25
>> > 22        258569.7794        4.38
>> > 24        265443.2924        4.50
>> > 26        266562.7872        4.52
>> > 28        267043.6367        4.53
>> > 30        266833.7212        4.52
>> > 32        267183.8474        4.53
>> >
>> > On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F. <bsmith at mcs.anl.gov>
>> wrote:
>> >
>> >   Junchao could try the PETSc (and non-PETSc) streams tests on the
>> machine.
>> >
>> >   There are a few differences, compiler, the reported results are with
>> OpenMP, different number of cores but yes the performance is a bit low. For
>> DOE that is great, makes GPUs look better :-)
>> >
>> >
>> > > On Sep 21, 2019, at 11:11 PM, Jed Brown <jed at jedbrown.org> wrote:
>> > >
>> > > For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or
>> 165
>> > > GB/s for the node for the best case (42 ranks).
>> > >
>> > > My understanding is that these systems have 8 channels of DDR4-2666
>> per
>> > > socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket
>> > > system, and 270 GB/s STREAM Triad according to this post
>> > >
>> > >
>> https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/
>> > >
>> > > Is this 60% of Triad the best we can get for SpMV?
>> > >
>> > > "Zhang, Junchao via petsc-dev" <petsc-dev at mcs.anl.gov> writes:
>> > >
>> > >> 42 cores have better performance.
>> > >>
>> > >> 36 MPI ranks
>> > >> MatMult              100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04
>> 4.5e+04 0.0e+00  6 99 97 28  0 100100100100  0 25145       0      0
>> 0.00e+00    0 0.00e+00  0
>> > >> VecScatterBegin      100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04
>> 4.5e+04 0.0e+00  0  0 97 28  0   1  0100100  0     0       0      0
>> 0.00e+00    0 0.00e+00  0
>> > >> VecScatterEnd        100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00  1  0  0  0  0  22  0  0  0  0     0       0      0
>> 0.00e+00    0 0.00e+00  0
>> > >>
>> > >> --Junchao Zhang
>> > >>
>> > >>
>> > >> On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. <bsmith at mcs.anl.gov
>> <mailto:bsmith at mcs.anl.gov>> wrote:
>> > >>
>> > >>  Junchao,
>> > >>
>> > >>    Mark has a good point; could you also try for completeness the
>> CPU with 36 cores and see if it is any better than the 42 core case?
>> > >>
>> > >>  Barry
>> > >>
>> > >>  So extrapolating about 20 nodes of the CPUs is equivalent to 1 node
>> of the GPUs for the multiply for this problem size.
>> > >>
>> > >>> On Sep 21, 2019, at 6:40 PM, Mark Adams <mfadams at lbl.gov<mailto:
>> mfadams at lbl.gov>> wrote:
>> > >>>
>> > >>> I came up with 36 cores/node for CPU GAMG runs. The memory bus is
>> pretty saturated at that point.
>> > >>>
>> > >>> On Sat, Sep 21, 2019 at 1:44 AM Zhang, Junchao via petsc-dev <
>> petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>> wrote:
>> > >>> Here are CPU version results on one node with 24 cores, 42 cores.
>> Click the links for core layout.
>> > >>>
>> > >>> 24 MPI ranks,
>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
>> > >>> MatMult              100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04
>> 5.9e+04 0.0e+00  8 99 97 25  0 100100100100  0 17948       0      0
>> 0.00e+00    0 0.00e+00  0
>> > >>> VecScatterBegin      100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04
>> 5.9e+04 0.0e+00  0  0 97 25  0   0  0100100  0     0       0      0
>> 0.00e+00    0 0.00e+00  0
>> > >>> VecScatterEnd        100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00  2  0  0  0  0  19  0  0  0  0     0       0      0
>> 0.00e+00    0 0.00e+00  0
>> > >>>
>> > >>> 42 MPI ranks,
>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=
>> > >>> MatMult              100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04
>> 4.1e+04 0.0e+00 23 99 97 30  0 100100100100  0 27493       0      0
>> 0.00e+00    0 0.00e+00  0
>> > >>> VecScatterBegin      100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04
>> 4.1e+04 0.0e+00  0  0 97 30  0   1  0100100  0     0       0      0
>> 0.00e+00    0 0.00e+00  0
>> > >>> VecScatterEnd        100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00  6  0  0  0  0  24  0  0  0  0     0       0      0
>> 0.00e+00    0 0.00e+00  0
>> > >>>
>> > >>> --Junchao Zhang
>> > >>>
>> > >>>
>> > >>> On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F. <
>> bsmith at mcs.anl.gov<mailto:bsmith at mcs.anl.gov>> wrote:
>> > >>>
>> > >>>  Junchao,
>> > >>>
>> > >>>   Very interesting. For completeness please run also 24 and 42 CPUs
>> without the GPUs. Note that the default layout for CPU cores is not good.
>> You will want 3 cores on each socket then 12 on each.
>> > >>>
>> > >>>  Thanks
>> > >>>
>> > >>>   Barry
>> > >>>
>> > >>>  Since Tim is one of our reviewers next week this is a very good
>> test matrix :-)
>> > >>>
>> > >>>
>> > >>>> On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev <
>> petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>> wrote:
>> > >>>>
>> > >>>> Click the links to visualize it.
>> > >>>>
>> > >>>> 6 ranks
>> > >>>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
>> > >>>> jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU
>> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f
>> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
>> > >>>>
>> > >>>> 24 ranks
>> > >>>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
>> > >>>> jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU
>> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f
>> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
>> > >>>>
>> > >>>> --Junchao Zhang
>> > >>>>
>> > >>>>
>> > >>>> On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev
>> <petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>> wrote:
>> > >>>> Junchao,
>> > >>>>
>> > >>>> Can you share your 'jsrun' command so that we can see how you are
>> mapping things to resource sets?
>> > >>>>
>> > >>>> --Richard
>> > >>>>
>> > >>>> On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
>> > >>>>> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix
>> Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100
>> times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I
>> found MatMult was almost dominated by VecScatter in this simple test. Using
>> 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve performance. But
>> if I enabled Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I
>> found CUDA aware SF hurt performance. I don't know why and have to profile
>> it. I will also collect  data with multiple nodes. Are the matrix and tests
>> proper?
>> > >>>>>
>> > >>>>>
>> ------------------------------------------------------------------------------------------------------------------------
>> > >>>>> Event                Count      Time (sec)     Flop
>>                 --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu
>> -   - GpuToCpu - GPU
>> > >>>>>                   Max Ratio  Max     Ratio   Max  Ratio  Mess
>>  AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count
>>  Size   Count   Size  %F
>> > >>>>>
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>> > >>>>> 6 MPI ranks (CPU version)
>> > >>>>> MatMult              100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03
>> 2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743       0      0
>> 0.00e+00    0 0.00e+00  0
>> > >>>>> VecScatterBegin      100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03
>> 2.2e+05 0.0e+00  0  0 97 18  0   0  0100100  0     0       0      0
>> 0.00e+00    0 0.00e+00  0
>> > >>>>> VecScatterEnd        100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00  3  0  0  0  0  13  0  0  0  0     0       0      0
>> 0.00e+00    0 0.00e+00  0
>> > >>>>>
>> > >>>>> 6 MPI ranks + 6 GPUs + regular SF
>> > >>>>> MatMult              100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03
>> 2.2e+05 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100
>> 1.02e+02  100 2.69e+02 100
>> > >>>>> VecScatterBegin      100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03
>> 2.2e+05 0.0e+00  0  0 97 18  0  64  0100100  0     0       0      0
>> 0.00e+00  100 2.69e+02  0
>> > >>>>> VecScatterEnd        100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00  0  0  0  0  0  22  0  0  0  0     0       0      0
>> 0.00e+00    0 0.00e+00  0
>> > >>>>> VecCUDACopyTo        100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00  0  0  0  0  0   5  0  0  0  0     0       0    100
>> 1.02e+02    0 0.00e+00  0
>> > >>>>> VecCopyFromSome      100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00  0  0  0  0  0  54  0  0  0  0     0       0      0
>> 0.00e+00  100 2.69e+02  0
>> > >>>>>
>> > >>>>> 6 MPI ranks + 6 GPUs + CUDA-aware SF
>> > >>>>> MatMult              100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03
>> 2.2e+05 0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0
>> 0.00e+00    0 0.00e+00 100
>> > >>>>> VecScatterBegin      100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03
>> 2.2e+05 0.0e+00  1  0 97 18  0  70  0100100  0     0       0      0
>> 0.00e+00    0 0.00e+00  0
>> > >>>>> VecScatterEnd        100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00  0  0  0  0  0  17  0  0  0  0     0       0      0
>> 0.00e+00    0 0.00e+00  0
>> > >>>>>
>> > >>>>> 24 MPI ranks + 6 GPUs + regular SF
>> > >>>>> MatMult              100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04
>> 5.9e+04 0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100
>> 4.61e+01  100 6.72e+01 100
>> > >>>>> VecScatterBegin      100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04
>> 5.9e+04 0.0e+00  0  0 97 25  0  34  0100100  0     0       0      0
>> 0.00e+00  100 6.72e+01  0
>> > >>>>> VecScatterEnd        100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00  1  0  0  0  0  42  0  0  0  0     0       0      0
>> 0.00e+00    0 0.00e+00  0
>> > >>>>> VecCUDACopyTo        100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00  0  0  0  0  0   3  0  0  0  0     0       0    100
>> 4.61e+01    0 0.00e+00  0
>> > >>>>> VecCopyFromSome      100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00  0  0  0  0  0  29  0  0  0  0     0       0      0
>> 0.00e+00  100 6.72e+01  0
>> > >>>>>
>> > >>>>> 24 MPI ranks + 6 GPUs + CUDA-aware SF
>> > >>>>> MatMult              100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04
>> 5.9e+04 0.0e+00  1 99 97 25  0 100100100100  0 387864   973391    0
>> 0.00e+00    0 0.00e+00 100
>> > >>>>> VecScatterBegin      100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04
>> 5.9e+04 0.0e+00  1  0 97 25  0  35  0100100  0     0       0      0
>> 0.00e+00    0 0.00e+00  0
>> > >>>>> VecScatterEnd        100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00  1  0  0  0  0  48  0  0  0  0     0       0      0
>> 0.00e+00    0 0.00e+00  0
>> > >>>>>
>> > >>>>>
>> > >>>>> --Junchao Zhang
>> > >>>>
>> > >>>
>> >
>> > <SummitNode.png>
>>
>>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190923/c7657bb0/attachment-0001.html>