[petsc-dev] MatMult on Summit
Matthew Knepley
knepley at gmail.com
Sun Sep 22 16:48:42 CDT 2019
On Sun, Sep 22, 2019 at 12:15 PM Zhang, Junchao via petsc-dev <
petsc-dev at mcs.anl.gov> wrote:
> I did stream test on Summit. I used the MPI version from petsc, but
> largely increased the array size N since one socket of Summit has 120MB L3
> cache. I used MPI version since it was easy for me to distribute ranks
> evenly to the two sockets.
> The result matches with data released by OLCF (see attached figure) and
> data given by Jed. We can see the bandwidth saturates around 24 ranks.
>
Junchao, maybe you can explain this to me. I see that bandwidth saturates
at 24 cores, but its only 9x of 1 core.
Why does it take so long to get to saturation?
Thanks,
Matt
> #Ranks Rate (MB/s) Ratio over 2 ranks
> ------------------------------------------
> 2 59012.2834 1.00
> 4 70959.1475 1.20
> 6 106639.9837 1.81
> 8 138638.6929 2.35
> 10 171125.0873 2.90
> 12 196162.5197 3.32
> 14 215272.7810 3.65
> 16 229562.4040 3.89
> 18 242587.4913 4.11
> 20 251057.1731 4.25
> 22 258569.7794 4.38
> 24 265443.2924 4.50
> 26 266562.7872 4.52
> 28 267043.6367 4.53
> 30 266833.7212 4.52
> 32 267183.8474 4.53
>
> On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F. <bsmith at mcs.anl.gov>
> wrote:
>
>>
>> Junchao could try the PETSc (and non-PETSc) streams tests on the
>> machine.
>>
>> There are a few differences, compiler, the reported results are with
>> OpenMP, different number of cores but yes the performance is a bit low. For
>> DOE that is great, makes GPUs look better :-)
>>
>>
>> > On Sep 21, 2019, at 11:11 PM, Jed Brown <jed at jedbrown.org> wrote:
>> >
>> > For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165
>> > GB/s for the node for the best case (42 ranks).
>> >
>> > My understanding is that these systems have 8 channels of DDR4-2666 per
>> > socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket
>> > system, and 270 GB/s STREAM Triad according to this post
>> >
>> >
>> https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/
>> >
>> > Is this 60% of Triad the best we can get for SpMV?
>> >
>> > "Zhang, Junchao via petsc-dev" <petsc-dev at mcs.anl.gov> writes:
>> >
>> >> 42 cores have better performance.
>> >>
>> >> 36 MPI ranks
>> >> MatMult 100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04
>> 4.5e+04 0.0e+00 6 99 97 28 0 100100100100 0 25145 0 0
>> 0.00e+00 0 0.00e+00 0
>> >> VecScatterBegin 100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04
>> 4.5e+04 0.0e+00 0 0 97 28 0 1 0100100 0 0 0 0
>> 0.00e+00 0 0.00e+00 0
>> >> VecScatterEnd 100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 1 0 0 0 0 22 0 0 0 0 0 0 0
>> 0.00e+00 0 0.00e+00 0
>> >>
>> >> --Junchao Zhang
>> >>
>> >>
>> >> On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. <bsmith at mcs.anl.gov
>> <mailto:bsmith at mcs.anl.gov>> wrote:
>> >>
>> >> Junchao,
>> >>
>> >> Mark has a good point; could you also try for completeness the CPU
>> with 36 cores and see if it is any better than the 42 core case?
>> >>
>> >> Barry
>> >>
>> >> So extrapolating about 20 nodes of the CPUs is equivalent to 1 node
>> of the GPUs for the multiply for this problem size.
>> >>
>> >>> On Sep 21, 2019, at 6:40 PM, Mark Adams <mfadams at lbl.gov<mailto:
>> mfadams at lbl.gov>> wrote:
>> >>>
>> >>> I came up with 36 cores/node for CPU GAMG runs. The memory bus is
>> pretty saturated at that point.
>> >>>
>> >>> On Sat, Sep 21, 2019 at 1:44 AM Zhang, Junchao via petsc-dev <
>> petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>> wrote:
>> >>> Here are CPU version results on one node with 24 cores, 42 cores.
>> Click the links for core layout.
>> >>>
>> >>> 24 MPI ranks,
>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
>> >>> MatMult 100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04
>> 5.9e+04 0.0e+00 8 99 97 25 0 100100100100 0 17948 0 0
>> 0.00e+00 0 0.00e+00 0
>> >>> VecScatterBegin 100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04
>> 5.9e+04 0.0e+00 0 0 97 25 0 0 0100100 0 0 0 0
>> 0.00e+00 0 0.00e+00 0
>> >>> VecScatterEnd 100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 2 0 0 0 0 19 0 0 0 0 0 0 0
>> 0.00e+00 0 0.00e+00 0
>> >>>
>> >>> 42 MPI ranks,
>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=
>> >>> MatMult 100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04
>> 4.1e+04 0.0e+00 23 99 97 30 0 100100100100 0 27493 0 0
>> 0.00e+00 0 0.00e+00 0
>> >>> VecScatterBegin 100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04
>> 4.1e+04 0.0e+00 0 0 97 30 0 1 0100100 0 0 0 0
>> 0.00e+00 0 0.00e+00 0
>> >>> VecScatterEnd 100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 6 0 0 0 0 24 0 0 0 0 0 0 0
>> 0.00e+00 0 0.00e+00 0
>> >>>
>> >>> --Junchao Zhang
>> >>>
>> >>>
>> >>> On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F. <bsmith at mcs.anl.gov
>> <mailto:bsmith at mcs.anl.gov>> wrote:
>> >>>
>> >>> Junchao,
>> >>>
>> >>> Very interesting. For completeness please run also 24 and 42 CPUs
>> without the GPUs. Note that the default layout for CPU cores is not good.
>> You will want 3 cores on each socket then 12 on each.
>> >>>
>> >>> Thanks
>> >>>
>> >>> Barry
>> >>>
>> >>> Since Tim is one of our reviewers next week this is a very good test
>> matrix :-)
>> >>>
>> >>>
>> >>>> On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev <
>> petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>> wrote:
>> >>>>
>> >>>> Click the links to visualize it.
>> >>>>
>> >>>> 6 ranks
>> >>>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
>> >>>> jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU
>> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f
>> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
>> >>>>
>> >>>> 24 ranks
>> >>>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
>> >>>> jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU
>> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f
>> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
>> >>>>
>> >>>> --Junchao Zhang
>> >>>>
>> >>>>
>> >>>> On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev <
>> petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>> wrote:
>> >>>> Junchao,
>> >>>>
>> >>>> Can you share your 'jsrun' command so that we can see how you are
>> mapping things to resource sets?
>> >>>>
>> >>>> --Richard
>> >>>>
>> >>>> On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
>> >>>>> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix
>> Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100
>> times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I
>> found MatMult was almost dominated by VecScatter in this simple test. Using
>> 6 MPI ranks + 6 GPUs, I found CUDA aware SF could improve performance. But
>> if I enabled Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I
>> found CUDA aware SF hurt performance. I don't know why and have to profile
>> it. I will also collect data with multiple nodes. Are the matrix and tests
>> proper?
>> >>>>>
>> >>>>>
>> ------------------------------------------------------------------------------------------------------------------------
>> >>>>> Event Count Time (sec) Flop
>> --- Global --- --- Stage ---- Total GPU - CpuToGpu -
>> - GpuToCpu - GPU
>> >>>>> Max Ratio Max Ratio Max Ratio Mess
>> AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count
>> Size Count Size %F
>> >>>>>
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >>>>> 6 MPI ranks (CPU version)
>> >>>>> MatMult 100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03
>> 2.2e+05 0.0e+00 24 99 97 18 0 100100100100 0 4743 0 0
>> 0.00e+00 0 0.00e+00 0
>> >>>>> VecScatterBegin 100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03
>> 2.2e+05 0.0e+00 0 0 97 18 0 0 0100100 0 0 0 0
>> 0.00e+00 0 0.00e+00 0
>> >>>>> VecScatterEnd 100 1.0 2.9441e+00133 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 3 0 0 0 0 13 0 0 0 0 0 0 0
>> 0.00e+00 0 0.00e+00 0
>> >>>>>
>> >>>>> 6 MPI ranks + 6 GPUs + regular SF
>> >>>>> MatMult 100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03
>> 2.2e+05 0.0e+00 0 99 97 18 0 100100100100 0 318057 3084009 100
>> 1.02e+02 100 2.69e+02 100
>> >>>>> VecScatterBegin 100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03
>> 2.2e+05 0.0e+00 0 0 97 18 0 64 0100100 0 0 0 0
>> 0.00e+00 100 2.69e+02 0
>> >>>>> VecScatterEnd 100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 22 0 0 0 0 0 0 0
>> 0.00e+00 0 0.00e+00 0
>> >>>>> VecCUDACopyTo 100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 5 0 0 0 0 0 0 100
>> 1.02e+02 0 0.00e+00 0
>> >>>>> VecCopyFromSome 100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 54 0 0 0 0 0 0 0
>> 0.00e+00 100 2.69e+02 0
>> >>>>>
>> >>>>> 6 MPI ranks + 6 GPUs + CUDA-aware SF
>> >>>>> MatMult 100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03
>> 2.2e+05 0.0e+00 1 99 97 18 0 100100100100 0 509496 3133521 0
>> 0.00e+00 0 0.00e+00 100
>> >>>>> VecScatterBegin 100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03
>> 2.2e+05 0.0e+00 1 0 97 18 0 70 0100100 0 0 0 0
>> 0.00e+00 0 0.00e+00 0
>> >>>>> VecScatterEnd 100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 17 0 0 0 0 0 0 0
>> 0.00e+00 0 0.00e+00 0
>> >>>>>
>> >>>>> 24 MPI ranks + 6 GPUs + regular SF
>> >>>>> MatMult 100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04
>> 5.9e+04 0.0e+00 1 99 97 25 0 100100100100 0 510337 951558 100
>> 4.61e+01 100 6.72e+01 100
>> >>>>> VecScatterBegin 100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04
>> 5.9e+04 0.0e+00 0 0 97 25 0 34 0100100 0 0 0 0
>> 0.00e+00 100 6.72e+01 0
>> >>>>> VecScatterEnd 100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 1 0 0 0 0 42 0 0 0 0 0 0 0
>> 0.00e+00 0 0.00e+00 0
>> >>>>> VecCUDACopyTo 100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 3 0 0 0 0 0 0 100
>> 4.61e+01 0 0.00e+00 0
>> >>>>> VecCopyFromSome 100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 0 0 0 0 0 29 0 0 0 0 0 0 0
>> 0.00e+00 100 6.72e+01 0
>> >>>>>
>> >>>>> 24 MPI ranks + 6 GPUs + CUDA-aware SF
>> >>>>> MatMult 100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04
>> 5.9e+04 0.0e+00 1 99 97 25 0 100100100100 0 387864 973391 0
>> 0.00e+00 0 0.00e+00 100
>> >>>>> VecScatterBegin 100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04
>> 5.9e+04 0.0e+00 1 0 97 25 0 35 0100100 0 0 0 0
>> 0.00e+00 0 0.00e+00 0
>> >>>>> VecScatterEnd 100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00 1 0 0 0 0 48 0 0 0 0 0 0 0
>> 0.00e+00 0 0.00e+00 0
>> >>>>>
>> >>>>>
>> >>>>> --Junchao Zhang
>> >>>>
>> >>>
>>
>>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190922/95bf695a/attachment.html>
More information about the petsc-dev
mailing list