[petsc-dev] MatMult on Summit
Zhang, Junchao
jczhang at mcs.anl.gov
Sat Sep 21 22:17:29 CDT 2019
42 cores have better performance.
36 MPI ranks
MatMult 100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 0.0e+00 6 99 97 28 0 100100100100 0 25145 0 0 0.00e+00 0 0.00e+00 0
VecScatterBegin 100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 0.0e+00 0 0 97 28 0 1 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0
VecScatterEnd 100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 22 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
--Junchao Zhang
On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. <bsmith at mcs.anl.gov<mailto:bsmith at mcs.anl.gov>> wrote:
Junchao,
Mark has a good point; could you also try for completeness the CPU with 36 cores and see if it is any better than the 42 core case?
Barry
So extrapolating about 20 nodes of the CPUs is equivalent to 1 node of the GPUs for the multiply for this problem size.
> On Sep 21, 2019, at 6:40 PM, Mark Adams <mfadams at lbl.gov<mailto:mfadams at lbl.gov>> wrote:
>
> I came up with 36 cores/node for CPU GAMG runs. The memory bus is pretty saturated at that point.
>
> On Sat, Sep 21, 2019 at 1:44 AM Zhang, Junchao via petsc-dev <petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>> wrote:
> Here are CPU version results on one node with 24 cores, 42 cores. Click the links for core layout.
>
> 24 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> MatMult 100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 0.0e+00 8 99 97 25 0 100100100100 0 17948 0 0 0.00e+00 0 0.00e+00 0
> VecScatterBegin 100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 0.0e+00 0 0 97 25 0 0 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0
> VecScatterEnd 100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 19 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
>
> 42 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=
> MatMult 100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 0.0e+00 23 99 97 30 0 100100100100 0 27493 0 0 0.00e+00 0 0.00e+00 0
> VecScatterBegin 100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 0.0e+00 0 0 97 30 0 1 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0
> VecScatterEnd 100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 6 0 0 0 0 24 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
>
> --Junchao Zhang
>
>
> On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F. <bsmith at mcs.anl.gov<mailto:bsmith at mcs.anl.gov>> wrote:
>
> Junchao,
>
> Very interesting. For completeness please run also 24 and 42 CPUs without the GPUs. Note that the default layout for CPU cores is not good. You will want 3 cores on each socket then 12 on each.
>
> Thanks
>
> Barry
>
> Since Tim is one of our reviewers next week this is a very good test matrix :-)
>
>
> > On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev <petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>> wrote:
> >
> > Click the links to visualize it.
> >
> > 6 ranks
> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
> > jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> >
> > 24 ranks
> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> > jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> >
> > --Junchao Zhang
> >
> >
> > On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev <petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>> wrote:
> > Junchao,
> >
> > Can you share your 'jsrun' command so that we can see how you are mapping things to resource sets?
> >
> > --Richard
> >
> > On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
> >> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100 times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I found MatMult was almost dominated by VecScatter in this simple test. Using 6 MPI ranks + 6 GPUs, I found CUDA aware SF could improve performance. But if I enabled Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't know why and have to profile it. I will also collect data with multiple nodes. Are the matrix and tests proper?
> >>
> >> ------------------------------------------------------------------------------------------------------------------------
> >> Event Count Time (sec) Flop --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU
> >> Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F
> >> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> >> 6 MPI ranks (CPU version)
> >> MatMult 100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 0.0e+00 24 99 97 18 0 100100100100 0 4743 0 0 0.00e+00 0 0.00e+00 0
> >> VecScatterBegin 100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00 0 0 97 18 0 0 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0
> >> VecScatterEnd 100 1.0 2.9441e+00133 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 13 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> >>
> >> 6 MPI ranks + 6 GPUs + regular SF
> >> MatMult 100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 0.0e+00 0 99 97 18 0 100100100100 0 318057 3084009 100 1.02e+02 100 2.69e+02 100
> >> VecScatterBegin 100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00 0 0 97 18 0 64 0100100 0 0 0 0 0.00e+00 100 2.69e+02 0
> >> VecScatterEnd 100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 22 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> >> VecCUDACopyTo 100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 5 0 0 0 0 0 0 100 1.02e+02 0 0.00e+00 0
> >> VecCopyFromSome 100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 54 0 0 0 0 0 0 0 0.00e+00 100 2.69e+02 0
> >>
> >> 6 MPI ranks + 6 GPUs + CUDA-aware SF
> >> MatMult 100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 0.0e+00 1 99 97 18 0 100100100100 0 509496 3133521 0 0.00e+00 0 0.00e+00 100
> >> VecScatterBegin 100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00 1 0 97 18 0 70 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0
> >> VecScatterEnd 100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 17 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> >>
> >> 24 MPI ranks + 6 GPUs + regular SF
> >> MatMult 100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 0.0e+00 1 99 97 25 0 100100100100 0 510337 951558 100 4.61e+01 100 6.72e+01 100
> >> VecScatterBegin 100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 0.0e+00 0 0 97 25 0 34 0100100 0 0 0 0 0.00e+00 100 6.72e+01 0
> >> VecScatterEnd 100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 42 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> >> VecCUDACopyTo 100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 3 0 0 0 0 0 0 100 4.61e+01 0 0.00e+00 0
> >> VecCopyFromSome 100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 29 0 0 0 0 0 0 0 0.00e+00 100 6.72e+01 0
> >>
> >> 24 MPI ranks + 6 GPUs + CUDA-aware SF
> >> MatMult 100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 5.9e+04 0.0e+00 1 99 97 25 0 100100100100 0 387864 973391 0 0.00e+00 0 0.00e+00 100
> >> VecScatterBegin 100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 5.9e+04 0.0e+00 1 0 97 25 0 35 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0
> >> VecScatterEnd 100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 48 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
> >>
> >>
> >> --Junchao Zhang
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190922/61649f86/attachment-0001.html>
More information about the petsc-dev
mailing list