[petsc-dev] MatMult on Summit

Sun Sep 22 11:14:54 CDT 2019

I did stream test on Summit. I used the MPI version from petsc, but largely increased the array size N since one socket of Summit has 120MB L3 cache. I used MPI version since it was easy for me to distribute ranks evenly to the two sockets.
The result matches with data released by OLCF (see attached figure) and data given by Jed. We can see the bandwidth saturates around 24 ranks.

#Ranks     Rate (MB/s)     Ratio over 2 ranks
------------------------------------------
2          59012.2834        1.00
4          70959.1475        1.20
6         106639.9837        1.81
8         138638.6929        2.35
10        171125.0873        2.90
12        196162.5197        3.32
14        215272.7810        3.65
16        229562.4040        3.89
18        242587.4913        4.11
20        251057.1731        4.25
22        258569.7794        4.38
24        265443.2924        4.50
26        266562.7872        4.52
28        267043.6367        4.53
30        266833.7212        4.52
32        267183.8474        4.53

On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F. <bsmith at mcs.anl.gov<mailto:bsmith at mcs.anl.gov>> wrote:

  Junchao could try the PETSc (and non-PETSc) streams tests on the machine.

  There are a few differences, compiler, the reported results are with OpenMP, different number of cores but yes the performance is a bit low. For DOE that is great, makes GPUs look better :-)

> On Sep 21, 2019, at 11:11 PM, Jed Brown <jed at jedbrown.org<mailto:jed at jedbrown.org>> wrote:
>
> For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165
> GB/s for the node for the best case (42 ranks).
>
> My understanding is that these systems have 8 channels of DDR4-2666 per
> socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket
> system, and 270 GB/s STREAM Triad according to this post
>
>  https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/
>
> Is this 60% of Triad the best we can get for SpMV?
>
> "Zhang, Junchao via petsc-dev" <petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>> writes:
>
>> 42 cores have better performance.
>>
>> 36 MPI ranks
>> MatMult              100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 0.0e+00  6 99 97 28  0 100100100100  0 25145       0      0 0.00e+00    0 0.00e+00  0
>> VecScatterBegin      100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 0.0e+00  0  0 97 28  0   1  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0
>> VecScatterEnd        100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0  22  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
>>
>> --Junchao Zhang
>>
>>
>> On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. <bsmith at mcs.anl.gov<mailto:bsmith at mcs.anl.gov><mailto:bsmith at mcs.anl.gov<mailto:bsmith at mcs.anl.gov>>> wrote:
>>
>>  Junchao,
>>
>>    Mark has a good point; could you also try for completeness the CPU with 36 cores and see if it is any better than the 42 core case?
>>
>>  Barry
>>
>>  So extrapolating about 20 nodes of the CPUs is equivalent to 1 node of the GPUs for the multiply for this problem size.
>>
>>> On Sep 21, 2019, at 6:40 PM, Mark Adams <mfadams at lbl.gov<mailto:mfadams at lbl.gov><mailto:mfadams at lbl.gov<mailto:mfadams at lbl.gov>>> wrote:
>>>
>>> I came up with 36 cores/node for CPU GAMG runs. The memory bus is pretty saturated at that point.
>>>
>>> On Sat, Sep 21, 2019 at 1:44 AM Zhang, Junchao via petsc-dev <petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov><mailto:petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>>> wrote:
>>> Here are CPU version results on one node with 24 cores, 42 cores. Click the links for core layout.
>>>
>>> 24 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
>>> MatMult              100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 0.0e+00  8 99 97 25  0 100100100100  0 17948       0      0 0.00e+00    0 0.00e+00  0
>>> VecScatterBegin      100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 0.0e+00  0  0 97 25  0   0  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0
>>> VecScatterEnd        100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0  19  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
>>>
>>> 42 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=
>>> MatMult              100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 0.0e+00 23 99 97 30  0 100100100100  0 27493       0      0 0.00e+00    0 0.00e+00  0
>>> VecScatterBegin      100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 0.0e+00  0  0 97 30  0   1  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0
>>> VecScatterEnd        100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  6  0  0  0  0  24  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
>>>
>>> --Junchao Zhang
>>>
>>>
>>> On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F. <bsmith at mcs.anl.gov<mailto:bsmith at mcs.anl.gov><mailto:bsmith at mcs.anl.gov<mailto:bsmith at mcs.anl.gov>>> wrote:
>>>
>>>  Junchao,
>>>
>>>   Very interesting. For completeness please run also 24 and 42 CPUs without the GPUs. Note that the default layout for CPU cores is not good. You will want 3 cores on each socket then 12 on each.
>>>
>>>  Thanks
>>>
>>>   Barry
>>>
>>>  Since Tim is one of our reviewers next week this is a very good test matrix :-)
>>>
>>>
>>>> On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev <petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov><mailto:petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>>> wrote:
>>>>
>>>> Click the links to visualize it.
>>>>
>>>> 6 ranks
>>>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
>>>> jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
>>>>
>>>> 24 ranks
>>>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
>>>> jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
>>>>
>>>> --Junchao Zhang
>>>>
>>>>
>>>> On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev <petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov><mailto:petsc-dev at mcs.anl.gov<mailto:petsc-dev at mcs.anl.gov>>> wrote:
>>>> Junchao,
>>>>
>>>> Can you share your 'jsrun' command so that we can see how you are mapping things to resource sets?
>>>>
>>>> --Richard
>>>>
>>>> On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
>>>>> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100 times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I found MatMult was almost dominated by VecScatter in this simple test. Using 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve performance. But if I enabled Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't know why and have to profile it. I will also collect  data with multiple nodes. Are the matrix and tests proper?
>>>>>
>>>>> ------------------------------------------------------------------------------------------------------------------------
>>>>> Event                Count      Time (sec)     Flop                              --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu - GPU
>>>>>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
>>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>> 6 MPI ranks (CPU version)
>>>>> MatMult              100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743       0      0 0.00e+00    0 0.00e+00  0
>>>>> VecScatterBegin      100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00  0  0 97 18  0   0  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0
>>>>> VecScatterEnd        100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  3  0  0  0  0  13  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
>>>>>
>>>>> 6 MPI ranks + 6 GPUs + regular SF
>>>>> MatMult              100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  100 2.69e+02 100
>>>>> VecScatterBegin      100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00  0  0 97 18  0  64  0100100  0     0       0      0 0.00e+00  100 2.69e+02  0
>>>>> VecScatterEnd        100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  22  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
>>>>> VecCUDACopyTo        100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   5  0  0  0  0     0       0    100 1.02e+02    0 0.00e+00  0
>>>>> VecCopyFromSome      100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  54  0  0  0  0     0       0      0 0.00e+00  100 2.69e+02  0
>>>>>
>>>>> 6 MPI ranks + 6 GPUs + CUDA-aware SF
>>>>> MatMult              100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+00    0 0.00e+00 100
>>>>> VecScatterBegin      100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00  1  0 97 18  0  70  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0
>>>>> VecScatterEnd        100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  17  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
>>>>>
>>>>> 24 MPI ranks + 6 GPUs + regular SF
>>>>> MatMult              100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61e+01  100 6.72e+01 100
>>>>> VecScatterBegin      100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 0.0e+00  0  0 97 25  0  34  0100100  0     0       0      0 0.00e+00  100 6.72e+01  0
>>>>> VecScatterEnd        100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0  42  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
>>>>> VecCUDACopyTo        100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   3  0  0  0  0     0       0    100 4.61e+01    0 0.00e+00  0
>>>>> VecCopyFromSome      100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  29  0  0  0  0     0       0      0 0.00e+00  100 6.72e+01  0
>>>>>
>>>>> 24 MPI ranks + 6 GPUs + CUDA-aware SF
>>>>> MatMult              100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 5.9e+04 0.0e+00  1 99 97 25  0 100100100100  0 387864   973391    0 0.00e+00    0 0.00e+00 100
>>>>> VecScatterBegin      100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 5.9e+04 0.0e+00  1  0 97 25  0  35  0100100  0     0       0      0 0.00e+00    0 0.00e+00  0
>>>>> VecScatterEnd        100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0  48  0  0  0  0     0       0      0 0.00e+00    0 0.00e+00  0
>>>>>
>>>>>
>>>>> --Junchao Zhang
>>>>
>>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190922/4567844d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SummitNode.png
Type: image/png
Size: 275167 bytes
Desc: SummitNode.png
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190922/4567844d/attachment-0001.png>