[petsc-users] Make stream

Fri Jun 12 20:43:24 CDT 2020

Jed Brown <jed at jedbrown.org> writes:

> Fande Kong <fdkong.jd at gmail.com> writes:
>
>>> There's a lot more to AMG setup than memory bandwidth (architecture
>>> matters a lot, even between different generation CPUs).
>>
>>
>> Could you elaborate a bit more on this? From my understanding, one big part
>> of AMG SetUp is RAP that should be pretty much bandwidth.
>
> The RAP isn't "pretty much bandwidth".  See below for some
> Skylake/POWER9/EPYC results and analysis (copied from an off-list
> thread).  I'll leave in some other bandwidth comments that may or may
> not be relevant to you.  The short story is that Skylake and EPYC are
> both much better than POWER9 at MatPtAP despite POWER9 having similar
> bandwidth as EPYC and thus being significantly faster than Skylake for
> MatMult/smoothing.
>
>
> Jed Brown <jed at jedbrown.org> writes:
>
>> I'm attaching a log from my machine (Noether), which is 2-socket EPYC
>> 7452 (32 cores each).  Each socket has 8xDDR4-3200 and 128 MB of L3
>> cache.  This is the same node architecture as the new BER/E3SM machine
>> being installed at Argonne (though that one will probably have
>> higher-clocked and/or more cores per socket).  Note that these CPUs are
>> about $2k each while Skylake 8180 are about $10k.
>>
>> Some excerpts/comments below.
>>
>
>  [...]
>
>  In addition to the notes below, I'd like to call out how important
>  streaming stores are on EPYC.  With vanilla code or _mm256_store_pd, we
>  get the following performance
>
>    $ mpiexec -n 64 --bind-to core --map-by core:1 src/benchmarks/streams/MPIVersion
>    Copy 162609.2392   Scale 159119.8259   Add 174687.6250   Triad 175840.1587
>
>  but replacing _mm256_store_pd with _mm256_stream_pd gives this
>
>    $ mpiexec -n 64 --bind-to core --map-by core:1 src/benchmarks/streams/MPIVersion
>    Copy 259951.9936   Scale 259381.0589   Add 250216.3389   Triad 249292.9701

I turned on NPS4 (a BIOS setting that creates a NUMA node for each pair
of memory channels) and get a modest performance boost.

$ mpiexec -n 64 --bind-to core --map-by core:1 src/benchmarks/streams/MPIVersion                                                               
Copy 289645.3776   Scale 289186.2783   Add 273220.0133   Triad 272911.2263 

On this architecture, best performance comes from one process per 4-core CCX (shared L3).

$ mpiexec -n 16 --bind-to core --map-by core:4 src/benchmarks/streams/MPIVersion                                                               
Copy 300704.8859   Scale 304556.3380   Add 295970.1132   Triad 298891.3821 

>  This is just preposterously huge, but very repeatable using gcc and
>  clang, and inspecting the assembly.  This suggests that it would be
>  useful for vector kernels to have streaming and non-streaming variants.
>  That is, if I drop the vector length by 20 (so the working set is 2.3
>  MB/core instead of 46 MB in the default version), then we get 2.4 TB/s
>  Triad with _mm256_store_pd:
>
>    $ mpiexec -n 64 --bind-to core --map-by core:1 src/benchmarks/streams/MPIVersion
>    Copy 2159915.7058   Scale 2212671.7087   Add 2414758.2757   Triad 2402671.1178
>
>  and a thoroughly embarrassing 353 GB/s with _mm256_stream_pd:
>
>    $ mpiexec -n 64 --bind-to core --map-by core:1 src/benchmarks/streams/MPIVersion
>    Copy 235934.6653   Scale 237446.8507   Add 352805.7288   Triad 352992.9692
>
>
>  I don't know a good way to automatically determine whether to expect the
>  memory to be in cache, but we could make it a global (or per-object)
>  run-time selection.
>
>> Jed Brown <jed at jedbrown.org> writes:
>>
>>> "Smith, Barry F." <bsmith at mcs.anl.gov> writes:
>>>
>>>>    Thanks. The PowerPC is pretty crappy compared to Skylake.
>>>
>>> Compare the MGSmooth times.  The POWER9 is faster than the Skylake
>>> because it has more memory bandwidth.
>>>
>>> $ rg 'MGInterp Level 4|MGSmooth Level 4' ex56*
>>> ex56-JLSE-skylake-56ranks-converged.txt
>>> 254:MGSmooth Level 4      68 1.0 1.8808e+00 1.2 7.93e+08 1.3 3.6e+04 1.9e+04 3.4e+01  8 29 10 16  3  62 60 18 54 25 22391
>>> 256:MGInterp Level 4      68 1.0 4.0043e-01 1.8 1.45e+08 1.3 2.2e+04 2.5e+03 0.0e+00  1  5  6  1  0   9 11 11  4  0 19109
>>>
>>> ex56-summit-cpu-36ranks-converged.txt
>>> 265:MGSmooth Level 4      68 1.0 1.1531e+00 1.1 1.22e+09 1.2 2.3e+04 2.6e+04 3.4e+01  3 29  7 13  3  61 60 12 54 25 36519       0      0 0.00e+00    0 0.00e+00  0
>>> 267:MGInterp Level 4      68 1.0 2.0749e-01 1.1 2.23e+08 1.2 1.4e+04 3.4e+03 0.0e+00  0  5  4  1  0  11 11  7  4  0 36925       0      0 0.00e+00    0 0.00e+00  0
>>>
>>> ex56-summit-gpu-24ranks-converged.txt
>>> 275:MGSmooth Level 4      68 1.0 1.4499e-01 1.2 1.85e+09 1.2 1.0e+04 5.3e+04 3.4e+01  0 29  7 13  3  26 60 12 55 25 299156   940881    115 2.46e+01  116 8.64e+01 100
>>> 277:MGInterp Level 4      68 1.0 1.7674e-01 1.0 3.23e+08 1.2 6.1e+03 6.7e+03 0.0e+00  0  5  4  1  0  33 11  7  4  0 42715   621223     36 2.98e+01  136 3.95e+00 100
>>>
>>> ex56-summit-gpu-36ranks-converged.txt
>>> 275:MGSmooth Level 4      68 1.0 1.4877e-01 1.2 1.25e+09 1.2 2.3e+04 2.6e+04 3.4e+01  0 29  7 13  3  19 60 12 54 25 291548   719522    115 1.83e+01  116 5.80e+01 100
>>> 277:MGInterp Level 4      68 1.0 2.4317e-01 1.0 2.20e+08 1.2 1.4e+04 3.4e+03 0.0e+00  0  5  4  1  0  33 11  7  4  0 31062   586044     36 1.99e+01  136 2.82e+00 100
>>
>> 258:MGSmooth Level 4      68 1.0 9.6950e-01 1.3 6.15e+08 1.3 4.0e+04 1.4e+04 2.0e+00  6 28 10 15  0  59 59 18 54 25 39423
>> 260:MGInterp Level 4      68 1.0 2.5707e-01 1.5 1.23e+08 1.2 2.7e+04 1.9e+03 0.0e+00  1  5  7  1  0  13 12 12  5  0 29294
>>
>> Epyc is faster than Power9 is faster than Sklake.
>>
>>>
>>> The Skylake is a lot faster at PtAP.  It'd be interesting to better
>>> understand why.  Perhaps it has to do with caching or aggressiveness of
>>> out-of-order execution.
>>>
>>> $ rg 'PtAP' ex56*
>>> ex56-JLSE-skylake-56ranks-converged.txt
>>> 164:MatPtAP                4 1.0 1.4214e+00 1.0 3.94e+08 1.5 1.1e+04 7.4e+04 4.4e+01  6 13  3 20  4   8 28  8 39  5 13754
>>> 165:MatPtAPSymbolic        4 1.0 8.3981e-01 1.0 0.00e+00 0.0 6.5e+03 7.3e+04 2.8e+01  4  0  2 12  2   5  0  5 23  3     0
>>> 166:MatPtAPNumeric         4 1.0 5.8402e-01 1.0 3.94e+08 1.5 4.5e+03 7.5e+04 1.6e+01  2 13  1  8  1   3 28  3 16  2 33474
>>>
>>> ex56-summit-cpu-36ranks-converged.txt
>>> 164:MatPtAP                4 1.0 3.9077e+00 1.0 5.89e+08 1.4 1.6e+04 7.4e+04 4.4e+01  9 13  5 26  4  11 28 12 46  5  4991       0      0 0.00e+00    0 0.00e+00  0
>>> 165:MatPtAPSymbolic        4 1.0 1.9525e+00 1.0 0.00e+00 0.0 1.2e+04 7.3e+04 2.8e+01  5  0  4 19  3   5  0  9 34  3     0       0      0 0.00e+00    0 0.00e+00  0
>>> 166:MatPtAPNumeric         4 1.0 1.9621e+00 1.0 5.89e+08 1.4 4.0e+03 7.5e+04 1.6e+01  5 13  1  7  1   5 28  3 12  2  9940       0      0 0.00e+00    0 0.00e+00  0
>>>
>>> ex56-summit-gpu-24ranks-converged.txt
>>> 167:MatPtAP                4 1.0 5.7210e+00 1.0 8.48e+08 1.3 7.5e+03 1.3e+05 4.4e+01  8 13  5 25  4  11 28 12 46  5  3415       0     16 3.36e+01    4 6.30e-02  0
>>> 168:MatPtAPSymbolic        4 1.0 2.8717e+00 1.0 0.00e+00 0.0 5.5e+03 1.3e+05 2.8e+01  4  0  4 19  3   5  0  9 34  3     0       0      0 0.00e+00    0 0.00e+00  0
>>> 169:MatPtAPNumeric         4 1.0 2.8537e+00 1.0 8.48e+08 1.3 2.0e+03 1.3e+05 1.6e+01  4 13  1  7  1   5 28  3 12  2  6846       0     16 3.36e+01    4 6.30e-02  0
>>>
>>> ex56-summit-gpu-36ranks-converged.txt
>>> 167:MatPtAP                4 1.0 4.0340e+00 1.0 5.89e+08 1.4 1.6e+04 7.4e+04 4.4e+01  8 13  5 26  4  11 28 12 46  5  4835       0     16 2.30e+01    4 5.18e-02  0
>>> 168:MatPtAPSymbolic        4 1.0 2.0355e+00 1.0 0.00e+00 0.0 1.2e+04 7.3e+04 2.8e+01  4  0  4 19  3   5  0  9 34  3     0       0      0 0.00e+00    0 0.00e+00  0
>>> 169:MatPtAPNumeric         4 1.0 2.0050e+00 1.0 5.89e+08 1.4 4.0e+03 7.5e+04 1.6e+01  4 13  1  7  1   5 28  3 12  2  9728       0     16 2.30e+01    4 5.18e-02  0
>>
>> 153:MatPtAPSymbolic        4 1.0 7.6053e-01 1.0 0.00e+00 0.0 7.6e+03 5.8e+04 2.8e+01  5  0  2 12  2   6  0  5 22  3     0
>> 154:MatPtAPNumeric         4 1.0 6.5172e-01 1.0 3.21e+08 1.4 6.4e+03 4.8e+04 2.4e+01  4 14  2  8  2   5 27  4 16  2 28861
>>
>> Epyc similar to Skylake here.
>>
>>> I'd really like to compare an EPYC for these operations.  I bet it's
>>> pretty good.  (More bandwidth than Skylake, bigger caches, but no
>>> AVX512.)
>>>
>>>>    So the biggest consumer is MatPtAP I guess that should be done first.
>>>>
>>>>    It would be good to have these results exclude the Jacobian and Function evaluation which really dominate the time and add clutter making it difficult to see the problems with the rest of SNESSolve.
>>>>
>>>>
>>>>    Did you notice:
>>>>
>>>> MGInterp Level 4      68 1.0 1.7674e-01 1.0 3.23e+08 1.2 6.1e+03 6.7e+03 0.0e+00  0  5  4  1  0  33 11  7  4  0 42715   621223     36 2.98e+01  136 3.95e+00 100
>>>>
>>>> it is terrible! Well over half of the KSPSolve time is in this one relatively minor routine. All of the interps are terribly slow. Is it related to the transpose multiple or something?
>>>
>>> Yes, it's definitely the MatMultTranspose, which must be about 3x more
>>> expensive than restriction even on the CPU.  PCMG/PCGAMG should
>>> explicitly transpose (unless the user sets an option to aggressively
>>> minimize memory usage).
>>>
>>> $ rg 'MGInterp|MultTrans' ex56*
>>> ex56-JLSE-skylake-56ranks-converged.txt
>>> 222:MatMultTranspose     136 1.0 3.5105e-01 3.7 7.91e+07 1.3 2.5e+04 1.3e+03 0.0e+00  1  3  7  1  0   5  6 13  3  0 11755
>>> 247:MGInterp Level 1      68 1.0 3.3894e-04 2.2 2.35e+05 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   693
>>> 250:MGInterp Level 2      68 1.0 1.1212e-0278.0 1.17e+06 0.0 1.8e+03 7.7e+02 0.0e+00  0  0  1  0  0   0  0  1  0  0  2172
>>> 253:MGInterp Level 3      68 1.0 6.7105e-02 5.3 1.23e+07 1.8 2.7e+04 4.2e+02 0.0e+00  0  0  8  0  0   1  1 14  1  0  8594
>>> 256:MGInterp Level 4      68 1.0 4.0043e-01 1.8 1.45e+08 1.3 2.2e+04 2.5e+03 0.0e+00  1  5  6  1  0   9 11 11  4  0 19109
>>>
>>> ex56-summit-cpu-36ranks-converged.txt
>>> 229:MatMultTranspose     136 1.0 1.4832e-01 1.4 1.21e+08 1.2 1.9e+04 1.5e+03 0.0e+00  0  3  6  1  0   6  6 10  3  0 27842       0      0 0.00e+00    0 0.00e+00  0
>>> 258:MGInterp Level 1      68 1.0 2.9145e-04 1.5 1.08e+05 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   370       0      0 0.00e+00    0 0.00e+00  0
>>> 261:MGInterp Level 2      68 1.0 5.7095e-03 1.5 9.16e+05 2.5 2.4e+03 7.1e+02 0.0e+00  0  0  1  0  0   0  0  1  0  0  4093       0      0 0.00e+00    0 0.00e+00  0
>>> 264:MGInterp Level 3      68 1.0 3.5654e-02 2.8 1.77e+07 1.5 2.3e+04 3.9e+02 0.0e+00  0  0  7  0  0   1  1 12  1  0 16095       0      0 0.00e+00    0 0.00e+00  0
>>> 267:MGInterp Level 4      68 1.0 2.0749e-01 1.1 2.23e+08 1.2 1.4e+04 3.4e+03 0.0e+00  0  5  4  1  0  11 11  7  4  0 36925       0      0 0.00e+00    0 0.00e+00  0
>>>
>>> ex56-summit-gpu-24ranks-converged.txt
>>> 236:MatMultTranspose     136 1.0 2.1445e-01 1.0 1.72e+08 1.2 9.5e+03 2.6e+03 0.0e+00  0  3  6  1  0  39  6 11  3  0 18719   451131      8 3.11e+01  272 2.19e+00 100
>>> 268:MGInterp Level 1      68 1.0 4.0388e-03 2.8 1.08e+05 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    27      79     37 5.84e-04   68 6.80e-05 100
>>> 271:MGInterp Level 2      68 1.0 2.9033e-02 2.9 1.25e+06 1.9 1.6e+03 7.8e+02 0.0e+00  0  0  1  0  0   5  0  2  0  0   812   11539     36 1.14e-01  136 5.41e-02 100
>>> 274:MGInterp Level 3      68 1.0 4.9503e-02 1.1 2.50e+07 1.4 1.1e+04 6.3e+02 0.0e+00  0  0  7  0  0   9  1 13  1  0 11476   100889     36 2.29e+00  136 3.74e-01 100
>>> 277:MGInterp Level 4      68 1.0 1.7674e-01 1.0 3.23e+08 1.2 6.1e+03 6.7e+03 0.0e+00  0  5  4  1  0  33 11  7  4  0 42715   621223     36 2.98e+01  136 3.95e+00 100
>>>
>>> ex56-summit-gpu-36ranks-converged.txt
>>> 236:MatMultTranspose     136 1.0 2.9692e-01 1.0 1.17e+08 1.2 1.9e+04 1.5e+03 0.0e+00  1  3  6  1  0  40  6 10  3  0 13521   336701      8 2.08e+01  272 1.59e+00 100
>>> 268:MGInterp Level 1      68 1.0 3.8752e-03 2.5 1.03e+05 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    27      79     37 3.95e-04   68 4.53e-05 100
>>> 271:MGInterp Level 2      68 1.0 3.5465e-02 2.2 9.12e+05 2.5 2.4e+03 7.1e+02 0.0e+00  0  0  1  0  0   4  0  1  0  0   655    5989     36 8.16e-02  136 4.89e-02 100
>>> 274:MGInterp Level 3      68 1.0 6.7101e-02 1.1 1.75e+07 1.5 2.3e+04 3.9e+02 0.0e+00  0  0  7  0  0   9  1 12  1  0  8455   56175     36 1.55e+00  136 3.03e-01 100
>>> 277:MGInterp Level 4      68 1.0 2.4317e-01 1.0 2.20e+08 1.2 1.4e+04 3.4e+03 0.0e+00  0  5  4  1  0  33 11  7  4  0 31062   586044     36 1.99e+01  136 2.82e+00 100
>>
>> 223:MatMultTranspose     136 1.0 2.0702e-01 2.9 6.59e+07 1.2 2.7e+04 1.1e+03 0.0e+00  1  3  7  1  0   7  6 12  3  0 19553
>> 251:MGInterp Level 1      68 1.0 2.8062e-04 1.5 9.79e+04 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   349
>> 254:MGInterp Level 2      68 1.0 6.2506e-0331.9 9.69e+05 0.0 2.1e+03 6.3e+02 0.0e+00  0  0  1  0  0   0  0  1  0  0  3458
>> 257:MGInterp Level 3      68 1.0 4.8159e-02 6.5 9.62e+06 1.5 2.5e+04 4.2e+02 0.0e+00  0  0  6  0  0   1  1 11  1  0 11199
>> 260:MGInterp Level 4      68 1.0 2.5707e-01 1.5 1.23e+08 1.2 2.7e+04 1.9e+03 0.0e+00  1  5  7  1  0  13 12 12  5  0 29294
>>
>> Power9 still has an edge here.