[petsc-users] Make stream
Jed Brown
jed at jedbrown.org
Fri Jun 12 20:43:24 CDT 2020
Jed Brown <jed at jedbrown.org> writes:
> Fande Kong <fdkong.jd at gmail.com> writes:
>
>>> There's a lot more to AMG setup than memory bandwidth (architecture
>>> matters a lot, even between different generation CPUs).
>>
>>
>> Could you elaborate a bit more on this? From my understanding, one big part
>> of AMG SetUp is RAP that should be pretty much bandwidth.
>
> The RAP isn't "pretty much bandwidth". See below for some
> Skylake/POWER9/EPYC results and analysis (copied from an off-list
> thread). I'll leave in some other bandwidth comments that may or may
> not be relevant to you. The short story is that Skylake and EPYC are
> both much better than POWER9 at MatPtAP despite POWER9 having similar
> bandwidth as EPYC and thus being significantly faster than Skylake for
> MatMult/smoothing.
>
>
> Jed Brown <jed at jedbrown.org> writes:
>
>> I'm attaching a log from my machine (Noether), which is 2-socket EPYC
>> 7452 (32 cores each). Each socket has 8xDDR4-3200 and 128 MB of L3
>> cache. This is the same node architecture as the new BER/E3SM machine
>> being installed at Argonne (though that one will probably have
>> higher-clocked and/or more cores per socket). Note that these CPUs are
>> about $2k each while Skylake 8180 are about $10k.
>>
>> Some excerpts/comments below.
>>
>
> [...]
>
> In addition to the notes below, I'd like to call out how important
> streaming stores are on EPYC. With vanilla code or _mm256_store_pd, we
> get the following performance
>
> $ mpiexec -n 64 --bind-to core --map-by core:1 src/benchmarks/streams/MPIVersion
> Copy 162609.2392 Scale 159119.8259 Add 174687.6250 Triad 175840.1587
>
> but replacing _mm256_store_pd with _mm256_stream_pd gives this
>
> $ mpiexec -n 64 --bind-to core --map-by core:1 src/benchmarks/streams/MPIVersion
> Copy 259951.9936 Scale 259381.0589 Add 250216.3389 Triad 249292.9701
I turned on NPS4 (a BIOS setting that creates a NUMA node for each pair
of memory channels) and get a modest performance boost.
$ mpiexec -n 64 --bind-to core --map-by core:1 src/benchmarks/streams/MPIVersion
Copy 289645.3776 Scale 289186.2783 Add 273220.0133 Triad 272911.2263
On this architecture, best performance comes from one process per 4-core CCX (shared L3).
$ mpiexec -n 16 --bind-to core --map-by core:4 src/benchmarks/streams/MPIVersion
Copy 300704.8859 Scale 304556.3380 Add 295970.1132 Triad 298891.3821
> This is just preposterously huge, but very repeatable using gcc and
> clang, and inspecting the assembly. This suggests that it would be
> useful for vector kernels to have streaming and non-streaming variants.
> That is, if I drop the vector length by 20 (so the working set is 2.3
> MB/core instead of 46 MB in the default version), then we get 2.4 TB/s
> Triad with _mm256_store_pd:
>
> $ mpiexec -n 64 --bind-to core --map-by core:1 src/benchmarks/streams/MPIVersion
> Copy 2159915.7058 Scale 2212671.7087 Add 2414758.2757 Triad 2402671.1178
>
> and a thoroughly embarrassing 353 GB/s with _mm256_stream_pd:
>
> $ mpiexec -n 64 --bind-to core --map-by core:1 src/benchmarks/streams/MPIVersion
> Copy 235934.6653 Scale 237446.8507 Add 352805.7288 Triad 352992.9692
>
>
> I don't know a good way to automatically determine whether to expect the
> memory to be in cache, but we could make it a global (or per-object)
> run-time selection.
>
>> Jed Brown <jed at jedbrown.org> writes:
>>
>>> "Smith, Barry F." <bsmith at mcs.anl.gov> writes:
>>>
>>>> Thanks. The PowerPC is pretty crappy compared to Skylake.
>>>
>>> Compare the MGSmooth times. The POWER9 is faster than the Skylake
>>> because it has more memory bandwidth.
>>>
>>> $ rg 'MGInterp Level 4|MGSmooth Level 4' ex56*
>>> ex56-JLSE-skylake-56ranks-converged.txt
>>> 254:MGSmooth Level 4 68 1.0 1.8808e+00 1.2 7.93e+08 1.3 3.6e+04 1.9e+04 3.4e+01 8 29 10 16 3 62 60 18 54 25 22391
>>> 256:MGInterp Level 4 68 1.0 4.0043e-01 1.8 1.45e+08 1.3 2.2e+04 2.5e+03 0.0e+00 1 5 6 1 0 9 11 11 4 0 19109
>>>
>>> ex56-summit-cpu-36ranks-converged.txt
>>> 265:MGSmooth Level 4 68 1.0 1.1531e+00 1.1 1.22e+09 1.2 2.3e+04 2.6e+04 3.4e+01 3 29 7 13 3 61 60 12 54 25 36519 0 0 0.00e+00 0 0.00e+00 0
>>> 267:MGInterp Level 4 68 1.0 2.0749e-01 1.1 2.23e+08 1.2 1.4e+04 3.4e+03 0.0e+00 0 5 4 1 0 11 11 7 4 0 36925 0 0 0.00e+00 0 0.00e+00 0
>>>
>>> ex56-summit-gpu-24ranks-converged.txt
>>> 275:MGSmooth Level 4 68 1.0 1.4499e-01 1.2 1.85e+09 1.2 1.0e+04 5.3e+04 3.4e+01 0 29 7 13 3 26 60 12 55 25 299156 940881 115 2.46e+01 116 8.64e+01 100
>>> 277:MGInterp Level 4 68 1.0 1.7674e-01 1.0 3.23e+08 1.2 6.1e+03 6.7e+03 0.0e+00 0 5 4 1 0 33 11 7 4 0 42715 621223 36 2.98e+01 136 3.95e+00 100
>>>
>>> ex56-summit-gpu-36ranks-converged.txt
>>> 275:MGSmooth Level 4 68 1.0 1.4877e-01 1.2 1.25e+09 1.2 2.3e+04 2.6e+04 3.4e+01 0 29 7 13 3 19 60 12 54 25 291548 719522 115 1.83e+01 116 5.80e+01 100
>>> 277:MGInterp Level 4 68 1.0 2.4317e-01 1.0 2.20e+08 1.2 1.4e+04 3.4e+03 0.0e+00 0 5 4 1 0 33 11 7 4 0 31062 586044 36 1.99e+01 136 2.82e+00 100
>>
>> 258:MGSmooth Level 4 68 1.0 9.6950e-01 1.3 6.15e+08 1.3 4.0e+04 1.4e+04 2.0e+00 6 28 10 15 0 59 59 18 54 25 39423
>> 260:MGInterp Level 4 68 1.0 2.5707e-01 1.5 1.23e+08 1.2 2.7e+04 1.9e+03 0.0e+00 1 5 7 1 0 13 12 12 5 0 29294
>>
>> Epyc is faster than Power9 is faster than Sklake.
>>
>>>
>>> The Skylake is a lot faster at PtAP. It'd be interesting to better
>>> understand why. Perhaps it has to do with caching or aggressiveness of
>>> out-of-order execution.
>>>
>>> $ rg 'PtAP' ex56*
>>> ex56-JLSE-skylake-56ranks-converged.txt
>>> 164:MatPtAP 4 1.0 1.4214e+00 1.0 3.94e+08 1.5 1.1e+04 7.4e+04 4.4e+01 6 13 3 20 4 8 28 8 39 5 13754
>>> 165:MatPtAPSymbolic 4 1.0 8.3981e-01 1.0 0.00e+00 0.0 6.5e+03 7.3e+04 2.8e+01 4 0 2 12 2 5 0 5 23 3 0
>>> 166:MatPtAPNumeric 4 1.0 5.8402e-01 1.0 3.94e+08 1.5 4.5e+03 7.5e+04 1.6e+01 2 13 1 8 1 3 28 3 16 2 33474
>>>
>>> ex56-summit-cpu-36ranks-converged.txt
>>> 164:MatPtAP 4 1.0 3.9077e+00 1.0 5.89e+08 1.4 1.6e+04 7.4e+04 4.4e+01 9 13 5 26 4 11 28 12 46 5 4991 0 0 0.00e+00 0 0.00e+00 0
>>> 165:MatPtAPSymbolic 4 1.0 1.9525e+00 1.0 0.00e+00 0.0 1.2e+04 7.3e+04 2.8e+01 5 0 4 19 3 5 0 9 34 3 0 0 0 0.00e+00 0 0.00e+00 0
>>> 166:MatPtAPNumeric 4 1.0 1.9621e+00 1.0 5.89e+08 1.4 4.0e+03 7.5e+04 1.6e+01 5 13 1 7 1 5 28 3 12 2 9940 0 0 0.00e+00 0 0.00e+00 0
>>>
>>> ex56-summit-gpu-24ranks-converged.txt
>>> 167:MatPtAP 4 1.0 5.7210e+00 1.0 8.48e+08 1.3 7.5e+03 1.3e+05 4.4e+01 8 13 5 25 4 11 28 12 46 5 3415 0 16 3.36e+01 4 6.30e-02 0
>>> 168:MatPtAPSymbolic 4 1.0 2.8717e+00 1.0 0.00e+00 0.0 5.5e+03 1.3e+05 2.8e+01 4 0 4 19 3 5 0 9 34 3 0 0 0 0.00e+00 0 0.00e+00 0
>>> 169:MatPtAPNumeric 4 1.0 2.8537e+00 1.0 8.48e+08 1.3 2.0e+03 1.3e+05 1.6e+01 4 13 1 7 1 5 28 3 12 2 6846 0 16 3.36e+01 4 6.30e-02 0
>>>
>>> ex56-summit-gpu-36ranks-converged.txt
>>> 167:MatPtAP 4 1.0 4.0340e+00 1.0 5.89e+08 1.4 1.6e+04 7.4e+04 4.4e+01 8 13 5 26 4 11 28 12 46 5 4835 0 16 2.30e+01 4 5.18e-02 0
>>> 168:MatPtAPSymbolic 4 1.0 2.0355e+00 1.0 0.00e+00 0.0 1.2e+04 7.3e+04 2.8e+01 4 0 4 19 3 5 0 9 34 3 0 0 0 0.00e+00 0 0.00e+00 0
>>> 169:MatPtAPNumeric 4 1.0 2.0050e+00 1.0 5.89e+08 1.4 4.0e+03 7.5e+04 1.6e+01 4 13 1 7 1 5 28 3 12 2 9728 0 16 2.30e+01 4 5.18e-02 0
>>
>> 153:MatPtAPSymbolic 4 1.0 7.6053e-01 1.0 0.00e+00 0.0 7.6e+03 5.8e+04 2.8e+01 5 0 2 12 2 6 0 5 22 3 0
>> 154:MatPtAPNumeric 4 1.0 6.5172e-01 1.0 3.21e+08 1.4 6.4e+03 4.8e+04 2.4e+01 4 14 2 8 2 5 27 4 16 2 28861
>>
>> Epyc similar to Skylake here.
>>
>>> I'd really like to compare an EPYC for these operations. I bet it's
>>> pretty good. (More bandwidth than Skylake, bigger caches, but no
>>> AVX512.)
>>>
>>>> So the biggest consumer is MatPtAP I guess that should be done first.
>>>>
>>>> It would be good to have these results exclude the Jacobian and Function evaluation which really dominate the time and add clutter making it difficult to see the problems with the rest of SNESSolve.
>>>>
>>>>
>>>> Did you notice:
>>>>
>>>> MGInterp Level 4 68 1.0 1.7674e-01 1.0 3.23e+08 1.2 6.1e+03 6.7e+03 0.0e+00 0 5 4 1 0 33 11 7 4 0 42715 621223 36 2.98e+01 136 3.95e+00 100
>>>>
>>>> it is terrible! Well over half of the KSPSolve time is in this one relatively minor routine. All of the interps are terribly slow. Is it related to the transpose multiple or something?
>>>
>>> Yes, it's definitely the MatMultTranspose, which must be about 3x more
>>> expensive than restriction even on the CPU. PCMG/PCGAMG should
>>> explicitly transpose (unless the user sets an option to aggressively
>>> minimize memory usage).
>>>
>>> $ rg 'MGInterp|MultTrans' ex56*
>>> ex56-JLSE-skylake-56ranks-converged.txt
>>> 222:MatMultTranspose 136 1.0 3.5105e-01 3.7 7.91e+07 1.3 2.5e+04 1.3e+03 0.0e+00 1 3 7 1 0 5 6 13 3 0 11755
>>> 247:MGInterp Level 1 68 1.0 3.3894e-04 2.2 2.35e+05 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 693
>>> 250:MGInterp Level 2 68 1.0 1.1212e-0278.0 1.17e+06 0.0 1.8e+03 7.7e+02 0.0e+00 0 0 1 0 0 0 0 1 0 0 2172
>>> 253:MGInterp Level 3 68 1.0 6.7105e-02 5.3 1.23e+07 1.8 2.7e+04 4.2e+02 0.0e+00 0 0 8 0 0 1 1 14 1 0 8594
>>> 256:MGInterp Level 4 68 1.0 4.0043e-01 1.8 1.45e+08 1.3 2.2e+04 2.5e+03 0.0e+00 1 5 6 1 0 9 11 11 4 0 19109
>>>
>>> ex56-summit-cpu-36ranks-converged.txt
>>> 229:MatMultTranspose 136 1.0 1.4832e-01 1.4 1.21e+08 1.2 1.9e+04 1.5e+03 0.0e+00 0 3 6 1 0 6 6 10 3 0 27842 0 0 0.00e+00 0 0.00e+00 0
>>> 258:MGInterp Level 1 68 1.0 2.9145e-04 1.5 1.08e+05 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 370 0 0 0.00e+00 0 0.00e+00 0
>>> 261:MGInterp Level 2 68 1.0 5.7095e-03 1.5 9.16e+05 2.5 2.4e+03 7.1e+02 0.0e+00 0 0 1 0 0 0 0 1 0 0 4093 0 0 0.00e+00 0 0.00e+00 0
>>> 264:MGInterp Level 3 68 1.0 3.5654e-02 2.8 1.77e+07 1.5 2.3e+04 3.9e+02 0.0e+00 0 0 7 0 0 1 1 12 1 0 16095 0 0 0.00e+00 0 0.00e+00 0
>>> 267:MGInterp Level 4 68 1.0 2.0749e-01 1.1 2.23e+08 1.2 1.4e+04 3.4e+03 0.0e+00 0 5 4 1 0 11 11 7 4 0 36925 0 0 0.00e+00 0 0.00e+00 0
>>>
>>> ex56-summit-gpu-24ranks-converged.txt
>>> 236:MatMultTranspose 136 1.0 2.1445e-01 1.0 1.72e+08 1.2 9.5e+03 2.6e+03 0.0e+00 0 3 6 1 0 39 6 11 3 0 18719 451131 8 3.11e+01 272 2.19e+00 100
>>> 268:MGInterp Level 1 68 1.0 4.0388e-03 2.8 1.08e+05 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 27 79 37 5.84e-04 68 6.80e-05 100
>>> 271:MGInterp Level 2 68 1.0 2.9033e-02 2.9 1.25e+06 1.9 1.6e+03 7.8e+02 0.0e+00 0 0 1 0 0 5 0 2 0 0 812 11539 36 1.14e-01 136 5.41e-02 100
>>> 274:MGInterp Level 3 68 1.0 4.9503e-02 1.1 2.50e+07 1.4 1.1e+04 6.3e+02 0.0e+00 0 0 7 0 0 9 1 13 1 0 11476 100889 36 2.29e+00 136 3.74e-01 100
>>> 277:MGInterp Level 4 68 1.0 1.7674e-01 1.0 3.23e+08 1.2 6.1e+03 6.7e+03 0.0e+00 0 5 4 1 0 33 11 7 4 0 42715 621223 36 2.98e+01 136 3.95e+00 100
>>>
>>> ex56-summit-gpu-36ranks-converged.txt
>>> 236:MatMultTranspose 136 1.0 2.9692e-01 1.0 1.17e+08 1.2 1.9e+04 1.5e+03 0.0e+00 1 3 6 1 0 40 6 10 3 0 13521 336701 8 2.08e+01 272 1.59e+00 100
>>> 268:MGInterp Level 1 68 1.0 3.8752e-03 2.5 1.03e+05 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 27 79 37 3.95e-04 68 4.53e-05 100
>>> 271:MGInterp Level 2 68 1.0 3.5465e-02 2.2 9.12e+05 2.5 2.4e+03 7.1e+02 0.0e+00 0 0 1 0 0 4 0 1 0 0 655 5989 36 8.16e-02 136 4.89e-02 100
>>> 274:MGInterp Level 3 68 1.0 6.7101e-02 1.1 1.75e+07 1.5 2.3e+04 3.9e+02 0.0e+00 0 0 7 0 0 9 1 12 1 0 8455 56175 36 1.55e+00 136 3.03e-01 100
>>> 277:MGInterp Level 4 68 1.0 2.4317e-01 1.0 2.20e+08 1.2 1.4e+04 3.4e+03 0.0e+00 0 5 4 1 0 33 11 7 4 0 31062 586044 36 1.99e+01 136 2.82e+00 100
>>
>> 223:MatMultTranspose 136 1.0 2.0702e-01 2.9 6.59e+07 1.2 2.7e+04 1.1e+03 0.0e+00 1 3 7 1 0 7 6 12 3 0 19553
>> 251:MGInterp Level 1 68 1.0 2.8062e-04 1.5 9.79e+04 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 349
>> 254:MGInterp Level 2 68 1.0 6.2506e-0331.9 9.69e+05 0.0 2.1e+03 6.3e+02 0.0e+00 0 0 1 0 0 0 0 1 0 0 3458
>> 257:MGInterp Level 3 68 1.0 4.8159e-02 6.5 9.62e+06 1.5 2.5e+04 4.2e+02 0.0e+00 0 0 6 0 0 1 1 11 1 0 11199
>> 260:MGInterp Level 4 68 1.0 2.5707e-01 1.5 1.23e+08 1.2 2.7e+04 1.9e+03 0.0e+00 1 5 7 1 0 13 12 12 5 0 29294
>>
>> Power9 still has an edge here.
More information about the petsc-users
mailing list