[petsc-users] Make stream

Wed Jun 17 00:21:09 CDT 2020

Thanks, Jed,

It is fascinating. I will try to check if I can do anything to have this
kind of improvement as well.

Thanks,

Fande,

On Fri, Jun 12, 2020 at 7:43 PM Jed Brown <jed at jedbrown.org> wrote:

> Jed Brown <jed at jedbrown.org> writes:
>
> > Fande Kong <fdkong.jd at gmail.com> writes:
> >
> >>> There's a lot more to AMG setup than memory bandwidth (architecture
> >>> matters a lot, even between different generation CPUs).
> >>
> >>
> >> Could you elaborate a bit more on this? From my understanding, one big
> part
> >> of AMG SetUp is RAP that should be pretty much bandwidth.
> >
> > The RAP isn't "pretty much bandwidth".  See below for some
> > Skylake/POWER9/EPYC results and analysis (copied from an off-list
> > thread).  I'll leave in some other bandwidth comments that may or may
> > not be relevant to you.  The short story is that Skylake and EPYC are
> > both much better than POWER9 at MatPtAP despite POWER9 having similar
> > bandwidth as EPYC and thus being significantly faster than Skylake for
> > MatMult/smoothing.
> >
> >
> > Jed Brown <jed at jedbrown.org> writes:
> >
> >> I'm attaching a log from my machine (Noether), which is 2-socket EPYC
> >> 7452 (32 cores each).  Each socket has 8xDDR4-3200 and 128 MB of L3
> >> cache.  This is the same node architecture as the new BER/E3SM machine
> >> being installed at Argonne (though that one will probably have
> >> higher-clocked and/or more cores per socket).  Note that these CPUs are
> >> about $2k each while Skylake 8180 are about $10k.
> >>
> >> Some excerpts/comments below.
> >>
> >
> >  [...]
> >
> >  In addition to the notes below, I'd like to call out how important
> >  streaming stores are on EPYC.  With vanilla code or _mm256_store_pd, we
> >  get the following performance
> >
> >    $ mpiexec -n 64 --bind-to core --map-by core:1
> src/benchmarks/streams/MPIVersion
> >    Copy 162609.2392   Scale 159119.8259   Add 174687.6250   Triad
> 175840.1587
> >
> >  but replacing _mm256_store_pd with _mm256_stream_pd gives this
> >
> >    $ mpiexec -n 64 --bind-to core --map-by core:1
> src/benchmarks/streams/MPIVersion
> >    Copy 259951.9936   Scale 259381.0589   Add 250216.3389   Triad
> 249292.9701
>
> I turned on NPS4 (a BIOS setting that creates a NUMA node for each pair
> of memory channels) and get a modest performance boost.
>
> $ mpiexec -n 64 --bind-to core --map-by core:1
> src/benchmarks/streams/MPIVersion
>
> Copy 289645.3776   Scale 289186.2783   Add 273220.0133   Triad 272911.2263
>
> On this architecture, best performance comes from one process per 4-core
> CCX (shared L3).
>
> $ mpiexec -n 16 --bind-to core --map-by core:4
> src/benchmarks/streams/MPIVersion
>
> Copy 300704.8859   Scale 304556.3380   Add 295970.1132   Triad 298891.3821
>
> >  This is just preposterously huge, but very repeatable using gcc and
> >  clang, and inspecting the assembly.  This suggests that it would be
> >  useful for vector kernels to have streaming and non-streaming variants.
> >  That is, if I drop the vector length by 20 (so the working set is 2.3
> >  MB/core instead of 46 MB in the default version), then we get 2.4 TB/s
> >  Triad with _mm256_store_pd:
> >
> >    $ mpiexec -n 64 --bind-to core --map-by core:1
> src/benchmarks/streams/MPIVersion
> >    Copy 2159915.7058   Scale 2212671.7087   Add 2414758.2757   Triad
> 2402671.1178
> >
> >  and a thoroughly embarrassing 353 GB/s with _mm256_stream_pd:
> >
> >    $ mpiexec -n 64 --bind-to core --map-by core:1
> src/benchmarks/streams/MPIVersion
> >    Copy 235934.6653   Scale 237446.8507   Add 352805.7288   Triad
> 352992.9692
> >
> >
> >  I don't know a good way to automatically determine whether to expect the
> >  memory to be in cache, but we could make it a global (or per-object)
> >  run-time selection.
> >
> >> Jed Brown <jed at jedbrown.org> writes:
> >>
> >>> "Smith, Barry F." <bsmith at mcs.anl.gov> writes:
> >>>
> >>>>    Thanks. The PowerPC is pretty crappy compared to Skylake.
> >>>
> >>> Compare the MGSmooth times.  The POWER9 is faster than the Skylake
> >>> because it has more memory bandwidth.
> >>>
> >>> $ rg 'MGInterp Level 4|MGSmooth Level 4' ex56*
> >>> ex56-JLSE-skylake-56ranks-converged.txt
> >>> 254:MGSmooth Level 4      68 1.0 1.8808e+00 1.2 7.93e+08 1.3 3.6e+04
> 1.9e+04 3.4e+01  8 29 10 16  3  62 60 18 54 25 22391
> >>> 256:MGInterp Level 4      68 1.0 4.0043e-01 1.8 1.45e+08 1.3 2.2e+04
> 2.5e+03 0.0e+00  1  5  6  1  0   9 11 11  4  0 19109
> >>>
> >>> ex56-summit-cpu-36ranks-converged.txt
> >>> 265:MGSmooth Level 4      68 1.0 1.1531e+00 1.1 1.22e+09 1.2 2.3e+04
> 2.6e+04 3.4e+01  3 29  7 13  3  61 60 12 54 25 36519       0      0
> 0.00e+00    0 0.00e+00  0
> >>> 267:MGInterp Level 4      68 1.0 2.0749e-01 1.1 2.23e+08 1.2 1.4e+04
> 3.4e+03 0.0e+00  0  5  4  1  0  11 11  7  4  0 36925       0      0
> 0.00e+00    0 0.00e+00  0
> >>>
> >>> ex56-summit-gpu-24ranks-converged.txt
> >>> 275:MGSmooth Level 4      68 1.0 1.4499e-01 1.2 1.85e+09 1.2 1.0e+04
> 5.3e+04 3.4e+01  0 29  7 13  3  26 60 12 55 25 299156   940881    115
> 2.46e+01  116 8.64e+01 100
> >>> 277:MGInterp Level 4      68 1.0 1.7674e-01 1.0 3.23e+08 1.2 6.1e+03
> 6.7e+03 0.0e+00  0  5  4  1  0  33 11  7  4  0 42715   621223     36
> 2.98e+01  136 3.95e+00 100
> >>>
> >>> ex56-summit-gpu-36ranks-converged.txt
> >>> 275:MGSmooth Level 4      68 1.0 1.4877e-01 1.2 1.25e+09 1.2 2.3e+04
> 2.6e+04 3.4e+01  0 29  7 13  3  19 60 12 54 25 291548   719522    115
> 1.83e+01  116 5.80e+01 100
> >>> 277:MGInterp Level 4      68 1.0 2.4317e-01 1.0 2.20e+08 1.2 1.4e+04
> 3.4e+03 0.0e+00  0  5  4  1  0  33 11  7  4  0 31062   586044     36
> 1.99e+01  136 2.82e+00 100
> >>
> >> 258:MGSmooth Level 4      68 1.0 9.6950e-01 1.3 6.15e+08 1.3 4.0e+04
> 1.4e+04 2.0e+00  6 28 10 15  0  59 59 18 54 25 39423
> >> 260:MGInterp Level 4      68 1.0 2.5707e-01 1.5 1.23e+08 1.2 2.7e+04
> 1.9e+03 0.0e+00  1  5  7  1  0  13 12 12  5  0 29294
> >>
> >> Epyc is faster than Power9 is faster than Sklake.
> >>
> >>>
> >>> The Skylake is a lot faster at PtAP.  It'd be interesting to better
> >>> understand why.  Perhaps it has to do with caching or aggressiveness of
> >>> out-of-order execution.
> >>>
> >>> $ rg 'PtAP' ex56*
> >>> ex56-JLSE-skylake-56ranks-converged.txt
> >>> 164:MatPtAP                4 1.0 1.4214e+00 1.0 3.94e+08 1.5 1.1e+04
> 7.4e+04 4.4e+01  6 13  3 20  4   8 28  8 39  5 13754
> >>> 165:MatPtAPSymbolic        4 1.0 8.3981e-01 1.0 0.00e+00 0.0 6.5e+03
> 7.3e+04 2.8e+01  4  0  2 12  2   5  0  5 23  3     0
> >>> 166:MatPtAPNumeric         4 1.0 5.8402e-01 1.0 3.94e+08 1.5 4.5e+03
> 7.5e+04 1.6e+01  2 13  1  8  1   3 28  3 16  2 33474
> >>>
> >>> ex56-summit-cpu-36ranks-converged.txt
> >>> 164:MatPtAP                4 1.0 3.9077e+00 1.0 5.89e+08 1.4 1.6e+04
> 7.4e+04 4.4e+01  9 13  5 26  4  11 28 12 46  5  4991       0      0
> 0.00e+00    0 0.00e+00  0
> >>> 165:MatPtAPSymbolic        4 1.0 1.9525e+00 1.0 0.00e+00 0.0 1.2e+04
> 7.3e+04 2.8e+01  5  0  4 19  3   5  0  9 34  3     0       0      0
> 0.00e+00    0 0.00e+00  0
> >>> 166:MatPtAPNumeric         4 1.0 1.9621e+00 1.0 5.89e+08 1.4 4.0e+03
> 7.5e+04 1.6e+01  5 13  1  7  1   5 28  3 12  2  9940       0      0
> 0.00e+00    0 0.00e+00  0
> >>>
> >>> ex56-summit-gpu-24ranks-converged.txt
> >>> 167:MatPtAP                4 1.0 5.7210e+00 1.0 8.48e+08 1.3 7.5e+03
> 1.3e+05 4.4e+01  8 13  5 25  4  11 28 12 46  5  3415       0     16
> 3.36e+01    4 6.30e-02  0
> >>> 168:MatPtAPSymbolic        4 1.0 2.8717e+00 1.0 0.00e+00 0.0 5.5e+03
> 1.3e+05 2.8e+01  4  0  4 19  3   5  0  9 34  3     0       0      0
> 0.00e+00    0 0.00e+00  0
> >>> 169:MatPtAPNumeric         4 1.0 2.8537e+00 1.0 8.48e+08 1.3 2.0e+03
> 1.3e+05 1.6e+01  4 13  1  7  1   5 28  3 12  2  6846       0     16
> 3.36e+01    4 6.30e-02  0
> >>>
> >>> ex56-summit-gpu-36ranks-converged.txt
> >>> 167:MatPtAP                4 1.0 4.0340e+00 1.0 5.89e+08 1.4 1.6e+04
> 7.4e+04 4.4e+01  8 13  5 26  4  11 28 12 46  5  4835       0     16
> 2.30e+01    4 5.18e-02  0
> >>> 168:MatPtAPSymbolic        4 1.0 2.0355e+00 1.0 0.00e+00 0.0 1.2e+04
> 7.3e+04 2.8e+01  4  0  4 19  3   5  0  9 34  3     0       0      0
> 0.00e+00    0 0.00e+00  0
> >>> 169:MatPtAPNumeric         4 1.0 2.0050e+00 1.0 5.89e+08 1.4 4.0e+03
> 7.5e+04 1.6e+01  4 13  1  7  1   5 28  3 12  2  9728       0     16
> 2.30e+01    4 5.18e-02  0
> >>
> >> 153:MatPtAPSymbolic        4 1.0 7.6053e-01 1.0 0.00e+00 0.0 7.6e+03
> 5.8e+04 2.8e+01  5  0  2 12  2   6  0  5 22  3     0
> >> 154:MatPtAPNumeric         4 1.0 6.5172e-01 1.0 3.21e+08 1.4 6.4e+03
> 4.8e+04 2.4e+01  4 14  2  8  2   5 27  4 16  2 28861
> >>
> >> Epyc similar to Skylake here.
> >>
> >>> I'd really like to compare an EPYC for these operations.  I bet it's
> >>> pretty good.  (More bandwidth than Skylake, bigger caches, but no
> >>> AVX512.)
> >>>
> >>>>    So the biggest consumer is MatPtAP I guess that should be done
> first.
> >>>>
> >>>>    It would be good to have these results exclude the Jacobian and
> Function evaluation which really dominate the time and add clutter making
> it difficult to see the problems with the rest of SNESSolve.
> >>>>
> >>>>
> >>>>    Did you notice:
> >>>>
> >>>> MGInterp Level 4      68 1.0 1.7674e-01 1.0 3.23e+08 1.2 6.1e+03
> 6.7e+03 0.0e+00  0  5  4  1  0  33 11  7  4  0 42715   621223     36
> 2.98e+01  136 3.95e+00 100
> >>>>
> >>>> it is terrible! Well over half of the KSPSolve time is in this one
> relatively minor routine. All of the interps are terribly slow. Is it
> related to the transpose multiple or something?
> >>>
> >>> Yes, it's definitely the MatMultTranspose, which must be about 3x more
> >>> expensive than restriction even on the CPU.  PCMG/PCGAMG should
> >>> explicitly transpose (unless the user sets an option to aggressively
> >>> minimize memory usage).
> >>>
> >>> $ rg 'MGInterp|MultTrans' ex56*
> >>> ex56-JLSE-skylake-56ranks-converged.txt
> >>> 222:MatMultTranspose     136 1.0 3.5105e-01 3.7 7.91e+07 1.3 2.5e+04
> 1.3e+03 0.0e+00  1  3  7  1  0   5  6 13  3  0 11755
> >>> 247:MGInterp Level 1      68 1.0 3.3894e-04 2.2 2.35e+05 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   693
> >>> 250:MGInterp Level 2      68 1.0 1.1212e-0278.0 1.17e+06 0.0 1.8e+03
> 7.7e+02 0.0e+00  0  0  1  0  0   0  0  1  0  0  2172
> >>> 253:MGInterp Level 3      68 1.0 6.7105e-02 5.3 1.23e+07 1.8 2.7e+04
> 4.2e+02 0.0e+00  0  0  8  0  0   1  1 14  1  0  8594
> >>> 256:MGInterp Level 4      68 1.0 4.0043e-01 1.8 1.45e+08 1.3 2.2e+04
> 2.5e+03 0.0e+00  1  5  6  1  0   9 11 11  4  0 19109
> >>>
> >>> ex56-summit-cpu-36ranks-converged.txt
> >>> 229:MatMultTranspose     136 1.0 1.4832e-01 1.4 1.21e+08 1.2 1.9e+04
> 1.5e+03 0.0e+00  0  3  6  1  0   6  6 10  3  0 27842       0      0
> 0.00e+00    0 0.00e+00  0
> >>> 258:MGInterp Level 1      68 1.0 2.9145e-04 1.5 1.08e+05 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   370       0      0
> 0.00e+00    0 0.00e+00  0
> >>> 261:MGInterp Level 2      68 1.0 5.7095e-03 1.5 9.16e+05 2.5 2.4e+03
> 7.1e+02 0.0e+00  0  0  1  0  0   0  0  1  0  0  4093       0      0
> 0.00e+00    0 0.00e+00  0
> >>> 264:MGInterp Level 3      68 1.0 3.5654e-02 2.8 1.77e+07 1.5 2.3e+04
> 3.9e+02 0.0e+00  0  0  7  0  0   1  1 12  1  0 16095       0      0
> 0.00e+00    0 0.00e+00  0
> >>> 267:MGInterp Level 4      68 1.0 2.0749e-01 1.1 2.23e+08 1.2 1.4e+04
> 3.4e+03 0.0e+00  0  5  4  1  0  11 11  7  4  0 36925       0      0
> 0.00e+00    0 0.00e+00  0
> >>>
> >>> ex56-summit-gpu-24ranks-converged.txt
> >>> 236:MatMultTranspose     136 1.0 2.1445e-01 1.0 1.72e+08 1.2 9.5e+03
> 2.6e+03 0.0e+00  0  3  6  1  0  39  6 11  3  0 18719   451131      8
> 3.11e+01  272 2.19e+00 100
> >>> 268:MGInterp Level 1      68 1.0 4.0388e-03 2.8 1.08e+05 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    27      79     37
> 5.84e-04   68 6.80e-05 100
> >>> 271:MGInterp Level 2      68 1.0 2.9033e-02 2.9 1.25e+06 1.9 1.6e+03
> 7.8e+02 0.0e+00  0  0  1  0  0   5  0  2  0  0   812   11539     36
> 1.14e-01  136 5.41e-02 100
> >>> 274:MGInterp Level 3      68 1.0 4.9503e-02 1.1 2.50e+07 1.4 1.1e+04
> 6.3e+02 0.0e+00  0  0  7  0  0   9  1 13  1  0 11476   100889     36
> 2.29e+00  136 3.74e-01 100
> >>> 277:MGInterp Level 4      68 1.0 1.7674e-01 1.0 3.23e+08 1.2 6.1e+03
> 6.7e+03 0.0e+00  0  5  4  1  0  33 11  7  4  0 42715   621223     36
> 2.98e+01  136 3.95e+00 100
> >>>
> >>> ex56-summit-gpu-36ranks-converged.txt
> >>> 236:MatMultTranspose     136 1.0 2.9692e-01 1.0 1.17e+08 1.2 1.9e+04
> 1.5e+03 0.0e+00  1  3  6  1  0  40  6 10  3  0 13521   336701      8
> 2.08e+01  272 1.59e+00 100
> >>> 268:MGInterp Level 1      68 1.0 3.8752e-03 2.5 1.03e+05 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    27      79     37
> 3.95e-04   68 4.53e-05 100
> >>> 271:MGInterp Level 2      68 1.0 3.5465e-02 2.2 9.12e+05 2.5 2.4e+03
> 7.1e+02 0.0e+00  0  0  1  0  0   4  0  1  0  0   655    5989     36
> 8.16e-02  136 4.89e-02 100
> >>> 274:MGInterp Level 3      68 1.0 6.7101e-02 1.1 1.75e+07 1.5 2.3e+04
> 3.9e+02 0.0e+00  0  0  7  0  0   9  1 12  1  0  8455   56175     36
> 1.55e+00  136 3.03e-01 100
> >>> 277:MGInterp Level 4      68 1.0 2.4317e-01 1.0 2.20e+08 1.2 1.4e+04
> 3.4e+03 0.0e+00  0  5  4  1  0  33 11  7  4  0 31062   586044     36
> 1.99e+01  136 2.82e+00 100
> >>
> >> 223:MatMultTranspose     136 1.0 2.0702e-01 2.9 6.59e+07 1.2 2.7e+04
> 1.1e+03 0.0e+00  1  3  7  1  0   7  6 12  3  0 19553
> >> 251:MGInterp Level 1      68 1.0 2.8062e-04 1.5 9.79e+04 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   349
> >> 254:MGInterp Level 2      68 1.0 6.2506e-0331.9 9.69e+05 0.0 2.1e+03
> 6.3e+02 0.0e+00  0  0  1  0  0   0  0  1  0  0  3458
> >> 257:MGInterp Level 3      68 1.0 4.8159e-02 6.5 9.62e+06 1.5 2.5e+04
> 4.2e+02 0.0e+00  0  0  6  0  0   1  1 11  1  0 11199
> >> 260:MGInterp Level 4      68 1.0 2.5707e-01 1.5 1.23e+08 1.2 2.7e+04
> 1.9e+03 0.0e+00  1  5  7  1  0  13 12 12  5  0 29294
> >>
> >> Power9 still has an edge here.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200616/0410abb9/attachment-0001.html>