[petsc-users] Understanding streams test on AMD EPYC 7502

Fri Apr 16 21:59:06 CDT 2021

Thanks for the reference timing. I can use this to talk to the vendor (or switch vendor…).

I am on a 2 socket system. It looks like the node the vendor built for me has 4 DIMMS, possibly all connected to the same socket?

[amduser at gigi ~]$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 0 size: 257877 MB
node 0 free: 225820 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 1 size: 0 MB
node 1 free: 0 MB
node distances:
node   0   1
  0:  10  32
  1:  32  10

Regards,
Blaise

On Apr 16, 2021, at 5:50 PM, Jed Brown <jed at jedbrown.org<mailto:jed at jedbrown.org>> wrote:

Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>> writes:

Why do I see the max bandwidth of EPYC-7502 is 200GB/s,
https://www.cpu-world.com/CPUs/Zen/AMD-EPYC%207502.html?

That's theoretical peak per socket, but he's probably on a 2-socket system.

Your bandwidth is around 1/8 of the max. Is it because your machine only
has one DIMM, thus only uses one memory channel?

Here's an lstopo for my 2x7452 system with the NPS4 BIOS setting.

<noether-nps4-lstopo.png>
And running the standard benchmark. Note that the highest performance is achieved with 16 ranks, which is one process per memory channel.

$ make stream MPI_BINDING='--bind-to core --map-by numa'
mpicc -o MPIVersion.o -c -fPIC -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fstack-protector -fvisibility=hidden -O3 -funsafe-math-optimizations -march=native    -I/projects/petsc/include -I/projects/petsc/mpich-O/include -I/opt/rocm/include    `pwd`/MPIVersion.c
Running streams with 'mpiexec --bind-to core --map-by numa' using 'NPMAX=64'
1  22833.4179   Rate (MB/s)
2  45445.0356   Rate (MB/s) 1.99029
3  67893.5186   Rate (MB/s) 2.97343
4  90691.7604   Rate (MB/s) 3.97189
5 113578.2249   Rate (MB/s) 4.97421
6 134311.8951   Rate (MB/s) 5.88226
7 157968.6688   Rate (MB/s) 6.91832
8 179669.5008   Rate (MB/s) 7.86871
9 116836.9905   Rate (MB/s) 5.11693
10 129766.7959   Rate (MB/s) 5.6832
11 142340.1754   Rate (MB/s) 6.23386
12 155720.9245   Rate (MB/s) 6.81987
13 167771.0425   Rate (MB/s) 7.34762
14 179843.1413   Rate (MB/s) 7.87632
15 193897.7955   Rate (MB/s) 8.49185
16 206523.2769   Rate (MB/s) 9.04479
17 142789.6490   Rate (MB/s) 6.25354
18 151205.7782   Rate (MB/s) 6.62213
19 158845.6213   Rate (MB/s) 6.95672
20 167435.0701   Rate (MB/s) 7.3329
21 175731.9719   Rate (MB/s) 7.69627
22 183984.5220   Rate (MB/s) 8.05769
23 192058.1005   Rate (MB/s) 8.41128
24 200761.0267   Rate (MB/s) 8.79243
25 155965.3011   Rate (MB/s) 6.83058
26 161673.4841   Rate (MB/s) 7.08057
27 167871.3408   Rate (MB/s) 7.35201
28 173951.9592   Rate (MB/s) 7.61831
29 179456.7696   Rate (MB/s) 7.8594
30 186474.7412   Rate (MB/s) 8.16675
31 191749.9724   Rate (MB/s) 8.39778
32 198041.2958   Rate (MB/s) 8.67332
33 164697.4378   Rate (MB/s) 7.21301
34 168645.1579   Rate (MB/s) 7.3859
35 173776.6503   Rate (MB/s) 7.61063
36 179109.2764   Rate (MB/s) 7.84418
37 183488.6248   Rate (MB/s) 8.03597
38 188954.7149   Rate (MB/s) 8.27536
39 193140.8746   Rate (MB/s) 8.4587
40 198804.6800   Rate (MB/s) 8.70675
41 169620.7845   Rate (MB/s) 7.42863
42 173306.4149   Rate (MB/s) 7.59004
43 177089.0440   Rate (MB/s) 7.7557
44 181301.7744   Rate (MB/s) 7.9402
45 184888.0697   Rate (MB/s) 8.09726
46 189267.4148   Rate (MB/s) 8.28906
47 193386.6666   Rate (MB/s) 8.46946
48 197338.0962   Rate (MB/s) 8.64252
49 171754.3212   Rate (MB/s) 7.52207
50 174416.7410   Rate (MB/s) 7.63867
51 177590.9451   Rate (MB/s) 7.77768
52 181843.2566   Rate (MB/s) 7.96391
53 184956.9725   Rate (MB/s) 8.10028
54 188459.9627   Rate (MB/s) 8.2537
55 191294.8101   Rate (MB/s) 8.37785
56 195167.9061   Rate (MB/s) 8.54747
57 173077.5973   Rate (MB/s) 7.58002
58 175707.2648   Rate (MB/s) 7.69519
59 178524.3544   Rate (MB/s) 7.81856
60 181446.2196   Rate (MB/s) 7.94653
61 184176.0972   Rate (MB/s) 8.06608
62 187132.5388   Rate (MB/s) 8.19556
63 190094.3249   Rate (MB/s) 8.32527
64 192888.0887   Rate (MB/s) 8.44763

Here's a hacked version to use nontemporal stores, just to show that the 300 GB/s you see in some publications is "real".

$ make stream MPI_BINDING='--bind-to core --map-by numa' NPMAX=16
mpicc -o MPIVersion.o -c -fPIC -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fstack-protector -fvisibility=hidden -O3 -funsafe-math-optimizations -march=native    -I/projects/petsc/include -I/projects/petsc/mpich-O/include -I/opt/rocm/include    `pwd`/MPIVersion.c
Running streams with 'mpiexec --bind-to core --map-by numa' using 'NPMAX=16'
Copy  33486.2539   Scale  34071.3326   Add  32054.4926   Triad  31648.2821
Copy  66152.1368   Scale  67692.2040   Add  63900.9276   Triad  63483.8342
Copy  99017.3661   Scale 100531.6449   Add  95109.5224   Triad  94124.0088
Copy 132296.5106   Scale 132442.5912   Add 127105.7793   Triad 125468.3513
Copy 164162.0199   Scale 167233.3935   Add 158407.6986   Triad 156593.6716
Copy 196532.3832   Scale 198430.1791   Add 189974.2218   Triad 188255.0783
Copy 229330.8877   Scale 227943.5409   Add 220342.2619   Triad 215985.3785
Copy 262284.7885   Scale 263016.6022   Add 251791.8289   Triad 248723.9313
Copy 171321.9641   Scale 172641.8351   Add 169675.2284   Triad 168820.0418
Copy 180864.2214   Scale 182373.5147   Add 187165.1475   Triad 187378.3674
Copy 199656.4142   Scale 199780.0876   Add 204114.6705   Triad 204703.4734
Copy 218084.4093   Scale 219094.7833   Add 223458.4367   Triad 222833.6622
Copy 236477.1224   Scale 236699.9477   Add 240783.0674   Triad 240799.8947
Copy 253032.9327   Scale 254071.3971   Add 260259.9976   Triad 260421.9921
Copy 272682.9868   Scale 272881.0838   Add 279639.9825   Triad 278932.2833
Copy 290096.2550   Scale 287402.4978   Add 297025.8896   Triad 295550.2586

--Junchao Zhang

On Fri, Apr 16, 2021 at 3:27 PM Jed Brown <jed at jedbrown.org<mailto:jed at jedbrown.org>> wrote:

Blaise A Bourdin <bourdin at lsu.edu<mailto:bourdin at lsu.edu>> writes:

Hi,

I am test-driving hardware for a new machine for my group and having a
hard time making sense the output of the stream test:

I am attaching the results and my reference (xeon 8260 nodes on QueenBee
3 at LONI).

If I understand correctly, on the AMD node, the memory bandwidth is
saturated with a single core. Is this expected?
The comparison is not totally fair in that QB3 uses intel MPI and MPI
compilers, whereas the AMD node uses mvapich2, which I compiled with the
following options: ./configure
--prefix=/home/amduser/Development/mvapich2-2.3.5-gcc9.3
--with-device=ch3:nemesis:tcp --with-rdma=gen2 --enable-cxx --enable-romio
--enable-fast=all --enable-g=dbg --enable-shared-libs=gcc --enable-shared

Am I doing something wrong on the AMD node?

It looks like it's oversubscribing some cores rather than spreading them
over the node. You should get around 200 GB/s on this node without using
streaming instructions (closer to 300 GB/s with those, but it isn't
representative of real-world code). Slightly less if you don't have NPS4
activated.

You can check your MPI docs and use make MPI_BINDING='--bind-to core', for
example.

--
A.K. & Shirley Barton Professor of  Mathematics
Adjunct Professor of Mechanical Engineering
Adjunct of the Center for Computation & Technology
Louisiana State University, Lockett Hall Room 344, Baton Rouge, LA 70803, USA
Tel. +1 (225) 578 1612, Fax  +1 (225) 578 4276 Web http://www.math.lsu.edu/~bourdin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210417/cf3c7f65/attachment-0001.html>