[petsc-users] Understanding streams test on AMD EPYC 7502
Blaise A Bourdin
bourdin at lsu.edu
Fri Apr 16 21:59:06 CDT 2021
Thanks for the reference timing. I can use this to talk to the vendor (or switch vendor…).
I am on a 2 socket system. It looks like the node the vendor built for me has 4 DIMMS, possibly all connected to the same socket?
[amduser at gigi ~]$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 0 size: 257877 MB
node 0 free: 225820 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 1 size: 0 MB
node 1 free: 0 MB
node distances:
node 0 1
0: 10 32
1: 32 10
Regards,
Blaise
On Apr 16, 2021, at 5:50 PM, Jed Brown <jed at jedbrown.org<mailto:jed at jedbrown.org>> wrote:
Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>> writes:
Why do I see the max bandwidth of EPYC-7502 is 200GB/s,
https://www.cpu-world.com/CPUs/Zen/AMD-EPYC%207502.html?
That's theoretical peak per socket, but he's probably on a 2-socket system.
Your bandwidth is around 1/8 of the max. Is it because your machine only
has one DIMM, thus only uses one memory channel?
Here's an lstopo for my 2x7452 system with the NPS4 BIOS setting.
<noether-nps4-lstopo.png>
And running the standard benchmark. Note that the highest performance is achieved with 16 ranks, which is one process per memory channel.
$ make stream MPI_BINDING='--bind-to core --map-by numa'
mpicc -o MPIVersion.o -c -fPIC -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fstack-protector -fvisibility=hidden -O3 -funsafe-math-optimizations -march=native -I/projects/petsc/include -I/projects/petsc/mpich-O/include -I/opt/rocm/include `pwd`/MPIVersion.c
Running streams with 'mpiexec --bind-to core --map-by numa' using 'NPMAX=64'
1 22833.4179 Rate (MB/s)
2 45445.0356 Rate (MB/s) 1.99029
3 67893.5186 Rate (MB/s) 2.97343
4 90691.7604 Rate (MB/s) 3.97189
5 113578.2249 Rate (MB/s) 4.97421
6 134311.8951 Rate (MB/s) 5.88226
7 157968.6688 Rate (MB/s) 6.91832
8 179669.5008 Rate (MB/s) 7.86871
9 116836.9905 Rate (MB/s) 5.11693
10 129766.7959 Rate (MB/s) 5.6832
11 142340.1754 Rate (MB/s) 6.23386
12 155720.9245 Rate (MB/s) 6.81987
13 167771.0425 Rate (MB/s) 7.34762
14 179843.1413 Rate (MB/s) 7.87632
15 193897.7955 Rate (MB/s) 8.49185
16 206523.2769 Rate (MB/s) 9.04479
17 142789.6490 Rate (MB/s) 6.25354
18 151205.7782 Rate (MB/s) 6.62213
19 158845.6213 Rate (MB/s) 6.95672
20 167435.0701 Rate (MB/s) 7.3329
21 175731.9719 Rate (MB/s) 7.69627
22 183984.5220 Rate (MB/s) 8.05769
23 192058.1005 Rate (MB/s) 8.41128
24 200761.0267 Rate (MB/s) 8.79243
25 155965.3011 Rate (MB/s) 6.83058
26 161673.4841 Rate (MB/s) 7.08057
27 167871.3408 Rate (MB/s) 7.35201
28 173951.9592 Rate (MB/s) 7.61831
29 179456.7696 Rate (MB/s) 7.8594
30 186474.7412 Rate (MB/s) 8.16675
31 191749.9724 Rate (MB/s) 8.39778
32 198041.2958 Rate (MB/s) 8.67332
33 164697.4378 Rate (MB/s) 7.21301
34 168645.1579 Rate (MB/s) 7.3859
35 173776.6503 Rate (MB/s) 7.61063
36 179109.2764 Rate (MB/s) 7.84418
37 183488.6248 Rate (MB/s) 8.03597
38 188954.7149 Rate (MB/s) 8.27536
39 193140.8746 Rate (MB/s) 8.4587
40 198804.6800 Rate (MB/s) 8.70675
41 169620.7845 Rate (MB/s) 7.42863
42 173306.4149 Rate (MB/s) 7.59004
43 177089.0440 Rate (MB/s) 7.7557
44 181301.7744 Rate (MB/s) 7.9402
45 184888.0697 Rate (MB/s) 8.09726
46 189267.4148 Rate (MB/s) 8.28906
47 193386.6666 Rate (MB/s) 8.46946
48 197338.0962 Rate (MB/s) 8.64252
49 171754.3212 Rate (MB/s) 7.52207
50 174416.7410 Rate (MB/s) 7.63867
51 177590.9451 Rate (MB/s) 7.77768
52 181843.2566 Rate (MB/s) 7.96391
53 184956.9725 Rate (MB/s) 8.10028
54 188459.9627 Rate (MB/s) 8.2537
55 191294.8101 Rate (MB/s) 8.37785
56 195167.9061 Rate (MB/s) 8.54747
57 173077.5973 Rate (MB/s) 7.58002
58 175707.2648 Rate (MB/s) 7.69519
59 178524.3544 Rate (MB/s) 7.81856
60 181446.2196 Rate (MB/s) 7.94653
61 184176.0972 Rate (MB/s) 8.06608
62 187132.5388 Rate (MB/s) 8.19556
63 190094.3249 Rate (MB/s) 8.32527
64 192888.0887 Rate (MB/s) 8.44763
Here's a hacked version to use nontemporal stores, just to show that the 300 GB/s you see in some publications is "real".
$ make stream MPI_BINDING='--bind-to core --map-by numa' NPMAX=16
mpicc -o MPIVersion.o -c -fPIC -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fstack-protector -fvisibility=hidden -O3 -funsafe-math-optimizations -march=native -I/projects/petsc/include -I/projects/petsc/mpich-O/include -I/opt/rocm/include `pwd`/MPIVersion.c
Running streams with 'mpiexec --bind-to core --map-by numa' using 'NPMAX=16'
Copy 33486.2539 Scale 34071.3326 Add 32054.4926 Triad 31648.2821
Copy 66152.1368 Scale 67692.2040 Add 63900.9276 Triad 63483.8342
Copy 99017.3661 Scale 100531.6449 Add 95109.5224 Triad 94124.0088
Copy 132296.5106 Scale 132442.5912 Add 127105.7793 Triad 125468.3513
Copy 164162.0199 Scale 167233.3935 Add 158407.6986 Triad 156593.6716
Copy 196532.3832 Scale 198430.1791 Add 189974.2218 Triad 188255.0783
Copy 229330.8877 Scale 227943.5409 Add 220342.2619 Triad 215985.3785
Copy 262284.7885 Scale 263016.6022 Add 251791.8289 Triad 248723.9313
Copy 171321.9641 Scale 172641.8351 Add 169675.2284 Triad 168820.0418
Copy 180864.2214 Scale 182373.5147 Add 187165.1475 Triad 187378.3674
Copy 199656.4142 Scale 199780.0876 Add 204114.6705 Triad 204703.4734
Copy 218084.4093 Scale 219094.7833 Add 223458.4367 Triad 222833.6622
Copy 236477.1224 Scale 236699.9477 Add 240783.0674 Triad 240799.8947
Copy 253032.9327 Scale 254071.3971 Add 260259.9976 Triad 260421.9921
Copy 272682.9868 Scale 272881.0838 Add 279639.9825 Triad 278932.2833
Copy 290096.2550 Scale 287402.4978 Add 297025.8896 Triad 295550.2586
--Junchao Zhang
On Fri, Apr 16, 2021 at 3:27 PM Jed Brown <jed at jedbrown.org<mailto:jed at jedbrown.org>> wrote:
Blaise A Bourdin <bourdin at lsu.edu<mailto:bourdin at lsu.edu>> writes:
Hi,
I am test-driving hardware for a new machine for my group and having a
hard time making sense the output of the stream test:
I am attaching the results and my reference (xeon 8260 nodes on QueenBee
3 at LONI).
If I understand correctly, on the AMD node, the memory bandwidth is
saturated with a single core. Is this expected?
The comparison is not totally fair in that QB3 uses intel MPI and MPI
compilers, whereas the AMD node uses mvapich2, which I compiled with the
following options: ./configure
--prefix=/home/amduser/Development/mvapich2-2.3.5-gcc9.3
--with-device=ch3:nemesis:tcp --with-rdma=gen2 --enable-cxx --enable-romio
--enable-fast=all --enable-g=dbg --enable-shared-libs=gcc --enable-shared
Am I doing something wrong on the AMD node?
It looks like it's oversubscribing some cores rather than spreading them
over the node. You should get around 200 GB/s on this node without using
streaming instructions (closer to 300 GB/s with those, but it isn't
representative of real-world code). Slightly less if you don't have NPS4
activated.
You can check your MPI docs and use make MPI_BINDING='--bind-to core', for
example.
--
A.K. & Shirley Barton Professor of Mathematics
Adjunct Professor of Mechanical Engineering
Adjunct of the Center for Computation & Technology
Louisiana State University, Lockett Hall Room 344, Baton Rouge, LA 70803, USA
Tel. +1 (225) 578 1612, Fax +1 (225) 578 4276 Web http://www.math.lsu.edu/~bourdin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20210417/cf3c7f65/attachment-0001.html>
More information about the petsc-users
mailing list