[petsc-dev] configuring hypre on batch system
Jed Brown
jed at jedbrown.org
Fri Jan 9 17:55:04 CST 2015
Mark Adams <mfadams at lbl.gov> writes:
> I've have a test up and running but hypre and GAMG are running very very
> slow. The test only has about 100 equation per core. Jed mentioned 20K
> cycles to start OMP parallel (really?) which would explain a lot. Do I
> understand that correctly Jed?
Yes, >20k cycles on KNC is what John McCalpin reports [1]. Somewhat
less on more reasonable architectures like Xeon (which also has a faster
clock rate), but still huge. Cycle counts for my attached test code:
cg.mcs.anl.gov (4x Opteron 6274 @ 2.2 GHz), ICC 13.1.3
$ make -B CC=icc CFLAGS='-std=c99 -fopenmp -fast' omp-test
icc -std=c99 -fopenmp -fast omp-test.c -o omp-test
$ for n in 1 2 4 8 12 16 24 32 48 64; do ./omp-test $n 10000 10 16; done
1 threads, 64 B: Min 647 Max 2611 Avg 649
2 threads, 128 B: Min 6817 Max 12689 Avg 7400
4 threads, 256 B: Min 7602 Max 15105 Avg 8910
8 threads, 512 B: Min 10408 Max 21640 Avg 11769
12 threads, 768 B: Min 13588 Max 22176 Avg 15608
16 threads, 1024 B: Min 15748 Max 26853 Avg 17397
24 threads, 1536 B: Min 19503 Max 32095 Avg 22130
32 threads, 2048 B: Min 21213 Max 36480 Avg 23688
48 threads, 3072 B: Min 25306 Max 613552 Avg 29799
64 threads, 4096 B: Min 106807 Max 47592474 Avg 291975
(The largest size may not be representative because someone's
8-process job was running. The machine was otherwise idle.)
For comparison, we can execute in serial with the same buffer sizes:
$ for n in 1 2 4 8 12 16 24 32 48 64; do ./omp-test 1 1000 1000 $[16*$n]; done
1 threads, 64 B: Min 645 Max 696 Avg 662
1 threads, 128 B: Min 667 Max 769 Avg 729
1 threads, 256 B: Min 682 Max 718 Avg 686
1 threads, 512 B: Min 770 Max 838 Avg 802
1 threads, 768 B: Min 788 Max 890 Avg 833
1 threads, 1024 B: Min 849 Max 899 Avg 870
1 threads, 1536 B: Min 941 Max 1007 Avg 953
1 threads, 2048 B: Min 1071 Max 1130 Avg 1102
1 threads, 3072 B: Min 1282 Max 1354 Avg 1299
1 threads, 4096 B: Min 1492 Max 1686 Avg 1514
es.mcs.anl.gov (2x E5-2650v2 @ 2.6 GHz), ICC 13.1.3
$ make -B CC=icc CFLAGS='-std=c99 -fopenmp -fast' omp-test
icc -std=c99 -fopenmp -fast omp-test.c -o omp-test
$ for n in 1 2 4 8 12 16 24 32; do ./omp-test $n 10000 10 16; done
1 threads, 64 B: Min 547 Max 19195 Avg 768
2 threads, 128 B: Min 1896 Max 9821 Avg 1966
4 threads, 256 B: Min 4489 Max 23076 Avg 5891
8 threads, 512 B: Min 6954 Max 24801 Avg 7784
12 threads, 768 B: Min 7146 Max 23007 Avg 7946
16 threads, 1024 B: Min 8296 Max 30338 Avg 9427
24 threads, 1536 B: Min 8930 Max 14236 Avg 9815
32 threads, 2048 B: Min 47937 Max 38485441 Avg 54358
(This machine was idle.)
And the serial comparison:
$ for n in 1 2 4 8 12 16 24 32; do ./omp-test 1 1000 1000 $[16*$n]; done
1 threads, 64 B: Min 406 Max 1293 Avg 500
1 threads, 128 B: Min 418 Max 557 Avg 427
1 threads, 256 B: Min 428 Max 589 Avg 438
1 threads, 512 B: Min 469 Max 641 Avg 471
1 threads, 768 B: Min 505 Max 631 Avg 508
1 threads, 1024 B: Min 536 Max 733 Avg 538
1 threads, 1536 B: Min 588 Max 813 Avg 605
1 threads, 2048 B: Min 627 Max 809 Avg 630
So we're talking about 3 µs (Xeon) to 10 µs (Opteron) overhead for omp
parallel even with these small numbers of cores. This is more than
ping-pong round trip on decent networks and 20 µs (one to pack, one to
unpack on the Opteron) is more than the cost of MPI_Allreduce on a
million cores of BG/Q [2]. You're welcome to run it for yourself on
Titan or wherever else.
The simple conclusion is that putting omp parallel in the critical path
is a terrible plan for strong scaling and downright silly if you're
spending money on a low-latency network.
[1] https://software.intel.com/en-us/forums/topic/537436#comment-1808790
[2] http://www.mcs.anl.gov/~fischer/bgq_all_reduce.png
-------------- next part --------------
A non-text attachment was scrubbed...
Name: omp-test.c
Type: text/x-csrc
Size: 1201 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20150109/2e90fa7c/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20150109/2e90fa7c/attachment.sig>
More information about the petsc-dev
mailing list