[petsc-dev] configuring hypre on batch system

Jed Brown jed at jedbrown.org
Fri Jan 9 17:55:04 CST 2015


Mark Adams <mfadams at lbl.gov> writes:

> I've have a test up and running but hypre and GAMG are running very very
> slow. The test only has about 100 equation per core.  Jed mentioned 20K
> cycles to start OMP parallel (really?) which would explain a lot.  Do I
> understand that correctly Jed?

Yes, >20k cycles on KNC is what John McCalpin reports [1].  Somewhat
less on more reasonable architectures like Xeon (which also has a faster
clock rate), but still huge.  Cycle counts for my attached test code:

cg.mcs.anl.gov (4x Opteron 6274 @ 2.2 GHz), ICC 13.1.3
$ make -B CC=icc CFLAGS='-std=c99 -fopenmp -fast' omp-test
icc -std=c99 -fopenmp -fast    omp-test.c   -o omp-test
$ for n in 1 2 4 8 12 16 24 32 48 64; do ./omp-test $n 10000 10 16; done                                                                                                                                                            
  1 threads,   64 B: Min      647  Max     2611  Avg      649
  2 threads,  128 B: Min     6817  Max    12689  Avg     7400
  4 threads,  256 B: Min     7602  Max    15105  Avg     8910
  8 threads,  512 B: Min    10408  Max    21640  Avg    11769
 12 threads,  768 B: Min    13588  Max    22176  Avg    15608
 16 threads, 1024 B: Min    15748  Max    26853  Avg    17397
 24 threads, 1536 B: Min    19503  Max    32095  Avg    22130
 32 threads, 2048 B: Min    21213  Max    36480  Avg    23688
 48 threads, 3072 B: Min    25306  Max   613552  Avg    29799
 64 threads, 4096 B: Min   106807  Max 47592474  Avg   291975

  (The largest size may not be representative because someone's
  8-process job was running.  The machine was otherwise idle.)

For comparison, we can execute in serial with the same buffer sizes:

$ for n in 1 2 4 8 12 16 24 32 48 64; do ./omp-test 1 1000 1000 $[16*$n]; done
  1 threads,   64 B: Min      645  Max      696  Avg      662
  1 threads,  128 B: Min      667  Max      769  Avg      729
  1 threads,  256 B: Min      682  Max      718  Avg      686
  1 threads,  512 B: Min      770  Max      838  Avg      802
  1 threads,  768 B: Min      788  Max      890  Avg      833
  1 threads, 1024 B: Min      849  Max      899  Avg      870
  1 threads, 1536 B: Min      941  Max     1007  Avg      953
  1 threads, 2048 B: Min     1071  Max     1130  Avg     1102
  1 threads, 3072 B: Min     1282  Max     1354  Avg     1299
  1 threads, 4096 B: Min     1492  Max     1686  Avg     1514



es.mcs.anl.gov (2x E5-2650v2 @ 2.6 GHz), ICC 13.1.3
$ make -B CC=icc CFLAGS='-std=c99 -fopenmp -fast' omp-test
icc -std=c99 -fopenmp -fast    omp-test.c   -o omp-test
$ for n in 1 2 4 8 12 16 24 32; do ./omp-test $n 10000 10 16; done                                                                                                                                                       
  1 threads,   64 B: Min      547  Max    19195  Avg      768
  2 threads,  128 B: Min     1896  Max     9821  Avg     1966
  4 threads,  256 B: Min     4489  Max    23076  Avg     5891
  8 threads,  512 B: Min     6954  Max    24801  Avg     7784
 12 threads,  768 B: Min     7146  Max    23007  Avg     7946
 16 threads, 1024 B: Min     8296  Max    30338  Avg     9427
 24 threads, 1536 B: Min     8930  Max    14236  Avg     9815
 32 threads, 2048 B: Min    47937  Max 38485441  Avg    54358

  (This machine was idle.)

And the serial comparison:

$ for n in 1 2 4 8 12 16 24 32; do ./omp-test 1 1000 1000 $[16*$n]; done
  1 threads,   64 B: Min      406  Max     1293  Avg      500
  1 threads,  128 B: Min      418  Max      557  Avg      427
  1 threads,  256 B: Min      428  Max      589  Avg      438
  1 threads,  512 B: Min      469  Max      641  Avg      471
  1 threads,  768 B: Min      505  Max      631  Avg      508
  1 threads, 1024 B: Min      536  Max      733  Avg      538
  1 threads, 1536 B: Min      588  Max      813  Avg      605
  1 threads, 2048 B: Min      627  Max      809  Avg      630


So we're talking about 3 µs (Xeon) to 10 µs (Opteron) overhead for omp
parallel even with these small numbers of cores.  This is more than
ping-pong round trip on decent networks and 20 µs (one to pack, one to
unpack on the Opteron) is more than the cost of MPI_Allreduce on a
million cores of BG/Q [2].  You're welcome to run it for yourself on
Titan or wherever else.


The simple conclusion is that putting omp parallel in the critical path
is a terrible plan for strong scaling and downright silly if you're
spending money on a low-latency network.


[1] https://software.intel.com/en-us/forums/topic/537436#comment-1808790
[2] http://www.mcs.anl.gov/~fischer/bgq_all_reduce.png

-------------- next part --------------
A non-text attachment was scrubbed...
Name: omp-test.c
Type: text/x-csrc
Size: 1201 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20150109/2e90fa7c/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20150109/2e90fa7c/attachment.sig>


More information about the petsc-dev mailing list