[petsc-dev] configuring hypre on batch system

Barry Smith bsmith at mcs.anl.gov
Fri Jan 9 19:38:06 CST 2015


  Jed,

    Can you do the same tests with pthreads and a thread pool? How much better? What about better threads than pthreads like they are allegedly developing for Argo at ANL? Have you asked Pete Beckman for his thread region times?

   Barry


> On Jan 9, 2015, at 5:55 PM, Jed Brown <jed at jedbrown.org> wrote:
> 
> Mark Adams <mfadams at lbl.gov> writes:
> 
>> I've have a test up and running but hypre and GAMG are running very very
>> slow. The test only has about 100 equation per core.  Jed mentioned 20K
>> cycles to start OMP parallel (really?) which would explain a lot.  Do I
>> understand that correctly Jed?
> 
> Yes, >20k cycles on KNC is what John McCalpin reports [1].  Somewhat
> less on more reasonable architectures like Xeon (which also has a faster
> clock rate), but still huge.  Cycle counts for my attached test code:
> 
> cg.mcs.anl.gov (4x Opteron 6274 @ 2.2 GHz), ICC 13.1.3
> $ make -B CC=icc CFLAGS='-std=c99 -fopenmp -fast' omp-test
> icc -std=c99 -fopenmp -fast    omp-test.c   -o omp-test
> $ for n in 1 2 4 8 12 16 24 32 48 64; do ./omp-test $n 10000 10 16; done                                                                                                                                                            
>  1 threads,   64 B: Min      647  Max     2611  Avg      649
>  2 threads,  128 B: Min     6817  Max    12689  Avg     7400
>  4 threads,  256 B: Min     7602  Max    15105  Avg     8910
>  8 threads,  512 B: Min    10408  Max    21640  Avg    11769
> 12 threads,  768 B: Min    13588  Max    22176  Avg    15608
> 16 threads, 1024 B: Min    15748  Max    26853  Avg    17397
> 24 threads, 1536 B: Min    19503  Max    32095  Avg    22130
> 32 threads, 2048 B: Min    21213  Max    36480  Avg    23688
> 48 threads, 3072 B: Min    25306  Max   613552  Avg    29799
> 64 threads, 4096 B: Min   106807  Max 47592474  Avg   291975
> 
>  (The largest size may not be representative because someone's
>  8-process job was running.  The machine was otherwise idle.)
> 
> For comparison, we can execute in serial with the same buffer sizes:
> 
> $ for n in 1 2 4 8 12 16 24 32 48 64; do ./omp-test 1 1000 1000 $[16*$n]; done
>  1 threads,   64 B: Min      645  Max      696  Avg      662
>  1 threads,  128 B: Min      667  Max      769  Avg      729
>  1 threads,  256 B: Min      682  Max      718  Avg      686
>  1 threads,  512 B: Min      770  Max      838  Avg      802
>  1 threads,  768 B: Min      788  Max      890  Avg      833
>  1 threads, 1024 B: Min      849  Max      899  Avg      870
>  1 threads, 1536 B: Min      941  Max     1007  Avg      953
>  1 threads, 2048 B: Min     1071  Max     1130  Avg     1102
>  1 threads, 3072 B: Min     1282  Max     1354  Avg     1299
>  1 threads, 4096 B: Min     1492  Max     1686  Avg     1514
> 
> 
> 
> es.mcs.anl.gov (2x E5-2650v2 @ 2.6 GHz), ICC 13.1.3
> $ make -B CC=icc CFLAGS='-std=c99 -fopenmp -fast' omp-test
> icc -std=c99 -fopenmp -fast    omp-test.c   -o omp-test
> $ for n in 1 2 4 8 12 16 24 32; do ./omp-test $n 10000 10 16; done                                                                                                                                                       
>  1 threads,   64 B: Min      547  Max    19195  Avg      768
>  2 threads,  128 B: Min     1896  Max     9821  Avg     1966
>  4 threads,  256 B: Min     4489  Max    23076  Avg     5891
>  8 threads,  512 B: Min     6954  Max    24801  Avg     7784
> 12 threads,  768 B: Min     7146  Max    23007  Avg     7946
> 16 threads, 1024 B: Min     8296  Max    30338  Avg     9427
> 24 threads, 1536 B: Min     8930  Max    14236  Avg     9815
> 32 threads, 2048 B: Min    47937  Max 38485441  Avg    54358
> 
>  (This machine was idle.)
> 
> And the serial comparison:
> 
> $ for n in 1 2 4 8 12 16 24 32; do ./omp-test 1 1000 1000 $[16*$n]; done
>  1 threads,   64 B: Min      406  Max     1293  Avg      500
>  1 threads,  128 B: Min      418  Max      557  Avg      427
>  1 threads,  256 B: Min      428  Max      589  Avg      438
>  1 threads,  512 B: Min      469  Max      641  Avg      471
>  1 threads,  768 B: Min      505  Max      631  Avg      508
>  1 threads, 1024 B: Min      536  Max      733  Avg      538
>  1 threads, 1536 B: Min      588  Max      813  Avg      605
>  1 threads, 2048 B: Min      627  Max      809  Avg      630
> 
> 
> So we're talking about 3 µs (Xeon) to 10 µs (Opteron) overhead for omp
> parallel even with these small numbers of cores.  This is more than
> ping-pong round trip on decent networks and 20 µs (one to pack, one to
> unpack on the Opteron) is more than the cost of MPI_Allreduce on a
> million cores of BG/Q [2].  You're welcome to run it for yourself on
> Titan or wherever else.
> 
> 
> The simple conclusion is that putting omp parallel in the critical path
> is a terrible plan for strong scaling and downright silly if you're
> spending money on a low-latency network.
> 
> 
> [1] https://software.intel.com/en-us/forums/topic/537436#comment-1808790
> [2] http://www.mcs.anl.gov/~fischer/bgq_all_reduce.png
> 
> #define _POSIX_C_SOURCE 199309L
> #include <stdio.h>
> #include <omp.h>
> #include <stdlib.h>
> 
> typedef unsigned long long cycles_t;
> cycles_t rdtsc() {
>  unsigned hi, lo;
>  __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
>  return ((cycles_t)lo)|( ((cycles_t)hi)<<32);
> }
> 
> int main(int argc,char *argv[]) {
>  if (argc != 5) {
>    fprintf(stderr,"Usage: %s NUM_THREADS NUM_SAMPLES SAMPLE_ITERATIONS LOCAL_SIZE\n",argv[0]);
>    return 1;
>  }
>  int nthreads = atoi(argv[1]),num_samples = atoi(argv[2]),sample_its = atoi(argv[3]),lsize = atoi(argv[4]);
> 
>  omp_set_num_threads(nthreads);
> 
>  int *buf = calloc(nthreads*lsize,sizeof(int));
>  // Warm up the thread pools
> #pragma omp parallel for
>  for (int k=0; k<nthreads*lsize; k++) buf[k]++;
> 
>  cycles_t max=0,min=1e10,sum=0;
>  for (int i=0; i<num_samples; i++) {
>    cycles_t t = rdtsc();
>    for (int j=0; j<sample_its; j++) {
> #pragma omp parallel for
>      for (int k=0; k<nthreads*lsize; k++) buf[k]++;
>    }
>    t = (rdtsc() - t)/sample_its;
>    if (t > max) max = t;
>    if (t < min) min = t;
>    sum += t;
>  }
>  printf("% 3d threads, %4zd B: Min %8llu  Max %8llu  Avg %8llu\n",nthreads,nthreads*lsize*sizeof(int),min,max,sum/num_samples);
>  return 0;
> }




More information about the petsc-dev mailing list