[petsc-dev] configuring hypre on batch system
Barry Smith
bsmith at mcs.anl.gov
Fri Jan 9 19:38:06 CST 2015
Jed,
Can you do the same tests with pthreads and a thread pool? How much better? What about better threads than pthreads like they are allegedly developing for Argo at ANL? Have you asked Pete Beckman for his thread region times?
Barry
> On Jan 9, 2015, at 5:55 PM, Jed Brown <jed at jedbrown.org> wrote:
>
> Mark Adams <mfadams at lbl.gov> writes:
>
>> I've have a test up and running but hypre and GAMG are running very very
>> slow. The test only has about 100 equation per core. Jed mentioned 20K
>> cycles to start OMP parallel (really?) which would explain a lot. Do I
>> understand that correctly Jed?
>
> Yes, >20k cycles on KNC is what John McCalpin reports [1]. Somewhat
> less on more reasonable architectures like Xeon (which also has a faster
> clock rate), but still huge. Cycle counts for my attached test code:
>
> cg.mcs.anl.gov (4x Opteron 6274 @ 2.2 GHz), ICC 13.1.3
> $ make -B CC=icc CFLAGS='-std=c99 -fopenmp -fast' omp-test
> icc -std=c99 -fopenmp -fast omp-test.c -o omp-test
> $ for n in 1 2 4 8 12 16 24 32 48 64; do ./omp-test $n 10000 10 16; done
> 1 threads, 64 B: Min 647 Max 2611 Avg 649
> 2 threads, 128 B: Min 6817 Max 12689 Avg 7400
> 4 threads, 256 B: Min 7602 Max 15105 Avg 8910
> 8 threads, 512 B: Min 10408 Max 21640 Avg 11769
> 12 threads, 768 B: Min 13588 Max 22176 Avg 15608
> 16 threads, 1024 B: Min 15748 Max 26853 Avg 17397
> 24 threads, 1536 B: Min 19503 Max 32095 Avg 22130
> 32 threads, 2048 B: Min 21213 Max 36480 Avg 23688
> 48 threads, 3072 B: Min 25306 Max 613552 Avg 29799
> 64 threads, 4096 B: Min 106807 Max 47592474 Avg 291975
>
> (The largest size may not be representative because someone's
> 8-process job was running. The machine was otherwise idle.)
>
> For comparison, we can execute in serial with the same buffer sizes:
>
> $ for n in 1 2 4 8 12 16 24 32 48 64; do ./omp-test 1 1000 1000 $[16*$n]; done
> 1 threads, 64 B: Min 645 Max 696 Avg 662
> 1 threads, 128 B: Min 667 Max 769 Avg 729
> 1 threads, 256 B: Min 682 Max 718 Avg 686
> 1 threads, 512 B: Min 770 Max 838 Avg 802
> 1 threads, 768 B: Min 788 Max 890 Avg 833
> 1 threads, 1024 B: Min 849 Max 899 Avg 870
> 1 threads, 1536 B: Min 941 Max 1007 Avg 953
> 1 threads, 2048 B: Min 1071 Max 1130 Avg 1102
> 1 threads, 3072 B: Min 1282 Max 1354 Avg 1299
> 1 threads, 4096 B: Min 1492 Max 1686 Avg 1514
>
>
>
> es.mcs.anl.gov (2x E5-2650v2 @ 2.6 GHz), ICC 13.1.3
> $ make -B CC=icc CFLAGS='-std=c99 -fopenmp -fast' omp-test
> icc -std=c99 -fopenmp -fast omp-test.c -o omp-test
> $ for n in 1 2 4 8 12 16 24 32; do ./omp-test $n 10000 10 16; done
> 1 threads, 64 B: Min 547 Max 19195 Avg 768
> 2 threads, 128 B: Min 1896 Max 9821 Avg 1966
> 4 threads, 256 B: Min 4489 Max 23076 Avg 5891
> 8 threads, 512 B: Min 6954 Max 24801 Avg 7784
> 12 threads, 768 B: Min 7146 Max 23007 Avg 7946
> 16 threads, 1024 B: Min 8296 Max 30338 Avg 9427
> 24 threads, 1536 B: Min 8930 Max 14236 Avg 9815
> 32 threads, 2048 B: Min 47937 Max 38485441 Avg 54358
>
> (This machine was idle.)
>
> And the serial comparison:
>
> $ for n in 1 2 4 8 12 16 24 32; do ./omp-test 1 1000 1000 $[16*$n]; done
> 1 threads, 64 B: Min 406 Max 1293 Avg 500
> 1 threads, 128 B: Min 418 Max 557 Avg 427
> 1 threads, 256 B: Min 428 Max 589 Avg 438
> 1 threads, 512 B: Min 469 Max 641 Avg 471
> 1 threads, 768 B: Min 505 Max 631 Avg 508
> 1 threads, 1024 B: Min 536 Max 733 Avg 538
> 1 threads, 1536 B: Min 588 Max 813 Avg 605
> 1 threads, 2048 B: Min 627 Max 809 Avg 630
>
>
> So we're talking about 3 µs (Xeon) to 10 µs (Opteron) overhead for omp
> parallel even with these small numbers of cores. This is more than
> ping-pong round trip on decent networks and 20 µs (one to pack, one to
> unpack on the Opteron) is more than the cost of MPI_Allreduce on a
> million cores of BG/Q [2]. You're welcome to run it for yourself on
> Titan or wherever else.
>
>
> The simple conclusion is that putting omp parallel in the critical path
> is a terrible plan for strong scaling and downright silly if you're
> spending money on a low-latency network.
>
>
> [1] https://software.intel.com/en-us/forums/topic/537436#comment-1808790
> [2] http://www.mcs.anl.gov/~fischer/bgq_all_reduce.png
>
> #define _POSIX_C_SOURCE 199309L
> #include <stdio.h>
> #include <omp.h>
> #include <stdlib.h>
>
> typedef unsigned long long cycles_t;
> cycles_t rdtsc() {
> unsigned hi, lo;
> __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
> return ((cycles_t)lo)|( ((cycles_t)hi)<<32);
> }
>
> int main(int argc,char *argv[]) {
> if (argc != 5) {
> fprintf(stderr,"Usage: %s NUM_THREADS NUM_SAMPLES SAMPLE_ITERATIONS LOCAL_SIZE\n",argv[0]);
> return 1;
> }
> int nthreads = atoi(argv[1]),num_samples = atoi(argv[2]),sample_its = atoi(argv[3]),lsize = atoi(argv[4]);
>
> omp_set_num_threads(nthreads);
>
> int *buf = calloc(nthreads*lsize,sizeof(int));
> // Warm up the thread pools
> #pragma omp parallel for
> for (int k=0; k<nthreads*lsize; k++) buf[k]++;
>
> cycles_t max=0,min=1e10,sum=0;
> for (int i=0; i<num_samples; i++) {
> cycles_t t = rdtsc();
> for (int j=0; j<sample_its; j++) {
> #pragma omp parallel for
> for (int k=0; k<nthreads*lsize; k++) buf[k]++;
> }
> t = (rdtsc() - t)/sample_its;
> if (t > max) max = t;
> if (t < min) min = t;
> sum += t;
> }
> printf("% 3d threads, %4zd B: Min %8llu Max %8llu Avg %8llu\n",nthreads,nthreads*lsize*sizeof(int),min,max,sum/num_samples);
> return 0;
> }
More information about the petsc-dev
mailing list