[petsc-users] Question about matrix permutation

Sun Jan 31 15:26:45 CST 2010

On Sat, 30 Jan 2010 14:41:03 -0600, Barry Smith <bsmith at mcs.anl.gov> wrote:
>     This looks great; push it to petsc-dev if it really works.

I will, but it needs to be cleaned up a bit before it can be pushed
(e.g. so that it works with other compilers).

>     BUT, this is NOT hardware prefetching, this is specifically  
> software prefetching, you are telling it exactly what to prefetch and  
> when.

Absolutely, but my original statement was

  it seems possible for deficient hardware prefetch to be at fault for a
  nontrivial amount of the observed difference between Inode and BAIJ.

I never claimed that hardware prefetch was doing anything useful
(although the tests below show that it is), but that prefetch was the
reason why Inode was not as fast as our standard performance model
would indicate.

> I would call it hardware prefetching only when the hardware detects a  
> particular pattern of access and then automatically extrapolates that  
> pattern to prefetch further along the pattern.

Right, and that extrapolation was not very good, for the reasons
described earlier in this thread.  But, to investigate how much good
hardware prefetch is in this case, I ran the following tests.

First I disabled hardware prefetch.  This is poorly documented on the
web, so here's the scoop:

With Core2 on Linux, you install msr-tools and check register 0x1a0.

  # rdmsr -p 1 0x1a0
  1364970489

The details are processor dependent, but on both 32 and 64-bit intel
chips, you need to flip bit 9 to disable hardware prefetch.

  # wrmsr -p 1 0x1a0 0x1364970689

to turn it on again, just disable that bit

  # wrmsr -p 1 0x1a0 0x1364970489

Note that this all needs to be run as root.  For details on MSR and
which bits to use for chips other than Core 2, see Appendix B of

  Intel® 64 and IA-32 Architectures Software Developer's Manual
  Volume 3B: System Programming Guide

All the manuals are here

  http://www.intel.com/products/processor/manuals/

I then ran the benchmarks using taskset to bind to the correct core and
nice -n -10 to ensure that mostly-idle background processes would always
run on the inactive core.  I checked that the results were consistent
regardless of which core was being used, and ran each test twice.  All
of these are *without* the software prefetch patch.  It appears that the
difference is comfortably over 2 percent in all cases.

./ex19 -ksp_type cgs -pc_type none -ksp_monitor -ksp_max_it 1000 -snes_max_it 1 -da_grid_x 50 -da_grid_y 50 -log_summary
on:  MatMult             2001 1.0 5.3418e+00 1.0 3.03e+09 1.0 0.0e+00 0.0e+00 0.0e+00 41 41  0  0  0  82 81  0  0  0   568
     MatMult             2001 1.0 5.3506e+00 1.0 3.03e+09 1.0 0.0e+00 0.0e+00 0.0e+00 41 41  0  0  0  82 81  0  0  0   567
off: MatMult             2001 1.0 5.9154e+00 1.0 3.03e+09 1.0 0.0e+00 0.0e+00 0.0e+00 37 41  0  0  0  76 81  0  0  0   513
     MatMult             2001 1.0 5.9271e+00 1.0 3.03e+09 1.0 0.0e+00 0.0e+00 0.0e+00 37 41  0  0  0  76 81  0  0  0   512

./ex19 -ksp_type cgs -pc_type none -ksp_monitor -ksp_max_it 100 -snes_max_it 1 -da_grid_x 200 -da_grid_y 200 -log_summary
on:  MatMult              201 1.0 7.9634e+00 1.0 4.98e+09 1.0 0.0e+00 0.0e+00 0.0e+00 28 38  0  0  0  61 77  0  0  0   626
     MatMult              201 1.0 7.9570e+00 1.0 4.98e+09 1.0 0.0e+00 0.0e+00 0.0e+00 28 38  0  0  0  61 77  0  0  0   626
off: MatMult              201 1.0 9.2377e+00 1.0 4.98e+09 1.0 0.0e+00 0.0e+00 0.0e+00 27 38  0  0  0  59 77  0  0  0   539
     MatMult              201 1.0 9.2294e+00 1.0 4.98e+09 1.0 0.0e+00 0.0e+00 0.0e+00 27 38  0  0  0  59 77  0  0  0   540

./ex19 -ksp_type cgs -pc_type none -ksp_monitor -ksp_max_it 100 -snes_max_it 1 -da_grid_x 200 -da_grid_y 200 -log_summary
on:  MatMult              201 1.0 1.7588e+01 1.0 1.12e+10 1.0 0.0e+00 0.0e+00 0.0e+00 27 38  0  0  0  59 77  0  0  0   639
     MatMult              201 1.0 1.7642e+01 1.0 1.12e+10 1.0 0.0e+00 0.0e+00 0.0e+00 27 38  0  0  0  60 77  0  0  0   637
off: MatMult              201 1.0 2.0519e+01 1.0 1.12e+10 1.0 0.0e+00 0.0e+00 0.0e+00 26 38  0  0  0  58 77  0  0  0   548
     MatMult              201 1.0 2.0601e+01 1.0 1.12e+10 1.0 0.0e+00 0.0e+00 0.0e+00 26 38  0  0  0  58 77  0  0  0   545

Then I did one comparison *with* the software prefetch patch:

./ex19 -ksp_type cgs -pc_type none -ksp_monitor -ksp_max_it 100 -snes_max_it 1 -da_grid_x 200 -da_grid_y 200 -log_summary
on:  MatMult              201 1.0 6.1305e+00 1.0 4.98e+09 1.0 0.0e+00 0.0e+00 0.0e+00 24 38  0  0  0  54 77  0  0  0   813
     MatMult              201 1.0 6.1379e+00 1.0 4.98e+09 1.0 0.0e+00 0.0e+00 0.0e+00 24 38  0  0  0  54 77  0  0  0   812
off: MatMult              201 1.0 7.8732e+00 1.0 4.98e+09 1.0 0.0e+00 0.0e+00 0.0e+00 25 38  0  0  0  55 77  0  0  0   633
     MatMult              201 1.0 7.8841e+00 1.0 4.98e+09 1.0 0.0e+00 0.0e+00 0.0e+00 25 38  0  0  0  55 77  0  0  0   632

So the performance loss from disabling hardware prefetch is *greater*
when using explicit software prefetch than when no software prefetch is
done.

Finally, I reverted the software prefetch patch and disabled DCU
prefetch (bit 37).  I'm not clear on the details of the difference
between this and the streaming prefetch that the tests above use, but
the manual says

  The DCU prefetcher is an L1 data cache prefetcher.  When the DCU
  prefetcher detects multiple loads from the same line done within a
  time limit, the DCU prefetcher assumes the next line will be required.
  The next line is prefetched in to the L1 data cache from memory or L2.

Compare these numbers to the ones above (~638 flop/s with both on, ~547
with stream prefetch off).

./ex19 -ksp_type cgs -pc_type none -ksp_monitor -ksp_max_it 100 -snes_max_it 1 -da_grid_x 200 -da_grid_y 200 -log_summary
dcu-off:  MatMult              201 1.0 8.2517e+00 1.0 4.98e+09 1.0 0.0e+00 0.0e+00 0.0e+00 28 38  0  0  0  61 77  0  0  0   604
both-off: MatMult              201 1.0 9.6107e+00 1.0 4.98e+09 1.0 0.0e+00 0.0e+00 0.0e+00 27 38  0  0  0  59 77  0  0  0   518

On my system, these were achieved with

  # wrmsr -p 1 0x1a0 0x3364970489               # stream on, DCU off
  # wrmsr -p 1 0x1a0 0x3364970689               # stream off, DCU off

My system details:

  $ uname -a
  Linux kunyang 2.6.32-ARCH #1 SMP PREEMPT Mon Jan 25 20:33:50 CET 2010 x86_64 Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz GenuineIntel GNU/Linux
  $ gcc -v
  Using built-in specs.
  Target: x86_64-unknown-linux-gnu
  Configured with: ../configure --prefix=/usr --enable-shared --enable-languages=c,c++,fortran,objc,obj-c++,ada --enable-threads=posix --mandir=/usr/share/man --infodir=/usr/share/info --enable-__cxa_atexit --disable-multilib --libdir=/usr/lib --libexecdir=/usr/lib --enable-clocale=gnu --disable-libstdcxx-pch --with-tune=generic
  Thread model: posix
  gcc version 4.4.3 (GCC)
  $ cat /proc/cpuinfo
  processor       : 0
  vendor_id       : GenuineIntel
  cpu family      : 6
  model           : 23
  model name      : Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz
  stepping        : 10
  cpu MHz         : 2534.000
  cache size      : 3072 KB
  physical id     : 0
  siblings        : 2
  core id         : 0
  cpu cores       : 2
  apicid          : 0
  initial apicid  : 0
  fpu             : yes
  fpu_exception   : yes
  cpuid level     : 13
  wp              : yes
  flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm ida tpr_shadow vnmi flexpriority
  bogomips        : 5057.73
  clflush size    : 64
  cache_alignment : 64
  address sizes   : 36 bits physical, 48 bits virtual
  power management:

  processor       : 1
  vendor_id       : GenuineIntel
  cpu family      : 6
  model           : 23
  model name      : Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz
  stepping        : 10
  cpu MHz         : 2534.000
  cache size      : 3072 KB
  physical id     : 0
  siblings        : 2
  core id         : 1
  cpu cores       : 2
  apicid          : 1
  initial apicid  : 1
  fpu             : yes
  fpu_exception   : yes
  cpuid level     : 13
  wp              : yes
  flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm ida tpr_shadow vnmi flexpriority
  bogomips        : 5059.12
  clflush size    : 64
  cache_alignment : 64
  address sizes   : 36 bits physical, 48 bits virtual
  power management:

Jed