[petsc-users] Question about matrix permutation
Jed Brown
jed at 59A2.org
Sun Jan 31 15:26:45 CST 2010
On Sat, 30 Jan 2010 14:41:03 -0600, Barry Smith <bsmith at mcs.anl.gov> wrote:
> This looks great; push it to petsc-dev if it really works.
I will, but it needs to be cleaned up a bit before it can be pushed
(e.g. so that it works with other compilers).
> BUT, this is NOT hardware prefetching, this is specifically
> software prefetching, you are telling it exactly what to prefetch and
> when.
Absolutely, but my original statement was
it seems possible for deficient hardware prefetch to be at fault for a
nontrivial amount of the observed difference between Inode and BAIJ.
I never claimed that hardware prefetch was doing anything useful
(although the tests below show that it is), but that prefetch was the
reason why Inode was not as fast as our standard performance model
would indicate.
> I would call it hardware prefetching only when the hardware detects a
> particular pattern of access and then automatically extrapolates that
> pattern to prefetch further along the pattern.
Right, and that extrapolation was not very good, for the reasons
described earlier in this thread. But, to investigate how much good
hardware prefetch is in this case, I ran the following tests.
First I disabled hardware prefetch. This is poorly documented on the
web, so here's the scoop:
With Core2 on Linux, you install msr-tools and check register 0x1a0.
# rdmsr -p 1 0x1a0
1364970489
The details are processor dependent, but on both 32 and 64-bit intel
chips, you need to flip bit 9 to disable hardware prefetch.
# wrmsr -p 1 0x1a0 0x1364970689
to turn it on again, just disable that bit
# wrmsr -p 1 0x1a0 0x1364970489
Note that this all needs to be run as root. For details on MSR and
which bits to use for chips other than Core 2, see Appendix B of
Intel® 64 and IA-32 Architectures Software Developer's Manual
Volume 3B: System Programming Guide
All the manuals are here
http://www.intel.com/products/processor/manuals/
I then ran the benchmarks using taskset to bind to the correct core and
nice -n -10 to ensure that mostly-idle background processes would always
run on the inactive core. I checked that the results were consistent
regardless of which core was being used, and ran each test twice. All
of these are *without* the software prefetch patch. It appears that the
difference is comfortably over 2 percent in all cases.
./ex19 -ksp_type cgs -pc_type none -ksp_monitor -ksp_max_it 1000 -snes_max_it 1 -da_grid_x 50 -da_grid_y 50 -log_summary
on: MatMult 2001 1.0 5.3418e+00 1.0 3.03e+09 1.0 0.0e+00 0.0e+00 0.0e+00 41 41 0 0 0 82 81 0 0 0 568
MatMult 2001 1.0 5.3506e+00 1.0 3.03e+09 1.0 0.0e+00 0.0e+00 0.0e+00 41 41 0 0 0 82 81 0 0 0 567
off: MatMult 2001 1.0 5.9154e+00 1.0 3.03e+09 1.0 0.0e+00 0.0e+00 0.0e+00 37 41 0 0 0 76 81 0 0 0 513
MatMult 2001 1.0 5.9271e+00 1.0 3.03e+09 1.0 0.0e+00 0.0e+00 0.0e+00 37 41 0 0 0 76 81 0 0 0 512
./ex19 -ksp_type cgs -pc_type none -ksp_monitor -ksp_max_it 100 -snes_max_it 1 -da_grid_x 200 -da_grid_y 200 -log_summary
on: MatMult 201 1.0 7.9634e+00 1.0 4.98e+09 1.0 0.0e+00 0.0e+00 0.0e+00 28 38 0 0 0 61 77 0 0 0 626
MatMult 201 1.0 7.9570e+00 1.0 4.98e+09 1.0 0.0e+00 0.0e+00 0.0e+00 28 38 0 0 0 61 77 0 0 0 626
off: MatMult 201 1.0 9.2377e+00 1.0 4.98e+09 1.0 0.0e+00 0.0e+00 0.0e+00 27 38 0 0 0 59 77 0 0 0 539
MatMult 201 1.0 9.2294e+00 1.0 4.98e+09 1.0 0.0e+00 0.0e+00 0.0e+00 27 38 0 0 0 59 77 0 0 0 540
./ex19 -ksp_type cgs -pc_type none -ksp_monitor -ksp_max_it 100 -snes_max_it 1 -da_grid_x 200 -da_grid_y 200 -log_summary
on: MatMult 201 1.0 1.7588e+01 1.0 1.12e+10 1.0 0.0e+00 0.0e+00 0.0e+00 27 38 0 0 0 59 77 0 0 0 639
MatMult 201 1.0 1.7642e+01 1.0 1.12e+10 1.0 0.0e+00 0.0e+00 0.0e+00 27 38 0 0 0 60 77 0 0 0 637
off: MatMult 201 1.0 2.0519e+01 1.0 1.12e+10 1.0 0.0e+00 0.0e+00 0.0e+00 26 38 0 0 0 58 77 0 0 0 548
MatMult 201 1.0 2.0601e+01 1.0 1.12e+10 1.0 0.0e+00 0.0e+00 0.0e+00 26 38 0 0 0 58 77 0 0 0 545
Then I did one comparison *with* the software prefetch patch:
./ex19 -ksp_type cgs -pc_type none -ksp_monitor -ksp_max_it 100 -snes_max_it 1 -da_grid_x 200 -da_grid_y 200 -log_summary
on: MatMult 201 1.0 6.1305e+00 1.0 4.98e+09 1.0 0.0e+00 0.0e+00 0.0e+00 24 38 0 0 0 54 77 0 0 0 813
MatMult 201 1.0 6.1379e+00 1.0 4.98e+09 1.0 0.0e+00 0.0e+00 0.0e+00 24 38 0 0 0 54 77 0 0 0 812
off: MatMult 201 1.0 7.8732e+00 1.0 4.98e+09 1.0 0.0e+00 0.0e+00 0.0e+00 25 38 0 0 0 55 77 0 0 0 633
MatMult 201 1.0 7.8841e+00 1.0 4.98e+09 1.0 0.0e+00 0.0e+00 0.0e+00 25 38 0 0 0 55 77 0 0 0 632
So the performance loss from disabling hardware prefetch is *greater*
when using explicit software prefetch than when no software prefetch is
done.
Finally, I reverted the software prefetch patch and disabled DCU
prefetch (bit 37). I'm not clear on the details of the difference
between this and the streaming prefetch that the tests above use, but
the manual says
The DCU prefetcher is an L1 data cache prefetcher. When the DCU
prefetcher detects multiple loads from the same line done within a
time limit, the DCU prefetcher assumes the next line will be required.
The next line is prefetched in to the L1 data cache from memory or L2.
Compare these numbers to the ones above (~638 flop/s with both on, ~547
with stream prefetch off).
./ex19 -ksp_type cgs -pc_type none -ksp_monitor -ksp_max_it 100 -snes_max_it 1 -da_grid_x 200 -da_grid_y 200 -log_summary
dcu-off: MatMult 201 1.0 8.2517e+00 1.0 4.98e+09 1.0 0.0e+00 0.0e+00 0.0e+00 28 38 0 0 0 61 77 0 0 0 604
both-off: MatMult 201 1.0 9.6107e+00 1.0 4.98e+09 1.0 0.0e+00 0.0e+00 0.0e+00 27 38 0 0 0 59 77 0 0 0 518
On my system, these were achieved with
# wrmsr -p 1 0x1a0 0x3364970489 # stream on, DCU off
# wrmsr -p 1 0x1a0 0x3364970689 # stream off, DCU off
My system details:
$ uname -a
Linux kunyang 2.6.32-ARCH #1 SMP PREEMPT Mon Jan 25 20:33:50 CET 2010 x86_64 Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz GenuineIntel GNU/Linux
$ gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../configure --prefix=/usr --enable-shared --enable-languages=c,c++,fortran,objc,obj-c++,ada --enable-threads=posix --mandir=/usr/share/man --infodir=/usr/share/info --enable-__cxa_atexit --disable-multilib --libdir=/usr/lib --libexecdir=/usr/lib --enable-clocale=gnu --disable-libstdcxx-pch --with-tune=generic
Thread model: posix
gcc version 4.4.3 (GCC)
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz
stepping : 10
cpu MHz : 2534.000
cache size : 3072 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm ida tpr_shadow vnmi flexpriority
bogomips : 5057.73
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz
stepping : 10
cpu MHz : 2534.000
cache size : 3072 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm ida tpr_shadow vnmi flexpriority
bogomips : 5059.12
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
Jed
More information about the petsc-users
mailing list