[petsc-users] Question about matrix permutation
Barry Smith
bsmith at mcs.anl.gov
Sun Jan 31 16:02:17 CST 2010
Ok, I owe you!
> This looks great; push it to petsc-dev if it really works.
I will, but it needs to be cleaned up a bit before it can be pushed
(e.g. so that it works with other compilers).
Check config/PETSc/Configure.py configurePrefix() for the current
support in PETSc. It is used, for example, in src/mat/impls/sbaij/seq/
relax.h; you may have found better ways of doing this so feel free to
change the current support if you have something better.
Barry
On Jan 31, 2010, at 3:26 PM, Jed Brown wrote:
> On Sat, 30 Jan 2010 14:41:03 -0600, Barry Smith <bsmith at mcs.anl.gov>
> wrote:
>> This looks great; push it to petsc-dev if it really works.
>
> I will, but it needs to be cleaned up a bit before it can be pushed
> (e.g. so that it works with other compilers).
>
>> BUT, this is NOT hardware prefetching, this is specifically
>> software prefetching, you are telling it exactly what to prefetch and
>> when.
>
> Absolutely, but my original statement was
>
> it seems possible for deficient hardware prefetch to be at fault
> for a
> nontrivial amount of the observed difference between Inode and BAIJ.
>
> I never claimed that hardware prefetch was doing anything useful
> (although the tests below show that it is), but that prefetch was the
> reason why Inode was not as fast as our standard performance model
> would indicate.
>
>> I would call it hardware prefetching only when the hardware detects a
>> particular pattern of access and then automatically extrapolates that
>> pattern to prefetch further along the pattern.
>
> Right, and that extrapolation was not very good, for the reasons
> described earlier in this thread. But, to investigate how much good
> hardware prefetch is in this case, I ran the following tests.
>
> First I disabled hardware prefetch. This is poorly documented on the
> web, so here's the scoop:
>
> With Core2 on Linux, you install msr-tools and check register 0x1a0.
>
> # rdmsr -p 1 0x1a0
> 1364970489
>
> The details are processor dependent, but on both 32 and 64-bit intel
> chips, you need to flip bit 9 to disable hardware prefetch.
>
> # wrmsr -p 1 0x1a0 0x1364970689
>
> to turn it on again, just disable that bit
>
> # wrmsr -p 1 0x1a0 0x1364970489
>
> Note that this all needs to be run as root. For details on MSR and
> which bits to use for chips other than Core 2, see Appendix B of
>
> Intel® 64 and IA-32 Architectures Software Developer's Manual
> Volume 3B: System Programming Guide
>
> All the manuals are here
>
> http://www.intel.com/products/processor/manuals/
>
>
> I then ran the benchmarks using taskset to bind to the correct core
> and
> nice -n -10 to ensure that mostly-idle background processes would
> always
> run on the inactive core. I checked that the results were consistent
> regardless of which core was being used, and ran each test twice. All
> of these are *without* the software prefetch patch. It appears that
> the
> difference is comfortably over 2 percent in all cases.
>
> ./ex19 -ksp_type cgs -pc_type none -ksp_monitor -ksp_max_it 1000 -
> snes_max_it 1 -da_grid_x 50 -da_grid_y 50 -log_summary
> on: MatMult 2001 1.0 5.3418e+00 1.0 3.03e+09 1.0 0.0e
> +00 0.0e+00 0.0e+00 41 41 0 0 0 82 81 0 0 0 568
> MatMult 2001 1.0 5.3506e+00 1.0 3.03e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00 41 41 0 0 0 82 81 0 0 0 567
> off: MatMult 2001 1.0 5.9154e+00 1.0 3.03e+09 1.0 0.0e
> +00 0.0e+00 0.0e+00 37 41 0 0 0 76 81 0 0 0 513
> MatMult 2001 1.0 5.9271e+00 1.0 3.03e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00 37 41 0 0 0 76 81 0 0 0 512
>
> ./ex19 -ksp_type cgs -pc_type none -ksp_monitor -ksp_max_it 100 -
> snes_max_it 1 -da_grid_x 200 -da_grid_y 200 -log_summary
> on: MatMult 201 1.0 7.9634e+00 1.0 4.98e+09 1.0 0.0e
> +00 0.0e+00 0.0e+00 28 38 0 0 0 61 77 0 0 0 626
> MatMult 201 1.0 7.9570e+00 1.0 4.98e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00 28 38 0 0 0 61 77 0 0 0 626
> off: MatMult 201 1.0 9.2377e+00 1.0 4.98e+09 1.0 0.0e
> +00 0.0e+00 0.0e+00 27 38 0 0 0 59 77 0 0 0 539
> MatMult 201 1.0 9.2294e+00 1.0 4.98e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00 27 38 0 0 0 59 77 0 0 0 540
>
> ./ex19 -ksp_type cgs -pc_type none -ksp_monitor -ksp_max_it 100 -
> snes_max_it 1 -da_grid_x 200 -da_grid_y 200 -log_summary
> on: MatMult 201 1.0 1.7588e+01 1.0 1.12e+10 1.0 0.0e
> +00 0.0e+00 0.0e+00 27 38 0 0 0 59 77 0 0 0 639
> MatMult 201 1.0 1.7642e+01 1.0 1.12e+10 1.0 0.0e+00
> 0.0e+00 0.0e+00 27 38 0 0 0 60 77 0 0 0 637
> off: MatMult 201 1.0 2.0519e+01 1.0 1.12e+10 1.0 0.0e
> +00 0.0e+00 0.0e+00 26 38 0 0 0 58 77 0 0 0 548
> MatMult 201 1.0 2.0601e+01 1.0 1.12e+10 1.0 0.0e+00
> 0.0e+00 0.0e+00 26 38 0 0 0 58 77 0 0 0 545
>
>
> Then I did one comparison *with* the software prefetch patch:
>
> ./ex19 -ksp_type cgs -pc_type none -ksp_monitor -ksp_max_it 100 -
> snes_max_it 1 -da_grid_x 200 -da_grid_y 200 -log_summary
> on: MatMult 201 1.0 6.1305e+00 1.0 4.98e+09 1.0 0.0e
> +00 0.0e+00 0.0e+00 24 38 0 0 0 54 77 0 0 0 813
> MatMult 201 1.0 6.1379e+00 1.0 4.98e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00 24 38 0 0 0 54 77 0 0 0 812
> off: MatMult 201 1.0 7.8732e+00 1.0 4.98e+09 1.0 0.0e
> +00 0.0e+00 0.0e+00 25 38 0 0 0 55 77 0 0 0 633
> MatMult 201 1.0 7.8841e+00 1.0 4.98e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00 25 38 0 0 0 55 77 0 0 0 632
>
> So the performance loss from disabling hardware prefetch is *greater*
> when using explicit software prefetch than when no software prefetch
> is
> done.
>
>
> Finally, I reverted the software prefetch patch and disabled DCU
> prefetch (bit 37). I'm not clear on the details of the difference
> between this and the streaming prefetch that the tests above use, but
> the manual says
>
> The DCU prefetcher is an L1 data cache prefetcher. When the DCU
> prefetcher detects multiple loads from the same line done within a
> time limit, the DCU prefetcher assumes the next line will be
> required.
> The next line is prefetched in to the L1 data cache from memory or
> L2.
>
> Compare these numbers to the ones above (~638 flop/s with both on,
> ~547
> with stream prefetch off).
>
> ./ex19 -ksp_type cgs -pc_type none -ksp_monitor -ksp_max_it 100 -
> snes_max_it 1 -da_grid_x 200 -da_grid_y 200 -log_summary
> dcu-off: MatMult 201 1.0 8.2517e+00 1.0 4.98e+09 1.0
> 0.0e+00 0.0e+00 0.0e+00 28 38 0 0 0 61 77 0 0 0 604
> both-off: MatMult 201 1.0 9.6107e+00 1.0 4.98e+09 1.0
> 0.0e+00 0.0e+00 0.0e+00 27 38 0 0 0 59 77 0 0 0 518
>
> On my system, these were achieved with
>
> # wrmsr -p 1 0x1a0 0x3364970489 # stream on, DCU off
> # wrmsr -p 1 0x1a0 0x3364970689 # stream off, DCU off
>
>
> My system details:
>
> $ uname -a
> Linux kunyang 2.6.32-ARCH #1 SMP PREEMPT Mon Jan 25 20:33:50 CET
> 2010 x86_64 Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz GenuineIntel
> GNU/Linux
> $ gcc -v
> Using built-in specs.
> Target: x86_64-unknown-linux-gnu
> Configured with: ../configure --prefix=/usr --enable-shared --
> enable-languages=c,c++,fortran,objc,obj-c++,ada --enable-
> threads=posix --mandir=/usr/share/man --infodir=/usr/share/info --
> enable-__cxa_atexit --disable-multilib --libdir=/usr/lib --
> libexecdir=/usr/lib --enable-clocale=gnu --disable-libstdcxx-pch --
> with-tune=generic
> Thread model: posix
> gcc version 4.4.3 (GCC)
> $ cat /proc/cpuinfo
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 6
> model : 23
> model name : Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz
> stepping : 10
> cpu MHz : 2534.000
> cache size : 3072 KB
> physical id : 0
> siblings : 2
> core id : 0
> cpu cores : 2
> apicid : 0
> initial apicid : 0
> fpu : yes
> fpu_exception : yes
> cpuid level : 13
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
> pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good
> aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr
> pdcm sse4_1 xsave lahf_lm ida tpr_shadow vnmi flexpriority
> bogomips : 5057.73
> clflush size : 64
> cache_alignment : 64
> address sizes : 36 bits physical, 48 bits virtual
> power management:
>
> processor : 1
> vendor_id : GenuineIntel
> cpu family : 6
> model : 23
> model name : Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz
> stepping : 10
> cpu MHz : 2534.000
> cache size : 3072 KB
> physical id : 0
> siblings : 2
> core id : 1
> cpu cores : 2
> apicid : 1
> initial apicid : 1
> fpu : yes
> fpu_exception : yes
> cpuid level : 13
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
> pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good
> aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr
> pdcm sse4_1 xsave lahf_lm ida tpr_shadow vnmi flexpriority
> bogomips : 5059.12
> clflush size : 64
> cache_alignment : 64
> address sizes : 36 bits physical, 48 bits virtual
> power management:
>
>
> Jed
More information about the petsc-users
mailing list