[petsc-users] Question about matrix permutation

Sun Jan 31 16:02:17 CST 2010

   Ok, I owe you!

>    This looks great; push it to petsc-dev if it really works.

I will, but it needs to be cleaned up a bit before it can be pushed
(e.g. so that it works with other compilers).

   Check config/PETSc/Configure.py configurePrefix() for the current  
support in PETSc. It is used, for example, in src/mat/impls/sbaij/seq/ 
relax.h; you may have found better ways of doing this so feel free to  
change the current support if you have something better.

    Barry

On Jan 31, 2010, at 3:26 PM, Jed Brown wrote:

> On Sat, 30 Jan 2010 14:41:03 -0600, Barry Smith <bsmith at mcs.anl.gov>  
> wrote:
>>    This looks great; push it to petsc-dev if it really works.
>
> I will, but it needs to be cleaned up a bit before it can be pushed
> (e.g. so that it works with other compilers).
>
>>    BUT, this is NOT hardware prefetching, this is specifically
>> software prefetching, you are telling it exactly what to prefetch and
>> when.
>
> Absolutely, but my original statement was
>
>  it seems possible for deficient hardware prefetch to be at fault  
> for a
>  nontrivial amount of the observed difference between Inode and BAIJ.
>
> I never claimed that hardware prefetch was doing anything useful
> (although the tests below show that it is), but that prefetch was the
> reason why Inode was not as fast as our standard performance model
> would indicate.
>
>> I would call it hardware prefetching only when the hardware detects a
>> particular pattern of access and then automatically extrapolates that
>> pattern to prefetch further along the pattern.
>
> Right, and that extrapolation was not very good, for the reasons
> described earlier in this thread.  But, to investigate how much good
> hardware prefetch is in this case, I ran the following tests.
>
> First I disabled hardware prefetch.  This is poorly documented on the
> web, so here's the scoop:
>
> With Core2 on Linux, you install msr-tools and check register 0x1a0.
>
>  # rdmsr -p 1 0x1a0
>  1364970489
>
> The details are processor dependent, but on both 32 and 64-bit intel
> chips, you need to flip bit 9 to disable hardware prefetch.
>
>  # wrmsr -p 1 0x1a0 0x1364970689
>
> to turn it on again, just disable that bit
>
>  # wrmsr -p 1 0x1a0 0x1364970489
>
> Note that this all needs to be run as root.  For details on MSR and
> which bits to use for chips other than Core 2, see Appendix B of
>
>  Intel® 64 and IA-32 Architectures Software Developer's Manual
>  Volume 3B: System Programming Guide
>
> All the manuals are here
>
>  http://www.intel.com/products/processor/manuals/
>
>
> I then ran the benchmarks using taskset to bind to the correct core  
> and
> nice -n -10 to ensure that mostly-idle background processes would  
> always
> run on the inactive core.  I checked that the results were consistent
> regardless of which core was being used, and ran each test twice.  All
> of these are *without* the software prefetch patch.  It appears that  
> the
> difference is comfortably over 2 percent in all cases.
>
> ./ex19 -ksp_type cgs -pc_type none -ksp_monitor -ksp_max_it 1000 - 
> snes_max_it 1 -da_grid_x 50 -da_grid_y 50 -log_summary
> on:  MatMult             2001 1.0 5.3418e+00 1.0 3.03e+09 1.0 0.0e 
> +00 0.0e+00 0.0e+00 41 41  0  0  0  82 81  0  0  0   568
>     MatMult             2001 1.0 5.3506e+00 1.0 3.03e+09 1.0 0.0e+00  
> 0.0e+00 0.0e+00 41 41  0  0  0  82 81  0  0  0   567
> off: MatMult             2001 1.0 5.9154e+00 1.0 3.03e+09 1.0 0.0e 
> +00 0.0e+00 0.0e+00 37 41  0  0  0  76 81  0  0  0   513
>     MatMult             2001 1.0 5.9271e+00 1.0 3.03e+09 1.0 0.0e+00  
> 0.0e+00 0.0e+00 37 41  0  0  0  76 81  0  0  0   512
>
> ./ex19 -ksp_type cgs -pc_type none -ksp_monitor -ksp_max_it 100 - 
> snes_max_it 1 -da_grid_x 200 -da_grid_y 200 -log_summary
> on:  MatMult              201 1.0 7.9634e+00 1.0 4.98e+09 1.0 0.0e 
> +00 0.0e+00 0.0e+00 28 38  0  0  0  61 77  0  0  0   626
>     MatMult              201 1.0 7.9570e+00 1.0 4.98e+09 1.0 0.0e+00  
> 0.0e+00 0.0e+00 28 38  0  0  0  61 77  0  0  0   626
> off: MatMult              201 1.0 9.2377e+00 1.0 4.98e+09 1.0 0.0e 
> +00 0.0e+00 0.0e+00 27 38  0  0  0  59 77  0  0  0   539
>     MatMult              201 1.0 9.2294e+00 1.0 4.98e+09 1.0 0.0e+00  
> 0.0e+00 0.0e+00 27 38  0  0  0  59 77  0  0  0   540
>
> ./ex19 -ksp_type cgs -pc_type none -ksp_monitor -ksp_max_it 100 - 
> snes_max_it 1 -da_grid_x 200 -da_grid_y 200 -log_summary
> on:  MatMult              201 1.0 1.7588e+01 1.0 1.12e+10 1.0 0.0e 
> +00 0.0e+00 0.0e+00 27 38  0  0  0  59 77  0  0  0   639
>     MatMult              201 1.0 1.7642e+01 1.0 1.12e+10 1.0 0.0e+00  
> 0.0e+00 0.0e+00 27 38  0  0  0  60 77  0  0  0   637
> off: MatMult              201 1.0 2.0519e+01 1.0 1.12e+10 1.0 0.0e 
> +00 0.0e+00 0.0e+00 26 38  0  0  0  58 77  0  0  0   548
>     MatMult              201 1.0 2.0601e+01 1.0 1.12e+10 1.0 0.0e+00  
> 0.0e+00 0.0e+00 26 38  0  0  0  58 77  0  0  0   545
>
>
> Then I did one comparison *with* the software prefetch patch:
>
> ./ex19 -ksp_type cgs -pc_type none -ksp_monitor -ksp_max_it 100 - 
> snes_max_it 1 -da_grid_x 200 -da_grid_y 200 -log_summary
> on:  MatMult              201 1.0 6.1305e+00 1.0 4.98e+09 1.0 0.0e 
> +00 0.0e+00 0.0e+00 24 38  0  0  0  54 77  0  0  0   813
>     MatMult              201 1.0 6.1379e+00 1.0 4.98e+09 1.0 0.0e+00  
> 0.0e+00 0.0e+00 24 38  0  0  0  54 77  0  0  0   812
> off: MatMult              201 1.0 7.8732e+00 1.0 4.98e+09 1.0 0.0e 
> +00 0.0e+00 0.0e+00 25 38  0  0  0  55 77  0  0  0   633
>     MatMult              201 1.0 7.8841e+00 1.0 4.98e+09 1.0 0.0e+00  
> 0.0e+00 0.0e+00 25 38  0  0  0  55 77  0  0  0   632
>
> So the performance loss from disabling hardware prefetch is *greater*
> when using explicit software prefetch than when no software prefetch  
> is
> done.
>
>
> Finally, I reverted the software prefetch patch and disabled DCU
> prefetch (bit 37).  I'm not clear on the details of the difference
> between this and the streaming prefetch that the tests above use, but
> the manual says
>
>  The DCU prefetcher is an L1 data cache prefetcher.  When the DCU
>  prefetcher detects multiple loads from the same line done within a
>  time limit, the DCU prefetcher assumes the next line will be  
> required.
>  The next line is prefetched in to the L1 data cache from memory or  
> L2.
>
> Compare these numbers to the ones above (~638 flop/s with both on,  
> ~547
> with stream prefetch off).
>
> ./ex19 -ksp_type cgs -pc_type none -ksp_monitor -ksp_max_it 100 - 
> snes_max_it 1 -da_grid_x 200 -da_grid_y 200 -log_summary
> dcu-off:  MatMult              201 1.0 8.2517e+00 1.0 4.98e+09 1.0  
> 0.0e+00 0.0e+00 0.0e+00 28 38  0  0  0  61 77  0  0  0   604
> both-off: MatMult              201 1.0 9.6107e+00 1.0 4.98e+09 1.0  
> 0.0e+00 0.0e+00 0.0e+00 27 38  0  0  0  59 77  0  0  0   518
>
> On my system, these were achieved with
>
>  # wrmsr -p 1 0x1a0 0x3364970489               # stream on, DCU off
>  # wrmsr -p 1 0x1a0 0x3364970689               # stream off, DCU off
>
>
> My system details:
>
>  $ uname -a
>  Linux kunyang 2.6.32-ARCH #1 SMP PREEMPT Mon Jan 25 20:33:50 CET  
> 2010 x86_64 Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz GenuineIntel  
> GNU/Linux
>  $ gcc -v
>  Using built-in specs.
>  Target: x86_64-unknown-linux-gnu
>  Configured with: ../configure --prefix=/usr --enable-shared -- 
> enable-languages=c,c++,fortran,objc,obj-c++,ada --enable- 
> threads=posix --mandir=/usr/share/man --infodir=/usr/share/info -- 
> enable-__cxa_atexit --disable-multilib --libdir=/usr/lib -- 
> libexecdir=/usr/lib --enable-clocale=gnu --disable-libstdcxx-pch -- 
> with-tune=generic
>  Thread model: posix
>  gcc version 4.4.3 (GCC)
>  $ cat /proc/cpuinfo
>  processor       : 0
>  vendor_id       : GenuineIntel
>  cpu family      : 6
>  model           : 23
>  model name      : Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz
>  stepping        : 10
>  cpu MHz         : 2534.000
>  cache size      : 3072 KB
>  physical id     : 0
>  siblings        : 2
>  core id         : 0
>  cpu cores       : 2
>  apicid          : 0
>  initial apicid  : 0
>  fpu             : yes
>  fpu_exception   : yes
>  cpuid level     : 13
>  wp              : yes
>  flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr  
> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm  
> pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good  
> aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr  
> pdcm sse4_1 xsave lahf_lm ida tpr_shadow vnmi flexpriority
>  bogomips        : 5057.73
>  clflush size    : 64
>  cache_alignment : 64
>  address sizes   : 36 bits physical, 48 bits virtual
>  power management:
>
>  processor       : 1
>  vendor_id       : GenuineIntel
>  cpu family      : 6
>  model           : 23
>  model name      : Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz
>  stepping        : 10
>  cpu MHz         : 2534.000
>  cache size      : 3072 KB
>  physical id     : 0
>  siblings        : 2
>  core id         : 1
>  cpu cores       : 2
>  apicid          : 1
>  initial apicid  : 1
>  fpu             : yes
>  fpu_exception   : yes
>  cpuid level     : 13
>  wp              : yes
>  flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr  
> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm  
> pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good  
> aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr  
> pdcm sse4_1 xsave lahf_lm ida tpr_shadow vnmi flexpriority
>  bogomips        : 5059.12
>  clflush size    : 64
>  cache_alignment : 64
>  address sizes   : 36 bits physical, 48 bits virtual
>  power management:
>
>
> Jed