Inefficient Multiplication?

Mon Aug 6 10:56:26 CDT 2007

This is a direct consequence of the so-called memory mountain, i.e.
the increasing costs of accessing memory further and further away from
the processor on a standard Von Neumann architecture: cache -> RAM ->
hard drive.  Larger matrices don't completely fit in the cache, and
you're paying a cache miss penalty.  See Bryant and Hallaron,
"Computer Systems: A Programmer's Perspective" for more on this, or
for excruciating details, check out ATLAS (Automatically Tuned Linear
Algebra System).

~A

On 8/6/07, John R. Wicks <jwicks at cs.brown.edu> wrote:
> I wrote a test program to compare the performance of the Matrix Template
> Library on packed array multiplication and although petsc outperformed mtl,
> I was surprised that its performance was not strictly linear in the number
> of entries:
>
> 1) I created a packed column major matrix of doubles in mtl with the
> corresponding row major version in petsc; all matrices were dense with 10
> entries per column (row), where the indices are chosen randomly.
>
> 2) For each column size, I created a fixed (dense) vector of doubles and
> performed 1000 multiplications (transposed multiplications) and recored the
> timings, with the results as follows:
>
> Iterations       rows    columns         entries         mtl     petc
> 1000    1000    1000    10000   5.35    0.1
> 1000    1000    2000    10000   5.69    0.1
> 1000    1000    4000    10000   5.91    0.11
> 1000    1000    8000    10000   7.85    0.13
> 1000    1000    16000   10000   9.82    0.21
> 1000    1000    32000   10000   12.18   0.21
> 1000    2000    2000    20000   10.94   0.21
> 1000    2000    4000    20000   11.11   0.21
> 1000    2000    8000    20000   12.12   0.26
> 1000    2000    16000   20000   13.47   0.31
> 1000    2000    32000   20000   16.69   0.4
> 1000    2000    64000   20000   23.66   0.68
> 1000    4000    4000    40000   24.7    0.44
> 1000    4000    8000    40000   26.24   0.52
> 1000    4000    16000   40000   27.73   0.64
> 1000    4000    32000   40000   31.23   0.76
> 1000    4000    64000   40000   37.91   1.26
> 1000    4000    128000  40000   50      2.02
> 1000    8000    8000    80000   49.79   1.08
> 1000    8000    16000   80000   53.24   1.29
> 1000    8000    32000   80000   56.63   1.61
> 1000    8000    64000   80000   59.54   2.44
> 1000    8000    128000  80000   69.85   3.55
> 1000    8000    256000  80000   95.39   4.61
> 1000    16000   16000   160000  103.58  2.52
> 1000    16000   32000   160000  100.83  2.92
> 1000    16000   64000   160000  104.57  4.68
> 1000    16000   128000  160000  127.82  6.67
> 1000    16000   256000  160000  151.82  8.18
> 1000    16000   512000  160000  200.33  9.97
> 1000    32000   32000   320000  206.65  6
> 1000    32000   64000   320000  218.44  9.41
> 1000    32000   128000  320000  234.82  12.95
> 1000    32000   256000  320000  260.71  15.22
> 1000    32000   512000  320000  304.17  18.38
> 1000    32000   1024000 320000  405.6   21.76
>
> As you can see, while petsc was 30-60 times faster than mtl, the performance
> of both packages also depended significantly on the number of rows.  Maybe
> this is not surprising, since I believe they both utilize the BLAS libraries
> at some level, but since I can write down a simple algorithm for performing
> the multiplication linearly in the number of entries, and I would expect
> such a standard library to use the most efficient algorithm available, I am
> wondering why this is happening?  Any ideas?
>
>