Inefficient Multiplication?
Aron Ahmadia
aja2111 at columbia.edu
Mon Aug 6 10:56:26 CDT 2007
This is a direct consequence of the so-called memory mountain, i.e.
the increasing costs of accessing memory further and further away from
the processor on a standard Von Neumann architecture: cache -> RAM ->
hard drive. Larger matrices don't completely fit in the cache, and
you're paying a cache miss penalty. See Bryant and Hallaron,
"Computer Systems: A Programmer's Perspective" for more on this, or
for excruciating details, check out ATLAS (Automatically Tuned Linear
Algebra System).
~A
On 8/6/07, John R. Wicks <jwicks at cs.brown.edu> wrote:
> I wrote a test program to compare the performance of the Matrix Template
> Library on packed array multiplication and although petsc outperformed mtl,
> I was surprised that its performance was not strictly linear in the number
> of entries:
>
> 1) I created a packed column major matrix of doubles in mtl with the
> corresponding row major version in petsc; all matrices were dense with 10
> entries per column (row), where the indices are chosen randomly.
>
> 2) For each column size, I created a fixed (dense) vector of doubles and
> performed 1000 multiplications (transposed multiplications) and recored the
> timings, with the results as follows:
>
> Iterations rows columns entries mtl petc
> 1000 1000 1000 10000 5.35 0.1
> 1000 1000 2000 10000 5.69 0.1
> 1000 1000 4000 10000 5.91 0.11
> 1000 1000 8000 10000 7.85 0.13
> 1000 1000 16000 10000 9.82 0.21
> 1000 1000 32000 10000 12.18 0.21
> 1000 2000 2000 20000 10.94 0.21
> 1000 2000 4000 20000 11.11 0.21
> 1000 2000 8000 20000 12.12 0.26
> 1000 2000 16000 20000 13.47 0.31
> 1000 2000 32000 20000 16.69 0.4
> 1000 2000 64000 20000 23.66 0.68
> 1000 4000 4000 40000 24.7 0.44
> 1000 4000 8000 40000 26.24 0.52
> 1000 4000 16000 40000 27.73 0.64
> 1000 4000 32000 40000 31.23 0.76
> 1000 4000 64000 40000 37.91 1.26
> 1000 4000 128000 40000 50 2.02
> 1000 8000 8000 80000 49.79 1.08
> 1000 8000 16000 80000 53.24 1.29
> 1000 8000 32000 80000 56.63 1.61
> 1000 8000 64000 80000 59.54 2.44
> 1000 8000 128000 80000 69.85 3.55
> 1000 8000 256000 80000 95.39 4.61
> 1000 16000 16000 160000 103.58 2.52
> 1000 16000 32000 160000 100.83 2.92
> 1000 16000 64000 160000 104.57 4.68
> 1000 16000 128000 160000 127.82 6.67
> 1000 16000 256000 160000 151.82 8.18
> 1000 16000 512000 160000 200.33 9.97
> 1000 32000 32000 320000 206.65 6
> 1000 32000 64000 320000 218.44 9.41
> 1000 32000 128000 320000 234.82 12.95
> 1000 32000 256000 320000 260.71 15.22
> 1000 32000 512000 320000 304.17 18.38
> 1000 32000 1024000 320000 405.6 21.76
>
> As you can see, while petsc was 30-60 times faster than mtl, the performance
> of both packages also depended significantly on the number of rows. Maybe
> this is not surprising, since I believe they both utilize the BLAS libraries
> at some level, but since I can write down a simple algorithm for performing
> the multiplication linearly in the number of entries, and I would expect
> such a standard library to use the most efficient algorithm available, I am
> wondering why this is happening? Any ideas?
>
>
More information about the petsc-users
mailing list