Inefficient Multiplication?
John R. Wicks
jwicks at cs.brown.edu
Mon Aug 6 10:10:50 CDT 2007
I wrote a test program to compare the performance of the Matrix Template
Library on packed array multiplication and although petsc outperformed mtl,
I was surprised that its performance was not strictly linear in the number
of entries:
1) I created a packed column major matrix of doubles in mtl with the
corresponding row major version in petsc; all matrices were dense with 10
entries per column (row), where the indices are chosen randomly.
2) For each column size, I created a fixed (dense) vector of doubles and
performed 1000 multiplications (transposed multiplications) and recored the
timings, with the results as follows:
Iterations rows columns entries mtl petc
1000 1000 1000 10000 5.35 0.1
1000 1000 2000 10000 5.69 0.1
1000 1000 4000 10000 5.91 0.11
1000 1000 8000 10000 7.85 0.13
1000 1000 16000 10000 9.82 0.21
1000 1000 32000 10000 12.18 0.21
1000 2000 2000 20000 10.94 0.21
1000 2000 4000 20000 11.11 0.21
1000 2000 8000 20000 12.12 0.26
1000 2000 16000 20000 13.47 0.31
1000 2000 32000 20000 16.69 0.4
1000 2000 64000 20000 23.66 0.68
1000 4000 4000 40000 24.7 0.44
1000 4000 8000 40000 26.24 0.52
1000 4000 16000 40000 27.73 0.64
1000 4000 32000 40000 31.23 0.76
1000 4000 64000 40000 37.91 1.26
1000 4000 128000 40000 50 2.02
1000 8000 8000 80000 49.79 1.08
1000 8000 16000 80000 53.24 1.29
1000 8000 32000 80000 56.63 1.61
1000 8000 64000 80000 59.54 2.44
1000 8000 128000 80000 69.85 3.55
1000 8000 256000 80000 95.39 4.61
1000 16000 16000 160000 103.58 2.52
1000 16000 32000 160000 100.83 2.92
1000 16000 64000 160000 104.57 4.68
1000 16000 128000 160000 127.82 6.67
1000 16000 256000 160000 151.82 8.18
1000 16000 512000 160000 200.33 9.97
1000 32000 32000 320000 206.65 6
1000 32000 64000 320000 218.44 9.41
1000 32000 128000 320000 234.82 12.95
1000 32000 256000 320000 260.71 15.22
1000 32000 512000 320000 304.17 18.38
1000 32000 1024000 320000 405.6 21.76
As you can see, while petsc was 30-60 times faster than mtl, the performance
of both packages also depended significantly on the number of rows. Maybe
this is not surprising, since I believe they both utilize the BLAS libraries
at some level, but since I can write down a simple algorithm for performing
the multiplication linearly in the number of entries, and I would expect
such a standard library to use the most efficient algorithm available, I am
wondering why this is happening? Any ideas?
More information about the petsc-users
mailing list