Inefficient Multiplication?

Mon Aug 6 10:10:50 CDT 2007

I wrote a test program to compare the performance of the Matrix Template
Library on packed array multiplication and although petsc outperformed mtl,
I was surprised that its performance was not strictly linear in the number
of entries:

1) I created a packed column major matrix of doubles in mtl with the
corresponding row major version in petsc; all matrices were dense with 10
entries per column (row), where the indices are chosen randomly.  

2) For each column size, I created a fixed (dense) vector of doubles and
performed 1000 multiplications (transposed multiplications) and recored the
timings, with the results as follows:

Iterations	 rows	 columns 	 entries	 mtl	 petc	
1000	1000	1000	10000	5.35	0.1
1000	1000	2000	10000	5.69	0.1
1000	1000	4000	10000	5.91	0.11
1000	1000	8000	10000	7.85	0.13
1000	1000	16000	10000	9.82	0.21
1000	1000	32000	10000	12.18	0.21
1000	2000	2000	20000	10.94	0.21
1000	2000	4000	20000	11.11	0.21
1000	2000	8000	20000	12.12	0.26
1000	2000	16000	20000	13.47	0.31
1000	2000	32000	20000	16.69	0.4
1000	2000	64000	20000	23.66	0.68
1000	4000	4000	40000	24.7	0.44
1000	4000	8000	40000	26.24	0.52
1000	4000	16000	40000	27.73	0.64
1000	4000	32000	40000	31.23	0.76
1000	4000	64000	40000	37.91	1.26
1000	4000	128000	40000	50	2.02
1000	8000	8000	80000	49.79	1.08
1000	8000	16000	80000	53.24	1.29
1000	8000	32000	80000	56.63	1.61
1000	8000	64000	80000	59.54	2.44
1000	8000	128000	80000	69.85	3.55
1000	8000	256000	80000	95.39	4.61
1000	16000	16000	160000	103.58	2.52
1000	16000	32000	160000	100.83	2.92
1000	16000	64000	160000	104.57	4.68
1000	16000	128000	160000	127.82	6.67
1000	16000	256000	160000	151.82	8.18
1000	16000	512000	160000	200.33	9.97
1000	32000	32000	320000	206.65	6
1000	32000	64000	320000	218.44	9.41
1000	32000	128000	320000	234.82	12.95
1000	32000	256000	320000	260.71	15.22
1000	32000	512000	320000	304.17	18.38
1000	32000	1024000	320000	405.6	21.76

As you can see, while petsc was 30-60 times faster than mtl, the performance
of both packages also depended significantly on the number of rows.  Maybe
this is not surprising, since I believe they both utilize the BLAS libraries
at some level, but since I can write down a simple algorithm for performing
the multiplication linearly in the number of entries, and I would expect
such a standard library to use the most efficient algorithm available, I am
wondering why this is happening?  Any ideas?