Agreed. According to your timings, it took ~130 seconds to invert a 3000 x 3000 matrix. With dense algorithms, the cost of LU with partial pivoting is 2/3 n^3, and performing triangle solves against n right hand sides requires n^3 flops. Thus the total work is 5/3 n^3. With n=3000, this is roughly 45 billion flops. Nearly any machine can easily hit at least 50% of theoretical peak on both of these operations, and thus, if your laptop has a theoretical peak of 5 GFlops, it should take 18 seconds with a dense algorithm. This is a pretty substantial improvement over what you currently are seeing, and dense algorithms parallelize nicely since there are O(n^3) operations to perform and only O(n^2) data.<br>

<br>On the other hand, when you increase your problem size to n ~= 30,000, the asymptotic difference between sparse and dense algorithms will start to become obvious. In 2d, sparse factorization is O(n^3/2) and the cost of doing n triangle solves is O(n^2 log(n)), leading to a total complexity of O(n^2 log(n)) versus the dense O(n^3) approach. For 3d problems, the cost of the triangle solves again dominates at O(n^7/3) versus the direct O(n^3) cost.<br>

<br>As a practical matter, according to the information you attached, you are both running in debug mode and not using optimized computational kernels. If you fix both of these issues, your performance should increase substantially.<br>

<br>Jack<br><br><div class="gmail_quote">On Mon, Aug 1, 2011 at 4:43 PM, Aron Ahmadia <span dir="ltr">&lt;<a href="mailto:aron.ahmadia@kaust.edu.sa">aron.ahmadia@kaust.edu.sa</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div dir="ltr">What Matt is getting at is that typically we measure the computational difficulty of a problem as a function of the &#39;unknowns&#39;.  If you are looking at turning a sparse matrix O(n) bytes into a dense inverse O(n^2) bytes, you&#39;ve taken what was originally a potentially optimal problem and turned it into one of quadratic difficulty in terms of memory, and even more in terms of flops.  You may have better luck using an explicitly dense method/algorithm for computing the inverse: it will scale well in the sense that adding more processors will work faster (this is the heart of the LINPACK computation after all), but you will need a lot of core-hours and memory to get there...<div>


<br></div><div>A</div></div>

</blockquote></div><br>