[petsc-users] Scalability issue

Fri Jul 24 10:41:43 CDT 2015

Hello,

I have been using PETSc for a few months now, and it truly is fantastic 
piece of software.

In my particular example I am working with a large, sparse distributed 
(MPI AIJ) matrix we can refer as 'G'.
G is a horizontal - retangular matrix (for example, 1,1 Million rows 
per 2,1 Million columns). This matrix is commonly very sparse and not 
diagonal 'heavy' (for example 5,2 Million nnz in which ~50% are on the 
diagonal block of MPI AIJ representation).
To work with this matrix, I also have a few parallel vectors (created 
using MatCreate Vec), we can refer as 'm' and 'k'.
I am trying to parallelize an iterative algorithm in which the most 
computational heavy operations are:

->Matrix-Vector Multiplication, more precisely G * m + k = b 
(MatMultAdd). From what I have been reading, to achive a good speedup in 
this operation, G should be as much diagonal as possible, due to 
overlapping communication and computation. But even when using a G 
matrix in which the diagonal block has ~95% of the nnz, I cannot get a 
decent speedup. Most of the times, the performance even gets worse.

->Matrix-Matrix Multiplication, in this case I need to perform G * G' = 
A, where A is later used on the linear solver and G' is transpose of G. 
The speedup in this operation is not worse, although is not very good.

->Linear problem solving. Lastly, In this operation I compute "Ax=b" 
from the last two operations. I tried to apply a RCM permutation to A to 
make it more diagonal, for better performance. However, the problem I 
faced was that, the permutation is performed locally in each processor 
and thus, the final result is different with different number of 
processors. I assume this was intended to reduce communication. The 
solution I found was
1-calculate A
2-calculate, localy to 1 machine, the RCM permutation IS using A
3-apply this permutation to the lines of G.
This works well, and A is generated as if RCM permuted. It is fine to 
do this operation in one machine because it is only done once while 
reading the input. The nnz of G become more spread and less diagonal, 
causing problems when calculating G * m + k = b.

These 3 operations (except the permutation) are performed in each 
iteration of my algorithm.

So, my questions are.
-What are the characteristics of G that lead to a good speedup in the 
operations I described? Am I missing something and too much obsessed 
with the diagonal block?

-Is there a better way to permute A without permute G and still get the 
same result using 1 or N machines?

I have been avoiding asking for help for a while. I'm very sorry for 
the long email.
Thank you very much for your time.
Best Regards,
Nelson