[petsc-users] Scalability issue
Nelson Filipe Lopes da Silva
nelsonflsilva at ist.utl.pt
Fri Jul 24 10:41:43 CDT 2015
Hello,
I have been using PETSc for a few months now, and it truly is fantastic
piece of software.
In my particular example I am working with a large, sparse distributed
(MPI AIJ) matrix we can refer as 'G'.
G is a horizontal - retangular matrix (for example, 1,1 Million rows
per 2,1 Million columns). This matrix is commonly very sparse and not
diagonal 'heavy' (for example 5,2 Million nnz in which ~50% are on the
diagonal block of MPI AIJ representation).
To work with this matrix, I also have a few parallel vectors (created
using MatCreate Vec), we can refer as 'm' and 'k'.
I am trying to parallelize an iterative algorithm in which the most
computational heavy operations are:
->Matrix-Vector Multiplication, more precisely G * m + k = b
(MatMultAdd). From what I have been reading, to achive a good speedup in
this operation, G should be as much diagonal as possible, due to
overlapping communication and computation. But even when using a G
matrix in which the diagonal block has ~95% of the nnz, I cannot get a
decent speedup. Most of the times, the performance even gets worse.
->Matrix-Matrix Multiplication, in this case I need to perform G * G' =
A, where A is later used on the linear solver and G' is transpose of G.
The speedup in this operation is not worse, although is not very good.
->Linear problem solving. Lastly, In this operation I compute "Ax=b"
from the last two operations. I tried to apply a RCM permutation to A to
make it more diagonal, for better performance. However, the problem I
faced was that, the permutation is performed locally in each processor
and thus, the final result is different with different number of
processors. I assume this was intended to reduce communication. The
solution I found was
1-calculate A
2-calculate, localy to 1 machine, the RCM permutation IS using A
3-apply this permutation to the lines of G.
This works well, and A is generated as if RCM permuted. It is fine to
do this operation in one machine because it is only done once while
reading the input. The nnz of G become more spread and less diagonal,
causing problems when calculating G * m + k = b.
These 3 operations (except the permutation) are performed in each
iteration of my algorithm.
So, my questions are.
-What are the characteristics of G that lead to a good speedup in the
operations I described? Am I missing something and too much obsessed
with the diagonal block?
-Is there a better way to permute A without permute G and still get the
same result using 1 or N machines?
I have been avoiding asking for help for a while. I'm very sorry for
the long email.
Thank you very much for your time.
Best Regards,
Nelson
More information about the petsc-users
mailing list