[petsc-users] Scalability issue
bsmith at mcs.anl.gov
Fri Jul 24 10:50:43 CDT 2015
It would be very helpful if you ran the code on say 1, 2, 4, 8, 16 ... processes with the option -log_summary and send (as attachments) the log summary information.
Also on the same machine run the streams benchmark; with recent releases of PETSc you only need to do
make streams NPMAX=16 (or whatever your largest process count is)
and send the output.
I suspect that you are doing everything fine and it is more an issue with the configuration of your machine. Also read the information at http://www.mcs.anl.gov/petsc/documentation/faq.html#computers on "binding"
> On Jul 24, 2015, at 10:41 AM, Nelson Filipe Lopes da Silva <nelsonflsilva at ist.utl.pt> wrote:
> I have been using PETSc for a few months now, and it truly is fantastic piece of software.
> In my particular example I am working with a large, sparse distributed (MPI AIJ) matrix we can refer as 'G'.
> G is a horizontal - retangular matrix (for example, 1,1 Million rows per 2,1 Million columns). This matrix is commonly very sparse and not diagonal 'heavy' (for example 5,2 Million nnz in which ~50% are on the diagonal block of MPI AIJ representation).
> To work with this matrix, I also have a few parallel vectors (created using MatCreate Vec), we can refer as 'm' and 'k'.
> I am trying to parallelize an iterative algorithm in which the most computational heavy operations are:
> ->Matrix-Vector Multiplication, more precisely G * m + k = b (MatMultAdd). From what I have been reading, to achive a good speedup in this operation, G should be as much diagonal as possible, due to overlapping communication and computation. But even when using a G matrix in which the diagonal block has ~95% of the nnz, I cannot get a decent speedup. Most of the times, the performance even gets worse.
> ->Matrix-Matrix Multiplication, in this case I need to perform G * G' = A, where A is later used on the linear solver and G' is transpose of G. The speedup in this operation is not worse, although is not very good.
> ->Linear problem solving. Lastly, In this operation I compute "Ax=b" from the last two operations. I tried to apply a RCM permutation to A to make it more diagonal, for better performance. However, the problem I faced was that, the permutation is performed locally in each processor and thus, the final result is different with different number of processors. I assume this was intended to reduce communication. The solution I found was
> 1-calculate A
> 2-calculate, localy to 1 machine, the RCM permutation IS using A
> 3-apply this permutation to the lines of G.
> This works well, and A is generated as if RCM permuted. It is fine to do this operation in one machine because it is only done once while reading the input. The nnz of G become more spread and less diagonal, causing problems when calculating G * m + k = b.
> These 3 operations (except the permutation) are performed in each iteration of my algorithm.
> So, my questions are.
> -What are the characteristics of G that lead to a good speedup in the operations I described? Am I missing something and too much obsessed with the diagonal block?
> -Is there a better way to permute A without permute G and still get the same result using 1 or N machines?
> I have been avoiding asking for help for a while. I'm very sorry for the long email.
> Thank you very much for your time.
> Best Regards,
More information about the petsc-users