[petsc-users] MatMult

Tue May 29 10:52:15 CDT 2012

Hello all,

I have a simple question about using MatMult (or MatMultAdd) in parallel.

I am performing the matrix-vector multiplication

z = A*x + y

in my code by using

call MatMultAdd(A,x,y,z,ierr); CHKERRQ(ierr)

A is a sparse matrix, type MPIAIJ, and x, y, and z have been obtained using 

call MatGetVecs(A,x,y,ierr); CHKERRQ(ierr)    
call MatGetVecs(A,PETSC_NULL_OBJECT,z,ierr); CHKERRQ(ierr)    

x, y, and z are vecs of type mpi.

The problem is that in the sequential case the MatMultAdd is MUCH faster than in the parallel case (at least a factor 100 difference).

As an example, here is the output with some properties of A when using -mat_view_info and -info:

2 processors:
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374781
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 -2080374780
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 -2080374782
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 -2080374782
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 -2080374780
[0] MatStashScatterBegin_Private(): No of messages: 0 
[1] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.
[0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.
[1] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 450; storage space: 100 unneeded,900 used
[1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
[0] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 450; storage space: 100 unneeded,900 used
[0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
[0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 2
[0] Mat_CheckInode(): Found 500 nodes out of 500 rows. Not using Inode routines
[1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 2
[1] Mat_CheckInode(): Found 500 nodes out of 500 rows. Not using Inode routines
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 -2080374780
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 -2080374782
[0] MatSetUpMultiply_MPIAIJ(): Using block index set to define scatter
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 -2080374780
[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 -2080374782
[0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter
[0] VecScatterCreate(): General case: MPI to Seq
[1] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 0; storage space: 0 unneeded,0 used
[1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
[0] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 0; storage space: 0 unneeded,0 used
[0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
[1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 0
[0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 0
Matrix Object: 2 MPI processes
  type: mpiaij
  rows=1000, cols=900
  total: nonzeros=1800, allocated nonzeros=2000
  total number of mallocs used during MatSetValues calls =0
    not using I-node (on process 0) routines

1 processor:
[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374783
[0] MatAssemblyEnd_SeqAIJ(): Matrix size: 1000 X 900; storage space: 200 unneeded,1800 used
[0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
[0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 2
[0] Mat_CheckInode(): Found 1000 nodes out of 1000 rows. Not using Inode routines
Matrix Object: 1 MPI processes
  type: seqaij
  rows=1000, cols=900
  total: nonzeros=1800, allocated nonzeros=2000
  total number of mallocs used during MatSetValues calls =0
    not using I-node routines

When I look at the partitioning of the vectors, I have the following for the parallel case:
x:
          0         450
         450         900
y:
           0         500
         500        1000
z:
           0         500
         500        1000

This seems OK to me. 

Certainly I am missing something in performing this matrix-vector multiplication efficiently. Any ideas?

Best regards,

Benjamin