On Tue, May 29, 2012 at 3:52 PM, Benjamin Sanderse <span dir="ltr"><<a href="mailto:B.Sanderse@cwi.nl" target="_blank">B.Sanderse@cwi.nl</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hello all,<br>

<br>

I have a simple question about using MatMult (or MatMultAdd) in parallel.<br>

<br>

I am performing the matrix-vector multiplication<br>

<br>

z = A*x + y<br>

<br>

in my code by using<br>

<br>

call MatMultAdd(A,x,y,z,ierr); CHKERRQ(ierr)<br>

<br>

A is a sparse matrix, type MPIAIJ, and x, y, and z have been obtained using<br>

<br>

call MatGetVecs(A,x,y,ierr); CHKERRQ(ierr)<br>

call MatGetVecs(A,PETSC_NULL_OBJECT,z,ierr); CHKERRQ(ierr)<br>

<br>

x, y, and z are vecs of type mpi.<br>

<br>

The problem is that in the sequential case the MatMultAdd is MUCH faster than in the parallel case (at least a factor 100 difference).<br></blockquote><div><br></div><div>With any performance question, always always always send the output of -log_summary to <a href="mailto:petsc-maint@mcs.anl.gov">petsc-maint@mcs.anl.gov</a>.</div>

<div><br></div><div>   Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

As an example, here is the output with some properties of A when using -mat_view_info and -info:<br>

<br>

2 processors:<br>

[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374781<br>

[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 -2080374780<br>

[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 -2080374782<br>

[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 -2080374782<br>

[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 -2080374780<br>

[0] MatStashScatterBegin_Private(): No of messages: 0<br>

[1] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.<br>

[0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.<br>

[1] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 450; storage space: 100 unneeded,900 used<br>

[1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>

[0] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 450; storage space: 100 unneeded,900 used<br>

[0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>

[0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 2<br>

[0] Mat_CheckInode(): Found 500 nodes out of 500 rows. Not using Inode routines<br>

[1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 2<br>

[1] Mat_CheckInode(): Found 500 nodes out of 500 rows. Not using Inode routines<br>

[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 -2080374780<br>

[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 -2080374782<br>

[0] MatSetUpMultiply_MPIAIJ(): Using block index set to define scatter<br>

[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 -2080374780<br>

[1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 -2080374782<br>

[0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter<br>

[0] VecScatterCreate(): General case: MPI to Seq<br>

[1] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 0; storage space: 0 unneeded,0 used<br>

[1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>

[0] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 0; storage space: 0 unneeded,0 used<br>

[0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>

[1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 0<br>

[0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 0<br>

Matrix Object: 2 MPI processes<br>

  type: mpiaij<br>

  rows=1000, cols=900<br>

  total: nonzeros=1800, allocated nonzeros=2000<br>

  total number of mallocs used during MatSetValues calls =0<br>

    not using I-node (on process 0) routines<br>

<br>

1 processor:<br>

[0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 -2080374783<br>

[0] MatAssemblyEnd_SeqAIJ(): Matrix size: 1000 X 900; storage space: 200 unneeded,1800 used<br>

[0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>

[0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 2<br>

[0] Mat_CheckInode(): Found 1000 nodes out of 1000 rows. Not using Inode routines<br>

Matrix Object: 1 MPI processes<br>

  type: seqaij<br>

  rows=1000, cols=900<br>

  total: nonzeros=1800, allocated nonzeros=2000<br>

  total number of mallocs used during MatSetValues calls =0<br>

    not using I-node routines<br>

<br>

When I look at the partitioning of the vectors, I have the following for the parallel case:<br>

x:<br>

          0         450<br>

         450         900<br>

y:<br>

           0         500<br>

         500        1000<br>

z:<br>

           0         500<br>

         500        1000<br>

<br>

This seems OK to me.<br>

<br>

Certainly I am missing something in performing this matrix-vector multiplication efficiently. Any ideas?<br>

<br>

Best regards,<br>

<br>

Benjamin<br>

</blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>

-- Norbert Wiener<br>