<div class="gmail_quote">On Wed, May 30, 2012 at 2:23 AM, Benjamin Sanderse <span dir="ltr"><<a href="mailto:B.Sanderse@cwi.nl" target="_blank">B.Sanderse@cwi.nl</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Sorry for forgetting -log_summary. Attached are log_summary for 1 and 2 processors, for both a problem with about 1000 unknowns and one with 125000 unknowns. The summary is for a run of the entire code, which involves many MatMults. I hope this still provides insight on what is going on.<br>


As you can see there is an extraordinary use of MatGetRow - I am working to change this - but they should not influence speed of the MatMults. Any thoughts?<br></blockquote><div><br></div><div>1. What computer is this running on? Specifically, how is its memory hierarchy laid out?  <a href="http://www.mcs.anl.gov/petsc/documentation/faq.html#computers">http://www.mcs.anl.gov/petsc/documentation/faq.html#computers</a>  Can you run the benchmarks in src/benchmarks/streams/?</div>

<div><br></div><div>2. It's worth heeding this message, the performance will look significantly different. If the parallel version is still much slower, please send that -log_summary.</div><div><br></div><div><div><font face="courier new, monospace">      ##########################################################</font></div>

<div><font face="courier new, monospace">      #                                                        #</font></div><div><font face="courier new, monospace">      #                          WARNING!!!                    #</font></div>

<div><font face="courier new, monospace">      #                                                        #</font></div><div><font face="courier new, monospace">      #   This code was compiled with a debugging option,      #</font></div>

<div><font face="courier new, monospace">      #   To get timing results run ./configure                #</font></div><div><font face="courier new, monospace">      #   using --with-debugging=no, the performance will      #</font></div>

<div><font face="courier new, monospace">      #   be generally two or three times faster.              #</font></div><div><font face="courier new, monospace">      #                                                        #</font></div>

<div><font face="courier new, monospace">      ##########################################################</font></div></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<span class="HOEnZb"><font color="#888888"><br>

Benjamin<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

----- Original Message -----<br>

From: "Jed Brown" <<a href="mailto:jedbrown@mcs.anl.gov">jedbrown@mcs.anl.gov</a>><br>

To: "PETSc users list" <<a href="mailto:petsc-users@mcs.anl.gov">petsc-users@mcs.anl.gov</a>><br>

Sent: Tuesday, May 29, 2012 5:56:51 PM<br>

Subject: Re: [petsc-users] MatMult<br>

<br>

On Tue, May 29, 2012 at 10:52 AM, Benjamin Sanderse <<a href="mailto:B.Sanderse@cwi.nl">B.Sanderse@cwi.nl</a>>wrote:<br>

<br>

> Hello all,<br>

><br>

> I have a simple question about using MatMult (or MatMultAdd) in parallel.<br>

><br>

> I am performing the matrix-vector multiplication<br>

><br>

> z = A*x + y<br>

><br>

> in my code by using<br>

><br>

> call MatMultAdd(A,x,y,z,ierr); CHKERRQ(ierr)<br>

><br>

> A is a sparse matrix, type MPIAIJ, and x, y, and z have been obtained using<br>

><br>

> call MatGetVecs(A,x,y,ierr); CHKERRQ(ierr)<br>

> call MatGetVecs(A,PETSC_NULL_OBJECT,z,ierr); CHKERRQ(ierr)<br>

><br>

> x, y, and z are vecs of type mpi.<br>

><br>

> The problem is that in the sequential case the MatMultAdd is MUCH faster<br>

> than in the parallel case (at least a factor 100 difference).<br>

><br>

<br>

1. Send output of -log_summary<br>

<br>

2. This matrix is tiny (1000x1000) and very sparse (at most 2 nonzeros per<br>

row) so you should not expect speedup from running in parallel.<br>

<br>

<br>

><br>

> As an example, here is the output with some properties of A when using<br>

> -mat_view_info and -info:<br>

><br>

> 2 processors:<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688<br>

> -2080374781<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374780<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374782<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374782<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374780<br>

> [0] MatStashScatterBegin_Private(): No of messages: 0<br>

> [1] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.<br>

> [0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.<br>

> [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 450; storage space: 100<br>

> unneeded,900 used<br>

> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>

> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 450; storage space: 100<br>

> unneeded,900 used<br>

> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>

> [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 2<br>

> [0] Mat_CheckInode(): Found 500 nodes out of 500 rows. Not using Inode<br>

> routines<br>

> [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 2<br>

> [1] Mat_CheckInode(): Found 500 nodes out of 500 rows. Not using Inode<br>

> routines<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374780<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374782<br>

> [0] MatSetUpMultiply_MPIAIJ(): Using block index set to define scatter<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374780<br>

> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689<br>

> -2080374782<br>

> [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter<br>

> [0] VecScatterCreate(): General case: MPI to Seq<br>

> [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 0; storage space: 0<br>

> unneeded,0 used<br>

> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>

> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 0; storage space: 0<br>

> unneeded,0 used<br>

> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>

> [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 0<br>

> [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 0<br>

> Matrix Object: 2 MPI processes<br>

>  type: mpiaij<br>

>  rows=1000, cols=900<br>

>  total: nonzeros=1800, allocated nonzeros=2000<br>

>  total number of mallocs used during MatSetValues calls =0<br>

>    not using I-node (on process 0) routines<br>

><br>

> 1 processor:<br>

> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688<br>

> -2080374783<br>

> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 1000 X 900; storage space: 200<br>

> unneeded,1800 used<br>

> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>

> [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 2<br>

> [0] Mat_CheckInode(): Found 1000 nodes out of 1000 rows. Not using Inode<br>

> routines<br>

> Matrix Object: 1 MPI processes<br>

>  type: seqaij<br>

>  rows=1000, cols=900<br>

>  total: nonzeros=1800, allocated nonzeros=2000<br>

>  total number of mallocs used during MatSetValues calls =0<br>

>    not using I-node routines<br>

><br>

> When I look at the partitioning of the vectors, I have the following for<br>

> the parallel case:<br>

> x:<br>

>          0         450<br>

>         450         900<br>

> y:<br>

>           0         500<br>

>         500        1000<br>

> z:<br>

>           0         500<br>

>         500        1000<br>

><br>

> This seems OK to me.<br>

><br>

> Certainly I am missing something in performing this matrix-vector<br>

> multiplication efficiently. Any ideas?<br>

><br>

> Best regards,<br>

><br>

> Benjamin<br>

><br>

</div></div></blockquote></div><br>