[petsc-users] MatMult

Wed May 30 06:43:26 CDT 2012

On Wed, May 30, 2012 at 2:23 AM, Benjamin Sanderse <B.Sanderse at cwi.nl>wrote:

> Sorry for forgetting -log_summary. Attached are log_summary for 1 and 2
> processors, for both a problem with about 1000 unknowns and one with 125000
> unknowns. The summary is for a run of the entire code, which involves many
> MatMults. I hope this still provides insight on what is going on.
> As you can see there is an extraordinary use of MatGetRow - I am working
> to change this - but they should not influence speed of the MatMults. Any
> thoughts?
>

1. What computer is this running on? Specifically, how is its memory
hierarchy laid out?
http://www.mcs.anl.gov/petsc/documentation/faq.html#computers  Can you run
the benchmarks in src/benchmarks/streams/?

2. It's worth heeding this message, the performance will look significantly
different. If the parallel version is still much slower, please send that
-log_summary.

      ##########################################################
      #                                                        #
      #                          WARNING!!!                    #
      #                                                        #
      #   This code was compiled with a debugging option,      #
      #   To get timing results run ./configure                #
      #   using --with-debugging=no, the performance will      #
      #   be generally two or three times faster.              #
      #                                                        #
      ##########################################################

>
> Benjamin
>
> ----- Original Message -----
> From: "Jed Brown" <jedbrown at mcs.anl.gov>
> To: "PETSc users list" <petsc-users at mcs.anl.gov>
> Sent: Tuesday, May 29, 2012 5:56:51 PM
> Subject: Re: [petsc-users] MatMult
>
> On Tue, May 29, 2012 at 10:52 AM, Benjamin Sanderse <B.Sanderse at cwi.nl
> >wrote:
>
> > Hello all,
> >
> > I have a simple question about using MatMult (or MatMultAdd) in parallel.
> >
> > I am performing the matrix-vector multiplication
> >
> > z = A*x + y
> >
> > in my code by using
> >
> > call MatMultAdd(A,x,y,z,ierr); CHKERRQ(ierr)
> >
> > A is a sparse matrix, type MPIAIJ, and x, y, and z have been obtained
> using
> >
> > call MatGetVecs(A,x,y,ierr); CHKERRQ(ierr)
> > call MatGetVecs(A,PETSC_NULL_OBJECT,z,ierr); CHKERRQ(ierr)
> >
> > x, y, and z are vecs of type mpi.
> >
> > The problem is that in the sequential case the MatMultAdd is MUCH faster
> > than in the parallel case (at least a factor 100 difference).
> >
>
> 1. Send output of -log_summary
>
> 2. This matrix is tiny (1000x1000) and very sparse (at most 2 nonzeros per
> row) so you should not expect speedup from running in parallel.
>
>
> >
> > As an example, here is the output with some properties of A when using
> > -mat_view_info and -info:
> >
> > 2 processors:
> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
> > -2080374781
> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374780
> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374782
> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374782
> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374780
> > [0] MatStashScatterBegin_Private(): No of messages: 0
> > [1] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.
> > [0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.
> > [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 450; storage space: 100
> > unneeded,900 used
> > [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 450; storage space: 100
> > unneeded,900 used
> > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 2
> > [0] Mat_CheckInode(): Found 500 nodes out of 500 rows. Not using Inode
> > routines
> > [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 2
> > [1] Mat_CheckInode(): Found 500 nodes out of 500 rows. Not using Inode
> > routines
> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374780
> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374782
> > [0] MatSetUpMultiply_MPIAIJ(): Using block index set to define scatter
> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374780
> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374782
> > [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter
> > [0] VecScatterCreate(): General case: MPI to Seq
> > [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 0; storage space: 0
> > unneeded,0 used
> > [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 0; storage space: 0
> > unneeded,0 used
> > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> > [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 0
> > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 0
> > Matrix Object: 2 MPI processes
> >  type: mpiaij
> >  rows=1000, cols=900
> >  total: nonzeros=1800, allocated nonzeros=2000
> >  total number of mallocs used during MatSetValues calls =0
> >    not using I-node (on process 0) routines
> >
> > 1 processor:
> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
> > -2080374783
> > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 1000 X 900; storage space: 200
> > unneeded,1800 used
> > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 2
> > [0] Mat_CheckInode(): Found 1000 nodes out of 1000 rows. Not using Inode
> > routines
> > Matrix Object: 1 MPI processes
> >  type: seqaij
> >  rows=1000, cols=900
> >  total: nonzeros=1800, allocated nonzeros=2000
> >  total number of mallocs used during MatSetValues calls =0
> >    not using I-node routines
> >
> > When I look at the partitioning of the vectors, I have the following for
> > the parallel case:
> > x:
> >          0         450
> >         450         900
> > y:
> >           0         500
> >         500        1000
> > z:
> >           0         500
> >         500        1000
> >
> > This seems OK to me.
> >
> > Certainly I am missing something in performing this matrix-vector
> > multiplication efficiently. Any ideas?
> >
> > Best regards,
> >
> > Benjamin
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120530/16534f13/attachment.html>