[petsc-users] MatMult

Wed May 30 09:15:44 CDT 2012

Hi Jed, 

Moving to the optimized version of Petsc (without debugging) basically removed the issue. Thanks a lot!

Benjamin

Op 30 mei 2012, om 13:43 heeft Jed Brown het volgende geschreven:

> On Wed, May 30, 2012 at 2:23 AM, Benjamin Sanderse <B.Sanderse at cwi.nl> wrote:
> Sorry for forgetting -log_summary. Attached are log_summary for 1 and 2 processors, for both a problem with about 1000 unknowns and one with 125000 unknowns. The summary is for a run of the entire code, which involves many MatMults. I hope this still provides insight on what is going on.
> As you can see there is an extraordinary use of MatGetRow - I am working to change this - but they should not influence speed of the MatMults. Any thoughts?
> 
> 1. What computer is this running on? Specifically, how is its memory hierarchy laid out?  http://www.mcs.anl.gov/petsc/documentation/faq.html#computers  Can you run the benchmarks in src/benchmarks/streams/?
> 
> 2. It's worth heeding this message, the performance will look significantly different. If the parallel version is still much slower, please send that -log_summary.
> 
>       ##########################################################
>       #                                                        #
>       #                          WARNING!!!                    #
>       #                                                        #
>       #   This code was compiled with a debugging option,      #
>       #   To get timing results run ./configure                #
>       #   using --with-debugging=no, the performance will      #
>       #   be generally two or three times faster.              #
>       #                                                        #
>       ##########################################################
> 
>  
> 
> Benjamin
> 
> ----- Original Message -----
> From: "Jed Brown" <jedbrown at mcs.anl.gov>
> To: "PETSc users list" <petsc-users at mcs.anl.gov>
> Sent: Tuesday, May 29, 2012 5:56:51 PM
> Subject: Re: [petsc-users] MatMult
> 
> On Tue, May 29, 2012 at 10:52 AM, Benjamin Sanderse <B.Sanderse at cwi.nl>wrote:
> 
> > Hello all,
> >
> > I have a simple question about using MatMult (or MatMultAdd) in parallel.
> >
> > I am performing the matrix-vector multiplication
> >
> > z = A*x + y
> >
> > in my code by using
> >
> > call MatMultAdd(A,x,y,z,ierr); CHKERRQ(ierr)
> >
> > A is a sparse matrix, type MPIAIJ, and x, y, and z have been obtained using
> >
> > call MatGetVecs(A,x,y,ierr); CHKERRQ(ierr)
> > call MatGetVecs(A,PETSC_NULL_OBJECT,z,ierr); CHKERRQ(ierr)
> >
> > x, y, and z are vecs of type mpi.
> >
> > The problem is that in the sequential case the MatMultAdd is MUCH faster
> > than in the parallel case (at least a factor 100 difference).
> >
> 
> 1. Send output of -log_summary
> 
> 2. This matrix is tiny (1000x1000) and very sparse (at most 2 nonzeros per
> row) so you should not expect speedup from running in parallel.
> 
> 
> >
> > As an example, here is the output with some properties of A when using
> > -mat_view_info and -info:
> >
> > 2 processors:
> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
> > -2080374781
> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374780
> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374782
> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374782
> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374780
> > [0] MatStashScatterBegin_Private(): No of messages: 0
> > [1] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.
> > [0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.
> > [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 450; storage space: 100
> > unneeded,900 used
> > [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 450; storage space: 100
> > unneeded,900 used
> > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 2
> > [0] Mat_CheckInode(): Found 500 nodes out of 500 rows. Not using Inode
> > routines
> > [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 2
> > [1] Mat_CheckInode(): Found 500 nodes out of 500 rows. Not using Inode
> > routines
> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374780
> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374782
> > [0] MatSetUpMultiply_MPIAIJ(): Using block index set to define scatter
> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374780
> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374782
> > [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter
> > [0] VecScatterCreate(): General case: MPI to Seq
> > [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 0; storage space: 0
> > unneeded,0 used
> > [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 0; storage space: 0
> > unneeded,0 used
> > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> > [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 0
> > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 0
> > Matrix Object: 2 MPI processes
> >  type: mpiaij
> >  rows=1000, cols=900
> >  total: nonzeros=1800, allocated nonzeros=2000
> >  total number of mallocs used during MatSetValues calls =0
> >    not using I-node (on process 0) routines
> >
> > 1 processor:
> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
> > -2080374783
> > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 1000 X 900; storage space: 200
> > unneeded,1800 used
> > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 2
> > [0] Mat_CheckInode(): Found 1000 nodes out of 1000 rows. Not using Inode
> > routines
> > Matrix Object: 1 MPI processes
> >  type: seqaij
> >  rows=1000, cols=900
> >  total: nonzeros=1800, allocated nonzeros=2000
> >  total number of mallocs used during MatSetValues calls =0
> >    not using I-node routines
> >
> > When I look at the partitioning of the vectors, I have the following for
> > the parallel case:
> > x:
> >          0         450
> >         450         900
> > y:
> >           0         500
> >         500        1000
> > z:
> >           0         500
> >         500        1000
> >
> > This seems OK to me.
> >
> > Certainly I am missing something in performing this matrix-vector
> > multiplication efficiently. Any ideas?
> >
> > Best regards,
> >
> > Benjamin
> >
> 

-- 
Ir. B. Sanderse

Centrum Wiskunde en Informatica
Science Park 123
1098 XG Amsterdam

t: +31 20 592 4161
e: sanderse at cwi.nl

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120530/be2bfcfc/attachment-0001.html>