[petsc-users] MatMult
Benjamin Sanderse
B.Sanderse at cwi.nl
Wed May 30 09:15:44 CDT 2012
Hi Jed,
Moving to the optimized version of Petsc (without debugging) basically removed the issue. Thanks a lot!
Benjamin
Op 30 mei 2012, om 13:43 heeft Jed Brown het volgende geschreven:
> On Wed, May 30, 2012 at 2:23 AM, Benjamin Sanderse <B.Sanderse at cwi.nl> wrote:
> Sorry for forgetting -log_summary. Attached are log_summary for 1 and 2 processors, for both a problem with about 1000 unknowns and one with 125000 unknowns. The summary is for a run of the entire code, which involves many MatMults. I hope this still provides insight on what is going on.
> As you can see there is an extraordinary use of MatGetRow - I am working to change this - but they should not influence speed of the MatMults. Any thoughts?
>
> 1. What computer is this running on? Specifically, how is its memory hierarchy laid out? http://www.mcs.anl.gov/petsc/documentation/faq.html#computers Can you run the benchmarks in src/benchmarks/streams/?
>
> 2. It's worth heeding this message, the performance will look significantly different. If the parallel version is still much slower, please send that -log_summary.
>
> ##########################################################
> # #
> # WARNING!!! #
> # #
> # This code was compiled with a debugging option, #
> # To get timing results run ./configure #
> # using --with-debugging=no, the performance will #
> # be generally two or three times faster. #
> # #
> ##########################################################
>
>
>
> Benjamin
>
> ----- Original Message -----
> From: "Jed Brown" <jedbrown at mcs.anl.gov>
> To: "PETSc users list" <petsc-users at mcs.anl.gov>
> Sent: Tuesday, May 29, 2012 5:56:51 PM
> Subject: Re: [petsc-users] MatMult
>
> On Tue, May 29, 2012 at 10:52 AM, Benjamin Sanderse <B.Sanderse at cwi.nl>wrote:
>
> > Hello all,
> >
> > I have a simple question about using MatMult (or MatMultAdd) in parallel.
> >
> > I am performing the matrix-vector multiplication
> >
> > z = A*x + y
> >
> > in my code by using
> >
> > call MatMultAdd(A,x,y,z,ierr); CHKERRQ(ierr)
> >
> > A is a sparse matrix, type MPIAIJ, and x, y, and z have been obtained using
> >
> > call MatGetVecs(A,x,y,ierr); CHKERRQ(ierr)
> > call MatGetVecs(A,PETSC_NULL_OBJECT,z,ierr); CHKERRQ(ierr)
> >
> > x, y, and z are vecs of type mpi.
> >
> > The problem is that in the sequential case the MatMultAdd is MUCH faster
> > than in the parallel case (at least a factor 100 difference).
> >
>
> 1. Send output of -log_summary
>
> 2. This matrix is tiny (1000x1000) and very sparse (at most 2 nonzeros per
> row) so you should not expect speedup from running in parallel.
>
>
> >
> > As an example, here is the output with some properties of A when using
> > -mat_view_info and -info:
> >
> > 2 processors:
> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
> > -2080374781
> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374780
> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374782
> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374782
> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374780
> > [0] MatStashScatterBegin_Private(): No of messages: 0
> > [1] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.
> > [0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.
> > [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 450; storage space: 100
> > unneeded,900 used
> > [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 450; storage space: 100
> > unneeded,900 used
> > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 2
> > [0] Mat_CheckInode(): Found 500 nodes out of 500 rows. Not using Inode
> > routines
> > [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 2
> > [1] Mat_CheckInode(): Found 500 nodes out of 500 rows. Not using Inode
> > routines
> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374780
> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374782
> > [0] MatSetUpMultiply_MPIAIJ(): Using block index set to define scatter
> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374780
> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689
> > -2080374782
> > [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter
> > [0] VecScatterCreate(): General case: MPI to Seq
> > [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 0; storage space: 0
> > unneeded,0 used
> > [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 0; storage space: 0
> > unneeded,0 used
> > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> > [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 0
> > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 0
> > Matrix Object: 2 MPI processes
> > type: mpiaij
> > rows=1000, cols=900
> > total: nonzeros=1800, allocated nonzeros=2000
> > total number of mallocs used during MatSetValues calls =0
> > not using I-node (on process 0) routines
> >
> > 1 processor:
> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688
> > -2080374783
> > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 1000 X 900; storage space: 200
> > unneeded,1800 used
> > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
> > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 2
> > [0] Mat_CheckInode(): Found 1000 nodes out of 1000 rows. Not using Inode
> > routines
> > Matrix Object: 1 MPI processes
> > type: seqaij
> > rows=1000, cols=900
> > total: nonzeros=1800, allocated nonzeros=2000
> > total number of mallocs used during MatSetValues calls =0
> > not using I-node routines
> >
> > When I look at the partitioning of the vectors, I have the following for
> > the parallel case:
> > x:
> > 0 450
> > 450 900
> > y:
> > 0 500
> > 500 1000
> > z:
> > 0 500
> > 500 1000
> >
> > This seems OK to me.
> >
> > Certainly I am missing something in performing this matrix-vector
> > multiplication efficiently. Any ideas?
> >
> > Best regards,
> >
> > Benjamin
> >
>
--
Ir. B. Sanderse
Centrum Wiskunde en Informatica
Science Park 123
1098 XG Amsterdam
t: +31 20 592 4161
e: sanderse at cwi.nl
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120530/be2bfcfc/attachment-0001.html>
More information about the petsc-users
mailing list