[petsc-users] Slow MatAssembly and MatMat Mult

Barry Smith bsmith at mcs.anl.gov
Wed Feb 17 21:30:34 CST 2016


  You need to bite the bullet and do the communication needed to get the info on the sub communicator and then get the result back out to the entire communicator. 

   If the two matrices are generated on MPI_COMM_WORLD you can use MatCreateRedundantMatrix() to get entire copies of them on sub communicators. You can actually then have each sub communicator do a certain number of multiplies of the sparse matrix with different columns of the dense matrix so that instead of having one sub communicator do 2500 sparse matrix vector products you can have each sub communicator (say you have 5 of them) do 500 sparse matrix vector products (giving two levels of parallelism).  The results are dense matrices so you would need to write some code to get the parts of the resulting dense matrices back to the processes where you want it. I would suggest using MPI calls directly for this, I don't PETSc has anything particularly useful to do that.

  Barry


> On Feb 17, 2016, at 9:17 PM, Bikash Kanungo <bikash at umich.edu> wrote:
> 
> Hi Barry,
> 
> I had thought of using sub-communicator for these operations. But the matrix entries have contributions from all the processors. Moreover, after the MatMatMult operation is done, I need to retrieve certain values from the resultant matrix through non-local calls (MatGetSubMatrices) on all processors. Defining these matrices to be residing on sub-communicator will prohibit me from adding contributions and calling MatGetSubMatrices from processors not within the sub-communicator. What would be a good workaround for it?
> 
> Regards,
> Bikash
> 
> On Wed, Feb 17, 2016 at 9:33 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> 
>   This is an absolutely tiny problem for 480 processors I am not surprised by the terrible performance. You should run this sub calculation on a small subset of the processes.
> 
>   Barry
> 
> > On Feb 17, 2016, at 7:03 PM, Bikash Kanungo <bikash at umich.edu> wrote:
> >
> > Hi,
> >
> > I have two small (2500x2500) matrices parallelized across 480 processors. One of them is an MPIAIJ matrix while the other is an MPIDENSE matrix. I perform a MatMatMult involving these two matrices. I tried these operations on two machines, one is the local cluster at University of Michigan and the other is the XSEDE Comet machine. The Comet machine takes 10-20 times more time in steps involving MatAssembly and MatMatMult of the aforementioned matrices. I have other Petsc MatMult operations in the same code involving larger matrices (4 million x 4 million) which show similar timing on both machines. It's just those small parallel matrices that are inconsistent in terms of their timings. I used same the compilers and MPI libraries in both the machines except that I have suppressed "avx2" flag in Comet. I believe avx2 affects floating point operations and not communication. I would like to know what might be causing these inconsistencies only in case of the small matrices. Are there any network settings that I can look into and compare?
> >
> > Regards,
> > Bikash
> >
> > --
> > Bikash S. Kanungo
> > PhD Student
> > Computational Materials Physics Group
> > Mechanical Engineering
> > University of Michigan
> >
> 
> 
> 
> 
> -- 
> Bikash S. Kanungo
> PhD Student
> Computational Materials Physics Group
> Mechanical Engineering 
> University of Michigan
> 



More information about the petsc-users mailing list