[petsc-users] Optimizing MatMatSolve

Mon Aug 1 16:56:42 CDT 2011

On Aug 1, 2011, at 3:31 PM, Adam Byrd wrote:

> On Mon, Aug 1, 2011 at 5:09 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> 
> On Aug 1, 2011, at 3:00 PM, Adam Byrd wrote:
> 
> > Hello,
> >
> > I'm looking for help reducing the time and communication of a parallel MatMatSolve using MUMPS. On a single processor I experience decent solve times (~9 seconds each), but when moving to multiple processors I see longer times with more cores. I've run with -log_summary and confirmed (practically) all the time is spent in MatMatSolve. I'm fairly certain it's all communication between nodes and I'm trying to figure out where I can make optimizations, or if it is even feasible for this type of problem. It is a parallel, dense,
> 
>     I hope you mean that the original matrix you use with MUMPS is sparse (you should not use MUMPS to solve dense linear systems).
> 
> Oops, yes. The original matrix is sparse. It requires the solution and identity matrix to be dense. I was typing faster than thinking. 
> 
> > direct solve using MUMPS with an LU preconditioner. I know there are many smaller optimizations that can be done in other areas, but at the moment it is only the solve that concerns me.
> 
>     MUMPS will run slower on 2 processors than 1, this is just a fact of life. You will only gain with parallel for MUMPS for large problems.
> 
> I see. It looks like I took off in the wrong direction then. I'm trying to solve for the inverse of a sparse matrix in parallel. I'm starting at 3600x3600 and will be moving to 30,000x30,000+ in the future. Which solver suits this sort of problem?

   Actually this problem is so small that you will not want to parallelize inside the solve, instead you will want to have different processes do different right hand sides and each process will have called MUMPS on the entire matrix. There will be no communication in the solver and you will get very good speed up.

   Barry

> 
>   Barry
> 
> 
> 
> >
> > ---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
> >
> > ./cntor on a complex-c named hpc-1-0.local with 2 processors, by abyrd Mon Aug  1 16:25:51 2011
> > Using Petsc Release Version 3.1.0, Patch 8, Thu Mar 17 13:37:48 CDT 2011
> >
> >                          Max       Max/Min        Avg      Total
> > Time (sec):           1.307e+02      1.00000   1.307e+02
> > Objects:              1.180e+02      1.00000   1.180e+02
> > Flops:                0.000e+00      0.00000   0.000e+00  0.000e+00
> > Flops/sec:            0.000e+00      0.00000   0.000e+00  0.000e+00
> > Memory:               2.091e+08      1.00001              4.181e+08
> > MPI Messages:         7.229e+03      1.00000   7.229e+03  1.446e+04
> > MPI Message Lengths:  4.141e+08      1.00000   5.729e+04  8.283e+08
> > MPI Reductions:       1.464e+04      1.00000
> >
> > Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
> >                             e.g., VecAXPY() for real vectors of length N --> 2N flops
> >                             and VecAXPY() for complex vectors of length N --> 8N flops
> >
> > Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
> >                         Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total
> >  0:      Main Stage: 1.3072e+02 100.0%  0.0000e+00   0.0%  1.446e+04 100.0%  5.729e+04      100.0%  1.730e+02   1.2%
> >
> > ------------------------------------------------------------------------------------------------------------------------
> > See the 'Profiling' chapter of the users' manual for details on interpreting output.
> > Phase summary info:
> >    Count: number of times phase was executed
> >    Time and Flops: Max - maximum over all processors
> >                    Ratio - ratio of maximum to minimum over all processors
> >    Mess: number of messages sent
> >    Avg. len: average message length
> >    Reduct: number of global reductions
> >    Global: entire computation
> >    Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
> >       %T - percent time in this phase         %F - percent flops in this phase
> >       %M - percent messages in this phase     %L - percent message lengths in this phase
> >       %R - percent reductions in this phase
> >    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
> > ------------------------------------------------------------------------------------------------------------------------
> >
> >
> >       ##########################################################
> >       #                                                        #
> >       #                          WARNING!!!                    #
> >       #                                                        #
> >       #   This code was compiled with a debugging option,      #
> >       #   To get timing results run config/configure.py        #
> >       #   using --with-debugging=no, the performance will      #
> >       #   be generally two or three times faster.              #
> >       #                                                        #
> >       ##########################################################
> >
> >
> >
> >
> >       ##########################################################
> >       #                                                        #
> >       #                          WARNING!!!                    #
> >       #                                                        #
> >       #   The code for various complex numbers numerical       #
> >       #   kernels uses C++, which generally is not well        #
> >       #   optimized.  For performance that is about 4-5 times  #
> >       #   faster, specify --with-fortran-kernels=1             #
> >       #   when running config/configure.py.                    #
> >       #                                                        #
> >       ##########################################################
> >
> >
> > Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
> >                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> > ------------------------------------------------------------------------------------------------------------------------
> >
> > --- Event Stage 0: Main Stage
> >
> > MatSolve           14400 1.0 1.2364e+02 1.0 0.00e+00 0.0 1.4e+04 5.7e+04 2.0e+01 95  0100100  0  95  0100100 12     0
> > MatLUFactorSym         4 1.0 2.0027e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatLUFactorNum         4 1.0 3.4223e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+01  3  0  0  0  0   3  0  0  0 14     0
> > MatConvert             1 1.0 2.3644e-01 2.4 0.00e+00 0.0 0.0e+00 0.0e+00 1.1e+01  0  0  0  0  0   0  0  0  0  6     0
> > MatAssemblyBegin      14 1.0 1.9959e-01 9.3 0.00e+00 0.0 3.0e+01 5.2e+04 1.2e+01  0  0  0  0  0   0  0  0  0  7     0
> > MatAssemblyEnd        14 1.0 1.9908e-01 1.1 0.00e+00 0.0 4.0e+00 2.8e+01 2.0e+01  0  0  0  0  0   0  0  0  0 12     0
> > MatGetRow             32 1.0 4.2677e-05 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatGetSubMatrice       4 1.0 7.6661e-03 1.0 0.00e+00 0.0 1.6e+01 1.2e+05 2.4e+01  0  0  0  0  0   0  0  0  0 14     0
> > MatMatSolve            4 1.0 1.2380e+02 1.0 0.00e+00 0.0 1.4e+04 5.7e+04 2.0e+01 95  0100100  0  95  0100100 12     0
> > VecSet                 4 1.0 1.8590e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > VecScatterBegin    28800 1.0 2.2810e+00 2.2 0.00e+00 0.0 1.4e+04 5.7e+04 0.0e+00  1  0100100  0   1  0100100  0     0
> > VecScatterEnd      14400 1.0 4.1534e+00 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
> > KSPSetup               4 1.0 1.1060e-0212.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > PCSetUp                4 1.0 3.4280e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 5.6e+01  3  0  0  0  0   3  0  0  0 32     0
> > ------------------------------------------------------------------------------------------------------------------------
> >
> > Memory usage is given in bytes:
> >
> > Object Type          Creations   Destructions     Memory  Descendants' Mem.
> > Reports information only for process 0.
> >
> > --- Event Stage 0: Main Stage
> >
> >               Matrix    27             27    208196712     0
> >                  Vec    36             36      1027376     0
> >          Vec Scatter    11             11         7220     0
> >            Index Set    42             42        22644     0
> >        Krylov Solver     1              1        34432     0
> >       Preconditioner     1              1          752     0
> > ========================================================================================================================
> > Average time to get PetscTime(): 1.90735e-07
> > Average time for MPI_Barrier(): 3.8147e-06
> > Average time for zero size MPI_Send(): 7.51019e-06
> > #PETSc Option Table entries:
> > -log_summary
> > -pc_factor_mat_solver_package mumps
> > -pc_type lu
> > #End of PETSc Option Table entries
> > Compiled without FORTRAN kernels
> > Compiled with full precision matrices (default)
> > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 16
> > Configure run at: Mon Jul 11 15:28:42 2011
> > Configure options: PETSC_ARCH=complex-cpp-mumps --with-cc=mpicc --with-fc=mpif90 --with-blas-lapack-dir=/usr/lib64 --with-shared --with-clanguage=c++ --with-scalar-type=complex --download-mumps=1 --download-blacs=1 --download-scalapack=1 --download-parmetis=1 --with-cxx=mpicxx
> > -----------------------------------------
> > Libraries compiled on Mon Jul 11 15:39:58 EDT 2011 on sc.local
> > Machine characteristics: Linux sc.local 2.6.18-194.11.1.el5 #1 SMP Tue Aug 10 19:05:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
> > Using PETSc directory: /panfs/storage.local/scs/home/abyrd/petsc-3.1-p8
> > Using PETSc arch: complex-cpp-mumps
> > -----------------------------------------
> > Using C compiler: mpicxx -Wall -Wwrite-strings -Wno-strict-aliasing -g   -fPIC
> > Using Fortran compiler: mpif90 -fPIC -Wall -Wno-unused-variable -g
> > -----------------------------------------
> > Using include paths: -I/panfs/storage.local/scs/home/abyrd/petsc-3.1-p8/complex-cpp-mumps/include -I/panfs/storage.local/scs/home/abyrd/petsc-3.1-p8/include -I/panfs/storage.local/scs/home/abyrd/petsc-3.1-p8/complex-cpp-mumps/include -I/usr/mpi/gnu/openmpi-1.4.2/include -I/usr/mpi/gnu/openmpi-1.4.2/lib64
> > ------------------------------------------
> > Using C linker: mpicxx -Wall -Wwrite-strings -Wno-strict-aliasing -g
> > Using Fortran linker: mpif90 -fPIC -Wall -Wno-unused-variable -g
> > Using libraries: -Wl,-rpath,/panfs/storage.local/scs/home/abyrd/petsc-3.1-p8/complex-cpp-mumps/lib -L/panfs/storage.local/scs/home/abyrd/petsc-3.1-p8/complex-cpp-mumps/lib -lpetsc       -lX11 -Wl,-rpath,/panfs/storage.local/scs/home/abyrd/petsc-3.1-p8/complex-cpp-mumps/lib -L/panfs/storage.local/scs/home/abyrd/petsc-3.1-p8/complex-cpp-mumps/lib -lcmumps -ldmumps -lsmumps -lzmumps -lmumps_common -lpord -lparmetis -lmetis -lscalapack -lblacs -Wl,-rpath,/usr/lib64 -L/usr/lib64 -llapack -lblas -lnsl -lrt -Wl,-rpath,/usr/mpi/gnu/openmpi-1.4.2/lib64 -L/usr/mpi/gnu/openmpi-1.4.2/lib64 -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.1.2 -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2 -ldl -lmpi -lopen-rte -lopen-pal -lnsl -lutil -lgcc_s -lpthread -lmpi_f90 -lmpi_f77 -lgfortran -lm -lm -lm -lm -lmpi_cxx -lstdc++ -lmpi_cxx -lstdc++ -ldl -lmpi -lopen-rte -lopen-pal -lnsl -lutil -lgcc_s -lpthread -ldl
> >
> > Respectfully,
> > Adam Byrd
> > <PETScCntor.zip>
> 
>