[petsc-users] Optimizing MatMatSolve

Mon Aug 1 16:34:30 CDT 2011

On Mon, Aug 1, 2011 at 9:31 PM, Adam Byrd <adam1.byrd at gmail.com> wrote:

> On Mon, Aug 1, 2011 at 5:09 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
>>
>> On Aug 1, 2011, at 3:00 PM, Adam Byrd wrote:
>>
>> > Hello,
>> >
>> > I'm looking for help reducing the time and communication of a parallel
>> MatMatSolve using MUMPS. On a single processor I experience decent solve
>> times (~9 seconds each), but when moving to multiple processors I see longer
>> times with more cores. I've run with -log_summary and confirmed
>> (practically) all the time is spent in MatMatSolve. I'm fairly certain it's
>> all communication between nodes and I'm trying to figure out where I can
>> make optimizations, or if it is even feasible for this type of problem. It
>> is a parallel, dense,
>>
>>      I hope you mean that the original matrix you use with MUMPS is sparse
>> (you should not use MUMPS to solve dense linear systems).
>>
>
> Oops, yes. The original matrix is sparse. It requires the solution and
> identity matrix to be dense. I was typing faster than thinking.
>
>>
>> > direct solve using MUMPS with an LU preconditioner. I know there are
>> many smaller optimizations that can be done in other areas, but at the
>> moment it is only the solve that concerns me.
>>
>>      MUMPS will run slower on 2 processors than 1, this is just a fact of
>> life. You will only gain with parallel for MUMPS for large problems.
>>
>
> I see. It looks like I took off in the wrong direction then. I'm trying to
> solve for the inverse of a sparse matrix in parallel. I'm starting at
> 3600x3600 and will be moving to 30,000x30,000+ in the future. Which solver
> suits this sort of problem?
>

The key to parallel computing (and most other things) is choosing the right
problem.This unfortunately, is not a problem that lends itself to
parallelism.

   Matt

>
>>   Barry
>>
>>
>>
>> >
>> > ---------------------------------------------- PETSc Performance
>> Summary: ----------------------------------------------
>> >
>> > ./cntor on a complex-c named hpc-1-0.local with 2 processors, by abyrd
>> Mon Aug  1 16:25:51 2011
>> > Using Petsc Release Version 3.1.0, Patch 8, Thu Mar 17 13:37:48 CDT 2011
>> >
>> >                          Max       Max/Min        Avg      Total
>> > Time (sec):           1.307e+02      1.00000   1.307e+02
>> > Objects:              1.180e+02      1.00000   1.180e+02
>> > Flops:                0.000e+00      0.00000   0.000e+00  0.000e+00
>> > Flops/sec:            0.000e+00      0.00000   0.000e+00  0.000e+00
>> > Memory:               2.091e+08      1.00001              4.181e+08
>> > MPI Messages:         7.229e+03      1.00000   7.229e+03  1.446e+04
>> > MPI Message Lengths:  4.141e+08      1.00000   5.729e+04  8.283e+08
>> > MPI Reductions:       1.464e+04      1.00000
>> >
>> > Flop counting convention: 1 flop = 1 real number operation of type
>> (multiply/divide/add/subtract)
>> >                             e.g., VecAXPY() for real vectors of length N
>> --> 2N flops
>> >                             and VecAXPY() for complex vectors of length
>> N --> 8N flops
>> >
>> > Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages
>> ---  -- Message Lengths --  -- Reductions --
>> >                         Avg     %Total     Avg     %Total   counts
>> %Total     Avg         %Total   counts   %Total
>> >  0:      Main Stage: 1.3072e+02 100.0%  0.0000e+00   0.0%  1.446e+04
>> 100.0%  5.729e+04      100.0%  1.730e+02   1.2%
>> >
>> >
>> ------------------------------------------------------------------------------------------------------------------------
>> > See the 'Profiling' chapter of the users' manual for details on
>> interpreting output.
>> > Phase summary info:
>> >    Count: number of times phase was executed
>> >    Time and Flops: Max - maximum over all processors
>> >                    Ratio - ratio of maximum to minimum over all
>> processors
>> >    Mess: number of messages sent
>> >    Avg. len: average message length
>> >    Reduct: number of global reductions
>> >    Global: entire computation
>> >    Stage: stages of a computation. Set stages with PetscLogStagePush()
>> and PetscLogStagePop().
>> >       %T - percent time in this phase         %F - percent flops in this
>> phase
>> >       %M - percent messages in this phase     %L - percent message
>> lengths in this phase
>> >       %R - percent reductions in this phase
>> >    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
>> over all processors)
>> >
>> ------------------------------------------------------------------------------------------------------------------------
>> >
>> >
>> >       ##########################################################
>> >       #                                                        #
>> >       #                          WARNING!!!                    #
>> >       #                                                        #
>> >       #   This code was compiled with a debugging option,      #
>> >       #   To get timing results run config/configure.py        #
>> >       #   using --with-debugging=no, the performance will      #
>> >       #   be generally two or three times faster.              #
>> >       #                                                        #
>> >       ##########################################################
>> >
>> >
>> >
>> >
>> >       ##########################################################
>> >       #                                                        #
>> >       #                          WARNING!!!                    #
>> >       #                                                        #
>> >       #   The code for various complex numbers numerical       #
>> >       #   kernels uses C++, which generally is not well        #
>> >       #   optimized.  For performance that is about 4-5 times  #
>> >       #   faster, specify --with-fortran-kernels=1             #
>> >       #   when running config/configure.py.                    #
>> >       #                                                        #
>> >       ##########################################################
>> >
>> >
>> > Event                Count      Time (sec)     Flops
>>         --- Global ---  --- Stage ---   Total
>> >                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>> >
>> ------------------------------------------------------------------------------------------------------------------------
>> >
>> > --- Event Stage 0: Main Stage
>> >
>> > MatSolve           14400 1.0 1.2364e+02 1.0 0.00e+00 0.0 1.4e+04 5.7e+04
>> 2.0e+01 95  0100100  0  95  0100100 12     0
>> > MatLUFactorSym         4 1.0 2.0027e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> > MatLUFactorNum         4 1.0 3.4223e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 2.4e+01  3  0  0  0  0   3  0  0  0 14     0
>> > MatConvert             1 1.0 2.3644e-01 2.4 0.00e+00 0.0 0.0e+00 0.0e+00
>> 1.1e+01  0  0  0  0  0   0  0  0  0  6     0
>> > MatAssemblyBegin      14 1.0 1.9959e-01 9.3 0.00e+00 0.0 3.0e+01 5.2e+04
>> 1.2e+01  0  0  0  0  0   0  0  0  0  7     0
>> > MatAssemblyEnd        14 1.0 1.9908e-01 1.1 0.00e+00 0.0 4.0e+00 2.8e+01
>> 2.0e+01  0  0  0  0  0   0  0  0  0 12     0
>> > MatGetRow             32 1.0 4.2677e-05 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> > MatGetSubMatrice       4 1.0 7.6661e-03 1.0 0.00e+00 0.0 1.6e+01 1.2e+05
>> 2.4e+01  0  0  0  0  0   0  0  0  0 14     0
>> > MatMatSolve            4 1.0 1.2380e+02 1.0 0.00e+00 0.0 1.4e+04 5.7e+04
>> 2.0e+01 95  0100100  0  95  0100100 12     0
>> > VecSet                 4 1.0 1.8590e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> > VecScatterBegin    28800 1.0 2.2810e+00 2.2 0.00e+00 0.0 1.4e+04 5.7e+04
>> 0.0e+00  1  0100100  0   1  0100100  0     0
>> > VecScatterEnd      14400 1.0 4.1534e+00 2.2 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
>> > KSPSetup               4 1.0 1.1060e-0212.6 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> > PCSetUp                4 1.0 3.4280e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 5.6e+01  3  0  0  0  0   3  0  0  0 32     0
>> >
>> ------------------------------------------------------------------------------------------------------------------------
>> >
>> > Memory usage is given in bytes:
>> >
>> > Object Type          Creations   Destructions     Memory  Descendants'
>> Mem.
>> > Reports information only for process 0.
>> >
>> > --- Event Stage 0: Main Stage
>> >
>> >               Matrix    27             27    208196712     0
>> >                  Vec    36             36      1027376     0
>> >          Vec Scatter    11             11         7220     0
>> >            Index Set    42             42        22644     0
>> >        Krylov Solver     1              1        34432     0
>> >       Preconditioner     1              1          752     0
>> >
>> ========================================================================================================================
>> > Average time to get PetscTime(): 1.90735e-07
>> > Average time for MPI_Barrier(): 3.8147e-06
>> > Average time for zero size MPI_Send(): 7.51019e-06
>> > #PETSc Option Table entries:
>> > -log_summary
>> > -pc_factor_mat_solver_package mumps
>> > -pc_type lu
>> > #End of PETSc Option Table entries
>> > Compiled without FORTRAN kernels
>> > Compiled with full precision matrices (default)
>> > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>> sizeof(PetscScalar) 16
>> > Configure run at: Mon Jul 11 15:28:42 2011
>> > Configure options: PETSC_ARCH=complex-cpp-mumps --with-cc=mpicc
>> --with-fc=mpif90 --with-blas-lapack-dir=/usr/lib64 --with-shared
>> --with-clanguage=c++ --with-scalar-type=complex --download-mumps=1
>> --download-blacs=1 --download-scalapack=1 --download-parmetis=1
>> --with-cxx=mpicxx
>> > -----------------------------------------
>> > Libraries compiled on Mon Jul 11 15:39:58 EDT 2011 on sc.local
>> > Machine characteristics: Linux sc.local 2.6.18-194.11.1.el5 #1 SMP Tue
>> Aug 10 19:05:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
>> > Using PETSc directory: /panfs/storage.local/scs/home/abyrd/petsc-3.1-p8
>> > Using PETSc arch: complex-cpp-mumps
>> > -----------------------------------------
>> > Using C compiler: mpicxx -Wall -Wwrite-strings -Wno-strict-aliasing -g
>> -fPIC
>> > Using Fortran compiler: mpif90 -fPIC -Wall -Wno-unused-variable -g
>> > -----------------------------------------
>> > Using include paths:
>> -I/panfs/storage.local/scs/home/abyrd/petsc-3.1-p8/complex-cpp-mumps/include
>> -I/panfs/storage.local/scs/home/abyrd/petsc-3.1-p8/include
>> -I/panfs/storage.local/scs/home/abyrd/petsc-3.1-p8/complex-cpp-mumps/include
>> -I/usr/mpi/gnu/openmpi-1.4.2/include -I/usr/mpi/gnu/openmpi-1.4.2/lib64
>> > ------------------------------------------
>> > Using C linker: mpicxx -Wall -Wwrite-strings -Wno-strict-aliasing -g
>> > Using Fortran linker: mpif90 -fPIC -Wall -Wno-unused-variable -g
>> > Using libraries:
>> -Wl,-rpath,/panfs/storage.local/scs/home/abyrd/petsc-3.1-p8/complex-cpp-mumps/lib
>> -L/panfs/storage.local/scs/home/abyrd/petsc-3.1-p8/complex-cpp-mumps/lib
>> -lpetsc       -lX11
>> -Wl,-rpath,/panfs/storage.local/scs/home/abyrd/petsc-3.1-p8/complex-cpp-mumps/lib
>> -L/panfs/storage.local/scs/home/abyrd/petsc-3.1-p8/complex-cpp-mumps/lib
>> -lcmumps -ldmumps -lsmumps -lzmumps -lmumps_common -lpord -lparmetis -lmetis
>> -lscalapack -lblacs -Wl,-rpath,/usr/lib64 -L/usr/lib64 -llapack -lblas -lnsl
>> -lrt -Wl,-rpath,/usr/mpi/gnu/openmpi-1.4.2/lib64
>> -L/usr/mpi/gnu/openmpi-1.4.2/lib64
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.1.2
>> -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2 -ldl -lmpi -lopen-rte -lopen-pal
>> -lnsl -lutil -lgcc_s -lpthread -lmpi_f90 -lmpi_f77 -lgfortran -lm -lm -lm
>> -lm -lmpi_cxx -lstdc++ -lmpi_cxx -lstdc++ -ldl -lmpi -lopen-rte -lopen-pal
>> -lnsl -lutil -lgcc_s -lpthread -ldl
>> >
>> > Respectfully,
>> > Adam Byrd
>> > <PETScCntor.zip>
>>
>>
>

-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20110801/9e23783d/attachment.htm>