On Mon, Aug 1, 2011 at 9:31 PM, Adam Byrd <span dir="ltr">&lt;<a href="mailto:adam1.byrd@gmail.com">adam1.byrd@gmail.com</a>&gt;</span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class="gmail_quote">On Mon, Aug 1, 2011 at 5:09 PM, Barry Smith <span dir="ltr">&lt;<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div><br>

On Aug 1, 2011, at 3:00 PM, Adam Byrd wrote:<br>

<br>

&gt; Hello,<br>

&gt;<br>

&gt; I&#39;m looking for help reducing the time and communication of a parallel MatMatSolve using MUMPS. On a single processor I experience decent solve times (~9 seconds each), but when moving to multiple processors I see longer times with more cores. I&#39;ve run with -log_summary and confirmed (practically) all the time is spent in MatMatSolve. I&#39;m fairly certain it&#39;s all communication between nodes and I&#39;m trying to figure out where I can make optimizations, or if it is even feasible for this type of problem. It is a parallel, dense,<br>


<br>

</div>     I hope you mean that the original matrix you use with MUMPS is sparse (you should not use MUMPS to solve dense linear systems).<br></blockquote><div><br>Oops, yes. The original matrix is sparse. It requires the solution and identity matrix to be dense. I was typing faster than thinking. <br>


</div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex">

<div><br>

&gt; direct solve using MUMPS with an LU preconditioner. I know there are many smaller optimizations that can be done in other areas, but at the moment it is only the solve that concerns me.<br>

<br>

</div>     MUMPS will run slower on 2 processors than 1, this is just a fact of life. You will only gain with parallel for MUMPS for large problems.<br></blockquote><div><br>I see. It looks like I took off in the wrong direction then. I&#39;m trying to solve for the inverse of a sparse matrix in parallel. I&#39;m starting at 3600x3600 and will be moving to 30,000x30,000+ in the future. Which solver suits this sort of problem?<br>

</div></div></blockquote><div><br></div><div>The key to parallel computing (and most other things) is choosing the right problem.This unfortunately, is not a problem that lends itself to parallelism.</div><div><br></div><div>

   Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="gmail_quote"><div>

</div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex">

<br>

   Barry<br>

<div><div></div><div><br>

<br>

<br>

&gt;<br>

&gt; ---------------------------------------------- PETSc Performance Summary: ----------------------------------------------<br>

&gt;<br>

&gt; ./cntor on a complex-c named hpc-1-0.local with 2 processors, by abyrd Mon Aug  1 16:25:51 2011<br>

&gt; Using Petsc Release Version 3.1.0, Patch 8, Thu Mar 17 13:37:48 CDT 2011<br>

&gt;<br>

&gt;                          Max       Max/Min        Avg      Total<br>

&gt; Time (sec):           1.307e+02      1.00000   1.307e+02<br>

&gt; Objects:              1.180e+02      1.00000   1.180e+02<br>

&gt; Flops:                0.000e+00      0.00000   0.000e+00  0.000e+00<br>

&gt; Flops/sec:            0.000e+00      0.00000   0.000e+00  0.000e+00<br>

&gt; Memory:               2.091e+08      1.00001              4.181e+08<br>

&gt; MPI Messages:         7.229e+03      1.00000   7.229e+03  1.446e+04<br>

&gt; MPI Message Lengths:  4.141e+08      1.00000   5.729e+04  8.283e+08<br>

&gt; MPI Reductions:       1.464e+04      1.00000<br>

&gt;<br>

&gt; Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)<br>

&gt;                             e.g., VecAXPY() for real vectors of length N --&gt; 2N flops<br>

&gt;                             and VecAXPY() for complex vectors of length N --&gt; 8N flops<br>

&gt;<br>

&gt; Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --<br>

&gt;                         Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total<br>

&gt;  0:      Main Stage: 1.3072e+02 100.0%  0.0000e+00   0.0%  1.446e+04 100.0%  5.729e+04      100.0%  1.730e+02   1.2%<br>

&gt;<br>

&gt; ------------------------------------------------------------------------------------------------------------------------<br>

&gt; See the &#39;Profiling&#39; chapter of the users&#39; manual for details on interpreting output.<br>

&gt; Phase summary info:<br>

&gt;    Count: number of times phase was executed<br>

&gt;    Time and Flops: Max - maximum over all processors<br>

&gt;                    Ratio - ratio of maximum to minimum over all processors<br>

&gt;    Mess: number of messages sent<br>

&gt;    Avg. len: average message length<br>

&gt;    Reduct: number of global reductions<br>

&gt;    Global: entire computation<br>

&gt;    Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().<br>

&gt;       %T - percent time in this phase         %F - percent flops in this phase<br>

&gt;       %M - percent messages in this phase     %L - percent message lengths in this phase<br>

&gt;       %R - percent reductions in this phase<br>

&gt;    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)<br>

&gt; ------------------------------------------------------------------------------------------------------------------------<br>

&gt;<br>

&gt;<br>

&gt;       ##########################################################<br>

&gt;       #                                                        #<br>

&gt;       #                          WARNING!!!                    #<br>

&gt;       #                                                        #<br>

&gt;       #   This code was compiled with a debugging option,      #<br>

&gt;       #   To get timing results run config/configure.py        #<br>

&gt;       #   using --with-debugging=no, the performance will      #<br>

&gt;       #   be generally two or three times faster.              #<br>

&gt;       #                                                        #<br>

&gt;       ##########################################################<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt;       ##########################################################<br>

&gt;       #                                                        #<br>

&gt;       #                          WARNING!!!                    #<br>

&gt;       #                                                        #<br>

&gt;       #   The code for various complex numbers numerical       #<br>

&gt;       #   kernels uses C++, which generally is not well        #<br>

&gt;       #   optimized.  For performance that is about 4-5 times  #<br>

&gt;       #   faster, specify --with-fortran-kernels=1             #<br>

&gt;       #   when running config/configure.py.                    #<br>

&gt;       #                                                        #<br>

&gt;       ##########################################################<br>

&gt;<br>

&gt;<br>

&gt; Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total<br>

&gt;                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s<br>

&gt; ------------------------------------------------------------------------------------------------------------------------<br>

&gt;<br>

&gt; --- Event Stage 0: Main Stage<br>

&gt;<br>

&gt; MatSolve           14400 1.0 1.2364e+02 1.0 0.00e+00 0.0 1.4e+04 5.7e+04 2.0e+01 95  0100100  0  95  0100100 12     0<br>

&gt; MatLUFactorSym         4 1.0 2.0027e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0<br>

&gt; MatLUFactorNum         4 1.0 3.4223e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+01  3  0  0  0  0   3  0  0  0 14     0<br>

&gt; MatConvert             1 1.0 2.3644e-01 2.4 0.00e+00 0.0 0.0e+00 0.0e+00 1.1e+01  0  0  0  0  0   0  0  0  0  6     0<br>

&gt; MatAssemblyBegin      14 1.0 1.9959e-01 9.3 0.00e+00 0.0 3.0e+01 5.2e+04 1.2e+01  0  0  0  0  0   0  0  0  0  7     0<br>

&gt; MatAssemblyEnd        14 1.0 1.9908e-01 1.1 0.00e+00 0.0 4.0e+00 2.8e+01 2.0e+01  0  0  0  0  0   0  0  0  0 12     0<br>

&gt; MatGetRow             32 1.0 4.2677e-05 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0<br>

&gt; MatGetSubMatrice       4 1.0 7.6661e-03 1.0 0.00e+00 0.0 1.6e+01 1.2e+05 2.4e+01  0  0  0  0  0   0  0  0  0 14     0<br>

&gt; MatMatSolve            4 1.0 1.2380e+02 1.0 0.00e+00 0.0 1.4e+04 5.7e+04 2.0e+01 95  0100100  0  95  0100100 12     0<br>

&gt; VecSet                 4 1.0 1.8590e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0<br>

&gt; VecScatterBegin    28800 1.0 2.2810e+00 2.2 0.00e+00 0.0 1.4e+04 5.7e+04 0.0e+00  1  0100100  0   1  0100100  0     0<br>

&gt; VecScatterEnd      14400 1.0 4.1534e+00 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0<br>

&gt; KSPSetup               4 1.0 1.1060e-0212.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0<br>

&gt; PCSetUp                4 1.0 3.4280e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 5.6e+01  3  0  0  0  0   3  0  0  0 32     0<br>

&gt; ------------------------------------------------------------------------------------------------------------------------<br>

&gt;<br>

&gt; Memory usage is given in bytes:<br>

&gt;<br>

&gt; Object Type          Creations   Destructions     Memory  Descendants&#39; Mem.<br>

&gt; Reports information only for process 0.<br>

&gt;<br>

&gt; --- Event Stage 0: Main Stage<br>

&gt;<br>

&gt;               Matrix    27             27    208196712     0<br>

&gt;                  Vec    36             36      1027376     0<br>

&gt;          Vec Scatter    11             11         7220     0<br>

&gt;            Index Set    42             42        22644     0<br>

&gt;        Krylov Solver     1              1        34432     0<br>

&gt;       Preconditioner     1              1          752     0<br>

&gt; ========================================================================================================================<br>

&gt; Average time to get PetscTime(): 1.90735e-07<br>

&gt; Average time for MPI_Barrier(): 3.8147e-06<br>

&gt; Average time for zero size MPI_Send(): 7.51019e-06<br>

&gt; #PETSc Option Table entries:<br>

&gt; -log_summary<br>

&gt; -pc_factor_mat_solver_package mumps<br>

&gt; -pc_type lu<br>

&gt; #End of PETSc Option Table entries<br>

&gt; Compiled without FORTRAN kernels<br>

&gt; Compiled with full precision matrices (default)<br>

&gt; sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 16<br>

&gt; Configure run at: Mon Jul 11 15:28:42 2011<br>

&gt; Configure options: PETSC_ARCH=complex-cpp-mumps --with-cc=mpicc --with-fc=mpif90 --with-blas-lapack-dir=/usr/lib64 --with-shared --with-clanguage=c++ --with-scalar-type=complex --download-mumps=1 --download-blacs=1 --download-scalapack=1 --download-parmetis=1 --with-cxx=mpicxx<br>


&gt; -----------------------------------------<br>

&gt; Libraries compiled on Mon Jul 11 15:39:58 EDT 2011 on sc.local<br>

&gt; Machine characteristics: Linux sc.local 2.6.18-194.11.1.el5 #1 SMP Tue Aug 10 19:05:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux<br>

&gt; Using PETSc directory: /panfs/storage.local/scs/home/abyrd/petsc-3.1-p8<br>

&gt; Using PETSc arch: complex-cpp-mumps<br>

&gt; -----------------------------------------<br>

&gt; Using C compiler: mpicxx -Wall -Wwrite-strings -Wno-strict-aliasing -g   -fPIC<br>

&gt; Using Fortran compiler: mpif90 -fPIC -Wall -Wno-unused-variable -g<br>

&gt; -----------------------------------------<br>

&gt; Using include paths: -I/panfs/storage.local/scs/home/abyrd/petsc-3.1-p8/complex-cpp-mumps/include -I/panfs/storage.local/scs/home/abyrd/petsc-3.1-p8/include -I/panfs/storage.local/scs/home/abyrd/petsc-3.1-p8/complex-cpp-mumps/include -I/usr/mpi/gnu/openmpi-1.4.2/include -I/usr/mpi/gnu/openmpi-1.4.2/lib64<br>


&gt; ------------------------------------------<br>

&gt; Using C linker: mpicxx -Wall -Wwrite-strings -Wno-strict-aliasing -g<br>

&gt; Using Fortran linker: mpif90 -fPIC -Wall -Wno-unused-variable -g<br>

&gt; Using libraries: -Wl,-rpath,/panfs/storage.local/scs/home/abyrd/petsc-3.1-p8/complex-cpp-mumps/lib -L/panfs/storage.local/scs/home/abyrd/petsc-3.1-p8/complex-cpp-mumps/lib -lpetsc       -lX11 -Wl,-rpath,/panfs/storage.local/scs/home/abyrd/petsc-3.1-p8/complex-cpp-mumps/lib -L/panfs/storage.local/scs/home/abyrd/petsc-3.1-p8/complex-cpp-mumps/lib -lcmumps -ldmumps -lsmumps -lzmumps -lmumps_common -lpord -lparmetis -lmetis -lscalapack -lblacs -Wl,-rpath,/usr/lib64 -L/usr/lib64 -llapack -lblas -lnsl -lrt -Wl,-rpath,/usr/mpi/gnu/openmpi-1.4.2/lib64 -L/usr/mpi/gnu/openmpi-1.4.2/lib64 -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.1.2 -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2 -ldl -lmpi -lopen-rte -lopen-pal -lnsl -lutil -lgcc_s -lpthread -lmpi_f90 -lmpi_f77 -lgfortran -lm -lm -lm -lm -lmpi_cxx -lstdc++ -lmpi_cxx -lstdc++ -ldl -lmpi -lopen-rte -lopen-pal -lnsl -lutil -lgcc_s -lpthread -ldl<br>


&gt;<br>

&gt; Respectfully,<br>

&gt; Adam Byrd<br>

</div></div>&gt; &lt;PETScCntor.zip&gt;<br>

<br>

</blockquote></div><br>

</blockquote></div><br><br clear="all"><br>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener<br>