Look at the timing. The symbolic factorization takes 1e-4 seconds and the numeric takes<br>only 10s, out of 542s. MatSolve is taking 517s. If you have a problem, it is likely there.<br>However, the MatSolve looks balanced.<br>

<br>  Matt<br><br><div class="gmail_quote">On Fri, May 8, 2009 at 10:59 AM, Fredrik Bengzon <span dir="ltr">&lt;<a href="mailto:fredrik.bengzon@math.umu.se">fredrik.bengzon@math.umu.se</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Hi,<br>

Here is the output from the KSP and EPS objects, and the log summary.<br>

/ Fredrik<br>

<br>

<br>

Reading Triangle/Tetgen mesh<br>

#nodes=19345<br>

#elements=81895<br>

#nodes per element=4<br>

Partitioning mesh with METIS 4.0<br>

Element distribution (rank | #elements)<br>

0 | 19771<br>

1 | 20954<br>

2 | 20611<br>

3 | 20559<br>

rank 1 has 257 ghost nodes<br>

rank 0 has 127 ghost nodes<br>

rank 2 has 143 ghost nodes<br>

rank 3 has 270 ghost nodes<br>

Calling 3D Navier-Lame Eigenvalue Solver<br>

Assembling stiffness and mass matrix<br>

Solving eigensystem with SLEPc<br>

KSP Object:(st_)<br>

 type: preonly<br>

 maximum iterations=100000, initial guess is zero<br>

 tolerances:  relative=1e-08, absolute=1e-50, divergence=10000<br>

 left preconditioning<br>

PC Object:(st_)<br>

 type: lu<br>

   LU: out-of-place factorization<br>

     matrix ordering: natural<br>

   LU: tolerance for zero pivot 1e-12<br>

EPS Object:<br>

 problem type: generalized symmetric eigenvalue problem<br>

 method: krylovschur<br>

 extraction type: Rayleigh-Ritz<br>

 selected portion of the spectrum: largest eigenvalues in magnitude<br>

 number of eigenvalues (nev): 4<br>

 number of column vectors (ncv): 19<br>

 maximum dimension of projected problem (mpd): 19<br>

 maximum number of iterations: 6108<br>

 tolerance: 1e-05<br>

 dimension of user-provided deflation space: 0<br>

 IP Object:<br>

   orthogonalization method:   classical Gram-Schmidt<br>

   orthogonalization refinement:   if needed (eta: 0.707100)<br>

 ST Object:<br>

   type: sinvert<br>

   shift: 0<br>

 Matrices A and B have same nonzero pattern<br>

     Associated KSP object<br>

     ------------------------------<br>

     KSP Object:(st_)<br>

       type: preonly<br>

       maximum iterations=100000, initial guess is zero<br>

       tolerances:  relative=1e-08, absolute=1e-50, divergence=10000<br>

       left preconditioning<br>

     PC Object:(st_)<br>

       type: lu<br>

         LU: out-of-place factorization<br>

           matrix ordering: natural<br>

         LU: tolerance for zero pivot 1e-12<br>

         LU: factor fill ratio needed 0<br>

              Factored matrix follows<br>

             Matrix Object:<br>

               type=mpiaij, rows=58035, cols=58035<br>

               package used to perform factorization: superlu_dist<br>

               total: nonzeros=0, allocated nonzeros=116070<br>

                 SuperLU_DIST run parameters:<br>

                   Process grid nprow 2 x npcol 2<br>

                   Equilibrate matrix TRUE<br>

                   Matrix input mode 1<br>

                   Replace tiny pivots TRUE<br>

                   Use iterative refinement FALSE<br>

                   Processors in row 2 col partition 2<br>

                   Row permutation LargeDiag<br>

                   Column permutation PARMETIS<br>

                   Parallel symbolic factorization TRUE<br>

                   Repeated factorization SamePattern<br>

       linear system matrix = precond matrix:<br>

       Matrix Object:<br>

         type=mpiaij, rows=58035, cols=58035<br>

         total: nonzeros=2223621, allocated nonzeros=2233584<br>

           using I-node (on process 0) routines: found 4695 nodes, limit used is 5<br>

     ------------------------------<br>

Number of iterations in the eigensolver: 1<br>

Number of requested eigenvalues: 4<br>

Stopping condition: tol=1e-05, maxit=6108<br>

Number of converged eigenpairs: 8<br>

<br>

Writing binary .vtu file /scratch/fredrik/output/mode-0.vtu<br>

Writing binary .vtu file /scratch/fredrik/output/mode-1.vtu<br>

Writing binary .vtu file /scratch/fredrik/output/mode-2.vtu<br>

Writing binary .vtu file /scratch/fredrik/output/mode-3.vtu<br>

Writing binary .vtu file /scratch/fredrik/output/mode-4.vtu<br>

Writing binary .vtu file /scratch/fredrik/output/mode-5.vtu<br>

Writing binary .vtu file /scratch/fredrik/output/mode-6.vtu<br>

Writing binary .vtu file /scratch/fredrik/output/mode-7.vtu<br>

************************************************************************************************************************<br>

***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use &#39;enscript -r -fCourier9&#39; to print this document            ***<br>

************************************************************************************************************************<br>

<br>

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------<br>

<br>

/home/fredrik/Hakan/cmlfet/a.out on a linux-gnu named medusa1 with 4 processors, by fredrik Fri May  8 17:57:28 2009<br>

Using Petsc Release Version 3.0.0, Patch 5, Mon Apr 13 09:15:37 CDT 2009<br>

<br>

                        Max       Max/Min        Avg      Total<br>

Time (sec):           5.429e+02      1.00001   5.429e+02<br>

Objects:              1.380e+02      1.00000   1.380e+02<br>

Flops:                1.053e+08      1.05695   1.028e+08  4.114e+08<br>

Flops/sec:            1.939e+05      1.05696   1.894e+05  7.577e+05<br>

Memory:               5.927e+07      1.03224              2.339e+08<br>

MPI Messages:         2.880e+02      1.51579   2.535e+02  1.014e+03<br>

MPI Message Lengths:  4.868e+07      1.08170   1.827e+05  1.853e+08<br>

MPI Reductions:       1.122e+02      1.00000<br>

<br>

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)<br>

                           e.g., VecAXPY() for real vectors of length N --&gt; 2N flops<br>

                           and VecAXPY() for complex vectors of length N --&gt; 8N flops<br>

<br>

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --<br>

                       Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total<br>

0:      Main Stage: 5.4292e+02 100.0%  4.1136e+08 100.0%  1.014e+03 100.0%  1.827e+05      100.0%  3.600e+02  80.2%<br>

<br>

------------------------------------------------------------------------------------------------------------------------<br>

See the &#39;Profiling&#39; chapter of the users&#39; manual for details on interpreting output.<br>

Phase summary info:<br>

  Count: number of times phase was executed<br>

  Time and Flops: Max - maximum over all processors<br>

                  Ratio - ratio of maximum to minimum over all processors<br>

  Mess: number of messages sent<br>

  Avg. len: average message length<br>

  Reduct: number of global reductions<br>

  Global: entire computation<br>

  Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().<br>

     %T - percent time in this phase         %F - percent flops in this phase<br>

     %M - percent messages in this phase     %L - percent message lengths in this phase<br>

     %R - percent reductions in this phase<br>

  Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)<br>

------------------------------------------------------------------------------------------------------------------------<br>

<br>

<br>

     ##########################################################<br>

     #                                                        #<br>

     #                          WARNING!!!                    #<br>

     #                                                        #<br>

     #   This code was compiled with a debugging option,      #<br>

     #   To get timing results run config/configure.py        #<br>

     #   using --with-debugging=no, the performance will      #<br>

     #   be generally two or three times faster.              #<br>

     #                                                        #<br>

     ##########################################################<br>

<br>

<br>

Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total<br>

                  Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s<br>

------------------------------------------------------------------------------------------------------------------------<br>

<br>

--- Event Stage 0: Main Stage<br>

<br>

STSetUp                1 1.0 1.0467e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 8.0e+00  2  0  0  0  2   2  0  0  0  2     0<br>

STApply               28 1.0 5.1775e+02 1.0 3.15e+07 1.1 1.7e+02 4.2e+03 2.8e+01 95 30 17  0  6  95 30 17  0  8     0<br>

EPSSetUp               1 1.0 1.0482e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 4.6e+01  2  0  0  0 10   2  0  0  0 13     0<br>

EPSSolve               1 1.0 3.7193e+02 1.0 9.59e+07 1.1 3.5e+02 4.2e+03 9.7e+01 69 91 35  1 22  69 91 35  1 27     1<br>

IPOrthogonalize       19 1.0 3.4406e-01 1.1 6.75e+07 1.1 2.3e+02 4.2e+03 7.6e+01  0 64 22  1 17   0 64 22  1 21   767<br>

IPInnerProduct       153 1.0 3.1410e-01 1.0 5.63e+07 1.1 2.3e+02 4.2e+03 3.9e+01  0 53 23  1  9   0 53 23  1 11   700<br>

IPApplyMatrix         39 1.0 2.4903e-01 1.1 4.38e+07 1.1 2.3e+02 4.2e+03 0.0e+00  0 42 23  1  0   0 42 23  1  0   687<br>

UpdateVectors          1 1.0 4.2958e-03 1.2 4.51e+06 1.1 0.0e+00 0.0e+00 0.0e+00  0  4  0  0  0   0  4  0  0  0  4107<br>

VecDot                 1 1.0 5.6815e-04 4.7 2.97e+04 1.1 0.0e+00 0.0e+00 1.0e+00  0  0  0  0  0   0  0  0  0  0   204<br>

VecNorm                8 1.0 2.5260e-03 3.2 2.38e+05 1.1 0.0e+00 0.0e+00 8.0e+00  0  0  0  0  2   0  0  0  0  2   368<br>

VecScale              27 1.0 5.9605e-04 1.1 4.01e+05 1.1 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2629<br>

VecCopy               53 1.0 4.0610e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0<br>

VecSet                77 1.0 6.2165e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0<br>

VecAXPY               38 1.0 2.7709e-03 1.7 1.13e+06 1.1 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0  1592<br>

VecMAXPY              38 1.0 2.5925e-02 1.1 1.13e+07 1.1 0.0e+00 0.0e+00 0.0e+00  0 11  0  0  0   0 11  0  0  0  1701<br>

VecAssemblyBegin       5 1.0 9.0070e-03 2.3 0.00e+00 0.0 3.6e+01 2.1e+04 1.5e+01  0  0  4  0  3   0  0  4  0  4     0<br>

VecAssemblyEnd         5 1.0 3.4809e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0<br>

VecScatterBegin       73 1.0 8.5931e-03 1.5 0.00e+00 0.0 4.6e+02 8.9e+03 0.0e+00  0  0 45  2  0   0  0 45  2  0     0<br>

VecScatterEnd         73 1.0 2.2542e-02 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0<br>

VecReduceArith        76 1.0 3.0838e-02 1.1 1.24e+07 1.1 0.0e+00 0.0e+00 0.0e+00  0 12  0  0  0   0 12  0  0  0  1573<br>

VecReduceComm         38 1.0 4.8040e-02 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 3.8e+01  0  0  0  0  8   0  0  0  0 11     0<br>

VecNormalize           8 1.0 2.7280e-03 2.8 3.56e+05 1.1 0.0e+00 0.0e+00 8.0e+00  0  0  0  0  2   0  0  0  0  2   511<br>

MatMult               67 1.0 4.1397e-01 1.1 7.53e+07 1.1 4.0e+02 4.2e+03 0.0e+00  0 71 40  1  0   0 71 40  1  0   710<br>

MatSolve              28 1.0 5.1757e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 95  0  0  0  0  95  0  0  0  0     0<br>

MatLUFactorSym         1 1.0 3.6097e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0<br>

MatLUFactorNum         1 1.0 1.0464e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0<br>

MatAssemblyBegin       9 1.0 3.3842e-0146.7 0.00e+00 0.0 5.4e+01 6.0e+04 8.0e+00  0  0  5  2  2   0  0  5  2  2     0<br>

MatAssemblyEnd         9 1.0 2.3042e-01 1.0 0.00e+00 0.0 3.6e+01 9.4e+02 3.1e+01  0  0  4  0  7   0  0  4  0  9     0<br>

MatGetRow           5206 1.1 3.1164e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0<br>

MatGetSubMatrice       5 1.0 8.7580e-01 1.2 0.00e+00 0.0 1.5e+02 1.1e+06 2.5e+01  0  0 15 88  6   0  0 15 88  7     0<br>

MatZeroEntries         2 1.0 1.0233e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0<br>

MatView                2 1.0 1.0149e-03 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  1     0<br>

KSPSetup               1 1.0 2.8610e-06 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0<br>

KSPSolve              28 1.0 5.1758e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.8e+01 95  0  0  0  6  95  0  0  0  8     0<br>

PCSetUp                1 1.0 1.0467e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 8.0e+00  2  0  0  0  2   2  0  0  0  2     0<br>

PCApply               28 1.0 5.1757e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 95  0  0  0  0  95  0  0  0  0     0<br>

------------------------------------------------------------------------------------------------------------------------<br>

<br>

Memory usage is given in bytes:<br>

<br>

Object Type          Creations   Destructions   Memory  Descendants&#39; Mem.<br>

<br>

--- Event Stage 0: Main Stage<br>

<br>

 Spectral Transform     1              1        536     0<br>

Eigenproblem Solver     1              1        824     0<br>

      Inner product     1              1        428     0<br>

          Index Set    38             38    1796776     0<br>

  IS L to G Mapping     1              1      58700     0<br>

                Vec    65             65    5458584     0<br>

        Vec Scatter     9              9       7092     0<br>

  Application Order     1              1     155232     0<br>

             Matrix    17             16   17715680     0<br>

      Krylov Solver     1              1        832     0<br>

     Preconditioner     1              1        744     0<br>

             Viewer     2              2       1088     0<br>

========================================================================================================================<br>

Average time to get PetscTime(): 1.90735e-07<br>

Average time for MPI_Barrier(): 5.9557e-05<br>

Average time for zero size MPI_Send(): 2.97427e-05<br>

#PETSc Option Table entries:<br>

-log_summary<br>

-mat_superlu_dist_parsymbfact<br>

#End o PETSc Option Table entries<br>

Compiled without FORTRAN kernels<br>

Compiled with full precision matrices (default)<br>

sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8<br>

Configure run at: Wed May  6 15:14:39 2009<br>

Configure options: --download-superlu_dist=1 --download-parmetis=1 --with-mpi-dir=/usr/lib/mpich --with-shared=0<br>

-----------------------------------------<br>

Libraries compiled on Wed May  6 15:14:49 CEST 2009 on medusa1<br>

Machine characteristics: Linux medusa1 2.6.18-6-amd64 #1 SMP Fri Dec 12 05:49:32 UTC 2008 x86_64 GNU/Linux<br>

Using PETSc directory: /home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5<br>

Using PETSc arch: linux-gnu-c-debug<br>

-----------------------------------------<br>

Using C compiler: /usr/lib/mpich/bin/mpicc -Wall -Wwrite-strings -Wno-strict-aliasing -g3  Using Fortran compiler: /usr/lib/mpich/bin/mpif77 -Wall -Wno-unused-variable -g   -----------------------------------------<br>

Using include paths: -I/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/include -I/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/include -I/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/include -I/usr/lib/mpich/include  ------------------------------------------<br>


Using C linker: /usr/lib/mpich/bin/mpicc -Wall -Wwrite-strings -Wno-strict-aliasing -g3<br>

Using Fortran linker: /usr/lib/mpich/bin/mpif77 -Wall -Wno-unused-variable -g Using libraries: -Wl,-rpath,/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib -L/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc        -lX11 -Wl,-rpath,/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib -L/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib -lsuperlu_dist_2.3 -llapack -lblas -lparmetis -lmetis -lm -L/usr/lib/mpich/lib -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -L/usr/lib64 -L/lib64 -ldl -lmpich -lpthread -lrt -lgcc_s -lg2c -lm -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6 -L/lib -lm -ldl -lmpich -lpthread -lrt -lgcc_s -ldl<br>


------------------------------------------<br>

<br>

real    9m10.616s<br>

user    0m23.921s<br>

sys    0m6.944s<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

Satish Balay wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Just a note about scalability: its a function of the hardware as<br>

well.. For proper scalability studies - you&#39;ll need a true distributed<br>

system with fast network [not SMP nodes..]<br>

<br>

Satish<br>

<br>

On Fri, 8 May 2009, Fredrik Bengzon wrote:<br>

<br>

  <br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Hong,<br>

Thank you for the suggestions, but I have looked at the EPS and KSP objects<br>

and I can not find anything wrong. The problem is that it takes longer to<br>

solve with 4 cpus than with 2 so the scalability seems to be absent when using<br>

superlu_dist. I have stored my mass and stiffness matrix in the mpiaij format<br>

and just passed them on to slepc. When using the petsc iterative krylov<br>

solvers i see 100% workload on all processors but when i switch to<br>

superlu_dist only two cpus seem to do the whole work of LU factoring. I don&#39;t<br>

want to use the krylov solver though since it might cause slepc not to<br>

converge.<br>

Regards,<br>

Fredrik<br>

<br>

Hong Zhang wrote:<br>

    <br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Run your code with &#39;-eps_view -ksp_view&#39; for checking<br>

which methods are used<br>

and &#39;-log_summary&#39; to see which operations dominate<br>

the computation.<br>

<br>

You can turn on parallel symbolic factorization<br>

with &#39;-mat_superlu_dist_parsymbfact&#39;.<br>

<br>

Unless you use large num of processors, symbolic factorization<br>

takes ignorable execution time. The numeric<br>

factorization usually dominates.<br>

<br>

Hong<br>

<br>

On Fri, 8 May 2009, Fredrik Bengzon wrote:<br>

<br>

      <br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Hi Petsc team,<br>

Sorry for posting questions not really concerning the petsc core, but when<br>

I run superlu_dist from within slepc I notice that the load balance is<br>

poor. It is just fine during assembly (I use Metis to partition my finite<br>

element mesh) but when calling the slepc solver it dramatically changes. I<br>

use superlu_dist as solver for the eigenvalue iteration. My question is:<br>

can this have something to do with the fact that the option &#39;Parallel<br>

symbolic factorization&#39; is set to false? If so, can I change the options<br>

to superlu_dist using MatSetOption for instance? Also, does this mean that<br>

superlu_dist is not using parmetis to reorder the matrix?<br>

Best Regards,<br>

Fredrik Bengzon<br>

<br>

<br>

        <br>

</blockquote></blockquote>

    <br>

</blockquote>

<br>

<br>

  <br>

</blockquote>

<br>

</blockquote></div><br><br clear="all"><br>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener<br>