[petsc-users] Need help improving working PETSC code

Thu Jul 9 14:56:34 CDT 2015

Thanks. I apologize for any stupid assumptions/mistakes. One thing that
became clear to me was to get the machinery working first, improve later. I
respond to each of your questions below them.

On Wed, Jul 8, 2015 at 4:27 PM Barry Smith <bsmith at mcs.anl.gov> wrote:

>
>    Please provide a bit more detail about the operator. Is it a pressure
> solver for CFD? Cell centered? Does the matrix has a null space of the
> constant functions?

The operator.. yes it's a pressure solver for a PISO/SIMPLE type algorithm.
The underlying code is OpenFOAM.

Reg. null space of constant functions, let me verify ... Do you mean Ax = b
becomes A (x_1 + x_2) = (c+d) where Ax_1 = c and A x_2 = d where d is
constant or zero right? The basic equation is a poisson equation, contains
only the 2nd derivative. So I suppose if the entire solution is offset by a
constant, it would still satisfy the equations. However, the problem does
have Dirichlet boundary conditions on one boundary. So I'm thinking no null
space.

> Is it the same linear system for each "time-step?" in the CFD solver or
> different?
>

Different. The structure of the matrix doesn't change every time step (for
now), however the values do.

   How many iterations is hypre BoomerAMG typically taking?
>

 Around 5 after I relaxed the tolerances (Used to be around 10-13 when I
had the tighter tolerances). I solve the system of equations 3 times per
"time-step" like here.

Mat Object: 256 MPI processes
  type: mpiaij
  rows=12254823, cols=12254823
  total: nonzeros=7.65502e+07, allocated nonzeros=1.83822e+08
  total number of mallocs used during MatSetValues calls =0
    not using I-node (on process 0) routines
simple.corrNonOrtho() = 1 simple.nNonOrthCorr() = 2
  1 KSP Residual norm 1.000000000000e+00
End residual = 0.000816181009378
Number of iterations = 4
simple.corrNonOrtho() = 2 simple.nNonOrthCorr() = 2
  1 KSP Residual norm 1.000000000000e+00
End residual = 0.00118562245298
Number of iterations = 3
simple.corrNonOrtho() = 3 simple.nNonOrthCorr() = 2
  1 KSP Residual norm 1.000000000000e+00
End residual = 1.15194716042e-06
Number of iterations = 5

   This is an extremely tight tolerance, why do you set it so small?
> KSPSetTolerances(ksp, 1e-13,  1e-13,  PETSC_DEFAULT,  PETSC_DEFAULT) ;
>

Well, that was just a first start.. wanted to make sure it worked.. But
reducing the required tolerance to 1e-5, 1e-5 also doesn't change the run
time by much. 5 "time-steps" now take 250s compared to 260s earlier. As a
comparison, I can use OpenFOAM's inbuilt PCG solver + DIC preconditioner to
do the same in about 13s (Thirteen). Sure the convergence tolerances are
not quite the same, but I don't think that's the issue here. I'm trying to
do one better OpenFOAM's existing solver suite. I think a good first step
would be to match it. Technically I'm comparing apples and oranges by
comparing PCG + DIC vs. FGMRES + AMG, however, I was hoping I could better.
I'm open to changing pretty much anything.

>
>    Send the output from a run with -log_summary so we can see where the
> time is being spent.
>

Thanks for this. This is below.

************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
-fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary:
----------------------------------------------

/scratch/02504/ganesh10/OpenFOAM/ganesh10-2.1.x/platforms/linux64IccDPOpt/bin/SRFSimpleFoamPETSC
on a sandybridge named c415-502.stampede.tacc.utexas.edu with 256
processors, by ganesh10 Thu
 Jul  9 13:29:19 2015
Using Petsc Release Version 3.5.3, Jan, 31, 2015

                         Max       Max/Min        Avg      Total
Time (sec):           2.473e+02      1.00143   2.472e+02
Objects:              5.210e+02      1.00000   5.210e+02
Flops:                1.256e+08      1.17291   1.184e+08  3.031e+10
Flops/sec:            5.086e+05      1.17366   4.790e+05  1.226e+08
MPI Messages:         3.956e+03     21.50000   1.167e+03  2.986e+05
MPI Message Lengths:  6.769e+06      4.61934   3.787e+03  1.131e+09
MPI Reductions:       3.680e+02      1.00000

Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N
--> 2N flops
                            and VecAXPY() for complex vectors of length N
--> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages
---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts
%Total     Avg         %Total   counts   %Total
 0:      Main Stage: 2.4720e+02 100.0%  3.0311e+10 100.0%  2.986e+05
100.0%  3.787e+03      100.0%  3.670e+02  99.7%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this
phase
      %M - percent messages in this phase     %L - percent message lengths
in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)
Flops                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Vector   484            482    184328584     0
      Vector Scatter     1              1         1060     0
              Matrix     3              3     10175808     0
           Index Set     2              2        38884     0
              Viewer     1              0            0     0
       Krylov Solver    15             15       284400     0
      Preconditioner    15             15        16440     0
========================================================================================================================
Average time to get PetscTime(): 9.53674e-08
Average time for MPI_Barrier(): 1.64032e-05
Average time for zero size MPI_Send(): 0.000232594
#PETSc Option Table entries:
-info blah
-log_summary
-mat_view ::ascii_info
-parallel
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --with-x=0 -with-pic
--with-external-packages-dir=/opt/apps/intel13/mvapich2_1_9/petsc/3.5/externalpackages
--with-mpi-compilers=1 --with-mpi-dir=/opt/apps/intel13/mvapich2/1.9
--with-scalar-type=real --with-shared-libraries=1 --with-precision=double
--with-hypre=1 --download-hypre --with-ml=1 --download-ml --with-ml=1
--download-ml --with-superlu_dist=1 --download-superlu_dist
--with-superlu=1 --download-superlu --with-parmetis=1 --download-parmetis
--with-metis=1 --download-metis --with-spai=1 --download-spai
--with-mumps=1 --download-mumps --with-parmetis=1 --download-parmetis
--with-metis=1 --download-metis --with-scalapack=1 --download-scalapack
--with-blacs=1 --download-blacs --with-spooles=1 --download-spooles
--with-hdf5=1 --with-hdf5-dir=/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9
--with-debugging=no
--with-blas-lapack-dir=/opt/apps/intel/13/composer_xe_2013.2.146/mkl
--with-mpiexec=mpirun_rsh --COPTFLAGS= --FOPTFLAGS= --CXXOPTFLAGS=
-----------------------------------------
Libraries compiled on Thu Apr  2 10:06:57 2015 on
staff.stampede.tacc.utexas.edu
Machine characteristics:
Linux-2.6.32-431.17.1.el6.x86_64-x86_64-with-centos-6.6-Final
Using PETSc directory: /opt/apps/intel13/mvapich2_1_9/petsc/3.5
Using PETSc arch: sandybridge
-----------------------------------------

Using C compiler: /opt/apps/intel13/mvapich2/1.9/bin/mpicc  -fPIC -wd1572
${COPTFLAGS} ${CFLAGS}
Using Fortran compiler: /opt/apps/intel13/mvapich2/1.9/bin/mpif90  -fPIC
${FOPTFLAGS} ${FFLAGS}
-----------------------------------------

Using include paths:
-I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/include
-I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/include
-I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/include
-I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/include
-I/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9/include
-I/opt/apps/intel13/mvapich2/1.9/include

ganesh

>
>    Barry
>
>
>
> > On Jul 8, 2015, at 11:11 AM, Ganesh Vijayakumar <ganesh.iitm at gmail.com>
> wrote:
> >
> > Hello,
> >
> >  First of all..thanks to the PETSC developers and everyone else
> contributing supporting material on the web.
> >
> >  I need to solve a system of equations Ax =b, where A is symmetric,
> sparse, unstructured and in parallel as a part of a finite volume solver
> for CFD. Decomposition is already done. I initialize the matrix as MPIAIJ
> even though it's symmetric as that's the only thing I could get to work.
> >
> > MatCreate(PETSC_COMM_WORLD,&A);
> > MatSetType(A,MATMPIAIJ);
> > MatSetSizes(A,nCellsCurProc,nCellsCurProc,nTotalCells,nTotalCells);
> > MatMPIAIJSetPreallocation(A,10,PETSC_NULL,5,PETSC_NULL);
> > MatSetUp(A) ;
> > // Assemble matrix using MatSetValue.. assemble full matrix even though
> it's symmetric.
> > MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY) ;
> > MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY) ;
> > //Create solver
> > KSPCreate(PETSC_COMM_WORLD,&ksp);
> > KSPSetType(ksp,KSPFGMRES;)
> > KSPGetPC(ksp, &pc);
> > PCFactorSetShiftType(pc, MAT_SHIFT_POSITIVE_DEFINITE);
> >
> > //PCSetType(pc, PCML); //Trilinos - Multilevel
> >
> > //PCSetType(pc, PCBJACOBI) ; //Incomplete LU
> > //PetscOptionsSetValue("-sub_pc_type", "ilu") ;
> > KSPSetFromOptions(ksp);
> >
> > // PCSetType(pc, PCASM); //Additive Schwarz Methods
> > // PCASMSetOverlap(pc,overlap);
> >
> > PCSetType(pc, PCHYPRE); //HYPRE
> > PCHYPRESetType(pc, "boomeramg");
> >
> > //PCHYPRESetType(pc, "parasails");
> >
> > KSPSetPC(ksp, pc);
> > KSPSetTolerances(ksp, 1e-13,  1e-13,  PETSC_DEFAULT,  PETSC_DEFAULT) ;
> > KSPSetInitialGuessNonzero(ksp, PETSC_TRUE);
> > KSPSetOperators(ksp, A, A);
> > KSPMonitorDefault(ksp, 1, 1, pVoid) ;
> >
> > KSPSolve(ksp,b,x);
> > VecGetArray(x, &getResultArray) ;
> > KSPGetResidualNorm(ksp, &endNorm) ;
> > KSPGetIterationNumber(ksp, &nIterations);
> >
> > KSPDestroy(&ksp);
> >
> >
> >  I know that this works and gives me the correct result. However,
> boomerAMG is proving to be too slow.. atleast the way I use it at 256
> processors with around 12 million unknowns. I need your advice on the
> preconditioner. Am I using it right? Is there anything else I could I be
> doing better?
> >
> > ganesh
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150709/6832d529/attachment-0001.html>