[petsc-users] Need help improving working PETSC code

Barry Smith bsmith at mcs.anl.gov
Thu Jul 9 17:46:56 CDT 2015


> On Jul 9, 2015, at 2:56 PM, Ganesh Vijayakumar <ganesh.iitm at gmail.com> wrote:
> 
> Thanks. I apologize for any stupid assumptions/mistakes.

  You have not made any stupid assumptions/mistakes

> One thing that became clear to me was to get the machinery working first, improve later. I respond to each of your questions below them.
> 
> On Wed, Jul 8, 2015 at 4:27 PM Barry Smith <bsmith at mcs.anl.gov> wrote:
> 
>    Please provide a bit more detail about the operator. Is it a pressure solver for CFD? Cell centered? Does the matrix has a null space of the constant functions?
> 
> The operator.. yes it's a pressure solver for a PISO/SIMPLE type algorithm. The underlying code is OpenFOAM. 

Make sure to read all the way down.

Ok. I understand

> 
> Reg. null space of constant functions, let me verify ... Do you mean Ax = b becomes A (x_1 + x_2) = (c+d) where Ax_1 = c and A x_2 = d where d is constant or zero right? The basic equation is a poisson equation, contains only the 2nd derivative. So I suppose if the entire solution is offset by a constant, it would still satisfy the equations. However, the problem does have Dirichlet boundary conditions on one boundary. So I'm thinking no null space.

 Yes, no null space
>  
> Is it the same linear system for each "time-step?" in the CFD solver or different?
> 
> Different. The structure of the matrix doesn't change every time step (for now), however the values do.
> 
>    How many iterations is hypre BoomerAMG typically taking?
> 
>  Around 5 after I relaxed the tolerances (Used to be around 10-13 when I had the tighter tolerances). I solve the system of equations 3 times per "time-step" like here. 
> 
> Mat Object: 256 MPI processes
>   type: mpiaij
>   rows=12254823, cols=12254823
>   total: nonzeros=7.65502e+07, allocated nonzeros=1.83822e+08
>   total number of mallocs used during MatSetValues calls =0
>     not using I-node (on process 0) routines
> simple.corrNonOrtho() = 1 simple.nNonOrthCorr() = 2
>   1 KSP Residual norm 1.000000000000e+00
> End residual = 0.000816181009378
> Number of iterations = 4
> simple.corrNonOrtho() = 2 simple.nNonOrthCorr() = 2
>   1 KSP Residual norm 1.000000000000e+00
> End residual = 0.00118562245298
> Number of iterations = 3
> simple.corrNonOrtho() = 3 simple.nNonOrthCorr() = 2
>   1 KSP Residual norm 1.000000000000e+00
> End residual = 1.15194716042e-06
> Number of iterations = 5
> 
>    This is an extremely tight tolerance, why do you set it so small? KSPSetTolerances(ksp, 1e-13,  1e-13,  PETSC_DEFAULT,  PETSC_DEFAULT) ;
> 
> Well, that was just a first start.. wanted to make sure it worked.. But reducing the required tolerance to 1e-5, 1e-5 also doesn't change the run time by much. 5 "time-steps" now take 250s compared to 260s earlier. As a comparison, I can use OpenFOAM's inbuilt PCG solver + DIC preconditioner

  What is DIC?

> to do the same in about 13s (Thirteen). Sure the convergence tolerances are not quite the same, but I don't think that's the issue here. I'm trying to do one better OpenFOAM's existing solver suite. I think a good first step would be to match it. Technically I'm comparing apples and oranges by comparing PCG + DIC vs. FGMRES + AMG, however, I was hoping I could better. I'm open to changing pretty much anything.
> 
>    Send the output from a run with -log_summary so we can see where the time is being spent.
> 
> Thanks for this. This is below.
> 
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
> ************************************************************************************************************************
> 
> ---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
> 
> /scratch/02504/ganesh10/OpenFOAM/ganesh10-2.1.x/platforms/linux64IccDPOpt/bin/SRFSimpleFoamPETSC on a sandybridge named c415-502.stampede.tacc.utexas.edu with 256 processors, by ganesh10 Thu
>  Jul  9 13:29:19 2015
> Using Petsc Release Version 3.5.3, Jan, 31, 2015
> 
>                          Max       Max/Min        Avg      Total
> Time (sec):           2.473e+02      1.00143   2.472e+02
> Objects:              5.210e+02      1.00000   5.210e+02
> Flops:                1.256e+08      1.17291   1.184e+08  3.031e+10
> Flops/sec:            5.086e+05      1.17366   4.790e+05  1.226e+08
> MPI Messages:         3.956e+03     21.50000   1.167e+03  2.986e+05
> MPI Message Lengths:  6.769e+06      4.61934   3.787e+03  1.131e+09
> MPI Reductions:       3.680e+02      1.00000
> 
> Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of length N --> 2N flops
>                             and VecAXPY() for complex vectors of length N --> 8N flops
> 
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total
>  0:      Main Stage: 2.4720e+02 100.0%  3.0311e+10 100.0%  2.986e+05 100.0%  3.787e+03      100.0%  3.670e+02  99.7%
> 
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flops: Max - maximum over all processors
>                    Ratio - ratio of maximum to minimum over all processors
>    Mess: number of messages sent
>    Avg. len: average message length (bytes)
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flops in this phase
>       %M - percent messages in this phase     %L - percent message lengths in this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
> 
> --- Event Stage 0: Main Stage
> 
Where is all the data?  It should list all the events and time it spends in each. Did  you use PetscOptionsSetValue() to provide -log_summary? That won't work you need to pass it on the command line or in the PETSC_OPTIONS environmental variable or in a file called petscrc 


Well regardless, it is likely that the matrix is well conditioned enough that the large set up time of hypre BoomerAMG is killing performance. How many iterations in the PCG plus DIC taking?

BTW: you can also run with for example -ksp_type cg -pc_type jacobi or -ksp_type cg -pc_type bjacobi -sub_pc_type icc to get simpler preconditioners with much lower set up times. What do you get for those?

Barry



> ------------------------------------------------------------------------------------------------------------------------
> 
> Memory usage is given in bytes:
> 
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
> 
> --- Event Stage 0: Main Stage
> 
>               Vector   484            482    184328584     0
>       Vector Scatter     1              1         1060     0
>               Matrix     3              3     10175808     0
>            Index Set     2              2        38884     0
>               Viewer     1              0            0     0
>        Krylov Solver    15             15       284400     0
>       Preconditioner    15             15        16440     0
> ========================================================================================================================
> Average time to get PetscTime(): 9.53674e-08
> Average time for MPI_Barrier(): 1.64032e-05
> Average time for zero size MPI_Send(): 0.000232594
> #PETSc Option Table entries:
> -info blah
> -log_summary
> -mat_view ::ascii_info
> -parallel
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
> Configure options: --with-x=0 -with-pic --with-external-packages-dir=/opt/apps/intel13/mvapich2_1_9/petsc/3.5/externalpackages --with-mpi-compilers=1 --with-mpi-dir=/opt/apps/intel13/mvapich2/1.9 --with-scalar-type=real --with-shared-libraries=1 --with-precision=double --with-hypre=1 --download-hypre --with-ml=1 --download-ml --with-ml=1 --download-ml --with-superlu_dist=1 --download-superlu_dist --with-superlu=1 --download-superlu --with-parmetis=1 --download-parmetis --with-metis=1 --download-metis --with-spai=1 --download-spai --with-mumps=1 --download-mumps --with-parmetis=1 --download-parmetis --with-metis=1 --download-metis --with-scalapack=1 --download-scalapack --with-blacs=1 --download-blacs --with-spooles=1 --download-spooles --with-hdf5=1 --with-hdf5-dir=/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9 --with-debugging=no --with-blas-lapack-dir=/opt/apps/intel/13/composer_xe_2013.2.146/mkl --with-mpiexec=mpirun_rsh --COPTFLAGS= --FOPTFLAGS= --CXXOPTFLAGS=
> -----------------------------------------
> Libraries compiled on Thu Apr  2 10:06:57 2015 on staff.stampede.tacc.utexas.edu
> Machine characteristics: Linux-2.6.32-431.17.1.el6.x86_64-x86_64-with-centos-6.6-Final
> Using PETSc directory: /opt/apps/intel13/mvapich2_1_9/petsc/3.5
> Using PETSc arch: sandybridge
> -----------------------------------------
> 
> Using C compiler: /opt/apps/intel13/mvapich2/1.9/bin/mpicc  -fPIC -wd1572   ${COPTFLAGS} ${CFLAGS}
> Using Fortran compiler: /opt/apps/intel13/mvapich2/1.9/bin/mpif90  -fPIC    ${FOPTFLAGS} ${FFLAGS}
> -----------------------------------------
> 
> Using include paths: -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/include -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/include -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/include -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/include -I/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9/include -I/opt/apps/intel13/mvapich2/1.9/include
> 
> 
> ganesh
>  
> 
>    Barry
> 
> 
> 
> > On Jul 8, 2015, at 11:11 AM, Ganesh Vijayakumar <ganesh.iitm at gmail.com> wrote:
> >
> > Hello,
> >
> >  First of all..thanks to the PETSC developers and everyone else contributing supporting material on the web.
> >
> >  I need to solve a system of equations Ax =b, where A is symmetric, sparse, unstructured and in parallel as a part of a finite volume solver for CFD. Decomposition is already done. I initialize the matrix as MPIAIJ even though it's symmetric as that's the only thing I could get to work.
> >
> > MatCreate(PETSC_COMM_WORLD,&A);
> > MatSetType(A,MATMPIAIJ);
> > MatSetSizes(A,nCellsCurProc,nCellsCurProc,nTotalCells,nTotalCells);
> > MatMPIAIJSetPreallocation(A,10,PETSC_NULL,5,PETSC_NULL);
> > MatSetUp(A) ;
> > // Assemble matrix using MatSetValue.. assemble full matrix even though it's symmetric.
> > MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY) ;
> > MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY) ;
> > //Create solver
> > KSPCreate(PETSC_COMM_WORLD,&ksp);
> > KSPSetType(ksp,KSPFGMRES;)
> > KSPGetPC(ksp, &pc);
> > PCFactorSetShiftType(pc, MAT_SHIFT_POSITIVE_DEFINITE);
> >
> > //PCSetType(pc, PCML); //Trilinos - Multilevel
> >
> > //PCSetType(pc, PCBJACOBI) ; //Incomplete LU
> > //PetscOptionsSetValue("-sub_pc_type", "ilu") ;
> > KSPSetFromOptions(ksp);
> >
> > // PCSetType(pc, PCASM); //Additive Schwarz Methods
> > // PCASMSetOverlap(pc,overlap);
> >
> > PCSetType(pc, PCHYPRE); //HYPRE
> > PCHYPRESetType(pc, "boomeramg");
> >
> > //PCHYPRESetType(pc, "parasails");
> >
> > KSPSetPC(ksp, pc);
> > KSPSetTolerances(ksp, 1e-13,  1e-13,  PETSC_DEFAULT,  PETSC_DEFAULT) ;
> > KSPSetInitialGuessNonzero(ksp, PETSC_TRUE);
> > KSPSetOperators(ksp, A, A);
> > KSPMonitorDefault(ksp, 1, 1, pVoid) ;
> >
> > KSPSolve(ksp,b,x);
> > VecGetArray(x, &getResultArray) ;
> > KSPGetResidualNorm(ksp, &endNorm) ;
> > KSPGetIterationNumber(ksp, &nIterations);
> >
> > KSPDestroy(&ksp);
> >
> >
> >  I know that this works and gives me the correct result. However, boomerAMG is proving to be too slow.. atleast the way I use it at 256 processors with around 12 million unknowns. I need your advice on the preconditioner. Am I using it right? Is there anything else I could I be doing better?
> >
> > ganesh
> 



More information about the petsc-users mailing list