[petsc-users] Need help improving working PETSC code
Barry Smith
bsmith at mcs.anl.gov
Fri Jul 10 13:32:09 CDT 2015
> On Jul 10, 2015, at 12:34 PM, Ganesh Vijayakumar <ganesh.iitm at gmail.com> wrote:
>
> Hello,
>
> On Thu, Jul 9, 2015 at 7:32 PM Barry Smith <bsmith at mcs.anl.gov> wrote:
>
> Ok, it is block Jacobi with ICC on each block (one per process) so -ksp_type cg -pc_type bjacobi -sub_pc_type icc with PETSc should give similar results to what they get.
>
> >
> > Where is all the data? It should list all the events and time it spends in each. Did you use PetscOptionsSetValue() to provide -log_summary? That won't work you need to pass it on the command line or in the PETSC_OPTIONS environmental variable or in a file called petscrc
>
> Using Petsc Release Version 3.5.3, Jan, 31, 2015
>
> Max Max/Min Avg Total
> Time (sec): 9.756e+01 1.00369 9.726e+01
> Objects: 4.500e+01 1.00000 4.500e+01
> Flops: 1.256e+08 1.17291 1.184e+08 3.031e+10
> Flops/sec: 1.292e+06 1.17364 1.217e+06 3.116e+08
> MPI Messages: 3.956e+03 21.50000 1.167e+03 2.986e+05
> MPI Message Lengths: 6.769e+06 4.61934 3.787e+03 1.131e+09
> MPI Reductions: 3.120e+02 1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length N --> 2N flops
> and VecAXPY() for complex vectors of length N --> 8N flops
>
> Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total counts %Total Avg %Total counts %Total
> 0: Main Stage: 9.7259e+01 100.0% 3.0311e+10 100.0% 2.986e+05 100.0% 3.787e+03 100.0% 3.110e+02 99.7%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flops: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all processors
> Mess: number of messages sent
> Avg. len: average message length (bytes)
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
> %T - percent time in this phase %F - percent flops in this phase
> %M - percent messages in this phase %L - percent message lengths in this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
> ------------------------------------------------------------------------------------------------------------------------
> Event Count Time (sec) Flops --- Global --- --- Stage --- Total
> Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> VecMDot 75 1.0 1.1153e-01 2.0 2.28e+07 1.0 0.0e+00 0.0e+00 7.5e+01 0 19 0 0 24 0 19 0 0 24 51865
> VecNorm 105 1.0 2.6864e-01 1.1 1.02e+07 1.0 0.0e+00 0.0e+00 1.0e+02 0 8 0 0 34 0 8 0 0 34 9580
> VecScale 90 1.0 2.2329e-02 6.7 4.35e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 4 0 0 0 0 4 0 0 0 49394
> VecSet 121 1.0 1.1327e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecAXPY 15 1.0 1.6739e-03 1.2 1.45e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 219629
> VecWAXPY 15 1.0 2.1994e-03 1.9 7.25e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 83578
> VecMAXPY 90 1.0 2.7625e-02 1.8 3.01e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 25 0 0 0 0 25 0 0 0 275924
> VecAssemblyBegin 30 1.0 1.2747e-02 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 9.0e+01 0 0 0 0 29 0 0 0 0 29 0
> VecAssemblyEnd 30 1.0 5.1475e-0426.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecScatterBegin 90 1.0 1.4369e-02 5.4 0.00e+00 0.0 2.9e+05 3.9e+03 0.0e+00 0 0 98 99 0 0 0 98 99 0 0
> VecScatterEnd 90 1.0 4.2581e-0211.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatMult 90 1.0 1.6290e-01 1.6 5.63e+07 1.5 2.9e+05 3.9e+03 0.0e+00 0 42 98 99 0 0 42 98 99 0 77813
> MatConvert 5 1.0 4.0061e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatAssemblyBegin 10 1.0 1.2128e-01 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+01 0 0 0 0 6 0 0 0 0 6 0
> MatAssemblyEnd 10 1.0 5.6291e-02 1.0 0.00e+00 0.0 6.5e+03 9.6e+02 8.0e+00 0 0 2 1 3 0 0 2 1 3 0
> MatGetRowIJ 10 1.0 9.0599e-06 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatZeroEntries 5 1.0 5.0242e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatView 5 1.0 3.0882e-03 3.5 0.00e+00 0.0 0.0e+00 0.0e+00 5.0e+00 0 0 0 0 2 0 0 0 0 2 0
> KSPGMRESOrthog 75 1.0 1.2176e-01 1.9 4.56e+07 1.0 0.0e+00 0.0e+00 7.5e+01 0 38 0 0 24 0 38 0 0 24 95014
> KSPSetUp 1 1.0 2.6391e-03 2.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSolve 15 1.0 9.3209e+01 1.0 1.26e+08 1.2 2.9e+05 3.9e+03 1.8e+02 96100 98 99 59 96100 98 99 59 325
The next two lines are the important ones. It is spending 80% of the time in setting up the hypre BoomerAMG preconditioner and 16% of the time applying it. (everything else is trivial).
> PCSetUp 5 1.0 7.7425e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 4.0e+00 80 0 0 0 1 80 0 0 0 1 0
> PCApply 75 1.0 1.5272e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 16 0 0 0 0 16 0 0 0 0 0
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type Creations Destructions Memory Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
> Vector 36 34 12361096 0
> Vector Scatter 1 1 1060 0
> Matrix 3 3 10175808 0
> Krylov Solver 1 1 18960 0
> Preconditioner 1 1 1096 0
> Viewer 1 0 0 0
> Index Set 2 2 38884 0
> ========================================================================================================================
> Average time to get PetscTime():
> Average time for MPI_Barrier(): 1.7786e-05
> Average time for zero size MPI_Send(): 0.000176195
The times for MPI_Barrier and MPI_Send() are HUGE on your machine. This will limit how fast anything can run. I am surprised they are so large; isn't stampede suppose to be a high end parallel machine?
> #PETSc Option Table entries:
> -info blah
> -log_summary
> -mat_view ::ascii_info
> -parallel
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
> Configure options: --with-x=0 -with-pic --with-external-packages-dir=/opt/apps/intel13/mvapich2_1_9/petsc/3.5/externalpackages --with-mpi-compilers=1 --with-mpi-dir=/opt/apps/intel13/mvapich2/1.9 --with-scalar-type=real --with-shared-libraries=1 --with-precision=double --with-hypre=1 --download-hypre --with-ml=1 --download-ml --with-ml=1 --download-ml --with-superlu_dist=1 --download-superlu_dist --with-superlu=1 --download-superlu --with-parmetis=1 --download-parmetis --with-metis=1 --download-metis --with-spai=1 --download-spai --with-mumps=1 --download-mumps --with-parmetis=1 --download-parmetis --with-metis=1 --download-metis --with-scalapack=1 --download-scalapack --with-blacs=1 --download-blacs --with-spooles=1 --download-spooles --with-hdf5=1 --with-hdf5-dir=/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9 --with-debugging=no --with-blas-lapack-dir=/opt/apps/intel/13/composer_xe_2013.2.146/mkl --with-mpiexec=mpirun_rsh --COPTFLAGS= --FOPTFLAGS= --CXXOPTFLAGS=
You should set -COPTFLAGS= --FOPTFLAGS= --CXXOPTFLAGS= to at least -O1 maybe -O3. Currently you are compiling without optimization which is BAD.
> -----------------------------------------
> Libraries compiled on Thu Apr 2 10:06:57 2015 on staff.stampede.tacc.utexas.edu
> Machine characteristics: Linux-2.6.32-431.17.1.el6.x86_64-x86_64-with-centos-6.6-Final
> Using PETSc directory: /opt/apps/intel13/mvapich2_1_9/petsc/3.5
> Using PETSc arch: sandybridge
> -----------------------------------------
>
> Using C compiler: /opt/apps/intel13/mvapich2/1.9/bin/mpicc -fPIC -wd1572 ${COPTFLAGS} ${CFLAGS}
> Using Fortran compiler: /opt/apps/intel13/mvapich2/1.9/bin/mpif90 -fPIC ${FOPTFLAGS} ${FFLAGS}
> -----------------------------------------
>
> Using include paths: -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/include -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/include -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/include -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/include -I/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9/include -I/opt/apps/intel13/mvapich2/1.9/include
> -----------------------------------------
> Using C linker: /opt/apps/intel13/mvapich2/1.9/bin/mpicc
> Using Fortran linker: /opt/apps/intel13/mvapich2/1.9/bin/mpif90
> Using libraries: -Wl,-rpath,/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/lib -L/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/lib -lpetsc -Wl,-rpath,/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/lib -L/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/lib -lsuperlu_4.3 -lHYPRE -Wl,-rpath,/opt/ofed/lib64 -L/opt/ofed/lib64 -Wl,-rpath,/opt/apps/limic2/0.5.5/lib -L/opt/apps/limic2/0.5.5/lib -Wl,-rpath,/opt/apps/intel13/mvapich2/1.9/lib -L/opt/apps/intel13/mvapich2/1.9/lib -Wl,-rpath,/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64 -L/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64 -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.4.7 -L/usr/lib/gcc/x86_64-redhat-linux/4.4.7 -lmpichcxx -lml -lmpichcxx -lspai -lsuperlu_dist_3.3 -lcmumps -ldmumps -lsmumps -lzmumps -lmumps_common -lpord -lscalapack -Wl,-rpath,/opt/apps/intel/13/composer_xe_2013.2.146/mkl/lib/intel64 -L/opt/apps/intel/13/composer_xe_2013.2.146/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm -lparmetis -lmetis -lpthread -lssl -lcrypto -Wl,-rpath,/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9/lib -L/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9/lib -lhdf5hl_fortran -lhdf5_fortran -lhdf5_hl -lhdf5 -lmpichf90 -lifport -lifcore -lm -lmpichcxx -Wl,-rpath,/opt/ofed/lib64 -L/opt/ofed/lib64 -Wl,-rpath,/opt/apps/limic2/0.5.5/lib -L/opt/apps/limic2/0.5.5/lib -Wl,-rpath,/opt/ofed/lib64 -L/opt/ofed/lib64 -Wl,-rpath,/opt/apps/limic2/0.5.5/lib -L/opt/apps/limic2/0.5.5/lib -ldl -Wl,-rpath,/opt/apps/intel13/mvapich2/1.9/lib -L/opt/apps/intel13/mvapich2/1.9/lib -lmpich -lopa -lmpl -libmad -lrdmacm -libumad -libverbs -lrt -llimic2 -lpthread -Wl,-rpath,/opt/ofed/lib64 -L/opt/ofed/lib64 -Wl,-rpath,/opt/apps/limic2/0.5.5/lib -L/opt/apps/limic2/0.5.5/lib -Wl,-rpath,/opt/ofed/lib64 -L/opt/ofed/lib64 -Wl,-rpath,/opt/apps/limic2/0.5.5/lib -L/opt/apps/limic2/0.5.5/lib -Wl,-rpath,/opt/apps/intel13/mvapich2/1.9/lib -L/opt/apps/intel13/mvapich2/1.9/lib -Wl,-rpath,/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64 -L/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64 -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.4.7 -L/usr/lib/gcc/x86_64-redhat-linux/4.4.7 -Wl,-rpath,/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64 -Wl,-rpath,/opt/apps/limic2/0.5.5/lib -Wl,-rpath,/opt/ofed/lib64 -Wl,-rpath,/opt/apps/intel13/mvapich2/1.9/lib -limf -lsvml -lirng -lipgo -ldecimal -lcilkrts -lstdc++ -lgcc_s -lirc -lirc_s -Wl,-rpath,/opt/ofed/lib64 -L/opt/ofed/lib64 -Wl,-rpath,/opt/apps/limic2/0.5.5/lib -L/opt/apps/limic2/0.5.5/lib -Wl,-rpath,/opt/ofed/lib64 -L/opt/ofed/lib64 -Wl,-rpath,/opt/apps/limic2/0.5.5/lib -L/opt/apps/limic2/0.5.5/lib -Wl,-rpath,/opt/apps/intel13/mvapich2/1.9/lib -L/opt/apps/intel13/mvapich2/1.9/lib -Wl,-rpath,/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64 -L/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64 -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.4.7 -L/usr/lib/gcc/x86_64-redhat-linux/4.4.7 -ldl
> -----------------------------------------
>
> Finalising parallel run
>
>
> > 1. How I do I tell PETSC that my matrix is symmetric. I tried setting my matrix as follows... but am apprehensive of it.
> >
> > MatCreateSBAIJ(PETSC_COMM_WORLD, 1, nCellsCurProc, nCellsCurProc, nTotalCells, nTotalCells, 10, NULL, 5, NULL, &A);
> >
> > Could I still use MatSetValue on both upper and lower diagonal part of the matrix. Will PETSC understand that it's redundant?
>
> Yes, run with -mat_ignore_lower_triangular or call MatSetOption(mat,MAT_IGNORE_LOWER_TRIANGULAR,PETSC_TRUE)
>
> This is very useful.. thanks.
>
> I have a question on setting block sizes. Should I create 1 block per processor?
No the block size has nothing to do with parallelism; it is 1 in your case because you have a scalar PDE (pressure) that you are solving.
> If so what do I set the d_nz and o_nz as? Right now I allocate memory for 10 non-zero elements per row that are local to the processor and 5 non-zero elements that are non-local. So my understanding was that
>
> MatCreateSBAIJ(PETSC_COMM_WORLD, 1, nCellsCurProc, nCellsCurProc, nTotalCells, nTotalCells, 10, NULL, 5, NULL, &A);
>
> should become
>
> MatCreateSBAIJ(PETSC_COMM_WORLD, nCellsCurProc, nCellsCurProc, nCellsCurProc, nTotalCells, nTotalCells, 10*nCellsCurProc, NULL, 5*nCellsCurProc, NULL, &A);
>
>
> But PETSC doesn't seem to like this. It complains that it's out of memory and throws a whole lot of error messages. Clearly something's wrong. Could you please tell me what is.
>
> > 2. Do I need PCFactorSetShiftType(pc,MAT_SHIFT_POSITIVE_DEFINITE); ?
>
> I hope not. But you might.
>
> Ok. I tried with and without it... doesn't seem to make a difference. So off for now. Will turn it on if necessary.
>
> > 3. What does KSPSetReusePreconditioner(ksp, PETSC_TRUE) do? Should I use it?
>
> Not at first. What it does it not build a new preconditioner for each solve. If the matrix is changing "slowly" you can often get away with setting this for some number of linear solvers, then set it back to false for the next solve then set it to true again for some number of linear solvers. You could try it with hypre, say keeping it the same for 10, 50, 100 solves and see what happens time wise.
>
> This was most useful. I did two things. First I shifted the creation of the KSP object to the initialization stage. So no more creation and deletion of KSP objects. Second I set ReusePreconditioner to true when the matrix changes and false when it doesn't. All of this got my execution time down from 250s to about 103s! I think that's great. Thanks again.
>
> ganesh
More information about the petsc-users
mailing list