understanding the output from -info
Barry Smith
bsmith at mcs.anl.gov
Sat Feb 10 21:26:07 CST 2007
My recommendation is just to try to optimize sequential runs
by using the most appropriate solver algorithms, the best sequential
processor with the fastest memory and slickest code.
Parallel computing is to solve big problems, not to solve little problems
fast. (anything less then 100k unknowns or even more is in my opinion is small).
Barry
On Sun, 11 Feb 2007, Ben Tay wrote:
> Hi,
>
> In other words, for my CFD code, it is not possible to parallelize it
> effectively because the problem is too small?
>
> Is these true for all parallel solver, or just PETSc? I was hoping to reduce
> the runtime since mine is an unsteady problem which requires many steps to
> reach a periodic state and it takes many hours to reach it.
>
> Lastly, if I'm running on 2 processors, will there be improvement likely?
>
> Thank you.
>
>
> On 2/11/07, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> >
> >
> > On Sat, 10 Feb 2007, Ben Tay wrote:
> >
> > > Hi,
> > >
> > > I've repeated the test with n,m = 800. Now serial takes around 11mins
> > while
> > > parallel with 4 processors took 6mins. Does it mean that the problem
> > must be
> > > pretty large before it is more superior to use parallel? Moreover
> > 800x800
> > > means there's 640000 unknowns. My problem is a 2D CFD code which
> > typically
> > > has 200x80=16000 unknowns. Does it mean that I won't be able to benefit
> > from
> > ^^^^^^^^^^^
> > You'll never get much performance past 2 processors; its not even worth
> > all the work of having a parallel code in this case. I'd just optimize the
> > heck out of the serial code.
> >
> > Barry
> >
> >
> >
> > > running in parallel?
> > >
> > > Btw, this is the parallel's log_summary:
> > >
> > >
> > > Event Count Time (sec)
> > > Flops/sec --- Global --- --- Stage --- Total
> > > Max Ratio Max Ratio Max Ratio Mess Avg len
> > > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
> > >
> >
> > ------------------------------------------------------------------------------------------------------------------------
> > >
> > > --- Event Stage 0: Main Stage
> > >
> > > MatMult 1265 1.0 7.0615e+01 1.2 3.22e+07 1.2 7.6e+03 6.4e+03
> > > 0.0e+00 16 11100100 0 16 11100100 0 103
> > > MatSolve 1265 1.0 4.7820e+01 1.2 4.60e+07 1.2 0.0e+00 0.0e+00
> > > 0.0e+00 11 11 0 0 0 11 11 0 0 0 152
> > > MatLUFactorNum 1 1.0 2.5703e-01 2.3 1.27e+07 2.3 0.0e+00 0.0e+00
> > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 22
> > > MatILUFactorSym 1 1.0 1.8933e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00
> > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > MatAssemblyBegin 1 1.0 4.2153e-01 3.5 0.00e+00 0.0 0.0e+00 0.0e+00
> > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > MatAssemblyEnd 1 1.0 3.6475e-01 1.5 0.00e+00 0.0 6.0e+00 3.2e+03
> > > 1.3e+01 0 0 0 0 0 0 0 0 0 0 0
> > > MatGetOrdering 1 1.0 1.2088e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > VecMDot 1224 1.0 1.5314e+02 1.2 4.63e+07 1.2 0.0e+00 0.0e+00
> > > 1.2e+03 36 36 0 0 31 36 36 0 0 31 158
> > > VecNorm 1266 1.0 1.0215e+02 1.1 4.31e+06 1.1 0.0e+00 0.0e+00
> > > 1.3e+03 24 2 0 0 33 24 2 0 0 33 16
> > > VecScale 1265 1.0 3.7467e+00 1.5 8.34e+07 1.5 0.0e+00 0.0e+00
> > > 0.0e+00 1 1 0 0 0 1 1 0 0 0 216
> > > VecCopy 41 1.0 2.5530e-01 2.8 0.00e+00 0.0 0.0e+00 0.0e+00
> > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > VecSet 1308 1.0 3.2717e+00 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
> > > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> > > VecAXPY 82 1.0 5.3338e-01 2.8 1.40e+08 2.8 0.0e+00 0.0e+00
> > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 197
> > > VecMAXPY 1265 1.0 4.6234e+01 1.2 1.74e+08 1.2 0.0e+00 0.0e+00
> > > 0.0e+00 10 38 0 0 0 10 38 0 0 0 557
> > > VecScatterBegin 1265 1.0 1.5684e-01 1.6 0.00e+00 0.0 7.6e+03 6.4e+03
> > > 0.0e+00 0 0100100 0 0 0100100 0 0
> > > VecScatterEnd 1265 1.0 4.3167e+01 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> > > 0.0e+00 9 0 0 0 0 9 0 0 0 0 0
> > > VecNormalize 1265 1.0 1.0459e+02 1.1 6.21e+06 1.1 0.0e+00 0.0e+00
> > > 1.3e+03 25 4 0 0 32 25 4 0 0 32 23
> > > KSPGMRESOrthog 1224 1.0 1.9035e+02 1.1 7.00e+07 1.1 0.0e+00 0.0e+00
> > > 1.2e+03 45 72 0 0 31 45 72 0 0 31 254
> > > KSPSetup 2 1.0 5.1674e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> > > 1.0e+01 0 0 0 0 0 0 0 0 0 0 0
> > > KSPSolve 1 1.0 4.0269e+02 1.0 4.16e+07 1.0 7.6e+03 6.4e+03
> > > 3.9e+03 99100100100 99 99100100100 99 166
> > > PCSetUp 2 1.0 4.5924e-01 2.6 8.23e+06 2.6 0.0e+00 0.0e+00
> > > 6.0e+00 0 0 0 0 0 0 0 0 0 0 12
> > > PCSetUpOnBlocks 1 1.0 4.5847e-01 2.6 8.26e+06 2.6 0.0e+00 0.0e+00
> > > 4.0e+00 0 0 0 0 0 0 0 0 0 0 13
> > > PCApply 1265 1.0 5.0990e+01 1.2 4.33e+07 1.2 0.0e+00 0.0e+00
> > > 1.3e+03 12 11 0 0 32 12 11 0 0 32 143
> > >
> >
> > ------------------------------------------------------------------------------------------------------------------------
> > >
> > > Memory usage is given in bytes:
> > >
> > > Object Type Creations Destructions Memory Descendants'
> > Mem.
> > >
> > > --- Event Stage 0: Main Stage
> > >
> > > Matrix 4 4 643208 0
> > > Index Set 5 5 1924296 0
> > > Vec 41 41 47379984 0
> > > Vec Scatter 1 1 0 0
> > > Krylov Solver 2 2 16880 0
> > > Preconditioner 2 2 196 0
> > >
> > ========================================================================================================================
> > > Average time to get PetscTime(): 1.00136e-06
> > > Average time for MPI_Barrier(): 4.00066e-05
> > > Average time for zero size MPI_Send(): 1.70469e-05
> > > OptionTable: -log_summary
> > > Compiled without FORTRAN kernels
> > > Compiled with full precision matrices (default)
> > > sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4
> > > sizeof(PetscScalar) 8
> > > Configure run at: Thu Jan 18 12:23:31 2007
> > > Configure options: --with-vendor-compilers=intel --with-x=0
> > --with-shared
> > > --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32
> > > --with-mpi-dir=/opt/mpich/myrinet/intel/
> > > -----------------------------------------
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On 2/10/07, Ben Tay <zonexo at gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I tried to use ex2f.F as a test code. I've changed the number n,m from
> > 3
> > > > to 500 each. I ran the code using 1 processor and then with 4
> > processor. I
> > > > then repeat the same with the following modification:
> > > >
> > > >
> > > > do i=1,10
> > > >
> > > > call KSPSolve(ksp,b,x,ierr)
> > > >
> > > > end do
> > > > I've added to do loop to make the solving repeat 10 times.
> > > >
> > > > In both cases, the serial code is faster, e.g. 1 taking 2.4 min while
> > the
> > > > other 3.3 min.
> > > >
> > > > Here's the log_summary:
> > > >
> > > >
> > > > ---------------------------------------------- PETSc Performance
> > Summary:
> > > > ----------------------------------------------
> > > >
> > > > ./ex2f on a linux-mpi named atlas12.nus.edu.sg with 4 processors, by
> > > > g0306332 Sat Feb 10 16:21:36 2007
> > > > Using Petsc Release Version 2.3.2, Patch 8, Tue Jan 2 14:33:59 PST
> > 2007
> > > > HG revision: ebeddcedcc065e32fc252af32cf1d01ed4fc7a80
> > > >
> > > > Max Max/Min Avg Total
> > > > Time (sec): 2.213e+02 1.00051 2.212e+02
> > > > Objects: 5.500e+01 1.00000 5.500e+01
> > > > Flops: 4.718e+09 1.00019 4.718e+09 1.887e+10
> > > > Flops/sec: 2.134e+07 1.00070 2.133e+07 8.531e+07
> > > >
> > > > Memory: 3.186e+07 1.00069 1.274e+08
> > > > MPI Messages: 1.832e+03 2.00000 1.374e+03 5.496e+03
> > > > MPI Message Lengths: 7.324e+06 2.00000 3.998e+03 2.197e+07
> > > > MPI Reductions: 7.112e+02 1.00000
> > > >
> > > > Flop counting convention: 1 flop = 1 real number operation of type
> > > > (multiply/divide/add/subtract)
> > > > e.g., VecAXPY() for real vectors of length
> > N
> > > > --> 2N flops
> > > > and VecAXPY() for complex vectors of
> > length N
> > > > --> 8N flops
> > > >
> > > > Summary of Stages: ----- Time ------ ----- Flops ----- ---
> > Messages
> > > > --- -- Message Lengths -- -- Reductions --
> > > > Avg %Total Avg %Total counts
> > > > %Total Avg %Total counts %Total
> > > > 0: Main Stage: 2.2120e+02 100.0% 1.8871e+10 100.0% 5.496e+03
> > > > 100.0% 3.998e+03 100.0% 2.845e+03 100.0%
> > > >
> > > >
> > > >
> > > >
> >
> > ------------------------------------------------------------------------------------------------------------------------
> > > > See the 'Profiling' chapter of the users' manual for details on
> > > > interpreting output.
> > > > Phase summary info:
> > > > Count: number of times phase was executed
> > > > Time and Flops/sec: Max - maximum over all processors
> > > > Ratio - ratio of maximum to minimum over all
> > > > processors
> > > > Mess: number of messages sent
> > > > Avg. len: average message length
> > > > Reduct: number of global reductions
> > > > Global: entire computation
> > > > Stage: stages of a computation. Set stages with PetscLogStagePush()
> > and
> > > > PetscLogStagePop().
> > > > %T - percent time in this phase %F - percent flops in
> > this
> > > > phase
> > > > %M - percent messages in this phase %L - percent message
> > lengths
> > > > in this phase
> > > > %R - percent reductions in this phase
> > > > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> > > > over all processors)
> > > >
> > > >
> > > >
> >
> > ------------------------------------------------------------------------------------------------------------------------
> > > >
> > > > ##########################################################
> > > > # #
> > > > # WARNING!!! #
> > > > # #
> > > > # This code was compiled with a debugging option, #
> > > > # To get timing results run config/configure.py #
> > > > # using --with-debugging=no, the performance will #
> > > > # be generally two or three times faster. #
> > > > # #
> > > > ##########################################################
> > > >
> > > >
> > > >
> > > >
> > > > ##########################################################
> > > > # #
> > > > # WARNING!!! #
> > > > # #
> > > > # This code was run without the PreLoadBegin() #
> > > > # macros. To get timing results we always recommend #
> > > > # preloading. otherwise timing numbers may be #
> > > > # meaningless. #
> > > > ##########################################################
> > > >
> > > >
> > > > Event Count Time (sec)
> > > > Flops/sec --- Global --- --- Stage ---
> > Total
> > > > Max Ratio Max Ratio Max Ratio Mess Avg
> > len
> > > > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
> > > >
> > > >
> > > >
> >
> > ------------------------------------------------------------------------------------------------------------------------
> > > >
> > > > --- Event Stage 0: Main Stage
> > > >
> > > > MatMult 915 1.0 4.4291e+01 1.3 1.50e+07 1.3 5.5e+03
> > 4.0e+03
> > > > 0.0e+00 18 11100100 0 18 11100100 0 46
> > > > MatSolve 915 1.0 1.5684e+01 1.1 3.56e+07 1.1 0.0e+00
> > 0.0e+00
> > > > 0.0e+00 7 11 0 0 0 7 11 0 0 0 131
> > > > MatLUFactorNum 1 1.0 5.1654e-02 1.4 1.48e+07 1.4 0.0e+00
> > 0.0e+00
> > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 43
> > > > MatILUFactorSym 1 1.0 1.6838e-02 1.1 0.00e+00 0.0 0.0e+00
> > 0.0e+00
> > > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > > MatAssemblyBegin 1 1.0 3.2428e-01 1.6 0.00e+00 0.0 0.0e+00
> > 0.0e+00
> > > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > > MatAssemblyEnd 1 1.0 1.3120e+00 1.1 0.00e+00 0.0 6.0e+00
> > 2.0e+03
> > > > 1.3e+01 1 0 0 0 0 1 0 0 0 0 0
> > > > MatGetOrdering 1 1.0 4.1590e-03 1.2 0.00e+00 0.0 0.0e+00
> > 0.0e+00
> > > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > > VecMDot 885 1.0 8.5091e+01 1.1 2.27e+07 1.1 0.0e+00
> > 0.0e+00
> > > > 8.8e+02 36 36 0 0 31 36 36 0 0 31 80
> > > > VecNorm 916 1.0 6.6747e+01 1.1 1.81e+06 1.1 0.0e+00
> > 0.0e+00
> > > > 9.2e+02 29 2 0 0 32 29 2 0 0 32 7
> > > > VecScale 915 1.0 1.1430e+00 2.2 1.12e+08 2.2 0.0e+00
> > 0.0e+00
> > > > 0.0e+00 0 1 0 0 0 0 1 0 0 0 200
> > > > VecCopy 30 1.0 1.2816e-01 5.7 0.00e+00 0.0 0.0e+00
> > 0.0e+00
> > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > > VecSet 947 1.0 7.8979e-01 1.3 0.00e+00 0.0 0.0e+00
> > 0.0e+00
> > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> > > > VecAXPY 60 1.0 5.5332e-02 1.1 1.51e+08 1.1 0.0e+00
> > 0.0e+00
> > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 542
> > > > VecMAXPY 915 1.0 1.5004e+01 1.3 1.54e+08 1.3 0.0e+00
> > 0.0e+00
> > > > 0.0e+00 6 38 0 0 0 6 38 0 0 0 483
> > > > VecScatterBegin 915 1.0 9.0358e-02 1.4 0.00e+00 0.0 5.5e+03
> > 4.0e+03
> > > > 0.0e+00 0 0100100 0 0 0100100 0 0
> > > > VecScatterEnd 915 1.0 3.5136e+01 1.4 0.00e+00 0.0 0.0e+00
> > 0.0e+00
> > > > 0.0e+00 14 0 0 0 0 14 0 0 0 0 0
> > > > VecNormalize 915 1.0 6.7272e+01 1.0 2.68e+06 1.0 0.0e+00
> > 0.0e+00
> > > > 9.2e+02 30 4 0 0 32 30 4 0 0 32 10
> > > > KSPGMRESOrthog 885 1.0 9.8478e+01 1.1 3.87e+07 1.1 0.0e+00
> > 0.0e+00
> > > > 8.8e+02 42 72 0 0 31 42 72 0 0 31 138
> > > > KSPSetup 2 1.0 6.1918e-01 1.2 0.00e+00 0.0 0.0e+00
> > 0.0e+00
> > > > 1.0e+01 0 0 0 0 0 0 0 0 0 0 0
> > > > KSPSolve 1 1.0 2.1892e+02 1.0 2.15e+07 1.0 5.5e+03
> > 4.0e+03
> > > > 2.8e+03 99100100100 99 99100100100 99 86
> > > > PCSetUp 2 1.0 7.3292e-02 1.3 9.84e+06 1.3 0.0e+00
> > 0.0e+00
> > > > 6.0e+00 0 0 0 0 0 0 0 0 0 0 30
> > > > PCSetUpOnBlocks 1 1.0 7.2706e-02 1.3 9.97e+06 1.3 0.0e+00
> > 0.0e+00
> > > > 4.0e+00 0 0 0 0 0 0 0 0 0 0 31
> > > > PCApply 915 1.0 1.6508e+01 1.1 3.27e+07 1.1 0.0e+00
> > 0.0e+00
> > > > 9.2e+02 7 11 0 0 32 7 11 0 0 32 124
> > > >
> > > >
> >
> > ------------------------------------------------------------------------------------------------------------------------
> > > >
> > > >
> > > > Memory usage is given in bytes:
> > > >
> > > > Object Type Creations Destructions Memory Descendants'
> > Mem.
> > > >
> > > > --- Event Stage 0: Main Stage
> > > >
> > > > Matrix 4 4 252008 0
> > > > Index Set 5 5 753096 0
> > > > Vec 41 41 18519984 0
> > > > Vec Scatter 1 1 0 0
> > > > Krylov Solver 2 2 16880 0
> > > > Preconditioner 2 2 196 0
> > > >
> > ========================================================================================================================
> > > >
> > > > Average time to get PetscTime(): 1.09673e-06
> > > > Average time for MPI_Barrier(): 4.18186e-05
> > > > Average time for zero size MPI_Send(): 2.62856e-05
> > > > OptionTable: -log_summary
> > > > Compiled without FORTRAN kernels
> > > > Compiled with full precision matrices (default)
> > > > sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4
> > > > sizeof(PetscScalar) 8
> > > > Configure run at: Thu Jan 18 12:23:31 2007
> > > > Configure options: --with-vendor-compilers=intel --with-x=0
> > --with-shared
> > > > --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32
> > > > --with-mpi-dir=/opt/mpich/myrinet/intel/
> > > > -----------------------------------------
> > > > Libraries compiled on Thu Jan 18 12:24:41 SGT 2007 on
> > atlas1.nus.edu.sg
> > > > Machine characteristics: Linux atlas1.nus.edu.sg 2.4.21-20.ELsmp #1
> > SMP
> > > > Wed Sep 8 17:29:34 GMT 2004 i686 i686 i386 GNU/Linux
> > > > Using PETSc directory: /nas/lsftmp/g0306332/petsc-2.3.2-p8
> > > > Using PETSc arch: linux-mpif90
> > > > -----------------------------------------
> > > > Using C compiler: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g
> > > > Using Fortran compiler: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC
> > -g
> > > > -w90 -w
> > > > -----------------------------------------
> > > > Using include paths: -I/nas/lsftmp/g0306332/petsc-
> > > > 2.3.2-p8-I/nas/lsftmp/g0306332/petsc-
> > > > 2.3.2-p8/bmake/linux-mpif90 -I/nas/lsftmp/g0306332/petsc-2.3.2-p8
> > /include
> > > > -I/opt/mpich/myrinet/intel/include
> > > > ------------------------------------------
> > > > Using C linker: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g
> > > > Using Fortran linker: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC -g
> > > > -w90 -w
> > > > Using libraries:
> > > > -Wl,-rpath,/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90
> > > > -L/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 -lpetscts
> > > > -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc
> > > > -Wl,-rpath,/lsftmp/g0306332/inter/mkl/lib/32
> > > > -L/lsftmp/g0306332/inter/mkl/lib/32 -lmkl_lapack -lmkl_ia32 -lguide
> > > > -lPEPCF90 -Wl,-rpath,/opt/intel/compiler70/ia32/lib
> > > > -Wl,-rpath,/opt/mpich/myrinet/intel/lib -L/opt/mpich/myrinet/intel/lib
> > > > -Wl,-rpath,-rpath -Wl,-rpath,-ldl -L-ldl -lmpich -Wl,-rpath,-L -lgm
> > > > -lpthread -Wl,-rpath,/opt/intel/compiler70/ia32/lib
> > > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib
> > -L/opt/intel/compiler70/ia32/lib
> > > > -Wl,-rpath,/usr/lib -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts
> > -lcxa
> > > > -lunwind -ldl -lmpichf90 -Wl,-rpath,/opt/gm/lib -L/opt/gm/lib
> > -lPEPCF90
> > > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib
> > -L/opt/intel/compiler70/ia32/lib
> > > > -Wl,-rpath,/usr/lib -L/usr/lib -lintrins -lIEPCF90 -lF90
> > -lm -Wl,-rpath,\
> > > > -Wl,-rpath,\ -L\ -ldl -lmpich -Wl,-rpath,\ -L\ -lgm -lpthread
> > > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib
> > -L/opt/intel/compiler70/ia32/lib
> > > > -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts -lcxa -lunwind -ldl
> > > > ------------------------------------------
> > > >
> > > > So is there something wrong with the server's mpi implementation?
> > > >
> > > > Thank you.
> > > >
> > > >
> > > >
> > > > On 2/10/07, Satish Balay <balay at mcs.anl.gov> wrote:
> > > > >
> > > > > Looks like MatMult = 24sec Out of this the scatter time is: 22sec.
> > > > > Either something is wrong with your run - or MPI is really broken..
> > > > >
> > > > > Satish
> > > > >
> > > > > > > > MatMult 3927 1.0 2.4071e+01 1.3 6.14e+06 1.4
> > 2.4e+04
> > > > > 1.3e+03
> > > > > > > > VecScatterBegin 3927 1.0 2.8672e-01 3.9 0.00e+00 0.0
> > 2.4e+04
> > > > > 1.3e+03
> > > > > > > > VecScatterEnd 3927 1.0 2.2135e+01 1.5 0.00e+00 0.0
> > 0.0e+00
> > > > > 0.0e+00
> > > > >
> > > > >
> > > >
> > >
> >
> >
>
More information about the petsc-users
mailing list