assembly
Matthew Knepley
knepley at gmail.com
Mon Feb 4 10:47:44 CST 2008
On Feb 4, 2008 10:41 AM, Thomas Geenen <geenen at gmail.com> wrote:
> On Feb 3, 2008 8:51 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> > Hmmm, are you saying the first round of setting values still
> > takes much longer then the second round?
>
> yes
>
> >Or is it the time
> > in MatAssemblyBegin() much longer the first time?
> >
> > The MatAssembly process has one piece of code that's
> > work is order n*size; where n is the stash size and size is the
> > number of processes, all other work is only order n.
> >
> > Could you send the -log_summary output?
>
> the timing is cumulative i guess?
> in between these two solves i solve a smaller system for which i do
> not include the timing.
I ma having a little trouble reading this. I think the easiest thing to do
is wrap the two section of code in their own sections:
PetscLogStageRegister(&stage1, "First assembly");
PetscLogStageRegister(&stage2, "Second assembly");
PetscLogStagePush(stage1);
<code for first assembly>
PetscLogStagePop();
PetscLogStagePush(stage2);
<code for second assembly>
PetscLogStagePop();
Then we can also get a look at how many messages are sent
and how big they are.
Thanks,
Matt
> run 1
> Max Max/Min Avg Total
> Time (sec): 2.154e+02 1.00001 2.154e+02
> Objects: 2.200e+01 1.00000 2.200e+01
> Flops: 0.000e+00 0.00000 0.000e+00 0.000e+00
> Flops/sec: 0.000e+00 0.00000 0.000e+00 0.000e+00
> MPI Messages: 1.750e+01 1.25000 1.633e+01 9.800e+01
> MPI Message Lengths: 3.460e+06 1.29903 1.855e+05 1.818e+07
> MPI Reductions: 4.167e+00 1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length
> N --> 2N flops
> and VecAXPY() for complex vectors of
> length N --> 8N flops
>
> Summary of Stages: ----- Time ------ ----- Flops ----- ---
> Messages --- -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total counts
> %Total Avg %Total counts %Total
> 0: Main Stage: 2.1537e+02 100.0% 0.0000e+00 0.0% 9.800e+01
> 100.0% 1.855e+05 100.0% 2.500e+01 100.0%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flops/sec: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all processors
> Mess: number of messages sent
> Avg. len: average message length
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush()
> and PetscLogStagePop().
> %T - percent time in this phase %F - percent flops in this phase
> %M - percent messages in this phase %L - percent message
> lengths in this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> over all processors)
> ------------------------------------------------------------------------------------------------------------------------
>
>
> ##########################################################
> # #
> # WARNING!!! #
> # #
> # This code was run without the PreLoadBegin() #
> # macros. To get timing results we always recommend #
> # preloading. otherwise timing numbers may be #
> # meaningless. #
> ##########################################################
>
>
> Event Count Time (sec) Flops/sec
> --- Global --- --- Stage --- Total
> Max Ratio Max Ratio Max Ratio Mess Avg
> len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatAssemblyBegin 1 1.0 2.9536e-01 4.0 0.00e+00 0.0 4.2e+01
> 4.2e+05 2.0e+00 0 0 43 98 8 0 0 43 98 8 0
> MatAssemblyEnd 1 1.0 2.1410e+02 1.0 0.00e+00 0.0 2.8e+01
> 8.2e+03 7.0e+00 99 0 29 1 28 99 0 29 1 28 0
> MatZeroEntries 1 1.0 9.3739e-02 5.9 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecSet 4 1.0 3.9721e-04 2.2 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type Creations Destructions Memory Descendants' Mem.
>
> --- Event Stage 0: Main Stage
>
> Matrix 3 0 0 0
> Index Set 6 6 45500 0
> Vec 6 1 196776 0
> Vec Scatter 3 0 0 0
> IS L to G Mapping 2 0 0 0
> Krylov Solver 1 0 0 0
> Preconditioner 1 0 0 0
> ========================================================================================================================
> Average time to get PetscTime(): 1.71661e-06
> Average time for MPI_Barrier(): 0.000159979
> Average time for zero size MPI_Send(): 1.29938e-05
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> Configure run at: Fri Sep 28 23:34:20 2007
>
> run2
> Max Max/Min Avg Total
> Time (sec): 2.298e+02 1.00000 2.298e+02
> Objects: 2.600e+02 1.00000 2.600e+02
> Flops: 1.265e+09 1.17394 1.161e+09 6.969e+09
> Flops/sec: 5.505e+06 1.17394 5.054e+06 3.032e+07
> MPI Messages: 1.436e+03 1.20816 1.326e+03 7.956e+03
> MPI Message Lengths: 2.120e+07 1.23141 1.457e+04 1.159e+08
> MPI Reductions: 4.192e+02 1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length
> N --> 2N flops
> and VecAXPY() for complex vectors of
> length N --> 8N flops
>
> Summary of Stages: ----- Time ------ ----- Flops ----- ---
> Messages --- -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total counts
> %Total Avg %Total counts %Total
> 0: Main Stage: 2.2943e+02 99.8% 6.9689e+09 100.0% 7.944e+03
> 99.8% 1.457e+04 100.0% 2.230e+02 8.9%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flops/sec: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all processors
> Mess: number of messages sent
> Avg. len: average message length
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush()
> and PetscLogStagePop().
> %T - percent time in this phase %F - percent flops in this phase
> %M - percent messages in this phase %L - percent message
> lengths in this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> over all processors)
> ------------------------------------------------------------------------------------------------------------------------
>
>
> ##########################################################
> # #
> # WARNING!!! #
> # #
> # This code was run without the PreLoadBegin() #
> # macros. To get timing results we always recommend #
> # preloading. otherwise timing numbers may be #
> # meaningless. #
> ##########################################################
>
>
> Event Count Time (sec) Flops/sec
> --- Global --- --- Stage --- Total
> Max Ratio Max Ratio Max Ratio Mess Avg
> len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult 135 1.0 2.0830e+00 1.4 2.30e+08 1.9 2.4e+03
> 1.3e+04 0.0e+00 1 26 30 26 0 1 26 30 26 0 862
> MatMultAdd 40 1.0 3.2598e-01 4.5 2.68e+07 5.7 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 31
> MatSolve 44 1.0 6.7841e-01 1.7 1.93e+08 1.7 0.0e+00
> 0.0e+00 0.0e+00 0 7 0 0 0 0 7 0 0 0 672
> MatRelax 80 1.0 4.2949e+00 1.6 8.77e+07 1.2 2.2e+03
> 1.3e+04 0.0e+00 1 23 28 26 0 1 23 28 26 0 374
> MatLUFactorSym 1 1.0 7.6739e-02 1.1 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatLUFactorNum 2 1.0 2.9370e+00 1.5 2.53e+08 1.5 0.0e+00
> 0.0e+00 0.0e+00 1 44 0 0 0 1 44 0 0 0 1037
> MatILUFactorSym 1 1.0 6.7334e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatAssemblyBegin 7 1.0 7.1652e-01 4.4 0.00e+00 0.0 1.3e+02
> 2.9e+05 8.0e+00 0 0 2 32 0 0 0 2 32 4 0
> MatAssemblyEnd 7 1.0 2.1473e+02 1.0 0.00e+00 0.0 8.4e+01
> 3.2e+03 2.2e+01 93 0 1 0 1 94 0 1 0 10 0
> MatGetRowIJ 2 1.0 1.8899e-03 1.1 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatGetSubMatrice 1 1.0 3.1915e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 3.0e+00 0 0 0 0 0 0 0 0 0 1 0
> MatGetOrdering 2 1.0 1.2184e-02 1.6 0.00e+00 0.0 0.0e+00
> 0.0e+00 4.0e+00 0 0 0 0 0 0 0 0 0 2 0
> MatIncreaseOvrlp 1 1.0 7.6865e-02 1.0 0.00e+00 0.0 1.1e+03
> 2.4e+03 2.0e+01 0 0 13 2 1 0 0 13 2 9 0
> MatZeroEntries 3 1.0 1.0429e-01 4.7 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MAT_GetRedundantMatrix 1 1.0 3.0144e-02 1.2 0.00e+00 0.0 9.0e+01
> 8.0e+04 2.0e+00 0 0 1 6 0 0 0 1 6 1 0
> VecDot 39 1.0 9.0680e-01140.0 2.95e+08191.3 0.0e+00
> 0.0e+00 3.9e+01 0 0 0 0 2 0 0 0 0 17 11
> VecMDot 8 1.0 2.8777e-03 2.8 4.16e+07 3.3 0.0e+00
> 0.0e+00 8.0e+00 0 0 0 0 0 0 0 0 0 4 84
> VecNorm 31 1.0 6.2301e-02 4.2 7.17e+07 5.8 0.0e+00
> 0.0e+00 3.1e+01 0 0 0 0 1 0 0 0 0 14 86
> VecScale 85 1.0 1.7729e-03 1.4 4.94e+08 1.4 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 2138
> VecCopy 4 1.0 6.2108e-04 1.5 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecSet 139 1.0 3.5934e-03 1.4 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecAXPY 98 1.0 5.1496e-03 1.3 9.24e+08 1.1 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 4705
> VecAYPX 40 1.0 1.0311e-02 1.4 1.90e+08 1.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 969
> VecWAXPY 75 1.0 2.6060e-02 1.4 1.22e+08 1.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 618
> VecMAXPY 9 1.0 2.1315e-04 1.6 3.01e+08 1.2 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1355
> VecAssemblyBegin 4 1.0 1.9898e-03 1.2 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.2e+01 0 0 0 0 0 0 0 0 0 5 0
> VecAssemblyEnd 4 1.0 2.1219e-05 1.2 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecScatterBegin 267 1.0 1.0224e-01 1.6 0.00e+00 0.0 6.3e+03
> 1.1e+04 0.0e+00 0 0 79 59 0 0 0 79 59 0 0
> VecScatterEnd 267 1.0 7.8653e-0111.3 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPGMRESOrthog 4 1.0 3.7677e-03 1.9 4.78e+07 2.4 0.0e+00
> 0.0e+00 1.2e+01 0 0 0 0 0 0 0 0 0 5 141
> KSPSetup 6 1.0 1.0260e-02 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 4.0e+00 0 0 0 0 0 0 0 0 0 2 0
> KSPSolve 2 1.0 7.8238e+00 1.0 1.16e+08 1.2 5.9e+03
> 1.1e+04 8.1e+01 3 70 75 54 3 3 70 75 54 36 622
> PCSetUp 3 1.0 5.1323e+00 1.2 1.29e+08 1.2 1.6e+03
> 9.1e+03 7.3e+01 2 47 20 13 3 2 47 20 13 33 636
> PCSetUpOnBlocks 1 1.0 1.3325e+00 1.0 1.47e+08 1.0 0.0e+00
> 0.0e+00 3.0e+00 1 17 0 0 0 1 17 0 0 1 871
> PCApply 44 1.0 5.7917e+00 1.1 9.60e+07 1.2 4.7e+03
> 1.0e+04 0.0e+00 2 41 59 42 0 2 41 59 42 0 497
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type Creations Destructions Memory Descendants' Mem.
>
> --- Event Stage 0: Main Stage
>
> Matrix 16 0 0 0
> Index Set 36 29 256760 0
> Vec 176 93 16582464 0
> Vec Scatter 10 0 0 0
> IS L to G Mapping 4 0 0 0
> Krylov Solver 6 0 0 0
> Preconditioner 6 0 0 0
> Viewer 4 2 0 0
> Container 2 0 0 0
> ========================================================================================================================
> Average time to get PetscTime(): 8.10623e-07
> Average time for MPI_Barrier(): 0.000178194
> Average time for zero size MPI_Send(): 1.33117e-05
> OptionTable: -mg_levels_ksp_type richardson
> OptionTable: -mg_levels_pc_sor_omega 1.05
> OptionTable: -mg_levels_pc_type sor
> OptionTable: -pc_ml_PrintLevel 4
> OptionTable: -pc_ml_maxNlevels 2
> OptionTable: -pc_type ml
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> Configure run at: Fri Sep 28 23:34:20 2007
>
>
> >
> > Barry
> >
> >
> > The a
> >
> > On Feb 3, 2008, at 6:44 AM, Thomas Geenen wrote:
> >
> > > i call
> > > ierr = MatStashSetInitialSize(A[*seqsolve],stash_size,
> > > stash_size);CHKERRQ(ierr);
> > > with 100 000 000 for the stash size to make sure that's not the
> > > bottleneck
> > >
> > > the assemble time remains unchanged however.
> > >
> > > nstash in MatAssemblyBegin_MPIAIJ (CPU=0) = 109485
> > > reallocs in MatAssemblyBegin_MPIAIJ = 0
> > >
> > > cheers
> > > Thomas
> > >
> > > On Saturday 02 February 2008 23:19, Barry Smith wrote:
> > >> The matstash has a concept of preallocation also. During the first
> > >> setvalues
> > >> it is allocating more and more memory for the stash. In the second
> > >> setvalues
> > >> the stash is large enough so does not require any addition
> > >> allocation.
> > >>
> > >> You can use the option -matstash_initial_size <size> to allocate
> > >> enough space
> > >> initially so that the first setvalues is also fast. It does not look
> > >> like there is a way
> > >> coded to get the <size> that you should use. It should be set to the
> > >> maximum nonzeros
> > >> any process has that belongs to other processes. The stash handling
> > >> code is
> > >> in src/mat/utils/matstash.c, perhaps you can figure out how to
> > >> printout with PetscInfo()
> > >> the sizes needed?
> > >>
> > >>
> > >> Barry
> > >>
> > >> On Feb 2, 2008, at 12:30 PM, Thomas Geenen wrote:
> > >>> On Saturday 02 February 2008 18:33, Hong Zhang wrote:
> > >>>> On Sat, 2 Feb 2008, Thomas Geenen wrote:
> > >>>>> Dear Petsc users,
> > >>>>>
> > >>>>> I would like to understand what is slowing down the assembly phase
> > >>>>> of my
> > >>>>> matrix. I create a matrix with MatCreateMPIAIJ i make a rough
> > >>>>> guess of
> > >>>>> the number of off diagonal entries and then use a conservative
> > >>>>> value to
> > >>>>> make sure I do not need extra mallocs. (the number of diagonal
> > >>>>> entries is
> > >>>>> exact)
> > >>>>> next i call MatSetValues and MatAssemblyBegin, MatAssemblyEnd.
> > >>>>> The first time i call MatSetValues and MatAssemblyBegin,
> > >>>>> MatAssemblyEnd it takes about 170 seconds
> > >>>>> the second time 0.3 seconds.
> > >>>>> I run it on 6 cpu's and I do fill quit a number of row-entries on
> > >>>>> the
> > >>>>> "wrong" cpu. However thats also the case the second run. I checked
> > >>>>> that there are no additional mallocs
> > >>>>> MatGetInfo info.mallocs=0 both after MatSetValues and after
> > >>>>> MatAssemblyBegin, MatAssemblyEnd.
> > >>>>
> > >>>> Run your code with the option '-log_summary' and check which
> > >>>> function
> > >>>> call dominates the execution time.
> > >>>
> > >>> the time is spend in MatStashScatterGetMesg_Private
> > >>>
> > >>>>> I run it on 6 cpu's and I do fill quit a number of row-entries on
> > >>>>> the
> > >>>>> "wrong" cpu.
> > >>>>
> > >>>> Likely, the communication that sending the entries to the
> > >>>> corrected cpu consume the time. Can you fill the entries in the
> > >>>> correct cpu?
> > >>>
> > >>> the second time the entries are filled on the wrong CPU as well.
> > >>> i am curious about the difference in time between run 1 and 2.
> > >>>
> > >>>> Hong
> > >>>>
> > >>>>> cheers
> > >>>>> Thomas
> > >
> >
> >
>
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener
More information about the petsc-users
mailing list