assembly
Barry Smith
bsmith at mcs.anl.gov
Mon Feb 4 12:21:05 CST 2008
MatAssemblyEnd 1 1.0 1.4709e+02
In MatAssemblyEnd all the messages with off-process values are
received and then
MatSetValues is called with them onto the local matrices.
The only way this can be taking this huge amount of time is if the
preallocation was
not done correctly. Can you please run with -info and send all the
output
Barry
On Feb 4, 2008, at 11:34 AM, Thomas Geenen wrote:
> hi matt,
>
> this is indeed much clearer
> i put the push and pop around MatAssemblyBegin/End
>
> 1: First_assembly: 1.4724e+02 90.8% 0.0000e+00 0.0% 7.000e+01
> 0.9% 2.266e+03 15.5% 9.000e+00 0.2%
> 2: Second_assembly: 2.3823e-01 0.1% 0.0000e+00 0.0% 7.000e+01
> 0.9% 1.276e+02 0.9% 9.000e+00 0.2%
> 3: Third_assembly: 5.0168e-01 0.3% 0.0000e+00 0.0% 4.200e+01
> 0.5% 2.237e+03 15.4% 3.000e+00 0.1%
>
> The second assembly is another system of equations (pressure
> correction in simpler)
> so 1 and 3 are 1 and 2 ......
>
> cheers
> Thomas
>
>
>
> ---------------------------------------------- PETSc Performance
> Summary: ----------------------------------------------
>
> Unknown Name on a linux-gnu named etna.geo.uu.nl with 6 processors, by
> geenen Mon Feb 4 18:27:48 2008
> Using Petsc Release Version 2.3.3, Patch 3, Fri Jun 15 16:51:25 CDT
> 2007 HG revision: f051789beadcd36f77fb6111d20225e26ed7cc0d
>
> Max Max/Min Avg Total
> Time (sec): 1.621e+02 1.00000 1.621e+02
> Objects: 2.600e+02 1.00000 2.600e+02
> Flops: 1.265e+09 1.17394 1.161e+09 6.969e+09
> Flops/sec: 7.806e+06 1.17393 7.166e+06 4.300e+07
> MPI Messages: 1.436e+03 1.20816 1.326e+03 7.956e+03
> MPI Message Lengths: 2.120e+07 1.23141 1.457e+04 1.159e+08
> MPI Reductions: 9.862e+02 1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length
> N --> 2N flops
> and VecAXPY() for complex vectors of
> length N --> 8N flops
>
> Summary of Stages: ----- Time ------ ----- Flops ----- ---
> Messages --- -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total counts
> %Total Avg %Total counts %Total
> 0: Main Stage: 1.3160e+01 8.1% 6.9689e+09 100.0% 7.762e+03
> 97.6% 9.941e+03 68.2% 2.020e+02 3.4%
> 1: First_assembly: 1.4724e+02 90.8% 0.0000e+00 0.0% 7.000e+01
> 0.9% 2.266e+03 15.5% 9.000e+00 0.2%
> 2: Second_assembly: 2.3823e-01 0.1% 0.0000e+00 0.0% 7.000e+01
> 0.9% 1.276e+02 0.9% 9.000e+00 0.2%
> 3: Third_assembly: 5.0168e-01 0.3% 0.0000e+00 0.0% 4.200e+01
> 0.5% 2.237e+03 15.4% 3.000e+00 0.1%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flops/sec: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all
> processors
> Mess: number of messages sent
> Avg. len: average message length
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush()
> and PetscLogStagePop().
> %T - percent time in this phase %F - percent flops in
> this phase
> %M - percent messages in this phase %L - percent message
> lengths in this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> over all processors)
> ------------------------------------------------------------------------------------------------------------------------
>
>
> ##########################################################
> # #
> # WARNING!!! #
> # #
> # This code was run without the PreLoadBegin() #
> # macros. To get timing results we always recommend #
> # preloading. otherwise timing numbers may be #
> # meaningless. #
> ##########################################################
>
>
> Event Count Time (sec) Flops/sec
> --- Global --- --- Stage --- Total
> Max Ratio Max Ratio Max Ratio Mess Avg
> len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult 135 1.0 1.5541e+00 1.3 3.03e+08 1.8 2.4e+03
> 1.3e+04 0.0e+00 1 26 30 26 0 11 26 30 38 0 1155
> MatMultAdd 40 1.0 3.2611e-01 8.3 3.64e+07 7.7 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 2 0 0 0 0 31
> MatSolve 44 1.0 6.7682e-01 1.7 1.94e+08 1.7 0.0e+00
> 0.0e+00 0.0e+00 0 7 0 0 0 4 7 0 0 0 673
> MatRelax 80 1.0 3.4453e+00 1.4 9.46e+07 1.1 2.2e+03
> 1.3e+04 0.0e+00 2 23 28 26 0 22 23 29 38 0 466
> MatLUFactorSym 1 1.0 6.7567e-02 1.1 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatLUFactorNum 2 1.0 2.8804e+00 1.4 2.53e+08 1.4 0.0e+00
> 0.0e+00 0.0e+00 1 44 0 0 0 17 44 0 0 0 1058
> MatILUFactorSym 1 1.0 6.7676e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.0e+00 0 0 0 0 0 5 0 0 0 0 0
> MatAssemblyBegin 4 1.0 2.7711e-0237.3 0.00e+00 0.0 0.0e+00
> 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 1 0
> MatAssemblyEnd 4 1.0 2.4401e-02 1.2 0.00e+00 0.0 2.8e+01
> 4.7e+02 7.0e+00 0 0 0 0 0 0 0 0 0 3 0
> MatGetRowIJ 2 1.0 1.2948e-02 7.5 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatGetSubMatrice 1 1.0 2.8603e-02 1.1 0.00e+00 0.0 0.0e+00
> 0.0e+00 3.0e+00 0 0 0 0 0 0 0 0 0 1 0
> MatGetOrdering 2 1.0 2.3054e-02 2.4 0.00e+00 0.0 0.0e+00
> 0.0e+00 4.0e+00 0 0 0 0 0 0 0 0 0 2 0
> MatIncreaseOvrlp 1 1.0 8.1528e-02 1.0 0.00e+00 0.0 1.1e+03
> 2.4e+03 2.0e+01 0 0 13 2 0 1 0 14 3 10 0
> MatZeroEntries 3 1.0 3.4422e-02 1.8 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MAT_GetRedundantMatrix 1 1.0 3.5774e-02 1.6 0.00e+00 0.0 9.0e+01
> 8.0e+04 2.0e+00 0 0 1 6 0 0 0 1 9 1 0
> VecDot 39 1.0 5.8092e-0131.0 1.02e+0842.3 0.0e+00
> 0.0e+00 3.9e+01 0 0 0 0 1 2 0 0 0 19 17
> VecMDot 8 1.0 3.4735e-03 2.9 4.52e+07 5.5 0.0e+00
> 0.0e+00 8.0e+00 0 0 0 0 0 0 0 0 0 4 69
> VecNorm 31 1.0 3.8690e-02 4.1 1.11e+08 5.6 0.0e+00
> 0.0e+00 3.1e+01 0 0 0 0 1 0 0 0 0 15 139
> VecScale 85 1.0 1.7631e-03 1.2 5.59e+08 1.9 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 2150
> VecCopy 4 1.0 5.5027e-04 1.4 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecSet 139 1.0 3.0956e-03 1.4 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecAXPY 98 1.0 5.0848e-03 1.3 9.35e+08 1.1 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 4765
> VecAYPX 40 1.0 1.0264e-02 1.4 2.01e+08 1.1 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 973
> VecWAXPY 75 1.0 2.6191e-02 1.4 1.22e+08 1.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 615
> VecMAXPY 9 1.0 2.1935e-04 1.7 2.93e+08 1.1 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1317
> VecAssemblyBegin 4 1.0 1.9331e-03 1.2 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.2e+01 0 0 0 0 0 0 0 0 0 6 0
> VecAssemblyEnd 4 1.0 2.3842e-05 1.4 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecScatterBegin 267 1.0 1.0370e-01 1.5 0.00e+00 0.0 6.3e+03
> 1.1e+04 0.0e+00 0 0 79 59 0 1 0 81 86 0 0
> VecScatterEnd 267 1.0 4.1189e-01 3.1 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 0 0 0 0 0 2 0 0 0 0 0
> KSPGMRESOrthog 4 1.0 4.5178e-03 2.0 5.22e+07 3.8 0.0e+00
> 0.0e+00 1.2e+01 0 0 0 0 0 0 0 0 0 6 117
> KSPSetup 6 1.0 7.9882e-03 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 4.0e+00 0 0 0 0 0 0 0 0 0 2 0
> KSPSolve 2 1.0 6.6590e+00 1.0 1.36e+08 1.2 5.9e+03
> 1.1e+04 8.1e+01 4 70 75 54 1 51 70 76 80 40 731
> PCSetUp 3 1.0 5.0877e+00 1.2 1.29e+08 1.2 1.6e+03
> 9.1e+03 7.3e+01 3 47 20 13 1 34 47 21 19 36 642
> PCSetUpOnBlocks 1 1.0 1.3292e+00 1.0 1.46e+08 1.0 0.0e+00
> 0.0e+00 3.0e+00 1 17 0 0 0 10 17 0 0 1 873
> PCApply 44 1.0 4.7444e+00 1.1 1.16e+08 1.2 4.7e+03
> 1.0e+04 0.0e+00 3 41 59 42 0 34 41 60 61 0 607
>
> --- Event Stage 1: First_assembly
>
> MatAssemblyBegin 1 1.0 3.0375e-01 3.5 0.00e+00 0.0 4.2e+01
> 4.2e+05 2.0e+00 0 0 1 15 0 0 0 60 99 22 0
> MatAssemblyEnd 1 1.0 1.4709e+02 1.0 0.00e+00 0.0 2.8e+01
> 8.2e+03 7.0e+00 91 0 0 0 0 100 0 40 1 78 0
>
> --- Event Stage 2: Second_assembly
>
> MatAssemblyBegin 1 1.0 1.7451e-02 5.0 0.00e+00 0.0 4.2e+01
> 2.4e+04 2.0e+00 0 0 1 1 0 5 0 60 98 22 0
> MatAssemblyEnd 1 1.0 2.3056e-01 1.0 0.00e+00 0.0 2.8e+01
> 8.4e+02 7.0e+00 0 0 0 0 0 95 0 40 2 78 0
>
> --- Event Stage 3: Third_assembly
>
> MatAssemblyBegin 1 1.0 3.3676e-01 3.8 0.00e+00 0.0 4.2e+01
> 4.2e+05 2.0e+00 0 0 1 15 0 45 0100100 67 0
> MatAssemblyEnd 1 1.0 3.3125e-01 1.7 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.0e+00 0 0 0 0 0 55 0 0 0 33 0
>
> --- Event Stage 4: Unknown
>
>
> --- Event Stage 5: Unknown
>
>
> --- Event Stage 6: Unknown
>
>
> --- Event Stage 7: Unknown
>
>
> --- Event Stage 8: Unknown
>
>
> --- Event Stage 9: Unknown
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type Creations Destructions Memory Descendants'
> Mem.
>
> --- Event Stage 0: Main Stage
>
> Matrix 16 0 0 0
> Index Set 32 25 207888 0
> Vec 172 91 16374224 0
> Vec Scatter 8 0 0 0
> IS L to G Mapping 4 0 0 0
> Krylov Solver 6 0 0 0
> Preconditioner 6 0 0 0
> Viewer 4 2 0 0
> Container 2 0 0 0
>
> --- Event Stage 1: First_assembly
>
> Index Set 2 2 44140 0
> Vec 2 1 196776 0
> Vec Scatter 1 0 0 0
>
> --- Event Stage 2: Second_assembly
>
> Index Set 2 2 4732 0
> Vec 2 1 11464 0
> Vec Scatter 1 0 0 0
>
> --- Event Stage 3: Third_assembly
>
>
> --- Event Stage 4: Unknown
>
>
> --- Event Stage 5: Unknown
>
>
> --- Event Stage 6: Unknown
>
>
> --- Event Stage 7: Unknown
>
>
> --- Event Stage 8: Unknown
>
>
> --- Event Stage 9: Unknown
>
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> =
> ======================================================================
> Average time to get PetscTime(): 8.82149e-07
> Average time for MPI_Barrier(): 0.000153208
> Average time for zero size MPI_Send(): 1.86761e-05
> OptionTable: -mg_levels_ksp_type richardson
> OptionTable: -mg_levels_pc_sor_omega 1.05
> OptionTable: -mg_levels_pc_type sor
> OptionTable: -pc_ml_PrintLevel 4
> OptionTable: -pc_ml_maxNlevels 2
> OptionTable: -pc_type ml
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> Configure run at: Fri Sep 28 23:34:20 2007
>
> On Feb 4, 2008 5:47 PM, Matthew Knepley <knepley at gmail.com> wrote:
>> On Feb 4, 2008 10:41 AM, Thomas Geenen <geenen at gmail.com> wrote:
>>> On Feb 3, 2008 8:51 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>>
>>>> Hmmm, are you saying the first round of setting values still
>>>> takes much longer then the second round?
>>>
>>> yes
>>>
>>>> Or is it the time
>>>> in MatAssemblyBegin() much longer the first time?
>>>>
>>>> The MatAssembly process has one piece of code that's
>>>> work is order n*size; where n is the stash size and size is the
>>>> number of processes, all other work is only order n.
>>>>
>>>> Could you send the -log_summary output?
>>>
>>> the timing is cumulative i guess?
>>> in between these two solves i solve a smaller system for which i do
>>> not include the timing.
>>
>> I ma having a little trouble reading this. I think the easiest
>> thing to do
>> is wrap the two section of code in their own sections:
>>
>> PetscLogStageRegister(&stage1, "First assembly");
>> PetscLogStageRegister(&stage2, "Second assembly");
>>
>> PetscLogStagePush(stage1);
>> <code for first assembly>
>> PetscLogStagePop();
>>
>> PetscLogStagePush(stage2);
>> <code for second assembly>
>> PetscLogStagePop();
>>
>> Then we can also get a look at how many messages are sent
>> and how big they are.
>>
>> Thanks,
>>
>> Matt
>>
>>
>>> run 1
>>> Max Max/Min Avg Total
>>> Time (sec): 2.154e+02 1.00001 2.154e+02
>>> Objects: 2.200e+01 1.00000 2.200e+01
>>> Flops: 0.000e+00 0.00000 0.000e+00 0.000e+00
>>> Flops/sec: 0.000e+00 0.00000 0.000e+00 0.000e+00
>>> MPI Messages: 1.750e+01 1.25000 1.633e+01 9.800e+01
>>> MPI Message Lengths: 3.460e+06 1.29903 1.855e+05 1.818e+07
>>> MPI Reductions: 4.167e+00 1.00000
>>>
>>> Flop counting convention: 1 flop = 1 real number operation of type
>>> (multiply/divide/add/subtract)
>>> e.g., VecAXPY() for real vectors of
>>> length
>>> N --> 2N flops
>>> and VecAXPY() for complex vectors of
>>> length N --> 8N flops
>>>
>>> Summary of Stages: ----- Time ------ ----- Flops ----- ---
>>> Messages --- -- Message Lengths -- -- Reductions --
>>> Avg %Total Avg %Total counts
>>> %Total Avg %Total counts %Total
>>> 0: Main Stage: 2.1537e+02 100.0% 0.0000e+00 0.0% 9.800e+01
>>> 100.0% 1.855e+05 100.0% 2.500e+01 100.0%
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>> See the 'Profiling' chapter of the users' manual for details on
>>> interpreting output.
>>> Phase summary info:
>>> Count: number of times phase was executed
>>> Time and Flops/sec: Max - maximum over all processors
>>> Ratio - ratio of maximum to minimum over all
>>> processors
>>> Mess: number of messages sent
>>> Avg. len: average message length
>>> Reduct: number of global reductions
>>> Global: entire computation
>>> Stage: stages of a computation. Set stages with
>>> PetscLogStagePush()
>>> and PetscLogStagePop().
>>> %T - percent time in this phase %F - percent flops in
>>> this phase
>>> %M - percent messages in this phase %L - percent message
>>> lengths in this phase
>>> %R - percent reductions in this phase
>>> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max
>>> time
>>> over all processors)
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>>
>>> ##########################################################
>>> # #
>>> # WARNING!!! #
>>> # #
>>> # This code was run without the PreLoadBegin() #
>>> # macros. To get timing results we always recommend #
>>> # preloading. otherwise timing numbers may be #
>>> # meaningless. #
>>> ##########################################################
>>>
>>>
>>> Event Count Time (sec) Flops/sec
>>> --- Global --- --- Stage --- Total
>>> Max Ratio Max Ratio Max Ratio Mess Avg
>>> len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> MatAssemblyBegin 1 1.0 2.9536e-01 4.0 0.00e+00 0.0 4.2e+01
>>> 4.2e+05 2.0e+00 0 0 43 98 8 0 0 43 98 8 0
>>> MatAssemblyEnd 1 1.0 2.1410e+02 1.0 0.00e+00 0.0 2.8e+01
>>> 8.2e+03 7.0e+00 99 0 29 1 28 99 0 29 1 28 0
>>> MatZeroEntries 1 1.0 9.3739e-02 5.9 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecSet 4 1.0 3.9721e-04 2.2 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>> Memory usage is given in bytes:
>>>
>>> Object Type Creations Destructions Memory
>>> Descendants' Mem.
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> Matrix 3 0 0 0
>>> Index Set 6 6 45500 0
>>> Vec 6 1 196776 0
>>> Vec Scatter 3 0 0 0
>>> IS L to G Mapping 2 0 0 0
>>> Krylov Solver 1 0 0 0
>>> Preconditioner 1 0 0 0
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> ====================================================================
>>> Average time to get PetscTime(): 1.71661e-06
>>> Average time for MPI_Barrier(): 0.000159979
>>> Average time for zero size MPI_Send(): 1.29938e-05
>>> Compiled without FORTRAN kernels
>>> Compiled with full precision matrices (default)
>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>>> sizeof(PetscScalar) 8
>>> Configure run at: Fri Sep 28 23:34:20 2007
>>>
>>> run2
>>> Max Max/Min Avg Total
>>> Time (sec): 2.298e+02 1.00000 2.298e+02
>>> Objects: 2.600e+02 1.00000 2.600e+02
>>> Flops: 1.265e+09 1.17394 1.161e+09 6.969e+09
>>> Flops/sec: 5.505e+06 1.17394 5.054e+06 3.032e+07
>>> MPI Messages: 1.436e+03 1.20816 1.326e+03 7.956e+03
>>> MPI Message Lengths: 2.120e+07 1.23141 1.457e+04 1.159e+08
>>> MPI Reductions: 4.192e+02 1.00000
>>>
>>> Flop counting convention: 1 flop = 1 real number operation of type
>>> (multiply/divide/add/subtract)
>>> e.g., VecAXPY() for real vectors of
>>> length
>>> N --> 2N flops
>>> and VecAXPY() for complex vectors of
>>> length N --> 8N flops
>>>
>>> Summary of Stages: ----- Time ------ ----- Flops ----- ---
>>> Messages --- -- Message Lengths -- -- Reductions --
>>> Avg %Total Avg %Total counts
>>> %Total Avg %Total counts %Total
>>> 0: Main Stage: 2.2943e+02 99.8% 6.9689e+09 100.0% 7.944e+03
>>> 99.8% 1.457e+04 100.0% 2.230e+02 8.9%
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>> See the 'Profiling' chapter of the users' manual for details on
>>> interpreting output.
>>> Phase summary info:
>>> Count: number of times phase was executed
>>> Time and Flops/sec: Max - maximum over all processors
>>> Ratio - ratio of maximum to minimum over all
>>> processors
>>> Mess: number of messages sent
>>> Avg. len: average message length
>>> Reduct: number of global reductions
>>> Global: entire computation
>>> Stage: stages of a computation. Set stages with
>>> PetscLogStagePush()
>>> and PetscLogStagePop().
>>> %T - percent time in this phase %F - percent flops in
>>> this phase
>>> %M - percent messages in this phase %L - percent message
>>> lengths in this phase
>>> %R - percent reductions in this phase
>>> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max
>>> time
>>> over all processors)
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>>
>>> ##########################################################
>>> # #
>>> # WARNING!!! #
>>> # #
>>> # This code was run without the PreLoadBegin() #
>>> # macros. To get timing results we always recommend #
>>> # preloading. otherwise timing numbers may be #
>>> # meaningless. #
>>> ##########################################################
>>>
>>>
>>> Event Count Time (sec) Flops/sec
>>> --- Global --- --- Stage --- Total
>>> Max Ratio Max Ratio Max Ratio Mess Avg
>>> len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> MatMult 135 1.0 2.0830e+00 1.4 2.30e+08 1.9 2.4e+03
>>> 1.3e+04 0.0e+00 1 26 30 26 0 1 26 30 26 0 862
>>> MatMultAdd 40 1.0 3.2598e-01 4.5 2.68e+07 5.7 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 31
>>> MatSolve 44 1.0 6.7841e-01 1.7 1.93e+08 1.7 0.0e+00
>>> 0.0e+00 0.0e+00 0 7 0 0 0 0 7 0 0 0 672
>>> MatRelax 80 1.0 4.2949e+00 1.6 8.77e+07 1.2 2.2e+03
>>> 1.3e+04 0.0e+00 1 23 28 26 0 1 23 28 26 0 374
>>> MatLUFactorSym 1 1.0 7.6739e-02 1.1 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatLUFactorNum 2 1.0 2.9370e+00 1.5 2.53e+08 1.5 0.0e+00
>>> 0.0e+00 0.0e+00 1 44 0 0 0 1 44 0 0 0 1037
>>> MatILUFactorSym 1 1.0 6.7334e-01 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatAssemblyBegin 7 1.0 7.1652e-01 4.4 0.00e+00 0.0 1.3e+02
>>> 2.9e+05 8.0e+00 0 0 2 32 0 0 0 2 32 4 0
>>> MatAssemblyEnd 7 1.0 2.1473e+02 1.0 0.00e+00 0.0 8.4e+01
>>> 3.2e+03 2.2e+01 93 0 1 0 1 94 0 1 0 10 0
>>> MatGetRowIJ 2 1.0 1.8899e-03 1.1 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatGetSubMatrice 1 1.0 3.1915e-02 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 3.0e+00 0 0 0 0 0 0 0 0 0 1 0
>>> MatGetOrdering 2 1.0 1.2184e-02 1.6 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 4.0e+00 0 0 0 0 0 0 0 0 0 2 0
>>> MatIncreaseOvrlp 1 1.0 7.6865e-02 1.0 0.00e+00 0.0 1.1e+03
>>> 2.4e+03 2.0e+01 0 0 13 2 1 0 0 13 2 9 0
>>> MatZeroEntries 3 1.0 1.0429e-01 4.7 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MAT_GetRedundantMatrix 1 1.0 3.0144e-02 1.2 0.00e+00 0.0 9.0e
>>> +01
>>> 8.0e+04 2.0e+00 0 0 1 6 0 0 0 1 6 1 0
>>> VecDot 39 1.0 9.0680e-01140.0 2.95e+08191.3 0.0e+00
>>> 0.0e+00 3.9e+01 0 0 0 0 2 0 0 0 0 17 11
>>> VecMDot 8 1.0 2.8777e-03 2.8 4.16e+07 3.3 0.0e+00
>>> 0.0e+00 8.0e+00 0 0 0 0 0 0 0 0 0 4 84
>>> VecNorm 31 1.0 6.2301e-02 4.2 7.17e+07 5.8 0.0e+00
>>> 0.0e+00 3.1e+01 0 0 0 0 1 0 0 0 0 14 86
>>> VecScale 85 1.0 1.7729e-03 1.4 4.94e+08 1.4 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 2138
>>> VecCopy 4 1.0 6.2108e-04 1.5 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecSet 139 1.0 3.5934e-03 1.4 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecAXPY 98 1.0 5.1496e-03 1.3 9.24e+08 1.1 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 4705
>>> VecAYPX 40 1.0 1.0311e-02 1.4 1.90e+08 1.0 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 969
>>> VecWAXPY 75 1.0 2.6060e-02 1.4 1.22e+08 1.0 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 618
>>> VecMAXPY 9 1.0 2.1315e-04 1.6 3.01e+08 1.2 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1355
>>> VecAssemblyBegin 4 1.0 1.9898e-03 1.2 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 1.2e+01 0 0 0 0 0 0 0 0 0 5 0
>>> VecAssemblyEnd 4 1.0 2.1219e-05 1.2 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecScatterBegin 267 1.0 1.0224e-01 1.6 0.00e+00 0.0 6.3e+03
>>> 1.1e+04 0.0e+00 0 0 79 59 0 0 0 79 59 0 0
>>> VecScatterEnd 267 1.0 7.8653e-0111.3 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> KSPGMRESOrthog 4 1.0 3.7677e-03 1.9 4.78e+07 2.4 0.0e+00
>>> 0.0e+00 1.2e+01 0 0 0 0 0 0 0 0 0 5 141
>>> KSPSetup 6 1.0 1.0260e-02 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 4.0e+00 0 0 0 0 0 0 0 0 0 2 0
>>> KSPSolve 2 1.0 7.8238e+00 1.0 1.16e+08 1.2 5.9e+03
>>> 1.1e+04 8.1e+01 3 70 75 54 3 3 70 75 54 36 622
>>> PCSetUp 3 1.0 5.1323e+00 1.2 1.29e+08 1.2 1.6e+03
>>> 9.1e+03 7.3e+01 2 47 20 13 3 2 47 20 13 33 636
>>> PCSetUpOnBlocks 1 1.0 1.3325e+00 1.0 1.47e+08 1.0 0.0e+00
>>> 0.0e+00 3.0e+00 1 17 0 0 0 1 17 0 0 1 871
>>> PCApply 44 1.0 5.7917e+00 1.1 9.60e+07 1.2 4.7e+03
>>> 1.0e+04 0.0e+00 2 41 59 42 0 2 41 59 42 0 497
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>> Memory usage is given in bytes:
>>>
>>> Object Type Creations Destructions Memory
>>> Descendants' Mem.
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> Matrix 16 0 0 0
>>> Index Set 36 29 256760 0
>>> Vec 176 93 16582464 0
>>> Vec Scatter 10 0 0 0
>>> IS L to G Mapping 4 0 0 0
>>> Krylov Solver 6 0 0 0
>>> Preconditioner 6 0 0 0
>>> Viewer 4 2 0 0
>>> Container 2 0 0 0
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> ====================================================================
>>> Average time to get PetscTime(): 8.10623e-07
>>> Average time for MPI_Barrier(): 0.000178194
>>> Average time for zero size MPI_Send(): 1.33117e-05
>>> OptionTable: -mg_levels_ksp_type richardson
>>> OptionTable: -mg_levels_pc_sor_omega 1.05
>>> OptionTable: -mg_levels_pc_type sor
>>> OptionTable: -pc_ml_PrintLevel 4
>>> OptionTable: -pc_ml_maxNlevels 2
>>> OptionTable: -pc_type ml
>>> Compiled without FORTRAN kernels
>>> Compiled with full precision matrices (default)
>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>>> sizeof(PetscScalar) 8
>>> Configure run at: Fri Sep 28 23:34:20 2007
>>>
>>>
>>>>
>>>> Barry
>>>>
>>>>
>>>> The a
>>>>
>>>> On Feb 3, 2008, at 6:44 AM, Thomas Geenen wrote:
>>>>
>>>>> i call
>>>>> ierr = MatStashSetInitialSize(A[*seqsolve],stash_size,
>>>>> stash_size);CHKERRQ(ierr);
>>>>> with 100 000 000 for the stash size to make sure that's not the
>>>>> bottleneck
>>>>>
>>>>> the assemble time remains unchanged however.
>>>>>
>>>>> nstash in MatAssemblyBegin_MPIAIJ (CPU=0) = 109485
>>>>> reallocs in MatAssemblyBegin_MPIAIJ = 0
>>>>>
>>>>> cheers
>>>>> Thomas
>>>>>
>>>>> On Saturday 02 February 2008 23:19, Barry Smith wrote:
>>>>>> The matstash has a concept of preallocation also. During the
>>>>>> first
>>>>>> setvalues
>>>>>> it is allocating more and more memory for the stash. In the
>>>>>> second
>>>>>> setvalues
>>>>>> the stash is large enough so does not require any addition
>>>>>> allocation.
>>>>>>
>>>>>> You can use the option -matstash_initial_size <size> to
>>>>>> allocate
>>>>>> enough space
>>>>>> initially so that the first setvalues is also fast. It does not
>>>>>> look
>>>>>> like there is a way
>>>>>> coded to get the <size> that you should use. It should be set
>>>>>> to the
>>>>>> maximum nonzeros
>>>>>> any process has that belongs to other processes. The stash
>>>>>> handling
>>>>>> code is
>>>>>> in src/mat/utils/matstash.c, perhaps you can figure out how to
>>>>>> printout with PetscInfo()
>>>>>> the sizes needed?
>>>>>>
>>>>>>
>>>>>> Barry
>>>>>>
>>>>>> On Feb 2, 2008, at 12:30 PM, Thomas Geenen wrote:
>>>>>>> On Saturday 02 February 2008 18:33, Hong Zhang wrote:
>>>>>>>> On Sat, 2 Feb 2008, Thomas Geenen wrote:
>>>>>>>>> Dear Petsc users,
>>>>>>>>>
>>>>>>>>> I would like to understand what is slowing down the assembly
>>>>>>>>> phase
>>>>>>>>> of my
>>>>>>>>> matrix. I create a matrix with MatCreateMPIAIJ i make a rough
>>>>>>>>> guess of
>>>>>>>>> the number of off diagonal entries and then use a conservative
>>>>>>>>> value to
>>>>>>>>> make sure I do not need extra mallocs. (the number of diagonal
>>>>>>>>> entries is
>>>>>>>>> exact)
>>>>>>>>> next i call MatSetValues and MatAssemblyBegin, MatAssemblyEnd.
>>>>>>>>> The first time i call MatSetValues and MatAssemblyBegin,
>>>>>>>>> MatAssemblyEnd it takes about 170 seconds
>>>>>>>>> the second time 0.3 seconds.
>>>>>>>>> I run it on 6 cpu's and I do fill quit a number of row-
>>>>>>>>> entries on
>>>>>>>>> the
>>>>>>>>> "wrong" cpu. However thats also the case the second run. I
>>>>>>>>> checked
>>>>>>>>> that there are no additional mallocs
>>>>>>>>> MatGetInfo info.mallocs=0 both after MatSetValues and after
>>>>>>>>> MatAssemblyBegin, MatAssemblyEnd.
>>>>>>>>
>>>>>>>> Run your code with the option '-log_summary' and check which
>>>>>>>> function
>>>>>>>> call dominates the execution time.
>>>>>>>
>>>>>>> the time is spend in MatStashScatterGetMesg_Private
>>>>>>>
>>>>>>>>> I run it on 6 cpu's and I do fill quit a number of row-
>>>>>>>>> entries on
>>>>>>>>> the
>>>>>>>>> "wrong" cpu.
>>>>>>>>
>>>>>>>> Likely, the communication that sending the entries to the
>>>>>>>> corrected cpu consume the time. Can you fill the entries in the
>>>>>>>> correct cpu?
>>>>>>>
>>>>>>> the second time the entries are filled on the wrong CPU as well.
>>>>>>> i am curious about the difference in time between run 1 and 2.
>>>>>>>
>>>>>>>> Hong
>>>>>>>>
>>>>>>>>> cheers
>>>>>>>>> Thomas
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which
>> their experiments lead.
>> -- Norbert Wiener
>>
>>
>
More information about the petsc-users
mailing list