assembly

Mon Feb 4 12:21:05 CST 2008

MatAssemblyEnd         1 1.0 1.4709e+02

In MatAssemblyEnd all the messages with off-process values are  
received and then
MatSetValues is called with them onto the local matrices.

The only way this can be taking this huge amount of time is if the  
preallocation was
not done correctly. Can you please run with -info and send all the  
output

    Barry

On Feb 4, 2008, at 11:34 AM, Thomas Geenen wrote:

> hi matt,
>
> this is indeed much clearer
> i put the push and pop around MatAssemblyBegin/End
>
> 1:  First_assembly: 1.4724e+02  90.8%  0.0000e+00   0.0%  7.000e+01
> 0.9%  2.266e+03       15.5%  9.000e+00   0.2%
> 2: Second_assembly: 2.3823e-01   0.1%  0.0000e+00   0.0%  7.000e+01
> 0.9%  1.276e+02        0.9%  9.000e+00   0.2%
> 3:  Third_assembly: 5.0168e-01   0.3%  0.0000e+00   0.0%  4.200e+01
> 0.5%  2.237e+03       15.4%  3.000e+00   0.1%
>
> The second assembly is another system of equations (pressure
> correction in simpler)
> so 1 and 3 are 1 and 2 ......
>
> cheers
> Thomas
>
>
>
> ---------------------------------------------- PETSc Performance
> Summary: ----------------------------------------------
>
> Unknown Name on a linux-gnu named etna.geo.uu.nl with 6 processors, by
> geenen Mon Feb  4 18:27:48 2008
> Using Petsc Release Version 2.3.3, Patch 3, Fri Jun 15 16:51:25 CDT
> 2007 HG revision: f051789beadcd36f77fb6111d20225e26ed7cc0d
>
>                         Max       Max/Min        Avg      Total
> Time (sec):           1.621e+02      1.00000   1.621e+02
> Objects:              2.600e+02      1.00000   2.600e+02
> Flops:                1.265e+09      1.17394   1.161e+09  6.969e+09
> Flops/sec:            7.806e+06      1.17393   7.166e+06  4.300e+07
> MPI Messages:         1.436e+03      1.20816   1.326e+03  7.956e+03
> MPI Message Lengths:  2.120e+07      1.23141   1.457e+04  1.159e+08
> MPI Reductions:       9.862e+02      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                            e.g., VecAXPY() for real vectors of length
> N --> 2N flops
>                            and VecAXPY() for complex vectors of
> length N --> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  ---
> Messages ---  -- Message Lengths --  -- Reductions --
>                        Avg     %Total     Avg     %Total   counts
> %Total     Avg         %Total   counts   %Total
> 0:      Main Stage: 1.3160e+01   8.1%  6.9689e+09 100.0%  7.762e+03
> 97.6%  9.941e+03       68.2%  2.020e+02   3.4%
> 1:  First_assembly: 1.4724e+02  90.8%  0.0000e+00   0.0%  7.000e+01
> 0.9%  2.266e+03       15.5%  9.000e+00   0.2%
> 2: Second_assembly: 2.3823e-01   0.1%  0.0000e+00   0.0%  7.000e+01
> 0.9%  1.276e+02        0.9%  9.000e+00   0.2%
> 3:  Third_assembly: 5.0168e-01   0.3%  0.0000e+00   0.0%  4.200e+01
> 0.5%  2.237e+03       15.4%  3.000e+00   0.1%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
>   Count: number of times phase was executed
>   Time and Flops/sec: Max - maximum over all processors
>                       Ratio - ratio of maximum to minimum over all  
> processors
>   Mess: number of messages sent
>   Avg. len: average message length
>   Reduct: number of global reductions
>   Global: entire computation
>   Stage: stages of a computation. Set stages with PetscLogStagePush()
> and PetscLogStagePop().
>      %T - percent time in this phase         %F - percent flops in  
> this phase
>      %M - percent messages in this phase     %L - percent message
> lengths in this phase
>      %R - percent reductions in this phase
>   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> over all processors)
> ------------------------------------------------------------------------------------------------------------------------
>
>
>      ##########################################################
>      #                                                        #
>      #                          WARNING!!!                    #
>      #                                                        #
>      #   This code was run without the PreLoadBegin()         #
>      #   macros. To get timing results we always recommend    #
>      #   preloading. otherwise timing numbers may be          #
>      #   meaningless.                                         #
>      ##########################################################
>
>
> Event                Count      Time (sec)     Flops/sec
>          --- Global ---  --- Stage ---   Total
>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg
> len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult              135 1.0 1.5541e+00 1.3 3.03e+08 1.8 2.4e+03
> 1.3e+04 0.0e+00  1 26 30 26  0  11 26 30 38  0  1155
> MatMultAdd            40 1.0 3.2611e-01 8.3 3.64e+07 7.7 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   2  0  0  0  0    31
> MatSolve              44 1.0 6.7682e-01 1.7 1.94e+08 1.7 0.0e+00
> 0.0e+00 0.0e+00  0  7  0  0  0   4  7  0  0  0   673
> MatRelax              80 1.0 3.4453e+00 1.4 9.46e+07 1.1 2.2e+03
> 1.3e+04 0.0e+00  2 23 28 26  0  22 23 29 38  0   466
> MatLUFactorSym         1 1.0 6.7567e-02 1.1 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatLUFactorNum         2 1.0 2.8804e+00 1.4 2.53e+08 1.4 0.0e+00
> 0.0e+00 0.0e+00  1 44  0  0  0  17 44  0  0  0  1058
> MatILUFactorSym        1 1.0 6.7676e-01 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.0e+00  0  0  0  0  0   5  0  0  0  0     0
> MatAssemblyBegin       4 1.0 2.7711e-0237.3 0.00e+00 0.0 0.0e+00
> 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  1     0
> MatAssemblyEnd         4 1.0 2.4401e-02 1.2 0.00e+00 0.0 2.8e+01
> 4.7e+02 7.0e+00  0  0  0  0  0   0  0  0  0  3     0
> MatGetRowIJ            2 1.0 1.2948e-02 7.5 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetSubMatrice       1 1.0 2.8603e-02 1.1 0.00e+00 0.0 0.0e+00
> 0.0e+00 3.0e+00  0  0  0  0  0   0  0  0  0  1     0
> MatGetOrdering         2 1.0 2.3054e-02 2.4 0.00e+00 0.0 0.0e+00
> 0.0e+00 4.0e+00  0  0  0  0  0   0  0  0  0  2     0
> MatIncreaseOvrlp       1 1.0 8.1528e-02 1.0 0.00e+00 0.0 1.1e+03
> 2.4e+03 2.0e+01  0  0 13  2  0   1  0 14  3 10     0
> MatZeroEntries         3 1.0 3.4422e-02 1.8 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MAT_GetRedundantMatrix       1 1.0 3.5774e-02 1.6 0.00e+00 0.0 9.0e+01
> 8.0e+04 2.0e+00  0  0  1  6  0   0  0  1  9  1     0
> VecDot                39 1.0 5.8092e-0131.0 1.02e+0842.3 0.0e+00
> 0.0e+00 3.9e+01  0  0  0  0  1   2  0  0  0 19    17
> VecMDot                8 1.0 3.4735e-03 2.9 4.52e+07 5.5 0.0e+00
> 0.0e+00 8.0e+00  0  0  0  0  0   0  0  0  0  4    69
> VecNorm               31 1.0 3.8690e-02 4.1 1.11e+08 5.6 0.0e+00
> 0.0e+00 3.1e+01  0  0  0  0  1   0  0  0  0 15   139
> VecScale              85 1.0 1.7631e-03 1.2 5.59e+08 1.9 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2150
> VecCopy                4 1.0 5.5027e-04 1.4 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet               139 1.0 3.0956e-03 1.4 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY               98 1.0 5.0848e-03 1.3 9.35e+08 1.1 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  4765
> VecAYPX               40 1.0 1.0264e-02 1.4 2.01e+08 1.1 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   973
> VecWAXPY              75 1.0 2.6191e-02 1.4 1.22e+08 1.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   615
> VecMAXPY               9 1.0 2.1935e-04 1.7 2.93e+08 1.1 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1317
> VecAssemblyBegin       4 1.0 1.9331e-03 1.2 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.2e+01  0  0  0  0  0   0  0  0  0  6     0
> VecAssemblyEnd         4 1.0 2.3842e-05 1.4 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecScatterBegin      267 1.0 1.0370e-01 1.5 0.00e+00 0.0 6.3e+03
> 1.1e+04 0.0e+00  0  0 79 59  0   1  0 81 86  0     0
> VecScatterEnd        267 1.0 4.1189e-01 3.1 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   2  0  0  0  0     0
> KSPGMRESOrthog         4 1.0 4.5178e-03 2.0 5.22e+07 3.8 0.0e+00
> 0.0e+00 1.2e+01  0  0  0  0  0   0  0  0  0  6   117
> KSPSetup               6 1.0 7.9882e-03 1.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 4.0e+00  0  0  0  0  0   0  0  0  0  2     0
> KSPSolve               2 1.0 6.6590e+00 1.0 1.36e+08 1.2 5.9e+03
> 1.1e+04 8.1e+01  4 70 75 54  1  51 70 76 80 40   731
> PCSetUp                3 1.0 5.0877e+00 1.2 1.29e+08 1.2 1.6e+03
> 9.1e+03 7.3e+01  3 47 20 13  1  34 47 21 19 36   642
> PCSetUpOnBlocks        1 1.0 1.3292e+00 1.0 1.46e+08 1.0 0.0e+00
> 0.0e+00 3.0e+00  1 17  0  0  0  10 17  0  0  1   873
> PCApply               44 1.0 4.7444e+00 1.1 1.16e+08 1.2 4.7e+03
> 1.0e+04 0.0e+00  3 41 59 42  0  34 41 60 61  0   607
>
> --- Event Stage 1: First_assembly
>
> MatAssemblyBegin       1 1.0 3.0375e-01 3.5 0.00e+00 0.0 4.2e+01
> 4.2e+05 2.0e+00  0  0  1 15  0   0  0 60 99 22     0
> MatAssemblyEnd         1 1.0 1.4709e+02 1.0 0.00e+00 0.0 2.8e+01
> 8.2e+03 7.0e+00 91  0  0  0  0 100  0 40  1 78     0
>
> --- Event Stage 2: Second_assembly
>
> MatAssemblyBegin       1 1.0 1.7451e-02 5.0 0.00e+00 0.0 4.2e+01
> 2.4e+04 2.0e+00  0  0  1  1  0   5  0 60 98 22     0
> MatAssemblyEnd         1 1.0 2.3056e-01 1.0 0.00e+00 0.0 2.8e+01
> 8.4e+02 7.0e+00  0  0  0  0  0  95  0 40  2 78     0
>
> --- Event Stage 3: Third_assembly
>
> MatAssemblyBegin       1 1.0 3.3676e-01 3.8 0.00e+00 0.0 4.2e+01
> 4.2e+05 2.0e+00  0  0  1 15  0  45  0100100 67     0
> MatAssemblyEnd         1 1.0 3.3125e-01 1.7 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.0e+00  0  0  0  0  0  55  0  0  0 33     0
>
> --- Event Stage 4: Unknown
>
>
> --- Event Stage 5: Unknown
>
>
> --- Event Stage 6: Unknown
>
>
> --- Event Stage 7: Unknown
>
>
> --- Event Stage 8: Unknown
>
>
> --- Event Stage 9: Unknown
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions   Memory  Descendants'  
> Mem.
>
> --- Event Stage 0: Main Stage
>
>              Matrix    16              0          0     0
>           Index Set    32             25     207888     0
>                 Vec   172             91   16374224     0
>         Vec Scatter     8              0          0     0
>   IS L to G Mapping     4              0          0     0
>       Krylov Solver     6              0          0     0
>      Preconditioner     6              0          0     0
>              Viewer     4              2          0     0
>           Container     2              0          0     0
>
> --- Event Stage 1: First_assembly
>
>           Index Set     2              2      44140     0
>                 Vec     2              1     196776     0
>         Vec Scatter     1              0          0     0
>
> --- Event Stage 2: Second_assembly
>
>           Index Set     2              2       4732     0
>                 Vec     2              1      11464     0
>         Vec Scatter     1              0          0     0
>
> --- Event Stage 3: Third_assembly
>
>
> --- Event Stage 4: Unknown
>
>
> --- Event Stage 5: Unknown
>
>
> --- Event Stage 6: Unknown
>
>
> --- Event Stage 7: Unknown
>
>
> --- Event Stage 8: Unknown
>
>
> --- Event Stage 9: Unknown
>
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> ======================================================================
> Average time to get PetscTime(): 8.82149e-07
> Average time for MPI_Barrier(): 0.000153208
> Average time for zero size MPI_Send(): 1.86761e-05
> OptionTable: -mg_levels_ksp_type richardson
> OptionTable: -mg_levels_pc_sor_omega 1.05
> OptionTable: -mg_levels_pc_type sor
> OptionTable: -pc_ml_PrintLevel 4
> OptionTable: -pc_ml_maxNlevels 2
> OptionTable: -pc_type ml
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> Configure run at: Fri Sep 28 23:34:20 2007
>
> On Feb 4, 2008 5:47 PM, Matthew Knepley <knepley at gmail.com> wrote:
>> On Feb 4, 2008 10:41 AM, Thomas Geenen <geenen at gmail.com> wrote:
>>> On Feb 3, 2008 8:51 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>>
>>>>    Hmmm, are you saying the first round of setting values still
>>>> takes much longer then the second round?
>>>
>>> yes
>>>
>>>> Or is it the time
>>>> in MatAssemblyBegin() much longer the first time?
>>>>
>>>>   The MatAssembly process has one piece of code that's
>>>> work is order n*size; where n is the stash size and size is the
>>>> number of processes, all other work is only order n.
>>>>
>>>>    Could you send the -log_summary output?
>>>
>>> the timing is cumulative i guess?
>>> in between these two solves i solve a smaller system for which i do
>>> not include the timing.
>>
>> I ma having a little trouble reading this. I think the easiest  
>> thing to do
>> is wrap the two section of code in their own sections:
>>
>> PetscLogStageRegister(&stage1, "First assembly");
>> PetscLogStageRegister(&stage2, "Second assembly");
>>
>> PetscLogStagePush(stage1);
>> <code for first assembly>
>> PetscLogStagePop();
>>
>> PetscLogStagePush(stage2);
>> <code for second assembly>
>> PetscLogStagePop();
>>
>> Then we can also get a look at how many messages are sent
>> and how big they are.
>>
>>  Thanks,
>>
>>     Matt
>>
>>
>>> run 1
>>>                         Max       Max/Min        Avg      Total
>>> Time (sec):           2.154e+02      1.00001   2.154e+02
>>> Objects:              2.200e+01      1.00000   2.200e+01
>>> Flops:                0.000e+00      0.00000   0.000e+00  0.000e+00
>>> Flops/sec:            0.000e+00      0.00000   0.000e+00  0.000e+00
>>> MPI Messages:         1.750e+01      1.25000   1.633e+01  9.800e+01
>>> MPI Message Lengths:  3.460e+06      1.29903   1.855e+05  1.818e+07
>>> MPI Reductions:       4.167e+00      1.00000
>>>
>>> Flop counting convention: 1 flop = 1 real number operation of type
>>> (multiply/divide/add/subtract)
>>>                            e.g., VecAXPY() for real vectors of  
>>> length
>>> N --> 2N flops
>>>                            and VecAXPY() for complex vectors of
>>> length N --> 8N flops
>>>
>>> Summary of Stages:   ----- Time ------  ----- Flops -----  ---
>>> Messages ---  -- Message Lengths --  -- Reductions --
>>>                        Avg     %Total     Avg     %Total   counts
>>> %Total     Avg         %Total   counts   %Total
>>> 0:      Main Stage: 2.1537e+02 100.0%  0.0000e+00   0.0%  9.800e+01
>>> 100.0%  1.855e+05      100.0%  2.500e+01 100.0%
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>> See the 'Profiling' chapter of the users' manual for details on
>>> interpreting output.
>>> Phase summary info:
>>>   Count: number of times phase was executed
>>>   Time and Flops/sec: Max - maximum over all processors
>>>                       Ratio - ratio of maximum to minimum over all  
>>> processors
>>>   Mess: number of messages sent
>>>   Avg. len: average message length
>>>   Reduct: number of global reductions
>>>   Global: entire computation
>>>   Stage: stages of a computation. Set stages with  
>>> PetscLogStagePush()
>>> and PetscLogStagePop().
>>>      %T - percent time in this phase         %F - percent flops in  
>>> this phase
>>>      %M - percent messages in this phase     %L - percent message
>>> lengths in this phase
>>>      %R - percent reductions in this phase
>>>   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max  
>>> time
>>> over all processors)
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>>
>>>      ##########################################################
>>>      #                                                        #
>>>      #                          WARNING!!!                    #
>>>      #                                                        #
>>>      #   This code was run without the PreLoadBegin()         #
>>>      #   macros. To get timing results we always recommend    #
>>>      #   preloading. otherwise timing numbers may be          #
>>>      #   meaningless.                                         #
>>>      ##########################################################
>>>
>>>
>>> Event                Count      Time (sec)     Flops/sec
>>>          --- Global ---  --- Stage ---   Total
>>>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg
>>> len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> MatAssemblyBegin       1 1.0 2.9536e-01 4.0 0.00e+00 0.0 4.2e+01
>>> 4.2e+05 2.0e+00  0  0 43 98  8   0  0 43 98  8     0
>>> MatAssemblyEnd         1 1.0 2.1410e+02 1.0 0.00e+00 0.0 2.8e+01
>>> 8.2e+03 7.0e+00 99  0 29  1 28  99  0 29  1 28     0
>>> MatZeroEntries         1 1.0 9.3739e-02 5.9 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecSet                 4 1.0 3.9721e-04 2.2 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>> Memory usage is given in bytes:
>>>
>>> Object Type          Creations   Destructions   Memory   
>>> Descendants' Mem.
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>>              Matrix     3              0          0     0
>>>           Index Set     6              6      45500     0
>>>                 Vec     6              1     196776     0
>>>         Vec Scatter     3              0          0     0
>>>   IS L to G Mapping     2              0          0     0
>>>       Krylov Solver     1              0          0     0
>>>      Preconditioner     1              0          0     0
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> ====================================================================
>>> Average time to get PetscTime(): 1.71661e-06
>>> Average time for MPI_Barrier(): 0.000159979
>>> Average time for zero size MPI_Send(): 1.29938e-05
>>> Compiled without FORTRAN kernels
>>> Compiled with full precision matrices (default)
>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>>> sizeof(PetscScalar) 8
>>> Configure run at: Fri Sep 28 23:34:20 2007
>>>
>>> run2
>>>                         Max       Max/Min        Avg      Total
>>> Time (sec):           2.298e+02      1.00000   2.298e+02
>>> Objects:              2.600e+02      1.00000   2.600e+02
>>> Flops:                1.265e+09      1.17394   1.161e+09  6.969e+09
>>> Flops/sec:            5.505e+06      1.17394   5.054e+06  3.032e+07
>>> MPI Messages:         1.436e+03      1.20816   1.326e+03  7.956e+03
>>> MPI Message Lengths:  2.120e+07      1.23141   1.457e+04  1.159e+08
>>> MPI Reductions:       4.192e+02      1.00000
>>>
>>> Flop counting convention: 1 flop = 1 real number operation of type
>>> (multiply/divide/add/subtract)
>>>                            e.g., VecAXPY() for real vectors of  
>>> length
>>> N --> 2N flops
>>>                            and VecAXPY() for complex vectors of
>>> length N --> 8N flops
>>>
>>> Summary of Stages:   ----- Time ------  ----- Flops -----  ---
>>> Messages ---  -- Message Lengths --  -- Reductions --
>>>                        Avg     %Total     Avg     %Total   counts
>>> %Total     Avg         %Total   counts   %Total
>>> 0:      Main Stage: 2.2943e+02  99.8%  6.9689e+09 100.0%  7.944e+03
>>> 99.8%  1.457e+04      100.0%  2.230e+02   8.9%
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>> See the 'Profiling' chapter of the users' manual for details on
>>> interpreting output.
>>> Phase summary info:
>>>   Count: number of times phase was executed
>>>   Time and Flops/sec: Max - maximum over all processors
>>>                       Ratio - ratio of maximum to minimum over all  
>>> processors
>>>   Mess: number of messages sent
>>>   Avg. len: average message length
>>>   Reduct: number of global reductions
>>>   Global: entire computation
>>>   Stage: stages of a computation. Set stages with  
>>> PetscLogStagePush()
>>> and PetscLogStagePop().
>>>      %T - percent time in this phase         %F - percent flops in  
>>> this phase
>>>      %M - percent messages in this phase     %L - percent message
>>> lengths in this phase
>>>      %R - percent reductions in this phase
>>>   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max  
>>> time
>>> over all processors)
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>>
>>>      ##########################################################
>>>      #                                                        #
>>>      #                          WARNING!!!                    #
>>>      #                                                        #
>>>      #   This code was run without the PreLoadBegin()         #
>>>      #   macros. To get timing results we always recommend    #
>>>      #   preloading. otherwise timing numbers may be          #
>>>      #   meaningless.                                         #
>>>      ##########################################################
>>>
>>>
>>> Event                Count      Time (sec)     Flops/sec
>>>          --- Global ---  --- Stage ---   Total
>>>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg
>>> len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> MatMult              135 1.0 2.0830e+00 1.4 2.30e+08 1.9 2.4e+03
>>> 1.3e+04 0.0e+00  1 26 30 26  0   1 26 30 26  0   862
>>> MatMultAdd            40 1.0 3.2598e-01 4.5 2.68e+07 5.7 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    31
>>> MatSolve              44 1.0 6.7841e-01 1.7 1.93e+08 1.7 0.0e+00
>>> 0.0e+00 0.0e+00  0  7  0  0  0   0  7  0  0  0   672
>>> MatRelax              80 1.0 4.2949e+00 1.6 8.77e+07 1.2 2.2e+03
>>> 1.3e+04 0.0e+00  1 23 28 26  0   1 23 28 26  0   374
>>> MatLUFactorSym         1 1.0 7.6739e-02 1.1 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatLUFactorNum         2 1.0 2.9370e+00 1.5 2.53e+08 1.5 0.0e+00
>>> 0.0e+00 0.0e+00  1 44  0  0  0   1 44  0  0  0  1037
>>> MatILUFactorSym        1 1.0 6.7334e-01 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatAssemblyBegin       7 1.0 7.1652e-01 4.4 0.00e+00 0.0 1.3e+02
>>> 2.9e+05 8.0e+00  0  0  2 32  0   0  0  2 32  4     0
>>> MatAssemblyEnd         7 1.0 2.1473e+02 1.0 0.00e+00 0.0 8.4e+01
>>> 3.2e+03 2.2e+01 93  0  1  0  1  94  0  1  0 10     0
>>> MatGetRowIJ            2 1.0 1.8899e-03 1.1 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatGetSubMatrice       1 1.0 3.1915e-02 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 3.0e+00  0  0  0  0  0   0  0  0  0  1     0
>>> MatGetOrdering         2 1.0 1.2184e-02 1.6 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 4.0e+00  0  0  0  0  0   0  0  0  0  2     0
>>> MatIncreaseOvrlp       1 1.0 7.6865e-02 1.0 0.00e+00 0.0 1.1e+03
>>> 2.4e+03 2.0e+01  0  0 13  2  1   0  0 13  2  9     0
>>> MatZeroEntries         3 1.0 1.0429e-01 4.7 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MAT_GetRedundantMatrix       1 1.0 3.0144e-02 1.2 0.00e+00 0.0 9.0e 
>>> +01
>>> 8.0e+04 2.0e+00  0  0  1  6  0   0  0  1  6  1     0
>>> VecDot                39 1.0 9.0680e-01140.0 2.95e+08191.3 0.0e+00
>>> 0.0e+00 3.9e+01  0  0  0  0  2   0  0  0  0 17    11
>>> VecMDot                8 1.0 2.8777e-03 2.8 4.16e+07 3.3 0.0e+00
>>> 0.0e+00 8.0e+00  0  0  0  0  0   0  0  0  0  4    84
>>> VecNorm               31 1.0 6.2301e-02 4.2 7.17e+07 5.8 0.0e+00
>>> 0.0e+00 3.1e+01  0  0  0  0  1   0  0  0  0 14    86
>>> VecScale              85 1.0 1.7729e-03 1.4 4.94e+08 1.4 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2138
>>> VecCopy                4 1.0 6.2108e-04 1.5 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecSet               139 1.0 3.5934e-03 1.4 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecAXPY               98 1.0 5.1496e-03 1.3 9.24e+08 1.1 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  4705
>>> VecAYPX               40 1.0 1.0311e-02 1.4 1.90e+08 1.0 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   969
>>> VecWAXPY              75 1.0 2.6060e-02 1.4 1.22e+08 1.0 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   618
>>> VecMAXPY               9 1.0 2.1315e-04 1.6 3.01e+08 1.2 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1355
>>> VecAssemblyBegin       4 1.0 1.9898e-03 1.2 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 1.2e+01  0  0  0  0  0   0  0  0  0  5     0
>>> VecAssemblyEnd         4 1.0 2.1219e-05 1.2 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecScatterBegin      267 1.0 1.0224e-01 1.6 0.00e+00 0.0 6.3e+03
>>> 1.1e+04 0.0e+00  0  0 79 59  0   0  0 79 59  0     0
>>> VecScatterEnd        267 1.0 7.8653e-0111.3 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> KSPGMRESOrthog         4 1.0 3.7677e-03 1.9 4.78e+07 2.4 0.0e+00
>>> 0.0e+00 1.2e+01  0  0  0  0  0   0  0  0  0  5   141
>>> KSPSetup               6 1.0 1.0260e-02 1.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 4.0e+00  0  0  0  0  0   0  0  0  0  2     0
>>> KSPSolve               2 1.0 7.8238e+00 1.0 1.16e+08 1.2 5.9e+03
>>> 1.1e+04 8.1e+01  3 70 75 54  3   3 70 75 54 36   622
>>> PCSetUp                3 1.0 5.1323e+00 1.2 1.29e+08 1.2 1.6e+03
>>> 9.1e+03 7.3e+01  2 47 20 13  3   2 47 20 13 33   636
>>> PCSetUpOnBlocks        1 1.0 1.3325e+00 1.0 1.47e+08 1.0 0.0e+00
>>> 0.0e+00 3.0e+00  1 17  0  0  0   1 17  0  0  1   871
>>> PCApply               44 1.0 5.7917e+00 1.1 9.60e+07 1.2 4.7e+03
>>> 1.0e+04 0.0e+00  2 41 59 42  0   2 41 59 42  0   497
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>> Memory usage is given in bytes:
>>>
>>> Object Type          Creations   Destructions   Memory   
>>> Descendants' Mem.
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>>              Matrix    16              0          0     0
>>>           Index Set    36             29     256760     0
>>>                 Vec   176             93   16582464     0
>>>         Vec Scatter    10              0          0     0
>>>   IS L to G Mapping     4              0          0     0
>>>       Krylov Solver     6              0          0     0
>>>      Preconditioner     6              0          0     0
>>>              Viewer     4              2          0     0
>>>           Container     2              0          0     0
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> ====================================================================
>>> Average time to get PetscTime(): 8.10623e-07
>>> Average time for MPI_Barrier(): 0.000178194
>>> Average time for zero size MPI_Send(): 1.33117e-05
>>> OptionTable: -mg_levels_ksp_type richardson
>>> OptionTable: -mg_levels_pc_sor_omega 1.05
>>> OptionTable: -mg_levels_pc_type sor
>>> OptionTable: -pc_ml_PrintLevel 4
>>> OptionTable: -pc_ml_maxNlevels 2
>>> OptionTable: -pc_type ml
>>> Compiled without FORTRAN kernels
>>> Compiled with full precision matrices (default)
>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>>> sizeof(PetscScalar) 8
>>> Configure run at: Fri Sep 28 23:34:20 2007
>>>
>>>
>>>>
>>>>    Barry
>>>>
>>>>
>>>>    The a
>>>>
>>>> On Feb 3, 2008, at 6:44 AM, Thomas Geenen wrote:
>>>>
>>>>> i call
>>>>> ierr = MatStashSetInitialSize(A[*seqsolve],stash_size,
>>>>> stash_size);CHKERRQ(ierr);
>>>>> with 100 000 000 for the stash size to make sure that's not the
>>>>> bottleneck
>>>>>
>>>>> the assemble time remains unchanged however.
>>>>>
>>>>> nstash in MatAssemblyBegin_MPIAIJ (CPU=0) = 109485
>>>>> reallocs in MatAssemblyBegin_MPIAIJ =  0
>>>>>
>>>>> cheers
>>>>> Thomas
>>>>>
>>>>> On Saturday 02 February 2008 23:19, Barry Smith wrote:
>>>>>>   The matstash has a concept of preallocation also. During the  
>>>>>> first
>>>>>> setvalues
>>>>>> it is allocating more and more memory for the stash. In the  
>>>>>> second
>>>>>> setvalues
>>>>>> the stash is large enough so does not require any addition
>>>>>> allocation.
>>>>>>
>>>>>>   You can use the option -matstash_initial_size <size> to  
>>>>>> allocate
>>>>>> enough space
>>>>>> initially so that the first setvalues is also fast. It does not  
>>>>>> look
>>>>>> like there is a way
>>>>>> coded to get the <size> that you should use. It should be set  
>>>>>> to the
>>>>>> maximum nonzeros
>>>>>> any process has that belongs to other processes. The stash  
>>>>>> handling
>>>>>> code is
>>>>>> in src/mat/utils/matstash.c, perhaps you can figure out how to
>>>>>> printout with PetscInfo()
>>>>>> the sizes needed?
>>>>>>
>>>>>>
>>>>>>   Barry
>>>>>>
>>>>>> On Feb 2, 2008, at 12:30 PM, Thomas Geenen wrote:
>>>>>>> On Saturday 02 February 2008 18:33, Hong Zhang wrote:
>>>>>>>> On Sat, 2 Feb 2008, Thomas Geenen wrote:
>>>>>>>>> Dear Petsc users,
>>>>>>>>>
>>>>>>>>> I would like to understand what is slowing down the assembly  
>>>>>>>>> phase
>>>>>>>>> of my
>>>>>>>>> matrix. I create a matrix with MatCreateMPIAIJ i make a rough
>>>>>>>>> guess of
>>>>>>>>> the number of off diagonal entries and then use a conservative
>>>>>>>>> value to
>>>>>>>>> make sure I do not need extra mallocs. (the number of diagonal
>>>>>>>>> entries is
>>>>>>>>> exact)
>>>>>>>>> next i call MatSetValues and MatAssemblyBegin, MatAssemblyEnd.
>>>>>>>>> The first time i call MatSetValues and MatAssemblyBegin,
>>>>>>>>> MatAssemblyEnd it takes about 170 seconds
>>>>>>>>> the second time 0.3 seconds.
>>>>>>>>> I run it on 6 cpu's and I do fill quit a number of row- 
>>>>>>>>> entries on
>>>>>>>>> the
>>>>>>>>> "wrong" cpu. However thats also the case the second run. I  
>>>>>>>>> checked
>>>>>>>>> that there are no additional mallocs
>>>>>>>>> MatGetInfo info.mallocs=0 both after MatSetValues and after
>>>>>>>>> MatAssemblyBegin, MatAssemblyEnd.
>>>>>>>>
>>>>>>>> Run your code with the option '-log_summary' and check which
>>>>>>>> function
>>>>>>>> call dominates the execution time.
>>>>>>>
>>>>>>> the time is spend in MatStashScatterGetMesg_Private
>>>>>>>
>>>>>>>>> I run it on 6 cpu's and I do fill quit a number of row- 
>>>>>>>>> entries on
>>>>>>>>> the
>>>>>>>>> "wrong" cpu.
>>>>>>>>
>>>>>>>> Likely, the communication that sending the entries to the
>>>>>>>> corrected cpu consume the time. Can you fill the entries in the
>>>>>>>> correct cpu?
>>>>>>>
>>>>>>> the second time the entries are filled on the wrong CPU as well.
>>>>>>> i am curious about the difference in time between run 1 and 2.
>>>>>>>
>>>>>>>> Hong
>>>>>>>>
>>>>>>>>> cheers
>>>>>>>>> Thomas
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which
>> their experiments lead.
>> -- Norbert Wiener
>>
>>
>