[petsc-users] Fwd: Smaller assemble time with increasing processors

Fri Jun 30 21:25:48 CDT 2023

Hi,
    Thanks for your reply. I try to use PetscLogEvent(), and the result
shows same conclusion.
    What I have done is :
----------------
    PetscLogEvent Mat_assemble_event, Mat_setvalue_event, Mat_setAsse_event;
    PetscClassId classid;
    PetscLogDouble user_event_flops;
    PetscClassIdRegister("Test assemble and set value", &classid);
    PetscLogEventRegister("Test only assemble", classid,
&Mat_assemble_event);
    PetscLogEventRegister("Test only set values", classid,
&Mat_setvalue_event);
    PetscLogEventRegister("Test both assemble and set values", classid,
&Mat_setAsse_event);
    PetscLogEventBegin(Mat_setAsse_event, 0, 0, 0, 0);
    PetscLogEventBegin(Mat_setvalue_event, 0, 0, 0, 0);
    ...compute elements and use MatSetValues. No call for assembly
    PetscLogEventEnd(Mat_setvalue_event, 0, 0, 0, 0);

    PetscLogEventBegin(Mat_assemble_event, 0, 0, 0, 0);
    MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY);
    MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY);
    PetscLogEventEnd(Mat_assemble_event, 0, 0, 0, 0);
    PetscLogEventEnd(Mat_setAsse_event, 0, 0, 0, 0);
----------------

    And the output as follows. By the way, dose petsc recorde all time
between PetscLogEventBegin and PetscLogEventEnd? or just test the time of
petsc API?
----------------
Event                Count      Time (sec)     Flop
     --- Global ---  --- Stage ----  Total
                   Max Ratio  *Max*     Ratio   Max  Ratio  Mess   AvgLen
 Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
64new               1 1.0 *2.3775e+02* 1.0 0.00e+00 0.0 6.2e+03 2.3e+04
9.0e+00 52  0  1  1  2  52  0  1  1  2     0
128new              1 1.0* 6.9945e+01* 1.0 0.00e+00 0.0 2.5e+04 1.1e+04
9.0e+00 30  0  1  1  2  30  0  1  1  2     0
256new              1 1.0 *1.7445e+01* 1.0 0.00e+00 0.0 9.9e+04 5.2e+03
9.0e+00 10  0  1  1  2  10  0  1  1  2     0

64:
only assemble       1 1.0 *2.6596e+02 *1.0 0.00e+00 0.0 7.0e+03 2.8e+05
1.1e+01 55  0  1  8  3  55  0  1  8  3     0
only setvalues      1 1.0 *1.9987e+02* 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 41  0  0  0  0  41  0  0  0  0     0
Test both           1 1.0 4.*6580e+02* 1.0 0.00e+00 0.0 7.0e+03 2.8e+05
1.5e+01 96  0  1  8  4  96  0  1  8  4     0

128:
 only assemble      1 1.0 *6.9718e+01* 1.0 0.00e+00 0.0 2.6e+04 8.1e+04
1.1e+01 30  0  1  4  3  30  0  1  4  3     0
only setvalues      1 1.0 *1.4438e+02* 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 60  0  0  0  0  60  0  0  0  0     0
Test both           1 1.0 *2.1417e+02* 1.0 0.00e+00 0.0 2.6e+04 8.1e+04
1.5e+01 91  0  1  4  4  91  0  1  4  4     0

256:
only assemble       1 1.0 *1.7482e+01* 1.0 0.00e+00 0.0 1.0e+05 2.3e+04
1.1e+01 10  0  1  3  3  10  0  1  3  3     0
only setvalues      1 1.0 *1.3717e+02* 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 78  0  0  0  0  78  0  0  0  0     0
Test both           1 1.0 *1.5475e+02* 1.0 0.00e+00 0.0 1.0e+05 2.3e+04
1.5e+01 91  0  1  3  4  91  0  1  3  4     0


Runfeng

Barry Smith <bsmith at petsc.dev> 于2023年6月30日周五 23:35写道：

>
>    You cannot look just at the VecAssemblyEnd() time, that will very
> likely give the wrong impression of the total time it takes to put the
> values in.
>
>    You need to register a new Event and put a PetscLogEvent() just before
> you start generating the vector entries and calling VecSetValues() and put
> the PetscLogEventEnd() just after the VecAssemblyEnd() this is the only way
> to get an accurate accounting of the time.
>
>   Barry
>
>
> > On Jun 30, 2023, at 11:21 AM, Runfeng Jin <jsfaraway at gmail.com> wrote:
> >
> > Hello!
> >
> > When I use PETSc build a sbaij matrix, I find a strange thing. When I
> increase the number of processors, the assemble time become smaller. All
> these are totally same matrix. The assemble time mainly arouse from message
> passing, which because I use dynamic workload that it is random for which
> elements are computed by which processor.
> > But from instinct, if use more processors, then more possible that the
> processor computes elements storing in other processors. But from the
> output of log_view, It seems when use more processors, the processors
> compute more elements storing in its local(infer from that, with more
> processors, less total amount of passed messages).
> >
> > What could cause this happened? Thank you!
> >
> >
> >  Following is the output of log_view for 64\128\256 processors. Every
> row is time profiler of VecAssemblyEnd.
> >
> >
> ------------------------------------------------------------------------------------------------------------------------
> > processors                Count                      Time (sec)
>                             Flop
>                    --- Global ---                               --- Stage
> ----                Total
> >                               Max    Ratio         Max
> Ratio                 Max  Ratio      Mess        AvgLen         Reduct
>            %T %F %M %L %R         %T %F %M %L %R       Mflop/s
> > 64                            1     1.0            2.3775e+02      1.0
>                  0.00e+00 0.0      6.2e+03    2.3e+04     9.0e+00
>        52  0      1    1    2             52   0    1      1     2
>    0
> > 128                          1     1.0            6.9945e+01      1.0
>                0.00e+00 0.0      2.5e+04    1.1e+04     9.0e+00
>     30   0      1     1  2              30   0    1       1    2
>  0
> > 256                          1     1.0           1.7445e+01        1.0
>                 0.00e+00 0.0      9.9e+04     5.2e+03    9.0e+00
>     10   0      1     1  2              10   0    1        1   2
>  0
> >
> > Runfeng Jin
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230701/3b81c396/attachment.html>