[petsc-users] Smaller assemble time with increasing processors

Mon Jul 3 10:03:23 CDT 2023

Thank you very much☺!

Runfeng

Barry Smith <bsmith at petsc.dev> 于2023年7月3日周一 22:52写道：

>
>
> On Jul 3, 2023, at 10:11 AM, Runfeng Jin <jsfaraway at gmail.com> wrote:
>
> Hi,
>
>> We use a hash table to store the nonzeros on the fly, and then convert to
>> packed storage on assembly.
>>
>
>    There is "extra memory" since the matrix entries are first stored in a
> hash and then converted into the regular CSR format, so for a short while,
> both copies are in memory.
>
>     We use the amazing khash package, include/petsc/private/khash/khash.h,
> our code is scattered around a bit depending on the matrix format we will
> be forming.
>
> cd src/mat
>
> git grep "_Hash("
>
> impls/aij/mpi/mpiaij.c:/* defines MatSetValues_MPI*_Hash(*),
> MatAssemblyBegin_MPI*_Hash(*), and  MatAssemblyEnd_MPI*_Hash(*) */
>
> impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:/* defines
> MatSetValues_MPICUSPARSE*_Hash(*) */
>
> impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:  PetscCall(MatSetUp_MPI
> *_Hash(*A));
>
> impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatSetValues_MPI*_Hash(*Mat
> A, PetscInt m, const PetscInt *rows, PetscInt n, const PetscInt *cols,
> const PetscScalar *values, InsertMode addv)
>
> impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatAssemblyBegin_MPI
> *_Hash(*Mat A, PETSC_UNUSED MatAssemblyType type)
>
> impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatAssemblyEnd_MPI
> *_Hash(*Mat A, MatAssemblyType type)
>
> impls/aij/mpi/mpihashmat.h:        PetscCall(MatSetValues_MPI*_Hash(*A,
> 1, row + i, ncols, col + i, val + i, A->insertmode));
>
> impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatDestroy_MPI*_Hash(*Mat
> A)
>
> impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatZeroEntries_MPI
> *_Hash(*PETSC_UNUSED Mat A)
>
> impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatSetRandom_MPI*_Hash(*Mat
> A, PETSC_UNUSED PetscRandom r)
>
> impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatSetUp_MPI*_Hash(*Mat
> A)
>
> impls/aij/seq/aij.c:/* defines MatSetValues_Seq*_Hash(*),
> MatAssemblyEnd_Seq*_Hash(*), MatSetUp_Seq*_Hash(*) */
>
> impls/aij/seq/seqhashmat.h:static PetscErrorCode MatAssemblyEnd_Seq
> *_Hash(*Mat A, MatAssemblyType type)
>
> impls/aij/seq/seqhashmat.h:  A->preallocated = PETSC_FALSE; /* this was
> set to true for the MatSetValues*_Hash(*) to work */
>
> impls/aij/seq/seqhashmat.h:static PetscErrorCode MatDestroy_Seq*_Hash(*Mat
> A)
>
> impls/aij/seq/seqhashmat.h:static PetscErrorCode MatZeroEntries_Seq
> *_Hash(*Mat A)
>
> impls/aij/seq/seqhashmat.h:static PetscErrorCode MatSetRandom_Seq*_Hash(*Mat
> A, PetscRandom r)
>
> impls/aij/seq/seqhashmat.h:static PetscErrorCode MatSetUp_Seq*_Hash(*Mat
> A)
>
> impls/baij/mpi/mpibaij.c:/* defines MatSetValues_MPI*_Hash(*),
> MatAssemblyBegin_MPI*_Hash(*), and  MatAssemblyEnd_MPI*_Hash(*) */
>
> impls/baij/seq/baij.c:/* defines MatSetValues_Seq*_Hash(*),
> MatAssemblyEnd_Seq*_Hash(*), MatSetUp_Seq*_Hash(*) */
>
> impls/sbaij/mpi/mpisbaij.c:/* defines MatSetValues_MPI*_Hash(*),
> MatAssemblyBegin_MPI*_Hash(*), MatAssemblyEnd_MPI*_Hash(*), MatSetUp_MPI
> *_Hash(*) */
>
> impls/sbaij/seq/sbaij.c:/* defines MatSetValues_Seq*_Hash(*),
> MatAssemblyEnd_Seq*_Hash(*), MatSetUp_Seq*_Hash(*) */
>
> Thanks for the numbers, it is good to see the performance is so similar to
> that obtained when providing preallocation information.
>
>
> Maybe can you tell me which file implements this function?
>
> Runfeng
>
>
>
> Runfeng Jin <jsfaraway at gmail.com> 于2023年7月3日周一 22:05写道：
>
>> Thank you for all your help!
>>
>> Runfeng
>>
>> Matthew Knepley <knepley at gmail.com> 于2023年7月3日周一 22:03写道：
>>
>>> On Mon, Jul 3, 2023 at 9:56 AM Runfeng Jin <jsfaraway at gmail.com> wrote:
>>>
>>>> Hi, impressive performance!
>>>>   I use the newest version of petsc(release branch), and it almost
>>>> deletes all assembly and stash time in large processors (assembly time
>>>> 64-4s/128-2s/256-0.2s, stash time all below 2s). For the zero programming
>>>> cost, it really incredible.
>>>>   The order code has a regular arrangement of the number of
>>>> nonzero-elements across rows, so I can have a good rough preallocation. And
>>>> from the data, dedicatedly arrange data and roughly acquiring the max
>>>> number of non-zero elements in rows can have a better performance than the
>>>> new version without preallocation. However, in reality, I will use the
>>>> newer version without preallocation for:1)less effort in programming and
>>>> also nearly the same good performance 2) good memory usage(I see no
>>>> unneeded memory after assembly) 3) dedicated preallocation is usually not
>>>> very easy and cause extra time cost.
>>>>    Maybe it will be better that leave some space for the user to do a
>>>> slight direction for the preallocation and thus acquire better performance.
>>>> But have no idea how to direct it.
>>>>    And I am very curious about how petsc achieves this. How can it not
>>>> know anything but achieve so good performance, and no wasted memory? May
>>>> you have an explanation about this?
>>>>
>>>
>>> We use a hash table to store the nonzeros on the fly, and then convert
>>> to packed storage on assembly.
>>>
>>>   Thanks,
>>>
>>>      Matt
>>>
>>>
>>>> assemble time:
>>>> version\processors               4            8        16         32
>>>>        64        128         256
>>>>      old                             14677s   4694s   1124s     572s
>>>>     38s         8s          2s
>>>>      new                                50s      28s       15s
>>>> 7.8s         4s          2s        0.4s
>>>>      older                              27s       24s        19s
>>>>  12s         14s         -              -
>>>> stash time(max among all processors):
>>>> version\processors               4            8        16         32
>>>>        64        128         256
>>>>      old                                 3145s   2554s   673s     329s
>>>>      201s     142s     138s
>>>>      new                                2s         1s        ~0s
>>>> ~0s         ~0s          ~0s       ~0s
>>>>      older                              10s       73s        18s
>>>>  5s            1s         -              -
>>>> old: my poor preallocation
>>>> new: newest version of petsc that do not preallocation
>>>> older: the best preallocation version of my code.
>>>>
>>>>
>>>> Runfeng
>>>>
>>>> Barry Smith <bsmith at petsc.dev> 于2023年7月3日周一 12:19写道：
>>>>
>>>>>
>>>>>    The main branch of PETSc now supports filling sparse matrices
>>>>> without providing any preallocation information.
>>>>>
>>>>>    You can give it a try. Use your current fastest code but just
>>>>> remove ALL the preallocation calls. I would be interested in what kind of
>>>>> performance you get compared to your best current performance.
>>>>>
>>>>>   Barry
>>>>>
>>>>>
>>>>>
>>>>> On Jul 2, 2023, at 11:24 PM, Runfeng Jin <jsfaraway at gmail.com> wrote:
>>>>>
>>>>> Hi! Good advice!
>>>>>     I set value with MatSetValues() API, which sets one part of a row
>>>>> at a time(I use a kind of tiling technology so I cannot get all values of a
>>>>> row at a time).
>>>>>     I tested the number of malloc in these three cases.  The number of
>>>>> mallocs is decreasing with the increase of processors, and all these are
>>>>> very large(the matrix is 283234*283234, as can see in the following). This
>>>>> should be due to the unqualified preallocation. I use a rough
>>>>> preallocation, that every processor counts the number of nonzero elements
>>>>> for the first 10 rows, and uses the largest one to preallocate memory for
>>>>> all local rows. It seems that not work well.
>>>>>
>>>>> number_of_processors   number_of_max_mallocs_among_all_processors
>>>>> 64                                     20000
>>>>> 128                                   17000
>>>>> 256                                   11000
>>>>>
>>>>>     I change my way to preallocate. I evenly take 100 rows in every
>>>>> local matrix and take the largest one to preallocate memory for all local
>>>>> rows. Now the assemble time is reduced to a very small time.
>>>>>
>>>>> number_of_processors   number_of_max_mallocs_among_all_processors
>>>>> 64                                     3000
>>>>> 128                                   700
>>>>> 256                                   500
>>>>>
>>>>> Event                Count          Time (sec)            Flop
>>>>>                                                       --- Global ---
>>>>>           --- Stage ----              Total
>>>>>                    Max Ratio        Max     Ratio       Max  Ratio
>>>>>  Mess            AvgLen  Reduct  %T %F %M %L %R        %T %F %M %L %R
>>>>> Mflop/s
>>>>> 64                 1    1.0       3.8999e+01 1.0     0.00e+00 0.0
>>>>> 7.1e+03     2.9e+05 1.1e+01 15  0  1  8  3                     15  0  1  8
>>>>>  3                  0
>>>>>
>>>>> 128               1    1.0       8.5714e+00 1.0     0.00e+00 0.0
>>>>> 2.6e+04     8.1e+04 1.1e+01  5  0  1  4  3                       5  0  1  4
>>>>>  3                   0
>>>>> 256               1    1.0        2.5512e+00 1.0    0.00e+00 0.0
>>>>> 1.0e+05     2.3e+04 1.1e+01  2  0  1  3  3                       2  0  1  3
>>>>>  3                   0
>>>>>
>>>>> So the reason "why assemble time is smaller with the increasing number
>>>>> of processors " may be because more processors divide the malloc job so
>>>>> that total time is reduced?
>>>>>  If so, I still have some questions:
>>>>>     1. If preallocation is not accurate, will the performance of the
>>>>> assembly be affected? I mean, when processors receive the elements that
>>>>> should be stored in their local by MPI, then will the new mallocs  happen
>>>>> at this time point?
>>>>>     2. I can not give an accurate preallocation for the large cost, so
>>>>> is there any better way to preallocate for my situation?
>>>>>
>>>>>
>>>>>
>>>>> Barry Smith <bsmith at petsc.dev> 于2023年7月2日周日 00:16写道：
>>>>>
>>>>>>
>>>>>>    I see no reason not to trust the times below, they seem
>>>>>> reasonable. You get more than 2 times speed from 64 to 128 and then about
>>>>>> 1.38 from 128 to 256.
>>>>>>
>>>>>>    The total amount of data moved (number of messages moved times
>>>>>> average length) goes from 7.0e+03 * 2.8e+05  1.9600e+09 to 2.1060e+09
>>>>>> to 2.3000e+09. A pretty moderate amount of data increase, but note that
>>>>>> each time you double the number of ranks, you also increase substantially
>>>>>> the network's hardware to move data, so one would hope for a good speed up.
>>>>>>
>>>>>>    Also, the load balance is very good, near 1. Often with assembly,
>>>>>> we see very out-of-balance, and it is difficult to get good speedup when
>>>>>> the balance is really off.
>>>>>>
>>>>>>    It looks like over 90% of the entire run time is coming from
>>>>>> setting and assembling the values? Also the setting values time dominates
>>>>>> assembly time more with more ranks.  Are you setting a single value at a
>>>>>> time or a collection of them? How big are the vectors?
>>>>>>
>>>>>>    Run all three cases with -info :vec to see some information about
>>>>>> how many mallocs where move to hold the stashed vector entries.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Jun 30, 2023, at 10:25 PM, Runfeng Jin <jsfaraway at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>     Thanks for your reply. I try to use PetscLogEvent(), and the
>>>>>> result shows same conclusion.
>>>>>>     What I have done is :
>>>>>> ----------------
>>>>>>     PetscLogEvent Mat_assemble_event, Mat_setvalue_event,
>>>>>> Mat_setAsse_event;
>>>>>>     PetscClassId classid;
>>>>>>     PetscLogDouble user_event_flops;
>>>>>>     PetscClassIdRegister("Test assemble and set value", &classid);
>>>>>>     PetscLogEventRegister("Test only assemble", classid,
>>>>>> &Mat_assemble_event);
>>>>>>     PetscLogEventRegister("Test only set values", classid,
>>>>>> &Mat_setvalue_event);
>>>>>>     PetscLogEventRegister("Test both assemble and set values",
>>>>>> classid, &Mat_setAsse_event);
>>>>>>     PetscLogEventBegin(Mat_setAsse_event, 0, 0, 0, 0);
>>>>>>     PetscLogEventBegin(Mat_setvalue_event, 0, 0, 0, 0);
>>>>>>     ...compute elements and use MatSetValues. No call for assembly
>>>>>>     PetscLogEventEnd(Mat_setvalue_event, 0, 0, 0, 0);
>>>>>>
>>>>>>     PetscLogEventBegin(Mat_assemble_event, 0, 0, 0, 0);
>>>>>>     MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY);
>>>>>>     MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY);
>>>>>>     PetscLogEventEnd(Mat_assemble_event, 0, 0, 0, 0);
>>>>>>     PetscLogEventEnd(Mat_setAsse_event, 0, 0, 0, 0);
>>>>>> ----------------
>>>>>>
>>>>>>     And the output as follows. By the way, dose petsc recorde all
>>>>>> time between PetscLogEventBegin and PetscLogEventEnd? or just test the time
>>>>>> of petsc API?
>>>>>>
>>>>>>
>>>>>>    It is all of the time.
>>>>>>
>>>>>> ----------------
>>>>>> Event                Count      Time (sec)     Flop
>>>>>>            --- Global ---  --- Stage ----  Total
>>>>>>                    Max Ratio  *Max*     Ratio   Max  Ratio  Mess
>>>>>> AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>>>>> 64new               1 1.0 *2.3775e+02* 1.0 0.00e+00 0.0 6.2e+03
>>>>>> 2.3e+04 9.0e+00 52  0  1  1  2  52  0  1  1  2     0
>>>>>> 128new              1 1.0* 6.9945e+01* 1.0 0.00e+00 0.0 2.5e+04
>>>>>> 1.1e+04 9.0e+00 30  0  1  1  2  30  0  1  1  2     0
>>>>>> 256new              1 1.0 *1.7445e+01* 1.0 0.00e+00 0.0 9.9e+04
>>>>>> 5.2e+03 9.0e+00 10  0  1  1  2  10  0  1  1  2     0
>>>>>>
>>>>>> 64:
>>>>>> only assemble       1 1.0 *2.6596e+02 *1.0 0.00e+00 0.0 7.0e+03
>>>>>> 2.8e+05 1.1e+01 55  0  1  8  3  55  0  1  8  3     0
>>>>>> only setvalues      1 1.0 *1.9987e+02* 1.0 0.00e+00 0.0 0.0e+00
>>>>>> 0.0e+00 0.0e+00 41  0  0  0  0  41  0  0  0  0     0
>>>>>> Test both           1 1.0 4.*6580e+02* 1.0 0.00e+00 0.0 7.0e+03
>>>>>> 2.8e+05 1.5e+01 96  0  1  8  4  96  0  1  8  4     0
>>>>>>
>>>>>> 128:
>>>>>>  only assemble      1 1.0 *6.9718e+01* 1.0 0.00e+00 0.0 2.6e+04
>>>>>> 8.1e+04 1.1e+01 30  0  1  4  3  30  0  1  4  3     0
>>>>>> only setvalues      1 1.0 *1.4438e+02* 1.1 0.00e+00 0.0 0.0e+00
>>>>>> 0.0e+00 0.0e+00 60  0  0  0  0  60  0  0  0  0     0
>>>>>> Test both           1 1.0 *2.1417e+02* 1.0 0.00e+00 0.0 2.6e+04
>>>>>> 8.1e+04 1.5e+01 91  0  1  4  4  91  0  1  4  4     0
>>>>>>
>>>>>> 256:
>>>>>> only assemble       1 1.0 *1.7482e+01* 1.0 0.00e+00 0.0 1.0e+05
>>>>>> 2.3e+04 1.1e+01 10  0  1  3  3  10  0  1  3  3     0
>>>>>> only setvalues      1 1.0 *1.3717e+02* 1.1 0.00e+00 0.0 0.0e+00
>>>>>> 0.0e+00 0.0e+00 78  0  0  0  0  78  0  0  0  0     0
>>>>>> Test both           1 1.0 *1.5475e+02* 1.0 0.00e+00 0.0 1.0e+05
>>>>>> 2.3e+04 1.5e+01 91  0  1  3  4  91  0  1  3  4     0
>>>>>>
>>>>>>
>>>>>>
>>>>>> Runfeng
>>>>>>
>>>>>> Barry Smith <bsmith at petsc.dev> 于2023年6月30日周五 23:35写道：
>>>>>>
>>>>>>>
>>>>>>>    You cannot look just at the VecAssemblyEnd() time, that will very
>>>>>>> likely give the wrong impression of the total time it takes to put the
>>>>>>> values in.
>>>>>>>
>>>>>>>    You need to register a new Event and put a PetscLogEvent() just
>>>>>>> before you start generating the vector entries and calling VecSetValues()
>>>>>>> and put the PetscLogEventEnd() just after the VecAssemblyEnd() this is the
>>>>>>> only way to get an accurate accounting of the time.
>>>>>>>
>>>>>>>   Barry
>>>>>>>
>>>>>>>
>>>>>>> > On Jun 30, 2023, at 11:21 AM, Runfeng Jin <jsfaraway at gmail.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > Hello!
>>>>>>> >
>>>>>>> > When I use PETSc build a sbaij matrix, I find a strange thing.
>>>>>>> When I increase the number of processors, the assemble time become smaller.
>>>>>>> All these are totally same matrix. The assemble time mainly arouse from
>>>>>>> message passing, which because I use dynamic workload that it is random for
>>>>>>> which elements are computed by which processor.
>>>>>>> > But from instinct, if use more processors, then more possible that
>>>>>>> the processor computes elements storing in other processors. But from the
>>>>>>> output of log_view, It seems when use more processors, the processors
>>>>>>> compute more elements storing in its local(infer from that, with more
>>>>>>> processors, less total amount of passed messages).
>>>>>>> >
>>>>>>> > What could cause this happened? Thank you!
>>>>>>> >
>>>>>>> >
>>>>>>> >  Following is the output of log_view for 64\128\256 processors.
>>>>>>> Every row is time profiler of VecAssemblyEnd.
>>>>>>> >
>>>>>>> >
>>>>>>> ------------------------------------------------------------------------------------------------------------------------
>>>>>>> > processors                Count                      Time (sec)
>>>>>>>                                   Flop
>>>>>>>                          --- Global ---                               ---
>>>>>>> Stage ----                Total
>>>>>>> >                               Max    Ratio         Max
>>>>>>>       Ratio                 Max  Ratio      Mess        AvgLen
>>>>>>>  Reduct               %T %F %M %L %R         %T %F %M %L %R       Mflop/s
>>>>>>> > 64                            1     1.0            2.3775e+02
>>>>>>> 1.0                   0.00e+00 0.0      6.2e+03    2.3e+04     9.0e+00
>>>>>>>            52  0      1    1    2             52   0    1      1     2
>>>>>>>        0
>>>>>>> > 128                          1     1.0            6.9945e+01
>>>>>>> 1.0                   0.00e+00 0.0      2.5e+04    1.1e+04     9.0e+00
>>>>>>>           30   0      1     1  2              30   0    1       1    2
>>>>>>>        0
>>>>>>> > 256                          1     1.0           1.7445e+01
>>>>>>> 1.0                  0.00e+00 0.0      9.9e+04     5.2e+03    9.0e+00
>>>>>>>           10   0      1     1  2              10   0    1        1   2
>>>>>>>        0
>>>>>>> >
>>>>>>> > Runfeng Jin
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>> https://www.cse.buffalo.edu/~knepley/
>>> <http://www.cse.buffalo.edu/~knepley/>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230703/33406907/attachment-0001.html>