[petsc-users] Smaller assemble time with increasing processors

Mon Jul 3 09:52:16 CDT 2023

> On Jul 3, 2023, at 10:11 AM, Runfeng Jin <jsfaraway at gmail.com> wrote:
> 
> Hi, 
>> We use a hash table to store the nonzeros on the fly, and then convert to packed storage on assembly.

   There is "extra memory" since the matrix entries are first stored in a hash and then converted into the regular CSR format, so for a short while, both copies are in memory. 

    We use the amazing khash package, include/petsc/private/khash/khash.h, our code is scattered around a bit depending on the matrix format we will be forming. 

cd src/mat
git grep "_Hash("
impls/aij/mpi/mpiaij.c:/* defines MatSetValues_MPI_Hash(), MatAssemblyBegin_MPI_Hash(), and  MatAssemblyEnd_MPI_Hash() */
impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:/* defines MatSetValues_MPICUSPARSE_Hash() */
impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:  PetscCall(MatSetUp_MPI_Hash(A));
impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatSetValues_MPI_Hash(Mat A, PetscInt m, const PetscInt *rows, PetscInt n, const PetscInt *cols, const PetscScalar *values, InsertMode addv)
impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatAssemblyBegin_MPI_Hash(Mat A, PETSC_UNUSED MatAssemblyType type)
impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatAssemblyEnd_MPI_Hash(Mat A, MatAssemblyType type)
impls/aij/mpi/mpihashmat.h:        PetscCall(MatSetValues_MPI_Hash(A, 1, row + i, ncols, col + i, val + i, A->insertmode));
impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatDestroy_MPI_Hash(Mat A)
impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatZeroEntries_MPI_Hash(PETSC_UNUSED Mat A)
impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatSetRandom_MPI_Hash(Mat A, PETSC_UNUSED PetscRandom r)
impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatSetUp_MPI_Hash(Mat A)
impls/aij/seq/aij.c:/* defines MatSetValues_Seq_Hash(), MatAssemblyEnd_Seq_Hash(), MatSetUp_Seq_Hash() */
impls/aij/seq/seqhashmat.h:static PetscErrorCode MatAssemblyEnd_Seq_Hash(Mat A, MatAssemblyType type)
impls/aij/seq/seqhashmat.h:  A->preallocated = PETSC_FALSE; /* this was set to true for the MatSetValues_Hash() to work */
impls/aij/seq/seqhashmat.h:static PetscErrorCode MatDestroy_Seq_Hash(Mat A)
impls/aij/seq/seqhashmat.h:static PetscErrorCode MatZeroEntries_Seq_Hash(Mat A)
impls/aij/seq/seqhashmat.h:static PetscErrorCode MatSetRandom_Seq_Hash(Mat A, PetscRandom r)
impls/aij/seq/seqhashmat.h:static PetscErrorCode MatSetUp_Seq_Hash(Mat A)
impls/baij/mpi/mpibaij.c:/* defines MatSetValues_MPI_Hash(), MatAssemblyBegin_MPI_Hash(), and  MatAssemblyEnd_MPI_Hash() */
impls/baij/seq/baij.c:/* defines MatSetValues_Seq_Hash(), MatAssemblyEnd_Seq_Hash(), MatSetUp_Seq_Hash() */
impls/sbaij/mpi/mpisbaij.c:/* defines MatSetValues_MPI_Hash(), MatAssemblyBegin_MPI_Hash(), MatAssemblyEnd_MPI_Hash(), MatSetUp_MPI_Hash() */
impls/sbaij/seq/sbaij.c:/* defines MatSetValues_Seq_Hash(), MatAssemblyEnd_Seq_Hash(), MatSetUp_Seq_Hash() */

Thanks for the numbers, it is good to see the performance is so similar to that obtained when providing preallocation information.

> 
> Maybe can you tell me which file implements this function? 
> 
> Runfeng
> 
> 
> 
> Runfeng Jin <jsfaraway at gmail.com <mailto:jsfaraway at gmail.com>> 于2023年7月3日周一 22:05写道：
>> Thank you for all your help!
>> 
>> Runfeng
>> 
>> Matthew Knepley <knepley at gmail.com <mailto:knepley at gmail.com>> 于2023年7月3日周一 22:03写道：
>>> On Mon, Jul 3, 2023 at 9:56 AM Runfeng Jin <jsfaraway at gmail.com <mailto:jsfaraway at gmail.com>> wrote:
>>>> Hi, impressive performance!
>>>>   I use the newest version of petsc(release branch), and it almost deletes all assembly and stash time in large processors (assembly time 64-4s/128-2s/256-0.2s, stash time all below 2s). For the zero programming cost, it really incredible. 
>>>>   The order code has a regular arrangement of the number of nonzero-elements across rows, so I can have a good rough preallocation. And from the data, dedicatedly arrange data and roughly acquiring the max number of non-zero elements in rows can have a better performance than the new version without preallocation. However, in reality, I will use the newer version without preallocation for:1)less effort in programming and also nearly the same good performance 2) good memory usage(I see no unneeded memory after assembly) 3) dedicated preallocation is usually not very easy and cause extra time cost.
>>>>    Maybe it will be better that leave some space for the user to do a slight direction for the preallocation and thus acquire better performance. But have no idea how to direct it.
>>>>    And I am very curious about how petsc achieves this. How can it not know anything but achieve so good performance, and no wasted memory? May you have an explanation about this?
>>> 
>>> We use a hash table to store the nonzeros on the fly, and then convert to packed storage on assembly.
>>> 
>>>   Thanks,
>>> 
>>>      Matt
>>>  
>>>> assemble time:
>>>> version\processors               4            8        16         32           64        128         256
>>>>      old                             14677s   4694s   1124s     572s        38s         8s          2s
>>>>      new                                50s      28s       15s        7.8s         4s          2s        0.4s
>>>>      older                              27s       24s        19s       12s         14s         -              -
>>>> stash time(max among all processors):
>>>> version\processors               4            8        16         32           64        128         256
>>>>      old                                 3145s   2554s   673s     329s       201s     142s     138s
>>>>      new                                2s         1s        ~0s        ~0s         ~0s          ~0s       ~0s
>>>>      older                              10s       73s        18s       5s            1s         -              -
>>>> old: my poor preallocation
>>>> new: newest version of petsc that do not preallocation
>>>> older: the best preallocation version of my code. 
>>>> 
>>>> 
>>>> Runfeng
>>>> 
>>>> Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> 于2023年7月3日周一 12:19写道：
>>>>> 
>>>>>    The main branch of PETSc now supports filling sparse matrices without providing any preallocation information.
>>>>> 
>>>>>    You can give it a try. Use your current fastest code but just remove ALL the preallocation calls. I would be interested in what kind of performance you get compared to your best current performance.
>>>>> 
>>>>>   Barry
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Jul 2, 2023, at 11:24 PM, Runfeng Jin <jsfaraway at gmail.com <mailto:jsfaraway at gmail.com>> wrote:
>>>>>> 
>>>>>> Hi! Good advice!
>>>>>>     I set value with MatSetValues() API, which sets one part of a row at a time(I use a kind of tiling technology so I cannot get all values of a row at a time).
>>>>>>     I tested the number of malloc in these three cases.  The number of mallocs is decreasing with the increase of processors, and all these are very large(the matrix is 283234*283234, as can see in the following). This should be due to the unqualified preallocation. I use a rough preallocation, that every processor counts the number of nonzero elements for the first 10 rows, and uses the largest one to preallocate memory for all local rows. It seems that not work well. 
>>>>>> number_of_processors   number_of_max_mallocs_among_all_processors  
>>>>>> 64                                     20000
>>>>>> 128                                   17000
>>>>>> 256                                   11000
>>>>>>     I change my way to preallocate. I evenly take 100 rows in every local matrix and take the largest one to preallocate memory for all local rows. Now the assemble time is reduced to a very small time.
>>>>>> number_of_processors   number_of_max_mallocs_among_all_processors  
>>>>>> 64                                     3000
>>>>>> 128                                   700
>>>>>> 256                                   500
>>>>>> Event                Count          Time (sec)            Flop                                                              --- Global ---                  --- Stage ----              Total
>>>>>>                    Max Ratio        Max     Ratio       Max  Ratio  Mess            AvgLen  Reduct  %T %F %M %L %R        %T %F %M %L %R    Mflop/s
>>>>>> 64                 1    1.0       3.8999e+01 1.0     0.00e+00 0.0 7.1e+03     2.9e+05 1.1e+01 15  0  1  8  3                     15  0  1  8  3                  0
>>>>>> 128               1    1.0       8.5714e+00 1.0     0.00e+00 0.0 2.6e+04     8.1e+04 1.1e+01  5  0  1  4  3                       5  0  1  4  3                   0
>>>>>> 256               1    1.0        2.5512e+00 1.0    0.00e+00 0.0 1.0e+05     2.3e+04 1.1e+01  2  0  1  3  3                       2  0  1  3  3                   0
>>>>>> So the reason "why assemble time is smaller with the increasing number of processors " may be because more processors divide the malloc job so that total time is reduced? 
>>>>>>  If so, I still have some questions:
>>>>>>     1. If preallocation is not accurate, will the performance of the assembly be affected? I mean, when processors receive the elements that should be stored in their local by MPI, then will the new mallocs  happen at this time point?
>>>>>>     2. I can not give an accurate preallocation for the large cost, so is there any better way to preallocate for my situation?
>>>>>> 
>>>>>> 
>>>>>> Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> 于2023年7月2日周日 00:16写道：
>>>>>>> 
>>>>>>>    I see no reason not to trust the times below, they seem reasonable. You get more than 2 times speed from 64 to 128 and then about 1.38 from 128 to 256. 
>>>>>>> 
>>>>>>>    The total amount of data moved (number of messages moved times average length) goes from 7.0e+03 * 2.8e+05  1.9600e+09 to 2.1060e+09 to 2.3000e+09. A pretty moderate amount of data increase, but note that each time you double the number of ranks, you also increase substantially the network's hardware to move data, so one would hope for a good speed up.
>>>>>>> 
>>>>>>>    Also, the load balance is very good, near 1. Often with assembly, we see very out-of-balance, and it is difficult to get good speedup when the balance is really off.
>>>>>>> 
>>>>>>>    It looks like over 90% of the entire run time is coming from setting and assembling the values? Also the setting values time dominates assembly time more with more ranks.  Are you setting a single value at a time or a collection of them? How big are the vectors?
>>>>>>> 
>>>>>>>    Run all three cases with -info :vec to see some information about how many mallocs where move to hold the stashed vector entries.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Jun 30, 2023, at 10:25 PM, Runfeng Jin <jsfaraway at gmail.com <mailto:jsfaraway at gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi, 
>>>>>>>>     Thanks for your reply. I try to use PetscLogEvent(), and the result shows same conclusion.
>>>>>>>>     What I have done is :
>>>>>>>> ----------------
>>>>>>>>     PetscLogEvent Mat_assemble_event, Mat_setvalue_event, Mat_setAsse_event;
>>>>>>>>     PetscClassId classid;
>>>>>>>>     PetscLogDouble user_event_flops;
>>>>>>>>     PetscClassIdRegister("Test assemble and set value", &classid);
>>>>>>>>     PetscLogEventRegister("Test only assemble", classid, &Mat_assemble_event);
>>>>>>>>     PetscLogEventRegister("Test only set values", classid, &Mat_setvalue_event);
>>>>>>>>     PetscLogEventRegister("Test both assemble and set values", classid, &Mat_setAsse_event);
>>>>>>>>     PetscLogEventBegin(Mat_setAsse_event, 0, 0, 0, 0);
>>>>>>>>     PetscLogEventBegin(Mat_setvalue_event, 0, 0, 0, 0);
>>>>>>>>     ...compute elements and use MatSetValues. No call for assembly
>>>>>>>>     PetscLogEventEnd(Mat_setvalue_event, 0, 0, 0, 0);
>>>>>>>> 
>>>>>>>>     PetscLogEventBegin(Mat_assemble_event, 0, 0, 0, 0);
>>>>>>>>     MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY);
>>>>>>>>     MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY);
>>>>>>>>     PetscLogEventEnd(Mat_assemble_event, 0, 0, 0, 0);
>>>>>>>>     PetscLogEventEnd(Mat_setAsse_event, 0, 0, 0, 0);
>>>>>>>> ----------------
>>>>>>>> 
>>>>>>>>     And the output as follows. By the way, dose petsc recorde all time between PetscLogEventBegin and PetscLogEventEnd? or just test the time of petsc API?
>>>>>>> 
>>>>>>>    It is all of the time. 
>>>>>>> 
>>>>>>>> ----------------
>>>>>>>> Event                Count      Time (sec)     Flop                              --- Global ---  --- Stage ----  Total
>>>>>>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>>>>>>> 64new               1 1.0 2.3775e+02 1.0 0.00e+00 0.0 6.2e+03 2.3e+04 9.0e+00 52  0  1  1  2  52  0  1  1  2     0
>>>>>>>> 128new              1 1.0 6.9945e+01 1.0 0.00e+00 0.0 2.5e+04 1.1e+04 9.0e+00 30  0  1  1  2  30  0  1  1  2     0
>>>>>>>> 256new              1 1.0 1.7445e+01 1.0 0.00e+00 0.0 9.9e+04 5.2e+03 9.0e+00 10  0  1  1  2  10  0  1  1  2     0
>>>>>>>> 
>>>>>>>> 64:
>>>>>>>> only assemble       1 1.0 2.6596e+02 1.0 0.00e+00 0.0 7.0e+03 2.8e+05 1.1e+01 55  0  1  8  3  55  0  1  8  3     0
>>>>>>>> only setvalues      1 1.0 1.9987e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 41  0  0  0  0  41  0  0  0  0     0
>>>>>>>> Test both           1 1.0 4.6580e+02 1.0 0.00e+00 0.0 7.0e+03 2.8e+05 1.5e+01 96  0  1  8  4  96  0  1  8  4     0
>>>>>>>> 
>>>>>>>> 128:
>>>>>>>>  only assemble      1 1.0 6.9718e+01 1.0 0.00e+00 0.0 2.6e+04 8.1e+04 1.1e+01 30  0  1  4  3  30  0  1  4  3     0
>>>>>>>> only setvalues      1 1.0 1.4438e+02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 60  0  0  0  0  60  0  0  0  0     0
>>>>>>>> Test both           1 1.0 2.1417e+02 1.0 0.00e+00 0.0 2.6e+04 8.1e+04 1.5e+01 91  0  1  4  4  91  0  1  4  4     0
>>>>>>>> 
>>>>>>>> 256:
>>>>>>>> only assemble       1 1.0 1.7482e+01 1.0 0.00e+00 0.0 1.0e+05 2.3e+04 1.1e+01 10  0  1  3  3  10  0  1  3  3     0
>>>>>>>> only setvalues      1 1.0 1.3717e+02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 78  0  0  0  0  78  0  0  0  0     0
>>>>>>>> Test both           1 1.0 1.5475e+02 1.0 0.00e+00 0.0 1.0e+05 2.3e+04 1.5e+01 91  0  1  3  4  91  0  1  3  4     0 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Runfeng
>>>>>>>> 
>>>>>>>> Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> 于2023年6月30日周五 23:35写道：
>>>>>>>>> 
>>>>>>>>>    You cannot look just at the VecAssemblyEnd() time, that will very likely give the wrong impression of the total time it takes to put the values in.
>>>>>>>>> 
>>>>>>>>>    You need to register a new Event and put a PetscLogEvent() just before you start generating the vector entries and calling VecSetValues() and put the PetscLogEventEnd() just after the VecAssemblyEnd() this is the only way to get an accurate accounting of the time.
>>>>>>>>> 
>>>>>>>>>   Barry
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> > On Jun 30, 2023, at 11:21 AM, Runfeng Jin <jsfaraway at gmail.com <mailto:jsfaraway at gmail.com>> wrote:
>>>>>>>>> > 
>>>>>>>>> > Hello!
>>>>>>>>> > 
>>>>>>>>> > When I use PETSc build a sbaij matrix, I find a strange thing. When I increase the number of processors, the assemble time become smaller. All these are totally same matrix. The assemble time mainly arouse from message passing, which because I use dynamic workload that it is random for which elements are computed by which processor.
>>>>>>>>> > But from instinct, if use more processors, then more possible that the processor computes elements storing in other processors. But from the output of log_view, It seems when use more processors, the processors compute more elements storing in its local(infer from that, with more processors, less total amount of passed messages).
>>>>>>>>> > 
>>>>>>>>> > What could cause this happened? Thank you!
>>>>>>>>> > 
>>>>>>>>> > 
>>>>>>>>> >  Following is the output of log_view for 64\128\256 processors. Every row is time profiler of VecAssemblyEnd.
>>>>>>>>> > 
>>>>>>>>> > ------------------------------------------------------------------------------------------------------------------------
>>>>>>>>> > processors                Count                      Time (sec)                                      Flop                                                               --- Global ---                               --- Stage ----                Total
>>>>>>>>> >                               Max    Ratio         Max                  Ratio                 Max  Ratio      Mess        AvgLen         Reduct               %T %F %M %L %R         %T %F %M %L %R       Mflop/s
>>>>>>>>> > 64                            1     1.0            2.3775e+02      1.0                   0.00e+00 0.0      6.2e+03    2.3e+04     9.0e+00                 52  0      1    1    2             52   0    1      1     2             0
>>>>>>>>> > 128                          1     1.0            6.9945e+01      1.0                   0.00e+00 0.0      2.5e+04    1.1e+04     9.0e+00                30   0      1     1  2              30   0    1       1    2             0
>>>>>>>>> > 256                          1     1.0           1.7445e+01        1.0                  0.00e+00 0.0      9.9e+04     5.2e+03    9.0e+00                10   0      1     1  2              10   0    1        1   2             0
>>>>>>>>> > 
>>>>>>>>> > Runfeng Jin
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
>>> 
>>> -- 
>>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>>> -- Norbert Wiener
>>> 
>>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230703/99a717f7/attachment-0001.html>