[petsc-users] Smaller assemble time with increasing processors
Runfeng Jin
jsfaraway at gmail.com
Mon Jul 3 09:05:25 CDT 2023
Thank you for all your help!
Runfeng
Matthew Knepley <knepley at gmail.com> 于2023年7月3日周一 22:03写道:
> On Mon, Jul 3, 2023 at 9:56 AM Runfeng Jin <jsfaraway at gmail.com> wrote:
>
>> Hi, impressive performance!
>> I use the newest version of petsc(release branch), and it almost
>> deletes all assembly and stash time in large processors (assembly time
>> 64-4s/128-2s/256-0.2s, stash time all below 2s). For the zero programming
>> cost, it really incredible.
>> The order code has a regular arrangement of the number of
>> nonzero-elements across rows, so I can have a good rough preallocation. And
>> from the data, dedicatedly arrange data and roughly acquiring the max
>> number of non-zero elements in rows can have a better performance than the
>> new version without preallocation. However, in reality, I will use the
>> newer version without preallocation for:1)less effort in programming and
>> also nearly the same good performance 2) good memory usage(I see no
>> unneeded memory after assembly) 3) dedicated preallocation is usually not
>> very easy and cause extra time cost.
>> Maybe it will be better that leave some space for the user to do a
>> slight direction for the preallocation and thus acquire better performance.
>> But have no idea how to direct it.
>> And I am very curious about how petsc achieves this. How can it not
>> know anything but achieve so good performance, and no wasted memory? May
>> you have an explanation about this?
>>
>
> We use a hash table to store the nonzeros on the fly, and then convert to
> packed storage on assembly.
>
> Thanks,
>
> Matt
>
>
>> assemble time:
>> version\processors 4 8 16 32
>> 64 128 256
>> old 14677s 4694s 1124s 572s
>> 38s 8s 2s
>> new 50s 28s 15s
>> 7.8s 4s 2s 0.4s
>> older 27s 24s 19s
>> 12s 14s - -
>> stash time(max among all processors):
>> version\processors 4 8 16 32
>> 64 128 256
>> old 3145s 2554s 673s 329s
>> 201s 142s 138s
>> new 2s 1s ~0s
>> ~0s ~0s ~0s ~0s
>> older 10s 73s 18s
>> 5s 1s - -
>> old: my poor preallocation
>> new: newest version of petsc that do not preallocation
>> older: the best preallocation version of my code.
>>
>>
>> Runfeng
>>
>> Barry Smith <bsmith at petsc.dev> 于2023年7月3日周一 12:19写道:
>>
>>>
>>> The main branch of PETSc now supports filling sparse matrices without
>>> providing any preallocation information.
>>>
>>> You can give it a try. Use your current fastest code but just remove
>>> ALL the preallocation calls. I would be interested in what kind of
>>> performance you get compared to your best current performance.
>>>
>>> Barry
>>>
>>>
>>>
>>> On Jul 2, 2023, at 11:24 PM, Runfeng Jin <jsfaraway at gmail.com> wrote:
>>>
>>> Hi! Good advice!
>>> I set value with MatSetValues() API, which sets one part of a row at
>>> a time(I use a kind of tiling technology so I cannot get all values of a
>>> row at a time).
>>> I tested the number of malloc in these three cases. The number of
>>> mallocs is decreasing with the increase of processors, and all these are
>>> very large(the matrix is 283234*283234, as can see in the following). This
>>> should be due to the unqualified preallocation. I use a rough
>>> preallocation, that every processor counts the number of nonzero elements
>>> for the first 10 rows, and uses the largest one to preallocate memory for
>>> all local rows. It seems that not work well.
>>>
>>> number_of_processors number_of_max_mallocs_among_all_processors
>>> 64 20000
>>> 128 17000
>>> 256 11000
>>>
>>> I change my way to preallocate. I evenly take 100 rows in every
>>> local matrix and take the largest one to preallocate memory for all local
>>> rows. Now the assemble time is reduced to a very small time.
>>>
>>> number_of_processors number_of_max_mallocs_among_all_processors
>>> 64 3000
>>> 128 700
>>> 256 500
>>>
>>> Event Count Time (sec) Flop
>>> --- Global ---
>>> --- Stage ---- Total
>>> Max Ratio Max Ratio Max Ratio
>>> Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R
>>> Mflop/s
>>> 64 1 1.0 3.8999e+01 1.0 0.00e+00 0.0
>>> 7.1e+03 2.9e+05 1.1e+01 15 0 1 8 3 15 0 1 8
>>> 3 0
>>>
>>> 128 1 1.0 8.5714e+00 1.0 0.00e+00 0.0
>>> 2.6e+04 8.1e+04 1.1e+01 5 0 1 4 3 5 0 1 4
>>> 3 0
>>> 256 1 1.0 2.5512e+00 1.0 0.00e+00 0.0
>>> 1.0e+05 2.3e+04 1.1e+01 2 0 1 3 3 2 0 1 3
>>> 3 0
>>>
>>> So the reason "why assemble time is smaller with the increasing number
>>> of processors " may be because more processors divide the malloc job so
>>> that total time is reduced?
>>> If so, I still have some questions:
>>> 1. If preallocation is not accurate, will the performance of the
>>> assembly be affected? I mean, when processors receive the elements that
>>> should be stored in their local by MPI, then will the new mallocs happen
>>> at this time point?
>>> 2. I can not give an accurate preallocation for the large cost, so
>>> is there any better way to preallocate for my situation?
>>>
>>>
>>>
>>> Barry Smith <bsmith at petsc.dev> 于2023年7月2日周日 00:16写道:
>>>
>>>>
>>>> I see no reason not to trust the times below, they seem reasonable.
>>>> You get more than 2 times speed from 64 to 128 and then about 1.38 from 128
>>>> to 256.
>>>>
>>>> The total amount of data moved (number of messages moved times
>>>> average length) goes from 7.0e+03 * 2.8e+05 1.9600e+09 to 2.1060e+09
>>>> to 2.3000e+09. A pretty moderate amount of data increase, but note that
>>>> each time you double the number of ranks, you also increase substantially
>>>> the network's hardware to move data, so one would hope for a good speed up.
>>>>
>>>> Also, the load balance is very good, near 1. Often with assembly, we
>>>> see very out-of-balance, and it is difficult to get good speedup when the
>>>> balance is really off.
>>>>
>>>> It looks like over 90% of the entire run time is coming from setting
>>>> and assembling the values? Also the setting values time dominates assembly
>>>> time more with more ranks. Are you setting a single value at a time or a
>>>> collection of them? How big are the vectors?
>>>>
>>>> Run all three cases with -info :vec to see some information about
>>>> how many mallocs where move to hold the stashed vector entries.
>>>>
>>>>
>>>>
>>>>
>>>> On Jun 30, 2023, at 10:25 PM, Runfeng Jin <jsfaraway at gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> Hi,
>>>> Thanks for your reply. I try to use PetscLogEvent(), and the result
>>>> shows same conclusion.
>>>> What I have done is :
>>>> ----------------
>>>> PetscLogEvent Mat_assemble_event, Mat_setvalue_event,
>>>> Mat_setAsse_event;
>>>> PetscClassId classid;
>>>> PetscLogDouble user_event_flops;
>>>> PetscClassIdRegister("Test assemble and set value", &classid);
>>>> PetscLogEventRegister("Test only assemble", classid,
>>>> &Mat_assemble_event);
>>>> PetscLogEventRegister("Test only set values", classid,
>>>> &Mat_setvalue_event);
>>>> PetscLogEventRegister("Test both assemble and set values", classid,
>>>> &Mat_setAsse_event);
>>>> PetscLogEventBegin(Mat_setAsse_event, 0, 0, 0, 0);
>>>> PetscLogEventBegin(Mat_setvalue_event, 0, 0, 0, 0);
>>>> ...compute elements and use MatSetValues. No call for assembly
>>>> PetscLogEventEnd(Mat_setvalue_event, 0, 0, 0, 0);
>>>>
>>>> PetscLogEventBegin(Mat_assemble_event, 0, 0, 0, 0);
>>>> MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY);
>>>> MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY);
>>>> PetscLogEventEnd(Mat_assemble_event, 0, 0, 0, 0);
>>>> PetscLogEventEnd(Mat_setAsse_event, 0, 0, 0, 0);
>>>> ----------------
>>>>
>>>> And the output as follows. By the way, dose petsc recorde all time
>>>> between PetscLogEventBegin and PetscLogEventEnd? or just test the time of
>>>> petsc API?
>>>>
>>>>
>>>> It is all of the time.
>>>>
>>>> ----------------
>>>> Event Count Time (sec) Flop
>>>> --- Global --- --- Stage ---- Total
>>>> Max Ratio *Max* Ratio Max Ratio Mess
>>>> AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>>>> 64new 1 1.0 *2.3775e+02* 1.0 0.00e+00 0.0 6.2e+03
>>>> 2.3e+04 9.0e+00 52 0 1 1 2 52 0 1 1 2 0
>>>> 128new 1 1.0* 6.9945e+01* 1.0 0.00e+00 0.0 2.5e+04
>>>> 1.1e+04 9.0e+00 30 0 1 1 2 30 0 1 1 2 0
>>>> 256new 1 1.0 *1.7445e+01* 1.0 0.00e+00 0.0 9.9e+04
>>>> 5.2e+03 9.0e+00 10 0 1 1 2 10 0 1 1 2 0
>>>>
>>>> 64:
>>>> only assemble 1 1.0 *2.6596e+02 *1.0 0.00e+00 0.0 7.0e+03
>>>> 2.8e+05 1.1e+01 55 0 1 8 3 55 0 1 8 3 0
>>>> only setvalues 1 1.0 *1.9987e+02* 1.0 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00 41 0 0 0 0 41 0 0 0 0 0
>>>> Test both 1 1.0 4.*6580e+02* 1.0 0.00e+00 0.0 7.0e+03
>>>> 2.8e+05 1.5e+01 96 0 1 8 4 96 0 1 8 4 0
>>>>
>>>> 128:
>>>> only assemble 1 1.0 *6.9718e+01* 1.0 0.00e+00 0.0 2.6e+04
>>>> 8.1e+04 1.1e+01 30 0 1 4 3 30 0 1 4 3 0
>>>> only setvalues 1 1.0 *1.4438e+02* 1.1 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00 60 0 0 0 0 60 0 0 0 0 0
>>>> Test both 1 1.0 *2.1417e+02* 1.0 0.00e+00 0.0 2.6e+04
>>>> 8.1e+04 1.5e+01 91 0 1 4 4 91 0 1 4 4 0
>>>>
>>>> 256:
>>>> only assemble 1 1.0 *1.7482e+01* 1.0 0.00e+00 0.0 1.0e+05
>>>> 2.3e+04 1.1e+01 10 0 1 3 3 10 0 1 3 3 0
>>>> only setvalues 1 1.0 *1.3717e+02* 1.1 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00 78 0 0 0 0 78 0 0 0 0 0
>>>> Test both 1 1.0 *1.5475e+02* 1.0 0.00e+00 0.0 1.0e+05
>>>> 2.3e+04 1.5e+01 91 0 1 3 4 91 0 1 3 4 0
>>>>
>>>>
>>>>
>>>> Runfeng
>>>>
>>>> Barry Smith <bsmith at petsc.dev> 于2023年6月30日周五 23:35写道:
>>>>
>>>>>
>>>>> You cannot look just at the VecAssemblyEnd() time, that will very
>>>>> likely give the wrong impression of the total time it takes to put the
>>>>> values in.
>>>>>
>>>>> You need to register a new Event and put a PetscLogEvent() just
>>>>> before you start generating the vector entries and calling VecSetValues()
>>>>> and put the PetscLogEventEnd() just after the VecAssemblyEnd() this is the
>>>>> only way to get an accurate accounting of the time.
>>>>>
>>>>> Barry
>>>>>
>>>>>
>>>>> > On Jun 30, 2023, at 11:21 AM, Runfeng Jin <jsfaraway at gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > Hello!
>>>>> >
>>>>> > When I use PETSc build a sbaij matrix, I find a strange thing. When
>>>>> I increase the number of processors, the assemble time become smaller. All
>>>>> these are totally same matrix. The assemble time mainly arouse from message
>>>>> passing, which because I use dynamic workload that it is random for which
>>>>> elements are computed by which processor.
>>>>> > But from instinct, if use more processors, then more possible that
>>>>> the processor computes elements storing in other processors. But from the
>>>>> output of log_view, It seems when use more processors, the processors
>>>>> compute more elements storing in its local(infer from that, with more
>>>>> processors, less total amount of passed messages).
>>>>> >
>>>>> > What could cause this happened? Thank you!
>>>>> >
>>>>> >
>>>>> > Following is the output of log_view for 64\128\256 processors.
>>>>> Every row is time profiler of VecAssemblyEnd.
>>>>> >
>>>>> >
>>>>> ------------------------------------------------------------------------------------------------------------------------
>>>>> > processors Count Time (sec)
>>>>> Flop
>>>>> --- Global --- ---
>>>>> Stage ---- Total
>>>>> > Max Ratio Max
>>>>> Ratio Max Ratio Mess AvgLen
>>>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>>>>> > 64 1 1.0 2.3775e+02
>>>>> 1.0 0.00e+00 0.0 6.2e+03 2.3e+04 9.0e+00
>>>>> 52 0 1 1 2 52 0 1 1 2
>>>>> 0
>>>>> > 128 1 1.0 6.9945e+01
>>>>> 1.0 0.00e+00 0.0 2.5e+04 1.1e+04 9.0e+00
>>>>> 30 0 1 1 2 30 0 1 1 2
>>>>> 0
>>>>> > 256 1 1.0 1.7445e+01
>>>>> 1.0 0.00e+00 0.0 9.9e+04 5.2e+03 9.0e+00
>>>>> 10 0 1 1 2 10 0 1 1 2
>>>>> 0
>>>>> >
>>>>> > Runfeng Jin
>>>>>
>>>>>
>>>>
>>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230703/56f8375e/attachment-0001.html>
More information about the petsc-users
mailing list