[petsc-dev] MatAssembly, debug, and compile flags

Fri Mar 17 04:04:47 CDT 2017

On Thu, 16 Mar 2017 15:37:17 -0500, Barry Smith wrote:
>> On Mar 16, 2017, at 10:57 AM, Pierre Jolivet 
>> <pierre.jolivet at enseeiht.fr> wrote:
>>
>> Thanks Barry.
>> I actually tried the application myself with my optimized build + 
>> your option. I'm attaching two logs for a strong scaling analysis, if 
>> someone could spend a minute or two looking at the numbers I'd be 
>> really grateful:
>> 1) MatAssembly still takes a rather long time IMHO. This is actually 
>> the bottleneck of my application. Especially on 1600 cores, the 
>> problem here is that I don't know if the huge time (almost a 5x 
>> slow-down w.r.t. the run on 320 cores) is due to 
>> MatMPIAIJSetPreallocationCSR (which I assumed beforehand was a no-op, 
>> but which is clearly not the case looking at the run on 320 cores) or 
>> the the option -pc_bjacobi_blocks 320 which also does one MatAssembly.
>
>     There is one additional synchronization point in the
> MatAssemblyEnd that has not/cannot be removed. This is the
> construction of the VecScatter; I think that likely explains the huge
> amount of time there.
>
>> 2) The other bottleneck is MatMult, which itself calls VecScatter. 
>> Since the structure of the matrix is rather dense, I'm guessing the 
>> communication pattern should be similar to an all-to-all. After having 
>> a look at the thread "VecScatter scaling problem on KNL", would you 
>> also suggest me to use -vecscatter_alltoall, or do you think this 
>> would not be appropriate for the MatMult?
>
>    Please run with
>
>    -vecscatter_view ::ascii_info
>
>  this will give information about the number of messages and sizes
> needed in the VecScatter. To help decide what to do next.

Here are two more logs. One with -vecscatter_view ::ascii_info which I 
don't really know how to analyze (I've spotted though that there are a 
couple of negative integers for the data counters, maybe you are using 
long instead of long long?), the other with -vecscatter_alltoall. The 
latter option gives a 2x speed-up for the MatMult, and for the PCApply 
too (which is weird to me because there should be no global 
communication with bjacobi and the diagonal blocks are only of size "5 
processes" so the speed-up seems rather huge for just doing VecScatter 
for gathering and scattering the RHS/solution for all 320 MUMPS 
instances).

Thanks for your help,
Pierre

>   Barry
>
>
>
>
>>
>> Thank you very much,
>> Pierre
>>
>> On Mon, 6 Mar 2017 09:34:53 -0600, Barry Smith wrote:
>>> I don't think the lack of the --with-debugging=no is important 
>>> here.
>>> Though he/she should use --with-debugging=no for production runs.
>>>
>>>   I think the reason for the "funny" numbers is that
>>> MatAssemblyBegin and End in this case have explicit synchronization
>>> points so some processes are waiting for other processes to get to 
>>> the
>>> synchronization point thus it looks like some processes are 
>>> spending a
>>> lot of time in the assembly routines when they are not really, they
>>> are just waiting.
>>>
>>>   You can remove the synchronization point by calling
>>>
>>>    MatSetOption(mat, MAT_NO_OFF_PROC_ENTRIES, PETSC_TRUE); before
>>> calling MatMPIAIJSetPreallocationCSR()
>>>
>>>   Barry
>>>
>>>> On Mar 6, 2017, at 8:59 AM, Pierre Jolivet 
>>>> <Pierre.Jolivet at enseeiht.fr> wrote:
>>>>
>>>> Hello,
>>>> I have an application with a matrix with lots of nonzero entries 
>>>> (that are perfectly load balanced between processes and rows).
>>>> A end user is currently using a PETSc library compiled with the 
>>>> following flags (among others):
>>>> --CFLAGS=-O2 --COPTFLAGS=-O3 --CXXFLAGS="-O2 -std=c++11" 
>>>> --CXXOPTFLAGS=-O3 --FFLAGS=-O2 --FOPTFLAGS=-O3
>>>> Notice the lack of --with-debugging=no
>>>> The matrix is assembled using MatMPIAIJSetPreallocationCSR and we 
>>>> end up with something like that in the -log_view:
>>>> MatAssemblyBegin       2 1.0 1.2520e+002602.1 0.00e+00 0.0 0.0e+00 
>>>> 0.0e+00 8.0e+00  0  0  0  0  2   0  0  0  0  2     0
>>>> MatAssemblyEnd         2 1.0 4.5104e+01 1.0 0.00e+00 0.0 8.2e+05 
>>>> 3.2e+04 4.6e+01 40  0 14  4  9  40  0 14  4  9     0
>>>>
>>>> For reference, here is what the matrix looks like (keep in mind it 
>>>> is well balanced)
>>>> Mat Object:   640 MPI processes
>>>>   type: mpiaij
>>>>   rows=10682560, cols=10682560
>>>>   total: nonzeros=51691212800, allocated nonzeros=51691212800
>>>>   total number of mallocs used during MatSetValues calls =0
>>>>     not using I-node (on process 0) routines
>>>>
>>>> Are MatAssemblyBegin/MatAssemblyEnd highly sensitive to the 
>>>> --with-debugging option on x86 even though the corresponding code is 
>>>> compiled with -O2, i.e., should I tell the user to have its PETSc 
>>>> lib recompiled, or would you recommend me to use another routine for 
>>>> assembling such a matrix?
>>>>
>>>> Thanks,
>>>> Pierre
>> <AD-3D-320_7531028.o><AD-3D-1600_7513074.o>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: AD-3D-1600_7533982_info.o
Type: application/x-object
Size: 120644 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20170317/33b3407d/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: AD-3D-1600_7533637_alltoall.o
Type: application/x-object
Size: 27026 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20170317/33b3407d/attachment-0001.bin>