[petsc-dev] MatAssembly, debug, and compile flags

Thu Mar 16 15:37:17 CDT 2017

> On Mar 16, 2017, at 10:57 AM, Pierre Jolivet <pierre.jolivet at enseeiht.fr> wrote:
> 
> Thanks Barry.
> I actually tried the application myself with my optimized build + your option. I'm attaching two logs for a strong scaling analysis, if someone could spend a minute or two looking at the numbers I'd be really grateful:
> 1) MatAssembly still takes a rather long time IMHO. This is actually the bottleneck of my application. Especially on 1600 cores, the problem here is that I don't know if the huge time (almost a 5x slow-down w.r.t. the run on 320 cores) is due to MatMPIAIJSetPreallocationCSR (which I assumed beforehand was a no-op, but which is clearly not the case looking at the run on 320 cores) or the the option -pc_bjacobi_blocks 320 which also does one MatAssembly.

    There is one additional synchronization point in the MatAssemblyEnd that has not/cannot be removed. This is the construction of the VecScatter; I think that likely explains the huge amount of time there.

> 2) The other bottleneck is MatMult, which itself calls VecScatter. Since the structure of the matrix is rather dense, I'm guessing the communication pattern should be similar to an all-to-all. After having a look at the thread "VecScatter scaling problem on KNL", would you also suggest me to use -vecscatter_alltoall, or do you think this would not be appropriate for the MatMult?

   Please run with 

   -vecscatter_view ::ascii_info

 this will give information about the number of messages and sizes needed in the VecScatter. To help decide what to do next.

  Barry

> 
> Thank you very much,
> Pierre
> 
> On Mon, 6 Mar 2017 09:34:53 -0600, Barry Smith wrote:
>> I don't think the lack of the --with-debugging=no is important here.
>> Though he/she should use --with-debugging=no for production runs.
>> 
>>   I think the reason for the "funny" numbers is that
>> MatAssemblyBegin and End in this case have explicit synchronization
>> points so some processes are waiting for other processes to get to the
>> synchronization point thus it looks like some processes are spending a
>> lot of time in the assembly routines when they are not really, they
>> are just waiting.
>> 
>>   You can remove the synchronization point by calling
>> 
>>    MatSetOption(mat, MAT_NO_OFF_PROC_ENTRIES, PETSC_TRUE); before
>> calling MatMPIAIJSetPreallocationCSR()
>> 
>>   Barry
>> 
>>> On Mar 6, 2017, at 8:59 AM, Pierre Jolivet <Pierre.Jolivet at enseeiht.fr> wrote:
>>> 
>>> Hello,
>>> I have an application with a matrix with lots of nonzero entries (that are perfectly load balanced between processes and rows).
>>> A end user is currently using a PETSc library compiled with the following flags (among others):
>>> --CFLAGS=-O2 --COPTFLAGS=-O3 --CXXFLAGS="-O2 -std=c++11" --CXXOPTFLAGS=-O3 --FFLAGS=-O2 --FOPTFLAGS=-O3
>>> Notice the lack of --with-debugging=no
>>> The matrix is assembled using MatMPIAIJSetPreallocationCSR and we end up with something like that in the -log_view:
>>> MatAssemblyBegin       2 1.0 1.2520e+002602.1 0.00e+00 0.0 0.0e+00 0.0e+00 8.0e+00  0  0  0  0  2   0  0  0  0  2     0
>>> MatAssemblyEnd         2 1.0 4.5104e+01 1.0 0.00e+00 0.0 8.2e+05 3.2e+04 4.6e+01 40  0 14  4  9  40  0 14  4  9     0
>>> 
>>> For reference, here is what the matrix looks like (keep in mind it is well balanced)
>>> Mat Object:   640 MPI processes
>>>   type: mpiaij
>>>   rows=10682560, cols=10682560
>>>   total: nonzeros=51691212800, allocated nonzeros=51691212800
>>>   total number of mallocs used during MatSetValues calls =0
>>>     not using I-node (on process 0) routines
>>> 
>>> Are MatAssemblyBegin/MatAssemblyEnd highly sensitive to the --with-debugging option on x86 even though the corresponding code is compiled with -O2, i.e., should I tell the user to have its PETSc lib recompiled, or would you recommend me to use another routine for assembling such a matrix?
>>> 
>>> Thanks,
>>> Pierre
> <AD-3D-320_7531028.o><AD-3D-1600_7513074.o>