[petsc-dev] MatAssembly, debug, and compile flags

Thu Mar 16 10:57:10 CDT 2017

Thanks Barry.
I actually tried the application myself with my optimized build + your 
option. I'm attaching two logs for a strong scaling analysis, if someone 
could spend a minute or two looking at the numbers I'd be really 
grateful:
1) MatAssembly still takes a rather long time IMHO. This is actually 
the bottleneck of my application. Especially on 1600 cores, the problem 
here is that I don't know if the huge time (almost a 5x slow-down w.r.t. 
the run on 320 cores) is due to MatMPIAIJSetPreallocationCSR (which I 
assumed beforehand was a no-op, but which is clearly not the case 
looking at the run on 320 cores) or the the option -pc_bjacobi_blocks 
320 which also does one MatAssembly.
2) The other bottleneck is MatMult, which itself calls VecScatter. 
Since the structure of the matrix is rather dense, I'm guessing the 
communication pattern should be similar to an all-to-all. After having a 
look at the thread "VecScatter scaling problem on KNL", would you also 
suggest me to use -vecscatter_alltoall, or do you think this would not 
be appropriate for the MatMult?

Thank you very much,
Pierre

On Mon, 6 Mar 2017 09:34:53 -0600, Barry Smith wrote:
> I don't think the lack of the --with-debugging=no is important here.
> Though he/she should use --with-debugging=no for production runs.
>
>    I think the reason for the "funny" numbers is that
> MatAssemblyBegin and End in this case have explicit synchronization
> points so some processes are waiting for other processes to get to 
> the
> synchronization point thus it looks like some processes are spending 
> a
> lot of time in the assembly routines when they are not really, they
> are just waiting.
>
>    You can remove the synchronization point by calling
>
>     MatSetOption(mat, MAT_NO_OFF_PROC_ENTRIES, PETSC_TRUE); before
> calling MatMPIAIJSetPreallocationCSR()
>
>    Barry
>
>> On Mar 6, 2017, at 8:59 AM, Pierre Jolivet 
>> <Pierre.Jolivet at enseeiht.fr> wrote:
>>
>> Hello,
>> I have an application with a matrix with lots of nonzero entries 
>> (that are perfectly load balanced between processes and rows).
>> A end user is currently using a PETSc library compiled with the 
>> following flags (among others):
>> --CFLAGS=-O2 --COPTFLAGS=-O3 --CXXFLAGS="-O2 -std=c++11" 
>> --CXXOPTFLAGS=-O3 --FFLAGS=-O2 --FOPTFLAGS=-O3
>> Notice the lack of --with-debugging=no
>> The matrix is assembled using MatMPIAIJSetPreallocationCSR and we 
>> end up with something like that in the -log_view:
>> MatAssemblyBegin       2 1.0 1.2520e+002602.1 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 8.0e+00  0  0  0  0  2   0  0  0  0  2     0
>> MatAssemblyEnd         2 1.0 4.5104e+01 1.0 0.00e+00 0.0 8.2e+05 
>> 3.2e+04 4.6e+01 40  0 14  4  9  40  0 14  4  9     0
>>
>> For reference, here is what the matrix looks like (keep in mind it 
>> is well balanced)
>>  Mat Object:   640 MPI processes
>>    type: mpiaij
>>    rows=10682560, cols=10682560
>>    total: nonzeros=51691212800, allocated nonzeros=51691212800
>>    total number of mallocs used during MatSetValues calls =0
>>      not using I-node (on process 0) routines
>>
>> Are MatAssemblyBegin/MatAssemblyEnd highly sensitive to the 
>> --with-debugging option on x86 even though the corresponding code is 
>> compiled with -O2, i.e., should I tell the user to have its PETSc lib 
>> recompiled, or would you recommend me to use another routine for 
>> assembling such a matrix?
>>
>> Thanks,
>> Pierre
-------------- next part --------------
A non-text attachment was scrubbed...
Name: AD-3D-320_7531028.o
Type: application/x-object
Size: 26023 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20170316/e31f04d8/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: AD-3D-1600_7513074.o
Type: application/x-object
Size: 27004 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20170316/e31f04d8/attachment-0001.bin>