[petsc-dev] MatAssembly, debug, and compile flags
Barry Smith
bsmith at mcs.anl.gov
Thu Mar 16 15:37:17 CDT 2017
> On Mar 16, 2017, at 10:57 AM, Pierre Jolivet <pierre.jolivet at enseeiht.fr> wrote:
>
> Thanks Barry.
> I actually tried the application myself with my optimized build + your option. I'm attaching two logs for a strong scaling analysis, if someone could spend a minute or two looking at the numbers I'd be really grateful:
> 1) MatAssembly still takes a rather long time IMHO. This is actually the bottleneck of my application. Especially on 1600 cores, the problem here is that I don't know if the huge time (almost a 5x slow-down w.r.t. the run on 320 cores) is due to MatMPIAIJSetPreallocationCSR (which I assumed beforehand was a no-op, but which is clearly not the case looking at the run on 320 cores) or the the option -pc_bjacobi_blocks 320 which also does one MatAssembly.
There is one additional synchronization point in the MatAssemblyEnd that has not/cannot be removed. This is the construction of the VecScatter; I think that likely explains the huge amount of time there.
> 2) The other bottleneck is MatMult, which itself calls VecScatter. Since the structure of the matrix is rather dense, I'm guessing the communication pattern should be similar to an all-to-all. After having a look at the thread "VecScatter scaling problem on KNL", would you also suggest me to use -vecscatter_alltoall, or do you think this would not be appropriate for the MatMult?
Please run with
-vecscatter_view ::ascii_info
this will give information about the number of messages and sizes needed in the VecScatter. To help decide what to do next.
Barry
>
> Thank you very much,
> Pierre
>
> On Mon, 6 Mar 2017 09:34:53 -0600, Barry Smith wrote:
>> I don't think the lack of the --with-debugging=no is important here.
>> Though he/she should use --with-debugging=no for production runs.
>>
>> I think the reason for the "funny" numbers is that
>> MatAssemblyBegin and End in this case have explicit synchronization
>> points so some processes are waiting for other processes to get to the
>> synchronization point thus it looks like some processes are spending a
>> lot of time in the assembly routines when they are not really, they
>> are just waiting.
>>
>> You can remove the synchronization point by calling
>>
>> MatSetOption(mat, MAT_NO_OFF_PROC_ENTRIES, PETSC_TRUE); before
>> calling MatMPIAIJSetPreallocationCSR()
>>
>> Barry
>>
>>> On Mar 6, 2017, at 8:59 AM, Pierre Jolivet <Pierre.Jolivet at enseeiht.fr> wrote:
>>>
>>> Hello,
>>> I have an application with a matrix with lots of nonzero entries (that are perfectly load balanced between processes and rows).
>>> A end user is currently using a PETSc library compiled with the following flags (among others):
>>> --CFLAGS=-O2 --COPTFLAGS=-O3 --CXXFLAGS="-O2 -std=c++11" --CXXOPTFLAGS=-O3 --FFLAGS=-O2 --FOPTFLAGS=-O3
>>> Notice the lack of --with-debugging=no
>>> The matrix is assembled using MatMPIAIJSetPreallocationCSR and we end up with something like that in the -log_view:
>>> MatAssemblyBegin 2 1.0 1.2520e+002602.1 0.00e+00 0.0 0.0e+00 0.0e+00 8.0e+00 0 0 0 0 2 0 0 0 0 2 0
>>> MatAssemblyEnd 2 1.0 4.5104e+01 1.0 0.00e+00 0.0 8.2e+05 3.2e+04 4.6e+01 40 0 14 4 9 40 0 14 4 9 0
>>>
>>> For reference, here is what the matrix looks like (keep in mind it is well balanced)
>>> Mat Object: 640 MPI processes
>>> type: mpiaij
>>> rows=10682560, cols=10682560
>>> total: nonzeros=51691212800, allocated nonzeros=51691212800
>>> total number of mallocs used during MatSetValues calls =0
>>> not using I-node (on process 0) routines
>>>
>>> Are MatAssemblyBegin/MatAssemblyEnd highly sensitive to the --with-debugging option on x86 even though the corresponding code is compiled with -O2, i.e., should I tell the user to have its PETSc lib recompiled, or would you recommend me to use another routine for assembling such a matrix?
>>>
>>> Thanks,
>>> Pierre
> <AD-3D-320_7531028.o><AD-3D-1600_7513074.o>
More information about the petsc-dev
mailing list