[petsc-dev] MatAssembly, debug, and compile flags

Fri Mar 17 14:32:05 CDT 2017

Pierre,

I remember I had a similar problem some years ago when working with
matrices with "process-dense" rows (i.e., when the off-diagonal part is
shared by many processes). I fixed the issue by changing the implementation
of PetscGatherMessageLenghts, from rendez-vous to all-to-all.

Barry, if you had access to petsc-maint, the title of the thread is
"Problem with PetscGatherMessageLengths".

Hope this helps,
Stefano

2017-03-17 22:21 GMT+03:00 Barry Smith <bsmith at mcs.anl.gov>:

>
> > On Mar 17, 2017, at 4:04 AM, Pierre Jolivet <Pierre.Jolivet at enseeiht.fr>
> wrote:
> >
> > On Thu, 16 Mar 2017 15:37:17 -0500, Barry Smith wrote:
> >>> On Mar 16, 2017, at 10:57 AM, Pierre Jolivet <
> pierre.jolivet at enseeiht.fr> wrote:
> >>>
> >>> Thanks Barry.
> >>> I actually tried the application myself with my optimized build + your
> option. I'm attaching two logs for a strong scaling analysis, if someone
> could spend a minute or two looking at the numbers I'd be really grateful:
> >>> 1) MatAssembly still takes a rather long time IMHO. This is actually
> the bottleneck of my application. Especially on 1600 cores, the problem
> here is that I don't know if the huge time (almost a 5x slow-down w.r.t.
> the run on 320 cores) is due to MatMPIAIJSetPreallocationCSR (which I
> assumed beforehand was a no-op, but which is clearly not the case looking
> at the run on 320 cores) or the the option -pc_bjacobi_blocks 320 which
> also does one MatAssembly.
> >>
> >>    There is one additional synchronization point in the
> >> MatAssemblyEnd that has not/cannot be removed. This is the
> >> construction of the VecScatter; I think that likely explains the huge
> >> amount of time there.
>
>   This concerns me
>
>   MatAssemblyEnd         2 1.0 7.5767e+01 1.0 0.00e+00 0.0 5.1e+06 9.4e+03
> 1.6e+01 64  0100  8 14  64  0100  8 14     0
>
>    I am thinking this is all the communication needed to set up the
> scatter. Do you have access to any performance profilers like Intel
> speedshop to see what is going on during all this time?
>
>
>    -vecscatter_alltoall uses alltoall in communication in the scatters but
> it does not use all to all in setting up the scatter (that is determining
> exactly what needs to be scattered at each time). I think this is the
> problem. We need to add more scatter set up code to optimize this case.
>
>
>
> >>
> >>> 2) The other bottleneck is MatMult, which itself calls VecScatter.
> Since the structure of the matrix is rather dense, I'm guessing the
> communication pattern should be similar to an all-to-all. After having a
> look at the thread "VecScatter scaling problem on KNL", would you also
> suggest me to use -vecscatter_alltoall, or do you think this would not be
> appropriate for the MatMult?
> >>
> >>   Please run with
> >>
> >>   -vecscatter_view ::ascii_info
> >>
> >> this will give information about the number of messages and sizes
> >> needed in the VecScatter. To help decide what to do next.
> >
> > Here are two more logs. One with -vecscatter_view ::ascii_info which I
> don't really know how to analyze (I've spotted though that there are a
> couple of negative integers for the data counters, maybe you are using long
> instead of long long?), the other with -vecscatter_alltoall. The latter
> option gives a 2x speed-up for the MatMult, and for the PCApply too (which
> is weird to me because there should be no global communication with bjacobi
> and the diagonal blocks are only of size "5 processes" so the speed-up
> seems rather huge for just doing VecScatter for gathering and scattering
> the RHS/solution for all 320 MUMPS instances).
>
>   ok, this is good, it confirms that the large amount of communication
> needed in the scatters were a major problem and using the all to all helps.
> This is about all you can do about the scatter time.
>
>
>
>   Barry
>
> >
> > Thanks for your help,
> > Pierre
> >
> >>  Barry
> >>
> >>
> >>
> >>
> >>>
> >>> Thank you very much,
> >>> Pierre
> >>>
> >>> On Mon, 6 Mar 2017 09:34:53 -0600, Barry Smith wrote:
> >>>> I don't think the lack of the --with-debugging=no is important here.
> >>>> Though he/she should use --with-debugging=no for production runs.
> >>>>
> >>>>  I think the reason for the "funny" numbers is that
> >>>> MatAssemblyBegin and End in this case have explicit synchronization
> >>>> points so some processes are waiting for other processes to get to the
> >>>> synchronization point thus it looks like some processes are spending a
> >>>> lot of time in the assembly routines when they are not really, they
> >>>> are just waiting.
> >>>>
> >>>>  You can remove the synchronization point by calling
> >>>>
> >>>>   MatSetOption(mat, MAT_NO_OFF_PROC_ENTRIES, PETSC_TRUE); before
> >>>> calling MatMPIAIJSetPreallocationCSR()
> >>>>
> >>>>  Barry
> >>>>
> >>>>> On Mar 6, 2017, at 8:59 AM, Pierre Jolivet <
> Pierre.Jolivet at enseeiht.fr> wrote:
> >>>>>
> >>>>> Hello,
> >>>>> I have an application with a matrix with lots of nonzero entries
> (that are perfectly load balanced between processes and rows).
> >>>>> A end user is currently using a PETSc library compiled with the
> following flags (among others):
> >>>>> --CFLAGS=-O2 --COPTFLAGS=-O3 --CXXFLAGS="-O2 -std=c++11"
> --CXXOPTFLAGS=-O3 --FFLAGS=-O2 --FOPTFLAGS=-O3
> >>>>> Notice the lack of --with-debugging=no
> >>>>> The matrix is assembled using MatMPIAIJSetPreallocationCSR and we
> end up with something like that in the -log_view:
> >>>>> MatAssemblyBegin       2 1.0 1.2520e+002602.1 0.00e+00 0.0 0.0e+00
> 0.0e+00 8.0e+00  0  0  0  0  2   0  0  0  0 2     0
> >>>>> MatAssemblyEnd         2 1.0 4.5104e+01 1.0 0.00e+00 0.0 8.2e+05
> 3.2e+04 4.6e+01 40  0 14  4  9  40  0 14  4  9 0
> >>>>>
> >>>>> For reference, here is what the matrix looks like (keep in mind it
> is well balanced)
> >>>>> Mat Object:   640 MPI processes
> >>>>>  type: mpiaij
> >>>>>  rows=10682560, cols=10682560
> >>>>>  total: nonzeros=51691212800, allocated nonzeros=51691212800
> >>>>>  total number of mallocs used during MatSetValues calls =0
> >>>>>    not using I-node (on process 0) routines
> >>>>>
> >>>>> Are MatAssemblyBegin/MatAssemblyEnd highly sensitive to the
> --with-debugging option on x86 even though the corresponding code is
> compiled with -O2, i.e., should I tell the user to have its PETSc lib
> recompiled, or would you recommend me to use another routine for assembling
> such a matrix?
> >>>>>
> >>>>> Thanks,
> >>>>> Pierre
> >>> <AD-3D-320_7531028.o><AD-3D-1600_7513074.o>
> > <AD-3D-1600_7533982_info.o><AD-3D-1600_7533637_alltoall.o>
>
>

-- 
Stefano
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20170317/f11343fe/attachment.html>