[petsc-dev] MatAssembly, debug, and compile flags

Pierre Jolivet Pierre.Jolivet at enseeiht.fr
Mon Mar 20 09:22:08 CDT 2017


Hello Barry,
It looks like my vendor mpirun does not support OpenSpeedShop, and I 
have been too lazy recompiling everything with IntelMPI.
However, I did some really basic profiling and it looks like you were 
right, a lot of time is spent in VecScatterCreate_PtoS.
I switched to a MPI_Alltoallv implementation and here is the new 
summary.

MatAssemblyEnd         2 1.0 4.3129e+01 1.0 0.00e+00 0.0 0.0e+00 
0.0e+00 1.8e+01 51  0  0  4 15  51  0  0  4 15     0

That's roughly 30 seconds faster, but I still find that rather slow. 
I'll now try an MPI_Alltoall implementation with padding because I know 
for a fact that BullxMPI performances for variable-sized collectives are 
much worse than for uniform collectives (+ all my local dimensions are 
almost the same so the memory cost of padding will be negligible).

Thanks,
Pierre

On Fri, 17 Mar 2017 22:02:26 +0100, Pierre Jolivet wrote:
> The number of messages during the MatAssembly is effectively halved
> MatAssemblyEnd 2 1.0 7.2139e+01 1.0 0.00e+00 0.0 2.6e+06 1.9e+04
> 1.8e+01 62 0 99 8 15 62 0 99 8 15 0
> But that was only a few second faster (and this may even only be
> system noise).
> I’ll see what I can infer from the openspeedshop profiling, and
> might give another MPI implementation a try during the weekend (I’m
> using BullxMPI, based on an ancient OpenMPI, but maybe IntelMPI gives
> better results).
>
> Thanks anyway!
> Pierre
>
>> On Mar 17, 2017, at 9:23 PM, Pierre Jolivet wrote:
>>
>> Thank you for all your input. openspeedshop/2.1 is installed on my
>> cluster but it appears something is wrong with the MPI wrapper so
>> I’ll have to wait for the answer from the support on Monday.
>> In the meantime I’ll try the patch from Stefano which looks very
>> promising since it will replace 1599 sends and 1599 receives by a
>> single all-to-all.
>> Thanks again!
>> Pierre
>>
>>> On Mar 17, 2017, at 8:59 PM, Stefano Zampini wrote:
>>>
>>> 2017-03-17 22:52 GMT+03:00 Barry Smith :
>>>
>>>> Stefano,
>>>>
>>>> Thanks this is very helpful.
>>>>
>>>> ---------------------
>>>> Why not? here is my naive implementation with AlltoAll, which
>>>> perform better in my case
>>>>
>>>> PetscErrorCode PetscGatherMessageLengths(MPI_Comm
>>>> comm,PetscMPIInt nsends,PetscMPIInt nrecvs,const PetscMPIInt
>>>> ilengths[],PetscMPIInt **onodes,PetscMPIInt **olengths)
>>>> {
>>>> PetscErrorCode ierr;
>>>> PetscMPIInt size,i,j;
>>>> PetscMPIInt *all_lengths;
>>>>
>>>> PetscFunctionBegin;
>>>> ierr = MPI_Comm_size(comm,&size);CHKERRQ(ierr);
>>>> ierr =
>>>>
>>> PetscMalloc(size*sizeof(PetscMPIInt),&all_lengths);CHKERRQ(ierr);
>>>> ierr =
>>>>
>>>
>>
> 
> MPI_Alltoall((void*)ilengths,1,MPI_INT,all_lengths,1,MPI_INT,comm);CHKERRQ(ierr);
>>>> ierr =
>>>> PetscMalloc(nrecvs*sizeof(PetscMPIInt),olengths);CHKERRQ(ierr);
>>>> ierr =
>>>> PetscMalloc(nrecvs*sizeof(PetscMPIInt),onodes);CHKERRQ(ierr);
>>>> for (i=0,j=0; i
>>>
>>> At that time I just fixed (1), not (2). My specific problem was
>>> not with timings per se, but with MPI (IntelMPI if I remember
>>> correctly) crashing when doing the rendez-vous with thousands of
>>> processes.
>>>
>>>> Don't go to sleep yet, I may have more questions :-)
>>>>
>>>> Barry
>>>>
>>>>> On Mar 17, 2017, at 2:32 PM, Stefano Zampini wrote:
>>>>>
>>>>> Pierre,
>>>>>
>>>>> I remember I had a similar problem some years ago when
>>>> working with matrices with "process-dense" rows (i.e., when the
>>>> off-diagonal part is shared by many processes). I fixed the
>>>> issue by changing the implementation of
>>>> PetscGatherMessageLenghts, from rendez-vous to all-to-all.
>>>>>
>>>>> Barry, if you had access to petsc-maint, the title of the
>>>> thread is "Problem with PetscGatherMessageLengths".
>>>>>
>>>>> Hope this helps,
>>>>> Stefano
>>>>>
>>>>>
>>>>> 2017-03-17 22:21 GMT+03:00 Barry Smith :
>>>>>
>>>>> > On Mar 17, 2017, at 4:04 AM, Pierre Jolivet wrote:
>>>>> >
>>>>> > On Thu, 16 Mar 2017 15:37:17 -0500, Barry Smith wrote:
>>>>> >>> On Mar 16, 2017, at 10:57 AM, Pierre Jolivet wrote:
>>>>> >>>
>>>>> >>> Thanks Barry.
>>>>> >>> I actually tried the application myself with my optimized
>>>> build + your option. I'm attaching two logs for a strong
>>> scaling
>>>> analysis, if someone could spend a minute or two looking at the
>>>> numbers I'd be really grateful:
>>>>> >>> 1) MatAssembly still takes a rather long time IMHO. This
>>>> is actually the bottleneck of my application. Especially on
>>> 1600
>>>> cores, the problem here is that I don't know if the huge time
>>>> (almost a 5x slow-down w.r.t. the run on 320 cores) is due to
>>>> MatMPIAIJSetPreallocationCSR (which I assumed beforehand was a
>>>> no-op, but which is clearly not the case looking at the run on
>>>> 320 cores) or the the option -pc_bjacobi_blocks 320 which also
>>>> does one MatAssembly.
>>>>> >>
>>>>> >> There is one additional synchronization point in the
>>>>> >> MatAssemblyEnd that has not/cannot be removed. This is the
>>>>> >> construction of the VecScatter; I think that likely
>>>> explains the huge
>>>>> >> amount of time there.
>>>>>
>>>>> This concerns me
>>>>>
>>>>> MatAssemblyEnd 2 1.0 7.5767e+01 1.0 0.00e+00 0.0 5.1e+06
>>>> 9.4e+03 1.6e+01 64 0100 8 14 64 0100 8 14 0
>>>>>
>>>>> I am thinking this is all the communication needed to set up
>>>> the scatter. Do you have access to any performance profilers
>>>> like Intel speedshop to see what is going on during all this
>>>> time?
>>>>>
>>>>>
>>>>> -vecscatter_alltoall uses alltoall in communication in the
>>>> scatters but it does not use all to all in setting up the
>>>> scatter (that is determining exactly what needs to be scattered
>>>> at each time). I think this is the problem. We need to add more
>>>> scatter set up code to optimize this case.
>>>>>
>>>>>
>>>>>
>>>>> >>
>>>>> >>> 2) The other bottleneck is MatMult, which itself calls
>>>> VecScatter. Since the structure of the matrix is rather dense,
>>>> I'm guessing the communication pattern should be similar to an
>>>> all-to-all. After having a look at the thread "VecScatter
>>>> scaling problem on KNL", would you also suggest me to use
>>>> -vecscatter_alltoall, or do you think this would not be
>>>> appropriate for the MatMult?
>>>>> >>
>>>>> >> Please run with
>>>>> >>
>>>>> >> -vecscatter_view ::ascii_info
>>>>> >>
>>>>> >> this will give information about the number of messages
>>>> and sizes
>>>>> >> needed in the VecScatter. To help decide what to do next.
>>>>> >
>>>>> > Here are two more logs. One with -vecscatter_view
>>>> ::ascii_info which I don't really know how to analyze (I've
>>>> spotted though that there are a couple of negative integers for
>>>> the data counters, maybe you are using long instead of long
>>>> long?), the other with -vecscatter_alltoall. The latter option
>>>> gives a 2x speed-up for the MatMult, and for the PCApply too
>>>> (which is weird to me because there should be no global
>>>> communication with bjacobi and the diagonal blocks are only of
>>>> size "5 processes" so the speed-up seems rather huge for just
>>>> doing VecScatter for gathering and scattering the RHS/solution
>>>> for all 320 MUMPS instances).
>>>>>
>>>>> ok, this is good, it confirms that the large amount of
>>>> communication needed in the scatters were a major problem and
>>>> using the all to all helps. This is about all you can do about
>>>> the scatter time.
>>>>>
>>>>>
>>>>>
>>>>> Barry
>>>>>
>>>>> >
>>>>> > Thanks for your help,
>>>>> > Pierre
>>>>> >
>>>>> >> Barry
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>>
>>>>> >>> Thank you very much,
>>>>> >>> Pierre
>>>>> >>>
>>>>> >>> On Mon, 6 Mar 2017 09:34:53 -0600, Barry Smith wrote:
>>>>> >>>> I don't think the lack of the --with-debugging=no is
>>>> important here.
>>>>> >>>> Though he/she should use --with-debugging=no for
>>>> production runs.
>>>>> >>>>
>>>>> >>>> I think the reason for the "funny" numbers is that
>>>>> >>>> MatAssemblyBegin and End in this case have explicit
>>>> synchronization
>>>>> >>>> points so some processes are waiting for other processes
>>>> to get to the
>>>>> >>>> synchronization point thus it looks like some processes
>>>> are spending a
>>>>> >>>> lot of time in the assembly routines when they are not
>>>> really, they
>>>>> >>>> are just waiting.
>>>>> >>>>
>>>>> >>>> You can remove the synchronization point by calling
>>>>> >>>>
>>>>> >>>> MatSetOption(mat, MAT_NO_OFF_PROC_ENTRIES, PETSC_TRUE);
>>>> before
>>>>> >>>> calling MatMPIAIJSetPreallocationCSR()
>>>>> >>>>
>>>>> >>>> Barry
>>>>> >>>>
>>>>> >>>>> On Mar 6, 2017, at 8:59 AM, Pierre Jolivet wrote:
>>>>> >>>>>
>>>>> >>>>> Hello,
>>>>> >>>>> I have an application with a matrix with lots of
>>>> nonzero entries (that are perfectly load balanced between
>>>> processes and rows).
>>>>> >>>>> A end user is currently using a PETSc library compiled
>>>> with the following flags (among others):
>>>>> >>>>> --CFLAGS=-O2 --COPTFLAGS=-O3 --CXXFLAGS="-O2
>>>> -std=c++11" --CXXOPTFLAGS=-O3 --FFLAGS=-O2 --FOPTFLAGS=-O3
>>>>> >>>>> Notice the lack of --with-debugging=no
>>>>> >>>>> The matrix is assembled using
>>>> MatMPIAIJSetPreallocationCSR and we end up with something like
>>>> that in the -log_view:
>>>>> >>>>> MatAssemblyBegin 2 1.0 1.2520e+002602.1 0.00e+00 0.0
>>>> 0.0e+00 0.0e+00 8.0e+00 0 0 0 0 2 0 0 0 0 2 0
>>>>> >>>>> MatAssemblyEnd 2 1.0 4.5104e+01 1.0 0.00e+00 0.0
>>>> 8.2e+05 3.2e+04 4.6e+01 40 0 14 4 9 40 0 14 4 9 0
>>>>> >>>>>
>>>>> >>>>> For reference, here is what the matrix looks like (keep
>>>> in mind it is well balanced)
>>>>> >>>>> Mat Object: 640 MPI processes
>>>>> >>>>> type: mpiaij
>>>>> >>>>> rows=10682560, cols=10682560
>>>>> >>>>> total: nonzeros=51691212800, allocated
>>>> nonzeros=51691212800
>>>>> >>>>> total number of mallocs used during MatSetValues calls
>>>> =0
>>>>> >>>>> not using I-node (on process 0) routines
>>>>> >>>>>
>>>>> >>>>> Are MatAssemblyBegin/MatAssemblyEnd highly sensitive to
>>>> the --with-debugging option on x86 even though the
>>> corresponding
>>>> code is compiled with -O2, i.e., should I tell the user to have
>>>> its PETSc lib recompiled, or would you recommend me to use
>>>> another routine for assembling such a matrix?
>>>>> >>>>>
>>>>> >>>>> Thanks,
>>>>> >>>>> Pierre
>>>>> >>>
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Stefano
>>>
>>> --
>>>
>>> Stefano
>
>
>
> Links:
> ------
> [1] mailto:stefano.zampini at gmail.com
> [2] mailto:bsmith at mcs.anl.gov
> [3] mailto:Pierre.Jolivet at enseeiht.fr
> [4] mailto:pierre.jolivet at enseeiht.fr
> [5] mailto:Pierre.Jolivet at enseeiht.fr
> [6] mailto:stefano.zampini at gmail.com
> [7] mailto:bsmith at mcs.anl.gov
> [8] mailto:pierre.jolivet at enseeiht.fr




More information about the petsc-dev mailing list