[petsc-dev] MatAssembly, debug, and compile flags

Sat Jun 24 16:44:54 CDT 2017

Hello Barry,
Sorry to bump up this old thread, but I’m still struggling with MatAssembly.
Just as a reminder, on 1280 processors, reference timings:
MatAssemblyBegin       2 1.0 1.4302e+006436.3 0.00e+00 0.0 0.0e+00 0.0e+00 3.0e+00  1  0  0  0  1   1  0  0  0  1     0
MatAssemblyEnd         2 1.0 5.0301e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.8e+01 31  0  0  3  7  31  0  0  3  9     0

I’ve turned VecScatterCreate_PtoS into a no-op (almost), so I’m rather satisfied with this part. The new timings were:
MatAssemblyBegin       2 1.0 1.6892e+007015.1 0.00e+00 0.0 0.0e+00 0.0e+00 3.0e+00  0  0  0  0  2   0  0  0  0  2     0
MatAssemblyEnd         2 1.0 2.9842e+01 1.1 0.00e+00 0.0 9.0e+03 7.9e+04 1.7e+01 15  0  0  0  9  15  0  0  0 11     0

I dug around in the code, and it turned out that a high percentage of the ~30 seconds were spent in MatSetUpMultiply_MPIAIJ.
I tried to set PETSC_USE_CTABLE to 0 and now I get:
MatAssemblyBegin       3 1.0 1.4183e+005960.8 0.00e+00 0.0 0.0e+00 0.0e+00 3.0e+00  0  0  0  0  2   0  0  0  0  2     0
MatAssemblyEnd         3 1.0 7.4979e+00 1.3 0.00e+00 0.0 9.0e+03 7.9e+04 1.7e+01  4  0 70  0 14   4  0 70  0 14     0

My follow-up questions are thus:
1) any ideas why using ctable is a terrible idea here? Looking at ctable.c it looks like ctable is an integer-specific hash table, so I’m guessing the number of collisions is rather low and it should be efficient.
2) why is PETSC_USE_CTABLE not a runtime option, and a preprocessor variable instead? Is it wrong to set PETSC_USE_CTABLE to 0 for MatSetUpMultiply_MPIAIJ, and 1 elsewhere?

BTW, just as a reminder, my matrices have really weird sparsity patterns with extremely large bandwidths. Here is an example for the dimensions of the off-diagonal block B (of the MPIAIJ format): 26871 x 21416009.

Thanks in advance for your help,
Pierre

> On 20 Mar 2017, at 3:22 PM, Pierre Jolivet <pierre.jolivet at enseeiht.fr> wrote:
> 
> Hello Barry,
> It looks like my vendor mpirun does not support OpenSpeedShop, and I have been too lazy recompiling everything with IntelMPI.
> However, I did some really basic profiling and it looks like you were right, a lot of time is spent in VecScatterCreate_PtoS.
> I switched to a MPI_Alltoallv implementation and here is the new summary.
> 
> MatAssemblyEnd         2 1.0 4.3129e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.8e+01 51  0  0  4 15  51  0  0  4 15     0
> 
> That's roughly 30 seconds faster, but I still find that rather slow. I'll now try an MPI_Alltoall implementation with padding because I know for a fact that BullxMPI performances for variable-sized collectives are much worse than for uniform collectives (+ all my local dimensions are almost the same so the memory cost of padding will be negligible).
> 
> Thanks,
> Pierre
> 
> On Fri, 17 Mar 2017 22:02:26 +0100, Pierre Jolivet wrote:
>> The number of messages during the MatAssembly is effectively halved
>> MatAssemblyEnd 2 1.0 7.2139e+01 1.0 0.00e+00 0.0 2.6e+06 1.9e+04
>> 1.8e+01 62 0 99 8 15 62 0 99 8 15 0
>> But that was only a few second faster (and this may even only be
>> system noise).
>> I’ll see what I can infer from the openspeedshop profiling, and
>> might give another MPI implementation a try during the weekend (I’m
>> using BullxMPI, based on an ancient OpenMPI, but maybe IntelMPI gives
>> better results).
>> 
>> Thanks anyway!
>> Pierre
>> 
>>> On Mar 17, 2017, at 9:23 PM, Pierre Jolivet wrote:
>>> 
>>> Thank you for all your input. openspeedshop/2.1 is installed on my
>>> cluster but it appears something is wrong with the MPI wrapper so
>>> I’ll have to wait for the answer from the support on Monday.
>>> In the meantime I’ll try the patch from Stefano which looks very
>>> promising since it will replace 1599 sends and 1599 receives by a
>>> single all-to-all.
>>> Thanks again!
>>> Pierre
>>> 
>>>> On Mar 17, 2017, at 8:59 PM, Stefano Zampini wrote:
>>>> 
>>>> 2017-03-17 22:52 GMT+03:00 Barry Smith :
>>>> 
>>>>> Stefano,
>>>>> 
>>>>> Thanks this is very helpful.
>>>>> 
>>>>> ---------------------
>>>>> Why not? here is my naive implementation with AlltoAll, which
>>>>> perform better in my case
>>>>> 
>>>>> PetscErrorCode PetscGatherMessageLengths(MPI_Comm
>>>>> comm,PetscMPIInt nsends,PetscMPIInt nrecvs,const PetscMPIInt
>>>>> ilengths[],PetscMPIInt **onodes,PetscMPIInt **olengths)
>>>>> {
>>>>> PetscErrorCode ierr;
>>>>> PetscMPIInt size,i,j;
>>>>> PetscMPIInt *all_lengths;
>>>>> 
>>>>> PetscFunctionBegin;
>>>>> ierr = MPI_Comm_size(comm,&size);CHKERRQ(ierr);
>>>>> ierr =
>>>>> 
>>>> PetscMalloc(size*sizeof(PetscMPIInt),&all_lengths);CHKERRQ(ierr);
>>>>> ierr =
>>>>> 
>>>> 
>>> 
>> MPI_Alltoall((void*)ilengths,1,MPI_INT,all_lengths,1,MPI_INT,comm);CHKERRQ(ierr);
>>>>> ierr =
>>>>> PetscMalloc(nrecvs*sizeof(PetscMPIInt),olengths);CHKERRQ(ierr);
>>>>> ierr =
>>>>> PetscMalloc(nrecvs*sizeof(PetscMPIInt),onodes);CHKERRQ(ierr);
>>>>> for (i=0,j=0; i
>>>> 
>>>> At that time I just fixed (1), not (2). My specific problem was
>>>> not with timings per se, but with MPI (IntelMPI if I remember
>>>> correctly) crashing when doing the rendez-vous with thousands of
>>>> processes.
>>>> 
>>>>> Don't go to sleep yet, I may have more questions :-)
>>>>> 
>>>>> Barry
>>>>> 
>>>>>> On Mar 17, 2017, at 2:32 PM, Stefano Zampini wrote:
>>>>>> 
>>>>>> Pierre,
>>>>>> 
>>>>>> I remember I had a similar problem some years ago when
>>>>> working with matrices with "process-dense" rows (i.e., when the
>>>>> off-diagonal part is shared by many processes). I fixed the
>>>>> issue by changing the implementation of
>>>>> PetscGatherMessageLenghts, from rendez-vous to all-to-all.
>>>>>> 
>>>>>> Barry, if you had access to petsc-maint, the title of the
>>>>> thread is "Problem with PetscGatherMessageLengths".
>>>>>> 
>>>>>> Hope this helps,
>>>>>> Stefano
>>>>>> 
>>>>>> 
>>>>>> 2017-03-17 22:21 GMT+03:00 Barry Smith :
>>>>>> 
>>>>>> > On Mar 17, 2017, at 4:04 AM, Pierre Jolivet wrote:
>>>>>> >
>>>>>> > On Thu, 16 Mar 2017 15:37:17 -0500, Barry Smith wrote:
>>>>>> >>> On Mar 16, 2017, at 10:57 AM, Pierre Jolivet wrote:
>>>>>> >>>
>>>>>> >>> Thanks Barry.
>>>>>> >>> I actually tried the application myself with my optimized
>>>>> build + your option. I'm attaching two logs for a strong
>>>> scaling
>>>>> analysis, if someone could spend a minute or two looking at the
>>>>> numbers I'd be really grateful:
>>>>>> >>> 1) MatAssembly still takes a rather long time IMHO. This
>>>>> is actually the bottleneck of my application. Especially on
>>>> 1600
>>>>> cores, the problem here is that I don't know if the huge time
>>>>> (almost a 5x slow-down w.r.t. the run on 320 cores) is due to
>>>>> MatMPIAIJSetPreallocationCSR (which I assumed beforehand was a
>>>>> no-op, but which is clearly not the case looking at the run on
>>>>> 320 cores) or the the option -pc_bjacobi_blocks 320 which also
>>>>> does one MatAssembly.
>>>>>> >>
>>>>>> >> There is one additional synchronization point in the
>>>>>> >> MatAssemblyEnd that has not/cannot be removed. This is the
>>>>>> >> construction of the VecScatter; I think that likely
>>>>> explains the huge
>>>>>> >> amount of time there.
>>>>>> 
>>>>>> This concerns me
>>>>>> 
>>>>>> MatAssemblyEnd 2 1.0 7.5767e+01 1.0 0.00e+00 0.0 5.1e+06
>>>>> 9.4e+03 1.6e+01 64 0100 8 14 64 0100 8 14 0
>>>>>> 
>>>>>> I am thinking this is all the communication needed to set up
>>>>> the scatter. Do you have access to any performance profilers
>>>>> like Intel speedshop to see what is going on during all this
>>>>> time?
>>>>>> 
>>>>>> 
>>>>>> -vecscatter_alltoall uses alltoall in communication in the
>>>>> scatters but it does not use all to all in setting up the
>>>>> scatter (that is determining exactly what needs to be scattered
>>>>> at each time). I think this is the problem. We need to add more
>>>>> scatter set up code to optimize this case.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> >>
>>>>>> >>> 2) The other bottleneck is MatMult, which itself calls
>>>>> VecScatter. Since the structure of the matrix is rather dense,
>>>>> I'm guessing the communication pattern should be similar to an
>>>>> all-to-all. After having a look at the thread "VecScatter
>>>>> scaling problem on KNL", would you also suggest me to use
>>>>> -vecscatter_alltoall, or do you think this would not be
>>>>> appropriate for the MatMult?
>>>>>> >>
>>>>>> >> Please run with
>>>>>> >>
>>>>>> >> -vecscatter_view ::ascii_info
>>>>>> >>
>>>>>> >> this will give information about the number of messages
>>>>> and sizes
>>>>>> >> needed in the VecScatter. To help decide what to do next.
>>>>>> >
>>>>>> > Here are two more logs. One with -vecscatter_view
>>>>> ::ascii_info which I don't really know how to analyze (I've
>>>>> spotted though that there are a couple of negative integers for
>>>>> the data counters, maybe you are using long instead of long
>>>>> long?), the other with -vecscatter_alltoall. The latter option
>>>>> gives a 2x speed-up for the MatMult, and for the PCApply too
>>>>> (which is weird to me because there should be no global
>>>>> communication with bjacobi and the diagonal blocks are only of
>>>>> size "5 processes" so the speed-up seems rather huge for just
>>>>> doing VecScatter for gathering and scattering the RHS/solution
>>>>> for all 320 MUMPS instances).
>>>>>> 
>>>>>> ok, this is good, it confirms that the large amount of
>>>>> communication needed in the scatters were a major problem and
>>>>> using the all to all helps. This is about all you can do about
>>>>> the scatter time.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Barry
>>>>>> 
>>>>>> >
>>>>>> > Thanks for your help,
>>>>>> > Pierre
>>>>>> >
>>>>>> >> Barry
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>>
>>>>>> >>> Thank you very much,
>>>>>> >>> Pierre
>>>>>> >>>
>>>>>> >>> On Mon, 6 Mar 2017 09:34:53 -0600, Barry Smith wrote:
>>>>>> >>>> I don't think the lack of the --with-debugging=no is
>>>>> important here.
>>>>>> >>>> Though he/she should use --with-debugging=no for
>>>>> production runs.
>>>>>> >>>>
>>>>>> >>>> I think the reason for the "funny" numbers is that
>>>>>> >>>> MatAssemblyBegin and End in this case have explicit
>>>>> synchronization
>>>>>> >>>> points so some processes are waiting for other processes
>>>>> to get to the
>>>>>> >>>> synchronization point thus it looks like some processes
>>>>> are spending a
>>>>>> >>>> lot of time in the assembly routines when they are not
>>>>> really, they
>>>>>> >>>> are just waiting.
>>>>>> >>>>
>>>>>> >>>> You can remove the synchronization point by calling
>>>>>> >>>>
>>>>>> >>>> MatSetOption(mat, MAT_NO_OFF_PROC_ENTRIES, PETSC_TRUE);
>>>>> before
>>>>>> >>>> calling MatMPIAIJSetPreallocationCSR()
>>>>>> >>>>
>>>>>> >>>> Barry
>>>>>> >>>>
>>>>>> >>>>> On Mar 6, 2017, at 8:59 AM, Pierre Jolivet wrote:
>>>>>> >>>>>
>>>>>> >>>>> Hello,
>>>>>> >>>>> I have an application with a matrix with lots of
>>>>> nonzero entries (that are perfectly load balanced between
>>>>> processes and rows).
>>>>>> >>>>> A end user is currently using a PETSc library compiled
>>>>> with the following flags (among others):
>>>>>> >>>>> --CFLAGS=-O2 --COPTFLAGS=-O3 --CXXFLAGS="-O2
>>>>> -std=c++11" --CXXOPTFLAGS=-O3 --FFLAGS=-O2 --FOPTFLAGS=-O3
>>>>>> >>>>> Notice the lack of --with-debugging=no
>>>>>> >>>>> The matrix is assembled using
>>>>> MatMPIAIJSetPreallocationCSR and we end up with something like
>>>>> that in the -log_view:
>>>>>> >>>>> MatAssemblyBegin 2 1.0 1.2520e+002602.1 0.00e+00 0.0
>>>>> 0.0e+00 0.0e+00 8.0e+00 0 0 0 0 2 0 0 0 0 2 0
>>>>>> >>>>> MatAssemblyEnd 2 1.0 4.5104e+01 1.0 0.00e+00 0.0
>>>>> 8.2e+05 3.2e+04 4.6e+01 40 0 14 4 9 40 0 14 4 9 0
>>>>>> >>>>>
>>>>>> >>>>> For reference, here is what the matrix looks like (keep
>>>>> in mind it is well balanced)
>>>>>> >>>>> Mat Object: 640 MPI processes
>>>>>> >>>>> type: mpiaij
>>>>>> >>>>> rows=10682560, cols=10682560
>>>>>> >>>>> total: nonzeros=51691212800, allocated
>>>>> nonzeros=51691212800
>>>>>> >>>>> total number of mallocs used during MatSetValues calls
>>>>> =0
>>>>>> >>>>> not using I-node (on process 0) routines
>>>>>> >>>>>
>>>>>> >>>>> Are MatAssemblyBegin/MatAssemblyEnd highly sensitive to
>>>>> the --with-debugging option on x86 even though the
>>>> corresponding
>>>>> code is compiled with -O2, i.e., should I tell the user to have
>>>>> its PETSc lib recompiled, or would you recommend me to use
>>>>> another routine for assembling such a matrix?
>>>>>> >>>>>
>>>>>> >>>>> Thanks,
>>>>>> >>>>> Pierre
>>>>>> >>>
>>>>>> >
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Stefano
>>>> 
>>>> --
>>>> 
>>>> Stefano
>> 
>> 
>> 
>> Links:
>> ------
>> [1] mailto:stefano.zampini at gmail.com
>> [2] mailto:bsmith at mcs.anl.gov
>> [3] mailto:Pierre.Jolivet at enseeiht.fr
>> [4] mailto:pierre.jolivet at enseeiht.fr
>> [5] mailto:Pierre.Jolivet at enseeiht.fr
>> [6] mailto:stefano.zampini at gmail.com
>> [7] mailto:bsmith at mcs.anl.gov
>> [8] mailto:pierre.jolivet at enseeiht.fr
>