[petsc-dev] MatAssembly, debug, and compile flags

Sun Jun 25 16:13:56 CDT 2017

I'm using master, currently synced with commit 
b27f05d0a5e3da93e0fb56db3eb0b4fea45f004f

Thanks for your comments.
Pierre

On Sat, 24 Jun 2017 16:50:38 -0500, Barry Smith wrote:
>> On Jun 24, 2017, at 4:44 PM, Pierre Jolivet 
>> <pierre.jolivet at enseeiht.fr> wrote:
>>
>> Hello Barry,
>> Sorry to bump up this old thread, but I’m still struggling with 
>> MatAssembly.
>> Just as a reminder, on 1280 processors, reference timings:
>> MatAssemblyBegin       2 1.0 1.4302e+006436.3 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 3.0e+00  1  0  0  0  1   1  0  0  0  1     0
>> MatAssemblyEnd         2 1.0 5.0301e+01 1.0 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 1.8e+01 31  0  0  3  7  31  0  0  3  9     0
>>
>> I’ve turned VecScatterCreate_PtoS into a no-op (almost), so I’m 
>> rather satisfied with this part. The new timings were:
>> MatAssemblyBegin       2 1.0 1.6892e+007015.1 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 3.0e+00  0  0  0  0  2   0  0  0  0  2     0
>> MatAssemblyEnd         2 1.0 2.9842e+01 1.1 0.00e+00 0.0 9.0e+03 
>> 7.9e+04 1.7e+01 15  0  0  0  9  15  0  0  0 11     0
>>
>> I dug around in the code, and it turned out that a high percentage 
>> of the ~30 seconds were spent in MatSetUpMultiply_MPIAIJ.
>> I tried to set PETSC_USE_CTABLE to 0 and now I get:
>> MatAssemblyBegin       3 1.0 1.4183e+005960.8 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 3.0e+00  0  0  0  0  2   0  0  0  0  2     0
>> MatAssemblyEnd         3 1.0 7.4979e+00 1.3 0.00e+00 0.0 9.0e+03 
>> 7.9e+04 1.7e+01  4  0 70  0 14   4  0 70  0 14     0
>>
>> My follow-up questions are thus:
>> 1) any ideas why using ctable is a terrible idea here? Looking at 
>> ctable.c it looks like ctable is an integer-specific hash table, so 
>> I’m guessing the number of collisions is rather low and it should be 
>> efficient.
>
>     We don't understand this. We occasionally see really bad
> performance with stable. It is usually just on a subset of processes.
> Have you tried with the master branch? We tried mucking with it to 
> get
> better performance.
>
>
>> 2) why is PETSC_USE_CTABLE not a runtime option, and a preprocessor 
>> variable instead? Is it wrong to set PETSC_USE_CTABLE to 0 for 
>> MatSetUpMultiply_MPIAIJ, and 1 elsewhere?
>
>    Good question. It should be runtime option attached to each object
> so it can be turned on and off as needed. We'd be happy to receive a
> pull request that fixed this or post an issue at
> bitbucket.com/petsc/petsc
>
>
>>
>> BTW, just as a reminder, my matrices have really weird sparsity 
>> patterns with extremely large bandwidths. Here is an example for the 
>> dimensions of the off-diagonal block B (of the MPIAIJ format): 26871 x 
>> 21416009.
>>
>> Thanks in advance for your help,
>> Pierre
>>
>>> On 20 Mar 2017, at 3:22 PM, Pierre Jolivet 
>>> <pierre.jolivet at enseeiht.fr> wrote:
>>>
>>> Hello Barry,
>>> It looks like my vendor mpirun does not support OpenSpeedShop, and 
>>> I have been too lazy recompiling everything with IntelMPI.
>>> However, I did some really basic profiling and it looks like you 
>>> were right, a lot of time is spent in VecScatterCreate_PtoS.
>>> I switched to a MPI_Alltoallv implementation and here is the new 
>>> summary.
>>>
>>> MatAssemblyEnd         2 1.0 4.3129e+01 1.0 0.00e+00 0.0 0.0e+00 
>>> 0.0e+00 1.8e+01 51  0  0  4 15  51  0  0  4 15     0
>>>
>>> That's roughly 30 seconds faster, but I still find that rather 
>>> slow. I'll now try an MPI_Alltoall implementation with padding 
>>> because I know for a fact that BullxMPI performances for 
>>> variable-sized collectives are much worse than for uniform 
>>> collectives (+ all my local dimensions are almost the same so the 
>>> memory cost of padding will be negligible).
>>>
>>> Thanks,
>>> Pierre
>>>
>>> On Fri, 17 Mar 2017 22:02:26 +0100, Pierre Jolivet wrote:
>>>> The number of messages during the MatAssembly is effectively 
>>>> halved
>>>> MatAssemblyEnd 2 1.0 7.2139e+01 1.0 0.00e+00 0.0 2.6e+06 1.9e+04
>>>> 1.8e+01 62 0 99 8 15 62 0 99 8 15 0
>>>> But that was only a few second faster (and this may even only be
>>>> system noise).
>>>> I’ll see what I can infer from the openspeedshop profiling, and
>>>> might give another MPI implementation a try during the weekend 
>>>> (I’m
>>>> using BullxMPI, based on an ancient OpenMPI, but maybe IntelMPI 
>>>> gives
>>>> better results).
>>>>
>>>> Thanks anyway!
>>>> Pierre
>>>>
>>>>> On Mar 17, 2017, at 9:23 PM, Pierre Jolivet wrote:
>>>>>
>>>>> Thank you for all your input. openspeedshop/2.1 is installed on 
>>>>> my
>>>>> cluster but it appears something is wrong with the MPI wrapper so
>>>>> I’ll have to wait for the answer from the support on Monday.
>>>>> In the meantime I’ll try the patch from Stefano which looks very
>>>>> promising since it will replace 1599 sends and 1599 receives by a
>>>>> single all-to-all.
>>>>> Thanks again!
>>>>> Pierre
>>>>>
>>>>>> On Mar 17, 2017, at 8:59 PM, Stefano Zampini wrote:
>>>>>>
>>>>>> 2017-03-17 22:52 GMT+03:00 Barry Smith :
>>>>>>
>>>>>>> Stefano,
>>>>>>>
>>>>>>> Thanks this is very helpful.
>>>>>>>
>>>>>>> ---------------------
>>>>>>> Why not? here is my naive implementation with AlltoAll, which
>>>>>>> perform better in my case
>>>>>>>
>>>>>>> PetscErrorCode PetscGatherMessageLengths(MPI_Comm
>>>>>>> comm,PetscMPIInt nsends,PetscMPIInt nrecvs,const PetscMPIInt
>>>>>>> ilengths[],PetscMPIInt **onodes,PetscMPIInt **olengths)
>>>>>>> {
>>>>>>> PetscErrorCode ierr;
>>>>>>> PetscMPIInt size,i,j;
>>>>>>> PetscMPIInt *all_lengths;
>>>>>>>
>>>>>>> PetscFunctionBegin;
>>>>>>> ierr = MPI_Comm_size(comm,&size);CHKERRQ(ierr);
>>>>>>> ierr =
>>>>>>>
>>>>>> 
>>>>>> PetscMalloc(size*sizeof(PetscMPIInt),&all_lengths);CHKERRQ(ierr);
>>>>>>> ierr =
>>>>>>>
>>>>>>
>>>>>
>>>> 
>>>> MPI_Alltoall((void*)ilengths,1,MPI_INT,all_lengths,1,MPI_INT,comm);CHKERRQ(ierr);
>>>>>>> ierr =
>>>>>>> PetscMalloc(nrecvs*sizeof(PetscMPIInt),olengths);CHKERRQ(ierr);
>>>>>>> ierr =
>>>>>>> PetscMalloc(nrecvs*sizeof(PetscMPIInt),onodes);CHKERRQ(ierr);
>>>>>>> for (i=0,j=0; i
>>>>>>
>>>>>> At that time I just fixed (1), not (2). My specific problem was
>>>>>> not with timings per se, but with MPI (IntelMPI if I remember
>>>>>> correctly) crashing when doing the rendez-vous with thousands of
>>>>>> processes.
>>>>>>
>>>>>>> Don't go to sleep yet, I may have more questions :-)
>>>>>>>
>>>>>>> Barry
>>>>>>>
>>>>>>>> On Mar 17, 2017, at 2:32 PM, Stefano Zampini wrote:
>>>>>>>>
>>>>>>>> Pierre,
>>>>>>>>
>>>>>>>> I remember I had a similar problem some years ago when
>>>>>>> working with matrices with "process-dense" rows (i.e., when the
>>>>>>> off-diagonal part is shared by many processes). I fixed the
>>>>>>> issue by changing the implementation of
>>>>>>> PetscGatherMessageLenghts, from rendez-vous to all-to-all.
>>>>>>>>
>>>>>>>> Barry, if you had access to petsc-maint, the title of the
>>>>>>> thread is "Problem with PetscGatherMessageLengths".
>>>>>>>>
>>>>>>>> Hope this helps,
>>>>>>>> Stefano
>>>>>>>>
>>>>>>>>
>>>>>>>> 2017-03-17 22:21 GMT+03:00 Barry Smith :
>>>>>>>>
>>>>>>>>> On Mar 17, 2017, at 4:04 AM, Pierre Jolivet wrote:
>>>>>>>>>
>>>>>>>>> On Thu, 16 Mar 2017 15:37:17 -0500, Barry Smith wrote:
>>>>>>>>>>> On Mar 16, 2017, at 10:57 AM, Pierre Jolivet wrote:
>>>>>>>>>>>
>>>>>>>>>>> Thanks Barry.
>>>>>>>>>>> I actually tried the application myself with my optimized
>>>>>>> build + your option. I'm attaching two logs for a strong
>>>>>> scaling
>>>>>>> analysis, if someone could spend a minute or two looking at the
>>>>>>> numbers I'd be really grateful:
>>>>>>>>>>> 1) MatAssembly still takes a rather long time IMHO. This
>>>>>>> is actually the bottleneck of my application. Especially on
>>>>>> 1600
>>>>>>> cores, the problem here is that I don't know if the huge time
>>>>>>> (almost a 5x slow-down w.r.t. the run on 320 cores) is due to
>>>>>>> MatMPIAIJSetPreallocationCSR (which I assumed beforehand was a
>>>>>>> no-op, but which is clearly not the case looking at the run on
>>>>>>> 320 cores) or the the option -pc_bjacobi_blocks 320 which also
>>>>>>> does one MatAssembly.
>>>>>>>>>>
>>>>>>>>>> There is one additional synchronization point in the
>>>>>>>>>> MatAssemblyEnd that has not/cannot be removed. This is the
>>>>>>>>>> construction of the VecScatter; I think that likely
>>>>>>> explains the huge
>>>>>>>>>> amount of time there.
>>>>>>>>
>>>>>>>> This concerns me
>>>>>>>>
>>>>>>>> MatAssemblyEnd 2 1.0 7.5767e+01 1.0 0.00e+00 0.0 5.1e+06
>>>>>>> 9.4e+03 1.6e+01 64 0100 8 14 64 0100 8 14 0
>>>>>>>>
>>>>>>>> I am thinking this is all the communication needed to set up
>>>>>>> the scatter. Do you have access to any performance profilers
>>>>>>> like Intel speedshop to see what is going on during all this
>>>>>>> time?
>>>>>>>>
>>>>>>>>
>>>>>>>> -vecscatter_alltoall uses alltoall in communication in the
>>>>>>> scatters but it does not use all to all in setting up the
>>>>>>> scatter (that is determining exactly what needs to be scattered
>>>>>>> at each time). I think this is the problem. We need to add more
>>>>>>> scatter set up code to optimize this case.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> 2) The other bottleneck is MatMult, which itself calls
>>>>>>> VecScatter. Since the structure of the matrix is rather dense,
>>>>>>> I'm guessing the communication pattern should be similar to an
>>>>>>> all-to-all. After having a look at the thread "VecScatter
>>>>>>> scaling problem on KNL", would you also suggest me to use
>>>>>>> -vecscatter_alltoall, or do you think this would not be
>>>>>>> appropriate for the MatMult?
>>>>>>>>>>
>>>>>>>>>> Please run with
>>>>>>>>>>
>>>>>>>>>> -vecscatter_view ::ascii_info
>>>>>>>>>>
>>>>>>>>>> this will give information about the number of messages
>>>>>>> and sizes
>>>>>>>>>> needed in the VecScatter. To help decide what to do next.
>>>>>>>>>
>>>>>>>>> Here are two more logs. One with -vecscatter_view
>>>>>>> ::ascii_info which I don't really know how to analyze (I've
>>>>>>> spotted though that there are a couple of negative integers for
>>>>>>> the data counters, maybe you are using long instead of long
>>>>>>> long?), the other with -vecscatter_alltoall. The latter option
>>>>>>> gives a 2x speed-up for the MatMult, and for the PCApply too
>>>>>>> (which is weird to me because there should be no global
>>>>>>> communication with bjacobi and the diagonal blocks are only of
>>>>>>> size "5 processes" so the speed-up seems rather huge for just
>>>>>>> doing VecScatter for gathering and scattering the RHS/solution
>>>>>>> for all 320 MUMPS instances).
>>>>>>>>
>>>>>>>> ok, this is good, it confirms that the large amount of
>>>>>>> communication needed in the scatters were a major problem and
>>>>>>> using the all to all helps. This is about all you can do about
>>>>>>> the scatter time.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Barry
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks for your help,
>>>>>>>>> Pierre
>>>>>>>>>
>>>>>>>>>> Barry
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thank you very much,
>>>>>>>>>>> Pierre
>>>>>>>>>>>
>>>>>>>>>>> On Mon, 6 Mar 2017 09:34:53 -0600, Barry Smith wrote:
>>>>>>>>>>>> I don't think the lack of the --with-debugging=no is
>>>>>>> important here.
>>>>>>>>>>>> Though he/she should use --with-debugging=no for
>>>>>>> production runs.
>>>>>>>>>>>>
>>>>>>>>>>>> I think the reason for the "funny" numbers is that
>>>>>>>>>>>> MatAssemblyBegin and End in this case have explicit
>>>>>>> synchronization
>>>>>>>>>>>> points so some processes are waiting for other processes
>>>>>>> to get to the
>>>>>>>>>>>> synchronization point thus it looks like some processes
>>>>>>> are spending a
>>>>>>>>>>>> lot of time in the assembly routines when they are not
>>>>>>> really, they
>>>>>>>>>>>> are just waiting.
>>>>>>>>>>>>
>>>>>>>>>>>> You can remove the synchronization point by calling
>>>>>>>>>>>>
>>>>>>>>>>>> MatSetOption(mat, MAT_NO_OFF_PROC_ENTRIES, PETSC_TRUE);
>>>>>>> before
>>>>>>>>>>>> calling MatMPIAIJSetPreallocationCSR()
>>>>>>>>>>>>
>>>>>>>>>>>> Barry
>>>>>>>>>>>>
>>>>>>>>>>>>> On Mar 6, 2017, at 8:59 AM, Pierre Jolivet wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>> I have an application with a matrix with lots of
>>>>>>> nonzero entries (that are perfectly load balanced between
>>>>>>> processes and rows).
>>>>>>>>>>>>> A end user is currently using a PETSc library compiled
>>>>>>> with the following flags (among others):
>>>>>>>>>>>>> --CFLAGS=-O2 --COPTFLAGS=-O3 --CXXFLAGS="-O2
>>>>>>> -std=c++11" --CXXOPTFLAGS=-O3 --FFLAGS=-O2 --FOPTFLAGS=-O3
>>>>>>>>>>>>> Notice the lack of --with-debugging=no
>>>>>>>>>>>>> The matrix is assembled using
>>>>>>> MatMPIAIJSetPreallocationCSR and we end up with something like
>>>>>>> that in the -log_view:
>>>>>>>>>>>>> MatAssemblyBegin 2 1.0 1.2520e+002602.1 0.00e+00 0.0
>>>>>>> 0.0e+00 0.0e+00 8.0e+00 0 0 0 0 2 0 0 0 0 2 0
>>>>>>>>>>>>> MatAssemblyEnd 2 1.0 4.5104e+01 1.0 0.00e+00 0.0
>>>>>>> 8.2e+05 3.2e+04 4.6e+01 40 0 14 4 9 40 0 14 4 9 0
>>>>>>>>>>>>>
>>>>>>>>>>>>> For reference, here is what the matrix looks like (keep
>>>>>>> in mind it is well balanced)
>>>>>>>>>>>>> Mat Object: 640 MPI processes
>>>>>>>>>>>>> type: mpiaij
>>>>>>>>>>>>> rows=10682560, cols=10682560
>>>>>>>>>>>>> total: nonzeros=51691212800, allocated
>>>>>>> nonzeros=51691212800
>>>>>>>>>>>>> total number of mallocs used during MatSetValues calls
>>>>>>> =0
>>>>>>>>>>>>> not using I-node (on process 0) routines
>>>>>>>>>>>>>
>>>>>>>>>>>>> Are MatAssemblyBegin/MatAssemblyEnd highly sensitive to
>>>>>>> the --with-debugging option on x86 even though the
>>>>>> corresponding
>>>>>>> code is compiled with -O2, i.e., should I tell the user to have
>>>>>>> its PETSc lib recompiled, or would you recommend me to use
>>>>>>> another routine for assembling such a matrix?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Pierre
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Stefano
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Stefano
>>>>
>>>>
>>>>
>>>> Links:
>>>> ------
>>>> [1] mailto:stefano.zampini at gmail.com
>>>> [2] mailto:bsmith at mcs.anl.gov
>>>> [3] mailto:Pierre.Jolivet at enseeiht.fr
>>>> [4] mailto:pierre.jolivet at enseeiht.fr
>>>> [5] mailto:Pierre.Jolivet at enseeiht.fr
>>>> [6] mailto:stefano.zampini at gmail.com
>>>> [7] mailto:bsmith at mcs.anl.gov
>>>> [8] mailto:pierre.jolivet at enseeiht.fr