[petsc-dev] MatAssembly, debug, and compile flags

Barry Smith bsmith at mcs.anl.gov
Sat Jun 24 16:50:38 CDT 2017


> On Jun 24, 2017, at 4:44 PM, Pierre Jolivet <pierre.jolivet at enseeiht.fr> wrote:
> 
> Hello Barry,
> Sorry to bump up this old thread, but I’m still struggling with MatAssembly.
> Just as a reminder, on 1280 processors, reference timings:
> MatAssemblyBegin       2 1.0 1.4302e+006436.3 0.00e+00 0.0 0.0e+00 0.0e+00 3.0e+00  1  0  0  0  1   1  0  0  0  1     0
> MatAssemblyEnd         2 1.0 5.0301e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.8e+01 31  0  0  3  7  31  0  0  3  9     0
> 
> I’ve turned VecScatterCreate_PtoS into a no-op (almost), so I’m rather satisfied with this part. The new timings were:
> MatAssemblyBegin       2 1.0 1.6892e+007015.1 0.00e+00 0.0 0.0e+00 0.0e+00 3.0e+00  0  0  0  0  2   0  0  0  0  2     0
> MatAssemblyEnd         2 1.0 2.9842e+01 1.1 0.00e+00 0.0 9.0e+03 7.9e+04 1.7e+01 15  0  0  0  9  15  0  0  0 11     0
> 
> I dug around in the code, and it turned out that a high percentage of the ~30 seconds were spent in MatSetUpMultiply_MPIAIJ.
> I tried to set PETSC_USE_CTABLE to 0 and now I get:
> MatAssemblyBegin       3 1.0 1.4183e+005960.8 0.00e+00 0.0 0.0e+00 0.0e+00 3.0e+00  0  0  0  0  2   0  0  0  0  2     0
> MatAssemblyEnd         3 1.0 7.4979e+00 1.3 0.00e+00 0.0 9.0e+03 7.9e+04 1.7e+01  4  0 70  0 14   4  0 70  0 14     0
> 
> My follow-up questions are thus:
> 1) any ideas why using ctable is a terrible idea here? Looking at ctable.c it looks like ctable is an integer-specific hash table, so I’m guessing the number of collisions is rather low and it should be efficient.

    We don't understand this. We occasionally see really bad performance with stable. It is usually just on a subset of processes. Have you tried with the master branch? We tried mucking with it to get better performance.


> 2) why is PETSC_USE_CTABLE not a runtime option, and a preprocessor variable instead? Is it wrong to set PETSC_USE_CTABLE to 0 for MatSetUpMultiply_MPIAIJ, and 1 elsewhere?

   Good question. It should be runtime option attached to each object so it can be turned on and off as needed. We'd be happy to receive a pull request that fixed this or post an issue at bitbucket.com/petsc/petsc


> 
> BTW, just as a reminder, my matrices have really weird sparsity patterns with extremely large bandwidths. Here is an example for the dimensions of the off-diagonal block B (of the MPIAIJ format): 26871 x 21416009.
> 
> Thanks in advance for your help,
> Pierre
> 
>> On 20 Mar 2017, at 3:22 PM, Pierre Jolivet <pierre.jolivet at enseeiht.fr> wrote:
>> 
>> Hello Barry,
>> It looks like my vendor mpirun does not support OpenSpeedShop, and I have been too lazy recompiling everything with IntelMPI.
>> However, I did some really basic profiling and it looks like you were right, a lot of time is spent in VecScatterCreate_PtoS.
>> I switched to a MPI_Alltoallv implementation and here is the new summary.
>> 
>> MatAssemblyEnd         2 1.0 4.3129e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.8e+01 51  0  0  4 15  51  0  0  4 15     0
>> 
>> That's roughly 30 seconds faster, but I still find that rather slow. I'll now try an MPI_Alltoall implementation with padding because I know for a fact that BullxMPI performances for variable-sized collectives are much worse than for uniform collectives (+ all my local dimensions are almost the same so the memory cost of padding will be negligible).
>> 
>> Thanks,
>> Pierre
>> 
>> On Fri, 17 Mar 2017 22:02:26 +0100, Pierre Jolivet wrote:
>>> The number of messages during the MatAssembly is effectively halved
>>> MatAssemblyEnd 2 1.0 7.2139e+01 1.0 0.00e+00 0.0 2.6e+06 1.9e+04
>>> 1.8e+01 62 0 99 8 15 62 0 99 8 15 0
>>> But that was only a few second faster (and this may even only be
>>> system noise).
>>> I’ll see what I can infer from the openspeedshop profiling, and
>>> might give another MPI implementation a try during the weekend (I’m
>>> using BullxMPI, based on an ancient OpenMPI, but maybe IntelMPI gives
>>> better results).
>>> 
>>> Thanks anyway!
>>> Pierre
>>> 
>>>> On Mar 17, 2017, at 9:23 PM, Pierre Jolivet wrote:
>>>> 
>>>> Thank you for all your input. openspeedshop/2.1 is installed on my
>>>> cluster but it appears something is wrong with the MPI wrapper so
>>>> I’ll have to wait for the answer from the support on Monday.
>>>> In the meantime I’ll try the patch from Stefano which looks very
>>>> promising since it will replace 1599 sends and 1599 receives by a
>>>> single all-to-all.
>>>> Thanks again!
>>>> Pierre
>>>> 
>>>>> On Mar 17, 2017, at 8:59 PM, Stefano Zampini wrote:
>>>>> 
>>>>> 2017-03-17 22:52 GMT+03:00 Barry Smith :
>>>>> 
>>>>>> Stefano,
>>>>>> 
>>>>>> Thanks this is very helpful.
>>>>>> 
>>>>>> ---------------------
>>>>>> Why not? here is my naive implementation with AlltoAll, which
>>>>>> perform better in my case
>>>>>> 
>>>>>> PetscErrorCode PetscGatherMessageLengths(MPI_Comm
>>>>>> comm,PetscMPIInt nsends,PetscMPIInt nrecvs,const PetscMPIInt
>>>>>> ilengths[],PetscMPIInt **onodes,PetscMPIInt **olengths)
>>>>>> {
>>>>>> PetscErrorCode ierr;
>>>>>> PetscMPIInt size,i,j;
>>>>>> PetscMPIInt *all_lengths;
>>>>>> 
>>>>>> PetscFunctionBegin;
>>>>>> ierr = MPI_Comm_size(comm,&size);CHKERRQ(ierr);
>>>>>> ierr =
>>>>>> 
>>>>> PetscMalloc(size*sizeof(PetscMPIInt),&all_lengths);CHKERRQ(ierr);
>>>>>> ierr =
>>>>>> 
>>>>> 
>>>> 
>>> MPI_Alltoall((void*)ilengths,1,MPI_INT,all_lengths,1,MPI_INT,comm);CHKERRQ(ierr);
>>>>>> ierr =
>>>>>> PetscMalloc(nrecvs*sizeof(PetscMPIInt),olengths);CHKERRQ(ierr);
>>>>>> ierr =
>>>>>> PetscMalloc(nrecvs*sizeof(PetscMPIInt),onodes);CHKERRQ(ierr);
>>>>>> for (i=0,j=0; i
>>>>> 
>>>>> At that time I just fixed (1), not (2). My specific problem was
>>>>> not with timings per se, but with MPI (IntelMPI if I remember
>>>>> correctly) crashing when doing the rendez-vous with thousands of
>>>>> processes.
>>>>> 
>>>>>> Don't go to sleep yet, I may have more questions :-)
>>>>>> 
>>>>>> Barry
>>>>>> 
>>>>>>> On Mar 17, 2017, at 2:32 PM, Stefano Zampini wrote:
>>>>>>> 
>>>>>>> Pierre,
>>>>>>> 
>>>>>>> I remember I had a similar problem some years ago when
>>>>>> working with matrices with "process-dense" rows (i.e., when the
>>>>>> off-diagonal part is shared by many processes). I fixed the
>>>>>> issue by changing the implementation of
>>>>>> PetscGatherMessageLenghts, from rendez-vous to all-to-all.
>>>>>>> 
>>>>>>> Barry, if you had access to petsc-maint, the title of the
>>>>>> thread is "Problem with PetscGatherMessageLengths".
>>>>>>> 
>>>>>>> Hope this helps,
>>>>>>> Stefano
>>>>>>> 
>>>>>>> 
>>>>>>> 2017-03-17 22:21 GMT+03:00 Barry Smith :
>>>>>>> 
>>>>>>>> On Mar 17, 2017, at 4:04 AM, Pierre Jolivet wrote:
>>>>>>>> 
>>>>>>>> On Thu, 16 Mar 2017 15:37:17 -0500, Barry Smith wrote:
>>>>>>>>>> On Mar 16, 2017, at 10:57 AM, Pierre Jolivet wrote:
>>>>>>>>>> 
>>>>>>>>>> Thanks Barry.
>>>>>>>>>> I actually tried the application myself with my optimized
>>>>>> build + your option. I'm attaching two logs for a strong
>>>>> scaling
>>>>>> analysis, if someone could spend a minute or two looking at the
>>>>>> numbers I'd be really grateful:
>>>>>>>>>> 1) MatAssembly still takes a rather long time IMHO. This
>>>>>> is actually the bottleneck of my application. Especially on
>>>>> 1600
>>>>>> cores, the problem here is that I don't know if the huge time
>>>>>> (almost a 5x slow-down w.r.t. the run on 320 cores) is due to
>>>>>> MatMPIAIJSetPreallocationCSR (which I assumed beforehand was a
>>>>>> no-op, but which is clearly not the case looking at the run on
>>>>>> 320 cores) or the the option -pc_bjacobi_blocks 320 which also
>>>>>> does one MatAssembly.
>>>>>>>>> 
>>>>>>>>> There is one additional synchronization point in the
>>>>>>>>> MatAssemblyEnd that has not/cannot be removed. This is the
>>>>>>>>> construction of the VecScatter; I think that likely
>>>>>> explains the huge
>>>>>>>>> amount of time there.
>>>>>>> 
>>>>>>> This concerns me
>>>>>>> 
>>>>>>> MatAssemblyEnd 2 1.0 7.5767e+01 1.0 0.00e+00 0.0 5.1e+06
>>>>>> 9.4e+03 1.6e+01 64 0100 8 14 64 0100 8 14 0
>>>>>>> 
>>>>>>> I am thinking this is all the communication needed to set up
>>>>>> the scatter. Do you have access to any performance profilers
>>>>>> like Intel speedshop to see what is going on during all this
>>>>>> time?
>>>>>>> 
>>>>>>> 
>>>>>>> -vecscatter_alltoall uses alltoall in communication in the
>>>>>> scatters but it does not use all to all in setting up the
>>>>>> scatter (that is determining exactly what needs to be scattered
>>>>>> at each time). I think this is the problem. We need to add more
>>>>>> scatter set up code to optimize this case.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 2) The other bottleneck is MatMult, which itself calls
>>>>>> VecScatter. Since the structure of the matrix is rather dense,
>>>>>> I'm guessing the communication pattern should be similar to an
>>>>>> all-to-all. After having a look at the thread "VecScatter
>>>>>> scaling problem on KNL", would you also suggest me to use
>>>>>> -vecscatter_alltoall, or do you think this would not be
>>>>>> appropriate for the MatMult?
>>>>>>>>> 
>>>>>>>>> Please run with
>>>>>>>>> 
>>>>>>>>> -vecscatter_view ::ascii_info
>>>>>>>>> 
>>>>>>>>> this will give information about the number of messages
>>>>>> and sizes
>>>>>>>>> needed in the VecScatter. To help decide what to do next.
>>>>>>>> 
>>>>>>>> Here are two more logs. One with -vecscatter_view
>>>>>> ::ascii_info which I don't really know how to analyze (I've
>>>>>> spotted though that there are a couple of negative integers for
>>>>>> the data counters, maybe you are using long instead of long
>>>>>> long?), the other with -vecscatter_alltoall. The latter option
>>>>>> gives a 2x speed-up for the MatMult, and for the PCApply too
>>>>>> (which is weird to me because there should be no global
>>>>>> communication with bjacobi and the diagonal blocks are only of
>>>>>> size "5 processes" so the speed-up seems rather huge for just
>>>>>> doing VecScatter for gathering and scattering the RHS/solution
>>>>>> for all 320 MUMPS instances).
>>>>>>> 
>>>>>>> ok, this is good, it confirms that the large amount of
>>>>>> communication needed in the scatters were a major problem and
>>>>>> using the all to all helps. This is about all you can do about
>>>>>> the scatter time.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Barry
>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks for your help,
>>>>>>>> Pierre
>>>>>>>> 
>>>>>>>>> Barry
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thank you very much,
>>>>>>>>>> Pierre
>>>>>>>>>> 
>>>>>>>>>> On Mon, 6 Mar 2017 09:34:53 -0600, Barry Smith wrote:
>>>>>>>>>>> I don't think the lack of the --with-debugging=no is
>>>>>> important here.
>>>>>>>>>>> Though he/she should use --with-debugging=no for
>>>>>> production runs.
>>>>>>>>>>> 
>>>>>>>>>>> I think the reason for the "funny" numbers is that
>>>>>>>>>>> MatAssemblyBegin and End in this case have explicit
>>>>>> synchronization
>>>>>>>>>>> points so some processes are waiting for other processes
>>>>>> to get to the
>>>>>>>>>>> synchronization point thus it looks like some processes
>>>>>> are spending a
>>>>>>>>>>> lot of time in the assembly routines when they are not
>>>>>> really, they
>>>>>>>>>>> are just waiting.
>>>>>>>>>>> 
>>>>>>>>>>> You can remove the synchronization point by calling
>>>>>>>>>>> 
>>>>>>>>>>> MatSetOption(mat, MAT_NO_OFF_PROC_ENTRIES, PETSC_TRUE);
>>>>>> before
>>>>>>>>>>> calling MatMPIAIJSetPreallocationCSR()
>>>>>>>>>>> 
>>>>>>>>>>> Barry
>>>>>>>>>>> 
>>>>>>>>>>>> On Mar 6, 2017, at 8:59 AM, Pierre Jolivet wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hello,
>>>>>>>>>>>> I have an application with a matrix with lots of
>>>>>> nonzero entries (that are perfectly load balanced between
>>>>>> processes and rows).
>>>>>>>>>>>> A end user is currently using a PETSc library compiled
>>>>>> with the following flags (among others):
>>>>>>>>>>>> --CFLAGS=-O2 --COPTFLAGS=-O3 --CXXFLAGS="-O2
>>>>>> -std=c++11" --CXXOPTFLAGS=-O3 --FFLAGS=-O2 --FOPTFLAGS=-O3
>>>>>>>>>>>> Notice the lack of --with-debugging=no
>>>>>>>>>>>> The matrix is assembled using
>>>>>> MatMPIAIJSetPreallocationCSR and we end up with something like
>>>>>> that in the -log_view:
>>>>>>>>>>>> MatAssemblyBegin 2 1.0 1.2520e+002602.1 0.00e+00 0.0
>>>>>> 0.0e+00 0.0e+00 8.0e+00 0 0 0 0 2 0 0 0 0 2 0
>>>>>>>>>>>> MatAssemblyEnd 2 1.0 4.5104e+01 1.0 0.00e+00 0.0
>>>>>> 8.2e+05 3.2e+04 4.6e+01 40 0 14 4 9 40 0 14 4 9 0
>>>>>>>>>>>> 
>>>>>>>>>>>> For reference, here is what the matrix looks like (keep
>>>>>> in mind it is well balanced)
>>>>>>>>>>>> Mat Object: 640 MPI processes
>>>>>>>>>>>> type: mpiaij
>>>>>>>>>>>> rows=10682560, cols=10682560
>>>>>>>>>>>> total: nonzeros=51691212800, allocated
>>>>>> nonzeros=51691212800
>>>>>>>>>>>> total number of mallocs used during MatSetValues calls
>>>>>> =0
>>>>>>>>>>>> not using I-node (on process 0) routines
>>>>>>>>>>>> 
>>>>>>>>>>>> Are MatAssemblyBegin/MatAssemblyEnd highly sensitive to
>>>>>> the --with-debugging option on x86 even though the
>>>>> corresponding
>>>>>> code is compiled with -O2, i.e., should I tell the user to have
>>>>>> its PETSc lib recompiled, or would you recommend me to use
>>>>>> another routine for assembling such a matrix?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Pierre
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Stefano
>>>>> 
>>>>> --
>>>>> 
>>>>> Stefano
>>> 
>>> 
>>> 
>>> Links:
>>> ------
>>> [1] mailto:stefano.zampini at gmail.com
>>> [2] mailto:bsmith at mcs.anl.gov
>>> [3] mailto:Pierre.Jolivet at enseeiht.fr
>>> [4] mailto:pierre.jolivet at enseeiht.fr
>>> [5] mailto:Pierre.Jolivet at enseeiht.fr
>>> [6] mailto:stefano.zampini at gmail.com
>>> [7] mailto:bsmith at mcs.anl.gov
>>> [8] mailto:pierre.jolivet at enseeiht.fr




More information about the petsc-dev mailing list