[petsc-dev] MatAssembly, debug, and compile flags

Fri Mar 17 16:02:26 CDT 2017

The number of messages during the MatAssembly is effectively halved
MatAssemblyEnd         2 1.0 7.2139e+01 1.0 0.00e+00 0.0 2.6e+06 1.9e+04 1.8e+01 62  0 99  8 15  62  0 99  8 15     0
But that was only a few second faster (and this may even only be system noise).
I’ll see what I can infer from the openspeedshop profiling, and might give another MPI implementation a try during the weekend (I’m using BullxMPI, based on an ancient OpenMPI, but maybe IntelMPI gives better results).

Thanks anyway!
Pierre

> On Mar 17, 2017, at 9:23 PM, Pierre Jolivet <pierre.jolivet at enseeiht.fr> wrote:
> 
> Thank you for all your input. openspeedshop/2.1 is installed on my cluster but it appears something is wrong with the MPI wrapper so I’ll have to wait for the answer from the support on Monday.
> In the meantime I’ll try the patch from Stefano which looks very promising since it will replace 1599 sends and 1599 receives by a single all-to-all.
> Thanks again!
> Pierre
> 
>> On Mar 17, 2017, at 8:59 PM, Stefano Zampini <stefano.zampini at gmail.com <mailto:stefano.zampini at gmail.com>> wrote:
>> 
>> 
>> 
>> 2017-03-17 22:52 GMT+03:00 Barry Smith <bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>>:
>> 
>>    Stefano,
>> 
>>     Thanks this is very helpful.
>> 
>> ---------------------
>> Why not? here is my naive implementation with AlltoAll, which perform better in my case
>> 
>> PetscErrorCode  PetscGatherMessageLengths(MPI_Comm comm,PetscMPIInt nsends,PetscMPIInt nrecvs,const PetscMPIInt ilengths[],PetscMPIInt **onodes,PetscMPIInt **olengths)
>> {
>>   PetscErrorCode ierr;
>>   PetscMPIInt    size,i,j;
>>   PetscMPIInt    *all_lengths;
>> 
>>   PetscFunctionBegin;
>>   ierr = MPI_Comm_size(comm,&size);CHKERRQ(ierr);
>>   ierr = PetscMalloc(size*sizeof(PetscMPIInt),&all_lengths);CHKERRQ(ierr);
>>   ierr = MPI_Alltoall((void*)ilengths,1,MPI_INT,all_lengths,1,MPI_INT,comm);CHKERRQ(ierr);
>>   ierr = PetscMalloc(nrecvs*sizeof(PetscMPIInt),olengths);CHKERRQ(ierr);
>>   ierr = PetscMalloc(nrecvs*sizeof(PetscMPIInt),onodes);CHKERRQ(ierr);
>>   for (i=0,j=0; i<size; i++) {
>>     if (all_lengths[i]) {
>>       (*olengths)[j] = all_lengths[i];
>>       (*onodes)[j] = i;
>>       j++;
>>     }
>>   }
>>   if (j != nrecvs) SETERRQ2(comm,PETSC_ERR_PLIB,"Unexpected number of senders %d != %d",j,nrecvs);
>>   ierr = PetscFree(all_lengths);CHKERRQ(ierr);
>>   PetscFunctionReturn(0);
>> }
>> -----------------------
>> 
>>   However I think this is only half the answer. If I look at VecScatterCreate_PtoS() for example it has
>> 
>>   ierr = PetscGatherNumberOfMessages(comm,NULL,nprocs,&nrecvs);CHKERRQ(ierr);
>>   ierr = PetscGatherMessageLengths(comm,nsends,nrecvs,nprocs,&onodes1,&olengths1);CHKERRQ(ierr);
>>   ierr = PetscSortMPIIntWithArray(nrecvs,onodes1,olengths1);CHKERRQ(ierr);
>>   recvtotal = 0; for (i=0; i<nrecvs; i++) recvtotal += olengths1[i];
>> 
>>   /* post receives:   */
>>   ierr  = PetscMalloc3(recvtotal,&rvalues,nrecvs,&source,nrecvs,&recv_waits);CHKERRQ(ierr);
>>   count = 0;
>>   for (i=0; i<nrecvs; i++) {
>>     ierr   = MPI_Irecv((rvalues+count),olengths1[i],MPIU_INT,onodes1[i],tag,comm,recv_waits+i);CHKERRQ(ierr);
>>     count += olengths1[i];
>>   }
>> 
>>   /* do sends:
>>      1) starts[i] gives the starting index in svalues for stuff going to
>>      the ith processor
>>   */
>>   nxr = 0;
>>   for (i=0; i<nx; i++) {
>>     if (owner[i] != rank) nxr++;
>>   }
>>   ierr = PetscMalloc3(nxr,&svalues,nsends,&send_waits,size+1,&starts);CHKERRQ(ierr);
>> 
>>   starts[0]  = 0;
>>   for (i=1; i<size; i++) starts[i] = starts[i-1] + nprocs[i-1];
>>   for (i=0; i<nx; i++) {
>>     if (owner[i] != rank) svalues[starts[owner[i]]++] = bs*inidx[i];
>>   }
>>   starts[0] = 0;
>>   for (i=1; i<size+1; i++) starts[i] = starts[i-1] + nprocs[i-1];
>>   count = 0;
>>   for (i=0; i<size; i++) {
>>     if (nprocs[i]) {
>>       ierr = MPI_Isend(svalues+starts[i],nprocs[i],MPIU_INT,i,tag,comm,send_waits+count++);CHKERRQ(ierr);
>>     }
>>   }
>> 
>> 
>> So I need to (1) use your alltoall PetscGatherMessageLengths() but also (2) replace the sends and receives above with alltoallv();
>> 
>> Is that correct? Did you also fix (2) or did fixing (1) help so much you didn't need to fix (2)?
>> 
>> 
>> At that time I just fixed (1), not (2). My specific problem was not with timings per se, but with MPI (IntelMPI if I remember correctly) crashing when doing the rendez-vous with thousands of processes.
>>  
>>  
>> Don't go to sleep yet, I may have more questions :-)
>> 
>> 
>> Barry
>> 
>> 
>> > On Mar 17, 2017, at 2:32 PM, Stefano Zampini <stefano.zampini at gmail.com <mailto:stefano.zampini at gmail.com>> wrote:
>> >
>> > Pierre,
>> >
>> > I remember I had a similar problem some years ago when working with matrices with "process-dense" rows (i.e., when the off-diagonal part is shared by many processes). I fixed the issue by changing the implementation of PetscGatherMessageLenghts, from rendez-vous to all-to-all.
>> >
>> > Barry, if you had access to petsc-maint, the title of the thread is "Problem with PetscGatherMessageLengths".
>> >
>> > Hope this helps,
>> > Stefano
>> >
>> >
>> > 2017-03-17 22:21 GMT+03:00 Barry Smith <bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>>:
>> >
>> > > On Mar 17, 2017, at 4:04 AM, Pierre Jolivet <Pierre.Jolivet at enseeiht.fr <mailto:Pierre.Jolivet at enseeiht.fr>> wrote:
>> > >
>> > > On Thu, 16 Mar 2017 15:37:17 -0500, Barry Smith wrote:
>> > >>> On Mar 16, 2017, at 10:57 AM, Pierre Jolivet <pierre.jolivet at enseeiht.fr <mailto:pierre.jolivet at enseeiht.fr>> wrote:
>> > >>>
>> > >>> Thanks Barry.
>> > >>> I actually tried the application myself with my optimized build + your option. I'm attaching two logs for a strong scaling analysis, if someone could spend a minute or two looking at the numbers I'd be really grateful:
>> > >>> 1) MatAssembly still takes a rather long time IMHO. This is actually the bottleneck of my application. Especially on 1600 cores, the problem here is that I don't know if the huge time (almost a 5x slow-down w.r.t. the run on 320 cores) is due to MatMPIAIJSetPreallocationCSR (which I assumed beforehand was a no-op, but which is clearly not the case looking at the run on 320 cores) or the the option -pc_bjacobi_blocks 320 which also does one MatAssembly.
>> > >>
>> > >>    There is one additional synchronization point in the
>> > >> MatAssemblyEnd that has not/cannot be removed. This is the
>> > >> construction of the VecScatter; I think that likely explains the huge
>> > >> amount of time there.
>> >
>> >   This concerns me
>> >
>> >   MatAssemblyEnd         2 1.0 7.5767e+01 1.0 0.00e+00 0.0 5.1e+06 9.4e+03 1.6e+01 64  0100  8 14  64  0100  8 14     0
>> >
>> >    I am thinking this is all the communication needed to set up the scatter. Do you have access to any performance profilers like Intel speedshop to see what is going on during all this time?
>> >
>> >
>> >    -vecscatter_alltoall uses alltoall in communication in the scatters but it does not use all to all in setting up the scatter (that is determining exactly what needs to be scattered at each time). I think this is the problem. We need to add more scatter set up code to optimize this case.
>> >
>> >
>> >
>> > >>
>> > >>> 2) The other bottleneck is MatMult, which itself calls VecScatter. Since the structure of the matrix is rather dense, I'm guessing the communication pattern should be similar to an all-to-all. After having a look at the thread "VecScatter scaling problem on KNL", would you also suggest me to use -vecscatter_alltoall, or do you think this would not be appropriate for the MatMult?
>> > >>
>> > >>   Please run with
>> > >>
>> > >>   -vecscatter_view ::ascii_info
>> > >>
>> > >> this will give information about the number of messages and sizes
>> > >> needed in the VecScatter. To help decide what to do next.
>> > >
>> > > Here are two more logs. One with -vecscatter_view ::ascii_info which I don't really know how to analyze (I've spotted though that there are a couple of negative integers for the data counters, maybe you are using long instead of long long?), the other with -vecscatter_alltoall. The latter option gives a 2x speed-up for the MatMult, and for the PCApply too (which is weird to me because there should be no global communication with bjacobi and the diagonal blocks are only of size "5 processes" so the speed-up seems rather huge for just doing VecScatter for gathering and scattering the RHS/solution for all 320 MUMPS instances).
>> >
>> >   ok, this is good, it confirms that the large amount of communication needed in the scatters were a major problem and using the all to all helps. This is about all you can do about the scatter time.
>> >
>> >
>> >
>> >   Barry
>> >
>> > >
>> > > Thanks for your help,
>> > > Pierre
>> > >
>> > >>  Barry
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>>
>> > >>> Thank you very much,
>> > >>> Pierre
>> > >>>
>> > >>> On Mon, 6 Mar 2017 09:34:53 -0600, Barry Smith wrote:
>> > >>>> I don't think the lack of the --with-debugging=no is important here.
>> > >>>> Though he/she should use --with-debugging=no for production runs.
>> > >>>>
>> > >>>>  I think the reason for the "funny" numbers is that
>> > >>>> MatAssemblyBegin and End in this case have explicit synchronization
>> > >>>> points so some processes are waiting for other processes to get to the
>> > >>>> synchronization point thus it looks like some processes are spending a
>> > >>>> lot of time in the assembly routines when they are not really, they
>> > >>>> are just waiting.
>> > >>>>
>> > >>>>  You can remove the synchronization point by calling
>> > >>>>
>> > >>>>   MatSetOption(mat, MAT_NO_OFF_PROC_ENTRIES, PETSC_TRUE); before
>> > >>>> calling MatMPIAIJSetPreallocationCSR()
>> > >>>>
>> > >>>>  Barry
>> > >>>>
>> > >>>>> On Mar 6, 2017, at 8:59 AM, Pierre Jolivet <Pierre.Jolivet at enseeiht.fr <mailto:Pierre.Jolivet at enseeiht.fr>> wrote:
>> > >>>>>
>> > >>>>> Hello,
>> > >>>>> I have an application with a matrix with lots of nonzero entries (that are perfectly load balanced between processes and rows).
>> > >>>>> A end user is currently using a PETSc library compiled with the following flags (among others):
>> > >>>>> --CFLAGS=-O2 --COPTFLAGS=-O3 --CXXFLAGS="-O2 -std=c++11" --CXXOPTFLAGS=-O3 --FFLAGS=-O2 --FOPTFLAGS=-O3
>> > >>>>> Notice the lack of --with-debugging=no
>> > >>>>> The matrix is assembled using MatMPIAIJSetPreallocationCSR and we end up with something like that in the -log_view:
>> > >>>>> MatAssemblyBegin       2 1.0 1.2520e+002602.1 0.00e+00 0.0 0.0e+00 0.0e+00 8.0e+00  0  0  0  0  2   0  0  0  0 2     0
>> > >>>>> MatAssemblyEnd         2 1.0 4.5104e+01 1.0 0.00e+00 0.0 8.2e+05 3.2e+04 4.6e+01 40  0 14  4  9  40  0 14  4  9 0
>> > >>>>>
>> > >>>>> For reference, here is what the matrix looks like (keep in mind it is well balanced)
>> > >>>>> Mat Object:   640 MPI processes
>> > >>>>>  type: mpiaij
>> > >>>>>  rows=10682560, cols=10682560
>> > >>>>>  total: nonzeros=51691212800, allocated nonzeros=51691212800
>> > >>>>>  total number of mallocs used during MatSetValues calls =0
>> > >>>>>    not using I-node (on process 0) routines
>> > >>>>>
>> > >>>>> Are MatAssemblyBegin/MatAssemblyEnd highly sensitive to the --with-debugging option on x86 even though the corresponding code is compiled with -O2, i.e., should I tell the user to have its PETSc lib recompiled, or would you recommend me to use another routine for assembling such a matrix?
>> > >>>>>
>> > >>>>> Thanks,
>> > >>>>> Pierre
>> > >>> <AD-3D-320_7531028.o><AD-3D-1600_7513074.o>
>> > > <AD-3D-1600_7533982_info.o><AD-3D-1600_7533637_alltoall.o>
>> >
>> >
>> >
>> >
>> > --
>> > Stefano
>> 
>> 
>> 
>> 
>> -- 
>> Stefano
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20170317/ffd1302d/attachment.html>