[petsc-dev] MatAssembly, debug, and compile flags

Fri Mar 17 14:59:19 CDT 2017

2017-03-17 22:52 GMT+03:00 Barry Smith <bsmith at mcs.anl.gov>:

>
>    Stefano,
>
>     Thanks this is very helpful.
>
> ---------------------
> Why not? here is my naive implementation with AlltoAll, which perform
> better in my case
>
> PetscErrorCode  PetscGatherMessageLengths(MPI_Comm comm,PetscMPIInt
> nsends,PetscMPIInt nrecvs,const PetscMPIInt ilengths[],PetscMPIInt
> **onodes,PetscMPIInt **olengths)
> {
>   PetscErrorCode ierr;
>   PetscMPIInt    size,i,j;
>   PetscMPIInt    *all_lengths;
>
>   PetscFunctionBegin;
>   ierr = MPI_Comm_size(comm,&size);CHKERRQ(ierr);
>   ierr = PetscMalloc(size*sizeof(PetscMPIInt),&all_lengths);CHKERRQ(ierr);
>   ierr = MPI_Alltoall((void*)ilengths,1,MPI_INT,all_lengths,1,MPI_
> INT,comm);CHKERRQ(ierr);
>   ierr = PetscMalloc(nrecvs*sizeof(PetscMPIInt),olengths);CHKERRQ(ierr);
>   ierr = PetscMalloc(nrecvs*sizeof(PetscMPIInt),onodes);CHKERRQ(ierr);
>   for (i=0,j=0; i<size; i++) {
>     if (all_lengths[i]) {
>       (*olengths)[j] = all_lengths[i];
>       (*onodes)[j] = i;
>       j++;
>     }
>   }
>   if (j != nrecvs) SETERRQ2(comm,PETSC_ERR_PLIB,"Unexpected number of
> senders %d != %d",j,nrecvs);
>   ierr = PetscFree(all_lengths);CHKERRQ(ierr);
>   PetscFunctionReturn(0);
> }
> -----------------------
>
>   However I think this is only half the answer. If I look at
> VecScatterCreate_PtoS() for example it has
>
>   ierr = PetscGatherNumberOfMessages(comm,NULL,nprocs,&nrecvs);
> CHKERRQ(ierr);
>   ierr = PetscGatherMessageLengths(comm,nsends,nrecvs,nprocs,&
> onodes1,&olengths1);CHKERRQ(ierr);
>   ierr = PetscSortMPIIntWithArray(nrecvs,onodes1,olengths1);CHKERRQ(ierr);
>   recvtotal = 0; for (i=0; i<nrecvs; i++) recvtotal += olengths1[i];
>
>   /* post receives:   */
>   ierr  = PetscMalloc3(recvtotal,&rvalues,nrecvs,&source,nrecvs,
> &recv_waits);CHKERRQ(ierr);
>   count = 0;
>   for (i=0; i<nrecvs; i++) {
>     ierr   = MPI_Irecv((rvalues+count),olengths1[i],MPIU_INT,onodes1[
> i],tag,comm,recv_waits+i);CHKERRQ(ierr);
>     count += olengths1[i];
>   }
>
>   /* do sends:
>      1) starts[i] gives the starting index in svalues for stuff going to
>      the ith processor
>   */
>   nxr = 0;
>   for (i=0; i<nx; i++) {
>     if (owner[i] != rank) nxr++;
>   }
>   ierr = PetscMalloc3(nxr,&svalues,nsends,&send_waits,size+1,&
> starts);CHKERRQ(ierr);
>
>   starts[0]  = 0;
>   for (i=1; i<size; i++) starts[i] = starts[i-1] + nprocs[i-1];
>   for (i=0; i<nx; i++) {
>     if (owner[i] != rank) svalues[starts[owner[i]]++] = bs*inidx[i];
>   }
>   starts[0] = 0;
>   for (i=1; i<size+1; i++) starts[i] = starts[i-1] + nprocs[i-1];
>   count = 0;
>   for (i=0; i<size; i++) {
>     if (nprocs[i]) {
>       ierr = MPI_Isend(svalues+starts[i],nprocs[i],MPIU_INT,i,tag,comm,
> send_waits+count++);CHKERRQ(ierr);
>     }
>   }
>
>
> So I need to (1) use your alltoall PetscGatherMessageLengths() but also
> (2) replace the sends and receives above with alltoallv();
>
> Is that correct? Did you also fix (2) or did fixing (1) help so much you
> didn't need to fix (2)?
>
>
At that time I just fixed (1), not (2). My specific problem was not with
timings per se, but with MPI (IntelMPI if I remember correctly) crashing
when doing the rendez-vous with thousands of processes.


> Don't go to sleep yet, I may have more questions :-)
>
>
> Barry
>
>
> > On Mar 17, 2017, at 2:32 PM, Stefano Zampini <stefano.zampini at gmail.com>
> wrote:
> >
> > Pierre,
> >
> > I remember I had a similar problem some years ago when working with
> matrices with "process-dense" rows (i.e., when the off-diagonal part is
> shared by many processes). I fixed the issue by changing the implementation
> of PetscGatherMessageLenghts, from rendez-vous to all-to-all.
> >
> > Barry, if you had access to petsc-maint, the title of the thread is
> "Problem with PetscGatherMessageLengths".
> >
> > Hope this helps,
> > Stefano
> >
> >
> > 2017-03-17 22:21 GMT+03:00 Barry Smith <bsmith at mcs.anl.gov>:
> >
> > > On Mar 17, 2017, at 4:04 AM, Pierre Jolivet <
> Pierre.Jolivet at enseeiht.fr> wrote:
> > >
> > > On Thu, 16 Mar 2017 15:37:17 -0500, Barry Smith wrote:
> > >>> On Mar 16, 2017, at 10:57 AM, Pierre Jolivet <
> pierre.jolivet at enseeiht.fr> wrote:
> > >>>
> > >>> Thanks Barry.
> > >>> I actually tried the application myself with my optimized build +
> your option. I'm attaching two logs for a strong scaling analysis, if
> someone could spend a minute or two looking at the numbers I'd be really
> grateful:
> > >>> 1) MatAssembly still takes a rather long time IMHO. This is actually
> the bottleneck of my application. Especially on 1600 cores, the problem
> here is that I don't know if the huge time (almost a 5x slow-down w.r.t.
> the run on 320 cores) is due to MatMPIAIJSetPreallocationCSR (which I
> assumed beforehand was a no-op, but which is clearly not the case looking
> at the run on 320 cores) or the the option -pc_bjacobi_blocks 320 which
> also does one MatAssembly.
> > >>
> > >>    There is one additional synchronization point in the
> > >> MatAssemblyEnd that has not/cannot be removed. This is the
> > >> construction of the VecScatter; I think that likely explains the huge
> > >> amount of time there.
> >
> >   This concerns me
> >
> >   MatAssemblyEnd         2 1.0 7.5767e+01 1.0 0.00e+00 0.0 5.1e+06
> 9.4e+03 1.6e+01 64  0100  8 14  64  0100  8 14     0
> >
> >    I am thinking this is all the communication needed to set up the
> scatter. Do you have access to any performance profilers like Intel
> speedshop to see what is going on during all this time?
> >
> >
> >    -vecscatter_alltoall uses alltoall in communication in the scatters
> but it does not use all to all in setting up the scatter (that is
> determining exactly what needs to be scattered at each time). I think this
> is the problem. We need to add more scatter set up code to optimize this
> case.
> >
> >
> >
> > >>
> > >>> 2) The other bottleneck is MatMult, which itself calls VecScatter.
> Since the structure of the matrix is rather dense, I'm guessing the
> communication pattern should be similar to an all-to-all. After having a
> look at the thread "VecScatter scaling problem on KNL", would you also
> suggest me to use -vecscatter_alltoall, or do you think this would not be
> appropriate for the MatMult?
> > >>
> > >>   Please run with
> > >>
> > >>   -vecscatter_view ::ascii_info
> > >>
> > >> this will give information about the number of messages and sizes
> > >> needed in the VecScatter. To help decide what to do next.
> > >
> > > Here are two more logs. One with -vecscatter_view ::ascii_info which I
> don't really know how to analyze (I've spotted though that there are a
> couple of negative integers for the data counters, maybe you are using long
> instead of long long?), the other with -vecscatter_alltoall. The latter
> option gives a 2x speed-up for the MatMult, and for the PCApply too (which
> is weird to me because there should be no global communication with bjacobi
> and the diagonal blocks are only of size "5 processes" so the speed-up
> seems rather huge for just doing VecScatter for gathering and scattering
> the RHS/solution for all 320 MUMPS instances).
> >
> >   ok, this is good, it confirms that the large amount of communication
> needed in the scatters were a major problem and using the all to all helps.
> This is about all you can do about the scatter time.
> >
> >
> >
> >   Barry
> >
> > >
> > > Thanks for your help,
> > > Pierre
> > >
> > >>  Barry
> > >>
> > >>
> > >>
> > >>
> > >>>
> > >>> Thank you very much,
> > >>> Pierre
> > >>>
> > >>> On Mon, 6 Mar 2017 09:34:53 -0600, Barry Smith wrote:
> > >>>> I don't think the lack of the --with-debugging=no is important here.
> > >>>> Though he/she should use --with-debugging=no for production runs.
> > >>>>
> > >>>>  I think the reason for the "funny" numbers is that
> > >>>> MatAssemblyBegin and End in this case have explicit synchronization
> > >>>> points so some processes are waiting for other processes to get to
> the
> > >>>> synchronization point thus it looks like some processes are
> spending a
> > >>>> lot of time in the assembly routines when they are not really, they
> > >>>> are just waiting.
> > >>>>
> > >>>>  You can remove the synchronization point by calling
> > >>>>
> > >>>>   MatSetOption(mat, MAT_NO_OFF_PROC_ENTRIES, PETSC_TRUE); before
> > >>>> calling MatMPIAIJSetPreallocationCSR()
> > >>>>
> > >>>>  Barry
> > >>>>
> > >>>>> On Mar 6, 2017, at 8:59 AM, Pierre Jolivet <
> Pierre.Jolivet at enseeiht.fr> wrote:
> > >>>>>
> > >>>>> Hello,
> > >>>>> I have an application with a matrix with lots of nonzero entries
> (that are perfectly load balanced between processes and rows).
> > >>>>> A end user is currently using a PETSc library compiled with the
> following flags (among others):
> > >>>>> --CFLAGS=-O2 --COPTFLAGS=-O3 --CXXFLAGS="-O2 -std=c++11"
> --CXXOPTFLAGS=-O3 --FFLAGS=-O2 --FOPTFLAGS=-O3
> > >>>>> Notice the lack of --with-debugging=no
> > >>>>> The matrix is assembled using MatMPIAIJSetPreallocationCSR and we
> end up with something like that in the -log_view:
> > >>>>> MatAssemblyBegin       2 1.0 1.2520e+002602.1 0.00e+00 0.0 0.0e+00
> 0.0e+00 8.0e+00  0  0  0  0  2   0  0  0  0 2     0
> > >>>>> MatAssemblyEnd         2 1.0 4.5104e+01 1.0 0.00e+00 0.0 8.2e+05
> 3.2e+04 4.6e+01 40  0 14  4  9  40  0 14  4  9 0
> > >>>>>
> > >>>>> For reference, here is what the matrix looks like (keep in mind it
> is well balanced)
> > >>>>> Mat Object:   640 MPI processes
> > >>>>>  type: mpiaij
> > >>>>>  rows=10682560, cols=10682560
> > >>>>>  total: nonzeros=51691212800, allocated nonzeros=51691212800
> > >>>>>  total number of mallocs used during MatSetValues calls =0
> > >>>>>    not using I-node (on process 0) routines
> > >>>>>
> > >>>>> Are MatAssemblyBegin/MatAssemblyEnd highly sensitive to the
> --with-debugging option on x86 even though the corresponding code is
> compiled with -O2, i.e., should I tell the user to have its PETSc lib
> recompiled, or would you recommend me to use another routine for assembling
> such a matrix?
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Pierre
> > >>> <AD-3D-320_7531028.o><AD-3D-1600_7513074.o>
> > > <AD-3D-1600_7533982_info.o><AD-3D-1600_7533637_alltoall.o>
> >
> >
> >
> >
> > --
> > Stefano
>
>


-- 
Stefano
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20170317/8ab43f7c/attachment.html>