<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">The number of messages during the MatAssembly is effectively halved<div class="">MatAssemblyEnd 2 1.0 7.2139e+01 1.0 0.00e+00 0.0 2.6e+06 1.9e+04 1.8e+01 62 0 99 8 15 62 0 99 8 15 0</div><div class="">But that was only a few second faster (and this may even only be system noise).</div><div class="">I’ll see what I can infer from the openspeedshop profiling, and might give another MPI implementation a try during the weekend (I’m using BullxMPI, based on an ancient OpenMPI, but maybe IntelMPI gives better results).</div><div class=""><br class=""></div><div class="">Thanks anyway!</div><div class="">Pierre</div><div class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Mar 17, 2017, at 9:23 PM, Pierre Jolivet <<a href="mailto:pierre.jolivet@enseeiht.fr" class="">pierre.jolivet@enseeiht.fr</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><meta http-equiv="Content-Type" content="text/html charset=utf-8" class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">Thank you for all your input. openspeedshop/2.1 is installed on my cluster but it appears something is wrong with the MPI wrapper so I’ll have to wait for the answer from the support on Monday.<div class="">In the meantime I’ll try the patch from Stefano which looks very promising since it will replace 1599 sends and 1599 receives by a single all-to-all.</div><div class="">Thanks again!</div><div class="">Pierre</div><div class=""><br class=""><div class=""><blockquote type="cite" class=""><div class="">On Mar 17, 2017, at 8:59 PM, Stefano Zampini <<a href="mailto:stefano.zampini@gmail.com" class="">stefano.zampini@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><br class="Apple-interchange-newline"><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><div class="gmail_quote" style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">2017-03-17 22:52 GMT+03:00 Barry Smith<span class="Apple-converted-space"> </span><span dir="ltr" class=""><<a href="mailto:bsmith@mcs.anl.gov" target="_blank" class="">bsmith@mcs.anl.gov</a>></span>:<br class=""><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex;"><br class=""> Stefano,<br class=""><br class=""> <span class="Apple-converted-space"> </span>Thanks this is very helpful.<br class=""><br class="">---------------------<br class="">Why not? here is my naive implementation with AlltoAll, which perform better in my case<br class=""><br class="">PetscErrorCode PetscGatherMessageLengths(MPI_<wbr class="">Comm comm,PetscMPIInt nsends,PetscMPIInt nrecvs,const PetscMPIInt ilengths[],PetscMPIInt **onodes,PetscMPIInt **olengths)<br class="">{<br class=""> <span class="Apple-converted-space"> </span>PetscErrorCode ierr;<br class=""> <span class="Apple-converted-space"> </span>PetscMPIInt size,i,j;<br class=""> <span class="Apple-converted-space"> </span>PetscMPIInt *all_lengths;<br class=""><br class=""> <span class="Apple-converted-space"> </span>PetscFunctionBegin;<br class=""> <span class="Apple-converted-space"> </span>ierr = MPI_Comm_size(comm,&size);<wbr class="">CHKERRQ(ierr);<br class=""> <span class="Apple-converted-space"> </span>ierr = PetscMalloc(size*sizeof(<wbr class="">PetscMPIInt),&all_lengths);<wbr class="">CHKERRQ(ierr);<br class=""> <span class="Apple-converted-space"> </span>ierr = MPI_Alltoall((void*)ilengths,<wbr class="">1,MPI_INT,all_lengths,1,MPI_<wbr class="">INT,comm);CHKERRQ(ierr);<br class=""> <span class="Apple-converted-space"> </span>ierr = PetscMalloc(nrecvs*sizeof(<wbr class="">PetscMPIInt),olengths);<wbr class="">CHKERRQ(ierr);<br class=""> <span class="Apple-converted-space"> </span>ierr = PetscMalloc(nrecvs*sizeof(<wbr class="">PetscMPIInt),onodes);CHKERRQ(<wbr class="">ierr);<br class=""> <span class="Apple-converted-space"> </span>for (i=0,j=0; i<size; i++) {<br class=""> <span class="Apple-converted-space"> </span>if (all_lengths[i]) {<br class=""> <span class="Apple-converted-space"> </span>(*olengths)[j] = all_lengths[i];<br class=""> <span class="Apple-converted-space"> </span>(*onodes)[j] = i;<br class=""> <span class="Apple-converted-space"> </span>j++;<br class=""> <span class="Apple-converted-space"> </span>}<br class=""> <span class="Apple-converted-space"> </span>}<br class=""> <span class="Apple-converted-space"> </span>if (j != nrecvs) SETERRQ2(comm,PETSC_ERR_PLIB,"<wbr class="">Unexpected number of senders %d != %d",j,nrecvs);<br class=""> <span class="Apple-converted-space"> </span>ierr = PetscFree(all_lengths);<wbr class="">CHKERRQ(ierr);<br class=""> <span class="Apple-converted-space"> </span>PetscFunctionReturn(0);<br class="">}<br class="">-----------------------<br class=""><br class=""> <span class="Apple-converted-space"> </span>However I think this is only half the answer. If I look at VecScatterCreate_PtoS() for example it has<br class=""><br class=""> <span class="Apple-converted-space"> </span>ierr = PetscGatherNumberOfMessages(<wbr class="">comm,NULL,nprocs,&nrecvs);<wbr class="">CHKERRQ(ierr);<br class=""> <span class="Apple-converted-space"> </span>ierr = PetscGatherMessageLengths(<wbr class="">comm,nsends,nrecvs,nprocs,&<wbr class="">onodes1,&olengths1);CHKERRQ(<wbr class="">ierr);<br class=""> <span class="Apple-converted-space"> </span>ierr = PetscSortMPIIntWithArray(<wbr class="">nrecvs,onodes1,olengths1);<wbr class="">CHKERRQ(ierr);<br class=""> <span class="Apple-converted-space"> </span>recvtotal = 0; for (i=0; i<nrecvs; i++) recvtotal += olengths1[i];<br class=""><br class=""> <span class="Apple-converted-space"> </span>/* post receives: */<br class=""> <span class="Apple-converted-space"> </span>ierr = PetscMalloc3(recvtotal,&<wbr class="">rvalues,nrecvs,&source,nrecvs,<wbr class="">&recv_waits);CHKERRQ(ierr);<br class=""> <span class="Apple-converted-space"> </span>count = 0;<br class=""> <span class="Apple-converted-space"> </span>for (i=0; i<nrecvs; i++) {<br class=""> <span class="Apple-converted-space"> </span>ierr = MPI_Irecv((rvalues+count),<wbr class="">olengths1[i],MPIU_INT,onodes1[<wbr class="">i],tag,comm,recv_waits+i);<wbr class="">CHKERRQ(ierr);<br class=""> <span class="Apple-converted-space"> </span>count += olengths1[i];<br class=""> <span class="Apple-converted-space"> </span>}<br class=""><br class=""> <span class="Apple-converted-space"> </span>/* do sends:<br class=""> 1) starts[i] gives the starting index in svalues for stuff going to<br class=""> the ith processor<br class=""> <span class="Apple-converted-space"> </span>*/<br class=""> <span class="Apple-converted-space"> </span>nxr = 0;<br class=""> <span class="Apple-converted-space"> </span>for (i=0; i<nx; i++) {<br class=""> <span class="Apple-converted-space"> </span>if (owner[i] != rank) nxr++;<br class=""> <span class="Apple-converted-space"> </span>}<br class=""> <span class="Apple-converted-space"> </span>ierr = PetscMalloc3(nxr,&svalues,<wbr class="">nsends,&send_waits,size+1,&<wbr class="">starts);CHKERRQ(ierr);<br class=""><br class=""> <span class="Apple-converted-space"> </span>starts[0] = 0;<br class=""> <span class="Apple-converted-space"> </span>for (i=1; i<size; i++) starts[i] = starts[i-1] + nprocs[i-1];<br class=""> <span class="Apple-converted-space"> </span>for (i=0; i<nx; i++) {<br class=""> <span class="Apple-converted-space"> </span>if (owner[i] != rank) svalues[starts[owner[i]]++] = bs*inidx[i];<br class=""> <span class="Apple-converted-space"> </span>}<br class=""> <span class="Apple-converted-space"> </span>starts[0] = 0;<br class=""> <span class="Apple-converted-space"> </span>for (i=1; i<size+1; i++) starts[i] = starts[i-1] + nprocs[i-1];<br class=""> <span class="Apple-converted-space"> </span>count = 0;<br class=""> <span class="Apple-converted-space"> </span>for (i=0; i<size; i++) {<br class=""> <span class="Apple-converted-space"> </span>if (nprocs[i]) {<br class=""> <span class="Apple-converted-space"> </span>ierr = MPI_Isend(svalues+starts[i],<wbr class="">nprocs[i],MPIU_INT,i,tag,comm,<wbr class="">send_waits+count++);CHKERRQ(<wbr class="">ierr);<br class=""> <span class="Apple-converted-space"> </span>}<br class=""> <span class="Apple-converted-space"> </span>}<br class=""><br class=""><br class="">So I need to (1) use your alltoall PetscGatherMessageLengths() but also (2) replace the sends and receives above with alltoallv();<br class=""><br class="">Is that correct? Did you also fix (2) or did fixing (1) help so much you didn't need to fix (2)?<br class=""><br class=""></blockquote><div class=""><br class=""></div><div class="">At that time I just fixed (1), not (2). My specific problem was not with timings per se, but with MPI (IntelMPI if I remember correctly) crashing when doing the rendez-vous with thousands of processes.</div><div class=""> </div><div class=""> </div><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex;">Don't go to sleep yet, I may have more questions :-)<br class=""><span class="HOEnZb"><font color="#888888" class=""><br class=""><br class="">Barry<br class=""></font></span><div class="HOEnZb"><div class="h5"><br class=""><br class="">> On Mar 17, 2017, at 2:32 PM, Stefano Zampini <<a href="mailto:stefano.zampini@gmail.com" class="">stefano.zampini@gmail.com</a>> wrote:<br class="">><br class="">> Pierre,<br class="">><br class="">> I remember I had a similar problem some years ago when working with matrices with "process-dense" rows (i.e., when the off-diagonal part is shared by many processes). I fixed the issue by changing the implementation of PetscGatherMessageLenghts, from rendez-vous to all-to-all.<br class="">><br class="">> Barry, if you had access to petsc-maint, the title of the thread is "Problem with PetscGatherMessageLengths".<br class="">><br class="">> Hope this helps,<br class="">> Stefano<br class="">><br class="">><br class="">> 2017-03-17 22:21 GMT+03:00 Barry Smith <<a href="mailto:bsmith@mcs.anl.gov" class="">bsmith@mcs.anl.gov</a>>:<br class="">><br class="">> > On Mar 17, 2017, at 4:04 AM, Pierre Jolivet <<a href="mailto:Pierre.Jolivet@enseeiht.fr" class="">Pierre.Jolivet@enseeiht.fr</a>> wrote:<br class="">> ><br class="">> > On Thu, 16 Mar 2017 15:37:17 -0500, Barry Smith wrote:<br class="">> >>> On Mar 16, 2017, at 10:57 AM, Pierre Jolivet <<a href="mailto:pierre.jolivet@enseeiht.fr" class="">pierre.jolivet@enseeiht.fr</a>> wrote:<br class="">> >>><br class="">> >>> Thanks Barry.<br class="">> >>> I actually tried the application myself with my optimized build + your option. I'm attaching two logs for a strong scaling analysis, if someone could spend a minute or two looking at the numbers I'd be really grateful:<br class="">> >>> 1) MatAssembly still takes a rather long time IMHO. This is actually the bottleneck of my application. Especially on 1600 cores, the problem here is that I don't know if the huge time (almost a 5x slow-down w.r.t. the run on 320 cores) is due to MatMPIAIJSetPreallocationCSR (which I assumed beforehand was a no-op, but which is clearly not the case looking at the run on 320 cores) or the the option -pc_bjacobi_blocks 320 which also does one MatAssembly.<br class="">> >><br class="">> >> There is one additional synchronization point in the<br class="">> >> MatAssemblyEnd that has not/cannot be removed. This is the<br class="">> >> construction of the VecScatter; I think that likely explains the huge<br class="">> >> amount of time there.<br class="">><br class="">> This concerns me<br class="">><br class="">> MatAssemblyEnd 2 1.0 7.5767e+01 1.0 0.00e+00 0.0 5.1e+06 9.4e+03 1.6e+01 64 0100 8 14 64 0100 8 14 0<br class="">><br class="">> I am thinking this is all the communication needed to set up the scatter. Do you have access to any performance profilers like Intel speedshop to see what is going on during all this time?<br class="">><br class="">><br class="">> -vecscatter_alltoall uses alltoall in communication in the scatters but it does not use all to all in setting up the scatter (that is determining exactly what needs to be scattered at each time). I think this is the problem. We need to add more scatter set up code to optimize this case.<br class="">><br class="">><br class="">><br class="">> >><br class="">> >>> 2) The other bottleneck is MatMult, which itself calls VecScatter. Since the structure of the matrix is rather dense, I'm guessing the communication pattern should be similar to an all-to-all. After having a look at the thread "VecScatter scaling problem on KNL", would you also suggest me to use -vecscatter_alltoall, or do you think this would not be appropriate for the MatMult?<br class="">> >><br class="">> >> Please run with<br class="">> >><br class="">> >> -vecscatter_view ::ascii_info<br class="">> >><br class="">> >> this will give information about the number of messages and sizes<br class="">> >> needed in the VecScatter. To help decide what to do next.<br class="">> ><br class="">> > Here are two more logs. One with -vecscatter_view ::ascii_info which I don't really know how to analyze (I've spotted though that there are a couple of negative integers for the data counters, maybe you are using long instead of long long?), the other with -vecscatter_alltoall. The latter option gives a 2x speed-up for the MatMult, and for the PCApply too (which is weird to me because there should be no global communication with bjacobi and the diagonal blocks are only of size "5 processes" so the speed-up seems rather huge for just doing VecScatter for gathering and scattering the RHS/solution for all 320 MUMPS instances).<br class="">><br class="">> ok, this is good, it confirms that the large amount of communication needed in the scatters were a major problem and using the all to all helps. This is about all you can do about the scatter time.<br class="">><br class="">><br class="">><br class="">> Barry<br class="">><br class="">> ><br class="">> > Thanks for your help,<br class="">> > Pierre<br class="">> ><br class="">> >> Barry<br class="">> >><br class="">> >><br class="">> >><br class="">> >><br class="">> >>><br class="">> >>> Thank you very much,<br class="">> >>> Pierre<br class="">> >>><br class="">> >>> On Mon, 6 Mar 2017 09:34:53 -0600, Barry Smith wrote:<br class="">> >>>> I don't think the lack of the --with-debugging=no is important here.<br class="">> >>>> Though he/she should use --with-debugging=no for production runs.<br class="">> >>>><br class="">> >>>> I think the reason for the "funny" numbers is that<br class="">> >>>> MatAssemblyBegin and End in this case have explicit synchronization<br class="">> >>>> points so some processes are waiting for other processes to get to the<br class="">> >>>> synchronization point thus it looks like some processes are spending a<br class="">> >>>> lot of time in the assembly routines when they are not really, they<br class="">> >>>> are just waiting.<br class="">> >>>><br class="">> >>>> You can remove the synchronization point by calling<br class="">> >>>><br class="">> >>>> MatSetOption(mat, MAT_NO_OFF_PROC_ENTRIES, PETSC_TRUE); before<br class="">> >>>> calling MatMPIAIJSetPreallocationCSR()<br class="">> >>>><br class="">> >>>> Barry<br class="">> >>>><br class="">> >>>>> On Mar 6, 2017, at 8:59 AM, Pierre Jolivet <<a href="mailto:Pierre.Jolivet@enseeiht.fr" class="">Pierre.Jolivet@enseeiht.fr</a>> wrote:<br class="">> >>>>><br class="">> >>>>> Hello,<br class="">> >>>>> I have an application with a matrix with lots of nonzero entries (that are perfectly load balanced between processes and rows).<br class="">> >>>>> A end user is currently using a PETSc library compiled with the following flags (among others):<br class="">> >>>>> --CFLAGS=-O2 --COPTFLAGS=-O3 --CXXFLAGS="-O2 -std=c++11" --CXXOPTFLAGS=-O3 --FFLAGS=-O2 --FOPTFLAGS=-O3<br class="">> >>>>> Notice the lack of --with-debugging=no<br class="">> >>>>> The matrix is assembled using MatMPIAIJSetPreallocationCSR and we end up with something like that in the -log_view:<br class="">> >>>>> MatAssemblyBegin 2 1.0 1.2520e+002602.1 0.00e+00 0.0 0.0e+00 0.0e+00 8.0e+00 0 0 0 0 2 0 0 0 0 2 0<br class="">> >>>>> MatAssemblyEnd 2 1.0 4.5104e+01 1.0 0.00e+00 0.0 8.2e+05 3.2e+04 4.6e+01 40 0 14 4 9 40 0 14 4 9 0<br class="">> >>>>><br class="">> >>>>> For reference, here is what the matrix looks like (keep in mind it is well balanced)<br class="">> >>>>> Mat Object: 640 MPI processes<br class="">> >>>>> type: mpiaij<br class="">> >>>>> rows=10682560, cols=10682560<br class="">> >>>>> total: nonzeros=51691212800, allocated nonzeros=51691212800<br class="">> >>>>> total number of mallocs used during MatSetValues calls =0<br class="">> >>>>> not using I-node (on process 0) routines<br class="">> >>>>><br class="">> >>>>> Are MatAssemblyBegin/<wbr class="">MatAssemblyEnd highly sensitive to the --with-debugging option on x86 even though the corresponding code is compiled with -O2, i.e., should I tell the user to have its PETSc lib recompiled, or would you recommend me to use another routine for assembling such a matrix?<br class="">> >>>>><br class="">> >>>>> Thanks,<br class="">> >>>>> Pierre<br class="">> >>> <AD-3D-320_7531028.o><AD-3D-<wbr class="">1600_7513074.o><br class="">> > <AD-3D-1600_7533982_info.o><<wbr class="">AD-3D-1600_7533637_alltoall.o><br class="">><br class="">><br class="">><br class="">><br class="">> --<br class="">> Stefano<br class=""><br class=""></div></div></blockquote></div><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><br clear="all" style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><div style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><br class=""></div><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">--<span class="Apple-converted-space"> </span></span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><div class="gmail_signature" data-smartmail="gmail_signature" style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">Stefano</div></div></blockquote></div><br class=""></div></div></div></blockquote></div><br class=""></div></body></html>