[petsc-dev] VecScatter scaling problem on KNL

Thu Mar 9 21:47:22 CST 2017

> On Mar 9, 2017, at 12:32 PM, Jeff Hammond <jeff.science at gmail.com> wrote:
> 
> It would be really good to have details about the MPI performance issues suggested here before assigning blame on anybody.

   No one is assigning blame to anyone.

>  There are a number of good MPI profiling tools out there, many of which are supported on Cray machines.  In addition, one can use CrayPat to get hardware counters related to the network usage.
>  
>    Ok, in this situation VecScatter cannot detect that it is an all to all so will generate a message from each process to each other process.
> 
> This is almost certainly the problem, not the MPI implementation.

   Yes, but interestingly this problem has not appeared before on previous generation machines with presumably even more cores. So there may have been some regression in the MPI implementations. Presumably the vendor would like to know about the regression (even if they cannot or decide not to "fix" it).

>  If the application is doing all-to-all, it should call MPI_Alltoall(v), not a bunch of MPI_Scatter(v).

   True.  Because of the specific nature of Mark's communication pattern with the "holes" for boundary conditions, PETSc's VecScatter could not detect the all to all nature automatically. In theory VecScatter could detect it, I have to think about it some more, but it would require additional communication when setting up the communication pattern to establish the all to all nature.

>  Just as you should not write DGEMM in terms of a loop around DGEMV or double loops around DAXPY or DDOT, you should not implement many-to-many as a loop over 1-to-many or many-to-1.
> 
> When you fail to properly express your communication semantic to the MPI library, you inhibit its ability to do intelligent communication scheduling or implement flow control.  If you know what you are doing, you can reinvent your own wheel and roll your own collectives that do careful scheduling, but you will have to this over and over again as your application and hardware changes.
>  
> Given my past experience with Cray MPI (why do they even have their own MPI when Intel provides one; in fact why does Cray even exist when they just take other people's products and put their name on them)
> 
> Cray MPI and Intel MPI are both derived from MPICH, but both have _significant_ downstream optimizations, many of which pertain to the underlying networks on which they run, and some that are network-agnostic but may not be sufficiently general to be included in MPICH.  I am not aware of any inheritance from Intel MPI into Cray MPI.  Both Intel MPI and Cray MPI have their own optimizations related to KNL.  I believe Cray MPI also has MPI-IO optimizations for Lustre, but I don't know any of the details.
> 
> Among many other things, Cray MPI directly supports the Aries network, as well as previous generations of Cray interconnects.  These are highly nontrivial efforts.  Very recently, the OFI/libfabric effort has made it possible to run MPICH and Intel MPI on Aries systems as well, but neither of these is officially supported.  If you don't know what you are doing, you will end up running over TCP/IP and the performance will be terrible.  Open-MPI has Cray network support as well, but I don't have any experience using it.  In all of the results that I've seen, Cray MPI is the best overall for Cray networks.
> 
> To be more explicit, do not use an unsupported MPI library on a Cray machine in hopes of pinning the blame on Cray MPI.  Cray MPI is the most likely by far to have collective optimizations for the Aries dragonfly network and if it doesn't perform as you wish, then do a proper root cause analysis in case the application is pathological.  Once you determine the application is not performing a DDOS attack on the network, you should look up the various tuning nobs Cray MPI exposes via environment variables.  See slides 12 and 13 of https://www.alcf.anl.gov/files/ANL_MPI_on_KNL.pdf for some examples.
>  
> I am not totally surprised if the Cray MPI chocks on this flood of messages.
> 
> I would bet that whatever pattern is causing the problem here is a problem with every MPI on every network.  Cray systems are very good for all-to-all communication patterns.
> 
> Best,
> 
> Jeff, who is speaking as a user of Cray machines and an MPI Forum nerd, not on behalf of any vendor
>  
> 
>    1) Test with Intel MPI, perhaps they handle this case in a scalable way
> 
>     2) If Intel MPI also produces poor performance then (interesting, how come on other systems in the past this wasn't a bottleneck for the code?) the easiest solution is to separate the operation into two parts. Use a VecScatterCreateToAll() to get all the data to all the processes and then use another (purely sequential) VecScatter to get the data from this intermediate buffer into the final vector that has the "extra" locations for the boundary conditions in the final destination vector.
> 
>   BTW: You know this already, but any implementation that requires storing the "entire" vector on each process is, by definition, not scalable and hence should not even be considered for funding by ECP or SciDAC.
> 
> 
>   Barry
> 
> 
> > On Mar 8, 2017, at 8:43 PM, Mark Adams <mfadams at lbl.gov> wrote:
> >
> >>
> >>    Is the scatter created with VecScatterCreateToAll()? If so, internally the VecScatterBegin/End will use VecScatterBegin_MPI_ToAll() which then uses a MPI_Allgatherv() to do the communication.  You can check  in the debugger for this (on 2 processes) by just putting a break point in VecScatterBegin_MPI_ToAll() to confirm if it is called.
> >
> > Alas, not I did not use VecScatterCreateToAll and
> > VecScatterCreateToAll will take some code changes.
> >
> > There are boundary conditions in the destination vector, and so we
> > scatter into a larger vector the the global size of the PETSc vector,
> > using a general IS. With code that looks like this:
> >
> >  call ISCreateGeneral(PETSC_COMM_SELF,nreal,petsc_xgc,PETSC_COPY_VALUES,is,ierr)
> >  call VecScatterCreate(this%xVec,PETSC_NULL_OBJECT,vec,is,this%from_petsc,ierr)
> > ! reverse scatter object
> >
> > If we want to make this change then I could help a developer or you
> > can get me set up with a (small) test problem and a branch and I can
> > do it at NERSC.
> >
> > Thanks,
> 
> 
> 
> 
> -- 
> Jeff Hammond
> jeff.science at gmail.com
> http://jeffhammond.github.io/