[petsc-dev] VecScatter scaling problem on KNL

Thu Mar 9 12:32:46 CST 2017

It would be really good to have details about the MPI performance issues
suggested here before assigning blame on anybody.  There are a number of
good MPI profiling tools out there, many of which are supported on Cray
machines.  In addition, one can use CrayPat to get hardware counters
related to the network usage.

>    Ok, in this situation VecScatter cannot detect that it is an all to all
> so will generate a message from each process to each other process.

This is almost certainly the problem, not the MPI implementation.  If the
application is doing all-to-all, it should call MPI_Alltoall(v), not a
bunch of MPI_Scatter(v).  Just as you should not write DGEMM in terms of a
loop around DGEMV or double loops around DAXPY or DDOT, you should not
implement many-to-many as a loop over 1-to-many or many-to-1.

When you fail to properly express your communication semantic to the MPI
library, you inhibit its ability to do intelligent communication scheduling
or implement flow control.  If you know what you are doing, you can
reinvent your own wheel and roll your own collectives that do careful
scheduling, but you will have to this over and over again as your
application and hardware changes.

> Given my past experience with Cray MPI (why do they even have their own
> MPI when Intel provides one; in fact why does Cray even exist when they
> just take other people's products and put their name on them)

Cray MPI and Intel MPI are both derived from MPICH, but both have
_significant_ downstream optimizations, many of which pertain to the
underlying networks on which they run, and some that are network-agnostic
but may not be sufficiently general to be included in MPICH.  I am not
aware of any inheritance from Intel MPI into Cray MPI.  Both Intel MPI and
Cray MPI have their own optimizations related to KNL.  I believe Cray MPI
also has MPI-IO optimizations for Lustre, but I don't know any of the
details.

Among many other things, Cray MPI directly supports the Aries network, as
well as previous generations of Cray interconnects.  These are highly
nontrivial efforts.  Very recently, the OFI/libfabric effort has made it
possible to run MPICH and Intel MPI on Aries systems as well, but neither
of these is officially supported.  If you don't know what you are doing,
you will end up running over TCP/IP and the performance will be terrible.
Open-MPI has Cray network support as well, but I don't have any experience
using it.  In all of the results that I've seen, Cray MPI is the best
overall for Cray networks.

To be more explicit, do not use an unsupported MPI library on a Cray
machine in hopes of pinning the blame on Cray MPI.  Cray MPI is the most
likely by far to have collective optimizations for the Aries dragonfly
network and if it doesn't perform as you wish, then do a proper root cause
analysis in case the application is pathological.  Once you determine the
application is not performing a DDOS attack on the network, you should look
up the various tuning nobs Cray MPI exposes via environment variables.  See
slides 12 and 13 of https://www.alcf.anl.gov/files/ANL_MPI_on_KNL.pdf for
some examples.

> I am not totally surprised if the Cray MPI chocks on this flood of
> messages.
>

I would bet that whatever pattern is causing the problem here is a problem
with every MPI on every network.  Cray systems are very good for all-to-all
communication patterns.

Best,

Jeff, who is speaking as a user of Cray machines and an MPI Forum nerd, not
on behalf of any vendor

>
>    1) Test with Intel MPI, perhaps they handle this case in a scalable way
>
>     2) If Intel MPI also produces poor performance then (interesting, how
> come on other systems in the past this wasn't a bottleneck for the code?)
> the easiest solution is to separate the operation into two parts. Use a
> VecScatterCreateToAll() to get all the data to all the processes and then
> use another (purely sequential) VecScatter to get the data from this
> intermediate buffer into the final vector that has the "extra" locations
> for the boundary conditions in the final destination vector.
>
>   BTW: You know this already, but any implementation that requires storing
> the "entire" vector on each process is, by definition, not scalable and
> hence should not even be considered for funding by ECP or SciDAC.
>
>
>   Barry
>
>
> > On Mar 8, 2017, at 8:43 PM, Mark Adams <mfadams at lbl.gov> wrote:
> >
> >>
> >>    Is the scatter created with VecScatterCreateToAll()? If so,
> internally the VecScatterBegin/End will use VecScatterBegin_MPI_ToAll()
> which then uses a MPI_Allgatherv() to do the communication.  You can check
> in the debugger for this (on 2 processes) by just putting a break point in
> VecScatterBegin_MPI_ToAll() to confirm if it is called.
> >
> > Alas, not I did not use VecScatterCreateToAll and
> > VecScatterCreateToAll will take some code changes.
> >
> > There are boundary conditions in the destination vector, and so we
> > scatter into a larger vector the the global size of the PETSc vector,
> > using a general IS. With code that looks like this:
> >
> >  call ISCreateGeneral(PETSC_COMM_SELF,nreal,petsc_xgc,PETSC_
> COPY_VALUES,is,ierr)
> >  call VecScatterCreate(this%xVec,PETSC_NULL_OBJECT,vec,is,this%
> from_petsc,ierr)
> > ! reverse scatter object
> >
> > If we want to make this change then I could help a developer or you
> > can get me set up with a (small) test problem and a branch and I can
> > do it at NERSC.
> >
> > Thanks,
>
>

-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20170309/4da6fa27/attachment.html>