[petsc-users] possible performance issues with PETSc on Cray

Matthew Knepley knepley at gmail.com
Fri May 9 06:50:01 CDT 2014


On Fri, May 9, 2014 at 6:47 AM, Samar Khatiwala <spk at ldeo.columbia.edu>wrote:

> Hi Jed et al.,
>
> Just wanted to report back on the resolution of this issue. The computing
> support people at HLRN in Germany
> submitted a test case to CRAY re. performance on their XC30. CRAY has
> finally gotten back with a solution,
> which is to use the run-time option  -vecscatter_alltoall. Apparently this
> is a known issue and according to the
> HLRN folks passing this command line option to PETSc seems to work nicely.
>

What this does is replace point-to-point communication (MPI_Send/Recv) with
collective communication (MI_Alltoall).

  Thanks,

     Matt


> Thanks again for your help.
>
> Samar
>
> On Apr 11, 2014, at 7:44 AM, Jed Brown <jed at jedbrown.org> wrote:
>
> > Samar Khatiwala <spk at ldeo.columbia.edu> writes:
> >
> >> Hello,
> >>
> >> This is a somewhat vague query but I and a colleague have been running
> PETSc (3.4.3.0) on a Cray
> >> XC30 in Germany (https://www.hlrn.de/home/view/System3/WebHome) and
> the system administrators
> >> alerted us to some anomalies with our jobs that may or may not be
> related to PETSc but I thought I'd ask
> >> here in case others have noticed something similar.
> >>
> >> First, there was a large variation in run-time for identical jobs,
> sometimes as much as 50%. We didn't
> >> really pick up on this but other users complained to the IT people that
> their jobs were taking a performance
> >> hit with a similar variation in run-time. At that point we're told the
> IT folks started monitoring jobs and
> >> carrying out tests to see what was going on. They discovered that (1)
> this always happened when we were
> >> running our jobs and (2) the problem got worse with physical proximity
> to the nodes on which our jobs were
> >> running (what they described as a "strong interaction" between our jobs
> and others presumably through the
> >> communication network).
> >
> > It sounds like you are strong scaling (smallish subdomains) so that your
> > application is sensitive to network latency.  I see significant
> > performance variability on XC-30 with this Full Multigrid solver that is
> > not using PETSc.
> >
> > http://59a2.org/files/hopper-vs-edison.3semilogx.png
> >
> > See the factor of 2 performance variability for the samples of the ~15M
> > element case.  This operation is limited by instruction issue rather
> > than bandwidth (indeed, it is several times faster than doing the same
> > operations with assembled matrices).  Here the variability is within the
> > same application performing repeated solves.  If you get a different
> > partition on a different run, you can see larger variation.
> >
> > If your matrices are large enough, your performance will be limited by
> > memory bandwidth.  (This is the typical case, but sufficiently small
> > matrices can fit in cache.)  I once encountered a batch system that did
> > not properly reset nodes between runs, leaving a partially-filled
> > ramdisk distributed asymmetrically across the memory busses.  This led
> > to 3x performance reduction on 4-socket nodes because much of the memory
> > demanded by the application would be faulted onto one memory bus.
> > Presumably your machine has a resource manager that would not allow such
> > things to happen.
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20140509/f94af059/attachment-0001.html>


More information about the petsc-users mailing list