[petsc-users] possible performance issues with PETSc on Cray

Fri May 9 06:47:58 CDT 2014

Hi Jed et al.,

Just wanted to report back on the resolution of this issue. The computing support people at HLRN in Germany 
submitted a test case to CRAY re. performance on their XC30. CRAY has finally gotten back with a solution, 
which is to use the run-time option  -vecscatter_alltoall. Apparently this is a known issue and according to the 
HLRN folks passing this command line option to PETSc seems to work nicely.

Thanks again for your help.

Samar

On Apr 11, 2014, at 7:44 AM, Jed Brown <jed at jedbrown.org> wrote:

> Samar Khatiwala <spk at ldeo.columbia.edu> writes:
> 
>> Hello,
>> 
>> This is a somewhat vague query but I and a colleague have been running PETSc (3.4.3.0) on a Cray 
>> XC30 in Germany (https://www.hlrn.de/home/view/System3/WebHome) and the system administrators 
>> alerted us to some anomalies with our jobs that may or may not be related to PETSc but I thought I'd ask 
>> here in case others have noticed something similar.
>> 
>> First, there was a large variation in run-time for identical jobs, sometimes as much as 50%. We didn't 
>> really pick up on this but other users complained to the IT people that their jobs were taking a performance 
>> hit with a similar variation in run-time. At that point we're told the IT folks started monitoring jobs and 
>> carrying out tests to see what was going on. They discovered that (1) this always happened when we were 
>> running our jobs and (2) the problem got worse with physical proximity to the nodes on which our jobs were 
>> running (what they described as a "strong interaction" between our jobs and others presumably through the 
>> communication network).
> 
> It sounds like you are strong scaling (smallish subdomains) so that your
> application is sensitive to network latency.  I see significant
> performance variability on XC-30 with this Full Multigrid solver that is
> not using PETSc.
> 
> http://59a2.org/files/hopper-vs-edison.3semilogx.png
> 
> See the factor of 2 performance variability for the samples of the ~15M
> element case.  This operation is limited by instruction issue rather
> than bandwidth (indeed, it is several times faster than doing the same
> operations with assembled matrices).  Here the variability is within the
> same application performing repeated solves.  If you get a different
> partition on a different run, you can see larger variation.
> 
> If your matrices are large enough, your performance will be limited by
> memory bandwidth.  (This is the typical case, but sufficiently small
> matrices can fit in cache.)  I once encountered a batch system that did
> not properly reset nodes between runs, leaving a partially-filled
> ramdisk distributed asymmetrically across the memory busses.  This led
> to 3x performance reduction on 4-socket nodes because much of the memory
> demanded by the application would be faulted onto one memory bus.
> Presumably your machine has a resource manager that would not allow such
> things to happen.