[petsc-users] possible performance issues with PETSc on Cray

Fri Apr 11 07:24:04 CDT 2014

Hi Jed,

Thanks for the quick reply. This is very helpful. You may well be right that my matrices are not large enough 
(~2. 5e6 x 2.5e6 and I'm running on 360 cores = 15 nodes x 24 cores/node on this XC-30) and my runs are 
therefore sensitive to network latency. Would this, though, impact other people running jobs on nearby nodes? 
(I suppose it would if I'm passing too many messages because of the small size of the matrices.)

I'm going to pass on your reply to the system administrators. They will be able to understand the technical content 
better than I am capable of.

Thanks again!

Best,

Samar

On Apr 11, 2014, at 7:44 AM, Jed Brown <jed at jedbrown.org> wrote:

> Samar Khatiwala <spk at ldeo.columbia.edu> writes:
> 
>> Hello,
>> 
>> This is a somewhat vague query but I and a colleague have been running PETSc (3.4.3.0) on a Cray 
>> XC30 in Germany (https://www.hlrn.de/home/view/System3/WebHome) and the system administrators 
>> alerted us to some anomalies with our jobs that may or may not be related to PETSc but I thought I'd ask 
>> here in case others have noticed something similar.
>> 
>> First, there was a large variation in run-time for identical jobs, sometimes as much as 50%. We didn't 
>> really pick up on this but other users complained to the IT people that their jobs were taking a performance 
>> hit with a similar variation in run-time. At that point we're told the IT folks started monitoring jobs and 
>> carrying out tests to see what was going on. They discovered that (1) this always happened when we were 
>> running our jobs and (2) the problem got worse with physical proximity to the nodes on which our jobs were 
>> running (what they described as a "strong interaction" between our jobs and others presumably through the 
>> communication network).
> 
> It sounds like you are strong scaling (smallish subdomains) so that your
> application is sensitive to network latency.  I see significant
> performance variability on XC-30 with this Full Multigrid solver that is
> not using PETSc.
> 
> http://59a2.org/files/hopper-vs-edison.3semilogx.png
> 
> See the factor of 2 performance variability for the samples of the ~15M
> element case.  This operation is limited by instruction issue rather
> than bandwidth (indeed, it is several times faster than doing the same
> operations with assembled matrices).  Here the variability is within the
> same application performing repeated solves.  If you get a different
> partition on a different run, you can see larger variation.
> 
> If your matrices are large enough, your performance will be limited by
> memory bandwidth.  (This is the typical case, but sufficiently small
> matrices can fit in cache.)  I once encountered a batch system that did
> not properly reset nodes between runs, leaving a partially-filled
> ramdisk distributed asymmetrically across the memory busses.  This led
> to 3x performance reduction on 4-socket nodes because much of the memory
> demanded by the application would be faulted onto one memory bus.
> Presumably your machine has a resource manager that would not allow such
> things to happen.