[petsc-users] possible performance issues with PETSc on Cray
Samar Khatiwala
spk at ldeo.columbia.edu
Fri May 9 06:47:58 CDT 2014
Hi Jed et al.,
Just wanted to report back on the resolution of this issue. The computing support people at HLRN in Germany
submitted a test case to CRAY re. performance on their XC30. CRAY has finally gotten back with a solution,
which is to use the run-time option -vecscatter_alltoall. Apparently this is a known issue and according to the
HLRN folks passing this command line option to PETSc seems to work nicely.
Thanks again for your help.
Samar
On Apr 11, 2014, at 7:44 AM, Jed Brown <jed at jedbrown.org> wrote:
> Samar Khatiwala <spk at ldeo.columbia.edu> writes:
>
>> Hello,
>>
>> This is a somewhat vague query but I and a colleague have been running PETSc (3.4.3.0) on a Cray
>> XC30 in Germany (https://www.hlrn.de/home/view/System3/WebHome) and the system administrators
>> alerted us to some anomalies with our jobs that may or may not be related to PETSc but I thought I'd ask
>> here in case others have noticed something similar.
>>
>> First, there was a large variation in run-time for identical jobs, sometimes as much as 50%. We didn't
>> really pick up on this but other users complained to the IT people that their jobs were taking a performance
>> hit with a similar variation in run-time. At that point we're told the IT folks started monitoring jobs and
>> carrying out tests to see what was going on. They discovered that (1) this always happened when we were
>> running our jobs and (2) the problem got worse with physical proximity to the nodes on which our jobs were
>> running (what they described as a "strong interaction" between our jobs and others presumably through the
>> communication network).
>
> It sounds like you are strong scaling (smallish subdomains) so that your
> application is sensitive to network latency. I see significant
> performance variability on XC-30 with this Full Multigrid solver that is
> not using PETSc.
>
> http://59a2.org/files/hopper-vs-edison.3semilogx.png
>
> See the factor of 2 performance variability for the samples of the ~15M
> element case. This operation is limited by instruction issue rather
> than bandwidth (indeed, it is several times faster than doing the same
> operations with assembled matrices). Here the variability is within the
> same application performing repeated solves. If you get a different
> partition on a different run, you can see larger variation.
>
> If your matrices are large enough, your performance will be limited by
> memory bandwidth. (This is the typical case, but sufficiently small
> matrices can fit in cache.) I once encountered a batch system that did
> not properly reset nodes between runs, leaving a partially-filled
> ramdisk distributed asymmetrically across the memory busses. This led
> to 3x performance reduction on 4-socket nodes because much of the memory
> demanded by the application would be faulted onto one memory bus.
> Presumably your machine has a resource manager that would not allow such
> things to happen.
More information about the petsc-users
mailing list