<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Fri, May 9, 2014 at 6:47 AM, Samar Khatiwala <span dir="ltr"><<a href="mailto:spk@ldeo.columbia.edu" target="_blank">spk@ldeo.columbia.edu</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Jed et al.,<br>

<br>

Just wanted to report back on the resolution of this issue. The computing support people at HLRN in Germany<br>

submitted a test case to CRAY re. performance on their XC30. CRAY has finally gotten back with a solution,<br>

which is to use the run-time option  -vecscatter_alltoall. Apparently this is a known issue and according to the<br>

HLRN folks passing this command line option to PETSc seems to work nicely.<br></blockquote><div><br></div><div>What this does is replace point-to-point communication (MPI_Send/Recv) with collective communication (MI_Alltoall).</div>

<div><br></div><div>  Thanks,</div><div><br></div><div>     Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Thanks again for your help.<br>

<span class="HOEnZb"><font color="#888888"><br>

Samar<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

On Apr 11, 2014, at 7:44 AM, Jed Brown <<a href="mailto:jed@jedbrown.org">jed@jedbrown.org</a>> wrote:<br>

<br>

> Samar Khatiwala <<a href="mailto:spk@ldeo.columbia.edu">spk@ldeo.columbia.edu</a>> writes:<br>

><br>

>> Hello,<br>

>><br>

>> This is a somewhat vague query but I and a colleague have been running PETSc (3.4.3.0) on a Cray<br>

>> XC30 in Germany (<a href="https://www.hlrn.de/home/view/System3/WebHome" target="_blank">https://www.hlrn.de/home/view/System3/WebHome</a>) and the system administrators<br>

>> alerted us to some anomalies with our jobs that may or may not be related to PETSc but I thought I'd ask<br>

>> here in case others have noticed something similar.<br>

>><br>

>> First, there was a large variation in run-time for identical jobs, sometimes as much as 50%. We didn't<br>

>> really pick up on this but other users complained to the IT people that their jobs were taking a performance<br>

>> hit with a similar variation in run-time. At that point we're told the IT folks started monitoring jobs and<br>

>> carrying out tests to see what was going on. They discovered that (1) this always happened when we were<br>

>> running our jobs and (2) the problem got worse with physical proximity to the nodes on which our jobs were<br>

>> running (what they described as a "strong interaction" between our jobs and others presumably through the<br>

>> communication network).<br>

><br>

> It sounds like you are strong scaling (smallish subdomains) so that your<br>

> application is sensitive to network latency.  I see significant<br>

> performance variability on XC-30 with this Full Multigrid solver that is<br>

> not using PETSc.<br>

><br>

> <a href="http://59a2.org/files/hopper-vs-edison.3semilogx.png" target="_blank">http://59a2.org/files/hopper-vs-edison.3semilogx.png</a><br>

><br>

> See the factor of 2 performance variability for the samples of the ~15M<br>

> element case.  This operation is limited by instruction issue rather<br>

> than bandwidth (indeed, it is several times faster than doing the same<br>

> operations with assembled matrices).  Here the variability is within the<br>

> same application performing repeated solves.  If you get a different<br>

> partition on a different run, you can see larger variation.<br>

><br>

> If your matrices are large enough, your performance will be limited by<br>

> memory bandwidth.  (This is the typical case, but sufficiently small<br>

> matrices can fit in cache.)  I once encountered a batch system that did<br>

> not properly reset nodes between runs, leaving a partially-filled<br>

> ramdisk distributed asymmetrically across the memory busses.  This led<br>

> to 3x performance reduction on 4-socket nodes because much of the memory<br>

> demanded by the application would be faulted onto one memory bus.<br>

> Presumably your machine has a resource manager that would not allow such<br>

> things to happen.<br>

<br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>

-- Norbert Wiener

</div></div>