[Swift-devel] Progress on Swift RAM usage problem?

David Kelly davidkelly at uchicago.edu
Mon Feb 3 03:29:57 CST 2014


Hello,

I've spent the weekend working on the popdiagts script. I looked around on
Geyser's filesystem and was able to find some input files that I can use.
Once I found the data and got the 39 arguments correct, I was able to
reproduce the problem.

I see a result that looks very similar to the initial report:

Progress:  time: Mon, 03 Feb 2014 01:20:00 -0700  Active:1  Finished
successfully:3

/glade/u/home/davkelly/swift-0.94/cog/modules/swift/dist/swift-svn/bin/swift:
line 177: 31567 Killed                  java -Xmx8096M
-XX:+HeapDumpOnOutOfMemoryError
-Djava.endorsed.dirs=/glade/u/home/davkelly/swift-0.94/cog/modules/s...
To start, I ran Swift with the default of 1G heap size and within a few
minutes I was able to see Swift being killed. A heap plot of a failing run:

http://web.ci.uchicago.edu/~davidk/popdiagts-20140201-1458-i9hmaf0e.png

I tried bumping up the max heap size, but I ran into the same problem
within a few minutes. The amount of memory used never seems to get very
high. Here is a plot with 8G:

http://web.ci.uchicago.edu/~davidk/popdiagts-20140203-0059-g6a11m24.png

I used jmap to generate several heap dumps during the run. They are about
100MB compressed, 400MB uncompressed, located at:

http://web.ci.uchicago.edu/~davidk/heap1.gz
http://web.ci.uchicago.edu/~davidk/heap2.gz
http://web.ci.uchicago.edu/~davidk/heap3.gz
http://web.ci.uchicago.edu/~davidk/heap4.gz
http://web.ci.uchicago.edu/~davidk/heap5.gz
http://web.ci.uchicago.edu/~davidk/heap6.gz
http://web.ci.uchicago.edu/~davidk/heap7.gz
http://web.ci.uchicago.edu/~davidk/heap8.gz
http://web.ci.uchicago.edu/~davidk/heap9.gz
http://web.ci.uchicago.edu/~davidk/heap10.gz
http://web.ci.uchicago.edu/~davidk/heap11.gz

I used Eclipse Memory Analyzer to look at the heaps. You can view an html
histogram of the objects at:

http://web.ci.uchicago.edu/~davidk/heap-histogram/index.html

It's possible that there was a sudden spike in memory at the end that the
logs missed, but I don't think that's what's going on here.

As I was running the script, I opened top and saw the Swift CPU usage on
the Geyser head node get extremely high, up to 700%. I think it's getting
killed due to a kernel CPU throttle.

I went through the script line by line until I could narrow down where the
problem was. I whittling away at it until I could get a small, readable,
and data-independent test script that shows the problem.

Here it is:
----
type file;
app (file out) createFile() {
   createFile @filename(out);
}

app (file out) createFileGivenArray (file fileArray[]) {
   createFile @filename(out);
}

file myArray[];
file myFile;

foreach f,i in [1:2] {
   myArray[i] = createFile();
}

myFile = createFileGivenArray(myArray);
-----

On Midway you'll see the CPU usage on the head node jump to about 200%
while the first app runs. If you repeat that pattern many times (like the
original script does) you'll see CPU usage go even higher.

I've filed this as Bug 1195 (
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=1195 ). The package to
reproduce this is at http://web.ci.uchicago.edu/~davidk/popdiag.tar.gz.



On Tue, Jan 28, 2014 at 3:29 PM, Wilde, Michael J. <wilde at mcs.anl.gov>wrote:

>   *From:*
> David Kelly [davidkelly at uchicago.edu]
>  *Sent:*
> Tuesday, January 28, 2014 2:47 PM
>  *...*
>   I don't have too many updates on Sheri's problem. I was able to run the
> older standalone example I had on Geyser and did not see any issues with
> excessive amounts of resident memory being used.
>   ...
>
>
>  I think the failure was exceeding the Java heap size, not an RSS
> problem, right?
>
>     I think we might be better off shifting the way we approach this
> problem. It's difficult to run these apps, and to run them in the same way
> the users do. There's also a long delay getting responses. I think we'd be
> better off focusing on adding comprehensive memory tests to the test suite,
> measuring, plotting, and then documenting solutions/strategies into the
> user guide. It will take some time, but I think it's the best approach
> since everything would be under our own control, and it would provide
> solutions for all users.
>
>     That sounds good, while we are waiting for debugging info from users.
> But we should still strive to reproduce problems that users are
> encountering, and on giving them code updates with additional debugging
> hooks or possible remedies to test.
>
>  - Mike
>
>
> On Tue, Jan 28, 2014 at 12:36 PM, Wilde, Michael J. <wilde at mcs.anl.gov>wrote:
>
>>  Yadu, David, can you send updates on this to Swift devel, and lets talk
>> this afternoon at 3PM to discuss?
>>
>>  Thanks,
>>
>>    - Mike
>>
>>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20140203/7f3da11e/attachment.html>


More information about the Swift-devel mailing list