[Swift-devel] Progress on Swift RAM usage problem?
Hereld, Mark
hereld at anl.gov
Fri Feb 7 10:44:32 CST 2014
nice plots!
m
On Feb 7, 2014, at 10:38 AM, David Kelly <davidkelly at uchicago.edu<mailto:davidkelly at uchicago.edu>>
wrote:
For those interested in this problem, here is the latest heap plot of Jason's long (and still running) Beagle job.
On Mon, Feb 3, 2014 at 3:29 AM, David Kelly <davidkelly at uchicago.edu<mailto:davidkelly at uchicago.edu>> wrote:
Hello,
I've spent the weekend working on the popdiagts script. I looked around on Geyser's filesystem and was able to find some input files that I can use. Once I found the data and got the 39 arguments correct, I was able to reproduce the problem.
I see a result that looks very similar to the initial report:
Progress: time: Mon, 03 Feb 2014 01:20:00 -0700 Active:1 Finished successfully:3
/glade/u/home/davkelly/swift-0.94/cog/modules/swift/dist/swift-svn/bin/swift: line 177: 31567 Killed java -Xmx8096M -XX:+HeapDumpOnOutOfMemoryError -Djava.endorsed.dirs=/glade/u/home/davkelly/swift-0.94/cog/modules/s...
To start, I ran Swift with the default of 1G heap size and within a few minutes I was able to see Swift being killed. A heap plot of a failing run:
http://web.ci.uchicago.edu/~davidk/popdiagts-20140201-1458-i9hmaf0e.png
I tried bumping up the max heap size, but I ran into the same problem within a few minutes. The amount of memory used never seems to get very high. Here is a plot with 8G:
http://web.ci.uchicago.edu/~davidk/popdiagts-20140203-0059-g6a11m24.png
I used jmap to generate several heap dumps during the run. They are about 100MB compressed, 400MB uncompressed, located at:
http://web.ci.uchicago.edu/~davidk/heap1.gz
http://web.ci.uchicago.edu/~davidk/heap2.gz
http://web.ci.uchicago.edu/~davidk/heap3.gz
http://web.ci.uchicago.edu/~davidk/heap4.gz
http://web.ci.uchicago.edu/~davidk/heap5.gz
http://web.ci.uchicago.edu/~davidk/heap6.gz
http://web.ci.uchicago.edu/~davidk/heap7.gz
http://web.ci.uchicago.edu/~davidk/heap8.gz
http://web.ci.uchicago.edu/~davidk/heap9.gz
http://web.ci.uchicago.edu/~davidk/heap10.gz
http://web.ci.uchicago.edu/~davidk/heap11.gz
I used Eclipse Memory Analyzer to look at the heaps. You can view an html histogram of the objects at:
http://web.ci.uchicago.edu/~davidk/heap-histogram/index.html
It's possible that there was a sudden spike in memory at the end that the logs missed, but I don't think that's what's going on here.
As I was running the script, I opened top and saw the Swift CPU usage on the Geyser head node get extremely high, up to 700%. I think it's getting killed due to a kernel CPU throttle.
I went through the script line by line until I could narrow down where the problem was. I whittling away at it until I could get a small, readable, and data-independent test script that shows the problem.
Here it is:
----
type file;
app (file out) createFile() {
createFile @filename(out);
}
app (file out) createFileGivenArray (file fileArray[]) {
createFile @filename(out);
}
file myArray[];
file myFile;
foreach f,i in [1:2] {
myArray[i] = createFile();
}
myFile = createFileGivenArray(myArray);
-----
On Midway you'll see the CPU usage on the head node jump to about 200% while the first app runs. If you repeat that pattern many times (like the original script does) you'll see CPU usage go even higher.
I've filed this as Bug 1195 ( https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=1195 ). The package to reproduce this is at http://web.ci.uchicago.edu/~davidk/popdiag.tar.gz.
On Tue, Jan 28, 2014 at 3:29 PM, Wilde, Michael J. <wilde at mcs.anl.gov<mailto:wilde at mcs.anl.gov>> wrote:
From:
David Kelly [davidkelly at uchicago.edu<mailto:davidkelly at uchicago.edu>]
Sent:
Tuesday, January 28, 2014 2:47 PM
...
I don't have too many updates on Sheri's problem. I was able to run the older standalone example I had on Geyser and did not see any issues with excessive amounts of resident memory being used.
...
I think the failure was exceeding the Java heap size, not an RSS problem, right?
I think we might be better off shifting the way we approach this problem. It's difficult to run these apps, and to run them in the same way the users do. There's also a long delay getting responses. I think we'd be better off focusing on adding comprehensive memory tests to the test suite, measuring, plotting, and then documenting solutions/strategies into the user guide. It will take some time, but I think it's the best approach since everything would be under our own control, and it would provide solutions for all users.
That sounds good, while we are waiting for debugging info from users. But we should still strive to reproduce problems that users are encountering, and on giving them code updates with additional debugging hooks or possible remedies to test.
- Mike
On Tue, Jan 28, 2014 at 12:36 PM, Wilde, Michael J. <wilde at mcs.anl.gov<mailto:wilde at mcs.anl.gov>> wrote:
Yadu, David, can you send updates on this to Swift devel, and lets talk this afternoon at 3PM to discuss?
Thanks,
- Mike
_______________________________________________
Swift-devel mailing list
Swift-devel at ci.uchicago.edu<mailto:Swift-devel at ci.uchicago.edu>
https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
<heap-plot.png>_______________________________________________
Swift-devel mailing list
Swift-devel at ci.uchicago.edu<mailto:Swift-devel at ci.uchicago.edu>
https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
-------------------------------------------------------
Mark Hereld <hereld at anl.gov<mailto:hereld at anl.gov>>
Senior Fellow - Computation Institute
Experimental Systems Engineer - Mathematics and Computer Science
Visualization and Analysis Lead - Argonne Leadership Computing Facility
Argonne National Laboratory
The University of Chicago
Cell: 630.327.2088
Voice: 630.252.4170
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20140207/a5e3557f/attachment.html>
More information about the Swift-devel
mailing list