[Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost
Michael Wilde
wilde at anl.gov
Thu Jul 31 09:13:02 CDT 2014
Some discussion and diagnosis of this incident has taken place off list.
In a quick scan of the worker logs, I don't spot an obvious error that
would cause workers to exit.
Hopefully others on the Swift team can check those as well.
Jonathan, do you have stdout/err files from the PBS scheduler on blues,
in your runNNN log dirs?
If so, can you point us to them?
Thanks,
- Mike
On 7/29/14, 8:56 PM, Jonathan Ozik wrote:
> Hi all,
>
> I’m getting spurious errors in the jobs that I’m running on Blues. The stdout includes exceptions like:
> exception @ swift-int.k, line: 511
> Caused by: Block task failed: Connection to worker lost
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
> at org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
> at org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
>
> These seem to occur at different parts of the submitted jobs. Let me know if there’s a log file that you’d like to look at.
>
> In earlier attempts I was getting these warnings followed by broken pipe errors:
> Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000a0000000, 704643072, 2097152, 0) failed; error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages
>
> Apparently that’s a known precursor of crashes on Java 7 as described here (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
> Area: hotspot/gc
> Synopsis: Crashes due to failure to allocate large pages.
>
> On Linux, failures when allocating large pages can lead to crashes. When running JDK 7u51 or later versions, the issue can be recognized in two ways:
>
> • Before the crash happens one or more lines similar to this will have been printed to the log:
> os::commit_memory(0x00000006b1600000, 352321536, 2097152, 0) failed;
> error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages
> • If a hs_err file is generated it will contain a line similar to this:
> Large page allocation failures have occurred 3 times
> The problem can be avoided by running with large page support turned off, for example by passing the "-XX:-UseLargePages" option to the java binary.
>
> See 8007074 (not public).
>
> So I added the -XX:-UseLargePages option in the invocations of Java code that I was responsible for. That seemed to get rid of the warning and the crashes for a while, but perhaps that was just a coincidence.
>
> Jonathan
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
--
Michael Wilde
Mathematics and Computer Science Computation Institute
Argonne National Laboratory The University of Chicago
More information about the Swift-user
mailing list