[Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost
Michael Wilde
wilde at anl.gov
Thu Jul 31 09:18:08 CDT 2014
I see this from PBS in your home dir:
blues$ cat 583937.bmgt1.lcrc.anl.gov.ER
Use of uninitialized value $s in concatenation (.) or string at
/home/ozik/.globus/coasters/cscript4312030037430783094.pl line 2220.
Use of uninitialized value $s in concatenation (.) or string at
/home/ozik/.globus/coasters/cscript4312030037430783094.pl line 2220.
blues$
That looks to me like a Swift bug in worker.pl
We'll look into this angle.
Also I'm curious why these files are not going into your run dir (but
perhaps thats because youre running an older trunk release, not 0.95?
Or, thats a separate 0.95 bug).
- Mike
On 7/31/14, 9:13 AM, Michael Wilde wrote:
> Some discussion and diagnosis of this incident has taken place off list.
>
> In a quick scan of the worker logs, I don't spot an obvious error that
> would cause workers to exit.
> Hopefully others on the Swift team can check those as well.
>
> Jonathan, do you have stdout/err files from the PBS scheduler on blues,
> in your runNNN log dirs?
>
> If so, can you point us to them?
>
> Thanks,
>
> - Mike
>
> On 7/29/14, 8:56 PM, Jonathan Ozik wrote:
>> Hi all,
>>
>> I’m getting spurious errors in the jobs that I’m running on Blues. The stdout includes exceptions like:
>> exception @ swift-int.k, line: 511
>> Caused by: Block task failed: Connection to worker lost
>> java.io.IOException: Broken pipe
>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
>> at org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
>> at org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
>>
>> These seem to occur at different parts of the submitted jobs. Let me know if there’s a log file that you’d like to look at.
>>
>> In earlier attempts I was getting these warnings followed by broken pipe errors:
>> Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000a0000000, 704643072, 2097152, 0) failed; error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages
>>
>> Apparently that’s a known precursor of crashes on Java 7 as described here (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
>> Area: hotspot/gc
>> Synopsis: Crashes due to failure to allocate large pages.
>>
>> On Linux, failures when allocating large pages can lead to crashes. When running JDK 7u51 or later versions, the issue can be recognized in two ways:
>>
>> • Before the crash happens one or more lines similar to this will have been printed to the log:
>> os::commit_memory(0x00000006b1600000, 352321536, 2097152, 0) failed;
>> error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages
>> • If a hs_err file is generated it will contain a line similar to this:
>> Large page allocation failures have occurred 3 times
>> The problem can be avoided by running with large page support turned off, for example by passing the "-XX:-UseLargePages" option to the java binary.
>>
>> See 8007074 (not public).
>>
>> So I added the -XX:-UseLargePages option in the invocations of Java code that I was responsible for. That seemed to get rid of the warning and the crashes for a while, but perhaps that was just a coincidence.
>>
>> Jonathan
>>
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
--
Michael Wilde
Mathematics and Computer Science Computation Institute
Argonne National Laboratory The University of Chicago
More information about the Swift-user
mailing list