[Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost

Ozik, Jonathan jozik at anl.gov
Thu Jul 31 10:41:39 CDT 2014


Thank you Mike.
Regarding the location of the ER files, to reduce variables the last few runs were done with 0.95-RC6.

Jonathan

On Jul 31, 2014, at 9:18 AM, Michael Wilde <wilde at anl.gov> wrote:

> I see this from PBS in your home dir:
> 
> blues$ cat 583937.bmgt1.lcrc.anl.gov.ER
> Use of uninitialized value $s in concatenation (.) or string at 
> /home/ozik/.globus/coasters/cscript4312030037430783094.pl line 2220.
> Use of uninitialized value $s in concatenation (.) or string at 
> /home/ozik/.globus/coasters/cscript4312030037430783094.pl line 2220.
> blues$
> 
> That looks to me like a Swift bug in worker.pl
> 
> We'll look into this angle.
> 
> Also I'm curious why these files are not going into your run dir (but 
> perhaps thats because youre running an older trunk release, not 0.95? 
> Or, thats a separate 0.95 bug).
> 
> - Mike
> 
> On 7/31/14, 9:13 AM, Michael Wilde wrote:
>> Some discussion and diagnosis of this incident has taken place off list.
>> 
>> In a quick scan of the worker logs, I don't spot an obvious error that
>> would cause workers to exit.
>> Hopefully others on the Swift team can check those as well.
>> 
>> Jonathan, do you have stdout/err files from the PBS scheduler on blues,
>> in your runNNN log dirs?
>> 
>> If so, can you point us to them?
>> 
>> Thanks,
>> 
>> - Mike
>> 
>> On 7/29/14, 8:56 PM, Jonathan Ozik wrote:
>>> Hi all,
>>> 
>>> I’m getting spurious errors in the jobs that I’m running on Blues. The stdout includes exceptions like:
>>> 	exception @ swift-int.k, line: 511
>>> Caused by: Block task failed: Connection to worker lost
>>> java.io.IOException: Broken pipe
>>> 	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>>> 	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>>> 	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>>> 	at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>>> 	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
>>> 	at org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
>>> 	at org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
>>> 
>>> These seem to occur at different parts of the submitted jobs. Let me know if there’s a log file that you’d like to look at.
>>> 
>>> In earlier attempts I was getting these warnings followed by broken pipe errors:
>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000a0000000, 704643072, 2097152, 0) failed; error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages
>>> 
>>> Apparently that’s a known precursor of crashes on Java 7 as described here (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
>>> Area: hotspot/gc
>>> Synopsis: Crashes due to failure to allocate large pages.
>>> 
>>> On Linux, failures when allocating large pages can lead to crashes. When running JDK 7u51 or later versions, the issue can be recognized in two ways:
>>> 
>>> 	• Before the crash happens one or more lines similar to this will have been printed to the log:
>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152, 0) failed;
>>> error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages
>>> 	• If a hs_err file is generated it will contain a line similar to this:
>>> Large page allocation failures have occurred 3 times
>>> The problem can be avoided by running with large page support turned off, for example by passing the "-XX:-UseLargePages" option to the java binary.
>>> 
>>> See 8007074 (not public).
>>> 
>>> So I added the -XX:-UseLargePages option in the invocations of Java code that I was responsible for. That seemed to get rid of the warning and the crashes for a while, but perhaps that was just a coincidence.
>>> 
>>> Jonathan
>>> 
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu
>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> 
> -- 
> Michael Wilde
> Mathematics and Computer Science          Computation Institute
> Argonne National Laboratory               The University of Chicago
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user




More information about the Swift-user mailing list