[Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost

Jonathan Ozik jozik at uchicago.edu
Thu Jul 31 12:34:57 CDT 2014


Sure thing, it’s attached below.

Jonathan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: run014.log
Type: application/octet-stream
Size: 2887616 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140731/63a8f8f7/attachment.obj>
-------------- next part --------------

On Jul 31, 2014, at 12:09 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> Hi Jonathan,
> 
> I can't see anything obvious in the worker logs, but they are pretty
> large. Can you also post the swift log from this run? It would make it
> easier to focus on the right time frame.
> 
> Mihael
> 
> On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote:
>> Hi all,
>> 
>> I’m attaching the stdout and the worker logs below.
>> 
>> Thanks for looking at these!
>> 
>> Jonathan
>> 
>> 
>> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik <jozik at uchicago.edu>
>> wrote:
>> 
>>> Woops, sorry about that. It’s running now and the logs are being
>> generated. Once the run is done I’ll send you log files.
>>> 
>>> Thanks!
>>> 
>>> Jonathan
>>> 
>>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan <hategan at mcs.anl.gov>
>> wrote:
>>> 
>>>> Right. This isn't your fault. We should, though, probably talk
>> about
>>>> addressing the issue.
>>>> 
>>>> Mihael
>>>> 
>>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote:
>>>>> Mihael, thanks for spotting that.  I added the comments to
>> highlight the 
>>>>> changes in email.
>>>>> 
>>>>> -Yadu
>>>>> 
>>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote:
>>>>>> Hi Jonathan,
>>>>>> 
>>>>>> I suspect that the site config is considering the comment to be
>> part of
>>>>>> the value of the workerLogLevel property. We could confirm this
>> if you
>>>>>> send us the swift log from this particular run.
>>>>>> 
>>>>>> To fix it, you could try to remove everything after DEBUG
>> (including all
>>>>>> horizontal white space). In other words:
>>>>>> 
>>>>>> ...
>>>>>> workerloglevel=DEBUG
>>>>>> workerlogdirectory=/home/$USER/
>>>>>> ...
>>>>>> 
>>>>>> Mihael
>>>>>> 
>>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote:
>>>>>>> Hi Yadu,
>>>>>>> 
>>>>>>> 
>>>>>>> I’m getting errors indicating that DEBUG is an invalid worker
>> logging
>>>>>>> level. I’m attaching the stdout below. Let me know if I’m doing
>>>>>>> something silly.
>>>>>>> 
>>>>>>> 
>>>>>>> Jonathan
>>>>>>> 
>>>>>>> 
>>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji
>> <yadunand at uchicago.edu>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Jonathan,
>>>>>>>> 
>>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do not
>> see
>>>>>>>> anything unusual.
>>>>>>>> 
>>>>>>>> From your logs, it looks like workers are failing, so getting
>> worker
>>>>>>>> logs would help.
>>>>>>>> Could you try running on Blues with the following
>> swift.properties
>>>>>>>> and get us the worker*logs that would show up in the
>>>>>>>> workerlogdirectory ?
>>>>>>>> 
>>>>>>>> site=blues
>>>>>>>> 
>>>>>>>> site.blues {
>>>>>>>>   jobManager=pbs
>>>>>>>>   jobQueue=shared
>>>>>>>>   maxJobs=4
>>>>>>>>   jobGranularity=1
>>>>>>>>   maxNodesPerJob=1
>>>>>>>>   tasksPerWorker=16
>>>>>>>>   taskThrottle=64
>>>>>>>>   initialScore=10000
>>>>>>>>   jobWalltime=00:48:00
>>>>>>>>   taskWalltime=00:40:00
>>>>>>>>   workerloglevel=DEBUG                                  #
>> Adding
>>>>>>>> debug for workers
>>>>>>>>   workerlogdirectory=/home/$USER/                # Logging
>>>>>>>> directory on SFS
>>>>>>>>   workdir=$RUNDIRECTORY
>>>>>>>>   filesystem=local
>>>>>>>> }
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Yadu
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote:
>>>>>>>> 
>>>>>>>>> Hi Mike,
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Sorry, I figured there was some busy-ness involved!
>>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I
>> didn’t
>>>>>>>>> get the same issue. That is, the model run completed
>> successfully.
>>>>>>>>> For the Blues run, I used a trunk distribution (as of May 29,
>>>>>>>>> 2014). I’m including one of the log files below. I’m also
>>>>>>>>> including the swift.properties file that was used for the
>> blues
>>>>>>>>> runs below.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thank you!
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Jonathan
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde <wilde at anl.gov>
>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Jonathan,
>>>>>>>>>> 
>>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping!
>>>>>>>>>> 
>>>>>>>>>> I or one of the team will answer soon, on swift-user.
>>>>>>>>>> 
>>>>>>>>>> (But the first question is: which Swift release, and can you
>>>>>>>>>> point us to, or send, the full log file?)
>>>>>>>>>> 
>>>>>>>>>> Thanks and regards,
>>>>>>>>>> 
>>>>>>>>>> - Mike
>>>>>>>>>> 
>>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Mike,
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I didn’t get a response yet so just wanted to make sure that
>>>>>>>>>>> the message came across.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Jonathan
>>>>>>>>>>> 
>>>>>>>>>>> Begin forwarded message:
>>>>>>>>>>> 
>>>>>>>>>>>> From: Jonathan Ozik <jozik at uchicago.edu>
>>>>>>>>>>>> 
>>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511,
>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
>>>>>>>>>>>> 
>>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT
>>>>>>>>>>>> 
>>>>>>>>>>>> To: Mihael Hategan <hategan at mcs.anl.gov>,
>>>>>>>>>>>> "swift-user at ci.uchicago.edu" <swift-user at ci.uchicago.edu>
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>> 
>>>>>>>>>>>> I’m getting spurious errors in the jobs that I’m running on
>>>>>>>>>>>> Blues. The stdout includes exceptions like:
>>>>>>>>>>>> exception @ swift-int.k, line: 511
>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
>>>>>>>>>>>> java.io.IOException: Broken pipe
>>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>>>>>>>>>>>> at
>>>>>>>>>>>> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>>>>>>>>>>>> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>>>>>>>>>>>> at
>>>>>>>>>>>> 
>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
>>>>>>>>>>>> at
>>>>>>>>>>>> 
>> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
>>>>>>>>>>>> at
>>>>>>>>>>>> 
>> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
>>>>>>>>>>>> 
>>>>>>>>>>>> These seem to occur at different parts of the submitted
>>>>>>>>>>>> jobs. Let me know if there’s a log file that you’d like to
>>>>>>>>>>>> look at.
>>>>>>>>>>>> 
>>>>>>>>>>>> In earlier attempts I was getting these warnings followed
>> by
>>>>>>>>>>>> broken pipe errors:
>>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
>>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152,
>> 0)
>>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot
>>>>>>>>>>>> allocate large pages, falling back to regular pages
>>>>>>>>>>>> 
>>>>>>>>>>>> Apparently that’s a known precursor of crashes on Java 7 as
>>>>>>>>>>>> described here
>>>>>>>>>>>> 
>> (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
>>>>>>>>>>>> Area: hotspot/gc
>>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Linux, failures when allocating large pages can lead to
>>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the issue
>>>>>>>>>>>> can be recognized in two ways:
>>>>>>>>>>>> 
>>>>>>>>>>>> • Before the crash happens one or more lines similar to
>> this
>>>>>>>>>>>> will have been printed to the log:
>>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152,
>> 0)
>>>>>>>>>>>> failed;
>>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot allocate
>>>>>>>>>>>> large pages, falling back to regular pages
>>>>>>>>>>>> • If a hs_err file is generated it will contain a line
>>>>>>>>>>>> similar to this:
>>>>>>>>>>>> Large page allocation failures have occurred 3 times
>>>>>>>>>>>> The problem can be avoided by running with large page
>>>>>>>>>>>> support turned off, for example by passing the
>>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary.
>>>>>>>>>>>> 
>>>>>>>>>>>> See 8007074 (not public).
>>>>>>>>>>>> 
>>>>>>>>>>>> So I added the -XX:-UseLargePages option in the invocations
>>>>>>>>>>>> of Java code that I was responsible for. That seemed to get
>>>>>>>>>>>> rid of the warning and the crashes for a while, but perhaps
>>>>>>>>>>>> that was just a coincidence.
>>>>>>>>>>>> 
>>>>>>>>>>>> Jonathan
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Swift-user mailing list
>>>>>>>>>>>> Swift-user at ci.uchicago.edu
>>>>>>>>>>>> 
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> -- 
>>>>>>>>>> Michael Wilde
>>>>>>>>>> Mathematics and Computer Science          Computation
>> Institute
>>>>>>>>>> Argonne National Laboratory               The University of
>> Chicago
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
> 
> 



More information about the Swift-user mailing list