[Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost
Jonathan Ozik
jozik at uchicago.edu
Thu Jul 31 12:34:57 CDT 2014
Sure thing, it’s attached below.
Jonathan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: run014.log
Type: application/octet-stream
Size: 2887616 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140731/63a8f8f7/attachment.obj>
-------------- next part --------------
On Jul 31, 2014, at 12:09 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> Hi Jonathan,
>
> I can't see anything obvious in the worker logs, but they are pretty
> large. Can you also post the swift log from this run? It would make it
> easier to focus on the right time frame.
>
> Mihael
>
> On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote:
>> Hi all,
>>
>> I’m attaching the stdout and the worker logs below.
>>
>> Thanks for looking at these!
>>
>> Jonathan
>>
>>
>> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik <jozik at uchicago.edu>
>> wrote:
>>
>>> Woops, sorry about that. It’s running now and the logs are being
>> generated. Once the run is done I’ll send you log files.
>>>
>>> Thanks!
>>>
>>> Jonathan
>>>
>>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan <hategan at mcs.anl.gov>
>> wrote:
>>>
>>>> Right. This isn't your fault. We should, though, probably talk
>> about
>>>> addressing the issue.
>>>>
>>>> Mihael
>>>>
>>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote:
>>>>> Mihael, thanks for spotting that. I added the comments to
>> highlight the
>>>>> changes in email.
>>>>>
>>>>> -Yadu
>>>>>
>>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote:
>>>>>> Hi Jonathan,
>>>>>>
>>>>>> I suspect that the site config is considering the comment to be
>> part of
>>>>>> the value of the workerLogLevel property. We could confirm this
>> if you
>>>>>> send us the swift log from this particular run.
>>>>>>
>>>>>> To fix it, you could try to remove everything after DEBUG
>> (including all
>>>>>> horizontal white space). In other words:
>>>>>>
>>>>>> ...
>>>>>> workerloglevel=DEBUG
>>>>>> workerlogdirectory=/home/$USER/
>>>>>> ...
>>>>>>
>>>>>> Mihael
>>>>>>
>>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote:
>>>>>>> Hi Yadu,
>>>>>>>
>>>>>>>
>>>>>>> I’m getting errors indicating that DEBUG is an invalid worker
>> logging
>>>>>>> level. I’m attaching the stdout below. Let me know if I’m doing
>>>>>>> something silly.
>>>>>>>
>>>>>>>
>>>>>>> Jonathan
>>>>>>>
>>>>>>>
>>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji
>> <yadunand at uchicago.edu>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Jonathan,
>>>>>>>>
>>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do not
>> see
>>>>>>>> anything unusual.
>>>>>>>>
>>>>>>>> From your logs, it looks like workers are failing, so getting
>> worker
>>>>>>>> logs would help.
>>>>>>>> Could you try running on Blues with the following
>> swift.properties
>>>>>>>> and get us the worker*logs that would show up in the
>>>>>>>> workerlogdirectory ?
>>>>>>>>
>>>>>>>> site=blues
>>>>>>>>
>>>>>>>> site.blues {
>>>>>>>> jobManager=pbs
>>>>>>>> jobQueue=shared
>>>>>>>> maxJobs=4
>>>>>>>> jobGranularity=1
>>>>>>>> maxNodesPerJob=1
>>>>>>>> tasksPerWorker=16
>>>>>>>> taskThrottle=64
>>>>>>>> initialScore=10000
>>>>>>>> jobWalltime=00:48:00
>>>>>>>> taskWalltime=00:40:00
>>>>>>>> workerloglevel=DEBUG #
>> Adding
>>>>>>>> debug for workers
>>>>>>>> workerlogdirectory=/home/$USER/ # Logging
>>>>>>>> directory on SFS
>>>>>>>> workdir=$RUNDIRECTORY
>>>>>>>> filesystem=local
>>>>>>>> }
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Yadu
>>>>>>>>
>>>>>>>>
>>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote:
>>>>>>>>
>>>>>>>>> Hi Mike,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sorry, I figured there was some busy-ness involved!
>>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I
>> didn’t
>>>>>>>>> get the same issue. That is, the model run completed
>> successfully.
>>>>>>>>> For the Blues run, I used a trunk distribution (as of May 29,
>>>>>>>>> 2014). I’m including one of the log files below. I’m also
>>>>>>>>> including the swift.properties file that was used for the
>> blues
>>>>>>>>> runs below.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thank you!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Jonathan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde <wilde at anl.gov>
>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Jonathan,
>>>>>>>>>>
>>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping!
>>>>>>>>>>
>>>>>>>>>> I or one of the team will answer soon, on swift-user.
>>>>>>>>>>
>>>>>>>>>> (But the first question is: which Swift release, and can you
>>>>>>>>>> point us to, or send, the full log file?)
>>>>>>>>>>
>>>>>>>>>> Thanks and regards,
>>>>>>>>>>
>>>>>>>>>> - Mike
>>>>>>>>>>
>>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Mike,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I didn’t get a response yet so just wanted to make sure that
>>>>>>>>>>> the message came across.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Jonathan
>>>>>>>>>>>
>>>>>>>>>>> Begin forwarded message:
>>>>>>>>>>>
>>>>>>>>>>>> From: Jonathan Ozik <jozik at uchicago.edu>
>>>>>>>>>>>>
>>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511,
>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
>>>>>>>>>>>>
>>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT
>>>>>>>>>>>>
>>>>>>>>>>>> To: Mihael Hategan <hategan at mcs.anl.gov>,
>>>>>>>>>>>> "swift-user at ci.uchicago.edu" <swift-user at ci.uchicago.edu>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> I’m getting spurious errors in the jobs that I’m running on
>>>>>>>>>>>> Blues. The stdout includes exceptions like:
>>>>>>>>>>>> exception @ swift-int.k, line: 511
>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
>>>>>>>>>>>> java.io.IOException: Broken pipe
>>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>>>>>>>>>>>> at
>>>>>>>>>>>> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>>>>>>>>>>>> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>>>>>>>>>>>> at
>>>>>>>>>>>>
>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
>>>>>>>>>>>> at
>>>>>>>>>>>>
>> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
>>>>>>>>>>>> at
>>>>>>>>>>>>
>> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
>>>>>>>>>>>>
>>>>>>>>>>>> These seem to occur at different parts of the submitted
>>>>>>>>>>>> jobs. Let me know if there’s a log file that you’d like to
>>>>>>>>>>>> look at.
>>>>>>>>>>>>
>>>>>>>>>>>> In earlier attempts I was getting these warnings followed
>> by
>>>>>>>>>>>> broken pipe errors:
>>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
>>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152,
>> 0)
>>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot
>>>>>>>>>>>> allocate large pages, falling back to regular pages
>>>>>>>>>>>>
>>>>>>>>>>>> Apparently that’s a known precursor of crashes on Java 7 as
>>>>>>>>>>>> described here
>>>>>>>>>>>>
>> (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
>>>>>>>>>>>> Area: hotspot/gc
>>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages.
>>>>>>>>>>>>
>>>>>>>>>>>> On Linux, failures when allocating large pages can lead to
>>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the issue
>>>>>>>>>>>> can be recognized in two ways:
>>>>>>>>>>>>
>>>>>>>>>>>> • Before the crash happens one or more lines similar to
>> this
>>>>>>>>>>>> will have been printed to the log:
>>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152,
>> 0)
>>>>>>>>>>>> failed;
>>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot allocate
>>>>>>>>>>>> large pages, falling back to regular pages
>>>>>>>>>>>> • If a hs_err file is generated it will contain a line
>>>>>>>>>>>> similar to this:
>>>>>>>>>>>> Large page allocation failures have occurred 3 times
>>>>>>>>>>>> The problem can be avoided by running with large page
>>>>>>>>>>>> support turned off, for example by passing the
>>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary.
>>>>>>>>>>>>
>>>>>>>>>>>> See 8007074 (not public).
>>>>>>>>>>>>
>>>>>>>>>>>> So I added the -XX:-UseLargePages option in the invocations
>>>>>>>>>>>> of Java code that I was responsible for. That seemed to get
>>>>>>>>>>>> rid of the warning and the crashes for a while, but perhaps
>>>>>>>>>>>> that was just a coincidence.
>>>>>>>>>>>>
>>>>>>>>>>>> Jonathan
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Swift-user mailing list
>>>>>>>>>>>> Swift-user at ci.uchicago.edu
>>>>>>>>>>>>
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Michael Wilde
>>>>>>>>>> Mathematics and Computer Science Computation
>> Institute
>>>>>>>>>> Argonne National Laboratory The University of
>> Chicago
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>
More information about the Swift-user
mailing list