[Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost

Yadu Nand yadudoc1729 at gmail.com
Thu Jul 31 13:26:02 CDT 2014


​Here's a link to the scripts folder tarred up.
http://users.rcc.uchicago.edu/~yadunand/pbs_ozik_blues.tar.gz

A couple files couldn't be copied due to permission issues.

-Yadu​


On Thu, Jul 31, 2014 at 1:18 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> Ok, so the workers die while the jobs are running and not much else is
> happening.
> My money is on the apps eating up all RAM and the kernel killing the
> worker.
>
> The question is how we check whether this is true or not. Ideas?
>
> Yadu, can you do me a favor and package all the PBS output files from
> this run?
>
> Jonathan, can you see if you get the same errors with tasksPerWorker=8?
>
> Mihael
>
> On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote:
> > Sure thing, it’s attached below.
> >
> > Jonathan
> >
> >
> > On Jul 31, 2014, at 12:09 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > wrote:
> >
> > > Hi Jonathan,
> > >
> > > I can't see anything obvious in the worker logs, but they are pretty
> > > large. Can you also post the swift log from this run? It would make
> > it
> > > easier to focus on the right time frame.
> > >
> > > Mihael
> > >
> > > On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote:
> > >> Hi all,
> > >>
> > >> I’m attaching the stdout and the worker logs below.
> > >>
> > >> Thanks for looking at these!
> > >>
> > >> Jonathan
> > >>
> > >>
> > >> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik <jozik at uchicago.edu>
> > >> wrote:
> > >>
> > >>> Woops, sorry about that. It’s running now and the logs are being
> > >> generated. Once the run is done I’ll send you log files.
> > >>>
> > >>> Thanks!
> > >>>
> > >>> Jonathan
> > >>>
> > >>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan <hategan at mcs.anl.gov>
> > >> wrote:
> > >>>
> > >>>> Right. This isn't your fault. We should, though, probably talk
> > >> about
> > >>>> addressing the issue.
> > >>>>
> > >>>> Mihael
> > >>>>
> > >>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote:
> > >>>>> Mihael, thanks for spotting that.  I added the comments to
> > >> highlight the
> > >>>>> changes in email.
> > >>>>>
> > >>>>> -Yadu
> > >>>>>
> > >>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote:
> > >>>>>> Hi Jonathan,
> > >>>>>>
> > >>>>>> I suspect that the site config is considering the comment to be
> > >> part of
> > >>>>>> the value of the workerLogLevel property. We could confirm this
> > >> if you
> > >>>>>> send us the swift log from this particular run.
> > >>>>>>
> > >>>>>> To fix it, you could try to remove everything after DEBUG
> > >> (including all
> > >>>>>> horizontal white space). In other words:
> > >>>>>>
> > >>>>>> ...
> > >>>>>> workerloglevel=DEBUG
> > >>>>>> workerlogdirectory=/home/$USER/
> > >>>>>> ...
> > >>>>>>
> > >>>>>> Mihael
> > >>>>>>
> > >>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote:
> > >>>>>>> Hi Yadu,
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> I’m getting errors indicating that DEBUG is an invalid worker
> > >> logging
> > >>>>>>> level. I’m attaching the stdout below. Let me know if I’m
> > doing
> > >>>>>>> something silly.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Jonathan
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji
> > >> <yadunand at uchicago.edu>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi Jonathan,
> > >>>>>>>>
> > >>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do
> > not
> > >> see
> > >>>>>>>> anything unusual.
> > >>>>>>>>
> > >>>>>>>> From your logs, it looks like workers are failing, so getting
> > >> worker
> > >>>>>>>> logs would help.
> > >>>>>>>> Could you try running on Blues with the following
> > >> swift.properties
> > >>>>>>>> and get us the worker*logs that would show up in the
> > >>>>>>>> workerlogdirectory ?
> > >>>>>>>>
> > >>>>>>>> site=blues
> > >>>>>>>>
> > >>>>>>>> site.blues {
> > >>>>>>>>   jobManager=pbs
> > >>>>>>>>   jobQueue=shared
> > >>>>>>>>   maxJobs=4
> > >>>>>>>>   jobGranularity=1
> > >>>>>>>>   maxNodesPerJob=1
> > >>>>>>>>   tasksPerWorker=16
> > >>>>>>>>   taskThrottle=64
> > >>>>>>>>   initialScore=10000
> > >>>>>>>>   jobWalltime=00:48:00
> > >>>>>>>>   taskWalltime=00:40:00
> > >>>>>>>>   workerloglevel=DEBUG                                  #
> > >> Adding
> > >>>>>>>> debug for workers
> > >>>>>>>>   workerlogdirectory=/home/$USER/                # Logging
> > >>>>>>>> directory on SFS
> > >>>>>>>>   workdir=$RUNDIRECTORY
> > >>>>>>>>   filesystem=local
> > >>>>>>>> }
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> Yadu
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote:
> > >>>>>>>>
> > >>>>>>>>> Hi Mike,
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Sorry, I figured there was some busy-ness involved!
> > >>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I
> > >> didn’t
> > >>>>>>>>> get the same issue. That is, the model run completed
> > >> successfully.
> > >>>>>>>>> For the Blues run, I used a trunk distribution (as of May
> > 29,
> > >>>>>>>>> 2014). I’m including one of the log files below. I’m also
> > >>>>>>>>> including the swift.properties file that was used for the
> > >> blues
> > >>>>>>>>> runs below.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Thank you!
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Jonathan
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde <wilde at anl.gov>
> > >> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hi Jonathan,
> > >>>>>>>>>>
> > >>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping!
> > >>>>>>>>>>
> > >>>>>>>>>> I or one of the team will answer soon, on swift-user.
> > >>>>>>>>>>
> > >>>>>>>>>> (But the first question is: which Swift release, and can
> > you
> > >>>>>>>>>> point us to, or send, the full log file?)
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks and regards,
> > >>>>>>>>>>
> > >>>>>>>>>> - Mike
> > >>>>>>>>>>
> > >>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hi Mike,
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> I didn’t get a response yet so just wanted to make sure
> > that
> > >>>>>>>>>>> the message came across.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Jonathan
> > >>>>>>>>>>>
> > >>>>>>>>>>> Begin forwarded message:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> From: Jonathan Ozik <jozik at uchicago.edu>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511,
> > >>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> To: Mihael Hategan <hategan at mcs.anl.gov>,
> > >>>>>>>>>>>> "swift-user at ci.uchicago.edu" <swift-user at ci.uchicago.edu>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I’m getting spurious errors in the jobs that I’m running
> > on
> > >>>>>>>>>>>> Blues. The stdout includes exceptions like:
> > >>>>>>>>>>>> exception @ swift-int.k, line: 511
> > >>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> > >>>>>>>>>>>> java.io.IOException: Broken pipe
> > >>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> > >>>>>>>>>>>> at
> > >>>>>>>>>>>>
> > sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> > >>>>>>>>>>>> at
> > sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> > >>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> > >>>>>>>>>>>> at
> > >>>>>>>>>>>>
> > >> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
> > >>>>>>>>>>>> at
> > >>>>>>>>>>>>
> > >> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
> > >>>>>>>>>>>> at
> > >>>>>>>>>>>>
> > >> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> These seem to occur at different parts of the submitted
> > >>>>>>>>>>>> jobs. Let me know if there’s a log file that you’d like
> > to
> > >>>>>>>>>>>> look at.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> In earlier attempts I was getting these warnings followed
> > >> by
> > >>>>>>>>>>>> broken pipe errors:
> > >>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
> > >>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152,
> > >> 0)
> > >>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot
> > >>>>>>>>>>>> allocate large pages, falling back to regular pages
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Apparently that’s a known precursor of crashes on Java 7
> > as
> > >>>>>>>>>>>> described here
> > >>>>>>>>>>>>
> > >>
> > (
> http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
> > >>>>>>>>>>>> Area: hotspot/gc
> > >>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Linux, failures when allocating large pages can lead
> > to
> > >>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the
> > issue
> > >>>>>>>>>>>> can be recognized in two ways:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> • Before the crash happens one or more lines similar to
> > >> this
> > >>>>>>>>>>>> will have been printed to the log:
> > >>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152,
> > >> 0)
> > >>>>>>>>>>>> failed;
> > >>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot
> > allocate
> > >>>>>>>>>>>> large pages, falling back to regular pages
> > >>>>>>>>>>>> • If a hs_err file is generated it will contain a line
> > >>>>>>>>>>>> similar to this:
> > >>>>>>>>>>>> Large page allocation failures have occurred 3 times
> > >>>>>>>>>>>> The problem can be avoided by running with large page
> > >>>>>>>>>>>> support turned off, for example by passing the
> > >>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> See 8007074 (not public).
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> So I added the -XX:-UseLargePages option in the
> > invocations
> > >>>>>>>>>>>> of Java code that I was responsible for. That seemed to
> > get
> > >>>>>>>>>>>> rid of the warning and the crashes for a while, but
> > perhaps
> > >>>>>>>>>>>> that was just a coincidence.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Jonathan
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> _______________________________________________
> > >>>>>>>>>>>> Swift-user mailing list
> > >>>>>>>>>>>> Swift-user at ci.uchicago.edu
> > >>>>>>>>>>>>
> > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>> --
> > >>>>>>>>>> Michael Wilde
> > >>>>>>>>>> Mathematics and Computer Science          Computation
> > >> Institute
> > >>>>>>>>>> Argonne National Laboratory               The University of
> > >> Chicago
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>
> > >>
> > >
> > >
> >
> >
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>



-- 
Yadu Nand B
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140731/68b84153/attachment.html>


More information about the Swift-user mailing list