[Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost

Thu Jul 31 13:55:33 CDT 2014

Is each invocation of the Java app creating a large number of threads? I
ran into an issue like that (on another cluster) where I was hitting the
maximum number of processes per node, and the scheduler ended up killing my
jobs.

On Thu, Jul 31, 2014 at 1:42 PM, Jonathan Ozik <jozik at uchicago.edu> wrote:

> Okay, I’ve launched a new job, with tasksPerWorker=8. This is running on
> the sSerial queue rather than the shared queue for the other runs. Just
> wanted to note that for comparison. Each of the java processes that are
> launched with -Xmx1536m. I believe that Blues advertises each node having
> access to 64GB (http://www.lcrc.anl.gov/about/Blues), so at least at
> first glance the memory issue doesn’t seem like it could be an issue.
>
> Jonathan
>
> On Jul 31, 2014, at 1:18 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
>
> > Ok, so the workers die while the jobs are running and not much else is
> > happening.
> > My money is on the apps eating up all RAM and the kernel killing the
> > worker.
> >
> > The question is how we check whether this is true or not. Ideas?
> >
> > Yadu, can you do me a favor and package all the PBS output files from
> > this run?
> >
> > Jonathan, can you see if you get the same errors with tasksPerWorker=8?
> >
> > Mihael
> >
> > On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote:
> >> Sure thing, it’s attached below.
> >>
> >> Jonathan
> >>
> >>
> >> On Jul 31, 2014, at 12:09 PM, Mihael Hategan <hategan at mcs.anl.gov>
> >> wrote:
> >>
> >>> Hi Jonathan,
> >>>
> >>> I can't see anything obvious in the worker logs, but they are pretty
> >>> large. Can you also post the swift log from this run? It would make
> >> it
> >>> easier to focus on the right time frame.
> >>>
> >>> Mihael
> >>>
> >>> On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote:
> >>>> Hi all,
> >>>>
> >>>> I’m attaching the stdout and the worker logs below.
> >>>>
> >>>> Thanks for looking at these!
> >>>>
> >>>> Jonathan
> >>>>
> >>>>
> >>>> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik <jozik at uchicago.edu>
> >>>> wrote:
> >>>>
> >>>>> Woops, sorry about that. It’s running now and the logs are being
> >>>> generated. Once the run is done I’ll send you log files.
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> Jonathan
> >>>>>
> >>>>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan <hategan at mcs.anl.gov>
> >>>> wrote:
> >>>>>
> >>>>>> Right. This isn't your fault. We should, though, probably talk
> >>>> about
> >>>>>> addressing the issue.
> >>>>>>
> >>>>>> Mihael
> >>>>>>
> >>>>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote:
> >>>>>>> Mihael, thanks for spotting that.  I added the comments to
> >>>> highlight the
> >>>>>>> changes in email.
> >>>>>>>
> >>>>>>> -Yadu
> >>>>>>>
> >>>>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote:
> >>>>>>>> Hi Jonathan,
> >>>>>>>>
> >>>>>>>> I suspect that the site config is considering the comment to be
> >>>> part of
> >>>>>>>> the value of the workerLogLevel property. We could confirm this
> >>>> if you
> >>>>>>>> send us the swift log from this particular run.
> >>>>>>>>
> >>>>>>>> To fix it, you could try to remove everything after DEBUG
> >>>> (including all
> >>>>>>>> horizontal white space). In other words:
> >>>>>>>>
> >>>>>>>> ...
> >>>>>>>> workerloglevel=DEBUG
> >>>>>>>> workerlogdirectory=/home/$USER/
> >>>>>>>> ...
> >>>>>>>>
> >>>>>>>> Mihael
> >>>>>>>>
> >>>>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote:
> >>>>>>>>> Hi Yadu,
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I’m getting errors indicating that DEBUG is an invalid worker
> >>>> logging
> >>>>>>>>> level. I’m attaching the stdout below. Let me know if I’m
> >> doing
> >>>>>>>>> something silly.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Jonathan
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji
> >>>> <yadunand at uchicago.edu>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Jonathan,
> >>>>>>>>>>
> >>>>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do
> >> not
> >>>> see
> >>>>>>>>>> anything unusual.
> >>>>>>>>>>
> >>>>>>>>>> From your logs, it looks like workers are failing, so getting
> >>>> worker
> >>>>>>>>>> logs would help.
> >>>>>>>>>> Could you try running on Blues with the following
> >>>> swift.properties
> >>>>>>>>>> and get us the worker*logs that would show up in the
> >>>>>>>>>> workerlogdirectory ?
> >>>>>>>>>>
> >>>>>>>>>> site=blues
> >>>>>>>>>>
> >>>>>>>>>> site.blues {
> >>>>>>>>>>  jobManager=pbs
> >>>>>>>>>>  jobQueue=shared
> >>>>>>>>>>  maxJobs=4
> >>>>>>>>>>  jobGranularity=1
> >>>>>>>>>>  maxNodesPerJob=1
> >>>>>>>>>>  tasksPerWorker=16
> >>>>>>>>>>  taskThrottle=64
> >>>>>>>>>>  initialScore=10000
> >>>>>>>>>>  jobWalltime=00:48:00
> >>>>>>>>>>  taskWalltime=00:40:00
> >>>>>>>>>>  workerloglevel=DEBUG                                  #
> >>>> Adding
> >>>>>>>>>> debug for workers
> >>>>>>>>>>  workerlogdirectory=/home/$USER/                # Logging
> >>>>>>>>>> directory on SFS
> >>>>>>>>>>  workdir=$RUNDIRECTORY
> >>>>>>>>>>  filesystem=local
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Yadu
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Mike,
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Sorry, I figured there was some busy-ness involved!
> >>>>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I
> >>>> didn’t
> >>>>>>>>>>> get the same issue. That is, the model run completed
> >>>> successfully.
> >>>>>>>>>>> For the Blues run, I used a trunk distribution (as of May
> >> 29,
> >>>>>>>>>>> 2014). I’m including one of the log files below. I’m also
> >>>>>>>>>>> including the swift.properties file that was used for the
> >>>> blues
> >>>>>>>>>>> runs below.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Thank you!
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Jonathan
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde <wilde at anl.gov>
> >>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Jonathan,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping!
> >>>>>>>>>>>>
> >>>>>>>>>>>> I or one of the team will answer soon, on swift-user.
> >>>>>>>>>>>>
> >>>>>>>>>>>> (But the first question is: which Swift release, and can
> >> you
> >>>>>>>>>>>> point us to, or send, the full log file?)
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks and regards,
> >>>>>>>>>>>>
> >>>>>>>>>>>> - Mike
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Mike,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I didn’t get a response yet so just wanted to make sure
> >> that
> >>>>>>>>>>>>> the message came across.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Jonathan
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Begin forwarded message:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> From: Jonathan Ozik <jozik at uchicago.edu>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511,
> >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> To: Mihael Hategan <hategan at mcs.anl.gov>,
> >>>>>>>>>>>>>> "swift-user at ci.uchicago.edu" <swift-user at ci.uchicago.edu>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I’m getting spurious errors in the jobs that I’m running
> >> on
> >>>>>>>>>>>>>> Blues. The stdout includes exceptions like:
> >>>>>>>>>>>>>> exception @ swift-int.k, line: 511
> >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> >>>>>>>>>>>>>> java.io.IOException: Broken pipe
> >>>>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>
> >> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> >>>>>>>>>>>>>> at
> >> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> >>>>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>
> >>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>
> >>>> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>
> >>>> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> These seem to occur at different parts of the submitted
> >>>>>>>>>>>>>> jobs. Let me know if there’s a log file that you’d like
> >> to
> >>>>>>>>>>>>>> look at.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In earlier attempts I was getting these warnings followed
> >>>> by
> >>>>>>>>>>>>>> broken pipe errors:
> >>>>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
> >>>>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152,
> >>>> 0)
> >>>>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot
> >>>>>>>>>>>>>> allocate large pages, falling back to regular pages
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Apparently that’s a known precursor of crashes on Java 7
> >> as
> >>>>>>>>>>>>>> described here
> >>>>>>>>>>>>>>
> >>>>
> >> (
> http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
> >>>>>>>>>>>>>> Area: hotspot/gc
> >>>>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Linux, failures when allocating large pages can lead
> >> to
> >>>>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the
> >> issue
> >>>>>>>>>>>>>> can be recognized in two ways:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> • Before the crash happens one or more lines similar to
> >>>> this
> >>>>>>>>>>>>>> will have been printed to the log:
> >>>>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152,
> >>>> 0)
> >>>>>>>>>>>>>> failed;
> >>>>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot
> >> allocate
> >>>>>>>>>>>>>> large pages, falling back to regular pages
> >>>>>>>>>>>>>> • If a hs_err file is generated it will contain a line
> >>>>>>>>>>>>>> similar to this:
> >>>>>>>>>>>>>> Large page allocation failures have occurred 3 times
> >>>>>>>>>>>>>> The problem can be avoided by running with large page
> >>>>>>>>>>>>>> support turned off, for example by passing the
> >>>>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> See 8007074 (not public).
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> So I added the -XX:-UseLargePages option in the
> >> invocations
> >>>>>>>>>>>>>> of Java code that I was responsible for. That seemed to
> >> get
> >>>>>>>>>>>>>> rid of the warning and the crashes for a while, but
> >> perhaps
> >>>>>>>>>>>>>> that was just a coincidence.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Jonathan
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>> Swift-user mailing list
> >>>>>>>>>>>>>> Swift-user at ci.uchicago.edu
> >>>>>>>>>>>>>>
> >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Michael Wilde
> >>>>>>>>>>>> Mathematics and Computer Science          Computation
> >>>> Institute
> >>>>>>>>>>>> Argonne National Laboratory               The University of
> >>>> Chicago
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140731/01087b84/attachment.html>