[Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost
Michael Wilde
wilde at anl.gov
Thu Jul 31 15:18:53 CDT 2014
Its odd that no errors like OOM or OOT would be logged to stdout of the
PBS job.
Jonathan, can you check with the Blues Sysadmins if they have any other
error logs (e.g. dmesg/syslogs) for your jobs, or for the nodes on which
they ran?
Thanks,
- Mike
On 7/31/14, 1:55 PM, David Kelly wrote:
> Is each invocation of the Java app creating a large number of threads?
> I ran into an issue like that (on another cluster) where I was hitting
> the maximum number of processes per node, and the scheduler ended up
> killing my jobs.
>
>
> On Thu, Jul 31, 2014 at 1:42 PM, Jonathan Ozik <jozik at uchicago.edu
> <mailto:jozik at uchicago.edu>> wrote:
>
> Okay, I've launched a new job, with tasksPerWorker=8. This is
> running on the sSerial queue rather than the shared queue for the
> other runs. Just wanted to note that for comparison. Each of the
> java processes that are launched with -Xmx1536m. I believe that
> Blues advertises each node having access to 64GB
> (http://www.lcrc.anl.gov/about/Blues), so at least at first glance
> the memory issue doesn't seem like it could be an issue.
>
> Jonathan
>
> On Jul 31, 2014, at 1:18 PM, Mihael Hategan <hategan at mcs.anl.gov
> <mailto:hategan at mcs.anl.gov>> wrote:
>
> > Ok, so the workers die while the jobs are running and not much
> else is
> > happening.
> > My money is on the apps eating up all RAM and the kernel killing the
> > worker.
> >
> > The question is how we check whether this is true or not. Ideas?
> >
> > Yadu, can you do me a favor and package all the PBS output files
> from
> > this run?
> >
> > Jonathan, can you see if you get the same errors with
> tasksPerWorker=8?
> >
> > Mihael
> >
> > On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote:
> >> Sure thing, it's attached below.
> >>
> >> Jonathan
> >>
> >>
> >> On Jul 31, 2014, at 12:09 PM, Mihael Hategan
> <hategan at mcs.anl.gov <mailto:hategan at mcs.anl.gov>>
> >> wrote:
> >>
> >>> Hi Jonathan,
> >>>
> >>> I can't see anything obvious in the worker logs, but they are
> pretty
> >>> large. Can you also post the swift log from this run? It would
> make
> >> it
> >>> easier to focus on the right time frame.
> >>>
> >>> Mihael
> >>>
> >>> On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote:
> >>>> Hi all,
> >>>>
> >>>> I'm attaching the stdout and the worker logs below.
> >>>>
> >>>> Thanks for looking at these!
> >>>>
> >>>> Jonathan
> >>>>
> >>>>
> >>>> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik
> <jozik at uchicago.edu <mailto:jozik at uchicago.edu>>
> >>>> wrote:
> >>>>
> >>>>> Woops, sorry about that. It's running now and the logs are being
> >>>> generated. Once the run is done I'll send you log files.
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> Jonathan
> >>>>>
> >>>>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan
> <hategan at mcs.anl.gov <mailto:hategan at mcs.anl.gov>>
> >>>> wrote:
> >>>>>
> >>>>>> Right. This isn't your fault. We should, though, probably talk
> >>>> about
> >>>>>> addressing the issue.
> >>>>>>
> >>>>>> Mihael
> >>>>>>
> >>>>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote:
> >>>>>>> Mihael, thanks for spotting that. I added the comments to
> >>>> highlight the
> >>>>>>> changes in email.
> >>>>>>>
> >>>>>>> -Yadu
> >>>>>>>
> >>>>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote:
> >>>>>>>> Hi Jonathan,
> >>>>>>>>
> >>>>>>>> I suspect that the site config is considering the comment
> to be
> >>>> part of
> >>>>>>>> the value of the workerLogLevel property. We could
> confirm this
> >>>> if you
> >>>>>>>> send us the swift log from this particular run.
> >>>>>>>>
> >>>>>>>> To fix it, you could try to remove everything after DEBUG
> >>>> (including all
> >>>>>>>> horizontal white space). In other words:
> >>>>>>>>
> >>>>>>>> ...
> >>>>>>>> workerloglevel=DEBUG
> >>>>>>>> workerlogdirectory=/home/$USER/
> >>>>>>>> ...
> >>>>>>>>
> >>>>>>>> Mihael
> >>>>>>>>
> >>>>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote:
> >>>>>>>>> Hi Yadu,
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I'm getting errors indicating that DEBUG is an invalid
> worker
> >>>> logging
> >>>>>>>>> level. I'm attaching the stdout below. Let me know if I'm
> >> doing
> >>>>>>>>> something silly.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Jonathan
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji
> >>>> <yadunand at uchicago.edu <mailto:yadunand at uchicago.edu>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Jonathan,
> >>>>>>>>>>
> >>>>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do
> >> not
> >>>> see
> >>>>>>>>>> anything unusual.
> >>>>>>>>>>
> >>>>>>>>>> From your logs, it looks like workers are failing, so
> getting
> >>>> worker
> >>>>>>>>>> logs would help.
> >>>>>>>>>> Could you try running on Blues with the following
> >>>> swift.properties
> >>>>>>>>>> and get us the worker*logs that would show up in the
> >>>>>>>>>> workerlogdirectory ?
> >>>>>>>>>>
> >>>>>>>>>> site=blues
> >>>>>>>>>>
> >>>>>>>>>> site.blues {
> >>>>>>>>>> jobManager=pbs
> >>>>>>>>>> jobQueue=shared
> >>>>>>>>>> maxJobs=4
> >>>>>>>>>> jobGranularity=1
> >>>>>>>>>> maxNodesPerJob=1
> >>>>>>>>>> tasksPerWorker=16
> >>>>>>>>>> taskThrottle=64
> >>>>>>>>>> initialScore=10000
> >>>>>>>>>> jobWalltime=00:48:00
> >>>>>>>>>> taskWalltime=00:40:00
> >>>>>>>>>> workerloglevel=DEBUG #
> >>>> Adding
> >>>>>>>>>> debug for workers
> >>>>>>>>>> workerlogdirectory=/home/$USER/ # Logging
> >>>>>>>>>> directory on SFS
> >>>>>>>>>> workdir=$RUNDIRECTORY
> >>>>>>>>>> filesystem=local
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Yadu
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Mike,
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Sorry, I figured there was some busy-ness involved!
> >>>>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I
> >>>> didn't
> >>>>>>>>>>> get the same issue. That is, the model run completed
> >>>> successfully.
> >>>>>>>>>>> For the Blues run, I used a trunk distribution (as of May
> >> 29,
> >>>>>>>>>>> 2014). I'm including one of the log files below. I'm also
> >>>>>>>>>>> including the swift.properties file that was used for the
> >>>> blues
> >>>>>>>>>>> runs below.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Thank you!
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Jonathan
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde
> <wilde at anl.gov <mailto:wilde at anl.gov>>
> >>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Jonathan,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the
> ping!
> >>>>>>>>>>>>
> >>>>>>>>>>>> I or one of the team will answer soon, on swift-user.
> >>>>>>>>>>>>
> >>>>>>>>>>>> (But the first question is: which Swift release, and can
> >> you
> >>>>>>>>>>>> point us to, or send, the full log file?)
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks and regards,
> >>>>>>>>>>>>
> >>>>>>>>>>>> - Mike
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Mike,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I didn't get a response yet so just wanted to make sure
> >> that
> >>>>>>>>>>>>> the message came across.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Jonathan
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Begin forwarded message:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> From: Jonathan Ozik <jozik at uchicago.edu
> <mailto:jozik at uchicago.edu>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k,
> line: 511,
> >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> To: Mihael Hategan <hategan at mcs.anl.gov
> <mailto:hategan at mcs.anl.gov>>,
> >>>>>>>>>>>>>> "swift-user at ci.uchicago.edu
> <mailto:swift-user at ci.uchicago.edu>" <swift-user at ci.uchicago.edu
> <mailto:swift-user at ci.uchicago.edu>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'm getting spurious errors in the jobs that I'm
> running
> >> on
> >>>>>>>>>>>>>> Blues. The stdout includes exceptions like:
> >>>>>>>>>>>>>> exception @ swift-int.k, line: 511
> >>>>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> >>>>>>>>>>>>>> java.io.IOException: Broken pipe
> >>>>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>
> >> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> >>>>>>>>>>>>>> at
> >> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> >>>>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>
> >>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>
> >>>>
> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>
> >>>> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> These seem to occur at different parts of the submitted
> >>>>>>>>>>>>>> jobs. Let me know if there's a log file that you'd like
> >> to
> >>>>>>>>>>>>>> look at.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In earlier attempts I was getting these warnings
> followed
> >>>> by
> >>>>>>>>>>>>>> broken pipe errors:
> >>>>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
> >>>>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072,
> 2097152,
> >>>> 0)
> >>>>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12);
> Cannot
> >>>>>>>>>>>>>> allocate large pages, falling back to regular pages
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Apparently that's a known precursor of crashes on
> Java 7
> >> as
> >>>>>>>>>>>>>> described here
> >>>>>>>>>>>>>>
> >>>>
> >>
> (http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
> >>>>>>>>>>>>>> Area: hotspot/gc
> >>>>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large
> pages.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Linux, failures when allocating large pages can lead
> >> to
> >>>>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the
> >> issue
> >>>>>>>>>>>>>> can be recognized in two ways:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> . Before the crash happens one or more lines similar to
> >>>> this
> >>>>>>>>>>>>>> will have been printed to the log:
> >>>>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536,
> 2097152,
> >>>> 0)
> >>>>>>>>>>>>>> failed;
> >>>>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot
> >> allocate
> >>>>>>>>>>>>>> large pages, falling back to regular pages
> >>>>>>>>>>>>>> . If a hs_err file is generated it will contain a line
> >>>>>>>>>>>>>> similar to this:
> >>>>>>>>>>>>>> Large page allocation failures have occurred 3 times
> >>>>>>>>>>>>>> The problem can be avoided by running with large page
> >>>>>>>>>>>>>> support turned off, for example by passing the
> >>>>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> See 8007074 (not public).
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> So I added the -XX:-UseLargePages option in the
> >> invocations
> >>>>>>>>>>>>>> of Java code that I was responsible for. That seemed to
> >> get
> >>>>>>>>>>>>>> rid of the warning and the crashes for a while, but
> >> perhaps
> >>>>>>>>>>>>>> that was just a coincidence.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Jonathan
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>> Swift-user mailing list
> >>>>>>>>>>>>>> Swift-user at ci.uchicago.edu
> <mailto:Swift-user at ci.uchicago.edu>
> >>>>>>>>>>>>>>
> >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Michael Wilde
> >>>>>>>>>>>> Mathematics and Computer Science Computation
> >>>> Institute
> >>>>>>>>>>>> Argonne National Laboratory The
> University of
> >>>> Chicago
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu <mailto:Swift-user at ci.uchicago.edu>
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
--
Michael Wilde
Mathematics and Computer Science Computation Institute
Argonne National Laboratory The University of Chicago
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140731/26f7a71c/attachment.html>
More information about the Swift-user
mailing list