[Swift-user] exception @ swift-int.k, line: 511, Caused by: Block task failed: Connection to worker lost

Mihael Hategan hategan at mcs.anl.gov
Thu Jul 31 13:36:27 CDT 2014


Thanks Yadu!

I see nothing interesting in those logs. Again, the absence of any kind*
of problem logged by the worker points to some abrupt termination of the
process, which is most likely explained by an OOM.

Mihael

(*) uninitialized variable concatenation warning aside

On Thu, 2014-07-31 at 13:26 -0500, Yadu Nand wrote:
> ​Here's a link to the scripts folder tarred up.
> http://users.rcc.uchicago.edu/~yadunand/pbs_ozik_blues.tar.gz
> 
> A couple files couldn't be copied due to permission issues.
> 
> -Yadu​
> 
> 
> On Thu, Jul 31, 2014 at 1:18 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
> > Ok, so the workers die while the jobs are running and not much else is
> > happening.
> > My money is on the apps eating up all RAM and the kernel killing the
> > worker.
> >
> > The question is how we check whether this is true or not. Ideas?
> >
> > Yadu, can you do me a favor and package all the PBS output files from
> > this run?
> >
> > Jonathan, can you see if you get the same errors with tasksPerWorker=8?
> >
> > Mihael
> >
> > On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik wrote:
> > > Sure thing, it’s attached below.
> > >
> > > Jonathan
> > >
> > >
> > > On Jul 31, 2014, at 12:09 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > > wrote:
> > >
> > > > Hi Jonathan,
> > > >
> > > > I can't see anything obvious in the worker logs, but they are pretty
> > > > large. Can you also post the swift log from this run? It would make
> > > it
> > > > easier to focus on the right time frame.
> > > >
> > > > Mihael
> > > >
> > > > On Thu, 2014-07-31 at 08:35 -0500, Jonathan Ozik wrote:
> > > >> Hi all,
> > > >>
> > > >> I’m attaching the stdout and the worker logs below.
> > > >>
> > > >> Thanks for looking at these!
> > > >>
> > > >> Jonathan
> > > >>
> > > >>
> > > >> On Jul 31, 2014, at 12:37 AM, Jonathan Ozik <jozik at uchicago.edu>
> > > >> wrote:
> > > >>
> > > >>> Woops, sorry about that. It’s running now and the logs are being
> > > >> generated. Once the run is done I’ll send you log files.
> > > >>>
> > > >>> Thanks!
> > > >>>
> > > >>> Jonathan
> > > >>>
> > > >>> On Jul 31, 2014, at 12:12 AM, Mihael Hategan <hategan at mcs.anl.gov>
> > > >> wrote:
> > > >>>
> > > >>>> Right. This isn't your fault. We should, though, probably talk
> > > >> about
> > > >>>> addressing the issue.
> > > >>>>
> > > >>>> Mihael
> > > >>>>
> > > >>>> On Thu, 2014-07-31 at 00:10 -0500, Yadu Nand Babuji wrote:
> > > >>>>> Mihael, thanks for spotting that.  I added the comments to
> > > >> highlight the
> > > >>>>> changes in email.
> > > >>>>>
> > > >>>>> -Yadu
> > > >>>>>
> > > >>>>> On 07/31/2014 12:04 AM, Mihael Hategan wrote:
> > > >>>>>> Hi Jonathan,
> > > >>>>>>
> > > >>>>>> I suspect that the site config is considering the comment to be
> > > >> part of
> > > >>>>>> the value of the workerLogLevel property. We could confirm this
> > > >> if you
> > > >>>>>> send us the swift log from this particular run.
> > > >>>>>>
> > > >>>>>> To fix it, you could try to remove everything after DEBUG
> > > >> (including all
> > > >>>>>> horizontal white space). In other words:
> > > >>>>>>
> > > >>>>>> ...
> > > >>>>>> workerloglevel=DEBUG
> > > >>>>>> workerlogdirectory=/home/$USER/
> > > >>>>>> ...
> > > >>>>>>
> > > >>>>>> Mihael
> > > >>>>>>
> > > >>>>>> On Wed, 2014-07-30 at 23:50 -0500, Jonathan Ozik wrote:
> > > >>>>>>> Hi Yadu,
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> I’m getting errors indicating that DEBUG is an invalid worker
> > > >> logging
> > > >>>>>>> level. I’m attaching the stdout below. Let me know if I’m
> > > doing
> > > >>>>>>> something silly.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> Jonathan
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Jul 30, 2014, at 8:10 PM, Yadu Nand Babuji
> > > >> <yadunand at uchicago.edu>
> > > >>>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hi Jonathan,
> > > >>>>>>>>
> > > >>>>>>>> I ran a couple of tests on Blues with swift-0.95-RC6 and do
> > > not
> > > >> see
> > > >>>>>>>> anything unusual.
> > > >>>>>>>>
> > > >>>>>>>> From your logs, it looks like workers are failing, so getting
> > > >> worker
> > > >>>>>>>> logs would help.
> > > >>>>>>>> Could you try running on Blues with the following
> > > >> swift.properties
> > > >>>>>>>> and get us the worker*logs that would show up in the
> > > >>>>>>>> workerlogdirectory ?
> > > >>>>>>>>
> > > >>>>>>>> site=blues
> > > >>>>>>>>
> > > >>>>>>>> site.blues {
> > > >>>>>>>>   jobManager=pbs
> > > >>>>>>>>   jobQueue=shared
> > > >>>>>>>>   maxJobs=4
> > > >>>>>>>>   jobGranularity=1
> > > >>>>>>>>   maxNodesPerJob=1
> > > >>>>>>>>   tasksPerWorker=16
> > > >>>>>>>>   taskThrottle=64
> > > >>>>>>>>   initialScore=10000
> > > >>>>>>>>   jobWalltime=00:48:00
> > > >>>>>>>>   taskWalltime=00:40:00
> > > >>>>>>>>   workerloglevel=DEBUG                                  #
> > > >> Adding
> > > >>>>>>>> debug for workers
> > > >>>>>>>>   workerlogdirectory=/home/$USER/                # Logging
> > > >>>>>>>> directory on SFS
> > > >>>>>>>>   workdir=$RUNDIRECTORY
> > > >>>>>>>>   filesystem=local
> > > >>>>>>>> }
> > > >>>>>>>>
> > > >>>>>>>> Thanks,
> > > >>>>>>>> Yadu
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> On 07/30/2014 05:13 PM, Jonathan Ozik wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Hi Mike,
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> Sorry, I figured there was some busy-ness involved!
> > > >>>>>>>>> I ran the same test on Midway (using swift/0.95-RC5) and I
> > > >> didn’t
> > > >>>>>>>>> get the same issue. That is, the model run completed
> > > >> successfully.
> > > >>>>>>>>> For the Blues run, I used a trunk distribution (as of May
> > > 29,
> > > >>>>>>>>> 2014). I’m including one of the log files below. I’m also
> > > >>>>>>>>> including the swift.properties file that was used for the
> > > >> blues
> > > >>>>>>>>> runs below.
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> Thank you!
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> Jonathan
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On Jul 30, 2014, at 4:44 PM, Michael Wilde <wilde at anl.gov>
> > > >> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Hi Jonathan,
> > > >>>>>>>>>>
> > > >>>>>>>>>> Yes, very sorry - we've been swamped. Thanks for the ping!
> > > >>>>>>>>>>
> > > >>>>>>>>>> I or one of the team will answer soon, on swift-user.
> > > >>>>>>>>>>
> > > >>>>>>>>>> (But the first question is: which Swift release, and can
> > > you
> > > >>>>>>>>>> point us to, or send, the full log file?)
> > > >>>>>>>>>>
> > > >>>>>>>>>> Thanks and regards,
> > > >>>>>>>>>>
> > > >>>>>>>>>> - Mike
> > > >>>>>>>>>>
> > > >>>>>>>>>> On 7/30/14, 3:48 PM, Jonathan Ozik wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Hi Mike,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> I didn’t get a response yet so just wanted to make sure
> > > that
> > > >>>>>>>>>>> the message came across.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Jonathan
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Begin forwarded message:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> From: Jonathan Ozik <jozik at uchicago.edu>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Subject: [Swift-user] exception @ swift-int.k, line: 511,
> > > >>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Date: July 29, 2014 at 8:56:28 PM CDT
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> To: Mihael Hategan <hategan at mcs.anl.gov>,
> > > >>>>>>>>>>>> "swift-user at ci.uchicago.edu" <swift-user at ci.uchicago.edu>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> I’m getting spurious errors in the jobs that I’m running
> > > on
> > > >>>>>>>>>>>> Blues. The stdout includes exceptions like:
> > > >>>>>>>>>>>> exception @ swift-int.k, line: 511
> > > >>>>>>>>>>>> Caused by: Block task failed: Connection to worker lost
> > > >>>>>>>>>>>> java.io.IOException: Broken pipe
> > > >>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> > > >>>>>>>>>>>> at
> > > >>>>>>>>>>>>
> > > sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> > > >>>>>>>>>>>> at
> > > sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> > > >>>>>>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> > > >>>>>>>>>>>> at
> > > >>>>>>>>>>>>
> > > >> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
> > > >>>>>>>>>>>> at
> > > >>>>>>>>>>>>
> > > >> org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)
> > > >>>>>>>>>>>> at
> > > >>>>>>>>>>>>
> > > >> org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> These seem to occur at different parts of the submitted
> > > >>>>>>>>>>>> jobs. Let me know if there’s a log file that you’d like
> > > to
> > > >>>>>>>>>>>> look at.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> In earlier attempts I was getting these warnings followed
> > > >> by
> > > >>>>>>>>>>>> broken pipe errors:
> > > >>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
> > > >>>>>>>>>>>> os::commit_memory(0x00000000a0000000, 704643072, 2097152,
> > > >> 0)
> > > >>>>>>>>>>>> failed; error='Cannot allocate memory' (errno=12); Cannot
> > > >>>>>>>>>>>> allocate large pages, falling back to regular pages
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Apparently that’s a known precursor of crashes on Java 7
> > > as
> > > >>>>>>>>>>>> described here
> > > >>>>>>>>>>>>
> > > >>
> > > (
> > http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html):
> > > >>>>>>>>>>>> Area: hotspot/gc
> > > >>>>>>>>>>>> Synopsis: Crashes due to failure to allocate large pages.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Linux, failures when allocating large pages can lead
> > > to
> > > >>>>>>>>>>>> crashes. When running JDK 7u51 or later versions, the
> > > issue
> > > >>>>>>>>>>>> can be recognized in two ways:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> • Before the crash happens one or more lines similar to
> > > >> this
> > > >>>>>>>>>>>> will have been printed to the log:
> > > >>>>>>>>>>>> os::commit_memory(0x00000006b1600000, 352321536, 2097152,
> > > >> 0)
> > > >>>>>>>>>>>> failed;
> > > >>>>>>>>>>>> error='Cannot allocate memory' (errno=12); Cannot
> > > allocate
> > > >>>>>>>>>>>> large pages, falling back to regular pages
> > > >>>>>>>>>>>> • If a hs_err file is generated it will contain a line
> > > >>>>>>>>>>>> similar to this:
> > > >>>>>>>>>>>> Large page allocation failures have occurred 3 times
> > > >>>>>>>>>>>> The problem can be avoided by running with large page
> > > >>>>>>>>>>>> support turned off, for example by passing the
> > > >>>>>>>>>>>> "-XX:-UseLargePages" option to the java binary.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> See 8007074 (not public).
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> So I added the -XX:-UseLargePages option in the
> > > invocations
> > > >>>>>>>>>>>> of Java code that I was responsible for. That seemed to
> > > get
> > > >>>>>>>>>>>> rid of the warning and the crashes for a while, but
> > > perhaps
> > > >>>>>>>>>>>> that was just a coincidence.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Jonathan
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> _______________________________________________
> > > >>>>>>>>>>>> Swift-user mailing list
> > > >>>>>>>>>>>> Swift-user at ci.uchicago.edu
> > > >>>>>>>>>>>>
> > > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>> --
> > > >>>>>>>>>> Michael Wilde
> > > >>>>>>>>>> Mathematics and Computer Science          Computation
> > > >> Institute
> > > >>>>>>>>>> Argonne National Laboratory               The University of
> > > >> Chicago
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>>
> > > >>
> > > >>
> > > >
> > > >
> > >
> > >
> >
> >
> > _______________________________________________
> > Swift-user mailing list
> > Swift-user at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> >
> 
> 
> 





More information about the Swift-user mailing list