[Swift-devel] Coaster Task Submission Stalling

Tim Armstrong tim.g.armstrong at gmail.com
Thu Sep 11 13:54:21 CDT 2014


Yeah, local workers, this was started with start-coaster-service with conf
file:

export WORKER_MODE=local
export IPADDR=127.0.0.1
export SERVICE_PORT=53363
export JOBSPERNODE=4
export LOGDIR=$(pwd)


- Tim

On Thu, Sep 11, 2014 at 1:44 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> Passive workers?
>
> On Thu, 2014-09-11 at 13:26 -0500, Tim Armstrong wrote:
> > Oops, forgot about that
> >
> >
> >
> > On Thu, Sep 11, 2014 at 1:23 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
> >
> > > The coaster logging was broken, and that brokenness caused it to print
> > > everything on stdout. That got fixed, so the actual log is now
> > > in ./cps*.log.
> > >
> > > So I probably need that log.
> > >
> > > Mihael
> > >
> > > On Thu, 2014-09-11 at 13:10 -0500, Tim Armstrong wrote:
> > > > I meant the github master., but it turns out that I had had the wrong
> > > Swift
> > > > on my path.  Apologies for the confusion.
> > > >
> > > > I've rerun with the current one.
> > > >
> > > > I'm getting a null pointer exception on line 226 of
> > > > BlockQueueProcessor.java.  Adding some printfs revealed that
> settings was
> > > > null.
> > > >
> > > > Log attached.
> > > >
> > > > - Tim
> > > >
> > > > Job: Job(id:0 600.000s)
> > > > Settings: null
> > > > java.lang.NullPointerException
> > > >     at
> > > >
> > >
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.checkJob(BlockQueueProcessor.java:228)
> > > >     at
> > > >
> > >
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.enqueue1(BlockQueueProcessor.java:210)
> > > >     at
> > > >
> > >
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.enqueue(BlockQueueProcessor.java:204)
> > > >     at
> > > >
> > >
> org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.enqueue(JobQueue.java:103)
> > > >     at
> > > >
> > >
> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:96)
> > > >     at
> > > >
> > >
> org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:112)
> > > >     at
> > > >
> > >
> org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:589)
> > > >     at
> > > >
> > >
> org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:175)
> > > >     at
> > > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:90)
> > > >
> > > > On Thu, Sep 11, 2014 at 12:41 PM, Tim Armstrong <
> > > tim.g.armstrong at gmail.com>
> > > > wrote:
> > > >
> > > > > I thought I was running the latest trunk, I'll rebuild and see if
> I can
> > > > > reproduce the issue.
> > > > >
> > > > >  - Tim
> > > > >
> > > > > On Thu, Sep 11, 2014 at 12:39 PM, Mihael Hategan <
> hategan at mcs.anl.gov>
> > > > > wrote:
> > > > >
> > > > >> The method "getMetaChannel()" has been removed. Where did you get
> the
> > > > >> code from?
> > > > >>
> > > > >> Mihael
> > > > >>
> > > > >> On Thu, 2014-09-11 at 12:16 -0500, Tim Armstrong wrote:
> > > > >> > I'm seeing failures when running Swift/T tests with
> > > > >> > start-coaster-service.sh.
> > > > >> >
> > > > >> > E.g. the turbine test coaster-exec-1.  I can provide
> instructions
> > > for
> > > > >> > running the test if needed (roughly, you need to build Swift/T
> with
> > > > >> coaster
> > > > >> > support enabled, then make tests/coaster-exec-1.result in the
> > > turbine
> > > > >> > directory).  The github swift-t release is up to date if you
> want
> > > to use
> > > > >> > that.
> > > > >> >
> > > > >> > Full log is attached, stack trace excerpt is below.
> > > > >> >
> > > > >> > - Tim
> > > > >> >
> > > > >> > 2014-09-11 12:11:13,708-0500 INFO  BlockQueueProcessor
> Starting...
> > > > >> > id=0911-1112130
> > > > >> > Using threaded sender for TCPChannel [type: server, contact:
> > > > >> 127.0.0.1:48242
> > > > >> > ]
> > > > >> > 2014-09-11 12:11:13,708-0500 INFO  AbstractStreamCoasterChannel
> > > Using
> > > > >> > threaded sender for TCPChannel [type: server, contact:
> > > 127.0.0.1:48242]
> > > > >> > org.globus.cog.coaster.channels.ChannelException: Invalid
> channel:
> > > null
> > > > >> > @id://null-nullS
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.channels.ChannelManager.reserveChannel(ChannelManager.java:226)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.setClientChannelContext(PassiveQueueProcessor.java:41)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.setClientChannelContext(JobQueue.java:135)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:77)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173)
> > > > >> >     at
> > > > >>
> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70)
> > > > >> > provider=local
> > > > >> > 2014-09-11 12:11:13,930-0500 INFO  ExecutionTaskHandler
> > > provider=local
> > > > >> > org.globus.cog.coaster.channels.ChannelException: Invalid
> channel:
> > > null
> > > > >> > @id://null-nullS
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173)
> > > > >> >     at
> > > > >>
> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70)
> > > > >> > Handler(tag: 38907, SUBMITJOB) sending error: Could not
> deserialize
> > > job
> > > > >> > description
> > > > >> > org.globus.cog.coaster.ProtocolException: Could not deserialize
> job
> > > > >> > description
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173)
> > > > >> >     at
> > > > >>
> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70)
> > > > >> > Caused by: org.globus.cog.coaster.channels.ChannelException:
> Invalid
> > > > >> > channel: null at id://null-nullS
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80)
> > > > >> >     ... 4 more
> > > > >> > 2014-09-11 12:11:13,937-0500 INFO  RequestReply Handler(tag:
> 38907,
> > > > >> > SUBMITJOB) sending error: Could not deserialize job description
> > > > >> > org.globus.cog.coaster.ProtocolException: Could not deserialize
> job
> > > > >> > description
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173)
> > > > >> >     at
> > > > >>
> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70)
> > > > >> > Caused by: org.globus.cog.coaster.channels.ChannelException:
> Invalid
> > > > >> > channel: null at id://null-nullS
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472)
> > > > >> >     at
> > > > >> >
> > > > >>
> > >
> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80)
> > > > >> >     ... 4 more
> > > > >> >
> > > > >> >
> > > > >> > On Thu, Sep 11, 2014 at 10:30 AM, Tim Armstrong <
> > > > >> tim.g.armstrong at gmail.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> > > This all sounds great.
> > > > >> > >
> > > > >> > > Just to check that I've understood correctly, from the
> client's
> > > point
> > > > >> of
> > > > >> > > view:
> > > > >> > > * The per-client settings behave the same if -shared is not
> > > provided.
> > > > >> > > * Per-client settings are ignored if -shared is provided
> > > > >> > >
> > > > >> > > I had one question:
> > > > >> > > * Do automatically allocated workers work with per-client
> > > settings?  I
> > > > >> > > understand there were some issues related to sharing workers
> > > between
> > > > >> > > clients.  Was the solution to have separate worker pools, or
> is
> > > this
> > > > >> just
> > > > >> > > not supported?
> > > > >> > >
> > > > >> > > - Tim
> > > > >> > >
> > > > >> > > On Mon, Sep 8, 2014 at 2:38 PM, Mihael Hategan <
> > > hategan at mcs.anl.gov>
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > >> So...
> > > > >> > >>
> > > > >> > >> There were bugs. Lots of bugs.
> > > > >> > >> I did some work over the weekend to fix some of these and
> clean
> > > up
> > > > >> the
> > > > >> > >> coaster code. Here's a summary:
> > > > >> > >>
> > > > >> > >> - there was some stuff in the low level coaster code to deal
> with
> > > > >> > >> persisting coaster channels over multiple connections with
> > > various
> > > > >> > >> options, like periodic connections, client or server
> initiated
> > > > >> > >> connections, buffering of commands, etc. None of this was
> used by
> > > > >> Swift,
> > > > >> > >> and the code was pretty messy. I removed that.
> > > > >> > >> - there were some issues with multiple clients:
> > > > >> > >>   * improper shutdown of relevant workers when a client
> > > disconnected
> > > > >> > >>   * the worker task dispatcher was a singleton and had a
> > > reference to
> > > > >> > >> one block allocator, whereas multiple clients involved
> multiple
> > > > >> > >> allocators.
> > > > >> > >> - there were a bunch of locking issues in the C client that
> > > valgrind
> > > > >> > >> caught
> > > > >> > >> - the idea of remote job ids was a bit hard to work with.
> This
> > > > >> remote id
> > > > >> > >> was the job id that the service assigned to a job. This is
> > > necessary
> > > > >> > >> because two different clients can submit jobs with the same
> id.
> > > The
> > > > >> > >> remote id would be communicated to the client as the reply
> to the
> > > > >> submit
> > > > >> > >> request. However, it was entirely possible for a notification
> > > about
> > > > >> job
> > > > >> > >> status to be sent to the client before the submit reply was.
> > > Since
> > > > >> > >> notifications were sent using the remote-id, the client would
> > > have no
> > > > >> > >> idea what job the notifications belonged to. Now, the server
> > > might
> > > > >> need
> > > > >> > >> a unique job id, but there is no reason why it cannot use the
> > > client
> > > > >> id
> > > > >> > >> when communicating the status to a client. So that's there
> now.
> > > > >> > >> - the way the C client was working, its jobs ended up not
> going
> > > to
> > > > >> the
> > > > >> > >> workers, but the local queue. The service settings now allow
> > > > >> specifying
> > > > >> > >> the provider/jobManager/url to be used to start blocks, and
> jobs
> > > are
> > > > >> > >> routed appropriately if they do not have the batch job flag
> set.
> > > > >> > >>
> > > > >> > >> I also added a shared service mode. We discussed this before.
> > > > >> Basically
> > > > >> > >> you start the coaster service with "-shared
> > > <service.properties>" and
> > > > >> > >> all the settings are read from that file. In this case, all
> > > clients
> > > > >> > >> share the same worker pool, and client settings are ignored.
> > > > >> > >>
> > > > >> > >> The C client now has a multi-job testing tool which can
> submit
> > > many
> > > > >> jobs
> > > > >> > >> with the desired level of concurrency.
> > > > >> > >>
> > > > >> > >> I have tested the C client with both shared and non-shared
> mode,
> > > with
> > > > >> > >> various levels of jobs being sent, with either one or two
> > > concurrent
> > > > >> > >> clients.
> > > > >> > >>
> > > > >> > >> I haven't tested manual workers.
> > > > >> > >>
> > > > >> > >> I've also decided that during normal operation (i.e. client
> > > connects,
> > > > >> > >> submits jobs, shuts down gracefully), there should be no
> > > exceptions
> > > > >> in
> > > > >> > >> the coaster log. I think we should stick to that principle.
> This
> > > was
> > > > >> the
> > > > >> > >> case last I tested, and we should consider any deviation from
> > > that
> > > > >> to be
> > > > >> > >> a problem. Of course, there are some things for which there
> is no
> > > > >> > >> graceful shut down, such as ctrl+C-ing a manual worker.
> > > Exceptions
> > > > >> are
> > > > >> > >> fine in that case.
> > > > >> > >>
> > > > >> > >> So anyway, let's start from here.
> > > > >> > >>
> > > > >> > >> Mihael
> > > > >> > >>
> > > > >> > >> On Fri, 2014-09-05 at 13:09 -0500, Tim Armstrong wrote:
> > > > >> > >> > Thanks, let me know if there's anything I can help do.
> > > > >> > >> >
> > > > >> > >> > - Tim
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >> > On Fri, Sep 5, 2014 at 12:57 PM, Mihael Hategan <
> > > > >> hategan at mcs.anl.gov>
> > > > >> > >> wrote:
> > > > >> > >> >
> > > > >> > >> > > Thanks. It also seems that there is an older bug in
> there in
> > > > >> which the
> > > > >> > >> > > client connection is not properly accounted for and
> things
> > > start
> > > > >> > >> failing
> > > > >> > >> > > two minutes after the client connects (which is also
> probably
> > > > >> why you
> > > > >> > >> > > didn't see this in runs with many short client
> connections).
> > > I'm
> > > > >> not
> > > > >> > >> > > sure why the fix for that bug isn't in the trunk code.
> > > > >> > >> > >
> > > > >> > >> > > In any event, I'll set up a client submission loop and
> fix
> > > all
> > > > >> these
> > > > >> > >> > > things.
> > > > >> > >> > >
> > > > >> > >> > > Mihael
> > > > >> > >> > >
> > > > >> > >> > > On Fri, 2014-09-05 at 12:13 -0500, Tim Armstrong wrote:
> > > > >> > >> > > > Ok, here it is with the additional debug messages.
> Source
> > > code
> > > > >> > >> change is
> > > > >> > >> > > > in commit 890c41f2ba701b10264553471590096d6f94c278.
> > > > >> > >> > > >
> > > > >> > >> > > > Warning: the tarball will expand to several gigabytes
> of
> > > logs.
> > > > >> > >> > > >
> > > > >> > >> > > > I had to do multiple client runs to trigger it.  It
> seems
> > > like
> > > > >> the
> > > > >> > >> > > problem
> > > > >> > >> > > > might be triggered by abnormal termination of the
> client.
> > > > >> First 18
> > > > >> > >> runs
> > > > >> > >> > > > went fine, problem only started when I ctrl-c-ed the
> > > swift/t
> > > > >> run #19
> > > > >> > >> > > before
> > > > >> > >> > > > the run #20 that exhibited delays.
> > > > >> > >> > > >
> > > > >> > >> > > >
> > > http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz
> > > > >> > >> > > >
> > > > >> > >> > > > - Tim
> > > > >> > >> > > >
> > > > >> > >> > > >
> > > > >> > >> > > > On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong <
> > > > >> > >> tim.g.armstrong at gmail.com
> > > > >> > >> > > >
> > > > >> > >> > > > wrote:
> > > > >> > >> > > >
> > > > >> > >> > > > > It's here:
> > > > >> > >> > > > >
> > > > >> > >>
> > > http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz
> > > > >> .
> > > > >> > >> > > > >
> > > > >> > >> > > > > I'll add some extra debug messages in the coaster C++
> > > client
> > > > >> and
> > > > >> > >> see
> > > > >> > >> > > if I
> > > > >> > >> > > > > can recreate the scenario.
> > > > >> > >> > > > >
> > > > >> > >> > > > > - Tim
> > > > >> > >> > > > >
> > > > >> > >> > > > >
> > > > >> > >> > > > > On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan <
> > > > >> > >> hategan at mcs.anl.gov>
> > > > >> > >> > > > > wrote:
> > > > >> > >> > > > >
> > > > >> > >> > > > >> Ok, so that's legit.
> > > > >> > >> > > > >>
> > > > >> > >> > > > >> It does look like shut down workers are not properly
> > > > >> accounted
> > > > >> > >> for in
> > > > >> > >> > > > >> some places (and I believe Yadu submitted a bug for
> > > this).
> > > > >> > >> However, I
> > > > >> > >> > > do
> > > > >> > >> > > > >> not see the dead time you mention in either of the
> last
> > > two
> > > > >> sets
> > > > >> > >> of
> > > > >> > >> > > > >> logs. It looks like each client instance submits a
> > > continous
> > > > >> > >> stream of
> > > > >> > >> > > > >> jobs.
> > > > >> > >> > > > >>
> > > > >> > >> > > > >> So let's get back to the initial log. Can I have the
> > > full
> > > > >> > >> service log?
> > > > >> > >> > > > >> I'm trying to track what happened with the jobs
> > > submitted
> > > > >> before
> > > > >> > >> the
> > > > >> > >> > > > >> first big pause.
> > > > >> > >> > > > >>
> > > > >> > >> > > > >> Also, a log message in
> CoasterClient::updateJobStatus()
> > > (or
> > > > >> > >> friends)
> > > > >> > >> > > > >> would probably help a lot here.
> > > > >> > >> > > > >>
> > > > >> > >> > > > >> Mihael
> > > > >> > >> > > > >>
> > > > >> > >> > > > >> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong
> wrote:
> > > > >> > >> > > > >> > Should be here:
> > > > >> > >> > > > >> >
> > > > >> > >> > > > >> >
> > > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz
> > > > >> > >> > > > >> >
> > > > >> > >> > > > >> >
> > > > >> > >> > > > >> >
> > > > >> > >> > > > >> >
> > > > >> > >> > > > >> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan <
> > > > >> > >> hategan at mcs.anl.gov
> > > > >> > >> > > >
> > > > >> > >> > > > >> wrote:
> > > > >> > >> > > > >> >
> > > > >> > >> > > > >> > > The first worker "failing" is 0904-20022331.
> The log
> > > > >> looks
> > > > >> > >> funny
> > > > >> > >> > > at
> > > > >> > >> > > > >> the
> > > > >> > >> > > > >> > > end.
> > > > >> > >> > > > >> > >
> > > > >> > >> > > > >> > > Can you git pull and re-run? The worker is
> getting
> > > some
> > > > >> > >> command
> > > > >> > >> > > at the
> > > > >> > >> > > > >> > > end there and doing nothing about it and I
> wonder
> > > why.
> > > > >> > >> > > > >> > >
> > > > >> > >> > > > >> > > Mihael
> > > > >> > >> > > > >> > >
> > > > >> > >> > > > >> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong
> > > wrote:
> > > > >> > >> > > > >> > > > Ok, now I have some worker logs:
> > > > >> > >> > > > >> > > >
> > > > >> > >> > > > >> > > >
> > > > >> > >>
> http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz
> > > > >> > >> > > > >> > > >
> > > > >> > >> > > > >> > > > There's nothing obvious I see in the worker
> logs
> > > that
> > > > >> would
> > > > >> > >> > > > >> indicate why
> > > > >> > >> > > > >> > > > the connection was broken.
> > > > >> > >> > > > >> > > >
> > > > >> > >> > > > >> > > > - Tim
> > > > >> > >> > > > >> > > >
> > > > >> > >> > > > >> > > >
> > > > >> > >> > > > >> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong
> <
> > > > >> > >> > > > >> tim.g.armstrong at gmail.com
> > > > >> > >> > > > >> > > >
> > > > >> > >> > > > >> > > > wrote:
> > > > >> > >> > > > >> > > >
> > > > >> > >> > > > >> > > > > This is all running locally on my laptop,
> so I
> > > > >> think we
> > > > >> > >> can
> > > > >> > >> > > rule
> > > > >> > >> > > > >> out
> > > > >> > >> > > > >> > > 1).
> > > > >> > >> > > > >> > > > >
> > > > >> > >> > > > >> > > > > It also seems like it's a state the coaster
> > > service
> > > > >> gets
> > > > >> > >> into
> > > > >> > >> > > > >> after a
> > > > >> > >> > > > >> > > few
> > > > >> > >> > > > >> > > > > client sessions: generally the first
> coaster run
> > > > >> works
> > > > >> > >> fine,
> > > > >> > >> > > then
> > > > >> > >> > > > >> > > after a
> > > > >> > >> > > > >> > > > > few runs the problem occurs more frequently.
> > > > >> > >> > > > >> > > > >
> > > > >> > >> > > > >> > > > > I'm going to try and get worker logs, in the
> > > > >> meantime
> > > > >> > >> i've got
> > > > >> > >> > > > >> some
> > > > >> > >> > > > >> > > > > jstacks (attached).
> > > > >> > >> > > > >> > > > >
> > > > >> > >> > > > >> > > > > Matching service logs (largish) are here if
> > > needed:
> > > > >> > >> > > > >> > > > >
> > > http://people.cs.uchicago.edu/~tga/service.out.gz
> > > > >> > >> > > > >> > > > >
> > > > >> > >> > > > >> > > > >
> > > > >> > >> > > > >> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael
> Hategan
> > > <
> > > > >> > >> > > > >> hategan at mcs.anl.gov>
> > > > >> > >> > > > >> > > > > wrote:
> > > > >> > >> > > > >> > > > >
> > > > >> > >> > > > >> > > > >> Ah, makes sense.
> > > > >> > >> > > > >> > > > >>
> > > > >> > >> > > > >> > > > >> 2 minutes is the channel timeout. Each live
> > > > >> connection
> > > > >> > >> is
> > > > >> > >> > > > >> guaranteed
> > > > >> > >> > > > >> > > to
> > > > >> > >> > > > >> > > > >> have some communication for any 2 minute
> time
> > > > >> window,
> > > > >> > >> > > partially
> > > > >> > >> > > > >> due to
> > > > >> > >> > > > >> > > > >> periodic heartbeats (sent every 1 minute).
> If
> > > no
> > > > >> > >> packets flow
> > > > >> > >> > > > >> for the
> > > > >> > >> > > > >> > > > >> duration of 2 minutes, the connection is
> > > assumed
> > > > >> broken
> > > > >> > >> and
> > > > >> > >> > > all
> > > > >> > >> > > > >> jobs
> > > > >> > >> > > > >> > > > >> that were submitted to the respective
> workers
> > > are
> > > > >> > >> considered
> > > > >> > >> > > > >> failed.
> > > > >> > >> > > > >> > > So
> > > > >> > >> > > > >> > > > >> there seems to be an issue with the
> > > connections to
> > > > >> some
> > > > >> > >> of
> > > > >> > >> > > the
> > > > >> > >> > > > >> > > workers,
> > > > >> > >> > > > >> > > > >> and it takes 2 minutes to detect them.
> > > > >> > >> > > > >> > > > >>
> > > > >> > >> > > > >> > > > >> Since the service seems to be alive
> (although a
> > > > >> jstack
> > > > >> > >> on the
> > > > >> > >> > > > >> service
> > > > >> > >> > > > >> > > > >> when thing seem to hang might help), this
> > > leaves
> > > > >> two
> > > > >> > >> > > > >> possibilities:
> > > > >> > >> > > > >> > > > >> 1 - some genuine network problem
> > > > >> > >> > > > >> > > > >> 2 - the worker died without properly
> closing
> > > TCP
> > > > >> > >> connections
> > > > >> > >> > > > >> > > > >>
> > > > >> > >> > > > >> > > > >> If (2), you could enable worker logging
> > > > >> > >> > > > >> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL =
> > > "DEBUG") to
> > > > >> see
> > > > >> > >> if
> > > > >> > >> > > > >> anything
> > > > >> > >> > > > >> > > shows
> > > > >> > >> > > > >> > > > >> up.
> > > > >> > >> > > > >> > > > >>
> > > > >> > >> > > > >> > > > >> Mihael
> > > > >> > >> > > > >> > > > >>
> > > > >> > >> > > > >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim
> > > Armstrong
> > > > >> wrote:
> > > > >> > >> > > > >> > > > >> > Here are client and service logs, with
> part
> > > of
> > > > >> > >> service log
> > > > >> > >> > > > >> edited
> > > > >> > >> > > > >> > > down
> > > > >> > >> > > > >> > > > >> to
> > > > >> > >> > > > >> > > > >> > be a reasonable size (I have the full
> thing
> > > if
> > > > >> > >> needed, but
> > > > >> > >> > > it
> > > > >> > >> > > > >> was
> > > > >> > >> > > > >> > > over a
> > > > >> > >> > > > >> > > > >> > gigabyte).
> > > > >> > >> > > > >> > > > >> >
> > > > >> > >> > > > >> > > > >> > One relevant section is from 19:49:35
> > > onwards.
> > > > >> The
> > > > >> > >> client
> > > > >> > >> > > > >> submits 4
> > > > >> > >> > > > >> > > > >> jobs
> > > > >> > >> > > > >> > > > >> > (its limit), but they don't complete
> until
> > > > >> 19:51:32
> > > > >> > >> or so
> > > > >> > >> > > (I
> > > > >> > >> > > > >> can see
> > > > >> > >> > > > >> > > > >> that
> > > > >> > >> > > > >> > > > >> > one task completed based on ncompleted=1
> in
> > > the
> > > > >> > >> > > check_tasks log
> > > > >> > >> > > > >> > > > >> message).
> > > > >> > >> > > > >> > > > >> > It looks like something has happened with
> > > broken
> > > > >> > >> pipes and
> > > > >> > >> > > > >> workers
> > > > >> > >> > > > >> > > being
> > > > >> > >> > > > >> > > > >> > lost, but I'm not sure what the ultimate
> > > cause of
> > > > >> > >> that is
> > > > >> > >> > > > >> likely to
> > > > >> > >> > > > >> > > be.
> > > > >> > >> > > > >> > > > >> >
> > > > >> > >> > > > >> > > > >> > - Tim
> > > > >> > >> > > > >> > > > >> >
> > > > >> > >> > > > >> > > > >> >
> > > > >> > >> > > > >> > > > >> >
> > > > >> > >> > > > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael
> > > Hategan <
> > > > >> > >> > > > >> hategan at mcs.anl.gov
> > > > >> > >> > > > >> > > >
> > > > >> > >> > > > >> > > > >> wrote:
> > > > >> > >> > > > >> > > > >> >
> > > > >> > >> > > > >> > > > >> > > Hi Tim,
> > > > >> > >> > > > >> > > > >> > >
> > > > >> > >> > > > >> > > > >> > > I've never seen this before with pure
> Java.
> > > > >> > >> > > > >> > > > >> > >
> > > > >> > >> > > > >> > > > >> > > Do you have logs from these runs?
> > > > >> > >> > > > >> > > > >> > >
> > > > >> > >> > > > >> > > > >> > > Mihael
> > > > >> > >> > > > >> > > > >> > >
> > > > >> > >> > > > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim
> > > > >> Armstrong
> > > > >> > >> wrote:
> > > > >> > >> > > > >> > > > >> > > > I'm running a test Swift/T script
> that
> > > submit
> > > > >> > >> tasks to
> > > > >> > >> > > > >> Coasters
> > > > >> > >> > > > >> > > > >> through
> > > > >> > >> > > > >> > > > >> > > the
> > > > >> > >> > > > >> > > > >> > > > C++ client and I'm seeing some odd
> > > behaviour
> > > > >> > >> where task
> > > > >> > >> > > > >> > > > >> > > > submission/execution is stalling for
> ~2
> > > > >> minute
> > > > >> > >> periods.
> > > > >> > >> > > > >> For
> > > > >> > >> > > > >> > > > >> example, I'm
> > > > >> > >> > > > >> > > > >> > > > seeing submit log messages like
> > > "submitting
> > > > >> > >> > > > >> > > > >> > > > urn:133-1409778135377-1409778135378:
> > > > >> > >> /bin/hostname" in
> > > > >> > >> > > > >> bursts of
> > > > >> > >> > > > >> > > > >> several
> > > > >> > >> > > > >> > > > >> > > > seconds with a gap of roughly 2
> minutes
> > > in
> > > > >> > >> between,
> > > > >> > >> > > e.g.
> > > > >> > >> > > > >> I'm
> > > > >> > >> > > > >> > > seeing
> > > > >> > >> > > > >> > > > >> > > bursts
> > > > >> > >> > > > >> > > > >> > > > with the following intervals in my
> logs.
> > > > >> > >> > > > >> > > > >> > > >
> > > > >> > >> > > > >> > > > >> > > > 16:07:04,603 to 16:07:10,391
> > > > >> > >> > > > >> > > > >> > > > 16:09:07,377 to 16:09:13,076
> > > > >> > >> > > > >> > > > >> > > > 16:11:10,005 to 16:11:16,770
> > > > >> > >> > > > >> > > > >> > > > 16:13:13,291 to 16:13:19,296
> > > > >> > >> > > > >> > > > >> > > > 16:15:16,000 to 16:15:21,602
> > > > >> > >> > > > >> > > > >> > > >
> > > > >> > >> > > > >> > > > >> > > > From what I can tell, the delay is
> on the
> > > > >> coaster
> > > > >> > >> > > service
> > > > >> > >> > > > >> side:
> > > > >> > >> > > > >> > > the
> > > > >> > >> > > > >> > > > >> C
> > > > >> > >> > > > >> > > > >> > > > client is just waiting for a
> response.
> > > > >> > >> > > > >> > > > >> > > >
> > > > >> > >> > > > >> > > > >> > > > The jobs are just being submitted
> > > through the
> > > > >> > >> local job
> > > > >> > >> > > > >> > > manager, so
> > > > >> > >> > > > >> > > > >> I
> > > > >> > >> > > > >> > > > >> > > > wouldn't expect any delays there.
> The
> > > tasks
> > > > >> are
> > > > >> > >> also
> > > > >> > >> > > just
> > > > >> > >> > > > >> > > > >> > > "/bin/hostname",
> > > > >> > >> > > > >> > > > >> > > > so should return immediately.
> > > > >> > >> > > > >> > > > >> > > >
> > > > >> > >> > > > >> > > > >> > > > I'm going to continue digging into
> this
> > > on my
> > > > >> > >> own, but
> > > > >> > >> > > the
> > > > >> > >> > > > >> 2
> > > > >> > >> > > > >> > > minute
> > > > >> > >> > > > >> > > > >> delay
> > > > >> > >> > > > >> > > > >> > > > seems like a big clue: does anyone
> have
> > > an
> > > > >> idea
> > > > >> > >> what
> > > > >> > >> > > could
> > > > >> > >> > > > >> cause
> > > > >> > >> > > > >> > > > >> stalls
> > > > >> > >> > > > >> > > > >> > > in
> > > > >> > >> > > > >> > > > >> > > > task submission of 2 minute duration?
> > > > >> > >> > > > >> > > > >> > > >
> > > > >> > >> > > > >> > > > >> > > > Cheers,
> > > > >> > >> > > > >> > > > >> > > > Tim
> > > > >> > >> > > > >> > > > >> > >
> > > > >> > >> > > > >> > > > >> > >
> > > > >> > >> > > > >> > > > >> > >
> > > > >> > >> > > > >> > > > >>
> > > > >> > >> > > > >> > > > >>
> > > > >> > >> > > > >> > > > >>
> > > > >> > >> > > > >> > > > >
> > > > >> > >> > > > >> > >
> > > > >> > >> > > > >> > >
> > > > >> > >> > > > >> > >
> > > > >> > >> > > > >>
> > > > >> > >> > > > >>
> > > > >> > >> > > > >>
> > > > >> > >> > > > >
> > > > >> > >> > >
> > > > >> > >> > >
> > > > >> > >> > >
> > > > >> > >>
> > > > >> > >>
> > > > >> > >>
> > > > >> > >
> > > > >>
> > > > >>
> > > > >>
> > > > >
> > >
> > >
> > >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20140911/7a11b759/attachment.html>


More information about the Swift-devel mailing list