[Swift-devel] Coaster Task Submission Stalling

Mon Sep 8 14:38:21 CDT 2014

So...

There were bugs. Lots of bugs.
I did some work over the weekend to fix some of these and clean up the
coaster code. Here's a summary:

- there was some stuff in the low level coaster code to deal with
persisting coaster channels over multiple connections with various
options, like periodic connections, client or server initiated
connections, buffering of commands, etc. None of this was used by Swift,
and the code was pretty messy. I removed that.
- there were some issues with multiple clients:
  * improper shutdown of relevant workers when a client disconnected
  * the worker task dispatcher was a singleton and had a reference to
one block allocator, whereas multiple clients involved multiple
allocators. 
- there were a bunch of locking issues in the C client that valgrind
caught
- the idea of remote job ids was a bit hard to work with. This remote id
was the job id that the service assigned to a job. This is necessary
because two different clients can submit jobs with the same id. The
remote id would be communicated to the client as the reply to the submit
request. However, it was entirely possible for a notification about job
status to be sent to the client before the submit reply was. Since
notifications were sent using the remote-id, the client would have no
idea what job the notifications belonged to. Now, the server might need
a unique job id, but there is no reason why it cannot use the client id
when communicating the status to a client. So that's there now.
- the way the C client was working, its jobs ended up not going to the
workers, but the local queue. The service settings now allow specifying
the provider/jobManager/url to be used to start blocks, and jobs are
routed appropriately if they do not have the batch job flag set.

I also added a shared service mode. We discussed this before. Basically
you start the coaster service with "-shared <service.properties>" and
all the settings are read from that file. In this case, all clients
share the same worker pool, and client settings are ignored.

The C client now has a multi-job testing tool which can submit many jobs
with the desired level of concurrency.

I have tested the C client with both shared and non-shared mode, with
various levels of jobs being sent, with either one or two concurrent
clients. 

I haven't tested manual workers.

I've also decided that during normal operation (i.e. client connects,
submits jobs, shuts down gracefully), there should be no exceptions in
the coaster log. I think we should stick to that principle. This was the
case last I tested, and we should consider any deviation from that to be
a problem. Of course, there are some things for which there is no
graceful shut down, such as ctrl+C-ing a manual worker. Exceptions are
fine in that case.

So anyway, let's start from here.

Mihael

On Fri, 2014-09-05 at 13:09 -0500, Tim Armstrong wrote:
> Thanks, let me know if there's anything I can help do.
> 
> - Tim
> 
> 
> On Fri, Sep 5, 2014 at 12:57 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
> > Thanks. It also seems that there is an older bug in there in which the
> > client connection is not properly accounted for and things start failing
> > two minutes after the client connects (which is also probably why you
> > didn't see this in runs with many short client connections). I'm not
> > sure why the fix for that bug isn't in the trunk code.
> >
> > In any event, I'll set up a client submission loop and fix all these
> > things.
> >
> > Mihael
> >
> > On Fri, 2014-09-05 at 12:13 -0500, Tim Armstrong wrote:
> > > Ok, here it is with the additional debug messages.  Source code change is
> > > in commit 890c41f2ba701b10264553471590096d6f94c278.
> > >
> > > Warning: the tarball will expand to several gigabytes of logs.
> > >
> > > I had to do multiple client runs to trigger it.  It seems like the
> > problem
> > > might be triggered by abnormal termination of the client.  First 18 runs
> > > went fine, problem only started when I ctrl-c-ed the swift/t run #19
> > before
> > > the run #20 that exhibited delays.
> > >
> > > http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz
> > >
> > > - Tim
> > >
> > >
> > > On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong <tim.g.armstrong at gmail.com
> > >
> > > wrote:
> > >
> > > > It's here:
> > > > http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz .
> > > >
> > > > I'll add some extra debug messages in the coaster C++ client and see
> > if I
> > > > can recreate the scenario.
> > > >
> > > > - Tim
> > > >
> > > >
> > > > On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > > > wrote:
> > > >
> > > >> Ok, so that's legit.
> > > >>
> > > >> It does look like shut down workers are not properly accounted for in
> > > >> some places (and I believe Yadu submitted a bug for this). However, I
> > do
> > > >> not see the dead time you mention in either of the last two sets of
> > > >> logs. It looks like each client instance submits a continous stream of
> > > >> jobs.
> > > >>
> > > >> So let's get back to the initial log. Can I have the full service log?
> > > >> I'm trying to track what happened with the jobs submitted before the
> > > >> first big pause.
> > > >>
> > > >> Also, a log message in CoasterClient::updateJobStatus() (or friends)
> > > >> would probably help a lot here.
> > > >>
> > > >> Mihael
> > > >>
> > > >> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote:
> > > >> > Should be here:
> > > >> >
> > > >> > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan <hategan at mcs.anl.gov
> > >
> > > >> wrote:
> > > >> >
> > > >> > > The first worker "failing" is 0904-20022331. The log looks funny
> > at
> > > >> the
> > > >> > > end.
> > > >> > >
> > > >> > > Can you git pull and re-run? The worker is getting some command
> > at the
> > > >> > > end there and doing nothing about it and I wonder why.
> > > >> > >
> > > >> > > Mihael
> > > >> > >
> > > >> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote:
> > > >> > > > Ok, now I have some worker logs:
> > > >> > > >
> > > >> > > > http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz
> > > >> > > >
> > > >> > > > There's nothing obvious I see in the worker logs that would
> > > >> indicate why
> > > >> > > > the connection was broken.
> > > >> > > >
> > > >> > > > - Tim
> > > >> > > >
> > > >> > > >
> > > >> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong <
> > > >> tim.g.armstrong at gmail.com
> > > >> > > >
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > This is all running locally on my laptop, so I think we can
> > rule
> > > >> out
> > > >> > > 1).
> > > >> > > > >
> > > >> > > > > It also seems like it's a state the coaster service gets into
> > > >> after a
> > > >> > > few
> > > >> > > > > client sessions: generally the first coaster run works fine,
> > then
> > > >> > > after a
> > > >> > > > > few runs the problem occurs more frequently.
> > > >> > > > >
> > > >> > > > > I'm going to try and get worker logs, in the meantime i've got
> > > >> some
> > > >> > > > > jstacks (attached).
> > > >> > > > >
> > > >> > > > > Matching service logs (largish) are here if needed:
> > > >> > > > > http://people.cs.uchicago.edu/~tga/service.out.gz
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan <
> > > >> hategan at mcs.anl.gov>
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > >> Ah, makes sense.
> > > >> > > > >>
> > > >> > > > >> 2 minutes is the channel timeout. Each live connection is
> > > >> guaranteed
> > > >> > > to
> > > >> > > > >> have some communication for any 2 minute time window,
> > partially
> > > >> due to
> > > >> > > > >> periodic heartbeats (sent every 1 minute). If no packets flow
> > > >> for the
> > > >> > > > >> duration of 2 minutes, the connection is assumed broken and
> > all
> > > >> jobs
> > > >> > > > >> that were submitted to the respective workers are considered
> > > >> failed.
> > > >> > > So
> > > >> > > > >> there seems to be an issue with the connections to some of
> > the
> > > >> > > workers,
> > > >> > > > >> and it takes 2 minutes to detect them.
> > > >> > > > >>
> > > >> > > > >> Since the service seems to be alive (although a jstack on the
> > > >> service
> > > >> > > > >> when thing seem to hang might help), this leaves two
> > > >> possibilities:
> > > >> > > > >> 1 - some genuine network problem
> > > >> > > > >> 2 - the worker died without properly closing TCP connections
> > > >> > > > >>
> > > >> > > > >> If (2), you could enable worker logging
> > > >> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if
> > > >> anything
> > > >> > > shows
> > > >> > > > >> up.
> > > >> > > > >>
> > > >> > > > >> Mihael
> > > >> > > > >>
> > > >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote:
> > > >> > > > >> > Here are client and service logs, with part of service log
> > > >> edited
> > > >> > > down
> > > >> > > > >> to
> > > >> > > > >> > be a reasonable size (I have the full thing if needed, but
> > it
> > > >> was
> > > >> > > over a
> > > >> > > > >> > gigabyte).
> > > >> > > > >> >
> > > >> > > > >> > One relevant section is from 19:49:35 onwards.  The client
> > > >> submits 4
> > > >> > > > >> jobs
> > > >> > > > >> > (its limit), but they don't complete until 19:51:32 or so
> > (I
> > > >> can see
> > > >> > > > >> that
> > > >> > > > >> > one task completed based on ncompleted=1 in the
> > check_tasks log
> > > >> > > > >> message).
> > > >> > > > >> > It looks like something has happened with broken pipes and
> > > >> workers
> > > >> > > being
> > > >> > > > >> > lost, but I'm not sure what the ultimate cause of that is
> > > >> likely to
> > > >> > > be.
> > > >> > > > >> >
> > > >> > > > >> > - Tim
> > > >> > > > >> >
> > > >> > > > >> >
> > > >> > > > >> >
> > > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan <
> > > >> hategan at mcs.anl.gov
> > > >> > > >
> > > >> > > > >> wrote:
> > > >> > > > >> >
> > > >> > > > >> > > Hi Tim,
> > > >> > > > >> > >
> > > >> > > > >> > > I've never seen this before with pure Java.
> > > >> > > > >> > >
> > > >> > > > >> > > Do you have logs from these runs?
> > > >> > > > >> > >
> > > >> > > > >> > > Mihael
> > > >> > > > >> > >
> > > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote:
> > > >> > > > >> > > > I'm running a test Swift/T script that submit tasks to
> > > >> Coasters
> > > >> > > > >> through
> > > >> > > > >> > > the
> > > >> > > > >> > > > C++ client and I'm seeing some odd behaviour where task
> > > >> > > > >> > > > submission/execution is stalling for ~2 minute periods.
> > > >> For
> > > >> > > > >> example, I'm
> > > >> > > > >> > > > seeing submit log messages like "submitting
> > > >> > > > >> > > > urn:133-1409778135377-1409778135378: /bin/hostname" in
> > > >> bursts of
> > > >> > > > >> several
> > > >> > > > >> > > > seconds with a gap of roughly 2 minutes in between,
> > e.g.
> > > >> I'm
> > > >> > > seeing
> > > >> > > > >> > > bursts
> > > >> > > > >> > > > with the following intervals in my logs.
> > > >> > > > >> > > >
> > > >> > > > >> > > > 16:07:04,603 to 16:07:10,391
> > > >> > > > >> > > > 16:09:07,377 to 16:09:13,076
> > > >> > > > >> > > > 16:11:10,005 to 16:11:16,770
> > > >> > > > >> > > > 16:13:13,291 to 16:13:19,296
> > > >> > > > >> > > > 16:15:16,000 to 16:15:21,602
> > > >> > > > >> > > >
> > > >> > > > >> > > > From what I can tell, the delay is on the coaster
> > service
> > > >> side:
> > > >> > > the
> > > >> > > > >> C
> > > >> > > > >> > > > client is just waiting for a response.
> > > >> > > > >> > > >
> > > >> > > > >> > > > The jobs are just being submitted through the local job
> > > >> > > manager, so
> > > >> > > > >> I
> > > >> > > > >> > > > wouldn't expect any delays there.  The tasks are also
> > just
> > > >> > > > >> > > "/bin/hostname",
> > > >> > > > >> > > > so should return immediately.
> > > >> > > > >> > > >
> > > >> > > > >> > > > I'm going to continue digging into this on my own, but
> > the
> > > >> 2
> > > >> > > minute
> > > >> > > > >> delay
> > > >> > > > >> > > > seems like a big clue: does anyone have an idea what
> > could
> > > >> cause
> > > >> > > > >> stalls
> > > >> > > > >> > > in
> > > >> > > > >> > > > task submission of 2 minute duration?
> > > >> > > > >> > > >
> > > >> > > > >> > > > Cheers,
> > > >> > > > >> > > > Tim
> > > >> > > > >> > >
> > > >> > > > >> > >
> > > >> > > > >> > >
> > > >> > > > >>
> > > >> > > > >>
> > > >> > > > >>
> > > >> > > > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >>
> > > >>
> > > >>
> > > >
> >
> >
> >