[Swift-devel] Coaster Task Submission Stalling

Thu Sep 11 10:30:05 CDT 2014

This all sounds great.

Just to check that I've understood correctly, from the client's point of
view:
* The per-client settings behave the same if -shared is not provided.
* Per-client settings are ignored if -shared is provided

I had one question:
* Do automatically allocated workers work with per-client settings?  I
understand there were some issues related to sharing workers between
clients.  Was the solution to have separate worker pools, or is this just
not supported?

- Tim

On Mon, Sep 8, 2014 at 2:38 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> So...
>
> There were bugs. Lots of bugs.
> I did some work over the weekend to fix some of these and clean up the
> coaster code. Here's a summary:
>
> - there was some stuff in the low level coaster code to deal with
> persisting coaster channels over multiple connections with various
> options, like periodic connections, client or server initiated
> connections, buffering of commands, etc. None of this was used by Swift,
> and the code was pretty messy. I removed that.
> - there were some issues with multiple clients:
>   * improper shutdown of relevant workers when a client disconnected
>   * the worker task dispatcher was a singleton and had a reference to
> one block allocator, whereas multiple clients involved multiple
> allocators.
> - there were a bunch of locking issues in the C client that valgrind
> caught
> - the idea of remote job ids was a bit hard to work with. This remote id
> was the job id that the service assigned to a job. This is necessary
> because two different clients can submit jobs with the same id. The
> remote id would be communicated to the client as the reply to the submit
> request. However, it was entirely possible for a notification about job
> status to be sent to the client before the submit reply was. Since
> notifications were sent using the remote-id, the client would have no
> idea what job the notifications belonged to. Now, the server might need
> a unique job id, but there is no reason why it cannot use the client id
> when communicating the status to a client. So that's there now.
> - the way the C client was working, its jobs ended up not going to the
> workers, but the local queue. The service settings now allow specifying
> the provider/jobManager/url to be used to start blocks, and jobs are
> routed appropriately if they do not have the batch job flag set.
>
> I also added a shared service mode. We discussed this before. Basically
> you start the coaster service with "-shared <service.properties>" and
> all the settings are read from that file. In this case, all clients
> share the same worker pool, and client settings are ignored.
>
> The C client now has a multi-job testing tool which can submit many jobs
> with the desired level of concurrency.
>
> I have tested the C client with both shared and non-shared mode, with
> various levels of jobs being sent, with either one or two concurrent
> clients.
>
> I haven't tested manual workers.
>
> I've also decided that during normal operation (i.e. client connects,
> submits jobs, shuts down gracefully), there should be no exceptions in
> the coaster log. I think we should stick to that principle. This was the
> case last I tested, and we should consider any deviation from that to be
> a problem. Of course, there are some things for which there is no
> graceful shut down, such as ctrl+C-ing a manual worker. Exceptions are
> fine in that case.
>
> So anyway, let's start from here.
>
> Mihael
>
> On Fri, 2014-09-05 at 13:09 -0500, Tim Armstrong wrote:
> > Thanks, let me know if there's anything I can help do.
> >
> > - Tim
> >
> >
> > On Fri, Sep 5, 2014 at 12:57 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
> >
> > > Thanks. It also seems that there is an older bug in there in which the
> > > client connection is not properly accounted for and things start
> failing
> > > two minutes after the client connects (which is also probably why you
> > > didn't see this in runs with many short client connections). I'm not
> > > sure why the fix for that bug isn't in the trunk code.
> > >
> > > In any event, I'll set up a client submission loop and fix all these
> > > things.
> > >
> > > Mihael
> > >
> > > On Fri, 2014-09-05 at 12:13 -0500, Tim Armstrong wrote:
> > > > Ok, here it is with the additional debug messages.  Source code
> change is
> > > > in commit 890c41f2ba701b10264553471590096d6f94c278.
> > > >
> > > > Warning: the tarball will expand to several gigabytes of logs.
> > > >
> > > > I had to do multiple client runs to trigger it.  It seems like the
> > > problem
> > > > might be triggered by abnormal termination of the client.  First 18
> runs
> > > > went fine, problem only started when I ctrl-c-ed the swift/t run #19
> > > before
> > > > the run #20 that exhibited delays.
> > > >
> > > > http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz
> > > >
> > > > - Tim
> > > >
> > > >
> > > > On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong <
> tim.g.armstrong at gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > It's here:
> > > > >
> http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz .
> > > > >
> > > > > I'll add some extra debug messages in the coaster C++ client and
> see
> > > if I
> > > > > can recreate the scenario.
> > > > >
> > > > > - Tim
> > > > >
> > > > >
> > > > > On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan <
> hategan at mcs.anl.gov>
> > > > > wrote:
> > > > >
> > > > >> Ok, so that's legit.
> > > > >>
> > > > >> It does look like shut down workers are not properly accounted
> for in
> > > > >> some places (and I believe Yadu submitted a bug for this).
> However, I
> > > do
> > > > >> not see the dead time you mention in either of the last two sets
> of
> > > > >> logs. It looks like each client instance submits a continous
> stream of
> > > > >> jobs.
> > > > >>
> > > > >> So let's get back to the initial log. Can I have the full service
> log?
> > > > >> I'm trying to track what happened with the jobs submitted before
> the
> > > > >> first big pause.
> > > > >>
> > > > >> Also, a log message in CoasterClient::updateJobStatus() (or
> friends)
> > > > >> would probably help a lot here.
> > > > >>
> > > > >> Mihael
> > > > >>
> > > > >> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote:
> > > > >> > Should be here:
> > > > >> >
> > > > >> > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan <
> hategan at mcs.anl.gov
> > > >
> > > > >> wrote:
> > > > >> >
> > > > >> > > The first worker "failing" is 0904-20022331. The log looks
> funny
> > > at
> > > > >> the
> > > > >> > > end.
> > > > >> > >
> > > > >> > > Can you git pull and re-run? The worker is getting some
> command
> > > at the
> > > > >> > > end there and doing nothing about it and I wonder why.
> > > > >> > >
> > > > >> > > Mihael
> > > > >> > >
> > > > >> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote:
> > > > >> > > > Ok, now I have some worker logs:
> > > > >> > > >
> > > > >> > > >
> http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz
> > > > >> > > >
> > > > >> > > > There's nothing obvious I see in the worker logs that would
> > > > >> indicate why
> > > > >> > > > the connection was broken.
> > > > >> > > >
> > > > >> > > > - Tim
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong <
> > > > >> tim.g.armstrong at gmail.com
> > > > >> > > >
> > > > >> > > > wrote:
> > > > >> > > >
> > > > >> > > > > This is all running locally on my laptop, so I think we
> can
> > > rule
> > > > >> out
> > > > >> > > 1).
> > > > >> > > > >
> > > > >> > > > > It also seems like it's a state the coaster service gets
> into
> > > > >> after a
> > > > >> > > few
> > > > >> > > > > client sessions: generally the first coaster run works
> fine,
> > > then
> > > > >> > > after a
> > > > >> > > > > few runs the problem occurs more frequently.
> > > > >> > > > >
> > > > >> > > > > I'm going to try and get worker logs, in the meantime
> i've got
> > > > >> some
> > > > >> > > > > jstacks (attached).
> > > > >> > > > >
> > > > >> > > > > Matching service logs (largish) are here if needed:
> > > > >> > > > > http://people.cs.uchicago.edu/~tga/service.out.gz
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan <
> > > > >> hategan at mcs.anl.gov>
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > >> Ah, makes sense.
> > > > >> > > > >>
> > > > >> > > > >> 2 minutes is the channel timeout. Each live connection is
> > > > >> guaranteed
> > > > >> > > to
> > > > >> > > > >> have some communication for any 2 minute time window,
> > > partially
> > > > >> due to
> > > > >> > > > >> periodic heartbeats (sent every 1 minute). If no packets
> flow
> > > > >> for the
> > > > >> > > > >> duration of 2 minutes, the connection is assumed broken
> and
> > > all
> > > > >> jobs
> > > > >> > > > >> that were submitted to the respective workers are
> considered
> > > > >> failed.
> > > > >> > > So
> > > > >> > > > >> there seems to be an issue with the connections to some
> of
> > > the
> > > > >> > > workers,
> > > > >> > > > >> and it takes 2 minutes to detect them.
> > > > >> > > > >>
> > > > >> > > > >> Since the service seems to be alive (although a jstack
> on the
> > > > >> service
> > > > >> > > > >> when thing seem to hang might help), this leaves two
> > > > >> possibilities:
> > > > >> > > > >> 1 - some genuine network problem
> > > > >> > > > >> 2 - the worker died without properly closing TCP
> connections
> > > > >> > > > >>
> > > > >> > > > >> If (2), you could enable worker logging
> > > > >> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if
> > > > >> anything
> > > > >> > > shows
> > > > >> > > > >> up.
> > > > >> > > > >>
> > > > >> > > > >> Mihael
> > > > >> > > > >>
> > > > >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote:
> > > > >> > > > >> > Here are client and service logs, with part of service
> log
> > > > >> edited
> > > > >> > > down
> > > > >> > > > >> to
> > > > >> > > > >> > be a reasonable size (I have the full thing if needed,
> but
> > > it
> > > > >> was
> > > > >> > > over a
> > > > >> > > > >> > gigabyte).
> > > > >> > > > >> >
> > > > >> > > > >> > One relevant section is from 19:49:35 onwards.  The
> client
> > > > >> submits 4
> > > > >> > > > >> jobs
> > > > >> > > > >> > (its limit), but they don't complete until 19:51:32 or
> so
> > > (I
> > > > >> can see
> > > > >> > > > >> that
> > > > >> > > > >> > one task completed based on ncompleted=1 in the
> > > check_tasks log
> > > > >> > > > >> message).
> > > > >> > > > >> > It looks like something has happened with broken pipes
> and
> > > > >> workers
> > > > >> > > being
> > > > >> > > > >> > lost, but I'm not sure what the ultimate cause of that
> is
> > > > >> likely to
> > > > >> > > be.
> > > > >> > > > >> >
> > > > >> > > > >> > - Tim
> > > > >> > > > >> >
> > > > >> > > > >> >
> > > > >> > > > >> >
> > > > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan <
> > > > >> hategan at mcs.anl.gov
> > > > >> > > >
> > > > >> > > > >> wrote:
> > > > >> > > > >> >
> > > > >> > > > >> > > Hi Tim,
> > > > >> > > > >> > >
> > > > >> > > > >> > > I've never seen this before with pure Java.
> > > > >> > > > >> > >
> > > > >> > > > >> > > Do you have logs from these runs?
> > > > >> > > > >> > >
> > > > >> > > > >> > > Mihael
> > > > >> > > > >> > >
> > > > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong
> wrote:
> > > > >> > > > >> > > > I'm running a test Swift/T script that submit
> tasks to
> > > > >> Coasters
> > > > >> > > > >> through
> > > > >> > > > >> > > the
> > > > >> > > > >> > > > C++ client and I'm seeing some odd behaviour where
> task
> > > > >> > > > >> > > > submission/execution is stalling for ~2 minute
> periods.
> > > > >> For
> > > > >> > > > >> example, I'm
> > > > >> > > > >> > > > seeing submit log messages like "submitting
> > > > >> > > > >> > > > urn:133-1409778135377-1409778135378:
> /bin/hostname" in
> > > > >> bursts of
> > > > >> > > > >> several
> > > > >> > > > >> > > > seconds with a gap of roughly 2 minutes in between,
> > > e.g.
> > > > >> I'm
> > > > >> > > seeing
> > > > >> > > > >> > > bursts
> > > > >> > > > >> > > > with the following intervals in my logs.
> > > > >> > > > >> > > >
> > > > >> > > > >> > > > 16:07:04,603 to 16:07:10,391
> > > > >> > > > >> > > > 16:09:07,377 to 16:09:13,076
> > > > >> > > > >> > > > 16:11:10,005 to 16:11:16,770
> > > > >> > > > >> > > > 16:13:13,291 to 16:13:19,296
> > > > >> > > > >> > > > 16:15:16,000 to 16:15:21,602
> > > > >> > > > >> > > >
> > > > >> > > > >> > > > From what I can tell, the delay is on the coaster
> > > service
> > > > >> side:
> > > > >> > > the
> > > > >> > > > >> C
> > > > >> > > > >> > > > client is just waiting for a response.
> > > > >> > > > >> > > >
> > > > >> > > > >> > > > The jobs are just being submitted through the
> local job
> > > > >> > > manager, so
> > > > >> > > > >> I
> > > > >> > > > >> > > > wouldn't expect any delays there.  The tasks are
> also
> > > just
> > > > >> > > > >> > > "/bin/hostname",
> > > > >> > > > >> > > > so should return immediately.
> > > > >> > > > >> > > >
> > > > >> > > > >> > > > I'm going to continue digging into this on my own,
> but
> > > the
> > > > >> 2
> > > > >> > > minute
> > > > >> > > > >> delay
> > > > >> > > > >> > > > seems like a big clue: does anyone have an idea
> what
> > > could
> > > > >> cause
> > > > >> > > > >> stalls
> > > > >> > > > >> > > in
> > > > >> > > > >> > > > task submission of 2 minute duration?
> > > > >> > > > >> > > >
> > > > >> > > > >> > > > Cheers,
> > > > >> > > > >> > > > Tim
> > > > >> > > > >> > >
> > > > >> > > > >> > >
> > > > >> > > > >> > >
> > > > >> > > > >>
> > > > >> > > > >>
> > > > >> > > > >>
> > > > >> > > > >
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >>
> > > > >>
> > > > >>
> > > > >
> > >
> > >
> > >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20140911/424ec6df/attachment.html>