[Swift-devel] Coaster Task Submission Stalling

Fri Sep 5 08:55:04 CDT 2014

It's here:
http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz .

I'll add some extra debug messages in the coaster C++ client and see if I
can recreate the scenario.

- Tim

On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> Ok, so that's legit.
>
> It does look like shut down workers are not properly accounted for in
> some places (and I believe Yadu submitted a bug for this). However, I do
> not see the dead time you mention in either of the last two sets of
> logs. It looks like each client instance submits a continous stream of
> jobs.
>
> So let's get back to the initial log. Can I have the full service log?
> I'm trying to track what happened with the jobs submitted before the
> first big pause.
>
> Also, a log message in CoasterClient::updateJobStatus() (or friends)
> would probably help a lot here.
>
> Mihael
>
> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote:
> > Should be here:
> >
> > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz
> >
> >
> >
> >
> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
> >
> > > The first worker "failing" is 0904-20022331. The log looks funny at the
> > > end.
> > >
> > > Can you git pull and re-run? The worker is getting some command at the
> > > end there and doing nothing about it and I wonder why.
> > >
> > > Mihael
> > >
> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote:
> > > > Ok, now I have some worker logs:
> > > >
> > > > http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz
> > > >
> > > > There's nothing obvious I see in the worker logs that would indicate
> why
> > > > the connection was broken.
> > > >
> > > > - Tim
> > > >
> > > >
> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong <
> tim.g.armstrong at gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > This is all running locally on my laptop, so I think we can rule
> out
> > > 1).
> > > > >
> > > > > It also seems like it's a state the coaster service gets into
> after a
> > > few
> > > > > client sessions: generally the first coaster run works fine, then
> > > after a
> > > > > few runs the problem occurs more frequently.
> > > > >
> > > > > I'm going to try and get worker logs, in the meantime i've got some
> > > > > jstacks (attached).
> > > > >
> > > > > Matching service logs (largish) are here if needed:
> > > > > http://people.cs.uchicago.edu/~tga/service.out.gz
> > > > >
> > > > >
> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan <
> hategan at mcs.anl.gov>
> > > > > wrote:
> > > > >
> > > > >> Ah, makes sense.
> > > > >>
> > > > >> 2 minutes is the channel timeout. Each live connection is
> guaranteed
> > > to
> > > > >> have some communication for any 2 minute time window, partially
> due to
> > > > >> periodic heartbeats (sent every 1 minute). If no packets flow for
> the
> > > > >> duration of 2 minutes, the connection is assumed broken and all
> jobs
> > > > >> that were submitted to the respective workers are considered
> failed.
> > > So
> > > > >> there seems to be an issue with the connections to some of the
> > > workers,
> > > > >> and it takes 2 minutes to detect them.
> > > > >>
> > > > >> Since the service seems to be alive (although a jstack on the
> service
> > > > >> when thing seem to hang might help), this leaves two
> possibilities:
> > > > >> 1 - some genuine network problem
> > > > >> 2 - the worker died without properly closing TCP connections
> > > > >>
> > > > >> If (2), you could enable worker logging
> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if anything
> > > shows
> > > > >> up.
> > > > >>
> > > > >> Mihael
> > > > >>
> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote:
> > > > >> > Here are client and service logs, with part of service log
> edited
> > > down
> > > > >> to
> > > > >> > be a reasonable size (I have the full thing if needed, but it
> was
> > > over a
> > > > >> > gigabyte).
> > > > >> >
> > > > >> > One relevant section is from 19:49:35 onwards.  The client
> submits 4
> > > > >> jobs
> > > > >> > (its limit), but they don't complete until 19:51:32 or so (I
> can see
> > > > >> that
> > > > >> > one task completed based on ncompleted=1 in the check_tasks log
> > > > >> message).
> > > > >> > It looks like something has happened with broken pipes and
> workers
> > > being
> > > > >> > lost, but I'm not sure what the ultimate cause of that is
> likely to
> > > be.
> > > > >> >
> > > > >> > - Tim
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan <
> hategan at mcs.anl.gov
> > > >
> > > > >> wrote:
> > > > >> >
> > > > >> > > Hi Tim,
> > > > >> > >
> > > > >> > > I've never seen this before with pure Java.
> > > > >> > >
> > > > >> > > Do you have logs from these runs?
> > > > >> > >
> > > > >> > > Mihael
> > > > >> > >
> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote:
> > > > >> > > > I'm running a test Swift/T script that submit tasks to
> Coasters
> > > > >> through
> > > > >> > > the
> > > > >> > > > C++ client and I'm seeing some odd behaviour where task
> > > > >> > > > submission/execution is stalling for ~2 minute periods.  For
> > > > >> example, I'm
> > > > >> > > > seeing submit log messages like "submitting
> > > > >> > > > urn:133-1409778135377-1409778135378: /bin/hostname" in
> bursts of
> > > > >> several
> > > > >> > > > seconds with a gap of roughly 2 minutes in between, e.g. I'm
> > > seeing
> > > > >> > > bursts
> > > > >> > > > with the following intervals in my logs.
> > > > >> > > >
> > > > >> > > > 16:07:04,603 to 16:07:10,391
> > > > >> > > > 16:09:07,377 to 16:09:13,076
> > > > >> > > > 16:11:10,005 to 16:11:16,770
> > > > >> > > > 16:13:13,291 to 16:13:19,296
> > > > >> > > > 16:15:16,000 to 16:15:21,602
> > > > >> > > >
> > > > >> > > > From what I can tell, the delay is on the coaster service
> side:
> > > the
> > > > >> C
> > > > >> > > > client is just waiting for a response.
> > > > >> > > >
> > > > >> > > > The jobs are just being submitted through the local job
> > > manager, so
> > > > >> I
> > > > >> > > > wouldn't expect any delays there.  The tasks are also just
> > > > >> > > "/bin/hostname",
> > > > >> > > > so should return immediately.
> > > > >> > > >
> > > > >> > > > I'm going to continue digging into this on my own, but the 2
> > > minute
> > > > >> delay
> > > > >> > > > seems like a big clue: does anyone have an idea what could
> cause
> > > > >> stalls
> > > > >> > > in
> > > > >> > > > task submission of 2 minute duration?
> > > > >> > > >
> > > > >> > > > Cheers,
> > > > >> > > > Tim
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >>
> > > > >>
> > > > >>
> > > > >
> > >
> > >
> > >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20140905/087f365a/attachment.html>