[Swift-devel] Coaster Task Submission Stalling

Fri Sep 5 12:13:02 CDT 2014

Ok, here it is with the additional debug messages.  Source code change is
in commit 890c41f2ba701b10264553471590096d6f94c278.

Warning: the tarball will expand to several gigabytes of logs.

I had to do multiple client runs to trigger it.  It seems like the problem
might be triggered by abnormal termination of the client.  First 18 runs
went fine, problem only started when I ctrl-c-ed the swift/t run #19 before
the run #20 that exhibited delays.

http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz

- Tim

On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong <tim.g.armstrong at gmail.com>
wrote:

> It's here:
> http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz .
>
> I'll add some extra debug messages in the coaster C++ client and see if I
> can recreate the scenario.
>
> - Tim
>
>
> On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
>
>> Ok, so that's legit.
>>
>> It does look like shut down workers are not properly accounted for in
>> some places (and I believe Yadu submitted a bug for this). However, I do
>> not see the dead time you mention in either of the last two sets of
>> logs. It looks like each client instance submits a continous stream of
>> jobs.
>>
>> So let's get back to the initial log. Can I have the full service log?
>> I'm trying to track what happened with the jobs submitted before the
>> first big pause.
>>
>> Also, a log message in CoasterClient::updateJobStatus() (or friends)
>> would probably help a lot here.
>>
>> Mihael
>>
>> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote:
>> > Should be here:
>> >
>> > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz
>> >
>> >
>> >
>> >
>> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan <hategan at mcs.anl.gov>
>> wrote:
>> >
>> > > The first worker "failing" is 0904-20022331. The log looks funny at
>> the
>> > > end.
>> > >
>> > > Can you git pull and re-run? The worker is getting some command at the
>> > > end there and doing nothing about it and I wonder why.
>> > >
>> > > Mihael
>> > >
>> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote:
>> > > > Ok, now I have some worker logs:
>> > > >
>> > > > http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz
>> > > >
>> > > > There's nothing obvious I see in the worker logs that would
>> indicate why
>> > > > the connection was broken.
>> > > >
>> > > > - Tim
>> > > >
>> > > >
>> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong <
>> tim.g.armstrong at gmail.com
>> > > >
>> > > > wrote:
>> > > >
>> > > > > This is all running locally on my laptop, so I think we can rule
>> out
>> > > 1).
>> > > > >
>> > > > > It also seems like it's a state the coaster service gets into
>> after a
>> > > few
>> > > > > client sessions: generally the first coaster run works fine, then
>> > > after a
>> > > > > few runs the problem occurs more frequently.
>> > > > >
>> > > > > I'm going to try and get worker logs, in the meantime i've got
>> some
>> > > > > jstacks (attached).
>> > > > >
>> > > > > Matching service logs (largish) are here if needed:
>> > > > > http://people.cs.uchicago.edu/~tga/service.out.gz
>> > > > >
>> > > > >
>> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan <
>> hategan at mcs.anl.gov>
>> > > > > wrote:
>> > > > >
>> > > > >> Ah, makes sense.
>> > > > >>
>> > > > >> 2 minutes is the channel timeout. Each live connection is
>> guaranteed
>> > > to
>> > > > >> have some communication for any 2 minute time window, partially
>> due to
>> > > > >> periodic heartbeats (sent every 1 minute). If no packets flow
>> for the
>> > > > >> duration of 2 minutes, the connection is assumed broken and all
>> jobs
>> > > > >> that were submitted to the respective workers are considered
>> failed.
>> > > So
>> > > > >> there seems to be an issue with the connections to some of the
>> > > workers,
>> > > > >> and it takes 2 minutes to detect them.
>> > > > >>
>> > > > >> Since the service seems to be alive (although a jstack on the
>> service
>> > > > >> when thing seem to hang might help), this leaves two
>> possibilities:
>> > > > >> 1 - some genuine network problem
>> > > > >> 2 - the worker died without properly closing TCP connections
>> > > > >>
>> > > > >> If (2), you could enable worker logging
>> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if
>> anything
>> > > shows
>> > > > >> up.
>> > > > >>
>> > > > >> Mihael
>> > > > >>
>> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote:
>> > > > >> > Here are client and service logs, with part of service log
>> edited
>> > > down
>> > > > >> to
>> > > > >> > be a reasonable size (I have the full thing if needed, but it
>> was
>> > > over a
>> > > > >> > gigabyte).
>> > > > >> >
>> > > > >> > One relevant section is from 19:49:35 onwards.  The client
>> submits 4
>> > > > >> jobs
>> > > > >> > (its limit), but they don't complete until 19:51:32 or so (I
>> can see
>> > > > >> that
>> > > > >> > one task completed based on ncompleted=1 in the check_tasks log
>> > > > >> message).
>> > > > >> > It looks like something has happened with broken pipes and
>> workers
>> > > being
>> > > > >> > lost, but I'm not sure what the ultimate cause of that is
>> likely to
>> > > be.
>> > > > >> >
>> > > > >> > - Tim
>> > > > >> >
>> > > > >> >
>> > > > >> >
>> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan <
>> hategan at mcs.anl.gov
>> > > >
>> > > > >> wrote:
>> > > > >> >
>> > > > >> > > Hi Tim,
>> > > > >> > >
>> > > > >> > > I've never seen this before with pure Java.
>> > > > >> > >
>> > > > >> > > Do you have logs from these runs?
>> > > > >> > >
>> > > > >> > > Mihael
>> > > > >> > >
>> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote:
>> > > > >> > > > I'm running a test Swift/T script that submit tasks to
>> Coasters
>> > > > >> through
>> > > > >> > > the
>> > > > >> > > > C++ client and I'm seeing some odd behaviour where task
>> > > > >> > > > submission/execution is stalling for ~2 minute periods.
>> For
>> > > > >> example, I'm
>> > > > >> > > > seeing submit log messages like "submitting
>> > > > >> > > > urn:133-1409778135377-1409778135378: /bin/hostname" in
>> bursts of
>> > > > >> several
>> > > > >> > > > seconds with a gap of roughly 2 minutes in between, e.g.
>> I'm
>> > > seeing
>> > > > >> > > bursts
>> > > > >> > > > with the following intervals in my logs.
>> > > > >> > > >
>> > > > >> > > > 16:07:04,603 to 16:07:10,391
>> > > > >> > > > 16:09:07,377 to 16:09:13,076
>> > > > >> > > > 16:11:10,005 to 16:11:16,770
>> > > > >> > > > 16:13:13,291 to 16:13:19,296
>> > > > >> > > > 16:15:16,000 to 16:15:21,602
>> > > > >> > > >
>> > > > >> > > > From what I can tell, the delay is on the coaster service
>> side:
>> > > the
>> > > > >> C
>> > > > >> > > > client is just waiting for a response.
>> > > > >> > > >
>> > > > >> > > > The jobs are just being submitted through the local job
>> > > manager, so
>> > > > >> I
>> > > > >> > > > wouldn't expect any delays there.  The tasks are also just
>> > > > >> > > "/bin/hostname",
>> > > > >> > > > so should return immediately.
>> > > > >> > > >
>> > > > >> > > > I'm going to continue digging into this on my own, but the
>> 2
>> > > minute
>> > > > >> delay
>> > > > >> > > > seems like a big clue: does anyone have an idea what could
>> cause
>> > > > >> stalls
>> > > > >> > > in
>> > > > >> > > > task submission of 2 minute duration?
>> > > > >> > > >
>> > > > >> > > > Cheers,
>> > > > >> > > > Tim
>> > > > >> > >
>> > > > >> > >
>> > > > >> > >
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >
>> > >
>> > >
>> > >
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20140905/7aa4294b/attachment.html>