[Swift-devel] Coaster Task Submission Stalling

Tim Armstrong tim.g.armstrong at gmail.com
Thu Sep 4 13:11:04 CDT 2014


This is all running locally on my laptop, so I think we can rule out 1).

It also seems like it's a state the coaster service gets into after a few
client sessions: generally the first coaster run works fine, then after a
few runs the problem occurs more frequently.

I'm going to try and get worker logs, in the meantime i've got some jstacks
(attached).

Matching service logs (largish) are here if needed:
http://people.cs.uchicago.edu/~tga/service.out.gz


On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> Ah, makes sense.
>
> 2 minutes is the channel timeout. Each live connection is guaranteed to
> have some communication for any 2 minute time window, partially due to
> periodic heartbeats (sent every 1 minute). If no packets flow for the
> duration of 2 minutes, the connection is assumed broken and all jobs
> that were submitted to the respective workers are considered failed. So
> there seems to be an issue with the connections to some of the workers,
> and it takes 2 minutes to detect them.
>
> Since the service seems to be alive (although a jstack on the service
> when thing seem to hang might help), this leaves two possibilities:
> 1 - some genuine network problem
> 2 - the worker died without properly closing TCP connections
>
> If (2), you could enable worker logging
> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if anything shows
> up.
>
> Mihael
>
> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote:
> > Here are client and service logs, with part of service log edited down to
> > be a reasonable size (I have the full thing if needed, but it was over a
> > gigabyte).
> >
> > One relevant section is from 19:49:35 onwards.  The client submits 4 jobs
> > (its limit), but they don't complete until 19:51:32 or so (I can see that
> > one task completed based on ncompleted=1 in the check_tasks log message).
> > It looks like something has happened with broken pipes and workers being
> > lost, but I'm not sure what the ultimate cause of that is likely to be.
> >
> > - Tim
> >
> >
> >
> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
> >
> > > Hi Tim,
> > >
> > > I've never seen this before with pure Java.
> > >
> > > Do you have logs from these runs?
> > >
> > > Mihael
> > >
> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote:
> > > > I'm running a test Swift/T script that submit tasks to Coasters
> through
> > > the
> > > > C++ client and I'm seeing some odd behaviour where task
> > > > submission/execution is stalling for ~2 minute periods.  For
> example, I'm
> > > > seeing submit log messages like "submitting
> > > > urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of
> several
> > > > seconds with a gap of roughly 2 minutes in between, e.g. I'm seeing
> > > bursts
> > > > with the following intervals in my logs.
> > > >
> > > > 16:07:04,603 to 16:07:10,391
> > > > 16:09:07,377 to 16:09:13,076
> > > > 16:11:10,005 to 16:11:16,770
> > > > 16:13:13,291 to 16:13:19,296
> > > > 16:15:16,000 to 16:15:21,602
> > > >
> > > > From what I can tell, the delay is on the coaster service side: the C
> > > > client is just waiting for a response.
> > > >
> > > > The jobs are just being submitted through the local job manager, so I
> > > > wouldn't expect any delays there.  The tasks are also just
> > > "/bin/hostname",
> > > > so should return immediately.
> > > >
> > > > I'm going to continue digging into this on my own, but the 2 minute
> delay
> > > > seems like a big clue: does anyone have an idea what could cause
> stalls
> > > in
> > > > task submission of 2 minute duration?
> > > >
> > > > Cheers,
> > > > Tim
> > >
> > >
> > >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20140904/fa47749a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hostnames-run1.out
Type: application/octet-stream
Size: 310493 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20140904/fa47749a/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hostnames-run2.out
Type: application/octet-stream
Size: 4461088 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20140904/fa47749a/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jstack.out
Type: application/octet-stream
Size: 113681 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20140904/fa47749a/attachment-0002.obj>


More information about the Swift-devel mailing list