[Swift-devel] Coaster Task Submission Stalling

Mihael Hategan hategan at mcs.anl.gov
Wed Sep 3 22:35:22 CDT 2014


Ah, makes sense.

2 minutes is the channel timeout. Each live connection is guaranteed to
have some communication for any 2 minute time window, partially due to
periodic heartbeats (sent every 1 minute). If no packets flow for the
duration of 2 minutes, the connection is assumed broken and all jobs
that were submitted to the respective workers are considered failed. So
there seems to be an issue with the connections to some of the workers,
and it takes 2 minutes to detect them.

Since the service seems to be alive (although a jstack on the service
when thing seem to hang might help), this leaves two possibilities:
1 - some genuine network problem
2 - the worker died without properly closing TCP connections

If (2), you could enable worker logging
(Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if anything shows
up.

Mihael

On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote:
> Here are client and service logs, with part of service log edited down to
> be a reasonable size (I have the full thing if needed, but it was over a
> gigabyte).
> 
> One relevant section is from 19:49:35 onwards.  The client submits 4 jobs
> (its limit), but they don't complete until 19:51:32 or so (I can see that
> one task completed based on ncompleted=1 in the check_tasks log message).
> It looks like something has happened with broken pipes and workers being
> lost, but I'm not sure what the ultimate cause of that is likely to be.
> 
> - Tim
> 
> 
> 
> On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
> > Hi Tim,
> >
> > I've never seen this before with pure Java.
> >
> > Do you have logs from these runs?
> >
> > Mihael
> >
> > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote:
> > > I'm running a test Swift/T script that submit tasks to Coasters through
> > the
> > > C++ client and I'm seeing some odd behaviour where task
> > > submission/execution is stalling for ~2 minute periods.  For example, I'm
> > > seeing submit log messages like "submitting
> > > urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of several
> > > seconds with a gap of roughly 2 minutes in between, e.g. I'm seeing
> > bursts
> > > with the following intervals in my logs.
> > >
> > > 16:07:04,603 to 16:07:10,391
> > > 16:09:07,377 to 16:09:13,076
> > > 16:11:10,005 to 16:11:16,770
> > > 16:13:13,291 to 16:13:19,296
> > > 16:15:16,000 to 16:15:21,602
> > >
> > > From what I can tell, the delay is on the coaster service side: the C
> > > client is just waiting for a response.
> > >
> > > The jobs are just being submitted through the local job manager, so I
> > > wouldn't expect any delays there.  The tasks are also just
> > "/bin/hostname",
> > > so should return immediately.
> > >
> > > I'm going to continue digging into this on my own, but the 2 minute delay
> > > seems like a big clue: does anyone have an idea what could cause stalls
> > in
> > > task submission of 2 minute duration?
> > >
> > > Cheers,
> > > Tim
> >
> >
> >





More information about the Swift-devel mailing list