[Swift-devel] Coaster Task Submission Stalling

Tim Armstrong tim.g.armstrong at gmail.com
Thu Sep 4 14:35:29 CDT 2014


Ok, now I have some worker logs:

http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz

There's nothing obvious I see in the worker logs that would indicate why
the connection was broken.

- Tim


On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong <tim.g.armstrong at gmail.com>
wrote:

> This is all running locally on my laptop, so I think we can rule out 1).
>
> It also seems like it's a state the coaster service gets into after a few
> client sessions: generally the first coaster run works fine, then after a
> few runs the problem occurs more frequently.
>
> I'm going to try and get worker logs, in the meantime i've got some
> jstacks (attached).
>
> Matching service logs (largish) are here if needed:
> http://people.cs.uchicago.edu/~tga/service.out.gz
>
>
> On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
>
>> Ah, makes sense.
>>
>> 2 minutes is the channel timeout. Each live connection is guaranteed to
>> have some communication for any 2 minute time window, partially due to
>> periodic heartbeats (sent every 1 minute). If no packets flow for the
>> duration of 2 minutes, the connection is assumed broken and all jobs
>> that were submitted to the respective workers are considered failed. So
>> there seems to be an issue with the connections to some of the workers,
>> and it takes 2 minutes to detect them.
>>
>> Since the service seems to be alive (although a jstack on the service
>> when thing seem to hang might help), this leaves two possibilities:
>> 1 - some genuine network problem
>> 2 - the worker died without properly closing TCP connections
>>
>> If (2), you could enable worker logging
>> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if anything shows
>> up.
>>
>> Mihael
>>
>> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote:
>> > Here are client and service logs, with part of service log edited down
>> to
>> > be a reasonable size (I have the full thing if needed, but it was over a
>> > gigabyte).
>> >
>> > One relevant section is from 19:49:35 onwards.  The client submits 4
>> jobs
>> > (its limit), but they don't complete until 19:51:32 or so (I can see
>> that
>> > one task completed based on ncompleted=1 in the check_tasks log
>> message).
>> > It looks like something has happened with broken pipes and workers being
>> > lost, but I'm not sure what the ultimate cause of that is likely to be.
>> >
>> > - Tim
>> >
>> >
>> >
>> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan <hategan at mcs.anl.gov>
>> wrote:
>> >
>> > > Hi Tim,
>> > >
>> > > I've never seen this before with pure Java.
>> > >
>> > > Do you have logs from these runs?
>> > >
>> > > Mihael
>> > >
>> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote:
>> > > > I'm running a test Swift/T script that submit tasks to Coasters
>> through
>> > > the
>> > > > C++ client and I'm seeing some odd behaviour where task
>> > > > submission/execution is stalling for ~2 minute periods.  For
>> example, I'm
>> > > > seeing submit log messages like "submitting
>> > > > urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of
>> several
>> > > > seconds with a gap of roughly 2 minutes in between, e.g. I'm seeing
>> > > bursts
>> > > > with the following intervals in my logs.
>> > > >
>> > > > 16:07:04,603 to 16:07:10,391
>> > > > 16:09:07,377 to 16:09:13,076
>> > > > 16:11:10,005 to 16:11:16,770
>> > > > 16:13:13,291 to 16:13:19,296
>> > > > 16:15:16,000 to 16:15:21,602
>> > > >
>> > > > From what I can tell, the delay is on the coaster service side: the
>> C
>> > > > client is just waiting for a response.
>> > > >
>> > > > The jobs are just being submitted through the local job manager, so
>> I
>> > > > wouldn't expect any delays there.  The tasks are also just
>> > > "/bin/hostname",
>> > > > so should return immediately.
>> > > >
>> > > > I'm going to continue digging into this on my own, but the 2 minute
>> delay
>> > > > seems like a big clue: does anyone have an idea what could cause
>> stalls
>> > > in
>> > > > task submission of 2 minute duration?
>> > > >
>> > > > Cheers,
>> > > > Tim
>> > >
>> > >
>> > >
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20140904/6f7fbce8/attachment.html>


More information about the Swift-devel mailing list