<div dir="ltr"><div><div>It's here: <a href="http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz">http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz</a> .<br><br></div>I'll add some extra debug messages in the coaster C++ client and see if I can recreate the scenario.<br><br></div>- Tim<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan <span dir="ltr"><<a href="mailto:hategan@mcs.anl.gov" target="_blank">hategan@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Ok, so that's legit.<br>

<br>

It does look like shut down workers are not properly accounted for in<br>

some places (and I believe Yadu submitted a bug for this). However, I do<br>

not see the dead time you mention in either of the last two sets of<br>

logs. It looks like each client instance submits a continous stream of<br>

jobs.<br>

<br>

So let's get back to the initial log. Can I have the full service log?<br>

I'm trying to track what happened with the jobs submitted before the<br>

first big pause.<br>

<br>

Also, a log message in CoasterClient::updateJobStatus() (or friends)<br>

would probably help a lot here.<br>

<span class="HOEnZb"><font color="#888888"><br>

Mihael<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote:<br>

> Should be here:<br>

><br>

> <a href="http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz" target="_blank">http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz</a><br>

><br>

><br>

><br>

><br>

> On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>> wrote:<br>

><br>

> > The first worker "failing" is 0904-20022331. The log looks funny at the<br>

> > end.<br>

> ><br>

> > Can you git pull and re-run? The worker is getting some command at the<br>

> > end there and doing nothing about it and I wonder why.<br>

> ><br>

> > Mihael<br>

> ><br>

> > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote:<br>

> > > Ok, now I have some worker logs:<br>

> > ><br>

> > > <a href="http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz" target="_blank">http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz</a><br>

> > ><br>

> > > There's nothing obvious I see in the worker logs that would indicate why<br>

> > > the connection was broken.<br>

> > ><br>

> > > - Tim<br>

> > ><br>

> > ><br>

> > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong <<a href="mailto:tim.g.armstrong@gmail.com">tim.g.armstrong@gmail.com</a><br>

> > ><br>

> > > wrote:<br>

> > ><br>

> > > > This is all running locally on my laptop, so I think we can rule out<br>

> > 1).<br>

> > > ><br>

> > > > It also seems like it's a state the coaster service gets into after a<br>

> > few<br>

> > > > client sessions: generally the first coaster run works fine, then<br>

> > after a<br>

> > > > few runs the problem occurs more frequently.<br>

> > > ><br>

> > > > I'm going to try and get worker logs, in the meantime i've got some<br>

> > > > jstacks (attached).<br>

> > > ><br>

> > > > Matching service logs (largish) are here if needed:<br>

> > > > <a href="http://people.cs.uchicago.edu/~tga/service.out.gz" target="_blank">http://people.cs.uchicago.edu/~tga/service.out.gz</a><br>

> > > ><br>

> > > ><br>

> > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>

> > > > wrote:<br>

> > > ><br>

> > > >> Ah, makes sense.<br>

> > > >><br>

> > > >> 2 minutes is the channel timeout. Each live connection is guaranteed<br>

> > to<br>

> > > >> have some communication for any 2 minute time window, partially due to<br>

> > > >> periodic heartbeats (sent every 1 minute). If no packets flow for the<br>

> > > >> duration of 2 minutes, the connection is assumed broken and all jobs<br>

> > > >> that were submitted to the respective workers are considered failed.<br>

> > So<br>

> > > >> there seems to be an issue with the connections to some of the<br>

> > workers,<br>

> > > >> and it takes 2 minutes to detect them.<br>

> > > >><br>

> > > >> Since the service seems to be alive (although a jstack on the service<br>

> > > >> when thing seem to hang might help), this leaves two possibilities:<br>

> > > >> 1 - some genuine network problem<br>

> > > >> 2 - the worker died without properly closing TCP connections<br>

> > > >><br>

> > > >> If (2), you could enable worker logging<br>

> > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if anything<br>

> > shows<br>

> > > >> up.<br>

> > > >><br>

> > > >> Mihael<br>

> > > >><br>

> > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote:<br>

> > > >> > Here are client and service logs, with part of service log edited<br>

> > down<br>

> > > >> to<br>

> > > >> > be a reasonable size (I have the full thing if needed, but it was<br>

> > over a<br>

> > > >> > gigabyte).<br>

> > > >> ><br>

> > > >> > One relevant section is from 19:49:35 onwards.  The client submits 4<br>

> > > >> jobs<br>

> > > >> > (its limit), but they don't complete until 19:51:32 or so (I can see<br>

> > > >> that<br>

> > > >> > one task completed based on ncompleted=1 in the check_tasks log<br>

> > > >> message).<br>

> > > >> > It looks like something has happened with broken pipes and workers<br>

> > being<br>

> > > >> > lost, but I'm not sure what the ultimate cause of that is likely to<br>

> > be.<br>

> > > >> ><br>

> > > >> > - Tim<br>

> > > >> ><br>

> > > >> ><br>

> > > >> ><br>

> > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a><br>

> > ><br>

> > > >> wrote:<br>

> > > >> ><br>

> > > >> > > Hi Tim,<br>

> > > >> > ><br>

> > > >> > > I've never seen this before with pure Java.<br>

> > > >> > ><br>

> > > >> > > Do you have logs from these runs?<br>

> > > >> > ><br>

> > > >> > > Mihael<br>

> > > >> > ><br>

> > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote:<br>

> > > >> > > > I'm running a test Swift/T script that submit tasks to Coasters<br>

> > > >> through<br>

> > > >> > > the<br>

> > > >> > > > C++ client and I'm seeing some odd behaviour where task<br>

> > > >> > > > submission/execution is stalling for ~2 minute periods.  For<br>

> > > >> example, I'm<br>

> > > >> > > > seeing submit log messages like "submitting<br>

> > > >> > > > urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of<br>

> > > >> several<br>

> > > >> > > > seconds with a gap of roughly 2 minutes in between, e.g. I'm<br>

> > seeing<br>

> > > >> > > bursts<br>

> > > >> > > > with the following intervals in my logs.<br>

> > > >> > > ><br>

> > > >> > > > 16:07:04,603 to 16:07:10,391<br>

> > > >> > > > 16:09:07,377 to 16:09:13,076<br>

> > > >> > > > 16:11:10,005 to 16:11:16,770<br>

> > > >> > > > 16:13:13,291 to 16:13:19,296<br>

> > > >> > > > 16:15:16,000 to 16:15:21,602<br>

> > > >> > > ><br>

> > > >> > > > From what I can tell, the delay is on the coaster service side:<br>

> > the<br>

> > > >> C<br>

> > > >> > > > client is just waiting for a response.<br>

> > > >> > > ><br>

> > > >> > > > The jobs are just being submitted through the local job<br>

> > manager, so<br>

> > > >> I<br>

> > > >> > > > wouldn't expect any delays there.  The tasks are also just<br>

> > > >> > > "/bin/hostname",<br>

> > > >> > > > so should return immediately.<br>

> > > >> > > ><br>

> > > >> > > > I'm going to continue digging into this on my own, but the 2<br>

> > minute<br>

> > > >> delay<br>

> > > >> > > > seems like a big clue: does anyone have an idea what could cause<br>

> > > >> stalls<br>

> > > >> > > in<br>

> > > >> > > > task submission of 2 minute duration?<br>

> > > >> > > ><br>

> > > >> > > > Cheers,<br>

> > > >> > > > Tim<br>

> > > >> > ><br>

> > > >> > ><br>

> > > >> > ><br>

> > > >><br>

> > > >><br>

> > > >><br>

> > > ><br>

> ><br>

> ><br>

> ><br>

<br>

<br>

</div></div></blockquote></div><br></div>