<div dir="ltr"><div>Here are client and service logs, with part of service log edited down to be a reasonable size (I have the full thing if needed, but it was over a gigabyte).<br><br>One relevant section is from 19:49:35 onwards. The client submits 4 jobs<br>
(its limit), but they don't complete until 19:51:32 or so (I can see that<br>one task completed based on ncompleted=1 in the check_tasks log message).<br>It looks like something has happened with broken pipes and workers being<br>
lost, but I'm not sure what the ultimate cause of that is likely to be.<br><br></div>- Tim<br><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan <span dir="ltr"><<a href="mailto:hategan@mcs.anl.gov" target="_blank">hategan@mcs.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Tim,<br>
<br>
I've never seen this before with pure Java.<br>
<br>
Do you have logs from these runs?<br>
<span class="HOEnZb"><font color="#888888"><br>
Mihael<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote:<br>
> I'm running a test Swift/T script that submit tasks to Coasters through the<br>
> C++ client and I'm seeing some odd behaviour where task<br>
> submission/execution is stalling for ~2 minute periods. For example, I'm<br>
> seeing submit log messages like "submitting<br>
> urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of several<br>
> seconds with a gap of roughly 2 minutes in between, e.g. I'm seeing bursts<br>
> with the following intervals in my logs.<br>
><br>
> 16:07:04,603 to 16:07:10,391<br>
> 16:09:07,377 to 16:09:13,076<br>
> 16:11:10,005 to 16:11:16,770<br>
> 16:13:13,291 to 16:13:19,296<br>
> 16:15:16,000 to 16:15:21,602<br>
><br>
> From what I can tell, the delay is on the coaster service side: the C<br>
> client is just waiting for a response.<br>
><br>
> The jobs are just being submitted through the local job manager, so I<br>
> wouldn't expect any delays there. The tasks are also just "/bin/hostname",<br>
> so should return immediately.<br>
><br>
> I'm going to continue digging into this on my own, but the 2 minute delay<br>
> seems like a big clue: does anyone have an idea what could cause stalls in<br>
> task submission of 2 minute duration?<br>
><br>
> Cheers,<br>
> Tim<br>
<br>
<br>
</div></div></blockquote></div><br></div>