<div dir="ltr">I think there might be an issue with the sites.xml formatting there too. It looks like sites.xml has multple <config> and </config> tags</div><div class="gmail_extra"><br><br><div class="gmail_quote">
On Tue, Jan 7, 2014 at 6:46 PM, Mihael Hategan <span dir="ltr"><<a href="mailto:hategan@mcs.anl.gov" target="_blank">hategan@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Lorenzo and his group were running some lustre intensive jobs, so lustre<br>
was rather unresponsive. If this happened in the past day or two, I<br>
would try again.<br>
<br>
If not, then a jstack on the java process (both swift and coaster<br>
service if separate) might shed some light on the issue.<br>
<span class="HOEnZb"><font color="#888888"><br>
Mihael<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
On Tue, 2014-01-07 at 18:40 -0600, Michael Wilde wrote:<br>
> Hi Mihael and/or David,<br>
><br>
> Can you look at this run on beagle and provide a diagnosis?<br>
><br>
> -rw-r--r-- 1 mattshax ci-users 33307656 Jan 7 18:20<br>
> /lustre/beagle/mattshax/swifthome.20140107/sweep8-20140107-1812-obopd7ad.log<br>
><br>
> Its an EnergyPlus run by Matthew of SOM.<br>
><br>
> The progress ticker shows:<br>
><br>
> login1$ grep -i progresstick *ad.log<br>
> 2014-01-07 18:12:26,570-0600 INFO RuntimeStats$ProgressTicker<br>
> 2014-01-07 18:12:33,600-0600 INFO RuntimeStats$ProgressTicker Initializing:3<br>
> 2014-01-07 18:12:34,605-0600 INFO RuntimeStats$ProgressTicker Initializing:7297 Selecting site:1803<br>
> 2014-01-07 18:12:38,556-0600 INFO RuntimeStats$ProgressTicker Selecting site:9097 Submitting:3<br>
> 2014-01-07 18:12:43,585-0600 INFO RuntimeStats$ProgressTicker Submitting:9099 Submitted:1<br>
> 2014-01-07 18:12:44,580-0600 INFO RuntimeStats$ProgressTicker Submitting:7635 Submitted:1465<br>
> 2014-01-07 18:12:45,580-0600 INFO RuntimeStats$ProgressTicker Submitting:1014 Submitted:8086<br>
> 2014-01-07 18:12:56,570-0600 INFO RuntimeStats$ProgressTicker Submitted:9100<br>
> 2014-01-07 18:13:26,571-0600 INFO RuntimeStats$ProgressTicker Submitted:9100<br>
> ...<br>
> 2014-01-07 18:19:26,573-0600 INFO RuntimeStats$ProgressTicker Submitted:9100<br>
> 2014-01-07 18:19:56,573-0600 INFO RuntimeStats$ProgressTicker Submitted:9100<br>
> (at which time it was killed)<br>
><br>
> Beagle had abundant (300+) free nodes, and many PBS jobs started for the run. It seems though that workers started timing out around 18:14. I cant tell if any workers were getting any work started, or not.<br>
><br>
> This has happened several times (on 0.94.1). I will try to get this app moved to 0.95RC as soon as possible, but for now, Matthew is making good progress with the scripts as-is (modulo these timeout situations).<br>
><br>
> He thought, from earlier debugging, that the timeouts were due to actual app failures (eg caused by bad app config files) but I cant see how that could be happening.<br>
><br>
> Any assessment or diagnosis of this situation would be appreciated.<br>
><br>
> Thanks,<br>
><br>
> - Mike<br>
><br>
<br>
<br>
</div></div><div class="HOEnZb"><div class="h5">_______________________________________________<br>
Swift-devel mailing list<br>
<a href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a><br>
<a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel</a><br>
</div></div></blockquote></div><br></div>