[Swift-devel] Please look at hung run on Beagle
Mihael Hategan
hategan at mcs.anl.gov
Tue Jan 7 18:46:27 CST 2014
Lorenzo and his group were running some lustre intensive jobs, so lustre
was rather unresponsive. If this happened in the past day or two, I
would try again.
If not, then a jstack on the java process (both swift and coaster
service if separate) might shed some light on the issue.
Mihael
On Tue, 2014-01-07 at 18:40 -0600, Michael Wilde wrote:
> Hi Mihael and/or David,
>
> Can you look at this run on beagle and provide a diagnosis?
>
> -rw-r--r-- 1 mattshax ci-users 33307656 Jan 7 18:20
> /lustre/beagle/mattshax/swifthome.20140107/sweep8-20140107-1812-obopd7ad.log
>
> Its an EnergyPlus run by Matthew of SOM.
>
> The progress ticker shows:
>
> login1$ grep -i progresstick *ad.log
> 2014-01-07 18:12:26,570-0600 INFO RuntimeStats$ProgressTicker
> 2014-01-07 18:12:33,600-0600 INFO RuntimeStats$ProgressTicker Initializing:3
> 2014-01-07 18:12:34,605-0600 INFO RuntimeStats$ProgressTicker Initializing:7297 Selecting site:1803
> 2014-01-07 18:12:38,556-0600 INFO RuntimeStats$ProgressTicker Selecting site:9097 Submitting:3
> 2014-01-07 18:12:43,585-0600 INFO RuntimeStats$ProgressTicker Submitting:9099 Submitted:1
> 2014-01-07 18:12:44,580-0600 INFO RuntimeStats$ProgressTicker Submitting:7635 Submitted:1465
> 2014-01-07 18:12:45,580-0600 INFO RuntimeStats$ProgressTicker Submitting:1014 Submitted:8086
> 2014-01-07 18:12:56,570-0600 INFO RuntimeStats$ProgressTicker Submitted:9100
> 2014-01-07 18:13:26,571-0600 INFO RuntimeStats$ProgressTicker Submitted:9100
> ...
> 2014-01-07 18:19:26,573-0600 INFO RuntimeStats$ProgressTicker Submitted:9100
> 2014-01-07 18:19:56,573-0600 INFO RuntimeStats$ProgressTicker Submitted:9100
> (at which time it was killed)
>
> Beagle had abundant (300+) free nodes, and many PBS jobs started for the run. It seems though that workers started timing out around 18:14. I cant tell if any workers were getting any work started, or not.
>
> This has happened several times (on 0.94.1). I will try to get this app moved to 0.95RC as soon as possible, but for now, Matthew is making good progress with the scripts as-is (modulo these timeout situations).
>
> He thought, from earlier debugging, that the timeouts were due to actual app failures (eg caused by bad app config files) but I cant see how that could be happening.
>
> Any assessment or diagnosis of this situation would be appreciated.
>
> Thanks,
>
> - Mike
>
More information about the Swift-devel
mailing list