[Swift-devel] Please look at hung run on Beagle

Mihael Hategan hategan at mcs.anl.gov
Tue Jan 7 18:46:27 CST 2014


Lorenzo and his group were running some lustre intensive jobs, so lustre
was rather unresponsive. If this happened in the past day or two, I
would try again.

If not, then a jstack on the java process (both swift and coaster
service if separate) might shed some light on the issue.

Mihael

On Tue, 2014-01-07 at 18:40 -0600, Michael Wilde wrote:
> Hi Mihael and/or David,
> 
> Can you look at this run on beagle and provide a diagnosis?
> 
>   -rw-r--r-- 1 mattshax ci-users 33307656 Jan  7 18:20
>   /lustre/beagle/mattshax/swifthome.20140107/sweep8-20140107-1812-obopd7ad.log
> 
> Its an EnergyPlus run by Matthew of SOM.
> 
> The progress ticker shows:
> 
> login1$ grep -i progresstick *ad.log
> 2014-01-07 18:12:26,570-0600 INFO  RuntimeStats$ProgressTicker 
> 2014-01-07 18:12:33,600-0600 INFO  RuntimeStats$ProgressTicker   Initializing:3
> 2014-01-07 18:12:34,605-0600 INFO  RuntimeStats$ProgressTicker   Initializing:7297  Selecting site:1803
> 2014-01-07 18:12:38,556-0600 INFO  RuntimeStats$ProgressTicker   Selecting site:9097  Submitting:3
> 2014-01-07 18:12:43,585-0600 INFO  RuntimeStats$ProgressTicker   Submitting:9099  Submitted:1
> 2014-01-07 18:12:44,580-0600 INFO  RuntimeStats$ProgressTicker   Submitting:7635  Submitted:1465
> 2014-01-07 18:12:45,580-0600 INFO  RuntimeStats$ProgressTicker   Submitting:1014  Submitted:8086
> 2014-01-07 18:12:56,570-0600 INFO  RuntimeStats$ProgressTicker   Submitted:9100
> 2014-01-07 18:13:26,571-0600 INFO  RuntimeStats$ProgressTicker   Submitted:9100
> ...
> 2014-01-07 18:19:26,573-0600 INFO  RuntimeStats$ProgressTicker   Submitted:9100
> 2014-01-07 18:19:56,573-0600 INFO  RuntimeStats$ProgressTicker   Submitted:9100
> (at which time it was killed)
> 
> Beagle had abundant (300+) free nodes, and many PBS jobs started for the run. It seems though that workers started timing out around 18:14.  I cant tell if any workers were getting any work started, or not.
> 
> This has happened several times (on 0.94.1).  I will try to get this app moved to 0.95RC as soon as possible, but for now, Matthew is making good progress with the scripts as-is (modulo these timeout situations).
> 
> He thought, from earlier debugging, that the timeouts were due to actual app failures (eg caused by bad app config files) but I cant see how that could be happening.
> 
> Any assessment or diagnosis of this situation would be appreciated.
> 
> Thanks,
> 
> - Mike
> 





More information about the Swift-devel mailing list