[Swift-devel] Please look at hung run on Beagle
Michael Wilde
wilde at mcs.anl.gov
Tue Jan 7 18:40:39 CST 2014
Hi Mihael and/or David,
Can you look at this run on beagle and provide a diagnosis?
-rw-r--r-- 1 mattshax ci-users 33307656 Jan 7 18:20
/lustre/beagle/mattshax/swifthome.20140107/sweep8-20140107-1812-obopd7ad.log
Its an EnergyPlus run by Matthew of SOM.
The progress ticker shows:
login1$ grep -i progresstick *ad.log
2014-01-07 18:12:26,570-0600 INFO RuntimeStats$ProgressTicker
2014-01-07 18:12:33,600-0600 INFO RuntimeStats$ProgressTicker Initializing:3
2014-01-07 18:12:34,605-0600 INFO RuntimeStats$ProgressTicker Initializing:7297 Selecting site:1803
2014-01-07 18:12:38,556-0600 INFO RuntimeStats$ProgressTicker Selecting site:9097 Submitting:3
2014-01-07 18:12:43,585-0600 INFO RuntimeStats$ProgressTicker Submitting:9099 Submitted:1
2014-01-07 18:12:44,580-0600 INFO RuntimeStats$ProgressTicker Submitting:7635 Submitted:1465
2014-01-07 18:12:45,580-0600 INFO RuntimeStats$ProgressTicker Submitting:1014 Submitted:8086
2014-01-07 18:12:56,570-0600 INFO RuntimeStats$ProgressTicker Submitted:9100
2014-01-07 18:13:26,571-0600 INFO RuntimeStats$ProgressTicker Submitted:9100
...
2014-01-07 18:19:26,573-0600 INFO RuntimeStats$ProgressTicker Submitted:9100
2014-01-07 18:19:56,573-0600 INFO RuntimeStats$ProgressTicker Submitted:9100
(at which time it was killed)
Beagle had abundant (300+) free nodes, and many PBS jobs started for the run. It seems though that workers started timing out around 18:14. I cant tell if any workers were getting any work started, or not.
This has happened several times (on 0.94.1). I will try to get this app moved to 0.95RC as soon as possible, but for now, Matthew is making good progress with the scripts as-is (modulo these timeout situations).
He thought, from earlier debugging, that the timeouts were due to actual app failures (eg caused by bad app config files) but I cant see how that could be happening.
Any assessment or diagnosis of this situation would be appreciated.
Thanks,
- Mike
--
Michael Wilde
Mathematics and Computer Science | Computation Institute
Argonne National Laboratory | The University of Chicago
More information about the Swift-devel
mailing list