[Swift-devel] Please look at hung run on Beagle

Michael Wilde wilde at mcs.anl.gov
Tue Jan 7 18:40:39 CST 2014


Hi Mihael and/or David,

Can you look at this run on beagle and provide a diagnosis?

  -rw-r--r-- 1 mattshax ci-users 33307656 Jan  7 18:20
  /lustre/beagle/mattshax/swifthome.20140107/sweep8-20140107-1812-obopd7ad.log

Its an EnergyPlus run by Matthew of SOM.

The progress ticker shows:

login1$ grep -i progresstick *ad.log
2014-01-07 18:12:26,570-0600 INFO  RuntimeStats$ProgressTicker 
2014-01-07 18:12:33,600-0600 INFO  RuntimeStats$ProgressTicker   Initializing:3
2014-01-07 18:12:34,605-0600 INFO  RuntimeStats$ProgressTicker   Initializing:7297  Selecting site:1803
2014-01-07 18:12:38,556-0600 INFO  RuntimeStats$ProgressTicker   Selecting site:9097  Submitting:3
2014-01-07 18:12:43,585-0600 INFO  RuntimeStats$ProgressTicker   Submitting:9099  Submitted:1
2014-01-07 18:12:44,580-0600 INFO  RuntimeStats$ProgressTicker   Submitting:7635  Submitted:1465
2014-01-07 18:12:45,580-0600 INFO  RuntimeStats$ProgressTicker   Submitting:1014  Submitted:8086
2014-01-07 18:12:56,570-0600 INFO  RuntimeStats$ProgressTicker   Submitted:9100
2014-01-07 18:13:26,571-0600 INFO  RuntimeStats$ProgressTicker   Submitted:9100
...
2014-01-07 18:19:26,573-0600 INFO  RuntimeStats$ProgressTicker   Submitted:9100
2014-01-07 18:19:56,573-0600 INFO  RuntimeStats$ProgressTicker   Submitted:9100
(at which time it was killed)

Beagle had abundant (300+) free nodes, and many PBS jobs started for the run. It seems though that workers started timing out around 18:14.  I cant tell if any workers were getting any work started, or not.

This has happened several times (on 0.94.1).  I will try to get this app moved to 0.95RC as soon as possible, but for now, Matthew is making good progress with the scripts as-is (modulo these timeout situations).

He thought, from earlier debugging, that the timeouts were due to actual app failures (eg caused by bad app config files) but I cant see how that could be happening.

Any assessment or diagnosis of this situation would be appreciated.

Thanks,

- Mike

-- 
Michael Wilde
Mathematics and Computer Science | Computation Institute
Argonne National Laboratory      | The University of Chicago



More information about the Swift-devel mailing list