[Swift-devel] Please look at hung run on Beagle
    Michael Wilde 
    wilde at mcs.anl.gov
       
    Tue Jan  7 18:40:39 CST 2014
    
    
  
Hi Mihael and/or David,
Can you look at this run on beagle and provide a diagnosis?
  -rw-r--r-- 1 mattshax ci-users 33307656 Jan  7 18:20
  /lustre/beagle/mattshax/swifthome.20140107/sweep8-20140107-1812-obopd7ad.log
Its an EnergyPlus run by Matthew of SOM.
The progress ticker shows:
login1$ grep -i progresstick *ad.log
2014-01-07 18:12:26,570-0600 INFO  RuntimeStats$ProgressTicker 
2014-01-07 18:12:33,600-0600 INFO  RuntimeStats$ProgressTicker   Initializing:3
2014-01-07 18:12:34,605-0600 INFO  RuntimeStats$ProgressTicker   Initializing:7297  Selecting site:1803
2014-01-07 18:12:38,556-0600 INFO  RuntimeStats$ProgressTicker   Selecting site:9097  Submitting:3
2014-01-07 18:12:43,585-0600 INFO  RuntimeStats$ProgressTicker   Submitting:9099  Submitted:1
2014-01-07 18:12:44,580-0600 INFO  RuntimeStats$ProgressTicker   Submitting:7635  Submitted:1465
2014-01-07 18:12:45,580-0600 INFO  RuntimeStats$ProgressTicker   Submitting:1014  Submitted:8086
2014-01-07 18:12:56,570-0600 INFO  RuntimeStats$ProgressTicker   Submitted:9100
2014-01-07 18:13:26,571-0600 INFO  RuntimeStats$ProgressTicker   Submitted:9100
...
2014-01-07 18:19:26,573-0600 INFO  RuntimeStats$ProgressTicker   Submitted:9100
2014-01-07 18:19:56,573-0600 INFO  RuntimeStats$ProgressTicker   Submitted:9100
(at which time it was killed)
Beagle had abundant (300+) free nodes, and many PBS jobs started for the run. It seems though that workers started timing out around 18:14.  I cant tell if any workers were getting any work started, or not.
This has happened several times (on 0.94.1).  I will try to get this app moved to 0.95RC as soon as possible, but for now, Matthew is making good progress with the scripts as-is (modulo these timeout situations).
He thought, from earlier debugging, that the timeouts were due to actual app failures (eg caused by bad app config files) but I cant see how that could be happening.
Any assessment or diagnosis of this situation would be appreciated.
Thanks,
- Mike
-- 
Michael Wilde
Mathematics and Computer Science | Computation Institute
Argonne National Laboratory      | The University of Chicago
    
    
More information about the Swift-devel
mailing list