[Swift-devel] Please look at hung run on Beagle

David Kelly davidk at ci.uchicago.edu
Tue Jan 7 19:11:00 CST 2014


I think there might be an issue with the sites.xml formatting there too. It
looks like sites.xml has multple <config> and </config> tags


On Tue, Jan 7, 2014 at 6:46 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> Lorenzo and his group were running some lustre intensive jobs, so lustre
> was rather unresponsive. If this happened in the past day or two, I
> would try again.
>
> If not, then a jstack on the java process (both swift and coaster
> service if separate) might shed some light on the issue.
>
> Mihael
>
> On Tue, 2014-01-07 at 18:40 -0600, Michael Wilde wrote:
> > Hi Mihael and/or David,
> >
> > Can you look at this run on beagle and provide a diagnosis?
> >
> >   -rw-r--r-- 1 mattshax ci-users 33307656 Jan  7 18:20
> >
> /lustre/beagle/mattshax/swifthome.20140107/sweep8-20140107-1812-obopd7ad.log
> >
> > Its an EnergyPlus run by Matthew of SOM.
> >
> > The progress ticker shows:
> >
> > login1$ grep -i progresstick *ad.log
> > 2014-01-07 18:12:26,570-0600 INFO  RuntimeStats$ProgressTicker
> > 2014-01-07 18:12:33,600-0600 INFO  RuntimeStats$ProgressTicker
> Initializing:3
> > 2014-01-07 18:12:34,605-0600 INFO  RuntimeStats$ProgressTicker
> Initializing:7297  Selecting site:1803
> > 2014-01-07 18:12:38,556-0600 INFO  RuntimeStats$ProgressTicker
> Selecting site:9097  Submitting:3
> > 2014-01-07 18:12:43,585-0600 INFO  RuntimeStats$ProgressTicker
> Submitting:9099  Submitted:1
> > 2014-01-07 18:12:44,580-0600 INFO  RuntimeStats$ProgressTicker
> Submitting:7635  Submitted:1465
> > 2014-01-07 18:12:45,580-0600 INFO  RuntimeStats$ProgressTicker
> Submitting:1014  Submitted:8086
> > 2014-01-07 18:12:56,570-0600 INFO  RuntimeStats$ProgressTicker
> Submitted:9100
> > 2014-01-07 18:13:26,571-0600 INFO  RuntimeStats$ProgressTicker
> Submitted:9100
> > ...
> > 2014-01-07 18:19:26,573-0600 INFO  RuntimeStats$ProgressTicker
> Submitted:9100
> > 2014-01-07 18:19:56,573-0600 INFO  RuntimeStats$ProgressTicker
> Submitted:9100
> > (at which time it was killed)
> >
> > Beagle had abundant (300+) free nodes, and many PBS jobs started for the
> run. It seems though that workers started timing out around 18:14.  I cant
> tell if any workers were getting any work started, or not.
> >
> > This has happened several times (on 0.94.1).  I will try to get this app
> moved to 0.95RC as soon as possible, but for now, Matthew is making good
> progress with the scripts as-is (modulo these timeout situations).
> >
> > He thought, from earlier debugging, that the timeouts were due to actual
> app failures (eg caused by bad app config files) but I cant see how that
> could be happening.
> >
> > Any assessment or diagnosis of this situation would be appreciated.
> >
> > Thanks,
> >
> > - Mike
> >
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20140107/5f6c81b4/attachment.html>


More information about the Swift-devel mailing list