[Swift-devel] 100K job script hangs at 30K jobs

Mihael Hategan hategan at mcs.anl.gov
Sat Aug 6 21:29:48 CDT 2011


So this problem was the problem of dying workers combined with the
system not noticing it and so zombie jobs would slowly fill the throttle
(which was set to 10 in this case). I backported the dead worker
detection code from trunk. Combined with retries, this should take care
of the problem, but it may be worth looking into why the workers were
dying.

On Sat, 2011-08-06 at 13:34 -0500, Michael Wilde wrote:
> Mihael,
> 
> A later catsn test, started this morning, hung at 30K or 100K catsn jobs.
> 
> Swift was still printing progress but not progressing beyond:
> 
> Progress:  time: Sat, 06 Aug 2011 13:29:08 -0500  Selecting site:1014  Submitted:10  Finished successfully:30329
> 
> I had stopped it earlier in the morning, then resumed it to get a jstack.
> 
> Logs and stack traces of both the swift and coaster service JVMs are in:
>   /home/wilde/swiftgrid/test.swift-workers/logs.07
> 
> - Mike





More information about the Swift-devel mailing list