[Swift-devel] Please look at hung run on Beagle

Michael Wilde wilde at mcs.anl.gov
Tue Jan 7 21:11:38 CST 2014


> Lorenzo and his group were running some lustre intensive jobs, so
> lustre
> was rather unresponsive.

Indeed, lustre responsiveness has been an issue for the SOM jobs, mainly stretching their walltimes unexpectedly long. But we're working around that.

> If this happened in the past day or two, I
> would try again.
> 
> If not, then a jstack on the java process (both swift and coaster
> service if separate) might shed some light on the issue.

Matthew just found the problem.  He'd been trying to fit tasks of maxwalltime 59:00 into 1-hour maxtime slots, which after the 1-minute deduction, apparently no longer fit. This is an old problem, that I think we have a ticket on: it results in essentially a livelock: the coaster workers repeatedly time out because they get no work, but the scheduler can never give them work because the only available tasks dont fit in the slots.

Did we discuss how we should remedy this?  Seems a message stating that no sufficiently large slots exist should be generated, which would immediately tell the user why the run is not progressing.

- Mike



More information about the Swift-devel mailing list