I suspect what may be happening in the most recent run is that a bunch of long jobs are accumulating for retry, having failed earlier due to walltimes, and are now spending forever over and over running out of walltime. --