[Swift-devel] Walltime exceeded error
Jonathan Monette
jonmon at mcs.anl.gov
Sun Feb 19 18:08:10 CST 2012
Hello,
So I have been spending the better part of today trying to reproduce this maxwalltime issue we have been witnessing. The most recent run I ran is at /home/jonmon/PADS/Swift/tests/catsnsleep
This run does not produce the issue. In face it does show that the workers shutdown and restart takes over. It does show that there were 120 jobs failed but I believe that is because the retries were exceeded on those jobs.
The run in question where this was being witnessed was on Beagle and is located at /home/jonmon/public_html/Swift/bugs/SciColSim/run002. There is a log file in that directory that you should be able to view and see the issue and perhaps clarify why the execution just hung and made no progress. We though that the job would be killed and then retried once the wall time exceeded what we provided. It looks like the job was killed but was not restarted. This script is very complicated but does produce the issue when run long enough.
Maybe Mihael can provide some insight as to what was going in the code when the code hung on Beagle as the hang checker never kicked in so Swift thought it was doing something to make progress when in fact it was not. Perhaps this issue is Beagle specific(not sure what that means). I am going to try the same scale of a run on PADS and see if it completes(although it may take longer as PADS does not have the computing power that Beagle does.
More information about the Swift-devel
mailing list