[Swift-devel] Walltime exceeded error
Michael Wilde
wilde at mcs.anl.gov
Mon Feb 20 09:35:33 CST 2012
Jon, can you try another run on PADS with these changes:
- 1 slot instead of 192 to keep the log much smaller
- n=20 instead of 1000 (ditto)
- t=70 to make sure that the app() runtime exceeds the specified maxwalltime by enough
- local:pbs instead of ssh:pbs to stay closer to the config where the problem occurred
- beagle if possible (one node in the scalability or development queue) and same Java as used in the failing case
Mike
----- Original Message -----
> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
> To: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Sunday, February 19, 2012 6:08:10 PM
> Subject: [Swift-devel] Walltime exceeded error
> Hello,
> So I have been spending the better part of today trying to reproduce
> this maxwalltime issue we have been witnessing. The most recent run I
> ran is at /home/jonmon/PADS/Swift/tests/catsnsleep
>
> This run does not produce the issue. In face it does show that the
> workers shutdown and restart takes over. It does show that there were
> 120 jobs failed but I believe that is because the retries were
> exceeded on those jobs.
>
> The run in question where this was being witnessed was on Beagle and
> is located at /home/jonmon/public_html/Swift/bugs/SciColSim/run002.
> There is a log file in that directory that you should be able to view
> and see the issue and perhaps clarify why the execution just hung and
> made no progress. We though that the job would be killed and then
> retried once the wall time exceeded what we provided. It looks like
> the job was killed but was not restarted. This script is very
> complicated but does produce the issue when run long enough.
>
> Maybe Mihael can provide some insight as to what was going in the code
> when the code hung on Beagle as the hang checker never kicked in so
> Swift thought it was doing something to make progress when in fact it
> was not. Perhaps this issue is Beagle specific(not sure what that
> means). I am going to try the same scale of a run on PADS and see if
> it completes(although it may take longer as PADS does not have the
> computing power that Beagle does.
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
More information about the Swift-devel
mailing list