[Swift-devel] Walltime exceeded error

Jonathan jonmon at mcs.anl.gov
Mon Feb 20 10:11:24 CST 2012


Yes.  I will. 



On Feb 20, 2012, at 9:35, Michael Wilde <wilde at mcs.anl.gov> wrote:

> Jon, can you try another run on PADS with these changes:
> 
> - 1 slot instead of 192 to keep the log much smaller
> - n=20 instead of 1000 (ditto)
> - t=70 to make sure that the app() runtime exceeds the specified maxwalltime by enough
> - local:pbs instead of ssh:pbs to stay closer to the config where the problem occurred
> - beagle if possible (one node in the scalability or development queue) and same Java as used in the failing case
> 
> Mike
> 
> ----- Original Message -----
>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
>> To: "Swift Devel" <swift-devel at ci.uchicago.edu>
>> Sent: Sunday, February 19, 2012 6:08:10 PM
>> Subject: [Swift-devel] Walltime exceeded error
>> Hello,
>> So I have been spending the better part of today trying to reproduce
>> this maxwalltime issue we have been witnessing. The most recent run I
>> ran is at /home/jonmon/PADS/Swift/tests/catsnsleep
>> 
>> This run does not produce the issue. In face it does show that the
>> workers shutdown and restart takes over. It does show that there were
>> 120 jobs failed but I believe that is because the retries were
>> exceeded on those jobs.
>> 
>> The run in question where this was being witnessed was on Beagle and
>> is located at /home/jonmon/public_html/Swift/bugs/SciColSim/run002.
>> There is a log file in that directory that you should be able to view
>> and see the issue and perhaps clarify why the execution just hung and
>> made no progress. We though that the job would be killed and then
>> retried once the wall time exceeded what we provided. It looks like
>> the job was killed but was not restarted. This script is very
>> complicated but does produce the issue when run long enough.
>> 
>> Maybe Mihael can provide some insight as to what was going in the code
>> when the code hung on Beagle as the hang checker never kicked in so
>> Swift thought it was doing something to make progress when in fact it
>> was not. Perhaps this issue is Beagle specific(not sure what that
>> means). I am going to try the same scale of a run on PADS and see if
>> it completes(although it may take longer as PADS does not have the
>> computing power that Beagle does.
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 



More information about the Swift-devel mailing list