[Swift-user] Problems with PBS on Beagle for swift.0.94-2012.1102

Lorenzo Pesce lpesce at uchicago.edu
Wed Dec 5 19:40:32 CST 2012


On Dec 5, 2012, at 5:53 PM, Michael Wilde wrote:

> Hi Lorenzo,
> 
> The "513" error comes from the coaster worker, and means that the app() call exceeded is maxwalltime.  This maxwalltime is specified first in the sites file (which applies to all apps on the site) and can be overridden for a given app by specifying maxwalltime on the app's entry in the tc file.
> 
> So, yes, if you specified maxwalltime as 16:40:00 in the sites file and did *not* override that in the tc file, then that is how long the app() call should be allowed to run on the compute node before the coaster worker kills it and returns a 513 error.

Yes, I understand that and the values match. 

However, I have been told that the first is the time the coasters use in the PBS submission and maxwalltime is used to plan how to run the apps, it is the time apps are expected to take inside of the coasters.

    <profile namespace="globus" key="maxTime">168000</profile>
    <profile namespace="globus" key="maxwalltime">2:0:00</profile>

This is how it seemed to work until version .94, has it changed? What do these two mean now?

Have I missed something?

> 
> From your logs, assuming the app started shortly after Swift started:
> 
> Started at: Tue, 04 Dec 2012 15:05:14
> Got 513 at: Wed, 05 Dec 2012 07:44:53
> 
> 24:00 - 15:05 = 8:55
>              + 7:44
>               15:99 = 16:39
> 
> which is what you asked for, mine the 1 min that Swift subtracts for overhead.
> 
> - Mike
> 
> 
> 
> 
> 
> ----- Original Message -----
>> From: "Lorenzo Pesce" <lpesce at uchicago.edu>
>> To: swift-user at ci.uchicago.edu
>> Cc: "Jason J. Pitt" <pittjj at uchicago.edu>
>> Sent: Wednesday, December 5, 2012 9:18:39 AM
>> Subject: [Swift-user] Problems with PBS on Beagle for swift.0.94-2012.1102
>> From the logs:
>> 
>> 
>> Swift trunk swift-r6003 cog-r3497
>> RunID: 20121204-1504-llt6yhhb
>> Progress: time: Tue, 04 Dec 2012 15:04:44 +0000
>> Progress: time: Tue, 04 Dec 2012 15:04:53 +0000 Submitted:1 Active:1
>> Progress: time: Tue, 04 Dec 2012 15:05:14 +0000 Active:2
>> 
>> 
>> <snip>
>> 
>> 
>> 
>> Progress: time: Wed, 05 Dec 2012 07:44:53 +0000 Active:1 Failed but
>> can retry:1
>> Execution failed:
>> Walltime exceeded
>> org.globus.cog.abstraction.impl.common.execution.JobException:
>> Walltime exceeded (exit code: 513)
>> at
>> org.globus.cog.abstraction.coaster.service.local.JobStatusHandler.requestComplete(JobStatusHandler.java:38)
>> at
>> org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:91)
>> at
>> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:505)
>> at
>> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.stepNIO(AbstractStreamKarajanChannel.java:238)
>> at
>> org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.loop(NIOMultiplexer.java:97)
>> at
>> org.globus.cog.karajan.workflow.service.channels.NIOMultiplexer.run(NIOMultiplexer.java:56)
>> (exit code: 513)
>> 
>> 
>> 
>> 
>> By my reckoning it is 15 hours or 900 minutes or 54000 seconds
>> 
>> 
>> From my sites file:
>> 
>> 
>> 
>> <profile namespace="globus" key="maxTime">172800</profile>
>> 
>> <profile namespace="globus" key="maxwalltime">16:40:00</profile>
>> 
>> 
>> Isn't maxwalltime the time expected from a single run app as opposed
>> to the time allocated to the coaster?
>> 
>> 
>> Has something changed in the way swift works?
>> 
>> 
>> Lorenzo
>> 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 




More information about the Swift-user mailing list