[Swift-user] Help with resuming swift job

Lorenzo Pesce lpesce at uchicago.edu
Fri Apr 20 11:43:37 CDT 2012


It is me again ;-)

I seem to have let too many beagle users run Swift.... ;-)

Here we are having a few difference people running from the same filesystem (the project file system, in this case the person who sent the first batch is on a plane right now) because of some quirks in Beagle's group permissions (namely they don't work right as far as I can tell), sometimes we need to change file permissions:

I changed the file permission, 
Swift 0.93 swift-r5483 cog-r3339

and sent swift out with -resume 

RunID: 20120420-1622-6siopoz8
Execution failed:
        Could not aquire exclusive lock on log file: /lustre/beagle/GCNet/grasping/44neurons_SR0_200ms_5Windows_v1/causal_test-20120420-1433-k1eh3b2a.0.rlog

Is there a lock file that needs to be changed?


The motivation for this is that some of the files failed for two predictable reasons: they took too long and/or they blew the memory. There are also other issues, but they are not relevant at this point because they aren't solved (need of a new optimization, which we did not have time to implement). Usually we send the first batch, with short times and fewer nodes that does 98% of the work, and resume the remaining hacking the sites.xml file (we welcome better strategies which most of you have hinted to in the past). This was an attempt to rerun after a crash.

BTW, my post mortem investigation seems to suggest that one of the users actually killed the swift script by mistake, wrongly changed a privilege in flight or something like that as opposed to the script failing or running out of time. Short of torture it does not seem he will confess more than this.

Thanks a million as usual.

Lorenzo
 


More information about the Swift-user mailing list