[Swift-user] Help with resuming swift job

Lorenzo Pesce lpesce at uchicago.edu
Mon Apr 23 15:24:16 CDT 2012


So far the problem could not be reproduced. I suspect that there was a cross fire in the folder. 
My apologies for raising a false alarm.

On Apr 21, 2012, at 5:32 PM, Michael Wilde wrote:

> Hi Lorenzo,
> 
> I did a quick check into this problem. As far as I can tell, the .rlog restart file needed to resume this run does not exist.  Did it get removed manually, or do you think it got removed by Swift? (Which may remove it when the run completes successfully, but I need to check on that).
> 
> - Mike
> 
> ----- Original Message -----
>> From: "Lorenzo Pesce" <lpesce at uchicago.edu>
>> To: swift-user at ci.uchicago.edu
>> Sent: Friday, April 20, 2012 11:43:37 AM
>> Subject: [Swift-user] Help with resuming swift job
>> It is me again ;-)
>> 
>> I seem to have let too many beagle users run Swift.... ;-)
>> 
>> Here we are having a few difference people running from the same
>> filesystem (the project file system, in this case the person who sent
>> the first batch is on a plane right now) because of some quirks in
>> Beagle's group permissions (namely they don't work right as far as I
>> can tell), sometimes we need to change file permissions:
>> 
>> I changed the file permission,
>> Swift 0.93 swift-r5483 cog-r3339
>> 
>> and sent swift out with -resume
>> 
>> RunID: 20120420-1622-6siopoz8
>> Execution failed:
>> Could not aquire exclusive lock on log file:
>> /lustre/beagle/GCNet/grasping/44neurons_SR0_200ms_5Windows_v1/causal_test-20120420-1433-k1eh3b2a.0.rlog
>> 
>> Is there a lock file that needs to be changed?
>> 
>> 
>> The motivation for this is that some of the files failed for two
>> predictable reasons: they took too long and/or they blew the memory.
>> There are also other issues, but they are not relevant at this point
>> because they aren't solved (need of a new optimization, which we did
>> not have time to implement). Usually we send the first batch, with
>> short times and fewer nodes that does 98% of the work, and resume the
>> remaining hacking the sites.xml file (we welcome better strategies
>> which most of you have hinted to in the past). This was an attempt to
>> rerun after a crash.
>> 
>> BTW, my post mortem investigation seems to suggest that one of the
>> users actually killed the swift script by mistake, wrongly changed a
>> privilege in flight or something like that as opposed to the script
>> failing or running out of time. Short of torture it does not seem he
>> will confess more than this.
>> 
>> Thanks a million as usual.
>> 
>> Lorenzo
>> 
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 




More information about the Swift-user mailing list