[Swift-user] Help with resuming swift job

Lorenzo Pesce lpesce at uchicago.edu
Sat Apr 21 18:04:54 CDT 2012


I plan to make a better investigation of what happened exactly because there were too many people changing things.
I will try to reproduce the error. It might be difficult or impossible at this point because I modified the script in order to avoid conflicts.
In this dir sequential swift runs were made with identically names tc, cf and sites files, which would have been no problem if people
did not run them at the same time or in a chaotic way. Since it seems to be a possibility we modified it and changed the names of 
the files. 

I will let you know ASAP if this happens again.

On Apr 21, 2012, at 5:32 PM, Michael Wilde wrote:

> Hi Lorenzo,
> 
> I did a quick check into this problem. As far as I can tell, the .rlog restart file needed to resume this run does not exist.  Did it get removed manually, or do you think it got removed by Swift? (Which may remove it when the run completes successfully, but I need to check on that).
> 
> - Mike
> 
> ----- Original Message -----
>> From: "Lorenzo Pesce" <lpesce at uchicago.edu>
>> To: swift-user at ci.uchicago.edu
>> Sent: Friday, April 20, 2012 11:43:37 AM
>> Subject: [Swift-user] Help with resuming swift job
>> It is me again ;-)
>> 
>> I seem to have let too many beagle users run Swift.... ;-)
>> 
>> Here we are having a few difference people running from the same
>> filesystem (the project file system, in this case the person who sent
>> the first batch is on a plane right now) because of some quirks in
>> Beagle's group permissions (namely they don't work right as far as I
>> can tell), sometimes we need to change file permissions:
>> 
>> I changed the file permission,
>> Swift 0.93 swift-r5483 cog-r3339
>> 
>> and sent swift out with -resume
>> 
>> RunID: 20120420-1622-6siopoz8
>> Execution failed:
>> Could not aquire exclusive lock on log file:
>> /lustre/beagle/GCNet/grasping/44neurons_SR0_200ms_5Windows_v1/causal_test-20120420-1433-k1eh3b2a.0.rlog
>> 
>> Is there a lock file that needs to be changed?
>> 
>> 
>> The motivation for this is that some of the files failed for two
>> predictable reasons: they took too long and/or they blew the memory.
>> There are also other issues, but they are not relevant at this point
>> because they aren't solved (need of a new optimization, which we did
>> not have time to implement). Usually we send the first batch, with
>> short times and fewer nodes that does 98% of the work, and resume the
>> remaining hacking the sites.xml file (we welcome better strategies
>> which most of you have hinted to in the past). This was an attempt to
>> rerun after a crash.
>> 
>> BTW, my post mortem investigation seems to suggest that one of the
>> users actually killed the swift script by mistake, wrongly changed a
>> privilege in flight or something like that as opposed to the script
>> failing or running out of time. Short of torture it does not seem he
>> will confess more than this.
>> 
>> Thanks a million as usual.
>> 
>> Lorenzo
>> 
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 




More information about the Swift-user mailing list