[Swift-user] Help with resuming swift job

Michael Wilde wilde at mcs.anl.gov
Sat Apr 21 17:32:37 CDT 2012


Hi Lorenzo,

I did a quick check into this problem. As far as I can tell, the .rlog restart file needed to resume this run does not exist.  Did it get removed manually, or do you think it got removed by Swift? (Which may remove it when the run completes successfully, but I need to check on that).

- Mike

----- Original Message -----
> From: "Lorenzo Pesce" <lpesce at uchicago.edu>
> To: swift-user at ci.uchicago.edu
> Sent: Friday, April 20, 2012 11:43:37 AM
> Subject: [Swift-user] Help with resuming swift job
> It is me again ;-)
> 
> I seem to have let too many beagle users run Swift.... ;-)
> 
> Here we are having a few difference people running from the same
> filesystem (the project file system, in this case the person who sent
> the first batch is on a plane right now) because of some quirks in
> Beagle's group permissions (namely they don't work right as far as I
> can tell), sometimes we need to change file permissions:
> 
> I changed the file permission,
> Swift 0.93 swift-r5483 cog-r3339
> 
> and sent swift out with -resume
> 
> RunID: 20120420-1622-6siopoz8
> Execution failed:
> Could not aquire exclusive lock on log file:
> /lustre/beagle/GCNet/grasping/44neurons_SR0_200ms_5Windows_v1/causal_test-20120420-1433-k1eh3b2a.0.rlog
> 
> Is there a lock file that needs to be changed?
> 
> 
> The motivation for this is that some of the files failed for two
> predictable reasons: they took too long and/or they blew the memory.
> There are also other issues, but they are not relevant at this point
> because they aren't solved (need of a new optimization, which we did
> not have time to implement). Usually we send the first batch, with
> short times and fewer nodes that does 98% of the work, and resume the
> remaining hacking the sites.xml file (we welcome better strategies
> which most of you have hinted to in the past). This was an attempt to
> rerun after a crash.
> 
> BTW, my post mortem investigation seems to suggest that one of the
> users actually killed the swift script by mistake, wrongly changed a
> privilege in flight or something like that as opposed to the script
> failing or running out of time. Short of torture it does not seem he
> will confess more than this.
> 
> Thanks a million as usual.
> 
> Lorenzo
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-user mailing list