[Swift-devel] New 0.93 problem: <jobname>.error No such file or directory

Mihael Hategan hategan at mcs.anl.gov
Mon Aug 8 20:58:24 CDT 2011


On Mon, 2011-08-08 at 20:39 -0500, Michael Wilde wrote: 
> Im now running Swift svn swift-r4965 cog-r3225
> 
> A 100K-catsn script ran to completion.
> 
> Then a 500K-catsn script terminated at ~ 15K jobs with the error below.
> 
> Logs are in /home/wilde/swiftgrid/test.swift-workers

Judging from the error message, your workers are dying for unknown
reasons. I see only two applications that failed (and they have distinct
arguments), so I'm guessing you turned off retries. At 2/15K failure
probability, if you set retries to at least 1, you would get a dramatic
decrease in the odds that the failure will happen twice for the same
app.

Do you know where swork:14 and swork:29 ran? (it may be useful to name
workers based on their site).

Also, if you want to troubleshoot the workers, worker logging may help.

Mihael






More information about the Swift-devel mailing list