[Swift-devel] New 0.93 problem: <jobname>.error No such file or directory
Mihael Hategan
hategan at mcs.anl.gov
Mon Aug 8 20:58:24 CDT 2011
On Mon, 2011-08-08 at 20:39 -0500, Michael Wilde wrote:
> Im now running Swift svn swift-r4965 cog-r3225
>
> A 100K-catsn script ran to completion.
>
> Then a 500K-catsn script terminated at ~ 15K jobs with the error below.
>
> Logs are in /home/wilde/swiftgrid/test.swift-workers
Judging from the error message, your workers are dying for unknown
reasons. I see only two applications that failed (and they have distinct
arguments), so I'm guessing you turned off retries. At 2/15K failure
probability, if you set retries to at least 1, you would get a dramatic
decrease in the odds that the failure will happen twice for the same
app.
Do you know where swork:14 and swork:29 ran? (it may be useful to name
workers based on their site).
Also, if you want to troubleshoot the workers, worker logging may help.
Mihael
More information about the Swift-devel
mailing list