[Swift-devel] New 0.93 problem: <jobname>.error No such file or directory

Mihael Hategan hategan at mcs.anl.gov
Mon Aug 8 21:58:59 CDT 2011


On Mon, 2011-08-08 at 21:54 -0500, Michael Wilde wrote:
> > Judging from the error message, your workers are dying for unknown
> > reasons. I see only two applications that failed (and they have
> > distinct
> > arguments), so I'm guessing you turned off retries. At 2/15K failure
> > probability, if you set retries to at least 1, you would get a
> > dramatic
> > decrease in the odds that the failure will happen twice for the same
> > app.
> 
> Good idea, will do.
> 
> So I just realized whats happening here.  Workers can fail (ie you
> tested killing them, you said) and Swift will keep running, *but* the
> apps that were running on failed workers receive failures and need to
> get retried through normal retry, as if the apps themselves had
> failed, correct? That just dawned on me.

Yep.

[...]
> 
> I saw two apps fail because the site didnt set OSG_WM_TMP (where I
> place the logs). I thought that in those two cases the worker never
> started, but perhaps those two failures are related to these two app
> failures.

In your case there is an actual TCP connections, so the workers must
have started.





More information about the Swift-devel mailing list