[Swift-devel] New 0.93 problem: <jobname>.error No such file or directory

Mon Aug 8 23:14:22 CDT 2011

OK, with retry on, the same run has now passed 250K jobs, and retried 2 failures successfully. Its running at about 100 jobs/sec to about 38 workers over 22 sites.

Once this tests out I'll increase the number of workers.

- Mike

----- Original Message -----
> From: "Mihael Hategan" <hategan at mcs.anl.gov>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Monday, August 8, 2011 9:58:59 PM
> Subject: Re: New 0.93 problem: <jobname>.error No such file or directory
> On Mon, 2011-08-08 at 21:54 -0500, Michael Wilde wrote:
> > > Judging from the error message, your workers are dying for unknown
> > > reasons. I see only two applications that failed (and they have
> > > distinct
> > > arguments), so I'm guessing you turned off retries. At 2/15K
> > > failure
> > > probability, if you set retries to at least 1, you would get a
> > > dramatic
> > > decrease in the odds that the failure will happen twice for the
> > > same
> > > app.
> >
> > Good idea, will do.
> >
> > So I just realized whats happening here. Workers can fail (ie you
> > tested killing them, you said) and Swift will keep running, *but*
> > the
> > apps that were running on failed workers receive failures and need
> > to
> > get retried through normal retry, as if the apps themselves had
> > failed, correct? That just dawned on me.
> 
> Yep.
> 
> [...]
> >
> > I saw two apps fail because the site didnt set OSG_WM_TMP (where I
> > place the logs). I thought that in those two cases the worker never
> > started, but perhaps those two failures are related to these two app
> > failures.
> 
> In your case there is an actual TCP connections, so the workers must
> have started.

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory