[Swift-devel] New 0.93 problem: <jobname>.error No such file or directory

Michael Wilde wilde at mcs.anl.gov
Mon Aug 8 21:54:35 CDT 2011


> Judging from the error message, your workers are dying for unknown
> reasons. I see only two applications that failed (and they have
> distinct
> arguments), so I'm guessing you turned off retries. At 2/15K failure
> probability, if you set retries to at least 1, you would get a
> dramatic
> decrease in the odds that the failure will happen twice for the same
> app.

Good idea, will do.

So I just realized whats happening here.  Workers can fail (ie you tested killing them, you said) and Swift will keep running, *but* the apps that were running on failed workers receive failures and need to get retried through normal retry, as if the apps themselves had failed, correct? That just dawned on me.

> Do you know where swork:14 and swork:29 ran? (it may be useful to name
> workers based on their site).

Good idea, will do.

> Also, if you want to troubleshoot the workers, worker logging may
> help.

I have worker logging on; Im not sure why Im not (yet) getting the logs back. My Condor jobs are coded to transfer the worker log back after workers exit.  I'll try to get these logs.

I saw two apps fail because the site didnt set OSG_WM_TMP (where I place the logs). I thought that in those two cases the worker never started, but perhaps those two failures are related to these two app failures.

More digging.

- Mike

> 
> Mihael

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list