[Swift-devel] New 0.93 problem: <jobname>.error No such file or directory

Michael Wilde wilde at mcs.anl.gov
Tue Aug 9 07:16:39 CDT 2011


I stopped this run and started a larger one: 5M catsn jobs to a pool of 300-400 workers (varies over time).  It finished 2.2M and was still running, albeit slowly, when I ended it.

The job rate ramped up quickly as the external QueueN script obtained workers. After about 15 mins had obtained 80 workers and seemed to be running at several hundred tasks per second. I had moved all the test clients, IO, and logging to local hard disk on communicado for speed. I set a retry count of 5, and turned on lazy failure mode.

After about 6 hours, the test had passed 2.2M jobs and was still progressing, but seemed to have drastically slowed down from its earlier rate. Seemed to have dropped below a few jobs per second. Possibly it ate through its throttle due to failed/hung workers.

The throttle was 300 jobs, and it seemed have about 400 running workers (the QueueN algorithm was grabbing more workers than the artificial "demand" I had set of 250).

I then killed the run and captured all the logs, including jstacks and a trace of top output every minute. Mainly because I wanted to free up the workers and study the run before continuing.

I see about 3 worker failure scenarios in the Condor logs:

1) _swiftwrap.staging: line 331: warning: here-document at line 303 delimited by end-of-file (wanted `$STDERR')

2) com$ cat 2.err
Send failed: Transport endpoint is not connected at ./worker.pl line 384.
com$ cat 2.out
OSG_WN_TMP=/state/partition1/tmp
=== contact: http://communicado.ci.uchicago.edu:56323
=== name:    Firefly Running in dir /grid_home/engage/gram_scratch_7Xkg2fpMUc
=== cwd:     /grid_home/engage/gram_scratch_7Xkg2fpMUc
=== logdir:  /state/partition1/tmp/Firefly.workerdir.Q18464
===============================================
=== exit: worker.pl exited with code=107
=== worker log - last 1000 lines:

==> /state/partition1/tmp/Firefly.workerdir.Q18464/worker-Firefly.log <==
1312882398.535 INFO  - Firefly Logging started: Tue Aug  9 04:33:18 2011
1312882398.535 INFO  - Running on node c1511.local
1312882398.535 INFO  - Connecting (0)...
1312882398.566 INFO  - Connected
1312882398.604 INFO  000101 Registration successful. ID=000101
1312890065.197 WARN  000101 Send failed: Transport endpoint is not connected
com$ 

3) only occurred once or twice, and I need to hunt it down.

----

I see 1234 messages containing "worker lost", like:
2011-08-09 01:50:03,438-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=cat-fld0l6ek - Application exception: Task failed: Conne
ction to worker lost

1234 is >> the throttle of 300, so it seems to be running past that problem.

I'll investigate more, but since its working so well I need to first get the application users going that are waiting on this.  I wonder if these issues will show up more local stress testing on the MCS hosts, as Alberto and Ketan are working on.

- Mike


----- Original Message -----
> From: "Michael Wilde" <wilde at mcs.anl.gov>
> To: "Mihael Hategan" <hategan at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Monday, August 8, 2011 11:14:22 PM
> Subject: Re: New 0.93 problem: <jobname>.error No such file or directory
> OK, with retry on, the same run has now passed 250K jobs, and retried
> 2 failures successfully. Its running at about 100 jobs/sec to about 38
> workers over 22 sites.
> 
> Once this tests out I'll increase the number of workers.
> 
> - Mike
> 
> 
> ----- Original Message -----
> > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > Sent: Monday, August 8, 2011 9:58:59 PM
> > Subject: Re: New 0.93 problem: <jobname>.error No such file or
> > directory
> > On Mon, 2011-08-08 at 21:54 -0500, Michael Wilde wrote:
> > > > Judging from the error message, your workers are dying for
> > > > unknown
> > > > reasons. I see only two applications that failed (and they have
> > > > distinct
> > > > arguments), so I'm guessing you turned off retries. At 2/15K
> > > > failure
> > > > probability, if you set retries to at least 1, you would get a
> > > > dramatic
> > > > decrease in the odds that the failure will happen twice for the
> > > > same
> > > > app.
> > >
> > > Good idea, will do.
> > >
> > > So I just realized whats happening here. Workers can fail (ie you
> > > tested killing them, you said) and Swift will keep running, *but*
> > > the
> > > apps that were running on failed workers receive failures and need
> > > to
> > > get retried through normal retry, as if the apps themselves had
> > > failed, correct? That just dawned on me.
> >
> > Yep.
> >
> > [...]
> > >
> > > I saw two apps fail because the site didnt set OSG_WM_TMP (where I
> > > place the logs). I thought that in those two cases the worker
> > > never
> > > started, but perhaps those two failures are related to these two
> > > app
> > > failures.
> >
> > In your case there is an actual TCP connections, so the workers must
> > have started.
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list