[Swift-devel] New 0.93 problem: <jobname>.error No such file or directory

Tue Aug 9 07:25:41 CDT 2011

Forgot to mention two things:

- the logs are on communicado on local dir /scratch/local/wilde/swift/test.swift-workers/logs.14

- this is a really cool milestone: 2.2M jobs and counting from one swift script to OSG; at about 20 mins into the run it was pushing 138 jobs/sec in one arbitrary 10 min period that I looked at.

Nice work, Mihael!

- Mike

----- Original Message -----
> From: "Michael Wilde" <wilde at mcs.anl.gov>
> To: "Mihael Hategan" <hategan at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Tuesday, August 9, 2011 7:16:39 AM
> Subject: Re: [Swift-devel] New 0.93 problem: <jobname>.error No such file or directory
> I stopped this run and started a larger one: 5M catsn jobs to a pool
> of 300-400 workers (varies over time). It finished 2.2M and was still
> running, albeit slowly, when I ended it.
> 
> The job rate ramped up quickly as the external QueueN script obtained
> workers. After about 15 mins had obtained 80 workers and seemed to be
> running at several hundred tasks per second. I had moved all the test
> clients, IO, and logging to local hard disk on communicado for speed.
> I set a retry count of 5, and turned on lazy failure mode.
> 
> After about 6 hours, the test had passed 2.2M jobs and was still
> progressing, but seemed to have drastically slowed down from its
> earlier rate. Seemed to have dropped below a few jobs per second.
> Possibly it ate through its throttle due to failed/hung workers.
> 
> The throttle was 300 jobs, and it seemed have about 400 running
> workers (the QueueN algorithm was grabbing more workers than the
> artificial "demand" I had set of 250).
> 
> I then killed the run and captured all the logs, including jstacks and
> a trace of top output every minute. Mainly because I wanted to free up
> the workers and study the run before continuing.
> 
> I see about 3 worker failure scenarios in the Condor logs:
> 
> 1) _swiftwrap.staging: line 331: warning: here-document at line 303
> delimited by end-of-file (wanted `$STDERR')
> 
> 2) com$ cat 2.err
> Send failed: Transport endpoint is not connected at ./worker.pl line
> 384.
> com$ cat 2.out
> OSG_WN_TMP=/state/partition1/tmp
> === contact: http://communicado.ci.uchicago.edu:56323
> === name: Firefly Running in dir
> /grid_home/engage/gram_scratch_7Xkg2fpMUc
> === cwd: /grid_home/engage/gram_scratch_7Xkg2fpMUc
> === logdir: /state/partition1/tmp/Firefly.workerdir.Q18464
> ===============================================
> === exit: worker.pl exited with code=107
> === worker log - last 1000 lines:
> 
> ==> /state/partition1/tmp/Firefly.workerdir.Q18464/worker-Firefly.log
> <==
> 1312882398.535 INFO - Firefly Logging started: Tue Aug 9 04:33:18 2011
> 1312882398.535 INFO - Running on node c1511.local
> 1312882398.535 INFO - Connecting (0)...
> 1312882398.566 INFO - Connected
> 1312882398.604 INFO 000101 Registration successful. ID=000101
> 1312890065.197 WARN 000101 Send failed: Transport endpoint is not
> connected
> com$
> 
> 3) only occurred once or twice, and I need to hunt it down.
> 
> ----
> 
> I see 1234 messages containing "worker lost", like:
> 2011-08-09 01:50:03,438-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION
> jobid=cat-fld0l6ek - Application exception: Task failed: Conne
> ction to worker lost
> 
> 1234 is >> the throttle of 300, so it seems to be running past that
> problem.
> 
> I'll investigate more, but since its working so well I need to first
> get the application users going that are waiting on this. I wonder if
> these issues will show up more local stress testing on the MCS hosts,
> as Alberto and Ketan are working on.
> 
> - Mike
> 
> 
> ----- Original Message -----
> > From: "Michael Wilde" <wilde at mcs.anl.gov>
> > To: "Mihael Hategan" <hategan at mcs.anl.gov>
> > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > Sent: Monday, August 8, 2011 11:14:22 PM
> > Subject: Re: New 0.93 problem: <jobname>.error No such file or
> > directory
> > OK, with retry on, the same run has now passed 250K jobs, and
> > retried
> > 2 failures successfully. Its running at about 100 jobs/sec to about
> > 38
> > workers over 22 sites.
> >
> > Once this tests out I'll increase the number of workers.
> >
> > - Mike
> >
> >
> > ----- Original Message -----
> > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > > Sent: Monday, August 8, 2011 9:58:59 PM
> > > Subject: Re: New 0.93 problem: <jobname>.error No such file or
> > > directory
> > > On Mon, 2011-08-08 at 21:54 -0500, Michael Wilde wrote:
> > > > > Judging from the error message, your workers are dying for
> > > > > unknown
> > > > > reasons. I see only two applications that failed (and they
> > > > > have
> > > > > distinct
> > > > > arguments), so I'm guessing you turned off retries. At 2/15K
> > > > > failure
> > > > > probability, if you set retries to at least 1, you would get a
> > > > > dramatic
> > > > > decrease in the odds that the failure will happen twice for
> > > > > the
> > > > > same
> > > > > app.
> > > >
> > > > Good idea, will do.
> > > >
> > > > So I just realized whats happening here. Workers can fail (ie
> > > > you
> > > > tested killing them, you said) and Swift will keep running,
> > > > *but*
> > > > the
> > > > apps that were running on failed workers receive failures and
> > > > need
> > > > to
> > > > get retried through normal retry, as if the apps themselves had
> > > > failed, correct? That just dawned on me.
> > >
> > > Yep.
> > >
> > > [...]
> > > >
> > > > I saw two apps fail because the site didnt set OSG_WM_TMP (where
> > > > I
> > > > place the logs). I thought that in those two cases the worker
> > > > never
> > > > started, but perhaps those two failures are related to these two
> > > > app
> > > > failures.
> > >
> > > In your case there is an actual TCP connections, so the workers
> > > must
> > > have started.
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory