[Swift-devel] New 0.93 problem: <jobname>.error No such file or directory

Michael Wilde wilde at mcs.anl.gov
Mon Aug 8 16:29:02 CDT 2011


Mihael,

I ran one test to 100K jobs - ran fine.

Second test failed after ~15K jobs with the following error:

catsn-20110808-1558-6tm450a1.d/cat-ze1806ek.error (No such file or directory)
(partial traceback below).

Is this related to your change on handling of the status file?

I was seeing the same error on sporadic, shorter tests last night but did not yet have a chance to investigate.

The full log for this error is catsn-20110808-1558-6tm450a1.log in
/home/wilde/swiftgrid/test.swift-workers/logs.10

- Mike


2011-08-08 16:01:27,952-0500 INFO  GridExec TASK_DEFINITION: Task(type=JOB_SUBMISSION, identity=urn:0-3-14624-1-1-1312837151244) is \
/bin/bash _swiftwrap.staging -e /bin/cat -out outdir/f.14625.out -err stderr.txt -i -d outdir -if data.txt -of outdir/f.14625.out -k\
 -cdmfile  -status provider -a data.txt
2011-08-08 16:01:27,960-0500 INFO  ExecutionContext Detailed exception:
Exception in cat:
Arguments: [data.txt]
Host: localhost
Directory: catsn-20110808-1558-6tm450a1/jobs/z/cat-ze1806ek
- - -

Caused by: /autonfs/home/wilde/swiftgrid/test.swift-workers/./catsn-20110808-1558-6tm450a1.d/cat-ze1806ek.error (No such file or dir\
ectory)

        at org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29)
        at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27)
        at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194)
        at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214)
        at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
        at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28)


----- Original Message -----
> From: "Mihael Hategan" <hategan at mcs.anl.gov>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Sent: Sunday, August 7, 2011 3:12:47 PM
> Subject: Re: 100K job script hangs at 30K jobs
> Ok. I ran 65k jobs with a script that randomly killed and added
> workers.
> It finished fine, but it needs testing on more environments.
> 
> On Sun, 2011-08-07 at 09:39 -0500, Michael Wilde wrote:
> > I'll try to trap that next chance I get, and try to ship back worker
> > logs.
> >
> > ----- Original Message -----
> > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > > Sent: Saturday, August 6, 2011 9:29:48 PM
> > > Subject: Re: 100K job script hangs at 30K jobs
> > > So this problem was the problem of dying workers combined with the
> > > system not noticing it and so zombie jobs would slowly fill the
> > > throttle
> > > (which was set to 10 in this case). I backported the dead worker
> > > detection code from trunk. Combined with retries, this should take
> > > care
> > > of the problem, but it may be worth looking into why the workers
> > > were
> > > dying.
> > >
> > > On Sat, 2011-08-06 at 13:34 -0500, Michael Wilde wrote:
> > > > Mihael,
> > > >
> > > > A later catsn test, started this morning, hung at 30K or 100K
> > > > catsn
> > > > jobs.
> > > >
> > > > Swift was still printing progress but not progressing beyond:
> > > >
> > > > Progress: time: Sat, 06 Aug 2011 13:29:08 -0500 Selecting
> > > > site:1014
> > > > Submitted:10 Finished successfully:30329
> > > >
> > > > I had stopped it earlier in the morning, then resumed it to get
> > > > a
> > > > jstack.
> > > >
> > > > Logs and stack traces of both the swift and coaster service JVMs
> > > > are
> > > > in:
> > > >   /home/wilde/swiftgrid/test.swift-workers/logs.07
> > > >
> > > > - Mike
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list