[Swift-devel] New 0.93 problem: <jobname>.error No such file or directory
Michael Wilde
wilde at mcs.anl.gov
Mon Aug 8 16:29:02 CDT 2011
Mihael,
I ran one test to 100K jobs - ran fine.
Second test failed after ~15K jobs with the following error:
catsn-20110808-1558-6tm450a1.d/cat-ze1806ek.error (No such file or directory)
(partial traceback below).
Is this related to your change on handling of the status file?
I was seeing the same error on sporadic, shorter tests last night but did not yet have a chance to investigate.
The full log for this error is catsn-20110808-1558-6tm450a1.log in
/home/wilde/swiftgrid/test.swift-workers/logs.10
- Mike
2011-08-08 16:01:27,952-0500 INFO GridExec TASK_DEFINITION: Task(type=JOB_SUBMISSION, identity=urn:0-3-14624-1-1-1312837151244) is \
/bin/bash _swiftwrap.staging -e /bin/cat -out outdir/f.14625.out -err stderr.txt -i -d outdir -if data.txt -of outdir/f.14625.out -k\
-cdmfile -status provider -a data.txt
2011-08-08 16:01:27,960-0500 INFO ExecutionContext Detailed exception:
Exception in cat:
Arguments: [data.txt]
Host: localhost
Directory: catsn-20110808-1558-6tm450a1/jobs/z/cat-ze1806ek
- - -
Caused by: /autonfs/home/wilde/swiftgrid/test.swift-workers/./catsn-20110808-1558-6tm450a1.d/cat-ze1806ek.error (No such file or dir\
ectory)
at org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29)
at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27)
at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194)
at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214)
at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28)
----- Original Message -----
> From: "Mihael Hategan" <hategan at mcs.anl.gov>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Sent: Sunday, August 7, 2011 3:12:47 PM
> Subject: Re: 100K job script hangs at 30K jobs
> Ok. I ran 65k jobs with a script that randomly killed and added
> workers.
> It finished fine, but it needs testing on more environments.
>
> On Sun, 2011-08-07 at 09:39 -0500, Michael Wilde wrote:
> > I'll try to trap that next chance I get, and try to ship back worker
> > logs.
> >
> > ----- Original Message -----
> > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > > Sent: Saturday, August 6, 2011 9:29:48 PM
> > > Subject: Re: 100K job script hangs at 30K jobs
> > > So this problem was the problem of dying workers combined with the
> > > system not noticing it and so zombie jobs would slowly fill the
> > > throttle
> > > (which was set to 10 in this case). I backported the dead worker
> > > detection code from trunk. Combined with retries, this should take
> > > care
> > > of the problem, but it may be worth looking into why the workers
> > > were
> > > dying.
> > >
> > > On Sat, 2011-08-06 at 13:34 -0500, Michael Wilde wrote:
> > > > Mihael,
> > > >
> > > > A later catsn test, started this morning, hung at 30K or 100K
> > > > catsn
> > > > jobs.
> > > >
> > > > Swift was still printing progress but not progressing beyond:
> > > >
> > > > Progress: time: Sat, 06 Aug 2011 13:29:08 -0500 Selecting
> > > > site:1014
> > > > Submitted:10 Finished successfully:30329
> > > >
> > > > I had stopped it earlier in the morning, then resumed it to get
> > > > a
> > > > jstack.
> > > >
> > > > Logs and stack traces of both the swift and coaster service JVMs
> > > > are
> > > > in:
> > > > /home/wilde/swiftgrid/test.swift-workers/logs.07
> > > >
> > > > - Mike
> >
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
More information about the Swift-devel
mailing list