[Swift-devel] New 0.93 problem: <jobname>.error No such file or directory
Mihael Hategan
hategan at mcs.anl.gov
Mon Aug 8 17:23:15 CDT 2011
On Mon, 2011-08-08 at 16:29 -0500, Michael Wilde wrote:
> catsn-20110808-1558-6tm450a1.d/cat-ze1806ek.error (No such file or directory)
> (partial traceback below).
>
> Is this related to your change on handling of the status file?
Yes, but I thought I fixed it. Make sure you have at least swift r4963.
>
> I was seeing the same error on sporadic, shorter tests last night but did not yet have a chance to investigate.
>
> The full log for this error is catsn-20110808-1558-6tm450a1.log in
> /home/wilde/swiftgrid/test.swift-workers/logs.10
>
> - Mike
>
>
> 2011-08-08 16:01:27,952-0500 INFO GridExec TASK_DEFINITION: Task(type=JOB_SUBMISSION, identity=urn:0-3-14624-1-1-1312837151244) is \
> /bin/bash _swiftwrap.staging -e /bin/cat -out outdir/f.14625.out -err stderr.txt -i -d outdir -if data.txt -of outdir/f.14625.out -k\
> -cdmfile -status provider -a data.txt
> 2011-08-08 16:01:27,960-0500 INFO ExecutionContext Detailed exception:
> Exception in cat:
> Arguments: [data.txt]
> Host: localhost
> Directory: catsn-20110808-1558-6tm450a1/jobs/z/cat-ze1806ek
> - - -
>
> Caused by: /autonfs/home/wilde/swiftgrid/test.swift-workers/./catsn-20110808-1558-6tm450a1.d/cat-ze1806ek.error (No such file or dir\
> ectory)
>
> at org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29)
> at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27)
> at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194)
> at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214)
> at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
> at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28)
>
>
> ----- Original Message -----
> > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > Sent: Sunday, August 7, 2011 3:12:47 PM
> > Subject: Re: 100K job script hangs at 30K jobs
> > Ok. I ran 65k jobs with a script that randomly killed and added
> > workers.
> > It finished fine, but it needs testing on more environments.
> >
> > On Sun, 2011-08-07 at 09:39 -0500, Michael Wilde wrote:
> > > I'll try to trap that next chance I get, and try to ship back worker
> > > logs.
> > >
> > > ----- Original Message -----
> > > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > > > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > > > Sent: Saturday, August 6, 2011 9:29:48 PM
> > > > Subject: Re: 100K job script hangs at 30K jobs
> > > > So this problem was the problem of dying workers combined with the
> > > > system not noticing it and so zombie jobs would slowly fill the
> > > > throttle
> > > > (which was set to 10 in this case). I backported the dead worker
> > > > detection code from trunk. Combined with retries, this should take
> > > > care
> > > > of the problem, but it may be worth looking into why the workers
> > > > were
> > > > dying.
> > > >
> > > > On Sat, 2011-08-06 at 13:34 -0500, Michael Wilde wrote:
> > > > > Mihael,
> > > > >
> > > > > A later catsn test, started this morning, hung at 30K or 100K
> > > > > catsn
> > > > > jobs.
> > > > >
> > > > > Swift was still printing progress but not progressing beyond:
> > > > >
> > > > > Progress: time: Sat, 06 Aug 2011 13:29:08 -0500 Selecting
> > > > > site:1014
> > > > > Submitted:10 Finished successfully:30329
> > > > >
> > > > > I had stopped it earlier in the morning, then resumed it to get
> > > > > a
> > > > > jstack.
> > > > >
> > > > > Logs and stack traces of both the swift and coaster service JVMs
> > > > > are
> > > > > in:
> > > > > /home/wilde/swiftgrid/test.swift-workers/logs.07
> > > > >
> > > > > - Mike
> > >
>
More information about the Swift-devel
mailing list