[Swift-devel] New 0.93 problem: <jobname>.error No such file or directory

Michael Wilde wilde at mcs.anl.gov
Mon Aug 8 20:39:34 CDT 2011


Im now running Swift svn swift-r4965 cog-r3225

A 100K-catsn script ran to completion.

Then a 500K-catsn script terminated at ~ 15K jobs with the error below.

Logs are in /home/wilde/swiftgrid/test.swift-workers
Failing run was *pe.log

- Mike


2011-08-08 18:37:59,452-0500 DEBUG vdl:execute2 THREAD_ASSOCIATION jobid=cat-1fkb66ek thread=0-3-29294-1-1 host=localhost replicati\
onGroup=8shb66ek
2011-08-08 18:37:59,452-0500 DEBUG vdl:execute2 THREAD_ASSOCIATION jobid=cat-2fkb66ek thread=0-3-29296-1-1 host=localhost replicati\
onGroup=9shb66ek
2011-08-08 18:37:59,452-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=cat-eakb66ek - Application exception: Task failed: Conn\
ection to worker lost
java.net.SocketException: Connection reset
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:124)
        at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStreamKarajanChannel.j\
ava:305)
        at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.ja\
va:251)


2011-08-08 18:37:59,452-0500 INFO  GridExec TASK_DEFINITION: Task(type=JOB_SUBMISSION, identity=urn:0-3-29290-1-1-1312846318323) is\
 /bin/bash _swiftwrap.staging -e /bin/cat -out outdir/f.29291.out -err stderr.txt -i -d outdir -if data.txt -of outdir/f.29291.out \
-k -cdmfile  -status provider -a data.txt
2011-08-08 18:37:59,452-0500 INFO  vdl:execute START thread=0-3-30899-1 tr=cat
2011-08-08 18:37:59,455-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=cat-oakb66ek - Application exception: Task failed: Conn\
ection to worker lost
java.net.SocketException: Broken pipe
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:124)
        at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStreamKarajanChannel.j\
ava:305)
        at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.ja\
va:251)




----- Original Message -----
> From: "Mihael Hategan" <hategan at mcs.anl.gov>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Monday, August 8, 2011 5:23:15 PM
> Subject: Re: New 0.93 problem: <jobname>.error No such file or directory
> On Mon, 2011-08-08 at 16:29 -0500, Michael Wilde wrote:
> > catsn-20110808-1558-6tm450a1.d/cat-ze1806ek.error (No such file or
> > directory)
> > (partial traceback below).
> >
> > Is this related to your change on handling of the status file?
> 
> Yes, but I thought I fixed it. Make sure you have at least swift
> r4963.
> 
> >
> > I was seeing the same error on sporadic, shorter tests last night
> > but did not yet have a chance to investigate.
> >
> > The full log for this error is catsn-20110808-1558-6tm450a1.log in
> > /home/wilde/swiftgrid/test.swift-workers/logs.10
> >
> > - Mike
> >
> >
> > 2011-08-08 16:01:27,952-0500 INFO GridExec TASK_DEFINITION:
> > Task(type=JOB_SUBMISSION, identity=urn:0-3-14624-1-1-1312837151244)
> > is \
> > /bin/bash _swiftwrap.staging -e /bin/cat -out outdir/f.14625.out
> > -err stderr.txt -i -d outdir -if data.txt -of outdir/f.14625.out -k\
> >  -cdmfile -status provider -a data.txt
> > 2011-08-08 16:01:27,960-0500 INFO ExecutionContext Detailed
> > exception:
> > Exception in cat:
> > Arguments: [data.txt]
> > Host: localhost
> > Directory: catsn-20110808-1558-6tm450a1/jobs/z/cat-ze1806ek
> > - - -
> >
> > Caused by:
> > /autonfs/home/wilde/swiftgrid/test.swift-workers/./catsn-20110808-1558-6tm450a1.d/cat-ze1806ek.error
> > (No such file or dir\
> > ectory)
> >
> >         at
> >         org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29)
> >         at
> >         org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27)
> >         at
> >         org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194)
> >         at
> >         org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214)
> >         at
> >         org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
> >         at
> >         org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28)
> >
> >
> > ----- Original Message -----
> > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > > Sent: Sunday, August 7, 2011 3:12:47 PM
> > > Subject: Re: 100K job script hangs at 30K jobs
> > > Ok. I ran 65k jobs with a script that randomly killed and added
> > > workers.
> > > It finished fine, but it needs testing on more environments.
> > >
> > > On Sun, 2011-08-07 at 09:39 -0500, Michael Wilde wrote:
> > > > I'll try to trap that next chance I get, and try to ship back
> > > > worker
> > > > logs.
> > > >
> > > > ----- Original Message -----
> > > > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > > > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > > > > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > > > > Sent: Saturday, August 6, 2011 9:29:48 PM
> > > > > Subject: Re: 100K job script hangs at 30K jobs
> > > > > So this problem was the problem of dying workers combined with
> > > > > the
> > > > > system not noticing it and so zombie jobs would slowly fill
> > > > > the
> > > > > throttle
> > > > > (which was set to 10 in this case). I backported the dead
> > > > > worker
> > > > > detection code from trunk. Combined with retries, this should
> > > > > take
> > > > > care
> > > > > of the problem, but it may be worth looking into why the
> > > > > workers
> > > > > were
> > > > > dying.
> > > > >
> > > > > On Sat, 2011-08-06 at 13:34 -0500, Michael Wilde wrote:
> > > > > > Mihael,
> > > > > >
> > > > > > A later catsn test, started this morning, hung at 30K or
> > > > > > 100K
> > > > > > catsn
> > > > > > jobs.
> > > > > >
> > > > > > Swift was still printing progress but not progressing
> > > > > > beyond:
> > > > > >
> > > > > > Progress: time: Sat, 06 Aug 2011 13:29:08 -0500 Selecting
> > > > > > site:1014
> > > > > > Submitted:10 Finished successfully:30329
> > > > > >
> > > > > > I had stopped it earlier in the morning, then resumed it to
> > > > > > get
> > > > > > a
> > > > > > jstack.
> > > > > >
> > > > > > Logs and stack traces of both the swift and coaster service
> > > > > > JVMs
> > > > > > are
> > > > > > in:
> > > > > >   /home/wilde/swiftgrid/test.swift-workers/logs.07
> > > > > >
> > > > > > - Mike
> > > >
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list