[Swift-devel] persistent coasters and data staging

Fri Oct 7 15:52:46 CDT 2011

The exception I'm seeing is:

Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 521
        at org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29)
        at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27)
        at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194)
        at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214)
        at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58)
        at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28)
        at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29)
        at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20)
        at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63)
        at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139)
        at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197)
        at org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227)
        at org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104)
        at org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

It stops after about 10 minutes or so. The scripts and logs are on /autonfs/gpfs-pads/projects/CI-CCR000013/davidk/coaster-stress-tests

----- Original Message -----
> From: "David Kelly" <davidk at ci.uchicago.edu>
> To: "Mihael Hategan" <hategan at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>, "Michael Wilde" <wilde at mcs.anl.gov>
> Sent: Friday, October 7, 2011 3:51:00 PM
> Subject: Re: [Swift-devel] persistent coasters and data staging
> I wrote a test to try to replicate this issue. I am running the
> coaster service on bridled. My workers are running on the MCS servers.
> I am using a modified catsn script with 100 files, each exactly 10
> megabytes. After about 10 minutes of this, I get a failure in the
> logs:
> 
> 
> ----- Original Message -----
> > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > Sent: Tuesday, October 4, 2011 1:35:04 PM
> > Subject: Re: [Swift-devel] persistent coasters and data staging
> > On Tue, 2011-10-04 at 12:23 -0500, Michael Wilde wrote:
> > > Mihael, Ketan, David,
> > >
> > > Ketan and I reviewed progress yesterday on ExTENCI applications,
> > > and
> > > decided that for the moment Ketan will focus on the
> > > coaster-server-per-site+GridFTP configuration.
> > >
> > > David, I'd like you to take over the testing and troubleshooting
> > > of
> > > the configuration related to this email thread: single coaster
> > > server
> > > for all OSG sites, using provider staging.
> > >
> > > It seems like the next action was for Ketan to send Mihael the
> > > requested service log. Im not sure if that was done, or if so what
> > > it
> > > revealed.
> > >
> > > Also, in reviewing this email thread, it wasnt clear to me:
> > > Mihael,
> > > are you applying the fixes for this problem in trunk or 0.93
> > > branch?
> > > I
> > > believe that Ketan has been testing with the 0.93 branch.
> >
> > I was dealing with the 0.93 branch.
> >
> > >
> > > The other thing that was not clear to me, Mihael, was whether you
> > > have
> > > been able to replicate the problems that Ketan is experiencing in
> > > talking to OSG sites, in your own test setups, or if we're in a
> > > mode
> > > of sending you symptoms that you cant replicate and validate the
> > > fixes
> > > for.
> >
> > I was able to replicate the original problem and fix a large chunk
> > of
> > it.
> >
> > Ketan seems to be running into a different problem, but I suspect
> > it's
> > a
> > configuration issue of some sort.
> >
> > [...]
> >
> > Mihael
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel