[Swift-devel] persistent coasters and data staging

Mon Oct 3 09:25:22 CDT 2011

Mihael,

On Sun, Oct 2, 2011 at 9:40 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> On Sun, 2011-10-02 at 21:27 -0500, Ketan Maheshwari wrote:
> > Mihael,
> >
> >
> > So far, I've been using the proxy mode:
> >
> >
> > <profile namespace="swift" key="stagingMethod">proxy</profile>
> >
> >
> > I just tried using the non-proxy (file/local) mode:
> >
> >
> > <filesystem provider="local" url="none" />
>
> <profile namespace="swift" key="stagingMethod">file</profile>
>

Thanks, however, on using the above file mode, Swift do not seem to be
progressing. On stdout, I see intermittent "Active: 1" lines but they
dissappear and get back to submitted status:

This happens for about 20 minutes after which the run starts but with high
number of failures, with following message:

Caused by: Task failed: null
org.globus.cog.karajan.workflow.service.channels.ChannelException: Channel
died and no contact available
at
org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:235)
at
org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:257)
at
org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:227)
at
org.globus.cog.abstraction.coaster.service.job.manager.Node.getChannel(Node.java:125)
at
org.globus.cog.abstraction.coaster.service.job.manager.Cpu.submit(Cpu.java:245)
at
org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launchSequential(Cpu.java:203)
at
org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launch(Cpu.java:189)
at
org.globus.cog.abstraction.coaster.service.job.manager.Cpu.pull(Cpu.java:159)
at
org.globus.cog.abstraction.coaster.service.job.manager.PullThread.run(PullThread.java:98)

On the workers stdout, I see 59 workers are running:
"*** demandThread: swiftDemand=20 paddedDemand=24 totalRunning=59"

In the worker logs, I do not see any errors except for one worker which
says:

"Failed to register (timeout)"

The log for this run is:
http://www.ci.uchicago.edu/~ketan/catsn-20111003-0901-nd7ta1bb.log

The data size for this run is 10MB per task.

Regards,
Ketan

> And that is not related to the heartbeat error, which I'm not sure why
> you're getting.
>
> As for the errors you get in proxy mode, are you sure your workers are
> fine?
>
>
>

-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20111003/947468ba/attachment.html>