[Swift-devel] persistent coasters and data staging

Mon Oct 3 14:23:17 CDT 2011

Are you running with a standalone coaster service? If yes, can you also
post the service log?

On Mon, 2011-10-03 at 09:25 -0500, Ketan Maheshwari wrote:
> Mihael,
> 
> 
> On Sun, Oct 2, 2011 at 9:40 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
>         On Sun, 2011-10-02 at 21:27 -0500, Ketan Maheshwari wrote:
>         > Mihael,
>         >
>         >
>         
>         > So far, I've been using the proxy mode:
>         >
>         >
>         > <profile namespace="swift"
>         key="stagingMethod">proxy</profile>
>         >
>         >
>         > I just tried using the non-proxy (file/local) mode:
>         >
>         >
>         > <filesystem provider="local" url="none" />
>         
>         
>         <profile namespace="swift" key="stagingMethod">file</profile>
> 
> 
> Thanks, however, on using the above file mode, Swift do not seem to be
> progressing. On stdout, I see intermittent "Active: 1" lines but they
> dissappear and get back to submitted status:
> 
> 
> This happens for about 20 minutes after which the run starts but with
> high number of failures, with following message:
> 
> 
> Caused by: Task failed: null
> org.globus.cog.karajan.workflow.service.channels.ChannelException:
> Channel died and no contact available
> at
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:235)
> at
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:257)
> at
> org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:227)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.Node.getChannel(Node.java:125)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.Cpu.submit(Cpu.java:245)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launchSequential(Cpu.java:203)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launch(Cpu.java:189)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.Cpu.pull(Cpu.java:159)
> at
> org.globus.cog.abstraction.coaster.service.job.manager.PullThread.run(PullThread.java:98)
> 
> 
> On the workers stdout, I see 59 workers are running:
> "*** demandThread: swiftDemand=20 paddedDemand=24 totalRunning=59"
> 
> 
> In the worker logs, I do not see any errors except for one worker
> which says:
> 
> 
> "Failed to register (timeout)" 
> 
> 
> The log for this run is:
> http://www.ci.uchicago.edu/~ketan/catsn-20111003-0901-nd7ta1bb.log
> 
> 
> The data size for this run is 10MB per task.
> 
> 
> Regards,
> Ketan
> 
> 
> 
> 
>         
>         And that is not related to the heartbeat error, which I'm not
>         sure why
>         you're getting.
>         
>         As for the errors you get in proxy mode, are you sure your
>         workers are
>         fine?
>         
>         
> 
> 
> 
> 
> -- 
> Ketan
> 
> 
>