[Swift-devel] persistent coasters and data staging

Tue Oct 4 12:23:35 CDT 2011

Mihael, Ketan, David,

Ketan and I reviewed progress yesterday on ExTENCI applications, and decided that for the moment Ketan will focus on the coaster-server-per-site+GridFTP configuration.

David, I'd like you to take over the testing and troubleshooting of the configuration related to this email thread: single coaster server for all OSG sites, using provider staging.

It seems like the next action was for Ketan to send Mihael the requested service log. Im not sure if that was done, or if so what it revealed.

Also, in reviewing this email thread, it wasnt clear to me: Mihael, are you applying the fixes for this problem in trunk or 0.93 branch? I believe that Ketan has been testing with the 0.93 branch.

The other thing that was not clear to me, Mihael, was whether you have been able to replicate the problems that Ketan is experiencing in talking to OSG sites, in your own test setups, or if we're in a mode of sending you symptoms that you cant replicate and validate the fixes for. 

In order to get sufficient test coverage into the stress-test branch of the test suite, for the symptoms we've been seeing here, could you provide details on what you have been able to re-create, and how?

David, can you pick up this problem and work to replicate the problems in a reproducible test suite cases, and then test the fixes, and then test on OSG?

We can discuss in more detail what that would entail.  I was hopeful that we could recreate the OSG symptoms in a more controlled environment between a CI lab machine and the MCS compute servers.

Thanks,

- Mike

----- Original Message -----
> From: "Mihael Hategan" <hategan at mcs.anl.gov>
> To: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Monday, October 3, 2011 2:23:17 PM
> Subject: Re: [Swift-devel] persistent coasters and data staging
> Are you running with a standalone coaster service? If yes, can you
> also
> post the service log?
> 
> On Mon, 2011-10-03 at 09:25 -0500, Ketan Maheshwari wrote:
> > Mihael,
> >
> >
> > On Sun, Oct 2, 2011 at 9:40 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > wrote:
> >         On Sun, 2011-10-02 at 21:27 -0500, Ketan Maheshwari wrote:
> >         > Mihael,
> >         >
> >         >
> >
> >         > So far, I've been using the proxy mode:
> >         >
> >         >
> >         > <profile namespace="swift"
> >         key="stagingMethod">proxy</profile>
> >         >
> >         >
> >         > I just tried using the non-proxy (file/local) mode:
> >         >
> >         >
> >         > <filesystem provider="local" url="none" />
> >
> >
> >         <profile namespace="swift"
> >         key="stagingMethod">file</profile>
> >
> >
> > Thanks, however, on using the above file mode, Swift do not seem to
> > be
> > progressing. On stdout, I see intermittent "Active: 1" lines but
> > they
> > dissappear and get back to submitted status:
> >
> >
> > This happens for about 20 minutes after which the run starts but
> > with
> > high number of failures, with following message:
> >
> >
> > Caused by: Task failed: null
> > org.globus.cog.karajan.workflow.service.channels.ChannelException:
> > Channel died and no contact available
> > at
> > org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:235)
> > at
> > org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:257)
> > at
> > org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:227)
> > at
> > org.globus.cog.abstraction.coaster.service.job.manager.Node.getChannel(Node.java:125)
> > at
> > org.globus.cog.abstraction.coaster.service.job.manager.Cpu.submit(Cpu.java:245)
> > at
> > org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launchSequential(Cpu.java:203)
> > at
> > org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launch(Cpu.java:189)
> > at
> > org.globus.cog.abstraction.coaster.service.job.manager.Cpu.pull(Cpu.java:159)
> > at
> > org.globus.cog.abstraction.coaster.service.job.manager.PullThread.run(PullThread.java:98)
> >
> >
> > On the workers stdout, I see 59 workers are running:
> > "*** demandThread: swiftDemand=20 paddedDemand=24 totalRunning=59"
> >
> >
> > In the worker logs, I do not see any errors except for one worker
> > which says:
> >
> >
> > "Failed to register (timeout)"
> >
> >
> > The log for this run is:
> > http://www.ci.uchicago.edu/~ketan/catsn-20111003-0901-nd7ta1bb.log
> >
> >
> > The data size for this run is 10MB per task.
> >
> >
> > Regards,
> > Ketan
> >
> >
> >
> >
> >
> >         And that is not related to the heartbeat error, which I'm
> >         not
> >         sure why
> >         you're getting.
> >
> >         As for the errors you get in proxy mode, are you sure your
> >         workers are
> >         fine?
> >
> >
> >
> >
> >
> >
> > --
> > Ketan
> >
> >
> >
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory