[Swift-devel] persistent coasters and data staging
Ketan Maheshwari
ketancmaheshwari at gmail.com
Sun Oct 2 21:27:20 CDT 2011
Mihael,
So far, I've been using the proxy mode:
<profile namespace="swift" key="stagingMethod">proxy</profile>
I just tried using the non-proxy (file/local) mode:
<filesystem provider="local" url="none" />
The run doesn't progress. I get the following timeout messages interspersed
with stdout status message:
Command(2, HEARTBEAT): handling reply timeout;
sendReqTime=111002-211133.264, sendTime=111002-211255.655,
now=111002-212055.740
Command(2, HEARTBEAT)fault was: Reply timeout
org.globus.cog.karajan.workflow.service.TimeoutException
at
org.globus.cog.karajan.workflow.service.commands.Command.handleTimeout(Command.java:253)
at
org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:131)
at
org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:122)
at
org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel$1.run(AbstractKarajanChannel.java:116)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
Progress: time: Sun, 02 Oct 2011 21:21:25 -0500 Submitting:100
On the other hand, while trying the proxy mode, I did not get any timeouts
however, 7 out of 100 jobs failed with the following errors:
The following errors have occurred:
1. Task failed: null
org.globus.cog.karajan.workflow.service.channels.ChannelException: Channel
died and no contact available
at
org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:235)
at
org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:257)
at
org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:227)
at
org.globus.cog.abstraction.coaster.service.job.manager.Node.getChannel(Node.java:125)
at
org.globus.cog.abstraction.coaster.service.job.manager.Cpu.submit(Cpu.java:245)
at
org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launchSequential(Cpu.java:203)
at
org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launch(Cpu.java:189)
at
org.globus.cog.abstraction.coaster.service.job.manager.Cpu.pull(Cpu.java:159)
at
org.globus.cog.abstraction.coaster.service.job.manager.PullThread.run(PullThread.java:98)
(5 times)
2. Task failed: Connection to worker lost
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
at java.net.SocketOutputStream.write(SocketOutputStream.java:124)
at
org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStreamKarajanChannel.java:305)
at
org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:251)
3. Task failed: Connection to worker lost
java.net.SocketException: Connection reset
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
at java.net.SocketOutputStream.write(SocketOutputStream.java:124)
at
org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStreamKarajanChannel.java:305)
at
org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:251)
Regards,
Ketan
On Sun, Oct 2, 2011 at 4:38 AM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> I might have spoken a bit too soon there. There's still a timeout, but
> it occurs at higher loads during stageout. That's with proxy mode, so
> local (file) mode (i.e. what you should be using on OSG with the service
> running on the client node) may not necessarily show the same problem.
>
> On Sat, 2011-10-01 at 17:19 -0700, Mihael Hategan wrote:
> > This should be fixed now in cog r3293.
> >
> > There were two deadlocks. One that hung stage-ins and one that applied
> > to stageouts. These were only apparent when all the I/O buffers got
> > used, so only with relatively large staging activity.
> >
> > Please test.
> >
> > Mihael
> >
> > On Wed, 2011-09-21 at 16:24 -0500, Ketan Maheshwari wrote:
> > > Hi Mihael,
> > >
> > >
> > > I tested this fix. It seems that the timeout issue for large-ish data
> > > and throttle > ~30 persists. I am not sure if this is data staging
> > > timeout though.
> > >
> > >
> > > The setup that fails is as follows:
> > >
> > >
> > > persistent coasters, resource= workers running on OSG
> > > data size=8MB, 100 data items.
> > > foreach throttle=40=jobthrottle.
> > >
> > >
> > > The standard output seems intermittently showing some activity and
> > > then getting back to no activity without any progress on tasks.
> > >
> > >
> > > Please find the log and stdouterr
> > > here: http://www.ci.uchicago.edu/~ketan/coaster-lab/std.out.err,
> > >
> http://www.ci.uchicago.edu/~ketan/coaster-lab/catsn-20110921-1535-v0t3gcg5.log
> > >
> > >
> > > When I tested with small data, 1MB, 2MB, 4MB, it did work. 4MB
> > > displayed a fat tail behavior though, ~94 tasks completing steadily
> > > and quickly while the last 5-6 tasks taking disproportionate times.
> > > The throttle in these cases was <= 30.
> > >
> > >
> > >
> > >
> > > Regards,
> > > Ketan
> > >
> > > On Mon, Sep 12, 2011 at 7:19 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > > wrote:
> > > Try now please (cog r3262).
> > >
> > > On Mon, 2011-09-12 at 15:56 -0500, Ketan Maheshwari wrote:
> > >
> > >
> > > > Mihael,
> > > >
> > > >
> > > > I tried with the new worker.pl, running a 100 task 10MB per
> > > task run
> > > > with throttle set at 100.
> > > >
> > > >
> > > > However, it seems to have failed with the same symptoms of
> > > timeout
> > > > error 521:
> > > >
> > > >
> > > > Caused by: null
> > > > Caused by:
> > > >
> > > org.globus.cog.abstraction.impl.common.execution.JobException:
> > > Job
> > > > failed with an exit code of 521
> > > > Progress: time: Mon, 12 Sep 2011 15:45:31 -0500
> > > Submitted:53
> > > > Active:1 Failed:46
> > > > Progress: time: Mon, 12 Sep 2011 15:45:34 -0500
> > > Submitted:53
> > > > Active:1 Failed:46
> > > > Exception in cat:
> > > > Arguments: [gpfs/pads/swift/ketan/indir10/data0002.txt]
> > > > Host: grid
> > > > Directory: catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk
> > > > - - -
> > > >
> > > >
> > > > Caused by: null
> > > > Caused by:
> > > >
> > > org.globus.cog.abstraction.impl.common.execution.JobException:
> > > Job
> > > > failed with an exit code of 521
> > > > Progress: time: Mon, 12 Sep 2011 15:45:45 -0500
> > > Submitted:52
> > > > Active:1 Failed:47
> > > > Exception in cat:
> > > > Arguments: [gpfs/pads/swift/ketan/indir10/data0014.txt]
> > > > Host: grid
> > > > Directory: catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk
> > > >
> > > >
> > > > I had about 107 workers running at the time of these
> > > failures.
> > > >
> > > >
> > > > I started seeing the failure messages after about 20 minutes
> > > into this
> > > > run.
> > > >
> > > >
> > > > The logs are in http://www.ci.uchicago.edu/~ketan/pack.tgz
> > > >
> > > >
> > > > Regards,
> > > > Ketan
> > > >
> > > >
> > > >
> > > > On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan
> > > <hategan at mcs.anl.gov>
> > > > wrote:
> > > > On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari
> > > wrote:
> > > >
> > > > > After some discussion with Mike, Our conclusion
> > > from these
> > > > runs was
> > > > > that the parallel data transfers are causing
> > > timeouts from
> > > > the
> > > > > worker.pl, further, we were undecided if somehow
> > > the timeout
> > > > threshold
> > > > > is set too agressive plus how are they determined
> > > and
> > > > whether a change
> > > > > in that value could resolve the issue.
> > > >
> > > >
> > > > Something like that. Worker.pl would use the time
> > > when a file
> > > > transfer
> > > > started to determine timeouts. This is undesirable.
> > > The
> > > > purpose of
> > > > timeouts is to determine whether the other side has
> > > stopped
> > > > from
> > > > properly following the flow of things. It follows
> > > that any
> > > > kind of
> > > > activity should reset the timeout... timer.
> > > >
> > > > I updated the worker code to deal with the issue in
> > > a proper
> > > > way. But
> > > > now I need your help. This is perl code, and it
> > > needs testing.
> > > >
> > > > So can you re-run, first with some simple test that
> > > uses
> > > > coaster staging
> > > > (just to make sure I didn't mess something up), and
> > > then the
> > > > version of
> > > > your tests that was most likely to fail?
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ketan
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Ketan
> > >
> > >
> > >
> >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>
>
>
--
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20111002/011aab8a/attachment.html>
More information about the Swift-devel
mailing list