[Swift-devel] persistent coasters and data staging

Sun Oct 2 21:27:20 CDT 2011

Mihael,

So far, I've been using the proxy mode:

<profile namespace="swift" key="stagingMethod">proxy</profile>

I just tried using the non-proxy (file/local) mode:

<filesystem provider="local" url="none" />

The run doesn't progress. I get the following timeout messages interspersed
with stdout status message:
Command(2, HEARTBEAT): handling reply timeout;
sendReqTime=111002-211133.264, sendTime=111002-211255.655,
now=111002-212055.740
Command(2, HEARTBEAT)fault was: Reply timeout
org.globus.cog.karajan.workflow.service.TimeoutException
at
org.globus.cog.karajan.workflow.service.commands.Command.handleTimeout(Command.java:253)
at
org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:131)
at
org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:122)
at
org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel$1.run(AbstractKarajanChannel.java:116)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
Progress:  time: Sun, 02 Oct 2011 21:21:25 -0500  Submitting:100

On the other hand, while trying the proxy mode, I did not get any timeouts
however, 7 out of 100 jobs failed with the following errors:

The following errors have occurred:
1. Task failed: null
org.globus.cog.karajan.workflow.service.channels.ChannelException: Channel
died and no contact available
at
org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:235)
 at
org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:257)
at
org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:227)
 at
org.globus.cog.abstraction.coaster.service.job.manager.Node.getChannel(Node.java:125)
at
org.globus.cog.abstraction.coaster.service.job.manager.Cpu.submit(Cpu.java:245)
 at
org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launchSequential(Cpu.java:203)
at
org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launch(Cpu.java:189)
 at
org.globus.cog.abstraction.coaster.service.job.manager.Cpu.pull(Cpu.java:159)
at
org.globus.cog.abstraction.coaster.service.job.manager.PullThread.run(PullThread.java:98)
(5 times)
2. Task failed: Connection to worker lost
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
 at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
at java.net.SocketOutputStream.write(SocketOutputStream.java:124)
 at
org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStreamKarajanChannel.java:305)
at
org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:251)
3. Task failed: Connection to worker lost
java.net.SocketException: Connection reset
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
 at java.net.SocketOutputStream.write(SocketOutputStream.java:124)
at
org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStreamKarajanChannel.java:305)
 at
org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:251)

Regards,
Ketan

On Sun, Oct 2, 2011 at 4:38 AM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> I might have spoken a bit too soon there. There's still a timeout, but
> it occurs at higher loads during stageout. That's with proxy mode, so
> local (file) mode (i.e. what you should be using on OSG with the service
> running on the client node) may not necessarily show the same problem.
>
> On Sat, 2011-10-01 at 17:19 -0700, Mihael Hategan wrote:
> > This should be fixed now in cog r3293.
> >
> > There were two deadlocks. One that hung stage-ins and one that applied
> > to stageouts. These were only apparent when all the I/O buffers got
> > used, so only with relatively large staging activity.
> >
> > Please test.
> >
> > Mihael
> >
> > On Wed, 2011-09-21 at 16:24 -0500, Ketan Maheshwari wrote:
> > > Hi Mihael,
> > >
> > >
> > > I tested this fix. It seems that the timeout issue for large-ish data
> > > and throttle > ~30 persists. I am not sure if this is data staging
> > > timeout though.
> > >
> > >
> > > The setup that fails is as follows:
> > >
> > >
> > > persistent coasters, resource= workers running on OSG
> > > data size=8MB, 100 data items.
> > > foreach throttle=40=jobthrottle.
> > >
> > >
> > > The standard output seems intermittently showing some activity and
> > > then getting back to no activity without any progress on tasks.
> > >
> > >
> > > Please find the log and stdouterr
> > > here: http://www.ci.uchicago.edu/~ketan/coaster-lab/std.out.err,
> > >
> http://www.ci.uchicago.edu/~ketan/coaster-lab/catsn-20110921-1535-v0t3gcg5.log
> > >
> > >
> > > When I tested with small data, 1MB, 2MB, 4MB, it did work. 4MB
> > > displayed a fat tail behavior though, ~94 tasks completing steadily
> > > and quickly while the last 5-6 tasks taking disproportionate times.
> > > The throttle in these cases was <= 30.
> > >
> > >
> > >
> > >
> > > Regards,
> > > Ketan
> > >
> > > On Mon, Sep 12, 2011 at 7:19 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > > wrote:
> > >         Try now please (cog r3262).
> > >
> > >         On Mon, 2011-09-12 at 15:56 -0500, Ketan Maheshwari wrote:
> > >
> > >
> > >         > Mihael,
> > >         >
> > >         >
> > >         > I tried with the new worker.pl, running a 100 task 10MB per
> > >         task run
> > >         > with throttle set at 100.
> > >         >
> > >         >
> > >         > However, it seems to have failed with the same symptoms of
> > >         timeout
> > >         > error 521:
> > >         >
> > >         >
> > >         > Caused by: null
> > >         > Caused by:
> > >         >
> > >         org.globus.cog.abstraction.impl.common.execution.JobException:
> > >         Job
> > >         > failed with an exit code of 521
> > >         > Progress:  time: Mon, 12 Sep 2011 15:45:31 -0500
> > >          Submitted:53
> > >         >  Active:1  Failed:46
> > >         > Progress:  time: Mon, 12 Sep 2011 15:45:34 -0500
> > >          Submitted:53
> > >         >  Active:1  Failed:46
> > >         > Exception in cat:
> > >         > Arguments: [gpfs/pads/swift/ketan/indir10/data0002.txt]
> > >         > Host: grid
> > >         > Directory: catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk
> > >         > - - -
> > >         >
> > >         >
> > >         > Caused by: null
> > >         > Caused by:
> > >         >
> > >         org.globus.cog.abstraction.impl.common.execution.JobException:
> > >         Job
> > >         > failed with an exit code of 521
> > >         > Progress:  time: Mon, 12 Sep 2011 15:45:45 -0500
> > >          Submitted:52
> > >         >  Active:1  Failed:47
> > >         > Exception in cat:
> > >         > Arguments: [gpfs/pads/swift/ketan/indir10/data0014.txt]
> > >         > Host: grid
> > >         > Directory: catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk
> > >         >
> > >         >
> > >         > I had about 107 workers running at the time of these
> > >         failures.
> > >         >
> > >         >
> > >         > I started seeing the failure messages after about 20 minutes
> > >         into this
> > >         > run.
> > >         >
> > >         >
> > >         > The logs are in http://www.ci.uchicago.edu/~ketan/pack.tgz
> > >         >
> > >         >
> > >         > Regards,
> > >         > Ketan
> > >         >
> > >         >
> > >         >
> > >         > On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan
> > >         <hategan at mcs.anl.gov>
> > >         > wrote:
> > >         >         On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari
> > >         wrote:
> > >         >
> > >         >         > After some discussion with Mike, Our conclusion
> > >         from these
> > >         >         runs was
> > >         >         > that the parallel data transfers are causing
> > >         timeouts from
> > >         >         the
> > >         >         > worker.pl, further, we were undecided if somehow
> > >         the timeout
> > >         >         threshold
> > >         >         > is set too agressive plus how are they determined
> > >         and
> > >         >         whether a change
> > >         >         > in that value could resolve the issue.
> > >         >
> > >         >
> > >         >         Something like that. Worker.pl would use the time
> > >         when a file
> > >         >         transfer
> > >         >         started to determine timeouts. This is undesirable.
> > >         The
> > >         >         purpose of
> > >         >         timeouts is to determine whether the other side has
> > >         stopped
> > >         >         from
> > >         >         properly following the flow of things. It follows
> > >         that any
> > >         >         kind of
> > >         >         activity should reset the timeout... timer.
> > >         >
> > >         >         I updated the worker code to deal with the issue in
> > >         a proper
> > >         >         way. But
> > >         >         now I need your help. This is perl code, and it
> > >         needs testing.
> > >         >
> > >         >         So can you re-run, first with some simple test that
> > >         uses
> > >         >         coaster staging
> > >         >         (just to make sure I didn't mess something up), and
> > >         then the
> > >         >         version of
> > >         >         your tests that was most likely to fail?
> > >         >
> > >         >
> > >         >
> > >         >
> > >         >
> > >         > --
> > >         > Ketan
> > >         >
> > >         >
> > >         >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Ketan
> > >
> > >
> > >
> >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>
>
>

-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20111002/011aab8a/attachment.html>