<div><div><div>Mihael,</div><div><br></div><div>So far, I've been using the proxy mode:</div><div><br></div><div><profile namespace="swift" key="stagingMethod">proxy</profile></div><div>

<br></div><div>I just tried using the non-proxy (file/local) mode:</div><div><br></div><div><filesystem provider="local" url="none" /></div><div><br></div><div>The run doesn't progress. I get the following timeout messages interspersed with stdout status message:</div>

<div><div>Command(2, HEARTBEAT): handling reply timeout; sendReqTime=111002-211133.264, sendTime=111002-211255.655, now=111002-212055.740</div><div>Command(2, HEARTBEAT)fault was: Reply timeout</div><div>org.globus.cog.karajan.workflow.service.TimeoutException</div>

<div><span class="Apple-tab-span" style="white-space:pre">      </span>at org.globus.cog.karajan.workflow.service.commands.Command.handleTimeout(Command.java:253)</div><div><span class="Apple-tab-span" style="white-space:pre">  </span>at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:131)</div>

<div><span class="Apple-tab-span" style="white-space:pre">      </span>at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.checkTimeouts(AbstractKarajanChannel.java:122)</div><div><span class="Apple-tab-span" style="white-space:pre">    </span>at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel$1.run(AbstractKarajanChannel.java:116)</div>

<div><span class="Apple-tab-span" style="white-space:pre">      </span>at java.util.TimerThread.mainLoop(Timer.java:512)</div><div><span class="Apple-tab-span" style="white-space:pre">    </span>at java.util.TimerThread.run(Timer.java:462)</div>

</div><div>Progress:  time: Sun, 02 Oct 2011 21:21:25 -0500  Submitting:100</div><div><br></div><div>On the other hand, while trying the proxy mode, I did not get any timeouts however, 7 out of 100 jobs failed with the following errors:</div>

<div><br></div><div>The following errors have occurred:</div><div>1. Task failed: null</div><div>org.globus.cog.karajan.workflow.service.channels.ChannelException: Channel died and no contact available</div><div><span style="white-space:pre-wrap">     </span>at org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:235)</div>


<div><span style="white-space:pre-wrap">  </span>at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:257)</div><div><span style="white-space:pre-wrap">       </span>at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:227)</div>


<div><span style="white-space:pre-wrap">  </span>at org.globus.cog.abstraction.coaster.service.job.manager.Node.getChannel(Node.java:125)</div><div><span style="white-space:pre-wrap"> </span>at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.submit(Cpu.java:245)</div>


<div><span style="white-space:pre-wrap">  </span>at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launchSequential(Cpu.java:203)</div><div><span style="white-space:pre-wrap">     </span>at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launch(Cpu.java:189)</div>


<div><span style="white-space:pre-wrap">  </span>at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.pull(Cpu.java:159)</div><div><span style="white-space:pre-wrap"> </span>at org.globus.cog.abstraction.coaster.service.job.manager.PullThread.run(PullThread.java:98) (5 times)</div>


<div>2. Task failed: Connection to worker lost</div><div>java.net.SocketException: Broken pipe</div><div><span style="white-space:pre-wrap">      </span>at java.net.SocketOutputStream.socketWrite0(Native Method)</div>

<div><span style="white-space:pre-wrap">  </span>at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)</div><div><span style="white-space:pre-wrap">   </span>at java.net.SocketOutputStream.write(SocketOutputStream.java:124)</div>


<div><span style="white-space:pre-wrap">  </span>at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStreamKarajanChannel.java:305)</div><div><span style="white-space:pre-wrap">      </span>at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:251)</div>


<div>3. Task failed: Connection to worker lost</div><div>java.net.SocketException: Connection reset</div><div><span style="white-space:pre-wrap"> </span>at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)</div>


<div><span style="white-space:pre-wrap">  </span>at java.net.SocketOutputStream.write(SocketOutputStream.java:124)</div><div><span style="white-space:pre-wrap">        </span>at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStreamKarajanChannel.java:305)</div>


<div><span style="white-space:pre-wrap">  </span>at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:251)</div></div></div><div><br>

</div><div><br></div><div>Regards,</div><div>Ketan</div><div><br></div><br><div class="gmail_quote">On Sun, Oct 2, 2011 at 4:38 AM, Mihael Hategan <span dir="ltr"><<a href="mailto:hategan@mcs.anl.gov" target="_blank">hategan@mcs.anl.gov</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I might have spoken a bit too soon there. There's still a timeout, but<br>

it occurs at higher loads during stageout. That's with proxy mode, so<br>

local (file) mode (i.e. what you should be using on OSG with the service<br>

running on the client node) may not necessarily show the same problem.<br>

<div><div></div><div><br>

On Sat, 2011-10-01 at 17:19 -0700, Mihael Hategan wrote:<br>

> This should be fixed now in cog r3293.<br>

><br>

> There were two deadlocks. One that hung stage-ins and one that applied<br>

> to stageouts. These were only apparent when all the I/O buffers got<br>

> used, so only with relatively large staging activity.<br>

><br>

> Please test.<br>

><br>

> Mihael<br>

><br>

> On Wed, 2011-09-21 at 16:24 -0500, Ketan Maheshwari wrote:<br>

> > Hi Mihael,<br>

> ><br>

> ><br>

> > I tested this fix. It seems that the timeout issue for large-ish data<br>

> > and throttle > ~30 persists. I am not sure if this is data staging<br>

> > timeout though.<br>

> ><br>

> ><br>

> > The setup that fails is as follows:<br>

> ><br>

> ><br>

> > persistent coasters, resource= workers running on OSG<br>

> > data size=8MB, 100 data items.<br>

> > foreach throttle=40=jobthrottle.<br>

> ><br>

> ><br>

> > The standard output seems intermittently showing some activity and<br>

> > then getting back to no activity without any progress on tasks.<br>

> ><br>

> ><br>

> > Please find the log and stdouterr<br>

> > here: <a href="http://www.ci.uchicago.edu/~ketan/coaster-lab/std.out.err" target="_blank">http://www.ci.uchicago.edu/~ketan/coaster-lab/std.out.err</a>,<br>

> >  <a href="http://www.ci.uchicago.edu/~ketan/coaster-lab/catsn-20110921-1535-v0t3gcg5.log" target="_blank">http://www.ci.uchicago.edu/~ketan/coaster-lab/catsn-20110921-1535-v0t3gcg5.log</a><br>

> ><br>

> ><br>

> > When I tested with small data, 1MB, 2MB, 4MB, it did work. 4MB<br>

> > displayed a fat tail behavior though, ~94 tasks completing steadily<br>

> > and quickly while the last 5-6 tasks taking disproportionate times.<br>

> > The throttle in these cases was <= 30.<br>

> ><br>

> ><br>

> ><br>

> ><br>

> > Regards,<br>

> > Ketan<br>

> ><br>

> > On Mon, Sep 12, 2011 at 7:19 PM, Mihael Hategan <<a href="mailto:hategan@mcs.anl.gov" target="_blank">hategan@mcs.anl.gov</a>><br>

> > wrote:<br>

> >         Try now please (cog r3262).<br>

> ><br>

> >         On Mon, 2011-09-12 at 15:56 -0500, Ketan Maheshwari wrote:<br>

> ><br>

> ><br>

> >         > Mihael,<br>

> >         ><br>

> >         ><br>

> >         > I tried with the new <a href="http://worker.pl" target="_blank">worker.pl</a>, running a 100 task 10MB per<br>

> >         task run<br>

> >         > with throttle set at 100.<br>

> >         ><br>

> >         ><br>

> >         > However, it seems to have failed with the same symptoms of<br>

> >         timeout<br>

> >         > error 521:<br>

> >         ><br>

> >         ><br>

> >         > Caused by: null<br>

> >         > Caused by:<br>

> >         ><br>

> >         org.globus.cog.abstraction.impl.common.execution.JobException:<br>

> >         Job<br>

> >         > failed with an exit code of 521<br>

> >         > Progress:  time: Mon, 12 Sep 2011 15:45:31 -0500<br>

> >          Submitted:53<br>

> >         >  Active:1  Failed:46<br>

> >         > Progress:  time: Mon, 12 Sep 2011 15:45:34 -0500<br>

> >          Submitted:53<br>

> >         >  Active:1  Failed:46<br>

> >         > Exception in cat:<br>

> >         > Arguments: [gpfs/pads/swift/ketan/indir10/data0002.txt]<br>

> >         > Host: grid<br>

> >         > Directory: catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk<br>

> >         > - - -<br>

> >         ><br>

> >         ><br>

> >         > Caused by: null<br>

> >         > Caused by:<br>

> >         ><br>

> >         org.globus.cog.abstraction.impl.common.execution.JobException:<br>

> >         Job<br>

> >         > failed with an exit code of 521<br>

> >         > Progress:  time: Mon, 12 Sep 2011 15:45:45 -0500<br>

> >          Submitted:52<br>

> >         >  Active:1  Failed:47<br>

> >         > Exception in cat:<br>

> >         > Arguments: [gpfs/pads/swift/ketan/indir10/data0014.txt]<br>

> >         > Host: grid<br>

> >         > Directory: catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk<br>

> >         ><br>

> >         ><br>

> >         > I had about 107 workers running at the time of these<br>

> >         failures.<br>

> >         ><br>

> >         ><br>

> >         > I started seeing the failure messages after about 20 minutes<br>

> >         into this<br>

> >         > run.<br>

> >         ><br>

> >         ><br>

> >         > The logs are in <a href="http://www.ci.uchicago.edu/~ketan/pack.tgz" target="_blank">http://www.ci.uchicago.edu/~ketan/pack.tgz</a><br>

> >         ><br>

> >         ><br>

> >         > Regards,<br>

> >         > Ketan<br>

> >         ><br>

> >         ><br>

> >         ><br>

> >         > On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan<br>

> >         <<a href="mailto:hategan@mcs.anl.gov" target="_blank">hategan@mcs.anl.gov</a>><br>

> >         > wrote:<br>

> >         >         On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari<br>

> >         wrote:<br>

> >         ><br>

> >         >         > After some discussion with Mike, Our conclusion<br>

> >         from these<br>

> >         >         runs was<br>

> >         >         > that the parallel data transfers are causing<br>

> >         timeouts from<br>

> >         >         the<br>

> >         >         > <a href="http://worker.pl" target="_blank">worker.pl</a>, further, we were undecided if somehow<br>

> >         the timeout<br>

> >         >         threshold<br>

> >         >         > is set too agressive plus how are they determined<br>

> >         and<br>

> >         >         whether a change<br>

> >         >         > in that value could resolve the issue.<br>

> >         ><br>

> >         ><br>

> >         >         Something like that. Worker.pl would use the time<br>

> >         when a file<br>

> >         >         transfer<br>

> >         >         started to determine timeouts. This is undesirable.<br>

> >         The<br>

> >         >         purpose of<br>

> >         >         timeouts is to determine whether the other side has<br>

> >         stopped<br>

> >         >         from<br>

> >         >         properly following the flow of things. It follows<br>

> >         that any<br>

> >         >         kind of<br>

> >         >         activity should reset the timeout... timer.<br>

> >         ><br>

> >         >         I updated the worker code to deal with the issue in<br>

> >         a proper<br>

> >         >         way. But<br>

> >         >         now I need your help. This is perl code, and it<br>

> >         needs testing.<br>

> >         ><br>

> >         >         So can you re-run, first with some simple test that<br>

> >         uses<br>

> >         >         coaster staging<br>

> >         >         (just to make sure I didn't mess something up), and<br>

> >         then the<br>

> >         >         version of<br>

> >         >         your tests that was most likely to fail?<br>

> >         ><br>

> >         ><br>

> >         ><br>

> >         ><br>

> >         ><br>

> >         > --<br>

> >         > Ketan<br>

> >         ><br>

> >         ><br>

> >         ><br>

> ><br>

> ><br>

> ><br>

> ><br>

> ><br>

> ><br>

> ><br>

> > --<br>

> > Ketan<br>

> ><br>

> ><br>

> ><br>

><br>

><br>

</div></div><div><div></div><div>> _______________________________________________<br>

> Swift-devel mailing list<br>

> <a href="mailto:Swift-devel@ci.uchicago.edu" target="_blank">Swift-devel@ci.uchicago.edu</a><br>

> <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel</a><br>

<br>

<br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>Ketan<br><br><br>