Mihael,<div><br></div><div>I tried with the new <a href="http://worker.pl">worker.pl</a>, running a 100 task 10MB per task run with throttle set at 100.</div><div><br></div><div>However, it seems to have failed with the same symptoms of timeout error 521:</div>
<div><br></div><div><div>Caused by: null</div><div>Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 521</div><div>Progress: time: Mon, 12 Sep 2011 15:45:31 -0500 Submitted:53 Active:1 Failed:46</div>
<div>Progress: time: Mon, 12 Sep 2011 15:45:34 -0500 Submitted:53 Active:1 Failed:46</div><div>Exception in cat:</div><div>Arguments: [gpfs/pads/swift/ketan/indir10/data0002.txt]</div><div>Host: grid</div><div>Directory: catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk</div>
<div>- - -</div><div><br></div><div>Caused by: null</div><div>Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 521</div><div>Progress: time: Mon, 12 Sep 2011 15:45:45 -0500 Submitted:52 Active:1 Failed:47</div>
<div>Exception in cat:</div><div>Arguments: [gpfs/pads/swift/ketan/indir10/data0014.txt]</div><div>Host: grid</div><div>Directory: catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk</div></div><div><br></div><div>I had about 107 workers running at the time of these failures.</div>
<div><br></div><div>I started seeing the failure messages after about 20 minutes into this run.</div><div><br></div><div>The logs are in <a href="http://www.ci.uchicago.edu/~ketan/pack.tgz">http://www.ci.uchicago.edu/~ketan/pack.tgz</a><br>
<br></div><div>Regards,</div><div>Ketan</div><div><br></div><div><br><div class="gmail_quote">On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan <span dir="ltr"><<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="im">On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari wrote:<br>
<br>
> After some discussion with Mike, Our conclusion from these runs was<br>
> that the parallel data transfers are causing timeouts from the<br>
> <a href="http://worker.pl" target="_blank">worker.pl</a>, further, we were undecided if somehow the timeout threshold<br>
> is set too agressive plus how are they determined and whether a change<br>
> in that value could resolve the issue.<br>
<br>
</div>Something like that. Worker.pl would use the time when a file transfer<br>
started to determine timeouts. This is undesirable. The purpose of<br>
timeouts is to determine whether the other side has stopped from<br>
properly following the flow of things. It follows that any kind of<br>
activity should reset the timeout... timer.<br>
<br>
I updated the worker code to deal with the issue in a proper way. But<br>
now I need your help. This is perl code, and it needs testing.<br>
<br>
So can you re-run, first with some simple test that uses coaster staging<br>
(just to make sure I didn't mess something up), and then the version of<br>
your tests that was most likely to fail?<br>
<br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br>Ketan<br><br><br>
</div>