Mihael,<div><br></div><div>The experiments and logs I sent you above are not from the SCEC workflow. These are just the catsn scripts. The logs also doesn't show anything related to invalid path as such.<br><br><br>The var_str invalid path issue still persists though and I am trying to debug it, but that is a completely different one. </div>

<div><br></div><div>Regards,</div><div>Ketan</div><div><br></div><div><div class="gmail_quote">On Thu, Sep 22, 2011 at 1:57 PM, Mihael Hategan <span dir="ltr"><<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">What I see in the log is the error about the invalid path, which, as I<br>

mentioned before, is  an issue of var_str seemingly being empty (you may<br>

want to trace its value though to confirm). I don't see anything about a<br>

stagein/out issue.<br>

<font color="#888888"><br>

Mihael<br>

</font><div><div></div><div class="h5"><br>

On Wed, 2011-09-21 at 16:24 -0500, Ketan Maheshwari wrote:<br>

> Hi Mihael,<br>

><br>

><br>

> I tested this fix. It seems that the timeout issue for large-ish data<br>

> and throttle > ~30 persists. I am not sure if this is data staging<br>

> timeout though.<br>

><br>

><br>

> The setup that fails is as follows:<br>

><br>

><br>

> persistent coasters, resource= workers running on OSG<br>

> data size=8MB, 100 data items.<br>

> foreach throttle=40=jobthrottle.<br>

><br>

><br>

> The standard output seems intermittently showing some activity and<br>

> then getting back to no activity without any progress on tasks.<br>

><br>

><br>

> Please find the log and stdouterr<br>

> here: <a href="http://www.ci.uchicago.edu/~ketan/coaster-lab/std.out.err" target="_blank">http://www.ci.uchicago.edu/~ketan/coaster-lab/std.out.err</a>,<br>

>  <a href="http://www.ci.uchicago.edu/~ketan/coaster-lab/catsn-20110921-1535-v0t3gcg5.log" target="_blank">http://www.ci.uchicago.edu/~ketan/coaster-lab/catsn-20110921-1535-v0t3gcg5.log</a><br>

><br>

><br>

> When I tested with small data, 1MB, 2MB, 4MB, it did work. 4MB<br>

> displayed a fat tail behavior though, ~94 tasks completing steadily<br>

> and quickly while the last 5-6 tasks taking disproportionate times.<br>

> The throttle in these cases was <= 30.<br>

><br>

><br>

><br>

><br>

> Regards,<br>

> Ketan<br>

><br>

> On Mon, Sep 12, 2011 at 7:19 PM, Mihael Hategan <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>

> wrote:<br>

>         Try now please (cog r3262).<br>

><br>

>         On Mon, 2011-09-12 at 15:56 -0500, Ketan Maheshwari wrote:<br>

><br>

><br>

>         > Mihael,<br>

>         ><br>

>         ><br>

>         > I tried with the new <a href="http://worker.pl" target="_blank">worker.pl</a>, running a 100 task 10MB per<br>

>         task run<br>

>         > with throttle set at 100.<br>

>         ><br>

>         ><br>

>         > However, it seems to have failed with the same symptoms of<br>

>         timeout<br>

>         > error 521:<br>

>         ><br>

>         ><br>

>         > Caused by: null<br>

>         > Caused by:<br>

>         ><br>

>         org.globus.cog.abstraction.impl.common.execution.JobException:<br>

>         Job<br>

>         > failed with an exit code of 521<br>

>         > Progress:  time: Mon, 12 Sep 2011 15:45:31 -0500<br>

>          Submitted:53<br>

>         >  Active:1  Failed:46<br>

>         > Progress:  time: Mon, 12 Sep 2011 15:45:34 -0500<br>

>          Submitted:53<br>

>         >  Active:1  Failed:46<br>

>         > Exception in cat:<br>

>         > Arguments: [gpfs/pads/swift/ketan/indir10/data0002.txt]<br>

>         > Host: grid<br>

>         > Directory: catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk<br>

>         > - - -<br>

>         ><br>

>         ><br>

>         > Caused by: null<br>

>         > Caused by:<br>

>         ><br>

>         org.globus.cog.abstraction.impl.common.execution.JobException:<br>

>         Job<br>

>         > failed with an exit code of 521<br>

>         > Progress:  time: Mon, 12 Sep 2011 15:45:45 -0500<br>

>          Submitted:52<br>

>         >  Active:1  Failed:47<br>

>         > Exception in cat:<br>

>         > Arguments: [gpfs/pads/swift/ketan/indir10/data0014.txt]<br>

>         > Host: grid<br>

>         > Directory: catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk<br>

>         ><br>

>         ><br>

>         > I had about 107 workers running at the time of these<br>

>         failures.<br>

>         ><br>

>         ><br>

>         > I started seeing the failure messages after about 20 minutes<br>

>         into this<br>

>         > run.<br>

>         ><br>

>         ><br>

>         > The logs are in <a href="http://www.ci.uchicago.edu/~ketan/pack.tgz" target="_blank">http://www.ci.uchicago.edu/~ketan/pack.tgz</a><br>

>         ><br>

>         ><br>

>         > Regards,<br>

>         > Ketan<br>

>         ><br>

>         ><br>

>         ><br>

>         > On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan<br>

>         <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>

>         > wrote:<br>

>         >         On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari<br>

>         wrote:<br>

>         ><br>

>         >         > After some discussion with Mike, Our conclusion<br>

>         from these<br>

>         >         runs was<br>

>         >         > that the parallel data transfers are causing<br>

>         timeouts from<br>

>         >         the<br>

>         >         > <a href="http://worker.pl" target="_blank">worker.pl</a>, further, we were undecided if somehow<br>

>         the timeout<br>

>         >         threshold<br>

>         >         > is set too agressive plus how are they determined<br>

>         and<br>

>         >         whether a change<br>

>         >         > in that value could resolve the issue.<br>

>         ><br>

>         ><br>

>         >         Something like that. Worker.pl would use the time<br>

>         when a file<br>

>         >         transfer<br>

>         >         started to determine timeouts. This is undesirable.<br>

>         The<br>

>         >         purpose of<br>

>         >         timeouts is to determine whether the other side has<br>

>         stopped<br>

>         >         from<br>

>         >         properly following the flow of things. It follows<br>

>         that any<br>

>         >         kind of<br>

>         >         activity should reset the timeout... timer.<br>

>         ><br>

>         >         I updated the worker code to deal with the issue in<br>

>         a proper<br>

>         >         way. But<br>

>         >         now I need your help. This is perl code, and it<br>

>         needs testing.<br>

>         ><br>

>         >         So can you re-run, first with some simple test that<br>

>         uses<br>

>         >         coaster staging<br>

>         >         (just to make sure I didn't mess something up), and<br>

>         then the<br>

>         >         version of<br>

>         >         your tests that was most likely to fail?<br>

>         ><br>

>         ><br>

>         ><br>

>         ><br>

>         ><br>

>         > --<br>

>         > Ketan<br>

>         ><br>

>         ><br>

>         ><br>

><br>

><br>

><br>

><br>

><br>

><br>

><br>

> --<br>

> Ketan<br>

><br>

><br>

><br>

<br>

<br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>Ketan<br><br><br>

</div>