[Swift-devel] persistent coasters and data staging

Mon Sep 12 16:21:40 CDT 2011

Ok. I see the same problem in the service code. I'm working on a fix.

On Mon, 2011-09-12 at 15:56 -0500, Ketan Maheshwari wrote:
> Mihael,
> 
> 
> I tried with the new worker.pl, running a 100 task 10MB per task run
> with throttle set at 100.
> 
> 
> However, it seems to have failed with the same symptoms of timeout
> error 521:
> 
> 
> Caused by: null
> Caused by:
> org.globus.cog.abstraction.impl.common.execution.JobException: Job
> failed with an exit code of 521
> Progress:  time: Mon, 12 Sep 2011 15:45:31 -0500  Submitted:53
>  Active:1  Failed:46
> Progress:  time: Mon, 12 Sep 2011 15:45:34 -0500  Submitted:53
>  Active:1  Failed:46
> Exception in cat:
> Arguments: [gpfs/pads/swift/ketan/indir10/data0002.txt]
> Host: grid
> Directory: catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk
> - - -
> 
> 
> Caused by: null
> Caused by:
> org.globus.cog.abstraction.impl.common.execution.JobException: Job
> failed with an exit code of 521
> Progress:  time: Mon, 12 Sep 2011 15:45:45 -0500  Submitted:52
>  Active:1  Failed:47
> Exception in cat:
> Arguments: [gpfs/pads/swift/ketan/indir10/data0014.txt]
> Host: grid
> Directory: catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk
> 
> 
> I had about 107 workers running at the time of these failures.
> 
> 
> I started seeing the failure messages after about 20 minutes into this
> run.
> 
> 
> The logs are in http://www.ci.uchicago.edu/~ketan/pack.tgz
> 
> 
> Regards,
> Ketan
> 
> 
> 
> On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
>         On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari wrote:
>         
>         > After some discussion with Mike, Our conclusion from these
>         runs was
>         > that the parallel data transfers are causing timeouts from
>         the
>         > worker.pl, further, we were undecided if somehow the timeout
>         threshold
>         > is set too agressive plus how are they determined and
>         whether a change
>         > in that value could resolve the issue.
>         
>         
>         Something like that. Worker.pl would use the time when a file
>         transfer
>         started to determine timeouts. This is undesirable. The
>         purpose of
>         timeouts is to determine whether the other side has stopped
>         from
>         properly following the flow of things. It follows that any
>         kind of
>         activity should reset the timeout... timer.
>         
>         I updated the worker code to deal with the issue in a proper
>         way. But
>         now I need your help. This is perl code, and it needs testing.
>         
>         So can you re-run, first with some simple test that uses
>         coaster staging
>         (just to make sure I didn't mess something up), and then the
>         version of
>         your tests that was most likely to fail?
>         
> 
> 
> 
> 
> -- 
> Ketan
> 
> 
>