[Swift-devel] persistent coasters and data staging

Ketan Maheshwari ketancmaheshwari at gmail.com
Mon Sep 12 15:56:29 CDT 2011


Mihael,

I tried with the new worker.pl, running a 100 task 10MB per task run with
throttle set at 100.

However, it seems to have failed with the same symptoms of timeout error
521:

Caused by: null
Caused by: org.globus.cog.abstraction.impl.common.execution.JobException:
Job failed with an exit code of 521
Progress:  time: Mon, 12 Sep 2011 15:45:31 -0500  Submitted:53  Active:1
 Failed:46
Progress:  time: Mon, 12 Sep 2011 15:45:34 -0500  Submitted:53  Active:1
 Failed:46
Exception in cat:
Arguments: [gpfs/pads/swift/ketan/indir10/data0002.txt]
Host: grid
Directory: catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk
- - -

Caused by: null
Caused by: org.globus.cog.abstraction.impl.common.execution.JobException:
Job failed with an exit code of 521
Progress:  time: Mon, 12 Sep 2011 15:45:45 -0500  Submitted:52  Active:1
 Failed:47
Exception in cat:
Arguments: [gpfs/pads/swift/ketan/indir10/data0014.txt]
Host: grid
Directory: catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk

I had about 107 workers running at the time of these failures.

I started seeing the failure messages after about 20 minutes into this run.

The logs are in http://www.ci.uchicago.edu/~ketan/pack.tgz

Regards,
Ketan


On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari wrote:
>
> > After some discussion with Mike, Our conclusion from these runs was
> > that the parallel data transfers are causing timeouts from the
> > worker.pl, further, we were undecided if somehow the timeout threshold
> > is set too agressive plus how are they determined and whether a change
> > in that value could resolve the issue.
>
> Something like that. Worker.pl would use the time when a file transfer
> started to determine timeouts. This is undesirable. The purpose of
> timeouts is to determine whether the other side has stopped from
> properly following the flow of things. It follows that any kind of
> activity should reset the timeout... timer.
>
> I updated the worker code to deal with the issue in a proper way. But
> now I need your help. This is perl code, and it needs testing.
>
> So can you re-run, first with some simple test that uses coaster staging
> (just to make sure I didn't mess something up), and then the version of
> your tests that was most likely to fail?
>
>


-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110912/7eddeaad/attachment.html>


More information about the Swift-devel mailing list