[Swift-devel] persistent coasters and data staging
Mihael Hategan
hategan at mcs.anl.gov
Sat Oct 1 19:19:58 CDT 2011
This should be fixed now in cog r3293.
There were two deadlocks. One that hung stage-ins and one that applied
to stageouts. These were only apparent when all the I/O buffers got
used, so only with relatively large staging activity.
Please test.
Mihael
On Wed, 2011-09-21 at 16:24 -0500, Ketan Maheshwari wrote:
> Hi Mihael,
>
>
> I tested this fix. It seems that the timeout issue for large-ish data
> and throttle > ~30 persists. I am not sure if this is data staging
> timeout though.
>
>
> The setup that fails is as follows:
>
>
> persistent coasters, resource= workers running on OSG
> data size=8MB, 100 data items.
> foreach throttle=40=jobthrottle.
>
>
> The standard output seems intermittently showing some activity and
> then getting back to no activity without any progress on tasks.
>
>
> Please find the log and stdouterr
> here: http://www.ci.uchicago.edu/~ketan/coaster-lab/std.out.err,
> http://www.ci.uchicago.edu/~ketan/coaster-lab/catsn-20110921-1535-v0t3gcg5.log
>
>
> When I tested with small data, 1MB, 2MB, 4MB, it did work. 4MB
> displayed a fat tail behavior though, ~94 tasks completing steadily
> and quickly while the last 5-6 tasks taking disproportionate times.
> The throttle in these cases was <= 30.
>
>
>
>
> Regards,
> Ketan
>
> On Mon, Sep 12, 2011 at 7:19 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
> Try now please (cog r3262).
>
> On Mon, 2011-09-12 at 15:56 -0500, Ketan Maheshwari wrote:
>
>
> > Mihael,
> >
> >
> > I tried with the new worker.pl, running a 100 task 10MB per
> task run
> > with throttle set at 100.
> >
> >
> > However, it seems to have failed with the same symptoms of
> timeout
> > error 521:
> >
> >
> > Caused by: null
> > Caused by:
> >
> org.globus.cog.abstraction.impl.common.execution.JobException:
> Job
> > failed with an exit code of 521
> > Progress: time: Mon, 12 Sep 2011 15:45:31 -0500
> Submitted:53
> > Active:1 Failed:46
> > Progress: time: Mon, 12 Sep 2011 15:45:34 -0500
> Submitted:53
> > Active:1 Failed:46
> > Exception in cat:
> > Arguments: [gpfs/pads/swift/ketan/indir10/data0002.txt]
> > Host: grid
> > Directory: catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk
> > - - -
> >
> >
> > Caused by: null
> > Caused by:
> >
> org.globus.cog.abstraction.impl.common.execution.JobException:
> Job
> > failed with an exit code of 521
> > Progress: time: Mon, 12 Sep 2011 15:45:45 -0500
> Submitted:52
> > Active:1 Failed:47
> > Exception in cat:
> > Arguments: [gpfs/pads/swift/ketan/indir10/data0014.txt]
> > Host: grid
> > Directory: catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk
> >
> >
> > I had about 107 workers running at the time of these
> failures.
> >
> >
> > I started seeing the failure messages after about 20 minutes
> into this
> > run.
> >
> >
> > The logs are in http://www.ci.uchicago.edu/~ketan/pack.tgz
> >
> >
> > Regards,
> > Ketan
> >
> >
> >
> > On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan
> <hategan at mcs.anl.gov>
> > wrote:
> > On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari
> wrote:
> >
> > > After some discussion with Mike, Our conclusion
> from these
> > runs was
> > > that the parallel data transfers are causing
> timeouts from
> > the
> > > worker.pl, further, we were undecided if somehow
> the timeout
> > threshold
> > > is set too agressive plus how are they determined
> and
> > whether a change
> > > in that value could resolve the issue.
> >
> >
> > Something like that. Worker.pl would use the time
> when a file
> > transfer
> > started to determine timeouts. This is undesirable.
> The
> > purpose of
> > timeouts is to determine whether the other side has
> stopped
> > from
> > properly following the flow of things. It follows
> that any
> > kind of
> > activity should reset the timeout... timer.
> >
> > I updated the worker code to deal with the issue in
> a proper
> > way. But
> > now I need your help. This is perl code, and it
> needs testing.
> >
> > So can you re-run, first with some simple test that
> uses
> > coaster staging
> > (just to make sure I didn't mess something up), and
> then the
> > version of
> > your tests that was most likely to fail?
> >
> >
> >
> >
> >
> > --
> > Ketan
> >
> >
> >
>
>
>
>
>
>
>
> --
> Ketan
>
>
>
More information about the Swift-devel
mailing list