[Swift-devel] persistent coasters and data staging

Mihael Hategan hategan at mcs.anl.gov
Sun Oct 2 04:38:30 CDT 2011


I might have spoken a bit too soon there. There's still a timeout, but
it occurs at higher loads during stageout. That's with proxy mode, so
local (file) mode (i.e. what you should be using on OSG with the service
running on the client node) may not necessarily show the same problem.

On Sat, 2011-10-01 at 17:19 -0700, Mihael Hategan wrote:
> This should be fixed now in cog r3293.
> 
> There were two deadlocks. One that hung stage-ins and one that applied
> to stageouts. These were only apparent when all the I/O buffers got
> used, so only with relatively large staging activity.
> 
> Please test.
> 
> Mihael
> 
> On Wed, 2011-09-21 at 16:24 -0500, Ketan Maheshwari wrote:
> > Hi Mihael,
> > 
> > 
> > I tested this fix. It seems that the timeout issue for large-ish data
> > and throttle > ~30 persists. I am not sure if this is data staging
> > timeout though.
> > 
> > 
> > The setup that fails is as follows:
> > 
> > 
> > persistent coasters, resource= workers running on OSG
> > data size=8MB, 100 data items.
> > foreach throttle=40=jobthrottle.
> > 
> > 
> > The standard output seems intermittently showing some activity and
> > then getting back to no activity without any progress on tasks.
> > 
> > 
> > Please find the log and stdouterr
> > here: http://www.ci.uchicago.edu/~ketan/coaster-lab/std.out.err,
> >  http://www.ci.uchicago.edu/~ketan/coaster-lab/catsn-20110921-1535-v0t3gcg5.log
> > 
> > 
> > When I tested with small data, 1MB, 2MB, 4MB, it did work. 4MB
> > displayed a fat tail behavior though, ~94 tasks completing steadily
> > and quickly while the last 5-6 tasks taking disproportionate times.
> > The throttle in these cases was <= 30.
> > 
> > 
> > 
> > 
> > Regards,
> > Ketan
> > 
> > On Mon, Sep 12, 2011 at 7:19 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > wrote:
> >         Try now please (cog r3262).
> >         
> >         On Mon, 2011-09-12 at 15:56 -0500, Ketan Maheshwari wrote:
> >         
> >         
> >         > Mihael,
> >         >
> >         >
> >         > I tried with the new worker.pl, running a 100 task 10MB per
> >         task run
> >         > with throttle set at 100.
> >         >
> >         >
> >         > However, it seems to have failed with the same symptoms of
> >         timeout
> >         > error 521:
> >         >
> >         >
> >         > Caused by: null
> >         > Caused by:
> >         >
> >         org.globus.cog.abstraction.impl.common.execution.JobException:
> >         Job
> >         > failed with an exit code of 521
> >         > Progress:  time: Mon, 12 Sep 2011 15:45:31 -0500
> >          Submitted:53
> >         >  Active:1  Failed:46
> >         > Progress:  time: Mon, 12 Sep 2011 15:45:34 -0500
> >          Submitted:53
> >         >  Active:1  Failed:46
> >         > Exception in cat:
> >         > Arguments: [gpfs/pads/swift/ketan/indir10/data0002.txt]
> >         > Host: grid
> >         > Directory: catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk
> >         > - - -
> >         >
> >         >
> >         > Caused by: null
> >         > Caused by:
> >         >
> >         org.globus.cog.abstraction.impl.common.execution.JobException:
> >         Job
> >         > failed with an exit code of 521
> >         > Progress:  time: Mon, 12 Sep 2011 15:45:45 -0500
> >          Submitted:52
> >         >  Active:1  Failed:47
> >         > Exception in cat:
> >         > Arguments: [gpfs/pads/swift/ketan/indir10/data0014.txt]
> >         > Host: grid
> >         > Directory: catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk
> >         >
> >         >
> >         > I had about 107 workers running at the time of these
> >         failures.
> >         >
> >         >
> >         > I started seeing the failure messages after about 20 minutes
> >         into this
> >         > run.
> >         >
> >         >
> >         > The logs are in http://www.ci.uchicago.edu/~ketan/pack.tgz
> >         >
> >         >
> >         > Regards,
> >         > Ketan
> >         >
> >         >
> >         >
> >         > On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan
> >         <hategan at mcs.anl.gov>
> >         > wrote:
> >         >         On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari
> >         wrote:
> >         >
> >         >         > After some discussion with Mike, Our conclusion
> >         from these
> >         >         runs was
> >         >         > that the parallel data transfers are causing
> >         timeouts from
> >         >         the
> >         >         > worker.pl, further, we were undecided if somehow
> >         the timeout
> >         >         threshold
> >         >         > is set too agressive plus how are they determined
> >         and
> >         >         whether a change
> >         >         > in that value could resolve the issue.
> >         >
> >         >
> >         >         Something like that. Worker.pl would use the time
> >         when a file
> >         >         transfer
> >         >         started to determine timeouts. This is undesirable.
> >         The
> >         >         purpose of
> >         >         timeouts is to determine whether the other side has
> >         stopped
> >         >         from
> >         >         properly following the flow of things. It follows
> >         that any
> >         >         kind of
> >         >         activity should reset the timeout... timer.
> >         >
> >         >         I updated the worker code to deal with the issue in
> >         a proper
> >         >         way. But
> >         >         now I need your help. This is perl code, and it
> >         needs testing.
> >         >
> >         >         So can you re-run, first with some simple test that
> >         uses
> >         >         coaster staging
> >         >         (just to make sure I didn't mess something up), and
> >         then the
> >         >         version of
> >         >         your tests that was most likely to fail?
> >         >
> >         >
> >         >
> >         >
> >         >
> >         > --
> >         > Ketan
> >         >
> >         >
> >         >
> >         
> >         
> >         
> > 
> > 
> > 
> > 
> > -- 
> > Ketan
> > 
> > 
> > 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel





More information about the Swift-devel mailing list