[Swift-devel] persistent coasters and data staging

Mihael Hategan hategan at mcs.anl.gov
Tue Sep 27 00:51:40 CDT 2011


So it might be that the client runs out of buffers and that delays some
transfers which causes the timeouts.

I need to confirm this, but a quick fix may be to disable timeouts for
file transfers. The alternative would be to send some periodic "still
queued" message. I'll give this some thought, but suggestions are
welcome.

Mihael

On Thu, 2011-09-22 at 12:12 -0700, Mihael Hategan wrote:
> Ah, yes. Sorry. I was looking at the wrong log.
> 
> On Thu, 2011-09-22 at 14:07 -0500, Ketan Maheshwari wrote:
> > Mihael,
> > 
> > 
> > The experiments and logs I sent you above are not from the SCEC
> > workflow. These are just the catsn scripts. The logs also doesn't show
> > anything related to invalid path as such.
> > 
> > 
> > The var_str invalid path issue still persists though and I am trying
> > to debug it, but that is a completely different one. 
> > 
> > 
> > Regards,
> > Ketan
> > 
> > 
> > On Thu, Sep 22, 2011 at 1:57 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > wrote:
> >         What I see in the log is the error about the invalid path,
> >         which, as I
> >         mentioned before, is  an issue of var_str seemingly being
> >         empty (you may
> >         want to trace its value though to confirm). I don't see
> >         anything about a
> >         stagein/out issue.
> >         
> >         Mihael
> >         
> >         
> >         On Wed, 2011-09-21 at 16:24 -0500, Ketan Maheshwari wrote:
> >         > Hi Mihael,
> >         >
> >         >
> >         > I tested this fix. It seems that the timeout issue for
> >         large-ish data
> >         > and throttle > ~30 persists. I am not sure if this is data
> >         staging
> >         > timeout though.
> >         >
> >         >
> >         > The setup that fails is as follows:
> >         >
> >         >
> >         > persistent coasters, resource= workers running on OSG
> >         > data size=8MB, 100 data items.
> >         > foreach throttle=40=jobthrottle.
> >         >
> >         >
> >         > The standard output seems intermittently showing some
> >         activity and
> >         > then getting back to no activity without any progress on
> >         tasks.
> >         >
> >         >
> >         > Please find the log and stdouterr
> >         > here:
> >         http://www.ci.uchicago.edu/~ketan/coaster-lab/std.out.err,
> >         >
> >          http://www.ci.uchicago.edu/~ketan/coaster-lab/catsn-20110921-1535-v0t3gcg5.log
> >         >
> >         >
> >         > When I tested with small data, 1MB, 2MB, 4MB, it did work.
> >         4MB
> >         > displayed a fat tail behavior though, ~94 tasks completing
> >         steadily
> >         > and quickly while the last 5-6 tasks taking disproportionate
> >         times.
> >         > The throttle in these cases was <= 30.
> >         >
> >         >
> >         >
> >         >
> >         > Regards,
> >         > Ketan
> >         >
> >         > On Mon, Sep 12, 2011 at 7:19 PM, Mihael Hategan
> >         <hategan at mcs.anl.gov>
> >         > wrote:
> >         >         Try now please (cog r3262).
> >         >
> >         >         On Mon, 2011-09-12 at 15:56 -0500, Ketan Maheshwari
> >         wrote:
> >         >
> >         >
> >         >         > Mihael,
> >         >         >
> >         >         >
> >         >         > I tried with the new worker.pl, running a 100 task
> >         10MB per
> >         >         task run
> >         >         > with throttle set at 100.
> >         >         >
> >         >         >
> >         >         > However, it seems to have failed with the same
> >         symptoms of
> >         >         timeout
> >         >         > error 521:
> >         >         >
> >         >         >
> >         >         > Caused by: null
> >         >         > Caused by:
> >         >         >
> >         >
> >         org.globus.cog.abstraction.impl.common.execution.JobException:
> >         >         Job
> >         >         > failed with an exit code of 521
> >         >         > Progress:  time: Mon, 12 Sep 2011 15:45:31 -0500
> >         >          Submitted:53
> >         >         >  Active:1  Failed:46
> >         >         > Progress:  time: Mon, 12 Sep 2011 15:45:34 -0500
> >         >          Submitted:53
> >         >         >  Active:1  Failed:46
> >         >         > Exception in cat:
> >         >         > Arguments:
> >         [gpfs/pads/swift/ketan/indir10/data0002.txt]
> >         >         > Host: grid
> >         >         > Directory:
> >         catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk
> >         >         > - - -
> >         >         >
> >         >         >
> >         >         > Caused by: null
> >         >         > Caused by:
> >         >         >
> >         >
> >         org.globus.cog.abstraction.impl.common.execution.JobException:
> >         >         Job
> >         >         > failed with an exit code of 521
> >         >         > Progress:  time: Mon, 12 Sep 2011 15:45:45 -0500
> >         >          Submitted:52
> >         >         >  Active:1  Failed:47
> >         >         > Exception in cat:
> >         >         > Arguments:
> >         [gpfs/pads/swift/ketan/indir10/data0014.txt]
> >         >         > Host: grid
> >         >         > Directory:
> >         catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk
> >         >         >
> >         >         >
> >         >         > I had about 107 workers running at the time of
> >         these
> >         >         failures.
> >         >         >
> >         >         >
> >         >         > I started seeing the failure messages after about
> >         20 minutes
> >         >         into this
> >         >         > run.
> >         >         >
> >         >         >
> >         >         > The logs are in
> >         http://www.ci.uchicago.edu/~ketan/pack.tgz
> >         >         >
> >         >         >
> >         >         > Regards,
> >         >         > Ketan
> >         >         >
> >         >         >
> >         >         >
> >         >         > On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan
> >         >         <hategan at mcs.anl.gov>
> >         >         > wrote:
> >         >         >         On Mon, 2011-09-12 at 11:58 -0500, Ketan
> >         Maheshwari
> >         >         wrote:
> >         >         >
> >         >         >         > After some discussion with Mike, Our
> >         conclusion
> >         >         from these
> >         >         >         runs was
> >         >         >         > that the parallel data transfers are
> >         causing
> >         >         timeouts from
> >         >         >         the
> >         >         >         > worker.pl, further, we were undecided if
> >         somehow
> >         >         the timeout
> >         >         >         threshold
> >         >         >         > is set too agressive plus how are they
> >         determined
> >         >         and
> >         >         >         whether a change
> >         >         >         > in that value could resolve the issue.
> >         >         >
> >         >         >
> >         >         >         Something like that. Worker.pl would use
> >         the time
> >         >         when a file
> >         >         >         transfer
> >         >         >         started to determine timeouts. This is
> >         undesirable.
> >         >         The
> >         >         >         purpose of
> >         >         >         timeouts is to determine whether the other
> >         side has
> >         >         stopped
> >         >         >         from
> >         >         >         properly following the flow of things. It
> >         follows
> >         >         that any
> >         >         >         kind of
> >         >         >         activity should reset the timeout...
> >         timer.
> >         >         >
> >         >         >         I updated the worker code to deal with the
> >         issue in
> >         >         a proper
> >         >         >         way. But
> >         >         >         now I need your help. This is perl code,
> >         and it
> >         >         needs testing.
> >         >         >
> >         >         >         So can you re-run, first with some simple
> >         test that
> >         >         uses
> >         >         >         coaster staging
> >         >         >         (just to make sure I didn't mess something
> >         up), and
> >         >         then the
> >         >         >         version of
> >         >         >         your tests that was most likely to fail?
> >         >         >
> >         >         >
> >         >         >
> >         >         >
> >         >         >
> >         >         > --
> >         >         > Ketan
> >         >         >
> >         >         >
> >         >         >
> >         >
> >         >
> >         >
> >         >
> >         >
> >         >
> >         >
> >         > --
> >         > Ketan
> >         >
> >         >
> >         >
> >         
> >         
> >         
> > 
> > 
> > 
> > 
> > -- 
> > Ketan
> > 
> > 
> > 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel





More information about the Swift-devel mailing list