[Swift-devel] persistent coasters and data staging

Thu Sep 22 14:07:53 CDT 2011

Mihael,

The experiments and logs I sent you above are not from the SCEC workflow.
These are just the catsn scripts. The logs also doesn't show anything
related to invalid path as such.

The var_str invalid path issue still persists though and I am trying to
debug it, but that is a completely different one.

Regards,
Ketan

On Thu, Sep 22, 2011 at 1:57 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> What I see in the log is the error about the invalid path, which, as I
> mentioned before, is  an issue of var_str seemingly being empty (you may
> want to trace its value though to confirm). I don't see anything about a
> stagein/out issue.
>
> Mihael
>
> On Wed, 2011-09-21 at 16:24 -0500, Ketan Maheshwari wrote:
> > Hi Mihael,
> >
> >
> > I tested this fix. It seems that the timeout issue for large-ish data
> > and throttle > ~30 persists. I am not sure if this is data staging
> > timeout though.
> >
> >
> > The setup that fails is as follows:
> >
> >
> > persistent coasters, resource= workers running on OSG
> > data size=8MB, 100 data items.
> > foreach throttle=40=jobthrottle.
> >
> >
> > The standard output seems intermittently showing some activity and
> > then getting back to no activity without any progress on tasks.
> >
> >
> > Please find the log and stdouterr
> > here: http://www.ci.uchicago.edu/~ketan/coaster-lab/std.out.err,
> >
> http://www.ci.uchicago.edu/~ketan/coaster-lab/catsn-20110921-1535-v0t3gcg5.log
> >
> >
> > When I tested with small data, 1MB, 2MB, 4MB, it did work. 4MB
> > displayed a fat tail behavior though, ~94 tasks completing steadily
> > and quickly while the last 5-6 tasks taking disproportionate times.
> > The throttle in these cases was <= 30.
> >
> >
> >
> >
> > Regards,
> > Ketan
> >
> > On Mon, Sep 12, 2011 at 7:19 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > wrote:
> >         Try now please (cog r3262).
> >
> >         On Mon, 2011-09-12 at 15:56 -0500, Ketan Maheshwari wrote:
> >
> >
> >         > Mihael,
> >         >
> >         >
> >         > I tried with the new worker.pl, running a 100 task 10MB per
> >         task run
> >         > with throttle set at 100.
> >         >
> >         >
> >         > However, it seems to have failed with the same symptoms of
> >         timeout
> >         > error 521:
> >         >
> >         >
> >         > Caused by: null
> >         > Caused by:
> >         >
> >         org.globus.cog.abstraction.impl.common.execution.JobException:
> >         Job
> >         > failed with an exit code of 521
> >         > Progress:  time: Mon, 12 Sep 2011 15:45:31 -0500
> >          Submitted:53
> >         >  Active:1  Failed:46
> >         > Progress:  time: Mon, 12 Sep 2011 15:45:34 -0500
> >          Submitted:53
> >         >  Active:1  Failed:46
> >         > Exception in cat:
> >         > Arguments: [gpfs/pads/swift/ketan/indir10/data0002.txt]
> >         > Host: grid
> >         > Directory: catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk
> >         > - - -
> >         >
> >         >
> >         > Caused by: null
> >         > Caused by:
> >         >
> >         org.globus.cog.abstraction.impl.common.execution.JobException:
> >         Job
> >         > failed with an exit code of 521
> >         > Progress:  time: Mon, 12 Sep 2011 15:45:45 -0500
> >          Submitted:52
> >         >  Active:1  Failed:47
> >         > Exception in cat:
> >         > Arguments: [gpfs/pads/swift/ketan/indir10/data0014.txt]
> >         > Host: grid
> >         > Directory: catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk
> >         >
> >         >
> >         > I had about 107 workers running at the time of these
> >         failures.
> >         >
> >         >
> >         > I started seeing the failure messages after about 20 minutes
> >         into this
> >         > run.
> >         >
> >         >
> >         > The logs are in http://www.ci.uchicago.edu/~ketan/pack.tgz
> >         >
> >         >
> >         > Regards,
> >         > Ketan
> >         >
> >         >
> >         >
> >         > On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan
> >         <hategan at mcs.anl.gov>
> >         > wrote:
> >         >         On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari
> >         wrote:
> >         >
> >         >         > After some discussion with Mike, Our conclusion
> >         from these
> >         >         runs was
> >         >         > that the parallel data transfers are causing
> >         timeouts from
> >         >         the
> >         >         > worker.pl, further, we were undecided if somehow
> >         the timeout
> >         >         threshold
> >         >         > is set too agressive plus how are they determined
> >         and
> >         >         whether a change
> >         >         > in that value could resolve the issue.
> >         >
> >         >
> >         >         Something like that. Worker.pl would use the time
> >         when a file
> >         >         transfer
> >         >         started to determine timeouts. This is undesirable.
> >         The
> >         >         purpose of
> >         >         timeouts is to determine whether the other side has
> >         stopped
> >         >         from
> >         >         properly following the flow of things. It follows
> >         that any
> >         >         kind of
> >         >         activity should reset the timeout... timer.
> >         >
> >         >         I updated the worker code to deal with the issue in
> >         a proper
> >         >         way. But
> >         >         now I need your help. This is perl code, and it
> >         needs testing.
> >         >
> >         >         So can you re-run, first with some simple test that
> >         uses
> >         >         coaster staging
> >         >         (just to make sure I didn't mess something up), and
> >         then the
> >         >         version of
> >         >         your tests that was most likely to fail?
> >         >
> >         >
> >         >
> >         >
> >         >
> >         > --
> >         > Ketan
> >         >
> >         >
> >         >
> >
> >
> >
> >
> >
> >
> >
> > --
> > Ketan
> >
> >
> >
>
>
>

-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110922/fadd66ac/attachment.html>