[Swift-devel] persistent coasters and data staging
Mihael Hategan
hategan at mcs.anl.gov
Thu Sep 22 14:12:10 CDT 2011
Ah, yes. Sorry. I was looking at the wrong log.
On Thu, 2011-09-22 at 14:07 -0500, Ketan Maheshwari wrote:
> Mihael,
>
>
> The experiments and logs I sent you above are not from the SCEC
> workflow. These are just the catsn scripts. The logs also doesn't show
> anything related to invalid path as such.
>
>
> The var_str invalid path issue still persists though and I am trying
> to debug it, but that is a completely different one.
>
>
> Regards,
> Ketan
>
>
> On Thu, Sep 22, 2011 at 1:57 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
> What I see in the log is the error about the invalid path,
> which, as I
> mentioned before, is an issue of var_str seemingly being
> empty (you may
> want to trace its value though to confirm). I don't see
> anything about a
> stagein/out issue.
>
> Mihael
>
>
> On Wed, 2011-09-21 at 16:24 -0500, Ketan Maheshwari wrote:
> > Hi Mihael,
> >
> >
> > I tested this fix. It seems that the timeout issue for
> large-ish data
> > and throttle > ~30 persists. I am not sure if this is data
> staging
> > timeout though.
> >
> >
> > The setup that fails is as follows:
> >
> >
> > persistent coasters, resource= workers running on OSG
> > data size=8MB, 100 data items.
> > foreach throttle=40=jobthrottle.
> >
> >
> > The standard output seems intermittently showing some
> activity and
> > then getting back to no activity without any progress on
> tasks.
> >
> >
> > Please find the log and stdouterr
> > here:
> http://www.ci.uchicago.edu/~ketan/coaster-lab/std.out.err,
> >
> http://www.ci.uchicago.edu/~ketan/coaster-lab/catsn-20110921-1535-v0t3gcg5.log
> >
> >
> > When I tested with small data, 1MB, 2MB, 4MB, it did work.
> 4MB
> > displayed a fat tail behavior though, ~94 tasks completing
> steadily
> > and quickly while the last 5-6 tasks taking disproportionate
> times.
> > The throttle in these cases was <= 30.
> >
> >
> >
> >
> > Regards,
> > Ketan
> >
> > On Mon, Sep 12, 2011 at 7:19 PM, Mihael Hategan
> <hategan at mcs.anl.gov>
> > wrote:
> > Try now please (cog r3262).
> >
> > On Mon, 2011-09-12 at 15:56 -0500, Ketan Maheshwari
> wrote:
> >
> >
> > > Mihael,
> > >
> > >
> > > I tried with the new worker.pl, running a 100 task
> 10MB per
> > task run
> > > with throttle set at 100.
> > >
> > >
> > > However, it seems to have failed with the same
> symptoms of
> > timeout
> > > error 521:
> > >
> > >
> > > Caused by: null
> > > Caused by:
> > >
> >
> org.globus.cog.abstraction.impl.common.execution.JobException:
> > Job
> > > failed with an exit code of 521
> > > Progress: time: Mon, 12 Sep 2011 15:45:31 -0500
> > Submitted:53
> > > Active:1 Failed:46
> > > Progress: time: Mon, 12 Sep 2011 15:45:34 -0500
> > Submitted:53
> > > Active:1 Failed:46
> > > Exception in cat:
> > > Arguments:
> [gpfs/pads/swift/ketan/indir10/data0002.txt]
> > > Host: grid
> > > Directory:
> catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk
> > > - - -
> > >
> > >
> > > Caused by: null
> > > Caused by:
> > >
> >
> org.globus.cog.abstraction.impl.common.execution.JobException:
> > Job
> > > failed with an exit code of 521
> > > Progress: time: Mon, 12 Sep 2011 15:45:45 -0500
> > Submitted:52
> > > Active:1 Failed:47
> > > Exception in cat:
> > > Arguments:
> [gpfs/pads/swift/ketan/indir10/data0014.txt]
> > > Host: grid
> > > Directory:
> catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk
> > >
> > >
> > > I had about 107 workers running at the time of
> these
> > failures.
> > >
> > >
> > > I started seeing the failure messages after about
> 20 minutes
> > into this
> > > run.
> > >
> > >
> > > The logs are in
> http://www.ci.uchicago.edu/~ketan/pack.tgz
> > >
> > >
> > > Regards,
> > > Ketan
> > >
> > >
> > >
> > > On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan
> > <hategan at mcs.anl.gov>
> > > wrote:
> > > On Mon, 2011-09-12 at 11:58 -0500, Ketan
> Maheshwari
> > wrote:
> > >
> > > > After some discussion with Mike, Our
> conclusion
> > from these
> > > runs was
> > > > that the parallel data transfers are
> causing
> > timeouts from
> > > the
> > > > worker.pl, further, we were undecided if
> somehow
> > the timeout
> > > threshold
> > > > is set too agressive plus how are they
> determined
> > and
> > > whether a change
> > > > in that value could resolve the issue.
> > >
> > >
> > > Something like that. Worker.pl would use
> the time
> > when a file
> > > transfer
> > > started to determine timeouts. This is
> undesirable.
> > The
> > > purpose of
> > > timeouts is to determine whether the other
> side has
> > stopped
> > > from
> > > properly following the flow of things. It
> follows
> > that any
> > > kind of
> > > activity should reset the timeout...
> timer.
> > >
> > > I updated the worker code to deal with the
> issue in
> > a proper
> > > way. But
> > > now I need your help. This is perl code,
> and it
> > needs testing.
> > >
> > > So can you re-run, first with some simple
> test that
> > uses
> > > coaster staging
> > > (just to make sure I didn't mess something
> up), and
> > then the
> > > version of
> > > your tests that was most likely to fail?
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Ketan
> > >
> > >
> > >
> >
> >
> >
> >
> >
> >
> >
> > --
> > Ketan
> >
> >
> >
>
>
>
>
>
>
>
> --
> Ketan
>
>
>
More information about the Swift-devel
mailing list