[Swift-devel] persistent coasters and data staging

Thu Sep 22 14:12:10 CDT 2011

Ah, yes. Sorry. I was looking at the wrong log.

On Thu, 2011-09-22 at 14:07 -0500, Ketan Maheshwari wrote:
> Mihael,
> 
> 
> The experiments and logs I sent you above are not from the SCEC
> workflow. These are just the catsn scripts. The logs also doesn't show
> anything related to invalid path as such.
> 
> 
> The var_str invalid path issue still persists though and I am trying
> to debug it, but that is a completely different one. 
> 
> 
> Regards,
> Ketan
> 
> 
> On Thu, Sep 22, 2011 at 1:57 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
>         What I see in the log is the error about the invalid path,
>         which, as I
>         mentioned before, is  an issue of var_str seemingly being
>         empty (you may
>         want to trace its value though to confirm). I don't see
>         anything about a
>         stagein/out issue.
>         
>         Mihael
>         
>         
>         On Wed, 2011-09-21 at 16:24 -0500, Ketan Maheshwari wrote:
>         > Hi Mihael,
>         >
>         >
>         > I tested this fix. It seems that the timeout issue for
>         large-ish data
>         > and throttle > ~30 persists. I am not sure if this is data
>         staging
>         > timeout though.
>         >
>         >
>         > The setup that fails is as follows:
>         >
>         >
>         > persistent coasters, resource= workers running on OSG
>         > data size=8MB, 100 data items.
>         > foreach throttle=40=jobthrottle.
>         >
>         >
>         > The standard output seems intermittently showing some
>         activity and
>         > then getting back to no activity without any progress on
>         tasks.
>         >
>         >
>         > Please find the log and stdouterr
>         > here:
>         http://www.ci.uchicago.edu/~ketan/coaster-lab/std.out.err,
>         >
>          http://www.ci.uchicago.edu/~ketan/coaster-lab/catsn-20110921-1535-v0t3gcg5.log
>         >
>         >
>         > When I tested with small data, 1MB, 2MB, 4MB, it did work.
>         4MB
>         > displayed a fat tail behavior though, ~94 tasks completing
>         steadily
>         > and quickly while the last 5-6 tasks taking disproportionate
>         times.
>         > The throttle in these cases was <= 30.
>         >
>         >
>         >
>         >
>         > Regards,
>         > Ketan
>         >
>         > On Mon, Sep 12, 2011 at 7:19 PM, Mihael Hategan
>         <hategan at mcs.anl.gov>
>         > wrote:
>         >         Try now please (cog r3262).
>         >
>         >         On Mon, 2011-09-12 at 15:56 -0500, Ketan Maheshwari
>         wrote:
>         >
>         >
>         >         > Mihael,
>         >         >
>         >         >
>         >         > I tried with the new worker.pl, running a 100 task
>         10MB per
>         >         task run
>         >         > with throttle set at 100.
>         >         >
>         >         >
>         >         > However, it seems to have failed with the same
>         symptoms of
>         >         timeout
>         >         > error 521:
>         >         >
>         >         >
>         >         > Caused by: null
>         >         > Caused by:
>         >         >
>         >
>         org.globus.cog.abstraction.impl.common.execution.JobException:
>         >         Job
>         >         > failed with an exit code of 521
>         >         > Progress:  time: Mon, 12 Sep 2011 15:45:31 -0500
>         >          Submitted:53
>         >         >  Active:1  Failed:46
>         >         > Progress:  time: Mon, 12 Sep 2011 15:45:34 -0500
>         >          Submitted:53
>         >         >  Active:1  Failed:46
>         >         > Exception in cat:
>         >         > Arguments:
>         [gpfs/pads/swift/ketan/indir10/data0002.txt]
>         >         > Host: grid
>         >         > Directory:
>         catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk
>         >         > - - -
>         >         >
>         >         >
>         >         > Caused by: null
>         >         > Caused by:
>         >         >
>         >
>         org.globus.cog.abstraction.impl.common.execution.JobException:
>         >         Job
>         >         > failed with an exit code of 521
>         >         > Progress:  time: Mon, 12 Sep 2011 15:45:45 -0500
>         >          Submitted:52
>         >         >  Active:1  Failed:47
>         >         > Exception in cat:
>         >         > Arguments:
>         [gpfs/pads/swift/ketan/indir10/data0014.txt]
>         >         > Host: grid
>         >         > Directory:
>         catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk
>         >         >
>         >         >
>         >         > I had about 107 workers running at the time of
>         these
>         >         failures.
>         >         >
>         >         >
>         >         > I started seeing the failure messages after about
>         20 minutes
>         >         into this
>         >         > run.
>         >         >
>         >         >
>         >         > The logs are in
>         http://www.ci.uchicago.edu/~ketan/pack.tgz
>         >         >
>         >         >
>         >         > Regards,
>         >         > Ketan
>         >         >
>         >         >
>         >         >
>         >         > On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan
>         >         <hategan at mcs.anl.gov>
>         >         > wrote:
>         >         >         On Mon, 2011-09-12 at 11:58 -0500, Ketan
>         Maheshwari
>         >         wrote:
>         >         >
>         >         >         > After some discussion with Mike, Our
>         conclusion
>         >         from these
>         >         >         runs was
>         >         >         > that the parallel data transfers are
>         causing
>         >         timeouts from
>         >         >         the
>         >         >         > worker.pl, further, we were undecided if
>         somehow
>         >         the timeout
>         >         >         threshold
>         >         >         > is set too agressive plus how are they
>         determined
>         >         and
>         >         >         whether a change
>         >         >         > in that value could resolve the issue.
>         >         >
>         >         >
>         >         >         Something like that. Worker.pl would use
>         the time
>         >         when a file
>         >         >         transfer
>         >         >         started to determine timeouts. This is
>         undesirable.
>         >         The
>         >         >         purpose of
>         >         >         timeouts is to determine whether the other
>         side has
>         >         stopped
>         >         >         from
>         >         >         properly following the flow of things. It
>         follows
>         >         that any
>         >         >         kind of
>         >         >         activity should reset the timeout...
>         timer.
>         >         >
>         >         >         I updated the worker code to deal with the
>         issue in
>         >         a proper
>         >         >         way. But
>         >         >         now I need your help. This is perl code,
>         and it
>         >         needs testing.
>         >         >
>         >         >         So can you re-run, first with some simple
>         test that
>         >         uses
>         >         >         coaster staging
>         >         >         (just to make sure I didn't mess something
>         up), and
>         >         then the
>         >         >         version of
>         >         >         your tests that was most likely to fail?
>         >         >
>         >         >
>         >         >
>         >         >
>         >         >
>         >         > --
>         >         > Ketan
>         >         >
>         >         >
>         >         >
>         >
>         >
>         >
>         >
>         >
>         >
>         >
>         > --
>         > Ketan
>         >
>         >
>         >
>         
>         
>         
> 
> 
> 
> 
> -- 
> Ketan
> 
> 
>