[Swift-devel] Re: Please review and advise on: Bug 357 - Script hangs in staging on OSG

Fri Apr 15 14:22:00 CDT 2011

Sadly, the gridftp client does not use NIO.

But besides that, I don't know what the correct solution should be.
Perhaps there should be limits on transfer threads to a single site and
then the global transfer throttle should be larger.

On Fri, 2011-04-15 at 08:31 -0500, Michael Wilde wrote:
> I proposed the following in bugzilla (Dan, are you getting these? If so I wont forward any more and will assume that when interested you'll read the bugzilla discussions...)
> 
> ----- Forwarded Message -----
> From: bugzilla-daemon at mcs.anl.gov
> To: wilde at mcs.anl.gov
> Sent: Friday, April 15, 2011 8:28:36 AM
> Subject: [Bug 357] Script hangs in staging on OSG
> 
> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=357
> 
> --- Comment #6 from Michael Wilde <wilde at mcs.anl.gov>  2011-04-15 08:28:35 ---
> This problem may be explained by the following:
> 
> - each site requires a some large number of files tranferred in (60+) and some
> small number out (<4?)
> 
> - some sites may hang on transfers, especially small and/or overloaded sites
> 
> - we have only 4 transfer threads here
> 
> - if all the transfer threads are hung on requests (eq socket operations) that
> hang, then all Swift data transfer after that point hangs.  Ideally these
> operations should be run with a timer that enables the operation to be aborted
> and the transfer thread returned to use.  EVen better, all socket operations
> should be select-driven and non-blocking. (I thought they were..)
> 
> - Theory: one or more small overloaded sites - eg UMiss in the example of the
> first log filed in this ticket - are hanging all the transfer threads
> 
> ==> Proposed temporary solution: (a) use more transfer threads: 16 or 32?; (b)
> possibly batch up the small files into a single tarball so that we use less
> threads per site and thus hung sites hang less threads; (c) avoid sites where
> we are seeing hangs. (d) create a script to analyze a current run's log and
> spot any hanging IO requests, identifying the files and sites involved. Use
> this to spot and remove hanging sites.  (e) Mihael to improve Swift's
> robustness in this area by timeout out hung requests and causing the
> appropriate higher level of recovery to kick in.
> 
> 
> ====
> 
> Some messages from the related email thread on this bug are pasted below:
> 
> ----- Forwarded Message -----
> From: "Mihael Hategan" <hategan at mcs.anl.gov>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Allan Espinosa" <aespinosa at cs.uchicago.edu>, "Daniel Katz"
> <dsk at ci.uchicago.edu>, "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Thursday, April 14, 2011 8:51:16 PM
> Subject: Re: Please review and advise on: Bug 357 - Script hangs in staging on
> OSG
> 
> Well, that's barely hung unless the gridftp servers are hung, which may
> be.
> 
> I would suggest upping the transfer throttle in this case. 4 may be
> cutting it too close. Maybe to 16.
> 
> On Thu, 2011-04-14 at 19:45 -0500, Michael Wilde wrote:
> > So you have 4 transfer threads and all 4 are waiting here:
> > 
> > at java.net.SocketInputStream.socketRead0(Native Method)
> > 	at java.net.SocketInputStream.read(SocketInputStream.java:129)
> > 
> > (from throttle.transfers=4)
> > 
> > Is this workflow hung, and if so, how are you determining that?  Do you have another log plot of stagein and out?
> > 
> > - Mike
> > 
> > 
> 
> ----- Forwarded Message -----
> From: "Mihael Hategan" <hategan at mcs.anl.gov>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Allan Espinosa" <aespinosa at cs.uchicago.edu>, "Daniel Katz"
> <dsk at ci.uchicago.edu>, "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Thursday, April 14, 2011 8:53:10 PM
> Subject: Re: Please review and advise on: Bug 357 - Script hangs in staging on
> OSG
> 
> On Thu, 2011-04-14 at 20:11 -0500, Michael Wilde wrote:
> > ALlan, what I meant was: do you have any evidence that this current run is hung (either in a similar manner to the one we looked at closely this morning, or in a different manner)?
> > 
> > In this mornings log, you could tell from plots of stagein and stageout events that many these events were not completing after something triggered an error.
> > 
> > Do you have similar plots or evidence of hangs regarding this run and its log?
> > 
> > I dont know from browsing the traces if one would *naturally* expect
> > the transfer threads to all be waiting on input sockets most of the
> > time, or if seeing all 4 threads waiting on sockets is indicative of
> > data transfer being totally hung.
> 
> If nothing else happens in the log, then probably so. But the same could
> happen for very large files (or very slow servers).
> 
> 
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> -- 
> Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You reported the bug.
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> 
> ----- Original Message -----
> > On Thu, 2011-04-14 at 20:11 -0500, Michael Wilde wrote:
> > > ALlan, what I meant was: do you have any evidence that this current
> > > run is hung (either in a similar manner to the one we looked at
> > > closely this morning, or in a different manner)?
> > >
> > > In this mornings log, you could tell from plots of stagein and
> > > stageout events that many these events were not completing after
> > > something triggered an error.
> > >
> > > Do you have similar plots or evidence of hangs regarding this run
> > > and its log?
> > >
> > > I dont know from browsing the traces if one would *naturally* expect
> > > the transfer threads to all be waiting on input sockets most of the
> > > time, or if seeing all 4 threads waiting on sockets is indicative of
> > > data transfer being totally hung.
> > 
> > If nothing else happens in the log, then probably so. But the same
> > could
> > happen for very large files (or very slow servers).
>