[Swift-devel] Re: Please review and advise on: Bug 357 - Script hangs in staging on OSG

Fri Apr 15 08:31:31 CDT 2011

I proposed the following in bugzilla (Dan, are you getting these? If so I wont forward any more and will assume that when interested you'll read the bugzilla discussions...)

----- Forwarded Message -----
From: bugzilla-daemon at mcs.anl.gov
To: wilde at mcs.anl.gov
Sent: Friday, April 15, 2011 8:28:36 AM
Subject: [Bug 357] Script hangs in staging on OSG

https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=357

--- Comment #6 from Michael Wilde <wilde at mcs.anl.gov>  2011-04-15 08:28:35 ---
This problem may be explained by the following:

- each site requires a some large number of files tranferred in (60+) and some
small number out (<4?)

- some sites may hang on transfers, especially small and/or overloaded sites

- we have only 4 transfer threads here

- if all the transfer threads are hung on requests (eq socket operations) that
hang, then all Swift data transfer after that point hangs.  Ideally these
operations should be run with a timer that enables the operation to be aborted
and the transfer thread returned to use.  EVen better, all socket operations
should be select-driven and non-blocking. (I thought they were..)

- Theory: one or more small overloaded sites - eg UMiss in the example of the
first log filed in this ticket - are hanging all the transfer threads

==> Proposed temporary solution: (a) use more transfer threads: 16 or 32?; (b)
possibly batch up the small files into a single tarball so that we use less
threads per site and thus hung sites hang less threads; (c) avoid sites where
we are seeing hangs. (d) create a script to analyze a current run's log and
spot any hanging IO requests, identifying the files and sites involved. Use
this to spot and remove hanging sites.  (e) Mihael to improve Swift's
robustness in this area by timeout out hung requests and causing the
appropriate higher level of recovery to kick in.

====

Some messages from the related email thread on this bug are pasted below:

----- Forwarded Message -----
From: "Mihael Hategan" <hategan at mcs.anl.gov>
To: "Michael Wilde" <wilde at mcs.anl.gov>
Cc: "Allan Espinosa" <aespinosa at cs.uchicago.edu>, "Daniel Katz"
<dsk at ci.uchicago.edu>, "Swift Devel" <swift-devel at ci.uchicago.edu>
Sent: Thursday, April 14, 2011 8:51:16 PM
Subject: Re: Please review and advise on: Bug 357 - Script hangs in staging on
OSG

Well, that's barely hung unless the gridftp servers are hung, which may
be.

I would suggest upping the transfer throttle in this case. 4 may be
cutting it too close. Maybe to 16.

On Thu, 2011-04-14 at 19:45 -0500, Michael Wilde wrote:
> So you have 4 transfer threads and all 4 are waiting here:
> 
> at java.net.SocketInputStream.socketRead0(Native Method)
> 	at java.net.SocketInputStream.read(SocketInputStream.java:129)
> 
> (from throttle.transfers=4)
> 
> Is this workflow hung, and if so, how are you determining that?  Do you have another log plot of stagein and out?
> 
> - Mike
> 
> 

----- Forwarded Message -----
From: "Mihael Hategan" <hategan at mcs.anl.gov>
To: "Michael Wilde" <wilde at mcs.anl.gov>
Cc: "Allan Espinosa" <aespinosa at cs.uchicago.edu>, "Daniel Katz"
<dsk at ci.uchicago.edu>, "Swift Devel" <swift-devel at ci.uchicago.edu>
Sent: Thursday, April 14, 2011 8:53:10 PM
Subject: Re: Please review and advise on: Bug 357 - Script hangs in staging on
OSG

On Thu, 2011-04-14 at 20:11 -0500, Michael Wilde wrote:
> ALlan, what I meant was: do you have any evidence that this current run is hung (either in a similar manner to the one we looked at closely this morning, or in a different manner)?
> 
> In this mornings log, you could tell from plots of stagein and stageout events that many these events were not completing after something triggered an error.
> 
> Do you have similar plots or evidence of hangs regarding this run and its log?
> 
> I dont know from browsing the traces if one would *naturally* expect
> the transfer threads to all be waiting on input sockets most of the
> time, or if seeing all 4 threads waiting on sockets is indicative of
> data transfer being totally hung.

If nothing else happens in the log, then probably so. But the same could
happen for very large files (or very slow servers).

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug.

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory

----- Original Message -----
> On Thu, 2011-04-14 at 20:11 -0500, Michael Wilde wrote:
> > ALlan, what I meant was: do you have any evidence that this current
> > run is hung (either in a similar manner to the one we looked at
> > closely this morning, or in a different manner)?
> >
> > In this mornings log, you could tell from plots of stagein and
> > stageout events that many these events were not completing after
> > something triggered an error.
> >
> > Do you have similar plots or evidence of hangs regarding this run
> > and its log?
> >
> > I dont know from browsing the traces if one would *naturally* expect
> > the transfer threads to all be waiting on input sockets most of the
> > time, or if seeing all 4 threads waiting on sockets is indicative of
> > data transfer being totally hung.
> 
> If nothing else happens in the log, then probably so. But the same
> could
> happen for very large files (or very slow servers).

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory