[Swift-devel] Re: Please review and advise on: Bug 357 - Script hangs in staging on OSG

Michael Wilde wilde at mcs.anl.gov
Thu Apr 14 20:20:10 CDT 2011


In other words, does it show the behavior of the attached plot?

- Mike


----- Original Message -----
> ALlan, what I meant was: do you have any evidence that this current
> run is hung (either in a similar manner to the one we looked at
> closely this morning, or in a different manner)?
> 
> In this mornings log, you could tell from plots of stagein and
> stageout events that many these events were not completing after
> something triggered an error.
> 
> Do you have similar plots or evidence of hangs regarding this run and
> its log?
> 
> I dont know from browsing the traces if one would *naturally* expect
> the transfer threads to all be waiting on input sockets most of the
> time, or if seeing all 4 threads waiting on sockets is indicative of
> data transfer being totally hung.
> 
> Mihael, I assume you can tell much more from these traces?
> 
> - Mike
> 
> 
> ----- Original Message -----
> > Right now the logs only gives out messages about
> > AbstractKarajanStreamChannel. I set the org.globus.ftp package's
> > logging level to DEBUG, so entries should be reflected if there are
> > transfers being made.
> >
> > -Allan
> >
> > 2011/4/14 Michael Wilde <wilde at mcs.anl.gov>:
> > > So you have 4 transfer threads and all 4 are waiting here:
> > >
> > > at java.net.SocketInputStream.socketRead0(Native Method)
> > >        at
> > >        java.net.SocketInputStream.read(SocketInputStream.java:129)
> > >
> > > (from throttle.transfers=4)
> > >
> > > Is this workflow hung, and if so, how are you determining that? Do
> > > you have another log plot of stagein and out?
> > >
> > > - Mike
> > >
> > >
> > > ----- Original Message -----
> > >> Fresh traces (jstack and log) in
> > >> /home/aespinosa/workflows/cybershake/archive-runs/transfer-logging
> > >> .
> > >> The swift log is a snapshot of the workflow that is still
> > >> running.
> > >>
> > >> -Allan
> > >>
> > >> 2011/4/14 Mihael Hategan <hategan at mcs.anl.gov>:
> > >> > One immediate question that I have is what's up with the
> > >> > deadline
> > >> > passed
> > >> > messages?
> > >> >
> > >> > That happens when jobs run for at least twice their advertised
> > >> > walltime
> > >> > and for some reason the site doesn't seem to cancel them. This
> > >> > may
> > >> > be
> > >> > indicative of notifications getting lost.
> > >> >
> > >> > As for the transfers, I don't see all transfers hanging after
> > >> > that.
> > >> > I
> > >> > mean there are transfers that complete ok. Though things do
> > >> > seem
> > >> > to
> > >> > slow
> > >> > down quite a bit, so that looks like a problem.
> > >> >
> > >> > Let's see what in the stack traces. In the mean time, I will
> > >> > see
> > >> > what it
> > >> > takes to get transfer progress messages.
> > >> >
> > >> > Mihael
> > >> >
> > >> >
> > >> > On Thu, 2011-04-14 at 17:28 -0500, Michael Wilde wrote:
> > >> >> bri$ pwd
> > >> >> /home/aespinosa/workflows/cybershake/archive-runs/test
> > >> >> bri$ ls -lt
> > >> >> total 1844128
> > >> >> -rw-r--r-- 1 aespinosa ci-users 0 Apr 14 14:21
> > >> >> max-duration.tmp
> > >> >> -rw-r--r-- 1 aespinosa ci-users 15 Apr 14 14:20 start-time.tmp
> > >> >> -rw-r--r-- 1 aespinosa ci-users 1433206 Apr 14 14:20
> > >> >> stagein.event
> > >> >> -rw-r--r-- 1 aespinosa ci-users 2372737 Apr 14 14:19
> > >> >> sort-preserve2.tmp
> > >> >> -rw-r--r-- 1 aespinosa ci-users 2372737 Apr 14 14:19
> > >> >> sort-preserve.tmp
> > >> >> -rw-r--r-- 1 aespinosa ci-users 15 Apr 14 14:19 t.inf
> > >> >> -rw-r--r-- 1 aespinosa ci-users 2263727 Apr 14 12:51
> > >> >> stagein.transition
> > >> >> -rw-r--r-- 1 aespinosa ci-users 8998897 Apr 14 12:31
> > >> >> stagein.log
> > >> >> -rw-r--r-- 1 aespinosa ci-users 92059 Apr 14 12:05
> > >> >> dostageout.event
> > >> >> -rw-r--r-- 1 aespinosa ci-users 97442 Apr 14 11:51
> > >> >> dostagein.event
> > >> >> -rw-r--r-- 1 aespinosa ci-users 2998 Apr 13 17:38
> > >> >> dostagein.sorted-start.png
> > >> >> -rw-r--r-- 1 aespinosa ci-users 3080 Apr 13 17:38
> > >> >> dostageout.sorted-start.png
> > >> >> -rw-r--r-- 1 aespinosa ci-users 3255 Apr 8 16:05
> > >> >> execute2-total.png
> > >> >> -rw-r--r-- 1 aespinosa ci-users 1533974 Apr 8 14:46
> > >> >> postproc-20110407-1438-i90jepr3.0.rlog
> > >> >> -rw-r--r-- 1 aespinosa ci-users 1868896768 Apr 8 14:46
> > >> >> postproc-20110407-1438-i90jepr3.log
> > >> >> drwxr-xr-x 2 aespinosa ci-users 32768 Apr 7 14:39
> > >> >> postproc-20110407-1438-i90jepr3.d/
> > >> >> bri$
> > >> >>
> > >> >> runs, not run
> > >> >>
> > >> >> ALso see bridled: /tmp/mw1
> > >> >>
> > >> >> ----- Original Message -----
> > >> >> > [hategan at bridled tmp]$ cd
> > >> >> > ~aespinosa/workflows/cybershake/archive-run/test/
> > >> >> > -bash: cd:
> > >> >> > /home/aespinosa/workflows/cybershake/archive-run/test/: No
> > >> >> > such file or directory
> > >> >> >
> > >> >> > On Thu, 2011-04-14 at 17:21 -0500, Allan Espinosa wrote:
> > >> >> > > ~aespinosa/workflows/cybershake/archive-run/test/postproc*.log
> > >> >> > >
> > >> >> > > 2011/4/14 Mihael Hategan <hategan at mcs.anl.gov>:
> > >> >> > > > On Thu, 2011-04-14 at 15:57 -0500, Michael Wilde wrote:
> > >> >> > > >> While Allan continues to debug this, can you take a
> > >> >> > > >> look
> > >> >> > > >> at
> > >> >> > > >> the
> > >> >> > > >> (huge) log?
> > >> >> > > >
> > >> >> > > > Where is this log?
> > >> >> > > >
> > >> >> > > >
> > >
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory

-------------- next part --------------
A non-text attachment was scrubbed...
Name: dostagein.sorted-start.png
Type: image/png
Size: 2998 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110414/1d9b24af/attachment.png>


More information about the Swift-devel mailing list