[Swift-devel] Re: Please review and advise on: Bug 357 - Script hangs in staging on OSG

Michael Wilde wilde at mcs.anl.gov
Thu Apr 14 20:11:18 CDT 2011


ALlan, what I meant was: do you have any evidence that this current run is hung (either in a similar manner to the one we looked at closely this morning, or in a different manner)?

In this mornings log, you could tell from plots of stagein and stageout events that many these events were not completing after something triggered an error.

Do you have similar plots or evidence of hangs regarding this run and its log?

I dont know from browsing the traces if one would *naturally* expect the transfer threads to all be waiting on input sockets most of the time, or if seeing all 4 threads waiting on sockets is indicative of data transfer being totally hung.

Mihael, I assume you can tell much more from these traces?

- Mike


----- Original Message -----
> Right now the logs only gives out messages about
> AbstractKarajanStreamChannel. I set the org.globus.ftp package's
> logging level to DEBUG, so entries should be reflected if there are
> transfers being made.
> 
> -Allan
> 
> 2011/4/14 Michael Wilde <wilde at mcs.anl.gov>:
> > So you have 4 transfer threads and all 4 are waiting here:
> >
> > at java.net.SocketInputStream.socketRead0(Native Method)
> >        at
> >        java.net.SocketInputStream.read(SocketInputStream.java:129)
> >
> > (from throttle.transfers=4)
> >
> > Is this workflow hung, and if so, how are you determining that? Do
> > you have another log plot of stagein and out?
> >
> > - Mike
> >
> >
> > ----- Original Message -----
> >> Fresh traces (jstack and log) in
> >> /home/aespinosa/workflows/cybershake/archive-runs/transfer-logging
> >> .
> >> The swift log is a snapshot of the workflow that is still running.
> >>
> >> -Allan
> >>
> >> 2011/4/14 Mihael Hategan <hategan at mcs.anl.gov>:
> >> > One immediate question that I have is what's up with the deadline
> >> > passed
> >> > messages?
> >> >
> >> > That happens when jobs run for at least twice their advertised
> >> > walltime
> >> > and for some reason the site doesn't seem to cancel them. This
> >> > may
> >> > be
> >> > indicative of notifications getting lost.
> >> >
> >> > As for the transfers, I don't see all transfers hanging after
> >> > that.
> >> > I
> >> > mean there are transfers that complete ok. Though things do seem
> >> > to
> >> > slow
> >> > down quite a bit, so that looks like a problem.
> >> >
> >> > Let's see what in the stack traces. In the mean time, I will see
> >> > what it
> >> > takes to get transfer progress messages.
> >> >
> >> > Mihael
> >> >
> >> >
> >> > On Thu, 2011-04-14 at 17:28 -0500, Michael Wilde wrote:
> >> >> bri$ pwd
> >> >> /home/aespinosa/workflows/cybershake/archive-runs/test
> >> >> bri$ ls -lt
> >> >> total 1844128
> >> >> -rw-r--r-- 1 aespinosa ci-users 0 Apr 14 14:21 max-duration.tmp
> >> >> -rw-r--r-- 1 aespinosa ci-users 15 Apr 14 14:20 start-time.tmp
> >> >> -rw-r--r-- 1 aespinosa ci-users 1433206 Apr 14 14:20
> >> >> stagein.event
> >> >> -rw-r--r-- 1 aespinosa ci-users 2372737 Apr 14 14:19
> >> >> sort-preserve2.tmp
> >> >> -rw-r--r-- 1 aespinosa ci-users 2372737 Apr 14 14:19
> >> >> sort-preserve.tmp
> >> >> -rw-r--r-- 1 aespinosa ci-users 15 Apr 14 14:19 t.inf
> >> >> -rw-r--r-- 1 aespinosa ci-users 2263727 Apr 14 12:51
> >> >> stagein.transition
> >> >> -rw-r--r-- 1 aespinosa ci-users 8998897 Apr 14 12:31 stagein.log
> >> >> -rw-r--r-- 1 aespinosa ci-users 92059 Apr 14 12:05
> >> >> dostageout.event
> >> >> -rw-r--r-- 1 aespinosa ci-users 97442 Apr 14 11:51
> >> >> dostagein.event
> >> >> -rw-r--r-- 1 aespinosa ci-users 2998 Apr 13 17:38
> >> >> dostagein.sorted-start.png
> >> >> -rw-r--r-- 1 aespinosa ci-users 3080 Apr 13 17:38
> >> >> dostageout.sorted-start.png
> >> >> -rw-r--r-- 1 aespinosa ci-users 3255 Apr 8 16:05
> >> >> execute2-total.png
> >> >> -rw-r--r-- 1 aespinosa ci-users 1533974 Apr 8 14:46
> >> >> postproc-20110407-1438-i90jepr3.0.rlog
> >> >> -rw-r--r-- 1 aespinosa ci-users 1868896768 Apr 8 14:46
> >> >> postproc-20110407-1438-i90jepr3.log
> >> >> drwxr-xr-x 2 aespinosa ci-users 32768 Apr 7 14:39
> >> >> postproc-20110407-1438-i90jepr3.d/
> >> >> bri$
> >> >>
> >> >> runs, not run
> >> >>
> >> >> ALso see bridled: /tmp/mw1
> >> >>
> >> >> ----- Original Message -----
> >> >> > [hategan at bridled tmp]$ cd
> >> >> > ~aespinosa/workflows/cybershake/archive-run/test/
> >> >> > -bash: cd:
> >> >> > /home/aespinosa/workflows/cybershake/archive-run/test/: No
> >> >> > such file or directory
> >> >> >
> >> >> > On Thu, 2011-04-14 at 17:21 -0500, Allan Espinosa wrote:
> >> >> > > ~aespinosa/workflows/cybershake/archive-run/test/postproc*.log
> >> >> > >
> >> >> > > 2011/4/14 Mihael Hategan <hategan at mcs.anl.gov>:
> >> >> > > > On Thu, 2011-04-14 at 15:57 -0500, Michael Wilde wrote:
> >> >> > > >> While Allan continues to debug this, can you take a look
> >> >> > > >> at
> >> >> > > >> the
> >> >> > > >> (huge) log?
> >> >> > > >
> >> >> > > > Where is this log?
> >> >> > > >
> >> >> > > >
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list