[Swift-devel] Re: Please review and advise on: Bug 357 - Script hangs in staging on OSG

Michael Wilde wilde at mcs.anl.gov
Thu Apr 14 19:45:45 CDT 2011


So you have 4 transfer threads and all 4 are waiting here:

at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.read(SocketInputStream.java:129)

(from throttle.transfers=4)

Is this workflow hung, and if so, how are you determining that?  Do you have another log plot of stagein and out?

- Mike


----- Original Message -----
> Fresh traces (jstack and log) in
> /home/aespinosa/workflows/cybershake/archive-runs/transfer-logging .
> The swift log is a snapshot of the workflow that is still running.
> 
> -Allan
> 
> 2011/4/14 Mihael Hategan <hategan at mcs.anl.gov>:
> > One immediate question that I have is what's up with the deadline
> > passed
> > messages?
> >
> > That happens when jobs run for at least twice their advertised
> > walltime
> > and for some reason the site doesn't seem to cancel them. This may
> > be
> > indicative of notifications getting lost.
> >
> > As for the transfers, I don't see all transfers hanging after that.
> > I
> > mean there are transfers that complete ok. Though things do seem to
> > slow
> > down quite a bit, so that looks like a problem.
> >
> > Let's see what in the stack traces. In the mean time, I will see
> > what it
> > takes to get transfer progress messages.
> >
> > Mihael
> >
> >
> > On Thu, 2011-04-14 at 17:28 -0500, Michael Wilde wrote:
> >> bri$ pwd
> >> /home/aespinosa/workflows/cybershake/archive-runs/test
> >> bri$ ls -lt
> >> total 1844128
> >> -rw-r--r-- 1 aespinosa ci-users 0 Apr 14 14:21 max-duration.tmp
> >> -rw-r--r-- 1 aespinosa ci-users 15 Apr 14 14:20 start-time.tmp
> >> -rw-r--r-- 1 aespinosa ci-users 1433206 Apr 14 14:20 stagein.event
> >> -rw-r--r-- 1 aespinosa ci-users 2372737 Apr 14 14:19
> >> sort-preserve2.tmp
> >> -rw-r--r-- 1 aespinosa ci-users 2372737 Apr 14 14:19
> >> sort-preserve.tmp
> >> -rw-r--r-- 1 aespinosa ci-users 15 Apr 14 14:19 t.inf
> >> -rw-r--r-- 1 aespinosa ci-users 2263727 Apr 14 12:51
> >> stagein.transition
> >> -rw-r--r-- 1 aespinosa ci-users 8998897 Apr 14 12:31 stagein.log
> >> -rw-r--r-- 1 aespinosa ci-users 92059 Apr 14 12:05 dostageout.event
> >> -rw-r--r-- 1 aespinosa ci-users 97442 Apr 14 11:51 dostagein.event
> >> -rw-r--r-- 1 aespinosa ci-users 2998 Apr 13 17:38
> >> dostagein.sorted-start.png
> >> -rw-r--r-- 1 aespinosa ci-users 3080 Apr 13 17:38
> >> dostageout.sorted-start.png
> >> -rw-r--r-- 1 aespinosa ci-users 3255 Apr 8 16:05 execute2-total.png
> >> -rw-r--r-- 1 aespinosa ci-users 1533974 Apr 8 14:46
> >> postproc-20110407-1438-i90jepr3.0.rlog
> >> -rw-r--r-- 1 aespinosa ci-users 1868896768 Apr 8 14:46
> >> postproc-20110407-1438-i90jepr3.log
> >> drwxr-xr-x 2 aespinosa ci-users 32768 Apr 7 14:39
> >> postproc-20110407-1438-i90jepr3.d/
> >> bri$
> >>
> >> runs, not run
> >>
> >> ALso see bridled: /tmp/mw1
> >>
> >> ----- Original Message -----
> >> > [hategan at bridled tmp]$ cd
> >> > ~aespinosa/workflows/cybershake/archive-run/test/
> >> > -bash: cd:
> >> > /home/aespinosa/workflows/cybershake/archive-run/test/: No
> >> > such file or directory
> >> >
> >> > On Thu, 2011-04-14 at 17:21 -0500, Allan Espinosa wrote:
> >> > > ~aespinosa/workflows/cybershake/archive-run/test/postproc*.log
> >> > >
> >> > > 2011/4/14 Mihael Hategan <hategan at mcs.anl.gov>:
> >> > > > On Thu, 2011-04-14 at 15:57 -0500, Michael Wilde wrote:
> >> > > >> While Allan continues to debug this, can you take a look at
> >> > > >> the
> >> > > >> (huge) log?
> >> > > >
> >> > > > Where is this log?
> >> > > >
> >> > > >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list