[Swift-devel] Re: Please review and advise on: Bug 357 - Script hangs in staging on OSG

Mihael Hategan hategan at mcs.anl.gov
Thu Apr 14 20:51:16 CDT 2011


Well, that's barely hung unless the gridftp servers are hung, which may
be.

I would suggest upping the transfer throttle in this case. 4 may be
cutting it too close. Maybe to 16.

On Thu, 2011-04-14 at 19:45 -0500, Michael Wilde wrote:
> So you have 4 transfer threads and all 4 are waiting here:
> 
> at java.net.SocketInputStream.socketRead0(Native Method)
> 	at java.net.SocketInputStream.read(SocketInputStream.java:129)
> 
> (from throttle.transfers=4)
> 
> Is this workflow hung, and if so, how are you determining that?  Do you have another log plot of stagein and out?
> 
> - Mike
> 
> 
> ----- Original Message -----
> > Fresh traces (jstack and log) in
> > /home/aespinosa/workflows/cybershake/archive-runs/transfer-logging .
> > The swift log is a snapshot of the workflow that is still running.
> > 
> > -Allan
> > 
> > 2011/4/14 Mihael Hategan <hategan at mcs.anl.gov>:
> > > One immediate question that I have is what's up with the deadline
> > > passed
> > > messages?
> > >
> > > That happens when jobs run for at least twice their advertised
> > > walltime
> > > and for some reason the site doesn't seem to cancel them. This may
> > > be
> > > indicative of notifications getting lost.
> > >
> > > As for the transfers, I don't see all transfers hanging after that.
> > > I
> > > mean there are transfers that complete ok. Though things do seem to
> > > slow
> > > down quite a bit, so that looks like a problem.
> > >
> > > Let's see what in the stack traces. In the mean time, I will see
> > > what it
> > > takes to get transfer progress messages.
> > >
> > > Mihael
> > >
> > >
> > > On Thu, 2011-04-14 at 17:28 -0500, Michael Wilde wrote:
> > >> bri$ pwd
> > >> /home/aespinosa/workflows/cybershake/archive-runs/test
> > >> bri$ ls -lt
> > >> total 1844128
> > >> -rw-r--r-- 1 aespinosa ci-users 0 Apr 14 14:21 max-duration.tmp
> > >> -rw-r--r-- 1 aespinosa ci-users 15 Apr 14 14:20 start-time.tmp
> > >> -rw-r--r-- 1 aespinosa ci-users 1433206 Apr 14 14:20 stagein.event
> > >> -rw-r--r-- 1 aespinosa ci-users 2372737 Apr 14 14:19
> > >> sort-preserve2.tmp
> > >> -rw-r--r-- 1 aespinosa ci-users 2372737 Apr 14 14:19
> > >> sort-preserve.tmp
> > >> -rw-r--r-- 1 aespinosa ci-users 15 Apr 14 14:19 t.inf
> > >> -rw-r--r-- 1 aespinosa ci-users 2263727 Apr 14 12:51
> > >> stagein.transition
> > >> -rw-r--r-- 1 aespinosa ci-users 8998897 Apr 14 12:31 stagein.log
> > >> -rw-r--r-- 1 aespinosa ci-users 92059 Apr 14 12:05 dostageout.event
> > >> -rw-r--r-- 1 aespinosa ci-users 97442 Apr 14 11:51 dostagein.event
> > >> -rw-r--r-- 1 aespinosa ci-users 2998 Apr 13 17:38
> > >> dostagein.sorted-start.png
> > >> -rw-r--r-- 1 aespinosa ci-users 3080 Apr 13 17:38
> > >> dostageout.sorted-start.png
> > >> -rw-r--r-- 1 aespinosa ci-users 3255 Apr 8 16:05 execute2-total.png
> > >> -rw-r--r-- 1 aespinosa ci-users 1533974 Apr 8 14:46
> > >> postproc-20110407-1438-i90jepr3.0.rlog
> > >> -rw-r--r-- 1 aespinosa ci-users 1868896768 Apr 8 14:46
> > >> postproc-20110407-1438-i90jepr3.log
> > >> drwxr-xr-x 2 aespinosa ci-users 32768 Apr 7 14:39
> > >> postproc-20110407-1438-i90jepr3.d/
> > >> bri$
> > >>
> > >> runs, not run
> > >>
> > >> ALso see bridled: /tmp/mw1
> > >>
> > >> ----- Original Message -----
> > >> > [hategan at bridled tmp]$ cd
> > >> > ~aespinosa/workflows/cybershake/archive-run/test/
> > >> > -bash: cd:
> > >> > /home/aespinosa/workflows/cybershake/archive-run/test/: No
> > >> > such file or directory
> > >> >
> > >> > On Thu, 2011-04-14 at 17:21 -0500, Allan Espinosa wrote:
> > >> > > ~aespinosa/workflows/cybershake/archive-run/test/postproc*.log
> > >> > >
> > >> > > 2011/4/14 Mihael Hategan <hategan at mcs.anl.gov>:
> > >> > > > On Thu, 2011-04-14 at 15:57 -0500, Michael Wilde wrote:
> > >> > > >> While Allan continues to debug this, can you take a look at
> > >> > > >> the
> > >> > > >> (huge) log?
> > >> > > >
> > >> > > > Where is this log?
> > >> > > >
> > >> > > >
> 





More information about the Swift-devel mailing list