[Swift-devel] Re: Please review and advise on: Bug 357 - Script hangs in staging on OSG

Allan Espinosa aespinosa at cs.uchicago.edu
Thu Apr 14 19:49:05 CDT 2011


Right now the logs only gives out messages about
AbstractKarajanStreamChannel.  I set the org.globus.ftp package's
logging level to DEBUG, so entries should be reflected if there are
transfers being made.

-Allan

2011/4/14 Michael Wilde <wilde at mcs.anl.gov>:
> So you have 4 transfer threads and all 4 are waiting here:
>
> at java.net.SocketInputStream.socketRead0(Native Method)
>        at java.net.SocketInputStream.read(SocketInputStream.java:129)
>
> (from throttle.transfers=4)
>
> Is this workflow hung, and if so, how are you determining that?  Do you have another log plot of stagein and out?
>
> - Mike
>
>
> ----- Original Message -----
>> Fresh traces (jstack and log) in
>> /home/aespinosa/workflows/cybershake/archive-runs/transfer-logging .
>> The swift log is a snapshot of the workflow that is still running.
>>
>> -Allan
>>
>> 2011/4/14 Mihael Hategan <hategan at mcs.anl.gov>:
>> > One immediate question that I have is what's up with the deadline
>> > passed
>> > messages?
>> >
>> > That happens when jobs run for at least twice their advertised
>> > walltime
>> > and for some reason the site doesn't seem to cancel them. This may
>> > be
>> > indicative of notifications getting lost.
>> >
>> > As for the transfers, I don't see all transfers hanging after that.
>> > I
>> > mean there are transfers that complete ok. Though things do seem to
>> > slow
>> > down quite a bit, so that looks like a problem.
>> >
>> > Let's see what in the stack traces. In the mean time, I will see
>> > what it
>> > takes to get transfer progress messages.
>> >
>> > Mihael
>> >
>> >
>> > On Thu, 2011-04-14 at 17:28 -0500, Michael Wilde wrote:
>> >> bri$ pwd
>> >> /home/aespinosa/workflows/cybershake/archive-runs/test
>> >> bri$ ls -lt
>> >> total 1844128
>> >> -rw-r--r-- 1 aespinosa ci-users 0 Apr 14 14:21 max-duration.tmp
>> >> -rw-r--r-- 1 aespinosa ci-users 15 Apr 14 14:20 start-time.tmp
>> >> -rw-r--r-- 1 aespinosa ci-users 1433206 Apr 14 14:20 stagein.event
>> >> -rw-r--r-- 1 aespinosa ci-users 2372737 Apr 14 14:19
>> >> sort-preserve2.tmp
>> >> -rw-r--r-- 1 aespinosa ci-users 2372737 Apr 14 14:19
>> >> sort-preserve.tmp
>> >> -rw-r--r-- 1 aespinosa ci-users 15 Apr 14 14:19 t.inf
>> >> -rw-r--r-- 1 aespinosa ci-users 2263727 Apr 14 12:51
>> >> stagein.transition
>> >> -rw-r--r-- 1 aespinosa ci-users 8998897 Apr 14 12:31 stagein.log
>> >> -rw-r--r-- 1 aespinosa ci-users 92059 Apr 14 12:05 dostageout.event
>> >> -rw-r--r-- 1 aespinosa ci-users 97442 Apr 14 11:51 dostagein.event
>> >> -rw-r--r-- 1 aespinosa ci-users 2998 Apr 13 17:38
>> >> dostagein.sorted-start.png
>> >> -rw-r--r-- 1 aespinosa ci-users 3080 Apr 13 17:38
>> >> dostageout.sorted-start.png
>> >> -rw-r--r-- 1 aespinosa ci-users 3255 Apr 8 16:05 execute2-total.png
>> >> -rw-r--r-- 1 aespinosa ci-users 1533974 Apr 8 14:46
>> >> postproc-20110407-1438-i90jepr3.0.rlog
>> >> -rw-r--r-- 1 aespinosa ci-users 1868896768 Apr 8 14:46
>> >> postproc-20110407-1438-i90jepr3.log
>> >> drwxr-xr-x 2 aespinosa ci-users 32768 Apr 7 14:39
>> >> postproc-20110407-1438-i90jepr3.d/
>> >> bri$
>> >>
>> >> runs, not run
>> >>
>> >> ALso see bridled: /tmp/mw1
>> >>
>> >> ----- Original Message -----
>> >> > [hategan at bridled tmp]$ cd
>> >> > ~aespinosa/workflows/cybershake/archive-run/test/
>> >> > -bash: cd:
>> >> > /home/aespinosa/workflows/cybershake/archive-run/test/: No
>> >> > such file or directory
>> >> >
>> >> > On Thu, 2011-04-14 at 17:21 -0500, Allan Espinosa wrote:
>> >> > > ~aespinosa/workflows/cybershake/archive-run/test/postproc*.log
>> >> > >
>> >> > > 2011/4/14 Mihael Hategan <hategan at mcs.anl.gov>:
>> >> > > > On Thu, 2011-04-14 at 15:57 -0500, Michael Wilde wrote:
>> >> > > >> While Allan continues to debug this, can you take a look at
>> >> > > >> the
>> >> > > >> (huge) log?
>> >> > > >
>> >> > > > Where is this log?
>> >> > > >
>> >> > > >
>



More information about the Swift-devel mailing list