[Swift-devel] Re: Please review and advise on: Bug 357 - Script hangs in staging on OSG

Mihael Hategan hategan at mcs.anl.gov
Fri Apr 15 14:24:38 CDT 2011


Individual transfers are retried up to three times. Only if all 3 fail
the job fails.

On Fri, 2011-04-15 at 12:14 -0500, Allan Espinosa wrote:
> Hi Mike,
> 
> here's the plot of the stagein events from yesterday's showing
> "almost" the same behavior.  I attached the raw data and an svg
> version of the plot as well.  (plot were produced from R instead of
> the regular swift-log-processing tools)
> 
> Also looking at the transfers that did not finish (the ones with
> "START" in the last column on the raw data), I tested out two gridftp
> servers today:
> 
> Ranger to top.ucr.edu
> $ globus-url-copy -vb
> gsiftp://gridftp.ranger.tacc.teragrid.org/scratch/projeca3d/CyberShake2007/ruptures/RuptureVariations_35_V2_3/218/260/218_260.txt.variation-s0000-h0000
> gsiftp://top.ucr.edu/data/down/osg_data/engage/scec/testfile
> Source: gsiftp://gridftp.ranger.tacc.teragrid.org/scratch/projects/tg/tera3d/CyberShake2007/ruptures/RuptureVariations_35_V2_3/218/260/
> Dest:   gsiftp://top.ucr.edu/data/down/osg_data/engage/scec/
>   218_260.txt.variation-s0000-h0000  ->  testfile
> 
> So this site looks like it's working alright.  Here's the last
> activity about top.ucr.edu in the log:
> 2011-04-14 17:50:20,975-0500 DEBUG Reply read 1st line
> 2011-04-14 17:50:20,976-0500 DEBUG vdl:dostageout FILE_STAGE_OUT_START
> srcname=PeakVals_TEST_219_258_32.bsa srcdir=po
> stproc-20110414-1604-bzxryead/shared/scratch/01035/tg802895/science/cybershake/Results/TEST/219/258
> srchost=UCR-HEP__
> top.ucr.edu destdir=/scratch/01035/tg802895/science/cybershake/Results/TEST/219/258
> desthost=gridftp.ranger.tacc.tera
> grid.org provider=gsiftp
> 2011-04-14 17:50:20,978-0500 DEBUG vdl:dostageout FILE_STAGE_OUT_START
> srcname=Seismogram_TEST_219_258_35.grm srcdir=
> postproc-20110414-1604-bzxryead/shared/scratch/01035/tg802895/science/cybershake/Results/TEST/219/258
> srchost=UCR-HEP
> __top.ucr.edu destdir=/scratch/01035/tg802895/science/cybershake/Results/TEST/219/258
> desthost=gridftp.ranger.tacc.te
> ragrid.org provider=gsiftp
> 
> ----comment-----
> I am not sure if these gsiftp ABORT calls are with the top.ucr.edu site
> ---comment------
> 
> 2011-04-14 17:50:21,006-0500 DEBUG Reply 1st line: 350 OK. Send RNTO
> with destination name.
> 2011-04-14 17:50:21,006-0500 DEBUG FTPControlChannel Control channel
> received: 350 OK. Send RNTO with destination nam
> e.
> 2011-04-14 17:50:21,006-0500 DEBUG FTPControlChannel Control channel
> sending: ABOR
> 
> 2011-04-14 17:50:21,006-0500 DEBUG Reply read 1st line
> 2011-04-14 17:50:21,006-0500 DEBUG Reply 1st line: 350 OK. Send RNTO
> with destination name.
> 2011-04-14 17:50:21,006-0500 DEBUG FTPControlChannel Control channel
> received: 350 OK. Send RNTO with destination nam
> e.
> 2011-04-14 17:50:21,006-0500 DEBUG FTPControlChannel Control channel
> sending: ABOR
> 
> 2011-04-14 17:50:21,006-0500 DEBUG Reply read 1st line
> 2011-04-14 17:50:21,068-0500 DEBUG Reply 1st line: 226 Abort successful
> 
> 
> 
> [aespinosa at communicado shared]$  MB/sec avg         2.39 MB/sec inst
> 
> Ranger to Clemson:
> $ globus-url-copy -vb
> gsiftp://gridftp.ranger.tacc.teragrid.org/scratch/projects/tg/tera3d/CyberShake2007/ruptures/RuptureVariations_35_V2_3/218/260/218_260.txt.variation-s0000-h0000
> gsiftp://osg-gw.clemson.edu/common1/osg/data/engage/scec/testfile
> Source: gsiftp://gridftp.ranger.tacc.teragrid.org/scratch/projects/tg/tera3d/CyberShake2007/ruptures/RuptureVariations_35_V2_3/218/260/
> Dest:   gsiftp://osg-gw.clemson.edu/common1/osg/data/engage/scec/
>   218_260.txt.variation-s0000-h0000  ->  testfile
> 
> 
> error: globus_xio: Unable to connect to osg-gw.clemson.edu:2811
> globus_xio: System error in connect: Connection refused
> globus_xio: A system call failed: Connection refused
> 
> 
> Looks like the Swift log gave an exception for the clemson resource:
> 2011-04-14 17:17:16,198-0500 DEBUG Reply end reached
> 2011-04-14 17:17:16,198-0500 DEBUG FTPControlChannel Control channel
> received: 500-Command failed. : globus_gridftp_s
> erver_file.c:globus_l_gfs_file_send:2190:
> 500-globus_l_gfs_file_open failed.
> 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
> 500-globus_xio_register_open failed.
> 500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
> 500-Unable to open file
> /common1/osg/data/engage/scec/swift_scratch/postproc-20110414-1604-bzxryead/shared/scratch/01
> 035/tg802895/science/cybershake/Results/TEST/219/245/PeakVals_TEST_219_245_126.bsa
> 500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
> 500-System error in open: No such file or directory
> 500-globus_xio: A system call failed: No such file or directory
> 500 End.
> 2011-04-14 17:17:16,198-0500 DEBUG Local first reply bad: 500-Command
> failed. : globus_gridftp_server_file.c:globus_l
> _gfs_file_send:2190:
> 500-globus_l_gfs_file_open failed.
> 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
> 500-globus_xio_register_open failed.
> 500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
> 500-Unable to open file
> /common1/osg/data/engage/scec/swift_scratch/postproc-20110414-1604-bzxryead/shared/scratch/01
> 035/tg802895/science/cybershake/Results/TEST/219/245/PeakVals_TEST_219_245_126.bsa
> 500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
> 500-System error in open: No such file or directory
> 500-globus_xio: A system call failed: No such file or directory
> 500 End.
> 2011-04-14 17:17:16,198-0500 DEBUG Local category: 5
> 2011-04-14 17:17:16,199-0500 DEBUG FTPControlChannel slept 200
> 2011-04-14 17:17:16,202-0500 DEBUG TransferState intercepted exception
> org.globus.ftp.exception.ServerException: Server refused performing
> the request. Custom message:  (error code 1) [Nested exception
> message:  Custom message: Unexpected reply: 500-Command failed. :
> globus_gridftp_server_file.c:globus_l_gfs_file_send:2190:
> 500-globus_l_gfs_file_open failed.
> 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
> 500-globus_xio_register_open failed.
> 500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
> 500-Unable to open file
> /common1/osg/data/engage/scec/swift_scratch/postproc-20110414-1604-bzxryead/shared/scratch/01035/tg802895/science/cybershake/Results/TEST/219/245/PeakVals_TEST_219_245_126.bsa
> 500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
> 500-System error in open: No such file or directory
> 500-globus_xio: A system call failed: No such file or directory
> 500 End.].  Nested exception is
> org.globus.ftp.exception.UnexpectedReplyCodeException:  Custom
> message: Unexpected reply: 500-Command failed. :
> globus_gridftp_server_file.c:globus_l_gfs_file_send:2190:
> 500-globus_l_gfs_file_open failed.
> 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
> 500-globus_xio_register_open failed.
> 500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
> 500-Unable to open file
> /common1/osg/data/engage/scec/swift_scratch/postproc-20110414-1604-bzxryead/shared/scratch/01035/tg802895/science/cybershake/Results/TEST/219/245/PeakVals_TEST_219_245_126.bsa
> 500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
> 500-System error in open: No such file or directory
> 500-globus_xio: A system call failed: No such file or directory
> 500 End.
>         at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:195)
>         at org.globus.ftp.vanilla.TransferMonitor.start(TransferMonitor.java:109)
>         at org.globus.ftp.FTPClient.transferRunSingleThread(FTPClient.java:1479)
>         at org.globus.ftp.FTPClient.transfer(FTPClient.java:1378)
>         at org.globus.io.urlcopy.UrlCopy.thirdPartyTransfer(UrlCopy.java:739)
>         at org.globus.io.urlcopy.UrlCopy.copy(UrlCopy.java:493)
>         at org.globus.io.urlcopy.UrlCopy.run(UrlCopy.java:445)
>         at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doThirdPartyTransfer(DelegatedFileTransferHandler.java:571)
>         at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:465)
> 
> 
> I expected the Swift terminal session to report a "Failed" job, but as
> it runs now it looks like this:
> 
> Progress:  Stage in:79  Stage out:232  Finished successfully:130
> Progress:  Stage in:79  Stage out:232  Finished successfully:130
> Progress:  Stage in:79  Stage out:232  Finished successfully:130
> Progress:  Stage in:79  Stage out:232  Finished successfully:130
> Progress:  Stage in:79  Stage out:232  Finished successfully:130
> 
> Btw, i have lazy.errors=true
> 
> 2011/4/14 Michael Wilde <wilde at mcs.anl.gov>:
> > ALlan, what I meant was: do you have any evidence that this current run is hung (either in a similar manner to the one we looked at closely this morning, or in a different manner)?
> >
> > In this mornings log, you could tell from plots of stagein and stageout events that many these events were not completing after something triggered an error.
> >
> > Do you have similar plots or evidence of hangs regarding this run and its log?
> >
> > I dont know from browsing the traces if one would *naturally* expect the transfer threads to all be waiting on input sockets most of the time, or if seeing all 4 threads waiting on sockets is indicative of data transfer being totally hung.
> >
> > Mihael, I assume you can tell much more from these traces?
> >
> > - Mike
> >
> >
> > ----- Original Message -----
> >> Right now the logs only gives out messages about
> >> AbstractKarajanStreamChannel. I set the org.globus.ftp package's
> >> logging level to DEBUG, so entries should be reflected if there are
> >> transfers being made.
> >>
> >> -Allan
> >>
> >> 2011/4/14 Michael Wilde <wilde at mcs.anl.gov>:
> >> > So you have 4 transfer threads and all 4 are waiting here:
> >> >
> >> > at java.net.SocketInputStream.socketRead0(Native Method)
> >> >        at
> >> >        java.net.SocketInputStream.read(SocketInputStream.java:129)
> >> >
> >> > (from throttle.transfers=4)
> >> >
> >> > Is this workflow hung, and if so, how are you determining that? Do
> >> > you have another log plot of stagein and out?
> >> >
> >> > - Mike
> >> >
> >> >
> >> > ----- Original Message -----
> >> >> Fresh traces (jstack and log) in
> >> >> /home/aespinosa/workflows/cybershake/archive-runs/transfer-logging
> >> >> .
> >> >> The swift log is a snapshot of the workflow that is still running.
> >> >>
> >> >> -Allan
> >> >>
> >> >> 2011/4/14 Mihael Hategan <hategan at mcs.anl.gov>:
> >> >> > One immediate question that I have is what's up with the deadline
> >> >> > passed
> >> >> > messages?
> >> >> >
> >> >> > That happens when jobs run for at least twice their advertised
> >> >> > walltime
> >> >> > and for some reason the site doesn't seem to cancel them. This
> >> >> > may
> >> >> > be
> >> >> > indicative of notifications getting lost.
> >> >> >
> >> >> > As for the transfers, I don't see all transfers hanging after
> >> >> > that.
> >> >> > I
> >> >> > mean there are transfers that complete ok. Though things do seem
> >> >> > to
> >> >> > slow
> >> >> > down quite a bit, so that looks like a problem.
> >> >> >
> >> >> > Let's see what in the stack traces. In the mean time, I will see
> >> >> > what it
> >> >> > takes to get transfer progress messages.
> >> >> >
> >> >> > Mihael
> >> >> >
> >> >> >
> >> >> > On Thu, 2011-04-14 at 17:28 -0500, Michael Wilde wrote:
> >> >> >> bri$ pwd
> >> >> >> /home/aespinosa/workflows/cybershake/archive-runs/test
> >> >> >> bri$ ls -lt
> >> >> >> total 1844128
> >> >> >> -rw-r--r-- 1 aespinosa ci-users 0 Apr 14 14:21 max-duration.tmp
> >> >> >> -rw-r--r-- 1 aespinosa ci-users 15 Apr 14 14:20 start-time.tmp
> >> >> >> -rw-r--r-- 1 aespinosa ci-users 1433206 Apr 14 14:20
> >> >> >> stagein.event
> >> >> >> -rw-r--r-- 1 aespinosa ci-users 2372737 Apr 14 14:19
> >> >> >> sort-preserve2.tmp
> >> >> >> -rw-r--r-- 1 aespinosa ci-users 2372737 Apr 14 14:19
> >> >> >> sort-preserve.tmp
> >> >> >> -rw-r--r-- 1 aespinosa ci-users 15 Apr 14 14:19 t.inf
> >> >> >> -rw-r--r-- 1 aespinosa ci-users 2263727 Apr 14 12:51
> >> >> >> stagein.transition
> >> >> >> -rw-r--r-- 1 aespinosa ci-users 8998897 Apr 14 12:31 stagein.log
> >> >> >> -rw-r--r-- 1 aespinosa ci-users 92059 Apr 14 12:05
> >> >> >> dostageout.event
> >> >> >> -rw-r--r-- 1 aespinosa ci-users 97442 Apr 14 11:51
> >> >> >> dostagein.event
> >> >> >> -rw-r--r-- 1 aespinosa ci-users 2998 Apr 13 17:38
> >> >> >> dostagein.sorted-start.png
> >> >> >> -rw-r--r-- 1 aespinosa ci-users 3080 Apr 13 17:38
> >> >> >> dostageout.sorted-start.png
> >> >> >> -rw-r--r-- 1 aespinosa ci-users 3255 Apr 8 16:05
> >> >> >> execute2-total.png
> >> >> >> -rw-r--r-- 1 aespinosa ci-users 1533974 Apr 8 14:46
> >> >> >> postproc-20110407-1438-i90jepr3.0.rlog
> >> >> >> -rw-r--r-- 1 aespinosa ci-users 1868896768 Apr 8 14:46
> >> >> >> postproc-20110407-1438-i90jepr3.log
> >> >> >> drwxr-xr-x 2 aespinosa ci-users 32768 Apr 7 14:39
> >> >> >> postproc-20110407-1438-i90jepr3.d/
> >> >> >> bri$
> >> >> >>
> >> >> >> runs, not run
> >> >> >>
> >> >> >> ALso see bridled: /tmp/mw1
> >> >> >>
> >> >> >> ----- Original Message -----
> >> >> >> > [hategan at bridled tmp]$ cd
> >> >> >> > ~aespinosa/workflows/cybershake/archive-run/test/
> >> >> >> > -bash: cd:
> >> >> >> > /home/aespinosa/workflows/cybershake/archive-run/test/: No
> >> >> >> > such file or directory
> >> >> >> >
> >> >> >> > On Thu, 2011-04-14 at 17:21 -0500, Allan Espinosa wrote:
> >> >> >> > > ~aespinosa/workflows/cybershake/archive-run/test/postproc*.log
> >> >> >> > >
> >> >> >> > > 2011/4/14 Mihael Hategan <hategan at mcs.anl.gov>:
> >> >> >> > > > On Thu, 2011-04-14 at 15:57 -0500, Michael Wilde wrote:
> >> >> >> > > >> While Allan continues to debug this, can you take a look
> >> >> >> > > >> at
> >> >> >> > > >> the
> >> >> >> > > >> (huge) log?
> >> >> >> > > >
> >> >> >> > > > Where is this log?
> >> >> >> > > >
> >> >> >> > > >
> >> >
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
> >
> >
> 
> 
> 





More information about the Swift-devel mailing list