[Swift-devel] Re: Please review and advise on: Bug 357 - Script hangs in staging on OSG
Mihael Hategan
hategan at mcs.anl.gov
Fri Apr 15 14:46:46 CDT 2011
If all three tries of the transfer failed and if a subsequent rerun of
the job didn't yet complete.
On Fri, 2011-04-15 at 14:28 -0500, Allan Espinosa wrote:
> But then it should be registered into the "Failed jobs" counter right?
>
> 2011/4/15 Mihael Hategan <hategan at mcs.anl.gov>:
> > Individual transfers are retried up to three times. Only if all 3 fail
> > the job fails.
> >
> > On Fri, 2011-04-15 at 12:14 -0500, Allan Espinosa wrote:
> >> Hi Mike,
> >>
> >> here's the plot of the stagein events from yesterday's showing
> >> "almost" the same behavior. I attached the raw data and an svg
> >> version of the plot as well. (plot were produced from R instead of
> >> the regular swift-log-processing tools)
> >>
> >> Also looking at the transfers that did not finish (the ones with
> >> "START" in the last column on the raw data), I tested out two gridftp
> >> servers today:
> >>
> >> Ranger to top.ucr.edu
> >> $ globus-url-copy -vb
> >> gsiftp://gridftp.ranger.tacc.teragrid.org/scratch/projeca3d/CyberShake2007/ruptures/RuptureVariations_35_V2_3/218/260/218_260.txt.variation-s0000-h0000
> >> gsiftp://top.ucr.edu/data/down/osg_data/engage/scec/testfile
> >> Source: gsiftp://gridftp.ranger.tacc.teragrid.org/scratch/projects/tg/tera3d/CyberShake2007/ruptures/RuptureVariations_35_V2_3/218/260/
> >> Dest: gsiftp://top.ucr.edu/data/down/osg_data/engage/scec/
> >> 218_260.txt.variation-s0000-h0000 -> testfile
> >>
> >> So this site looks like it's working alright. Here's the last
> >> activity about top.ucr.edu in the log:
> >> 2011-04-14 17:50:20,975-0500 DEBUG Reply read 1st line
> >> 2011-04-14 17:50:20,976-0500 DEBUG vdl:dostageout FILE_STAGE_OUT_START
> >> srcname=PeakVals_TEST_219_258_32.bsa srcdir=po
> >> stproc-20110414-1604-bzxryead/shared/scratch/01035/tg802895/science/cybershake/Results/TEST/219/258
> >> srchost=UCR-HEP__
> >> top.ucr.edu destdir=/scratch/01035/tg802895/science/cybershake/Results/TEST/219/258
> >> desthost=gridftp.ranger.tacc.tera
> >> grid.org provider=gsiftp
> >> 2011-04-14 17:50:20,978-0500 DEBUG vdl:dostageout FILE_STAGE_OUT_START
> >> srcname=Seismogram_TEST_219_258_35.grm srcdir=
> >> postproc-20110414-1604-bzxryead/shared/scratch/01035/tg802895/science/cybershake/Results/TEST/219/258
> >> srchost=UCR-HEP
> >> __top.ucr.edu destdir=/scratch/01035/tg802895/science/cybershake/Results/TEST/219/258
> >> desthost=gridftp.ranger.tacc.te
> >> ragrid.org provider=gsiftp
> >>
> >> ----comment-----
> >> I am not sure if these gsiftp ABORT calls are with the top.ucr.edu site
> >> ---comment------
> >>
> >> 2011-04-14 17:50:21,006-0500 DEBUG Reply 1st line: 350 OK. Send RNTO
> >> with destination name.
> >> 2011-04-14 17:50:21,006-0500 DEBUG FTPControlChannel Control channel
> >> received: 350 OK. Send RNTO with destination nam
> >> e.
> >> 2011-04-14 17:50:21,006-0500 DEBUG FTPControlChannel Control channel
> >> sending: ABOR
> >>
> >> 2011-04-14 17:50:21,006-0500 DEBUG Reply read 1st line
> >> 2011-04-14 17:50:21,006-0500 DEBUG Reply 1st line: 350 OK. Send RNTO
> >> with destination name.
> >> 2011-04-14 17:50:21,006-0500 DEBUG FTPControlChannel Control channel
> >> received: 350 OK. Send RNTO with destination nam
> >> e.
> >> 2011-04-14 17:50:21,006-0500 DEBUG FTPControlChannel Control channel
> >> sending: ABOR
> >>
> >> 2011-04-14 17:50:21,006-0500 DEBUG Reply read 1st line
> >> 2011-04-14 17:50:21,068-0500 DEBUG Reply 1st line: 226 Abort successful
> >>
> >>
> >>
> >> [aespinosa at communicado shared]$ MB/sec avg 2.39 MB/sec inst
> >>
> >> Ranger to Clemson:
> >> $ globus-url-copy -vb
> >> gsiftp://gridftp.ranger.tacc.teragrid.org/scratch/projects/tg/tera3d/CyberShake2007/ruptures/RuptureVariations_35_V2_3/218/260/218_260.txt.variation-s0000-h0000
> >> gsiftp://osg-gw.clemson.edu/common1/osg/data/engage/scec/testfile
> >> Source: gsiftp://gridftp.ranger.tacc.teragrid.org/scratch/projects/tg/tera3d/CyberShake2007/ruptures/RuptureVariations_35_V2_3/218/260/
> >> Dest: gsiftp://osg-gw.clemson.edu/common1/osg/data/engage/scec/
> >> 218_260.txt.variation-s0000-h0000 -> testfile
> >>
> >>
> >> error: globus_xio: Unable to connect to osg-gw.clemson.edu:2811
> >> globus_xio: System error in connect: Connection refused
> >> globus_xio: A system call failed: Connection refused
> >>
> >>
> >> Looks like the Swift log gave an exception for the clemson resource:
> >> 2011-04-14 17:17:16,198-0500 DEBUG Reply end reached
> >> 2011-04-14 17:17:16,198-0500 DEBUG FTPControlChannel Control channel
> >> received: 500-Command failed. : globus_gridftp_s
> >> erver_file.c:globus_l_gfs_file_send:2190:
> >> 500-globus_l_gfs_file_open failed.
> >> 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
> >> 500-globus_xio_register_open failed.
> >> 500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
> >> 500-Unable to open file
> >> /common1/osg/data/engage/scec/swift_scratch/postproc-20110414-1604-bzxryead/shared/scratch/01
> >> 035/tg802895/science/cybershake/Results/TEST/219/245/PeakVals_TEST_219_245_126.bsa
> >> 500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
> >> 500-System error in open: No such file or directory
> >> 500-globus_xio: A system call failed: No such file or directory
> >> 500 End.
> >> 2011-04-14 17:17:16,198-0500 DEBUG Local first reply bad: 500-Command
> >> failed. : globus_gridftp_server_file.c:globus_l
> >> _gfs_file_send:2190:
> >> 500-globus_l_gfs_file_open failed.
> >> 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
> >> 500-globus_xio_register_open failed.
> >> 500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
> >> 500-Unable to open file
> >> /common1/osg/data/engage/scec/swift_scratch/postproc-20110414-1604-bzxryead/shared/scratch/01
> >> 035/tg802895/science/cybershake/Results/TEST/219/245/PeakVals_TEST_219_245_126.bsa
> >> 500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
> >> 500-System error in open: No such file or directory
> >> 500-globus_xio: A system call failed: No such file or directory
> >> 500 End.
> >> 2011-04-14 17:17:16,198-0500 DEBUG Local category: 5
> >> 2011-04-14 17:17:16,199-0500 DEBUG FTPControlChannel slept 200
> >> 2011-04-14 17:17:16,202-0500 DEBUG TransferState intercepted exception
> >> org.globus.ftp.exception.ServerException: Server refused performing
> >> the request. Custom message: (error code 1) [Nested exception
> >> message: Custom message: Unexpected reply: 500-Command failed. :
> >> globus_gridftp_server_file.c:globus_l_gfs_file_send:2190:
> >> 500-globus_l_gfs_file_open failed.
> >> 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
> >> 500-globus_xio_register_open failed.
> >> 500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
> >> 500-Unable to open file
> >> /common1/osg/data/engage/scec/swift_scratch/postproc-20110414-1604-bzxryead/shared/scratch/01035/tg802895/science/cybershake/Results/TEST/219/245/PeakVals_TEST_219_245_126.bsa
> >> 500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
> >> 500-System error in open: No such file or directory
> >> 500-globus_xio: A system call failed: No such file or directory
> >> 500 End.]. Nested exception is
> >> org.globus.ftp.exception.UnexpectedReplyCodeException: Custom
> >> message: Unexpected reply: 500-Command failed. :
> >> globus_gridftp_server_file.c:globus_l_gfs_file_send:2190:
> >> 500-globus_l_gfs_file_open failed.
> >> 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
> >> 500-globus_xio_register_open failed.
> >> 500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
> >> 500-Unable to open file
> >> /common1/osg/data/engage/scec/swift_scratch/postproc-20110414-1604-bzxryead/shared/scratch/01035/tg802895/science/cybershake/Results/TEST/219/245/PeakVals_TEST_219_245_126.bsa
> >> 500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
> >> 500-System error in open: No such file or directory
> >> 500-globus_xio: A system call failed: No such file or directory
> >> 500 End.
> >> at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:195)
> >> at org.globus.ftp.vanilla.TransferMonitor.start(TransferMonitor.java:109)
> >> at org.globus.ftp.FTPClient.transferRunSingleThread(FTPClient.java:1479)
> >> at org.globus.ftp.FTPClient.transfer(FTPClient.java:1378)
> >> at org.globus.io.urlcopy.UrlCopy.thirdPartyTransfer(UrlCopy.java:739)
> >> at org.globus.io.urlcopy.UrlCopy.copy(UrlCopy.java:493)
> >> at org.globus.io.urlcopy.UrlCopy.run(UrlCopy.java:445)
> >> at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doThirdPartyTransfer(DelegatedFileTransferHandler.java:571)
> >> at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:465)
> >>
> >>
> >> I expected the Swift terminal session to report a "Failed" job, but as
> >> it runs now it looks like this:
> >>
> >> Progress: Stage in:79 Stage out:232 Finished successfully:130
> >> Progress: Stage in:79 Stage out:232 Finished successfully:130
> >> Progress: Stage in:79 Stage out:232 Finished successfully:130
> >> Progress: Stage in:79 Stage out:232 Finished successfully:130
> >> Progress: Stage in:79 Stage out:232 Finished successfully:130
> >>
> >> Btw, i have lazy.errors=true
> >>
> >> 2011/4/14 Michael Wilde <wilde at mcs.anl.gov>:
> >> > ALlan, what I meant was: do you have any evidence that this current run is hung (either in a similar manner to the one we looked at closely this morning, or in a different manner)?
> >> >
> >> > In this mornings log, you could tell from plots of stagein and stageout events that many these events were not completing after something triggered an error.
> >> >
> >> > Do you have similar plots or evidence of hangs regarding this run and its log?
> >> >
> >> > I dont know from browsing the traces if one would *naturally* expect the transfer threads to all be waiting on input sockets most of the time, or if seeing all 4 threads waiting on sockets is indicative of data transfer being totally hung.
> >> >
> >> > Mihael, I assume you can tell much more from these traces?
> >> >
> >> > - Mike
> >> >
> >> >
> >> > ----- Original Message -----
> >> >> Right now the logs only gives out messages about
> >> >> AbstractKarajanStreamChannel. I set the org.globus.ftp package's
> >> >> logging level to DEBUG, so entries should be reflected if there are
> >> >> transfers being made.
> >> >>
> >> >> -Allan
> >> >>
> >> >> 2011/4/14 Michael Wilde <wilde at mcs.anl.gov>:
> >> >> > So you have 4 transfer threads and all 4 are waiting here:
> >> >> >
> >> >> > at java.net.SocketInputStream.socketRead0(Native Method)
> >> >> > at
> >> >> > java.net.SocketInputStream.read(SocketInputStream.java:129)
> >> >> >
> >> >> > (from throttle.transfers=4)
> >> >> >
> >> >> > Is this workflow hung, and if so, how are you determining that? Do
> >> >> > you have another log plot of stagein and out?
> >> >> >
> >> >> > - Mike
> >> >> >
> >> >> >
> >> >> > ----- Original Message -----
> >> >> >> Fresh traces (jstack and log) in
> >> >> >> /home/aespinosa/workflows/cybershake/archive-runs/transfer-logging
> >> >> >> .
> >> >> >> The swift log is a snapshot of the workflow that is still running.
> >> >> >>
> >> >> >> -Allan
> >> >> >>
> >> >> >> 2011/4/14 Mihael Hategan <hategan at mcs.anl.gov>:
> >> >> >> > One immediate question that I have is what's up with the deadline
> >> >> >> > passed
> >> >> >> > messages?
> >> >> >> >
> >> >> >> > That happens when jobs run for at least twice their advertised
> >> >> >> > walltime
> >> >> >> > and for some reason the site doesn't seem to cancel them. This
> >> >> >> > may
> >> >> >> > be
> >> >> >> > indicative of notifications getting lost.
> >> >> >> >
> >> >> >> > As for the transfers, I don't see all transfers hanging after
> >> >> >> > that.
> >> >> >> > I
> >> >> >> > mean there are transfers that complete ok. Though things do seem
> >> >> >> > to
> >> >> >> > slow
> >> >> >> > down quite a bit, so that looks like a problem.
> >> >> >> >
> >> >> >> > Let's see what in the stack traces. In the mean time, I will see
> >> >> >> > what it
> >> >> >> > takes to get transfer progress messages.
> >> >> >> >
> >> >> >> > Mihael
> >> >> >> >
> >> >> >> >
> >> >> >> > On Thu, 2011-04-14 at 17:28 -0500, Michael Wilde wrote:
> >> >> >> >> bri$ pwd
> >> >> >> >> /home/aespinosa/workflows/cybershake/archive-runs/test
> >> >> >> >> bri$ ls -lt
> >> >> >> >> total 1844128
> >> >> >> >> -rw-r--r-- 1 aespinosa ci-users 0 Apr 14 14:21 max-duration.tmp
> >> >> >> >> -rw-r--r-- 1 aespinosa ci-users 15 Apr 14 14:20 start-time.tmp
> >> >> >> >> -rw-r--r-- 1 aespinosa ci-users 1433206 Apr 14 14:20
> >> >> >> >> stagein.event
> >> >> >> >> -rw-r--r-- 1 aespinosa ci-users 2372737 Apr 14 14:19
> >> >> >> >> sort-preserve2.tmp
> >> >> >> >> -rw-r--r-- 1 aespinosa ci-users 2372737 Apr 14 14:19
> >> >> >> >> sort-preserve.tmp
> >> >> >> >> -rw-r--r-- 1 aespinosa ci-users 15 Apr 14 14:19 t.inf
> >> >> >> >> -rw-r--r-- 1 aespinosa ci-users 2263727 Apr 14 12:51
> >> >> >> >> stagein.transition
> >> >> >> >> -rw-r--r-- 1 aespinosa ci-users 8998897 Apr 14 12:31 stagein.log
> >> >> >> >> -rw-r--r-- 1 aespinosa ci-users 92059 Apr 14 12:05
> >> >> >> >> dostageout.event
> >> >> >> >> -rw-r--r-- 1 aespinosa ci-users 97442 Apr 14 11:51
> >> >> >> >> dostagein.event
> >> >> >> >> -rw-r--r-- 1 aespinosa ci-users 2998 Apr 13 17:38
> >> >> >> >> dostagein.sorted-start.png
> >> >> >> >> -rw-r--r-- 1 aespinosa ci-users 3080 Apr 13 17:38
> >> >> >> >> dostageout.sorted-start.png
> >> >> >> >> -rw-r--r-- 1 aespinosa ci-users 3255 Apr 8 16:05
> >> >> >> >> execute2-total.png
> >> >> >> >> -rw-r--r-- 1 aespinosa ci-users 1533974 Apr 8 14:46
> >> >> >> >> postproc-20110407-1438-i90jepr3.0.rlog
> >> >> >> >> -rw-r--r-- 1 aespinosa ci-users 1868896768 Apr 8 14:46
> >> >> >> >> postproc-20110407-1438-i90jepr3.log
> >> >> >> >> drwxr-xr-x 2 aespinosa ci-users 32768 Apr 7 14:39
> >> >> >> >> postproc-20110407-1438-i90jepr3.d/
> >> >> >> >> bri$
> >> >> >> >>
> >> >> >> >> runs, not run
> >> >> >> >>
> >> >> >> >> ALso see bridled: /tmp/mw1
> >> >> >> >>
> >> >> >> >> ----- Original Message -----
> >> >> >> >> > [hategan at bridled tmp]$ cd
> >> >> >> >> > ~aespinosa/workflows/cybershake/archive-run/test/
> >> >> >> >> > -bash: cd:
> >> >> >> >> > /home/aespinosa/workflows/cybershake/archive-run/test/: No
> >> >> >> >> > such file or directory
> >> >> >> >> >
> >> >> >> >> > On Thu, 2011-04-14 at 17:21 -0500, Allan Espinosa wrote:
> >> >> >> >> > > ~aespinosa/workflows/cybershake/archive-run/test/postproc*.log
> >> >> >> >> > >
> >> >> >> >> > > 2011/4/14 Mihael Hategan <hategan at mcs.anl.gov>:
> >> >> >> >> > > > On Thu, 2011-04-14 at 15:57 -0500, Michael Wilde wrote:
> >> >> >> >> > > >> While Allan continues to debug this, can you take a look
> >> >> >> >> > > >> at
> >> >> >> >> > > >> the
> >> >> >> >> > > >> (huge) log?
> >> >> >> >> > > >
> >> >> >> >> > > > Where is this log?
> >> >> >> >> > > >
> >> >> >> >> > > >
> >> >> >
More information about the Swift-devel
mailing list