[Swift-devel] Re: Please review and advise on: Bug 357 - Script hangs in staging on OSG

Allan Espinosa aespinosa at cs.uchicago.edu
Fri Apr 15 14:28:45 CDT 2011


But then it should be registered into the "Failed jobs" counter right?

2011/4/15 Mihael Hategan <hategan at mcs.anl.gov>:
> Individual transfers are retried up to three times. Only if all 3 fail
> the job fails.
>
> On Fri, 2011-04-15 at 12:14 -0500, Allan Espinosa wrote:
>> Hi Mike,
>>
>> here's the plot of the stagein events from yesterday's showing
>> "almost" the same behavior.  I attached the raw data and an svg
>> version of the plot as well.  (plot were produced from R instead of
>> the regular swift-log-processing tools)
>>
>> Also looking at the transfers that did not finish (the ones with
>> "START" in the last column on the raw data), I tested out two gridftp
>> servers today:
>>
>> Ranger to top.ucr.edu
>> $ globus-url-copy -vb
>> gsiftp://gridftp.ranger.tacc.teragrid.org/scratch/projeca3d/CyberShake2007/ruptures/RuptureVariations_35_V2_3/218/260/218_260.txt.variation-s0000-h0000
>> gsiftp://top.ucr.edu/data/down/osg_data/engage/scec/testfile
>> Source: gsiftp://gridftp.ranger.tacc.teragrid.org/scratch/projects/tg/tera3d/CyberShake2007/ruptures/RuptureVariations_35_V2_3/218/260/
>> Dest:   gsiftp://top.ucr.edu/data/down/osg_data/engage/scec/
>>   218_260.txt.variation-s0000-h0000  ->  testfile
>>
>> So this site looks like it's working alright.  Here's the last
>> activity about top.ucr.edu in the log:
>> 2011-04-14 17:50:20,975-0500 DEBUG Reply read 1st line
>> 2011-04-14 17:50:20,976-0500 DEBUG vdl:dostageout FILE_STAGE_OUT_START
>> srcname=PeakVals_TEST_219_258_32.bsa srcdir=po
>> stproc-20110414-1604-bzxryead/shared/scratch/01035/tg802895/science/cybershake/Results/TEST/219/258
>> srchost=UCR-HEP__
>> top.ucr.edu destdir=/scratch/01035/tg802895/science/cybershake/Results/TEST/219/258
>> desthost=gridftp.ranger.tacc.tera
>> grid.org provider=gsiftp
>> 2011-04-14 17:50:20,978-0500 DEBUG vdl:dostageout FILE_STAGE_OUT_START
>> srcname=Seismogram_TEST_219_258_35.grm srcdir=
>> postproc-20110414-1604-bzxryead/shared/scratch/01035/tg802895/science/cybershake/Results/TEST/219/258
>> srchost=UCR-HEP
>> __top.ucr.edu destdir=/scratch/01035/tg802895/science/cybershake/Results/TEST/219/258
>> desthost=gridftp.ranger.tacc.te
>> ragrid.org provider=gsiftp
>>
>> ----comment-----
>> I am not sure if these gsiftp ABORT calls are with the top.ucr.edu site
>> ---comment------
>>
>> 2011-04-14 17:50:21,006-0500 DEBUG Reply 1st line: 350 OK. Send RNTO
>> with destination name.
>> 2011-04-14 17:50:21,006-0500 DEBUG FTPControlChannel Control channel
>> received: 350 OK. Send RNTO with destination nam
>> e.
>> 2011-04-14 17:50:21,006-0500 DEBUG FTPControlChannel Control channel
>> sending: ABOR
>>
>> 2011-04-14 17:50:21,006-0500 DEBUG Reply read 1st line
>> 2011-04-14 17:50:21,006-0500 DEBUG Reply 1st line: 350 OK. Send RNTO
>> with destination name.
>> 2011-04-14 17:50:21,006-0500 DEBUG FTPControlChannel Control channel
>> received: 350 OK. Send RNTO with destination nam
>> e.
>> 2011-04-14 17:50:21,006-0500 DEBUG FTPControlChannel Control channel
>> sending: ABOR
>>
>> 2011-04-14 17:50:21,006-0500 DEBUG Reply read 1st line
>> 2011-04-14 17:50:21,068-0500 DEBUG Reply 1st line: 226 Abort successful
>>
>>
>>
>> [aespinosa at communicado shared]$  MB/sec avg         2.39 MB/sec inst
>>
>> Ranger to Clemson:
>> $ globus-url-copy -vb
>> gsiftp://gridftp.ranger.tacc.teragrid.org/scratch/projects/tg/tera3d/CyberShake2007/ruptures/RuptureVariations_35_V2_3/218/260/218_260.txt.variation-s0000-h0000
>> gsiftp://osg-gw.clemson.edu/common1/osg/data/engage/scec/testfile
>> Source: gsiftp://gridftp.ranger.tacc.teragrid.org/scratch/projects/tg/tera3d/CyberShake2007/ruptures/RuptureVariations_35_V2_3/218/260/
>> Dest:   gsiftp://osg-gw.clemson.edu/common1/osg/data/engage/scec/
>>   218_260.txt.variation-s0000-h0000  ->  testfile
>>
>>
>> error: globus_xio: Unable to connect to osg-gw.clemson.edu:2811
>> globus_xio: System error in connect: Connection refused
>> globus_xio: A system call failed: Connection refused
>>
>>
>> Looks like the Swift log gave an exception for the clemson resource:
>> 2011-04-14 17:17:16,198-0500 DEBUG Reply end reached
>> 2011-04-14 17:17:16,198-0500 DEBUG FTPControlChannel Control channel
>> received: 500-Command failed. : globus_gridftp_s
>> erver_file.c:globus_l_gfs_file_send:2190:
>> 500-globus_l_gfs_file_open failed.
>> 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
>> 500-globus_xio_register_open failed.
>> 500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
>> 500-Unable to open file
>> /common1/osg/data/engage/scec/swift_scratch/postproc-20110414-1604-bzxryead/shared/scratch/01
>> 035/tg802895/science/cybershake/Results/TEST/219/245/PeakVals_TEST_219_245_126.bsa
>> 500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
>> 500-System error in open: No such file or directory
>> 500-globus_xio: A system call failed: No such file or directory
>> 500 End.
>> 2011-04-14 17:17:16,198-0500 DEBUG Local first reply bad: 500-Command
>> failed. : globus_gridftp_server_file.c:globus_l
>> _gfs_file_send:2190:
>> 500-globus_l_gfs_file_open failed.
>> 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
>> 500-globus_xio_register_open failed.
>> 500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
>> 500-Unable to open file
>> /common1/osg/data/engage/scec/swift_scratch/postproc-20110414-1604-bzxryead/shared/scratch/01
>> 035/tg802895/science/cybershake/Results/TEST/219/245/PeakVals_TEST_219_245_126.bsa
>> 500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
>> 500-System error in open: No such file or directory
>> 500-globus_xio: A system call failed: No such file or directory
>> 500 End.
>> 2011-04-14 17:17:16,198-0500 DEBUG Local category: 5
>> 2011-04-14 17:17:16,199-0500 DEBUG FTPControlChannel slept 200
>> 2011-04-14 17:17:16,202-0500 DEBUG TransferState intercepted exception
>> org.globus.ftp.exception.ServerException: Server refused performing
>> the request. Custom message:  (error code 1) [Nested exception
>> message:  Custom message: Unexpected reply: 500-Command failed. :
>> globus_gridftp_server_file.c:globus_l_gfs_file_send:2190:
>> 500-globus_l_gfs_file_open failed.
>> 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
>> 500-globus_xio_register_open failed.
>> 500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
>> 500-Unable to open file
>> /common1/osg/data/engage/scec/swift_scratch/postproc-20110414-1604-bzxryead/shared/scratch/01035/tg802895/science/cybershake/Results/TEST/219/245/PeakVals_TEST_219_245_126.bsa
>> 500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
>> 500-System error in open: No such file or directory
>> 500-globus_xio: A system call failed: No such file or directory
>> 500 End.].  Nested exception is
>> org.globus.ftp.exception.UnexpectedReplyCodeException:  Custom
>> message: Unexpected reply: 500-Command failed. :
>> globus_gridftp_server_file.c:globus_l_gfs_file_send:2190:
>> 500-globus_l_gfs_file_open failed.
>> 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
>> 500-globus_xio_register_open failed.
>> 500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
>> 500-Unable to open file
>> /common1/osg/data/engage/scec/swift_scratch/postproc-20110414-1604-bzxryead/shared/scratch/01035/tg802895/science/cybershake/Results/TEST/219/245/PeakVals_TEST_219_245_126.bsa
>> 500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
>> 500-System error in open: No such file or directory
>> 500-globus_xio: A system call failed: No such file or directory
>> 500 End.
>>         at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:195)
>>         at org.globus.ftp.vanilla.TransferMonitor.start(TransferMonitor.java:109)
>>         at org.globus.ftp.FTPClient.transferRunSingleThread(FTPClient.java:1479)
>>         at org.globus.ftp.FTPClient.transfer(FTPClient.java:1378)
>>         at org.globus.io.urlcopy.UrlCopy.thirdPartyTransfer(UrlCopy.java:739)
>>         at org.globus.io.urlcopy.UrlCopy.copy(UrlCopy.java:493)
>>         at org.globus.io.urlcopy.UrlCopy.run(UrlCopy.java:445)
>>         at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doThirdPartyTransfer(DelegatedFileTransferHandler.java:571)
>>         at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:465)
>>
>>
>> I expected the Swift terminal session to report a "Failed" job, but as
>> it runs now it looks like this:
>>
>> Progress:  Stage in:79  Stage out:232  Finished successfully:130
>> Progress:  Stage in:79  Stage out:232  Finished successfully:130
>> Progress:  Stage in:79  Stage out:232  Finished successfully:130
>> Progress:  Stage in:79  Stage out:232  Finished successfully:130
>> Progress:  Stage in:79  Stage out:232  Finished successfully:130
>>
>> Btw, i have lazy.errors=true
>>
>> 2011/4/14 Michael Wilde <wilde at mcs.anl.gov>:
>> > ALlan, what I meant was: do you have any evidence that this current run is hung (either in a similar manner to the one we looked at closely this morning, or in a different manner)?
>> >
>> > In this mornings log, you could tell from plots of stagein and stageout events that many these events were not completing after something triggered an error.
>> >
>> > Do you have similar plots or evidence of hangs regarding this run and its log?
>> >
>> > I dont know from browsing the traces if one would *naturally* expect the transfer threads to all be waiting on input sockets most of the time, or if seeing all 4 threads waiting on sockets is indicative of data transfer being totally hung.
>> >
>> > Mihael, I assume you can tell much more from these traces?
>> >
>> > - Mike
>> >
>> >
>> > ----- Original Message -----
>> >> Right now the logs only gives out messages about
>> >> AbstractKarajanStreamChannel. I set the org.globus.ftp package's
>> >> logging level to DEBUG, so entries should be reflected if there are
>> >> transfers being made.
>> >>
>> >> -Allan
>> >>
>> >> 2011/4/14 Michael Wilde <wilde at mcs.anl.gov>:
>> >> > So you have 4 transfer threads and all 4 are waiting here:
>> >> >
>> >> > at java.net.SocketInputStream.socketRead0(Native Method)
>> >> >        at
>> >> >        java.net.SocketInputStream.read(SocketInputStream.java:129)
>> >> >
>> >> > (from throttle.transfers=4)
>> >> >
>> >> > Is this workflow hung, and if so, how are you determining that? Do
>> >> > you have another log plot of stagein and out?
>> >> >
>> >> > - Mike
>> >> >
>> >> >
>> >> > ----- Original Message -----
>> >> >> Fresh traces (jstack and log) in
>> >> >> /home/aespinosa/workflows/cybershake/archive-runs/transfer-logging
>> >> >> .
>> >> >> The swift log is a snapshot of the workflow that is still running.
>> >> >>
>> >> >> -Allan
>> >> >>
>> >> >> 2011/4/14 Mihael Hategan <hategan at mcs.anl.gov>:
>> >> >> > One immediate question that I have is what's up with the deadline
>> >> >> > passed
>> >> >> > messages?
>> >> >> >
>> >> >> > That happens when jobs run for at least twice their advertised
>> >> >> > walltime
>> >> >> > and for some reason the site doesn't seem to cancel them. This
>> >> >> > may
>> >> >> > be
>> >> >> > indicative of notifications getting lost.
>> >> >> >
>> >> >> > As for the transfers, I don't see all transfers hanging after
>> >> >> > that.
>> >> >> > I
>> >> >> > mean there are transfers that complete ok. Though things do seem
>> >> >> > to
>> >> >> > slow
>> >> >> > down quite a bit, so that looks like a problem.
>> >> >> >
>> >> >> > Let's see what in the stack traces. In the mean time, I will see
>> >> >> > what it
>> >> >> > takes to get transfer progress messages.
>> >> >> >
>> >> >> > Mihael
>> >> >> >
>> >> >> >
>> >> >> > On Thu, 2011-04-14 at 17:28 -0500, Michael Wilde wrote:
>> >> >> >> bri$ pwd
>> >> >> >> /home/aespinosa/workflows/cybershake/archive-runs/test
>> >> >> >> bri$ ls -lt
>> >> >> >> total 1844128
>> >> >> >> -rw-r--r-- 1 aespinosa ci-users 0 Apr 14 14:21 max-duration.tmp
>> >> >> >> -rw-r--r-- 1 aespinosa ci-users 15 Apr 14 14:20 start-time.tmp
>> >> >> >> -rw-r--r-- 1 aespinosa ci-users 1433206 Apr 14 14:20
>> >> >> >> stagein.event
>> >> >> >> -rw-r--r-- 1 aespinosa ci-users 2372737 Apr 14 14:19
>> >> >> >> sort-preserve2.tmp
>> >> >> >> -rw-r--r-- 1 aespinosa ci-users 2372737 Apr 14 14:19
>> >> >> >> sort-preserve.tmp
>> >> >> >> -rw-r--r-- 1 aespinosa ci-users 15 Apr 14 14:19 t.inf
>> >> >> >> -rw-r--r-- 1 aespinosa ci-users 2263727 Apr 14 12:51
>> >> >> >> stagein.transition
>> >> >> >> -rw-r--r-- 1 aespinosa ci-users 8998897 Apr 14 12:31 stagein.log
>> >> >> >> -rw-r--r-- 1 aespinosa ci-users 92059 Apr 14 12:05
>> >> >> >> dostageout.event
>> >> >> >> -rw-r--r-- 1 aespinosa ci-users 97442 Apr 14 11:51
>> >> >> >> dostagein.event
>> >> >> >> -rw-r--r-- 1 aespinosa ci-users 2998 Apr 13 17:38
>> >> >> >> dostagein.sorted-start.png
>> >> >> >> -rw-r--r-- 1 aespinosa ci-users 3080 Apr 13 17:38
>> >> >> >> dostageout.sorted-start.png
>> >> >> >> -rw-r--r-- 1 aespinosa ci-users 3255 Apr 8 16:05
>> >> >> >> execute2-total.png
>> >> >> >> -rw-r--r-- 1 aespinosa ci-users 1533974 Apr 8 14:46
>> >> >> >> postproc-20110407-1438-i90jepr3.0.rlog
>> >> >> >> -rw-r--r-- 1 aespinosa ci-users 1868896768 Apr 8 14:46
>> >> >> >> postproc-20110407-1438-i90jepr3.log
>> >> >> >> drwxr-xr-x 2 aespinosa ci-users 32768 Apr 7 14:39
>> >> >> >> postproc-20110407-1438-i90jepr3.d/
>> >> >> >> bri$
>> >> >> >>
>> >> >> >> runs, not run
>> >> >> >>
>> >> >> >> ALso see bridled: /tmp/mw1
>> >> >> >>
>> >> >> >> ----- Original Message -----
>> >> >> >> > [hategan at bridled tmp]$ cd
>> >> >> >> > ~aespinosa/workflows/cybershake/archive-run/test/
>> >> >> >> > -bash: cd:
>> >> >> >> > /home/aespinosa/workflows/cybershake/archive-run/test/: No
>> >> >> >> > such file or directory
>> >> >> >> >
>> >> >> >> > On Thu, 2011-04-14 at 17:21 -0500, Allan Espinosa wrote:
>> >> >> >> > > ~aespinosa/workflows/cybershake/archive-run/test/postproc*.log
>> >> >> >> > >
>> >> >> >> > > 2011/4/14 Mihael Hategan <hategan at mcs.anl.gov>:
>> >> >> >> > > > On Thu, 2011-04-14 at 15:57 -0500, Michael Wilde wrote:
>> >> >> >> > > >> While Allan continues to debug this, can you take a look
>> >> >> >> > > >> at
>> >> >> >> > > >> the
>> >> >> >> > > >> (huge) log?
>> >> >> >> > > >
>> >> >> >> > > > Where is this log?
>> >> >> >> > > >
>> >> >> >> > > >
>> >> >



More information about the Swift-devel mailing list