[Swift-devel] Re: Please review and advise on: Bug 357 - Script hangs in staging on OSG
Allan Espinosa
aespinosa at cs.uchicago.edu
Fri Apr 15 12:14:29 CDT 2011
Hi Mike,
here's the plot of the stagein events from yesterday's showing
"almost" the same behavior. I attached the raw data and an svg
version of the plot as well. (plot were produced from R instead of
the regular swift-log-processing tools)
Also looking at the transfers that did not finish (the ones with
"START" in the last column on the raw data), I tested out two gridftp
servers today:
Ranger to top.ucr.edu
$ globus-url-copy -vb
gsiftp://gridftp.ranger.tacc.teragrid.org/scratch/projeca3d/CyberShake2007/ruptures/RuptureVariations_35_V2_3/218/260/218_260.txt.variation-s0000-h0000
gsiftp://top.ucr.edu/data/down/osg_data/engage/scec/testfile
Source: gsiftp://gridftp.ranger.tacc.teragrid.org/scratch/projects/tg/tera3d/CyberShake2007/ruptures/RuptureVariations_35_V2_3/218/260/
Dest: gsiftp://top.ucr.edu/data/down/osg_data/engage/scec/
218_260.txt.variation-s0000-h0000 -> testfile
So this site looks like it's working alright. Here's the last
activity about top.ucr.edu in the log:
2011-04-14 17:50:20,975-0500 DEBUG Reply read 1st line
2011-04-14 17:50:20,976-0500 DEBUG vdl:dostageout FILE_STAGE_OUT_START
srcname=PeakVals_TEST_219_258_32.bsa srcdir=po
stproc-20110414-1604-bzxryead/shared/scratch/01035/tg802895/science/cybershake/Results/TEST/219/258
srchost=UCR-HEP__
top.ucr.edu destdir=/scratch/01035/tg802895/science/cybershake/Results/TEST/219/258
desthost=gridftp.ranger.tacc.tera
grid.org provider=gsiftp
2011-04-14 17:50:20,978-0500 DEBUG vdl:dostageout FILE_STAGE_OUT_START
srcname=Seismogram_TEST_219_258_35.grm srcdir=
postproc-20110414-1604-bzxryead/shared/scratch/01035/tg802895/science/cybershake/Results/TEST/219/258
srchost=UCR-HEP
__top.ucr.edu destdir=/scratch/01035/tg802895/science/cybershake/Results/TEST/219/258
desthost=gridftp.ranger.tacc.te
ragrid.org provider=gsiftp
----comment-----
I am not sure if these gsiftp ABORT calls are with the top.ucr.edu site
---comment------
2011-04-14 17:50:21,006-0500 DEBUG Reply 1st line: 350 OK. Send RNTO
with destination name.
2011-04-14 17:50:21,006-0500 DEBUG FTPControlChannel Control channel
received: 350 OK. Send RNTO with destination nam
e.
2011-04-14 17:50:21,006-0500 DEBUG FTPControlChannel Control channel
sending: ABOR
2011-04-14 17:50:21,006-0500 DEBUG Reply read 1st line
2011-04-14 17:50:21,006-0500 DEBUG Reply 1st line: 350 OK. Send RNTO
with destination name.
2011-04-14 17:50:21,006-0500 DEBUG FTPControlChannel Control channel
received: 350 OK. Send RNTO with destination nam
e.
2011-04-14 17:50:21,006-0500 DEBUG FTPControlChannel Control channel
sending: ABOR
2011-04-14 17:50:21,006-0500 DEBUG Reply read 1st line
2011-04-14 17:50:21,068-0500 DEBUG Reply 1st line: 226 Abort successful
[aespinosa at communicado shared]$ MB/sec avg 2.39 MB/sec inst
Ranger to Clemson:
$ globus-url-copy -vb
gsiftp://gridftp.ranger.tacc.teragrid.org/scratch/projects/tg/tera3d/CyberShake2007/ruptures/RuptureVariations_35_V2_3/218/260/218_260.txt.variation-s0000-h0000
gsiftp://osg-gw.clemson.edu/common1/osg/data/engage/scec/testfile
Source: gsiftp://gridftp.ranger.tacc.teragrid.org/scratch/projects/tg/tera3d/CyberShake2007/ruptures/RuptureVariations_35_V2_3/218/260/
Dest: gsiftp://osg-gw.clemson.edu/common1/osg/data/engage/scec/
218_260.txt.variation-s0000-h0000 -> testfile
error: globus_xio: Unable to connect to osg-gw.clemson.edu:2811
globus_xio: System error in connect: Connection refused
globus_xio: A system call failed: Connection refused
Looks like the Swift log gave an exception for the clemson resource:
2011-04-14 17:17:16,198-0500 DEBUG Reply end reached
2011-04-14 17:17:16,198-0500 DEBUG FTPControlChannel Control channel
received: 500-Command failed. : globus_gridftp_s
erver_file.c:globus_l_gfs_file_send:2190:
500-globus_l_gfs_file_open failed.
500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
500-globus_xio_register_open failed.
500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
500-Unable to open file
/common1/osg/data/engage/scec/swift_scratch/postproc-20110414-1604-bzxryead/shared/scratch/01
035/tg802895/science/cybershake/Results/TEST/219/245/PeakVals_TEST_219_245_126.bsa
500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
500-System error in open: No such file or directory
500-globus_xio: A system call failed: No such file or directory
500 End.
2011-04-14 17:17:16,198-0500 DEBUG Local first reply bad: 500-Command
failed. : globus_gridftp_server_file.c:globus_l
_gfs_file_send:2190:
500-globus_l_gfs_file_open failed.
500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
500-globus_xio_register_open failed.
500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
500-Unable to open file
/common1/osg/data/engage/scec/swift_scratch/postproc-20110414-1604-bzxryead/shared/scratch/01
035/tg802895/science/cybershake/Results/TEST/219/245/PeakVals_TEST_219_245_126.bsa
500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
500-System error in open: No such file or directory
500-globus_xio: A system call failed: No such file or directory
500 End.
2011-04-14 17:17:16,198-0500 DEBUG Local category: 5
2011-04-14 17:17:16,199-0500 DEBUG FTPControlChannel slept 200
2011-04-14 17:17:16,202-0500 DEBUG TransferState intercepted exception
org.globus.ftp.exception.ServerException: Server refused performing
the request. Custom message: (error code 1) [Nested exception
message: Custom message: Unexpected reply: 500-Command failed. :
globus_gridftp_server_file.c:globus_l_gfs_file_send:2190:
500-globus_l_gfs_file_open failed.
500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
500-globus_xio_register_open failed.
500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
500-Unable to open file
/common1/osg/data/engage/scec/swift_scratch/postproc-20110414-1604-bzxryead/shared/scratch/01035/tg802895/science/cybershake/Results/TEST/219/245/PeakVals_TEST_219_245_126.bsa
500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
500-System error in open: No such file or directory
500-globus_xio: A system call failed: No such file or directory
500 End.]. Nested exception is
org.globus.ftp.exception.UnexpectedReplyCodeException: Custom
message: Unexpected reply: 500-Command failed. :
globus_gridftp_server_file.c:globus_l_gfs_file_send:2190:
500-globus_l_gfs_file_open failed.
500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
500-globus_xio_register_open failed.
500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
500-Unable to open file
/common1/osg/data/engage/scec/swift_scratch/postproc-20110414-1604-bzxryead/shared/scratch/01035/tg802895/science/cybershake/Results/TEST/219/245/PeakVals_TEST_219_245_126.bsa
500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
500-System error in open: No such file or directory
500-globus_xio: A system call failed: No such file or directory
500 End.
at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:195)
at org.globus.ftp.vanilla.TransferMonitor.start(TransferMonitor.java:109)
at org.globus.ftp.FTPClient.transferRunSingleThread(FTPClient.java:1479)
at org.globus.ftp.FTPClient.transfer(FTPClient.java:1378)
at org.globus.io.urlcopy.UrlCopy.thirdPartyTransfer(UrlCopy.java:739)
at org.globus.io.urlcopy.UrlCopy.copy(UrlCopy.java:493)
at org.globus.io.urlcopy.UrlCopy.run(UrlCopy.java:445)
at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doThirdPartyTransfer(DelegatedFileTransferHandler.java:571)
at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:465)
I expected the Swift terminal session to report a "Failed" job, but as
it runs now it looks like this:
Progress: Stage in:79 Stage out:232 Finished successfully:130
Progress: Stage in:79 Stage out:232 Finished successfully:130
Progress: Stage in:79 Stage out:232 Finished successfully:130
Progress: Stage in:79 Stage out:232 Finished successfully:130
Progress: Stage in:79 Stage out:232 Finished successfully:130
Btw, i have lazy.errors=true
2011/4/14 Michael Wilde <wilde at mcs.anl.gov>:
> ALlan, what I meant was: do you have any evidence that this current run is hung (either in a similar manner to the one we looked at closely this morning, or in a different manner)?
>
> In this mornings log, you could tell from plots of stagein and stageout events that many these events were not completing after something triggered an error.
>
> Do you have similar plots or evidence of hangs regarding this run and its log?
>
> I dont know from browsing the traces if one would *naturally* expect the transfer threads to all be waiting on input sockets most of the time, or if seeing all 4 threads waiting on sockets is indicative of data transfer being totally hung.
>
> Mihael, I assume you can tell much more from these traces?
>
> - Mike
>
>
> ----- Original Message -----
>> Right now the logs only gives out messages about
>> AbstractKarajanStreamChannel. I set the org.globus.ftp package's
>> logging level to DEBUG, so entries should be reflected if there are
>> transfers being made.
>>
>> -Allan
>>
>> 2011/4/14 Michael Wilde <wilde at mcs.anl.gov>:
>> > So you have 4 transfer threads and all 4 are waiting here:
>> >
>> > at java.net.SocketInputStream.socketRead0(Native Method)
>> > at
>> > java.net.SocketInputStream.read(SocketInputStream.java:129)
>> >
>> > (from throttle.transfers=4)
>> >
>> > Is this workflow hung, and if so, how are you determining that? Do
>> > you have another log plot of stagein and out?
>> >
>> > - Mike
>> >
>> >
>> > ----- Original Message -----
>> >> Fresh traces (jstack and log) in
>> >> /home/aespinosa/workflows/cybershake/archive-runs/transfer-logging
>> >> .
>> >> The swift log is a snapshot of the workflow that is still running.
>> >>
>> >> -Allan
>> >>
>> >> 2011/4/14 Mihael Hategan <hategan at mcs.anl.gov>:
>> >> > One immediate question that I have is what's up with the deadline
>> >> > passed
>> >> > messages?
>> >> >
>> >> > That happens when jobs run for at least twice their advertised
>> >> > walltime
>> >> > and for some reason the site doesn't seem to cancel them. This
>> >> > may
>> >> > be
>> >> > indicative of notifications getting lost.
>> >> >
>> >> > As for the transfers, I don't see all transfers hanging after
>> >> > that.
>> >> > I
>> >> > mean there are transfers that complete ok. Though things do seem
>> >> > to
>> >> > slow
>> >> > down quite a bit, so that looks like a problem.
>> >> >
>> >> > Let's see what in the stack traces. In the mean time, I will see
>> >> > what it
>> >> > takes to get transfer progress messages.
>> >> >
>> >> > Mihael
>> >> >
>> >> >
>> >> > On Thu, 2011-04-14 at 17:28 -0500, Michael Wilde wrote:
>> >> >> bri$ pwd
>> >> >> /home/aespinosa/workflows/cybershake/archive-runs/test
>> >> >> bri$ ls -lt
>> >> >> total 1844128
>> >> >> -rw-r--r-- 1 aespinosa ci-users 0 Apr 14 14:21 max-duration.tmp
>> >> >> -rw-r--r-- 1 aespinosa ci-users 15 Apr 14 14:20 start-time.tmp
>> >> >> -rw-r--r-- 1 aespinosa ci-users 1433206 Apr 14 14:20
>> >> >> stagein.event
>> >> >> -rw-r--r-- 1 aespinosa ci-users 2372737 Apr 14 14:19
>> >> >> sort-preserve2.tmp
>> >> >> -rw-r--r-- 1 aespinosa ci-users 2372737 Apr 14 14:19
>> >> >> sort-preserve.tmp
>> >> >> -rw-r--r-- 1 aespinosa ci-users 15 Apr 14 14:19 t.inf
>> >> >> -rw-r--r-- 1 aespinosa ci-users 2263727 Apr 14 12:51
>> >> >> stagein.transition
>> >> >> -rw-r--r-- 1 aespinosa ci-users 8998897 Apr 14 12:31 stagein.log
>> >> >> -rw-r--r-- 1 aespinosa ci-users 92059 Apr 14 12:05
>> >> >> dostageout.event
>> >> >> -rw-r--r-- 1 aespinosa ci-users 97442 Apr 14 11:51
>> >> >> dostagein.event
>> >> >> -rw-r--r-- 1 aespinosa ci-users 2998 Apr 13 17:38
>> >> >> dostagein.sorted-start.png
>> >> >> -rw-r--r-- 1 aespinosa ci-users 3080 Apr 13 17:38
>> >> >> dostageout.sorted-start.png
>> >> >> -rw-r--r-- 1 aespinosa ci-users 3255 Apr 8 16:05
>> >> >> execute2-total.png
>> >> >> -rw-r--r-- 1 aespinosa ci-users 1533974 Apr 8 14:46
>> >> >> postproc-20110407-1438-i90jepr3.0.rlog
>> >> >> -rw-r--r-- 1 aespinosa ci-users 1868896768 Apr 8 14:46
>> >> >> postproc-20110407-1438-i90jepr3.log
>> >> >> drwxr-xr-x 2 aespinosa ci-users 32768 Apr 7 14:39
>> >> >> postproc-20110407-1438-i90jepr3.d/
>> >> >> bri$
>> >> >>
>> >> >> runs, not run
>> >> >>
>> >> >> ALso see bridled: /tmp/mw1
>> >> >>
>> >> >> ----- Original Message -----
>> >> >> > [hategan at bridled tmp]$ cd
>> >> >> > ~aespinosa/workflows/cybershake/archive-run/test/
>> >> >> > -bash: cd:
>> >> >> > /home/aespinosa/workflows/cybershake/archive-run/test/: No
>> >> >> > such file or directory
>> >> >> >
>> >> >> > On Thu, 2011-04-14 at 17:21 -0500, Allan Espinosa wrote:
>> >> >> > > ~aespinosa/workflows/cybershake/archive-run/test/postproc*.log
>> >> >> > >
>> >> >> > > 2011/4/14 Mihael Hategan <hategan at mcs.anl.gov>:
>> >> >> > > > On Thu, 2011-04-14 at 15:57 -0500, Michael Wilde wrote:
>> >> >> > > >> While Allan continues to debug this, can you take a look
>> >> >> > > >> at
>> >> >> > > >> the
>> >> >> > > >> (huge) log?
>> >> >> > > >
>> >> >> > > > Where is this log?
>> >> >> > > >
>> >> >> > > >
>> >
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
>
>
--
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: stageinfile.png
Type: image/png
Size: 3788 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110415/ea2354e4/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: stageinfile.svg.bz2
Type: application/x-bzip2
Size: 15786 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110415/ea2354e4/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: stageinfile.event.bz2
Type: application/x-bzip2
Size: 26018 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110415/ea2354e4/attachment-0001.bin>
More information about the Swift-devel
mailing list