[Swift-devel] [Bug 357] Script hangs in staging on OSG

bugzilla-daemon at mcs.anl.gov bugzilla-daemon at mcs.anl.gov
Thu Apr 14 16:54:31 CDT 2011


https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=357





--- Comment #2 from Michael Wilde <wilde at mcs.anl.gov>  2011-04-14 16:54:30 ---
Log analysis shows:

The first 10 transfers that hung were:

bri$ grep START  stagein.event  | sort -n | head
1302218175.538 73795.3809998035
TEST_218_241_subfx.sgt-gridftp.pads.ci.uchicago.edu-UMissHEP__umiss001.hep.olemiss.edu
START
1302218175.538 73795.3809998035
TEST_218_241_subfy.sgt-gridftp.pads.ci.uchicago.edu-UMissHEP__umiss001.hep.olemiss.edu
START
1302218217.45 73753.4689998627
TEST_218_239_subfy.sgt-gridftp.pads.ci.uchicago.edu-UMissHEP__umiss001.hep.olemiss.edu
START
1302218489.313 73481.6059999466
TEST_218_258_subfy.sgt-gridftp.pads.ci.uchicago.edu-UMissHEP__umiss001.hep.olemiss.edu
START
1302218489.317 73481.6019999981
218_258.txt.variation-s0003-h0006-gridftp.pads.ci.uchicago.edu-UMissHEP__umiss001.hep.olemiss.edu
START
1302218489.321 73481.5979998112
218_258.txt.variation-s0003-h0005-gridftp.pads.ci.uchicago.edu-UMissHEP__umiss001.hep.olemiss.edu
START
1302218489.325 73481.5939998627
218_258.txt.variation-s0003-h0004-gridftp.pads.ci.uchicago.edu-UMissHEP__umiss001.hep.olemiss.edu
START
1302218489.329 73481.5899999142
218_258.txt.variation-s0003-h0003-gridftp.pads.ci.uchicago.edu-UMissHEP__umiss001.hep.olemiss.edu
START
1302218489.333 73481.5859999657
218_258.txt.variation-s0004-h0003-gridftp.pads.ci.uchicago.edu-UMissHEP__umiss001.hep.olemiss.edu
START
1302218489.341 73481.5779998302
218_258.txt.variation-s0003-h0008-gridftp.pads.ci.uchicago.edu-UMissHEP__umiss001.hep.olemiss.edu
START


Events involved in the *first* file whose transfer hung (to OLEMiss) were:

(note: it *looks* to me like the first transfer of this file to OLEMiss got
clobbered by the job getting killed due to the replication timer.  After that
point, things started hanging.  So replication is a suspect in this scenario.)

ed *pr3.log
1
2011-04-07 14:39:05,314-0500 DEBUG Loader Max heap: 3817799680
/TEST_218_241_subfx.sgt
2011-04-07 18:12:16,776-0500 DEBUG vdl:execute2 JOB_START
jobid=extract-91r7kd8k tr=extract arguments=[stat=TEST, extract_sgt=1,
slon=-118.286, slat=34.0192,
rupmodfile=gpfs/pads/swift/aespinosa/science/cybershake/RuptureVariations/218/241/218_241.txt.variation-s0011-h0005,
sgt_xfile=gpfs/pads/swift/aespinosa/science/cybershake/SgtFiles/TEST/TEST_fx_644.sgt,
sgt_yfile=gpfs/pads/swift/aespinosa/science/cybershake/SgtFiles/TEST/TEST_fy_644.sgt,
extract_sgt_xfile=gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241/TEST_218_241_subfx.sgt,
extract_sgt_yfile=gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241/TEST_218_241_subfy.sgt]
tmpdir=postproc-20110407-1438-i90jepr3/jobs/9/extract-91r7kd8k host=PADS
/
2011-04-07 18:12:16,777-0500 INFO  Execute Submit: in:
postproc-20110407-1438-i90jepr3 command: /bin/bash shared/_swiftwrap
extract-91r7kd8k -jobdir 9 -scratch  -e
/gpfs/pads/swift/aespinosa/science/cybershake/apps/JBSim3d/bin/jbsim3d -out
stdout.txt -err stderr.txt -i -d
gpfs/pads/swift/aespinosa/science/cybershake/SgtFiles/TEST|gpfs/pads/swift/aespinosa/science/cybershake/RuptureVariations/218/241|gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241
-if
/gpfs/pads/swift/aespinosa/science/cybershake/SgtFiles/TEST/TEST_fy_644.sgt|/gpfs/pads/swift/aespinosa/science/cybershake/SgtFiles/TEST/TEST_fx_644.sgt|/gpfs/pads/swift/aespinosa/science/cybershake/RuptureVariations/218/241/218_241.txt.variation-s0011-h0005
-of
gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241/TEST_218_241_subfy.sgt|gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241/TEST_218_241_subfx.sgt
-k  -cdmfile  -status provider -a stat=TEST extract_sgt=1 slon=-118.286
slat=34.0192
rupmodfile=gpfs/pads/swift/aespinosa/science/cybershake/RuptureVariations/218/241/218_241.txt.variation-s0011-h0005
sgt_xfile=gpfs/pads/swift/aespinosa/science/cybershake/SgtFiles/TEST/TEST_fx_644.sgt
sgt_yfile=gpfs/pads/swift/aespinosa/science/cybershake/SgtFiles/TEST/TEST_fy_644.sgt
extract_sgt_xfile=gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241/TEST_218_241_subfx.sgt
extract_sgt_yfile=gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241/TEST_218_241_subfy.sgt
/
2011-04-07 18:12:16,777-0500 INFO  GridExec TASK_DEFINITION:
Task(type=JOB_SUBMISSION, identity=urn:0-13-155-6-1-1302205287944) is /bin/bash
shared/_swiftwrap extract-91r7kd8k -jobdir 9 -scratch  -e
/gpfs/pads/swift/aespinosa/science/cybershake/apps/JBSim3d/bin/jbsim3d -out
stdout.txt -err stderr.txt -i -d
gpfs/pads/swift/aespinosa/science/cybershake/SgtFiles/TEST|gpfs/pads/swift/aespinosa/science/cybershake/RuptureVariations/218/241|gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241
-if
/gpfs/pads/swift/aespinosa/science/cybershake/SgtFiles/TEST/TEST_fy_644.sgt|/gpfs/pads/swift/aespinosa/science/cybershake/SgtFiles/TEST/TEST_fx_644.sgt|/gpfs/pads/swift/aespinosa/science/cybershake/RuptureVariations/218/241/218_241.txt.variation-s0011-h0005
-of
gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241/TEST_218_241_subfy.sgt|gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241/TEST_218_241_subfx.sgt
-k  -cdmfile  -status provider -a stat=TEST extract_sgt=1 slon=-118.286
slat=34.0192
rupmodfile=gpfs/pads/swift/aespinosa/science/cybershake/RuptureVariations/218/241/218_241.txt.variation-s0011-h0005
sgt_xfile=gpfs/pads/swift/aespinosa/science/cybershake/SgtFiles/TEST/TEST_fx_644.sgt
sgt_yfile=gpfs/pads/swift/aespinosa/science/cybershake/SgtFiles/TEST/TEST_fy_644.sgt
extract_sgt_xfile=gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241/TEST_218_241_subfx.sgt
extract_sgt_yfile=gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241/TEST_218_241_subfy.sgt
/
2011-04-07 18:13:25,080-0500 DEBUG vdl:dostageout FILE_STAGE_OUT_START
srcname=TEST_218_241_subfx.sgt
srcdir=postproc-20110407-1438-i90jepr3/shared/gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241
srchost=PADS
destdir=/gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241
desthost=gridftp.pads.ci.uchicago.edu provider=gsiftp
/
2011-04-07 18:16:13,723-0500 DEBUG vdl:dostageout FILE_STAGE_OUT_END
srcname=TEST_218_241_subfx.sgt
srcdir=postproc-20110407-1438-i90jepr3/shared/gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241
srchost=PADS
destdir=/gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241
desthost=gridftp.pads.ci.uchicago.edu provider=gsiftp
/
2011-04-07 18:16:15,538-0500 DEBUG vdl:dostagein CDM:
gsiftp://gridftp.pads.ci.uchicago.edu//gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241/TEST_218_241_subfx.sgt
: DEFAULT
/
2011-04-07 18:16:15,538-0500 DEBUG vdl:dostageinfile FILE_STAGE_IN_START
file=TEST_218_241_subfx.sgt srchost=gridftp.pads.ci.uchicago.edu
srcdir=/gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241
srcname=TEST_218_241_subfx.sgt desthost=UMissHEP__umiss001.hep.olemiss.edu
destdir=postproc-20110407-1438-i90jepr3/shared/gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241
provider=gsiftp policy=DEFAULT
/
2011-04-07 18:16:15,924-0500 DEBUG vdl:dostagein CDM:
gsiftp://gridftp.pads.ci.uchicago.edu//gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241/TEST_218_241_subfx.sgt
: DEFAULT
/
2011-04-07 18:16:15,924-0500 DEBUG vdl:dostageinfile FILE_STAGE_IN_START
file=TEST_218_241_subfx.sgt srchost=gridftp.pads.ci.uchicago.edu
srcdir=/gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241
srcname=TEST_218_241_subfx.sgt desthost=Nebraska__red.unl.edu
destdir=postproc-20110407-1438-i90jepr3/shared/gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241
provider=gsiftp policy=DEFAULT
/
2011-04-07 18:28:54,689-0500 DEBUG vdl:dostageinfile FILE_STAGE_IN_END
file=TEST_218_241_subfx.sgt srchost=gridftp.pads.ci.uchicago.edu
srcdir=/gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241
srcname=TEST_218_241_subfx.sgt desthost=Nebraska__red.unl.edu
destdir=postproc-20110407-1438-i90jepr3/shared/gpfs/pads/swift/aespinosa/science/cybershake/Results/TEST/218/241
provider=gsiftp

-- 
Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching the reporter.



More information about the Swift-devel mailing list