[Swift-user] Re: workflow 'pauses' after a few thousand jobs

Allan Espinosa aespinosa at cs.uchicago.edu
Mon Jun 14 13:59:26 CDT 2010


The logs were too large.  Currently it is located in
http://www.ci.uchicago.edu/~aespinosa/swift/debug/postproc-prod_pads2.log.bz2

-Allan

2010/6/14 Allan Espinosa <aespinosa at cs.uchicago.edu>:
> I attached the coaster monitoring screenshot and the logfile.  Below is the snippet of the logfile when the workflow "hangs":
>
> 2010-06-13 16:53:55,077-0500 DEBUG vdl:dostagein CDM: file://localhost/LGU/LGU_fy_664.sgt : DIRECT
> 2010-06-13 16:53:55,077-0500 DEBUG vdl:dostageinfile FILE_STAGE_IN_START file=124_400.txt.variation-s0013-h0006 srchost=localhost srcdir=124/400 srcname=124_400.txt.variation-s0013-h0006 desthost=FIREFLY destdir=postproc-LGU-firefly_coast1/shared/124/400 provider=file policy=DIRECT
> 2010-06-13 16:53:55,077-0500 DEBUG vdl:dostageinfile FILE_STAGE_IN_SKIP file=124_400.txt.variation-s0013-h0006 policy=DIRECT
> 2010-06-13 16:53:55,077-0500 DEBUG vdl:dostageinfile FILE_STAGE_IN_END file=124_400.txt.variation-s0013-h0006 srchost=localhost srcdir=124/400 srcname=124_400.txt.variation-s0013-h0006 desthost=FIREFLY destdir=postproc-LGU-firefly_coast1/shared/124/400 provider=file
> 2010-06-13 16:53:55,077-0500 INFO  vdl:dostagein END jobid=extract-6qrl9dtj - Staging in finished
> 2010-06-13 16:53:55,077-0500 DEBUG vdl:execute2 JOB_START jobid=extract-6qrl9dtj tr=extract arguments=[stat=LGU, extract_sgt=1, slon=-119.06587, slat=34.10819, rupmodfile=124/400/124_400.txt.variation-s0013-h0006, sgt_xfile=LGU/LGU_fx_664.sgt, sgt_yfile=LGU/LGU_fy_664.sgt, extract_sgt_xfile=panfs/panasas/CMS/data/engage-aespinosa/swift/LGU/124/400/LGU_124_400_subfx.sgt, extract_sgt_yfile=panfs/panasas/CMS/data/engage-aespinosa/swift/LGU/124/400/LGU_124_400_subfy.sgt] tmpdir=postproc-LGU-firefly_coast1/jobs/6/extract-6qrl9dtj host=FIREFLY
> 2010-06-13 16:53:55,078-0500 INFO  Execute jobid=extract-6qrl9dtj task=Task(type=JOB_SUBMISSION, identity=urn:0-12-846-6-1-1276453892896)
> 2010-06-13 16:53:55,079-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:0-12-846-6-1-1276453892896) setting status to Submitting
> 2010-06-13 16:53:55,080-0500 INFO  Command Sending Command(14283, SUBMITJOB) on GSSSChannel-11535022945(1)
> 2010-06-13 16:53:55,186-0500 INFO  AbstractKarajanChannel GSSSChannel-11535022945(1) REPL: Command(14283, SUBMITJOB)
> 2010-06-13 16:53:55,186-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:0-12-846-6-1-1276453892896) setting status to Submitted
> 2010-06-13 16:53:55,186-0500 INFO  JobSubmissionTaskHandler Submitted task Task(type=JOB_SUBMISSION, identity=urn:0-12-846-6-1-1276453892896). Job id: urn:1276453892896-1276453772841-1276453772842
>
> (At this point there are several hundred jobs queued)
>
> 2010-06-13 16:53:55,186-0500 INFO  AbstractKarajanChannel Unregistering Command(14283, SUBMITJOB)
> 2010-06-13 16:53:58,179-0500 INFO  AbstractStreamKarajanChannel$Multiplexer No streams
> 2010-06-13 16:54:04,831-0500 INFO  AbstractKarajanChannel GSSSChannel-11535022945(1) REQ: Handler(BQPSTATUS)
> 2010-06-13 16:54:04,834-0500 INFO  BQPStatusHandler Process BQP status update 1
> 2010-06-13 16:54:04,834-0500 INFO  BQPStatusHandler Process BQP status update 2
> 2010-06-13 16:54:08,201-0500 INFO  AbstractStreamKarajanChannel$Multiplexer No streams
> 2010-06-13 16:54:14,893-0500 INFO  AbstractKarajanChannel GSSSChannel-11535022945(1) REQ: Handler(BQPSTATUS)
> 2010-06-13 16:54:14,894-0500 INFO  AbstractStreamKarajanChannel Sender 16493645 queue size: 0
> 2010-06-13 16:54:14,897-0500 INFO  BQPStatusHandler Process BQP status update 1
> 2010-06-13 16:54:14,897-0500 INFO  BQPStatusHandler Process BQP status update 2
> 2010-06-13 16:54:18,221-0500 INFO  AbstractStreamKarajanChannel$Multiplexer No streams
> 2010-06-13 16:54:25,019-0500 INFO  AbstractKarajanChannel GSSSChannel-11535022945(1) REQ: Handler(BQPSTATUS)
> 2010-06-13 16:54:25,019-0500 INFO  AbstractStreamKarajanChannel Sender 16493645 queue size: 0
> 2010-06-13 16:54:25,022-0500 INFO  BQPStatusHandler Process BQP status update 1
> 2010-06-13 16:54:25,022-0500 INFO  BQPStatusHandler Process BQP status update 2
> 2010-06-13 16:54:28,243-0500 INFO  AbstractStreamKarajanChannel$Multiplexer No streams
> 2010-06-13 16:54:35,131-0500 INFO  AbstractKarajanChannel GSSSChannel-11535022945(1) REQ: Handler(BQPSTATUS)
> 2010-06-13 16:54:35,131-0500 INFO  AbstractStreamKarajanChannel Sender 16493645 queue size: 0
> 2010-06-13 16:54:35,134-0500 INFO  BQPStatusHandler Process BQP status update 1
> 2010-06-13 16:54:35,134-0500 INFO  BQPStatusHandler Process BQP status update 2
> 2010-06-13 16:54:38,245-0500 INFO  AbstractStreamKarajanChannel$Multiplexer No streams
> 2010-06-13 16:54:45,208-0500 INFO  AbstractKarajanChannel GSSSChannel-11535022945(1) REQ: Handler(BQPSTATUS)
> ...
> ...
> ...
> queue size: 0
> 2010-06-14 13:48:07,067-0500 INFO  BQPStatusHandler Process BQP status update 1
> 2010-06-14 13:48:07,067-0500 INFO  BQPStatusHandler Process BQP status update 2
> 2010-06-14 13:48:09,623-0500 INFO  AbstractStreamKarajanChannel$Multiplexer No streams
> 2010-06-14 13:48:14,968-0500 INFO  Command Sending Command(10504, SHUTDOWNSERVICE) on GSSSChannel-1651098589(3)
> 2010-06-14 13:48:14,968-0500 INFO  AbstractStreamKarajanChannel Sender 15595149 queue size: 0
> 2010-06-14 13:48:15,089-0500 INFO  AbstractKarajanChannel GSSSChannel-1651098589(3) REPL: Command(10504, SHUTDOWNSERVICE)
> 2010-06-14 13:48:15,089-0500 INFO  AbstractKarajanChannel Unregistering Command(10504, SHUTDOWNSERVICE)
> --i shutdown the workflow via Ctrl-C---
>
> sites.xml:
>  <pool handle="FIREFLY">
>    <execution provider="coaster" url="ff-grid.unl.edu" jobmanager="gt2:gt2:pbs"
>      />
>
>    <profile namespace="globus" key="maxTime">86400</profile>
>    <profile namespace="globus" key="maxNodes">5300</profile>
>    <profile namespace="globus" key="spread">0.8</profile>
>    <profile namespace="globus" key="slots">10</profile>
>    <profile namespace="globus" key="remoteMonitorEnabled">true</profile>
>
>    <profile namespace="karajan" key="initialScore">1500.0</profile>
>    <profile namespace="karajan" key="jobThrottle">53.00</profile>
>
>    <gridftp  url="gsiftp://ff-grid2.unl.edu"/>
>    <workdirectory>/panfs/panasas/CMS/data/engage-aespinosa/swift</workdirectory>
>  </pool>
>



More information about the Swift-user mailing list