[Swift-user] workflow 'pauses' after a few thousand jobs

Allan Espinosa aespinosa at cs.uchicago.edu
Mon Jun 14 13:56:01 CDT 2010


 attached the coaster monitoring screenshot and the logfile.  Below is the snippet of the logfile when the workflow "hangs":

2010-06-13 16:53:55,077-0500 DEBUG vdl:dostagein CDM: file://localhost/LGU/LGU_fy_664.sgt : DIRECT
2010-06-13 16:53:55,077-0500 DEBUG vdl:dostageinfile FILE_STAGE_IN_START file=124_400.txt.variation-s0013-h0006 srchost=localhost srcdir=124/400 srcname=124_400.txt.variation-s0013-h0006 desthost=FIREFLY destdir=postproc-LGU-firefly_coast1/shared/124/400 provider=file policy=DIRECT
2010-06-13 16:53:55,077-0500 DEBUG vdl:dostageinfile FILE_STAGE_IN_SKIP file=124_400.txt.variation-s0013-h0006 policy=DIRECT
2010-06-13 16:53:55,077-0500 DEBUG vdl:dostageinfile FILE_STAGE_IN_END file=124_400.txt.variation-s0013-h0006 srchost=localhost srcdir=124/400 srcname=124_400.txt.variation-s0013-h0006 desthost=FIREFLY destdir=postproc-LGU-firefly_coast1/shared/124/400 provider=file
2010-06-13 16:53:55,077-0500 INFO  vdl:dostagein END jobid=extract-6qrl9dtj - Staging in finished
2010-06-13 16:53:55,077-0500 DEBUG vdl:execute2 JOB_START jobid=extract-6qrl9dtj tr=extract arguments=[stat=LGU, extract_sgt=1, slon=-119.06587, slat=34.10819, rupmodfile=124/400/124_400.txt.variation-s0013-h0006, sgt_xfile=LGU/LGU_fx_664.sgt, sgt_yfile=LGU/LGU_fy_664.sgt, extract_sgt_xfile=panfs/panasas/CMS/data/engage-aespinosa/swift/LGU/124/400/LGU_124_400_subfx.sgt, extract_sgt_yfile=panfs/panasas/CMS/data/engage-aespinosa/swift/LGU/124/400/LGU_124_400_subfy.sgt] tmpdir=postproc-LGU-firefly_coast1/jobs/6/extract-6qrl9dtj host=FIREFLY
2010-06-13 16:53:55,078-0500 INFO  Execute jobid=extract-6qrl9dtj task=Task(type=JOB_SUBMISSION, identity=urn:0-12-846-6-1-1276453892896)
2010-06-13 16:53:55,079-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:0-12-846-6-1-1276453892896) setting status to Submitting
2010-06-13 16:53:55,080-0500 INFO  Command Sending Command(14283, SUBMITJOB) on GSSSChannel-11535022945(1)
2010-06-13 16:53:55,186-0500 INFO  AbstractKarajanChannel GSSSChannel-11535022945(1) REPL: Command(14283, SUBMITJOB)
2010-06-13 16:53:55,186-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:0-12-846-6-1-1276453892896) setting status to Submitted
2010-06-13 16:53:55,186-0500 INFO  JobSubmissionTaskHandler Submitted task Task(type=JOB_SUBMISSION, identity=urn:0-12-846-6-1-1276453892896). Job id: urn:1276453892896-1276453772841-1276453772842

(At this point there are several hundred jobs queued)

2010-06-13 16:53:55,186-0500 INFO  AbstractKarajanChannel Unregistering Command(14283, SUBMITJOB)
2010-06-13 16:53:58,179-0500 INFO  AbstractStreamKarajanChannel$Multiplexer No streams
2010-06-13 16:54:04,831-0500 INFO  AbstractKarajanChannel GSSSChannel-11535022945(1) REQ: Handler(BQPSTATUS)
2010-06-13 16:54:04,834-0500 INFO  BQPStatusHandler Process BQP status update 1
2010-06-13 16:54:04,834-0500 INFO  BQPStatusHandler Process BQP status update 2
2010-06-13 16:54:08,201-0500 INFO  AbstractStreamKarajanChannel$Multiplexer No streams
2010-06-13 16:54:14,893-0500 INFO  AbstractKarajanChannel GSSSChannel-11535022945(1) REQ: Handler(BQPSTATUS)
2010-06-13 16:54:14,894-0500 INFO  AbstractStreamKarajanChannel Sender 16493645 queue size: 0
2010-06-13 16:54:14,897-0500 INFO  BQPStatusHandler Process BQP status update 1
2010-06-13 16:54:14,897-0500 INFO  BQPStatusHandler Process BQP status update 2
2010-06-13 16:54:18,221-0500 INFO  AbstractStreamKarajanChannel$Multiplexer No streams
2010-06-13 16:54:25,019-0500 INFO  AbstractKarajanChannel GSSSChannel-11535022945(1) REQ: Handler(BQPSTATUS)
2010-06-13 16:54:25,019-0500 INFO  AbstractStreamKarajanChannel Sender 16493645 queue size: 0
2010-06-13 16:54:25,022-0500 INFO  BQPStatusHandler Process BQP status update 1
2010-06-13 16:54:25,022-0500 INFO  BQPStatusHandler Process BQP status update 2
2010-06-13 16:54:28,243-0500 INFO  AbstractStreamKarajanChannel$Multiplexer No streams
2010-06-13 16:54:35,131-0500 INFO  AbstractKarajanChannel GSSSChannel-11535022945(1) REQ: Handler(BQPSTATUS)
2010-06-13 16:54:35,131-0500 INFO  AbstractStreamKarajanChannel Sender 16493645 queue size: 0
2010-06-13 16:54:35,134-0500 INFO  BQPStatusHandler Process BQP status update 1
2010-06-13 16:54:35,134-0500 INFO  BQPStatusHandler Process BQP status update 2
2010-06-13 16:54:38,245-0500 INFO  AbstractStreamKarajanChannel$Multiplexer No streams
2010-06-13 16:54:45,208-0500 INFO  AbstractKarajanChannel GSSSChannel-11535022945(1) REQ: Handler(BQPSTATUS)
...
...
...
queue size: 0
2010-06-14 13:48:07,067-0500 INFO  BQPStatusHandler Process BQP status update 1
2010-06-14 13:48:07,067-0500 INFO  BQPStatusHandler Process BQP status update 2
2010-06-14 13:48:09,623-0500 INFO  AbstractStreamKarajanChannel$Multiplexer No streams
2010-06-14 13:48:14,968-0500 INFO  Command Sending Command(10504, SHUTDOWNSERVICE) on GSSSChannel-1651098589(3)
2010-06-14 13:48:14,968-0500 INFO  AbstractStreamKarajanChannel Sender 15595149 queue size: 0
2010-06-14 13:48:15,089-0500 INFO  AbstractKarajanChannel GSSSChannel-1651098589(3) REPL: Command(10504, SHUTDOWNSERVICE)
2010-06-14 13:48:15,089-0500 INFO  AbstractKarajanChannel Unregistering Command(10504, SHUTDOWNSERVICE)
--i shutdown the workflow via Ctrl-C---

sites.xml:
  <pool handle="FIREFLY">
    <execution provider="coaster" url="ff-grid.unl.edu" jobmanager="gt2:gt2:pbs"
      />

    <profile namespace="globus" key="maxTime">86400</profile>
    <profile namespace="globus" key="maxNodes">5300</profile>
    <profile namespace="globus" key="spread">0.8</profile>
    <profile namespace="globus" key="slots">10</profile>
    <profile namespace="globus" key="remoteMonitorEnabled">true</profile>

    <profile namespace="karajan" key="initialScore">1500.0</profile>
    <profile namespace="karajan" key="jobThrottle">53.00</profile>

    <gridftp  url="gsiftp://ff-grid2.unl.edu"/>
    <workdirectory>/panfs/panasas/CMS/data/engage-aespinosa/swift</workdirectory>
  </pool>


-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: postproc-LGU-firefly_coast1.png
Type: image/png
Size: 7438 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100614/04074d50/attachment.png>


More information about the Swift-user mailing list