[Swift-devel] coaster workers not receiving enough jobs
Allan Espinosa
aespinosa at cs.uchicago.edu
Thu Jul 23 11:40:15 CDT 2009
I tried 0660-many.swift with 200 5min sleep jobs using local:local
mode (since queue on ranger and teraport takes a while to finish).
The session spawned 192 workers. Swift reports at most 36 active
processes at a time (which it finished successfully). After that
workers reach idle time exceptions. Logs and stuff are in
~aespinosa/workflows/coaster_debug/run1/
sites.xml:
<config>
<pool handle="localhost">
<gridftp url="local://localhost" />
<execution provider="coaster" jobmanager="local:local" url="localhost" />
<workdirectory
>/home/aespinosa/workflows/coaster_debug/workdir</workdirectory>
<profile namespace="karajan" key="initialScore">10000</profile>
<profile namespace="karajan" key="jobThrottle">1.98</profile>
<profile namespace="globus" key="workersPerNode">1</profile>
<profile namespace="globus" key="maxwalltime">00:05:00</profile>
<profile namespace="globus" key="maxtime">3600</profile>
</pool>
</config>
swift session:
Swift svn swift-r3011 cog-r2439
RunID: locallog
Progress:
Progress: Selecting site:198 Initializing site shared directory:1 Stage in:1
Progress: Selecting site:1 Submitting:198 Submitted:1
Progress: Selecting site:1 Submitted:198 Active:1
Progress: Selecting site:1 Submitted:192 Active:7
Progress: Selecting site:1 Submitted:188 Active:11
Progress: Selecting site:1 Submitted:181 Active:18
Progress: Selecting site:1 Submitted:178 Active:21
Progress: Selecting site:1 Submitted:163 Active:36
Progress: Selecting site:1 Submitted:163 Active:36
Progress: Selecting site:1 Submitted:163 Active:36
Progress: Selecting site:1 Submitted:163 Active:36
Progress: Selecting site:1 Submitted:163 Active:36
Progress: Selecting site:1 Submitted:163 Active:36
Progress: Selecting site:1 Submitted:163 Active:36
Progress: Selecting site:1 Submitted:163 Active:36
Progress: Selecting site:1 Submitted:163 Active:36
Progress: Selecting site:1 Submitted:163 Active:36
Progress: Selecting site:1 Submitted:163 Active:35 Checking status:1
Progress: Submitted:156 Active:35 Checking status:1 Finished successfully:8
Progress: Submitted:149 Active:34 Checking status:1 Finished successfully:16
Progress: Submitted:144 Active:35 Checking status:1 Finished successfully:20
Progress: Submitted:134 Active:30 Finished successfully:36
Progress: Submitted:134 Active:30 Finished successfully:36
Progress: Submitted:134 Active:30 Finished successfully:36
Progress: Submitted:134 Active:30 Finished successfully:36
Progress: Submitted:134 Active:30 Finished successfully:36
Progress: Submitted:134 Active:30 Finished successfully:36
Progress: Submitted:134 Active:30 Finished successfully:36
Progress: Submitted:134 Active:30 Finished successfully:36
Progress: Submitted:133 Active:31 Finished successfully:36
Failed to transfer wrapper log from 066-many-locallog/info/0 on localhost
Failed to transfer wrapper log from 066-many-locallog/info/l on localhost
Failed to transfer wrapper log from 066-many-locallog/info/k on localhost
Failed to transfer wrapper log from 066-many-locallog/info/n on localhost
Failed to transfer wrapper log from 066-many-locallog/info/o on localhost
Failed to transfer wrapper log from 066-many-locallog/info/q on localhost
ailed to transfer wrapper log from 066-many-locallog/info/c on localhost
Failed to transfer wrapper log from 066-many-locallog/info/m on localhost
Failed to transfer wrapper log from 066-many-locallog/info/i on localhost
Failed to transfer wrapper log from 066-many-locallog/info/p on localhost
Failed to transfer wrapper log from 066-many-locallog/info/a on localhost
Progress: Stage in:11 Submitting:34 Submitted:113 Active:6
Finished successfully:36
Progress: Submitted:157 Active:7 Finished successfully:36
Failed to transfer wrapper log from 066-many-locallog/info/t on localhost
Failed to transfer wrapper log from 066-many-locallog/info/u on localhost
Failed to transfer wrapper log from 066-many-locallog/info/v on localhost
Failed to transfer wrapper log from 066-many-locallog/info/x on localhost
Failed to transfer wrapper log from 066-many-locallog/info/r on localhost
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
Progress: Submitted:163 Active:1 Finished successfully:36
...
... (not yet finished)
$grep JOB_SUBMISSION coasters.log | grep Active | grep workerid | cat -n | tail
65 2009-07-23 11:08:10,065-0500 DEBUG TaskImpl
Task(type=JOB_SUBMISSION,
identity=urn:1248364974288-1248364979260-1248364979261) setting status
to Active workerid=000055
66 2009-07-23 11:08:10,090-0500 DEBUG TaskImpl
Task(type=JOB_SUBMISSION,
identity=urn:1248364974280-1248364979248-1248364979249) setting status
to Active workerid=000051
$ grep -a SUBMITJOB worker-0723-021156-00000* | grep Cmd | cat -n | tail
61 worker-0723-021156-000001.log:1248365290 000054 < len=9,
actuallen=9, tag=1, flags=0, SUBMITJOB
62 worker-0723-021156-000001.log:1248365290 000050 < len=9,
actuallen=9, tag=1, flags=0, SUBMITJOB
63 worker-0723-021156-000001.log:1248365290 000053 < len=9,
actuallen=9, tag=1, flags=0, SUBMITJOB
64 worker-0723-021156-000001.log:1248365290 000052 < len=9,
actuallen=9, tag=1, flags=0, SUBMITJOB
65 worker-0723-021156-000001.log:1248365290 000051 < len=9,
actuallen=9, tag=1, flags=0, SUBMITJOB
66 worker-0723-021156-000001.log:1248365290 000055 < len=9,
actuallen=9, tag=1, flags=0, SUBMITJOB
it corresponds correctly with the swift session (more or less) since
we had 30+ completed jobs.
Some lines in coasters.log i find intersting:
2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:1248364974290-1248364979263-1248364979264) setting status
to Submitted
2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:1248364974290-1248364979263-1248364979264) setting status
to Active
2009-07-23 11:12:06,065-0500 INFO Command Sending Command(106,
JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1)
2009-07-23 11:12:06,065-0500 INFO Command Command(106, JOBSTATUS)
CMD: Command(106, JOBSTATUS)
2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:1248364974290-1248364979263-1248364979264) setting status
to Failed Block ta
sk failed: 0723-021156-000001Block task ended prematurely
Statement unlikely to be reached at
/home/aespinosa/.globus/coasters/cscript15423.pl line 580.
(Maybe you meant system() when you said exec()?)
2009-07-23 11:12:06,065-0500 INFO Command Sending Command(107,
JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1)
2009-07-23 11:12:06,065-0500 INFO Command Command(107, JOBSTATUS)
CMD: Command(107, JOBSTATUS)
-Allan
More information about the Swift-devel
mailing list