[Swift-devel] Re: coaster workers not receiving enough jobs
Mihael Hategan
hategan at mcs.anl.gov
Tue Jul 28 16:12:22 CDT 2009
Right. I put in some quick code to prevent another spin. That likely has
some bugs that causes this problem.
On Tue, 2009-07-28 at 16:06 -0500, Allan Espinosa wrote:
> Now i tried changing 5mins to 5secs. based on the worker ids with the
> pull() calls it looks like all the jobs were successfully assigned to
> all workers throughout the blocks.
>
> [aespinosa at communicado coasters]$ grep pull coasters.log | grep -v
> Later | awk '{print $5}' | sort -u | cat -n | tail
> 65 0728-590322-000001:58
> 66 0728-590322-000001:59
> 67 0728-590322-000001:6
> 68 0728-590322-000001:60
> 69 0728-590322-000001:61
> 70 0728-590322-000001:62
> 71 0728-590322-000001:63
> 72 0728-590322-000001:7
> 73 0728-590322-000001:8
> 74 0728-590322-000001:9
>
>
> My guess is that at long jobs (5 mins), pull() timeouts while waiting
> and will only get assigned much later on. But this doesn't happen
> because of some timeout mechanisms too (i think).
>
> 2009/7/23 Allan Espinosa <aespinosa at cs.uchicago.edu>:
> > I tried 0660-many.swift with 200 5min sleep jobs using local:local
> > mode (since queue on ranger and teraport takes a while to finish).
> > The session spawned 192 workers. Swift reports at most 36 active
> > processes at a time (which it finished successfully). After that
> > workers reach idle time exceptions. Logs and stuff are in
> > ~aespinosa/workflows/coaster_debug/run1/
> >
> > sites.xml:
> > <config>
> >
> > <pool handle="localhost">
> > <gridftp url="local://localhost" />
> > <execution provider="coaster" jobmanager="local:local" url="localhost" />
> > <workdirectory
> >>/home/aespinosa/workflows/coaster_debug/workdir</workdirectory>
> >
> > <profile namespace="karajan" key="initialScore">10000</profile>
> > <profile namespace="karajan" key="jobThrottle">1.98</profile>
> >
> > <profile namespace="globus" key="workersPerNode">1</profile>
> > <profile namespace="globus" key="maxwalltime">00:05:00</profile>
> > <profile namespace="globus" key="maxtime">3600</profile>
> > </pool>
> >
> > </config>
> >
> > swift session:
> > Swift svn swift-r3011 cog-r2439
> >
> > RunID: locallog
> > Progress:
> > Progress: Selecting site:198 Initializing site shared directory:1 Stage in:1
> > Progress: Selecting site:1 Submitting:198 Submitted:1
> > Progress: Selecting site:1 Submitted:198 Active:1
> > Progress: Selecting site:1 Submitted:192 Active:7
> > Progress: Selecting site:1 Submitted:188 Active:11
> > Progress: Selecting site:1 Submitted:181 Active:18
> > Progress: Selecting site:1 Submitted:178 Active:21
> > Progress: Selecting site:1 Submitted:163 Active:36
> > Progress: Selecting site:1 Submitted:163 Active:36
> >
> >
> > Progress: Selecting site:1 Submitted:163 Active:36
> > Progress: Selecting site:1 Submitted:163 Active:36
> > Progress: Selecting site:1 Submitted:163 Active:36
> > Progress: Selecting site:1 Submitted:163 Active:36
> > Progress: Selecting site:1 Submitted:163 Active:36
> > Progress: Selecting site:1 Submitted:163 Active:36
> > Progress: Selecting site:1 Submitted:163 Active:36
> > Progress: Selecting site:1 Submitted:163 Active:36
> > Progress: Selecting site:1 Submitted:163 Active:35 Checking status:1
> > Progress: Submitted:156 Active:35 Checking status:1 Finished successfully:8
> > Progress: Submitted:149 Active:34 Checking status:1 Finished successfully:16
> > Progress: Submitted:144 Active:35 Checking status:1 Finished successfully:20
> > Progress: Submitted:134 Active:30 Finished successfully:36
> > Progress: Submitted:134 Active:30 Finished successfully:36
> > Progress: Submitted:134 Active:30 Finished successfully:36
> > Progress: Submitted:134 Active:30 Finished successfully:36
> > Progress: Submitted:134 Active:30 Finished successfully:36
> > Progress: Submitted:134 Active:30 Finished successfully:36
> > Progress: Submitted:134 Active:30 Finished successfully:36
> > Progress: Submitted:134 Active:30 Finished successfully:36
> > Progress: Submitted:133 Active:31 Finished successfully:36
> > Failed to transfer wrapper log from 066-many-locallog/info/0 on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/l on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/k on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/n on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/o on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/q on localhost
> > ailed to transfer wrapper log from 066-many-locallog/info/c on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/m on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/i on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/p on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/a on localhost
> > Progress: Stage in:11 Submitting:34 Submitted:113 Active:6
> > Finished successfully:36
> > Progress: Submitted:157 Active:7 Finished successfully:36
> > Failed to transfer wrapper log from 066-many-locallog/info/t on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/u on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/v on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/x on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/r on localhost
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > Progress: Submitted:163 Active:1 Finished successfully:36
> > ...
> > ... (not yet finished)
> >
> > $grep JOB_SUBMISSION coasters.log | grep Active | grep workerid | cat -n | tail
> > 65 2009-07-23 11:08:10,065-0500 DEBUG TaskImpl
> > Task(type=JOB_SUBMISSION,
> > identity=urn:1248364974288-1248364979260-1248364979261) setting status
> > to Active workerid=000055
> > 66 2009-07-23 11:08:10,090-0500 DEBUG TaskImpl
> > Task(type=JOB_SUBMISSION,
> > identity=urn:1248364974280-1248364979248-1248364979249) setting status
> > to Active workerid=000051
> > $ grep -a SUBMITJOB worker-0723-021156-00000* | grep Cmd | cat -n | tail
> > 61 worker-0723-021156-000001.log:1248365290 000054 < len=9,
> > actuallen=9, tag=1, flags=0, SUBMITJOB
> > 62 worker-0723-021156-000001.log:1248365290 000050 < len=9,
> > actuallen=9, tag=1, flags=0, SUBMITJOB
> > 63 worker-0723-021156-000001.log:1248365290 000053 < len=9,
> > actuallen=9, tag=1, flags=0, SUBMITJOB
> > 64 worker-0723-021156-000001.log:1248365290 000052 < len=9,
> > actuallen=9, tag=1, flags=0, SUBMITJOB
> > 65 worker-0723-021156-000001.log:1248365290 000051 < len=9,
> > actuallen=9, tag=1, flags=0, SUBMITJOB
> > 66 worker-0723-021156-000001.log:1248365290 000055 < len=9,
> > actuallen=9, tag=1, flags=0, SUBMITJOB
> >
> >
> > it corresponds correctly with the swift session (more or less) since
> > we had 30+ completed jobs.
> >
> > Some lines in coasters.log i find intersting:
> > 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> > identity=urn:1248364974290-1248364979263-1248364979264) setting status
> > to Submitted
> > 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> > identity=urn:1248364974290-1248364979263-1248364979264) setting status
> > to Active
> > 2009-07-23 11:12:06,065-0500 INFO Command Sending Command(106,
> > JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1)
> > 2009-07-23 11:12:06,065-0500 INFO Command Command(106, JOBSTATUS)
> > CMD: Command(106, JOBSTATUS)
> > 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> > identity=urn:1248364974290-1248364979263-1248364979264) setting status
> > to Failed Block ta
> > sk failed: 0723-021156-000001Block task ended prematurely
> >
> > Statement unlikely to be reached at
> > /home/aespinosa/.globus/coasters/cscript15423.pl line 580.
> > (Maybe you meant system() when you said exec()?)
> >
> >
> > 2009-07-23 11:12:06,065-0500 INFO Command Sending Command(107,
> > JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1)
> > 2009-07-23 11:12:06,065-0500 INFO Command Command(107, JOBSTATUS)
> > CMD: Command(107, JOBSTATUS)
> >
> >
> > -Allan
> >
>
>
>
More information about the Swift-devel
mailing list