[Swift-devel] Re: coaster workers not receiving enough jobs
Allan Espinosa
aespinosa at cs.uchicago.edu
Tue Jul 28 16:06:48 CDT 2009
Now i tried changing 5mins to 5secs. based on the worker ids with the
pull() calls it looks like all the jobs were successfully assigned to
all workers throughout the blocks.
[aespinosa at communicado coasters]$ grep pull coasters.log | grep -v
Later | awk '{print $5}' | sort -u | cat -n | tail
65 0728-590322-000001:58
66 0728-590322-000001:59
67 0728-590322-000001:6
68 0728-590322-000001:60
69 0728-590322-000001:61
70 0728-590322-000001:62
71 0728-590322-000001:63
72 0728-590322-000001:7
73 0728-590322-000001:8
74 0728-590322-000001:9
My guess is that at long jobs (5 mins), pull() timeouts while waiting
and will only get assigned much later on. But this doesn't happen
because of some timeout mechanisms too (i think).
2009/7/23 Allan Espinosa <aespinosa at cs.uchicago.edu>:
> I tried 0660-many.swift with 200 5min sleep jobs using local:local
> mode (since queue on ranger and teraport takes a while to finish).
> The session spawned 192 workers. Swift reports at most 36 active
> processes at a time (which it finished successfully). After that
> workers reach idle time exceptions. Logs and stuff are in
> ~aespinosa/workflows/coaster_debug/run1/
>
> sites.xml:
> <config>
>
> <pool handle="localhost">
> <gridftp url="local://localhost" />
> <execution provider="coaster" jobmanager="local:local" url="localhost" />
> <workdirectory
>>/home/aespinosa/workflows/coaster_debug/workdir</workdirectory>
>
> <profile namespace="karajan" key="initialScore">10000</profile>
> <profile namespace="karajan" key="jobThrottle">1.98</profile>
>
> <profile namespace="globus" key="workersPerNode">1</profile>
> <profile namespace="globus" key="maxwalltime">00:05:00</profile>
> <profile namespace="globus" key="maxtime">3600</profile>
> </pool>
>
> </config>
>
> swift session:
> Swift svn swift-r3011 cog-r2439
>
> RunID: locallog
> Progress:
> Progress: Selecting site:198 Initializing site shared directory:1 Stage in:1
> Progress: Selecting site:1 Submitting:198 Submitted:1
> Progress: Selecting site:1 Submitted:198 Active:1
> Progress: Selecting site:1 Submitted:192 Active:7
> Progress: Selecting site:1 Submitted:188 Active:11
> Progress: Selecting site:1 Submitted:181 Active:18
> Progress: Selecting site:1 Submitted:178 Active:21
> Progress: Selecting site:1 Submitted:163 Active:36
> Progress: Selecting site:1 Submitted:163 Active:36
>
>
> Progress: Selecting site:1 Submitted:163 Active:36
> Progress: Selecting site:1 Submitted:163 Active:36
> Progress: Selecting site:1 Submitted:163 Active:36
> Progress: Selecting site:1 Submitted:163 Active:36
> Progress: Selecting site:1 Submitted:163 Active:36
> Progress: Selecting site:1 Submitted:163 Active:36
> Progress: Selecting site:1 Submitted:163 Active:36
> Progress: Selecting site:1 Submitted:163 Active:36
> Progress: Selecting site:1 Submitted:163 Active:35 Checking status:1
> Progress: Submitted:156 Active:35 Checking status:1 Finished successfully:8
> Progress: Submitted:149 Active:34 Checking status:1 Finished successfully:16
> Progress: Submitted:144 Active:35 Checking status:1 Finished successfully:20
> Progress: Submitted:134 Active:30 Finished successfully:36
> Progress: Submitted:134 Active:30 Finished successfully:36
> Progress: Submitted:134 Active:30 Finished successfully:36
> Progress: Submitted:134 Active:30 Finished successfully:36
> Progress: Submitted:134 Active:30 Finished successfully:36
> Progress: Submitted:134 Active:30 Finished successfully:36
> Progress: Submitted:134 Active:30 Finished successfully:36
> Progress: Submitted:134 Active:30 Finished successfully:36
> Progress: Submitted:133 Active:31 Finished successfully:36
> Failed to transfer wrapper log from 066-many-locallog/info/0 on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/l on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/k on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/n on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/o on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/q on localhost
> ailed to transfer wrapper log from 066-many-locallog/info/c on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/m on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/i on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/p on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/a on localhost
> Progress: Stage in:11 Submitting:34 Submitted:113 Active:6
> Finished successfully:36
> Progress: Submitted:157 Active:7 Finished successfully:36
> Failed to transfer wrapper log from 066-many-locallog/info/t on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/u on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/v on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/x on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/r on localhost
> Progress: Submitted:163 Active:1 Finished successfully:36
> Progress: Submitted:163 Active:1 Finished successfully:36
> Progress: Submitted:163 Active:1 Finished successfully:36
> Progress: Submitted:163 Active:1 Finished successfully:36
> Progress: Submitted:163 Active:1 Finished successfully:36
> Progress: Submitted:163 Active:1 Finished successfully:36
> Progress: Submitted:163 Active:1 Finished successfully:36
> Progress: Submitted:163 Active:1 Finished successfully:36
> Progress: Submitted:163 Active:1 Finished successfully:36
> Progress: Submitted:163 Active:1 Finished successfully:36
> Progress: Submitted:163 Active:1 Finished successfully:36
> Progress: Submitted:163 Active:1 Finished successfully:36
> Progress: Submitted:163 Active:1 Finished successfully:36
> Progress: Submitted:163 Active:1 Finished successfully:36
> Progress: Submitted:163 Active:1 Finished successfully:36
> Progress: Submitted:163 Active:1 Finished successfully:36
> ...
> ... (not yet finished)
>
> $grep JOB_SUBMISSION coasters.log | grep Active | grep workerid | cat -n | tail
> 65 2009-07-23 11:08:10,065-0500 DEBUG TaskImpl
> Task(type=JOB_SUBMISSION,
> identity=urn:1248364974288-1248364979260-1248364979261) setting status
> to Active workerid=000055
> 66 2009-07-23 11:08:10,090-0500 DEBUG TaskImpl
> Task(type=JOB_SUBMISSION,
> identity=urn:1248364974280-1248364979248-1248364979249) setting status
> to Active workerid=000051
> $ grep -a SUBMITJOB worker-0723-021156-00000* | grep Cmd | cat -n | tail
> 61 worker-0723-021156-000001.log:1248365290 000054 < len=9,
> actuallen=9, tag=1, flags=0, SUBMITJOB
> 62 worker-0723-021156-000001.log:1248365290 000050 < len=9,
> actuallen=9, tag=1, flags=0, SUBMITJOB
> 63 worker-0723-021156-000001.log:1248365290 000053 < len=9,
> actuallen=9, tag=1, flags=0, SUBMITJOB
> 64 worker-0723-021156-000001.log:1248365290 000052 < len=9,
> actuallen=9, tag=1, flags=0, SUBMITJOB
> 65 worker-0723-021156-000001.log:1248365290 000051 < len=9,
> actuallen=9, tag=1, flags=0, SUBMITJOB
> 66 worker-0723-021156-000001.log:1248365290 000055 < len=9,
> actuallen=9, tag=1, flags=0, SUBMITJOB
>
>
> it corresponds correctly with the swift session (more or less) since
> we had 30+ completed jobs.
>
> Some lines in coasters.log i find intersting:
> 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:1248364974290-1248364979263-1248364979264) setting status
> to Submitted
> 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:1248364974290-1248364979263-1248364979264) setting status
> to Active
> 2009-07-23 11:12:06,065-0500 INFO Command Sending Command(106,
> JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1)
> 2009-07-23 11:12:06,065-0500 INFO Command Command(106, JOBSTATUS)
> CMD: Command(106, JOBSTATUS)
> 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:1248364974290-1248364979263-1248364979264) setting status
> to Failed Block ta
> sk failed: 0723-021156-000001Block task ended prematurely
>
> Statement unlikely to be reached at
> /home/aespinosa/.globus/coasters/cscript15423.pl line 580.
> (Maybe you meant system() when you said exec()?)
>
>
> 2009-07-23 11:12:06,065-0500 INFO Command Sending Command(107,
> JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1)
> 2009-07-23 11:12:06,065-0500 INFO Command Command(107, JOBSTATUS)
> CMD: Command(107, JOBSTATUS)
>
>
> -Allan
>
--
Allan M. Espinosa <http://allan.88-mph.net/blog>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
More information about the Swift-devel
mailing list