[Swift-devel] Re: coaster workers not receiving enough jobs

Allan Espinosa aespinosa at cs.uchicago.edu
Tue Jul 28 16:06:48 CDT 2009


Now i tried changing 5mins to 5secs.  based on the worker ids with the
pull() calls it looks like all the jobs were successfully assigned to
all workers throughout the blocks.

[aespinosa at communicado coasters]$ grep pull coasters.log | grep -v
Later | awk '{print $5}' | sort -u | cat -n | tail
    65  0728-590322-000001:58
    66  0728-590322-000001:59
    67  0728-590322-000001:6
    68  0728-590322-000001:60
    69  0728-590322-000001:61
    70  0728-590322-000001:62
    71  0728-590322-000001:63
    72  0728-590322-000001:7
    73  0728-590322-000001:8
    74  0728-590322-000001:9


My guess is that at long jobs (5 mins), pull() timeouts while waiting
and will only get assigned much later on.  But this doesn't happen
because of some timeout mechanisms too (i think).

2009/7/23 Allan Espinosa <aespinosa at cs.uchicago.edu>:
> I tried 0660-many.swift with 200 5min sleep jobs using local:local
> mode (since queue on ranger and teraport takes a while to finish).
> The session spawned 192 workers.  Swift reports at most 36 active
> processes at a time (which it finished successfully).  After that
> workers reach idle time exceptions.   Logs and stuff are in
> ~aespinosa/workflows/coaster_debug/run1/
>
> sites.xml:
> <config>
>
>  <pool handle="localhost">
>    <gridftp  url="local://localhost" />
>    <execution provider="coaster" jobmanager="local:local" url="localhost" />
>    <workdirectory
>>/home/aespinosa/workflows/coaster_debug/workdir</workdirectory>
>
>                <profile namespace="karajan" key="initialScore">10000</profile>
>                <profile namespace="karajan" key="jobThrottle">1.98</profile>
>
>    <profile namespace="globus" key="workersPerNode">1</profile>
>    <profile namespace="globus" key="maxwalltime">00:05:00</profile>
>    <profile namespace="globus" key="maxtime">3600</profile>
>  </pool>
>
> </config>
>
> swift session:
> Swift svn swift-r3011 cog-r2439
>
> RunID: locallog
> Progress:
> Progress:  Selecting site:198  Initializing site shared directory:1  Stage in:1
> Progress:  Selecting site:1  Submitting:198  Submitted:1
> Progress:  Selecting site:1  Submitted:198  Active:1
> Progress:  Selecting site:1  Submitted:192  Active:7
> Progress:  Selecting site:1  Submitted:188  Active:11
> Progress:  Selecting site:1  Submitted:181  Active:18
> Progress:  Selecting site:1  Submitted:178  Active:21
> Progress:  Selecting site:1  Submitted:163  Active:36
> Progress:  Selecting site:1  Submitted:163  Active:36
>
>
> Progress:  Selecting site:1  Submitted:163  Active:36
> Progress:  Selecting site:1  Submitted:163  Active:36
> Progress:  Selecting site:1  Submitted:163  Active:36
> Progress:  Selecting site:1  Submitted:163  Active:36
> Progress:  Selecting site:1  Submitted:163  Active:36
> Progress:  Selecting site:1  Submitted:163  Active:36
> Progress:  Selecting site:1  Submitted:163  Active:36
> Progress:  Selecting site:1  Submitted:163  Active:36
> Progress:  Selecting site:1  Submitted:163  Active:35  Checking status:1
> Progress:  Submitted:156  Active:35  Checking status:1  Finished successfully:8
> Progress:  Submitted:149  Active:34  Checking status:1  Finished successfully:16
> Progress:  Submitted:144  Active:35  Checking status:1  Finished successfully:20
> Progress:  Submitted:134  Active:30  Finished successfully:36
> Progress:  Submitted:134  Active:30  Finished successfully:36
> Progress:  Submitted:134  Active:30  Finished successfully:36
> Progress:  Submitted:134  Active:30  Finished successfully:36
> Progress:  Submitted:134  Active:30  Finished successfully:36
> Progress:  Submitted:134  Active:30  Finished successfully:36
> Progress:  Submitted:134  Active:30  Finished successfully:36
> Progress:  Submitted:134  Active:30  Finished successfully:36
> Progress:  Submitted:133  Active:31  Finished successfully:36
> Failed to transfer wrapper log from 066-many-locallog/info/0 on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/l on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/k on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/n on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/o on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/q on localhost
> ailed to transfer wrapper log from 066-many-locallog/info/c on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/m on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/i on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/p on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/a on localhost
> Progress:  Stage in:11  Submitting:34  Submitted:113  Active:6
> Finished successfully:36
> Progress:  Submitted:157  Active:7  Finished successfully:36
> Failed to transfer wrapper log from 066-many-locallog/info/t on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/u on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/v on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/x on localhost
> Failed to transfer wrapper log from 066-many-locallog/info/r on localhost
> Progress:  Submitted:163  Active:1  Finished successfully:36
> Progress:  Submitted:163  Active:1  Finished successfully:36
> Progress:  Submitted:163  Active:1  Finished successfully:36
> Progress:  Submitted:163  Active:1  Finished successfully:36
> Progress:  Submitted:163  Active:1  Finished successfully:36
> Progress:  Submitted:163  Active:1  Finished successfully:36
> Progress:  Submitted:163  Active:1  Finished successfully:36
> Progress:  Submitted:163  Active:1  Finished successfully:36
> Progress:  Submitted:163  Active:1  Finished successfully:36
> Progress:  Submitted:163  Active:1  Finished successfully:36
> Progress:  Submitted:163  Active:1  Finished successfully:36
> Progress:  Submitted:163  Active:1  Finished successfully:36
> Progress:  Submitted:163  Active:1  Finished successfully:36
> Progress:  Submitted:163  Active:1  Finished successfully:36
> Progress:  Submitted:163  Active:1  Finished successfully:36
> Progress:  Submitted:163  Active:1  Finished successfully:36
> ...
> ... (not yet finished)
>
> $grep JOB_SUBMISSION coasters.log | grep Active | grep workerid | cat -n | tail
>    65  2009-07-23 11:08:10,065-0500 DEBUG TaskImpl
> Task(type=JOB_SUBMISSION,
> identity=urn:1248364974288-1248364979260-1248364979261) setting status
> to Active workerid=000055
>    66  2009-07-23 11:08:10,090-0500 DEBUG TaskImpl
> Task(type=JOB_SUBMISSION,
> identity=urn:1248364974280-1248364979248-1248364979249) setting status
> to Active workerid=000051
> $ grep -a SUBMITJOB worker-0723-021156-00000* | grep Cmd | cat -n | tail
>   61  worker-0723-021156-000001.log:1248365290 000054 < len=9,
> actuallen=9, tag=1, flags=0, SUBMITJOB
>    62  worker-0723-021156-000001.log:1248365290 000050 < len=9,
> actuallen=9, tag=1, flags=0, SUBMITJOB
>    63  worker-0723-021156-000001.log:1248365290 000053 < len=9,
> actuallen=9, tag=1, flags=0, SUBMITJOB
>    64  worker-0723-021156-000001.log:1248365290 000052 < len=9,
> actuallen=9, tag=1, flags=0, SUBMITJOB
>    65  worker-0723-021156-000001.log:1248365290 000051 < len=9,
> actuallen=9, tag=1, flags=0, SUBMITJOB
>    66  worker-0723-021156-000001.log:1248365290 000055 < len=9,
> actuallen=9, tag=1, flags=0, SUBMITJOB
>
>
> it corresponds correctly with the swift session (more or less) since
> we had 30+ completed jobs.
>
> Some lines in coasters.log i find intersting:
> 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:1248364974290-1248364979263-1248364979264) setting status
> to Submitted
> 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:1248364974290-1248364979263-1248364979264) setting status
> to Active
> 2009-07-23 11:12:06,065-0500 INFO  Command Sending Command(106,
> JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1)
> 2009-07-23 11:12:06,065-0500 INFO  Command Command(106, JOBSTATUS)
> CMD: Command(106, JOBSTATUS)
> 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:1248364974290-1248364979263-1248364979264) setting status
> to Failed Block ta
> sk failed: 0723-021156-000001Block task ended prematurely
>
> Statement unlikely to be reached at
> /home/aespinosa/.globus/coasters/cscript15423.pl line 580.
>        (Maybe you meant system() when you said exec()?)
>
>
> 2009-07-23 11:12:06,065-0500 INFO  Command Sending Command(107,
> JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1)
> 2009-07-23 11:12:06,065-0500 INFO  Command Command(107, JOBSTATUS)
> CMD: Command(107, JOBSTATUS)
>
>
> -Allan
>



-- 
Allan M. Espinosa <http://allan.88-mph.net/blog>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>



More information about the Swift-devel mailing list