[Swift-devel] Re: coaster workers not receiving enough jobs

Mihael Hategan hategan at mcs.anl.gov
Tue Jul 28 16:12:22 CDT 2009


Right. I put in some quick code to prevent another spin. That likely has
some bugs that causes this problem.

On Tue, 2009-07-28 at 16:06 -0500, Allan Espinosa wrote:
> Now i tried changing 5mins to 5secs.  based on the worker ids with the
> pull() calls it looks like all the jobs were successfully assigned to
> all workers throughout the blocks.
> 
> [aespinosa at communicado coasters]$ grep pull coasters.log | grep -v
> Later | awk '{print $5}' | sort -u | cat -n | tail
>     65  0728-590322-000001:58
>     66  0728-590322-000001:59
>     67  0728-590322-000001:6
>     68  0728-590322-000001:60
>     69  0728-590322-000001:61
>     70  0728-590322-000001:62
>     71  0728-590322-000001:63
>     72  0728-590322-000001:7
>     73  0728-590322-000001:8
>     74  0728-590322-000001:9
> 
> 
> My guess is that at long jobs (5 mins), pull() timeouts while waiting
> and will only get assigned much later on.  But this doesn't happen
> because of some timeout mechanisms too (i think).
> 
> 2009/7/23 Allan Espinosa <aespinosa at cs.uchicago.edu>:
> > I tried 0660-many.swift with 200 5min sleep jobs using local:local
> > mode (since queue on ranger and teraport takes a while to finish).
> > The session spawned 192 workers.  Swift reports at most 36 active
> > processes at a time (which it finished successfully).  After that
> > workers reach idle time exceptions.   Logs and stuff are in
> > ~aespinosa/workflows/coaster_debug/run1/
> >
> > sites.xml:
> > <config>
> >
> >  <pool handle="localhost">
> >    <gridftp  url="local://localhost" />
> >    <execution provider="coaster" jobmanager="local:local" url="localhost" />
> >    <workdirectory
> >>/home/aespinosa/workflows/coaster_debug/workdir</workdirectory>
> >
> >                <profile namespace="karajan" key="initialScore">10000</profile>
> >                <profile namespace="karajan" key="jobThrottle">1.98</profile>
> >
> >    <profile namespace="globus" key="workersPerNode">1</profile>
> >    <profile namespace="globus" key="maxwalltime">00:05:00</profile>
> >    <profile namespace="globus" key="maxtime">3600</profile>
> >  </pool>
> >
> > </config>
> >
> > swift session:
> > Swift svn swift-r3011 cog-r2439
> >
> > RunID: locallog
> > Progress:
> > Progress:  Selecting site:198  Initializing site shared directory:1  Stage in:1
> > Progress:  Selecting site:1  Submitting:198  Submitted:1
> > Progress:  Selecting site:1  Submitted:198  Active:1
> > Progress:  Selecting site:1  Submitted:192  Active:7
> > Progress:  Selecting site:1  Submitted:188  Active:11
> > Progress:  Selecting site:1  Submitted:181  Active:18
> > Progress:  Selecting site:1  Submitted:178  Active:21
> > Progress:  Selecting site:1  Submitted:163  Active:36
> > Progress:  Selecting site:1  Submitted:163  Active:36
> >
> >
> > Progress:  Selecting site:1  Submitted:163  Active:36
> > Progress:  Selecting site:1  Submitted:163  Active:36
> > Progress:  Selecting site:1  Submitted:163  Active:36
> > Progress:  Selecting site:1  Submitted:163  Active:36
> > Progress:  Selecting site:1  Submitted:163  Active:36
> > Progress:  Selecting site:1  Submitted:163  Active:36
> > Progress:  Selecting site:1  Submitted:163  Active:36
> > Progress:  Selecting site:1  Submitted:163  Active:36
> > Progress:  Selecting site:1  Submitted:163  Active:35  Checking status:1
> > Progress:  Submitted:156  Active:35  Checking status:1  Finished successfully:8
> > Progress:  Submitted:149  Active:34  Checking status:1  Finished successfully:16
> > Progress:  Submitted:144  Active:35  Checking status:1  Finished successfully:20
> > Progress:  Submitted:134  Active:30  Finished successfully:36
> > Progress:  Submitted:134  Active:30  Finished successfully:36
> > Progress:  Submitted:134  Active:30  Finished successfully:36
> > Progress:  Submitted:134  Active:30  Finished successfully:36
> > Progress:  Submitted:134  Active:30  Finished successfully:36
> > Progress:  Submitted:134  Active:30  Finished successfully:36
> > Progress:  Submitted:134  Active:30  Finished successfully:36
> > Progress:  Submitted:134  Active:30  Finished successfully:36
> > Progress:  Submitted:133  Active:31  Finished successfully:36
> > Failed to transfer wrapper log from 066-many-locallog/info/0 on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/l on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/k on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/n on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/o on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/q on localhost
> > ailed to transfer wrapper log from 066-many-locallog/info/c on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/m on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/i on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/p on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/a on localhost
> > Progress:  Stage in:11  Submitting:34  Submitted:113  Active:6
> > Finished successfully:36
> > Progress:  Submitted:157  Active:7  Finished successfully:36
> > Failed to transfer wrapper log from 066-many-locallog/info/t on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/u on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/v on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/x on localhost
> > Failed to transfer wrapper log from 066-many-locallog/info/r on localhost
> > Progress:  Submitted:163  Active:1  Finished successfully:36
> > Progress:  Submitted:163  Active:1  Finished successfully:36
> > Progress:  Submitted:163  Active:1  Finished successfully:36
> > Progress:  Submitted:163  Active:1  Finished successfully:36
> > Progress:  Submitted:163  Active:1  Finished successfully:36
> > Progress:  Submitted:163  Active:1  Finished successfully:36
> > Progress:  Submitted:163  Active:1  Finished successfully:36
> > Progress:  Submitted:163  Active:1  Finished successfully:36
> > Progress:  Submitted:163  Active:1  Finished successfully:36
> > Progress:  Submitted:163  Active:1  Finished successfully:36
> > Progress:  Submitted:163  Active:1  Finished successfully:36
> > Progress:  Submitted:163  Active:1  Finished successfully:36
> > Progress:  Submitted:163  Active:1  Finished successfully:36
> > Progress:  Submitted:163  Active:1  Finished successfully:36
> > Progress:  Submitted:163  Active:1  Finished successfully:36
> > Progress:  Submitted:163  Active:1  Finished successfully:36
> > ...
> > ... (not yet finished)
> >
> > $grep JOB_SUBMISSION coasters.log | grep Active | grep workerid | cat -n | tail
> >    65  2009-07-23 11:08:10,065-0500 DEBUG TaskImpl
> > Task(type=JOB_SUBMISSION,
> > identity=urn:1248364974288-1248364979260-1248364979261) setting status
> > to Active workerid=000055
> >    66  2009-07-23 11:08:10,090-0500 DEBUG TaskImpl
> > Task(type=JOB_SUBMISSION,
> > identity=urn:1248364974280-1248364979248-1248364979249) setting status
> > to Active workerid=000051
> > $ grep -a SUBMITJOB worker-0723-021156-00000* | grep Cmd | cat -n | tail
> >   61  worker-0723-021156-000001.log:1248365290 000054 < len=9,
> > actuallen=9, tag=1, flags=0, SUBMITJOB
> >    62  worker-0723-021156-000001.log:1248365290 000050 < len=9,
> > actuallen=9, tag=1, flags=0, SUBMITJOB
> >    63  worker-0723-021156-000001.log:1248365290 000053 < len=9,
> > actuallen=9, tag=1, flags=0, SUBMITJOB
> >    64  worker-0723-021156-000001.log:1248365290 000052 < len=9,
> > actuallen=9, tag=1, flags=0, SUBMITJOB
> >    65  worker-0723-021156-000001.log:1248365290 000051 < len=9,
> > actuallen=9, tag=1, flags=0, SUBMITJOB
> >    66  worker-0723-021156-000001.log:1248365290 000055 < len=9,
> > actuallen=9, tag=1, flags=0, SUBMITJOB
> >
> >
> > it corresponds correctly with the swift session (more or less) since
> > we had 30+ completed jobs.
> >
> > Some lines in coasters.log i find intersting:
> > 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> > identity=urn:1248364974290-1248364979263-1248364979264) setting status
> > to Submitted
> > 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> > identity=urn:1248364974290-1248364979263-1248364979264) setting status
> > to Active
> > 2009-07-23 11:12:06,065-0500 INFO  Command Sending Command(106,
> > JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1)
> > 2009-07-23 11:12:06,065-0500 INFO  Command Command(106, JOBSTATUS)
> > CMD: Command(106, JOBSTATUS)
> > 2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> > identity=urn:1248364974290-1248364979263-1248364979264) setting status
> > to Failed Block ta
> > sk failed: 0723-021156-000001Block task ended prematurely
> >
> > Statement unlikely to be reached at
> > /home/aespinosa/.globus/coasters/cscript15423.pl line 580.
> >        (Maybe you meant system() when you said exec()?)
> >
> >
> > 2009-07-23 11:12:06,065-0500 INFO  Command Sending Command(107,
> > JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1)
> > 2009-07-23 11:12:06,065-0500 INFO  Command Command(107, JOBSTATUS)
> > CMD: Command(107, JOBSTATUS)
> >
> >
> > -Allan
> >
> 
> 
> 




More information about the Swift-devel mailing list