[Swift-devel] coaster workers not receiving enough jobs

Allan Espinosa aespinosa at cs.uchicago.edu
Thu Jul 23 11:40:15 CDT 2009


I tried 0660-many.swift with 200 5min sleep jobs using local:local
mode (since queue on ranger and teraport takes a while to finish).
The session spawned 192 workers.  Swift reports at most 36 active
processes at a time (which it finished successfully).  After that
workers reach idle time exceptions.   Logs and stuff are in
~aespinosa/workflows/coaster_debug/run1/

sites.xml:
<config>

  <pool handle="localhost">
    <gridftp  url="local://localhost" />
    <execution provider="coaster" jobmanager="local:local" url="localhost" />
    <workdirectory
>/home/aespinosa/workflows/coaster_debug/workdir</workdirectory>

                <profile namespace="karajan" key="initialScore">10000</profile>
                <profile namespace="karajan" key="jobThrottle">1.98</profile>

    <profile namespace="globus" key="workersPerNode">1</profile>
    <profile namespace="globus" key="maxwalltime">00:05:00</profile>
    <profile namespace="globus" key="maxtime">3600</profile>
  </pool>

</config>

swift session:
Swift svn swift-r3011 cog-r2439

RunID: locallog
Progress:
Progress:  Selecting site:198  Initializing site shared directory:1  Stage in:1
Progress:  Selecting site:1  Submitting:198  Submitted:1
Progress:  Selecting site:1  Submitted:198  Active:1
Progress:  Selecting site:1  Submitted:192  Active:7
Progress:  Selecting site:1  Submitted:188  Active:11
Progress:  Selecting site:1  Submitted:181  Active:18
Progress:  Selecting site:1  Submitted:178  Active:21
Progress:  Selecting site:1  Submitted:163  Active:36
Progress:  Selecting site:1  Submitted:163  Active:36


Progress:  Selecting site:1  Submitted:163  Active:36
Progress:  Selecting site:1  Submitted:163  Active:36
Progress:  Selecting site:1  Submitted:163  Active:36
Progress:  Selecting site:1  Submitted:163  Active:36
Progress:  Selecting site:1  Submitted:163  Active:36
Progress:  Selecting site:1  Submitted:163  Active:36
Progress:  Selecting site:1  Submitted:163  Active:36
Progress:  Selecting site:1  Submitted:163  Active:36
Progress:  Selecting site:1  Submitted:163  Active:35  Checking status:1
Progress:  Submitted:156  Active:35  Checking status:1  Finished successfully:8
Progress:  Submitted:149  Active:34  Checking status:1  Finished successfully:16
Progress:  Submitted:144  Active:35  Checking status:1  Finished successfully:20
Progress:  Submitted:134  Active:30  Finished successfully:36
Progress:  Submitted:134  Active:30  Finished successfully:36
Progress:  Submitted:134  Active:30  Finished successfully:36
Progress:  Submitted:134  Active:30  Finished successfully:36
Progress:  Submitted:134  Active:30  Finished successfully:36
Progress:  Submitted:134  Active:30  Finished successfully:36
Progress:  Submitted:134  Active:30  Finished successfully:36
Progress:  Submitted:134  Active:30  Finished successfully:36
Progress:  Submitted:133  Active:31  Finished successfully:36
Failed to transfer wrapper log from 066-many-locallog/info/0 on localhost
Failed to transfer wrapper log from 066-many-locallog/info/l on localhost
Failed to transfer wrapper log from 066-many-locallog/info/k on localhost
Failed to transfer wrapper log from 066-many-locallog/info/n on localhost
Failed to transfer wrapper log from 066-many-locallog/info/o on localhost
Failed to transfer wrapper log from 066-many-locallog/info/q on localhost
ailed to transfer wrapper log from 066-many-locallog/info/c on localhost
Failed to transfer wrapper log from 066-many-locallog/info/m on localhost
Failed to transfer wrapper log from 066-many-locallog/info/i on localhost
Failed to transfer wrapper log from 066-many-locallog/info/p on localhost
Failed to transfer wrapper log from 066-many-locallog/info/a on localhost
Progress:  Stage in:11  Submitting:34  Submitted:113  Active:6
Finished successfully:36
Progress:  Submitted:157  Active:7  Finished successfully:36
Failed to transfer wrapper log from 066-many-locallog/info/t on localhost
Failed to transfer wrapper log from 066-many-locallog/info/u on localhost
Failed to transfer wrapper log from 066-many-locallog/info/v on localhost
Failed to transfer wrapper log from 066-many-locallog/info/x on localhost
Failed to transfer wrapper log from 066-many-locallog/info/r on localhost
Progress:  Submitted:163  Active:1  Finished successfully:36
Progress:  Submitted:163  Active:1  Finished successfully:36
Progress:  Submitted:163  Active:1  Finished successfully:36
Progress:  Submitted:163  Active:1  Finished successfully:36
Progress:  Submitted:163  Active:1  Finished successfully:36
Progress:  Submitted:163  Active:1  Finished successfully:36
Progress:  Submitted:163  Active:1  Finished successfully:36
Progress:  Submitted:163  Active:1  Finished successfully:36
Progress:  Submitted:163  Active:1  Finished successfully:36
Progress:  Submitted:163  Active:1  Finished successfully:36
Progress:  Submitted:163  Active:1  Finished successfully:36
Progress:  Submitted:163  Active:1  Finished successfully:36
Progress:  Submitted:163  Active:1  Finished successfully:36
Progress:  Submitted:163  Active:1  Finished successfully:36
Progress:  Submitted:163  Active:1  Finished successfully:36
Progress:  Submitted:163  Active:1  Finished successfully:36
...
... (not yet finished)

$grep JOB_SUBMISSION coasters.log | grep Active | grep workerid | cat -n | tail
    65  2009-07-23 11:08:10,065-0500 DEBUG TaskImpl
Task(type=JOB_SUBMISSION,
identity=urn:1248364974288-1248364979260-1248364979261) setting status
to Active workerid=000055
    66  2009-07-23 11:08:10,090-0500 DEBUG TaskImpl
Task(type=JOB_SUBMISSION,
identity=urn:1248364974280-1248364979248-1248364979249) setting status
to Active workerid=000051
$ grep -a SUBMITJOB worker-0723-021156-00000* | grep Cmd | cat -n | tail
   61  worker-0723-021156-000001.log:1248365290 000054 < len=9,
actuallen=9, tag=1, flags=0, SUBMITJOB
    62  worker-0723-021156-000001.log:1248365290 000050 < len=9,
actuallen=9, tag=1, flags=0, SUBMITJOB
    63  worker-0723-021156-000001.log:1248365290 000053 < len=9,
actuallen=9, tag=1, flags=0, SUBMITJOB
    64  worker-0723-021156-000001.log:1248365290 000052 < len=9,
actuallen=9, tag=1, flags=0, SUBMITJOB
    65  worker-0723-021156-000001.log:1248365290 000051 < len=9,
actuallen=9, tag=1, flags=0, SUBMITJOB
    66  worker-0723-021156-000001.log:1248365290 000055 < len=9,
actuallen=9, tag=1, flags=0, SUBMITJOB


it corresponds correctly with the swift session (more or less) since
we had 30+ completed jobs.

Some lines in coasters.log i find intersting:
2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:1248364974290-1248364979263-1248364979264) setting status
to Submitted
2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:1248364974290-1248364979263-1248364979264) setting status
to Active
2009-07-23 11:12:06,065-0500 INFO  Command Sending Command(106,
JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1)
2009-07-23 11:12:06,065-0500 INFO  Command Command(106, JOBSTATUS)
CMD: Command(106, JOBSTATUS)
2009-07-23 11:12:06,065-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:1248364974290-1248364979263-1248364979264) setting status
to Failed Block ta
sk failed: 0723-021156-000001Block task ended prematurely

Statement unlikely to be reached at
/home/aespinosa/.globus/coasters/cscript15423.pl line 580.
        (Maybe you meant system() when you said exec()?)


2009-07-23 11:12:06,065-0500 INFO  Command Sending Command(107,
JOBSTATUS) on GSSCChannel-https://128.135.125.17:50000(1)
2009-07-23 11:12:06,065-0500 INFO  Command Command(107, JOBSTATUS)
CMD: Command(107, JOBSTATUS)


-Allan



More information about the Swift-devel mailing list