[Swift-devel] Straggler jobs in swift/falkon workflow
Michael Wilde
wilde at mcs.anl.gov
Wed Sep 5 17:08:22 CDT 2007
(My immediate project is to run many "angle" jobs in parallel: first
1000, then 10K, 100K, etc. with good performance and low latency.)
In the latest run, run15 - 1000 parallel jobs - I set the data xfer
throttles back to the defaults of 4 and 8 for .transfers and
.file.operations respectively, and left all other throttles off (wide open).
In this run, all 1000 jobs finished successfully (first time for me! ;)
but the last two jobs "straggled" in 12 minutes after all the others.
Im still trying to debug this and get a better handle on the timings and
any error retry, from the detailed log.
But here's what I get in terms of job completion times (for a 14:23
start time):
#Finished Time
64 14:26
105 14:27
71 14:28
57 14:29
65 14:30
58 14:31
68 14:32
50 14:33
52 14:34
61 14:35
62 14:36
59 14:37
65 14:38
63 14:39
60 14:40
35 14:41
3 14:43
1 14:52
1 14:53
As you can see, the last two jobs finish 10 minutes later than all the
others. For the most part I get about 60 job completions a minute once
the workflow cranks up. (This rate is about half of what I saw with
transfer/data throttles wide open).
I'm looking for errors related to these last two jobs but havent found
any yet.
Any thoughts as to what might cause the stragglers?
Also, I'd like to plot some of the job progress rates from the log file
- but before I do, does anyone already have tools that des this?
Thanks,
Mike
More information about the Swift-devel
mailing list