[Swift-devel] Straggler jobs in swift/falkon workflow

Michael Wilde wilde at mcs.anl.gov
Wed Sep 5 17:08:22 CDT 2007


(My immediate project is to run many "angle" jobs in parallel: first 
1000, then 10K, 100K, etc. with good performance and low latency.)


In the latest run, run15 - 1000 parallel jobs - I set the data xfer 
throttles back to the defaults of 4 and 8 for .transfers and 
.file.operations respectively, and left all other throttles off (wide open).

In this run, all 1000 jobs finished successfully (first time for me! ;) 
but the last two jobs "straggled" in 12 minutes after all the others. 
Im still trying to debug this and get a better handle on the timings and 
any error retry, from the detailed log.

But here's what I get in terms of job completion times (for a 14:23 
start time):

#Finished Time
      64 14:26
     105 14:27
      71 14:28
      57 14:29
      65 14:30
      58 14:31
      68 14:32
      50 14:33
      52 14:34
      61 14:35
      62 14:36
      59 14:37
      65 14:38
      63 14:39
      60 14:40
      35 14:41
       3 14:43
       1 14:52
       1 14:53

As you can see, the last two jobs finish 10 minutes later than all the 
others. For the most part I get about 60 job completions a minute once 
the workflow cranks up. (This rate is about half of what I saw with 
transfer/data throttles wide open).

I'm looking for errors related to these last two jobs but havent found 
any yet.

Any thoughts as to what might cause the stragglers?

Also, I'd like to plot some of the job progress rates from the log file 
- but before I do, does anyone already have tools that des this?

Thanks,

Mike




More information about the Swift-devel mailing list