[Swift-devel] Straggler jobs in swift/falkon workflow

Ioan Raicu iraicu at cs.uchicago.edu
Wed Sep 5 17:27:06 CDT 2007


Hi,
If you have the Falkon service logs, I bet both stragled jobs were on 
the same node (one on each worker)...  if this is the case, I bet its a 
hardware issue with that node.  Are these runs deterministic?  If you 
were to re-run it, should the job completion times be the same on the 
same # of processors?  If yes, and if there was a bad node, then 
repeating the experiment would yield a similar distribution of job 
completion times, with the same node having two straglers. 

About tools to plot the progress, you can use
falkon/service/run.www-graphs.sh 51000 60

which will start a bunch of scripts to plot the progress using ploticus 
every 60 sec, and starts a web server so you can see the graphs remotely 
on port 51000.  You might have to recompile ploticus depending on which 
machine you are on (IA32, IA64, etc).

Ioan


Michael Wilde wrote:
> (My immediate project is to run many "angle" jobs in parallel: first 
> 1000, then 10K, 100K, etc. with good performance and low latency.)
>
>
> In the latest run, run15 - 1000 parallel jobs - I set the data xfer 
> throttles back to the defaults of 4 and 8 for .transfers and 
> .file.operations respectively, and left all other throttles off (wide 
> open).
>
> In this run, all 1000 jobs finished successfully (first time for me! 
> ;) but the last two jobs "straggled" in 12 minutes after all the 
> others. Im still trying to debug this and get a better handle on the 
> timings and any error retry, from the detailed log.
>
> But here's what I get in terms of job completion times (for a 14:23 
> start time):
>
> #Finished Time
>      64 14:26
>     105 14:27
>      71 14:28
>      57 14:29
>      65 14:30
>      58 14:31
>      68 14:32
>      50 14:33
>      52 14:34
>      61 14:35
>      62 14:36
>      59 14:37
>      65 14:38
>      63 14:39
>      60 14:40
>      35 14:41
>       3 14:43
>       1 14:52
>       1 14:53
>
> As you can see, the last two jobs finish 10 minutes later than all the 
> others. For the most part I get about 60 job completions a minute once 
> the workflow cranks up. (This rate is about half of what I saw with 
> transfer/data throttles wide open).
>
> I'm looking for errors related to these last two jobs but havent found 
> any yet.
>
> Any thoughts as to what might cause the stragglers?
>
> Also, I'd like to plot some of the job progress rates from the log 
> file - but before I do, does anyone already have tools that des this?
>
> Thanks,
>
> Mike
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================




More information about the Swift-devel mailing list