[Swift-devel] Re: 2 rack run with swift

Thu Jul 24 09:24:32 CDT 2008

Hi,
I did a similar run through Falkon only, and got:
Number of Tasks: 16384
Task Duration: 30 sec
Average Task Execution Time (from Client point of view): 31.851 sec
Number of CPUs: 8192
Startup: 5.185 sec
Execute: 80.656 sec
Ideal time: 60 sec

Swift took some 600 seconds, and had an average per task run time of 
240.97 sec.  Zhao, was Swift patched up, with Ben's 3 patches from 
April/May?  I am curious what would happen if we throw 256 second tasks 
through Swift, at the same 2 rack scale? 

Ioan

Michael Wilde wrote:
> Thanks, Zhao,
>
> This is a great initial snapshot of performance on the new BG/P Falkon 
> server mechanism (1 server per pset).
>
> Its also the largest Swift run to date I know of in terms of "sites" 
> (32) and processors used (8192).
>
> From a quick scan of the plots, it seems like we have some tuning to do:
>
> The ideal time for this run would be 120 seconds. It took 600 seconds. 
> Thats in fact "not bad at all" for a first attempt at this scale, and 
> very reasonable if the job length were longer. 16K jobs in 10 minutes 
> is pretty good. The nearest real-world Falkon-only run I can compare 
> to is the 15Kx9 DOCK run, which did 138K jobs in 40 minutes. This run 
> performed at somewhat under half that rate.
>
> I suspect that the main bottleneck this is hitting is creation of job 
> directories on the BGP. As we learned in the past few months of 
> Falkon-only runs, creation of filesystem objects on GPFS is very 
> expensive, and creation of two objects within the same parent 
> directory by > 1 host is extremely expensive in locking contention.
>
> I *think* the plots bear this out, but need more assessment.
>
> I'd like to start by writing down a detailed description of the 
> runtime file environment and management logic (i.e. job setup by swift 
> and file management by wrapper.sh.  Then look to see which of the 
> options Ben provided when we last did this, in March, were properly 
> enabled. (Some may still be un-applied test patches). Then turn on 
> some of the timing metrics in wrapper.sh to see where time is spent.
>
> I also see that job distribution among servers is pretty good - 
> ranging from 490 to 600 jobs, but for the most part staying within 10 
> jobs of the ideal, 512.
>
> I can't work on this today till our Swift report is done, but can then 
> turn to it.  Ben, once you're done with the SA Grid School, we could 
> use your help on this. Mihael, as well, if you're interested and able 
> to help.
>
> For now, I think we know a few steps we can take to measure and 
> improve things.
>
> - Mike
>
>
> On 7/24/08 1:19 AM, Zhao Zhang wrote:
>> Hi, All
>>
>> I just made a swift run of 16384 sleep_30 tasks on 2 racks, which are 
>> 8192 cores. The log is at
>> http://www.ci.uchicago.edu/~zzhang/report-sleep-20080724-0030-3zbv20j6/
>>
>> Tomorrow, I will  try to make a mars run with swift.
>>
>> zhao
>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================