[Swift-devel] Re: 2 rack run with swift
Ioan Raicu
iraicu at cs.uchicago.edu
Thu Jul 24 09:24:32 CDT 2008
Hi,
I did a similar run through Falkon only, and got:
Number of Tasks: 16384
Task Duration: 30 sec
Average Task Execution Time (from Client point of view): 31.851 sec
Number of CPUs: 8192
Startup: 5.185 sec
Execute: 80.656 sec
Ideal time: 60 sec
Swift took some 600 seconds, and had an average per task run time of
240.97 sec. Zhao, was Swift patched up, with Ben's 3 patches from
April/May? I am curious what would happen if we throw 256 second tasks
through Swift, at the same 2 rack scale?
Ioan
Michael Wilde wrote:
> Thanks, Zhao,
>
> This is a great initial snapshot of performance on the new BG/P Falkon
> server mechanism (1 server per pset).
>
> Its also the largest Swift run to date I know of in terms of "sites"
> (32) and processors used (8192).
>
> From a quick scan of the plots, it seems like we have some tuning to do:
>
> The ideal time for this run would be 120 seconds. It took 600 seconds.
> Thats in fact "not bad at all" for a first attempt at this scale, and
> very reasonable if the job length were longer. 16K jobs in 10 minutes
> is pretty good. The nearest real-world Falkon-only run I can compare
> to is the 15Kx9 DOCK run, which did 138K jobs in 40 minutes. This run
> performed at somewhat under half that rate.
>
> I suspect that the main bottleneck this is hitting is creation of job
> directories on the BGP. As we learned in the past few months of
> Falkon-only runs, creation of filesystem objects on GPFS is very
> expensive, and creation of two objects within the same parent
> directory by > 1 host is extremely expensive in locking contention.
>
> I *think* the plots bear this out, but need more assessment.
>
> I'd like to start by writing down a detailed description of the
> runtime file environment and management logic (i.e. job setup by swift
> and file management by wrapper.sh. Then look to see which of the
> options Ben provided when we last did this, in March, were properly
> enabled. (Some may still be un-applied test patches). Then turn on
> some of the timing metrics in wrapper.sh to see where time is spent.
>
> I also see that job distribution among servers is pretty good -
> ranging from 490 to 600 jobs, but for the most part staying within 10
> jobs of the ideal, 512.
>
> I can't work on this today till our Swift report is done, but can then
> turn to it. Ben, once you're done with the SA Grid School, we could
> use your help on this. Mihael, as well, if you're interested and able
> to help.
>
> For now, I think we know a few steps we can take to measure and
> improve things.
>
> - Mike
>
>
> On 7/24/08 1:19 AM, Zhao Zhang wrote:
>> Hi, All
>>
>> I just made a swift run of 16384 sleep_30 tasks on 2 racks, which are
>> 8192 cores. The log is at
>> http://www.ci.uchicago.edu/~zzhang/report-sleep-20080724-0030-3zbv20j6/
>>
>> Tomorrow, I will try to make a mars run with swift.
>>
>> zhao
>
--
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================
More information about the Swift-devel
mailing list