[Swift-devel] Re: 2 rack run with swift
Michael Wilde
wilde at mcs.anl.gov
Thu Jul 24 07:57:17 CDT 2008
Thanks, Zhao,
This is a great initial snapshot of performance on the new BG/P Falkon
server mechanism (1 server per pset).
Its also the largest Swift run to date I know of in terms of "sites"
(32) and processors used (8192).
From a quick scan of the plots, it seems like we have some tuning to do:
The ideal time for this run would be 120 seconds. It took 600 seconds.
Thats in fact "not bad at all" for a first attempt at this scale, and
very reasonable if the job length were longer. 16K jobs in 10 minutes is
pretty good. The nearest real-world Falkon-only run I can compare to is
the 15Kx9 DOCK run, which did 138K jobs in 40 minutes. This run
performed at somewhat under half that rate.
I suspect that the main bottleneck this is hitting is creation of job
directories on the BGP. As we learned in the past few months of
Falkon-only runs, creation of filesystem objects on GPFS is very
expensive, and creation of two objects within the same parent directory
by > 1 host is extremely expensive in locking contention.
I *think* the plots bear this out, but need more assessment.
I'd like to start by writing down a detailed description of the runtime
file environment and management logic (i.e. job setup by swift and file
management by wrapper.sh. Then look to see which of the options Ben
provided when we last did this, in March, were properly enabled. (Some
may still be un-applied test patches). Then turn on some of the timing
metrics in wrapper.sh to see where time is spent.
I also see that job distribution among servers is pretty good - ranging
from 490 to 600 jobs, but for the most part staying within 10 jobs of
the ideal, 512.
I can't work on this today till our Swift report is done, but can then
turn to it. Ben, once you're done with the SA Grid School, we could use
your help on this. Mihael, as well, if you're interested and able to help.
For now, I think we know a few steps we can take to measure and improve
things.
- Mike
On 7/24/08 1:19 AM, Zhao Zhang wrote:
> Hi, All
>
> I just made a swift run of 16384 sleep_30 tasks on 2 racks, which are
> 8192 cores. The log is at
> http://www.ci.uchicago.edu/~zzhang/report-sleep-20080724-0030-3zbv20j6/
>
> Tomorrow, I will try to make a mars run with swift.
>
> zhao
More information about the Swift-devel
mailing list