[Swift-devel] Re: Another performance comparison of DOCK

Sun Apr 13 18:31:42 CDT 2008


Michael Wilde wrote:
> Its not clear to me whats best here, for 3 reasons:
>
> 1) We should set file.transfers and file.operations to a value that 
> prevents Swift from adversely impacting performance on shared resources.
>
> Since Swift must run on the login node and hits the shared cluster 
> networks, we should test carefully.
>
> 2) Its not clear to me how many concurrent operations the login hosts 
> can sustain before topping out, and how this number depends on file 
> size. Do you know this from the GPFS benchmarks? 
1 node can sustain about 71 file reads/sec (1B each), and 512 nodes can 
sustain 362 reads/sec.  10KB files are similar, 66/sec and 315/sec.  
Read+write for 1B is 31/sec and 79/sec, and for 10KB is 31/sec and 
81/sec.  Does this explain anything?  How fast were stage-ins going?  
Does a stage-in mean copy a file from one place to another?  Then we are 
looking at the 1 node performance of 10KB read+write, which would be 31 
ops/sec.  Would all the stage-in be happening during the 500 second idle 
time?  If yes, then that is about 24 files/sec, which is awfully close 
to the 31 files/sec from our benchmark.  If not, then its pure coincidence.
> And did you measure impact on system response during those benchmarks?
No.

Ioan
>
> I think that the overall system would top out well before 2000 
> concurrent transfers, but I could be wrong. Going much higher than the 
> point where concurrency increases the data rate, it seems, would cause 
> the rate to drop due to contention and context switching.
>
> 3) If I/O operations are fast compared to the job length and 
> completion rate, you dont have to set these values to the same as the 
> max number of input files that can be demanded at once.
>
> I think we want to set the I/O operation concurrency to a value that 
> achieves the highest operation rate we can sustain while keeping 
> overall system performance at some acceptable level (tbd).
>
> So first we need to find the concurrency level that maximizes ops/sec, 
> (which may be filesize dependent) and then possibly back that off to 
> reduce system impact.
>
> It seems to me that finding the right I/O concurrency setting is 
> complex and non-obvious, and I'm interested in what Ben and Mihael 
> suggest here.
>
> - Mike
>
> On 4/13/08 3:51 PM, Ioan Raicu wrote:
>> But we have 2X input files as opposed to number of jobs and CPUs.  We 
>> have 2048 CPUs, shouldn't we set all file I/O operations to at least 
>> 4096... and that means that files won't be ready for the next jobs 
>> once the first ones start completing... so we should really set 
>> things to twice that, so 8192 is the number I'd set on all file 
>> operations for this app on 2K CPUs.
>> Ioan
>>
>> Zhao Zhang wrote:
>>> Hi, Mike
>>>
>>> Michael Wilde wrote:
>>>> Ben, your analysis sounds very good. Some notes below, including 
>>>> questions for Zhao.
>>>>
>>>> On 4/13/08 2:57 PM, Ben Clifford wrote:
>>>>>
>>>>>> Ben, can you point me to the graphs for this run? (Zhao's 
>>>>>> *99cy0z4g.log)
>>>>>
>>>>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g
>>>>>
>>>>>> Once stage-ins start to complete, are the corresponding jobs 
>>>>>> initiated quickly, or is Swift doing mostly stage-ins for some 
>>>>>> period?
>>>>>
>>>>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to 
>>>>> falkon) pretty much right as the corresponding stagein completes. 
>>>>> I have no deeper information about when the worker actually starts 
>>>>> to run.
>>>>>
>>>>>> Zhao indicated he saw data indicating there was about a 700 
>>>>>> second lag from
>>>>>> workflow start time till the first Falkon jobs started, if I 
>>>>>> understood
>>>>>> correctly. Do the graphs confirm this or say something different?
>>>>>
>>>>> There is a period of about 500s or so until stuff starts to 
>>>>> happen; I haven't looked at it. That is before stage-ins start 
>>>>> too, though, which means that i think this...
>>>>>
>>>>>> If the 700-second delay figure is true, and stage-in was 
>>>>>> eliminated by copying
>>>>>> input files right to the /tmp workdir rather than first to 
>>>>>> /shared, then we'd
>>>>>> have:
>>>>>>
>>>>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency
>>>>>
>>>>> calculation is not meaningful.
>>>>>
>>>>> I have not looked at what is going on during that 500s startup 
>>>>> time, but I plan to.
>>>>
>>>> Zhao, what SVN rev is your Swift at?  Ben fixed an N^2 mapper 
>>>> logging problem a few weeks ago. Could that cause such a delay, 
>>>> Ben? It would be very obvious in the swift log.
>>> The version is Swift svn swift-r1780 cog-r1956
>>>>
>>>>>
>>>>>> I assume we're paying the same staging price on the output side?
>>>>>
>>>>> not really - the output stageouts go very fast, and also because 
>>>>> job ending is staggered, they don't happen all at once.
>>>>>
>>>>> This is the same with most of the large runs I've seen (of any 
>>>>> application) - stageout tends not to be a problem (or at least, no 
>>>>> where near the problems of stagein).
>>>>>
>>>>> All stageins happen over a period t=400 to t=1100 fairly smoothly. 
>>>>> There's rate limiting still on file operations (100 max) and file 
>>>>> transfers (2000 max) which is being hit still.
>>>>
>>>> I thought Zhao set file operations throttle to 2000 as well.  
>>>> Sounds like we can test with the latter higher, and find out what's 
>>>> limiting the former.
>>>>
>>>> Zhao, what are your settings for property throttle.file.operations?
>>>> I assume you have throttle.transfers set to 2000.
>>>>
>>>> If its set right, any chance that Swift or Karajan is limiting it 
>>>> somewhere?
>>> 2000 for sure,
>>> throttle.submit=off
>>> throttle.host.submit=off
>>> throttle.score.job.factor=off
>>> throttle.transfers=2000
>>> throttle.file.operation=2000
>>>>>
>>>>> I think there's two directions to proceed in here that make sense 
>>>>> for actual use on single clusters running falkon (rather than 
>>>>> trying to cut out stuff randomly to push up numbers):
>>>>>
>>>>>  i) use some of the data placement features in falkon, rather than 
>>>>> Swift's
>>>>>     relatively simple data management that was designed more for 
>>>>> running
>>>>>     on the grid.
>>>>
>>>> Long term: we should consider how the Coaster implementation could 
>>>> eventually do a similar data placement approach. In the meantime 
>>>> (mid term) examining what interface changes are needed for Falkon 
>>>> data placement might help prepare for that. Need to discuss if that 
>>>> would be a good step or not.
>>>>
>>>>>
>>>>>  ii) do stage-ins using symlinks rather than file copying. this makes
>>>>>      sense when everything is living in a single filesystem, which 
>>>>> again
>>>>>      is not what Swift's data management was originally optimised 
>>>>> for.
>>>>
>>>> I assume you mean symlinks from shared/ back to the user's input 
>>>> files?
>>>>
>>>> That sounds worth testing: find out if symlink creation is fast on 
>>>> NFS and GPFS.
>>>>
>>>> Is another approach to copy direct from the user's files to the 
>>>> /tmp workdir (ie wrapper.sh pulls the data in)? Measurement will 
>>>> tell if symlinks alone get adequate performance. Symlinks do seem 
>>>> an easier first step.
>>>>
>>>>> I think option ii) is substantially easier to implement (on the 
>>>>> order of days) and is generally useful in the single-cluster, 
>>>>> local-source-data situation that appears to be what people want to 
>>>>> do for running on the BG/P and scicortex (that is, pretty much 
>>>>> ignoring anything grid-like at all).
>>>>
>>>> Grid-like might mean pulling data to the /tmp workdir directly by 
>>>> the wrapper - but that seems like a harder step, and would need 
>>>> measurement and prototyping of such code before attempting. Data 
>>>> transfer clients that the wrapper script can count on might be an 
>>>> obstacle.
>>>>
>>>>>
>>>>> Option i) is much harder (on the order of months), needing a very 
>>>>> different interface between Swift and Falkon than exists at the 
>>>>> moment.
>>>>>
>>>>>
>>>>>
>>>>
>>
>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================