[Swift-devel] Re: Another performance comparison of DOCK

Michael Wilde wilde at mcs.anl.gov
Sun Apr 13 17:50:01 CDT 2008


Its not clear to me whats best here, for 3 reasons:

1) We should set file.transfers and file.operations to a value that 
prevents Swift from adversely impacting performance on shared resources.

Since Swift must run on the login node and hits the shared cluster 
networks, we should test carefully.

2) Its not clear to me how many concurrent operations the login hosts 
can sustain before topping out, and how this number depends on file 
size. Do you know this from the GPFS benchmarks? And did you measure 
impact on system response during those benchmarks?

I think that the overall system would top out well before 2000 
concurrent transfers, but I could be wrong. Going much higher than the 
point where concurrency increases the data rate, it seems, would cause 
the rate to drop due to contention and context switching.

3) If I/O operations are fast compared to the job length and completion 
rate, you dont have to set these values to the same as the max number of 
input files that can be demanded at once.

I think we want to set the I/O operation concurrency to a value that 
achieves the highest operation rate we can sustain while keeping overall 
system performance at some acceptable level (tbd).

So first we need to find the concurrency level that maximizes ops/sec, 
(which may be filesize dependent) and then possibly back that off to 
reduce system impact.

It seems to me that finding the right I/O concurrency setting is complex 
and non-obvious, and I'm interested in what Ben and Mihael suggest here.

- Mike

On 4/13/08 3:51 PM, Ioan Raicu wrote:
> But we have 2X input files as opposed to number of jobs and CPUs.  We 
> have 2048 CPUs, shouldn't we set all file I/O operations to at least 
> 4096... and that means that files won't be ready for the next jobs once 
> the first ones start completing... so we should really set things to 
> twice that, so 8192 is the number I'd set on all file operations for 
> this app on 2K CPUs.
> Ioan
> 
> Zhao Zhang wrote:
>> Hi, Mike
>>
>> Michael Wilde wrote:
>>> Ben, your analysis sounds very good. Some notes below, including 
>>> questions for Zhao.
>>>
>>> On 4/13/08 2:57 PM, Ben Clifford wrote:
>>>>
>>>>> Ben, can you point me to the graphs for this run? (Zhao's 
>>>>> *99cy0z4g.log)
>>>>
>>>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g
>>>>
>>>>> Once stage-ins start to complete, are the corresponding jobs 
>>>>> initiated quickly, or is Swift doing mostly stage-ins for some period?
>>>>
>>>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to 
>>>> falkon) pretty much right as the corresponding stagein completes. I 
>>>> have no deeper information about when the worker actually starts to 
>>>> run.
>>>>
>>>>> Zhao indicated he saw data indicating there was about a 700 second 
>>>>> lag from
>>>>> workflow start time till the first Falkon jobs started, if I 
>>>>> understood
>>>>> correctly. Do the graphs confirm this or say something different?
>>>>
>>>> There is a period of about 500s or so until stuff starts to happen; 
>>>> I haven't looked at it. That is before stage-ins start too, though, 
>>>> which means that i think this...
>>>>
>>>>> If the 700-second delay figure is true, and stage-in was eliminated 
>>>>> by copying
>>>>> input files right to the /tmp workdir rather than first to /shared, 
>>>>> then we'd
>>>>> have:
>>>>>
>>>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency
>>>>
>>>> calculation is not meaningful.
>>>>
>>>> I have not looked at what is going on during that 500s startup time, 
>>>> but I plan to.
>>>
>>> Zhao, what SVN rev is your Swift at?  Ben fixed an N^2 mapper logging 
>>> problem a few weeks ago. Could that cause such a delay, Ben? It would 
>>> be very obvious in the swift log.
>> The version is Swift svn swift-r1780 cog-r1956
>>>
>>>>
>>>>> I assume we're paying the same staging price on the output side?
>>>>
>>>> not really - the output stageouts go very fast, and also because job 
>>>> ending is staggered, they don't happen all at once.
>>>>
>>>> This is the same with most of the large runs I've seen (of any 
>>>> application) - stageout tends not to be a problem (or at least, no 
>>>> where near the problems of stagein).
>>>>
>>>> All stageins happen over a period t=400 to t=1100 fairly smoothly. 
>>>> There's rate limiting still on file operations (100 max) and file 
>>>> transfers (2000 max) which is being hit still.
>>>
>>> I thought Zhao set file operations throttle to 2000 as well.  Sounds 
>>> like we can test with the latter higher, and find out what's limiting 
>>> the former.
>>>
>>> Zhao, what are your settings for property throttle.file.operations?
>>> I assume you have throttle.transfers set to 2000.
>>>
>>> If its set right, any chance that Swift or Karajan is limiting it 
>>> somewhere?
>> 2000 for sure,
>> throttle.submit=off
>> throttle.host.submit=off
>> throttle.score.job.factor=off
>> throttle.transfers=2000
>> throttle.file.operation=2000
>>>>
>>>> I think there's two directions to proceed in here that make sense 
>>>> for actual use on single clusters running falkon (rather than trying 
>>>> to cut out stuff randomly to push up numbers):
>>>>
>>>>  i) use some of the data placement features in falkon, rather than 
>>>> Swift's
>>>>     relatively simple data management that was designed more for 
>>>> running
>>>>     on the grid.
>>>
>>> Long term: we should consider how the Coaster implementation could 
>>> eventually do a similar data placement approach. In the meantime (mid 
>>> term) examining what interface changes are needed for Falkon data 
>>> placement might help prepare for that. Need to discuss if that would 
>>> be a good step or not.
>>>
>>>>
>>>>  ii) do stage-ins using symlinks rather than file copying. this makes
>>>>      sense when everything is living in a single filesystem, which 
>>>> again
>>>>      is not what Swift's data management was originally optimised for.
>>>
>>> I assume you mean symlinks from shared/ back to the user's input files?
>>>
>>> That sounds worth testing: find out if symlink creation is fast on 
>>> NFS and GPFS.
>>>
>>> Is another approach to copy direct from the user's files to the /tmp 
>>> workdir (ie wrapper.sh pulls the data in)? Measurement will tell if 
>>> symlinks alone get adequate performance. Symlinks do seem an easier 
>>> first step.
>>>
>>>> I think option ii) is substantially easier to implement (on the 
>>>> order of days) and is generally useful in the single-cluster, 
>>>> local-source-data situation that appears to be what people want to 
>>>> do for running on the BG/P and scicortex (that is, pretty much 
>>>> ignoring anything grid-like at all).
>>>
>>> Grid-like might mean pulling data to the /tmp workdir directly by the 
>>> wrapper - but that seems like a harder step, and would need 
>>> measurement and prototyping of such code before attempting. Data 
>>> transfer clients that the wrapper script can count on might be an 
>>> obstacle.
>>>
>>>>
>>>> Option i) is much harder (on the order of months), needing a very 
>>>> different interface between Swift and Falkon than exists at the moment.
>>>>
>>>>
>>>>
>>>
> 



More information about the Swift-devel mailing list