[Swift-devel] Re: Another performance comparison of DOCK

Mihael Hategan hategan at mcs.anl.gov
Sun Apr 13 17:06:11 CDT 2008


On Sun, 2008-04-13 at 17:50 -0500, Michael Wilde wrote:
> Its not clear to me whats best here, for 3 reasons:
> 
> 1) We should set file.transfers and file.operations to a value that 
> prevents Swift from adversely impacting performance on shared resources.
> 
> Since Swift must run on the login node and hits the shared cluster 
> networks, we should test carefully.
> 
> 2) Its not clear to me how many concurrent operations the login hosts 
> can sustain before topping out, and how this number depends on file 
> size. Do you know this from the GPFS benchmarks? And did you measure 
> impact on system response during those benchmarks?
> 
> I think that the overall system would top out well before 2000 
> concurrent transfers, but I could be wrong. Going much higher than the 
> point where concurrency increases the data rate, it seems, would cause 
> the rate to drop due to contention and context switching.

The number is probably in the 10-100 range. With 2000 it's somewhat
likely that the transfers are long done before all the gridftp
connections can be started.

Mihael

> 
> 3) If I/O operations are fast compared to the job length and completion 
> rate, you dont have to set these values to the same as the max number of 
> input files that can be demanded at once.
> 
> I think we want to set the I/O operation concurrency to a value that 
> achieves the highest operation rate we can sustain while keeping overall 
> system performance at some acceptable level (tbd).
> 
> So first we need to find the concurrency level that maximizes ops/sec, 
> (which may be filesize dependent) and then possibly back that off to 
> reduce system impact.
> 
> It seems to me that finding the right I/O concurrency setting is complex 
> and non-obvious, and I'm interested in what Ben and Mihael suggest here.
> 
> - Mike
> 
> On 4/13/08 3:51 PM, Ioan Raicu wrote:
> > But we have 2X input files as opposed to number of jobs and CPUs.  We 
> > have 2048 CPUs, shouldn't we set all file I/O operations to at least 
> > 4096... and that means that files won't be ready for the next jobs once 
> > the first ones start completing... so we should really set things to 
> > twice that, so 8192 is the number I'd set on all file operations for 
> > this app on 2K CPUs.
> > Ioan
> > 
> > Zhao Zhang wrote:
> >> Hi, Mike
> >>
> >> Michael Wilde wrote:
> >>> Ben, your analysis sounds very good. Some notes below, including 
> >>> questions for Zhao.
> >>>
> >>> On 4/13/08 2:57 PM, Ben Clifford wrote:
> >>>>
> >>>>> Ben, can you point me to the graphs for this run? (Zhao's 
> >>>>> *99cy0z4g.log)
> >>>>
> >>>> http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g
> >>>>
> >>>>> Once stage-ins start to complete, are the corresponding jobs 
> >>>>> initiated quickly, or is Swift doing mostly stage-ins for some period?
> >>>>
> >>>> In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to 
> >>>> falkon) pretty much right as the corresponding stagein completes. I 
> >>>> have no deeper information about when the worker actually starts to 
> >>>> run.
> >>>>
> >>>>> Zhao indicated he saw data indicating there was about a 700 second 
> >>>>> lag from
> >>>>> workflow start time till the first Falkon jobs started, if I 
> >>>>> understood
> >>>>> correctly. Do the graphs confirm this or say something different?
> >>>>
> >>>> There is a period of about 500s or so until stuff starts to happen; 
> >>>> I haven't looked at it. That is before stage-ins start too, though, 
> >>>> which means that i think this...
> >>>>
> >>>>> If the 700-second delay figure is true, and stage-in was eliminated 
> >>>>> by copying
> >>>>> input files right to the /tmp workdir rather than first to /shared, 
> >>>>> then we'd
> >>>>> have:
> >>>>>
> >>>>> 1190260 / ( 1290 * 2048 ) = .45 efficiency
> >>>>
> >>>> calculation is not meaningful.
> >>>>
> >>>> I have not looked at what is going on during that 500s startup time, 
> >>>> but I plan to.
> >>>
> >>> Zhao, what SVN rev is your Swift at?  Ben fixed an N^2 mapper logging 
> >>> problem a few weeks ago. Could that cause such a delay, Ben? It would 
> >>> be very obvious in the swift log.
> >> The version is Swift svn swift-r1780 cog-r1956
> >>>
> >>>>
> >>>>> I assume we're paying the same staging price on the output side?
> >>>>
> >>>> not really - the output stageouts go very fast, and also because job 
> >>>> ending is staggered, they don't happen all at once.
> >>>>
> >>>> This is the same with most of the large runs I've seen (of any 
> >>>> application) - stageout tends not to be a problem (or at least, no 
> >>>> where near the problems of stagein).
> >>>>
> >>>> All stageins happen over a period t=400 to t=1100 fairly smoothly. 
> >>>> There's rate limiting still on file operations (100 max) and file 
> >>>> transfers (2000 max) which is being hit still.
> >>>
> >>> I thought Zhao set file operations throttle to 2000 as well.  Sounds 
> >>> like we can test with the latter higher, and find out what's limiting 
> >>> the former.
> >>>
> >>> Zhao, what are your settings for property throttle.file.operations?
> >>> I assume you have throttle.transfers set to 2000.
> >>>
> >>> If its set right, any chance that Swift or Karajan is limiting it 
> >>> somewhere?
> >> 2000 for sure,
> >> throttle.submit=off
> >> throttle.host.submit=off
> >> throttle.score.job.factor=off
> >> throttle.transfers=2000
> >> throttle.file.operation=2000
> >>>>
> >>>> I think there's two directions to proceed in here that make sense 
> >>>> for actual use on single clusters running falkon (rather than trying 
> >>>> to cut out stuff randomly to push up numbers):
> >>>>
> >>>>  i) use some of the data placement features in falkon, rather than 
> >>>> Swift's
> >>>>     relatively simple data management that was designed more for 
> >>>> running
> >>>>     on the grid.
> >>>
> >>> Long term: we should consider how the Coaster implementation could 
> >>> eventually do a similar data placement approach. In the meantime (mid 
> >>> term) examining what interface changes are needed for Falkon data 
> >>> placement might help prepare for that. Need to discuss if that would 
> >>> be a good step or not.
> >>>
> >>>>
> >>>>  ii) do stage-ins using symlinks rather than file copying. this makes
> >>>>      sense when everything is living in a single filesystem, which 
> >>>> again
> >>>>      is not what Swift's data management was originally optimised for.
> >>>
> >>> I assume you mean symlinks from shared/ back to the user's input files?
> >>>
> >>> That sounds worth testing: find out if symlink creation is fast on 
> >>> NFS and GPFS.
> >>>
> >>> Is another approach to copy direct from the user's files to the /tmp 
> >>> workdir (ie wrapper.sh pulls the data in)? Measurement will tell if 
> >>> symlinks alone get adequate performance. Symlinks do seem an easier 
> >>> first step.
> >>>
> >>>> I think option ii) is substantially easier to implement (on the 
> >>>> order of days) and is generally useful in the single-cluster, 
> >>>> local-source-data situation that appears to be what people want to 
> >>>> do for running on the BG/P and scicortex (that is, pretty much 
> >>>> ignoring anything grid-like at all).
> >>>
> >>> Grid-like might mean pulling data to the /tmp workdir directly by the 
> >>> wrapper - but that seems like a harder step, and would need 
> >>> measurement and prototyping of such code before attempting. Data 
> >>> transfer clients that the wrapper script can count on might be an 
> >>> obstacle.
> >>>
> >>>>
> >>>> Option i) is much harder (on the order of months), needing a very 
> >>>> different interface between Swift and Falkon than exists at the moment.
> >>>>
> >>>>
> >>>>
> >>>
> > 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 




More information about the Swift-devel mailing list