[Swift-devel] Re: Another performance comparison of DOCK

Ben Clifford benc at hawaga.org.uk
Sun Apr 13 14:57:06 CDT 2008



> Ben, can you point me to the graphs for this run? (Zhao's *99cy0z4g.log)

http://www.ci.uchicago.edu/~benc/report-dock2-20080412-1609-99cy0z4g

> Once stage-ins start to complete, are the corresponding jobs initiated 
> quickly, or is Swift doing mostly stage-ins for some period?

In the run dock2-20080412-1609-99cy0z4g, jobs are submitted (to falkon) 
pretty much right as the corresponding stagein completes. I have no deeper 
information about when the worker actually starts to run.

> Zhao indicated he saw data indicating there was about a 700 second lag from
> workflow start time till the first Falkon jobs started, if I understood
> correctly. Do the graphs confirm this or say something different?

There is a period of about 500s or so until stuff starts to happen; I 
haven't looked at it. That is before stage-ins start too, though, which 
means that i think this...

> If the 700-second delay figure is true, and stage-in was eliminated by copying
> input files right to the /tmp workdir rather than first to /shared, then we'd
> have:
> 
> 1190260 / ( 1290 * 2048 ) = .45 efficiency

calculation is not meaningful.

I have not looked at what is going on during that 500s startup time, but I 
plan to.

> I assume we're paying the same staging price on the output side?

not really - the output stageouts go very fast, and also because job 
ending is staggered, they don't happen all at once.

This is the same with most of the large runs I've seen (of any 
application) - stageout tends not to be a problem (or at least, no where 
near the problems of stagein).

All stageins happen over a period t=400 to t=1100 fairly smoothly. There's 
rate limiting still on file operations (100 max) and file transfers (2000 
max) which is being hit still.

I think there's two directions to proceed in here that make sense for 
actual use on single clusters running falkon (rather than trying to cut 
out stuff randomly to push up numbers):

 i) use some of the data placement features in falkon, rather than Swift's
    relatively simple data management that was designed more for running
    on the grid.

 ii) do stage-ins using symlinks rather than file copying. this makes
     sense when everything is living in a single filesystem, which again
     is not what Swift's data management was originally optimised for.

I think option ii) is substantially easier to implement (on the order of 
days) and is generally useful in the single-cluster, local-source-data 
situation that appears to be what people want to do for running on the 
BG/P and scicortex (that is, pretty much ignoring anything grid-like at 
all).

Option i) is much harder (on the order of months), needing a very 
different interface between Swift and Falkon than exists at the moment.



-- 




More information about the Swift-devel mailing list