[Swift-devel] notes on how swift implements file input and output

Fri Dec 5 12:46:31 CST 2008

On Thu, 4 Dec 2008, Ian Foster wrote:

> a) Am I correct in assuming that Swift currently will not run on a site that
> does not support a shared file system?

yes.

> b) Can we build on this document to introduce means by which we could make use
> of methods such as bulk transfer of many input files, collective I/O as on
> BG/P, etc.?

I wrote this to motivate discussion for the meeting we had yesterday 
involving myself, mike wilde, hategan, ioan, zhao and allan espinosa which 
was pretty much centered on that topic.

We looked at three specific cases of swift + something else:

 * swift + gLite
 * swift + falkon data diffusion
 * swift + mike/zhao/allan's collective IO work

In all three cases, the modifications necessary to core swift seem fairly 
simple.

In the gLite and falkon data diffusion case, it seems straightforward to 
change the abstractions a bit so that, which there is still a concept of a 
site shared filesystem, there is no requirement that this be posix 
accessible; instead glite or falkon data diffusion specific mechanisms can 
be used to move data from the site shared filesystem to the appropriate 
worker node.

In the collective IO work, the new filesystem there exposes itself through 
posix anyway; getting that working with swift seems mostly to be 
integration work rather than new coding.

That being said, the discussion above was mostly about the mechanics of 
plugging the pieces together. The more interesting and harder part of that 
is likely to be performance characterisation and improvement; for 
collective IO and data diffusion, it is improvement over traditional file 
systems rather than whether it works or not that seems to be the goal.

> c) What are the pros and cons of copying all input and output files twice,
> once to the site, and once to the node. Is this ever a source of overhead?

They're not always copied to the node. In the present case, it is an 
option whether to copy input files entirely to a worker node or to access 
them directly off the site shared filesystem; its been seen through 
experiment that it can be faster in some cases to copy a file using some 
specialised posix data transfer tool like /bin/cp and then have local 
access to it; conversely though if the input file is large and only small 
parts of it are accessed randomly, then keeping it on the shared file 
system may be a better approach.

Having a site-shared filesystem as part of the abstraction gives a fairly 
straightforward way to handle site-side data caching so that input files 
do not have to be staged in multiple times for multiple jobs; it also 
gives a pretty portable way to get data to a worker node from its stagein 
location that is closely aligned with how traditional grid sites are 
configured.

There are pages more that could be written comparing different approaches 
to doing this...

--