[Swift-devel] Clustering and Temp Dirs with Swift

Michael Wilde wilde at mcs.anl.gov
Sun Oct 28 17:15:27 CDT 2007


Regarding this problem of avoiding large directories:

One part of this taking all the swift by-product files that are 
generated on a per-job basis within a workflow (which Ben has started to 
list below) and naming them in a way that spreads them across a 
directory tree.

One step seems to be naming files in a way that makes this split easier.

What I'd like to suggest is that we set all the UUID patterns that we 
use in swift from "no-touch-em" properties that we can experiment with.
This can set both the pattern eg nnnnnn or aaaaaa as well as whether its 
sequential vs random, etc.

This makes me ask what naming strategy we use for jobs and kickstart 
records:
angle4-mtlivaji-kickstart.xml
angle4-ntlivaji-kickstart.xml
angle4-otlivaji-kickstart.xml
angle4-ptlivaji-kickstart.xml
angle4-qtlivaji-kickstart.xml

Why are these jobnames differing in the leftmost character of the uuid 
instead of the rightmost? I never paid attention to this till I started 
thinking about the dir hashing Ben suggests. I think that most hashes, 
unless the file names are random, need to be aware of which end of the 
name is varying fastest.

If these were numeric patterns, it would be easy to eg put 100 files per 
dir by taking say the leftmost 6 characters and making that a dirname 
within which the rightmost 2 chars would vary:

tlivaj/angle4-tlivajim-kickstart.xml
tlivaj/angle4-tlivajin-kickstart.xml
tlivaj/angle4-tlivajio-kickstart.xml
tlivaj/angle4-tlivajip-kickstart.xml
tlivaj/angle4-tlivajiq-kickstart.xml

but easier on my eyes would be:
000000/angle4-00000001-kickstart.xml
000000/angle4-00000002-kickstart.xml
...
000000/angle4-00000099-kickstart.xml
...
000020/angle4-00002076-kickstart.xml
etc.

This makes splitting based on powers of 10 (or 26 or 36) trivial. Other 
splits can be done with mod() functions.

Can we start heading in this or some similar direction?

We need to coordinate a plan for this, I suspect, to make Andrew's 
workflows perform acceptably.

- Mike



On 10/27/07 2:08 PM, Ben Clifford wrote:
> 
> On Sat, 27 Oct 2007, Mihael Hategan wrote:
> 
>> Quickly before I leave the house:
>> Perhaps we could try copying to local FS instead of linking from shared
>> dir and hence running the jobs on the local FS.
> 
> Maybe. I'd be suspicious that doesn't reduce access to the directory too 
> much.
> 
> I think the directories where there are lots of files being read/written 
> by lots of hosts are:
> 
> the top directory (one job directory per job)
> the info directory
> the kickstart directory
> the file cache
> 
> In the case where directories get too many files in them because of 
> directory size constraints, its common to split that directory into many 
> smaller directories (eg. how squid caching, or git object storage works). 
> eg, given a file fubar.txt store it in fu/fubar.txt, with 'fu' being some 
> short hash of the filename (with the hash here being 'extract the first 
> two characters).
> 
> Pretty much I think Andrew wanted to do that for his data files anyway, 
> which would then reflect in the layout of the data cache directory 
> structure.
> 
> For job directories, it may not be too hard to split the big directories 
> into smaller ones. There will still be write-lock conflicts, but this 
> might mean the contention for each directories write-lock is lower.
> 



More information about the Swift-devel mailing list