[Swift-devel] Use case and examples needed to avoid large directories

Michael Wilde wilde at mcs.anl.gov
Fri Sep 28 18:32:09 CDT 2007


Andrew Jamieson reviewed the needs of his application with me today and 
we noted the following:

When run under VDS, a showstopper problem for TeraPort, which is running 
GPFS, was that too many files needed to be created in a single output 
directory. The observed behavior was that when more than around 200 
files were placed by parallel jobs into a single output directory, the 
rate of file creation was so slow that the overall workflow speed was 
badly impacted. I dont know if thats GPFS in general or TeraPort in 
particular, but in old VDS days we saw the same behavior for GADU 
workflows on Jazz, at the same low threshold of files-per-dir.

I mentioned the large-number-of-files-per-directory problem to Mihael. 
He says "already solved": if you break your input data up in that 
manner, the temp directories on the execution nodes that hold that data 
will have the same structure.

I'd like to ask about this in a bit more detail.

Do we still need some "magic" in the mapper to make sure that 
intermediate and output files are similarly structured?

Is there a description anywhere in the Swift docs on how data caching, 
file naming, temporary dir creation, and data transfer is handled in 
Swift, and how properties and mappers affect things?  Ben, as you work 
on the mapper text and tutorial examples, is this a good section to 
document that in?

- Mike




More information about the Swift-devel mailing list