[Swift-devel] lots of very small files vs gridftp

Ben Clifford benc at hawaga.org.uk
Thu Sep 25 08:10:04 CDT 2008


Here are some notes about lots of very small files vs gridftp:

The cnari workflow that skenny works on has lots of very small files, 
where very in this context means smaller than the GridFTP Lots Of Small 
Files work likes to handle.

In the CNARI runs that we've been making recently, there are 65535 input 
files, each roughly of a kilobyte, and a corresponding number of output 
files each of around 10 bytes.

The limiting factor in these runs at the moment is staging of files - from 
UC to Ranger, throughput appears to be around 7 files per second, which is 
quite poor.

Buzz and I did some informal measuring of very small file transfer using 
gt4.2 globus-url-copy between communicado and the UC/ANL TeraGrid site to 
get a feel for what to expect.

To transfer 1000 files:

   # concurrent conncetions  |   duration of copy (seconds, multiple runs)
                       16          7, 16, 16
                        4         14, 14, 14
                        2         26, 25
                        1         48, 52

Assuming (perhaps incorrectly) that 65k files would take 65 x that, then 
transferring 65k files would take 455 ( = 7 * 65) seconds using the best 
result above.

To transfer a 65mb single file between the two sites takes 9s.

So from a raw transfer perspective, transferring as a single GridFTP 
transfer rather than as separate files is very good.

However, there is some (possibly large) file system overhead at both ends, 
as 65000 file opens can take some time. Tarring up 65000 files of 1k each 
took around 60 seconds when Buzz tried it.

I also haven't investigated the ranger filesystem performance. I'm hoping 
to get some wrapper logs from a run today to see what is happening there. 
The remote filesystem on Ranger is Lustre which I have minimal experience 
with; however the input files for the CNARI runs are laid out in a way 
that would almost definitely cause trouble if the shared space was GPFS 
(in that they are all in a single directory). Results of investigating 
this should be available in a day or so.

-- 




More information about the Swift-devel mailing list