[Swift-devel] lots of very small files vs gridftp
Ben Clifford
benc at hawaga.org.uk
Thu Sep 25 08:10:04 CDT 2008
Here are some notes about lots of very small files vs gridftp:
The cnari workflow that skenny works on has lots of very small files,
where very in this context means smaller than the GridFTP Lots Of Small
Files work likes to handle.
In the CNARI runs that we've been making recently, there are 65535 input
files, each roughly of a kilobyte, and a corresponding number of output
files each of around 10 bytes.
The limiting factor in these runs at the moment is staging of files - from
UC to Ranger, throughput appears to be around 7 files per second, which is
quite poor.
Buzz and I did some informal measuring of very small file transfer using
gt4.2 globus-url-copy between communicado and the UC/ANL TeraGrid site to
get a feel for what to expect.
To transfer 1000 files:
# concurrent conncetions | duration of copy (seconds, multiple runs)
16 7, 16, 16
4 14, 14, 14
2 26, 25
1 48, 52
Assuming (perhaps incorrectly) that 65k files would take 65 x that, then
transferring 65k files would take 455 ( = 7 * 65) seconds using the best
result above.
To transfer a 65mb single file between the two sites takes 9s.
So from a raw transfer perspective, transferring as a single GridFTP
transfer rather than as separate files is very good.
However, there is some (possibly large) file system overhead at both ends,
as 65000 file opens can take some time. Tarring up 65000 files of 1k each
took around 60 seconds when Buzz tried it.
I also haven't investigated the ranger filesystem performance. I'm hoping
to get some wrapper logs from a run today to see what is happening there.
The remote filesystem on Ranger is Lustre which I have minimal experience
with; however the input files for the CNARI runs are laid out in a way
that would almost definitely cause trouble if the shared space was GPFS
(in that they are all in a single directory). Results of investigating
this should be available in a day or so.
--
More information about the Swift-devel
mailing list