[Swift-devel] Swift and BGP

Mon Oct 26 11:56:49 CDT 2009

I've been playing with Swift on the BGP the past few days.

My observation is that with the current swift and reasonable mapping
(such as the one done by the concurrent mapper) filesystem slowness does
not seem to be caused by contentious access to the same file/directory.
Instead, it's the sheer amount of network filesystem requests which come
from a few sources:

- bash: every time bash forks a process it closes the current script,
forks the process; after the process is done, bash re-opens the script
file, seeks to the position it left it at and reads the next command.
And typical scripts involve forking a few processes. Our wrapper script
is invoked while on a networked FS.
- info: about 30 requests per run 
- swift-specific fs access (what the wrapper is meant to do)
- application fs requests

At 100 jobs/s, only the wrapper causes about 10000 requests/s to the fs
server.

I suspect that what Allan observed with the moving of the output files
being slow is a coincidence. I did a run which showed that for jobs
towards the start, the operations towards the end of the wrapper
execution are slow, while jobs towards the end have the first part of
the wrapper process running slower. This is likely due to ramp-up and
ramp-down. I wanted to plot that, but BGP is down today, so it will have
to wait.

The solution is the having things on the node local FS. Ben already
added some code to do that. I changed that a bit and also moved the info
file to the scratch fs (unless the user requests that the info be on NFS
in order to get progressive results for debugging purposes). A scratch
directory different than the work directory is used whenever the user
specifies <scratch>dir</scratch> in sites.xml.

Another thing is using provider job status instead of files when using
coasters or falkon.

With coasters, scratch FS, and provider status, I empirically determined
that an average throughput of 100jobs/s is something that the system
(swift + coasters) can sustain well, provided that swift tries to keep
the number of jobs submitted to the coaster service to about twice the
number of workers. I tested this with 6000 workers and 60 second jobs. I
will post the plots shortly.

So here's how one would go with this on intrepid:
- determine the maximum number of workers (avg-exec-time * 100)
- set the nodeGranularity to 512 nodes, 4 workers per node. Also set
maxWorkers to 512 so that only 512 node blocks are requested. For some
reason 512 node partitions start almost instantly (even if you have 6 of
them) while 1024 node partitions you have to wait for.
- set the total number of blocks ("slots" parameter) to
no-of-workers/2048.
- set the jobThrottle to 2*no-of-workers/100
- make sure you also have foreach,max.threads set to 2*no-of-workers
(though that depends on the structure of the program).
- run on login6. There is no point in using the normal login machines
since they have a limit of 1024 file descriptors per process.

I will actually code an xml element for sites.xml to capture this
without that much pain.

There is eventually a hard limit of (a bit less than) 65536 workers. I
think. This is because each TCP connection from the workers requires a
local port on the coaster service side, and there's a limit of 2^16 to
that. This could eventually be addressed by having proxies on the IO
nodes or something.

On intrepid (as opposed to surveyor) the default queue won't accept
1-node jobs. So the cleanup job at the end of a run will fail with a
nice display of a stack trace and you will have to manually clean up the
work directory.