[Swift-devel] Re: metrics

Ben Clifford benc at hawaga.org.uk
Tue Apr 28 07:31:22 CDT 2009


(I added swift-devel to this because others might read and comment

On Fri, 24 Apr 2009, Jon Roelofs wrote:

> What kind of hardware is this typically run on?  About how many machines
> would there be?  Are there multicore machines? If so, would swift treat a
> quad core as if it were 4 sites? I have the idea that swift is targeted
> somewhere in-between a dedicated supercomputer and something like what
> Berkeley is doing with BOINC, but more on the dedicated side of things.  Is
> that a fair assumption?

There is quite a lot of variation in what Swift gets run on.

The traditional target platform is something like the Open Science Grid 
which has multiple sites, all of which vary in characteristics from each 
other significantly (in terms of node count and network connectivity)

In there, each site consists of multiple compute cores (either in the same 
machine or in a single cluster with a shared file system). The sites are 
generally dedicated to this use, but not to a particular user; so when 
Swift wants to run jobs on those sites, those jobs usually must sit in a 
queue for some period before execution, sometimes for a very long time 
(for example, I've seen one cluster pretty much fully loaded for 2 weeks 
solid, whilst other sites will run jobs within seconds).

>From a site selection and scheduling perspective, the difficult thing is 
mostly in choosing the right site for a job (what does "the right site" 
mean? how to determine that? what happens when we chose the wrong site?)

On TeraGrid, where individual sites are quite large, users tend to run 
only on one site at once, so site selection issues are less important. 
However, part of the motivation to run on only one site is because of 
deficiencies in site selection (for example, the data affinity stuff that 
we have talked about is intended to solve a problem which causes a lot of 
data transfer).

On TeraGrid people are commonly running with the coaster execution 
provider, which allocates nodes separately from submitting jobs to them. 
So there is the possibily (not used at the moment) to do some scheduling 
based on the knowledge that those nodes will likely be dedicated to a 
particular Swift run for some duration.

A third mode of Swift usage is on something like the Blue Gene/P. There, 
the machine has enough cores that when used with Swift it is regarded as 
several sites, each site being some chunk of the machine. So then site 
selection issues start coming back into play again. On BG/P people run 
through Falkon, which, similar to coasters, separates out allocation of 
compute nodes from assigning tasks to those nodes.

In all of the above, it is often the case that the expected duration of a 
job is not known; so it is hard to do very tight planning ahead of time.

The various execution systems and sites have very different performance 
characteristics, in terms of how many jobs can be sent to a site at once, 
and how much time overhead each job has.

> Is there a way to specify that certain jobs need sites with specific 
> hardware/software requirements?  For example, maybe one of the apps 
> needs hardware that can run CUDA code, so it doesn't make sense to send 
> this job to a site without a GPGPU.

The tc.data file lets you list which sites support which applications, as 
a list of sites and applications. Mostly at the moment that is due to 
applications being installed or not installed rather than hardware 
requirements.

-- 



More information about the Swift-devel mailing list