[Swift-devel] several alternatives to design the data management system for Swift on SuperComputers

Mon Dec 1 16:45:55 CST 2008

On Mon, 1 Dec 2008, Zhao Zhang wrote:

> Scenario 1: Say a computation has 2 stages, the 2nd stage would take the
> output from the 1st stage
> as the input data.
> 
> Data Flow in current swift system: 1st stage will write the output data to
> GPFS, where swift knows this
> output data is the input for the 2nd stage. Then send the task to on worker on
> CN.
> 
> Desired Data Flow: 1st stage of computation knows the output data will be used
> as the input for the next
> stage, thus the data is not copied back to GPFS, then the 2nd stage task
> arrived and consumed this data.
> 
> Key Issue: the 2nd stage task has no idea of where the 1st stage output data
> is.
> 
> Design Alternatives:
> 1. Data aware task scheduling:
>    Both swift and falkon need to be data aware. Swift should know where the
> output of 1st stage is, which
>    means, which pset, or say which falkon service.
>    And the falkon service should know which CN has the data for the 2nd stage
> computation.

Swift *is* data aware. However it models things at the site level, not at 
a worker node level. This is true at the moment:

> Swift should know where theoutput of 1st stage is, which means, which 
> pset, or say which falkon service.

There was talk before of having some data-affinity in the swift scheduler, 
which would mean that jobs would prefer (but perhaps not be guaranteed) to 
run on a site which already had their input data. I don't know if anyone 
did any coding towards this - I haven't seen an implementation.

In the pset = site case, which is how BG/P is being used at the moment, 
this would at least tend to keep execution on the same site as

At the moment, Falkon doesn't know about input and output files for Swift 
jobs, so can't act on that information to influence its scheduling.

> 2. Swift patch jobs vertically
>    Before sending out any jobs, swift knows those 2 stage jobs has data
> dependency, thus send out 1 batched
>    job as 1 to each worker.

VDS had some clustering capability like this. It seems quite interesting 
to think about.

In the multilevel scheduling case, where Swift is scheduling jobs between 
sites, and Falkon is scheduling jobs within a site, then having falkon 
able to do some kind of data-affinity scheduling within the site would 
also be perhaps interesting. Clustering jobs ahead of time is something 
that can perhaps reduce performance (according to the claims that running 
through a resource provisioner is better than clustering ahead of time) 
and doing it dynamically might be interesting.

The difference between 1 and 2 above seems similar to the clustering vs. 
provisioning distinction.

> 3. Collective IO
>   Build a shared file system which could be accessed by all CN, instead of
> writing output data to GPFS, workers
>   copy intermediate output data to this shared ram-disk. And retrieve the data
> from IFS.
> 
>   Several Concerns:
>   a) reliability of torus network --- we need to test more about this.
>   b) performance of torus network --- could this be really performing better
> than GPFS? If not, at what scale
>       could torus perform better than GPFS?

As phrased above, this seems a little strange:

"rather than use a shared file system, lets build a shared file system and 
use it."

Do you mean building some general purpose posix shared file system? If so, 
this seems quite hard, and seems directly in competition with PVFS and 
GPFS, a competition which you are pretty much guaranteed to lose.

It may be that you mean something completely different - your concerns 
about the torus network seem unrelated to writing a posix fs, so I think 
that may be the case (or maybe you are overspecialising your concerns).

--