[Swift-devel] several alternatives to design the data management system for Swift on SuperComputers

Mon Dec 1 16:15:54 CST 2008

Hi, All

The following alternatives is a summary from a talk between Mike and 
Zhao. We are trying
to optimize the data IO performance for swift on supercomputers, 
includes BGP, Ranger,
and possibly Jaguar. We are trying to eliminate all unnecessary data IO 
during stages of computation.

Scenario 1: Say a computation has 2 stages, the 2nd stage would take the 
output from the 1st stage
as the input data.

Data Flow in current swift system: 1st stage will write the output data 
to GPFS, where swift knows this
output data is the input for the 2nd stage. Then send the task to on 
worker on CN.

Desired Data Flow: 1st stage of computation knows the output data will 
be used as the input for the next
stage, thus the data is not copied back to GPFS, then the 2nd stage task 
arrived and consumed this data.

Key Issue: the 2nd stage task has no idea of where the 1st stage output 
data is.

Design Alternatives:
1. Data aware task scheduling:
    Both swift and falkon need to be data aware. Swift should know where 
the output of 1st stage is, which
    means, which pset, or say which falkon service.
    And the falkon service should know which CN has the data for the 2nd 
stage computation.

2. Swift patch jobs vertically
    Before sending out any jobs, swift knows those 2 stage jobs has data 
dependency, thus send out 1 batched
    job as 1 to each worker.

3. Collective IO
   Build a shared file system which could be accessed by all CN, instead 
of writing output data to GPFS, workers
   copy intermediate output data to this shared ram-disk. And retrieve 
the data from IFS.

   Several Concerns:
   a) reliability of torus network --- we need to test more about this.
   b) performance of torus network --- could this be really performing 
better than GPFS? If not, at what scale
       could torus perform better than GPFS?

4. Half-Collective IO
   All workers wirte data to IFS, and the data will be periodically 
copied back to GPFS. In this case, we only
   optimize the output phase, leave the input phase as is.

Any other ideas? Thanks so much.

best wishes
zhangzhao