[Swift-devel] several alternatives to design the data management system for Swift on SuperComputers

Ioan Raicu iraicu at cs.uchicago.edu
Mon Dec 1 16:33:33 CST 2008


Hi,
I can see option (1) working as long as there is 1 Swift client and 1 
Falkon service. For example, our current deployment on the BG/P would 
not work, as we have 1 Swift to *many *Falkon services. Now, even the 
1-1 Swift-Falkon ratio won't work today, as the Falkon provider is not 
data-aware yet... but could be updated, maybe a few days of coding and 
testing; the harder part (IMO) will be making sure that the Swift data 
management doesn't interfere with the Falkon data management, and vice 
versa.

Option (3) and (4), we have discussed before. The trick with these are 
making things general and transparent enough that it works, and works 
well. Getting a Torus aggregate throughput to exceed 8GB/s shouldn't be 
that hard, with probably a fraction of the machine (several racks). Any 
word on the latest numbers of the improved GPFS, which is supposed to 
upgrade the number of servers from 8 or 16 up to 100+? With linear 
scalability, that would mean 80GB/s, the peak of the SAN throughputs I 
saw a while back in some slides from an ALCF talk. For us to get 80GB/s 
using CIO, we'd need 2MB/s per node. I bet we can easily achieve that, 
but it would probably be at the larger scales of 10s of racks. I recall 
getting 100MB/s+ per node, right?  This would give us a theoretical 
upper bound of 4000GB/s, so in theory, there is plenty of room between 
80GB/s and 4000GB/s. I bet in practice, we'd only get a small fraction 
of that 4000GB/s, but it would be interesting how much we can really get 
without thinking of the network topology, and also how far we can get if 
we do take the network topology into consideration.

Option (2), I haven't thought of before, but it only works if an output 
file is only needed as 1 input file. What do you do if you have 1 output 
file needed for N input files?  Do you replicate the first job N times, 
just so you can get the output file in N locations?  Or do you group the 
jobs in 1+N jobs, where the N jobs execute in serial order on 1 
processor/node? This might be worth investigating, but I think you'll be 
restricting the natural parallelism, or repeating work just to avoid 
data management.

Ioan

Zhao Zhang wrote:
> Hi, All
>
> The following alternatives is a summary from a talk between Mike and 
> Zhao. We are trying
> to optimize the data IO performance for swift on supercomputers, 
> includes BGP, Ranger,
> and possibly Jaguar. We are trying to eliminate all unnecessary data 
> IO during stages of computation.
>
> Scenario 1: Say a computation has 2 stages, the 2nd stage would take 
> the output from the 1st stage
> as the input data.
>
> Data Flow in current swift system: 1st stage will write the output 
> data to GPFS, where swift knows this
> output data is the input for the 2nd stage. Then send the task to on 
> worker on CN.
>
> Desired Data Flow: 1st stage of computation knows the output data will 
> be used as the input for the next
> stage, thus the data is not copied back to GPFS, then the 2nd stage 
> task arrived and consumed this data.
>
> Key Issue: the 2nd stage task has no idea of where the 1st stage 
> output data is.
>
> Design Alternatives:
> 1. Data aware task scheduling:
>    Both swift and falkon need to be data aware. Swift should know 
> where the output of 1st stage is, which
>    means, which pset, or say which falkon service.
>    And the falkon service should know which CN has the data for the 
> 2nd stage computation.
>
> 2. Swift patch jobs vertically
>    Before sending out any jobs, swift knows those 2 stage jobs has 
> data dependency, thus send out 1 batched
>    job as 1 to each worker.
>
> 3. Collective IO
>   Build a shared file system which could be accessed by all CN, 
> instead of writing output data to GPFS, workers
>   copy intermediate output data to this shared ram-disk. And retrieve 
> the data from IFS.
>
>   Several Concerns:
>   a) reliability of torus network --- we need to test more about this.
>   b) performance of torus network --- could this be really performing 
> better than GPFS? If not, at what scale
>       could torus perform better than GPFS?
>
> 4. Half-Collective IO
>   All workers wirte data to IFS, and the data will be periodically 
> copied back to GPFS. In this case, we only
>   optimize the output phase, leave the input phase as is.
>
> Any other ideas? Thanks so much.
>
> best wishes
> zhangzhao
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20081201/5ce53d6d/attachment.html>


More information about the Swift-devel mailing list