[Swift-devel] several alternatives to design the data management system for Swift on SuperComputers
Ioan Raicu
iraicu at cs.uchicago.edu
Mon Dec 1 16:33:33 CST 2008
Hi,
I can see option (1) working as long as there is 1 Swift client and 1
Falkon service. For example, our current deployment on the BG/P would
not work, as we have 1 Swift to *many *Falkon services. Now, even the
1-1 Swift-Falkon ratio won't work today, as the Falkon provider is not
data-aware yet... but could be updated, maybe a few days of coding and
testing; the harder part (IMO) will be making sure that the Swift data
management doesn't interfere with the Falkon data management, and vice
versa.
Option (3) and (4), we have discussed before. The trick with these are
making things general and transparent enough that it works, and works
well. Getting a Torus aggregate throughput to exceed 8GB/s shouldn't be
that hard, with probably a fraction of the machine (several racks). Any
word on the latest numbers of the improved GPFS, which is supposed to
upgrade the number of servers from 8 or 16 up to 100+? With linear
scalability, that would mean 80GB/s, the peak of the SAN throughputs I
saw a while back in some slides from an ALCF talk. For us to get 80GB/s
using CIO, we'd need 2MB/s per node. I bet we can easily achieve that,
but it would probably be at the larger scales of 10s of racks. I recall
getting 100MB/s+ per node, right? This would give us a theoretical
upper bound of 4000GB/s, so in theory, there is plenty of room between
80GB/s and 4000GB/s. I bet in practice, we'd only get a small fraction
of that 4000GB/s, but it would be interesting how much we can really get
without thinking of the network topology, and also how far we can get if
we do take the network topology into consideration.
Option (2), I haven't thought of before, but it only works if an output
file is only needed as 1 input file. What do you do if you have 1 output
file needed for N input files? Do you replicate the first job N times,
just so you can get the output file in N locations? Or do you group the
jobs in 1+N jobs, where the N jobs execute in serial order on 1
processor/node? This might be worth investigating, but I think you'll be
restricting the natural parallelism, or repeating work just to avoid
data management.
Ioan
Zhao Zhang wrote:
> Hi, All
>
> The following alternatives is a summary from a talk between Mike and
> Zhao. We are trying
> to optimize the data IO performance for swift on supercomputers,
> includes BGP, Ranger,
> and possibly Jaguar. We are trying to eliminate all unnecessary data
> IO during stages of computation.
>
> Scenario 1: Say a computation has 2 stages, the 2nd stage would take
> the output from the 1st stage
> as the input data.
>
> Data Flow in current swift system: 1st stage will write the output
> data to GPFS, where swift knows this
> output data is the input for the 2nd stage. Then send the task to on
> worker on CN.
>
> Desired Data Flow: 1st stage of computation knows the output data will
> be used as the input for the next
> stage, thus the data is not copied back to GPFS, then the 2nd stage
> task arrived and consumed this data.
>
> Key Issue: the 2nd stage task has no idea of where the 1st stage
> output data is.
>
> Design Alternatives:
> 1. Data aware task scheduling:
> Both swift and falkon need to be data aware. Swift should know
> where the output of 1st stage is, which
> means, which pset, or say which falkon service.
> And the falkon service should know which CN has the data for the
> 2nd stage computation.
>
> 2. Swift patch jobs vertically
> Before sending out any jobs, swift knows those 2 stage jobs has
> data dependency, thus send out 1 batched
> job as 1 to each worker.
>
> 3. Collective IO
> Build a shared file system which could be accessed by all CN,
> instead of writing output data to GPFS, workers
> copy intermediate output data to this shared ram-disk. And retrieve
> the data from IFS.
>
> Several Concerns:
> a) reliability of torus network --- we need to test more about this.
> b) performance of torus network --- could this be really performing
> better than GPFS? If not, at what scale
> could torus perform better than GPFS?
>
> 4. Half-Collective IO
> All workers wirte data to IFS, and the data will be periodically
> copied back to GPFS. In this case, we only
> optimize the output phase, leave the input phase as is.
>
> Any other ideas? Thanks so much.
>
> best wishes
> zhangzhao
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
--
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20081201/5ce53d6d/attachment.html>
More information about the Swift-devel
mailing list