[Swift-devel] several alternatives to design the data management system for Swift on SuperComputers

Mon Dec 1 16:52:54 CST 2008


Mihael Hategan wrote:
> On Mon, 2008-12-01 at 16:15 -0600, Zhao Zhang wrote:
>
>   
>> Desired Data Flow: 1st stage of computation knows the output data will 
>> be used as the input for the next
>> stage, thus the data is not copied back to GPFS, then the 2nd stage task 
>> arrived and consumed this data.
>>     
>
> This assumes a sequential workflow (t1 -> t2 ->... -> tn). For anything
> more complex, this becomes a nasty scheduling problem. For example:
>
> (t1, t2) -> t3
>
> The outputs of which of t1 or t2 should not be copied back?
>
>   
>> Key Issue: the 2nd stage task has no idea of where the 1st stage output 
>> data is.
>>     
>
> I beg to disagree. Swift provides the mechanism to record where data is.
> The key issue is that queuing systems don't allow control over the exact
> nodes that tasks go to.
>   
Well, Falkon with data diffusion gives you that level of control :)
> Another key issue is that you may not even want to do so, because that
> node may be better used running a different task (scheduling problem
> again).
>
>   
>> Design Alternatives:
>> 1. Data aware task scheduling:
>>     Both swift and falkon need to be data aware. Swift should know where 
>> the output of 1st stage is, which
>>     means, which pset, or say which falkon service.
>>     And the falkon service should know which CN has the data for the 2nd 
>> stage computation.
>>
>> 2. Swift patch jobs vertically
>>     Before sending out any jobs, swift knows those 2 stage jobs has data 
>> dependency, thus send out 1 batched
>>     job as 1 to each worker.
>>
>> 3. Collective IO
>>    Build a shared file system which could be accessed by all CN, instead 
>> of writing output data to GPFS, workers
>>    copy intermediate output data to this shared ram-disk. And retrieve 
>> the data from IFS.
>>     
>
> That seems awfully close to implementing a distributed filesystem, which
> I think is a fairly bad idea. If you're trying to avoid GPFS contention,
> then avoid it by carefully sticking your data in different directories.
> And do keep in mind that most operating systems cache filesystem data in
> memory, so a read after write of a reasonably small file will be very
> fast with any filesystem.
>   
I don't think you realize how expensive GPFS access is when doing so at 
100K CPU scale.  Simple operations that should take milliseconds take 
tens of seconds to complete, maybe more.  For example, the GPFS locking 
of writes to a single directory can take 1000s of seconds at only 16K 
CPU scale... the idea of creating these islands of shared file systems, 
that are localized to a small portion of the total number of workers, 
seems like a viable solution to allow more data intensive applications 
to scale. The problem is how is the CIO expressed in such a way that it 
works well, reliably, and transparently. We also have to do more 
measurements to see how much we gain performance wise, for the efforts 
we are throwing at the problem.

Ioan
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20081201/3b5c5483/attachment.html>