[Swift-devel] coaster status summary

Mihael Hategan hategan at mcs.anl.gov
Sun Apr 6 04:17:22 CDT 2008


> >
> > Of course it's unreliable unless you deal with the reliability issues as
> > outlined above.
> >   
> I did deal with them, duplicates, out of order, retries, timeouts, 
> etc... yet, I still couldn't get a 100% reliable implementation,

Of course you couldn't. It's impossible.

>  and I 
> gave up... in theory, UDP should work given that you deal with all the 
> reliability issues you outlined.  I am just pointing out that after lots 
> of debugging, I gave in and swapped UDP for TCP to avoid the unexplained 
> lost message once in a while.  I am positive it was a bug in my code, so 
> perhaps you'll have better luck!
> >   
> >> Is the 180 tasks/sec the overall throughput measured from Swift's
> >> point of view, including overhead of wrapper.sh?  Or is that a
> >> micro-benchmark measuring just the coaster performance?  
> >>     
> >
> > It's at the provider level. No wrapper.sh.
> >   
> OK, great!
> 
> Ioan
> >   
> >> Ioan
> >>
> >>
> >> Mihael Hategan wrote: 
> >>     
> >>> On Fri, 2008-04-04 at 06:59 -0500, Michael Wilde wrote:
> >>>   
> >>>       
> >>>> Mihael, this is great progress - very exciting.
> >>>> Some questions (dont need answers right away):
> >>>>
> >>>> How would the end user use it? Manually start a service?
> >>>> Is the service a separate process, or in the swift jvm?
> >>>>     
> >>>>         
> >>> I though the lines below answered some of these.
> >>>
> >>> A user would specify the coaster provider in sites.xml. The provider
> >>> will then automatically deploy a service on the target machine without
> >>> the user having to do so. Given that the service is on a different
> >>> machine than the client, they can't be in the same JVM.
> >>>
> >>>   
> >>>       
> >>>> How are the number of workers set or adjusted?
> >>>>     
> >>>>         
> >>> Currently workers are requested as much as needed, up to a maximum. This
> >>> is preliminary hence "Better allocation strategy for workers".
> >>>
> >>>   
> >>>       
> >>>> Does a service manage workers on one cluster or many?
> >>>>     
> >>>>         
> >>> One service per cluster.
> >>>
> >>>   
> >>>       
> >>>> At 180 jobs/sec with 10 workers, what were the CPU loads on swift, 
> >>>> worker and service?
> >>>>     
> >>>>         
> >>> I faintly recall them being at less than 50% for some reason I don't
> >>> understand.
> >>>
> >>>   
> >>>       
> >>>> Do you want to try this on the workflows we're running on Falkon on the 
> >>>> BGP and SiCortex?
> >>>>     
> >>>>         
> >>> Let me repeat "prototype" and "more testing". In no way do I want to do
> >>> preliminary testing with an application that is shaky on an architecture
> >>> that is also shaky.
> >>>
> >>> Mihael
> >>>
> >>>   
> >>>       
> >>>> Im eager to try it when you feel its ready for others to test.
> >>>>
> >>>> Nice work!
> >>>>
> >>>> - Mike
> >>>>
> >>>>
> >>>>
> >>>> On 4/4/08 4:39 AM, Mihael Hategan wrote:
> >>>>     
> >>>>         
> >>>>> I've been asked for a summary of the status of the coaster prototype, so
> >>>>> here it is:
> >>>>> - It's a prototype so bugs are plenty
> >>>>> - It's self deployed (you don't need to start a service on the target
> >>>>> cluster)
> >>>>> - You can also use it while starting a service on the target cluster
> >>>>> - There is a worker written in Perl
> >>>>> - It uses encryption between client and coaster service
> >>>>> - It uses UDP between the service and the workers (this may prove to be
> >>>>> better or worse choice than TCP)
> >>>>> - A preliminary test done locally shows an amortized throughput of
> >>>>> around 180 jobs/s (/bin/date). This was done with encryption and with 10
> >>>>> workers. Pretty picture attached (total time vs. # of jobs)
> >>>>>
> >>>>> To do:
> >>>>> - The scheduling algorithm in the service needs a bit more work
> >>>>> - When worker messages are lost, some jobs may get lost (i.e. needs more
> >>>>> fault tolerance)
> >>>>> - Start testing it on actual clusters
> >>>>> - Do some memory consumption benchmarks
> >>>>> - Better allocation strategy for workers
> >>>>>
> >>>>> Mihael
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------------
> >>>>>
> >>>>> _______________________________________________
> >>>>> Swift-devel mailing list
> >>>>> Swift-devel at ci.uchicago.edu
> >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>       
> >>>>>           
> >>> _______________________________________________
> >>> Swift-devel mailing list
> >>> Swift-devel at ci.uchicago.edu
> >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>
> >>>   
> >>>       
> >> -- 
> >> ===================================================
> >> Ioan Raicu
> >> Ph.D. Candidate
> >> ===================================================
> >> Distributed Systems Laboratory
> >> Computer Science Department
> >> University of Chicago
> >> 1100 E. 58th Street, Ryerson Hall
> >> Chicago, IL 60637
> >> ===================================================
> >> Email: iraicu at cs.uchicago.edu
> >> Web:   http://www.cs.uchicago.edu/~iraicu
> >> http://dev.globus.org/wiki/Incubator/Falkon
> >> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> >> ===================================================
> >> ===================================================
> >>
> >>     
> >
> >
> >   
> 




More information about the Swift-devel mailing list