[Swift-devel] coaster status summary

Mon Apr 7 13:16:25 CDT 2008

Although, when switching to TCP, most of my problems magically went 
away... obviously TCP's error recovery mechanisms are more robust than 
what I implemented.  The moral of the story is from my experience, have 
a UDP option for potentially better performance and scalability, but 
have TCP as a configurable option for potentially better reliability and 
robustness.

Ioan

Mihael Hategan wrote:
>>> Of course it's unreliable unless you deal with the reliability issues as
>>> outlined above.
>>>   
>>>       
>> I did deal with them, duplicates, out of order, retries, timeouts, 
>> etc... yet, I still couldn't get a 100% reliable implementation,
>>     
>
> Of course you couldn't. It's impossible.
>
>   
>>  and I 
>> gave up... in theory, UDP should work given that you deal with all the 
>> reliability issues you outlined.  I am just pointing out that after lots 
>> of debugging, I gave in and swapped UDP for TCP to avoid the unexplained 
>> lost message once in a while.  I am positive it was a bug in my code, so 
>> perhaps you'll have better luck!
>>     
>>>   
>>>       
>>>> Is the 180 tasks/sec the overall throughput measured from Swift's
>>>> point of view, including overhead of wrapper.sh?  Or is that a
>>>> micro-benchmark measuring just the coaster performance?  
>>>>     
>>>>         
>>> It's at the provider level. No wrapper.sh.
>>>   
>>>       
>> OK, great!
>>
>> Ioan
>>     
>>>   
>>>       
>>>> Ioan
>>>>
>>>>
>>>> Mihael Hategan wrote: 
>>>>     
>>>>         
>>>>> On Fri, 2008-04-04 at 06:59 -0500, Michael Wilde wrote:
>>>>>   
>>>>>       
>>>>>           
>>>>>> Mihael, this is great progress - very exciting.
>>>>>> Some questions (dont need answers right away):
>>>>>>
>>>>>> How would the end user use it? Manually start a service?
>>>>>> Is the service a separate process, or in the swift jvm?
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> I though the lines below answered some of these.
>>>>>
>>>>> A user would specify the coaster provider in sites.xml. The provider
>>>>> will then automatically deploy a service on the target machine without
>>>>> the user having to do so. Given that the service is on a different
>>>>> machine than the client, they can't be in the same JVM.
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>>>> How are the number of workers set or adjusted?
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> Currently workers are requested as much as needed, up to a maximum. This
>>>>> is preliminary hence "Better allocation strategy for workers".
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>>>> Does a service manage workers on one cluster or many?
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> One service per cluster.
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>>>> At 180 jobs/sec with 10 workers, what were the CPU loads on swift, 
>>>>>> worker and service?
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> I faintly recall them being at less than 50% for some reason I don't
>>>>> understand.
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>>>> Do you want to try this on the workflows we're running on Falkon on the 
>>>>>> BGP and SiCortex?
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> Let me repeat "prototype" and "more testing". In no way do I want to do
>>>>> preliminary testing with an application that is shaky on an architecture
>>>>> that is also shaky.
>>>>>
>>>>> Mihael
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>>>> Im eager to try it when you feel its ready for others to test.
>>>>>>
>>>>>> Nice work!
>>>>>>
>>>>>> - Mike
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 4/4/08 4:39 AM, Mihael Hategan wrote:
>>>>>>     
>>>>>>         
>>>>>>             
>>>>>>> I've been asked for a summary of the status of the coaster prototype, so
>>>>>>> here it is:
>>>>>>> - It's a prototype so bugs are plenty
>>>>>>> - It's self deployed (you don't need to start a service on the target
>>>>>>> cluster)
>>>>>>> - You can also use it while starting a service on the target cluster
>>>>>>> - There is a worker written in Perl
>>>>>>> - It uses encryption between client and coaster service
>>>>>>> - It uses UDP between the service and the workers (this may prove to be
>>>>>>> better or worse choice than TCP)
>>>>>>> - A preliminary test done locally shows an amortized throughput of
>>>>>>> around 180 jobs/s (/bin/date). This was done with encryption and with 10
>>>>>>> workers. Pretty picture attached (total time vs. # of jobs)
>>>>>>>
>>>>>>> To do:
>>>>>>> - The scheduling algorithm in the service needs a bit more work
>>>>>>> - When worker messages are lost, some jobs may get lost (i.e. needs more
>>>>>>> fault tolerance)
>>>>>>> - Start testing it on actual clusters
>>>>>>> - Do some memory consumption benchmarks
>>>>>>> - Better allocation strategy for workers
>>>>>>>
>>>>>>> Mihael
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Swift-devel mailing list
>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>> -- 
>>>> ===================================================
>>>> Ioan Raicu
>>>> Ph.D. Candidate
>>>> ===================================================
>>>> Distributed Systems Laboratory
>>>> Computer Science Department
>>>> University of Chicago
>>>> 1100 E. 58th Street, Ryerson Hall
>>>> Chicago, IL 60637
>>>> ===================================================
>>>> Email: iraicu at cs.uchicago.edu
>>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>> ===================================================
>>>> ===================================================
>>>>
>>>>     
>>>>         
>>>   
>>>       
>
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080407/1b023ebc/attachment.html>