[Swift-devel] coaster status summary
Ioan Raicu
iraicu at cs.uchicago.edu
Mon Apr 7 13:16:25 CDT 2008
Although, when switching to TCP, most of my problems magically went
away... obviously TCP's error recovery mechanisms are more robust than
what I implemented. The moral of the story is from my experience, have
a UDP option for potentially better performance and scalability, but
have TCP as a configurable option for potentially better reliability and
robustness.
Ioan
Mihael Hategan wrote:
>>> Of course it's unreliable unless you deal with the reliability issues as
>>> outlined above.
>>>
>>>
>> I did deal with them, duplicates, out of order, retries, timeouts,
>> etc... yet, I still couldn't get a 100% reliable implementation,
>>
>
> Of course you couldn't. It's impossible.
>
>
>> and I
>> gave up... in theory, UDP should work given that you deal with all the
>> reliability issues you outlined. I am just pointing out that after lots
>> of debugging, I gave in and swapped UDP for TCP to avoid the unexplained
>> lost message once in a while. I am positive it was a bug in my code, so
>> perhaps you'll have better luck!
>>
>>>
>>>
>>>> Is the 180 tasks/sec the overall throughput measured from Swift's
>>>> point of view, including overhead of wrapper.sh? Or is that a
>>>> micro-benchmark measuring just the coaster performance?
>>>>
>>>>
>>> It's at the provider level. No wrapper.sh.
>>>
>>>
>> OK, great!
>>
>> Ioan
>>
>>>
>>>
>>>> Ioan
>>>>
>>>>
>>>> Mihael Hategan wrote:
>>>>
>>>>
>>>>> On Fri, 2008-04-04 at 06:59 -0500, Michael Wilde wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Mihael, this is great progress - very exciting.
>>>>>> Some questions (dont need answers right away):
>>>>>>
>>>>>> How would the end user use it? Manually start a service?
>>>>>> Is the service a separate process, or in the swift jvm?
>>>>>>
>>>>>>
>>>>>>
>>>>> I though the lines below answered some of these.
>>>>>
>>>>> A user would specify the coaster provider in sites.xml. The provider
>>>>> will then automatically deploy a service on the target machine without
>>>>> the user having to do so. Given that the service is on a different
>>>>> machine than the client, they can't be in the same JVM.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> How are the number of workers set or adjusted?
>>>>>>
>>>>>>
>>>>>>
>>>>> Currently workers are requested as much as needed, up to a maximum. This
>>>>> is preliminary hence "Better allocation strategy for workers".
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Does a service manage workers on one cluster or many?
>>>>>>
>>>>>>
>>>>>>
>>>>> One service per cluster.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> At 180 jobs/sec with 10 workers, what were the CPU loads on swift,
>>>>>> worker and service?
>>>>>>
>>>>>>
>>>>>>
>>>>> I faintly recall them being at less than 50% for some reason I don't
>>>>> understand.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Do you want to try this on the workflows we're running on Falkon on the
>>>>>> BGP and SiCortex?
>>>>>>
>>>>>>
>>>>>>
>>>>> Let me repeat "prototype" and "more testing". In no way do I want to do
>>>>> preliminary testing with an application that is shaky on an architecture
>>>>> that is also shaky.
>>>>>
>>>>> Mihael
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Im eager to try it when you feel its ready for others to test.
>>>>>>
>>>>>> Nice work!
>>>>>>
>>>>>> - Mike
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 4/4/08 4:39 AM, Mihael Hategan wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> I've been asked for a summary of the status of the coaster prototype, so
>>>>>>> here it is:
>>>>>>> - It's a prototype so bugs are plenty
>>>>>>> - It's self deployed (you don't need to start a service on the target
>>>>>>> cluster)
>>>>>>> - You can also use it while starting a service on the target cluster
>>>>>>> - There is a worker written in Perl
>>>>>>> - It uses encryption between client and coaster service
>>>>>>> - It uses UDP between the service and the workers (this may prove to be
>>>>>>> better or worse choice than TCP)
>>>>>>> - A preliminary test done locally shows an amortized throughput of
>>>>>>> around 180 jobs/s (/bin/date). This was done with encryption and with 10
>>>>>>> workers. Pretty picture attached (total time vs. # of jobs)
>>>>>>>
>>>>>>> To do:
>>>>>>> - The scheduling algorithm in the service needs a bit more work
>>>>>>> - When worker messages are lost, some jobs may get lost (i.e. needs more
>>>>>>> fault tolerance)
>>>>>>> - Start testing it on actual clusters
>>>>>>> - Do some memory consumption benchmarks
>>>>>>> - Better allocation strategy for workers
>>>>>>>
>>>>>>> Mihael
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Swift-devel mailing list
>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>
>>>>>
>>>>>
>>>>>
>>>> --
>>>> ===================================================
>>>> Ioan Raicu
>>>> Ph.D. Candidate
>>>> ===================================================
>>>> Distributed Systems Laboratory
>>>> Computer Science Department
>>>> University of Chicago
>>>> 1100 E. 58th Street, Ryerson Hall
>>>> Chicago, IL 60637
>>>> ===================================================
>>>> Email: iraicu at cs.uchicago.edu
>>>> Web: http://www.cs.uchicago.edu/~iraicu
>>>> http://dev.globus.org/wiki/Incubator/Falkon
>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>>> ===================================================
>>>> ===================================================
>>>>
>>>>
>>>>
>>>
>>>
>
>
>
--
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080407/1b023ebc/attachment.html>
More information about the Swift-devel
mailing list