[Swift-devel] coaster status summary

Fri Apr 4 19:02:44 CDT 2008

You say that you use UDP on the workers.  This might be more light 
weight, but might also pose practical issues. 

Some of those are:
- might not work well on any network other than a LAN
- won't be friendly to firewalls or NATs, no matter if you the service 
pushes jobs, or workers pull jobs; the logic is that you need 2 way 
communication, and using UDP (being a connectionless protocol), its like 
having a server socket and a client socket on both ends of the 
communication at the same time.  This might not matter if the service 
and the worker are on the same LAN with no NATs or firewalls in the 
middle, but, it would matter on a machine such as the BG/P, as there is 
a NAT inbetween the login nodes and the compute nodes.  In essence, for 
this to work on the BG/P, you'll need to avoid having server side 
sockets on the compute nodes (workers), and you'll probably only be able 
to do that via a connection oriented protocol (i.e. TCP).  Is switching 
to TCP a relatively straight forward option?  If not, it might be worth 
implementing to make the implementation more flexible
- loosing messages and recovering from them will likely be harder than 
anticipated; I have a UDP version of the notification engine that Falkon 
uses, and after much debugging, I gave up and switched over to TCP.  It 
worked most of the time, but the occasional lost message (1 in 1000s, 
maybe even more rare) made Falkon unreliable, and hence I stopped using 
it. 

Is the 180 tasks/sec the overall throughput measured from Swift's point 
of view, including overhead of wrapper.sh?  Or is that a micro-benchmark 
measuring just the coaster performance? 

Ioan

Mihael Hategan wrote:
> On Fri, 2008-04-04 at 06:59 -0500, Michael Wilde wrote:
>   
>> Mihael, this is great progress - very exciting.
>> Some questions (dont need answers right away):
>>
>> How would the end user use it? Manually start a service?
>> Is the service a separate process, or in the swift jvm?
>>     
>
> I though the lines below answered some of these.
>
> A user would specify the coaster provider in sites.xml. The provider
> will then automatically deploy a service on the target machine without
> the user having to do so. Given that the service is on a different
> machine than the client, they can't be in the same JVM.
>
>   
>> How are the number of workers set or adjusted?
>>     
>
> Currently workers are requested as much as needed, up to a maximum. This
> is preliminary hence "Better allocation strategy for workers".
>
>   
>> Does a service manage workers on one cluster or many?
>>     
>
> One service per cluster.
>
>   
>> At 180 jobs/sec with 10 workers, what were the CPU loads on swift, 
>> worker and service?
>>     
>
> I faintly recall them being at less than 50% for some reason I don't
> understand.
>
>   
>> Do you want to try this on the workflows we're running on Falkon on the 
>> BGP and SiCortex?
>>     
>
> Let me repeat "prototype" and "more testing". In no way do I want to do
> preliminary testing with an application that is shaky on an architecture
> that is also shaky.
>
> Mihael
>
>   
>> Im eager to try it when you feel its ready for others to test.
>>
>> Nice work!
>>
>> - Mike
>>
>>
>>
>> On 4/4/08 4:39 AM, Mihael Hategan wrote:
>>     
>>> I've been asked for a summary of the status of the coaster prototype, so
>>> here it is:
>>> - It's a prototype so bugs are plenty
>>> - It's self deployed (you don't need to start a service on the target
>>> cluster)
>>> - You can also use it while starting a service on the target cluster
>>> - There is a worker written in Perl
>>> - It uses encryption between client and coaster service
>>> - It uses UDP between the service and the workers (this may prove to be
>>> better or worse choice than TCP)
>>> - A preliminary test done locally shows an amortized throughput of
>>> around 180 jobs/s (/bin/date). This was done with encryption and with 10
>>> workers. Pretty picture attached (total time vs. # of jobs)
>>>
>>> To do:
>>> - The scheduling algorithm in the service needs a bit more work
>>> - When worker messages are lost, some jobs may get lost (i.e. needs more
>>> fault tolerance)
>>> - Start testing it on actual clusters
>>> - Do some memory consumption benchmarks
>>> - Better allocation strategy for workers
>>>
>>> Mihael
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>       
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080404/91080066/attachment.html>