[Swift-devel] coaster status summary

Mihael Hategan hategan at mcs.anl.gov
Sat Apr 5 04:45:54 CDT 2008


On Fri, 2008-04-04 at 19:02 -0500, Ioan Raicu wrote:
> You say that you use UDP on the workers.  This might be more light
> weight, but might also pose practical issues.  

Of course. That is the trade-off.

> 
> Some of those are:
> - might not work well on any network other than a LAN

It works exactly as it's supposed to: no guarantee of uniqueness, no
guarantee of order, no guarantee of integrity, and no guarantee of
reliability. One has to drop duplicates, do checksums, re-order, have
time-outs.

> - won't be friendly to firewalls or NATs, no matter if you the service
> pushes jobs, or workers pull jobs; the logic is that you need 2 way
> communication, and using UDP (being a connectionless protocol), its
> like having a server socket and a client socket on both ends of the
> communication at the same time.

Precisely so. In Java you can use one UDP socket as both client and
server. Perl seems to be nastier as it won't let you send and receive on
the same socket (at least in the implementation I've seen).

>   This might not matter if the service and the worker are on the same
> LAN with no NATs or firewalls in the middle, but, it would matter on a
> machine such as the BG/P, as there is a NAT inbetween the login nodes
> and the compute nodes.

That's odd. Do you have anything to back that up?

>   In essence, for this to work on the BG/P, you'll need to avoid
> having server side sockets on the compute nodes (workers), and you'll
> probably only be able to do that via a connection oriented protocol
> (i.e. TCP).  Is switching to TCP a relatively straight forward option?
> If not, it might be worth implementing to make the implementation more
> flexible
> - loosing messages and recovering from them will likely be harder than
> anticipated; I have a UDP version of the notification engine that
> Falkon uses, and after much debugging, I gave up and switched over to
> TCP.  It worked most of the time, but the occasional lost message (1
> in 1000s, maybe even more rare) made Falkon unreliable, and hence I
> stopped using it.

Of course it's unreliable unless you deal with the reliability issues as
outlined above.

> 
> Is the 180 tasks/sec the overall throughput measured from Swift's
> point of view, including overhead of wrapper.sh?  Or is that a
> micro-benchmark measuring just the coaster performance?  

It's at the provider level. No wrapper.sh.

> 
> Ioan
> 
> 
> Mihael Hategan wrote: 
> > On Fri, 2008-04-04 at 06:59 -0500, Michael Wilde wrote:
> >   
> > > Mihael, this is great progress - very exciting.
> > > Some questions (dont need answers right away):
> > > 
> > > How would the end user use it? Manually start a service?
> > > Is the service a separate process, or in the swift jvm?
> > >     
> > 
> > I though the lines below answered some of these.
> > 
> > A user would specify the coaster provider in sites.xml. The provider
> > will then automatically deploy a service on the target machine without
> > the user having to do so. Given that the service is on a different
> > machine than the client, they can't be in the same JVM.
> > 
> >   
> > > How are the number of workers set or adjusted?
> > >     
> > 
> > Currently workers are requested as much as needed, up to a maximum. This
> > is preliminary hence "Better allocation strategy for workers".
> > 
> >   
> > > Does a service manage workers on one cluster or many?
> > >     
> > 
> > One service per cluster.
> > 
> >   
> > > At 180 jobs/sec with 10 workers, what were the CPU loads on swift, 
> > > worker and service?
> > >     
> > 
> > I faintly recall them being at less than 50% for some reason I don't
> > understand.
> > 
> >   
> > > Do you want to try this on the workflows we're running on Falkon on the 
> > > BGP and SiCortex?
> > >     
> > 
> > Let me repeat "prototype" and "more testing". In no way do I want to do
> > preliminary testing with an application that is shaky on an architecture
> > that is also shaky.
> > 
> > Mihael
> > 
> >   
> > > Im eager to try it when you feel its ready for others to test.
> > > 
> > > Nice work!
> > > 
> > > - Mike
> > > 
> > > 
> > > 
> > > On 4/4/08 4:39 AM, Mihael Hategan wrote:
> > >     
> > > > I've been asked for a summary of the status of the coaster prototype, so
> > > > here it is:
> > > > - It's a prototype so bugs are plenty
> > > > - It's self deployed (you don't need to start a service on the target
> > > > cluster)
> > > > - You can also use it while starting a service on the target cluster
> > > > - There is a worker written in Perl
> > > > - It uses encryption between client and coaster service
> > > > - It uses UDP between the service and the workers (this may prove to be
> > > > better or worse choice than TCP)
> > > > - A preliminary test done locally shows an amortized throughput of
> > > > around 180 jobs/s (/bin/date). This was done with encryption and with 10
> > > > workers. Pretty picture attached (total time vs. # of jobs)
> > > > 
> > > > To do:
> > > > - The scheduling algorithm in the service needs a bit more work
> > > > - When worker messages are lost, some jobs may get lost (i.e. needs more
> > > > fault tolerance)
> > > > - Start testing it on actual clusters
> > > > - Do some memory consumption benchmarks
> > > > - Better allocation strategy for workers
> > > > 
> > > > Mihael
> > > > 
> > > > 
> > > > ------------------------------------------------------------------------
> > > > 
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > >       
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> >   
> 
> -- 
> ===================================================
> Ioan Raicu
> Ph.D. Candidate
> ===================================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ===================================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
> http://dev.globus.org/wiki/Incubator/Falkon
> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> ===================================================
> ===================================================
> 




More information about the Swift-devel mailing list