[Swift-devel] Clustering and Temp Dirs with Swift

Fri Oct 26 18:39:45 CDT 2007

On Fri, 2007-10-26 at 15:11 -0500, Ioan Raicu wrote:
> I am not sure what configuration exists on TP, but on the TeraGrid 
> ANL/UC cluster, with 8 servers behind GPFS, the wrapper script 
> performance (create dir, create symbolic links, remove directory... all 
> on GPFS) is anywhere between 20~40 / sec, depending on how many nodes 
> you have doing this concurrently.  The throughput increases first as you 
> add nodes, but then decreases down to about 20/sec with 20~30+ nodes.  
> What this means is that even if you bundle jobs up, you will not get 
> anything better than this, throughput wise, regardless of how short the 
> jobs are.  Now, if TP has less than 8 servers, its likely that the 
> throughput it can sustain is even lower,

Perhaps in terms of bytes/s. But I wouldn't be so sure that this applies
to other file stuff.

>  and if you push it over the 
> edge, even to the point of thrashing where the throughput can be 
> extremely small.   I don't have any suggestions of how you can get 
> around this, with the exception of making your job sizes larger on 
> average, and hence have fewer jobs over the same period of time.
> 
> Ioan
> 
> Andrew Robert Jamieson wrote:
> > I am kind of at a stand still for getting anything done on TP right 
> > now with this problem. Are there any suggestions to overcome this for 
> > the time being?
> >
> > On Fri, 26 Oct 2007, Andrew Robert Jamieson wrote:
> >
> >> Hello all,
> >>
> >>  I am encountering the following problem on Teraport.  I submit a 
> >> clustered swift WF which should amount to something on the order of 
> >> 850x3 individual jobs total. I have clustered the jobs because they 
> >> are very fast (somewhere around 20 sec to 1 min long).  When I submit 
> >> the WF on TP things start out fantastic, I get 10s of output files in 
> >> a matter of seconds and nodes would start and finish clustered 
> >> batches in a matter of minutes or less. However, after waiting about 
> >> 3-5 mins, when clustered jobs are begin to line up in the queue and 
> >> more start running at the same time, things start to slow down to a 
> >> trickle in terms of output.
> >>
> >> One thing I noticed is when I try a simply ls on TP in the swift temp 
> >> running directory where the temp job dirs are created and destroyed, 
> >> it take a very long time.  And when it is done only five or so things 
> >> are in the dir. (this is the dir with "info  kickstart  shared  
> >> status wrapper.log" in it).  What I think is happening is that TP's 
> >> filesystem cant handle this extremely rapid creation/destruction of 
> >> directories in that shared location. From what I have been told these 
> >> temp dirs come and go as long as the job runs successfully.
> >>
> >> What I am wondering is if there is anyway to move that dir to the 
> >> local node tmp diretory not the shared file system, while it is 
> >> running and if something fails then have it sent to the appropriate 
> >> place.
> >>
> >> Or, if another layer of temp dir wrapping could be applied with 
> >> labeld perhaps with respect to the clustered job grouping and not 
> >> simply the individual jobs (since there are thousands being computed 
> >> at once).
> >> That these things would only be generated/deleted every 5 mins or 10 
> >> mins (if clustered properly on my part) instead of one event every 
> >> milli second or what have you.
> >>
> >> I don't know which solution is feasible or if any are at all, but 
> >> this seems to be a major problem for my WFs.  In general it is never 
> >> good to have a million things coming and going on a shared file 
> >> system in one place, from my experience at least.
> >>
> >>
> >> Thanks,
> >> Andrew
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> 
> -- 
> ============================================
> Ioan Raicu
> Ph.D. Student
> ============================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ============================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
>        http://dsl.cs.uchicago.edu/
> ============================================
> ============================================
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>