[Swift-devel] Clustering and Temp Dirs with Swift

Fri Oct 26 23:48:38 CDT 2007

On Fri, 2007-10-26 at 23:35 -0500, Ioan Raicu wrote:
> But the scenario is probably more paralelizable, in theory.  There is
> some common path, say /shared/common/path, and then you have x
> directories that you want to create in path, say dir1, dir2, ... ,
> dirx.  If the meta-data information is distributed over the 8 I/O
> servers, than the creating these x directories should be load balanced
> across the 8 I/O servers.  If the meta-data is centralized, they will
> all hit the same server.  In the end, it doesn't really matter.  What
> matters is that it limits the job granularity you can really have, as
> the cost of the mkdir and rm dir can quickly outpace the cost of
> computation and data staging in and out.  It would be great to have
> some alternatives, for workflows that need more throughput than GPFS
> can handle.

It's hard to ensure correctness in distributed systems. And for some
specific problems, it is impossible. Leslie Lamport's page with papers
is a rich source of seemingly trivial issues that are actually hard.

What should raise a flag is that a bunch of relatively competitive teams
haven't been able to make it very good given that they had quite a bit
of time. Unless you get some more knowledge in the topic, it's somewhat
likely that you missed some aspect of the problem. You may have
something, but I for one am incapable of assessing whether you actually
do (or not).

You may also want to consider trade-offs between performance of small
operations and performance of big operations. Perhaps, if such a
trade-off is necessary, they biased things toward big operations.

Mihael

> 
> Ioan
> 
> Mihael Hategan wrote: 
> > On Fri, 2007-10-26 at 23:02 -0500, Ioan Raicu wrote:
> >   
> > > If it doesn't apply to meta-data operations, such as directories, then
> > > it means that meta-data changes in the file system is rather
> > > centralized (maybe this explains the relatively poor performance for
> > > creating and removing directories).
> > >     
> > 
> > On GPFS, according to my understanding of their documentation, exactly
> > one node controls access to one file at any given time. If, for all
> > observable aspects of the implementation, a directory is a file with a
> > bunch of metadata for the files it contains, then doing things in a
> > directory from multiple places is similar to accessing the same file
> > from multiple places.
> > 
> > Unless I'm blatantly wrong. Probably some complications of that model
> > exist even if I'm not.
> > 
> >   
> > > I would be curious to see how well the solution works to move data
> > > to the local disk first prior to processing, to avoid working from the
> > > shared file system (including the creation and removal of the scratch
> > > temp directory on GPFS).
> > > 
> > > Ioan  
> > > 
> > > Mihael Hategan wrote: 
> > >     
> > > > On Fri, 2007-10-26 at 15:11 -0500, Ioan Raicu wrote:
> > > >   
> > > >       
> > > > > I am not sure what configuration exists on TP, but on the TeraGrid 
> > > > > ANL/UC cluster, with 8 servers behind GPFS, the wrapper script 
> > > > > performance (create dir, create symbolic links, remove directory... all 
> > > > > on GPFS) is anywhere between 20~40 / sec, depending on how many nodes 
> > > > > you have doing this concurrently.  The throughput increases first as you 
> > > > > add nodes, but then decreases down to about 20/sec with 20~30+ nodes.  
> > > > > What this means is that even if you bundle jobs up, you will not get 
> > > > > anything better than this, throughput wise, regardless of how short the 
> > > > > jobs are.  Now, if TP has less than 8 servers, its likely that the 
> > > > > throughput it can sustain is even lower,
> > > > >     
> > > > >         
> > > > Perhaps in terms of bytes/s. But I wouldn't be so sure that this applies
> > > > to other file stuff.
> > > > 
> > > >   
> > > >       
> > > > > and if you push it over the 
> > > > > edge, even to the point of thrashing where the throughput can be 
> > > > > extremely small.   I don't have any suggestions of how you can get 
> > > > > around this, with the exception of making your job sizes larger on 
> > > > > average, and hence have fewer jobs over the same period of time.
> > > > > 
> > > > > Ioan
> > > > > 
> > > > > Andrew Robert Jamieson wrote:
> > > > >     
> > > > >         
> > > > > > I am kind of at a stand still for getting anything done on TP right 
> > > > > > now with this problem. Are there any suggestions to overcome this for 
> > > > > > the time being?
> > > > > > 
> > > > > > On Fri, 26 Oct 2007, Andrew Robert Jamieson wrote:
> > > > > > 
> > > > > >       
> > > > > >           
> > > > > > > Hello all,
> > > > > > > 
> > > > > > >  I am encountering the following problem on Teraport.  I submit a 
> > > > > > > clustered swift WF which should amount to something on the order of 
> > > > > > > 850x3 individual jobs total. I have clustered the jobs because they 
> > > > > > > are very fast (somewhere around 20 sec to 1 min long).  When I submit 
> > > > > > > the WF on TP things start out fantastic, I get 10s of output files in 
> > > > > > > a matter of seconds and nodes would start and finish clustered 
> > > > > > > batches in a matter of minutes or less. However, after waiting about 
> > > > > > > 3-5 mins, when clustered jobs are begin to line up in the queue and 
> > > > > > > more start running at the same time, things start to slow down to a 
> > > > > > > trickle in terms of output.
> > > > > > > 
> > > > > > > One thing I noticed is when I try a simply ls on TP in the swift temp 
> > > > > > > running directory where the temp job dirs are created and destroyed, 
> > > > > > > it take a very long time.  And when it is done only five or so things 
> > > > > > > are in the dir. (this is the dir with "info  kickstart  shared  
> > > > > > > status wrapper.log" in it).  What I think is happening is that TP's 
> > > > > > > filesystem cant handle this extremely rapid creation/destruction of 
> > > > > > > directories in that shared location. From what I have been told these 
> > > > > > > temp dirs come and go as long as the job runs successfully.
> > > > > > > 
> > > > > > > What I am wondering is if there is anyway to move that dir to the 
> > > > > > > local node tmp diretory not the shared file system, while it is 
> > > > > > > running and if something fails then have it sent to the appropriate 
> > > > > > > place.
> > > > > > > 
> > > > > > > Or, if another layer of temp dir wrapping could be applied with 
> > > > > > > labeld perhaps with respect to the clustered job grouping and not 
> > > > > > > simply the individual jobs (since there are thousands being computed 
> > > > > > > at once).
> > > > > > > That these things would only be generated/deleted every 5 mins or 10 
> > > > > > > mins (if clustered properly on my part) instead of one event every 
> > > > > > > milli second or what have you.
> > > > > > > 
> > > > > > > I don't know which solution is feasible or if any are at all, but 
> > > > > > > this seems to be a major problem for my WFs.  In general it is never 
> > > > > > > good to have a million things coming and going on a shared file 
> > > > > > > system in one place, from my experience at least.
> > > > > > > 
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Andrew
> > > > > > > _______________________________________________
> > > > > > > Swift-devel mailing list
> > > > > > > Swift-devel at ci.uchicago.edu
> > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > > > > 
> > > > > > >         
> > > > > > >             
> > > > > > _______________________________________________
> > > > > > Swift-devel mailing list
> > > > > > Swift-devel at ci.uchicago.edu
> > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > > > 
> > > > > >       
> > > > > >           
> > > > > -- 
> > > > > ============================================
> > > > > Ioan Raicu
> > > > > Ph.D. Student
> > > > > ============================================
> > > > > Distributed Systems Laboratory
> > > > > Computer Science Department
> > > > > University of Chicago
> > > > > 1100 E. 58th Street, Ryerson Hall
> > > > > Chicago, IL 60637
> > > > > ============================================
> > > > > Email: iraicu at cs.uchicago.edu
> > > > > Web:   http://www.cs.uchicago.edu/~iraicu
> > > > >        http://dsl.cs.uchicago.edu/
> > > > > ============================================
> > > > > ============================================
> > > > > 
> > > > > _______________________________________________
> > > > > Swift-devel mailing list
> > > > > Swift-devel at ci.uchicago.edu
> > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > > 
> > > > >     
> > > > >         
> > > > 
> > > >       
> > > -- 
> > > ============================================
> > > Ioan Raicu
> > > Ph.D. Student
> > > ============================================
> > > Distributed Systems Laboratory
> > > Computer Science Department
> > > University of Chicago
> > > 1100 E. 58th Street, Ryerson Hall
> > > Chicago, IL 60637
> > > ============================================
> > > Email: iraicu at cs.uchicago.edu
> > > Web:   http://www.cs.uchicago.edu/~iraicu
> > >        http://dsl.cs.uchicago.edu/
> > > ============================================
> > > ============================================
> > >     
> > 
> > 
> >   
> 
> -- 
> ============================================
> Ioan Raicu
> Ph.D. Student
> ============================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ============================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
>        http://dsl.cs.uchicago.edu/
> ============================================
> ============================================