[Swift-devel] Clustering and Temp Dirs with Swift
Mihael Hategan
hategan at mcs.anl.gov
Fri Oct 26 23:48:38 CDT 2007
On Fri, 2007-10-26 at 23:35 -0500, Ioan Raicu wrote:
> But the scenario is probably more paralelizable, in theory. There is
> some common path, say /shared/common/path, and then you have x
> directories that you want to create in path, say dir1, dir2, ... ,
> dirx. If the meta-data information is distributed over the 8 I/O
> servers, than the creating these x directories should be load balanced
> across the 8 I/O servers. If the meta-data is centralized, they will
> all hit the same server. In the end, it doesn't really matter. What
> matters is that it limits the job granularity you can really have, as
> the cost of the mkdir and rm dir can quickly outpace the cost of
> computation and data staging in and out. It would be great to have
> some alternatives, for workflows that need more throughput than GPFS
> can handle.
It's hard to ensure correctness in distributed systems. And for some
specific problems, it is impossible. Leslie Lamport's page with papers
is a rich source of seemingly trivial issues that are actually hard.
What should raise a flag is that a bunch of relatively competitive teams
haven't been able to make it very good given that they had quite a bit
of time. Unless you get some more knowledge in the topic, it's somewhat
likely that you missed some aspect of the problem. You may have
something, but I for one am incapable of assessing whether you actually
do (or not).
You may also want to consider trade-offs between performance of small
operations and performance of big operations. Perhaps, if such a
trade-off is necessary, they biased things toward big operations.
Mihael
>
> Ioan
>
> Mihael Hategan wrote:
> > On Fri, 2007-10-26 at 23:02 -0500, Ioan Raicu wrote:
> >
> > > If it doesn't apply to meta-data operations, such as directories, then
> > > it means that meta-data changes in the file system is rather
> > > centralized (maybe this explains the relatively poor performance for
> > > creating and removing directories).
> > >
> >
> > On GPFS, according to my understanding of their documentation, exactly
> > one node controls access to one file at any given time. If, for all
> > observable aspects of the implementation, a directory is a file with a
> > bunch of metadata for the files it contains, then doing things in a
> > directory from multiple places is similar to accessing the same file
> > from multiple places.
> >
> > Unless I'm blatantly wrong. Probably some complications of that model
> > exist even if I'm not.
> >
> >
> > > I would be curious to see how well the solution works to move data
> > > to the local disk first prior to processing, to avoid working from the
> > > shared file system (including the creation and removal of the scratch
> > > temp directory on GPFS).
> > >
> > > Ioan
> > >
> > > Mihael Hategan wrote:
> > >
> > > > On Fri, 2007-10-26 at 15:11 -0500, Ioan Raicu wrote:
> > > >
> > > >
> > > > > I am not sure what configuration exists on TP, but on the TeraGrid
> > > > > ANL/UC cluster, with 8 servers behind GPFS, the wrapper script
> > > > > performance (create dir, create symbolic links, remove directory... all
> > > > > on GPFS) is anywhere between 20~40 / sec, depending on how many nodes
> > > > > you have doing this concurrently. The throughput increases first as you
> > > > > add nodes, but then decreases down to about 20/sec with 20~30+ nodes.
> > > > > What this means is that even if you bundle jobs up, you will not get
> > > > > anything better than this, throughput wise, regardless of how short the
> > > > > jobs are. Now, if TP has less than 8 servers, its likely that the
> > > > > throughput it can sustain is even lower,
> > > > >
> > > > >
> > > > Perhaps in terms of bytes/s. But I wouldn't be so sure that this applies
> > > > to other file stuff.
> > > >
> > > >
> > > >
> > > > > and if you push it over the
> > > > > edge, even to the point of thrashing where the throughput can be
> > > > > extremely small. I don't have any suggestions of how you can get
> > > > > around this, with the exception of making your job sizes larger on
> > > > > average, and hence have fewer jobs over the same period of time.
> > > > >
> > > > > Ioan
> > > > >
> > > > > Andrew Robert Jamieson wrote:
> > > > >
> > > > >
> > > > > > I am kind of at a stand still for getting anything done on TP right
> > > > > > now with this problem. Are there any suggestions to overcome this for
> > > > > > the time being?
> > > > > >
> > > > > > On Fri, 26 Oct 2007, Andrew Robert Jamieson wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Hello all,
> > > > > > >
> > > > > > > I am encountering the following problem on Teraport. I submit a
> > > > > > > clustered swift WF which should amount to something on the order of
> > > > > > > 850x3 individual jobs total. I have clustered the jobs because they
> > > > > > > are very fast (somewhere around 20 sec to 1 min long). When I submit
> > > > > > > the WF on TP things start out fantastic, I get 10s of output files in
> > > > > > > a matter of seconds and nodes would start and finish clustered
> > > > > > > batches in a matter of minutes or less. However, after waiting about
> > > > > > > 3-5 mins, when clustered jobs are begin to line up in the queue and
> > > > > > > more start running at the same time, things start to slow down to a
> > > > > > > trickle in terms of output.
> > > > > > >
> > > > > > > One thing I noticed is when I try a simply ls on TP in the swift temp
> > > > > > > running directory where the temp job dirs are created and destroyed,
> > > > > > > it take a very long time. And when it is done only five or so things
> > > > > > > are in the dir. (this is the dir with "info kickstart shared
> > > > > > > status wrapper.log" in it). What I think is happening is that TP's
> > > > > > > filesystem cant handle this extremely rapid creation/destruction of
> > > > > > > directories in that shared location. From what I have been told these
> > > > > > > temp dirs come and go as long as the job runs successfully.
> > > > > > >
> > > > > > > What I am wondering is if there is anyway to move that dir to the
> > > > > > > local node tmp diretory not the shared file system, while it is
> > > > > > > running and if something fails then have it sent to the appropriate
> > > > > > > place.
> > > > > > >
> > > > > > > Or, if another layer of temp dir wrapping could be applied with
> > > > > > > labeld perhaps with respect to the clustered job grouping and not
> > > > > > > simply the individual jobs (since there are thousands being computed
> > > > > > > at once).
> > > > > > > That these things would only be generated/deleted every 5 mins or 10
> > > > > > > mins (if clustered properly on my part) instead of one event every
> > > > > > > milli second or what have you.
> > > > > > >
> > > > > > > I don't know which solution is feasible or if any are at all, but
> > > > > > > this seems to be a major problem for my WFs. In general it is never
> > > > > > > good to have a million things coming and going on a shared file
> > > > > > > system in one place, from my experience at least.
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Andrew
> > > > > > > _______________________________________________
> > > > > > > Swift-devel mailing list
> > > > > > > Swift-devel at ci.uchicago.edu
> > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > _______________________________________________
> > > > > > Swift-devel mailing list
> > > > > > Swift-devel at ci.uchicago.edu
> > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > > >
> > > > > >
> > > > > >
> > > > > --
> > > > > ============================================
> > > > > Ioan Raicu
> > > > > Ph.D. Student
> > > > > ============================================
> > > > > Distributed Systems Laboratory
> > > > > Computer Science Department
> > > > > University of Chicago
> > > > > 1100 E. 58th Street, Ryerson Hall
> > > > > Chicago, IL 60637
> > > > > ============================================
> > > > > Email: iraicu at cs.uchicago.edu
> > > > > Web: http://www.cs.uchicago.edu/~iraicu
> > > > > http://dsl.cs.uchicago.edu/
> > > > > ============================================
> > > > > ============================================
> > > > >
> > > > > _______________________________________________
> > > > > Swift-devel mailing list
> > > > > Swift-devel at ci.uchicago.edu
> > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > --
> > > ============================================
> > > Ioan Raicu
> > > Ph.D. Student
> > > ============================================
> > > Distributed Systems Laboratory
> > > Computer Science Department
> > > University of Chicago
> > > 1100 E. 58th Street, Ryerson Hall
> > > Chicago, IL 60637
> > > ============================================
> > > Email: iraicu at cs.uchicago.edu
> > > Web: http://www.cs.uchicago.edu/~iraicu
> > > http://dsl.cs.uchicago.edu/
> > > ============================================
> > > ============================================
> > >
> >
> >
> >
>
> --
> ============================================
> Ioan Raicu
> Ph.D. Student
> ============================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ============================================
> Email: iraicu at cs.uchicago.edu
> Web: http://www.cs.uchicago.edu/~iraicu
> http://dsl.cs.uchicago.edu/
> ============================================
> ============================================
More information about the Swift-devel
mailing list