[Swift-devel] Clustering and Temp Dirs with Swift
Mihael Hategan
hategan at mcs.anl.gov
Fri Oct 26 18:39:45 CDT 2007
On Fri, 2007-10-26 at 15:11 -0500, Ioan Raicu wrote:
> I am not sure what configuration exists on TP, but on the TeraGrid
> ANL/UC cluster, with 8 servers behind GPFS, the wrapper script
> performance (create dir, create symbolic links, remove directory... all
> on GPFS) is anywhere between 20~40 / sec, depending on how many nodes
> you have doing this concurrently. The throughput increases first as you
> add nodes, but then decreases down to about 20/sec with 20~30+ nodes.
> What this means is that even if you bundle jobs up, you will not get
> anything better than this, throughput wise, regardless of how short the
> jobs are. Now, if TP has less than 8 servers, its likely that the
> throughput it can sustain is even lower,
Perhaps in terms of bytes/s. But I wouldn't be so sure that this applies
to other file stuff.
> and if you push it over the
> edge, even to the point of thrashing where the throughput can be
> extremely small. I don't have any suggestions of how you can get
> around this, with the exception of making your job sizes larger on
> average, and hence have fewer jobs over the same period of time.
>
> Ioan
>
> Andrew Robert Jamieson wrote:
> > I am kind of at a stand still for getting anything done on TP right
> > now with this problem. Are there any suggestions to overcome this for
> > the time being?
> >
> > On Fri, 26 Oct 2007, Andrew Robert Jamieson wrote:
> >
> >> Hello all,
> >>
> >> I am encountering the following problem on Teraport. I submit a
> >> clustered swift WF which should amount to something on the order of
> >> 850x3 individual jobs total. I have clustered the jobs because they
> >> are very fast (somewhere around 20 sec to 1 min long). When I submit
> >> the WF on TP things start out fantastic, I get 10s of output files in
> >> a matter of seconds and nodes would start and finish clustered
> >> batches in a matter of minutes or less. However, after waiting about
> >> 3-5 mins, when clustered jobs are begin to line up in the queue and
> >> more start running at the same time, things start to slow down to a
> >> trickle in terms of output.
> >>
> >> One thing I noticed is when I try a simply ls on TP in the swift temp
> >> running directory where the temp job dirs are created and destroyed,
> >> it take a very long time. And when it is done only five or so things
> >> are in the dir. (this is the dir with "info kickstart shared
> >> status wrapper.log" in it). What I think is happening is that TP's
> >> filesystem cant handle this extremely rapid creation/destruction of
> >> directories in that shared location. From what I have been told these
> >> temp dirs come and go as long as the job runs successfully.
> >>
> >> What I am wondering is if there is anyway to move that dir to the
> >> local node tmp diretory not the shared file system, while it is
> >> running and if something fails then have it sent to the appropriate
> >> place.
> >>
> >> Or, if another layer of temp dir wrapping could be applied with
> >> labeld perhaps with respect to the clustered job grouping and not
> >> simply the individual jobs (since there are thousands being computed
> >> at once).
> >> That these things would only be generated/deleted every 5 mins or 10
> >> mins (if clustered properly on my part) instead of one event every
> >> milli second or what have you.
> >>
> >> I don't know which solution is feasible or if any are at all, but
> >> this seems to be a major problem for my WFs. In general it is never
> >> good to have a million things coming and going on a shared file
> >> system in one place, from my experience at least.
> >>
> >>
> >> Thanks,
> >> Andrew
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
>
> --
> ============================================
> Ioan Raicu
> Ph.D. Student
> ============================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ============================================
> Email: iraicu at cs.uchicago.edu
> Web: http://www.cs.uchicago.edu/~iraicu
> http://dsl.cs.uchicago.edu/
> ============================================
> ============================================
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
More information about the Swift-devel
mailing list