[Swift-devel] Clustering and Temp Dirs with Swift
Mihael Hategan
hategan at mcs.anl.gov
Fri Oct 26 15:04:34 CDT 2007
>From my live discussion with Andrew, I think we concluded that the
reasonable way of proceeding is to reduce things happening on the shared
filesystem. That may mean:
- making sure the temporary job directory is created on a local
filesystem
- making seq.sh log to individual files (perhaps in info), like the
wrapper. This may reduce contention.
Mihael
On Fri, 2007-10-26 at 14:58 -0500, Andrew Robert Jamieson wrote:
> I am kind of at a stand still for getting anything done on TP right now
> with this problem. Are there any suggestions to overcome this for the time
> being?
>
> On Fri, 26 Oct 2007, Andrew Robert Jamieson wrote:
>
> > Hello all,
> >
> > I am encountering the following problem on Teraport. I submit a clustered
> > swift WF which should amount to something on the order of 850x3 individual
> > jobs total. I have clustered the jobs because they are very fast (somewhere
> > around 20 sec to 1 min long). When I submit the WF on TP things start out
> > fantastic, I get 10s of output files in a matter of seconds and nodes would
> > start and finish clustered batches in a matter of minutes or less. However,
> > after waiting about 3-5 mins, when clustered jobs are begin to line up in the
> > queue and more start running at the same time, things start to slow down to a
> > trickle in terms of output.
> >
> > One thing I noticed is when I try a simply ls on TP in the swift temp running
> > directory where the temp job dirs are created and destroyed, it take a very
> > long time. And when it is done only five or so things are in the dir. (this
> > is the dir with "info kickstart shared status wrapper.log" in it). What I
> > think is happening is that TP's filesystem cant handle this extremely rapid
> > creation/destruction of directories in that shared location. From what I have
> > been told these temp dirs come and go as long as the job runs successfully.
> >
> > What I am wondering is if there is anyway to move that dir to the local node
> > tmp diretory not the shared file system, while it is running and if something
> > fails then have it sent to the appropriate place.
> >
> > Or, if another layer of temp dir wrapping could be applied with labeld
> > perhaps with respect to the clustered job grouping and not simply the
> > individual jobs (since there are thousands being computed at once).
> > That these things would only be generated/deleted every 5 mins or 10 mins (if
> > clustered properly on my part) instead of one event every milli second or
> > what have you.
> >
> > I don't know which solution is feasible or if any are at all, but this seems
> > to be a major problem for my WFs. In general it is never good to have a
> > million things coming and going on a shared file system in one place, from my
> > experience at least.
> >
> >
> > Thanks,
> > Andrew
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
More information about the Swift-devel
mailing list