[Swift-devel] Clustering and Temp Dirs with Swift

Mihael Hategan hategan at mcs.anl.gov
Fri Oct 26 15:04:34 CDT 2007


>From my live discussion with Andrew, I think we concluded that the
reasonable way of proceeding is to reduce things happening on the shared
filesystem. That may mean:

- making sure the temporary job directory is created on a local
filesystem
- making seq.sh log to individual files (perhaps in info), like the
wrapper. This may reduce contention.

Mihael

On Fri, 2007-10-26 at 14:58 -0500, Andrew Robert Jamieson wrote:
> I am kind of at a stand still for getting anything done on TP right now 
> with this problem. Are there any suggestions to overcome this for the time 
> being?
> 
> On Fri, 26 Oct 2007, Andrew Robert Jamieson wrote:
> 
> > Hello all,
> >
> >  I am encountering the following problem on Teraport.  I submit a clustered 
> > swift WF which should amount to something on the order of 850x3 individual 
> > jobs total. I have clustered the jobs because they are very fast (somewhere 
> > around 20 sec to 1 min long).  When I submit the WF on TP things start out 
> > fantastic, I get 10s of output files in a matter of seconds and nodes would 
> > start and finish clustered batches in a matter of minutes or less. However, 
> > after waiting about 3-5 mins, when clustered jobs are begin to line up in the 
> > queue and more start running at the same time, things start to slow down to a 
> > trickle in terms of output.
> >
> > One thing I noticed is when I try a simply ls on TP in the swift temp running 
> > directory where the temp job dirs are created and destroyed, it take a very 
> > long time.  And when it is done only five or so things are in the dir. (this 
> > is the dir with "info  kickstart  shared  status wrapper.log" in it).  What I 
> > think is happening is that TP's filesystem cant handle this extremely rapid 
> > creation/destruction of directories in that shared location. From what I have 
> > been told these temp dirs come and go as long as the job runs successfully.
> >
> > What I am wondering is if there is anyway to move that dir to the local node 
> > tmp diretory not the shared file system, while it is running and if something 
> > fails then have it sent to the appropriate place.
> >
> > Or, if another layer of temp dir wrapping could be applied with labeld 
> > perhaps with respect to the clustered job grouping and not simply the 
> > individual jobs (since there are thousands being computed at once).
> > That these things would only be generated/deleted every 5 mins or 10 mins (if 
> > clustered properly on my part) instead of one event every milli second or 
> > what have you.
> >
> > I don't know which solution is feasible or if any are at all, but this seems 
> > to be a major problem for my WFs.  In general it is never good to have a 
> > million things coming and going on a shared file system in one place, from my 
> > experience at least.
> >
> >
> > Thanks,
> > Andrew
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 




More information about the Swift-devel mailing list