[Swift-devel] Clustering and Temp Dirs with Swift

Ben Clifford benc at hawaga.org.uk
Fri Oct 26 19:37:51 CDT 2007


the most recent run logs I've seen of this are that things were 
progressing with a small number of job failures, however, one job failed 
three times (as happens sometimes, perhaps indicative of a problem with 
that job, perhaps statistically/stochastically because you have a lot of 
jobs and the execute hosts arent' perfect) and because of that three times 
failure, the workflow was aborted.

I discussed with you on IM the possibility of running with 
lazy.errors=true which will cause the workflow to run for longer in the 
case of such a problem.

The output rate stuff is interesting. I'll try to get some better 
statistics on that. It is the case that jobs finishing don't immediately 
put their output in your run directory. This interacts with jobs that have 
not yet been run in a slightly surprising way. Hopefully I can graph this 
better soon.

The charts at 
http://www.ci.uchicago.edu/~benc/report-Windowlicker-20071025-2116-ue28hhtc/ 
suggest that there are plenty of jobs finishing.

Here are some questions (that I think can be answered by logs, but not 
with the graphs I have now):

  i) how fast are jobs finishing executing?

  ii) how fast are jobs *completely* finishing (which I think is what you 
are expecting) which includes staging out files from the compute site to 
the submit site?

I'll have some more plots of this in 12h or so.

On Fri, 26 Oct 2007, Andrew Robert Jamieson wrote:

> I am kind of at a stand still for getting anything done on TP right now with
> this problem. Are there any suggestions to overcome this for the time being?
> 
> On Fri, 26 Oct 2007, Andrew Robert Jamieson wrote:
> 
> > Hello all,
> > 
> >  I am encountering the following problem on Teraport.  I submit a clustered
> > swift WF which should amount to something on the order of 850x3 individual
> > jobs total. I have clustered the jobs because they are very fast (somewhere
> > around 20 sec to 1 min long).  When I submit the WF on TP things start out
> > fantastic, I get 10s of output files in a matter of seconds and nodes would
> > start and finish clustered batches in a matter of minutes or less. However,
> > after waiting about 3-5 mins, when clustered jobs are begin to line up in
> > the queue and more start running at the same time, things start to slow down
> > to a trickle in terms of output.
> > 
> > One thing I noticed is when I try a simply ls on TP in the swift temp
> > running directory where the temp job dirs are created and destroyed, it take
> > a very long time.  And when it is done only five or so things are in the
> > dir. (this is the dir with "info  kickstart  shared  status wrapper.log" in
> > it).  What I think is happening is that TP's filesystem cant handle this
> > extremely rapid creation/destruction of directories in that shared location.
> > From what I have been told these temp dirs come and go as long as the job
> > runs successfully.
> > 
> > What I am wondering is if there is anyway to move that dir to the local node
> > tmp diretory not the shared file system, while it is running and if
> > something fails then have it sent to the appropriate place.
> > 
> > Or, if another layer of temp dir wrapping could be applied with labeld
> > perhaps with respect to the clustered job grouping and not simply the
> > individual jobs (since there are thousands being computed at once).
> > That these things would only be generated/deleted every 5 mins or 10 mins
> > (if clustered properly on my part) instead of one event every milli second
> > or what have you.
> > 
> > I don't know which solution is feasible or if any are at all, but this seems
> > to be a major problem for my WFs.  In general it is never good to have a
> > million things coming and going on a shared file system in one place, from
> > my experience at least.
> > 
> > 
> > Thanks,
> > Andrew
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> 
> 



More information about the Swift-devel mailing list