[Swift-devel] Clustering and Temp Dirs with Swift

Fri Oct 26 21:19:03 CDT 2007

The problem is when we get more than 5 nodes running the
clustered jobs we go from 50 output files being spit out in a
minute to about 1 a minute. If you look at your graphs you
notice that the jobs become increasingly longer and longer as
time progresses.  In fact this entire workflow should have
been able to be completed by a single node in something like 2
hours max. In my case I was running on around 20+ nodes and
the WF was stretching to nearly 3 hours or something like
that. That is when the error finally occurred, and by then it
was largely irrelevant.  This should not be happening.

I think Ioan correctly described the problem.  This seems like
a well understood phenomenon for GPFS and regardless of what
my WFs are doing. And the fact is I will absolutely have more
that 20+ jobs/sec completing or whatever the limit is for my
massive WFs.    

I will try with kickstart on and lazy errors on, but I think
the same will happen.

>
>the most recent run logs I've seen of this are that things were 
>progressing with a small number of job failures, however, one
job failed 
>three times (as happens sometimes, perhaps indicative of a
problem with 
>that job, perhaps statistically/stochastically because you
have a lot of 
>jobs and the execute hosts arent' perfect) and because of
that three times 
>failure, the workflow was aborted.
>
>I discussed with you on IM the possibility of running with 
>lazy.errors=true which will cause the workflow to run for
longer in the 
>case of such a problem.
>
>The output rate stuff is interesting. I'll try to get some
better 
>statistics on that. It is the case that jobs finishing don't
immediately 
>put their output in your run directory. This interacts with
jobs that have 
>not yet been run in a slightly surprising way. Hopefully I
can graph this 
>better soon.
>
>The charts at 
>http://www.ci.uchicago.edu/~benc/report-Windowlicker-20071025-2116-ue28hhtc/

>suggest that there are plenty of jobs finishing.
>
>Here are some questions (that I think can be answered by
logs, but not 
>with the graphs I have now):
>
>  i) how fast are jobs finishing executing?
>
>  ii) how fast are jobs *completely* finishing (which I think
is what you 
>are expecting) which includes staging out files from the
compute site to 
>the submit site?
>
>I'll have some more plots of this in 12h or so.
>
>On Fri, 26 Oct 2007, Andrew Robert Jamieson wrote:
>
>> I am kind of at a stand still for getting anything done on
TP right now with
>> this problem. Are there any suggestions to overcome this
for the time being?
>> 
>> On Fri, 26 Oct 2007, Andrew Robert Jamieson wrote:
>> 
>> > Hello all,
>> > 
>> >  I am encountering the following problem on Teraport.  I
submit a clustered
>> > swift WF which should amount to something on the order of
850x3 individual
>> > jobs total. I have clustered the jobs because they are
very fast (somewhere
>> > around 20 sec to 1 min long).  When I submit the WF on TP
things start out
>> > fantastic, I get 10s of output files in a matter of
seconds and nodes would
>> > start and finish clustered batches in a matter of minutes
or less. However,
>> > after waiting about 3-5 mins, when clustered jobs are
begin to line up in
>> > the queue and more start running at the same time, things
start to slow down
>> > to a trickle in terms of output.
>> > 
>> > One thing I noticed is when I try a simply ls on TP in
the swift temp
>> > running directory where the temp job dirs are created and
destroyed, it take
>> > a very long time.  And when it is done only five or so
things are in the
>> > dir. (this is the dir with "info  kickstart  shared 
status wrapper.log" in
>> > it).  What I think is happening is that TP's filesystem
cant handle this
>> > extremely rapid creation/destruction of directories in
that shared location.
>> > From what I have been told these temp dirs come and go as
long as the job
>> > runs successfully.
>> > 
>> > What I am wondering is if there is anyway to move that
dir to the local node
>> > tmp diretory not the shared file system, while it is
running and if
>> > something fails then have it sent to the appropriate place.
>> > 
>> > Or, if another layer of temp dir wrapping could be
applied with labeld
>> > perhaps with respect to the clustered job grouping and
not simply the
>> > individual jobs (since there are thousands being computed
at once).
>> > That these things would only be generated/deleted every 5
mins or 10 mins
>> > (if clustered properly on my part) instead of one event
every milli second
>> > or what have you.
>> > 
>> > I don't know which solution is feasible or if any are at
all, but this seems
>> > to be a major problem for my WFs.  In general it is never
good to have a
>> > million things coming and going on a shared file system
in one place, from
>> > my experience at least.
>> > 
>> > 
>> > Thanks,
>> > Andrew
>> > _______________________________________________
>> > Swift-devel mailing list
>> > Swift-devel at ci.uchicago.edu
>> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> > 
>> 
>>