[Swift-devel] Clustering and Temp Dirs with Swift

Sat Oct 27 13:43:34 CDT 2007

Many good points have been raised on this thread. Ive read through it 
once, probably need to do another pass.

I also spoke to Mihael in person on this yesterday afternoon, and want 
now to try to organize our efforts on this (as I think we all realize 
its a clear barrier to performance).  Plus Andrew is still pushing for 
results by Nov 1 for an NIH grant resubmission.

I suspect that my angle workflow on UC teragrid was having similar 
problems: lots of jobs finishing but data coming back very slowly.
(Btw I really appreciate everyones efforts on this and I *do* realize 
that its a weekend)

What I understand to be happening now is:
- Ben is doing more measurements
- Mihael was going to try to rework remote jobs to read and write 
to/from local disk
- Mihael suggested we instrument wrapper.sh to record times. Possibly 
insert date and time commands at every step. (I'd also like to have an 
option to retrieve these logs from the remote side, but we can do that 
as a separate utility script for now).

I have long wanted to document the current data management logic; I 
think seeing this in writing would help us pinpoint likely points of 
contention.  I didnt take notes when Mihael answered my questions on 
this yesterday, but would like to go back and recapture this.  It would 
also help us design new options for data handling conventions.

Mihael: are you doing anything on the rework to local I/O at the moment? 
Knowing your plan would help guide what others should do next.

Ben: is the log_processing code changing as we speak, and is it sensible 
for me and others to try to run your latest versions? Or just send you 
logfiles?

I think in general moving as much I/O (including metadata I/O) from 
shared disk to local disk is a good thing. If its easy to move almost 
*all* metadata access there, this is low hanging fruit, and we can just 
compare before or after times.  However if this is hard, then its better 
to get more measurements and find *which* I/O operations are causing 
problems, and go after the worst offenders first.

Question: do people feel that a move to local disk could be done 
*entirely* in wrapper.sh, or is it known that other parts of swift would 
have to change as well?

For the moment, until I hear comments on the questions above, I will 
work on Angle, see if I get the same problems (I should see the same) 
and try to start a simple text doc on the data management mechanism that 
will at least help *me* better understand whats going on.

- Mike

On 10/26/07 7:37 PM, Ben Clifford wrote:
> the most recent run logs I've seen of this are that things were 
> progressing with a small number of job failures, however, one job failed 
> three times (as happens sometimes, perhaps indicative of a problem with 
> that job, perhaps statistically/stochastically because you have a lot of 
> jobs and the execute hosts arent' perfect) and because of that three times 
> failure, the workflow was aborted.
> 
> I discussed with you on IM the possibility of running with 
> lazy.errors=true which will cause the workflow to run for longer in the 
> case of such a problem.
> 
> The output rate stuff is interesting. I'll try to get some better 
> statistics on that. It is the case that jobs finishing don't immediately 
> put their output in your run directory. This interacts with jobs that have 
> not yet been run in a slightly surprising way. Hopefully I can graph this 
> better soon.
> 
> The charts at 
> http://www.ci.uchicago.edu/~benc/report-Windowlicker-20071025-2116-ue28hhtc/ 
> suggest that there are plenty of jobs finishing.
> 
> Here are some questions (that I think can be answered by logs, but not 
> with the graphs I have now):
> 
>   i) how fast are jobs finishing executing?
> 
>   ii) how fast are jobs *completely* finishing (which I think is what you 
> are expecting) which includes staging out files from the compute site to 
> the submit site?
> 
> I'll have some more plots of this in 12h or so.
> 
> On Fri, 26 Oct 2007, Andrew Robert Jamieson wrote:
> 
>> I am kind of at a stand still for getting anything done on TP right now with
>> this problem. Are there any suggestions to overcome this for the time being?
>>
>> On Fri, 26 Oct 2007, Andrew Robert Jamieson wrote:
>>
>>> Hello all,
>>>
>>>  I am encountering the following problem on Teraport.  I submit a clustered
>>> swift WF which should amount to something on the order of 850x3 individual
>>> jobs total. I have clustered the jobs because they are very fast (somewhere
>>> around 20 sec to 1 min long).  When I submit the WF on TP things start out
>>> fantastic, I get 10s of output files in a matter of seconds and nodes would
>>> start and finish clustered batches in a matter of minutes or less. However,
>>> after waiting about 3-5 mins, when clustered jobs are begin to line up in
>>> the queue and more start running at the same time, things start to slow down
>>> to a trickle in terms of output.
>>>
>>> One thing I noticed is when I try a simply ls on TP in the swift temp
>>> running directory where the temp job dirs are created and destroyed, it take
>>> a very long time.  And when it is done only five or so things are in the
>>> dir. (this is the dir with "info  kickstart  shared  status wrapper.log" in
>>> it).  What I think is happening is that TP's filesystem cant handle this
>>> extremely rapid creation/destruction of directories in that shared location.
>>> From what I have been told these temp dirs come and go as long as the job
>>> runs successfully.
>>>
>>> What I am wondering is if there is anyway to move that dir to the local node
>>> tmp diretory not the shared file system, while it is running and if
>>> something fails then have it sent to the appropriate place.
>>>
>>> Or, if another layer of temp dir wrapping could be applied with labeld
>>> perhaps with respect to the clustered job grouping and not simply the
>>> individual jobs (since there are thousands being computed at once).
>>> That these things would only be generated/deleted every 5 mins or 10 mins
>>> (if clustered properly on my part) instead of one event every milli second
>>> or what have you.
>>>
>>> I don't know which solution is feasible or if any are at all, but this seems
>>> to be a major problem for my WFs.  In general it is never good to have a
>>> million things coming and going on a shared file system in one place, from
>>> my experience at least.
>>>
>>>
>>> Thanks,
>>> Andrew
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>
> 
>