[Swift-devel] Clustering and Temp Dirs with Swift
Michael Wilde
wilde at mcs.anl.gov
Sat Oct 27 14:14:42 CDT 2007
On 10/27/07 1:50 PM, Ben Clifford wrote:
>
> On Sat, 27 Oct 2007, Michael Wilde wrote:
>
>> I suspect that my angle workflow on UC teragrid was having similar problems:
>> lots of jobs finishing but data coming back very slowly.
>> (Btw I really appreciate everyones efforts on this and I *do* realize that its
>> a weekend)
>
> Is this the one that looks like you were hitting the maximum-of-4-at-once
> limit on file transfers?
Yes. I dont have the data at hand, but I thought that I had achieved
better performance in early runs (about 4 weeks prior). One reason Im
suspicious that the throttle itself may not be the problem, is that in
older tests I had the throttles opened much wider, and this was causing
transfer and data management failures. So I narrows them back to the
deafult values, and the workflow went very fast (seeming to have no data
transfer bottleneck). I need to gather more data to know whats really
happening.
One additional unexplained item is that in the run you analyzed with a
4-wide transfer throttle, I was still getting a lot of I/O errors in the
log, which I dont thing have been explained yet.
>
>> Ben: is the log_processing code changing as we speak, and is it sensible for
>> me and others to try to run your latest versions? Or just send you logfiles?
>
> It always changes. But you can svn update whenever you want.
>
> If you put a log file (and associated kickstart records) in the usual
> repository then its easy enough for me to run the code on it.
>
>> Question: do people feel that a move to local disk could be done
>> *entirely* in wrapper.sh, or is it known that other parts of swift would
>> have to change as well?
>
> I think that there won't be a trivial solution to this problem. At
> present, the model is quite strongly tied to a site-shared filesystem (as
> VDS was before).
>
> In the past, we've discussed informally different ways of moving data
> round between submit-side storage locations, site-wide storage locations
> and worker-local storage. I think this is another use case for that; but I
> think the general conclusion that that's a non-trivial thing to do is
> still valid.
I agree that it sounds non-trivial. But it sounded from Mihael on
Friday that he was about to start work on it. Thats what I'd like to
discuss on the list.
Also, a point to consider that has not been discussed much in this
thread: it seems from anecdotal evidence that having too many entries in
any single dir, *especially* on GPFS, causes very bad performance.
Addressing this may be much easier to do than moving shared files and
dirs to local disk. For GPFS it seems like >100 entries per dir performs
badly.
>
>> For the moment, until I hear comments on the questions above, I will
>> work on Angle, see if I get the same problems (I should see the same)
>> and try to start a simple text doc on the data management mechanism that
>> will at least help *me* better understand whats going on.
>
> For angle, a first thing to try is increasing the transfer throttle.
>
> If there's lock contention there, it may be that will decrease, rather
> than increase the performance.
Agreed. Will test a few cases and report back; will probably take me
till tomorrow to get some results but I'll send reports as I progress.
>
More information about the Swift-devel
mailing list