[Swift-user] Data transfer error

Bronevetsky, Greg bronevetsky1 at llnl.gov
Fri May 30 15:14:31 CDT 2014


The issues I'm running into seem more related to metadata operations since in Lustre the metadata server is not distributed. When I used 10 or 20 nodes I was generating thousands of file opens per second, which Lustre cannot deal with. Even when I use node-local storage as scratch I still get timeouts. Is there a way to just track metadata operations?


	The only true way of avoiding the shared FS is with provider staging enabled, and having both the swift run directory and the workdirectory on local disk.

Does this mean that I'd only be able to do single-node runs or is there a way to shuttle data between the node-local storage of different nodes?

Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov
http://greg.bronevetsky.com


-----Original Message-----
From: Mihael Hategan [mailto:hategan at mcs.anl.gov] 
Sent: Friday, May 30, 2014 12:06 PM
To: Bronevetsky, Greg
Cc: swift-user at ci.uchicago.edu
Subject: Re: [Swift-user] Data transfer error

On Fri, 2014-05-30 at 16:24 +0000, Bronevetsky, Greg wrote:
> I just ran a test where I varied <profile namespace="globus"
> key="maxwalltime"> between 1 and 10 minutes. At 1 it gave me errors 
> and for larger values it did not. So, assuming that this is the true 
> root cause, how can I resolve it? I can use node-local storage as my 
> <workdirectory>. However, when I run my real workload, I'm still 
> getting errors even if I use node-local storage.

Assuming you used <scratch>/local/disk</scratch>, there is still some load on the shared filesystem since swift still needs to copy data from it to the scratch directory and back.

The only true way of avoiding the shared FS is with provider staging enabled, and having both the swift run directory and the workdirectory on local disk.

>  I'm still following up with our file systems folks but the key issue 
> appears to be the large number of meta-data operations that are sent 
> at the shared file system (Lustre or NFS here). Is there a way to 
> reduce that or at least measure it so that I can tell our admins 
> exactly the throughput I need?

This is hard to quantify. It is possible to measure the rate of I/O requests using strace, and recent versions of swift have some flags that allow you to strace the worker and its sub-processes.

The actual bandwidth, I don't know. Perhaps iotop or something like it, but I have never personally used it to measure disk bandwidth with swift apps on shared FSs.

Mihael



More information about the Swift-user mailing list