[Swift-user] Data transfer error

Bronevetsky, Greg bronevetsky1 at llnl.gov
Fri May 30 11:24:20 CDT 2014


I just ran a test where I varied <profile namespace="globus" key="maxwalltime"> between 1 and 10 minutes. At 1 it gave me errors and for larger values it did not. So, assuming that this is the true root cause, how can I resolve it? I can use node-local storage as my <workdirectory>. However, when I run my real workload, I'm still getting errors even if I use node-local storage. I'm still following up with our file systems folks but the key issue appears to be the large number of meta-data operations that are sent at the shared file system (Lustre or NFS here). Is there a way to reduce that or at least measure it so that I can tell our admins exactly the throughput I need?

Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov
http://greg.bronevetsky.com


-----Original Message-----
From: Mihael Hategan [mailto:hategan at mcs.anl.gov] 
Sent: Thursday, May 29, 2014 5:56 PM
To: Bronevetsky, Greg
Cc: swift-user at ci.uchicago.edu
Subject: Re: [Swift-user] Data transfer error

Hi Greg,

I think the swift log (that thing in the directory where you invoke swift called *.log) would contain all the relevant information here.

In any event, it is also there in the worker log:

2014/05/29 15:44:08.741 DEBUG 000000 Checking jobs status (12 active)
2014/05/29 15:44:08.741 DEBUG 000000 1401402532522 Checking pid 24049
2014/05/29 15:44:08.741 DEBUG 000000 1401402532522 walltime exceeded
(start: 1401403208.70666, now: 1401403448.74189, maxwalltime: 240); killing ...

I believe that what is happening is that as you increase the load, I/O operations on shared disks become slower and slower to the extent that the app walltimes become greater than what you get in small runs and what you have as maxwalltime in sites.xml. The fact that is happens on both lustre and NFS seems to support this theory.

This can be checked by slowly increasing the maxwalltime in sites.xml.
If it is not associated by a corresponding increase in the scale at which you can run without failure, then we should probably look somewhere else.

Mihael


On Thu, 2014-05-29 at 22:48 +0000, Bronevetsky, Greg wrote:
> I've finally managed to create a reproducer for my problem. [...]



More information about the Swift-user mailing list