[Swift-user] Data transfer error

Mihael Hategan hategan at mcs.anl.gov
Thu May 29 19:55:37 CDT 2014


Hi Greg,

I think the swift log (that thing in the directory where you invoke
swift called *.log) would contain all the relevant information here.

In any event, it is also there in the worker log:

2014/05/29 15:44:08.741 DEBUG 000000 Checking jobs status (12 active)
2014/05/29 15:44:08.741 DEBUG 000000 1401402532522 Checking pid 24049
2014/05/29 15:44:08.741 DEBUG 000000 1401402532522 walltime exceeded
(start: 1401403208.70666, now: 1401403448.74189, maxwalltime: 240);
killing
...

I believe that what is happening is that as you increase the load, I/O
operations on shared disks become slower and slower to the extent that
the app walltimes become greater than what you get in small runs and
what you have as maxwalltime in sites.xml. The fact that is happens on
both lustre and NFS seems to support this theory.

This can be checked by slowly increasing the maxwalltime in sites.xml.
If it is not associated by a corresponding increase in the scale at
which you can run without failure, then we should probably look
somewhere else.

Mihael


On Thu, 2014-05-29 at 22:48 +0000, Bronevetsky, Greg wrote:
> I've finally managed to create a reproducer for my problem. [...]




More information about the Swift-user mailing list