[Swift-user] Data transfer error

Mihael Hategan hategan at mcs.anl.gov
Fri May 23 15:23:13 CDT 2014


On Fri, 2014-05-23 at 19:32 +0000, Bronevetsky, Greg wrote:
> I've now had a little more experience with this and have gotten a
> partial workaround. Whatever the underlying cause, it seems to happen
> a lot less when I disable my mechanisms to avoid re-executing tasks
> that I've already completed. Right now my guess for the root cause is
> that I'm hitting the Lustre meta-data servers too hard and they're
> throwing back occasional errors.

That sounds plausible.

>  Specifically, I just got yelled at by our admins about performing
> thousands of file openings per second.

:)

> 
> I just did a small run and got some failures. e.g.:
> 	Progress:  time: Fri, 23 May 2014 12:25:54 -0700  Selecting site:2723  Submitted:216  Active:119  Stage out:16  Finished successfully:58  Failed but can retry:144
> 
> However, when I looked at the log files generated when I set
> workerLoggingLevel to DEBUG as well as the stdout and stderr of the
> SLURM scripts I didn't find any failures or errors. What should I be
> looking for?

Those are probably swift-level errors, and the details would be in the
swift log (or on stdout once the run finished).

Mihael




More information about the Swift-user mailing list