[Swift-user] Data transfer error

Mihael Hategan hategan at mcs.anl.gov
Tue May 27 18:46:04 CDT 2014


On Tue, 2014-05-27 at 22:56 +0000, Bronevetsky, Greg wrote:

[...]
> Progress:  time: Tue, 27 May 2014 15:39:42 -0700  Selecting site:1268  Stage in:30  Submitted:328  Active:31  Finished successfully:3  Failed but can retry:344

I would really suggest disabling lazy errors and execution retries until
you get things to run.

[...]
> 2014/05/27 15:34:56.648 INFO  000000 1401230034461 Staging out /p/lscratche/bronevet/swift/work/psuadeExperiments-20140527-1533-6dt18a4b/jobs/1/runModel-1r0az8rl/wrapper.error (mode = 2).

Right. It means something went wrong running the app on the compute
node. That's a file that is used to send back the exact error.

[...]
> _swiftwrap.staging: line 45: echo: write error: No space left on device
> 
> However, I can’t see how this could be since each node has 16GB of RAM (available as RAM or ramdisk). Is there a way to look into this further?

Swift doesn't have much in that direction. The wrapper logs should
contain some diagnostic information for failing jobs, but if they fail
due to lack of disk space, I can't see how the wrapper log can be
written to.

What I would suggest is wrapping your app in a script that looks into
disk issues (df, ls), and running multiple apps on a single node and
hopefully catching a glimpse of what the problem is before all scratch
space is exhausted.

I think it would be a nice idea to add some node status (mem/disk/cpu)
monitors to the swift monitoring interfaces.

Mihael




More information about the Swift-user mailing list