[Swift-user] Data transfer error
Bronevetsky, Greg
bronevetsky1 at llnl.gov
Tue May 27 17:56:09 CDT 2014
Mihael, I've been struggling with the runs for the past few days. I've managed to push some of them through but the majority gets so many errors that they appear to stall out. Below is an example of the stdout output from Swift:
RunID: 20140527-1533-6dt18a4b
Progress: time: Tue, 27 May 2014 15:33:52 -0700
Progress: time: Tue, 27 May 2014 15:33:54 -0700 Stage in:1 Submitted:2
Progress: time: Tue, 27 May 2014 15:33:55 -0700 Active:2 Finished successfully:1
Progress: time: Tue, 27 May 2014 15:33:56 -0700 Initializing:40 Active:2 Finished successfully:1
Progress: time: Tue, 27 May 2014 15:33:57 -0700 Initializing:352 Selecting site:241 Active:2 Finished successfully:1
Progress: time: Tue, 27 May 2014 15:33:58 -0700 Selecting site:1674 Submitting:326 Active:2 Finished successfully:1
Progress: time: Tue, 27 May 2014 15:34:00 -0700 Selecting site:1601 Stage in:2 Submitted:397 Active:2 Finished successfully:1
...
Progress: time: Tue, 27 May 2014 15:39:42 -0700 Selecting site:1268 Stage in:30 Submitted:328 Active:31 Finished successfully:3 Failed but can retry:344
Digging through the logs, I've found the following mention of an error in one of my worker logs (attached):
2014/05/27 15:34:56.648 INFO 000000 1401230034461 Staging out /p/lscratche/bronevet/swift/work/psuadeExperiments-20140527-1533-6dt18a4b/jobs/1/runModel-1r0az8rl/wrapper.error (mode = 2).
…
2014/05/27 15:34:56.659 INFO 000000 1401230034458 Job dir total 17
drwx------ 2 bronevet bronevet 7168 May 27 15:34 .
drwx------ 5 bronevet bronevet 7168 May 27 15:34 ..
-rw------- 1 bronevet bronevet 6078 May 27 15:34 _swiftwrap.staging
-rw------- 1 bronevet bronevet 96 May 27 15:34 out.expID_0
-rw------- 1 bronevet bronevet 1199 May 27 15:34 out.solver_bicg.precond_diag.mtx_nasa1824.mt_0.fm_0.lm_0.ap_-1.5515515515515510e-01.am_-2.1171171171171173e+00.psp_-3.0330330330330328e+00.psm_-2.2472472472472473e+00.cprob_1e-10.block_0
-rw------- 1 bronevet bronevet 0 May 27 15:34 stderr.txt
-rw------- 1 bronevet bronevet 103 May 27 15:34 wrapper.error
-rw------- 1 bronevet bronevet 32501 May 27 15:34 wrapper.log
Also, I saw some SLURM stdout files the said that I’m out of space.
cat /g/g15/bronevet/.globus/scripts/Slurm1575966868019932992.submit.stdout
env: write error: No space left on device
df: write error: No space left on device
cat: write error: No space left on device
cat: write error: No space left on device
_swiftwrap.staging: line 45: echo: write error: No space left on device
…
env: write error: No space left on device
df: write error: No space left on device
cat: write error: No space left on device
cat: write error: No space left on device
_swiftwrap.staging: line 45: echo: write error: No space left on device
However, I can’t see how this could be since each node has 16GB of RAM (available as RAM or ramdisk). Is there a way to look into this further?
Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov
http://greg.bronevetsky.com
-----Original Message-----
From: Mihael Hategan [mailto:hategan at mcs.anl.gov]
Sent: Friday, May 23, 2014 1:23 PM
To: Bronevetsky, Greg
Cc: swift-user at ci.uchicago.edu
Subject: Re: [Swift-user] Data transfer error
On Fri, 2014-05-23 at 19:32 +0000, Bronevetsky, Greg wrote:
> I've now had a little more experience with this and have gotten a
> partial workaround. Whatever the underlying cause, it seems to happen
> a lot less when I disable my mechanisms to avoid re-executing tasks
> that I've already completed. Right now my guess for the root cause is
> that I'm hitting the Lustre meta-data servers too hard and they're
> throwing back occasional errors.
That sounds plausible.
> Specifically, I just got yelled at by our admins about performing
> thousands of file openings per second.
:)
>
> I just did a small run and got some failures. e.g.:
> Progress: time: Fri, 23 May 2014 12:25:54 -0700 Selecting site:2723
> Submitted:216 Active:119 Stage out:16 Finished successfully:58
> Failed but can retry:144
>
> However, when I looked at the log files generated when I set
> workerLoggingLevel to DEBUG as well as the stdout and stderr of the
> SLURM scripts I didn't find any failures or errors. What should I be
> looking for?
Those are probably swift-level errors, and the details would be in the swift log (or on stdout once the run finished).
Mihael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140527/dd64a1f7/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: worker-0527-3303530-000000.log
Type: application/octet-stream
Size: 3326458 bytes
Desc: worker-0527-3303530-000000.log
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140527/dd64a1f7/attachment.obj>
More information about the Swift-user
mailing list