[Swift-user] Data transfer error

Bronevetsky, Greg bronevetsky1 at llnl.gov
Tue May 27 17:56:09 CDT 2014


Mihael, I've been struggling with the runs for the past few days. I've managed to push some of them through but the majority gets so many errors that they appear to stall out. Below is an example of the stdout output from Swift:

RunID: 20140527-1533-6dt18a4b

Progress:  time: Tue, 27 May 2014 15:33:52 -0700

Progress:  time: Tue, 27 May 2014 15:33:54 -0700  Stage in:1  Submitted:2

Progress:  time: Tue, 27 May 2014 15:33:55 -0700  Active:2  Finished successfully:1

Progress:  time: Tue, 27 May 2014 15:33:56 -0700  Initializing:40  Active:2  Finished successfully:1

Progress:  time: Tue, 27 May 2014 15:33:57 -0700  Initializing:352  Selecting site:241  Active:2  Finished successfully:1

Progress:  time: Tue, 27 May 2014 15:33:58 -0700  Selecting site:1674  Submitting:326  Active:2  Finished successfully:1

Progress:  time: Tue, 27 May 2014 15:34:00 -0700  Selecting site:1601  Stage in:2  Submitted:397  Active:2  Finished successfully:1

...

Progress:  time: Tue, 27 May 2014 15:39:42 -0700  Selecting site:1268  Stage in:30  Submitted:328  Active:31  Finished successfully:3  Failed but can retry:344



Digging through the logs, I've found the following mention of an error in one of my worker logs (attached):

2014/05/27 15:34:56.648 INFO  000000 1401230034461 Staging out /p/lscratche/bronevet/swift/work/psuadeExperiments-20140527-1533-6dt18a4b/jobs/1/runModel-1r0az8rl/wrapper.error (mode = 2).

…

2014/05/27 15:34:56.659 INFO  000000 1401230034458 Job dir total 17

drwx------ 2 bronevet bronevet  7168 May 27 15:34 .

drwx------ 5 bronevet bronevet  7168 May 27 15:34 ..

-rw------- 1 bronevet bronevet  6078 May 27 15:34 _swiftwrap.staging

-rw------- 1 bronevet bronevet    96 May 27 15:34 out.expID_0

-rw------- 1 bronevet bronevet  1199 May 27 15:34 out.solver_bicg.precond_diag.mtx_nasa1824.mt_0.fm_0.lm_0.ap_-1.5515515515515510e-01.am_-2.1171171171171173e+00.psp_-3.0330330330330328e+00.psm_-2.2472472472472473e+00.cprob_1e-10.block_0

-rw------- 1 bronevet bronevet     0 May 27 15:34 stderr.txt

-rw------- 1 bronevet bronevet   103 May 27 15:34 wrapper.error

-rw------- 1 bronevet bronevet 32501 May 27 15:34 wrapper.log



Also, I saw some SLURM stdout files the said that I’m out of space.

cat /g/g15/bronevet/.globus/scripts/Slurm1575966868019932992.submit.stdout

env: write error: No space left on device

df: write error: No space left on device

cat: write error: No space left on device

cat: write error: No space left on device

_swiftwrap.staging: line 45: echo: write error: No space left on device

…

env: write error: No space left on device

df: write error: No space left on device

cat: write error: No space left on device

cat: write error: No space left on device

_swiftwrap.staging: line 45: echo: write error: No space left on device



However, I can’t see how this could be since each node has 16GB of RAM (available as RAM or ramdisk). Is there a way to look into this further?



Greg Bronevetsky

Lawrence Livermore National Lab

(925) 424-5756

bronevetsky at llnl.gov

http://greg.bronevetsky.com





-----Original Message-----
From: Mihael Hategan [mailto:hategan at mcs.anl.gov]
Sent: Friday, May 23, 2014 1:23 PM
To: Bronevetsky, Greg
Cc: swift-user at ci.uchicago.edu
Subject: Re: [Swift-user] Data transfer error



On Fri, 2014-05-23 at 19:32 +0000, Bronevetsky, Greg wrote:

> I've now had a little more experience with this and have gotten a

> partial workaround. Whatever the underlying cause, it seems to happen

> a lot less when I disable my mechanisms to avoid re-executing tasks

> that I've already completed. Right now my guess for the root cause is

> that I'm hitting the Lustre meta-data servers too hard and they're

> throwing back occasional errors.



That sounds plausible.



>  Specifically, I just got yelled at by our admins about performing

> thousands of file openings per second.



:)



>

> I just did a small run and got some failures. e.g.:

>             Progress:  time: Fri, 23 May 2014 12:25:54 -0700  Selecting site:2723

> Submitted:216  Active:119  Stage out:16  Finished successfully:58

> Failed but can retry:144

>

> However, when I looked at the log files generated when I set

> workerLoggingLevel to DEBUG as well as the stdout and stderr of the

> SLURM scripts I didn't find any failures or errors. What should I be

> looking for?



Those are probably swift-level errors, and the details would be in the swift log (or on stdout once the run finished).



Mihael


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140527/dd64a1f7/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: worker-0527-3303530-000000.log
Type: application/octet-stream
Size: 3326458 bytes
Desc: worker-0527-3303530-000000.log
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140527/dd64a1f7/attachment.obj>


More information about the Swift-user mailing list