[Swift-user] Data transfer error
Bronevetsky, Greg
bronevetsky1 at llnl.gov
Wed May 28 17:54:16 CDT 2014
Mihael, I ran a few more experiments where I ran a workflow on a single cluster node while monitoring its memory use but I didn't see any issues with it running out of memory since at all times /proc/meminfo reported 22GB out of 24GB free. I've now begun a more focused analysis where I have a simple script that captures the high-level structure of my real script. It first generates a bunch of files, producing additional temporary files and the directories along with the main output file. These files are then reduced using a reduction tree based on the example you sent me. I have not yet gotten the simple script to fail in the same way as the main script but I've noticed a few oddities.
First, although my sites file has <profile namespace="swift" key="stagingMethod">file</profile> and my cf file has use.provider.staging=true, I see that all the intermediate files produced by my tasks are written to the global file system specified in the sites file as <workdirectory>/p/lscratche/bronevet/swift_work</workdirectory>. How do I force Swift to use node-local storage for this data?
Second, when I run as many processes on the one node as there are cores, the script runs but it keeps stalling. As you can see below, it processes tasks in batches of 12. However, after a few batches the job is aborted (~6 mins into a 30 min allocation) even though the node appears healthy and does not run out of memory and Swift submits a new job into the batch queue. Why does this happen?
Swift 0.94.1 swift-r7114 cog-r3803
RunID: 20140528-1504-1ndui8r0
Progress: time: Wed, 28 May 2014 15:04:36 -0700
Progress: time: Wed, 28 May 2014 15:04:37 -0700 Initializing:258
Progress: time: Wed, 28 May 2014 15:04:38 -0700 Initializing:698 Selecting site:589
Progress: time: Wed, 28 May 2014 15:04:45 -0700 Selecting site:4408 Submitting:3
Progress: time: Wed, 28 May 2014 15:05:06 -0700 Selecting site:4010 Submitted:401
...
Progress: time: Wed, 28 May 2014 15:15:06 -0700 Selecting site:4010 Submitted:401
Progress: time: Wed, 28 May 2014 15:15:13 -0700 Selecting site:4010 Stage in:1 Submitted:400
Progress: time: Wed, 28 May 2014 15:15:21 -0700 Selecting site:4010 Stage in:9 Submitted:392
Progress: time: Wed, 28 May 2014 15:15:22 -0700 Selecting site:4010 Stage in:5 Submitted:389 Active:7
Progress: time: Wed, 28 May 2014 15:15:36 -0700 Selecting site:4010 Submitted:389 Active:12
Progress: time: Wed, 28 May 2014 15:16:06 -0700 Selecting site:4010 Submitted:389 Active:12
Progress: time: Wed, 28 May 2014 15:16:36 -0700 Selecting site:4010 Submitted:389 Active:12
Progress: time: Wed, 28 May 2014 15:17:06 -0700 Selecting site:4010 Submitted:389 Active:12
Progress: time: Wed, 28 May 2014 15:17:36 -0700 Selecting site:4010 Submitted:389 Active:12
Progress: time: Wed, 28 May 2014 15:18:06 -0700 Selecting site:4010 Submitted:389 Active:12
Progress: time: Wed, 28 May 2014 15:18:12 -0700 Selecting site:4010 Submitted:389 Active:11 Stage out:1
Progress: time: Wed, 28 May 2014 15:18:15 -0700 Selecting site:4010 Submitted:389 Active:2 Stage out:10
Progress: time: Wed, 28 May 2014 15:18:23 -0700 Selecting site:4010 Submitted:389 Active:1 Stage out:11
Progress: time: Wed, 28 May 2014 15:18:25 -0700 Selecting site:3998 Stage in:1 Submitted:400 Finished successfully:12
Progress: time: Wed, 28 May 2014 15:18:26 -0700 Selecting site:3998 Stage in:9 Submitted:392 Finished successfully:12
Progress: time: Wed, 28 May 2014 15:18:36 -0700 Selecting site:3998 Submitted:389 Active:12 Finished successfully:12
Progress: time: Wed, 28 May 2014 15:19:06 -0700 Selecting site:3998 Submitted:389 Active:12 Finished successfully:12
Progress: time: Wed, 28 May 2014 15:19:36 -0700 Selecting site:3998 Submitted:389 Active:12 Finished successfully:12
Progress: time: Wed, 28 May 2014 15:20:06 -0700 Selecting site:3998 Submitted:389 Active:12 Finished successfully:12
Progress: time: Wed, 28 May 2014 15:20:18 -0700 Selecting site:3998 Submitted:389 Active:11 Stage out:1 Finished successfully:12
Progress: time: Wed, 28 May 2014 15:20:19 -0700 Selecting site:3998 Submitted:389 Active:7 Stage out:5 Finished successfully:12
Progress: time: Wed, 28 May 2014 15:20:21 -0700 Selecting site:3998 Submitted:389 Stage out:11 Finished successfully:13
Progress: time: Wed, 28 May 2014 15:20:36 -0700 Selecting site:3986 Submitted:401 Finished successfully:24
Progress: time: Wed, 28 May 2014 15:21:06 -0700 Selecting site:3986 Submitted:401 Finished successfully:24
Progress: time: Wed, 28 May 2014 15:21:36 -0700 Selecting site:3986 Submitted:401 Finished successfully:24
… Batch allocation released, new request submitted …
Progress: time: Wed, 28 May 2014 15:22:06 -0700 Selecting site:3986 Submitted:401 Finished successfully:24
Progress: time: Wed, 28 May 2014 15:22:36 -0700 Selecting site:3986 Submitted:401 Finished successfully:24
Progress: time: Wed, 28 May 2014 15:23:06 -0700 Selecting site:3986 Submitted:401 Finished successfully:24
Progress: time: Wed, 28 May 2014 15:23:18 -0700 Selecting site:3986 Stage in:1 Submitted:400 Finished successfully:24
Progress: time: Wed, 28 May 2014 15:23:20 -0700 Selecting site:3986 Stage in:4 Submitted:397 Finished successfully:24
Progress: time: Wed, 28 May 2014 15:23:24 -0700 Selecting site:3986 Stage in:5 Submitted:396 Finished successfully:24
Progress: time: Wed, 28 May 2014 15:23:36 -0700 Selecting site:3986 Submitted:389 Active:12 Finished successfully:24
Progress: time: Wed, 28 May 2014 15:24:06 -0700 Selecting site:3986 Submitted:389 Active:12 Finished successfully:24
Progress: time: Wed, 28 May 2014 15:24:36 -0700 Selecting site:3986 Submitted:389 Active:12 Finished successfully:24
Progress: time: Wed, 28 May 2014 15:25:06 -0700 Selecting site:3986 Submitted:389 Active:12 Finished successfully:24
Progress: time: Wed, 28 May 2014 15:25:16 -0700 Selecting site:3986 Submitted:389 Active:11 Stage out:1 Finished successfully:24
Progress: time: Wed, 28 May 2014 15:25:26 -0700 Selecting site:3986 Submitted:389 Active:7 Stage out:5 Finished successfully:24
Progress: time: Wed, 28 May 2014 15:25:27 -0700 Selecting site:3986 Submitted:389 Stage out:12 Finished successfully:24
Progress: time: Wed, 28 May 2014 15:25:29 -0700 Selecting site:3975 Stage in:2 Submitted:398 Stage out:1 Finished successfully:35
Progress: time: Wed, 28 May 2014 15:25:32 -0700 Selecting site:3975 Stage in:3 Submitted:397 Stage out:1 Finished successfully:35
Progress: time: Wed, 28 May 2014 15:25:34 -0700 Selecting site:3975 Stage in:4 Submitted:396 Stage out:1 Finished successfully:35
Progress: time: Wed, 28 May 2014 15:25:35 -0700 Selecting site:3974 Stage in:1 Submitting:1 Submitted:389 Active:10 Finished successfully:36
Progress: time: Wed, 28 May 2014 15:25:36 -0700 Selecting site:3974 Submitted:389 Active:12 Finished successfully:36
Progress: time: Wed, 28 May 2014 15:26:06 -0700 Selecting site:3974 Submitted:389 Active:12 Finished successfully:36
Progress: time: Wed, 28 May 2014 15:26:36 -0700 Selecting site:3974 Submitted:389 Active:12 Finished successfully:36
Progress: time: Wed, 28 May 2014 15:27:01 -0700 Selecting site:3974 Submitted:389 Active:11 Stage out:1 Finished successfully:36
Progress: time: Wed, 28 May 2014 15:27:03 -0700 Selecting site:3974 Submitted:389 Active:2 Stage out:10 Finished successfully:36
Progress: time: Wed, 28 May 2014 15:27:04 -0700 Selecting site:3974 Submitted:389 Active:1 Stage out:11 Finished successfully:36
Progress: time: Wed, 28 May 2014 15:27:06 -0700 Selecting site:3974 Submitted:389 Stage out:12 Finished successfully:36
Progress: time: Wed, 28 May 2014 15:27:08 -0700 Selecting site:3974 Submitted:389 Stage out:11 Finished successfully:37
Progress: time: Wed, 28 May 2014 15:27:36 -0700 Selecting site:3962 Submitted:389 Active:12 Finished successfully:48
Progress: time: Wed, 28 May 2014 15:28:06 -0700 Selecting site:3962 Submitted:389 Active:12 Finished successfully:48
Progress: time: Wed, 28 May 2014 15:28:36 -0700 Selecting site:3962 Submitted:389 Active:12 Finished successfully:48
Progress: time: Wed, 28 May 2014 15:28:47 -0700 Selecting site:3962 Submitted:389 Active:11 Stage out:1 Finished successfully:48
Progress: time: Wed, 28 May 2014 15:29:01 -0700 Selecting site:3962 Submitted:389 Active:8 Stage out:4 Finished successfully:48
Progress: time: Wed, 28 May 2014 15:29:02 -0700 Selecting site:3955 Submitting:1 Submitted:395 Finished successfully:60
Progress: time: Wed, 28 May 2014 15:29:06 -0700 Selecting site:3950 Submitted:401 Finished successfully:60
Progress: time: Wed, 28 May 2014 15:29:36 -0700 Selecting site:3950 Submitted:401 Finished successfully:60
Progress: time: Wed, 28 May 2014 15:30:06 -0700 Selecting site:3950 Submitted:401 Finished successfully:60
Progress: time: Wed, 28 May 2014 15:30:36 -0700 Selecting site:3950 Submitted:401 Finished successfully:60
… Batch allocation released, new request submitted …
Progress: time: Wed, 28 May 2014 15:31:06 -0700 Selecting site:3950 Submitted:401 Finished successfully:60
Progress: time: Wed, 28 May 2014 15:31:36 -0700 Selecting site:3950 Submitted:401 Finished successfully:60
Progress: time: Wed, 28 May 2014 15:31:56 -0700 Selecting site:3950 Stage in:1 Submitted:400 Finished successfully:60
Progress: time: Wed, 28 May 2014 15:32:06 -0700 Selecting site:3950 Submitted:389 Active:12 Finished successfully:60
Progress: time: Wed, 28 May 2014 15:32:36 -0700 Selecting site:3950 Submitted:389 Active:12 Finished successfully:60
Progress: time: Wed, 28 May 2014 15:33:06 -0700 Selecting site:3950 Submitted:389 Active:12 Finished successfully:60
Progress: time: Wed, 28 May 2014 15:33:27 -0700 Selecting site:3950 Submitted:389 Active:11 Stage out:1 Finished successfully:60
Progress: time: Wed, 28 May 2014 15:33:28 -0700 Selecting site:3950 Submitted:389 Active:3 Stage out:9 Finished successfully:60
Progress: time: Wed, 28 May 2014 15:33:31 -0700 Selecting site:3938 Stage in:1 Submitted:400 Finished successfully:72
Progress: time: Wed, 28 May 2014 15:33:36 -0700 Selecting site:3938 Stage in:2 Submitted:399 Finished successfully:72
Progress: time: Wed, 28 May 2014 15:34:06 -0700 Selecting site:3938 Submitted:389 Active:12 Finished successfully:72
Progress: time: Wed, 28 May 2014 15:34:36 -0700 Selecting site:3938 Submitted:389 Active:12 Finished successfully:72
Progress: time: Wed, 28 May 2014 15:35:06 -0700 Selecting site:3938 Submitted:389 Active:12 Finished successfully:72
… Batch allocation released, new request submitted …
Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov
http://greg.bronevetsky.com
swift -sites.file /g/g15/bronevet/apps/swift-0.94.1/etc/sites.pbatch.sierra.tmp.xml -tc.file ~/code/tmp/sight/apps/linsolve/iml/tc.data -config /g/g15/bronevet/apps/swift-0.94.1/etc/cf.tmp -lazy.errors false ~/code/tmp/sight/apps/linsolve/iml/psuadeExperiments.swift -precond=diag -matrix=nasa1824 -modelType=singleModel -resume psuadeExperiments-20140528-0007-gx1v1o8g.0.rlog
Swift 0.94.1 swift-r7114 cog-r3803
RunID: 20140528-1001-vnifrjtf
Progress: time: Wed, 28 May 2014 10:01:24 -0700
Tree combinedProgress: time: Wed, 28 May 2014 10:01:54 -0700 Finished in previous run:12
Progress: time: Wed, 28 May 2014 10:02:15 -0700 Initializing:2 Finished in previous run:12
Progress: time: Wed, 28 May 2014 10:02:16 -0700 Initializing:1542 Finished in previous run:39
Progress: time: Wed, 28 May 2014 10:02:17 -0700 Initializing:2627 Selecting site:1532 Finished in previous run:93
Progress: time: Wed, 28 May 2014 10:02:18 -0700 Initializing:10156 Selecting site:3667 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:02:19 -0700 Initializing:3421 Selecting site:12328 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:02:21 -0700 Selecting site:15746 Submitting:3 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:02:23 -0700 Selecting site:15348 Submitting:384 Submitted:17 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:02:24 -0700 Selecting site:15348 Submitted:401 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:02:45 -0700 Selecting site:15348 Stage in:1 Submitted:400 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:02:46 -0700 Selecting site:15348 Stage in:2 Submitted:399 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:02:52 -0700 Selecting site:15348 Stage in:6 Submitted:395 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:02:53 -0700 Selecting site:15348 Stage in:10 Submitted:391 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:03:24 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:03:54 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:04:24 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:04:54 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:05:24 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:05:54 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:06:24 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:06:54 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:07:24 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:07:54 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:08:24 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:08:54 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:09:24 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:09:54 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:10:24 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:10:54 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:11:24 -0700 Selecting site:15348 Submitted:389 Active:12 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:11:27 -0700 Selecting site:15348 Submitted:388 Active:13 Finished in previous run:269
Progress: time: Wed, 28 May 2014 10:11:28 -0700 Selecting site:15324 Submitting:23 Submitted:366 Active:12 Finished in previous run:269 Failed but can retry:24
Progress: time: Wed, 28 May 2014 10:11:54 -0700 Selecting site:15324 Submitted:389 Active:12 Finished in previous run:269 Failed but can retry:24
Progress: time: Wed, 28 May 2014 10:12:12 -0700 Selecting site:15324 Submitted:388 Active:13 Finished in previous run:269 Failed but can retry:24
Progress: time: Wed, 28 May 2014 10:12:13 -0700 Selecting site:15300 Submitting:23 Submitted:366 Active:12 Finished in previous run:269 Failed but can retry:48
Progress: time: Wed, 28 May 2014 10:12:24 -0700 Selecting site:15300 Submitted:389 Active:12 Finished in previous run:269 Failed but can retry:48
Progress: time: Wed, 28 May 2014 10:12:54 -0700 Selecting site:15300 Submitted:389 Active:12 Finished in previous run:269 Failed but can retry:48
Progress: time: Wed, 28 May 2014 10:12:58 -0700 Selecting site:15300 Submitted:388 Active:13 Finished in previous run:269 Failed but can retry:48
Progress: time: Wed, 28 May 2014 10:12:59 -0700 Selecting site:15276 Submitting:23 Submitted:366 Active:12 Finished in previous run:269 Failed but can retry:72
Progress: time: Wed, 28 May 2014 10:13:24 -0700 Selecting site:15276 Submitted:389 Active:12 Finished in previous run:269 Failed but can retry:72
Progress: time: Wed, 28 May 2014 10:13:43 -0700 Selecting site:15276 Submitted:388 Active:13 Finished in previous run:269 Failed but can retry:72
Progress: time: Wed, 28 May 2014 10:13:44 -0700 Selecting site:15252 Submitting:23 Submitted:366 Active:12 Finished in previous run:269 Failed but can retry:96
Progress: time: Wed, 28 May 2014 10:13:54 -0700 Selecting site:15252 Submitted:389 Active:12 Finished in previous run:269 Failed but can retry:96
Progress: time: Wed, 28 May 2014 10:14:24 -0700 Selecting site:15252 Submitted:389 Active:12 Finished in previous run:269 Failed but can retry:96
Progress: time: Wed, 28 May 2014 10:14:28 -0700 Selecting site:15252 Submitted:388 Active:13 Finished in previous run:269 Failed but can retry:96
Progress: time: Wed, 28 May 2014 10:14:29 -0700 Selecting site:15228 Submitting:23 Submitted:366 Active:12 Finished in previous run:269 Failed but can retry:120
Progress: time: Wed, 28 May 2014 10:14:54 -0700 Selecting site:15228 Submitted:389 Active:12 Finished in previous run:269 Failed but can retry:120
Progress: time: Wed, 28 May 2014 10:15:14 -0700 Selecting site:15228 Submitted:388 Active:13 Finished in previous run:269 Failed but can retry:120
Progress: time: Wed, 28 May 2014 10:15:15 -0700 Selecting site:15204 Submitting:23 Submitted:366 Active:12 Finished in previous run:269 Failed but can retry:144
Progress: time: Wed, 28 May 2014 10:15:24 -0700 Selecting site:15204 Submitted:389 Active:12 Finished in previous run:269 Failed but can retry:144
-----Original Message-----
From: Mihael Hategan [mailto:hategan at mcs.anl.gov]
Sent: Tuesday, May 27, 2014 4:46 PM
To: Bronevetsky, Greg
Cc: swift-user at ci.uchicago.edu
Subject: Re: [Swift-user] Data transfer error
On Tue, 2014-05-27 at 22:56 +0000, Bronevetsky, Greg wrote:
[...]
> Progress: time: Tue, 27 May 2014 15:39:42 -0700 Selecting site:1268
> Stage in:30 Submitted:328 Active:31 Finished successfully:3 Failed
> but can retry:344
I would really suggest disabling lazy errors and execution retries until you get things to run.
[...]
> 2014/05/27 15:34:56.648 INFO 000000 1401230034461 Staging out /p/lscratche/bronevet/swift/work/psuadeExperiments-20140527-1533-6dt18a4b/jobs/1/runModel-1r0az8rl/wrapper.error (mode = 2).
Right. It means something went wrong running the app on the compute node. That's a file that is used to send back the exact error.
[...]
> _swiftwrap.staging: line 45: echo: write error: No space left on
> device
>
> However, I can’t see how this could be since each node has 16GB of RAM (available as RAM or ramdisk). Is there a way to look into this further?
Swift doesn't have much in that direction. The wrapper logs should contain some diagnostic information for failing jobs, but if they fail due to lack of disk space, I can't see how the wrapper log can be written to.
What I would suggest is wrapping your app in a script that looks into disk issues (df, ls), and running multiple apps on a single node and hopefully catching a glimpse of what the problem is before all scratch space is exhausted.
I think it would be a nice idea to add some node status (mem/disk/cpu) monitors to the swift monitoring interfaces.
Mihael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140528/ee67f6c5/attachment.html>
More information about the Swift-user
mailing list