[Swift-user] Data transfer error

Bronevetsky, Greg bronevetsky1 at llnl.gov
Wed May 28 17:54:16 CDT 2014


Mihael, I ran a few more experiments where I ran a workflow on a single cluster node while monitoring its memory use but I didn't see any issues with it running out of memory since at all times /proc/meminfo reported 22GB out of 24GB free. I've now begun a more focused analysis where I have a simple script that captures the high-level structure of my real script. It first generates a bunch of files, producing additional temporary files and the directories along with the main output file. These files are then reduced using a reduction tree based on the example you sent me. I have not yet gotten the simple script to fail in the same way as the main script but I've noticed a few oddities.



First, although my sites file has <profile namespace="swift" key="stagingMethod">file</profile> and my cf file has use.provider.staging=true, I see that all the intermediate files produced by my tasks are written to the global file system specified in the sites file as <workdirectory>/p/lscratche/bronevet/swift_work</workdirectory>. How do I force Swift to use node-local storage for this data?



Second, when I run as many processes on the one node as there are cores, the script runs but it keeps stalling. As you can see below, it processes tasks in batches of 12. However, after a few batches the job is aborted (~6 mins into a 30 min allocation) even though the node appears healthy and does not run out of memory and Swift submits a new job into the batch queue. Why does this happen?

Swift 0.94.1 swift-r7114 cog-r3803

RunID: 20140528-1504-1ndui8r0

Progress:  time: Wed, 28 May 2014 15:04:36 -0700

Progress:  time: Wed, 28 May 2014 15:04:37 -0700  Initializing:258

Progress:  time: Wed, 28 May 2014 15:04:38 -0700  Initializing:698  Selecting site:589

Progress:  time: Wed, 28 May 2014 15:04:45 -0700  Selecting site:4408  Submitting:3

Progress:  time: Wed, 28 May 2014 15:05:06 -0700  Selecting site:4010  Submitted:401

...

Progress:  time: Wed, 28 May 2014 15:15:06 -0700  Selecting site:4010  Submitted:401

Progress:  time: Wed, 28 May 2014 15:15:13 -0700  Selecting site:4010  Stage in:1  Submitted:400

Progress:  time: Wed, 28 May 2014 15:15:21 -0700  Selecting site:4010  Stage in:9  Submitted:392

Progress:  time: Wed, 28 May 2014 15:15:22 -0700  Selecting site:4010  Stage in:5  Submitted:389  Active:7

Progress:  time: Wed, 28 May 2014 15:15:36 -0700  Selecting site:4010  Submitted:389  Active:12

Progress:  time: Wed, 28 May 2014 15:16:06 -0700  Selecting site:4010  Submitted:389  Active:12

Progress:  time: Wed, 28 May 2014 15:16:36 -0700  Selecting site:4010  Submitted:389  Active:12

Progress:  time: Wed, 28 May 2014 15:17:06 -0700  Selecting site:4010  Submitted:389  Active:12

Progress:  time: Wed, 28 May 2014 15:17:36 -0700  Selecting site:4010  Submitted:389  Active:12

Progress:  time: Wed, 28 May 2014 15:18:06 -0700  Selecting site:4010  Submitted:389  Active:12

Progress:  time: Wed, 28 May 2014 15:18:12 -0700  Selecting site:4010  Submitted:389  Active:11  Stage out:1

Progress:  time: Wed, 28 May 2014 15:18:15 -0700  Selecting site:4010  Submitted:389  Active:2  Stage out:10

Progress:  time: Wed, 28 May 2014 15:18:23 -0700  Selecting site:4010  Submitted:389  Active:1  Stage out:11

Progress:  time: Wed, 28 May 2014 15:18:25 -0700  Selecting site:3998  Stage in:1  Submitted:400  Finished successfully:12

Progress:  time: Wed, 28 May 2014 15:18:26 -0700  Selecting site:3998  Stage in:9  Submitted:392  Finished successfully:12

Progress:  time: Wed, 28 May 2014 15:18:36 -0700  Selecting site:3998  Submitted:389  Active:12  Finished successfully:12

Progress:  time: Wed, 28 May 2014 15:19:06 -0700  Selecting site:3998  Submitted:389  Active:12  Finished successfully:12

Progress:  time: Wed, 28 May 2014 15:19:36 -0700  Selecting site:3998  Submitted:389  Active:12  Finished successfully:12

Progress:  time: Wed, 28 May 2014 15:20:06 -0700  Selecting site:3998  Submitted:389  Active:12  Finished successfully:12

Progress:  time: Wed, 28 May 2014 15:20:18 -0700  Selecting site:3998  Submitted:389  Active:11  Stage out:1  Finished successfully:12

Progress:  time: Wed, 28 May 2014 15:20:19 -0700  Selecting site:3998  Submitted:389  Active:7  Stage out:5  Finished successfully:12

Progress:  time: Wed, 28 May 2014 15:20:21 -0700  Selecting site:3998  Submitted:389  Stage out:11  Finished successfully:13

Progress:  time: Wed, 28 May 2014 15:20:36 -0700  Selecting site:3986  Submitted:401  Finished successfully:24

Progress:  time: Wed, 28 May 2014 15:21:06 -0700  Selecting site:3986  Submitted:401  Finished successfully:24

Progress:  time: Wed, 28 May 2014 15:21:36 -0700  Selecting site:3986  Submitted:401  Finished successfully:24

… Batch allocation released, new request submitted …

Progress:  time: Wed, 28 May 2014 15:22:06 -0700  Selecting site:3986  Submitted:401  Finished successfully:24

Progress:  time: Wed, 28 May 2014 15:22:36 -0700  Selecting site:3986  Submitted:401  Finished successfully:24

Progress:  time: Wed, 28 May 2014 15:23:06 -0700  Selecting site:3986  Submitted:401  Finished successfully:24

Progress:  time: Wed, 28 May 2014 15:23:18 -0700  Selecting site:3986  Stage in:1  Submitted:400  Finished successfully:24

Progress:  time: Wed, 28 May 2014 15:23:20 -0700  Selecting site:3986  Stage in:4  Submitted:397  Finished successfully:24

Progress:  time: Wed, 28 May 2014 15:23:24 -0700  Selecting site:3986  Stage in:5  Submitted:396  Finished successfully:24

Progress:  time: Wed, 28 May 2014 15:23:36 -0700  Selecting site:3986  Submitted:389  Active:12  Finished successfully:24

Progress:  time: Wed, 28 May 2014 15:24:06 -0700  Selecting site:3986  Submitted:389  Active:12  Finished successfully:24

Progress:  time: Wed, 28 May 2014 15:24:36 -0700  Selecting site:3986  Submitted:389  Active:12  Finished successfully:24

Progress:  time: Wed, 28 May 2014 15:25:06 -0700  Selecting site:3986  Submitted:389  Active:12  Finished successfully:24

Progress:  time: Wed, 28 May 2014 15:25:16 -0700  Selecting site:3986  Submitted:389  Active:11  Stage out:1  Finished successfully:24

Progress:  time: Wed, 28 May 2014 15:25:26 -0700  Selecting site:3986  Submitted:389  Active:7  Stage out:5  Finished successfully:24

Progress:  time: Wed, 28 May 2014 15:25:27 -0700  Selecting site:3986  Submitted:389  Stage out:12  Finished successfully:24

Progress:  time: Wed, 28 May 2014 15:25:29 -0700  Selecting site:3975  Stage in:2  Submitted:398  Stage out:1  Finished successfully:35

Progress:  time: Wed, 28 May 2014 15:25:32 -0700  Selecting site:3975  Stage in:3  Submitted:397  Stage out:1  Finished successfully:35

Progress:  time: Wed, 28 May 2014 15:25:34 -0700  Selecting site:3975  Stage in:4  Submitted:396  Stage out:1  Finished successfully:35

Progress:  time: Wed, 28 May 2014 15:25:35 -0700  Selecting site:3974  Stage in:1  Submitting:1  Submitted:389  Active:10  Finished successfully:36

Progress:  time: Wed, 28 May 2014 15:25:36 -0700  Selecting site:3974  Submitted:389  Active:12  Finished successfully:36

Progress:  time: Wed, 28 May 2014 15:26:06 -0700  Selecting site:3974  Submitted:389  Active:12  Finished successfully:36

Progress:  time: Wed, 28 May 2014 15:26:36 -0700  Selecting site:3974  Submitted:389  Active:12  Finished successfully:36

Progress:  time: Wed, 28 May 2014 15:27:01 -0700  Selecting site:3974  Submitted:389  Active:11  Stage out:1  Finished successfully:36

Progress:  time: Wed, 28 May 2014 15:27:03 -0700  Selecting site:3974  Submitted:389  Active:2  Stage out:10  Finished successfully:36

Progress:  time: Wed, 28 May 2014 15:27:04 -0700  Selecting site:3974  Submitted:389  Active:1  Stage out:11  Finished successfully:36

Progress:  time: Wed, 28 May 2014 15:27:06 -0700  Selecting site:3974  Submitted:389  Stage out:12  Finished successfully:36

Progress:  time: Wed, 28 May 2014 15:27:08 -0700  Selecting site:3974  Submitted:389  Stage out:11  Finished successfully:37

Progress:  time: Wed, 28 May 2014 15:27:36 -0700  Selecting site:3962  Submitted:389  Active:12  Finished successfully:48

Progress:  time: Wed, 28 May 2014 15:28:06 -0700  Selecting site:3962  Submitted:389  Active:12  Finished successfully:48

Progress:  time: Wed, 28 May 2014 15:28:36 -0700  Selecting site:3962  Submitted:389  Active:12  Finished successfully:48

Progress:  time: Wed, 28 May 2014 15:28:47 -0700  Selecting site:3962  Submitted:389  Active:11  Stage out:1  Finished successfully:48

Progress:  time: Wed, 28 May 2014 15:29:01 -0700  Selecting site:3962  Submitted:389  Active:8  Stage out:4  Finished successfully:48

Progress:  time: Wed, 28 May 2014 15:29:02 -0700  Selecting site:3955  Submitting:1  Submitted:395  Finished successfully:60

Progress:  time: Wed, 28 May 2014 15:29:06 -0700  Selecting site:3950  Submitted:401  Finished successfully:60

Progress:  time: Wed, 28 May 2014 15:29:36 -0700  Selecting site:3950  Submitted:401  Finished successfully:60

Progress:  time: Wed, 28 May 2014 15:30:06 -0700  Selecting site:3950  Submitted:401  Finished successfully:60

Progress:  time: Wed, 28 May 2014 15:30:36 -0700  Selecting site:3950  Submitted:401  Finished successfully:60

… Batch allocation released, new request submitted …

Progress:  time: Wed, 28 May 2014 15:31:06 -0700  Selecting site:3950  Submitted:401  Finished successfully:60

Progress:  time: Wed, 28 May 2014 15:31:36 -0700  Selecting site:3950  Submitted:401  Finished successfully:60

Progress:  time: Wed, 28 May 2014 15:31:56 -0700  Selecting site:3950  Stage in:1  Submitted:400  Finished successfully:60

Progress:  time: Wed, 28 May 2014 15:32:06 -0700  Selecting site:3950  Submitted:389  Active:12  Finished successfully:60

Progress:  time: Wed, 28 May 2014 15:32:36 -0700  Selecting site:3950  Submitted:389  Active:12  Finished successfully:60

Progress:  time: Wed, 28 May 2014 15:33:06 -0700  Selecting site:3950  Submitted:389  Active:12  Finished successfully:60

Progress:  time: Wed, 28 May 2014 15:33:27 -0700  Selecting site:3950  Submitted:389  Active:11  Stage out:1  Finished successfully:60

Progress:  time: Wed, 28 May 2014 15:33:28 -0700  Selecting site:3950  Submitted:389  Active:3  Stage out:9  Finished successfully:60

Progress:  time: Wed, 28 May 2014 15:33:31 -0700  Selecting site:3938  Stage in:1  Submitted:400  Finished successfully:72

Progress:  time: Wed, 28 May 2014 15:33:36 -0700  Selecting site:3938  Stage in:2  Submitted:399  Finished successfully:72

Progress:  time: Wed, 28 May 2014 15:34:06 -0700  Selecting site:3938  Submitted:389  Active:12  Finished successfully:72

Progress:  time: Wed, 28 May 2014 15:34:36 -0700  Selecting site:3938  Submitted:389  Active:12  Finished successfully:72

Progress:  time: Wed, 28 May 2014 15:35:06 -0700  Selecting site:3938  Submitted:389  Active:12  Finished successfully:72

… Batch allocation released, new request submitted …





Greg Bronevetsky

Lawrence Livermore National Lab

(925) 424-5756

bronevetsky at llnl.gov

http://greg.bronevetsky.com





swift -sites.file /g/g15/bronevet/apps/swift-0.94.1/etc/sites.pbatch.sierra.tmp.xml -tc.file ~/code/tmp/sight/apps/linsolve/iml/tc.data -config /g/g15/bronevet/apps/swift-0.94.1/etc/cf.tmp -lazy.errors false ~/code/tmp/sight/apps/linsolve/iml/psuadeExperiments.swift -precond=diag -matrix=nasa1824 -modelType=singleModel -resume psuadeExperiments-20140528-0007-gx1v1o8g.0.rlog

Swift 0.94.1 swift-r7114 cog-r3803



RunID: 20140528-1001-vnifrjtf

Progress:  time: Wed, 28 May 2014 10:01:24 -0700

Tree combinedProgress:  time: Wed, 28 May 2014 10:01:54 -0700  Finished in previous run:12

Progress:  time: Wed, 28 May 2014 10:02:15 -0700  Initializing:2  Finished in previous run:12

Progress:  time: Wed, 28 May 2014 10:02:16 -0700  Initializing:1542  Finished in previous run:39

Progress:  time: Wed, 28 May 2014 10:02:17 -0700  Initializing:2627  Selecting site:1532  Finished in previous run:93

Progress:  time: Wed, 28 May 2014 10:02:18 -0700  Initializing:10156  Selecting site:3667  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:02:19 -0700  Initializing:3421  Selecting site:12328  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:02:21 -0700  Selecting site:15746  Submitting:3  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:02:23 -0700  Selecting site:15348  Submitting:384  Submitted:17  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:02:24 -0700  Selecting site:15348  Submitted:401  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:02:45 -0700  Selecting site:15348  Stage in:1  Submitted:400  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:02:46 -0700  Selecting site:15348  Stage in:2  Submitted:399  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:02:52 -0700  Selecting site:15348  Stage in:6  Submitted:395  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:02:53 -0700  Selecting site:15348  Stage in:10  Submitted:391  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:03:24 -0700  Selecting site:15348  Submitted:389  Active:12  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:03:54 -0700  Selecting site:15348  Submitted:389  Active:12  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:04:24 -0700  Selecting site:15348  Submitted:389  Active:12  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:04:54 -0700  Selecting site:15348  Submitted:389  Active:12  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:05:24 -0700  Selecting site:15348  Submitted:389  Active:12  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:05:54 -0700  Selecting site:15348  Submitted:389  Active:12  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:06:24 -0700  Selecting site:15348  Submitted:389  Active:12  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:06:54 -0700  Selecting site:15348  Submitted:389  Active:12  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:07:24 -0700  Selecting site:15348  Submitted:389  Active:12  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:07:54 -0700  Selecting site:15348  Submitted:389  Active:12  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:08:24 -0700  Selecting site:15348  Submitted:389  Active:12  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:08:54 -0700  Selecting site:15348  Submitted:389  Active:12  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:09:24 -0700  Selecting site:15348  Submitted:389  Active:12  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:09:54 -0700  Selecting site:15348  Submitted:389  Active:12  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:10:24 -0700  Selecting site:15348  Submitted:389  Active:12  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:10:54 -0700  Selecting site:15348  Submitted:389  Active:12  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:11:24 -0700  Selecting site:15348  Submitted:389  Active:12  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:11:27 -0700  Selecting site:15348  Submitted:388  Active:13  Finished in previous run:269

Progress:  time: Wed, 28 May 2014 10:11:28 -0700  Selecting site:15324  Submitting:23  Submitted:366  Active:12  Finished in previous run:269  Failed but can retry:24

Progress:  time: Wed, 28 May 2014 10:11:54 -0700  Selecting site:15324  Submitted:389  Active:12  Finished in previous run:269  Failed but can retry:24

Progress:  time: Wed, 28 May 2014 10:12:12 -0700  Selecting site:15324  Submitted:388  Active:13  Finished in previous run:269  Failed but can retry:24

Progress:  time: Wed, 28 May 2014 10:12:13 -0700  Selecting site:15300  Submitting:23  Submitted:366  Active:12  Finished in previous run:269  Failed but can retry:48

Progress:  time: Wed, 28 May 2014 10:12:24 -0700  Selecting site:15300  Submitted:389  Active:12  Finished in previous run:269  Failed but can retry:48

Progress:  time: Wed, 28 May 2014 10:12:54 -0700  Selecting site:15300  Submitted:389  Active:12  Finished in previous run:269  Failed but can retry:48

Progress:  time: Wed, 28 May 2014 10:12:58 -0700  Selecting site:15300  Submitted:388  Active:13  Finished in previous run:269  Failed but can retry:48

Progress:  time: Wed, 28 May 2014 10:12:59 -0700  Selecting site:15276  Submitting:23  Submitted:366  Active:12  Finished in previous run:269  Failed but can retry:72

Progress:  time: Wed, 28 May 2014 10:13:24 -0700  Selecting site:15276  Submitted:389  Active:12  Finished in previous run:269  Failed but can retry:72

Progress:  time: Wed, 28 May 2014 10:13:43 -0700  Selecting site:15276  Submitted:388  Active:13  Finished in previous run:269  Failed but can retry:72

Progress:  time: Wed, 28 May 2014 10:13:44 -0700  Selecting site:15252  Submitting:23  Submitted:366  Active:12  Finished in previous run:269  Failed but can retry:96

Progress:  time: Wed, 28 May 2014 10:13:54 -0700  Selecting site:15252  Submitted:389  Active:12  Finished in previous run:269  Failed but can retry:96

Progress:  time: Wed, 28 May 2014 10:14:24 -0700  Selecting site:15252  Submitted:389  Active:12  Finished in previous run:269  Failed but can retry:96

Progress:  time: Wed, 28 May 2014 10:14:28 -0700  Selecting site:15252  Submitted:388  Active:13  Finished in previous run:269  Failed but can retry:96

Progress:  time: Wed, 28 May 2014 10:14:29 -0700  Selecting site:15228  Submitting:23  Submitted:366  Active:12  Finished in previous run:269  Failed but can retry:120

Progress:  time: Wed, 28 May 2014 10:14:54 -0700  Selecting site:15228  Submitted:389  Active:12  Finished in previous run:269  Failed but can retry:120

Progress:  time: Wed, 28 May 2014 10:15:14 -0700  Selecting site:15228  Submitted:388  Active:13  Finished in previous run:269  Failed but can retry:120

Progress:  time: Wed, 28 May 2014 10:15:15 -0700  Selecting site:15204  Submitting:23  Submitted:366  Active:12  Finished in previous run:269  Failed but can retry:144

Progress:  time: Wed, 28 May 2014 10:15:24 -0700  Selecting site:15204  Submitted:389  Active:12  Finished in previous run:269  Failed but can retry:144



-----Original Message-----
From: Mihael Hategan [mailto:hategan at mcs.anl.gov]
Sent: Tuesday, May 27, 2014 4:46 PM
To: Bronevetsky, Greg
Cc: swift-user at ci.uchicago.edu
Subject: Re: [Swift-user] Data transfer error



On Tue, 2014-05-27 at 22:56 +0000, Bronevetsky, Greg wrote:



[...]

> Progress:  time: Tue, 27 May 2014 15:39:42 -0700  Selecting site:1268

> Stage in:30  Submitted:328  Active:31  Finished successfully:3  Failed

> but can retry:344



I would really suggest disabling lazy errors and execution retries until you get things to run.



[...]

> 2014/05/27 15:34:56.648 INFO  000000 1401230034461 Staging out /p/lscratche/bronevet/swift/work/psuadeExperiments-20140527-1533-6dt18a4b/jobs/1/runModel-1r0az8rl/wrapper.error (mode = 2).



Right. It means something went wrong running the app on the compute node. That's a file that is used to send back the exact error.



[...]

> _swiftwrap.staging: line 45: echo: write error: No space left on

> device

>

> However, I can’t see how this could be since each node has 16GB of RAM (available as RAM or ramdisk). Is there a way to look into this further?



Swift doesn't have much in that direction. The wrapper logs should contain some diagnostic information for failing jobs, but if they fail due to lack of disk space, I can't see how the wrapper log can be written to.



What I would suggest is wrapping your app in a script that looks into disk issues (df, ls), and running multiple apps on a single node and hopefully catching a glimpse of what the problem is before all scratch space is exhausted.



I think it would be a nice idea to add some node status (mem/disk/cpu) monitors to the swift monitoring interfaces.



Mihael


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140528/ee67f6c5/attachment.html>


More information about the Swift-user mailing list