[Swift-user] Data transfer error

Bronevetsky, Greg bronevetsky1 at llnl.gov
Thu May 29 17:48:31 CDT 2014


I've finally managed to create a reproducer for my problem. I've attached the problematic Swift script and the app script that it calls. The Swift script is a 2-level reduction tree with radix 40. It iterates 40 times and for iteration performs 40 inner iterations, in which it calls the app to generate an output file and merges these files. The files produced by all the inner iterations are subsequently merged to produce a single file. The app creates 10 temporary directories and temporary 10 files in each directory but the only output it emits to Swift is what it writes on stdout.

Below is the screen output emitted by Swift when the workflow is executed on a single 12-core node, executed from a directory on our Lustre scratch file system and also using this file system as work storage (<workdirectory> in sites file). As you can see, after it finishes processing each task Swift tries to stage them out and on some of them it encounters an error, which causes these tasks to go into the "Failed but can retry" status. If I reduce the workload below 40x40 tasks and 10x10 files this does not happen so the issue looks so be related to the amount of stress I put on the file system. I can cause the same behavior to occur on our nfs file system if I increase the size of the workload.

I've attached the logs that the run produced in my .globus/coasters and .globus/scripts directories. The stderr.txt files in my jobs/*/* directories were empty and the wrapper.log files contained pretty similar text such as:
	checking for paramfile
	no paramfile: using command line arguments
	Progress  2014-05-29 15:29:52.801545566-0700  LOG_START
	_____________________________________________________________________________
	        Wrapper (_swiftwrap.staging)
	_____________________________________________________________________________
	/g/g15/bronevet/apps/swift-0.94.1/examples/test/fileGenTest.py -out file.1.1 -err stderr.txt -i -d -if -of file.1.1 -k -cdmfile -status provider -a 1
	PWD=/p/lscratche/bronevet/swift_work/testSwiftErrors-20140529-1528-cv85utlb/jobs/o/fileGenTest-o7b89crl
	EXEC=/g/g15/bronevet/apps/swift-0.94.1/examples/test/fileGenTest.py
	STDIN=
	STDOUT=file.1.1
	STDERR=stderr.txt
	DIRS=
	INF=
	OUTF=file.1.1
	KICKSTART=
	ARGS=1
	ARGC=1
	Progress  2014-05-29 15:29:52.806613446-0700  CREATE_INPUTDIR
	Progress  2014-05-29 15:29:52.809312133-0700  EXECUTE

Please let me know if you need any additional info.

Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov
http://greg.bronevetsky.com

Swift 0.94.1 swift-r7114 cog-r3803

RunID: 20140529-1528-cv85utlb
Progress:  time: Thu, 29 May 2014 15:28:51 -0700
Progress:  time: Thu, 29 May 2014 15:28:53 -0700  Selecting site:50  Submitting:400  Submitted:1
Progress:  time: Thu, 29 May 2014 15:29:21 -0700  Selecting site:50  Submitted:401
Progress:  time: Thu, 29 May 2014 15:29:51 -0700  Selecting site:50  Stage in:1  Submitted:400
Progress:  time: Thu, 29 May 2014 15:29:52 -0700  Selecting site:50  Stage in:8  Submitted:389  Active:4
Progress:  time: Thu, 29 May 2014 15:30:21 -0700  Selecting site:50  Submitted:389  Active:12
Progress:  time: Thu, 29 May 2014 15:30:51 -0700  Selecting site:50  Submitted:389  Active:12
Progress:  time: Thu, 29 May 2014 15:31:21 -0700  Selecting site:50  Submitted:389  Active:12
Progress:  time: Thu, 29 May 2014 15:31:51 -0700  Selecting site:50  Submitted:389  Active:12
Progress:  time: Thu, 29 May 2014 15:32:21 -0700  Selecting site:50  Submitted:389  Active:12
Progress:  time: Thu, 29 May 2014 15:32:51 -0700  Selecting site:50  Submitted:389  Active:12
Progress:  time: Thu, 29 May 2014 15:33:21 -0700  Selecting site:50  Submitted:389  Active:12
Progress:  time: Thu, 29 May 2014 15:33:51 -0700  Selecting site:50  Submitted:389  Active:12
Progress:  time: Thu, 29 May 2014 15:34:21 -0700  Selecting site:50  Submitted:389  Active:12
Progress:  time: Thu, 29 May 2014 15:34:28 -0700  Selecting site:50  Submitted:389  Active:11  Stage out:1
Progress:  time: Thu, 29 May 2014 15:34:51 -0700  Selecting site:39  Submitted:389  Active:12  Failed but can retry:11
Progress:  time: Thu, 29 May 2014 15:35:21 -0700  Selecting site:39  Submitted:389  Active:12  Failed but can retry:11
Progress:  time: Thu, 29 May 2014 15:35:51 -0700  Selecting site:39  Submitted:389  Active:12  Failed but can retry:11
Progress:  time: Thu, 29 May 2014 15:36:21 -0700  Selecting site:39  Submitted:389  Active:12  Failed but can retry:11
Progress:  time: Thu, 29 May 2014 15:36:51 -0700  Selecting site:39  Submitted:389  Active:12  Failed but can retry:11
Progress:  time: Thu, 29 May 2014 15:37:21 -0700  Selecting site:39  Submitted:389  Active:12  Failed but can retry:11
Progress:  time: Thu, 29 May 2014 15:37:51 -0700  Selecting site:39  Submitted:389  Active:12  Failed but can retry:11
Progress:  time: Thu, 29 May 2014 15:38:21 -0700  Selecting site:39  Submitted:389  Active:12  Failed but can retry:11
Progress:  time: Thu, 29 May 2014 15:38:51 -0700  Selecting site:39  Submitted:389  Active:12  Failed but can retry:11
Progress:  time: Thu, 29 May 2014 15:39:21 -0700  Selecting site:39  Submitted:389  Active:12  Failed but can retry:11
Progress:  time: Thu, 29 May 2014 15:39:25 -0700  Selecting site:39  Submitted:389  Active:11  Stage out:1  Failed but can retry:11
Progress:  time: Thu, 29 May 2014 15:39:27 -0700  Selecting site:39  Submitted:389  Active:9  Stage out:3  Failed but can retry:11
Progress:  time: Thu, 29 May 2014 15:39:51 -0700  Selecting site:39  Submitted:389  Active:8  Stage out:4  Failed but can retry:11
Progress:  time: Thu, 29 May 2014 15:39:55 -0700  Selecting site:39  Submitted:389  Active:7  Stage out:5  Failed but can retry:11
Progress:  time: Thu, 29 May 2014 15:39:57 -0700  Selecting site:39  Submitted:389  Active:4  Stage out:8  Failed but can retry:11
Progress:  time: Thu, 29 May 2014 15:39:58 -0700  Selecting site:29  Stage in:3  Submitted:398  Failed but can retry:21
Progress:  time: Thu, 29 May 2014 15:40:04 -0700  Selecting site:29  Stage in:5  Submitted:396  Failed but can retry:21
Progress:  time: Thu, 29 May 2014 15:40:08 -0700  Selecting site:29  Stage in:8  Submitted:393  Failed but can retry:21
Progress:  time: Thu, 29 May 2014 15:40:21 -0700  Selecting site:29  Submitted:389  Active:12  Failed but can retry:21
Progress:  time: Thu, 29 May 2014 15:40:51 -0700  Selecting site:29  Submitted:389  Active:12  Failed but can retry:21
Progress:  time: Thu, 29 May 2014 15:41:21 -0700  Selecting site:29  Submitted:389  Active:12  Failed but can retry:21
Progress:  time: Thu, 29 May 2014 15:41:51 -0700  Selecting site:29  Submitted:389  Active:12  Failed but can retry:21
Progress:  time: Thu, 29 May 2014 15:42:21 -0700  Selecting site:29  Submitted:389  Active:12  Failed but can retry:21
Progress:  time: Thu, 29 May 2014 15:42:51 -0700  Selecting site:29  Submitted:389  Active:12  Failed but can retry:21
Progress:  time: Thu, 29 May 2014 15:43:21 -0700  Selecting site:29  Submitted:389  Active:12  Failed but can retry:21
Progress:  time: Thu, 29 May 2014 15:43:51 -0700  Selecting site:29  Submitted:389  Active:12  Failed but can retry:21

-----Original Message-----
From: Mihael Hategan [mailto:hategan at mcs.anl.gov] 
Sent: Wednesday, May 28, 2014 4:46 PM
To: Bronevetsky, Greg
Cc: swift-user at ci.uchicago.edu
Subject: Re: [Swift-user] Data transfer error

On Wed, 2014-05-28 at 22:54 +0000, Bronevetsky, Greg wrote:
> Mihael, I ran a few more experiments where I ran a workflow on a 
> single cluster node while monitoring its memory use but I didn't see 
> any issues with it running out of memory since at all times 
> /proc/meminfo reported 22GB out of 24GB free.

The error you were getting previously seemed to indicate that you were running out of *disk* space somewhere, probably on the ramdisk.

So maybe the output of 'df' would be better than /proc/meminfo

>  I've now begun a more focused analysis where I have a simple script 
> that captures the high-level structure of my real script. It first 
> generates a bunch of files, producing additional temporary files and 
> the directories along with the main output file. These files are then 
> reduced using a reduction tree based on the example you sent me. I 
> have not yet gotten the simple script to fail in the same way as the 
> main script but I've noticed a few oddities.
> 
> First, although my sites file has <profile namespace="swift"
> key="stagingMethod">file</profile> and my cf file has 
> use.provider.staging=true, I see that all the intermediate files 
> produced by my tasks are written to the global file system specified 
> in the sites file as 
> <workdirectory>/p/lscratche/bronevet/swift_work</workdirectory>. How 
> do I force Swift to use node-local storage for this data?

You would have to change <workdirectory> to a node-local location.

> 
> Second, when I run as many processes on the one node as there are 
> cores, the script runs but it keeps stalling. As you can see below, it 
> processes tasks in batches of 12. However, after a few batches the job 
> is aborted (~6 mins into a 30 min allocation) even though the node 
> appears healthy and does not run out of memory and Swift submits a new 
> job into the batch queue. Why does this happen?

Are you specifying a max walltime for the apps?

If not, swift assumes 10 minutes. If the first few batches take 21 minutes, and the worker has 30 minutes allocated, it won't be able to fit any other jobs after that.

Mihael

-------------- next part --------------
A non-text attachment was scrubbed...
Name: fileGenTest.py
Type: application/octet-stream
Size: 285 bytes
Desc: fileGenTest.py
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140529/16654c57/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testSwiftErrors.swift
Type: application/octet-stream
Size: 658 bytes
Desc: testSwiftErrors.swift
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140529/16654c57/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: worker-0529-2803530-000000.log.bz2
Type: application/octet-stream
Size: 313974 bytes
Desc: worker-0529-2803530-000000.log.bz2
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140529/16654c57/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PBS1386040007442248971.submit.exitcode
Type: application/octet-stream
Size: 4 bytes
Desc: PBS1386040007442248971.submit.exitcode
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140529/16654c57/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PBS1386040007442248971.submit.stderr
Type: application/octet-stream
Size: 115 bytes
Desc: PBS1386040007442248971.submit.stderr
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140529/16654c57/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PBS1386040007442248971.submit.stdout
Type: application/octet-stream
Size: 69 bytes
Desc: PBS1386040007442248971.submit.stdout
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140529/16654c57/attachment-0005.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PBS1386040007442248971.submit
Type: application/octet-stream
Size: 797 bytes
Desc: PBS1386040007442248971.submit
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140529/16654c57/attachment-0006.obj>


More information about the Swift-user mailing list