[Swift-user] Data transfer error
Bronevetsky, Greg
bronevetsky1 at llnl.gov
Wed May 28 18:48:12 CDT 2014
Are you specifying a max walltime for the apps?
If not, swift assumes 10 minutes. If the first few batches take 21 minutes, and the worker has 30 minutes allocated, it won't be able to fit any other jobs after that.
My sites file has the following bounds:
<profile namespace="globus" key="maxtime">1800</profile>
<profile namespace="globus" key="maxwalltime">00:24:00</profile>
The job typically ended 5-10 min after it started.
Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov
http://greg.bronevetsky.com
-----Original Message-----
From: Mihael Hategan [mailto:hategan at mcs.anl.gov]
Sent: Wednesday, May 28, 2014 4:46 PM
To: Bronevetsky, Greg
Cc: swift-user at ci.uchicago.edu
Subject: Re: [Swift-user] Data transfer error
On Wed, 2014-05-28 at 22:54 +0000, Bronevetsky, Greg wrote:
> Mihael, I ran a few more experiments where I ran a workflow on a
> single cluster node while monitoring its memory use but I didn't see
> any issues with it running out of memory since at all times
> /proc/meminfo reported 22GB out of 24GB free.
The error you were getting previously seemed to indicate that you were running out of *disk* space somewhere, probably on the ramdisk.
So maybe the output of 'df' would be better than /proc/meminfo
> I've now begun a more focused analysis where I have a simple script
> that captures the high-level structure of my real script. It first
> generates a bunch of files, producing additional temporary files and
> the directories along with the main output file. These files are then
> reduced using a reduction tree based on the example you sent me. I
> have not yet gotten the simple script to fail in the same way as the
> main script but I've noticed a few oddities.
>
> First, although my sites file has <profile namespace="swift"
> key="stagingMethod">file</profile> and my cf file has
> use.provider.staging=true, I see that all the intermediate files
> produced by my tasks are written to the global file system specified
> in the sites file as
> <workdirectory>/p/lscratche/bronevet/swift_work</workdirectory>. How
> do I force Swift to use node-local storage for this data?
You would have to change <workdirectory> to a node-local location.
>
> Second, when I run as many processes on the one node as there are
> cores, the script runs but it keeps stalling. As you can see below, it
> processes tasks in batches of 12. However, after a few batches the job
> is aborted (~6 mins into a 30 min allocation) even though the node
> appears healthy and does not run out of memory and Swift submits a new
> job into the batch queue. Why does this happen?
Are you specifying a max walltime for the apps?
If not, swift assumes 10 minutes. If the first few batches take 21 minutes, and the worker has 30 minutes allocated, it won't be able to fit any other jobs after that.
Mihael
More information about the Swift-user
mailing list