[Swift-user] Deleting no longer necessary anonymous files in _concurrent

Justin M Wozniak wozniak at mcs.anl.gov
Fri Sep 3 13:50:57 CDT 2010


First off, I definitely recognize the importance of doing this 
efficiently.  I also realize you may be thinking of certain "make" 
functionality that does something like this.

We are currently working on improvements to Swift's data access 
mechanisms.  Many applications create temporary intermediate data, so we 
are definitely looking at this.

>From Swift's perspective, there are two aspects- the garbage collection or 
automated delete and the data placement between job executions.  The 
garbage collection is something that could be handled in a few ways (cache 
management) or simply by tearing down the intermediate storage system. 
The data placement involves using an intermediate storage system that is 
at the compute site, preventing full stage out to the client, and ensuring 
that this storage system is accessible to both the producer and consumer 
of the pipeline data.  (Swift assumes that there is one permanent 
filesystem, the one from which it is run, and uses staging for everything 
else.  A given pair of jobs could execute at separate sites with 
different filesystems.)

There is "beta" functionality in the Swift trunk to directly utilize a 
local filesystem (that at least two applications are using).  If there is 
a "scratch" filesystem that you can use, I can direct you to that.  We are 
also productizing the ability to setup an temporary storage system for use 
by Swift, but that is not available yet.

On Wed, 1 Sep 2010, John Dennis wrote:

> Justin,
>
> 	I am a little confused by your response that cleaning up temporary 
> files is not the responsibility of the Swift language.  We did not
> create  the file 
> 'wgt_files-935f5705-27ed-4a99-9420-441269bba3a0-36-4-0-array' Swift did.  I 
> certainly have not use for it.  It was created
> as part of the parallelization process.   Consider the following bit of 
> pseudo swift code
>
> foreach years {
> 	file wgt_files[];
> 	foreach month {
> 		wgt_files[] = DoSomething();
> 	} 
> }
>
> 	The 'wgt_files' is only in  scope within the 'foreach years' loop. 
> Once all iterations of 'foreach years' loop has completed,
> I would expect the 'wgt_files' to be deleted once a variable/file goes out of 
> scope.   Isn't this really an issue of garbage collection
> for the Swift language?
>
> 	While I do see how you could use the external variable to manage this 
> all ourselves that would significantly complicate the
> source code and remove much of the simple and elegant solution that Swift 
> provides.
>
> 	Matthew and I are concerned about this because of the impact this has 
> on disk usage.  For example our Swift script
> requires temporary space of size 4x the input data.  Our generated data is 
> tiny, while the size of the _concurrent directory
> is 2x the size of the input data.  Now we want to execute the Swift script on 
> ~30 TB of data.  So just to enable parallel execution
> with Swift would require an extra 120TB of disk space.  I realize that 
> parallel execution will consume more disk space but this seems
> excessive.
>
> Thanks,
> John Dennis
> 
>
>
> On Aug 30, 2010, at 3:54 PM, Justin M Wozniak wrote:
>
>> Hi Matthew
>> 	Deleting files is out of the scope of the Swift language.  You can of 
>> course remove them yourself in your scripts, and as long as Swift does not 
>> try to stage them out you should be fine.
>> 	You may want to look at external variables as another way to approach 
>> this (manual 2.5).  Using external variables you can manage the files in 
>> your scripts while maintaining the Swift progress model.
>> 	Justin
>> 
>> On Fri, 27 Aug 2010, Matthew Woitaszek wrote:
>>> Good afternoon,
>>> 
>>> I'm working with a script that creates arrays of intermediate files
>>> using the anonymous concurrent mapper, such as:
>>> 
>>> file wgt_file[];
>>> 
>>> As I expect, all of these files get generated in the remote swift
>>> temporary directory and are then returned to the _concurrent directory
>>> on the host executing Swift. However, in this particular application,
>>> they're then immediately consumed by a subsequent procedure and never
>>> needed again.
>>> 
>>> Is there a way to configure Swift or the file mapper declaration to
>>> delete these files after the remaining script "consumes" them? (That
>>> is, after all procedures relying on them as inputs have been
>>> executed?) Or can (should?) that be done manually?
>>> 
>>> More speculatively, is there a way to keep files like these on the
>>> execution host and not even bring them back to _concurrent? (With loss
>>> of generality, I'm executing on a single site, and don't really ever
>>> need the file locally, for restarts or staging to another site.)
>>> 
>>> Any advice about managing copies of large intermediate data files in
>>> the Swift execution context would be appreciated!
>>> 
>>> Matthew
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>>> 
>> 
>> -- 
>> Justin M Wozniak
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>

-- 
Justin M Wozniak



More information about the Swift-user mailing list