[Swift-user] Deleting no longer necessary anonymous files in _concurrent
Matthew Woitaszek
matthew.woitaszek at gmail.com
Tue Sep 7 13:22:42 CDT 2010
Hi Justin,
Thanks for your reply -- I'd definitely like to learn more about the
alternate staging/scratch options.
> There is "beta" functionality in the Swift trunk to directly utilize a local
> filesystem (that at least two applications are using). If there is a
> "scratch" filesystem that you can use, I can direct you to that.
By this, do you mean a something like a node-local scratch system,
where files could be staged directly from _concurrent to a node
instead of a "site", or is it something else?
If node-local, I fear that might be a step backwards for our
application. In our case, the staging time vs. capacity tradeoff is
becoming quite problematic. On one hand, I really only want to keep
one copy of everything (_concurrent), but limiting the amount of
storage on the a site might increase staging, which negates the
parallelism, so I'm back to prefering a big site cache to minimize
that.
Is there a way to get tasks to read/write directly out of _concurrent
without the staging to the remote site at all? I suspect the answer is
"no" due to your description of _concurrent's importance as the
permanent file system and its use in staging to site file systems. But
in our case, we're coincidentally at one site, so the big GPFS scratch
file system area ends up holding both _concurrent as well as the swift
site temporary directory in different paths.
> The
> data placement involves using an intermediate storage system that is at the
> compute site, preventing full stage out to the client, and ensuring that
> this storage system is accessible to both the producer and consumer of the
> pipeline data.
This sounds like a feature that John and I would sign up for. :-)
I see the new use.provider.staging option in the trunk, and "sfs" is
very tempting...
(Also, thanks for your thoughts on garbage collection; I'll stick with
the possibilities in the staging arena for now!)
Thanks for your time,
Matthew
On Fri, Sep 3, 2010 at 12:50 PM, Justin M Wozniak <wozniak at mcs.anl.gov> wrote:
>
> First off, I definitely recognize the importance of doing this efficiently.
> I also realize you may be thinking of certain "make" functionality that
> does something like this.
>
> We are currently working on improvements to Swift's data access mechanisms.
> Many applications create temporary intermediate data, so we are definitely
> looking at this.
>
>> From Swift's perspective, there are two aspects- the garbage collection or
>
> automated delete and the data placement between job executions. The garbage
> collection is something that could be handled in a few ways (cache
> management) or simply by tearing down the intermediate storage system. The
> data placement involves using an intermediate storage system that is at the
> compute site, preventing full stage out to the client, and ensuring that
> this storage system is accessible to both the producer and consumer of the
> pipeline data. (Swift assumes that there is one permanent filesystem, the
> one from which it is run, and uses staging for everything else. A given
> pair of jobs could execute at separate sites with different filesystems.)
>
> There is "beta" functionality in the Swift trunk to directly utilize a local
> filesystem (that at least two applications are using). If there is a
> "scratch" filesystem that you can use, I can direct you to that. We are
> also productizing the ability to setup an temporary storage system for use
> by Swift, but that is not available yet.
>
> On Wed, 1 Sep 2010, John Dennis wrote:
>
>> Justin,
>>
>> I am a little confused by your response that cleaning up temporary
>> files is not the responsibility of the Swift language. We did not
>> create the file
>> 'wgt_files-935f5705-27ed-4a99-9420-441269bba3a0-36-4-0-array' Swift did. I
>> certainly have not use for it. It was created
>> as part of the parallelization process. Consider the following bit of
>> pseudo swift code
>>
>> foreach years {
>> file wgt_files[];
>> foreach month {
>> wgt_files[] = DoSomething();
>> } }
>>
>> The 'wgt_files' is only in scope within the 'foreach years' loop.
>> Once all iterations of 'foreach years' loop has completed,
>> I would expect the 'wgt_files' to be deleted once a variable/file goes out
>> of scope. Isn't this really an issue of garbage collection
>> for the Swift language?
>>
>> While I do see how you could use the external variable to manage
>> this all ourselves that would significantly complicate the
>> source code and remove much of the simple and elegant solution that Swift
>> provides.
>>
>> Matthew and I are concerned about this because of the impact this
>> has on disk usage. For example our Swift script
>> requires temporary space of size 4x the input data. Our generated data is
>> tiny, while the size of the _concurrent directory
>> is 2x the size of the input data. Now we want to execute the Swift script
>> on ~30 TB of data. So just to enable parallel execution
>> with Swift would require an extra 120TB of disk space. I realize that
>> parallel execution will consume more disk space but this seems
>> excessive.
>>
>> Thanks,
>> John Dennis
>>
>>
>>
>> On Aug 30, 2010, at 3:54 PM, Justin M Wozniak wrote:
>>
>>> Hi Matthew
>>> Deleting files is out of the scope of the Swift language. You can
>>> of course remove them yourself in your scripts, and as long as Swift does
>>> not try to stage them out you should be fine.
>>> You may want to look at external variables as another way to
>>> approach this (manual 2.5). Using external variables you can manage the
>>> files in your scripts while maintaining the Swift progress model.
>>> Justin
>>>
>>> On Fri, 27 Aug 2010, Matthew Woitaszek wrote:
>>>>
>>>> Good afternoon,
>>>>
>>>> I'm working with a script that creates arrays of intermediate files
>>>> using the anonymous concurrent mapper, such as:
>>>>
>>>> file wgt_file[];
>>>>
>>>> As I expect, all of these files get generated in the remote swift
>>>> temporary directory and are then returned to the _concurrent directory
>>>> on the host executing Swift. However, in this particular application,
>>>> they're then immediately consumed by a subsequent procedure and never
>>>> needed again.
>>>>
>>>> Is there a way to configure Swift or the file mapper declaration to
>>>> delete these files after the remaining script "consumes" them? (That
>>>> is, after all procedures relying on them as inputs have been
>>>> executed?) Or can (should?) that be done manually?
>>>>
>>>> More speculatively, is there a way to keep files like these on the
>>>> execution host and not even bring them back to _concurrent? (With loss
>>>> of generality, I'm executing on a single site, and don't really ever
>>>> need the file locally, for restarts or staging to another site.)
>>>>
>>>> Any advice about managing copies of large intermediate data files in
>>>> the Swift execution context would be appreciated!
>>>>
>>>> Matthew
>>>> _______________________________________________
>>>> Swift-user mailing list
>>>> Swift-user at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>>>>
>>>
>>> --
>>> Justin M Wozniak
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>>
>
> --
> Justin M Wozniak
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>
More information about the Swift-user
mailing list