[Swift-devel] swift with input and output sandboxes instead of a shared site filesystem

Wed Apr 22 07:42:58 CDT 2009

This sounds good, Ben. I only skimmed, need to read, and will comment later.

We should compare to the experiments on collective I/O (currently named 
"many-task data management) that Allan, Zhao, Ian, and I are doing.

The approach involves some things that would apply in the two 
environments you mention as well:

- pull input files to the worker node FS rather than push them multihop
- batch output file up into tarballs and expand them back on their 
target filesystem (on the submit host for now)

It also involves some things that apply more to large clusters, but may 
be generalizable to generic grid environments:

- broadcast common files used by many jobs from the submit host to the 
worker nodes
- use "intermediate" filesystems striped across cluster nodes, rather 
than local filesystems on cluster nodes, where this is more efficient or 
is needed
- have the worker nodes selectively access files from local, 
intermediate, or global storage, depending on where the submit host 
workflow decided to place them
- keep a catalog in the cluster of what files are on what local host, 
and a protocol to transfer them from where they were produced to where 
they need to be consumed (this feature is like data diffusion, but needs 
more thought and experience to determine how useful it is; not may 
workflows need it).

This is being done so far in discussion with me, Ioan, and Ian, but we'd 
like to get you and Mihael and anyone else interested to join in; Kamil 
Iskra and Justin Wozniak from the MCS ZeptoOS and Radix groups are 
involved as well).

We should use this thread to discuss your IO strategy below first, 
before we involve the MTDM experiments, but one common thread seems to 
be mastersing the changes in Swift data management that allow for us to 
explore these new data management modes.

If you recall, we've had discussions in the past on having something 
like "pluggable data management strategies" that allowed a given script 
to be executed in different environments with different strategies - 
either globallu set or set by site.

I'm offline a lot untill Monday with a proposal deadline, and hope to 
comment and rejoin the discussion by then or shorty after.

- Mike

On 4/22/09 7:12 AM, Ben Clifford wrote:
> I implemented an extremely poor quality prototype to try to get my head 
> around some of the execution semantics when running through:
> 
>   i) the gLite workload management system (hereafter, WMS) as used in the 
>      South African Natioanl Grid (my short-term interest) and in EGEE (my 
>      longer term interest)
> 
>  ii) condor used as an LRM to manage nodes which do not have a shared 
>      filesystem, without any "grid stuff" involved
> 
> In the case of the WMS, it is a goal to have the WMS perform site 
> selection, rather than submitting clients (such as Swift). I don't 
> particularly agree with this, but there it is.
> 
> In the case of condor-with-no-shared-fs, one of the basic requirements of 
> a Swift site is violated - that of an easily accessible shared file 
> system.
> 
> Both the WMS and condor provide an alternative to Swift's file management 
> model; and both of their approaches look similar.
> 
> In a job submission, one specifies the files to be staged into an 
> arbitrary working directory before execution, and the files to be staged 
> out after execution.
> 
> My prototype is intended to get practical experience interfacing Swift to 
> a job submission with those semantics.
> 
> What I have done in my implementation is rip out almost the entirety of 
> the execute2/site file cache/wrapper.sh layers, and replace it with a 
> callout to a user-specified shell script. The shell script is passed the 
> submit side paths of input files and of output files, and the commandline.
> 
> The shell script is then entirely responsible for causing the job to run 
> somewhere and for doing appropriate input and output staging.
> 
> Into this shell interface, I then have two scripts, one for sagrid/glite 
> and one for condor-with-no-shared-fs.
> 
> They are similar to each other, differing only in the syntax of the 
> submission commands and files.
> 
> These scripts create a single input tarball, create a job submission file, 
> submit to the appropriate submit command, hang round polling for status 
> until the job is finished, and unpack an output tarball. Tarballs are used 
> rather than explicitly listing each input and output file for two reasons: 
> i) if an output file is missing (perhaps due to application failure) I 
> would like the job submission to still return what it has (most especially 
> remote log files). As long as a tarball is made with *something*, this 
> works. ii) condor (and perhaps WMS) apparently cannot handle directory 
> hierarchies in their stagein/stageout parameters.
> 
> I have tested on the SAgrid testing environment (for WMS) and this works 
> (although quite slowly, as the WMS reports job status changes quite 
> slowly); and on a condor installation on gwynn.bsd.uchicago.edu (this has 
> a shared filesystem, so is not a totally satisfactory test). I also sent 
> this to Mats to test in his environment (as a project he has was my 
> immediate motivation for the condor side of this).
> 
> This prototype approach loses a huge chunk of Swift execution-side 
> functionality such as replication, clustering, coasters (deliberately - I 
> was targetting getting SwiftScript programs running, rather than getting a 
> decent integration with the interesting execution stuff we have made).
> 
> As such, it is entirely inappropriate for production (or even most 
> experimental) use.
> 
> However, it has given me another perspective on submitting jobs to the 
> above two environments.
> 
> For condor:
> 
> The zipped input/output sandbox approach seems to work nicely.
> 
> To mould this into something more in tune with what is in Swift now, I 
> think is not crazy hard - the input and output staging parts of execute2 
> would need to change into something that creates/unpacks a tarball and 
> appropriately modifies the job description so that when it is run by the 
> existing execution mechanism, the tarballs get carried along. (to test if 
> you bothered reading this, if you paste me the random string H14n$=N:t)Z 
> you get a free beer)
> 
> As specified above, that approach does not work with clustering or with 
> coasters, though both could be modified so as to support such (for 
> example, clustering could be made to merge all stagein and stageout 
> listings for jobs; and coasters could be given a different interface to 
> the existing coaster file transfer mechanism). It might be that coasters 
> and clusters are not particularly desired in this environment, though.
> 
> For glite execution - the big loss here I think is coasters, because its a 
> very spread out grid environment. So with this approach, applications 
> which work well without coasters will probably work well; but applications 
> which are reliant on coasters for their performance will work as dismally 
> as when run without coasters in any other grid environment. I can think of 
> various modifications, similar to those mentioned in the condor section 
> above, to try to make them work through this submission system, but it 
> might be that a totally different approach to my above implementation is 
> warranted for coaster based execution on glite, with more explicit 
> specification of which sites to run on, rather than allowing the WMS any 
> choice, and only running on sites which do have a shared filesystem 
> available.
> 
> I think in the short term, my interest is in getting this stuff more 
> closely integrated without focusing too much on coasters and clusters.
> 
> Comments.
>