[Swift-devel] Re: Provenance use case

Thu Mar 26 16:05:36 CDT 2009

On Wed, 25 Mar 2009, Michael Wilde wrote:

> - record all their runs so we can track the (hopeful) growth in their usage of
> swift

so that sounds like you want number of jobs and duration of jobs?

> - collect all in one place and report runs and usage by system, user, etc.

by 'system', you mean site as in what is defined in sites.xml?

user is determinable by submit-side unix user?

what is etc?

> - track all their runs so they can readily find all their generated data

elaborate on this - you want to get a list of every data file generated 
with details of what was run to create it? or different to that?

> - exactly what (svn) code rev was run: unlike users of canned apps, these guys
> run their own code and are constantly changing it

This could be determined for every invocation remotely and collected. That 
might be expensive. Less reliably it could be measured once per site, but 
that would not detect someone changing the software in the middle of a 
run.

> - record in an annotation database some science characteristics about each
> run: both on the individual output files and on a set of simulations (called a
> "rsound"). These attributes are gleaned or computed by examining output,
> posibly doing some computations on it, and in many cases averaging and finding
> ranges and other stats from a round.
> (A round is 100 to 2000 runs of their simulation program. Their simulation has
> a notion of "goodness" - ie how close to a known protein structure did the
> simulation get. Thats a key attribute they compute and track.)

most of that sounds like simple database work that does not need swift to 
be involved. where it does tie into Swift is getting SQL keys to relate 
stuff in the provenance database to this annotation data.

The following could be useful there: a globally unique URI to identify a 
particular run (something like the run-id, packaged as a URI)

For data, there are two different ways you can label. One is a unique 
dataset identifier that is different per-run (that is, you run first.swift 
twice and the output dataset has a different URI in each run, even though 
it has the same filename, hello.txt, in both runs); and second is the 
filename, which is easy to look at but doesn't take into account that 
files are mutable on the filesystem.

For storing annotations about data, presumably you would want to use one 
of those two. Filename you can see easily without interacting with Swift. 
If you want to use the dataset ID, then proabably you would need some way 
to give you those dataset IDs (eg feed in a filename and a run ID and get 
told the dataset ID).

> I think some way to track this is a real provenance requirement.

yes

> Its easy to do in arbitrary ad-hoc ways, but rather harder to implement 
> a single uniform way to grab such information at the time that a binary 
> executable is chosen. Theres also the major issue of info available on 
> the submit host where tc.data resides vs that which can be captured at 
> runtime eg in wrapper.sh.  And how to make it low-overhead, etc.

yes

--