[Swift-devel] Provenance use case

Michael Wilde wilde at mcs.anl.gov
Wed Mar 25 13:55:42 CDT 2009


As the OOPS guys are close to running, they would like and would benefit 
  from provenance recording.

Their (and our) requirements are something like below. Im sending this 
out now because its fresh in my mind after talking to Glen, but probably 
cant get deep into a disussion of it for the next month or so.

I hope its of some use in steering and setting priorities for provenance 
work.

I'd like to do the same for other groups, focusing on CNARI where 
provenance is a funded, committed deliverable.

- Mike



Swift project needs:

- record all their runs so we can track the (hopeful) growth in their 
usage of swift

- collect all in one place and report runs and usage by system, user, etc.

OOPS User needs:

- track all their runs so they can readily find all their generated data
- know what input parameters were used, both swift level and what config 
file settings were passed in
- exactly what (svn) code rev was run: unlike users of canned apps, 
these guys run their own code and are constantly changing it
- how fast did each run go as a function of code rev, input args, and 
target system type
- record in an annotation database some science characteristics about 
each run: both on the individual output files and on a set of 
simulations (called a "rsound"). These attributes are gleaned or 
computed by examining output, posibly doing some computations on it, and 
in many cases averaging and finding ranges and other stats from a round.
(A round is 100 to 2000 runs of their simulation program. Their 
simulation has a notion of "goodness" - ie how close to a known protein 
structure did the simulation get. Thats a key attribute they compute and 
track.)

A note: the svn code rev tracking issue raises interesting needs.

Presumably the oops swift script will change little across oops code 
revs, but you want some kind of tracability from:

the app() proc name
the tc.data entry name
the tc.data entry path
what svn rev that path was pointing to or symlinked to at the moment of 
execution

How this is managed will vary from group to group, but we can set and 
suggest some standards that make tracking practical.

In the case of oops, the SVN rev is placed in a REVISION file near the 
top of the dist tree (and could be placed anywhere where provenance 
recording might be able to pick it up from).

In recent oops testing, we make every src dir on every site an svn 
checkout, and do svn update to bring the tree up to date.  hence the 
path in tc.data doesnt change as the code evolves.

An earlier strategy we tried is that we generated src distros on a 
central host like communicado, put the svn rev in the distro's tarbal 
name, and top-level dir, and extracted it to that dir on the each target 
site, and built the code.  Then a symlink was adjusted to point to the 
"latest" rev, and thus the value of the symlink contained the svn rev.

I think some way to track this is a real provenance requirement. Its 
easy to do in arbitrary ad-hoc ways, but rather harder to implement a 
single uniform way to grab such information at the time that a binary 
executable is chosen.  Theres also the major issue of info available on 
the submit host where tc.data resides vs that which can be captured at 
runtime eg in wrapper.sh.  And how to make it low-overhead, etc.







More information about the Swift-devel mailing list