[Swift-devel] Provenance use case
Michael Wilde
wilde at mcs.anl.gov
Wed Mar 25 13:55:42 CDT 2009
As the OOPS guys are close to running, they would like and would benefit
from provenance recording.
Their (and our) requirements are something like below. Im sending this
out now because its fresh in my mind after talking to Glen, but probably
cant get deep into a disussion of it for the next month or so.
I hope its of some use in steering and setting priorities for provenance
work.
I'd like to do the same for other groups, focusing on CNARI where
provenance is a funded, committed deliverable.
- Mike
Swift project needs:
- record all their runs so we can track the (hopeful) growth in their
usage of swift
- collect all in one place and report runs and usage by system, user, etc.
OOPS User needs:
- track all their runs so they can readily find all their generated data
- know what input parameters were used, both swift level and what config
file settings were passed in
- exactly what (svn) code rev was run: unlike users of canned apps,
these guys run their own code and are constantly changing it
- how fast did each run go as a function of code rev, input args, and
target system type
- record in an annotation database some science characteristics about
each run: both on the individual output files and on a set of
simulations (called a "rsound"). These attributes are gleaned or
computed by examining output, posibly doing some computations on it, and
in many cases averaging and finding ranges and other stats from a round.
(A round is 100 to 2000 runs of their simulation program. Their
simulation has a notion of "goodness" - ie how close to a known protein
structure did the simulation get. Thats a key attribute they compute and
track.)
A note: the svn code rev tracking issue raises interesting needs.
Presumably the oops swift script will change little across oops code
revs, but you want some kind of tracability from:
the app() proc name
the tc.data entry name
the tc.data entry path
what svn rev that path was pointing to or symlinked to at the moment of
execution
How this is managed will vary from group to group, but we can set and
suggest some standards that make tracking practical.
In the case of oops, the SVN rev is placed in a REVISION file near the
top of the dist tree (and could be placed anywhere where provenance
recording might be able to pick it up from).
In recent oops testing, we make every src dir on every site an svn
checkout, and do svn update to bring the tree up to date. hence the
path in tc.data doesnt change as the code evolves.
An earlier strategy we tried is that we generated src distros on a
central host like communicado, put the svn rev in the distro's tarbal
name, and top-level dir, and extracted it to that dir on the each target
site, and built the code. Then a symlink was adjusted to point to the
"latest" rev, and thus the value of the symlink contained the svn rev.
I think some way to track this is a real provenance requirement. Its
easy to do in arbitrary ad-hoc ways, but rather harder to implement a
single uniform way to grab such information at the time that a binary
executable is chosen. Theres also the major issue of info available on
the submit host where tc.data resides vs that which can be captured at
runtime eg in wrapper.sh. And how to make it low-overhead, etc.
More information about the Swift-devel
mailing list