[Swift-devel] Re: oops provenance

Wed Apr 15 11:44:55 CDT 2009

On 4/15/09 10:58 AM, Ben Clifford wrote:
> If you want to look at the provenance db as it is now, read sections 2 and 
> 3 of this page:
> 
> http://www.ci.uchicago.edu/~benc/provenance.html#owndb
> 
> I recommend if you try this at home to use sqlite3, not postgres.
> 
>> I think a starting point for oops provenance is this: For every run, you want
>> to know:
> 
> many of these are straightforward to add, and i will look at doing so 
> after pc3 stuff
> 
>> - an ID for the run
> 
> this exists now

yes, "but". Its long and hard to manage. We have experience now with 
both Falkon and Swift in giving runs simple short IDs, and that has 
worked well. Its so much easier to talk about oops run 0042 than run 
*imqvgr8. The long ID is also useful but should be more hidden and internal.

How we do this should tie in with where we go with swift run management 
conventions.

> 
>> - analyzed scores of the run output
> 
> not sure what that is - is this application specific output?

yes
> 
>> - what version of oops was used
> 
> the extrainfo stuff I implemented previously for the oops app may be used 
> here. I've heard no feedback about it actually being used, though.

right, that is the solution. we need to test it.
> 
>> - what version of the oops.swift script was used
> 
> For all the version stuff, you need to figure out what version semantics 
> you want (eg md5sum of swift script, which gives fine grained version 
> distinction but no order; user specified version numbering which is pretty 
> much guaranteed to be wrong but you might think you want that, and also 
> gives ordering; ... there are lots of schemes ...)

hmmm - all those sound good - can we have them all? ;)

seriously, though - a few thoughts on this:

- i lean to close integration with svn on versions, ie use svn to 
version code, including swift scripts, and use svn revision IDs as well 
as software release numbers to define versions of code. Ie, oops rev 
0428 or oops release 1.2.4, depending on what you were running.

- i can now see the merits and use of the old vdl constructs 
namespace::name:version, and would like to explore how to use and 
integrate that into Swift.

- I think the mdsum etc stuff is useful, and also good for research into 
"airtight" provenance, but less immediately needed by users. And when 
added, seems like that kind of thing thats nice to have always running 
in the background, to resolve thorny provenance questions, but should 
seldom be visible to the end user.

>> Given this in a database, you could also compare structure scores for one
>> version of code or one algorithm vs another
> 
> This is more application specific data?

I think its a join of app-specific and swift-maintained. Eg, in the 
current oops.swift script, the user can specify via cmd line arg which 
of 2 oops algorithms to use ("classic" or "rama"). So I could easily see 
a parameter sweep that says: for each protein in plist, do the full 
sweep for both algorithms, and give me tables, plots etc that lets me 
compare them. So far, that is more "application" than provenance. But 
now, do the same thing but compare rama 1.2.4 with rama 1.2.6. Depending 
on how thats expressed, it could utilize provenance info. Especially if 
the question was asked "retrospectively" on the provenance data, as 
opposed to set up in advance as a comparative workflow. Ie, look at the 
runtime-per-simulation of each of the last 3 rama versions.

> Are you looking here to have data output from a run end up in a database?

Yes, thats being considered, as an application thing, in addition to and 
separate from the provenance data.

I will send the OOPS paper to swft and try to get it posted on the swift 
web soon. Its got some nice stats in sec 5 on #runs, that would be great 
to derive on a running basis from collected provenance data.

- Mike