[Swift-devel] Re: Provenance use case

Glen Hocky hockyg at uchicago.edu
Thu Mar 26 16:18:53 CDT 2009


>
>> > - exactly what (svn) code rev was run: unlike users of canned apps, these guys
>> > run their own code and are constantly changing it
>>     
>
> This could be determined for every invocation remotely and collected. That 
> might be expensive. Less reliably it could be measured once per site, but 
> that would not detect someone changing the software in the middle of a 
> run.
>   
Suggestion for this. What about if we wrote a script that echo's any 
relevant data (e.g. svn-code-rev) in whatever format you thing best. If 
this is installed allong with the rest of the code, we could just stick 
it into the tc.data file and write an app wrapper for it. Would that 
make it easier to incorporate this information into a provenance 
collection feature of swift?



Ben Clifford wrote:
> On Wed, 25 Mar 2009, Michael Wilde wrote:
>
>   
>> - record all their runs so we can track the (hopeful) growth in their usage of
>> swift
>>     
>
> so that sounds like you want number of jobs and duration of jobs?
>
>   
>> - collect all in one place and report runs and usage by system, user, etc.
>>     
>
> by 'system', you mean site as in what is defined in sites.xml?
>
> user is determinable by submit-side unix user?
>
> what is etc?
>
>   
>> - track all their runs so they can readily find all their generated data
>>     
>
> elaborate on this - you want to get a list of every data file generated 
> with details of what was run to create it? or different to that?
>
>
>   
>> - exactly what (svn) code rev was run: unlike users of canned apps, these guys
>> run their own code and are constantly changing it
>>     
>
> This could be determined for every invocation remotely and collected. That 
> might be expensive. Less reliably it could be measured once per site, but 
> that would not detect someone changing the software in the middle of a 
> run.
>
>
>   
>> - record in an annotation database some science characteristics about each
>> run: both on the individual output files and on a set of simulations (called a
>> "rsound"). These attributes are gleaned or computed by examining output,
>> posibly doing some computations on it, and in many cases averaging and finding
>> ranges and other stats from a round.
>> (A round is 100 to 2000 runs of their simulation program. Their simulation has
>> a notion of "goodness" - ie how close to a known protein structure did the
>> simulation get. Thats a key attribute they compute and track.)
>>     
>
> most of that sounds like simple database work that does not need swift to 
> be involved. where it does tie into Swift is getting SQL keys to relate 
> stuff in the provenance database to this annotation data.
>
> The following could be useful there: a globally unique URI to identify a 
> particular run (something like the run-id, packaged as a URI)
>
> For data, there are two different ways you can label. One is a unique 
> dataset identifier that is different per-run (that is, you run first.swift 
> twice and the output dataset has a different URI in each run, even though 
> it has the same filename, hello.txt, in both runs); and second is the 
> filename, which is easy to look at but doesn't take into account that 
> files are mutable on the filesystem.
>
> For storing annotations about data, presumably you would want to use one 
> of those two. Filename you can see easily without interacting with Swift. 
> If you want to use the dataset ID, then proabably you would need some way 
> to give you those dataset IDs (eg feed in a filename and a run ID and get 
> told the dataset ID).
>
>   
>> I think some way to track this is a real provenance requirement.
>>     
>
> yes
>
>   
>> Its easy to do in arbitrary ad-hoc ways, but rather harder to implement 
>> a single uniform way to grab such information at the time that a binary 
>> executable is chosen. Theres also the major issue of info available on 
>> the submit host where tc.data resides vs that which can be captured at 
>> runtime eg in wrapper.sh.  And how to make it low-overhead, etc.
>>     
>
> yes
>
>   




More information about the Swift-devel mailing list