[Swift-devel] Re: More questions on Provenance

Tue Jul 28 11:12:07 CDT 2009

Thanks  Ben,

This is very helpful. I wish I could hunt you down.
Interesting to know about the recent OPM work.
We have defined network nodes in our model to explicitly demonstrate those.
I did not know about OPM.

Thanks

Ben Clifford wrote:
> Hi Tanu. I'm long gone. But here are a few brief comments. I added 
> swift-devel.
>
> On Mon, 27 Jul 2009, Tanu Malik wrote:
>
>   
>> 1. How do you model the provenance for across the network transfers?
>> In that case the input is some file, the process is the file transfer process
>> and the
>> output would be on another machine. The output will have to be created
>> manually
>> which either mentions the success of the transfer or failure.
>>     
>
> The level at which provenance is recorded is more abstract than that at 
> the level where file transfers exist. A procedure takes input files which 
> are described by URLs relative to the submit-side run directory and 
> produces output files described by the same.
>
> The internal mechanisms of moving those files around to runtime sites as 
> needed and managing the cache of those happens internally to the procedure 
> execution and is not exposed as explicit activity.
>
> Information is logged abut such transfers though so if desired it might be 
> possible to make another level of description about what happened there 
> (one of the interesting things with ongoing OPM work is how to describe 
> the same activity at multiple levels like this).
>
>   
>> 2. Also you mention something about the number of runs in your 
>> presentation. "extra records � depth of graph x number of runs". What 
>> does the number of runs correspond to and how is that modeled in the DB.
>>     
>
> This is about constructing an explicit transitive closure of the 
> procedure/dataset graph.
>
> If you have an explicit graph A->B, B->C then constructing the closure 
> means you ened to add A->C as an edge. Thats what I mean by roughly 
> proportional to depth of graph - the deeper the graph, the more edges you 
> need to add.
>
> In the most recent implementation, each invocation of Swift is a subgraph 
> disconnected from the subgraphs of all other invocations of Swift. So (if 
> you make the often invalid but also often valid assumption that each 
> invocation of Swift generates roughly the same size provenance output), 
> size of the graph put together is roughly proportional to the number of 
> runs.
>
> If further work was done to identify datasets from the graphs of different 
> runs (using some identity relation such as same filename or something 
> else), then generating a tranistive closure would possibly generate graphs 
> that are proportional to more-than-the-number-of-runs.
>
>   
>> I was also wondering if we can chat on the phone or I come up again to 
>> discuss a possible collaboration on this project and present some of our 
>> new results.
>>     
>
> Nothing involving me except by very occasional email or if you hunt me 
> down in person and ply me with alcohol.
>
> --