[Swift-devel] Re: More questions on Provenance

Tue Jul 28 08:35:19 CDT 2009

Hi Tanu. I'm long gone. But here are a few brief comments. I added 
swift-devel.

On Mon, 27 Jul 2009, Tanu Malik wrote:

> 1. How do you model the provenance for across the network transfers?
> In that case the input is some file, the process is the file transfer process
> and the
> output would be on another machine. The output will have to be created
> manually
> which either mentions the success of the transfer or failure.

The level at which provenance is recorded is more abstract than that at 
the level where file transfers exist. A procedure takes input files which 
are described by URLs relative to the submit-side run directory and 
produces output files described by the same.

The internal mechanisms of moving those files around to runtime sites as 
needed and managing the cache of those happens internally to the procedure 
execution and is not exposed as explicit activity.

Information is logged abut such transfers though so if desired it might be 
possible to make another level of description about what happened there 
(one of the interesting things with ongoing OPM work is how to describe 
the same activity at multiple levels like this).

> 2. Also you mention something about the number of runs in your 
> presentation. "extra records – depth of graph x number of runs". What 
> does the number of runs correspond to and how is that modeled in the DB.

This is about constructing an explicit transitive closure of the 
procedure/dataset graph.

If you have an explicit graph A->B, B->C then constructing the closure 
means you ened to add A->C as an edge. Thats what I mean by roughly 
proportional to depth of graph - the deeper the graph, the more edges you 
need to add.

In the most recent implementation, each invocation of Swift is a subgraph 
disconnected from the subgraphs of all other invocations of Swift. So (if 
you make the often invalid but also often valid assumption that each 
invocation of Swift generates roughly the same size provenance output), 
size of the graph put together is roughly proportional to the number of 
runs.

If further work was done to identify datasets from the graphs of different 
runs (using some identity relation such as same filename or something 
else), then generating a tranistive closure would possibly generate graphs 
that are proportional to more-than-the-number-of-runs.

> I was also wondering if we can chat on the phone or I come up again to 
> discuss a possible collaboration on this project and present some of our 
> new results.

Nothing involving me except by very occasional email or if you hunt me 
down in person and ply me with alcohol.

--