[Swift-devel] Re: More questions on Provenance
Ben Clifford
benc at hawaga.org.uk
Tue Jul 28 08:35:19 CDT 2009
Hi Tanu. I'm long gone. But here are a few brief comments. I added
swift-devel.
On Mon, 27 Jul 2009, Tanu Malik wrote:
> 1. How do you model the provenance for across the network transfers?
> In that case the input is some file, the process is the file transfer process
> and the
> output would be on another machine. The output will have to be created
> manually
> which either mentions the success of the transfer or failure.
The level at which provenance is recorded is more abstract than that at
the level where file transfers exist. A procedure takes input files which
are described by URLs relative to the submit-side run directory and
produces output files described by the same.
The internal mechanisms of moving those files around to runtime sites as
needed and managing the cache of those happens internally to the procedure
execution and is not exposed as explicit activity.
Information is logged abut such transfers though so if desired it might be
possible to make another level of description about what happened there
(one of the interesting things with ongoing OPM work is how to describe
the same activity at multiple levels like this).
> 2. Also you mention something about the number of runs in your
> presentation. "extra records depth of graph x number of runs". What
> does the number of runs correspond to and how is that modeled in the DB.
This is about constructing an explicit transitive closure of the
procedure/dataset graph.
If you have an explicit graph A->B, B->C then constructing the closure
means you ened to add A->C as an edge. Thats what I mean by roughly
proportional to depth of graph - the deeper the graph, the more edges you
need to add.
In the most recent implementation, each invocation of Swift is a subgraph
disconnected from the subgraphs of all other invocations of Swift. So (if
you make the often invalid but also often valid assumption that each
invocation of Swift generates roughly the same size provenance output),
size of the graph put together is roughly proportional to the number of
runs.
If further work was done to identify datasets from the graphs of different
runs (using some identity relation such as same filename or something
else), then generating a tranistive closure would possibly generate graphs
that are proportional to more-than-the-number-of-runs.
> I was also wondering if we can chat on the phone or I come up again to
> discuss a possible collaboration on this project and present some of our
> new results.
Nothing involving me except by very occasional email or if you hunt me
down in person and ply me with alcohol.
--
More information about the Swift-devel
mailing list