[Swift-commit] r2704 - provenancedb

Tue Mar 17 10:18:43 CDT 2009

Author: benc
Date: 2009-03-17 10:18:41 -0500 (Tue, 17 Mar 2009)
New Revision: 2704

Modified:
   provenancedb/provenance.xml
Log:
some opm notes

Modified: provenancedb/provenance.xml
===================================================================

--- provenancedb/provenance.xml	2009-03-17 15:17:15 UTC (rev 2703)
+++ provenancedb/provenance.xml	2009-03-17 15:18:41 UTC (rev 2704)
@@ -1882,10 +1882,70 @@
 <section><title>OPM links</title>
 <para><ulink url="http://twiki.ipaw.info/bin/view/Challenge/OPM">Open Provenance Model at ipaw.info</ulink></para>
 </section>
+
+<section><title>Swift specific OPM considerations</title>
+
+<para>
+non-strictness: Swift sometimes lazily constructs collections (leading to
+the notion in Swift of an array being closed, which means that we know no
+more contents will be created, somewhat like knowing we've reached the end
+of a list). It may be that an array is never closed during a run, but that
+we still have sufficient provenance information to answer useful queries
+(for example, if we specify a list [1:100000] and only refer to the 5th
+element in that array, we probably never generate most of the DSHandles...
+so an explicit representation of that array in terms of datasets cannot be
+expressed - though a higher level representation of it in terms of its
+constructor parameters can be made) (?)
+</para>
+
+<para>
+aliasing: (this is related to some similar ambiguity in other parts of
+Swift, to do with dataset roots - not provenance related). It is possible to
+construct arrays by explicitly listing their members:
+<programlisting>
+int i = 8;
+int j = 100;
+int a[] = [i,j];
+int k = a[1];
+// here, k = 8
+</programlisting>
+The dataset contained in <literal>i</literal> is an artifact (a literal, so
+some input artifact that has no creating process). The array
+<literal>a</literal> is an artifact created by the explicit array construction
+syntax <literal>[memberlist]</literal> (which is an OPM process). If we
+then model the array accessor syntax <literal>a[1]</literal> as an OPM
+process, what artifact does it return? The same one or a different one?
+In OPM, we want it to return a different artifact; but in Swift we want this
+to be the same dataset... (perhaps explaining this with <literal>int</literal>
+type variables is not the best way - using file-mapped data might be better)
+TODO: what are the reasons we want files to have a single dataset
+representation in Swift? dependency ordering - definitely. cache management?
+Does this lead to a stronger notion of aliasing in Swift?
+</para>
+
+<para>
+Provenance of array indices: It seems fairly natural to represent arrays as OPM
+collections, with array element extraction being a process. However, in OPM,
+the index of an array is indicated with a role (with suggestions that it might
+be a simple number or an XPath expression). In Swift arrays, the index is
+a number, but it has its own provenance, so by recording only an integer there,
+we lose provenance information about where that integer came from - that
+integer is a Swift dataset in its own right, which has its own provenance.
+It would be nice to be able to represent that (even if its not standardised
+in OPM). I think that needs re-ification of roles so that they can be
+described; or it needs treatment of [] as being like any other binary
+operator (which is what happens inside swift) - where the LHS and RHS are
+artifacts, and the role is not used for identifying the member (which would
+also be an argument for making array element extraction be treated more
+like a plain binary operator inside the Swift compiler and runtime)
+</para>
+
 </section>
 
+</section>
 
 
+
 <section><title>stuff</title>
 <para>
 TODO transcribe info from the pile of papers I wrote. esp query analysis
@@ -2569,5 +2629,36 @@
 </section>
 
 
+<section><title>Representation of dataset containment and procedure execution in r2681 and how it could change.</title>
+
+<para>
+Representation of processes that transform one dataset into another dataset
+at present only occurs for <literal>app</literal> procedures, in logging of
+<literal>vdl:execute</literal> invocations, in lines like this:
+<screen>
+2009-03-12 12:20:29,772+0100 INFO  vdl:parameterlog PARAM thread=0-10-1 direction=input variable=s provenanceid=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090312-1220-md2mfc24:720000000033
+</screen>
+and dataset containment is represented at closing of the containing DSHandle by this:
+<screen>
+2009-03-12 12:20:30,205+0100 INFO  AbstractDataNode CONTAINMENT parent=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090312-1220-md2mfc24:720000000020 child=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090312-1220-md2mfc24:720000000086
+2009-03-12 12:20:30,205+0100 INFO  AbstractDataNode ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090312-1220-md2mfc24:720000000086 path=[2]
+</screen>
+</para>
+
+<para>
+This representation does not represent the relationship between datasets when
+they are related by @functions or operators. Nor does it represent causal
+relationships between collections and their members - instead it represents
+containment.
+</para>
+
+<para>
+Adding representation of operators (including array construction) and of
+ at function invocations would give substantially more information about
+the provenance of many more datasets.
+</para>
+
+</section>
+
 </article>