[Swift-commit] r3274 - text/swift_pc3_fgcs

noreply at svn.ci.uchicago.edu noreply at svn.ci.uchicago.edu
Tue Apr 6 16:53:15 CDT 2010


Author: lgadelha
Date: 2010-04-06 16:53:14 -0500 (Tue, 06 Apr 2010)
New Revision: 3274

Added:
   text/swift_pc3_fgcs/sortProgGraph.odg
   text/swift_pc3_fgcs/sortProgGraph.png
Modified:
   text/swift_pc3_fgcs/swift_pc3_fgcs.tex
Log:
Updated "Data Model" section.


Added: text/swift_pc3_fgcs/sortProgGraph.odg
===================================================================
(Binary files differ)


Property changes on: text/swift_pc3_fgcs/sortProgGraph.odg
___________________________________________________________________
Name: svn:mime-type
   + application/octet-stream

Added: text/swift_pc3_fgcs/sortProgGraph.png
===================================================================
(Binary files differ)


Property changes on: text/swift_pc3_fgcs/sortProgGraph.png
___________________________________________________________________
Name: svn:mime-type
   + application/octet-stream

Modified: text/swift_pc3_fgcs/swift_pc3_fgcs.tex
===================================================================
--- text/swift_pc3_fgcs/swift_pc3_fgcs.tex	2010-04-06 03:26:43 UTC (rev 3273)
+++ text/swift_pc3_fgcs/swift_pc3_fgcs.tex	2010-04-06 21:53:14 UTC (rev 3274)
@@ -44,9 +44,9 @@
 \usepackage{upquote}
 %% if you use PostScript figures in your article
 %% use the graphics package for simple commands
-%% \usepackage{graphics}
+%%\usepackage{graphics}
 %% or use the graphicx package for more complicated commands
-%% \usepackage{graphicx}
+%%\usepackage{graphicx}
 %% or use the epsfig package if you prefer to use the old commands
 %% \usepackage{epsfig}
 
@@ -65,39 +65,10 @@
 
 \begin{frontmatter}
 
-%% Title, authors and addresses
 
-%% use the tnoteref command within \title for footnotes;
-%% use the tnotetext command for theassociated footnote;
-%% use the fnref command within \author or \address for footnotes;
-%% use the fntext command for theassociated footnote;
-%% use the corref command within \author for corresponding author footnotes;
-%% use the cortext command for theassociated footnote;
-%% use the ead command for the email address,
-%% and the form \ead[url] for the home page:
-%% \title{Title\tnoteref{label1}}
-%% \tnotetext[label1]{}
-%% \author{Name\corref{cor1}\fnref{label2}}
-%% \ead{email address}
-%% \ead[url]{home page}
-%% \fntext[label2]{}
-%% \cortext[cor1]{}
-%% \address{Address\fnref{label3}}
-%% \fntext[label3]{}
-
 \title{Provenance Management in Swift}
 
-%% use optional labels to link authors explicitly to addresses:
-%% \author[label1,label2]{}
-%% \address[label1]{}
-%% \address[label2]{}
 
-%\author{Ben Clifford}
-%\author{Luiz M. R. Gadelha Jr.}
-%\author{Marta Mattoso}
-%\author{Michael Wilde}
-%\author{Ian Foster}
-
 \author[no]{Ben Clifford}
 \ead{benc at hawaga.org.uk}
 \author[coppe]{Luiz M. R. Gadelha Jr.}
@@ -136,7 +107,7 @@
 
 \section{Introduction}
 
-The automation of large scale computational scientific experiments can be accomplished through the use of workflow management systems, parallel scripting tools, and related systems that allow the definition of the activities, input and output data, and data dependencies of such experiments. The manual analysis of the data resulting from their execution is not feasible, due to the usually large amount of information. Provenance systems can be used to facilitate this task, since they gather details about the design \cite{FrSi06} \cite{DeGa09} and execution of these experiments, such as data artifacts consumed and produced by their activities. They also make it easier to reproduce an experiment for the purpose of verification. 
+The automation of large scale computational scientific experiments can be accomplished through the use of workflow management systems \cite{DeGa09}, parallel scripting tools \cite{WiFo09}, and related systems that allow the definition of the activities, input and output data, and data dependencies of such experiments. The manual analysis of the data resulting from their execution is not feasible, due to the usually large amount of information. Provenance systems can be used to facilitate this task, since they gather details about the design \cite{FrSi06} and execution of these experiments, such as data artifacts consumed and produced by their activities. They also make it easier to reproduce an experiment for the purpose of verification. 
 
 The Open Provenance Model (OPM) \cite{opm1.1} is an ongoing effort to standardize the representation of provenance information. OPM defines the entities {\em artifact}, {\em process}, and {\em agent} and the relations {\em used} (between an artifact and a process), {\em wasGeneratedBy} (between a process and an artifact), {\em wasControlledBy} (between an agent and a process), {\em wasTriggeredBy} (between two processes), and {\em wasDerivedFrom} (between two artifacts).
 
@@ -146,30 +117,20 @@
 
 \section{Data Model} \label{datamodel}
 
-In Swift, data is represented by strongly-typed single-assignment variables. Data types can be {\em atomic} or {\em composite}. Atomic types are given by {\em primitive} types, such as integers or strings, or {\em mapped} types. Mapped types are used for representing and accessing data stored in local or remote files. {\em Composite} types are given by structures and arrays. In the Swift runtime, data is represented by a {\em dataset handle}. It may have as attributes a value, a filename, a child dataset handle (when it is a structure or an array), or a parent dataset handle (when it is contained in a structure or an array). Swift processes are given by invocations of external programs, functions, and operators. Dataset handles are produced and consumed by Swift processes.
+In Swift, data is represented by strongly-typed single-assignment variables. Data types can be {\em atomic} or {\em composite}. Atomic types are given by {\em primitive} types, such as integers or strings, or {\em mapped} types. Mapped types are used for representing and accessing data stored in local or remote files. {\em Composite} types are given by structures and arrays. In the Swift runtime, data is represented by a {\em dataset handle}. It may have as attributes a value, a file name, a child dataset handle (when it is a structure or an array), or a parent dataset handle (when it is contained in a structure or an array). Swift processes are given by invocations of external programs, functions, and operators. Dataset handles are produced and consumed by Swift processes.
 
-% brief intro to Swift with sortProg?
-
 In the Swift provenance model, dataset handles and processes are recorded, as are the relations between them (either a process consuming a dataset handle as input, or a process producing a dataset handle as output). Each dataset handle and process is uniquely identified in time and space by a URI. This information is stored persistently in a relational database; we have also experimented with other database layouts \cite{ClGaMa09}. The two key relational tables used to store the structure of the provenance graph are {\tt processes}, that stores brief information about processes (see table \ref{processes_table}), and {\tt dataset\_usage}, that stores produced and consumed relationships between processes and dataset handles (see table \ref{dataset_usage_table}). Other tables  \cite{ClGaMa09} are used to record details about each process and dataset, and other relationships such as containment.
 
-
-
-Consider the Swiftscript program in listing \ref{sortprog}, which first describes a procedure ({\tt sortProg}, which calls the external executable {\tt sort}); then declares references to two files ({\tt f}, a reference to {\tt inputfile}, and {\tt g}, a reference to {\tt outputfile}); and finally calls the procedure {\tt sortProg}. When this program is run, provenance records are generated as follows: a process record is generated for the initial call to the {\tt sortProg(f)} procedure; a process record is generated for the {\tt @i} inside {\tt sortProg}, representing the evaluation of the {\tt @filename} function that Swift uses to determine the physical file name corresponding to the reference {\tt f}; a process record is generated for the {\tt @o} inside {\tt sortProg}, again representing the evaluation of the {\tt @filename} function, this time for the reference {\tt g}.
-
-Dataset handles are recorded for: the string {\tt "inputfile"}; the string {\tt "outputfile"}; file variable {\tt f}; the file variable {\tt g}; the filename of {\tt i}; the filename of {\tt o}.	
-
-Input/output relations are recorded as: {\tt sortProg(f)}  takes {\tt f} as an input; {\tt sortProg(f)} produces {\tt g} as an output; the {\tt @filename} function takes {\tt f} as an input; the {\tt @filename} function takes {\tt g} as an input; the {\tt @filename} produces the filename of {\tt i} as an output.
-
 \begin{table}
 \begin{center} 
-\caption{Database table {\tt processes}.\label{processes_table}}
-\begin{tabular}{ | l | p{11cm} |  }
+\caption{Database relation {\tt processes}.\label{processes_table}}
+\begin{tabular}{ | l | p{10cm} |  }
 \hline	
   {\bf Attribute} & {\bf Definition}\\
 \hline
-  {\tt id}    & the URI identifying the process\\
+  {\tt id}        & the URI identifying the process\\
 \hline
-  {\tt type} & the type of the process: execution, compound procedure, function, operator\\
+  {\tt type}      & the type of the process: execution, compound procedure, function, operator\\
 \hline  
 \end{tabular}
 \end{center} 
@@ -177,10 +138,10 @@
 
 \begin{table}
 \begin{center} 
-\caption{Database table {\tt dataset\_usage}.\label{dataset_usage_table}}
+\caption{Database relation {\tt dataset\_usage}.\label{dataset_usage_table}}
 \begin{tabular}{ | l | p{9.8cm} |  }
 \hline	
-  {\bf Attribute} & {\bf Definition}\\
+  {\bf Attribute}   & {\bf Definition}\\
 \hline
   {\tt process\_id} & a URI identifying the process end of the relationship\\
 \hline
@@ -211,7 +172,6 @@
 %\item[(V)] the filename of {\tt o}.	
 %\end{enumerate}
 
-
 %Input/output relations are recorded as:
 
 %\begin{itemize}
@@ -222,9 +182,6 @@
 %\item (B) produces (U) as an output.
 %\end{itemize}
 
-%%\footnotesize
-
-%\begin{verbatim}
 \lstset{basicstyle=\tt \footnotesize}
 \begin{lstlisting}[float,caption={\tt sortProg} Swift program.,frame=lines,label=sortprog]
 app (file o) sortProg(file i) {
@@ -234,11 +191,39 @@
 file g <"outputfile">;
 g = sortProg(f);
 \end{lstlisting}
-%\end{verbatim}
-%\normalsize
 
-The Swift provenance model is close to OPM, but there are some differences. Dataset handles correspond closely with OPM artifacts as immutable representations of data. However they do not correspond exactly. An OPM artifact has unique provenance. However, a dataset handle can have multiple provenance descriptions. For example, in the SwiftScript program displayed in listing \ref{multi}, the expression {\tt c[0]} evaluates to the dataset handle corresponding to the variable {\tt a}. That dataset handle has a provenance trace indicating it was assigned from the constant value {\tt 7}. However, that dataset handle now has additional provenance indicating that it was output by applying the array access operator {\tt []} to the array {\tt c} and the numerical value {\tt 0}.
+Consider the Swiftscript program in listing \ref{sortprog}, which first describes a procedure ({\tt sortProg}, which calls the external executable {\tt sort}); then declares references to two files, ({\tt f}, a reference to {\tt inputfile}, and {\tt g}, a reference to {\tt outputfile}); and finally calls the procedure {\tt sortProg}. 
+When this program is run, provenance records are generated as follows: 
+  a process record is generated for the initial call to the {\tt sortProg(f)} procedure; 
+  a process record is generated for the {\tt @i} inside {\tt sortProg}, representing the evaluation of the {\tt @filename} function that Swift uses to determine the physical file name corresponding to the reference {\tt f}; 
+  a process record is generated for the {\tt @o} inside {\tt sortProg}, again representing the evaluation of the {\tt @filename} function, this time for the reference {\tt g}. 
+Dataset handles are recorded for: 
+  the string {\tt "inputfile"}; 
+  the string {\tt "outputfile"}; 
+  file variable {\tt f}; 
+  the file variable {\tt g}; 
+  the file name of {\tt i}; 
+  the file name of {\tt o}.	 
+Input and output relations are recorded as: 
+  {\tt sortProg(f)} takes {\tt f} as an input; 
+  {\tt sortProg(f)} produces {\tt g} as an output; 
+  the {\tt @i} function takes {\tt f} as an input; 
+  {\tt @i} produces the file name of {\tt i} as an output.
+  the {\tt @o} function takes {\tt g} as an input; 
+  {\tt @o} produces the file name of {\tt o} as an output.
 
+The Swift provenance model is close to OPM, but there are some differences. Dataset handles correspond closely with OPM artifacts as immutable representations of data. However they do not correspond exactly. An OPM artifact has unique provenance. However, a dataset handle can have multiple provenance descriptions. For example, given the SwiftScript program displayed in listing \ref{multi}, the expression {\tt c[0]} evaluates to the dataset handle corresponding to the variable {\tt a}. That dataset handle has a provenance trace indicating it was assigned from the constant value {\tt 7}. However, that dataset handle now has additional provenance indicating that it was output by applying the array access operator {\tt []} to the array {\tt c} and the numerical value {\tt 0}. In OPM, the artifact resulting from evaluating {\tt c[0]} is distinct from the artifact resulting from evaluating {\tt a}, although they may be annotated with an {\em isIdenticalTo} arc \cite{OPMcollections
 }. The OPM entity agent is currently not represented in Swift's  provenance model.
+
+Except for {\em wasControlledBy}, the dependency relationships defined in OPM can be derived from the {\tt dataset\_usage} database relation. It explicitly stores the {\em used} and {\em wasGeneratedBy} relationships. For instance, the provenance database for {\tt sortProg} contains the tuples $\langle \text{{\tt sortProg}}, \text{{\tt f}}, \text{In}, \text{{\tt i}} \rangle$ and $\langle \text{{\tt sortProg}}, \text{{\tt g}}, \text{Out}, \text{{\tt o}} \rangle$. In OPM, this is equivalent to say $\text{{\tt f}} \xleftarrow{\text{used(\text{{\tt i}})}} \text{{\tt sortProg}}$ and $\text{{\tt sortProg}} \xleftarrow{\text{wasGeneratedBy(\text{{\tt o}})}} \text{{\tt g}}$ respectively. Figure \ref{sortProgGraph} shows an OPM graph containing the relationships stored in the provenance database for the {\tt sortProg} example.
+
+\begin{figure*}
+\caption{Provenance graph of {\tt sortProg}.\label{sortProgGraph}}
+\begin{center}
+\includegraphics[width=13.5cm]{sortProgGraph}
+\end{center}
+\label{tsp}
+\end{figure*}
+
 \begin{lstlisting}[float,caption=Multiple provenance descriptions for a dataset.,frame=lines, label=multi]
 int a = 7;
 int b = 10;
@@ -246,14 +231,6 @@
 \end{lstlisting}
 \normalsize
 
-
-
-
-In OPM, the artifact resulting from evaluating {\tt c[0]} is distinct from the artifact resulting from evaluating {\tt a}, although they may be annotated with an isIdenticalTo arc \cite{OPMcollections}.
-
-Except for {\em wasControlledBy}, the dependency relationships defined in OPM can be derived from the {\tt dataset\_usage} database relation. {\em used} and {\em wasGeneratedBy} are explicitly stored in the relation. For instance, if the tuple $\langle P_{id}, D_{id}, {\text I}, R \rangle$ is in the {\tt dataset\_usage} relation then it is equivalent to say $D_{id} \xleftarrow{\text{used(R)}} P_{id}$ in OPM. If we had 'O' instead of 'I' as the value for attribute {\tt direction} it would be equivalent to 
-$P_{id} \xleftarrow{\text{wasGeneratedBy(R)}} D_{id}$ in OPM.
-
 One of the main concerns with using a relational model for representing provenance is the need for querying over the transitive relation expressed in the {\tt dataset\_usage} table. For example, after executing the SwiftScript code in listing \ref{transit}, it might be desirable to find all dataset handles that lead to {\tt c}: that is, {\tt a} and {\tt b}. However simple SQL queries over the {\tt dataset\_usage} relation can only go back one step, leading to the answer {\tt b} but not to the answer {\tt a}. To address this problem, we generate a transitive closure table by an incremental evaluation system \cite{SQLTRANS}. This approach makes it straightforward to query over transitive relations using natural SQL syntax, at the expense of larger database size and longer import time.
 
 \begin{lstlisting}[float,caption=Transitivity of provenance relationships.,frame=lines,label=transit]
@@ -291,11 +268,11 @@
 
 In our first attempt to implement LoadWorkflow in Swift, we found the use of the foreach loop problematic because the database routines are internal to the Java implementation and, therefore, Swift has no control over them. Since Swift tries to parallelize the {\tt foreach} iterations it ended up incorrectly parallelizing the database operations. It was necessary to serialize the execution of the workflow to avoid this problem. Most of the PC3 queries are for row-level database provenance. A workaround for gathering provenance about database operations was implemented by modifying the application database so that for every row inserted or modified, an entry containing the execution identifier of the Swift process that performed the corresponding database operation is also inserted. 
 
-{\em Core Query 1}. The first query asks, for a given application database row, which CSV files contributed to it. The strategy used to answer this query is to determine input CSV files that precede, in the transitivity table, the process that inserted the row. This query can be answered by first obtaining the Swift process identifier of the process that inserted the row from the annotations included in the application database. Finally, we query for filenames of datasets that contain CSV inputs in the set of predecessors of the process that inserted the row.
+{\em Core Query 1}. The first query asks, for a given application database row, which CSV files contributed to it. The strategy used to answer this query is to determine input CSV files that precede, in the transitivity table, the process that inserted the row. This query can be answered by first obtaining the Swift process identifier of the process that inserted the row from the annotations included in the application database. Finally, we query for file names of datasets that contain CSV inputs in the set of predecessors of the process that inserted the row.
 
-{\em Core Query 2}. This query asks if the range check (IsMatchColumnRanges) was performed in a particular table, given that a user found values that were not expected in it. This is implemented with the SQL query: 
+{\em Core Query 2}. This query asks if the range check (IsMatchColumnRanges) was performed in a particular table, given that a user found values that were not expected in it. This is implemented with the SQL query displayed in listing \ref{qc2}. It returns the input parameter XML for all IsMatchColumnRanges calls. These are XML values, and it is necessary to examine the resulting XML to determine if it was invoked for the specific table. There is unpleasant cross-format joining necessary here to get an actual yes/no result properly, although we could use a {\tt LIKE} clause to examine the value.
 
-\begin{lstlisting}[float,caption=A floating example,frame=lines]
+\begin{lstlisting}[float,caption=Core query 2.,frame=lines, label=qc2]
 > select dataset_values.value
   from
     processes, invocation_procedure_names, dataset_usage, 
@@ -310,8 +287,8 @@
     dataset_usage.dataset_id = dataset_values.dataset_id;
 \end{lstlisting}
 
-This returns the input parameter XML for all IsMatchColumnRanges calls. These are XML values, and it is necessary to examine the resulting XML to determine if it was invoked for the specific table. There is unpleasant cross-format joining necessary here to get an actual yes/no result properly, although we probably could use a {\tt LIKE} clause to examine the value.
 
+
 {\em Core Query 3}. The third core query asks which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value. This uses the additional annotations made, that only store which process originally inserted a row, not which processes have modified a row. So to some extent, rows are regarded a bit like artifacts (though not first order artifacts in the provenance database); and we can only answer questions about the provenance of rows, not the individual fields within those rows. That is sufficient for this query, though. First find the row that contains the interesting value and extract its {\tt IMAGEID}. Then find the process that created the {\tt IMAGEID} by querying the Derby database table {\tt P2IMAGEPROV}. This gives the process ID for the process that created the row. Now query the transitive closure table for all predecessors for that process (as in the first core query). This will produce all processes and artifacts t
 hat preceded this row creation. Our answer differs from the sample answer because we have sequenced access to the database, rather than regarding each row as a proper first-order artifact. The entire database state at a particular time is a successor to all previous database accessing operations, so any process which led to any database access before the row in question is regarded as a necessary operation. This is undesirable in some respects, but desirable in others. For example, a row insert only works because previous database operations which inserted other rows did not insert a conflicting primary key - so there is data dependency between the different operations even though they operate on different rows. 
 
 {\em Optional Query 1}. The computation halts due to failing an IsMatchTable-ColumnRanges check. How many tables were loaded successfully before the computation halted due to the failed check? The answer was given by querying how many load processes are known to the database (over all recorded computation), which can be restricted to a particular computation.




More information about the Swift-commit mailing list