[Swift-commit] r3275 - text/swift_pc3_fgcs

Wed Apr 7 16:03:50 CDT 2010

Author: lgadelha
Date: 2010-04-07 16:03:49 -0500 (Wed, 07 Apr 2010)
New Revision: 3275

Modified:
   text/swift_pc3_fgcs/swift_pc3_fgcs.tex
Log:
Update to Data Model section


Modified: text/swift_pc3_fgcs/swift_pc3_fgcs.tex
===================================================================

--- text/swift_pc3_fgcs/swift_pc3_fgcs.tex	2010-04-06 21:53:14 UTC (rev 3274)
+++ text/swift_pc3_fgcs/swift_pc3_fgcs.tex	2010-04-07 21:03:49 UTC (rev 3275)
@@ -111,7 +111,7 @@
 
 The Open Provenance Model (OPM) \cite{opm1.1} is an ongoing effort to standardize the representation of provenance information. OPM defines the entities {\em artifact}, {\em process}, and {\em agent} and the relations {\em used} (between an artifact and a process), {\em wasGeneratedBy} (between a process and an artifact), {\em wasControlledBy} (between an agent and a process), {\em wasTriggeredBy} (between two processes), and {\em wasDerivedFrom} (between two artifacts).
 
-The Swift parallel scripting system \cite{swift} \cite{WiFo09} is a successor of the Virtual Data System (VDS) \cite{chimera} \cite{ZhWiFo06} \cite{ClFo08}. It allows the specification, management and execution of large-scale scientific workflows on parallel and distributed environments. The SwiftScript language is used for high-level specification of computations, it has features such as data types, data mappers, conditional and repetition flow controls, and sub-workflow composition. Its data model and type system are derived from XDTM \cite{xdtm}, which allows for the definition of abstract data types and objects without referring to their physical representation. If some dataset does not reside in main memory, its materialization is done through the use of data mappers. Procedures perform logical operations on input data, without modifying them. Swiftscript also allows procedures to be composed to define more complex computations. By analyzing the inputs and outputs of th
 ese procedures, the system determines data dependencies between them. This information is used to execute procedures that have no mutual data dependencies in parallel. It supports common execution managers for clustered systems and for grid environments, such as Falkon \cite{falkon}, which provides high job execution throughput. Swift logs a variety of information about each computation. This information can be exported to a relational database that uses a data model similar to OPM.
+The Swift parallel scripting system \cite{swift} \cite{WiFo09} is a successor of the Virtual Data System (VDS) \cite{chimera} \cite{ZhWiFo06} \cite{ClFo08}. It allows the specification, management and execution of large-scale scientific workflows on parallel and distributed environments. The SwiftScript language is used for high-level specification of computations, it has features such as data types, data mappers, conditional and repetition flow controls, and sub-workflow composition. Its data model and type system are derived from XDTM \cite{xdtm}, which allows for the definition of abstract data types and objects without referring to their physical representation. If some dataset does not reside in main memory, its materialization is done through the use of data mappers. Procedures perform logical operations on input data, without modifying them. SwiftScript also allows procedures to be composed to define more complex computations. By analyzing the inputs and outputs of th
 ese procedures, the system determines data dependencies between them. This information is used to execute procedures that have no mutual data dependencies in parallel. It supports common execution managers for clustered systems and for grid environments, such as Falkon \cite{falkon}, which provides high job execution throughput. Swift logs a variety of information about each computation. This information can be exported to a relational database that uses a data model similar to OPM.
 
 The objective of this paper is to present the local and remote provenance recording and analysis capabilities of Swift. In the sections that follow, we demonstrate the provenance capabilities of the Swift system and evaluate its interoperability with other systems through the use of OPM. We describe the provenance data model of the Swift system and compare it to OPM. We also describe activities performed within the Third Provenance Challenge (PC3) which consisted of implementing a specific scientific workflow (LoadWorkflow), performing provenance queries, and exchanging provenance information with other systems.
 
@@ -119,7 +119,7 @@
 
 In Swift, data is represented by strongly-typed single-assignment variables. Data types can be {\em atomic} or {\em composite}. Atomic types are given by {\em primitive} types, such as integers or strings, or {\em mapped} types. Mapped types are used for representing and accessing data stored in local or remote files. {\em Composite} types are given by structures and arrays. In the Swift runtime, data is represented by a {\em dataset handle}. It may have as attributes a value, a file name, a child dataset handle (when it is a structure or an array), or a parent dataset handle (when it is contained in a structure or an array). Swift processes are given by invocations of external programs, functions, and operators. Dataset handles are produced and consumed by Swift processes.
 
-In the Swift provenance model, dataset handles and processes are recorded, as are the relations between them (either a process consuming a dataset handle as input, or a process producing a dataset handle as output). Each dataset handle and process is uniquely identified in time and space by a URI. This information is stored persistently in a relational database; we have also experimented with other database layouts \cite{ClGaMa09}. The two key relational tables used to store the structure of the provenance graph are {\tt processes}, that stores brief information about processes (see table \ref{processes_table}), and {\tt dataset\_usage}, that stores produced and consumed relationships between processes and dataset handles (see table \ref{dataset_usage_table}). Other tables  \cite{ClGaMa09} are used to record details about each process and dataset, and other relationships such as containment.
+In the Swift provenance model, dataset handles and processes are recorded, as are the relations between them (either a process consuming a dataset handle as input, or a process producing a dataset handle as output). Each dataset handle and process is uniquely identified in time and space by a URI. This information is stored persistently in a relational database; we have also experimented with other database layouts \cite{ClGaMa09}. The two key relational tables used to store the structure of the provenance graph are {\tt processes}, that stores brief information about processes (see table \ref{processes_table}), and {\tt dataset\_usage}, that stores produced and consumed relationships between processes and dataset handles (see table \ref{dataset_usage_table}). Other tables  (see \cite{ClGaMa09} for details) are used to record details about each process and dataset, and other relationships such as containment.
 
 \begin{table}
 \begin{center} 
@@ -192,7 +192,7 @@
 g = sortProg(f);
 \end{lstlisting}
 
-Consider the Swiftscript program in listing \ref{sortprog}, which first describes a procedure ({\tt sortProg}, which calls the external executable {\tt sort}); then declares references to two files, ({\tt f}, a reference to {\tt inputfile}, and {\tt g}, a reference to {\tt outputfile}); and finally calls the procedure {\tt sortProg}. 
+Consider the SwiftScript program in listing \ref{sortprog}, which first describes a procedure ({\tt sortProg}, which calls the external executable {\tt sort}); then declares references to two files, ({\tt f}, a reference to {\tt inputfile}, and {\tt g}, a reference to {\tt outputfile}); and finally calls the procedure {\tt sortProg}. 
 When this program is run, provenance records are generated as follows: 
   a process record is generated for the initial call to the {\tt sortProg(f)} procedure; 
   a process record is generated for the {\tt @i} inside {\tt sortProg}, representing the evaluation of the {\tt @filename} function that Swift uses to determine the physical file name corresponding to the reference {\tt f}; 
@@ -214,10 +214,10 @@
 
 The Swift provenance model is close to OPM, but there are some differences. Dataset handles correspond closely with OPM artifacts as immutable representations of data. However they do not correspond exactly. An OPM artifact has unique provenance. However, a dataset handle can have multiple provenance descriptions. For example, given the SwiftScript program displayed in listing \ref{multi}, the expression {\tt c[0]} evaluates to the dataset handle corresponding to the variable {\tt a}. That dataset handle has a provenance trace indicating it was assigned from the constant value {\tt 7}. However, that dataset handle now has additional provenance indicating that it was output by applying the array access operator {\tt []} to the array {\tt c} and the numerical value {\tt 0}. In OPM, the artifact resulting from evaluating {\tt c[0]} is distinct from the artifact resulting from evaluating {\tt a}, although they may be annotated with an {\em isIdenticalTo} arc \cite{OPMcollections
 }. The OPM entity agent is currently not represented in Swift's  provenance model.
 
-Except for {\em wasControlledBy}, the dependency relationships defined in OPM can be derived from the {\tt dataset\_usage} database relation. It explicitly stores the {\em used} and {\em wasGeneratedBy} relationships. For instance, the provenance database for {\tt sortProg} contains the tuples $\langle \text{{\tt sortProg}}, \text{{\tt f}}, \text{In}, \text{{\tt i}} \rangle$ and $\langle \text{{\tt sortProg}}, \text{{\tt g}}, \text{Out}, \text{{\tt o}} \rangle$. In OPM, this is equivalent to say $\text{{\tt f}} \xleftarrow{\text{used(\text{{\tt i}})}} \text{{\tt sortProg}}$ and $\text{{\tt sortProg}} \xleftarrow{\text{wasGeneratedBy(\text{{\tt o}})}} \text{{\tt g}}$ respectively. Figure \ref{sortProgGraph} shows an OPM graph containing the relationships stored in the provenance database for the {\tt sortProg} example.
+Except for {\em wasControlledBy}, the dependency relationships defined in OPM can be derived from the {\tt dataset\_usage} database relation. It explicitly stores the {\em used} and {\em wasGeneratedBy} relationships. For instance, the provenance database for {\tt sortProg} contains the tuples $\langle \text{{\tt sortProg}}, \text{{\tt f}}, \text{In}, \text{{\tt i}} \rangle$ and $\langle \text{{\tt sortProg}}, \text{{\tt g}}, \text{Out}, \text{{\tt o}} \rangle$. In OPM, this is equivalent to say $\text{{\tt f}} \xleftarrow{\text{used(\text{{\tt i}})}} \text{{\tt sortProg}}$ and $\text{{\tt sortProg}} \xleftarrow{\text{wasGeneratedBy(\text{{\tt o}})}} \text{{\tt g}}$ respectively. {\em wasTriggeredBy} and {\em wasDerivedFrom} dependency relationships can be inferred from {\tt database\_usage}, in the {\tt sortProg} example we have ${\tt f} \xleftarrow{\text{wasDerivedFrom}} g$. Figure \ref{sortProgGraph} shows the relationships stored in Swift's provenance database for the {\
 tt sortProg} example using OPM notation. 
 
 \begin{figure*}
-\caption{Provenance graph of {\tt sortProg}.\label{sortProgGraph}}
+\caption{Provenance relationships of {\tt sortProg}.\label{sortProgGraph}}
 \begin{center}
 \includegraphics[width=13.5cm]{sortProgGraph}
 \end{center}
@@ -287,8 +287,6 @@
     dataset_usage.dataset_id = dataset_values.dataset_id;
 \end{lstlisting}
 
-
-
 {\em Core Query 3}. The third core query asks which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value. This uses the additional annotations made, that only store which process originally inserted a row, not which processes have modified a row. So to some extent, rows are regarded a bit like artifacts (though not first order artifacts in the provenance database); and we can only answer questions about the provenance of rows, not the individual fields within those rows. That is sufficient for this query, though. First find the row that contains the interesting value and extract its {\tt IMAGEID}. Then find the process that created the {\tt IMAGEID} by querying the Derby database table {\tt P2IMAGEPROV}. This gives the process ID for the process that created the row. Now query the transitive closure table for all predecessors for that process (as in the first core query). This will produce all processes and artifacts t
 hat preceded this row creation. Our answer differs from the sample answer because we have sequenced access to the database, rather than regarding each row as a proper first-order artifact. The entire database state at a particular time is a successor to all previous database accessing operations, so any process which led to any database access before the row in question is regarded as a necessary operation. This is undesirable in some respects, but desirable in others. For example, a row insert only works because previous database operations which inserted other rows did not insert a conflicting primary key - so there is data dependency between the different operations even though they operate on different rows. 
 
 {\em Optional Query 1}. The computation halts due to failing an IsMatchTable-ColumnRanges check. How many tables were loaded successfully before the computation halted due to the failed check? The answer was given by querying how many load processes are known to the database (over all recorded computation), which can be restricted to a particular computation.