[Swift-commit] r5955 - provenancedb

lgadelha at ci.uchicago.edu lgadelha at ci.uchicago.edu
Tue Oct 9 14:04:07 CDT 2012


Author: lgadelha
Date: 2012-10-09 14:04:07 -0500 (Tue, 09 Oct 2012)
New Revision: 5955

Added:
   provenancedb/provdb.mwb
Modified:
   provenancedb/README.asciidoc
Log:


Modified: provenancedb/README.asciidoc
===================================================================
--- provenancedb/README.asciidoc	2012-10-02 00:42:34 UTC (rev 5954)
+++ provenancedb/README.asciidoc	2012-10-09 19:04:07 UTC (rev 5955)
@@ -4,7 +4,7 @@
 
 Swift can be configured to gather and store provenance information about script executions. The following tools are available:
 
-. A set of scripts for extracting provenance information from Swift's log files. The extracted data is imported into a relational database, currently PotgreSQL, where it can queried.
+. A set of scripts for extracting provenance information from Swift's log files. The extracted data is imported into a relational database, currently PostgreSQL, where it can queried.
 
 . A query interface for provenance with a built-in query language called SPQL (Swift Provenance Query Language). SPQL is similar to SQL except for not having +FROM+-clauses and join expressions on the +WHERE+-clause, which are automatically computed for the user. A number of functions and stored procedures that abstract common provenance query patterns are available in both SPQL and SQL.
 	
@@ -14,36 +14,35 @@
 
 - Gathering of hierarchical relationships between data sets. 
 
-- Gathering of versioned information of the specifications of many-task scientific computations and of their component applications. 
+- Gathering of script source code used in each execution. 
 
 - Allows users to enrich their provenance records with annotations. 
 
-- Gathering of runtime information about component application executions. 
+- Gathering of runtime information about application executions. 
 
 - Provides a usable and useful query interface for provenance information. 
 
-A UML diagram of this provenance model is presented in figure <<provdb_schema>>. We simplify the UML notation to abbreviate the information that each annotated entity set (script run, function call, and variable) has one annotation entity set per data type. We define entities that correspond to the Open Provenance Model (OPM) notions of artifact, process, and artifact usage (either being consumed or produced by a process). These are augmented with entities used to represent many-task scientific computations, and to allow for entity annotations. Such annotations, which can be added post-execution, represent information about provenance entities  such as object version tags and scientific parameters. 
+A UML diagram of this provenance model is presented in figure <<provdb_schema>>. We simplify the UML notation to abbreviate the information that each annotated entity set (script run, function call, and variable) has one annotation entity set per data type. We define entities that correspond to the Open Provenance Model (OPM) notions of artifact, process, and artifact usage (either being consumed or produced by a process). Annotations, which can be added post-execution, represent information about provenance entities such as object version tags and scientific parameters. 
 
 [[provdb_schema]]
-image::provdb-uml.svg["Swift provenance database schema",width=1000]
+image::provdb.svg["Swift provenance database schema",width=1280]
 
-+script_run+: refers to the execution (successful or unsuccessful) of an entire many-task scientific computation, which in Swift is specified as the execution of a complete parallel script from start to finish. 
++script+: contains the script source code used and its hash value.
 
-+function_call+: records calls to Swift functions. These calls take as input data sets, such as values stored in primitive variables or files referenced by mapped variables; perform some computation specified in the respective function declaration; and produce data sets as output. In Swift, function calls can represent invocations of external applications, built-in functions, and operators; each function call is associated with the script run that invoked it.
++script_run+: refers to the execution (successful or unsuccessful) of a script, with attributes such as start time, source code filename, and Swift's version. 
 
-+app_fun_call+: represents an invocation of one component application of a many-task scientific computation. In Swift, it is generated by an invocation to an external application. External applications are listed in an application catalog along with the computational resources on which they can be executed.
++function_call+: records calls to functions within a script execution. These calls take as input data sets, such as values stored in primitive variables or files referenced by mapped variables; perform some computation specified in the respective function declaration; and produce data sets as output. In Swift, function calls can represent invocations of external applications, built-in functions, and operators; each function call is associated with the script run that invoked it.
 
++app_fun_call+: represents an invocation of an application function (_app function_). In Swift, it is generated by an invocation to an external application. External applications are listed in an application catalog along with the computational resources on which they can be executed.
+
 +application_execution+: represents execution attempts of an external application. Each application function call triggers one or more execution attempts, where one (or, in the case of retries or replication, several) particular computational resource(s) will be selected to actually execute the application.
 
 +runtime_info+: contains information associated with an application execution, such as resource consumption.
 
-+variable+: represents data sets that were assigned to variables in a Swift script. Variable types can be atomic or composite. Atomic types are primitive types, such as integers and strings, recorded in the relation Primitive, or mapped types, recorded in the relation Mapped. Mapped types are used for declaring and accessing data that is stored in files. Composite types are given by structures and arrays. Containment relationships define a hierarchy where each variable may have child variables (when it is a structure or an array), or a parent variable (when it is a member of a collection).  A variable may have as attributes a value, when it is a primitive variable; or a filename, when it is a mapped file.
++dataset+: represents data sets that were assigned to variables in a Swift script.
 
 +annot+: is a key-value pair associated with either a +variable+, +function_call+, or +script_run+. These annotations are used to store context-specific information about the entities of the provenance data model. Examples include scientific-domain parameters, object versions, and user identities. Annotations can also be used to associate a set of runs of a script related to a particular event or study, which we refer to as a campaign.   The +dataset_in+ and +dataset_out+ relationships between +function_call+ and +variable+ define a lineage graph that can be traversed to determine ancestors or descendants of a particular entity. Process dependency and data dependency graphs are derived with transitive queries over these relationships. 
 
-The provenance model presented here is a significant refinement of a previous one used by Swift in the Third Provenance Challenge, which was shown to be similar to OPM. This similarity is retained in the current version of the model, which adds support for annotations and runtime information on component application executions. +function_call+ corresponds to OPM processes, and +variable+ corresponds to OPM artifacts as immutable data objects. The OPM entity agent controls OPM processes, e.g. starting or terminating them. While we do not explicitly define an entity type for such agents, this information can be stored in the annotation tables of the +function_call+ or +script_run+ entities. To record which users controlled each script or function call execution, one can gather the associated POSIX userids, when executions are local, or the distinguished name of network security credentials, when executions cross security domains, and store them as annotations for the 
-respective entity. This is equivalent to establishing an OPM _wasControlledBy_ relationship. The dependency relationships, _used_ and _wasGeneratedBy_, as defined in OPM, correspond to our +dataset\_in+ and +dataset\_out+ relationships, respectively. Our data model has additional entity sets to capture behavior that is specific to parallel and distributed systems, to distinguish, for instance, between application invocations and execution attempts. We currently do not directly support the OPM concept of _account_, which can describe the same computation using different levels of abstraction. However, one could use annotations to associate one or more such accounts with an execution entity. Based on the mapping to OPM described here, Swift Provenance Database provides tools for exporting the provenance database into OPM provenance graph interchange format, which provides interoperability with other OPM-compatible provenance systems.  
-
 == Design and Implementation of Swift Provenance Database
 
 In this section, we describe the design and implementation of the Swift Provenance Database provenance query framework. It consists of a set of tools used for extracting provenance information from Swift log files, and a query interface. While its log extractor is specific to Swift, the remainder of the system, including the query interface, is applicable to any parallel functional data flow execution model. The Swift Provenance Database system design is influenced by our survey of provenance queries in many-task computing, where a set of query patterns was identified. The _multiple-step relationships_ (R^*) pattern is implemented by queries that follow the transitive closure of basic provenance relationships, such as data containment hierarchies, and data derivation and consumption. The _run correlation_ (RCr) pattern is implemented by queries for correlating attributes from multiple script runs, such as annotation values or the values of function call parameters.

Added: provenancedb/provdb.mwb
===================================================================
(Binary files differ)


Property changes on: provenancedb/provdb.mwb
___________________________________________________________________
Added: svn:mime-type
   + application/octet-stream




More information about the Swift-commit mailing list