[Swift-commit] r5993 - provenancedb

Fri Oct 26 12:10:18 CDT 2012

Author: lgadelha
Date: 2012-10-26 12:10:18 -0500 (Fri, 26 Oct 2012)
New Revision: 5993

Modified:
   provenancedb/README.asciidoc
   provenancedb/provdb-uml.dia
Log:


Modified: provenancedb/README.asciidoc
===================================================================

--- provenancedb/README.asciidoc	2012-10-25 21:18:39 UTC (rev 5992)
+++ provenancedb/README.asciidoc	2012-10-26 17:10:18 UTC (rev 5993)
@@ -25,7 +25,7 @@
 A UML diagram of this provenance model is presented in figure <<provdb_schema>>. We simplify the UML notation to abbreviate the information that each annotated entity set (script run, function call, and variable) has one annotation entity set per data type. We define entities that correspond to the Open Provenance Model (OPM) notions of artifact, process, and artifact usage (either being consumed or produced by a process). Annotations, which can be added post-execution, represent information about provenance entities such as object version tags and scientific parameters. 
 
 [[provdb_schema]]
-image::provdb.svg["Swift provenance database schema",width=1280]
+image::provdb.svg["Swift provenance database schema",width=1150]
 
 +script+: contains the script source code used and its hash value.
 
@@ -41,29 +41,28 @@
 
 +dataset+: represents data sets that were assigned to variables in a Swift script.
 
-+annot+: is a key-value pair associated with either a +variable+, +function_call+, or +script_run+. These annotations are used to store context-specific information about the entities of the provenance data model. Examples include scientific-domain parameters, object versions, and user identities. Annotations can also be used to associate a set of runs of a script related to a particular event or study, which we refer to as a campaign.   The +dataset_in+ and +dataset_out+ relationships between +function_call+ and +variable+ define a lineage graph that can be traversed to determine ancestors or descendants of a particular entity. Process dependency and data dependency graphs are derived with transitive queries over these relationships. 
++annot+: is a key-value pair associated with either a +variable+, +function_call+, or +script_run+. The annotations are free-form and can be used, for instance, to record scientific-domain parameters, object versions, and user identities.
 
+The +dataset_in+ and +dataset_out+ relationships between +function_call+ and +variable+ define a lineage graph that can be traversed to determine ancestors or descendants of a particular entity. Process dependency and data dependency graphs are derived with transitive queries over these relationships. 
+
 == Design and Implementation of Swift Provenance Database
 
-In this section, we describe the design and implementation of the Swift Provenance Database provenance query framework. It consists of a set of tools used for extracting provenance information from Swift log files, and a query interface. While its log extractor is specific to Swift, the remainder of the system, including the query interface, is applicable to any parallel functional data flow execution model. The Swift Provenance Database system design is influenced by our survey of provenance queries in many-task computing, where a set of query patterns was identified. The _multiple-step relationships_ (R^*) pattern is implemented by queries that follow the transitive closure of basic provenance relationships, such as data containment hierarchies, and data derivation and consumption. The _run correlation_ (RCr) pattern is implemented by queries for correlating attributes from multiple script runs, such as annotation values or the values of function call parameters.
+The Swift Provenance Database design is influenced by our survey of provenance queries in many-task computing. Built-in functions and stored procedures can be used to follow the transitive closure of basic provenance relationships, such as data containment hierarchies, and data derivation and consumption; or for correlating attributes from multiple script runs, such as annotation values or the values of function call parameters.
 
 
 === Provenance Gathering and Storage
 
-Swift can be configured to add both prospective and retrospective provenance information to the log file it creates to track the behavior of each script run.  The provenance extraction mechanism processes these log files, filters the entries that contain provenance data, and exports this information to a relational SQL database. Each application execution is launched by a wrapper script that sets up the execution environment. We modified these scripts to also gather runtime information, such as memory consumption and processor load.  Additionally, one can define a script that generates annotations in the form of key-value pairs, to be executed immediately before the actual application. These annotations can be exported to the provenance database and associated with the respective application execution. Swift Provenance Database processes the data logged by each wrapper to extract both the runtime information and the annotations, storing them in the provenance database. Addit
 ional annotations can be generated per script run 
+Swift can be configured to add both prospective and retrospective provenance information to its log files.  The provenance extraction mechanism processes these log files, filters the entries that contain provenance data, and exports this information to a relational database. Each application execution is started by a wrapper script that sets up the execution environment. We modified these scripts to also gather runtime information, such as memory consumption and processor load.  Additionally, one can define a script that generates annotations in the form of key-value pairs, to be executed immediately before the actual application. These annotations can be exported to the provenance database and associated with the respective application execution. The data logged by each wrapper is processed to extract both the runtime information and the annotations, storing them in the provenance database. Additional annotations can be generated per script run 
 using _ad-hoc_ annotator scripts. In addition to retrospective provenance, Swift Provenance Database keeps prospective provenance by recording the Swift script source code, the application catalog, and the site catalog used in each script run. 
 
-
-
-
 === Query Interface
 
-During the Third Provenance Challenge, we observed that expressing provenance queries in SQL is often cumbersome. For example, such queries require extensive use of complex relational joins, for instance, which are beyond the level of complexity that most domain scientists are willing, or have the time, to master and write. Such usability barriers are increasingly being seen as a critical issue in database management systems. Jagadish et al. propose that ease of use should be a requirement as important as functionality and performance. They observe that, even though general-purpose query languages such as SQL and XQuery allow for the design of powerful queries, they require detailed knowledge of the database schema and rather complex programming to express queries in terms of such schemas. Since databases are often normalized, data is spread through different relations requiring even more extensive use of database join operations when designing 
-queries. Some of the approaches used to improve usability are forms-based query interfaces, visual query builders, and schema summarization.
+During the Third Provenance Challenge, we observed that expressing provenance queries in SQL is often cumbersome. For example, such queries require extensive use of complex relational joins which are beyond the level of complexity that most domain scientists are willing, or have the time, to master and write. Swift Provenance Query Language, SPQL for short, was designed to allow for easier design of provenance queries for common query patterns than can be accomplished with general purpose query languages, such as SQL. 
+In the query interface, every SPQL query is translated into a SQL query that is processed by the underlying relational database. While the syntax of SPQL is by design similar to SQL, it does not require detailed knowledge of the underlying database schema for designing queries, but rather only of the entities in a simpler, higher-level abstract provenance schema, and their respective attributes.  
 
-In this section, we describe our structured provenance query language, SPQL for short, a component of Swift Provenance Database. It was designed to meet the requirements listed in section \ref{provmodel} and to allow for easier formation of provenance queries for the patterns identified  than can be accomplished with general purpose query languages, such as SQL. SPQL supports exploratory queries, where the user seeks information through a sequence of queries that progressively refine previous outputs, instead of having to compose many subqueries into a single complex query, as it is often the case with SQL. Even though our current implementation uses a relational database as the underlying data model for storing provenance information, it should not be dependent on it, we plan to evaluate alternative underlying data models such as graph databases, Datalog, and distributed column-oriented stores. Therefore, in the current implementation, every SPQL query is 
-translated into a SQL query that is processed by the underlying relational database. While the syntax of SPQL is by design similar to SQL, it does not require detailed knowledge of the underlying database schema for designing queries, but rather only of the entities in a simpler, higher-level abstract provenance schema, and their respective attributes. 
+//SPQL supports exploratory queries, where the user seeks information through a sequence of queries that progressively refine previous outputs, instead of having to compose many subqueries into a single complex query, as it is often the case with SQL.
 
+
 The basic building block of a SPQL query consists of a selection query with the following format:
 
 --------------------------------------
@@ -73,13 +72,15 @@
 (order by         orderByClause)))
 --------------------------------------
 
-This syntax is very similar to a selection query in SQL, with a critical usability benefit: hide the complexity of designing extensive join expressions. One does not need to provide all tables of the from clause. Instead, only the entity name is given and the translator reconstructs the underlying entity that was broken apart to produce the normalized schema. As in the relational data model, every query or built-in function results in a table, to preserve the power of SQL in querying results of another query. Selection queries can be composed using the usual set operations: union, intersect, and difference. A +select+ clause is a list with elements of the form +<entity set name>(.<attribute name>)+ or +<built-in function name>(.<return attribute name>)+. If attribute names are omitted, the query returns all the existing attributes of the entity set. SPQL supports the same aggregation, grouping, set 
-operation and ordering constructs provided by SQL.
+Where the optional parts are within parentheses. As in the relational data model, every query or built-in function results in a table, to preserve the power of SQL in querying results of another query. Selection queries can be composed using the usual set operations: union, intersection, and difference. A +select+ clause is a list with elements of the form +<entity set name>(.<attribute name>)+ or +<built-in function name>(.<return attribute name>)+. If attribute names are omitted, the query returns all the existing attributes of the entity set. SPQL supports the same aggregation, grouping, set operation and ordering constructs provided by SQL. 
 
-To simplify the schema that the user needs to understand to design queries, we used database views to define the  higher-level schema presentation shown in Figure. This abstract, OPM-compliant provenance schema, is a simplified view of the physical database schema detailed in section. It groups information related to a provenance entity set in a single relation. The annotation entity set shown is the union of the annotation entity sets of the underlying database, presented in Figure. To avoid defining one annotation table per data type, we use dynamic expression evaluation in the SPQL to SQL translator to determine the required type-specific annotation table of the underlying provenance database.
+To simplify the schema that the user needs to understand to design queries, we used database views to define the  higher-level schema presentation shown in <<provdb_schema_summary>>. This abstract, OPM-compliant provenance schema, is a simplified view of the physical database schema detailed in section. It groups information related to a provenance entity set in a single relation. The annotation entity set shown is the union of the annotation entity sets of the underlying database, presented in Figure. To avoid defining one annotation table per data type, we use dynamic expression evaluation in the SPQL to SQL translator to determine the required type-specific annotation table of the underlying provenance database.
 
-Most of the query patterns identified in \cite{gadelha_provenance_2011-1} are relatively straightforward to express in a relational query language such as SQL, except for the R^* and RCr patterns, which require either recursion or extensive use of relational joins. To abstract queries that match these patterns, we included  in SPQL the following built-in functions to make these common provenance queries easier to express: 
+[[provdb_schema_summary]]
+image::provdb-uml-summary.svg["Summary of Swift provenance database schema",width=800]
 
+To simplify query design, we included  in SPQL the following built-in functions to make these common provenance queries easier to express: 
+
 - +ancestors(object_id})+ returns a table with a single column containing the identifiers of variables and function calls that precede a particular node in a provenance graph stored in the database. 
 
 - +data_dependencies(variable_id})+, related to the previous built-in function, returns the identifiers of variables upon which +variable_id+ depends.
@@ -107,7 +108,7 @@
 --------------------------------------
 
 To further simplify query specification, SPQL uses a generic mechanism for computing the {\em from} clauses and the join expressions of the +where+ clause for the target SQL query. The SPQL to SQL query translator first scans all the entities present in the SPQL query. A shortest path containing all these entities is computed in the graph defined by the schema of the provenance database. All the entities present in this shortest path are listed in the +from+ clause of the target SQL query. The join expressions of the +where+ clause of the target query are computed using the edges of the shortest path, where each edge derives an expression that equates the attributes involved in the foreign key constraint of the entities that define the edge. While this automated join computation facilitates query design, it does somewhat reduce the expressivity of SPQL, as one is not able to perform other types of joins, such as self-joins, explicitly. However, many such queries can be expre
 ssed using subqueries, 
-which are supported by SPQL. While some of the expressive power of SQL is thus lost, we show in the sections that follow that SPQL is able to express, with far less effort and complexity, most important and useful queries that provenance query patterns require.  As a quick taste, this SPQL query returns the identifiers of the script runs that either produced or consumed the file +nr+:
+which are supported by SPQL. While some of the expressive power of SQL is thus lost, we show in the sections that follow that SPQL is able to express, with far less effort and complexity, most important and useful queries that provenance query patterns require. For example, this SPQL query returns the value of the parameter +proteinId+ per script run that consumed a file named +nr+:
 
 --------------------------------------
 select  compare_run(parameter='proteinId').run_id  where  file.name='nr';
@@ -164,6 +165,7 @@
 === Swift Configuration
 
 To enable the generation of provenance information in Swift's log files the option +provenance.log+ should be set to true in +etc/swift.properties+:
+
 --------------------------------------
 provenance.log=true
 --------------------------------------
@@ -205,33 +207,30 @@
 SELECT script_filename, swift_version, cog_version, final_state, start_time, duration
 FROM   script_run;
 
-  script_filename   | swift_version | cog_version | final_state |         start_time         | duration 
---------------------+---------------+-------------+-------------+----------------------------+----------
- modis.swift        | 5746          | 3371        | FAIL        | 2012-09-19 17:26:19.221-03 |    2.168
- modis-vortex.swift | 5746          | 3371        | FAIL        | 2012-09-19 17:28:24.809-03 |  180.542
- modis-vortex.swift | 5746          | 3371        | FAIL        | 2012-09-19 17:31:55.706-03 |  312.249
+ script_filename | swift_version | cog_version | final_state |         start_time         | duration 
+-----------------+---------------+-------------+-------------+----------------------------+----------
+ modis.swift     | 5483          | 3339        | SUCCESS     | 2012-10-26 11:46:51.282-02 |  100.724
+ modis.swift     | 5483          | 3339        | SUCCESS     | 2012-10-26 11:44:59.909-02 |   85.050
 --------------------------------------
 
 
 
 --------------------------------------
-select * from ancestors('dataset:20120919-1731-06svjllb:720000000654');
+select * from ancestors('dataset:20121026-1146-jng6bir4:720000001604');
 
                          ancestors                         
-+++-----------------------------------------------------------+++
- modis-vortex-20120919-1731-6fa0kk03:0
- modis-vortex-20120919-1731-6fa0kk03:0-6
- dataset:20120919-1731-06svjllb:720000000335
- dataset:20120919-1731-06svjllb:720000000653
- dataset:20120919-1731-06svjllb:720000000007
- modis-vortex-20120919-1731-6fa0kk03:06svjllb:720000000335
- dataset:20120919-1731-06svjllb:720000000336
- dataset:20120919-1731-06svjllb:720000000337
+---------------------------------------------------------+-
+ modis-20121026-1146-yp9rbbx5:0
+ modis-20121026-1146-yp9rbbx5:0-6
+ dataset:20121026-1146-jng6bir4:720000000335
+ dataset:20121026-1146-jng6bir4:720000000653
+ dataset:20121026-1146-jng6bir4:720000000007
+ modis-20121026-1146-yp9rbbx5:jng6bir4:720000000335
+ dataset:20121026-1146-jng6bir4:720000000336
  ...
- dataset:20120919-1731-06svjllb:720000000042
- dataset:20120919-1731-06svjllb:720000000229
- dataset:20120919-1731-06svjllb:720000000006
-(958 rows)
+ dataset:20121026-1146-jng6bir4:720000000107
+ dataset:20121026-1146-jng6bir4:720000000015
+ dataset:20121026-1146-jng6bir4:720000000006
 --------------------------------------
 
 

Modified: provenancedb/provdb-uml.dia
===================================================================
(Binary files differ)