[Swift-commit] r5951 - provenancedb

Wed Sep 26 14:36:41 CDT 2012

Author: lgadelha
Date: 2012-09-26 14:36:41 -0500 (Wed, 26 Sep 2012)
New Revision: 5951

Modified:
   provenancedb/README.asciidoc
Log:


Modified: provenancedb/README.asciidoc
===================================================================

--- provenancedb/README.asciidoc	2012-09-26 16:57:04 UTC (rev 5950)
+++ provenancedb/README.asciidoc	2012-09-26 19:36:41 UTC (rev 5951)
@@ -1,33 +1,29 @@
-= MTCProv - a practical provenance query framework for many-task computing =
+= Swift Provenance Database =
 
 == Introduction ==
 
-MTCProv is a provenance management tool integrated to Swift given by the following components:
+Swift can be configured to gather and store provenance information about script executions. The following tools are available:
 
-. A set of scripts for extracting provenance information from Swift's log files. The extracted data is imported into a relational database, currently PotgreSQL. where it can queried.
+. A set of scripts for extracting provenance information from Swift's log files. The extracted data is imported into a relational database, currently PotgreSQL, where it can queried.
 
-. A query interface for provenance with a built-in query language called SPQL (Structured Provenance Query Language). SPQL is similar to SQL except for not having +FROM+-clauses and join expressions on the +WHERE+-clause, which are automatically computed for the user. A number of functions and stored procedures that abstract common provenance query patterns are available both in SPQL and SQL.
+. A query interface for provenance with a built-in query language called SPQL (Swift Provenance Query Language). SPQL is similar to SQL except for not having +FROM+-clauses and join expressions on the +WHERE+-clause, which are automatically computed for the user. A number of functions and stored procedures that abstract common provenance query patterns are available both in SPQL and SQL.
 	
-In this section, we present the MTCProv data model for representing provenance of many-task scientific computations. This MTC provenance model is a compatible extension of the Open Provenance Model, in the sense that it is possible to export the data stored by MTCProv to an OPM-compliant graph. It addresses the characteristics of many-task computing, where concurrent component tasks are submitted to parallel and distributed computational resources.  Such resources are subject to failures, and are usually under high demand for executing tasks and transferring data. Science-level performance information, which describes the behavior of an experiment from the  point of view of the scientific domain, is critical for the management of such experiments (for instance, by determining how accurate the outcome of a scientific simulation was, and whether accuracy varies between execution environments). Recording the resource-level performance of such workloads can also assist scientist
 s in managing the life cycle of their computational experiments. In designing MTCProv, we interacted with Swift users from multiple scientific domains, including protein science, and earth sciences, and social network analysis, to support them in designing, executing and analyzing their scientific computations with Swift. From these engagements, we identified the following requirements for MTCProv:
+It addresses the characteristics of many-task computing, where concurrent component tasks are submitted to parallel and distributed computational resources. Such resources are subject to failures, and are usually under high demand for executing tasks and transferring data. Science-level performance information, which describes the behavior of an experiment from the  point of view of the scientific domain, is critical for the management of such experiments (for instance, by determining how accurate the outcome of a scientific simulation was, and whether accuracy varies between execution environments). Recording the resource-level performance of such workloads can also assist scientists in managing the life cycle of their computational experiments. Features :
 
-- Gather producer-consumer relationships between data sets and processes. These relationships form the core of provenance information. They enable typical provenance queries to be performed, such as determining all processes and data sets that were involved in the production of a particular data set. This in general requires traversing a graph defined by these relationships. Users should be able, for instance, to check the usage of a given file by different many-task application runs.
+- Gathering of producer-consumer relationships between data sets and processes. 
 
-- Gather hierarchical relationships between data sets. Swift supports hierarchical data sets, such as arrays and structures. For instance, a user can map input files stored in a given directory to an array, and later process these files in parallel using a {\tt foreach construct. Fine-grained recording of data set usage details should be supported, so that a user can trace, for instance, that an array was passed to a procedure, and that an individual array member was used by some sub-procedure. This is usually achieved by recording constructors and accessors of arrays as processes in a provenance trace.
+- Gathering of hierarchical relationships between data sets. 
 
-- Gather versioned information of the specifications of many-task scientific computations and of their component applications. As the specifications of many-task computations (e.g., Swift scripts), and their component applications (e.g., Swift leaf functions) can evolve over time, scientists can benefit from keeping track of which version they are using in a given run. In some cases, the scientist also acts as the developer of a component application, which can result in frequent component application version updates during a workflow lifecycle.
+- Gathering of versioned information of the specifications of many-task scientific computations and of their component applications. 
 
-- Allow users to enrich their provenance records with annotations. Annotations are usually specified as key-value pairs that can be used, for instance, to record resource-level and science-level performance, like input and output scientific parameters, and usage statistics from computational resources. Annotations are useful for extending the provenance data model when required information is not captured in the standard system data flow. For instance, many scientific applications use textual configuration files to specify the parameters of a simulation. Yet automated provenance management systems usually record only the consumption of the configuration file by the scientific application, but preserve no information about its content (which is what the scientist really needs to know).
+- Allows users to enrich their provenance records with annotations. 
 
-- Gather runtime information about component application executions. Systems like Swift support many diverse parallel and distributed environments. Depending on the available applications at each site and on job scheduling heuristics, a computational task can be executed on a local host, a high performance computing cluster, or a remote grid or cloud site. Scientists often can benefit from having access to details about these executions, such as where each job was executed, the amount of time a job had to wait on a scheduler's queue, the duration of its actual execution, its memory and processor consumption, and the volume and rate of file system and/or network IO operations.
+- Gathering of runtime information about component application executions. 
 
-- Provide a usable and useful query interface for provenance information. While the relational model is ideal for many aspects to store provenance relationships, it is often cumbersome to write SQL queries that require joining the many relations required to implement the OPM. For provenance to become a standard part of the e-Science methodology, it must be easy for scientists to formulate queries and interpret their outputs. Queries can often be of exploratory nature, where provenance information is analyzed in many steps that refine previous query outputs. Improving usability is usually achievable through the specification of a provenance query language, or by making available a set of stored procedures that abstract common provenance queries. \emph{In this work we propose a query interface which both extends and simplifies standard SQL by automating join specifications and abstracting common provenance query patterns into built-in functions.
+- Provides a usable and useful query interface for provenance information. 
 
-Some of these requirements have been previously identified by Miles et al. However few works to date emphasize usability and applications of provenance represented by requirements  through , which are the main objectives of this work. As a first step toward meeting these requirements, we propose a data model for the provenance of many-task scientific computations. A UML diagram of this provenance model is presented in Figure . We simplify the UML notation to abbreviate the information that each annotated entity set (script run, function call, and variable) has one annotation entity set per data type. We define entities that correspond to the OPM notions of artifact, process, and artifact usage (either being consumed or produced by a process). These are augmented with entities used to represent many-task scientific computations, and to allow for entity annotations. Such annotations, which can be added post-execution, represent 
-information about provenance entities  such as object version tags and scientific parameters. 
+A UML diagram of this provenance model is presented in Figure . We simplify the UML notation to abbreviate the information that each annotated entity set (script run, function call, and variable) has one annotation entity set per data type. We define entities that correspond to the Open Provenance Model (OPM) notions of artifact, process, and artifact usage (either being consumed or produced by a process). These are augmented with entities used to represent many-task scientific computations, and to allow for entity annotations. Such annotations, which can be added post-execution, represent information about provenance entities  such as object version tags and scientific parameters. 
 
-Some of these requirements have been previously identified by Miles et al. However few works to date emphasize usability and applications of provenance represented by requirements  through , which are the main objectives of this work. As a first step toward meeting these requirements, we propose a data model for the provenance of many-task scientific computations. A UML diagram of this provenance model is presented in Figure . We simplify the UML notation to abbreviate the information that each annotated entity set (script run, function call, and variable) has one annotation entity set per data type. We define entities that correspond to the OPM notions of artifact, process, and artifact usage (either being consumed or produced by a process). These are augmented with entities used to represent many-task scientific computations, and to allow for entity annotations. Such annotations, which can be added post-execution, represent 
-information about provenance entities  such as object version tags and scientific parameters. 
-
 +script_run+: refers to the execution (successful or unsuccessful) of an entire many-task scientific computation, which in Swift is specified as the execution of a complete parallel script from start to finish. 
 
 +function_call+: records calls to Swift functions. These calls take as input data sets, such as values stored in primitive variables or files referenced by mapped variables; perform some computation specified in the respective function declaration; and produce data sets as output. In Swift, function calls can represent invocations of external applications, built-in functions, and operators; each function call is associated with the script run that invoked it.
@@ -43,38 +39,27 @@
 +annot+: is a key-value pair associated with either a +variable+, +function_call+, or +script_run+. These annotations are used to store context-specific information about the entities of the provenance data model. Examples include scientific-domain parameters, object versions, and user identities. Annotations can also be used to associate a set of runs of a script related to a particular event or study, which we refer to as a campaign.   The +dataset_in+ and +dataset_out+ relationships between +function_call+ and +variable+ define a lineage graph that can be traversed to determine ancestors or descendants of a particular entity. Process dependency and data dependency graphs are derived with transitive queries over these relationships. 
 
 The provenance model presented here is a significant refinement of a previous one used by Swift in the Third Provenance Challenge, which was shown to be similar to OPM. This similarity is retained in the current version of the model, which adds support for annotations and runtime information on component application executions. +function_call+ corresponds to OPM processes, and +variable+ corresponds to OPM artifacts as immutable data objects. The OPM entity agent controls OPM processes, e.g. starting or terminating them. While we do not explicitly define an entity type for such agents, this information can be stored in the annotation tables of the +function_call+ or +script_run+ entities. To record which users controlled each script or function call execution, one can gather the associated POSIX userids, when executions are local, or the distinguished name of network security credentials, when executions cross security domains, and store them as annotations for the 
-respective entity. This is equivalent to establishing an OPM _wasControlledBy_ relationship. The dependency relationships, _used_ and _wasGeneratedBy_, as defined in OPM, correspond to our +dataset\_in+ and +dataset\_out+ relationships, respectively. Our data model has additional entity sets to capture behavior that is specific to parallel and distributed systems, to distinguish, for instance, between application invocations and execution attempts. We currently do not directly support the OPM concept of _account_, which can describe the same computation using different levels of abstraction. However, one could use annotations to associate one or more such accounts with an execution entity. Based on the mapping to OPM described here, MTCProv provides tools for exporting the provenance database into OPM provenance graph interchange format, which provides interoperability with other OPM-compatible provenance systems.  
+respective entity. This is equivalent to establishing an OPM _wasControlledBy_ relationship. The dependency relationships, _used_ and _wasGeneratedBy_, as defined in OPM, correspond to our +dataset\_in+ and +dataset\_out+ relationships, respectively. Our data model has additional entity sets to capture behavior that is specific to parallel and distributed systems, to distinguish, for instance, between application invocations and execution attempts. We currently do not directly support the OPM concept of _account_, which can describe the same computation using different levels of abstraction. However, one could use annotations to associate one or more such accounts with an execution entity. Based on the mapping to OPM described here, Swift Provenance Database provides tools for exporting the provenance database into OPM provenance graph interchange format, which provides interoperability with other OPM-compatible provenance systems.  
 
-== Design and Implementation of MTCProv
+== Design and Implementation of Swift Provenance Database
 
-In this section, we describe the design and implementation of the MTCProv provenance query framework. It consists of a set of tools used for extracting provenance information from Swift log files, and a query interface. While its log extractor is specific to Swift, the remainder of the system, including the query interface, is applicable to any parallel functional data flow execution model. The MTCProv system design is influenced by our survey of provenance queries in many-task computing, where a set of query patterns was identified. The _multiple-step relationships_ (R^*) pattern is implemented by queries that follow the transitive closure of basic provenance relationships, such as data containment hierarchies, and data derivation and consumption. The _run correlation_ (RCr) pattern is implemented by queries for correlating attributes from multiple script runs, such as annotation values or the values of function call parameters.
+In this section, we describe the design and implementation of the Swift Provenance Database provenance query framework. It consists of a set of tools used for extracting provenance information from Swift log files, and a query interface. While its log extractor is specific to Swift, the remainder of the system, including the query interface, is applicable to any parallel functional data flow execution model. The Swift Provenance Database system design is influenced by our survey of provenance queries in many-task computing, where a set of query patterns was identified. The _multiple-step relationships_ (R^*) pattern is implemented by queries that follow the transitive closure of basic provenance relationships, such as data containment hierarchies, and data derivation and consumption. The _run correlation_ (RCr) pattern is implemented by queries for correlating attributes from multiple script runs, such as annotation values or the values of function call parameters.
 
 
 === Provenance Gathering and Storage
 
-Swift can be configured to add both prospective and retrospective provenance information to the log file it creates to track the behavior of each script run.  The provenance extraction mechanism processes these log files, filters the entries that contain provenance data, and exports this information to a relational SQL database. Each application execution is launched by a wrapper script that sets up the execution environment. We modified these scripts to also gather runtime information, such as memory consumption and processor load.  Additionally, one can define a script that generates annotations in the form of key-value pairs, to be executed immediately before the actual application. These annotations can be exported to the provenance database and associated with the respective application execution. MTCProv processes the data logged by each wrapper to extract both the runtime information and the annotations, storing them in the provenance database. Additional annotations 
 can be generated per script run 
-using _ad-hoc_ annotator scripts. In addition to retrospective provenance, MTCProv keeps prospective provenance by recording the Swift script source code, the application catalog, and the site catalog used in each script run. 
+Swift can be configured to add both prospective and retrospective provenance information to the log file it creates to track the behavior of each script run.  The provenance extraction mechanism processes these log files, filters the entries that contain provenance data, and exports this information to a relational SQL database. Each application execution is launched by a wrapper script that sets up the execution environment. We modified these scripts to also gather runtime information, such as memory consumption and processor load.  Additionally, one can define a script that generates annotations in the form of key-value pairs, to be executed immediately before the actual application. These annotations can be exported to the provenance database and associated with the respective application execution. Swift Provenance Database processes the data logged by each wrapper to extract both the runtime information and the annotations, storing them in the provenance database. Addit
 ional annotations can be generated per script run 
+using _ad-hoc_ annotator scripts. In addition to retrospective provenance, Swift Provenance Database keeps prospective provenance by recording the Swift script source code, the application catalog, and the site catalog used in each script run. 
 
-Provenance information is frequently stored in a relational database. RDBMS's are well known for their reliability, performance and consistency properties. Some shortcomings of the relational data model for managing provenance are mentioned by Zhao et al., such as its use of fixed schemas, and weak support for recursive queries. Despite using a fixed schema, our data model allows for key-value annotations, which gives it some flexibility to store information not explicitly defined in the schema. The SQL:1999 standard, which is supported by many relational database management systems, has native constructs for performing recursive queries. Ordonez proposed recursive query optimizations that can enable transitive closure computation in linear time complexity on binary trees, and quadratic time complexity on sparse graphs. Relationship transitive closures, which are required by recursive R^* pattern queries, are well supported by graph-based data 
-models, however  many interesting queries require aggregation of entity attributes. Aggregate queries on a graph can support grouping node or edge attributes that are along a path. However, some provenance queries require the grouping of node attributes that are sparsely spread across a graph. These require potentially costly graph traversals, whereas in the relational data model, well-supported aggregate operations can implement such operations efficiently.  
 
-To keep track of file usage across script runs, we record its hash function value as an alternative identifier. This enables traversing the provenance graphs of different script runs by detecting file reuse. Persistent unique identifiers for files could be provided by a data curation system with support for object versioning. However, due to the heterogeneity of parallel and distributed environments, one cannot assume the availability of such systems. 
 
 
-
-It is based on runs of a sample iterative script using a different number of iterations on each run. 
-The workflow specified in the script has a structure that is commonly found in many Swift applications. In this case, the log file size tends to grow about 55\% in size when provenance gathering is enabled, but once they are exported to the provenance database they are no longer needed by MTCProv. 
-The database size scales linearly with the number of iterations in the script. The initial set up of the database, which contains information such as integrity constraints, indices, and schema, take some space in addition to the data. As the database grows, it tends to consume less space than the log file. For instance, a 10,000 iteration run of the sample script produces a 55MB log file, while the total size of a provenance database containing only this run is 42MB. In addition, successful relational parallel database techniques can partition the provenance 
-database and obtain high performance query processing. The average total time to complete a script run shows a negligible impact when provenance gathering is enabled. In a few cases the script was executed faster with provenance gathering enabled, which indicates that other factors, such as the task execution scheduling heuristics used by Swift and operating system noise, have a higher impact in the total execution time.
-
-
-
 === Query Interface
 
 During the Third Provenance Challenge, we observed that expressing provenance queries in SQL is often cumbersome. For example, such queries require extensive use of complex relational joins, for instance, which are beyond the level of complexity that most domain scientists are willing, or have the time, to master and write. Such usability barriers are increasingly being seen as a critical issue in database management systems. Jagadish et al. propose that ease of use should be a requirement as important as functionality and performance. They observe that, even though general-purpose query languages such as SQL and XQuery allow for the design of powerful queries, they require detailed knowledge of the database schema and rather complex programming to express queries in terms of such schemas. Since databases are often normalized, data is spread through different relations requiring even more extensive use of database join operations when designing 
 queries. Some of the approaches used to improve usability are forms-based query interfaces, visual query builders, and schema summarization.
 
-In this section, we describe our structured provenance query language, SPQL for short, a component of MTCProv. It was designed to meet the requirements listed in section \ref{provmodel} and to allow for easier formation of provenance queries for the patterns identified  than can be accomplished with general purpose query languages, such as SQL. SPQL supports exploratory queries, where the user seeks information through a sequence of queries that progressively refine previous outputs, instead of having to compose many subqueries into a single complex query, as it is often the case with SQL. Even though our current implementation uses a relational database as the underlying data model for storing provenance information, it should not be dependent on it, we plan to evaluate alternative underlying data models such as graph databases, Datalog, and distributed column-oriented stores. Therefore, in the current implementation, every SPQL query is 
+In this section, we describe our structured provenance query language, SPQL for short, a component of Swift Provenance Database. It was designed to meet the requirements listed in section \ref{provmodel} and to allow for easier formation of provenance queries for the patterns identified  than can be accomplished with general purpose query languages, such as SQL. SPQL supports exploratory queries, where the user seeks information through a sequence of queries that progressively refine previous outputs, instead of having to compose many subqueries into a single complex query, as it is often the case with SQL. Even though our current implementation uses a relational database as the underlying data model for storing provenance information, it should not be dependent on it, we plan to evaluate alternative underlying data models such as graph databases, Datalog, and distributed column-oriented stores. Therefore, in the current implementation, every SPQL query is 
 translated into a SQL query that is processed by the underlying relational database. While the syntax of SPQL is by design similar to SQL, it does not require detailed knowledge of the underlying database schema for designing queries, but rather only of the entities in a simpler, higher-level abstract provenance schema, and their respective attributes. 
 
 The basic building block of a SPQL query consists of a selection query with the following format:
@@ -126,7 +111,7 @@
 select  compare_run(parameter='proteinId').run_id  where  file.name='nr';
 --------------------------------------
 
-This SPQL query is translated by MTCProv to the following SQL query:
+This SPQL query is translated by Swift Provenance Database to the following SQL query:
 
 --------------------------------------
 select compare_run1.run_id
@@ -142,7 +127,7 @@
 
 == Tutorial ==
 
-MTCProv is a set of scripts, SQL functions and stored procedures, and a query interface. It extracts provenance information from Swift's log files into a relational database. The tools are downloadable through SVN with the command:
+Swift Provenance Database is a set of scripts, SQL functions and stored procedures, and a query interface. It extracts provenance information from Swift's log files into a relational database. The tools are downloadable through SVN with the command:
 
 --------------------------------------
 svn co https://svn.ci.uchicago.edu/svn/vdl2/provenancedb
@@ -150,7 +135,7 @@
 
 === Database Configuration
 
-MTCProv depends on PostgreSQL, version 9.0 or later, due to the use of _Common Table Expressions_ for computing transitive closures of data derivation relationships, supported only on these versions. The file +prov-init.sql+ contains the database schema, and the file +pql_functions.sql+ contain the function and stored procedure definitions. If the user has not created a provenance database yet, this can be done with the following commands (one may need to add "+-U+ _username_" and "+-h+ _hostname_" before the database name "+provdb+", depending on the database server configuration):
+Swift Provenance Database depends on PostgreSQL, version 9.0 or later, due to the use of _Common Table Expressions_ for computing transitive closures of data derivation relationships, supported only on these versions. The file +prov-init.sql+ contains the database schema, and the file +pql_functions.sql+ contain the function and stored procedure definitions. If the user has not created a provenance database yet, this can be done with the following commands (one may need to add "+-U+ _username_" and "+-h+ _hostname_" before the database name "+provdb+", depending on the database server configuration):
 
 --------------------------------------
 createdb provdb
@@ -158,7 +143,7 @@
 psql -f pql-functions.sql provdb
 --------------------------------------
 
-=== MTCProv Configuration
+=== Swift Provenance Database Configuration
 
 The file +etc/provenance.config+ should be edited to define the database configuration. The location of the directory containing the log files should be defined in the variable +LOGREPO+. For instance: