[Swift-commit] r5945 - in provenancedb: . swift_mod

Tue Sep 25 08:24:01 CDT 2012

Author: lgadelha
Date: 2012-09-25 08:24:00 -0500 (Tue, 25 Sep 2012)
New Revision: 5945

Added:
   provenancedb/README.asciidoc
Modified:
   provenancedb/README
   provenancedb/info-to-runtime
   provenancedb/prov-to-sql.sh
   provenancedb/swift_mod/_swiftwrap_runtime_snapshots
Log:
Added documentation in asciidoc. 


Modified: provenancedb/README
===================================================================

--- provenancedb/README	2012-09-24 20:27:28 UTC (rev 5944)
+++ provenancedb/README	2012-09-25 13:24:00 UTC (rev 5945)
@@ -1,3 +1,29 @@
+MTCProv - a practical provenance query framework for many-task computing
+
+- Introduction
+
+MTCProv is a provenance management tool integrated to Swift given by the following components:
+
+1) A set of scripts for extracting provenance information from Swift's log files. The extracted data is imported into a relational database, currently PotgreSQL. where it can queried.
+
+2) A query interface for provenance with a built-in query language called SPQL (Structured Provenance Query Language). SPQL is similar to SQL except for not having FROM-clauses and join expressions on the WHERE-clause, which are automatically computed for the user. A number of functions and stored procedures that abstract common provenance query patterns are available both in SPQL and SQL.
+	
+In this section, we present the MTCProv data model for representing provenance of many-task scientific computations. This MTC provenance model is a compatible extension of the Open Provenance Model, in the sense that it is possible to export the data stored by MTCProv to an OPM-compliant graph. It addresses the characteristics of many-task computing, where concurrent component tasks are submitted to parallel and distributed computational resources.  Such resources are subject to failures, and are usually under high demand for executing tasks and transferring data. Science-level performance information, which describes the behavior of an experiment from the  point of view of the scientific domain, is critical for the management of such experiments (for instance, by determining how accurate the outcome of a scientific simulation was, and whether accuracy varies between execution environments). Recording the resource-level performance of such workloads can also assist scientist
 s in managing the life cycle of their computational experiments. In designing MTCProv, we interacted with Swift users from multiple scientific domains, including protein science, and earth sciences, and social network analysis, to support them in designing, executing and analyzing their scientific computations with Swift. From these engagements, we identified the following requirements for MTCProv:
+
+* Gather producer-consumer relationships between data sets and processes. These relationships form the core of provenance information. They enable typical provenance queries to be performed, such as determining all processes and data sets that were involved in the production of a particular data set. This in general requires traversing a graph defined by these relationships. Users should be able, for instance, to check the usage of a given file by different many-task application runs.
+
+* Gather hierarchical relationships between data sets. Swift supports hierarchical data sets, such as arrays and structures. For instance, a user can map input files stored in a given directory to an array, and later process these files in parallel using a {\tt foreach} construct. Fine-grained recording of data set usage details should be supported, so that a user can trace, for instance, that an array was passed to a procedure, and that an individual array member was used by some sub-procedure. This is usually achieved by recording constructors and accessors of arrays as processes in a provenance trace.
+
+* Gather versioned information of the specifications of many-task scientific computations and of their component applications. As the specifications of many-task computations (e.g., Swift scripts), and their component applications (e.g., Swift leaf functions) can evolve over time, scientists can benefit from keeping track of which version they are using in a given run. In some cases, the scientist also acts as the developer of a component application, which can result in frequent component application version updates during a workflow lifecycle.
+
+* Allow users to enrich their provenance records with annotations. Annotations are usually specified as key-value pairs that can be used, for instance, to record resource-level and science-level performance, like input and output scientific parameters, and usage statistics from computational resources. Annotations are useful for extending the provenance data model when required information is not captured in the standard system data flow. For instance, many scientific applications use textual configuration files to specify the parameters of a simulation. Yet automated provenance management systems usually record only the consumption of the configuration file by the scientific application, but preserve no information about its content (which is what the scientist really needs to know).
+
+* Gather runtime information about component application executions. Systems like Swift support many diverse parallel and distributed environments. Depending on the available applications at each site and on job scheduling heuristics, a computational task can be executed on a local host, a high performance computing cluster, or a remote grid or cloud site. Scientists often can benefit from having access to details about these executions, such as where each job was executed, the amount of time a job had to wait on a scheduler's queue, the duration of its actual execution, its memory and processor consumption, and the volume and rate of file system and/or network IO operations.
+
+* Provide a usable and useful query interface for provenance information. While the relational model is ideal for many aspects to store provenance relationships, it is often cumbersome to write SQL queries that require joining the many relations required to implement the OPM. For provenance to become a standard part of the e-Science methodology, it must be easy for scientists to formulate queries and interpret their outputs. Queries can often be of exploratory nature \cite{white_exploratory_2009}, where provenance information is analyzed in many steps that refine previous query outputs. Improving usability is usually achievable through the specification of a provenance query language, or by making available a set of stored procedures that abstract common provenance queries. \emph{In this work we propose a query interface which both extends and simplifies standard SQL by automating join specifications and abstracting common provenance query patterns into built-in functions}.
+
+Tutorial
+
 The file etc/provenance.config should be edited to define the local configuration. The location of the directory containing the log files should be defined in the variable LOGREPO. For instance:
 
 export LOGREPO=~/swift-logs/

Added: provenancedb/README.asciidoc
===================================================================
--- provenancedb/README.asciidoc	                        (rev 0)
+++ provenancedb/README.asciidoc	2012-09-25 13:24:00 UTC (rev 5945)
@@ -0,0 +1,252 @@
+= MTCProv - a practical provenance query framework for many-task computing =
+
+== Introduction ==
+
+MTCProv is a provenance management tool integrated to Swift given by the following components:
+
+. A set of scripts for extracting provenance information from Swift's log files. The extracted data is imported into a relational database, currently PotgreSQL. where it can queried.
+
+. A query interface for provenance with a built-in query language called SPQL (Structured Provenance Query Language). SPQL is similar to SQL except for not having +FROM+-clauses and join expressions on the +WHERE+-clause, which are automatically computed for the user. A number of functions and stored procedures that abstract common provenance query patterns are available both in SPQL and SQL.
+	
+In this section, we present the MTCProv data model for representing provenance of many-task scientific computations. This MTC provenance model is a compatible extension of the Open Provenance Model, in the sense that it is possible to export the data stored by MTCProv to an OPM-compliant graph. It addresses the characteristics of many-task computing, where concurrent component tasks are submitted to parallel and distributed computational resources.  Such resources are subject to failures, and are usually under high demand for executing tasks and transferring data. Science-level performance information, which describes the behavior of an experiment from the  point of view of the scientific domain, is critical for the management of such experiments (for instance, by determining how accurate the outcome of a scientific simulation was, and whether accuracy varies between execution environments). Recording the resource-level performance of such workloads can also assist scientist
 s in managing the life cycle of their computational experiments. In designing MTCProv, we interacted with Swift users from multiple scientific domains, including protein science, and earth sciences, and social network analysis, to support them in designing, executing and analyzing their scientific computations with Swift. From these engagements, we identified the following requirements for MTCProv:
+
+- Gather producer-consumer relationships between data sets and processes. These relationships form the core of provenance information. They enable typical provenance queries to be performed, such as determining all processes and data sets that were involved in the production of a particular data set. This in general requires traversing a graph defined by these relationships. Users should be able, for instance, to check the usage of a given file by different many-task application runs.
+
+- Gather hierarchical relationships between data sets. Swift supports hierarchical data sets, such as arrays and structures. For instance, a user can map input files stored in a given directory to an array, and later process these files in parallel using a {\tt foreach construct. Fine-grained recording of data set usage details should be supported, so that a user can trace, for instance, that an array was passed to a procedure, and that an individual array member was used by some sub-procedure. This is usually achieved by recording constructors and accessors of arrays as processes in a provenance trace.
+
+- Gather versioned information of the specifications of many-task scientific computations and of their component applications. As the specifications of many-task computations (e.g., Swift scripts), and their component applications (e.g., Swift leaf functions) can evolve over time, scientists can benefit from keeping track of which version they are using in a given run. In some cases, the scientist also acts as the developer of a component application, which can result in frequent component application version updates during a workflow lifecycle.
+
+- Allow users to enrich their provenance records with annotations. Annotations are usually specified as key-value pairs that can be used, for instance, to record resource-level and science-level performance, like input and output scientific parameters, and usage statistics from computational resources. Annotations are useful for extending the provenance data model when required information is not captured in the standard system data flow. For instance, many scientific applications use textual configuration files to specify the parameters of a simulation. Yet automated provenance management systems usually record only the consumption of the configuration file by the scientific application, but preserve no information about its content (which is what the scientist really needs to know).
+
+- Gather runtime information about component application executions. Systems like Swift support many diverse parallel and distributed environments. Depending on the available applications at each site and on job scheduling heuristics, a computational task can be executed on a local host, a high performance computing cluster, or a remote grid or cloud site. Scientists often can benefit from having access to details about these executions, such as where each job was executed, the amount of time a job had to wait on a scheduler's queue, the duration of its actual execution, its memory and processor consumption, and the volume and rate of file system and/or network IO operations.
+
+- Provide a usable and useful query interface for provenance information. While the relational model is ideal for many aspects to store provenance relationships, it is often cumbersome to write SQL queries that require joining the many relations required to implement the OPM. For provenance to become a standard part of the e-Science methodology, it must be easy for scientists to formulate queries and interpret their outputs. Queries can often be of exploratory nature, where provenance information is analyzed in many steps that refine previous query outputs. Improving usability is usually achievable through the specification of a provenance query language, or by making available a set of stored procedures that abstract common provenance queries. \emph{In this work we propose a query interface which both extends and simplifies standard SQL by automating join specifications and abstracting common provenance query patterns into built-in functions.
+
+Some of these requirements have been previously identified by Miles et al. However few works to date emphasize usability and applications of provenance represented by requirements  through , which are the main objectives of this work. As a first step toward meeting these requirements, we propose a data model for the provenance of many-task scientific computations. A UML diagram of this provenance model is presented in Figure . We simplify the UML notation to abbreviate the information that each annotated entity set (script run, function call, and variable) has one annotation entity set per data type. We define entities that correspond to the OPM notions of artifact, process, and artifact usage (either being consumed or produced by a process). These are augmented with entities used to represent many-task scientific computations, and to allow for entity annotations. Such annotations, which can be added post-execution, represent 
+information about provenance entities  such as object version tags and scientific parameters. 
+
+Some of these requirements have been previously identified by Miles et al. However few works to date emphasize usability and applications of provenance represented by requirements  through , which are the main objectives of this work. As a first step toward meeting these requirements, we propose a data model for the provenance of many-task scientific computations. A UML diagram of this provenance model is presented in Figure . We simplify the UML notation to abbreviate the information that each annotated entity set (script run, function call, and variable) has one annotation entity set per data type. We define entities that correspond to the OPM notions of artifact, process, and artifact usage (either being consumed or produced by a process). These are augmented with entities used to represent many-task scientific computations, and to allow for entity annotations. Such annotations, which can be added post-execution, represent 
+information about provenance entities  such as object version tags and scientific parameters. 
+
++script_run+: refers to the execution (successful or unsuccessful) of an entire many-task scientific computation, which in Swift is specified as the execution of a complete parallel script from start to finish. 
+
++function_call+: records calls to Swift functions. These calls take as input data sets, such as values stored in primitive variables or files referenced by mapped variables; perform some computation specified in the respective function declaration; and produce data sets as output. In Swift, function calls can represent invocations of external applications, built-in functions, and operators; each function call is associated with the script run that invoked it.
+
++app_fun_call+: represents an invocation of one component application of a many-task scientific computation. In Swift, it is generated by an invocation to an external application. External applications are listed in an application catalog along with the computational resources on which they can be executed.
+
++application_execution+: represents execution attempts of an external application. Each application function call triggers one or more execution attempts, where one (or, in the case of retries or replication, several) particular computational resource(s) will be selected to actually execute the application.
+
++runtime_info+: contains information associated with an application execution, such as resource consumption.
+
++variable+: represents data sets that were assigned to variables in a Swift script. Variable types can be atomic or composite. Atomic types are primitive types, such as integers and strings, recorded in the relation Primitive, or mapped types, recorded in the relation Mapped. Mapped types are used for declaring and accessing data that is stored in files. Composite types are given by structures and arrays. Containment relationships define a hierarchy where each variable may have child variables (when it is a structure or an array), or a parent variable (when it is a member of a collection).  A variable may have as attributes a value, when it is a primitive variable; or a filename, when it is a mapped file.
+
++annot+: is a key-value pair associated with either a +variable+, +function_call+, or +script_run+. These annotations are used to store context-specific information about the entities of the provenance data model. Examples include scientific-domain parameters, object versions, and user identities. Annotations can also be used to associate a set of runs of a script related to a particular event or study, which we refer to as a campaign.   The +dataset_in+ and +dataset_out+ relationships between +function_call+ and +variable+ define a lineage graph that can be traversed to determine ancestors or descendants of a particular entity. Process dependency and data dependency graphs are derived with transitive queries over these relationships. 
+
+The provenance model presented here is a significant refinement of a previous one used by Swift in the Third Provenance Challenge, which was shown to be similar to OPM. This similarity is retained in the current version of the model, which adds support for annotations and runtime information on component application executions. +function_call+ corresponds to OPM processes, and +variable+ corresponds to OPM artifacts as immutable data objects. The OPM entity agent controls OPM processes, e.g. starting or terminating them. While we do not explicitly define an entity type for such agents, this information can be stored in the annotation tables of the +function_call+ or +script_run+ entities. To record which users controlled each script or function call execution, one can gather the associated POSIX userids, when executions are local, or the distinguished name of network security credentials, when executions cross security domains, and store them as annotations for the 
+respective entity. This is equivalent to establishing an OPM _wasControlledBy_ relationship. The dependency relationships, _used_ and _wasGeneratedBy_, as defined in OPM, correspond to our +dataset\_in+ and +dataset\_out+ relationships, respectively. Our data model has additional entity sets to capture behavior that is specific to parallel and distributed systems, to distinguish, for instance, between application invocations and execution attempts. We currently do not directly support the OPM concept of _account_, which can describe the same computation using different levels of abstraction. However, one could use annotations to associate one or more such accounts with an execution entity. Based on the mapping to OPM described here, MTCProv provides tools for exporting the provenance database into OPM provenance graph interchange format, which provides interoperability with other OPM-compatible provenance systems.  
+
+== Design and Implementation of MTCProv
+
+In this section, we describe the design and implementation of the MTCProv provenance query framework. It consists of a set of tools used for extracting provenance information from Swift log files, and a query interface. While its log extractor is specific to Swift, the remainder of the system, including the query interface, is applicable to any parallel functional data flow execution model. The MTCProv system design is influenced by our survey of provenance queries in many-task computing, where a set of query patterns was identified. The _multiple-step relationships_ (R^*) pattern is implemented by queries that follow the transitive closure of basic provenance relationships, such as data containment hierarchies, and data derivation and consumption. The _run correlation_ (RCr) pattern is implemented by queries for correlating attributes from multiple script runs, such as annotation values or the values of function call parameters.
+
+
+=== Provenance Gathering and Storage
+
+Swift can be configured to add both prospective and retrospective provenance information to the log file it creates to track the behavior of each script run.  The provenance extraction mechanism processes these log files, filters the entries that contain provenance data, and exports this information to a relational SQL database. Each application execution is launched by a wrapper script that sets up the execution environment. We modified these scripts to also gather runtime information, such as memory consumption and processor load.  Additionally, one can define a script that generates annotations in the form of key-value pairs, to be executed immediately before the actual application. These annotations can be exported to the provenance database and associated with the respective application execution. MTCProv processes the data logged by each wrapper to extract both the runtime information and the annotations, storing them in the provenance database. Additional annotations 
 can be generated per script run 
+using _ad-hoc_ annotator scripts. In addition to retrospective provenance, MTCProv keeps prospective provenance by recording the Swift script source code, the application catalog, and the site catalog used in each script run. 
+
+Provenance information is frequently stored in a relational database. RDBMS's are well known for their reliability, performance and consistency properties. Some shortcomings of the relational data model for managing provenance are mentioned by Zhao et al., such as its use of fixed schemas, and weak support for recursive queries. Despite using a fixed schema, our data model allows for key-value annotations, which gives it some flexibility to store information not explicitly defined in the schema. The SQL:1999 standard, which is supported by many relational database management systems, has native constructs for performing recursive queries. Ordonez proposed recursive query optimizations that can enable transitive closure computation in linear time complexity on binary trees, and quadratic time complexity on sparse graphs. Relationship transitive closures, which are required by recursive R^* pattern queries, are well supported by graph-based data 
+models, however  many interesting queries require aggregation of entity attributes. Aggregate queries on a graph can support grouping node or edge attributes that are along a path. However, some provenance queries require the grouping of node attributes that are sparsely spread across a graph. These require potentially costly graph traversals, whereas in the relational data model, well-supported aggregate operations can implement such operations efficiently.  
+
+To keep track of file usage across script runs, we record its hash function value as an alternative identifier. This enables traversing the provenance graphs of different script runs by detecting file reuse. Persistent unique identifiers for files could be provided by a data curation system with support for object versioning. However, due to the heterogeneity of parallel and distributed environments, one cannot assume the availability of such systems. 
+
+
+
+It is based on runs of a sample iterative script using a different number of iterations on each run. 
+The workflow specified in the script has a structure that is commonly found in many Swift applications. In this case, the log file size tends to grow about 55\% in size when provenance gathering is enabled, but once they are exported to the provenance database they are no longer needed by MTCProv. 
+The database size scales linearly with the number of iterations in the script. The initial set up of the database, which contains information such as integrity constraints, indices, and schema, take some space in addition to the data. As the database grows, it tends to consume less space than the log file. For instance, a 10,000 iteration run of the sample script produces a 55MB log file, while the total size of a provenance database containing only this run is 42MB. In addition, successful relational parallel database techniques can partition the provenance 
+database and obtain high performance query processing. The average total time to complete a script run shows a negligible impact when provenance gathering is enabled. In a few cases the script was executed faster with provenance gathering enabled, which indicates that other factors, such as the task execution scheduling heuristics used by Swift and operating system noise, have a higher impact in the total execution time.
+
+
+
+=== Query Interface
+
+During the Third Provenance Challenge, we observed that expressing provenance queries in SQL is often cumbersome. For example, such queries require extensive use of complex relational joins, for instance, which are beyond the level of complexity that most domain scientists are willing, or have the time, to master and write. Such usability barriers are increasingly being seen as a critical issue in database management systems. Jagadish et al. propose that ease of use should be a requirement as important as functionality and performance. They observe that, even though general-purpose query languages such as SQL and XQuery allow for the design of powerful queries, they require detailed knowledge of the database schema and rather complex programming to express queries in terms of such schemas. Since databases are often normalized, data is spread through different relations requiring even more extensive use of database join operations when designing 
+queries. Some of the approaches used to improve usability are forms-based query interfaces, visual query builders, and schema summarization.
+
+In this section, we describe our structured provenance query language, SPQL for short, a component of MTCProv. It was designed to meet the requirements listed in section \ref{provmodel} and to allow for easier formation of provenance queries for the patterns identified  than can be accomplished with general purpose query languages, such as SQL. SPQL supports exploratory queries, where the user seeks information through a sequence of queries that progressively refine previous outputs, instead of having to compose many subqueries into a single complex query, as it is often the case with SQL. Even though our current implementation uses a relational database as the underlying data model for storing provenance information, it should not be dependent on it, we plan to evaluate alternative underlying data models such as graph databases, Datalog, and distributed column-oriented stores. Therefore, in the current implementation, every SPQL query is 
+translated into a SQL query that is processed by the underlying relational database. While the syntax of SPQL is by design similar to SQL, it does not require detailed knowledge of the underlying database schema for designing queries, but rather only of the entities in a simpler, higher-level abstract provenance schema, and their respective attributes. 
+
+The basic building block of a SPQL query consists of a selection query with the following format:
+
+--------------------------------------
+select (distinct) selectClause
+(where            whereClause 
+(group by         groupByClause
+(order by         orderByClause)))
+--------------------------------------
+
+This syntax is very similar to a selection query in SQL, with a critical usability benefit: hide the complexity of designing extensive join expressions. One does not need to provide all tables of the from clause. Instead, only the entity name is given and the translator reconstructs the underlying entity that was broken apart to produce the normalized schema. As in the relational data model, every query or built-in function results in a table, to preserve the power of SQL in querying results of another query. Selection queries can be composed using the usual set operations: union, intersect, and difference. A +select+ clause is a list with elements of the form +<entity set name>(.<attribute name>)+ or +<built-in function name>(.<return attribute name>)+. If attribute names are omitted, the query returns all the existing attributes of the entity set. SPQL supports the same aggregation, grouping, set 
+operation and ordering constructs provided by SQL.
+
+To simplify the schema that the user needs to understand to design queries, we used database views to define the  higher-level schema presentation shown in Figure. This abstract, OPM-compliant provenance schema, is a simplified view of the physical database schema detailed in section. It groups information related to a provenance entity set in a single relation. The annotation entity set shown is the union of the annotation entity sets of the underlying database, presented in Figure. To avoid defining one annotation table per data type, we use dynamic expression evaluation in the SPQL to SQL translator to determine the required type-specific annotation table of the underlying provenance database.
+
+Most of the query patterns identified in \cite{gadelha_provenance_2011-1} are relatively straightforward to express in a relational query language such as SQL, except for the R^* and RCr patterns, which require either recursion or extensive use of relational joins. To abstract queries that match these patterns, we included  in SPQL the following built-in functions to make these common provenance queries easier to express: 
+
+- +ancestors(object_id})+ returns a table with a single column containing the identifiers of variables and function calls that precede a particular node in a provenance graph stored in the database. 
+
+- +data_dependencies(variable_id})+, related to the previous built-in function, returns the identifiers of variables upon which +variable_id+ depends.
+
+- +function_call_dependencies(function_call_id})+ returns the identifiers of function calls upon which +function_call_id+ depends.  
+
+- +compare_run(list of <function_parameter=string | annotation_key=string)+ shows how process parameters or annotation values vary across the script runs stored in the database.
+
+The underlying SQL implementation of the +ancestor+ built-in function, below, uses recursive Common Query Expressions, which are supported in the SQL:1999 standard. It uses the +prov\_graph+ database view, which is derived from the +dataset\_in+ and +dataset_out+ tables, resulting in a table containing the edges of the provenance graph.
+
+--------------------------------------
+CREATE FUNCTION ancestors(varchar)  RETURNS SETOF varchar AS $$
+WITH RECURSIVE anc(ancestor,descendant) AS
+  (    
+       SELECT parent AS ancestor, child AS descendant 
+       FROM   prov_graph
+       WHERE child=$1
+     UNION
+       SELECT prov_graph.parent AS ancestor, 
+              anc.descendant AS descendant
+       FROM   anc, prov_graph
+       WHERE  anc.ancestor=prov_graph.child
+  )
+  SELECT ancestor FROM anc $$ ;
+--------------------------------------
+
+To further simplify query specification, SPQL uses a generic mechanism for computing the {\em from} clauses and the join expressions of the +where+ clause for the target SQL query. The SPQL to SQL query translator first scans all the entities present in the SPQL query. A shortest path containing all these entities is computed in the graph defined by the schema of the provenance database. All the entities present in this shortest path are listed in the +from+ clause of the target SQL query. The join expressions of the +where+ clause of the target query are computed using the edges of the shortest path, where each edge derives an expression that equates the attributes involved in the foreign key constraint of the entities that define the edge. While this automated join computation facilitates query design, it does somewhat reduce the expressivity of SPQL, as one is not able to perform other types of joins, such as self-joins, explicitly. However, many such queries can be expre
 ssed using subqueries, 
+which are supported by SPQL. While some of the expressive power of SQL is thus lost, we show in the sections that follow that SPQL is able to express, with far less effort and complexity, most important and useful queries that provenance query patterns require.  As a quick taste, this SPQL query returns the identifiers of the script runs that either produced or consumed the file +nr+:
+
+--------------------------------------
+select  compare_run(parameter='proteinId').run_id  where  file.name='nr';
+--------------------------------------
+
+This SPQL query is translated by MTCProv to the following SQL query:
+
+--------------------------------------
+select compare_run1.run_id
+from   select run_id, j1.value AS proteinId 
+       from compare_run_by_param('proteinId') as compare_run1,
+       run, proc, ds_use, ds, file
+where  compare_run1.run_id=run.id and ds_use.proc_id=proc.id and 
+       ds_use.ds_id=ds.id and ds.id=file.id and
+       run.id=proc.run_id and file.name='nr';
+--------------------------------------
+
+Further queries are illustrated by example in the next section. We note here that the SPQL query interface also lets the user submit standard SQL statements to query the database.
+
+== Tutorial ==
+
+MTCProv is a set of scripts, SQL functions and stored procedures, and a query interface. It extracts provenance information from Swift's log files into a relational database. The tools are downloadable through SVN with the command:
+
+--------------------------------------
+svn co https://svn.ci.uchicago.edu/svn/vdl2/provenancedb
+--------------------------------------
+
+=== Database Configuration
+
+MTCProv depends on PostgreSQL, version 9.0 or later, due to the use of _Common Table Expressions_ for computing transitive closures of data derivation relationships, supported only on these versions. The file +prov-init.sql+ contains the database schema, and the file +pql_functions.sql+ contain the function and stored procedure definitions. If the user has not created a provenance database yet, this can be done with the following commands (one may need to add "+-U+ _username_" and "+-h+ _hostname_" before the database name "+provdb+", depending on the database server configuration):
+
+--------------------------------------
+createdb provdb
+psql -f prov-init.sql provdb
+psql -f pql-functions.sql provdb
+--------------------------------------
+
+=== MTCProv Configuration
+
+The file +etc/provenance.config+ should be edited to define the database configuration. The location of the directory containing the log files should be defined in the variable +LOGREPO+. For instance:
+
+--------------------------------------
+export LOGREPO=~/swift-logs/
+--------------------------------------
+
+The command used for connecting to the database should be defined in the variable SQLCMD. For example, to connect to CI's PostgreSQL? database:
+
+--------------------------------------
+export SQLCMD="psql -h db.ci.uchicago.edu -U provdb provdb"
+--------------------------------------
+
+The script +./swift-prov-import-all-logs+ will import provenance information from the log files in +$LOGREPO+ into the database. The command line option +-rebuild+ will initialize the database before importing provenance information. 
+
+=== Swift Configuration
+
+To enable the generation of provenance information in Swift's log files the option +provenance.log+ should be set to true in +etc/swift.properties+:
+--------------------------------------
+provenance.log=true
+--------------------------------------
+
+If Swift's SVN revision is 3417 or greater, the following options should be set in +etc/log4j.properties+:
+
+--------------------------------------
+log4j.logger.swift=DEBUG
+log4j.logger.org.griphyn.vdl.karajan.lib=DEBUG
+--------------------------------------
+
+==== Enriching Provenance Data with Runtime Resource Consumption Statistics
+
+A modified version of +_swiftwrap+ can be used to gather additional information on runtime resource comsumption, such as processor, memory, I/O, and swap use. One should backup the original +_swiftwrap+ script and replace it with the modified one:
+
+--------------------------------------
+cp $SWIFT_HOME/libexec/_swiftwrap $SWIFT_HOME/libexec/_swiftwrap-backup
+cp swift_mod/_swiftwrap_runtime_snapshots $SWIFT_HOME/libexec/_swiftwrap
+--------------------------------------
+
+=== Example: MODIS
+
+Run MODIS.
+
+--------------------------------------
+swift modis.swift
+swift-prov-import-all-logs
+--------------------------------------
+
+Connect to the provenance database:
+
+--------------------------------------
+psql provdb
+--------------------------------------
+
+List runs that were imported to the database:
+
+--------------------------------------
+SELECT script_filename, swift_version, cog_version, final_state, start_time, duration
+FROM   script_run;
+
+  script_filename   | swift_version | cog_version | final_state |         start_time         | duration 
+--------------------+---------------+-------------+-------------+----------------------------+----------
+ modis.swift        | 5746          | 3371        | FAIL        | 2012-09-19 17:26:19.221-03 |    2.168
+ modis-vortex.swift | 5746          | 3371        | FAIL        | 2012-09-19 17:28:24.809-03 |  180.542
+ modis-vortex.swift | 5746          | 3371        | FAIL        | 2012-09-19 17:31:55.706-03 |  312.249
+--------------------------------------
+
+
+
+--------------------------------------
+select * from ancestors('dataset:20120919-1731-06svjllb:720000000654');
+
+                         ancestors                         
++++-----------------------------------------------------------+++
+ modis-vortex-20120919-1731-6fa0kk03:0
+ modis-vortex-20120919-1731-6fa0kk03:0-6
+ dataset:20120919-1731-06svjllb:720000000335
+ dataset:20120919-1731-06svjllb:720000000653
+ dataset:20120919-1731-06svjllb:720000000007
+ modis-vortex-20120919-1731-6fa0kk03:06svjllb:720000000335
+ dataset:20120919-1731-06svjllb:720000000336
+ dataset:20120919-1731-06svjllb:720000000337
+ ...
+ dataset:20120919-1731-06svjllb:720000000042
+ dataset:20120919-1731-06svjllb:720000000229
+ dataset:20120919-1731-06svjllb:720000000006
+(958 rows)
+--------------------------------------
+
+
+
+

Modified: provenancedb/info-to-runtime
===================================================================
--- provenancedb/info-to-runtime	2012-09-24 20:27:28 UTC (rev 5944)
+++ provenancedb/info-to-runtime	2012-09-25 13:24:00 UTC (rev 5945)
@@ -10,7 +10,7 @@
     
     if [ "X$record" != "X" ] && [ -f $record ] ; then
 	
-	grep '^RUNTIME_AGGR=' $record | sed "s/^RUNTIME_AGGR=\(.*\)$/$globalid \1/"
+	grep '^RUNTIME_INFO=' $record | sed "s/^RUNTIME_INFO=\(.*\)$/$globalid \1/"
 	
     else
 	echo no wrapper log for $id >&2

Modified: provenancedb/prov-to-sql.sh
===================================================================
--- provenancedb/prov-to-sql.sh	2012-09-24 20:27:28 UTC (rev 5944)
+++ provenancedb/prov-to-sql.sh	2012-09-25 13:24:00 UTC (rev 5945)
@@ -129,10 +129,18 @@
 echo "    - Wrapper log resource consumption info."
 if [ -f runtime.txt ]; then
     while read execute2_id runtime; do
-	for key in $(echo maxrss walltime systime usertime cpu fsin fsout timesswapped socketrecv socketsent majorpagefaults minorpagefaults contextswitchesinv contextswitchesvol); do
-	    value=$(echo $runtime | awk -F "," '{print $1}' | awk -F ":" '{print $2}')
-	    echo "INSERT INTO annot_app_exec_num VALUES ('$execute2_id','$key',$value)"  >>  /tmp/$RUNID.sql
-	done
+	timestamp=$(echo $runtime | awk -F "," '{print $1}' | awk -F ":" '{print $2}')
+	cpu_usage=$(echo $runtime | awk -F "," '{print $2}' | awk -F ":" '{print $2}')
+	max_phys_mem=$(echo $runtime | awk -F "," '{print $3}' | awk -F ":" '{print $2}')
+	max_virtual_mem=$(echo $runtime | awk -F "," '{print $4}' | awk -F ":" '{print $2}')
+	io_read_bytes=$(echo $runtime | awk -F "," '{print $5}' | awk -F ":" '{print $2}')
+	io_write_bytes=$(echo $runtime | awk -F "," '{print $6}' | awk -F ":" '{print $2}')
+	echo "INSERT INTO rt_info (app_exec_id, timestamp, cpu_usage, max_phys_mem, max_virt_mem, io_read, io_write) VALUES ('$execute2_id', $timestamp, $cpu_usage, $max_phys_mem, $max_virtual_mem, $io_read_bytes, $io_write_bytes);"  >> /tmp/$RUNID.sql
+
+#	for key in $(echo maxrss walltime systime usertime cpu fsin fsout timesswapped socketrecv socketsent majorpagefaults minorpagefaults contextswitchesinv contextswitchesvol); do
+#	    value=$(echo $runtime | awk -F "," '{print $1}' | awk -F ":" '{print $2}')
+#	    echo "INSERT INTO annot_app_exec_num VALUES ('$execute2_id','$key',$value)"  >>  /tmp/$RUNID.sql
+#	done
     done < runtime.txt
 fi
 

Modified: provenancedb/swift_mod/_swiftwrap_runtime_snapshots
===================================================================
--- provenancedb/swift_mod/_swiftwrap_runtime_snapshots	2012-09-24 20:27:28 UTC (rev 5944)
+++ provenancedb/swift_mod/_swiftwrap_runtime_snapshots	2012-09-25 13:24:00 UTC (rev 5945)
@@ -54,9 +54,9 @@
 		fi
 		CPU_USAGE=$(echo $PSLINE | awk '{print $3}')
 		log "RUNTIME_INFO=timestamp:$STEP_DATE,cpu_usage:$CPU_USAGE,max_phys_mem:$MAX_PHYS_MEM,max_virtual_mem:$MAX_VIRTUAL_MEM,io_read_bytes:$READ_BYTES,io_write_bytes:$WRITE_BYTES"
-		if [ "$SAMPLING_INTERVAL" -lt 60 ]; then
-			let "SAMPLING_INTERVAL=$SAMPLING_INTERVAL+1"
-		fi
+		#if [ "$SAMPLING_INTERVAL" -lt 60 ]; then
+		#	let "SAMPLING_INTERVAL=$SAMPLING_INTERVAL+1"
+		#fi
 	done
 	wait $EXEC_PID
 }