[Swift-commit] r3287 - text/swift_pc3_fgcs

Tue Apr 20 09:57:59 CDT 2010

Author: lgadelha
Date: 2010-04-20 09:57:59 -0500 (Tue, 20 Apr 2010)
New Revision: 3287

Modified:
   text/swift_pc3_fgcs/swift_pc3_fgcs.tex
Log:


Modified: text/swift_pc3_fgcs/swift_pc3_fgcs.tex
===================================================================

--- text/swift_pc3_fgcs/swift_pc3_fgcs.tex	2010-04-20 05:09:24 UTC (rev 3286)
+++ text/swift_pc3_fgcs/swift_pc3_fgcs.tex	2010-04-20 14:57:59 UTC (rev 3287)
@@ -68,15 +68,15 @@
 \title{Provenance Management in Swift}
 
 \author{Ben Clifford}
-%\ead{benc at hawaga.org.uk}
+\ead{benc at hawaga.org.uk}
 \author[coppe]{Luiz Gadelha Jr.}
 \ead{gadelha at cos.ufrj.br}
 \author[coppe]{Marta Mattoso}
-%\ead{marta at cos.ufrj.br}
+\ead{marta at cos.ufrj.br}
 \author[uc,anl]{Michael Wilde}
-%\ead{wilde at mcs.anl.gov}
+\ead{wilde at mcs.anl.gov}
 \author[uc,anl]{Ian Foster}
-%\ead{foster at mcs.anl.gov}
+\ead{foster at mcs.anl.gov}
 %\address[no]{No affiliation}
 \address[coppe]{PESC/COPPE, Federal University of Rio de Janeiro, Brazil}
 \address[uc]{Computation Institute, University of Chicago, USA}
@@ -214,14 +214,11 @@
 
 In our initial attempts to implement LoadWorkflow, we found the use of the parallel {\tt foreach} loop problematic because the database routines executed by the external application procedures are opaque to Swift. Due to dependencies between iterations of the loop, these routines were being incorrectly executed in parallel. It was necessary to serialize the loop execution to keep the database consistent. For the same reason, since most of the PC3 queries are for row-level database provenance, we had to implement a workaround for gathering this provenance by modifying the application database so that for every row inserted, an entry containing the execution identifier of the Swift process that performed this insertion is recorded on a separate annotation table. A detailed description of the LoadWorkflow implementation in SwiftScript, and the SQL queries to the provenance database can be found in \cite{ClGaMa09}. Core query 1, for instance, consists of determining, for a given
  application database row, which CSV files contributed to it. The strategy used to answer this query is to determine input CSV files that precede, in the transitivity table, the process that inserted the row. This query can be answered by first obtaining the identifier of the Swift process that inserted the row from the annotations included in the application database. Then, we query for file names of datasets that contain CSV inputs in the set of predecessors of the process that inserted the row. 
 
-%Core query 2 asks if the range check workflow component (IsMatchColumnRanges) was performed in a particular table of the application database, given that a user found values that were not expected in it. This is implemented by querying for input parameters for all IsMatchColumnRanges calls. These are XML values, and it is necessary to examine the resulting XML to determine if it was invoked for the specific table. There is unpleasant cross-format joining necessary here to get an actual yes/no result properly, although we could use a {\tt LIKE} clause to examine the value.
 
-%{\em Core Query 3}. The query asks which operation executions were strictly necessary for an application database table (Image) to contain a particular (non-computed) value. This uses the additional annotations made, that only store which process originally inserted a row, not which processes have modified a row. So to some extent, rows are regarded a bit like artifacts (though not first order artifacts in the provenance database); and we can only answer questions about the provenance of rows, not the individual fields within those rows. That is sufficient for this query, though. First find the row that contains the interesting value and extract its identifier ({\tt IMAGEID}). Then find the process that created the row by querying the annotations. This gives the process identifier for the process that created the row. Now query the transitive closure table for all predecessors for that process. This will produce all processes and artifacts that preceded this row creation. 
-
 The OPM output for a LoadWorkflow run in Swift was generated by a script that maps Swift's provenance data model to OPM's XML schema. Since OPM and Swift's provenance database use similar data models, it is fairly straightforward to build a tool to import data from an OPM graph into the Swift provenance database. However we observed that the OPM outputs from the various participating teams, including Swift, carry many details of the LoadWorkflow implementation that are system specific, such as auxiliary tasks that are not necessarily related to the workflow. To answer the same queries, it would be necessary to perform some manual interpretation of the imported OPM graph in order to identify the relevant processes and artifacts.
 
 
-A number of other forms were briefly experimented with during development. The two most developed and interesting models were XML and Prolog. XML provides a semi-structured tree form for data. A benefit of this approach is that new data can be added to the database without needing an explicit schema to be known to the database. In addition, when used with a query language such as XPath, certain transitive queries become straightforward with the use of the {\tt //} operator of XPath. Representing the data as Prolog tuples is a different representation than a traditional database, but provides a query interface that can express interesting queries flexibly.
+Swift's provenance data model is not dependent on a particular database system. A number of other forms were briefly experimented with during development. The two most developed and interesting models were XML and Prolog. XML provides a semi-structured tree form for data. A benefit of this approach is that new data can be added to the database without needing an explicit schema to be known to the database. In addition, when used with a query language such as XPath, certain transitive queries become straightforward with the use of the {\tt //} operator of XPath. Representing the data as Prolog tuples is a different representation than a traditional database, but provides a query interface that can express interesting queries flexibly. 
 
 PC3 provided an opportunity to use OPM in practice. This also enabled us to evaluate OPM and compare it to Swift's provenance data model.
 OPM originally did not specify a naming mechanism for globally identifying artifacts outside of an OPM graph. In Swift, dataset handles are given an URI, now OPM has an  annotation for this purpose \cite{opm1.1}. 
@@ -236,7 +233,7 @@
 int c[] = [a, b];
 \end{lstlisting}
 
-The Swift entry made a minor proposal  \cite{pc} to change the XML schema to better reflect the perceived intentions of the OPM authors. It was apparent that the present representation of hierarchical processes in OPM is insufficiently rich for some groups and that it would be useful to represent hierarchy of individual processes and their containing processes more directly. In Swift this is given by two categories: at the highest level, SwiftScript language constructs, such as procedures and functions; below that, the mechanics of Swift's execution, such as moving files to and from computational resources, and interactions with job execution. Swift provenance work so far has concentrated in the high-level representation, treating all of the low-level behavior as opaque and exposing neither processes nor artifacts. An OPM modification proposal for this is forthcoming. In Swift, this information is often available through the Karajan \cite{karajan} thread identifier which clo
 sely maps to the Swift process execution hierarchy: a Swift process contains another Swift process if its Karajan thread identifier is a prefix of the second processes Karajan thread identifier. The Swift provenance database stores values of dataset handles when those values exist in-memory (for example, when a dataset handle represents and integer or a string). There was some desire in the PC3 workshop for a standard way to represent this.
+The Swift entry made a minor proposal  \cite{pc} to change the XML schema to better reflect the perceived intentions of the OPM authors. It was apparent that the present representation of hierarchical processes in OPM is insufficiently rich for some groups and that it would be useful to represent hierarchy of individual processes and their containing processes more directly. In Swift this is given by two categories: at the highest level, SwiftScript language constructs, such as procedures and functions; below that, the mechanics of Swift's execution, such as moving files to and from computational resources, and interactions with job execution. Swift provenance work so far has concentrated in the high-level representation, treating all of the low-level behavior as opaque and exposing neither processes nor artifacts. An OPM modification proposal for this is forthcoming. In Swift, this information is often available through the Karajan \cite{karajan} execution engine thread ide
 ntifier which closely maps to the Swift process execution hierarchy: a Swift process contains another Swift process if its Karajan thread identifier is a prefix of the second processes Karajan thread identifier. The Swift provenance database stores values of dataset handles when those values exist in-memory (for example, when a dataset handle represents and integer or a string). There was some desire in the PC3 workshop for a standard way to represent this.
 
 \section{Related Work}
 
@@ -255,10 +252,124 @@
 
 {\em Provenance query system}. It was clear from PC3 that although it is possible to express the provenance queries in SQL it is not always practical to do so, due to its poor transitivity support. One future objective is to make the provenance query system, which should include a specialized provenance query language, capable of being readily queried by scientists to let them do better science through validation, collaboration, and discovery. 
 
-%\footnotesize
 
 \bibliographystyle{plain}
-\bibliography{ref}
+\begin{thebibliography}{10}
 
+\bibitem{pc}
+{Provenance Challenge Wiki}.
+\newblock http://twiki.ipaw.info, 2009.
+
+\bibitem{karma}
+B.~Cao, B.~Plale, G.~Subramanian, E.~Robertson, and Y.~Simmhan.
+\newblock {Provenance Information Model of Karma Version 3}.
+\newblock In {\em Proc. IEEE Congress on Services}, pages 348--351, 2009.
+
+\bibitem{ClFo08}
+B.~Clifford, I.~Foster, J.~Voeckler, M.~Wilde, and Y.~Zhao.
+\newblock Tracking provenance in a virtual data grid.
+\newblock {\em Concurrency and Computation: Practice and Experience},
+  20(5):565--575, 2008.
+
+\bibitem{ClGaMa09}
+B.~Clifford, L.~Gadelha, M.~Mattoso, M.~Wilde, and I.~Foster.
+\newblock {Tracking Provenance in Swift}.
+\newblock Technical Report ANL/MCS-P1703-1209, Argonne National Laboratory,
+  2009.
+
+\bibitem{CrCa09}
+S.~da~Cruz, M.~Campos, and M.~Mattoso.
+\newblock {Towards a Taxonomy of Provenance in Scientific Workflow Management
+  Systems}.
+\newblock In {\em Proc. IEEE Congress on Services, Part I, (SERVICES I 2009)},
+  pages 259--266, 2009.
+
+\bibitem{DeGa09}
+E.~Deelman, D.~Gannon, M.~Shields, and I.~Taylor.
+\newblock {Workflows in e-Science: An overview of workflow system features and
+  capabilities}.
+\newblock {\em Future Generation Computer Systems}, 25(5):528--540, 2009.
+
+\bibitem{SQLTRANS}
+G.~Dong, L.~Libkin, J.~Su, and L.~Wong.
+\newblock {Maintaining Transitive Closure of Graphs in SQL}.
+\newblock {\em Intl. Journal of Information Technology}, 5, 1999.
+
+\bibitem{chimera}
+I.~Foster, J.~Vockler, M.~Wilde, and Y.~Zhao.
+\newblock {Chimera: A Virtual Data System for Representing, Querying and
+  Automating Data Derivation}.
+\newblock In {\em Proc. 14th International Conference on Scientific and
+  Statistical Database Management (SSDBM'02)}, pages 37--46, 2002.
+
+\bibitem{FrSi06}
+J.~Freire, C.~Silva, S.~Callahan, E.~Santos, C.~Scheidegger, and H.~Vo.
+\newblock {Managing Rapidly-Evolving Scientific Workflows}.
+\newblock In {\em International Provenance and Annotation Workshop (IPAW
+  2006)}, volume 4145 of {\em LNCS}, pages 10--18, 2006.
+
+\bibitem{OPMcollections}
+P.~Groth, S.~Miles, P.~Missier, and L.~Moreau.
+\newblock {A Proposal for Handling Collections in the Open Provenance Model}.
+\newblock
+  http://mailman.ecs.soton.ac.uk/pipermail/provenance-challenge-ipaw-info/2009%
+-June/000120.html, 2009.
+
+\bibitem{karajan}
+G.~Laszewski, M.~Hategan, and D.~Kodeboyina.
+\newblock {Java CoG Kit Workflow}.
+\newblock In I.~Taylor, E.~Deelman, D.~Gannon, and M.~Shields, editors, {\em
+  Workflows for e-Science}, pages 340--356. Springer, 2007.
+
+\bibitem{opm1.1}
+L.~Moreau, B.~Clifford, J.~Freire, Y.~Gil, P.~Groth, J.~Futrelle,
+  N.~Kwasnikowska, S.~Miles, P.~Missier, J.~Myers, Y.~Simmhan, E.~Stephan, and
+  J.~Van den Bussche.
+\newblock {The Open Provenance Model - Core Specification (v1.1)}.
+\newblock {\em Future Generation Computer Systems}, 2009 (Submitted).
+
+\bibitem{xdtm}
+L.~Moreau, Y.~Zhao, I.~Foster, J.~Voeckler, and M.~Wilde.
+\newblock {XDTM: XML Dataset Typing and Mapping for Specifying Datasets}.
+\newblock European Grid Conference (EGC 2005), 2005.
+
+\bibitem{tupelo}
+J.~Myers, J.~Futrelle, J.~Plutchak, P.~Bajcsy, J.~Kastner, L.~Marini,
+  R.~Kooper, R.~McGrath, T.~McLaren, A.~Rodr\'{\i}guez, and Y.~Liu.
+\newblock {Embedding Data within Knowledge Spaces}.
+\newblock {\em CoRR}, abs/0902.0744, 2009.
+
+\bibitem{falkon}
+I.~Raicu, Y.~Zhao, C.~Dumitrescu, I.~Foster, and M.~Wilde.
+\newblock {Falkon: A Fast and Lightweight Task Execution Framework}.
+\newblock In {\em Proc. ACM/IEEE Conference on High Performance Networking and
+  Computing (Supercomputing 2007)}, 2007.
+
+\bibitem{SiPlGa05}
+Y.~Simmhan, B.~Plale, and D.~Gannon.
+\newblock {A Survey of Data Provenance in e-Science}.
+\newblock {\em SIGMOD Record}, 34(3):31--36, 2005.
+
+\bibitem{WiFo09}
+M.~Wilde, I.~Foster, K.~Iskra, P.~Beckman, A.~Espinosa, M.~Hategan,
+  B.~Clifford, and I.~Raicu.
+\newblock {Parallel Scripting for Applications at the Petascale and Beyond}.
+\newblock {\em IEEE Computer}, 42(11):50--60, November 2009.
+
+\bibitem{swift}
+Y.~Zhao, M.~Hategan, B.~Clifford, I.~Foster, G.~Laszewski, I.~Raicu,
+  T.~Stef-Praun, and M.~Wilde.
+\newblock {Swift: Fast, Reliable, Loosely Coupled Parallel Computation}.
+\newblock In {\em Proc. 1st IEEE International Workshop on Scientific Workflows
+  (SWF 2007)}, pages 199--206, 2007.
+
+\bibitem{ZhWiFo06}
+Y.~Zhao, M.~Wilde, and I.~Foster.
+\newblock {Applying the Virtual Data Provenance Model}.
+\newblock In {\em International Provenance and Annotation Workshop (IPAW
+  2006)}, volume 4145 of {\em LNCS}, pages 148--161. Springer, 2006.
+
+\end{thebibliography}
+
 \end{document}
 \endinput