[Swift-commit] r4135 - text/parco10submission
noreply at svn.ci.uchicago.edu
noreply at svn.ci.uchicago.edu
Tue Feb 22 10:37:27 CST 2011
Author: wilde
Date: 2011-02-22 10:37:27 -0600 (Tue, 22 Feb 2011)
New Revision: 4135
Modified:
text/parco10submission/paper.bib
text/parco10submission/paper.tex
Log:
Made corrections requested by reviewer:
x = fixed as suggested, o = was already fixed (not sure what version was reviewed)
NOTED: needs more work
> p. 12: "It is important to understood"
Not in the svn version - what version was this reviewer reading?
x > p. 12: becomes"in scope" [space missing]
x > p. 12: "The invocation of mappers... are" ['are' should be 'is']
x > p. 13: My dictionary says 'de facto' not de-facto
x > p. 13: "underly" should be "underlie" [underly is not a word]
o > p. 14: "automatically selection"
Not in svn version
o > p. 14: semntics
Not in svn version
x > p. 17: applicatoin
x > p. 19: "discoveries...continues to drive" wrong [should be 'continue']
already fixed but changed to match suggestion.
o > p. 19: My dictionary says self-contained, not "self contained"
already fixed.
x > p. 20: line -9: maybe "find N files"? [doesn't make sense to me as is]
x > p. 22: line 11: geos[] . should be geos[]. (extraneous space)
x > p. 22: "for statement" should be "for the statement"
x > p. 22: line -8 "of every input tiles" should be "of every input tile"
already fixed, but improved.
x > p. 25: 7*27*10=1890 not 1690 (also see p. 26 where 1670 is mentioned)
o > p. 25: "These structure" [should be structures]
x > p. 25: "selectable energy function to used by" [should be 'to be used
> by']
x > Fig. 3: acronym SEM is undefined (also 2nd paragraph of section 5.2)
NOTED: WE WANT TO CHANGE THIS FIG
> Fig. 4: lower half: color of "Processors" differs left vs. right; why?
NOTED: TO REPLACE
o > p. 32: "rational" should be "rationale"
NOTED: Needs further improvement.
o > p. 32: "goals to providing" should be "goals of providing"
x > p. 33: line -12 "environment" should be "environments"
x NOTED > p. 34: "on single parallel system" - another typo
x NOTED: > p. 35: "thousands of small accesses to the filesystem" might not
> only be a bottleneck, it may significantly degrade
> responsiveness for every process on the system
x > p. 38: "J. H G" -> J. H. G." ?
Modified: text/parco10submission/paper.bib
===================================================================
--- text/parco10submission/paper.bib 2011-02-22 02:12:12 UTC (rev 4134)
+++ text/parco10submission/paper.bib 2011-02-22 16:37:27 UTC (rev 4135)
@@ -195,15 +195,30 @@
pages = {39--59}
}
- at article{Futures,
+ at article{OLDFutures,
title = {{The Incremental Garbage Collection of Processes}},
- author = {H G Baker, Jr. and C Hewitt },
+ author = {H. G. Baker, Jr. and C Hewitt },
journal = {{Proceedings of Symposium on AI and Programming Languages, ACM SIGPLAN Notices}},
volume = {12(8)},
year = 1977,
pages = {55--59}
}
+ at inproceedings{Futures,
+ author = {Baker,Jr., Henry C. and Hewitt, Carl},
+ title = {The incremental garbage collection of processes},
+ booktitle = {Proceedings of the 1977 symposium on Artificial intelligence and programming languages},
+ year = {1977},
+ pages = {55--59},
+ numpages = {5},
+ url = {http://doi.acm.org/10.1145/800228.806932},
+ doi = {http://doi.acm.org/10.1145/800228.806932},
+ acmid = {806932},
+ publisher = {ACM},
+ address = {New York, NY, USA},
+ keywords = {Eager evaluation, Garbage collection, Lazy evaluation, Multiprocessing systems, Processor scheduling},
+}
+
@misc{LONIPIPELINE,
title="LONI Pipeline http://pipeline.loni.ucla.edu/"
}
Modified: text/parco10submission/paper.tex
===================================================================
--- text/parco10submission/paper.tex 2011-02-22 02:12:12 UTC (rev 4134)
+++ text/parco10submission/paper.tex 2011-02-22 16:37:27 UTC (rev 4135)
@@ -43,7 +43,7 @@
\author[ci,mcs]{Michael Wilde\corref{cor1}} \ead{wilde at mcs.anl.gov}
\author[ci]{Mihael Hategan}
\author[mcs]{Justin M. Wozniak}
-\author[alcf]{Ben Clifford}
+\author[astro]{Ben Clifford}
\author[ci]{Daniel S. Katz}
\author[ci,mcs,cs]{Ian Foster}
@@ -51,8 +51,8 @@
\address[ci]{Computation Institute, University of Chicago and Argonne National Laboratory}
\address[mcs]{Mathematics and Computer Science Division, Argonne National Laboratory}
-\address[alcf]{Argonne Leadership Computing Facility, Argonne National Laboratory }
\address[cs]{Department of Computer Science, University of Chicago }
+\address[astro]{Department of Astronomy and Astrophysics, University of Chicago}
\begin{abstract}
@@ -823,7 +823,7 @@
Swift mappers can operate on files stored on the local machine in the directory where the {\tt swift} command is executing, or they can map any files accessible to the local machine, using absolute pathnames. Custom mappers (and some of the built-in mappers) can also map variables to files specified by URIs for access from remote servers via protocols such as GridFTP or HTTP, as described in section \ref{Execution}. Mappers can interact with structure fields and array elements in a simple and useful manner.
New mappers can be added to Swift either as Java classes or as simple, external executable scripts or programs coded in any language.
-Mappers can operate both as input mappers (which map files to be processed as application inputs) and as output mappers (which specify the names of files to be produced by applications). It is important to understand that mapping a variable is a different operation from setting the value of a variable. Variables of mapped-file type are mapped (conceptually) when the variable becomes``in scope,'' but they are set when a statement assigns them a value. The invocation of mappers (including external executable mappers) are completely synchronized with the Swift parallel execution model.
+Mappers can operate both as input mappers (which map files to be processed as application inputs) and as output mappers (which specify the names of files to be produced by applications). It is important to understand that mapping a variable is a different operation from setting the value of a variable. Variables of mapped-file type are mapped (conceptually) when the variable becomes ``in scope,'' but they are set when a statement assigns them a value. Mapper invocations (and invocations of external mapper executables) are completely synchronized with the Swift parallel execution model.
This ability to abstract the processing of files by programs as if they were in-memory objects and to process them with an implicitly parallel programming model is Swift's most valuable and noteworthy contribution.
@@ -903,7 +903,7 @@
portability), that they will run in any particular order with respect to other
application invocations in a script (except those implied by data
dependency), or that their working directories will or will not be
-cleaned up after execution. In addition, applications should strive to avoid side effects that could limit both their location independence and the determinism (either actual or de-facto) of the overall results of Swift scripts that call them.
+cleaned up after execution. In addition, applications should strive to avoid side effects that could limit both their location independence and the determinism (either actual or de facto) of the overall results of Swift scripts that call them.
Consider the following \verb|app| declaration for the \verb|rotate|
function:
@@ -942,7 +942,7 @@
\section{The Swift runtime environment}
\label{Execution}
-The Swift runtime environment comprises a set of services providing the parallel, distributed, and reliable execution that underly the simple Swift language model. A notable contribution of Swift is the extent to which the language model has been kept simple by factoring the complexity of these issues out of the language and implementing them in the runtime environment. Notable features of this environment include the following:
+The Swift runtime environment comprises a set of services providing the parallel, distributed, and reliable execution that underlie the simple Swift language model. A notable contribution of Swift is the extent to which the language model has been kept simple by factoring the complexity of these issues out of the language and implementing them in the runtime environment. Notable features of this environment include the following:
\begin{itemize}
@@ -1078,7 +1078,7 @@
provides a natural way to deal both with many transient errors, such as
temporary network loss, and with many changes in site state.
-Some errors are more permanent; for example, an applicatoin
+Some errors are more permanent; for example, an application
program may have a bug that causes it to always fail when given a
particular set of inputs. In this case, Swift's retry mechanism will
not help; each job will be tried a number of times, and each
@@ -1172,8 +1172,8 @@
time and with respect to themselves at other times. The load that a
particular site will bear varies over time, and sites can fail in
unusual ways. Swift's site scoring mechanism deals well with this situation in
-the majority of cases. However, discovery of new and unusual failure
-modes continues to drive the implementation of increasingly robust fault tolerance
+the majority of cases. However, discoveries of new and unusual failure
+modes continue to drive the implementation of increasingly robust fault tolerance
mechanisms.
When running jobs on dynamically discovered sites, it is likely that
@@ -1185,7 +1185,7 @@
files are described as mapped input files in the same way as input
data files and are passed as a parameter to the application
executable. Swift's existing input file management then
-stages-in the application files once per site per run.
+stages in the application files once per site per run.
\section{Applications}
\label{Applications}
@@ -1228,9 +1228,9 @@
5 million 1-byte pixels (5.7 MB), covering 2400
x 2400 250-meter squares, based on a specific map projection.
-The Swift script analyzes the dataset to find the files with the N
-largest total area of any requested sets of land-cover typesr It then produces a new dataset with viewable
-color images of those closest-matching data tiles.
+The Swift script analyzes the dataset to select the top N files ranked by
+total area of specified sets of land-cover types. It then produces a new dataset with viewable
+color images of those selected data tiles.
(A color rendering step is required, since the input datasets are not viewable images; their pixel
values are land-use codes.) A typical invocation of this script would be ``\emph{Find the top 12 urban tiles}'' or ``\emph{Find the 16 tiles with the most forest and grassland.}'' Since this script is used for tutorial purposes, the application programs it calls are simple shell scripts that use fast, generic image-processing applications to process the MODIS data. Thus, the example executes quickly while serving as a realistic tutorial script for much more compute-intensive satellite data-processing applications.
@@ -1258,15 +1258,15 @@
At lines 55--57 the script performs its first computation using a {\tt foreach} loop to invoke {\tt getLandUse} in parallel on each file mapped to the elements of {\tt geos[]}. Since 317 files were mapped (in lines 47--48), the loop will submit 317 instances of the application in parallel to the execution provider. These will execute with a degree of parallelism subject to available resources.
%\katznote{DONE: is this strictly true? Do you want to say that it will enable 317 instances to be runnable in parallel, but the number that are actually run in parallel depends on the hardware available to Swift, or something like that?}
-At lines 52--53 the result of each computation is placed in a file mapped to the array {\tt land} and named by the regular expression translation based on the file names mapped to {\tt geos[]} .
+At lines 52--53 the result of each computation is placed in a file mapped to the array {\tt land} and named by the regular expression translation based on the file names mapped to {\tt geos[]}.
Thus the landuse histogram for file {\tt h00v08.tif} would be written into file {\tt h00v08.landuse.freq} and would be considered by Swift to be of type {\tt landuse}.
Once all the land usage histograms have been computed, the script can then execute {\tt analyzeLandUse} at line 63 to find the requested number of highest tiles (files) with a specific land cover combination. The Swift runtime system uses futures to ensure that this analysis function is not invoked until all of its input files have computed and transported to the computation site chosen to run the analysis program. All of these steps take place automatically, using the relatively simple and location-independent Swift expressions shown. The output files to be used for the result are specified in the declarations at lines 61--62.
% \katznote{DONE: should these lines have a space inserted before the ``<'' to match the previous lines? Same question for 67-68... }
-To visualize the results, the application function {\tt markMap} invoked at line 68 will generate an image of a world map using the MODIS projection system and indicate the selected tiles matching the analysis criteria. Since this statement depends on the output of the analysis ({\tt topSelected}), it will wait for statement at line 63 to complete before commencing.
+To visualize the results, the application function {\tt markMap} invoked at line 68 will generate an image of a world map using the MODIS projection system and indicate the selected tiles matching the analysis criteria. Since this statement depends on the output of the analysis ({\tt topSelected}), it will wait for the statement at line 63 to complete before commencing.
-For additional visualization, the script assembles a full map of all the input tiles, placed in their proper grid location on the MODIS world map projection, and again marking the selected tiles. Since this operation needs true-color images of every input tiles these are computed---again in parallel---with 317 jobs generated by the foreach statement at lines 76--78. The power of Swift's implicit parallelization is shown vividly here: since the {\tt colorMODIS} call at line 77 depends only on the input array {\tt geos}, these 317 application invocations are submitted in parallel with the initial 317 parallel executions of the {\tt getLandUse} application at line 56. The script concludes at line 83 by assembling a montage of all the colored tiles and writing this image file to a web-accessible directory for viewing.
+For additional visualization, the script assembles a full map of all the input tiles, placed in their proper grid location on the MODIS world map projection, and with the selected tiles marked. Since this operation needs true-color images of every input tiles these are computed---again in parallel---with 317 jobs generated by the foreach statement at lines 76--78. The power of Swift's implicit parallelization is shown vividly here: since the {\tt colorMODIS} call at line 77 depends only on the input array {\tt geos}, these 317 application invocations are submitted in parallel with the initial 317 parallel executions of the {\tt getLandUse} application at line 56. The script concludes at line 83 by assembling a montage of all the colored tiles and writing this image file to a web-accessible directory for viewing.
\pagebreak
~
@@ -1304,7 +1304,7 @@
describing which particles are in the cavity.
Each script run covers a simulation space of 7 radii by 27 centers by
-10 models, requiring 1,690 jobs per run. Three model systems are
+10 models, requiring 1,890 jobs per run. Three model systems are
investigated for a total of 90 runs. Swift mappers enable metadata describing
these aspects to be encoded in the data files of the simulation campaigns to assist in
managing the large volume of file data.
@@ -1320,11 +1320,11 @@
The single application called by this script is the {\tt glassRun} program wrapped in the app function at lines 21--29. Note that rather than defining main program logic in ``open" (top-level) code, the script places all the program logic in the function {\tt GlassRun}, invoked by the single statement at line 80. This approach enables the simulation script to be defined in a library that can be imported into other Swift scripts to perform entire campaigns or campaign subsets.
The {\tt GlassRun} function starts by extracting a large set of science parameters from the Swift command line at lines 33--48 using the {\tt @arg()} function. It uses the built-in function {\tt readData} at lines 42--43 to read prepared lists of molecular radii and centroids from parameter files to define the primary physical dimensions of the simulation space.
-A selectable energy function to used by the simulation application is specified as a parameter at line 48.
+A selectable energy function to be used by the simulation application is specified as a parameter at line 48.
At lines 57 and 61, the script leverages Swift flexible dynamic arrays to create a 3D array for input and a 4D array of structures for outputs. These data structures, whose leaf elements consist entirely of mapped files, are set by using the external mappers specified for the input array at lines 57--59 and for the output array of structures at lines 61--63. Note that many of the science parameters are passed to the mappers, which in turn are used by the input mapper to locate files within the large, multilevel directory structure of the campaign and by the output mapper to create new directory and file naming conventions for the campaign outputs. The mappers apply the common, useful practice of using scientific metadata to determine directory and file names.
-The entire body of the {\tt GlassRun} is a four-level nesting of \verb|foreach| statements at lines 65--77. These loops perform a parallel parameter sweep over all combinations of radius, centroid, model, and job number within the simulation space. A single run of the script immediately expands to an independent parallel invocation of the simulation application for each point in the space: 1,670 jobs for the minimum case of a 7 x 27 x 10 x 1 space. Note that the {\tt if} statement at line 69 causes the simulation execution to be skipped if it has already been performed, as determined by a ``\verb|NULL|'' file name returned by the mapper for the output of a given job in the simulation space. In the current campaign the fourth dimension ({\tt nsub}) of the simulation space is fixed at one. This value could be increased to define subconfigurations that would perform better Monte Carlo averaging, with a multiplicative increase in the number of jobs. This is currently set to one
because there are ample starting configurations, but if this was not the case (as in earlier campaigns) the script could run repeated simulations with different random seeds.
+The entire body of the {\tt GlassRun} is a four-level nesting of \verb|foreach| statements at lines 65--77. These loops perform a parallel parameter sweep over all combinations of radius, centroid, model, and job number within the simulation space. A single run of the script immediately expands to an independent parallel invocation of the simulation application for each point in the space: 1,890 jobs for the minimum case of a 7 x 27 x 10 x 1 space. Note that the {\tt if} statement at line 69 causes the simulation execution to be skipped if it has already been performed, as determined by a ``\verb|NULL|'' file name returned by the mapper for the output of a given job in the simulation space. In the current campaign the fourth dimension ({\tt nsub}) of the simulation space is fixed at one. This value could be increased to define subconfigurations that would perform better Monte Carlo averaging, with a multiplicative increase in the number of jobs. This is currently set to one
because there are ample starting configurations, but if this was not the case (as in earlier campaigns) the script could run repeated simulations with different random seeds.
The advantages of managing a simulation campaign in this manner are borne out well by Hocky's experience: the expression of the campaign is a well-structured, high-level script, devoid of details about file naming, synchronization of parallel tasks, location and state of remote computing resources, or explicit explicit data transfer. Hocky was able to leverage local cluster resources on many occasions, but at any time he could count on his script's acquiring on the order of 1,000 compute cores from 6 to 18 sites of the Open Science Grid. When executing on the OSG, he leveraged Swift's capability to replicate jobs that were waiting in queues at more congested sites, and automatically sent them to sites where resources were available and jobs were being processed at better rates. All these actions would have represented a huge distraction from his primary scientific simulation campaign if he had been required to use or to script lower-level abstractions where parallelism and r
emote distribution were the manual responsibility of the programmer.
@@ -1425,7 +1425,7 @@
Previously published measurements of Swift performance on several scientific applications provide evidence that its parallel distributed programming model can be implemented with sufficient scalability and efficiency to make it a practical tool for large-scale parallel application scripting.
-The performance of Swift submitting jobs over the wide-area network from the University of Chicago to the teraGrid Ranger cluster at TACC is shown in Figure~\ref{SEMplots} (from \cite{CNARI_2009}). The figure plots an SEM workload of 131,072 jobs for four brain regions and two experimental conditions. This workflow completed in approximately 3 hours. The logs from the {\tt swift\_plot\_log} utility show the high degree of concurrent overlap between job execution and input and output file staging to remote computing resources.
+The performance of Swift submitting jobs over the wide-area network from the University of Chicago to the teraGrid Ranger cluster at TACC is shown in Figure~\ref{SEMplots} (from \cite{CNARI_2009}). The figure plots a structural equation modeling (SEM) workload of 131,072 jobs for four brain regions and two experimental conditions. This workflow completed in approximately 3 hours. The logs from the {\tt swift\_plot\_log} utility show the high degree of concurrent overlap between job execution and input and output file staging to remote computing resources.
The workflows were developed on and submitted (to Ranger) from a single-core Linux workstation at the University of Chicago running an Intel Xeon 3.20 GHz CPU. Data staging was performed by using the Globus GridFTP protocol, and job execution was performed over the Globus GRAM~2 protocol.
During the third hour of the workflow, Swift achieved very high utilization of the 2,048 allocated processor cores and a steady rate of input and output transfers. The first two hours of the run were more bursty, because of fluctuating grid conditions and data server loads.
@@ -1506,7 +1506,7 @@
%% tasks. We use the term workflow as defined by (Taylor
%% et. al. 2006). So we often call a Swift script a workflow.
-% Ousterhout in (Ousterhout 1998) eloquently laid out the rational and
+% Ousterhout in (Ousterhout 1998) eloquently laid out the rationale and
% motivation for scripting languages. As the creator of Tcl [ref], he
% described here the difference between programming and scripting, and
% the place of each in the scheme of applying computers to solving
@@ -1517,6 +1517,10 @@
the scheme of applying computers to solving problems have previously been
presented~\cite{Scripting_1998}.
+%%% ^^^^^^ IMPROVE THIS; dont separate scripting from programming....^^^^
+
+%%% Review the text below VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV
+
Coordination languages and systems such as Linda~\cite{LINDA},
Strand~\cite{STRAND_1989}, and PCN~\cite{PCN_1993} allow composition of
distributed or parallel components, but they usually require the components
@@ -1551,8 +1555,8 @@
programming tool for the specification and execution of large parallel
computations on large quantities of data and facilitating the
utilization of large distributed resources. However,the two
-differ in many aspects. The
-MapReduce programming model supports key-value pairs as
+differ in many aspects.
+The MapReduce programming model supports key-value pairs as
input or output datasets and two types of computation functions,
map and reduce; Swift provides a type system and allows the
definition of complex data structures and arbitrary computational
@@ -1625,8 +1629,10 @@
Walker et al.~\cite{DataFlowShell} have recently developed extensions to
BASH that allow a user to define a dataflow graph, including the concepts
-of fork, join, cycles, and key-value aggregation, but execute on single parallel system or cluster.
+of fork, join, cycles, and key-value aggregation, but which execute on single parallel systems or clusters.
+%%% No data types or mapping ^^^; Check latest state of this re the "executes on" point.
+
A few groups have been working on parallel and distributed versions of make~\cite{GXPmake, makeflow}. These tools use the concept of ``virtual data,'' where the user defines the processing by which data is created and then calls for the final data product. The make-like tools determine what processing is needed to get from the existing files to the final product, which includes
running processing tasks. If this is run on a distributed system, data movement also must
be handled by the tools. In comparison, Swift is a language, which may be slightly
@@ -1675,6 +1681,11 @@
are developing a set of collective data management
primitives ~\cite{CDM_2009}.
+% ^^^ reviewer points out:
+% > p. 35: "thousands of small accesses to the filesystem" might not
+% > only be a bottleneck, it may significantly degrade
+% > responsiveness for every process on the system
+
\subsection{Provenance}
\label{Provenance}
More information about the Swift-commit
mailing list