[Swift-commit] r3897 - text/parco10submission

Fri Jan 7 13:37:40 CST 2011

Author: wozniak
Date: 2011-01-07 13:37:40 -0600 (Fri, 07 Jan 2011)
New Revision: 3897

Modified:
   text/parco10submission/paper.tex
Log:
Include multicore plot


Modified: text/parco10submission/paper.tex
===================================================================

--- text/parco10submission/paper.tex	2011-01-07 19:37:06 UTC (rev 3896)
+++ text/parco10submission/paper.tex	2011-01-07 19:37:40 UTC (rev 3897)
@@ -80,16 +80,16 @@
 
 \begin{abstract}
 
-Scientists, engineers and statisticians must often execute domain-specific application programs 
-many times on large collections of file-based data.  This  activity requires complex orchestration and data management 
+Scientists, engineers and statisticians must often execute domain-specific application programs
+many times on large collections of file-based data.  This  activity requires complex orchestration and data management
 as data is passed to, from, and among application invocations. Distributed and
 parallel computing resources can accelerate
-such processing, but their use increases programming complexity yet further. 
+such processing, but their use increases programming complexity yet further.
 The Swift parallel scripting language reduces these complexities by (a) making file system structures accessible via language constructs,
 and (b)
 allowing
 ordinary application programs to be composed into powerful parallel scripts that can efficiently utilize parallel and distributed resources.
-We present Swift's implicitly parallel and deterministic programming model, 
+We present Swift's implicitly parallel and deterministic programming model,
 which applies external applications to file collections using a functional style of scripting that abstracts and simplifies distributed parallel execution.
 
 %IAN: Re above--it seems important to me to make point (a).
@@ -136,14 +136,14 @@
 %new approach.
 
 Many parallel applications involve a single
-message-passing parallel program: a model supported well by the Message Passing Interface (MPI). 
+message-passing parallel program: a model supported well by the Message Passing Interface (MPI).
 However, many others require the coupling or
 orchestration of large numbers of application invocations: either many
 invocations of the same program, or many invocations of sequences and
-patterns of several programs. The execution of these 
+patterns of several programs. The execution of these
 %In this model, existing applications are similar to
 %functions in programming, and users typically need to execute many of
-%them. 
+%them.
 Scaling up requires the distribution of such workloads among
 many computers or clusters.
 %and hence a ``grid'' approach.
@@ -387,7 +387,7 @@
 Variables are used in Swift to name the local variables, arguments, and returns of a function. Every Swift variable is assigned a concrete data type, based on a very simple type model (with no concepts of inheritance, abstraction, etc). The outermost function in a Swift (akin to ``main'' in C) is only unique in that the variables in its environment can be declared ``global'' to make them accessible to every other function in the script.
 
 Swift data elements (atomic variables and array elements) are \emph{single-assignment}---
-they can be assigned at most one value during execution---and behave as futures. 
+they can be assigned at most one value during execution---and behave as futures.
 This semantic provides the
 basis for Swift's model of parallel function evaluation and chaining.
 While Swift collection types (arrays and structures) are not
@@ -411,7 +411,7 @@
 indices, but are sparse.
 Both types of collections can contain members of atomic or collection types. Structures contain a finite number of elements. Arrays contain a varying number of elements. Structures and arrays can both recursively reference other structures and arrays in addition to atomic values. Arrays can be nested to provide multi-dimensional indexing.
 
-Due to the dynamic, highly parallel nature of Swift, its arrays have no notion of size. Array elements can be set as a script's execution progresses. The number of elements set increases monotonically. An array is considered ``closed'' when no further statements that set an element of the array can be executed. This state is recognized at run time by information obtained from compile-time analysis of the script's call graph. 
+Due to the dynamic, highly parallel nature of Swift, its arrays have no notion of size. Array elements can be set as a script's execution progresses. The number of elements set increases monotonically. An array is considered ``closed'' when no further statements that set an element of the array can be executed. This state is recognized at run time by information obtained from compile-time analysis of the script's call graph.
 
 %Also, since all data elements have single-assignment semantics, no garbage collection issues arise. \katznote{does this follow? garbage collection removed variables that are no longer needed - I don't see how single assignment helps here.}
 %\mihaelnote{I think we should not mention the garbage collection issue. In fact, we don't and we should implement
@@ -422,7 +422,7 @@
 are associated with a \emph{mapper}, which defines (often through a dynamic lookup process) the
 data files that are to be mapped to the variable. Array and structure elements that are declared to be mapped files are similarly mapped.
 
-Mapped type and collection 
+Mapped type and collection
 type variable declarations can be annotated with a
 \emph{mapping} descriptor that specify the file(s) that are to be mapped to the Swift data element(s).
 
@@ -1189,7 +1189,7 @@
 \section{Applications}
 \label{Applications}
 
-Swift has been used by applications in 
+Swift has been used by applications in
 \mikenote{List here from CDI, IEEE, etc}
 
 This section describes two complete Swift scripts (representative of two diverse disciplines) in more detail.
@@ -1224,84 +1224,84 @@
      1	type file;
      2	type imagefile;
      3	type landuse;
-     4	
+     4
      5	# Define application program interfaces
-     6	
+     6
      7	app (landuse output) getLandUse (imagefile input, int sortfield)
      8	{
      9	  getlanduse @input sortfield stdout=@output ;
     10	}
-    11	
+    11
     12	app (file output, file tilelist) analyzeLandUse
     13	    (landuse input[], string usetype, int maxnum)
     14	{
     15	  analyzelanduse @output @tilelist usetype maxnum @filenames(input);
     16	}
-    17	
+    17
     18	app (imagefile output) colorMODIS (imagefile input)
     19	{
     20	  colormodis @input @output;
     21	}
-    22	
+    22
     23	app (imagefile output) assemble
     24	    (file selected, imagefile image[], string webdir)
     25	{
     26	  assemble @output @selected @filename(image[0]) webdir;
     27	}
-    28	
-    29	app (imagefile grid) markMap (file tilelist) 
+    28
+    29	app (imagefile grid) markMap (file tilelist)
     30	{
     31	  markmap @tilelist @grid;
     32	}
-    33	
+    33
     34	# Constants and command line arguments
-    35	
+    35
     36	int nFiles =      @toint(@arg("nfiles","1000"));
     37	int nSelect =     @toint(@arg("nselect","12"));
     38	string landType = @arg("landtype","urban");
     39	string runID =    @arg("runid","modis-run");
     40	string MODISdir=  @arg("modisdir","/home/wilde/bigdata/data/modis/2002");
     41	string webDir =   @arg("webdir","/home/wilde/public_html/geo/");
-    42	
+    42
     43	string suffix=".tif";
-    44	
+    44
     45	# Input Dataset
-    46	
+    46
     47	imagefile geos[] <ext; exec="modis.mapper",
     48	  location=MODISdir, suffix=".tif", n=nFiles >; # site=site
-    49	
+    49
     50	# Compute the land use summary of each MODIS tile
-    51	
+    51
     52	landuse land[] <structured_regexp_mapper; source=geos, match="(h..v..)",
     53	  transform=@strcat(runID,"/\\1.landuse.byfreq")>;
-    54	
+    54
     55	foreach g,i in geos {
     56	    land[i] = getLandUse(g,1);
     57	}
-    58	
+    58
     59	# Find the top N tiles (by total area of selected landuse types)
-    60	
+    60
     61	file topSelected<"topselected.txt">;
     62	file selectedTiles<"selectedtiles.txt">;
     63	(topSelected, selectedTiles) = analyzeLandUse(land, landType, nSelect);
-    64	
+    64
     65	# Mark the top N tiles on a sinusoidal gridded map
-    66	
+    66
     67	imagefile gridMap<"markedGrid.gif">;
     68	gridMap = markMap(topSelected);
-    69	
+    69
     70	# Create multi-color images for all tiles
-    71	
+    71
     72	imagefile colorImage[] <structured_regexp_mapper;
-    73	          source=geos, match="(h..v..)", 
+    73	          source=geos, match="(h..v..)",
     74	          transform="landuse/\\1.color.png">;
-    75	
+    75
     76	foreach g, i in geos {
     77	  colorImage[i] = colorMODIS(g);
     78	}
-    79	
+    79
     80	# Assemble a montage of the top selected areas
-    81	
+    81
     82	imagefile montage <single_file_mapper; file=@strcat(runID,"/","map.png") >; # @arg
     83	montage = assemble(selectedTiles,colorImage,webDir);
 
@@ -1315,7 +1315,7 @@
 
 Rather than run ~500K-1.5M steps per jobs (which a priori i didn't know how many i would run anyway) i ran 100K at a time. hence the repetitions of runs. But i would say the campaign started more like in october. if all the jobs are on pads then it'll be more obvious.
 
-As this simulation was a lengthy campaign (from about October through December 2010) Hocky chose to leverage Swift ``external'' mappers to determine what work remained during various restarts. His mappers assumed an application run was complete if all the returned ".final" files existed.  In the case of script restarts, results that already existed were not computed. The swift restart mechanism was also tested and worked fine, but required tracking which workflow was being restarted. Occasionally missing files caused the restart to fail; Hocky's ad-hoc restart via mappers worked exceedingly well (and perhaps suggests a new approach for the integrated restart mechanism). 
+As this simulation was a lengthy campaign (from about October through December 2010) Hocky chose to leverage Swift ``external'' mappers to determine what work remained during various restarts. His mappers assumed an application run was complete if all the returned ".final" files existed.  In the case of script restarts, results that already existed were not computed. The swift restart mechanism was also tested and worked fine, but required tracking which workflow was being restarted. Occasionally missing files caused the restart to fail; Hocky's ad-hoc restart via mappers worked exceedingly well (and perhaps suggests a new approach for the integrated restart mechanism).
 
 A high-level description of the glass simulation campaign is as follows:
 
@@ -1327,7 +1327,7 @@
 
 about 1-2 hours per job
 
-Approximate OSG usage over 100K cpus hours with about 100K tasks of 1-2 hours completed. App has been successfully run on about 18 OG (with the majority of runs have been completed on about 6 primary  sites). 
+Approximate OSG usage over 100K cpus hours with about 100K tasks of 1-2 hours completed. App has been successfully run on about 18 OG (with the majority of runs have been completed on about 6 primary  sites).
 
 This project would be completely unwieldy and much harder to organize without using Swift.
 
@@ -1344,11 +1344,11 @@
      1	type Arc;
      2	type Restart;
      3	type Log;
-     4	
+     4
      5	type GlassIn{
      6	  Restart startfile;
      7	}
-     8	
+     8
      9	type GlassOut{
     10	  Arc arcfile;
     11	  Restart restartfile;
@@ -1356,7 +1356,7 @@
     13	  Restart final;
     14	  Log logfile;
     15	}
-    16	
+    16
     17	app (GlassOut o) glassCavityRun(
     18	  GlassIn i, string rad, string temp, string steps,
     19	  string volume, string fraca, string energyfunction,
@@ -1368,7 +1368,7 @@
     25	    "--cradius" rad "--ccoord" centerstring arctimestring
     26	    stdout=@filename(o.logfile);
     27	}
-    28	
+    28
     29	CreateGlassSystem()
     30	{
     31	  string temp=@arg("temp","2.0");
@@ -1393,16 +1393,16 @@
     50	    arctimestring="";
     51	  }
     52	  string energyfunction=@arg("energyfunction","softsphereratiosmooth");
-    53	
+    53
     54	  GlassIn modelIn[][][] <ext;exec="GlassCavityOutArray.map",
     55	    rlist=rlist, clist=clist, steps=ceqsteps, n=nmodels,
     56	    esteps=esteps, temp=temp, volume=volume,
-    57	    e=energyfunction, natoms=natoms, i="true">; 
+    57	    e=energyfunction, natoms=natoms, i="true">;
     58	  GlassOut modelOut[][][][] <ext; exec="GlassCavityContinueOutArray.map",
     59	    n=nmodels, nsub=nsub, rlist=rlist, clist=clist,
     60	    ceqsteps=ceqsteps, esteps=esteps, steps=steps, temp=temp, volume=volume,
     61	    e=energyfunction, natoms=natoms>;
-    62	
+    62
     63	  foreach rad,rindex in radii {
     64	    foreach centerstring,cindex in centers {
     65	      foreach model in [0:nmodels-1] {
@@ -1420,8 +1420,8 @@
     77	    }
     78	  }
     79	}
-    80	
-    81	
+    80
+    81
     82	CreateGlassSystem();
 \end{Verbatim}
 %\end{verbatim}
@@ -1437,7 +1437,16 @@
   \end{center}
 \end{figure*}
 
+\begin{figure*}[htbp]
+  \begin{center}
+    \includegraphics[scale=0.70]{plots/multicore}
+    \caption{System utilization for variable length tasks
+              at varying concurrency}
+    \label{PlotMulticore}
+  \end{center}
+\end{figure*}
 
+
 \section{Related Work}
 \label{Related}
 
@@ -1491,7 +1500,7 @@
 programming tool for the specification and execution of large parallel
 computations on large quantities of data, and facilitating the
 utilization of large distributed resources. However, the two also
-differ in many aspects.  The 
+differ in many aspects.  The
 MapReduce programming model supports key-value pairs as
 input or output datasets and two types of computation functions,
 map and reduce; Swift provides a type system and allows the