[Swift-commit] r3903 - text/parco10submission

Sat Jan 8 08:44:34 CST 2011

Author: wilde
Date: 2011-01-08 08:44:33 -0600 (Sat, 08 Jan 2011)
New Revision: 3903

Modified:
   text/parco10submission/paper.bib
   text/parco10submission/paper.tex
Log:
Completed desription of MODIS example and started polishing the Glass example. Added app intro summary and included refererencs.

Modified: text/parco10submission/paper.bib
===================================================================

--- text/parco10submission/paper.bib	2011-01-08 00:12:42 UTC (rev 3902)
+++ text/parco10submission/paper.bib	2011-01-08 14:44:33 UTC (rev 3903)
@@ -26,6 +26,124 @@
   url = {http://people.cs.uchicago.edu/~iraicu/publications/2008_NOVA08_book-chapter_Swift.pdf},
 }
 
+ at article{PTMap_2010,
+ title = {{The first global screening of protein substrates bearing protein-bound 3,4-Dihydroxyphenylalanine in Escherichia coli and human mitochondria.}},
+ author = {S Lee and Y Chen and H Luo and A A Wu and M Wilde and P T Schumacker and Y Zhao},
+ journal = {{Journal of Proteome Research}},
+ volume = {9(11)},
+ year = 2010,
+ pages = {5705-14}
+}
+
+ at article{PTMap_DUP_2009,
+ title = {{PTMap--a sequence alignment software for unrestricted, accurate, and full-spectrum identification of post-translational modification sites. }},
+ author = {Y Chen and W Chen and M H Cobb and Y Zhao},
+ journal = {{Proc Natl Acad Sci USA}},
+ volume = {106(3)},
+ year = 2009,
+ pages = {761-6}
+}
+
+ at article{Boker_2010,
+   author = {S Boker and M Neale and H Maes and M Wilde and M Spiegel and T Brick and J Spies and R Estabrook and S Kenny and T Bates and P Mehta and J Fox},
+   title = {OpenMx: An Open Source Extended Structural Equation Modeling Framework},
+   journal = {Psychometrika},
+   volume = {In press},
+   year = {2010}
+}
+
+ at techreport{Fedorov_2009,
+   author = {A Fedorov and B Clifford and S K WarÞeld and R Kikinis and N Chrisochoides},
+   title = {Non-Rigid Registration for Image-Guided Neurosurgery on the TeraGrid: A Case Study},
+   institution = {College of William and Mary},
+   number = {WM-CS-2009-05},
+   year = {2009}
+}
+
+ at techreport{ProteinFolding_2009,
+	Author = {G Hocky AND M Wilde AND J DeBartolo AND M Hategan AND I Foster AND T R Sosnick and K F Freed},
+	Date-Added = {2010-04-01 14:52:23 -0500},
+	Date-Modified = {2010-04-01 14:56:11 -0500},
+	Institution = {Argonne National Laboratory},
+	Month = {April},
+	Number = {ANL/MCS-P1612-0409},
+	Read = {0},
+	Title = {Towards petascale ab initio protein folding through parallel scripting},
+	Year = {2009},
+}
+
+ at article{SPEED_2010,
+	author={Joe DeBartolo and Glen Hocky and Michael Wilde and Jinbo Xu and Karl F. Freed and Tobin R. Sosnick},
+	title={Protein structure prediction enhanced with evolutionary diversity: SPEED},
+	journal={Protein Science},
+	volume={19},
+	number={3},
+	pages={520--534},
+	year={2010},
+}
+
+ at inproceedings{MoralHazard_2007,
+	author={Stef-Praun, T. and Madeira, G. and Foster, I.  and Townsend, R.},
+	title={Accelerating solution of a moral hazard problem with Swift},
+	booktitle={e-Social Science 2007},
+	year={2007},
+	address={Indianapolis, IN.}
+}
+
+ at article{CNARI_2009,
+	AUTHOR={S Kenny and M Andric and S Boker M and M Neale and M Wilde and  S L Small},
+	TITLE={Parallel workflows for data-driven structural equation modeling in functional neuroimaging},
+	JOURNAL={Frontiers in Neuroinformatics},
+	VOLUM={3},
+	YEAR={2009},
+	URL={www.frontiersin.org/neuroscience/neuroinformatics/paper/10.3389/neuro.11/034.2009/html/},
+	DOI={10.3389/neuro.11/034.2009},
+	ISSN={ISSN 1662-5196}
+}
+
+
+ at article{CNARI_2008,
+	title = {Improving the analysis, storage and sharing of neuroimaging data using relational databases and distributed computing},
+	journal = "NeuroImage",
+	volume = "39",
+	number = "2",
+	pages = "693 - 706",
+	year = "2008",
+	note = "",
+	issn = "1053-8119",
+	doi = "DOI: 10.1016/j.neuroimage.2007.09.021",
+	url = "http://www.sciencedirect.com/science/article/B6WNP-4PPW72Y-1/2/ac536a08f82f82ad9ce940ac235d8a55",
+	author = "Uri Hasson and Jeremy I. Skipper and Michael J. Wilde and Howard C. Nusbaum and Steven L. Small"
+}
+
+ at article{CNARI_DUP_2007,
+	author = {T Stef-Praun and B Clifford and I Foster and U Hasson and M Hategan and S L Small and M Wilde Michael and Y Zhao},
+	issn = {0926-9630},
+	journal = {Studies in health technology and informatics},
+	keywords = {niak, pipeline, psom},
+	pages = {207--216},
+	posted-at = {2009-12-10 16:20:18},
+	priority = {4},
+	title = {Accelerating medical research using the swift workflow system},
+	url = {http://view.ncbi.nlm.nih.gov/pubmed/17476063},
+	volume = {126},
+	year = {2007}
+}
+
+ at article{PetascaleScripting_2009,
+	author = {M Wilde and I Foster and K Iskra and P Beckman and Z Zhang and A Espinosa and M Hategan and B Clifford and I Raicu},
+	title = {Parallel Scripting for Applications at the Petascale and Beyond},
+	journal = {Computer},
+	volume = {42},
+	number = {11},
+	year = {2009},
+	issn = {0018-9162},
+	pages = {50--60},
+	doi = {http://dx.doi.org/10.1109/MC.2009.365},
+	publisher = {IEEE Computer Society Press},
+	address = {Los Alamitos, CA, USA},
+ }
+
 @inproceedings{SWIFTIWSW2007,
   author = {Yong Zhao and Mihael Hategan and B Clifford and I Foster and G von Laszewski and I Raicu and T Stef-Praun and M Wilde},
   title = {{Swift: Fast, Reliable, Loosely Coupled Parallel Computation}},

Modified: text/parco10submission/paper.tex
===================================================================
--- text/parco10submission/paper.tex	2011-01-08 00:12:42 UTC (rev 3902)
+++ text/parco10submission/paper.tex	2011-01-08 14:44:33 UTC (rev 3903)
@@ -1185,14 +1185,28 @@
 executable. Swift's existing input file management then
 stages-in the application files once per site per run.
 
-\pagebreak
 \section{Applications}
 \label{Applications}
 
-Swift has been used by applications in
-\mikenote{List here from CDI, IEEE, etc}
+By providing a minimal language that allows the rapid
+composition of existing executable programs and scripts into
+a logical unit, Swift has become a beneficial resource for
+small to moderate-sized scientific projects.
 
-This section describes two complete Swift scripts (representative of two diverse disciplines) in more detail.
+Swift has been used to perform
+computational
+biochemical investigations, such as protein structure prediction \cite{PetascaleScripting_2009, ProteinFolding_2009, SPEED_2010} and
+molecular dynamics of protein-ligand docking~\cite{Falkon_2008}, protein-RNA docking, and searching mass-spectrometry data for post-translational protein
+modifications \cite{PTMap_2009, PTMap_2010, PetascaleScripting_2009};
+modeling the interactions of climate, energy,
+and economics \cite{MoralHazard_2007, PetascaleScripting_2009}; post-processing and analysis of climate model results;
+exploring the language functions of the human brain \cite{CNARI_2007, CNARI_2008, CNARI_2009};
+creating general statistical frameworks for structural equation
+modeling \cite{Boker_2010};
+and performing image processing for research in
+image-guided planning for neurosurgery \cite{Fedorov_2009}.
+
+This section describes two representative Swift scripts (from two diverse disciplines) in more detail.
 The first script is a tutorial example (used in a class on data intensive computing at the University of Chicago) which performs a simple analysis of satellite land-use imagery. The second script is taken (with minor changes to fit better on the page) directly from work done using Swift for an investigation into the molecular structure of glassy materials in the field of theoretical chemistry. In both examples, the intent is to show a complete and realistic Swift script, annotated to better understand the nature of the Swift programming model and to provide a glimpse of real Swift usage.
 
 \subsection{Satellite image data processing.}
@@ -1215,7 +1229,7 @@
 largest total area of any requested sets of land-cover types, and then produces a new dataset with viewable
 color images of those closest-matching data tiles.
 (The input datasets are not viewable images, as their pixel
-values are land-use codes. Thus a color rendering step is required). A typical invocation of this script would be ``\emph{find the top 12 urban tiles}'' or ``\emph{find the 16 tiles with the most forest and grassland}''.
+values are land-use codes. Thus a color rendering step is required). A typical invocation of this script would be ``\emph{find the top 12 urban tiles}'' or ``\emph{find the 16 tiles with the most forest and grassland}''. As this script is used for tutorial purposes, the application programs it calls are simple shell scripts that use fast, generic image processing applications to process the MODIS data. Thus the example executes quickly while serving as a realistic tutorial script for much more compute-intensive satellite data processing applications.
 \\
 \\
 The script is structured as follows:
@@ -1247,7 +1261,7 @@
 
 To visualize the results, the application function {\tt markMap} invoked at line 68 will generate an image of a world map using the MODIS projection system and indicate the selected tiles matching the analysis criteria. Since this statememt depends on the output of the analysis, it will wait for statement at line 63 to complete before commencing.
 
-For additional visualization, the script assembles a full map of all the input tiles, placed in their proper grid location on the MODIS world map projection, and again marking the selected tiles. Since this operation needs true-color images of every input tiles these are computed -- again in parallel -- with 317 jobs invoked by the foreach statement at line 76-78. The power of Swift's implicit parallelization is very vividly shown here: since the colorMODIS call at line 77 depends only on the input array geos, these 317 application invocations.
+For additional visualization, the script assembles a full map of all the input tiles, placed in their proper grid location on the MODIS world map projection, and again marking the selected tiles. Since this operation needs true-color images of every input tiles these are computed -- again in parallel -- with 317 jobs invoked by the foreach statement at line 76-78. The power of Swift's implicit parallelization is very vividly shown here: since the {\tt colorMODIS} call at line 77 depends only on the input array {\tt geos}, these 317 application invocations are executed in parallel with the initial 317 parallel executions of the {\tt getLandUse} application at line 56.  The script concludes at line 83 by assembling a montage of all the colored tiles and writing this image file to a web-accessible directory for viewing.
 
 \pagebreak
 Swift example 1: MODIS satellite image processing script
@@ -1279,13 +1293,8 @@
     25	{
     26	  assemble @output @selected @filename(img[0]) webdir;
     27	}
-<<<<<<< .mine
     28	
     29	app (image grid) markMap (file tilelist) 
-=======
-    28
-    29	app (imagefile grid) markMap (file tilelist)
->>>>>>> .r3901
     30	{
     31	  markmap @tilelist @grid;
     32	}
@@ -1344,34 +1353,25 @@
 \end{Verbatim}
 %\end{verbatim}
 
-\pagebreak
 \subsection{Simulation of glassy dynamics and thermodynamics.}
 
-Recent study of the glass transition in model systems has focused on calculating from theory or simulation what is known as the "Mosaic length". Glen Hocky of the Reichman Lab at Columbia applied a new cavity method for measuring this length scale, where particles are simulated by molecular dynamics or Monte Carlo methods within cavities having amorphous boundary conditions. Various correlation functions are calculated at the interior of cavities of varying sizes and averaged over many independent simulations to determine a thermodynamic length. Hocky's simulations this method to investigate the differences between three different systems which all have the same "structure" but differ in other subtle ways to see if it is in fact this thermodynamic length that is there difference between the models.
+A recent study of the glass transition in model systems has focused on calculating from theory or simulation what is known as the "Mosaic length".
 
-Rather than run ~500K-1.5M steps per jobs (which a priori i didn't know how many i would run anyway) i ran 100K at a time. hence the repetitions of runs. But i would say the campaign started more like in october. if all the jobs are on pads then it'll be more obvious.
+Glen Hocky of the Reichman Group at Columbia applied a new cavity method for measuring this length scale, where particles are simulated by molecular dynamics or Monte Carlo methods within cavities having amorphous boundary conditions. Various correlation functions are calculated at the interior of cavities of varying sizes and averaged over many independent simulations to determine a thermodynamic length. Hocky is using simulations of this method to investigate the differences between three different glass systems which all have the same structure but which differ in other subtle ways to determine if this thermodynamic length causes the variations between the three systems.
 
-As this simulation was a lengthy campaign (from about October through December 2010) Hocky chose to leverage Swift ``external'' mappers to determine what work remained during various restarts. His mappers assumed an application run was complete if all the returned ".final" files existed.  In the case of script restarts, results that already existed were not computed. The swift restart mechanism was also tested and worked fine, but required tracking which workflow was being restarted. Occasionally missing files caused the restart to fail; Hocky's ad-hoc restart via mappers worked exceedingly well (and perhaps suggests a new approach for the integrated restart mechanism).
+Hocky's application code performs 100,000 Monte-Carlo steps in about 1-2 hours. Ten jobs are used to generate the 1M simulation steps needed for each configuration. The input data to each simulation is a file of about 150KB representing initial glass structures. Each simulation returns three new structures of 150KB each, a 50 KB log file, and a 4K file describing which particles are in the cavity.
 
-A high-level description of the glass simulation campaign is as follows:
+Each simulation covers a space of 7 radii by 27 centers by 10 models, requiring 1690 jobs per run. Three methods are simulated (``kalj'', ``kawka'', and  ``pedersenipl'') for total of 90 runs. Swift mappers enable metadata describing these aspects to be encoded in the data files of the campaigns to assist in managing the large volume of file data.
 
-loops are: 7 radii x 27 centers x 10 models x 1 job = 1690 jobs per run
+As the simulation campaigns are quite lengthy (the first ran from October through December 2010) Hocky chose to leverage Swift ``external'' mappers to determine what simulations need to be performed at any point in the campaign. His input mappers assume an application run was complete if all the returned ".final" files exist.  In the case of script restarts, results that already existed were not computed.
 
-3 methods: kalj (16) kawka(37) pedersenipl (37) for total of 90 runs
+Roughly 152,000 jobs defined by all the run*.sh scripts. Some runs were done on other resources including UChicago PADS cluster and TeraGrid resources. The only change necessary to run on OSG was configuring the OSG sites to run the science application.
 
-roughly 152,000 jobs defined by all the run*.sh scripts
-
-about 1-2 hours per job
-
 Approximate OSG usage over 100K cpus hours with about 100K tasks of 1-2 hours completed. App has been successfully run on about 18 OG (with the majority of runs have been completed on about 6 primary  sites).
 
+Investigations of more advanced techniques are underway, and the fact that the entire campaign can be driven by location-independent Swift scripts will enable Hocky to reliably re-execute the entire campaign with relative ease.
 This project would be completely unwieldy and much harder to organize without using Swift.
 
-Some runs were done on other resources including UChicago TeraGrid and the only change/addition necessary to run on OSG was configuring the OSG sites to run the science application.
-
-Is currently investigating whether slightly more advanced techniques will be necessary, in which case I may need to run approximately the same amount of simulations again.
-
-
 \pagebreak
 Swift example 2: Monte-Carlo simulation of quantum glass structures