From iraicu at cs.uchicago.edu  Sat Jan 16 08:46:58 2010
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sat, 16 Jan 2010 08:46:58 -0600
Subject: [Swift-user] CFP: 19th ACM International Symposium on High
 Performance Distributed Computing
Message-ID: <4B51D162.2040805@cs.uchicago.edu>

The 19th ACM International Symposium on High Performance Distributed 
Computing  (HPDC 2010) is now accepting submissions of research papers. 
 Authors are invited  to submit full papers of at most 12 pages or short 
papers of at most 4 pages.   Details about formatting requirements and 
 the submission process are given at 
http://hpdc2010.eecs.northwestern.edu/submitpaper.html.

The deadline for registering an abstract has been extended to Friday, 
January 22, 2010 and the deadline for the complete paper is unchanged to 
Friday, January 22, 2010.

The detailed call for papers follows, and can also be seen online at 
http://hpdc2010.eecs.northwestern.edu.

=======================================================================

ACM HPDC 2010 Call For Papers

19th ACM International Symposium on High Performance Distributed Computing

Chicago, Illinois
June 21-25, 2010

http://hpdc2010.eecs.northwestern.edu/


The ACM International Symposium on High Performance Distributed 
Computing (HPDC) is the premier venue for presenting the latest research 
on the design, implementation, evaluation, and use of  parallel and 
distributed systems for high performance and high end computing.  The 
19th installment of HPDC will take place in the heart of the Chicago, 
Illinois, the third largest city in the United States and a major 
technological and cultural capital.  The conference will be held on June 
23-25 (Wednesday through Friday) with eight affiliated workshops 
occurring on June 21-22 (Monday and Tuesday).    Open Grid Forum will be 
co-located as well, on June 20-22 (Sunday through Tuesday)

Submissions are welcomed on all forms of high performance distributed 
computing, including grids, clouds, clusters, service-oriented 
computing, utility computing, peer-to-peer systems, and global computing 
ensembles. New scholarly research showing empirical and reproducible 
results in architectures, systems, and networks is strongly encouraged, 
as are experience reports of applications and deployments that can 
provide insights for future high performance distributed computing research.

All papers will be rigorously reviewed by a distinguished program 
committee, with a strong focus on the combination of rigorous scientific 
results and likely high impact within high performance distributed 
computing.  Research papers must clearly demonstrate research 
contributions and novelty while experience reports must clearly describe 
lessons learned and demonstrate impact. Topics of interest include (but 
are not limited to) the following, in the context of high performance 
distributed computing and high end computing:

 * Systems
 * Architectures
 * Algorithms
 * Networking
 * Programming languages and environments
 * Data management
 * I/O and file systems
 * Virtualization
 * Resource management, scheduling, and load-balancing
 * Performance modeling, simulation, and prediction
 * Fault tolerance, reliability and availability
 * Security, configuration, policy, and management issues
 * Multicore issues and opportunities
 * Models and use cases for utility, grid, and cloud computing

Both full papers and short papers (for poster presentation and/or 
demonstrations) may be submitted.

IMPORTANT DATES

Paper Abstract submissions:     January 15, 2010
Paper submissions:                    January 22, 2010
Author notification:                       March 30, 2010
Final manuscripts:                        April 23, 2010

SUBMISSIONS

Authors are invited to submit full papers of at most 12 pages or short  
papers of at most 4 pages.  The page limits include all figures and 
references.  Papers should be formatted in the ACM proceedings style 
(e.g., http://www.acm.org/sigs/publications/proceedings-templates).    
Reviewing is single-blind.  Papers must be self-contained and provide 
the technical substance required for the program committee to evaluate 
the paper's contribution, including how it differs from prior work. All 
papers will be reviewed and judged on correctness, originality, 
technical strength, significance, quality of presentation, and interest 
and relevance to the conference. Submitted papers must be original work 
that has not appeared in and is not under consideration for another 
conference or a journal. 

PUBLICATION

Accepted full and short papers will appear in the conference proceedings.

WORKSHOPS
Eight workshops have been selected for co-location with HPDC, on June 
21st and 22nd, 2010. Please visit the workshop web page at 
http://hpdc2010.eecs.northwestern.edu/workshops.html for more 
information. The workshops include: 
 * Emerging Computational Methods for the Life Sciences
 * LSAP: Large-Scale System and Application Performance
 * MDQCS: Managing Data Quality for Collaborative Science
 * ScienceCloud: Scientific Cloud Computing
 * CLADE: Challenges of Large Applications in Distributed Environments
 * DIDC: Data Intensive Distributed Computing
 * MAPREDUCE: MapReduce and its Applications
 * VTDC: Virtualization Technologies for Distributed Computing

OPEN GRID FORUM
Open Grid Form (ogf.org <http://ogf.org>) will have a co-located meeting 
with HPDC.  Please visit the web site for more information.

GENERAL CO-CHAIRS
Kate Keahey, Argonne National Labs
Salim Hariri, University of Arizona

STEERING COMMITTEE
Salim Hariri, Univ. of Arizona (Chair)
Andrew A. Chien, Intel / UCSD
Henri Bal, Vrije University

Franck Cappello, INRIA
Jack Dongarra, Univ. of Tennessee
Ian Foster, ANL& Univ. of Chicago
Andrew Grimshaw, Univ. of Virginia
Carl Kesselman, USC/ISI
Dieter Kranzlmueller, Ludwig-Maximilians-Univ. Muenchen

Miron Livny, Univ. of Wisconsin
Manish Parashar, Rutgers University
Karsten Schwan, Georgia Tech
David Walker, Univ. of Cardiff
Rich Wolski, UCSB

PROGRAM CHAIR
Peter Dinda, Northwestern University


PROGRAM COMMITTEE
Kento Aida, NII and Tokyo Institute of Technology
Ron Brightwell, Sandia National Labs
Fabian Bustamante, Northwestern University
Henri Bal, Vrije Universiteit
Frank Cappello, INRIA
Claris Castillo, IBM Research
Henri Casanova, University of Hawaii
Abhishek Chandra, University of Minnesota
Chris Colohan, Google
Brian Cooper, Yahoo Research
Wu-chun Feng, Virginia Tech
Renato Ferreira, Universidade Federal de Minas Gerais
Jose Fortes, University of Florida
Ian Foster, University of Chicago / Argonne
Geoffrey Fox, Indiana University
Michael Gerndt, TU-Munich
Andrew Grimshaw, University of Virginia
Thilo Kielmann, Vrije Universiteit
Zhiling Lan, IIT
John Lange, Northwestern University
Arthur Maccabe, Oak Ridge National Labs
Satoshi Matsuoka, Toyota Institute of Technology
Jose Moreira, IBM Research
Klara Nahrstedt, UIUC
Dushyanth Narayanan, Microsoft Research
Manish Parashar, Rutgers University
Ioan Raicu, Northwestern University
Morris Riedel, Juelich Supercomputing Centre
Matei Ripeanu, UBC
Joel Saltz, Emory University
Karsten Schwan, Georgia Tech
Thomas Stricker, Google
Jaspal Subhlok, University of Houston
Martin Swany, University of Delaware
Michela Taufer, University of Delaware
Valerie Taylor, TAMU
Douglas Thain, University of Notre Dame
Jon Weissman, University of Minnesota
Rich Wolski, UCSB and Eucalyptus Systems
Dongyan Xu, Purdue University
Ken Yocum, UCSD

WORKSHOP CHAIR
Douglas Thain, University of Notre Dame

PUBLICITY CO-CHAIRS
Martin Swany, U. Delaware
Morris Riedel, Julich Supercomputing Centre

Renato Ferreira, Universidade Federal de Minas Gerais
Kento Aida, NII and Tokyo Institute of Technology

LOCAL ARRANGEMENTS CHAIR
Zhiling Lan, IIT

STUDENT ACTIVITIES CO-CHAIRS
John Lange, Northwestern University
Ioan Raicu, Northwestern University

-- 
=================================================================
Ioan Raicu, Ph.D.
NSF/CRA Computing Innovation Fellow
=================================================================
Center for Ultra-scale Computing and Information Security (CUCIS)
Department of Electrical Engineering and Computer Science
Northwestern University
2145 Sheridan Rd, Tech M384 
Evanston, IL 60208-3118
=================================================================
Cel:   1-847-722-0876
Tel:   1-847-491-8163
Email: iraicu at eecs.northwestern.edu
Web:   http://www.eecs.northwestern.edu/~iraicu/
       https://wiki.cucis.eecs.northwestern.edu/
=================================================================
=================================================================


-- 
=================================================================
Ioan Raicu, Ph.D.
NSF/CRA Computing Innovation Fellow
=================================================================
Center for Ultra-scale Computing and Information Security (CUCIS)
Department of Electrical Engineering and Computer Science
Northwestern University
2145 Sheridan Rd, Tech M384 
Evanston, IL 60208-3118
=================================================================
Cel:   1-847-722-0876
Tel:   1-847-491-8163
Email: iraicu at eecs.northwestern.edu
Web:   http://www.eecs.northwestern.edu/~iraicu/
       https://wiki.cucis.eecs.northwestern.edu/
=================================================================
=================================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100116/1b2a37e3/attachment.html>

From iraicu at cs.uchicago.edu  Mon Jan 18 08:27:07 2010
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 18 Jan 2010 08:27:07 -0600
Subject: [Swift-user] CFP: IEEE 2010 Fourth International Workshop on
 Scientific Workflows (SWF 2010)
Message-ID: <4B546FBB.601@cs.uchicago.edu>


Call for Papers  
IEEE 2010 Fourth International Workshop on Scientific Workflows (SWF 2010)
http://www.cs.wayne.edu/~shiyong/swf <http://www.cs.wayne.edu/%7Eshiyong/swf>
Miama, Florida, U.S.A., one day between July 5-10, 2010

In conjunction with IEEE ICWS/SCC/CLOUD/SERVICES 2010


Description
Scientific workflows have become an increasingly popular paradigm for scientists to 
formalize and structure complex scientific processes to enable and accelerate many 

significant scientific discoveries. A scientific workflow is a formal specification 
of a scientific process, which represents, streamlines, and automates the analytical 
and computational steps that a scientist needs to go through from dataset selection 

and integration, computation and analysis, to final data product presentation and 
visualization. The importance of scientific workflows has been recognized by NSF 
since 2006 and was reemphasized recently in a science article 

titled "Beyond the Data Deluge"(Science, Vol. 323. no. 5919, pp. 1297 ?C 1298, 2009), 
which concluded, "In the future, the rapidity with which any given discipline advances 
is likely to depend on how well the community acquires the necessary expertise in database, 

workflow management, visualization, and cloud computing technologies."

The goal of SWF 2010 is to provide a forum for researchers and practitioners to present 
their recent research results and best practices of scientific workflows, and identify 

the emerging trends, opportunities, problems, and challenges in this area. 
Authors are invited to submit regular papers (8 pages) and short papers (4 pages) 
that show original unpublished research results in all areas 

of scientific workflows. Topics of interest are listed below; however, submissions 
on all aspects of scientific workflows are welcome. 
Accepted SWF 2010 papers will be included in the proceedings of IEEE SERVICES 2010, 

which will be published by IEEE Computer Society Press.


Topics
o Scientific workflow provenance management and analytics
o Scientific workflow data, metadata, service, and task management
o Scientific workflow architectures, models, and languages

o Scientific workflow monitoring, debugging, and failure handling
o Streaming data processing in scientific workflows
o Pipelined, data, workflow, and task parallelism in scientific workflows
o Service, Grid, or Cloud-based scientific workflows

o Data, metadata, compute, user-interaction, or visualization-intensive scientific workflows
o Scientific workflow composition
o Security issues in scientific workflows 
o Data integration and service integration in scientific workflows

o Scientific workflow mapping, optimization, and scheduling
o Scientific workflow modeling, simulation, analysis, and verification
o Scalability, reliability, extensibility, agility, and interoperability    
o Scientific workflow applications

 
Important dates
Paper Submission     March 17, 2009
Decision Notification (Electronic)   April 17, 2009
Camera-Ready Submission & Pre-registration    April 30, 2009


Workshop chairs: 
Shiyong Lu, Wayne State University
Calton Pu, Georgia Tech
Liqiang Wang, University of Wyoming

Publication chairs (pending)
Ilkay Altintas, San Diego Supercomputer Center
Yogesh Simmhan, Microsoft Research
Ioan Raicu, Northwestern University
 
Publicity chair
Jamil alhiyafi,  Wayne State University
 
Program committee
Ilkay Altintas, San Diego Supercomputer Center, USA
Roger Barga, Microsoft Research, USA
Adam Barker, University of Oxford, UK
Shawn Bowers, UC Davis Genome Center, USA
Artem Chebotko, University of Texas at Pan American, USA
Ian Gorton, PNNL
Paul Groth, VU University Amsterdam
Marta L. Queir?s Mattoso, Federal University of Rio de Janeiro, Brazil
Luc Moreau, University of South Hampton, UK
Ioan Raicu,  Northwestern University, USA
Yogesh Simmhan,  Microsoft  Corporation, USA
Chung-Wei Hang,  North Carolina State University, USA
Ian Taylor,   Cardiff University, UK
Jianwu Wang, San Diego Supercomputer Center
Wei Tan, ANL
Ping Yang,  Binghamton University, USA
Ustun Yildiz, UC Davis
Yong Zhao,  Microsoft Corporation, USA
Zhiming Zhao, University of Amsterdam, the Netherlands

For any questions, please send e-mails to Shiyong Lu at shiyong at wayne.edu <mailto:shiyong at wayne.edu>.

 
-- 
=================================================================
Ioan Raicu, Ph.D.
NSF/CRA Computing Innovation Fellow
=================================================================
Center for Ultra-scale Computing and Information Security (CUCIS)
Department of Electrical Engineering and Computer Science
Northwestern University
2145 Sheridan Rd, Tech M384 
Evanston, IL 60208-3118
=================================================================
Cel:   1-847-722-0876
Tel:   1-847-491-8163
Email: iraicu at eecs.northwestern.edu
Web:   http://www.eecs.northwestern.edu/~iraicu/
       https://wiki.cucis.eecs.northwestern.edu/
=================================================================
=================================================================


-- 
=================================================================
Ioan Raicu, Ph.D.
NSF/CRA Computing Innovation Fellow
=================================================================
Center for Ultra-scale Computing and Information Security (CUCIS)
Department of Electrical Engineering and Computer Science
Northwestern University
2145 Sheridan Rd, Tech M384 
Evanston, IL 60208-3118
=================================================================
Cel:   1-847-722-0876
Tel:   1-847-491-8163
Email: iraicu at eecs.northwestern.edu
Web:   http://www.eecs.northwestern.edu/~iraicu/
       https://wiki.cucis.eecs.northwestern.edu/
=================================================================
=================================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100118/2dd763ed/attachment.html>

From iraicu at cs.uchicago.edu  Mon Jan 18 15:59:57 2010
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 18 Jan 2010 15:59:57 -0600
Subject: [Swift-user] CFP: 1st ACM Workshop on Scientific Cloud Computing
 (ScienceCloud) 2010
Message-ID: <4B54D9DD.7000103@cs.uchicago.edu>

Call for Papers

---------------------------------------------------------------------------------------
1st ACM Workshop on Scientific Cloud Computing (ScienceCloud) 2010
http://dsl.cs.uchicago.edu/ScienceCloud2010/  
---------------------------------------------------------------------------------------
June 21st, 2010
Chicago, Illinois, USA

Co-located with with ACM High Performance Distributed Computing Conference (HPDC) 2010 

=======================================================================================
Workshop Overview
The advent of computation can be compared, in terms of the breadth and depth of its 
impact on research and scholarship, to the invention of writing and the development of 
modern mathematics. Scientific Computing has already begun to change how science is done, 
enabling scientific breakthroughs through new kinds of experiments that would have been 
impossible only a decade ago. Today's science is generating datasets that are increasing 
exponentially in both complexity and volume, making their analysis, archival, and sharing 
one of the grand challenges of the 21st century. The support for data intensive computing 
is critical to advancing modern science as storage systems have experienced an increasing 
gap between its capacity and its bandwidth by more than 10-fold over the last decade. 
There is an emerging need for advanced techniques to manipulate, visualize and interpret 
large datasets. Scientific Computing is the key to many domains' "holy grail" of new 
knowledge, and comes in many shapes and forms, from high-performance computing (HPC) 
which is heavily focused on compute-intensive applications, high-throughput computing 
(HTC) which focuses on using many computing resources over long periods of time to 
accomplish its computational tasks, many-task computing (MTC) which aims to bridge the 
gap between HPC and HTC by focusing on using many resources over short periods of time, 
to data-intensive computing which is heavily focused on data distribution and harnessing 
data locality by scheduling of computations close to the data. 

The 1st workshop on Scientific Cloud Computing (ScienceCloud) will provide the scientific 
community a dedicated forum for discussing new research, development, and deployment 
efforts in running these kinds of scientific computing workloads on Cloud Computing 
infrastructures. The ScienceCloud workshop will focus on the use of cloud-based 
technologies to meet new compute intensive and data intensive scientific challenges that 
are not well served by the current supercomputers, grids or commercial clouds. What 
architectural changes to the current cloud frameworks (hardware, operating systems, 
networking and/or programming models) are needed to support science? Dynamic information 
derived from remote instruments and coupled simulation and sensor ensembles are both 
important new science pathways and tremendous challenges for current HPC/HTC/MTC 
technologies.  How can cloud technologies enable these new scientific approaches? How 
are scientists using clouds? Are there scientific HPC/HTC/MTC workloads that are suitable 
candidates to take advantage of emerging cloud computing resources with high efficiency? 
What benefits exist by adopting the cloud model, over clusters, grids, or supercomputers?  
What factors are limiting clouds use or would make them more usable/efficient?
This workshop encourages interaction and cross-pollination between those developing 
applications, algorithms, software, hardware and networking, emphasizing scientific 
computing for such cloud platforms. We believe the workshop will be an excellent place 
to help the community define the current state, determine future goals, and define 
architectures and services for future science clouds.

Topics of Interest
---------------------------------------------------------------------------------------
We invite the submission of original work that is related to the topics below. The papers 
can be either short (5 pages) position papers, or long (10 pages) research papers. Topics 
of interest include (in the context of Cloud Computing):
*	scientific computing applications
	o	case studies on cloud computing
	o	case studies comparing clouds, cluster, grids, and/or supercomputers
	o	performance evaluation
*	performance evaluation
	o	real systems
	o	cloud computing benchmarks
	o	reliability of large systems
*	programming models and tools
	o	map-reduce and its generalizations
	o	many-task computing middleware and applications
	o	integrating parallel programming frameworks with storage clouds
	o	message passing interface (MPI)
	o	service-oriented science applications
*	storage cloud architectures and implementations
	o	distributed file systems
	o	content distribution systems for large data
	o	data caching frameworks and techniques
	o	data management within and across data centers
	o	data-aware scheduling
	o	data-intensive computing applications
	o	eventual-consistency storage usage and management
*	compute resource management 
	o	dynamic resource provisioning
	o	scheduling
	o	techniques to manage many-core resources and/or GPUs
*	high-performance computing
	o	high-performance I/O systems
	o	interconnect and network interface architectures for HPC
	o	multi-gigabit wide-area networking
	o	scientific computing tradeoffs between clusters/grids/supercomputers and clouds
	o	parallel file systems in dynamic environments
*	models, frameworks and systems for cloud security
	o	implementation of access control and scalable isolation

Paper Submission and Publication
---------------------------------------------------------------------------------------
Authors are invited to submit papers with unpublished, original work of not more than 
10 pages of double column text using single spaced 10 point size on 8.5 x 11 inch pages,
(including all text, figures, and references) as per ACM 8.5 x 11 manuscript guidelines 
(http://www.acm.org/publications/instructions_for_proceedings_volumes); document 
templates can be found at http://www.acm.org/sigs/publications/proceedings-templates. 
A 250 word abstract (PDF format) must be submitted online at 
https://cmt.research.microsoft.com/ScienceCloud2010/ before the deadline of February 22nd, 
2010 at 11:59PM PST; the final 10 page papers in PDF format will be due on March 1st, 2010 
at 11:59PM PST. Papers will be peer-reviewed, and accepted papers will be published in the 
workshop proceedings as part of the ACM digital library. Notifications of the paper 
decisions will be sent out by April 1st, 2010. Selected excellent work will be invited to 
submit extended versions of the workshop paper to a special issue journal.  Submission 
implies the willingness of at least one of the authors to register and present the paper. 
For more information, please visit http://dsl.cs.uchicago.edu/ScienceCloud2010/. 

Important Dates
---------------------------------------------------------------------------------------
*	Abstract Due:			February 22nd, 2010
*	Papers Due:			March 1st, 2010
*	Notification of Acceptance:	April 1st, 2010
*	Workshop Date:			June 21st, 2010


Committee Members
---------------------------------------------------------------------------------------
Workshop Chairs
*	Pete Beckman, University of Chicago & Argonne National Laboratory
*	Ian Foster, University of Chicago & Argonne National Laboratory
*	Ioan Raicu, Northwestern University

Steering Committee
* 	Jeff Broughton, Lawrence Berkeley National Lab., USA 
* 	Alok Choudhary, Northwestern University, USA
* 	Dennis Gannon, Microsoft Research, USA
* 	Robert Grossman, University of Illinois at Chicago, USA
* 	Kate Keahey, Nimbus, University of Chicago, Argonne National Laboratory, USA 
* 	Ed Lazowska, University of Washington, USA
* 	Ignacio Llorente, Open Nebula, Universidad Complutense de Madrid, Spain
* 	David E. Martin, Argonne National Laboratory, Northwestern University, USA
* 	Gabriel Mateescu, Linkoping University, Sweden 
* 	David O'Hallaron, Carnegie Mellon University, Intel Labs, USA 
* 	Rich Wolski, Eucalyptus, University of California, Santa Barbara, USA
* 	Kathy Yelick, University of California at Berkeley, Lawrence Berkeley National Lab., USA

Technical Committee
* 	David Abramson, Monash University, Australia
* 	Roger Barga, Microsoft Research, USA
* 	Roy Campbell, University of Illinois at Urbana Champaign, USA
* 	Henri Casanova, University of Hawaii at Manoa, USA
* 	Brian Cooper, Yahoo! Research, USA
* 	Peter Dinda, Northwestern University, USA
* 	Jack Dongara, University of Tennessee, USA 
* 	Geoffrey Fox, Indiana University, USA
* 	Adriana Iamnitchi, University of South Florida, USA
* 	Alexandru Iosup, Delft University of Technology, Netherlands
* 	James Hamilton, Amazon Web Services, USA
* 	Tevfik Kosar, Louisiana State University, USA
* 	Shiyong Lu, Wayne State University, USA
* 	Ruben S. Montero, Universidad Complutense de Madrid, Spain
* 	Reagan Moore, University of North Carolina, Chappel Hill, USA
* 	Jose Moreira, IBM Research, USA
* 	Lavanya Ramakrishnan, Lawrence Berkeley National Laboratory
* 	Matei Ripeanu, University of British Columbia, Canada
* 	Larry Rudolph, VMware, USA
* 	Marc Snir, University of Illinois at Urbana Champaign, USA
* 	Xian-He Sun, Illinois Institute of Technology, USA
* 	Hakim Weatherspoon, Cornell University, USA 
* 	Mike Wilde, University of Chicago & Argonne National Laboratory, USA
* 	Alec Wolman, Microsoft Research, USA
* 	Yong Zhao, Microsoft, USA

-- 
=================================================================
Ioan Raicu, Ph.D.
NSF/CRA Computing Innovation Fellow
=================================================================
Center for Ultra-scale Computing and Information Security (CUCIS)
Department of Electrical Engineering and Computer Science
Northwestern University
2145 Sheridan Rd, Tech M384 
Evanston, IL 60208-3118
=================================================================
Cel:   1-847-722-0876
Tel:   1-847-491-8163
Email: iraicu at eecs.northwestern.edu
Web:   http://www.eecs.northwestern.edu/~iraicu/
       https://wiki.cucis.eecs.northwestern.edu/
=================================================================
=================================================================


-- 
=================================================================
Ioan Raicu, Ph.D.
NSF/CRA Computing Innovation Fellow
=================================================================
Center for Ultra-scale Computing and Information Security (CUCIS)
Department of Electrical Engineering and Computer Science
Northwestern University
2145 Sheridan Rd, Tech M384 
Evanston, IL 60208-3118
=================================================================
Cel:   1-847-722-0876
Tel:   1-847-491-8163
Email: iraicu at eecs.northwestern.edu
Web:   http://www.eecs.northwestern.edu/~iraicu/
       https://wiki.cucis.eecs.northwestern.edu/
=================================================================
=================================================================


From sardjito.antonius at gmail.com  Mon Jan 18 16:58:07 2010
From: sardjito.antonius at gmail.com (Antonius Sardjito)
Date: Mon, 18 Jan 2010 16:58:07 -0600
Subject: [Swift-user] could not initialized shared directory on pbs
Message-ID: <110a6b261001181458x59af150bqca221c6237993a9e@mail.gmail.com>

Hi,

I tired to execute a Swift script on the PADS cluster using this command:

swift -tc.file tc -sites.file pbs.xml modis.swift


But was encountered by this error below:

[antonius at login2 work]$ swift -tc.file tc -sites.file pbs.xml modis.swift
Swift svn swift-r3202 cog-r2682
RunID: 20100118-1636-roba879f
Progress:
Execution failed:
Could not initialize shared directory on pbs
Caused by:
org.globus.cog.abstraction.impl.file.FileResourceException: Failed to create
directory: /home/wilde/swiftwork/modis-20100118-1636-roba879f/shared


Any suggestion on what might have caused  the "could not initialized shared
directory on pbs" ?

-Antonius
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100118/845acdb0/attachment.html>

From hategan at mcs.anl.gov  Mon Jan 18 17:21:48 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 18 Jan 2010 17:21:48 -0600
Subject: [Swift-user] could not initialized shared directory on pbs
In-Reply-To: <110a6b261001181458x59af150bqca221c6237993a9e@mail.gmail.com>
References: <110a6b261001181458x59af150bqca221c6237993a9e@mail.gmail.com>
Message-ID: <1263856908.9254.2.camel@localhost>

On Mon, 2010-01-18 at 16:58 -0600, Antonius Sardjito wrote:
> Execution failed:
> Could not initialize shared directory on pbs
> Caused by:
> org.globus.cog.abstraction.impl.file.FileResourceException: Failed to
> create
> directory: /home/wilde/swiftwork/modis-20100118-1636-roba879f/shared

Right. You need to edit pbs.xml and change the work directory to
something you have write permissions to.

I added a note on the wiki assignment page
(http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftTutorialForBigData#Doing_Swift_Assignment_1)

Mihael


From hategan at mcs.anl.gov  Mon Jan 18 19:44:19 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 18 Jan 2010 19:44:19 -0600
Subject: [Swift-user] could not initialized shared directory on pbs
In-Reply-To: <110a6b261001181553g4a4e1113re5ca30a0ec164897@mail.gmail.com>
References: <110a6b261001181458x59af150bqca221c6237993a9e@mail.gmail.com>
	<1263856908.9254.2.camel@localhost>
	<110a6b261001181553g4a4e1113re5ca30a0ec164897@mail.gmail.com>
Message-ID: <1263865459.10474.12.camel@localhost>

It may be a good idea to CC the list so that if others run into the same
problem, they can see what the solution is.

On Mon, 2010-01-18 at 17:53 -0600, Antonius Sardjito wrote:
> I tried editing the file but when I run it I encountered a long erros
> at the top there was:
> 
[...]
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> Cannot submit job: Cannot run program "qsub": java.io.IOException:
> error=2, No such file or directory

[...]

Qsub is the standard PBS/Torque (queuing system) submit command. You
should generally have it in your path. If not, edit your ~/.soft file
and make sure the following lines are there:
+maui
+torque

Then save, run "resoft" and try running qsub. Pay attention to errors
that may appear when running "resoft".

> 
> Also the Modis folder is still inaccessible so.. so far I could on
> test what I've done with the data in the sample folder only.

That directory is a symbolic link. If you follow it, you'll see that it
points to /gpfs/pads/projects/see/data/raw/mcd12q1/2002/lct1. Looking
at /etc/fstab, it appears that /gpfs/pads is a mount point, and it
doesn't seem to be mounted.

I would send an email to support at ci.uchicago.edu asking for things to be
restored to their proper state (e.g. "please mount /gpfs/pads on the
pads login nodes").


From hategan at mcs.anl.gov  Mon Jan 18 19:54:17 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 18 Jan 2010 19:54:17 -0600
Subject: [Swift-user] could not initialized shared directory on pbs
In-Reply-To: <1263865459.10474.12.camel@localhost>
References: <110a6b261001181458x59af150bqca221c6237993a9e@mail.gmail.com>
	<1263856908.9254.2.camel@localhost>
	<110a6b261001181553g4a4e1113re5ca30a0ec164897@mail.gmail.com>
	<1263865459.10474.12.camel@localhost>
Message-ID: <1263866057.11003.1.camel@localhost>

On Mon, 2010-01-18 at 19:44 -0600, Mihael Hategan wrote:

> > 
> > Also the Modis folder is still inaccessible so.. so far I could on
> > test what I've done with the data in the sample folder only.
> 
> That directory is a symbolic link. If you follow it, you'll see that it
> points to /gpfs/pads/projects/see/data/raw/mcd12q1/2002/lct1. Looking
> at /etc/fstab, it appears that /gpfs/pads is a mount point, and it
> doesn't seem to be mounted.
> 
> I would send an email to support at ci.uchicago.edu asking for things to be
> restored to their proper state (e.g. "please mount /gpfs/pads on the
> pads login nodes").
> 

Scrap that.

Use the support address that the assignment page mentions
(pads-support at ci.uchicago.edu).

Mihael


From fedorov at bwh.harvard.edu  Tue Jan 19 09:38:45 2010
From: fedorov at bwh.harvard.edu (Andriy Fedorov)
Date: Tue, 19 Jan 2010 10:38:45 -0500
Subject: [Swift-user] Tuning parameters of coaster execution
In-Reply-To: <1256055802.24685.13.camel@localhost>
References: <82f536810910192035o1eaf761chfff2e006e31fb51a@mail.gmail.com>
	<1256054147.22279.18.camel@localhost>
	<82f536810910200904x584d8ca3m2da7fab8dc660b1d@mail.gmail.com>
	<1256055802.24685.13.camel@localhost>
Message-ID: <82f536811001190738u10442036k1e2b24116cb54fbc@mail.gmail.com>

Hi Mihael,

I've been playing with this following your suggestions, but I can't
get it to work.

Here's my site description:

<pool handle="Abe-GT2-coasters">
  <gridftp  url="local://localhost" />
  <execution provider="coaster" jobmanager="gt2:gt2:pbs"
url="grid-abe.ncsa.teragrid.org"/>
  <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
  <profile namespace="karajan" key="jobThrottle">2.55</profile>
  <profile namespace="karajan" key="initialScore">10000</profile>
  <profile namespace="globus" key="nodeGranularity">10</profile>
  <profile namespace="globus" key="remoteMonitorEnabled">false</profile>
  <profile namespace="globus" key="parallelism">0.1</profile>
  <profile namespace="globus" key="workersPerNode">2</profile>
  <profile namespace="globus" key="highOverallocation">10</profile>
</pool>

My maxWalltime for the job is 2, and I have 100 of them. When I run
the script, I see one job in the queue, with 10 nodes and 22 minutes
walltime. However, when the script is executing, it appears the jobs
are being scheduled one at a time. I have the current checkout of the
cog/swift trunk: Swift svn swift-r3202 cog-r2682. I attach the
coaster.log file for your reference.

Can you help me understand what I am doing wrong?

Also, I was trying to look in the code that does allocation, and it
seems that the code responsible for determining the block size for
allocation is in
modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BlockQueueProcessor.java.
Is this correct? And what is the piece of code that decides how to
schedule jobs within the allocated block?

I would appreciate any help. Thank you.

--
Andriy Fedorov, Ph.D.

Research Fellow
Brigham and Women's Hospital
Harvard Medical School
75 Francis Street
Boston, MA 02115 USA
fedorov at bwh.harvard.edu


On Tue, Oct 20, 2009 at 11:23, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> On Tue, 2009-10-20 at 12:04 -0400, Andriy Fedorov wrote:
>> On Tue, Oct 20, 2009 at 11:55, Mihael Hategan <hategan at mcs.anl.gov> wrote:
>> > You need a more recent version of the code.
>> >
>>
>> Mihael, I actually updated svn for both cog and swift yesterday prior
>> to running the tests. Here's what swift reports I have right now:
>>
>> Swift svn swift-r3170 cog-r2529
>
> Given that even when you have granularity=10 you still see 2 jobs, I
> suspect you are using swift site throttling parameters that force that.
> I would set the jobThrottle higher and possibly the initial score
> higher.
>
> For troubleshooting, what you could do is, on the remote side, say cat
> ~/.globus/coasters/coasters.log|grep "BlockQueueProcessor">bqp.log and
> post that. Also, you could set the remoteMonitorEnabled profile to
> "true" to get visual feedback of what's happening.
>
> The allocation time is 18 minutes because the new stuff doesn't
> overallocate using a fixed multiplier (though you can force it to do
> so). For small jobs (walltime = 1s) the multiplier is set by
> lowOverallocation (10.0 by default) while for large jobs (walltime ->
> +inf) the multiplier is 1, with an exponential decay in-between.
>
> If you want to always have blocks being 10 times the job walltime, you
> can set highOverallocation to 10.
>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bqp.log
Type: text/x-log
Size: 369066 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100119/99d0de37/attachment.bin>

From hategan at mcs.anl.gov  Tue Jan 19 11:43:26 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 19 Jan 2010 11:43:26 -0600
Subject: [Swift-user] Tuning parameters of coaster execution
In-Reply-To: <82f536811001190738u10442036k1e2b24116cb54fbc@mail.gmail.com>
References: <82f536810910192035o1eaf761chfff2e006e31fb51a@mail.gmail.com>
	<1256054147.22279.18.camel@localhost>
	<82f536810910200904x584d8ca3m2da7fab8dc660b1d@mail.gmail.com>
	<1256055802.24685.13.camel@localhost>
	<82f536811001190738u10442036k1e2b24116cb54fbc@mail.gmail.com>
Message-ID: <1263923006.18474.8.camel@localhost>

On Tue, 2010-01-19 at 10:38 -0500, Andriy Fedorov wrote:
> Hi Mihael,
> 
> I've been playing with this following your suggestions, but I can't
> get it to work.
> 
> Here's my site description:
> 
> <pool handle="Abe-GT2-coasters">
>   <gridftp  url="local://localhost" />
>   <execution provider="coaster" jobmanager="gt2:gt2:pbs"
> url="grid-abe.ncsa.teragrid.org"/>
>   <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>   <profile namespace="karajan" key="jobThrottle">2.55</profile>
>   <profile namespace="karajan" key="initialScore">10000</profile>
>   <profile namespace="globus" key="nodeGranularity">10</profile>
>   <profile namespace="globus" key="remoteMonitorEnabled">false</profile>
>   <profile namespace="globus" key="parallelism">0.1</profile>
>   <profile namespace="globus" key="workersPerNode">2</profile>
>   <profile namespace="globus" key="highOverallocation">10</profile>
> </pool>
> 
> My maxWalltime for the job is 2, and I have 100 of them. When I run
> the script, I see one job in the queue, with 10 nodes and 22 minutes
> walltime. However, when the script is executing, it appears the jobs
> are being scheduled one at a time. I have the current checkout of the
> cog/swift trunk: Swift svn swift-r3202 cog-r2682. I attach the
> coaster.log file for your reference.

It doesn't look like Swift is sending more than one job at a time. It
may be helpful to understand what the swift part is doing (i.e. swift
log, the swift script, etc.).

> 
> Can you help me understand what I am doing wrong?
> 
> Also, I was trying to look in the code that does allocation, and it
> seems that the code responsible for determining the block size for
> allocation is in
> modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BlockQueueProcessor.java.
> Is this correct? And what is the piece of code that decides how to
> schedule jobs within the allocated block?

Each worker will slurp jobs fitting (walltime <
worker_remaining_walltime) from the coaster queue if that's not empty.
So there isn't much in the way of scheduling at that point.


From fedorov at bwh.harvard.edu  Tue Jan 19 12:01:15 2010
From: fedorov at bwh.harvard.edu (Andriy Fedorov)
Date: Tue, 19 Jan 2010 13:01:15 -0500
Subject: [Swift-user] Tuning parameters of coaster execution
In-Reply-To: <1263923006.18474.8.camel@localhost>
References: <82f536810910192035o1eaf761chfff2e006e31fb51a@mail.gmail.com>
	<1256054147.22279.18.camel@localhost>
	<82f536810910200904x584d8ca3m2da7fab8dc660b1d@mail.gmail.com>
	<1256055802.24685.13.camel@localhost>
	<82f536811001190738u10442036k1e2b24116cb54fbc@mail.gmail.com>
	<1263923006.18474.8.camel@localhost>
Message-ID: <82f536811001191001t33bdea78sb76556f04ec68b2e@mail.gmail.com>

Mihael,

The script is very simple:

iterate cnt {
  doStuff
} until (cnt<100);

I thought this is a parallel construct. Was I wrong?

--
Andriy Fedorov, Ph.D.

Research Fellow
Brigham and Women's Hospital
Harvard Medical School
75 Francis Street
Boston, MA 02115 USA
fedorov at bwh.harvard.edu


On Tue, Jan 19, 2010 at 12:43, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> On Tue, 2010-01-19 at 10:38 -0500, Andriy Fedorov wrote:
>> Hi Mihael,
>>
>> I've been playing with this following your suggestions, but I can't
>> get it to work.
>>
>> Here's my site description:
>>
>> <pool handle="Abe-GT2-coasters">
>> ? <gridftp ?url="local://localhost" />
>> ? <execution provider="coaster" jobmanager="gt2:gt2:pbs"
>> url="grid-abe.ncsa.teragrid.org"/>
>> ? <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>> ? <profile namespace="karajan" key="jobThrottle">2.55</profile>
>> ? <profile namespace="karajan" key="initialScore">10000</profile>
>> ? <profile namespace="globus" key="nodeGranularity">10</profile>
>> ? <profile namespace="globus" key="remoteMonitorEnabled">false</profile>
>> ? <profile namespace="globus" key="parallelism">0.1</profile>
>> ? <profile namespace="globus" key="workersPerNode">2</profile>
>> ? <profile namespace="globus" key="highOverallocation">10</profile>
>> </pool>
>>
>> My maxWalltime for the job is 2, and I have 100 of them. When I run
>> the script, I see one job in the queue, with 10 nodes and 22 minutes
>> walltime. However, when the script is executing, it appears the jobs
>> are being scheduled one at a time. I have the current checkout of the
>> cog/swift trunk: Swift svn swift-r3202 cog-r2682. I attach the
>> coaster.log file for your reference.
>
> It doesn't look like Swift is sending more than one job at a time. It
> may be helpful to understand what the swift part is doing (i.e. swift
> log, the swift script, etc.).
>
>>
>> Can you help me understand what I am doing wrong?
>>
>> Also, I was trying to look in the code that does allocation, and it
>> seems that the code responsible for determining the block size for
>> allocation is in
>> modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BlockQueueProcessor.java.
>> Is this correct? And what is the piece of code that decides how to
>> schedule jobs within the allocated block?
>
> Each worker will slurp jobs fitting (walltime <
> worker_remaining_walltime) from the coaster queue if that's not empty.
> So there isn't much in the way of scheduling at that point.
>
>
>


From wilde at mcs.anl.gov  Tue Jan 19 12:22:34 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 19 Jan 2010 12:22:34 -0600
Subject: [Swift-user] Problem staging out from PBS?
Message-ID: <4B55F86A.9080102@mcs.anl.gov>

I'm getting the messages below in email from PBS on pads.ci.uchicago.edu.

Are the messages:

"Unable to copy file 908.svc.pads.ci.uchicago.edu.OU to 
wilde at login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5898136727049685871.submit.stdout"

due to Swift errors (ie, somehow PBS cant write to my scripts/ 
directory) or due to some problem in PBS?

The prior PBS error of being unable to write to /gpfs/scratch seems to 
have gone away.

- Mike


-------- Original Message --------
Subject: PBS JOB 908.svc.pads.ci.uchicago.edu
Date: Tue, 19 Jan 2010 12:16:37 -0600 (CST)
From: adm at ci.uchicago.edu (root)
To: wilde at ci.uchicago.edu

PBS Job Id: 908.svc.pads.ci.uchicago.edu
Job Name:   null
Exec host:  c19.pads.ci.uchicago.edu/0
An error has occurred processing your job, see below.
request to copy stageout files failed on node 
'c19.pads.ci.uchicago.edu/0' for job 908.svc.pads.ci.uchicago.edu

Unable to copy file 908.svc.pads.ci.uchicago.edu.OU to 
wilde at login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5898136727049685871.submit.stdout
>>> error from copy
LD_LIBRARY_PATH=
>>> end error output

Unable to copy file 908.svc.pads.ci.uchicago.edu.ER to 
wilde at login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5898136727049685871.submit.stderr
>>> error from copy
LD_LIBRARY_PATH=
>>> end error output


From fedorov at bwh.harvard.edu  Tue Jan 19 12:46:58 2010
From: fedorov at bwh.harvard.edu (Andriy Fedorov)
Date: Tue, 19 Jan 2010 13:46:58 -0500
Subject: [Swift-user] Tuning parameters of coaster execution
In-Reply-To: <82f536811001191001t33bdea78sb76556f04ec68b2e@mail.gmail.com>
References: <82f536810910192035o1eaf761chfff2e006e31fb51a@mail.gmail.com>
	<1256054147.22279.18.camel@localhost>
	<82f536810910200904x584d8ca3m2da7fab8dc660b1d@mail.gmail.com>
	<1256055802.24685.13.camel@localhost>
	<82f536811001190738u10442036k1e2b24116cb54fbc@mail.gmail.com>
	<1263923006.18474.8.camel@localhost>
	<82f536811001191001t33bdea78sb76556f04ec68b2e@mail.gmail.com>
Message-ID: <82f536811001191046y6f5e2509l5963be8229380b8c@mail.gmail.com>

On Tue, Jan 19, 2010 at 13:01, Andriy Fedorov <fedorov at bwh.harvard.edu> wrote:
> Mihael,
>
> The script is very simple:
>
> iterate cnt {
> ?doStuff
> } until (cnt<100);
>
> I thought this is a parallel construct. Was I wrong?
>

Yes, apparently I was wrong. I need this instead: "foreach i in
[1:100] { }". Apologize for not trying this before asking for help...


> --
> Andriy Fedorov, Ph.D.
>
> Research Fellow
> Brigham and Women's Hospital
> Harvard Medical School
> 75 Francis Street
> Boston, MA 02115 USA
> fedorov at bwh.harvard.edu
>
>
>
> On Tue, Jan 19, 2010 at 12:43, Mihael Hategan <hategan at mcs.anl.gov> wrote:
>> On Tue, 2010-01-19 at 10:38 -0500, Andriy Fedorov wrote:
>>> Hi Mihael,
>>>
>>> I've been playing with this following your suggestions, but I can't
>>> get it to work.
>>>
>>> Here's my site description:
>>>
>>> <pool handle="Abe-GT2-coasters">
>>> ? <gridftp ?url="local://localhost" />
>>> ? <execution provider="coaster" jobmanager="gt2:gt2:pbs"
>>> url="grid-abe.ncsa.teragrid.org"/>
>>> ? <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>> ? <profile namespace="karajan" key="jobThrottle">2.55</profile>
>>> ? <profile namespace="karajan" key="initialScore">10000</profile>
>>> ? <profile namespace="globus" key="nodeGranularity">10</profile>
>>> ? <profile namespace="globus" key="remoteMonitorEnabled">false</profile>
>>> ? <profile namespace="globus" key="parallelism">0.1</profile>
>>> ? <profile namespace="globus" key="workersPerNode">2</profile>
>>> ? <profile namespace="globus" key="highOverallocation">10</profile>
>>> </pool>
>>>
>>> My maxWalltime for the job is 2, and I have 100 of them. When I run
>>> the script, I see one job in the queue, with 10 nodes and 22 minutes
>>> walltime. However, when the script is executing, it appears the jobs
>>> are being scheduled one at a time. I have the current checkout of the
>>> cog/swift trunk: Swift svn swift-r3202 cog-r2682. I attach the
>>> coaster.log file for your reference.
>>
>> It doesn't look like Swift is sending more than one job at a time. It
>> may be helpful to understand what the swift part is doing (i.e. swift
>> log, the swift script, etc.).
>>
>>>
>>> Can you help me understand what I am doing wrong?
>>>
>>> Also, I was trying to look in the code that does allocation, and it
>>> seems that the code responsible for determining the block size for
>>> allocation is in
>>> modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BlockQueueProcessor.java.
>>> Is this correct? And what is the piece of code that decides how to
>>> schedule jobs within the allocated block?
>>
>> Each worker will slurp jobs fitting (walltime <
>> worker_remaining_walltime) from the coaster queue if that's not empty.
>> So there isn't much in the way of scheduling at that point.
>>
>>
>>
>


From hategan at mcs.anl.gov  Tue Jan 19 13:01:57 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 19 Jan 2010 13:01:57 -0600
Subject: [Swift-user] Tuning parameters of coaster execution
In-Reply-To: <82f536811001191001t33bdea78sb76556f04ec68b2e@mail.gmail.com>
References: <82f536810910192035o1eaf761chfff2e006e31fb51a@mail.gmail.com>
	<1256054147.22279.18.camel@localhost>
	<82f536810910200904x584d8ca3m2da7fab8dc660b1d@mail.gmail.com>
	<1256055802.24685.13.camel@localhost>
	<82f536811001190738u10442036k1e2b24116cb54fbc@mail.gmail.com>
	<1263923006.18474.8.camel@localhost>
	<82f536811001191001t33bdea78sb76556f04ec68b2e@mail.gmail.com>
Message-ID: <1263927717.21382.19.camel@localhost>

On Tue, 2010-01-19 at 13:01 -0500, Andriy Fedorov wrote:
> Mihael,
> 
> The script is very simple:
> 
> iterate cnt {
>   doStuff
> } until (cnt<100);
> 
> I thought this is a parallel construct. Was I wrong?

Short answer: yes.
Long answer: I (and judging by the existence if iterate also "we") don't
understand clearly why the normal parallel foreach couldn't be used to
implement convergence conditions.The plain solution:

a[0] = initialValue;
foreach v, k in a {
  if (a[k] < epsilon) {
    a[k + 1] = f(a[k]);
  }
  else {
    //nothing
  }
}

should work in my view, since it expresses a convergence problem
correctly, without an explicitly sequential operation. But I suppose
that posed some implementation difficulties that were deemed not worth
solving, hence "iterate". I suspect the problem was that of detecting
that certain branches of an if can lead to no more data being put in the
array, which seems difficult to analyze at compile-time or figure out at
run-time.

Mihael


From hategan at mcs.anl.gov  Tue Jan 19 13:04:15 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 19 Jan 2010 13:04:15 -0600
Subject: [Swift-user] Tuning parameters of coaster execution
In-Reply-To: <82f536811001191046y6f5e2509l5963be8229380b8c@mail.gmail.com>
References: <82f536810910192035o1eaf761chfff2e006e31fb51a@mail.gmail.com>
	<1256054147.22279.18.camel@localhost>
	<82f536810910200904x584d8ca3m2da7fab8dc660b1d@mail.gmail.com>
	<1256055802.24685.13.camel@localhost>
	<82f536811001190738u10442036k1e2b24116cb54fbc@mail.gmail.com>
	<1263923006.18474.8.camel@localhost>
	<82f536811001191001t33bdea78sb76556f04ec68b2e@mail.gmail.com>
	<82f536811001191046y6f5e2509l5963be8229380b8c@mail.gmail.com>
Message-ID: <1263927855.21382.22.camel@localhost>

On Tue, 2010-01-19 at 13:46 -0500, Andriy Fedorov wrote:
> On Tue, Jan 19, 2010 at 13:01, Andriy Fedorov <fedorov at bwh.harvard.edu> wrote:
> > Mihael,
> >
> > The script is very simple:
> >
> > iterate cnt {
> >  doStuff
> > } until (cnt<100);
> >
> > I thought this is a parallel construct. Was I wrong?
> >
> 
> Yes, apparently I was wrong. I need this instead: "foreach i in
> [1:100] { }". Apologize for not trying this before asking for help...
> 

That's ok. Why "iterate" is there and what it's doing always seems like
a good question, and maybe if people keep asking, I/we'll have a good
answer.


From wilde at mcs.anl.gov  Tue Jan 19 13:26:36 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 19 Jan 2010 13:26:36 -0600
Subject: [Swift-user] Coaster jobs are not running with expected parallelism
Message-ID: <4B56076C.6090507@mcs.anl.gov>

Im running a script on PADS that emits 20 jobs in parallel with a foreach().

I set coasters to use 8 workers per node, and my throttle to allow 64 
jobs to run in parallel, so I would expect *at least* 8 jobs to be 
running in parallel. But what I see is:

- 3 PBS worker jobs start
- 2 of these have a single core (c19/0 and c19/1)
- 1 of these has 18 *nodes*
- all 20 jobs show up as submitted or active, but never more than *3* 
active (note that 1 job is a setup job ad completes right away).

Below is info on this run.

Any idea why coaster provider is behaving this way?

- Mike

pool entry is:

   <pool handle="pbs">
     <profile namespace="globus" key="maxwalltime">00:05:00</profile>
     <profile namespace="globus" key="maxtime">1800</profile>
     <execution provider="coaster" url="none" jobManager="local:pbs"/>
     <profile namespace="globus" key="coastersPerNode">8</profile>
     <profile namespace="karajan" key="jobThrottle">.63</profile>
     <profile namespace="karajan" key="initialScore">10000</profile>
     <gridftp  url="local://localhost" />
     <workdirectory>$rundir</workdirectory>
   </pool>

Running on login2, I see:

/home/wilde/protlib2/bin/run.loops.sh: Executing on site pbs
Running from host with compute-node reachable address of 172.5.86.6
Running in /home/wilde/protests/run.loops.1498
protlib2 home is /home/wilde/protlib2
Swift svn swift-r3202 cog-r2682

RunID: 20100119-1309-l72sbpg8
Progress:
Progress:  Checking status:1
Progress:  Selecting site:18  Initializing site shared directory:1 
Stage in:1  Finished successfully:1
Progress:  Submitting:19  Submitted:1  Finished successfully:1
Progress:  Submitted:19  Active:1  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:3  Finished successfully:1
Progress:  Submitted:17  Active:2  Checking status:1  Finished 
successfully:1
Progress:  Submitted:15  Active:3  Stage out:1  Finished successfully:2
Progress:  Submitted:15  Active:3  Finished successfully:3

...and this keeps up - the script is progressing but only 3 jobs are 
running at a time. (Each takes about 5 minutes)

PBS shows:

login2$ qstat -n

svc.pads.ci.uchicago.edu:
 
   Req'd  Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK 
Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- --- 
------ ----- - -----
912.svc.pads.ci.     wilde    extended null              14877     1  -- 
    --  00:29 R   --
    c19
913.svc.pads.ci.     wilde    extended null                --     18  -- 
    --  00:29 R   --
    c46+c45+c44+c06+c07+c08+c10+c12+c14+c17+c22+c24+c28+c34+c35+c37+c39+c40
914.svc.pads.ci.     wilde    extended null              15135     1  -- 
    --  00:29 R   --
    c19
login2$ qstat -f
Job Id: 912.svc.pads.ci.uchicago.edu
     Job_Name = null
     Job_Owner = wilde at login2.pads.ci.uchicago.edu
     resources_used.cput = 00:00:58
     resources_used.mem = 165768kb
     resources_used.vmem = 757612kb
     resources_used.walltime = 00:01:14
     job_state = R
     queue = extended
     server = svc.pads.ci.uchicago.edu
     Checkpoint = u
     ctime = Tue Jan 19 13:09:16 2010
     Error_Path = 
login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS58
         66754363410172037.submit.stderr
     exec_host = c19.pads.ci.uchicago.edu/0
     Hold_Types = n
     Join_Path = n
     Keep_Files = n
     Mail_Points = n
     mtime = Tue Jan 19 13:09:18 2010
     Output_Path = 
login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5
         866754363410172037.submit.stdout
     Priority = 0
     qtime = Tue Jan 19 13:09:16 2010
     Rerunable = True
     Resource_List.nodect = 1
     Resource_List.nodes = 1
     Resource_List.walltime = 00:29:00
     session_id = 14877
     Shell_Path_List = /bin/sh
     Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
 
PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
 
oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
 
8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
 
:/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
 
e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
 
6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
 
1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
 
ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
 
/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
 
r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
 
swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
         svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
         PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
         PBS_SERVER=login2.pads.ci.uchicago.edu,
         PBS_O_HOST=login2.pads.ci.uchicago.edu,
         PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
         PBS_O_QUEUE=extended
     etime = Tue Jan 19 13:09:16 2010
     submit_args = /home/wilde/.globus/scripts/PBS5866754363410172037.submit
     start_time = Tue Jan 19 13:09:17 2010
     start_count = 1

Job Id: 913.svc.pads.ci.uchicago.edu
     Job_Name = null
     Job_Owner = wilde at login2.pads.ci.uchicago.edu
     resources_used.cput = 00:00:36
     resources_used.mem = 166452kb
     resources_used.vmem = 765732kb
     resources_used.walltime = 00:00:51
     job_state = R
     queue = extended
     server = svc.pads.ci.uchicago.edu
     Checkpoint = u
     ctime = Tue Jan 19 13:09:16 2010
     Error_Path = 
login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS89
         90749016166185054.submit.stderr
     exec_host = 
c46.pads.ci.uchicago.edu/0+c45.pads.ci.uchicago.edu/0+c44.pads
 
.ci.uchicago.edu/0+c06.pads.ci.uchicago.edu/0+c07.pads.ci.uchicago.edu
 
/0+c08.pads.ci.uchicago.edu/0+c10.pads.ci.uchicago.edu/0+c12.pads.ci.u
 
chicago.edu/0+c14.pads.ci.uchicago.edu/0+c17.pads.ci.uchicago.edu/0+c2
 
2.pads.ci.uchicago.edu/0+c24.pads.ci.uchicago.edu/0+c28.pads.ci.uchica
 
go.edu/0+c34.pads.ci.uchicago.edu/0+c35.pads.ci.uchicago.edu/0+c37.pad
 
s.ci.uchicago.edu/0+c39.pads.ci.uchicago.edu/0+c40.pads.ci.uchicago.ed
         u/0
     Hold_Types = n
     Join_Path = n
     Keep_Files = n
     Mail_Points = n
     mtime = Tue Jan 19 13:09:55 2010
     Output_Path = 
login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS8
         990749016166185054.submit.stdout
     Priority = 0
     qtime = Tue Jan 19 13:09:16 2010
     Rerunable = True
     Resource_List.nodect = 18
     Resource_List.nodes = 18
     Resource_List.walltime = 00:29:00
     session_id = 13956
     Shell_Path_List = /bin/sh
     Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
 
PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
 
oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
 
8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
 
:/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
 
e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
 
6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
 
1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
 
ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
 
/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
 
r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
 
swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
         svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
         PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
         PBS_SERVER=login2.pads.ci.uchicago.edu,
         PBS_O_HOST=login2.pads.ci.uchicago.edu,
         PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
         PBS_O_QUEUE=extended
     etime = Tue Jan 19 13:09:16 2010
     submit_args = /home/wilde/.globus/scripts/PBS8990749016166185054.submit
     start_time = Tue Jan 19 13:09:18 2010
     start_count = 1

Job Id: 914.svc.pads.ci.uchicago.edu
     Job_Name = null
     Job_Owner = wilde at login2.pads.ci.uchicago.edu
     resources_used.cput = 00:00:58
     resources_used.mem = 165760kb
     resources_used.vmem = 757612kb
     resources_used.walltime = 00:01:11
     job_state = R
     queue = extended
     server = svc.pads.ci.uchicago.edu
     Checkpoint = u
     ctime = Tue Jan 19 13:09:18 2010
     Error_Path = 
login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS54
         46269528052212820.submit.stderr
     exec_host = c19.pads.ci.uchicago.edu/1
     Hold_Types = n
     Join_Path = n
     Keep_Files = n
     Mail_Points = n
     mtime = Tue Jan 19 13:09:20 2010
     Output_Path = 
login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5
         446269528052212820.submit.stdout
     Priority = 0
     qtime = Tue Jan 19 13:09:18 2010
     Rerunable = True
     Resource_List.nodect = 1
     Resource_List.nodes = 1
     Resource_List.walltime = 00:29:00
     session_id = 15135
     Shell_Path_List = /bin/sh
     Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
 
PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
 
oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
 
8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
 
:/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
 
e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
 
6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
 
1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
 
ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
 
/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
 
r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
 
swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
         svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
         PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
         PBS_SERVER=login2.pads.ci.uchicago.edu,
         PBS_O_HOST=login2.pads.ci.uchicago.edu,
         PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
         PBS_O_QUEUE=extended
     etime = Tue Jan 19 13:09:18 2010
     submit_args = /home/wilde/.globus/scripts/PBS5446269528052212820.submit
     start_time = Tue Jan 19 13:09:20 2010
     start_count = 1

login2$
-------------------------------------------------------------------------------------------------------


From hategan at mcs.anl.gov  Tue Jan 19 13:32:30 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 19 Jan 2010 13:32:30 -0600
Subject: [Swift-user] Coaster jobs are not running with expected
	parallelism
In-Reply-To: <4B56076C.6090507@mcs.anl.gov>
References: <4B56076C.6090507@mcs.anl.gov>
Message-ID: <1263929550.22225.5.camel@localhost>

Maybe PBS is lying about that 18 node job. The coaster or worker logs on
pads/~/.globus/coasters could shed some light on this.

On Tue, 2010-01-19 at 13:26 -0600, Michael Wilde wrote:
> Im running a script on PADS that emits 20 jobs in parallel with a foreach().
> 
> I set coasters to use 8 workers per node, and my throttle to allow 64 
> jobs to run in parallel, so I would expect *at least* 8 jobs to be 
> running in parallel. But what I see is:
> 
> - 3 PBS worker jobs start
> - 2 of these have a single core (c19/0 and c19/1)
> - 1 of these has 18 *nodes*
> - all 20 jobs show up as submitted or active, but never more than *3* 
> active (note that 1 job is a setup job ad completes right away).
> 
> Below is info on this run.
> 
> Any idea why coaster provider is behaving this way?
> 
> - Mike
> 
> pool entry is:
> 
>    <pool handle="pbs">
>      <profile namespace="globus" key="maxwalltime">00:05:00</profile>
>      <profile namespace="globus" key="maxtime">1800</profile>
>      <execution provider="coaster" url="none" jobManager="local:pbs"/>
>      <profile namespace="globus" key="coastersPerNode">8</profile>
>      <profile namespace="karajan" key="jobThrottle">.63</profile>
>      <profile namespace="karajan" key="initialScore">10000</profile>
>      <gridftp  url="local://localhost" />
>      <workdirectory>$rundir</workdirectory>
>    </pool>
> 
> Running on login2, I see:
> 
> /home/wilde/protlib2/bin/run.loops.sh: Executing on site pbs
> Running from host with compute-node reachable address of 172.5.86.6
> Running in /home/wilde/protests/run.loops.1498
> protlib2 home is /home/wilde/protlib2
> Swift svn swift-r3202 cog-r2682
> 
> RunID: 20100119-1309-l72sbpg8
> Progress:
> Progress:  Checking status:1
> Progress:  Selecting site:18  Initializing site shared directory:1 
> Stage in:1  Finished successfully:1
> Progress:  Submitting:19  Submitted:1  Finished successfully:1
> Progress:  Submitted:19  Active:1  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:3  Finished successfully:1
> Progress:  Submitted:17  Active:2  Checking status:1  Finished 
> successfully:1
> Progress:  Submitted:15  Active:3  Stage out:1  Finished successfully:2
> Progress:  Submitted:15  Active:3  Finished successfully:3
> 
> ...and this keeps up - the script is progressing but only 3 jobs are 
> running at a time. (Each takes about 5 minutes)
> 
> PBS shows:
> 
> login2$ qstat -n
> 
> svc.pads.ci.uchicago.edu:
>  
>    Req'd  Req'd   Elap
> Job ID               Username Queue    Jobname          SessID NDS   TSK 
> Memory Time  S Time
> -------------------- -------- -------- ---------------- ------ ----- --- 
> ------ ----- - -----
> 912.svc.pads.ci.     wilde    extended null              14877     1  -- 
>     --  00:29 R   --
>     c19
> 913.svc.pads.ci.     wilde    extended null                --     18  -- 
>     --  00:29 R   --
>     c46+c45+c44+c06+c07+c08+c10+c12+c14+c17+c22+c24+c28+c34+c35+c37+c39+c40
> 914.svc.pads.ci.     wilde    extended null              15135     1  -- 
>     --  00:29 R   --
>     c19
> login2$ qstat -f
> Job Id: 912.svc.pads.ci.uchicago.edu
>      Job_Name = null
>      Job_Owner = wilde at login2.pads.ci.uchicago.edu
>      resources_used.cput = 00:00:58
>      resources_used.mem = 165768kb
>      resources_used.vmem = 757612kb
>      resources_used.walltime = 00:01:14
>      job_state = R
>      queue = extended
>      server = svc.pads.ci.uchicago.edu
>      Checkpoint = u
>      ctime = Tue Jan 19 13:09:16 2010
>      Error_Path = 
> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS58
>          66754363410172037.submit.stderr
>      exec_host = c19.pads.ci.uchicago.edu/0
>      Hold_Types = n
>      Join_Path = n
>      Keep_Files = n
>      Mail_Points = n
>      mtime = Tue Jan 19 13:09:18 2010
>      Output_Path = 
> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5
>          866754363410172037.submit.stdout
>      Priority = 0
>      qtime = Tue Jan 19 13:09:16 2010
>      Rerunable = True
>      Resource_List.nodect = 1
>      Resource_List.nodes = 1
>      Resource_List.walltime = 00:29:00
>      session_id = 14877
>      Shell_Path_List = /bin/sh
>      Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
>  
> PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
>  
> oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
>  
> 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
>  
> :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
>  
> e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
>  
> 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
>  
> 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
>  
> ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
>  
> /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
>  
> r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
>  
> swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
>          svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
>          PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
>          PBS_SERVER=login2.pads.ci.uchicago.edu,
>          PBS_O_HOST=login2.pads.ci.uchicago.edu,
>          PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
>          PBS_O_QUEUE=extended
>      etime = Tue Jan 19 13:09:16 2010
>      submit_args = /home/wilde/.globus/scripts/PBS5866754363410172037.submit
>      start_time = Tue Jan 19 13:09:17 2010
>      start_count = 1
> 
> Job Id: 913.svc.pads.ci.uchicago.edu
>      Job_Name = null
>      Job_Owner = wilde at login2.pads.ci.uchicago.edu
>      resources_used.cput = 00:00:36
>      resources_used.mem = 166452kb
>      resources_used.vmem = 765732kb
>      resources_used.walltime = 00:00:51
>      job_state = R
>      queue = extended
>      server = svc.pads.ci.uchicago.edu
>      Checkpoint = u
>      ctime = Tue Jan 19 13:09:16 2010
>      Error_Path = 
> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS89
>          90749016166185054.submit.stderr
>      exec_host = 
> c46.pads.ci.uchicago.edu/0+c45.pads.ci.uchicago.edu/0+c44.pads
>  
> .ci.uchicago.edu/0+c06.pads.ci.uchicago.edu/0+c07.pads.ci.uchicago.edu
>  
> /0+c08.pads.ci.uchicago.edu/0+c10.pads.ci.uchicago.edu/0+c12.pads.ci.u
>  
> chicago.edu/0+c14.pads.ci.uchicago.edu/0+c17.pads.ci.uchicago.edu/0+c2
>  
> 2.pads.ci.uchicago.edu/0+c24.pads.ci.uchicago.edu/0+c28.pads.ci.uchica
>  
> go.edu/0+c34.pads.ci.uchicago.edu/0+c35.pads.ci.uchicago.edu/0+c37.pad
>  
> s.ci.uchicago.edu/0+c39.pads.ci.uchicago.edu/0+c40.pads.ci.uchicago.ed
>          u/0
>      Hold_Types = n
>      Join_Path = n
>      Keep_Files = n
>      Mail_Points = n
>      mtime = Tue Jan 19 13:09:55 2010
>      Output_Path = 
> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS8
>          990749016166185054.submit.stdout
>      Priority = 0
>      qtime = Tue Jan 19 13:09:16 2010
>      Rerunable = True
>      Resource_List.nodect = 18
>      Resource_List.nodes = 18
>      Resource_List.walltime = 00:29:00
>      session_id = 13956
>      Shell_Path_List = /bin/sh
>      Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
>  
> PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
>  
> oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
>  
> 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
>  
> :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
>  
> e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
>  
> 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
>  
> 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
>  
> ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
>  
> /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
>  
> r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
>  
> swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
>          svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
>          PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
>          PBS_SERVER=login2.pads.ci.uchicago.edu,
>          PBS_O_HOST=login2.pads.ci.uchicago.edu,
>          PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
>          PBS_O_QUEUE=extended
>      etime = Tue Jan 19 13:09:16 2010
>      submit_args = /home/wilde/.globus/scripts/PBS8990749016166185054.submit
>      start_time = Tue Jan 19 13:09:18 2010
>      start_count = 1
> 
> Job Id: 914.svc.pads.ci.uchicago.edu
>      Job_Name = null
>      Job_Owner = wilde at login2.pads.ci.uchicago.edu
>      resources_used.cput = 00:00:58
>      resources_used.mem = 165760kb
>      resources_used.vmem = 757612kb
>      resources_used.walltime = 00:01:11
>      job_state = R
>      queue = extended
>      server = svc.pads.ci.uchicago.edu
>      Checkpoint = u
>      ctime = Tue Jan 19 13:09:18 2010
>      Error_Path = 
> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS54
>          46269528052212820.submit.stderr
>      exec_host = c19.pads.ci.uchicago.edu/1
>      Hold_Types = n
>      Join_Path = n
>      Keep_Files = n
>      Mail_Points = n
>      mtime = Tue Jan 19 13:09:20 2010
>      Output_Path = 
> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5
>          446269528052212820.submit.stdout
>      Priority = 0
>      qtime = Tue Jan 19 13:09:18 2010
>      Rerunable = True
>      Resource_List.nodect = 1
>      Resource_List.nodes = 1
>      Resource_List.walltime = 00:29:00
>      session_id = 15135
>      Shell_Path_List = /bin/sh
>      Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
>  
> PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
>  
> oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
>  
> 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
>  
> :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
>  
> e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
>  
> 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
>  
> 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
>  
> ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
>  
> /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
>  
> r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
>  
> swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
>          svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
>          PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
>          PBS_SERVER=login2.pads.ci.uchicago.edu,
>          PBS_O_HOST=login2.pads.ci.uchicago.edu,
>          PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
>          PBS_O_QUEUE=extended
>      etime = Tue Jan 19 13:09:18 2010
>      submit_args = /home/wilde/.globus/scripts/PBS5446269528052212820.submit
>      start_time = Tue Jan 19 13:09:20 2010
>      start_count = 1
> 
> login2$
> -------------------------------------------------------------------------------------------------------
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From wilde at mcs.anl.gov  Tue Jan 19 13:38:44 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 19 Jan 2010 13:38:44 -0600
Subject: [Swift-user] Coaster jobs are not running with expected
	parallelism
In-Reply-To: <1263929550.22225.5.camel@localhost>
References: <4B56076C.6090507@mcs.anl.gov> <1263929550.22225.5.camel@localhost>
Message-ID: <4B560A44.8050002@mcs.anl.gov>


On 1/19/10 1:32 PM, Mihael Hategan wrote:
> Maybe PBS is lying about that 18 node job. 

I would be surprised if thats the case. But even if it had *1* node you 
would think it would run at least 8 jobs in parallel.

Im confused why it has started three jobs, two with only one core and 
one with 18 nodes.

But the 18 node job just hit its wall time limit; now coasters seems to 
have started a 10 node job:

login2$ qstat -n

svc.pads.ci.uchicago.edu:
 
   Req'd  Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK 
Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- --- 
------ ----- - -----
912.svc.pads.ci.     wilde    extended null              14877     1  -- 
    --  00:29 R 00:25
    c19
915.svc.pads.ci.     wilde    extended null               9028     1  -- 
    --  00:29 R   --
    c38
916.svc.pads.ci.     wilde    extended null                --     10  -- 
    --  00:29 R   --
    c45+c44+c06+c07+c08+c10+c12+c14+c17+c22
login2$


The coaster or worker logs on
> pads/~/.globus/coasters could shed some light on this.

I'll look and make these readable by you.

- Mike

> On Tue, 2010-01-19 at 13:26 -0600, Michael Wilde wrote:
>> Im running a script on PADS that emits 20 jobs in parallel with a foreach().
>>
>> I set coasters to use 8 workers per node, and my throttle to allow 64 
>> jobs to run in parallel, so I would expect *at least* 8 jobs to be 
>> running in parallel. But what I see is:
>>
>> - 3 PBS worker jobs start
>> - 2 of these have a single core (c19/0 and c19/1)
>> - 1 of these has 18 *nodes*
>> - all 20 jobs show up as submitted or active, but never more than *3* 
>> active (note that 1 job is a setup job ad completes right away).
>>
>> Below is info on this run.
>>
>> Any idea why coaster provider is behaving this way?
>>
>> - Mike
>>
>> pool entry is:
>>
>>    <pool handle="pbs">
>>      <profile namespace="globus" key="maxwalltime">00:05:00</profile>
>>      <profile namespace="globus" key="maxtime">1800</profile>
>>      <execution provider="coaster" url="none" jobManager="local:pbs"/>
>>      <profile namespace="globus" key="coastersPerNode">8</profile>
>>      <profile namespace="karajan" key="jobThrottle">.63</profile>
>>      <profile namespace="karajan" key="initialScore">10000</profile>
>>      <gridftp  url="local://localhost" />
>>      <workdirectory>$rundir</workdirectory>
>>    </pool>
>>
>> Running on login2, I see:
>>
>> /home/wilde/protlib2/bin/run.loops.sh: Executing on site pbs
>> Running from host with compute-node reachable address of 172.5.86.6
>> Running in /home/wilde/protests/run.loops.1498
>> protlib2 home is /home/wilde/protlib2
>> Swift svn swift-r3202 cog-r2682
>>
>> RunID: 20100119-1309-l72sbpg8
>> Progress:
>> Progress:  Checking status:1
>> Progress:  Selecting site:18  Initializing site shared directory:1 
>> Stage in:1  Finished successfully:1
>> Progress:  Submitting:19  Submitted:1  Finished successfully:1
>> Progress:  Submitted:19  Active:1  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:3  Finished successfully:1
>> Progress:  Submitted:17  Active:2  Checking status:1  Finished 
>> successfully:1
>> Progress:  Submitted:15  Active:3  Stage out:1  Finished successfully:2
>> Progress:  Submitted:15  Active:3  Finished successfully:3
>>
>> ...and this keeps up - the script is progressing but only 3 jobs are 
>> running at a time. (Each takes about 5 minutes)
>>
>> PBS shows:
>>
>> login2$ qstat -n
>>
>> svc.pads.ci.uchicago.edu:
>>  
>>    Req'd  Req'd   Elap
>> Job ID               Username Queue    Jobname          SessID NDS   TSK 
>> Memory Time  S Time
>> -------------------- -------- -------- ---------------- ------ ----- --- 
>> ------ ----- - -----
>> 912.svc.pads.ci.     wilde    extended null              14877     1  -- 
>>     --  00:29 R   --
>>     c19
>> 913.svc.pads.ci.     wilde    extended null                --     18  -- 
>>     --  00:29 R   --
>>     c46+c45+c44+c06+c07+c08+c10+c12+c14+c17+c22+c24+c28+c34+c35+c37+c39+c40
>> 914.svc.pads.ci.     wilde    extended null              15135     1  -- 
>>     --  00:29 R   --
>>     c19
>> login2$ qstat -f
>> Job Id: 912.svc.pads.ci.uchicago.edu
>>      Job_Name = null
>>      Job_Owner = wilde at login2.pads.ci.uchicago.edu
>>      resources_used.cput = 00:00:58
>>      resources_used.mem = 165768kb
>>      resources_used.vmem = 757612kb
>>      resources_used.walltime = 00:01:14
>>      job_state = R
>>      queue = extended
>>      server = svc.pads.ci.uchicago.edu
>>      Checkpoint = u
>>      ctime = Tue Jan 19 13:09:16 2010
>>      Error_Path = 
>> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS58
>>          66754363410172037.submit.stderr
>>      exec_host = c19.pads.ci.uchicago.edu/0
>>      Hold_Types = n
>>      Join_Path = n
>>      Keep_Files = n
>>      Mail_Points = n
>>      mtime = Tue Jan 19 13:09:18 2010
>>      Output_Path = 
>> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5
>>          866754363410172037.submit.stdout
>>      Priority = 0
>>      qtime = Tue Jan 19 13:09:16 2010
>>      Rerunable = True
>>      Resource_List.nodect = 1
>>      Resource_List.nodes = 1
>>      Resource_List.walltime = 00:29:00
>>      session_id = 14877
>>      Shell_Path_List = /bin/sh
>>      Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
>>  
>> PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
>>  
>> oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
>>  
>> 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
>>  
>> :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
>>  
>> e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
>>  
>> 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
>>  
>> 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
>>  
>> ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
>>  
>> /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
>>  
>> r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
>>  
>> swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
>>          svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
>>          PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
>>          PBS_SERVER=login2.pads.ci.uchicago.edu,
>>          PBS_O_HOST=login2.pads.ci.uchicago.edu,
>>          PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
>>          PBS_O_QUEUE=extended
>>      etime = Tue Jan 19 13:09:16 2010
>>      submit_args = /home/wilde/.globus/scripts/PBS5866754363410172037.submit
>>      start_time = Tue Jan 19 13:09:17 2010
>>      start_count = 1
>>
>> Job Id: 913.svc.pads.ci.uchicago.edu
>>      Job_Name = null
>>      Job_Owner = wilde at login2.pads.ci.uchicago.edu
>>      resources_used.cput = 00:00:36
>>      resources_used.mem = 166452kb
>>      resources_used.vmem = 765732kb
>>      resources_used.walltime = 00:00:51
>>      job_state = R
>>      queue = extended
>>      server = svc.pads.ci.uchicago.edu
>>      Checkpoint = u
>>      ctime = Tue Jan 19 13:09:16 2010
>>      Error_Path = 
>> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS89
>>          90749016166185054.submit.stderr
>>      exec_host = 
>> c46.pads.ci.uchicago.edu/0+c45.pads.ci.uchicago.edu/0+c44.pads
>>  
>> .ci.uchicago.edu/0+c06.pads.ci.uchicago.edu/0+c07.pads.ci.uchicago.edu
>>  
>> /0+c08.pads.ci.uchicago.edu/0+c10.pads.ci.uchicago.edu/0+c12.pads.ci.u
>>  
>> chicago.edu/0+c14.pads.ci.uchicago.edu/0+c17.pads.ci.uchicago.edu/0+c2
>>  
>> 2.pads.ci.uchicago.edu/0+c24.pads.ci.uchicago.edu/0+c28.pads.ci.uchica
>>  
>> go.edu/0+c34.pads.ci.uchicago.edu/0+c35.pads.ci.uchicago.edu/0+c37.pad
>>  
>> s.ci.uchicago.edu/0+c39.pads.ci.uchicago.edu/0+c40.pads.ci.uchicago.ed
>>          u/0
>>      Hold_Types = n
>>      Join_Path = n
>>      Keep_Files = n
>>      Mail_Points = n
>>      mtime = Tue Jan 19 13:09:55 2010
>>      Output_Path = 
>> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS8
>>          990749016166185054.submit.stdout
>>      Priority = 0
>>      qtime = Tue Jan 19 13:09:16 2010
>>      Rerunable = True
>>      Resource_List.nodect = 18
>>      Resource_List.nodes = 18
>>      Resource_List.walltime = 00:29:00
>>      session_id = 13956
>>      Shell_Path_List = /bin/sh
>>      Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
>>  
>> PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
>>  
>> oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
>>  
>> 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
>>  
>> :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
>>  
>> e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
>>  
>> 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
>>  
>> 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
>>  
>> ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
>>  
>> /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
>>  
>> r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
>>  
>> swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
>>          svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
>>          PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
>>          PBS_SERVER=login2.pads.ci.uchicago.edu,
>>          PBS_O_HOST=login2.pads.ci.uchicago.edu,
>>          PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
>>          PBS_O_QUEUE=extended
>>      etime = Tue Jan 19 13:09:16 2010
>>      submit_args = /home/wilde/.globus/scripts/PBS8990749016166185054.submit
>>      start_time = Tue Jan 19 13:09:18 2010
>>      start_count = 1
>>
>> Job Id: 914.svc.pads.ci.uchicago.edu
>>      Job_Name = null
>>      Job_Owner = wilde at login2.pads.ci.uchicago.edu
>>      resources_used.cput = 00:00:58
>>      resources_used.mem = 165760kb
>>      resources_used.vmem = 757612kb
>>      resources_used.walltime = 00:01:11
>>      job_state = R
>>      queue = extended
>>      server = svc.pads.ci.uchicago.edu
>>      Checkpoint = u
>>      ctime = Tue Jan 19 13:09:18 2010
>>      Error_Path = 
>> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS54
>>          46269528052212820.submit.stderr
>>      exec_host = c19.pads.ci.uchicago.edu/1
>>      Hold_Types = n
>>      Join_Path = n
>>      Keep_Files = n
>>      Mail_Points = n
>>      mtime = Tue Jan 19 13:09:20 2010
>>      Output_Path = 
>> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5
>>          446269528052212820.submit.stdout
>>      Priority = 0
>>      qtime = Tue Jan 19 13:09:18 2010
>>      Rerunable = True
>>      Resource_List.nodect = 1
>>      Resource_List.nodes = 1
>>      Resource_List.walltime = 00:29:00
>>      session_id = 15135
>>      Shell_Path_List = /bin/sh
>>      Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde,
>>  
>> PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s
>>  
>> oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0.
>>  
>> 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin
>>  
>> :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar
>>  
>> e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1.
>>  
>> 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2.
>>  
>> 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma
>>  
>> ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin:
>>  
>> /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-
>>  
>> r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/
>>  
>> swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift-
>>          svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin,
>>          PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash,
>>          PBS_SERVER=login2.pads.ci.uchicago.edu,
>>          PBS_O_HOST=login2.pads.ci.uchicago.edu,
>>          PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498,
>>          PBS_O_QUEUE=extended
>>      etime = Tue Jan 19 13:09:18 2010
>>      submit_args = /home/wilde/.globus/scripts/PBS5446269528052212820.submit
>>      start_time = Tue Jan 19 13:09:20 2010
>>      start_count = 1
>>
>> login2$
>> -------------------------------------------------------------------------------------------------------
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> 


From hategan at mcs.anl.gov  Tue Jan 19 13:44:06 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 19 Jan 2010 13:44:06 -0600
Subject: [Swift-user] Coaster jobs are not running with expected
	parallelism
In-Reply-To: <4B560A44.8050002@mcs.anl.gov>
References: <4B56076C.6090507@mcs.anl.gov>
	<1263929550.22225.5.camel@localhost>  <4B560A44.8050002@mcs.anl.gov>
Message-ID: <1263930246.22837.3.camel@localhost>

On Tue, 2010-01-19 at 13:38 -0600, Michael Wilde wrote:
> 
> On 1/19/10 1:32 PM, Mihael Hategan wrote:
> > Maybe PBS is lying about that 18 node job. 
> 
> I would be surprised if thats the case. But even if it had *1* node you 
> would think it would run at least 8 jobs in parallel.

I see. Though not with your current setup. You should use
"workersPerNode" instead of "coastersPerNode".

> 
> Im confused why it has started three jobs, two with only one core and 
> one with 18 nodes.

It does that. It spreads out the block sizes to exploit non-linearities
in queuing times.

> 
> But the 18 node job just hit its wall time limit; now coasters seems to 
> have started a 10 node job:

Don't know about that. Logs please.


From wilde at mcs.anl.gov  Tue Jan 19 13:49:02 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 19 Jan 2010 13:49:02 -0600
Subject: [Swift-user] Coaster jobs are not running with expected
	parallelism
In-Reply-To: <1263930246.22837.3.camel@localhost>
References: <4B56076C.6090507@mcs.anl.gov>	
	<1263929550.22225.5.camel@localhost> <4B560A44.8050002@mcs.anl.gov>
	<1263930246.22837.3.camel@localhost>
Message-ID: <4B560CAE.5020203@mcs.anl.gov>


On 1/19/10 1:44 PM, Mihael Hategan wrote:
> On Tue, 2010-01-19 at 13:38 -0600, Michael Wilde wrote:
>> On 1/19/10 1:32 PM, Mihael Hategan wrote:
>>> Maybe PBS is lying about that 18 node job. 
>> I would be surprised if thats the case. But even if it had *1* node you 
>> would think it would run at least 8 jobs in parallel.
> 
> I see. Though not with your current setup. You should use
> "workersPerNode" instead of "coastersPerNode".

Thanks!  I'll fix that and try again. This makes more sense now, if its 
assuming 1 worker per node.

Still doesnt explain why its not starting more jobs, since it allocated 
abundant nodes (even assuming 1 worker per node).


> 
>> Im confused why it has started three jobs, two with only one core and 
>> one with 18 nodes.
> 
> It does that. It spreads out the block sizes to exploit non-linearities
> in queuing times.
> 
>> But the 18 node job just hit its wall time limit; now coasters seems to 
>> have started a 10 node job:
> 
> Don't know about that. Logs please.
> 

Here's the logs from that dir for this run. I dont understand why the 
coasters.log file in that directory has not been written to since Jan 13.

login2$ ls -dt * | head
worker-0119-090116-000002.log  worker-0114-310129-000005.log
worker-0119-090116-000004.log  worker-0114-310129-000006.log
worker-0119-090116-000003.log  worker-0114-310129-000007.log
worker-0119-090116-000001.log  worker-0114-310129-000008.log
worker-0119-090116-000000.log  worker-0114-310129-000009.log
cscript7310283766853084762.pl  worker-0114-310129-000000.log
worker-0119-491225-000001.log  worker-0114-110123-000004.log
worker-0119-491225-000000.log  worker-0114-110123-000002.log
worker-0119-151225-000001.log  worker-0114-110123-000003.log
worker-0119-151225-000000.log  worker-0114-110123-000000.log
login2$ ls -1dt * | head
worker-0119-090116-000002.log
worker-0119-090116-000004.log
worker-0119-090116-000003.log
worker-0119-090116-000001.log
worker-0119-090116-000000.log
cscript7310283766853084762.pl
worker-0119-491225-000001.log
worker-0119-491225-000000.log
worker-0119-151225-000001.log
worker-0119-151225-000000.log
login2$ more *0119-090116*
::::::::::::::
worker-0119-090116-000000.log
::::::::::::::
1263928159 0119-090116-000000 Logging started
1263928159 INFO - Running on node c19.pads.ci.uchicago.edu
1263928159 INFO 000000 Registration successful. ID=000000
::::::::::::::
worker-0119-090116-000001.log
::::::::::::::
1263928159 0119-090116-000001 Logging started
1263928159 INFO - Running on node c46.pads.ci.uchicago.edu
1263928160 INFO 000000 Registration successful. ID=000000
::::::::::::::
worker-0119-090116-000002.log
::::::::::::::
1263928160 0119-090116-000002 Logging started
1263928161 INFO - Running on node c19.pads.ci.uchicago.edu
1263928161 INFO 000000 Registration successful. ID=000000
1263929738 INFO 000000 Acknowledged shutdown. Exiting
1263929738 INFO 000000 Ran a total of 3 jobs
1263929738 INFO - All sub-processes finished. Exiting.
::::::::::::::
worker-0119-090116-000003.log
::::::::::::::
1263929733 0119-090116-000003 Logging started
1263929733 INFO - Running on node c38.pads.ci.uchicago.edu
1263929733 INFO 000000 Registration successful. ID=000000
::::::::::::::
worker-0119-090116-000004.log
::::::::::::::
1263929734 0119-090116-000004 Logging started
1263929734 INFO - Running on node c45.pads.ci.uchicago.edu
1263929734 INFO 000000 Registration successful. ID=000000
login2$


From hategan at mcs.anl.gov  Tue Jan 19 13:55:33 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 19 Jan 2010 13:55:33 -0600
Subject: [Swift-user] Coaster jobs are not running with expected
	parallelism
In-Reply-To: <4B560CAE.5020203@mcs.anl.gov>
References: <4B56076C.6090507@mcs.anl.gov>
	<1263929550.22225.5.camel@localhost>  <4B560A44.8050002@mcs.anl.gov>
	<1263930246.22837.3.camel@localhost>  <4B560CAE.5020203@mcs.anl.gov>
Message-ID: <1263930933.22837.14.camel@localhost>

On Tue, 2010-01-19 at 13:49 -0600, Michael Wilde wrote:
> 
> On 1/19/10 1:44 PM, Mihael Hategan wrote:
> > On Tue, 2010-01-19 at 13:38 -0600, Michael Wilde wrote:
> >> On 1/19/10 1:32 PM, Mihael Hategan wrote:
> >>> Maybe PBS is lying about that 18 node job. 
> >> I would be surprised if thats the case. But even if it had *1* node you 
> >> would think it would run at least 8 jobs in parallel.
> > 
> > I see. Though not with your current setup. You should use
> > "workersPerNode" instead of "coastersPerNode".
> 
> Thanks!  I'll fix that and try again. This makes more sense now, if its 
> assuming 1 worker per node.
> 
> Still doesnt explain why its not starting more jobs, since it allocated 
> abundant nodes (even assuming 1 worker per node).

Trunk or branch?

> 
> 
> > 
> >> Im confused why it has started three jobs, two with only one core and 
> >> one with 18 nodes.
> > 
> > It does that. It spreads out the block sizes to exploit non-linearities
> > in queuing times.
> > 
> >> But the 18 node job just hit its wall time limit; now coasters seems to 
> >> have started a 10 node job:
> > 
> > Don't know about that. Logs please.
> > 
> 
> Here's the logs from that dir for this run. I dont understand why the 
> coasters.log file in that directory has not been written to since Jan 13.

If you run swift on the head node and the coaster bootstrap provider is
"local", then the coaster service runs in the same jvm as swift, and it
writes to the same log as swift.

> 
> login2$ more *0119-090116*

[...]

Seems fine so far. Swift log then.


From wilde at mcs.anl.gov  Tue Jan 19 14:02:34 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 19 Jan 2010 14:02:34 -0600
Subject: [Swift-user] Coaster jobs are not running with expected
	parallelism
In-Reply-To: <1263930933.22837.14.camel@localhost>
References: <4B56076C.6090507@mcs.anl.gov>	
	<1263929550.22225.5.camel@localhost>
	<4B560A44.8050002@mcs.anl.gov>	
	<1263930246.22837.3.camel@localhost> <4B560CAE.5020203@mcs.anl.gov>
	<1263930933.22837.14.camel@localhost>
Message-ID: <4B560FDA.60707@mcs.anl.gov>


On 1/19/10 1:55 PM, Mihael Hategan wrote:
> On Tue, 2010-01-19 at 13:49 -0600, Michael Wilde wrote:
>> On 1/19/10 1:44 PM, Mihael Hategan wrote:
>>> On Tue, 2010-01-19 at 13:38 -0600, Michael Wilde wrote:
>>>> On 1/19/10 1:32 PM, Mihael Hategan wrote:
>>>>> Maybe PBS is lying about that 18 node job. 
>>>> I would be surprised if thats the case. But even if it had *1* node you 
>>>> would think it would run at least 8 jobs in parallel.
>>> I see. Though not with your current setup. You should use
>>> "workersPerNode" instead of "coastersPerNode".
>> Thanks!  I'll fix that and try again. This makes more sense now, if its 
>> assuming 1 worker per node.
>>
>> Still doesnt explain why its not starting more jobs, since it allocated 
>> abundant nodes (even assuming 1 worker per node).
> 
> Trunk or branch?

Stable branch.

> 
>>
>>>> Im confused why it has started three jobs, two with only one core and 
>>>> one with 18 nodes.
>>> It does that. It spreads out the block sizes to exploit non-linearities
>>> in queuing times.
>>>
>>>> But the 18 node job just hit its wall time limit; now coasters seems to 
>>>> have started a 10 node job:
>>> Don't know about that. Logs please.
>>>
>> Here's the logs from that dir for this run. I dont understand why the 
>> coasters.log file in that directory has not been written to since Jan 13.
> 
> If you run swift on the head node and the coaster bootstrap provider is
> "local", then the coaster service runs in the same jvm as swift, and it
> writes to the same log as swift.
> 
>> login2$ more *0119-090116*
> 
> [...]
> 
> Seems fine so far. Swift log then.

-rw-r--r-- 1 wilde ci-users 912946 Jan 19 13:49 
/home/wilde/protests/run.loops.1498/psim.loops-20100119-1309-l72sbpg8.log

I killed the run and will retry with workersPerNode corrected; maybe you 
can see, though, in this log, why the run was limited to only 3 active 
at once.

I'll see if same happens with workersPerNode set.

This would be explained if leaving workersPerNode *not* set somehow 
defaults to 1 worker per *block* (ie per pbs job) instead of 1 worker 
per node. Could that be hapenning?

- Mike


From hategan at mcs.anl.gov  Tue Jan 19 14:09:36 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 19 Jan 2010 14:09:36 -0600
Subject: [Swift-user] Coaster jobs are not running with expected
	parallelism
In-Reply-To: <4B560FDA.60707@mcs.anl.gov>
References: <4B56076C.6090507@mcs.anl.gov>
	<1263929550.22225.5.camel@localhost>  <4B560A44.8050002@mcs.anl.gov>
	<1263930246.22837.3.camel@localhost>  <4B560CAE.5020203@mcs.anl.gov>
	<1263930933.22837.14.camel@localhost>  <4B560FDA.60707@mcs.anl.gov>
Message-ID: <1263931776.23720.0.camel@localhost>

On Tue, 2010-01-19 at 14:02 -0600, Michael Wilde wrote:

> -rw-r--r-- 1 wilde ci-users 912946 Jan 19 13:49 
> /home/wilde/protests/run.loops.1498/psim.loops-20100119-1309-l72sbpg8.log
> 
> I killed the run and will retry with workersPerNode corrected; maybe you 
> can see, though, in this log, why the run was limited to only 3 active 
> at once.
> 
> I'll see if same happens with workersPerNode set.
> 
> This would be explained if leaving workersPerNode *not* set somehow 
> defaults to 1 worker per *block* (ie per pbs job) instead of 1 worker 
> per node. Could that be hapenning?

Not intentionally.


From wilde at mcs.anl.gov  Tue Jan 19 14:23:43 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 19 Jan 2010 14:23:43 -0600
Subject: [Swift-user] Coaster jobs are not running with expected
	parallelism
In-Reply-To: <1263931776.23720.0.camel@localhost>
References: <4B56076C.6090507@mcs.anl.gov>	
	<1263929550.22225.5.camel@localhost>
	<4B560A44.8050002@mcs.anl.gov>	
	<1263930246.22837.3.camel@localhost>
	<4B560CAE.5020203@mcs.anl.gov>	
	<1263930933.22837.14.camel@localhost> <4B560FDA.60707@mcs.anl.gov>
	<1263931776.23720.0.camel@localhost>
Message-ID: <4B5614CF.4030704@mcs.anl.gov>

With workersPerNode = 8, I now see 2 PBS jobs; one has 1 node, one has 3 
nodes.

Now *16* jobs are active.

The pattern seems to be that its only running workersPerNode app() tasks 
per PBS job (ie, per block).

I'll see if I can get it to run workersPerNode tasks per *node* with 
more explicit settings in the sites file.

The current jobs is:

/home/wilde/protlib2/bin/run.loops.sh: Executing on site pbs
Running from host with compute-node reachable address of 172.5.86.6
Running in /home/wilde/protests/run.loops.5357
protlib2 home is /home/wilde/protlib2
Swift svn swift-r3202 cog-r2682

RunID: 20100119-1414-q09uz2c0
Progress:
Progress:  Checking status:1
Progress:  Selecting site:18  Initializing site shared directory:1 
Stage in:1  Finished successfully:1
Progress:  Stage in:19  Submitting:1  Finished successfully:1
Progress:  Submitted:19  Active:1  Finished successfully:1
Progress:  Submitted:11  Active:9  Finished successfully:1
Progress:  Submitted:7  Active:13  Finished successfully:1
Progress:  Submitted:4  Active:16  Finished successfully:1
Progress:  Submitted:4  Active:16  Finished successfully:1
Progress:  Submitted:4  Active:16  Finished successfully:1
Progress:  Submitted:4  Active:16  Finished successfully:1

PBS says:

login2$ qstat -n

svc.pads.ci.uchicago.edu:
 
   Req'd  Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK 
Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- --- 
------ ----- - -----
917.svc.pads.ci.     wilde    extended null              16709     1  -- 
    --  00:29 R 00:04
    c19
918.svc.pads.ci.     wilde    extended null              15309     3  -- 
    --  00:29 R 00:04
    c46+c45+c44
login2$

Swift log is in:

login2$ ls -l $(pwd)/*0.log
-rw-r--r-- 1 wilde ci-users 386242 Jan 19 14:21 
/home/wilde/protests/run.loops.5357/psim.loops-20100119
4-q09uz2c0.log
login2$


On 1/19/10 2:09 PM, Mihael Hategan wrote:
> On Tue, 2010-01-19 at 14:02 -0600, Michael Wilde wrote:
> 
>> -rw-r--r-- 1 wilde ci-users 912946 Jan 19 13:49 
>> /home/wilde/protests/run.loops.1498/psim.loops-20100119-1309-l72sbpg8.log
>>
>> I killed the run and will retry with workersPerNode corrected; maybe you 
>> can see, though, in this log, why the run was limited to only 3 active 
>> at once.
>>
>> I'll see if same happens with workersPerNode set.
>>
>> This would be explained if leaving workersPerNode *not* set somehow 
>> defaults to 1 worker per *block* (ie per pbs job) instead of 1 worker 
>> per node. Could that be hapenning?
> 
> Not intentionally.
> 
> 


From iraicu at cs.uchicago.edu  Tue Jan 19 16:49:51 2010
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 19 Jan 2010 16:49:51 -0600
Subject: [Swift-user] CFP: Cloud Futures 2010: Advancing Research with Cloud
	Computing
Message-ID: <4B56370F.3060608@cs.uchicago.edu>

Cloud Futures 2010: Advancing Research with Cloud Computing 
April 8-9, 2010
Redmond, WA

Workshop Co-Chairs         
Dan Reed                                David  A. Patterson
Corporate Vice President                Professor of Computer Science
Extreme Computing Group                 Reliable Adaptive Distributed Systems Lab
Microsoft Research                      University of California - Berkeley
                
Call for Abstracts
Cloud computing is fast becoming the most important platform for research. 
Scientists today need vast computing resources to collect, share, manipulate, and 
explore massive data sets as well as to build and deploy new services for research.  
Cloud computing has the potential to advance research discoveries by making data 
and computing resources readily available at unprecedented economy of scale and 
nearly infinite scalability.  To realize the full promise of cloud computing for 
research, however, one must think about the cloud as a holistic platform for 
creating new services, new experiences, and new methods to pursue research, 
teaching and scholarly communication.  This goal presents a broad range of 
interesting questions.

We invite extended abstracts that illustrate the role of cloud computing across a 
variety of research and curriculum development areas---including  computer science, 
earth sciences, healthcare, humanities, life sciences, and social sciences---that 
highlight how new techniques and methods of research in the cloud may solve distinct 
challenges arising in those diverse areas.  Please include a bio (150 words max) and 
a brief abstract (300 words max) of a 30-minute short talk on a topic that describes 
practical experiences, experimental results, and vision papers.

Please submit your abstract by February 10, 2010 to cloudfut at microsoft.com 

Invited talks will be announced on February 18, 2010


-- 
=================================================================
Ioan Raicu, Ph.D.
NSF/CRA Computing Innovation Fellow
=================================================================
Center for Ultra-scale Computing and Information Security (CUCIS)
Department of Electrical Engineering and Computer Science
Northwestern University
2145 Sheridan Rd, Tech M384 
Evanston, IL 60208-3118
=================================================================
Cel:   1-847-722-0876
Tel:   1-847-491-8163
Email: iraicu at eecs.northwestern.edu
Web:   http://www.eecs.northwestern.edu/~iraicu/
       https://wiki.cucis.eecs.northwestern.edu/
=================================================================
=================================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100119/064ea3c8/attachment.html>

From wilde at mcs.anl.gov  Wed Jan 20 09:38:51 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 20 Jan 2010 09:38:51 -0600
Subject: [Swift-user] Coaster provider is not allocating dedicated nodes
Message-ID: <4B57238B.9020806@mcs.anl.gov>

Using the sites entry below, I see that coasters is allocating 8 
*shared* nods rather than *dedicated* nodes; hence its running many more 
processes per node than it should, causing the jobs to run longer than 
expected and exceed their walltime.

using this sites entry:

   <pool handle="pbs">
     <execution provider="coaster" url="none" jobManager="local:pbs"/>

     <profile namespace="globus" key="maxtime">7500</profile>
     <profile namespace="globus" key="workersPerNode">8</profile>

     <profile namespace="globus" key="slots">12</profile>
     <profile namespace="globus" key="nodeGranularity">1</profile>
     <profile namespace="globus" key="maxNodes">1</profile>

     <profile namespace="karajan" key="jobThrottle">1.27</profile>
     <profile namespace="karajan" key="initialScore">10000</profile>
     <filesystem provider="local"/>
     <workdirectory>$rundir</workdirectory>
   </pool>

qstat (below) shows the 12 coaster jobs I requested with "slots=12", but 
they are only using 2 different nodes, c45 and c46, between them, even 
though they are running 96 total coaster workers. (I can see that I have 
96 jobs active).

It seems like between coasters and the PBS provider, Swift is nt telling 
PBS that each of these jobs should get a dedicated node of 8 cores.


Job ID               Username Queue    Jobname          SessID NDS   TSK 
Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- --- 
------ ----- - -----
1034.svc.pads.ci     wilde    extended null              13086     1  -- 
    --  02:04 R 01:26
    c46
1035.svc.pads.ci     wilde    extended null              13168     1  -- 
    --  02:04 R 01:26
    c46
1036.svc.pads.ci     wilde    extended null              13387     1  -- 
    --  02:04 R 01:26
    c46
1037.svc.pads.ci     wilde    extended null              14060     1  -- 
    --  02:04 R 01:26
    c46
1038.svc.pads.ci     wilde    extended null              14237     1  -- 
    --  02:04 R 01:26
    c46
1039.svc.pads.ci     wilde    extended null              14640     1  -- 
    --  02:04 R 01:26
    c46
1040.svc.pads.ci     wilde    extended null              15200     1  -- 
    --  02:04 R 01:26
    c46
1041.svc.pads.ci     wilde    extended null              15753     1  -- 
    --  02:04 R 01:26
    c46
1042.svc.pads.ci     wilde    extended null              23700     1  -- 
    --  02:04 R 01:26
    c45
1043.svc.pads.ci     wilde    extended null              23781     1  -- 
    --  02:04 R 01:26
    c45
1044.svc.pads.ci     wilde    extended null              24016     1  -- 
    --  02:04 R 01:26
    c45
1045.svc.pads.ci     wilde    extended null              24796     1  -- 
    --  02:04 R 01:26
    c45


From wilde at mcs.anl.gov  Wed Jan 20 09:42:51 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 20 Jan 2010 09:42:51 -0600
Subject: [Swift-user] Coaster provider is not allocating dedicated nodes
In-Reply-To: <4B57238B.9020806@mcs.anl.gov>
References: <4B57238B.9020806@mcs.anl.gov>
Message-ID: <4B57247B.1050607@mcs.anl.gov>

The log for the run below is in:

/home/wilde/protests/run.loops.3231/psim.loops-20100120-0802-tsvkj4e7.log

- Mike

On 1/20/10 9:38 AM, Michael Wilde wrote:
> Using the sites entry below, I see that coasters is allocating 8 
> *shared* nods rather than *dedicated* nodes; hence its running many more 
> processes per node than it should, causing the jobs to run longer than 
> expected and exceed their walltime.
> 
> using this sites entry:
> 
>    <pool handle="pbs">
>      <execution provider="coaster" url="none" jobManager="local:pbs"/>
> 
>      <profile namespace="globus" key="maxtime">7500</profile>
>      <profile namespace="globus" key="workersPerNode">8</profile>
> 
>      <profile namespace="globus" key="slots">12</profile>
>      <profile namespace="globus" key="nodeGranularity">1</profile>
>      <profile namespace="globus" key="maxNodes">1</profile>
> 
>      <profile namespace="karajan" key="jobThrottle">1.27</profile>
>      <profile namespace="karajan" key="initialScore">10000</profile>
>      <filesystem provider="local"/>
>      <workdirectory>$rundir</workdirectory>
>    </pool>
> 
> qstat (below) shows the 12 coaster jobs I requested with "slots=12", but 
> they are only using 2 different nodes, c45 and c46, between them, even 
> though they are running 96 total coaster workers. (I can see that I have 
> 96 jobs active).
> 
> It seems like between coasters and the PBS provider, Swift is nt telling 
> PBS that each of these jobs should get a dedicated node of 8 cores.
> 
> 
> Job ID               Username Queue    Jobname          SessID NDS   TSK 
> Memory Time  S Time
> -------------------- -------- -------- ---------------- ------ ----- --- 
> ------ ----- - -----
> 1034.svc.pads.ci     wilde    extended null              13086     1  -- 
>     --  02:04 R 01:26
>     c46
> 1035.svc.pads.ci     wilde    extended null              13168     1  -- 
>     --  02:04 R 01:26
>     c46
> 1036.svc.pads.ci     wilde    extended null              13387     1  -- 
>     --  02:04 R 01:26
>     c46
> 1037.svc.pads.ci     wilde    extended null              14060     1  -- 
>     --  02:04 R 01:26
>     c46
> 1038.svc.pads.ci     wilde    extended null              14237     1  -- 
>     --  02:04 R 01:26
>     c46
> 1039.svc.pads.ci     wilde    extended null              14640     1  -- 
>     --  02:04 R 01:26
>     c46
> 1040.svc.pads.ci     wilde    extended null              15200     1  -- 
>     --  02:04 R 01:26
>     c46
> 1041.svc.pads.ci     wilde    extended null              15753     1  -- 
>     --  02:04 R 01:26
>     c46
> 1042.svc.pads.ci     wilde    extended null              23700     1  -- 
>     --  02:04 R 01:26
>     c45
> 1043.svc.pads.ci     wilde    extended null              23781     1  -- 
>     --  02:04 R 01:26
>     c45
> 1044.svc.pads.ci     wilde    extended null              24016     1  -- 
>     --  02:04 R 01:26
>     c45
> 1045.svc.pads.ci     wilde    extended null              24796     1  -- 
>     --  02:04 R 01:26
>     c45
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From wilde at mcs.anl.gov  Wed Jan 20 10:42:16 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 20 Jan 2010 10:42:16 -0600
Subject: [Swift-user] Coaster provider is not allocating dedicated nodes
In-Reply-To: <4B57247B.1050607@mcs.anl.gov>
References: <4B57238B.9020806@mcs.anl.gov> <4B57247B.1050607@mcs.anl.gov>
Message-ID: <4B573268.6010003@mcs.anl.gov>

I ran a test of the same <pool> entry using a simple foreach/cat script 
and captured the PBS submit file. It shows:

login2$ more logs/PBS1883411659688642512.submit
#PBS -S /bin/sh
#PBS -N null
#PBS -m n
#PBS -l nodes=1
#PBS -l walltime=01:10:00
#PBS -o /home/wilde/.globus/scripts/PBS1883411659688642512.submit.stdout
#PBS -e /home/wilde/.globus/scripts/PBS1883411659688642512.submit.stderr
/usr/bin/perl /home/wilde/.globus/coasters/cscript2151716324069557151.pl 
http://192.5.86.
6:50003 0120-331021-000010 8 /home/wilde/.globus/coasters
/bin/echo $? 
 >/home/wilde/.globus/scripts/PBS1883411659688642512.submit.exitcode
login2$

It seems that the line "#PBS -l nodes=1" should be:
#PBS -l nodes=1:ppn=8

- Mike


On 1/20/10 9:42 AM, Michael Wilde wrote:
> The log for the run below is in:
> 
> /home/wilde/protests/run.loops.3231/psim.loops-20100120-0802-tsvkj4e7.log
> 
> - Mike
> 
> On 1/20/10 9:38 AM, Michael Wilde wrote:
>> Using the sites entry below, I see that coasters is allocating 8 
>> *shared* nods rather than *dedicated* nodes; hence its running many more 
>> processes per node than it should, causing the jobs to run longer than 
>> expected and exceed their walltime.
>>
>> using this sites entry:
>>
>>    <pool handle="pbs">
>>      <execution provider="coaster" url="none" jobManager="local:pbs"/>
>>
>>      <profile namespace="globus" key="maxtime">7500</profile>
>>      <profile namespace="globus" key="workersPerNode">8</profile>
>>
>>      <profile namespace="globus" key="slots">12</profile>
>>      <profile namespace="globus" key="nodeGranularity">1</profile>
>>      <profile namespace="globus" key="maxNodes">1</profile>
>>
>>      <profile namespace="karajan" key="jobThrottle">1.27</profile>
>>      <profile namespace="karajan" key="initialScore">10000</profile>
>>      <filesystem provider="local"/>
>>      <workdirectory>$rundir</workdirectory>
>>    </pool>
>>
>> qstat (below) shows the 12 coaster jobs I requested with "slots=12", but 
>> they are only using 2 different nodes, c45 and c46, between them, even 
>> though they are running 96 total coaster workers. (I can see that I have 
>> 96 jobs active).
>>
>> It seems like between coasters and the PBS provider, Swift is nt telling 
>> PBS that each of these jobs should get a dedicated node of 8 cores.
>>
>>
>> Job ID               Username Queue    Jobname          SessID NDS   TSK 
>> Memory Time  S Time
>> -------------------- -------- -------- ---------------- ------ ----- --- 
>> ------ ----- - -----
>> 1034.svc.pads.ci     wilde    extended null              13086     1  -- 
>>     --  02:04 R 01:26
>>     c46
>> 1035.svc.pads.ci     wilde    extended null              13168     1  -- 
>>     --  02:04 R 01:26
>>     c46
>> 1036.svc.pads.ci     wilde    extended null              13387     1  -- 
>>     --  02:04 R 01:26
>>     c46
>> 1037.svc.pads.ci     wilde    extended null              14060     1  -- 
>>     --  02:04 R 01:26
>>     c46
>> 1038.svc.pads.ci     wilde    extended null              14237     1  -- 
>>     --  02:04 R 01:26
>>     c46
>> 1039.svc.pads.ci     wilde    extended null              14640     1  -- 
>>     --  02:04 R 01:26
>>     c46
>> 1040.svc.pads.ci     wilde    extended null              15200     1  -- 
>>     --  02:04 R 01:26
>>     c46
>> 1041.svc.pads.ci     wilde    extended null              15753     1  -- 
>>     --  02:04 R 01:26
>>     c46
>> 1042.svc.pads.ci     wilde    extended null              23700     1  -- 
>>     --  02:04 R 01:26
>>     c45
>> 1043.svc.pads.ci     wilde    extended null              23781     1  -- 
>>     --  02:04 R 01:26
>>     c45
>> 1044.svc.pads.ci     wilde    extended null              24016     1  -- 
>>     --  02:04 R 01:26
>>     c45
>> 1045.svc.pads.ci     wilde    extended null              24796     1  -- 
>>     --  02:04 R 01:26
>>     c45
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From hategan at mcs.anl.gov  Wed Jan 20 11:01:21 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 20 Jan 2010 11:01:21 -0600
Subject: [Swift-user] Coaster provider is not allocating dedicated nodes
In-Reply-To: <4B57238B.9020806@mcs.anl.gov>
References: <4B57238B.9020806@mcs.anl.gov>
Message-ID: <1264006881.463.5.camel@localhost>

On Wed, 2010-01-20 at 09:38 -0600, Michael Wilde wrote:
> Using the sites entry below, I see that coasters is allocating 8 
> *shared* nods rather than *dedicated* nodes; hence its running many more 
> processes per node than it should, causing the jobs to run longer than 
> expected and exceed their walltime.

Right. It looks like the pbs provider uses nodes= and doesn't mess with
ppn=, which means it allocate nodes as defined by the local policy
(which may mean cores instead of nodes).

I suggest setting workersPerNode to 1, but then you run into the
previous problem, which I'm trying to fix now and for which I have an
open ticket with PADS support.


From hategan at mcs.anl.gov  Wed Jan 20 15:02:19 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 20 Jan 2010 15:02:19 -0600
Subject: [Swift-user] Coaster provider is not allocating dedicated nodes
In-Reply-To: <1264006881.463.5.camel@localhost>
References: <4B57238B.9020806@mcs.anl.gov> <1264006881.463.5.camel@localhost>
Message-ID: <1264021339.8319.0.camel@localhost>

On Wed, 2010-01-20 at 11:01 -0600, Mihael Hategan wrote:
> On Wed, 2010-01-20 at 09:38 -0600, Michael Wilde wrote:
> > Using the sites entry below, I see that coasters is allocating 8 
> > *shared* nods rather than *dedicated* nodes; hence its running many more 
> > processes per node than it should, causing the jobs to run longer than 
> > expected and exceed their walltime.
> 
> Right. It looks like the pbs provider uses nodes= and doesn't mess with
> ppn=, which means it allocate nodes as defined by the local policy
> (which may mean cores instead of nodes).
> 
> I suggest setting workersPerNode to 1, but then you run into the
> previous problem, which I'm trying to fix now and for which I have an
> open ticket with PADS support.
> 

The ssh problem on PADS was fixed and I committed a patch to the branch
to start multiple instances of the app (cog r2683).

Mihael


From sardjito.antonius at gmail.com  Wed Jan 20 21:45:18 2010
From: sardjito.antonius at gmail.com (Antonius Sardjito)
Date: Wed, 20 Jan 2010 21:45:18 -0600
Subject: [Swift-user] qsub problem
Message-ID: <110a6b261001201945u3ca19089p7c44644533786df7@mail.gmail.com>

Hi,

I am still having trouble executing task on PADS, the problem is still qsub.
I tried adding +maui and +torque also tried it with the @ symbol instead but
I continued to get "cannot run program:qsub error"

-Antonius
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100120/0884f84c/attachment.html>

From sardjito.antonius at gmail.com  Wed Jan 20 22:00:33 2010
From: sardjito.antonius at gmail.com (Antonius Sardjito)
Date: Wed, 20 Jan 2010 22:00:33 -0600
Subject: [Swift-user] Re: qsub problem
In-Reply-To: <110a6b261001201945u3ca19089p7c44644533786df7@mail.gmail.com>
References: <110a6b261001201945u3ca19089p7c44644533786df7@mail.gmail.com>
Message-ID: <110a6b261001202000y30c8f3a8ybff8b283080fa72@mail.gmail.com>

Actually never mind my question, I got it to work now.  I downloaded the
program (the second to newest version) and use the command in the
README.configure.

-Antonius
On Wed, Jan 20, 2010 at 9:45 PM, Antonius Sardjito <
sardjito.antonius at gmail.com> wrote:

> Hi,
>
> I am still having trouble executing task on PADS, the problem is still
> qsub. I tried adding +maui and +torque also tried it with the @ symbol
> instead but I continued to get "cannot run program:qsub error"
>
> -Antonius
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100120/a42f168e/attachment.html>

From foster at anl.gov  Thu Jan 21 05:20:01 2010
From: foster at anl.gov (Ian Foster)
Date: Thu, 21 Jan 2010 05:20:01 -0600
Subject: [Swift-user] Re: [Cmsc34900] qsub problem
In-Reply-To: <110a6b261001202000y30c8f3a8ybff8b283080fa72@mail.gmail.com>
References: <110a6b261001201945u3ca19089p7c44644533786df7@mail.gmail.com>
	<110a6b261001202000y30c8f3a8ybff8b283080fa72@mail.gmail.com>
Message-ID: <06ECAC42-2D54-4F88-B2A3-80A1396D4F62@anl.gov>

Antonius:

Thanks for the update. I'm glad it is working.

Ian.


On Jan 20, 2010, at 10:00 PM, Antonius Sardjito wrote:

> Actually never mind my question, I got it to work now.  I downloaded the program (the second to newest version) and use the command in the README.configure.
> 
> -Antonius
> On Wed, Jan 20, 2010 at 9:45 PM, Antonius Sardjito <sardjito.antonius at gmail.com> wrote:
> Hi,
>  
> I am still having trouble executing task on PADS, the problem is still qsub. I tried adding +maui and +torque also tried it with the @ symbol instead but I continued to get "cannot run program:qsub error" 
>  
> -Antonius 
> 
> _______________________________________________
> Cmsc34900 mailing list
> Cmsc34900 at mailman.cs.uchicago.edu
> https://mailman.cs.uchicago.edu/mailman/listinfo/cmsc34900

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100121/6a220c3f/attachment.html>

From sardjito.antonius at gmail.com  Thu Jan 21 23:37:05 2010
From: sardjito.antonius at gmail.com (Antonius Sardjito)
Date: Thu, 21 Jan 2010 23:37:05 -0600
Subject: [Swift-user] readData documentation
Message-ID: <110a6b261001212137i61e43c18m4dcff72c1823118a@mail.gmail.com>

Hi,

Is there a more complete documentation on readData? besides the one in the
user guide. Thank you


-Antonius
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100121/57de39d0/attachment.html>

From wilde at mcs.anl.gov  Thu Jan 21 23:38:44 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 21 Jan 2010 23:38:44 -0600
Subject: [Swift-user] Re: [Cmsc34900] readData documentation
In-Reply-To: <110a6b261001212137i61e43c18m4dcff72c1823118a@mail.gmail.com>
References: <110a6b261001212137i61e43c18m4dcff72c1823118a@mail.gmail.com>
Message-ID: <4B5939E4.8050805@mcs.anl.gov>

There isn't, but if you have questions about it, please ask.

Its also helpful to test its behavior with tiny test swift scripts.

- Mike


On 1/21/10 11:37 PM, Antonius Sardjito wrote:
> Hi,
>  
> Is there a more complete documentation on readData? besides the one in 
> the user guide. Thank you
>  
>  
> -Antonius 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Cmsc34900 mailing list
> Cmsc34900 at mailman.cs.uchicago.edu
> https://mailman.cs.uchicago.edu/mailman/listinfo/cmsc34900


From wilde at mcs.anl.gov  Thu Jan 21 23:45:10 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 21 Jan 2010 23:45:10 -0600
Subject: [Swift-user] Problem in error handling for localhost jobs with
	status.mode=provider
Message-ID: <4B593B66.2040400@mcs.anl.gov>

When running a job on localhost with status.mode=provider set in 
swift.properties, missing-output-file error messages are lost.

You can replicate this error with this this script:

--

login2$ cat missingresult.swift
type file;

app (file f) echo()
{
   echo "foo";
}

file f<"missing.txt">;

f = echo();

--

With status.mode not set, you get the expected error message, "The 
following output files were not created by the application: missing.txt" 
but with it set to "provider" you only get "Job failed with an exit code 
of 254": (note that Ive got a bunch of debug messages below in _swiftwrap)

login2$ swift -config props  missingresult.swift
Swift svn swift-r3202 cog-r2683

RunID: 20100121-2337-pn3jdg2c
Progress:
To TTY: exit code = 0
_swiftwrap: returned from checkError
_swiftwrap: exit step 1
_swiftwrap: exit step 2
_swiftwrap: exit step 3
_swiftwrap: exit step 4
_swiftwrap: exit step 5
checking for outfile missing.txt
jobs/j/echo-jlb3snmj/missing.txt is missing
The following output files were not created by the application: missing.txt
fail(254) logged message The following output files were not created by 
the application: missing.txt
fail(254) logged info
Execution failed:
         Exception in echo:
Arguments: [foo]
Host: localhost
Directory: missingresult-20100121-2337-pn3jdg2c/jobs/j/echo-jlb3snmj
stderr.txt:
stdout.txt: foo

----

Caused by:
         Job failed with an exit code of 254
login2$


From sardjito.antonius at gmail.com  Fri Jan 22 00:05:04 2010
From: sardjito.antonius at gmail.com (Antonius Sardjito)
Date: Fri, 22 Jan 2010 00:05:04 -0600
Subject: [Swift-user] Re: [Cmsc34900] readData documentation
In-Reply-To: <4B5939E4.8050805@mcs.anl.gov>
References: <110a6b261001212137i61e43c18m4dcff72c1823118a@mail.gmail.com>
	<4B5939E4.8050805@mcs.anl.gov>
Message-ID: <110a6b261001212205h495ca62fg1588ccc80eae500b@mail.gmail.com>

Currently I am playing around with it but I keep getting an execution error
something like below:

Progress:  Submitting:1  Finished successfully:317
To TTY: exit code = 0
_swiftwrap: returned from checkError
Execution failed:
        File header does not match type. Expected 0 whitespace separated
items. Got 1 instead.

The file that is passed to readData() have no spaces, it was just an array
of string which I have checked with a text editor.

I hope you will be available tommorow, I think it is easier to show it.


-Antonius

On Thu, Jan 21, 2010 at 11:38 PM, Michael Wilde <wilde at mcs.anl.gov> wrote:

> There isn't, but if you have questions about it, please ask.
>
> Its also helpful to test its behavior with tiny test swift scripts.
>
> - Mike
>
>
>
> On 1/21/10 11:37 PM, Antonius Sardjito wrote:
>
>>  Hi,
>>  Is there a more complete documentation on readData? besides the one in
>> the user guide. Thank you
>>   -Antonius
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Cmsc34900 mailing list
>> Cmsc34900 at mailman.cs.uchicago.edu
>> https://mailman.cs.uchicago.edu/mailman/listinfo/cmsc34900
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100122/6b63a220/attachment.html>

From sardjito.antonius at gmail.com  Fri Jan 22 09:59:14 2010
From: sardjito.antonius at gmail.com (Antonius Sardjito)
Date: Fri, 22 Jan 2010 09:59:14 -0600
Subject: [Swift-user] regexp problem
Message-ID: <110a6b261001220759k4abda459l31e5c95963d3fdd1@mail.gmail.com>

Hi Dr. Wilde,

I am able to readData from the input now, into an array of string.  I am
currently trying to use the strcut and strcat to cut the ".tif" and replace
it with "Colored.tif"

I tried to do something like this:

string cutresult =
@strcut(output[0],"/home/wilde/bigdata/data/modis/output/");
#outfile = eccho(output[0]);
trace(output[0]);


and the error is :
*
RunID: 20100122-0951-345qqi15
Progress:
SwiftScript trace: /home/wilde/bigdata/data/modis/output/h11v05.tif
Execution failed:
        java.lang.IndexOutOfBoundsException: No group 1


I don't understand why I got the error IndexOutOfBounds, I have checked the
string using a procedure that echo the string to a file, and as you can see
from above I also checked it with trace(output[0]) which produces a valid
value.


-Antonius


ps.

This is the whole test file named mini.swift

type file;
file input <"input_for_mini_test.txt">;
string output[] = readData(input);
file outfile <"output_for_mini_test.txt">;
string color = "Colored.tif";

app (file out) eccho (string inputStr)
{
        echo inputStr stdout=@out;
}

outfile = eccho(output[0]);
string cutresult =
@strcut(output[0],"/home/wilde/bigdata/data/modis/output/");
trace(output[0]);

--EOF--

*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100122/51fed506/attachment.html>

From wilde at mcs.anl.gov  Fri Jan 22 12:40:16 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 22 Jan 2010 12:40:16 -0600
Subject: [Swift-user] regexp problem
In-Reply-To: <110a6b261001220759k4abda459l31e5c95963d3fdd1@mail.gmail.com>
References: <110a6b261001220759k4abda459l31e5c95963d3fdd1@mail.gmail.com>
Message-ID: <4B59F110.4040608@mcs.anl.gov>

You need to specify a string pattern which has at least one matching 
"parenthesized group", as below.

- Mike

--

$ cat strcut.swift

string output0 = "/home/wilde/bigdata/data/modis/output/mydir/my.file";

string pattern = "/home/wilde/bigdata/data/modis/output/(.*)";

string cutresult = @strcut(output0,pattern);

trace("output0",output0);
trace("cutresult",cutresult);login2$


$ swift strcut.swift

Swift svn swift-r3202 cog-r2683

RunID: 20100122-1236-euf2lr6b

Progress:

SwiftScript trace: output0, 
/home/wilde/bigdata/data/modis/output/mydir/my.file

SwiftScript trace: cutresult, mydir/my.file


On 1/22/10 9:59 AM, Antonius Sardjito wrote:
> Hi Dr. Wilde, 
> 
> I am able to readData from the input now, into an array of string.  I am 
> currently trying to use the strcut and strcat to cut the ".tif" and 
> replace it with "Colored.tif"
> 
> I tried to do something like this: 
> 
> string cutresult = 
> @strcut(output[0],"/home/wilde/bigdata/data/modis/output/");
> #outfile = eccho(output[0]);
> trace(output[0]);
> 
> 
> and the error is :
> *
> RunID: 20100122-0951-345qqi15
> Progress:
> SwiftScript trace: /home/wilde/bigdata/data/modis/output/h11v05.tif
> Execution failed:
>         java.lang.IndexOutOfBoundsException: No group 1
> 
>  
> I don't understand why I got the error IndexOutOfBounds, I have checked 
> the string using a procedure that echo the string to a file, and as you 
> can see from above I also checked it with trace(output[0]) which 
> produces a valid value. 
> 
> 
> -Antonius
> 
> 
> ps. 
> 
> This is the whole test file named mini.swift 
> 
> type file;
> file input <"input_for_mini_test.txt">;
> string output[] = readData(input);
> file outfile <"output_for_mini_test.txt">;
> string color = "Colored.tif";
> 
> app (file out) eccho (string inputStr)
> {
>         echo inputStr stdout=@out;
> }
> 
> outfile = eccho(output[0]);
> string cutresult = 
> @strcut(output[0],"/home/wilde/bigdata/data/modis/output/");
> trace(output[0]);
> 
> --EOF--
> 
> *
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From sardjito.antonius at gmail.com  Fri Jan 22 14:45:32 2010
From: sardjito.antonius at gmail.com (Antonius Sardjito)
Date: Fri, 22 Jan 2010 14:45:32 -0600
Subject: [Swift-user] unmapping a file
Message-ID: <110a6b261001221245s67547be8vfd1ba10a4d0dfe5b@mail.gmail.com>

Hi,

Is there a way to unmap a variable from a file ?
say f----> f.txt are there ways to unmap this so I could have say
s--->f.txt

Thanks
-Antonius
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100122/4ee6fa24/attachment.html>

From wilde at mcs.anl.gov  Fri Jan 22 15:49:21 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 22 Jan 2010 15:49:21 -0600
Subject: [Swift-user] unmapping a file
In-Reply-To: <110a6b261001221245s67547be8vfd1ba10a4d0dfe5b@mail.gmail.com>
References: <110a6b261001221245s67547be8vfd1ba10a4d0dfe5b@mail.gmail.com>
Message-ID: <4B5A1D61.6050707@mcs.anl.gov>

There is no way to unmap a variable.

And you need to manually ensure that you dont map the same physical 
filename to multiple *output* variables, as then its very likely that 
Swift will complain at run time that its trying to map and access a file 
that is already mapped to another variable (it complains of a "cache" 
conflict in such cases, as the file is already in a site's file cache).

Swift does let you map the same filename for *input* to different 
variables or array members.

If a file is already mapped to a variable, one will often reference that 
same variable in a later statement to re-read the same file again, or to 
read a previously produced file.

But in this example, the purpose of returning a list of tile filenames 
from analyzelanduse.sh was so that you can read in this list and use it 
to re-map a *subset* of a previously mapped array of files.

- Mike

On 1/22/10 2:45 PM, Antonius Sardjito wrote:
> Hi, 
> 
> Is there a way to unmap a variable from a file ? 
> say f----> f.txt are there ways to unmap this so I could have say 
> s--->f.txt 
> 
> Thanks
> -Antonius
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From jamalphd at gmail.com  Tue Jan 26 22:58:17 2010
From: jamalphd at gmail.com (J A)
Date: Tue, 26 Jan 2010 23:58:17 -0500
Subject: [Swift-user] Read file and write to files
Message-ID: <b19f113c1001262058k27f8b443gac6fd294ce087d5@mail.gmail.com>

Hi All:

I am still learning the swift script.

I have a text file (list.txt) contains the following:


3
String1
String2
String3


Where the first line contains the number of sentence with text.

I would like to read the file list.txt and output the strings into separate
files where  file1.txt contains "String1", file2.txt contains "String2", and
so on.

Any suggestions on what to use or a sample code will be appreciated.

Thanks,
Jamal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100126/e6b56054/attachment.html>

From jamalphd at gmail.com  Tue Jan 26 23:01:06 2010
From: jamalphd at gmail.com (J A)
Date: Wed, 27 Jan 2010 00:01:06 -0500
Subject: [Swift-user] first.swift
Message-ID: <b19f113c1001262101i20a7ba33i3db795473c87d41d@mail.gmail.com>

Hi All:

I am looking at first.swift:
==========

type messagefile {}

(messagefile t) greeting() {
    app {
        echo "Hello, world!" stdout=@filename(t);
    }
}

messagefile outfile <"hello.txt">;

outfile = greeting();
============

Is there a way to avoid using the '<' and '>' in the script and have code
still do the same thing?

Thanks,

Jamal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100127/912bd1b5/attachment.html>

From benc at hawaga.org.uk  Wed Jan 27 04:02:22 2010
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 27 Jan 2010 10:02:22 +0000 (GMT)
Subject: [Swift-user] first.swift
In-Reply-To: <b19f113c1001262101i20a7ba33i3db795473c87d41d@mail.gmail.com>
References: <b19f113c1001262101i20a7ba33i3db795473c87d41d@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.1001271001320.14096@dildano.hawaga.org.uk>


> Is there a way to avoid using the '<' and '>' in the script and have code
> still do the same thing?

This is the kind of question that makes me reply with "why are you asking 
that?" ;)

so:

why are you asking that? (why do you not want < and > in the script?)

-- 


From skenny at uchicago.edu  Wed Jan 27 17:27:41 2010
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Wed, 27 Jan 2010 17:27:41 -0600 (CST)
Subject: [Swift-user] #include ?
Message-ID: <20100127172741.CIV77065@m4500-02.uchicago.edu>

i heard a rumor once :) that there was a working version of an
'include' feature of swift such that i may declare functions
in one script and use them in others...is this in the latest
swift? i wasn't able to find anything in the doc about it so
thought i'd check.

~sk


From wilde at mcs.anl.gov  Wed Jan 27 17:48:45 2010
From: wilde at mcs.anl.gov (wilde at mcs.anl.gov)
Date: Wed, 27 Jan 2010 17:48:45 -0600 (CST)
Subject: [Swift-user] #include ?
In-Reply-To: <4512972.68241264636028671.JavaMail.root@zimbra>
Message-ID: <19009325.68301264636125591.JavaMail.root@zimbra>

Yes, Ben added the start of a simple import mechanism:

login2$ cat cati.swift
import catapp;

type file;

file data<"data.txt">;
file out<"out.txt">;
out = cat(data);


login2$ cat catapp.swift
import typedefs;
app (file o) cat (file i)
{
  cat @i stdout=@o;
}

At the moment, the files you import need to be in the dir in which you are running Swift.

- Mike


----- Original Message ----- 
From: skenny at uchicago.edu 
To: swift-user at ci.uchicago.edu 
Sent: Wednesday, January 27, 2010 5:27:41 PM GMT -06:00 US/Canada Central 
Subject: [Swift-user] #include ? 

i heard a rumor once :) that there was a working version of an 
'include' feature of swift such that i may declare functions 
in one script and use them in others...is this in the latest 
swift? i wasn't able to find anything in the doc about it so 
thought i'd check. 

~sk 
_______________________________________________ 
Swift-user mailing list 
Swift-user at ci.uchicago.edu 
http://mail.ci.uchicago.edu/mailman/listinfo/swift-user 


From skenny at uchicago.edu  Wed Jan 27 18:01:52 2010
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Wed, 27 Jan 2010 18:01:52 -0600 (CST)
Subject: [Swift-user] #include ?
In-Reply-To: <19009325.68301264636125591.JavaMail.root@zimbra>
References: <4512972.68241264636028671.JavaMail.root@zimbra>
	<19009325.68301264636125591.JavaMail.root@zimbra>
Message-ID: <20100127180152.CIV82143@m4500-02.uchicago.edu>

great thanks!

---- Original message ----
>Date: Wed, 27 Jan 2010 17:48:45 -0600 (CST)
>From: wilde at mcs.anl.gov  
>Subject: Re: [Swift-user] #include ?  
>To: skenny at uchicago.edu
>Cc: swift-user at ci.uchicago.edu
>
>Yes, Ben added the start of a simple import mechanism:
>
>login2$ cat cati.swift
>import catapp;
>
>type file;
>
>file data<"data.txt">;
>file out<"out.txt">;
>out = cat(data);
>
>
>login2$ cat catapp.swift
>import typedefs;
>app (file o) cat (file i)
>{
>  cat @i stdout=@o;
>}
>
>At the moment, the files you import need to be in the dir in
which you are running Swift.
>
>- Mike
>
>
>
>
>----- Original Message ----- 
>From: skenny at uchicago.edu 
>To: swift-user at ci.uchicago.edu 
>Sent: Wednesday, January 27, 2010 5:27:41 PM GMT -06:00
US/Canada Central 
>Subject: [Swift-user] #include ? 
>
>i heard a rumor once :) that there was a working version of an 
>'include' feature of swift such that i may declare functions 
>in one script and use them in others...is this in the latest 
>swift? i wasn't able to find anything in the doc about it so 
>thought i'd check. 
>
>~sk 
>_______________________________________________ 
>Swift-user mailing list 
>Swift-user at ci.uchicago.edu 
>http://mail.ci.uchicago.edu/mailman/listinfo/swift-user 


From jamalphd at gmail.com  Wed Jan 27 19:59:01 2010
From: jamalphd at gmail.com (J A)
Date: Wed, 27 Jan 2010 20:59:01 -0500
Subject: [Swift-user] Re: Read file and write to files
In-Reply-To: <b19f113c1001262058k27f8b443gac6fd294ce087d5@mail.gmail.com>
References: <b19f113c1001262058k27f8b443gac6fd294ce087d5@mail.gmail.com>
Message-ID: <b19f113c1001271759h39bea5acy248efc7676cdc0b7@mail.gmail.com>

Any suggestions on my question below?

Thanks


On 1/26/10, J A <jamalphd at gmail.com> wrote:
>
> Hi All:
>
> I am still learning the swift script.
>
> I have a text file (list.txt) contains the following:
>
>
> 3
> String1
> String2
> String3
>
>
> Where the first line contains the number of sentence with text.
>
> I would like to read the file list.txt and output the strings into separate
> files where  file1.txt contains "String1", file2.txt contains "String2", and
> so on.
>
> Any suggestions on what to use or a sample code will be appreciated.
>
> Thanks,
> Jamal
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100127/3dc9fa93/attachment.html>

From wilde at mcs.anl.gov  Thu Jan 28 05:35:47 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 28 Jan 2010 05:35:47 -0600 (CST)
Subject: [Swift-user] Read file and write to files
In-Reply-To: <b19f113c1001262058k27f8b443gac6fd294ce087d5@mail.gmail.com>
Message-ID: <6118345.73361264678547548.JavaMail.root@zimbra>

Jamal,

You can use an approach like this:

Use readData() to read list.txt into an array.

Use foreach to iterate over the array.

Use the index of the array to form the new file names,
and a single_file_mapper to map a variable inside the
foreach to these files.

Use an app() like echo() to write the text to the files.

- Mike


----- Original Message ----- 
From: "J A" <jamalphd at gmail.com> 
To: swift-user at ci.uchicago.edu 
Sent: Tuesday, January 26, 2010 10:58:17 PM GMT -06:00 US/Canada Central 
Subject: [Swift-user] Read file and write to files 


Hi All: 

I am still learning the swift script. 

I have a text file (list.txt) contains the following: 


3 
String1 
String2 
String3 


Where the first line contains the number of sentence with text. 

I would like to read the file list.txt and output the strings into separate files where file1.txt contains "String1", file2.txt contains "String2", and so on. 

Any suggestions on what to use or a sample code will be appreciated. 

Thanks, 
Jamal 


_______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-user 


From wilde at mcs.anl.gov  Thu Jan 28 05:40:17 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 28 Jan 2010 05:40:17 -0600 (CST)
Subject: [Swift-user] first.swift
In-Reply-To: <Pine.LNX.4.64.1001271001320.14096@dildano.hawaga.org.uk>
Message-ID: <7046714.73401264678817210.JavaMail.root@zimbra>

Jamal, were you asking whether you can avoid using "mappers"
(which are indicated by <>) or whether you can avoid using
the chars "<>" in a Swift script?

In either case, I believe the answer is "no", in that Swift
is based on the use of mappers, and the syntax for mappers
requires "<>".

- Mike


----- Original Message ----- 
From: "Ben Clifford" <benc at hawaga.org.uk> 
To: "J A" <jamalphd at gmail.com> 
Cc: swift-user at ci.uchicago.edu 
Sent: Wednesday, January 27, 2010 4:02:22 AM GMT -06:00 US/Canada Central 
Subject: Re: [Swift-user] first.swift 


> Is there a way to avoid using the '<' and '>' in the script and have code 
> still do the same thing? 

This is the kind of question that makes me reply with "why are you asking 
that?" ;) 

so: 

why are you asking that? (why do you not want < and > in the script?) 

-- 

_______________________________________________ 
Swift-user mailing list 
Swift-user at ci.uchicago.edu 
http://mail.ci.uchicago.edu/mailman/listinfo/swift-user 


From jamalphd at gmail.com  Thu Jan 28 10:11:36 2010
From: jamalphd at gmail.com (J A)
Date: Thu, 28 Jan 2010 11:11:36 -0500
Subject: [Swift-user] compile error
Message-ID: <b19f113c1001280811p23f925aaqea9ba11beb6b5deb@mail.gmail.com>

Hi:

When I try to run the following code:

array_iteration.swift:


type file {}
(file f) echo (string s) {
    app {
        echo s stdout=@filename(f);
    }
}
(file fa[]) echo_batch (string sa[]) {
    foreach string s, i in sa {
        fa[i] = echo(s);
    }
}
string sa[] = ["hello","hi there","how are you"];
file fa[];
fa = echo_batch(sa);

......

I get the following error:

            Could not compile SwiftScript source: line 10:13: expecting an
identifier, found 'string'

any idea on how to fix it?


Thanks,
Jamal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100128/95f17259/attachment.html>

From wilde at mcs.anl.gov  Thu Jan 28 10:29:35 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 28 Jan 2010 10:29:35 -0600 (CST)
Subject: [Swift-user] compile error
In-Reply-To: <b19f113c1001280811p23f925aaqea9ba11beb6b5deb@mail.gmail.com>
Message-ID: <12775175.83861264696175449.JavaMail.root@zimbra>

One thing I spot here is that this statement:

  foreach string s, i in sa 

should be written:

  foreach s, i in sa

The foreach statement does not permit you to re-declare the types of the iteration variables (s and i in your case).

- Mike


----- Original Message -----
From: "J A" <jamalphd at gmail.com>
To: swift-user at ci.uchicago.edu
Sent: Thursday, January 28, 2010 10:11:36 AM GMT -06:00 US/Canada Central
Subject: [Swift-user] compile error


Hi: 

When I try to run the following code: 

array_iteration.swift: 


type file {} 
(file f) echo (string s) { 
app { 
echo s stdout=@filename(f ); 
} 
} 
(file fa[]) echo_batch (string sa[]) { 
foreach string s, i in sa { 
fa[i] = echo(s); 
} 
} 
string sa[] = ["hello","hi there","how are you"]; 
file fa[]; 
fa = echo_batch(sa); 

...... 

I get the following error: 

Could not compile SwiftScript source: line 10:13: expecting an identifier, found 'string' 


any idea on how to fix it? 


Thanks, 
Jamal 


_______________________________________________
Swift-user mailing list
Swift-user at ci.uchicago.edu
http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From jamalphd at gmail.com  Thu Jan 28 12:55:07 2010
From: jamalphd at gmail.com (J A)
Date: Thu, 28 Jan 2010 13:55:07 -0500
Subject: [Swift-user] compile error
In-Reply-To: <12775175.83861264696175449.JavaMail.root@zimbra>
References: <b19f113c1001280811p23f925aaqea9ba11beb6b5deb@mail.gmail.com>
	<12775175.83861264696175449.JavaMail.root@zimbra>
Message-ID: <b19f113c1001281055y739d1457qf1d2515d6a1b6117@mail.gmail.com>

This is one of the examples I dowlonaded from the swift website.


On Thu, Jan 28, 2010 at 11:29 AM, Michael Wilde <wilde at mcs.anl.gov> wrote:

> One thing I spot here is that this statement:
>
>  foreach string s, i in sa
>
> should be written:
>
>  foreach s, i in sa
>
> The foreach statement does not permit you to re-declare the types of the
> iteration variables (s and i in your case).
>
> - Mike
>
>
> ----- Original Message -----
> From: "J A" <jamalphd at gmail.com>
> To: swift-user at ci.uchicago.edu
> Sent: Thursday, January 28, 2010 10:11:36 AM GMT -06:00 US/Canada Central
> Subject: [Swift-user] compile error
>
>
>
> Hi:
>
> When I try to run the following code:
>
> array_iteration.swift:
>
>
> type file {}
> (file f) echo (string s) {
> app {
> echo s stdout=@filename(f );
> }
> }
> (file fa[]) echo_batch (string sa[]) {
> foreach string s, i in sa {
> fa[i] = echo(s);
> }
> }
> string sa[] = ["hello","hi there","how are you"];
> file fa[];
> fa = echo_batch(sa);
>
> ......
>
> I get the following error:
>
> Could not compile SwiftScript source: line 10:13: expecting an identifier,
> found 'string'
>
>
> any idea on how to fix it?
>
>
> Thanks,
> Jamal
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100128/1f6e24d4/attachment.html>

From jamalphd at gmail.com  Thu Jan 28 13:03:48 2010
From: jamalphd at gmail.com (J A)
Date: Thu, 28 Jan 2010 14:03:48 -0500
Subject: [Swift-user] compile error
In-Reply-To: <12775175.83861264696175449.JavaMail.root@zimbra>
References: <b19f113c1001280811p23f925aaqea9ba11beb6b5deb@mail.gmail.com>
	<12775175.83861264696175449.JavaMail.root@zimbra>
Message-ID: <b19f113c1001281103g74cfb660x58cad48c7b87cdc7@mail.gmail.com>

That solved the issue Michael.

Thanks

On Thu, Jan 28, 2010 at 11:29 AM, Michael Wilde <wilde at mcs.anl.gov> wrote:

> One thing I spot here is that this statement:
>
>  foreach string s, i in sa
>
> should be written:
>
>  foreach s, i in sa
>
> The foreach statement does not permit you to re-declare the types of the
> iteration variables (s and i in your case).
>
> - Mike
>
>
> ----- Original Message -----
> From: "J A" <jamalphd at gmail.com>
> To: swift-user at ci.uchicago.edu
> Sent: Thursday, January 28, 2010 10:11:36 AM GMT -06:00 US/Canada Central
> Subject: [Swift-user] compile error
>
>
>
> Hi:
>
> When I try to run the following code:
>
> array_iteration.swift:
>
>
> type file {}
> (file f) echo (string s) {
> app {
> echo s stdout=@filename(f );
> }
> }
> (file fa[]) echo_batch (string sa[]) {
> foreach string s, i in sa {
> fa[i] = echo(s);
> }
> }
> string sa[] = ["hello","hi there","how are you"];
> file fa[];
> fa = echo_batch(sa);
>
> ......
>
> I get the following error:
>
> Could not compile SwiftScript source: line 10:13: expecting an identifier,
> found 'string'
>
>
> any idea on how to fix it?
>
>
> Thanks,
> Jamal
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100128/18fb844f/attachment.html>

From jamalphd at gmail.com  Thu Jan 28 16:21:00 2010
From: jamalphd at gmail.com (J A)
Date: Thu, 28 Jan 2010 17:21:00 -0500
Subject: [Swift-user] example to write to a file
Message-ID: <b19f113c1001281421v2e07acbn67bafd82b1c8a8d7@mail.gmail.com>

Hi All:

does anyone has an example on how to write to a file?

Thanks,
Jamal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100128/d71a1ce2/attachment.html>

From wilde at mcs.anl.gov  Thu Jan 28 21:26:58 2010
From: wilde at mcs.anl.gov (wilde at mcs.anl.gov)
Date: Thu, 28 Jan 2010 21:26:58 -0600 (CST)
Subject: [Swift-user] Using the writeData() built-in function
In-Reply-To: <8254476.109671264735416491.JavaMail.root@zimbra>
Message-ID: <19940874.109751264735618737.JavaMail.root@zimbra>

Below is some preliminary info (a tiny example) on writing files from Swift using the not-yet-well-documented function writeData().

  http://www.ci.uchicago.edu/swift/guides/userguide.php#procedure.writedata

Other than this function, the only way to write data to a file is to call an app() that writes the data.

Note that writeData was meant solely to assist in writing long argument lists to a file that can be passed to an app() function, to avoid passing overly-long command lines.

Ive been exploring a few simple "library" functions writing using simple apps like echo and awk to do data formatting or conversion. While thats not ready for release, I just want to hint to those who need it that such techniques are reasonable and feasible.

In general, common practice in Swift is *not* to write general data files directly via Swift statements, but rather to call app() functions that write data files.

- Mike


----- Forwarded Message -----
From: "Ben Clifford" <benc at hawaga.org.uk>
To: swift-devel at ci.uchicago.edu
Sent: Wednesday, July 1, 2009 10:37:28 AM GMT -06:00 US/Canada Central
Subject: [Swift-devel] writeData


r2994 contains a writeData function which does the opposite of readData.

Specifically, you can say:

file l;
l = writeData(@f);

to output the filenames for a data structure into a text file, so that you 
can pass this instead of passing filenames on the command line.

-- 

_______________________________________________
Swift-devel mailing list
Swift-devel at ci.uchicago.edu
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From fedorov at bwh.harvard.edu  Sat Jan 30 16:36:52 2010
From: fedorov at bwh.harvard.edu (Andriy Fedorov)
Date: Sat, 30 Jan 2010 17:36:52 -0500
Subject: [Swift-user] Looking for the cause of failure
Message-ID: <82f536811001301436v56ac72dp489d4a2f9da2fad6@mail.gmail.com>

Hi,

I've been running a 1000-job swift script with coaster provider. After
executing successfully 998 jobs, I see continuous stream of messages

Progress:  Submitted:1  Active:1  Finished successfully:998
...

At the same time, there are no jobs in the PBS queue. looking at
~/.globus/coasters/coasters.log, I found the following error messages
towards the end of the log:

2010-01-30 16:17:22,275-0600 INFO  Block Block task status changed:
Failed The job manager could not stage out a file
2010-01-30 16:17:22,275-0600 INFO  Block Failed task spec: Job:
        executable: /usr/bin/perl
        arguments:  /u/ac/fedorov/.globus/coasters/cscript28331.pl
http://141.142.68.180:54622 0130-580326-000001
/u/ac/fedorov/.globus/coasters
        stdout:     null
        stderr:     null
        directory:  null
        batch:      false
        redirected: false
        {hostcount=40, maxwalltime=24, count=40, jobtype=multiple}

2010-01-30 16:17:22,275-0600 WARN  Block Worker task failed:
org.globus.gram.GramException: The job manager could not stage out a file
        at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531)
        at org.globus.gram.GramJob.setStatus(GramJob.java:184)
        at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176)
        at java.lang.Thread.run(Thread.java:595)

And then a longer series of what looks like timeout messages:

2010-01-30 16:19:40,911-0600 WARN  Command Command(3, SHUTDOWN):
handling reply timeout; sendReqTime=100130-161740.893,
sendTime=100130-161740.893, now=100130-161940.911
2010-01-30 16:19:40,911-0600 WARN  Command Command(3, SHUTDOWN)fault
was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
        at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
        at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
2010-01-30 16:19:40,911-0600 WARN  Command Command(4, SHUTDOWN):
handling reply timeout; sendReqTime=100130-161740.893,
sendTime=100130-161740.893, now=100130-161940.911
2010-01-30 16:19:40,911-0600 WARN  Command Command(4, SHUTDOWN)fault
was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
        at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
        at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)

Anybody can explain what happened? The same workflow ran earlier, but
with fewer (2) workers per node.

I am running this on Abe, Swift svn swift-r3202 cog-r2682, site description:

<pool handle="Abe-GT2-coasters">
  <gridftp  url="local://localhost" />
  <execution provider="coaster" jobmanager="gt2:gt2:pbs"
  url="grid-abe.ncsa.teragrid.org"/>
  <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
  <profile namespace="karajan" key="jobThrottle">2.55</profile>
  <profile namespace="karajan" key="initialScore">10000</profile>
  <profile namespace="globus" key="nodeGranularity">20</profile>
  <profile namespace="globus" key="remoteMonitorEnabled">false</profile>
  <profile namespace="globus" key="parallelism">0.1</profile>
  <profile namespace="globus" key="workersPerNode">4</profile>
  <profile namespace="globus" key="highOverallocation">10</profile>
</pool>

Thanks

Andriy Fedorov


From wilde at mcs.anl.gov  Sat Jan 30 18:27:28 2010
From: wilde at mcs.anl.gov (wilde at mcs.anl.gov)
Date: Sat, 30 Jan 2010 18:27:28 -0600 (CST)
Subject: [Swift-user] Looking for the cause of failure
In-Reply-To: <12680489.151301264897544218.JavaMail.root@zimbra>
Message-ID: <2795624.151361264897648127.JavaMail.root@zimbra>

Andriy, I need to look at this in more detail. (Mihael is unavailable this week).

But I'm wondering - since you are running Swift on an abe login host, consider changing:

  <execution provider="coaster" jobmanager="gt2:gt2:pbs"
             url="grid-abe.ncsa.teragrid.org"/>

to:

  <execution provider="coaster" url="none" jobmanager="local:pbs"/>

Also, on abe, don't you want to set workersNerNode to 8, as its nodes are 8-core hosts?

You may also want to set the max time of the coster job (in seconds) to, for example:

  <profile namespace="globus" key="maxtime">7500</profile>

Change "7500" to whatever makes sense for your application. This is somewhat manual, but I suggest setting the time to some value that ensure that the PBS jobs last long enough for your application jobs. That aspect may need  further adjustment.

Lastly, instead of the gridftp tag you can use:

  <filesystem provider="local"/>

But the gridftp tag you have is fine, and equivalent.

- Mike


----- "Andriy Fedorov" <fedorov at bwh.harvard.edu> wrote:

> Hi,
> 
> I've been running a 1000-job swift script with coaster provider.
> After
> executing successfully 998 jobs, I see continuous stream of messages
> 
> Progress:  Submitted:1  Active:1  Finished successfully:998
> ...
> 
> At the same time, there are no jobs in the PBS queue. looking at
> ~/.globus/coasters/coasters.log, I found the following error messages
> towards the end of the log:
> 
> 2010-01-30 16:17:22,275-0600 INFO  Block Block task status changed:
> Failed The job manager could not stage out a file
> 2010-01-30 16:17:22,275-0600 INFO  Block Failed task spec: Job:
>         executable: /usr/bin/perl
>         arguments:  /u/ac/fedorov/.globus/coasters/cscript28331.pl
> http://141.142.68.180:54622 0130-580326-000001
> /u/ac/fedorov/.globus/coasters
>         stdout:     null
>         stderr:     null
>         directory:  null
>         batch:      false
>         redirected: false
>         {hostcount=40, maxwalltime=24, count=40, jobtype=multiple}
> 
> 2010-01-30 16:17:22,275-0600 WARN  Block Worker task failed:
> org.globus.gram.GramException: The job manager could not stage out a
> file
>         at
> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531)
>         at org.globus.gram.GramJob.setStatus(GramJob.java:184)
>         at
> org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176)
>         at java.lang.Thread.run(Thread.java:595)
> 
> And then a longer series of what looks like timeout messages:
> 
> 2010-01-30 16:19:40,911-0600 WARN  Command Command(3, SHUTDOWN):
> handling reply timeout; sendReqTime=100130-161740.893,
> sendTime=100130-161740.893, now=100130-161940.911
> 2010-01-30 16:19:40,911-0600 WARN  Command Command(3, SHUTDOWN)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>         at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
>         at
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
>         at java.util.TimerThread.mainLoop(Timer.java:512)
>         at java.util.TimerThread.run(Timer.java:462)
> 2010-01-30 16:19:40,911-0600 WARN  Command Command(4, SHUTDOWN):
> handling reply timeout; sendReqTime=100130-161740.893,
> sendTime=100130-161740.893, now=100130-161940.911
> 2010-01-30 16:19:40,911-0600 WARN  Command Command(4, SHUTDOWN)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>         at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
>         at
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
>         at java.util.TimerThread.mainLoop(Timer.java:512)
>         at java.util.TimerThread.run(Timer.java:462)
> 
> Anybody can explain what happened? The same workflow ran earlier, but
> with fewer (2) workers per node.
> 
> I am running this on Abe, Swift svn swift-r3202 cog-r2682, site
> description:
> 
> <pool handle="Abe-GT2-coasters">
>   <gridftp  url="local://localhost" />
>   <execution provider="coaster" jobmanager="gt2:gt2:pbs"
>   url="grid-abe.ncsa.teragrid.org"/>
>   <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>   <profile namespace="karajan" key="jobThrottle">2.55</profile>
>   <profile namespace="karajan" key="initialScore">10000</profile>
>   <profile namespace="globus" key="nodeGranularity">20</profile>
>   <profile namespace="globus"
> key="remoteMonitorEnabled">false</profile>
>   <profile namespace="globus" key="parallelism">0.1</profile>
>   <profile namespace="globus" key="workersPerNode">4</profile>
>   <profile namespace="globus" key="highOverallocation">10</profile>
> </pool>
> 
> Thanks
> 
> Andriy Fedorov
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From hategan at mcs.anl.gov  Sat Jan 30 21:46:33 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 30 Jan 2010 21:46:33 -0600
Subject: [Swift-user] Looking for the cause of failure
In-Reply-To: <82f536811001301436v56ac72dp489d4a2f9da2fad6@mail.gmail.com>
References: <82f536811001301436v56ac72dp489d4a2f9da2fad6@mail.gmail.com>
Message-ID: <1264909593.6403.8.camel@localhost>

On Sat, 2010-01-30 at 17:36 -0500, Andriy Fedorov wrote:

> 2010-01-30 16:17:22,275-0600 INFO  Block Block task status changed:
> Failed The job manager could not stage out a file
> 2010-01-30 16:17:22,275-0600 INFO  Block Failed task spec: Job:
>         executable: /usr/bin/perl
>         arguments:  /u/ac/fedorov/.globus/coasters/cscript28331.pl
> http://141.142.68.180:54622 0130-580326-000001
> /u/ac/fedorov/.globus/coasters
>         stdout:     null
>         stderr:     null
>         directory:  null
>         batch:      false
>         redirected: false
>         {hostcount=40, maxwalltime=24, count=40, jobtype=multiple}
> 
> 2010-01-30 16:17:22,275-0600 WARN  Block Worker task failed:
> org.globus.gram.GramException: The job manager could not stage out a file
>         at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531)
>         at org.globus.gram.GramJob.setStatus(GramJob.java:184)
>         at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176)
>         at java.lang.Thread.run(Thread.java:595)

That in itself is not a failure condition as it is something that
happens after the worker job completes.

> 
> And then a longer series of what looks like timeout messages:
> 
> 2010-01-30 16:19:40,911-0600 WARN  Command Command(3, SHUTDOWN):
> handling reply timeout; sendReqTime=100130-161740.893,
> sendTime=100130-161740.893, now=100130-161940.911
> 2010-01-30 16:19:40,911-0600 WARN  Command Command(3, SHUTDOWN)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>         at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
>         at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
>         at java.util.TimerThread.mainLoop(Timer.java:512)
>         at java.util.TimerThread.run(Timer.java:462)

That is an indication that the worker didn't respond to a shutdown
command, perhaps because it died previously.

In ~/.globus/coasters you will find a bunch of worker logs. If you can
identify the ones for your run (based perhaps on the timestamp on the
files), they may contain the reason for the failure.

> 2010-01-30 16:19:40,911-0600 WARN  Command Command(4, SHUTDOWN):
> handling reply timeout; sendReqTime=100130-161740.893,
> sendTime=100130-161740.893, now=100130-161940.911
> 2010-01-30 16:19:40,911-0600 WARN  Command Command(4, SHUTDOWN)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>         at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
>         at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
>         at java.util.TimerThread.mainLoop(Timer.java:512)
>         at java.util.TimerThread.run(Timer.java:462)
> 
> Anybody can explain what happened? The same workflow ran earlier, but
> with fewer (2) workers per node.

Does it work if you set workers per node to 2 again? If yes, that may be
an indication that the workers per node setting causes a problem, and
that's a stronger statement than "it doesn't work right now".


From jamalphd at gmail.com  Sat Jan 30 22:00:29 2010
From: jamalphd at gmail.com (J A)
Date: Sat, 30 Jan 2010 23:00:29 -0500
Subject: [Swift-user] passing a file as an argument
Message-ID: <b19f113c1001302000x44cb69d3m388edaa6a2336509@mail.gmail.com>

Hi:

how can I pass a file when executing swift, so i want to do something like
this:

$ swift  file.txt


how do I catch the file inside the code?

Thanks,
Jamal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100130/0e919a2b/attachment.html>

From fedorov at bwh.harvard.edu  Sat Jan 30 22:07:47 2010
From: fedorov at bwh.harvard.edu (Andriy Fedorov)
Date: Sat, 30 Jan 2010 23:07:47 -0500
Subject: [Swift-user] Looking for the cause of failure
In-Reply-To: <2795624.151361264897648127.JavaMail.root@zimbra>
References: <12680489.151301264897544218.JavaMail.root@zimbra>
	<2795624.151361264897648127.JavaMail.root@zimbra>
Message-ID: <82f536811001302007v6940936fn37b9cd79bbab4736@mail.gmail.com>

On Sat, Jan 30, 2010 at 19:27,  <wilde at mcs.anl.gov> wrote:
> Andriy, I need to look at this in more detail. (Mihael is unavailable this week).
>
> But I'm wondering - since you are running Swift on an abe login host, consider changing:
>
> ?<execution provider="coaster" jobmanager="gt2:gt2:pbs"
> ? ? ? ? ? ? url="grid-abe.ncsa.teragrid.org"/>
>
> to:
>
> ?<execution provider="coaster" url="none" jobmanager="local:pbs"/>
>

Michael, thank you for the suggestion -- I will try!

> Also, on abe, don't you want to set workersNerNode to 8, as its nodes are 8-core hosts?
>

Yes, you are right!

> You may also want to set the max time of the coster job (in seconds) to, for example:
>
> ?<profile namespace="globus" key="maxtime">7500</profile>
>
> Change "7500" to whatever makes sense for your application. This is somewhat manual, but I suggest setting the time to some value that ensure that the PBS jobs last long enough for your application jobs. That aspect may need ?further adjustment.
>

I am not sure about this one. The documentation says maxtime defines
the maximum walltime for a coaster block, and is by default unlimited.
It seems to me that setting this parameter could actually create
problems. Can you explain?


> Lastly, instead of the gridftp tag you can use:
>
> ?<filesystem provider="local"/>
>
> But the gridftp tag you have is fine, and equivalent.
>
> - Mike
>
>
> ----- "Andriy Fedorov" <fedorov at bwh.harvard.edu> wrote:
>
>> Hi,
>>
>> I've been running a 1000-job swift script with coaster provider.
>> After
>> executing successfully 998 jobs, I see continuous stream of messages
>>
>> Progress: ?Submitted:1 ?Active:1 ?Finished successfully:998
>> ...
>>
>> At the same time, there are no jobs in the PBS queue. looking at
>> ~/.globus/coasters/coasters.log, I found the following error messages
>> towards the end of the log:
>>
>> 2010-01-30 16:17:22,275-0600 INFO ?Block Block task status changed:
>> Failed The job manager could not stage out a file
>> 2010-01-30 16:17:22,275-0600 INFO ?Block Failed task spec: Job:
>> ? ? ? ? executable: /usr/bin/perl
>> ? ? ? ? arguments: ?/u/ac/fedorov/.globus/coasters/cscript28331.pl
>> http://141.142.68.180:54622 0130-580326-000001
>> /u/ac/fedorov/.globus/coasters
>> ? ? ? ? stdout: ? ? null
>> ? ? ? ? stderr: ? ? null
>> ? ? ? ? directory: ?null
>> ? ? ? ? batch: ? ? ?false
>> ? ? ? ? redirected: false
>> ? ? ? ? {hostcount=40, maxwalltime=24, count=40, jobtype=multiple}
>>
>> 2010-01-30 16:17:22,275-0600 WARN ?Block Worker task failed:
>> org.globus.gram.GramException: The job manager could not stage out a
>> file
>> ? ? ? ? at
>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531)
>> ? ? ? ? at org.globus.gram.GramJob.setStatus(GramJob.java:184)
>> ? ? ? ? at
>> org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176)
>> ? ? ? ? at java.lang.Thread.run(Thread.java:595)
>>
>> And then a longer series of what looks like timeout messages:
>>
>> 2010-01-30 16:19:40,911-0600 WARN ?Command Command(3, SHUTDOWN):
>> handling reply timeout; sendReqTime=100130-161740.893,
>> sendTime=100130-161740.893, now=100130-161940.911
>> 2010-01-30 16:19:40,911-0600 WARN ?Command Command(3, SHUTDOWN)fault
>> was: Reply timeout
>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>> ? ? ? ? at
>> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
>> ? ? ? ? at
>> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
>> ? ? ? ? at java.util.TimerThread.mainLoop(Timer.java:512)
>> ? ? ? ? at java.util.TimerThread.run(Timer.java:462)
>> 2010-01-30 16:19:40,911-0600 WARN ?Command Command(4, SHUTDOWN):
>> handling reply timeout; sendReqTime=100130-161740.893,
>> sendTime=100130-161740.893, now=100130-161940.911
>> 2010-01-30 16:19:40,911-0600 WARN ?Command Command(4, SHUTDOWN)fault
>> was: Reply timeout
>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>> ? ? ? ? at
>> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
>> ? ? ? ? at
>> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
>> ? ? ? ? at java.util.TimerThread.mainLoop(Timer.java:512)
>> ? ? ? ? at java.util.TimerThread.run(Timer.java:462)
>>
>> Anybody can explain what happened? The same workflow ran earlier, but
>> with fewer (2) workers per node.
>>
>> I am running this on Abe, Swift svn swift-r3202 cog-r2682, site
>> description:
>>
>> <pool handle="Abe-GT2-coasters">
>> ? <gridftp ?url="local://localhost" />
>> ? <execution provider="coaster" jobmanager="gt2:gt2:pbs"
>> ? url="grid-abe.ncsa.teragrid.org"/>
>> ? <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>> ? <profile namespace="karajan" key="jobThrottle">2.55</profile>
>> ? <profile namespace="karajan" key="initialScore">10000</profile>
>> ? <profile namespace="globus" key="nodeGranularity">20</profile>
>> ? <profile namespace="globus"
>> key="remoteMonitorEnabled">false</profile>
>> ? <profile namespace="globus" key="parallelism">0.1</profile>
>> ? <profile namespace="globus" key="workersPerNode">4</profile>
>> ? <profile namespace="globus" key="highOverallocation">10</profile>
>> </pool>
>>
>> Thanks
>>
>> Andriy Fedorov
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>


From fedorov at bwh.harvard.edu  Sat Jan 30 22:10:27 2010
From: fedorov at bwh.harvard.edu (Andriy Fedorov)
Date: Sat, 30 Jan 2010 23:10:27 -0500
Subject: [Swift-user] Looking for the cause of failure
In-Reply-To: <1264909593.6403.8.camel@localhost>
References: <82f536811001301436v56ac72dp489d4a2f9da2fad6@mail.gmail.com>
	<1264909593.6403.8.camel@localhost>
Message-ID: <82f536811001302010h641557f4u9e52e91a72b50543@mail.gmail.com>

On Sat, Jan 30, 2010 at 22:46, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> In ~/.globus/coasters you will find a bunch of worker logs. If you can
> identify the ones for your run (based perhaps on the timestamp on the
> files), they may contain the reason for the failure.
>

Strangely, I don't have worker logs for these executions -- the latest
are from Jan 18.

>> Anybody can explain what happened? The same workflow ran earlier, but
>> with fewer (2) workers per node.
>
> Does it work if you set workers per node to 2 again? If yes, that may be
> an indication that the workers per node setting causes a problem, and
> that's a stronger statement than "it doesn't work right now".
>

I will try, and let you know. If this is indeed the case, is there any
particular reason why it may not work for 4 workers per node?

As Mike pointed out, the nodes actually have 8 cores.

>
>
>


From hategan at mcs.anl.gov  Sat Jan 30 22:14:04 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 30 Jan 2010 22:14:04 -0600
Subject: [Swift-user] Looking for the cause of failure
In-Reply-To: <82f536811001302007v6940936fn37b9cd79bbab4736@mail.gmail.com>
References: <12680489.151301264897544218.JavaMail.root@zimbra>
	<2795624.151361264897648127.JavaMail.root@zimbra>
	<82f536811001302007v6940936fn37b9cd79bbab4736@mail.gmail.com>
Message-ID: <1264911244.7775.3.camel@localhost>

On Sat, 2010-01-30 at 23:07 -0500, Andriy Fedorov wrote:

> > You may also want to set the max time of the coster job (in seconds) to, for example:
> >
> >  <profile namespace="globus" key="maxtime">7500</profile>
> >
> > Change "7500" to whatever makes sense for your application. This is somewhat manual, but I suggest setting the time to some value that ensure that the PBS jobs last long enough for your application jobs. That aspect may need  further adjustment.
> >
> 
> I am not sure about this one. The documentation says maxtime defines
> the maximum walltime for a coaster block, and is by default unlimited.
> It seems to me that setting this parameter could actually create
> problems. Can you explain?
> 

What may happen is that the block (the actual PBS job submitted to run
the workers) is longer than what the queue allows.

For example, you may select the "short" queue, and that may have a limit
of, say, 2 hours for the walltime. You want to set the maxtime
accordingly in order to prevent coasters from submitting a job with a
walltime higher than what the queue allows, which would cause the job to
fail immediately.
Even in the case you don't explicitly specify a queue, the default queue
may itself have a limit.


From hategan at mcs.anl.gov  Sat Jan 30 22:25:02 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 30 Jan 2010 22:25:02 -0600
Subject: [Swift-user] Looking for the cause of failure
In-Reply-To: <82f536811001302010h641557f4u9e52e91a72b50543@mail.gmail.com>
References: <82f536811001301436v56ac72dp489d4a2f9da2fad6@mail.gmail.com>
	<1264909593.6403.8.camel@localhost>
	<82f536811001302010h641557f4u9e52e91a72b50543@mail.gmail.com>
Message-ID: <1264911902.7775.14.camel@localhost>

On Sat, 2010-01-30 at 23:10 -0500, Andriy Fedorov wrote:
> On Sat, Jan 30, 2010 at 22:46, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > In ~/.globus/coasters you will find a bunch of worker logs. If you can
> > identify the ones for your run (based perhaps on the timestamp on the
> > files), they may contain the reason for the failure.
> >
> 
> Strangely, I don't have worker logs for these executions -- the latest
> are from Jan 18.

That indicates that the workers aren't even started. It's somewhat
unfortunate that GRAM fails to stage out stdout/stderr, because those
would likely contain information about the failure.

What you can probably do in this case is try to reproduce the jobs that
the coasters submit and do it manually with qsub or GRAM to see what the
queuing system complains about.

For that, you could enable log4j debugging for
org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.
That would give you the gt2 RSL of the job, and that would likely be
useful.

> 
> >> Anybody can explain what happened? The same workflow ran earlier, but
> >> with fewer (2) workers per node.
> >
> > Does it work if you set workers per node to 2 again? If yes, that may be
> > an indication that the workers per node setting causes a problem, and
> > that's a stronger statement than "it doesn't work right now".
> >
> 
> I will try, and let you know. If this is indeed the case, is there any
> particular reason why it may not work for 4 workers per node?
> 
> As Mike pointed out, the nodes actually have 8 cores.

No idea. I'm pretty much blind about the issue, and in such cases it
seems that the reasonable solution is to use a stick and hit random
things and get a feel for the obstacles around.

Now, Mike's suggestion about using the PBS provider directly seems like
a good one because it provides an alternative mechanism for doing the
same thing which, well, is pretty much like our stick above, except it's
a pretty big stick, so it has decent chances of making a difference.

Also, in case you're there, trunk is unstable code. For more stable
code, use the stable branch (details on the swift download page).


From fedorov at bwh.harvard.edu  Sat Jan 30 22:28:23 2010
From: fedorov at bwh.harvard.edu (Andriy Fedorov)
Date: Sat, 30 Jan 2010 23:28:23 -0500
Subject: [Swift-user] Looking for the cause of failure
In-Reply-To: <1264911244.7775.3.camel@localhost>
References: <12680489.151301264897544218.JavaMail.root@zimbra>
	<2795624.151361264897648127.JavaMail.root@zimbra>
	<82f536811001302007v6940936fn37b9cd79bbab4736@mail.gmail.com>
	<1264911244.7775.3.camel@localhost>
Message-ID: <82f536811001302028p6c58a241l87c4cd9af220c6f8@mail.gmail.com>

On Sat, Jan 30, 2010 at 23:14, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> What may happen is that the block (the actual PBS job submitted to run
> the workers) is longer than what the queue allows.
>
> For example, you may select the "short" queue, and that may have a limit
> of, say, 2 hours for the walltime. You want to set the maxtime
> accordingly in order to prevent coasters from submitting a job with a
> walltime higher than what the queue allows, which would cause the job to
> fail immediately.
> Even in the case you don't explicitly specify a queue, the default queue
> may itself have a limit.

This makes sense -- thank you for the explanation!

So I changed the number of workers per node to 8, and set the provider
to "local:pbs", as Mike suggested. I see 2 PBS jobs (20 and 40 nodes)
running, but from what Swift reports to me, only 16 (?) jobs are being
active at a time.

Selecting site:664  Submitted:240  Active:16  Finished successfully:80

With the previous setup, it made more sense, because the number of
active jobs was <number of PBS nodes>*<number of workers per node>.

Am I missing something simple? Maybe I should just try the stable
branch. I will do this next.

>
>
>


From hategan at mcs.anl.gov  Sat Jan 30 22:45:53 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 30 Jan 2010 22:45:53 -0600
Subject: [Swift-user] Looking for the cause of failure
In-Reply-To: <82f536811001302028p6c58a241l87c4cd9af220c6f8@mail.gmail.com>
References: <12680489.151301264897544218.JavaMail.root@zimbra>
	<2795624.151361264897648127.JavaMail.root@zimbra>
	<82f536811001302007v6940936fn37b9cd79bbab4736@mail.gmail.com>
	<1264911244.7775.3.camel@localhost>
	<82f536811001302028p6c58a241l87c4cd9af220c6f8@mail.gmail.com>
Message-ID: <1264913153.8312.15.camel@localhost>

On Sat, 2010-01-30 at 23:28 -0500, Andriy Fedorov wrote:
> On Sat, Jan 30, 2010 at 23:14, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > What may happen is that the block (the actual PBS job submitted to run
> > the workers) is longer than what the queue allows.
> >
> > For example, you may select the "short" queue, and that may have a limit
> > of, say, 2 hours for the walltime. You want to set the maxtime
> > accordingly in order to prevent coasters from submitting a job with a
> > walltime higher than what the queue allows, which would cause the job to
> > fail immediately.
> > Even in the case you don't explicitly specify a queue, the default queue
> > may itself have a limit.
> 
> This makes sense -- thank you for the explanation!
> 
> So I changed the number of workers per node to 8, and set the provider
> to "local:pbs", as Mike suggested. I see 2 PBS jobs (20 and 40 nodes)
> running, but from what Swift reports to me, only 16 (?) jobs are being
> active at a time.
> 
> Selecting site:664  Submitted:240  Active:16  Finished successfully:80

It may be a strange variation on relativity. What swift sees as the
number of concurrent jobs may not be what the cluster sees as the number
of concurrent jobs because messages between the two take various amounts
of time to make it from one place to the other. This is especially
visible when the jobs are short.

That or maybe this patch I recently committed (cog branches/4.1.7 r2683)
for the PBS provider. 16 is suspiciously equal to
number_of_jobs*workers_per_node, which may be a result of the PBS
provider starting only one executable irrespective of the number of
nodes requested. The patch mentioned uses pdsh to start the proper
number of executable instances.

> 
> With the previous setup, it made more sense, because the number of
> active jobs was <number of PBS nodes>*<number of workers per node>.

Define "previous setup". If it's about one coaster job per node, yes.
Unfortunately that's also something that prevents scalability with gram2
or clusters that have limits on the number of jobs in the queue (like
the BG/P).

You can force that behavior though with maxnodes=1.

> 
> Am I missing something simple? Maybe I should just try the stable
> branch. I will do this next.
> 

I would advise everybody besides about 2 people doing research on I/O
scalability with Swift to use the stable branch. Not only does it get
fixes before trunk, but it doesn't get weird changes that may cause
random breakage.


From fedorov at bwh.harvard.edu  Sun Jan 31 09:49:54 2010
From: fedorov at bwh.harvard.edu (Andriy Fedorov)
Date: Sun, 31 Jan 2010 10:49:54 -0500
Subject: [Swift-user] Looking for the cause of failure
In-Reply-To: <1264913153.8312.15.camel@localhost>
References: <12680489.151301264897544218.JavaMail.root@zimbra>
	<2795624.151361264897648127.JavaMail.root@zimbra>
	<82f536811001302007v6940936fn37b9cd79bbab4736@mail.gmail.com>
	<1264911244.7775.3.camel@localhost>
	<82f536811001302028p6c58a241l87c4cd9af220c6f8@mail.gmail.com>
	<1264913153.8312.15.camel@localhost>
Message-ID: <82f536811001310749o37a37509i6606f0f6bfcf3be3@mail.gmail.com>

On Sat, Jan 30, 2010 at 23:45, Mihael Hategan <hategan at mcs.anl.gov> wrote:
>> With the previous setup, it made more sense, because the number of
>> active jobs was <number of PBS nodes>*<number of workers per node>.
>
> Define "previous setup".

"previous setup" is the site configuration I included in the email
that started this thread.

I just tried this "previous setup", increasing number of workers per
node to 8, and everything worked very well (job status plot attached).

> If it's about one coaster job per node, yes.
> Unfortunately that's also something that prevents scalability with gram2
> or clusters that have limits on the number of jobs in the queue (like
> the BG/P).
>
> You can force that behavior though with maxnodes=1.
>
>>
>> Am I missing something simple? Maybe I should just try the stable
>> branch. I will do this next.
>>
>
> I would advise everybody besides about 2 people doing research on I/O
> scalability with Swift to use the stable branch. Not only does it get
> fixes before trunk, but it doesn't get weird changes that may cause
> random breakage.
>

With the stable branch, and "updated setup" (execution provider
"local:pbs") I have this error message:

/var/spool/torque/mom_priv/jobs/2489852.abem5.ncsa.uiuc.edu.SC: line
10: pdsh: command not found

Should I install pdsh first? I didn't see it right away in the TG
software list. I also don't see instructions in the Swift user guide,
unless I missed it.

>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: karatasks.JOB_SUBMISSION-trails.png
Type: image/png
Size: 5011 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100131/e4fb53d8/attachment.png>

From hategan at mcs.anl.gov  Sun Jan 31 09:56:42 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 31 Jan 2010 09:56:42 -0600
Subject: [Swift-user] Looking for the cause of failure
In-Reply-To: <82f536811001310749o37a37509i6606f0f6bfcf3be3@mail.gmail.com>
References: <12680489.151301264897544218.JavaMail.root@zimbra>
	<2795624.151361264897648127.JavaMail.root@zimbra>
	<82f536811001302007v6940936fn37b9cd79bbab4736@mail.gmail.com>
	<1264911244.7775.3.camel@localhost>
	<82f536811001302028p6c58a241l87c4cd9af220c6f8@mail.gmail.com>
	<1264913153.8312.15.camel@localhost>
	<82f536811001310749o37a37509i6606f0f6bfcf3be3@mail.gmail.com>
Message-ID: <1264953402.11359.1.camel@localhost>

On Sun, 2010-01-31 at 10:49 -0500, Andriy Fedorov wrote:
> On Sat, Jan 30, 2010 at 23:45, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> >> With the previous setup, it made more sense, because the number of
> >> active jobs was <number of PBS nodes>*<number of workers per node>.
> >
> > Define "previous setup".
> 
> "previous setup" is the site configuration I included in the email
> that started this thread.
> 
> I just tried this "previous setup", increasing number of workers per
> node to 8, and everything worked very well (job status plot attached).
> 
> > If it's about one coaster job per node, yes.
> > Unfortunately that's also something that prevents scalability with gram2
> > or clusters that have limits on the number of jobs in the queue (like
> > the BG/P).
> >
> > You can force that behavior though with maxnodes=1.
> >
> >>
> >> Am I missing something simple? Maybe I should just try the stable
> >> branch. I will do this next.
> >>
> >
> > I would advise everybody besides about 2 people doing research on I/O
> > scalability with Swift to use the stable branch. Not only does it get
> > fixes before trunk, but it doesn't get weird changes that may cause
> > random breakage.
> >
> 
> With the stable branch, and "updated setup" (execution provider
> "local:pbs") I have this error message:
> 
> /var/spool/torque/mom_priv/jobs/2489852.abem5.ncsa.uiuc.edu.SC: line
> 10: pdsh: command not found
> 
> Should I install pdsh first?

Yes. Might have a softenv package.

>  I didn't see it right away in the TG
> software list. I also don't see instructions in the Swift user guide,
> unless I missed it.

It's relatively new. There was also the assumption that it would be
installed pretty much everywhere, but it doesn't seem to be the case, so
I', thinking a plain ssh solution (which is what gram does) may be
better.


From wilde at mcs.anl.gov  Sun Jan 31 12:20:02 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 31 Jan 2010 12:20:02 -0600 (CST)
Subject: [Swift-user] passing a file as an argument
In-Reply-To: <b19f113c1001302000x44cb69d3m388edaa6a2336509@mail.gmail.com>
Message-ID: <27622487.155821264962002707.JavaMail.root@zimbra>

Jamal, for this, you pass the file name on the command line as a script argument after all the swift command arguments, and pick it up inside your swift script with the @arg() function, which is like argv in C (except indexed by name rather than position). Its described in the User Guide.

- Mike


----- "J A" <jamalphd at gmail.com> wrote:

> Hi:
> 
> how can I pass a file when executing swift, so i want to do something
> like this:
> 
> $ swift file.txt
> 
> 
> how do I catch the file inside the code?
> 
> Thanks,
> Jamal
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user