From iraicu at cs.uchicago.edu Sat Jan 16 08:46:58 2010 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sat, 16 Jan 2010 08:46:58 -0600 Subject: [Swift-user] CFP: 19th ACM International Symposium on High Performance Distributed Computing Message-ID: <4B51D162.2040805@cs.uchicago.edu> The 19th ACM International Symposium on High Performance Distributed Computing (HPDC 2010) is now accepting submissions of research papers. Authors are invited to submit full papers of at most 12 pages or short papers of at most 4 pages. Details about formatting requirements and the submission process are given at http://hpdc2010.eecs.northwestern.edu/submitpaper.html. The deadline for registering an abstract has been extended to Friday, January 22, 2010 and the deadline for the complete paper is unchanged to Friday, January 22, 2010. The detailed call for papers follows, and can also be seen online at http://hpdc2010.eecs.northwestern.edu. ======================================================================= ACM HPDC 2010 Call For Papers 19th ACM International Symposium on High Performance Distributed Computing Chicago, Illinois June 21-25, 2010 http://hpdc2010.eecs.northwestern.edu/ The ACM International Symposium on High Performance Distributed Computing (HPDC) is the premier venue for presenting the latest research on the design, implementation, evaluation, and use of parallel and distributed systems for high performance and high end computing. The 19th installment of HPDC will take place in the heart of the Chicago, Illinois, the third largest city in the United States and a major technological and cultural capital. The conference will be held on June 23-25 (Wednesday through Friday) with eight affiliated workshops occurring on June 21-22 (Monday and Tuesday). Open Grid Forum will be co-located as well, on June 20-22 (Sunday through Tuesday) Submissions are welcomed on all forms of high performance distributed computing, including grids, clouds, clusters, service-oriented computing, utility computing, peer-to-peer systems, and global computing ensembles. New scholarly research showing empirical and reproducible results in architectures, systems, and networks is strongly encouraged, as are experience reports of applications and deployments that can provide insights for future high performance distributed computing research. All papers will be rigorously reviewed by a distinguished program committee, with a strong focus on the combination of rigorous scientific results and likely high impact within high performance distributed computing. Research papers must clearly demonstrate research contributions and novelty while experience reports must clearly describe lessons learned and demonstrate impact. Topics of interest include (but are not limited to) the following, in the context of high performance distributed computing and high end computing: * Systems * Architectures * Algorithms * Networking * Programming languages and environments * Data management * I/O and file systems * Virtualization * Resource management, scheduling, and load-balancing * Performance modeling, simulation, and prediction * Fault tolerance, reliability and availability * Security, configuration, policy, and management issues * Multicore issues and opportunities * Models and use cases for utility, grid, and cloud computing Both full papers and short papers (for poster presentation and/or demonstrations) may be submitted. IMPORTANT DATES Paper Abstract submissions: January 15, 2010 Paper submissions: January 22, 2010 Author notification: March 30, 2010 Final manuscripts: April 23, 2010 SUBMISSIONS Authors are invited to submit full papers of at most 12 pages or short papers of at most 4 pages. The page limits include all figures and references. Papers should be formatted in the ACM proceedings style (e.g., http://www.acm.org/sigs/publications/proceedings-templates). Reviewing is single-blind. Papers must be self-contained and provide the technical substance required for the program committee to evaluate the paper's contribution, including how it differs from prior work. All papers will be reviewed and judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference. Submitted papers must be original work that has not appeared in and is not under consideration for another conference or a journal. PUBLICATION Accepted full and short papers will appear in the conference proceedings. WORKSHOPS Eight workshops have been selected for co-location with HPDC, on June 21st and 22nd, 2010. Please visit the workshop web page at http://hpdc2010.eecs.northwestern.edu/workshops.html for more information. The workshops include: * Emerging Computational Methods for the Life Sciences * LSAP: Large-Scale System and Application Performance * MDQCS: Managing Data Quality for Collaborative Science * ScienceCloud: Scientific Cloud Computing * CLADE: Challenges of Large Applications in Distributed Environments * DIDC: Data Intensive Distributed Computing * MAPREDUCE: MapReduce and its Applications * VTDC: Virtualization Technologies for Distributed Computing OPEN GRID FORUM Open Grid Form (ogf.org ) will have a co-located meeting with HPDC. Please visit the web site for more information. GENERAL CO-CHAIRS Kate Keahey, Argonne National Labs Salim Hariri, University of Arizona STEERING COMMITTEE Salim Hariri, Univ. of Arizona (Chair) Andrew A. Chien, Intel / UCSD Henri Bal, Vrije University Franck Cappello, INRIA Jack Dongarra, Univ. of Tennessee Ian Foster, ANL& Univ. of Chicago Andrew Grimshaw, Univ. of Virginia Carl Kesselman, USC/ISI Dieter Kranzlmueller, Ludwig-Maximilians-Univ. Muenchen Miron Livny, Univ. of Wisconsin Manish Parashar, Rutgers University Karsten Schwan, Georgia Tech David Walker, Univ. of Cardiff Rich Wolski, UCSB PROGRAM CHAIR Peter Dinda, Northwestern University PROGRAM COMMITTEE Kento Aida, NII and Tokyo Institute of Technology Ron Brightwell, Sandia National Labs Fabian Bustamante, Northwestern University Henri Bal, Vrije Universiteit Frank Cappello, INRIA Claris Castillo, IBM Research Henri Casanova, University of Hawaii Abhishek Chandra, University of Minnesota Chris Colohan, Google Brian Cooper, Yahoo Research Wu-chun Feng, Virginia Tech Renato Ferreira, Universidade Federal de Minas Gerais Jose Fortes, University of Florida Ian Foster, University of Chicago / Argonne Geoffrey Fox, Indiana University Michael Gerndt, TU-Munich Andrew Grimshaw, University of Virginia Thilo Kielmann, Vrije Universiteit Zhiling Lan, IIT John Lange, Northwestern University Arthur Maccabe, Oak Ridge National Labs Satoshi Matsuoka, Toyota Institute of Technology Jose Moreira, IBM Research Klara Nahrstedt, UIUC Dushyanth Narayanan, Microsoft Research Manish Parashar, Rutgers University Ioan Raicu, Northwestern University Morris Riedel, Juelich Supercomputing Centre Matei Ripeanu, UBC Joel Saltz, Emory University Karsten Schwan, Georgia Tech Thomas Stricker, Google Jaspal Subhlok, University of Houston Martin Swany, University of Delaware Michela Taufer, University of Delaware Valerie Taylor, TAMU Douglas Thain, University of Notre Dame Jon Weissman, University of Minnesota Rich Wolski, UCSB and Eucalyptus Systems Dongyan Xu, Purdue University Ken Yocum, UCSD WORKSHOP CHAIR Douglas Thain, University of Notre Dame PUBLICITY CO-CHAIRS Martin Swany, U. Delaware Morris Riedel, Julich Supercomputing Centre Renato Ferreira, Universidade Federal de Minas Gerais Kento Aida, NII and Tokyo Institute of Technology LOCAL ARRANGEMENTS CHAIR Zhiling Lan, IIT STUDENT ACTIVITIES CO-CHAIRS John Lange, Northwestern University Ioan Raicu, Northwestern University -- ================================================================= Ioan Raicu, Ph.D. NSF/CRA Computing Innovation Fellow ================================================================= Center for Ultra-scale Computing and Information Security (CUCIS) Department of Electrical Engineering and Computer Science Northwestern University 2145 Sheridan Rd, Tech M384 Evanston, IL 60208-3118 ================================================================= Cel: 1-847-722-0876 Tel: 1-847-491-8163 Email: iraicu at eecs.northwestern.edu Web: http://www.eecs.northwestern.edu/~iraicu/ https://wiki.cucis.eecs.northwestern.edu/ ================================================================= ================================================================= -- ================================================================= Ioan Raicu, Ph.D. NSF/CRA Computing Innovation Fellow ================================================================= Center for Ultra-scale Computing and Information Security (CUCIS) Department of Electrical Engineering and Computer Science Northwestern University 2145 Sheridan Rd, Tech M384 Evanston, IL 60208-3118 ================================================================= Cel: 1-847-722-0876 Tel: 1-847-491-8163 Email: iraicu at eecs.northwestern.edu Web: http://www.eecs.northwestern.edu/~iraicu/ https://wiki.cucis.eecs.northwestern.edu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Mon Jan 18 08:27:07 2010 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 18 Jan 2010 08:27:07 -0600 Subject: [Swift-user] CFP: IEEE 2010 Fourth International Workshop on Scientific Workflows (SWF 2010) Message-ID: <4B546FBB.601@cs.uchicago.edu> Call for Papers IEEE 2010 Fourth International Workshop on Scientific Workflows (SWF 2010) http://www.cs.wayne.edu/~shiyong/swf Miama, Florida, U.S.A., one day between July 5-10, 2010 In conjunction with IEEE ICWS/SCC/CLOUD/SERVICES 2010 Description Scientific workflows have become an increasingly popular paradigm for scientists to formalize and structure complex scientific processes to enable and accelerate many significant scientific discoveries. A scientific workflow is a formal specification of a scientific process, which represents, streamlines, and automates the analytical and computational steps that a scientist needs to go through from dataset selection and integration, computation and analysis, to final data product presentation and visualization. The importance of scientific workflows has been recognized by NSF since 2006 and was reemphasized recently in a science article titled "Beyond the Data Deluge"(Science, Vol. 323. no. 5919, pp. 1297 ?C 1298, 2009), which concluded, "In the future, the rapidity with which any given discipline advances is likely to depend on how well the community acquires the necessary expertise in database, workflow management, visualization, and cloud computing technologies." The goal of SWF 2010 is to provide a forum for researchers and practitioners to present their recent research results and best practices of scientific workflows, and identify the emerging trends, opportunities, problems, and challenges in this area. Authors are invited to submit regular papers (8 pages) and short papers (4 pages) that show original unpublished research results in all areas of scientific workflows. Topics of interest are listed below; however, submissions on all aspects of scientific workflows are welcome. Accepted SWF 2010 papers will be included in the proceedings of IEEE SERVICES 2010, which will be published by IEEE Computer Society Press. Topics o Scientific workflow provenance management and analytics o Scientific workflow data, metadata, service, and task management o Scientific workflow architectures, models, and languages o Scientific workflow monitoring, debugging, and failure handling o Streaming data processing in scientific workflows o Pipelined, data, workflow, and task parallelism in scientific workflows o Service, Grid, or Cloud-based scientific workflows o Data, metadata, compute, user-interaction, or visualization-intensive scientific workflows o Scientific workflow composition o Security issues in scientific workflows o Data integration and service integration in scientific workflows o Scientific workflow mapping, optimization, and scheduling o Scientific workflow modeling, simulation, analysis, and verification o Scalability, reliability, extensibility, agility, and interoperability o Scientific workflow applications Important dates Paper Submission March 17, 2009 Decision Notification (Electronic) April 17, 2009 Camera-Ready Submission & Pre-registration April 30, 2009 Workshop chairs: Shiyong Lu, Wayne State University Calton Pu, Georgia Tech Liqiang Wang, University of Wyoming Publication chairs (pending) Ilkay Altintas, San Diego Supercomputer Center Yogesh Simmhan, Microsoft Research Ioan Raicu, Northwestern University Publicity chair Jamil alhiyafi, Wayne State University Program committee Ilkay Altintas, San Diego Supercomputer Center, USA Roger Barga, Microsoft Research, USA Adam Barker, University of Oxford, UK Shawn Bowers, UC Davis Genome Center, USA Artem Chebotko, University of Texas at Pan American, USA Ian Gorton, PNNL Paul Groth, VU University Amsterdam Marta L. Queir?s Mattoso, Federal University of Rio de Janeiro, Brazil Luc Moreau, University of South Hampton, UK Ioan Raicu, Northwestern University, USA Yogesh Simmhan, Microsoft Corporation, USA Chung-Wei Hang, North Carolina State University, USA Ian Taylor, Cardiff University, UK Jianwu Wang, San Diego Supercomputer Center Wei Tan, ANL Ping Yang, Binghamton University, USA Ustun Yildiz, UC Davis Yong Zhao, Microsoft Corporation, USA Zhiming Zhao, University of Amsterdam, the Netherlands For any questions, please send e-mails to Shiyong Lu at shiyong at wayne.edu . -- ================================================================= Ioan Raicu, Ph.D. NSF/CRA Computing Innovation Fellow ================================================================= Center for Ultra-scale Computing and Information Security (CUCIS) Department of Electrical Engineering and Computer Science Northwestern University 2145 Sheridan Rd, Tech M384 Evanston, IL 60208-3118 ================================================================= Cel: 1-847-722-0876 Tel: 1-847-491-8163 Email: iraicu at eecs.northwestern.edu Web: http://www.eecs.northwestern.edu/~iraicu/ https://wiki.cucis.eecs.northwestern.edu/ ================================================================= ================================================================= -- ================================================================= Ioan Raicu, Ph.D. NSF/CRA Computing Innovation Fellow ================================================================= Center for Ultra-scale Computing and Information Security (CUCIS) Department of Electrical Engineering and Computer Science Northwestern University 2145 Sheridan Rd, Tech M384 Evanston, IL 60208-3118 ================================================================= Cel: 1-847-722-0876 Tel: 1-847-491-8163 Email: iraicu at eecs.northwestern.edu Web: http://www.eecs.northwestern.edu/~iraicu/ https://wiki.cucis.eecs.northwestern.edu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Mon Jan 18 15:59:57 2010 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 18 Jan 2010 15:59:57 -0600 Subject: [Swift-user] CFP: 1st ACM Workshop on Scientific Cloud Computing (ScienceCloud) 2010 Message-ID: <4B54D9DD.7000103@cs.uchicago.edu> Call for Papers --------------------------------------------------------------------------------------- 1st ACM Workshop on Scientific Cloud Computing (ScienceCloud) 2010 http://dsl.cs.uchicago.edu/ScienceCloud2010/ --------------------------------------------------------------------------------------- June 21st, 2010 Chicago, Illinois, USA Co-located with with ACM High Performance Distributed Computing Conference (HPDC) 2010 ======================================================================================= Workshop Overview The advent of computation can be compared, in terms of the breadth and depth of its impact on research and scholarship, to the invention of writing and the development of modern mathematics. Scientific Computing has already begun to change how science is done, enabling scientific breakthroughs through new kinds of experiments that would have been impossible only a decade ago. Today's science is generating datasets that are increasing exponentially in both complexity and volume, making their analysis, archival, and sharing one of the grand challenges of the 21st century. The support for data intensive computing is critical to advancing modern science as storage systems have experienced an increasing gap between its capacity and its bandwidth by more than 10-fold over the last decade. There is an emerging need for advanced techniques to manipulate, visualize and interpret large datasets. Scientific Computing is the key to many domains' "holy grail" of new knowledge, and comes in many shapes and forms, from high-performance computing (HPC) which is heavily focused on compute-intensive applications, high-throughput computing (HTC) which focuses on using many computing resources over long periods of time to accomplish its computational tasks, many-task computing (MTC) which aims to bridge the gap between HPC and HTC by focusing on using many resources over short periods of time, to data-intensive computing which is heavily focused on data distribution and harnessing data locality by scheduling of computations close to the data. The 1st workshop on Scientific Cloud Computing (ScienceCloud) will provide the scientific community a dedicated forum for discussing new research, development, and deployment efforts in running these kinds of scientific computing workloads on Cloud Computing infrastructures. The ScienceCloud workshop will focus on the use of cloud-based technologies to meet new compute intensive and data intensive scientific challenges that are not well served by the current supercomputers, grids or commercial clouds. What architectural changes to the current cloud frameworks (hardware, operating systems, networking and/or programming models) are needed to support science? Dynamic information derived from remote instruments and coupled simulation and sensor ensembles are both important new science pathways and tremendous challenges for current HPC/HTC/MTC technologies. How can cloud technologies enable these new scientific approaches? How are scientists using clouds? Are there scientific HPC/HTC/MTC workloads that are suitable candidates to take advantage of emerging cloud computing resources with high efficiency? What benefits exist by adopting the cloud model, over clusters, grids, or supercomputers? What factors are limiting clouds use or would make them more usable/efficient? This workshop encourages interaction and cross-pollination between those developing applications, algorithms, software, hardware and networking, emphasizing scientific computing for such cloud platforms. We believe the workshop will be an excellent place to help the community define the current state, determine future goals, and define architectures and services for future science clouds. Topics of Interest --------------------------------------------------------------------------------------- We invite the submission of original work that is related to the topics below. The papers can be either short (5 pages) position papers, or long (10 pages) research papers. Topics of interest include (in the context of Cloud Computing): * scientific computing applications o case studies on cloud computing o case studies comparing clouds, cluster, grids, and/or supercomputers o performance evaluation * performance evaluation o real systems o cloud computing benchmarks o reliability of large systems * programming models and tools o map-reduce and its generalizations o many-task computing middleware and applications o integrating parallel programming frameworks with storage clouds o message passing interface (MPI) o service-oriented science applications * storage cloud architectures and implementations o distributed file systems o content distribution systems for large data o data caching frameworks and techniques o data management within and across data centers o data-aware scheduling o data-intensive computing applications o eventual-consistency storage usage and management * compute resource management o dynamic resource provisioning o scheduling o techniques to manage many-core resources and/or GPUs * high-performance computing o high-performance I/O systems o interconnect and network interface architectures for HPC o multi-gigabit wide-area networking o scientific computing tradeoffs between clusters/grids/supercomputers and clouds o parallel file systems in dynamic environments * models, frameworks and systems for cloud security o implementation of access control and scalable isolation Paper Submission and Publication --------------------------------------------------------------------------------------- Authors are invited to submit papers with unpublished, original work of not more than 10 pages of double column text using single spaced 10 point size on 8.5 x 11 inch pages, (including all text, figures, and references) as per ACM 8.5 x 11 manuscript guidelines (http://www.acm.org/publications/instructions_for_proceedings_volumes); document templates can be found at http://www.acm.org/sigs/publications/proceedings-templates. A 250 word abstract (PDF format) must be submitted online at https://cmt.research.microsoft.com/ScienceCloud2010/ before the deadline of February 22nd, 2010 at 11:59PM PST; the final 10 page papers in PDF format will be due on March 1st, 2010 at 11:59PM PST. Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the ACM digital library. Notifications of the paper decisions will be sent out by April 1st, 2010. Selected excellent work will be invited to submit extended versions of the workshop paper to a special issue journal. Submission implies the willingness of at least one of the authors to register and present the paper. For more information, please visit http://dsl.cs.uchicago.edu/ScienceCloud2010/. Important Dates --------------------------------------------------------------------------------------- * Abstract Due: February 22nd, 2010 * Papers Due: March 1st, 2010 * Notification of Acceptance: April 1st, 2010 * Workshop Date: June 21st, 2010 Committee Members --------------------------------------------------------------------------------------- Workshop Chairs * Pete Beckman, University of Chicago & Argonne National Laboratory * Ian Foster, University of Chicago & Argonne National Laboratory * Ioan Raicu, Northwestern University Steering Committee * Jeff Broughton, Lawrence Berkeley National Lab., USA * Alok Choudhary, Northwestern University, USA * Dennis Gannon, Microsoft Research, USA * Robert Grossman, University of Illinois at Chicago, USA * Kate Keahey, Nimbus, University of Chicago, Argonne National Laboratory, USA * Ed Lazowska, University of Washington, USA * Ignacio Llorente, Open Nebula, Universidad Complutense de Madrid, Spain * David E. Martin, Argonne National Laboratory, Northwestern University, USA * Gabriel Mateescu, Linkoping University, Sweden * David O'Hallaron, Carnegie Mellon University, Intel Labs, USA * Rich Wolski, Eucalyptus, University of California, Santa Barbara, USA * Kathy Yelick, University of California at Berkeley, Lawrence Berkeley National Lab., USA Technical Committee * David Abramson, Monash University, Australia * Roger Barga, Microsoft Research, USA * Roy Campbell, University of Illinois at Urbana Champaign, USA * Henri Casanova, University of Hawaii at Manoa, USA * Brian Cooper, Yahoo! Research, USA * Peter Dinda, Northwestern University, USA * Jack Dongara, University of Tennessee, USA * Geoffrey Fox, Indiana University, USA * Adriana Iamnitchi, University of South Florida, USA * Alexandru Iosup, Delft University of Technology, Netherlands * James Hamilton, Amazon Web Services, USA * Tevfik Kosar, Louisiana State University, USA * Shiyong Lu, Wayne State University, USA * Ruben S. Montero, Universidad Complutense de Madrid, Spain * Reagan Moore, University of North Carolina, Chappel Hill, USA * Jose Moreira, IBM Research, USA * Lavanya Ramakrishnan, Lawrence Berkeley National Laboratory * Matei Ripeanu, University of British Columbia, Canada * Larry Rudolph, VMware, USA * Marc Snir, University of Illinois at Urbana Champaign, USA * Xian-He Sun, Illinois Institute of Technology, USA * Hakim Weatherspoon, Cornell University, USA * Mike Wilde, University of Chicago & Argonne National Laboratory, USA * Alec Wolman, Microsoft Research, USA * Yong Zhao, Microsoft, USA -- ================================================================= Ioan Raicu, Ph.D. NSF/CRA Computing Innovation Fellow ================================================================= Center for Ultra-scale Computing and Information Security (CUCIS) Department of Electrical Engineering and Computer Science Northwestern University 2145 Sheridan Rd, Tech M384 Evanston, IL 60208-3118 ================================================================= Cel: 1-847-722-0876 Tel: 1-847-491-8163 Email: iraicu at eecs.northwestern.edu Web: http://www.eecs.northwestern.edu/~iraicu/ https://wiki.cucis.eecs.northwestern.edu/ ================================================================= ================================================================= -- ================================================================= Ioan Raicu, Ph.D. NSF/CRA Computing Innovation Fellow ================================================================= Center for Ultra-scale Computing and Information Security (CUCIS) Department of Electrical Engineering and Computer Science Northwestern University 2145 Sheridan Rd, Tech M384 Evanston, IL 60208-3118 ================================================================= Cel: 1-847-722-0876 Tel: 1-847-491-8163 Email: iraicu at eecs.northwestern.edu Web: http://www.eecs.northwestern.edu/~iraicu/ https://wiki.cucis.eecs.northwestern.edu/ ================================================================= ================================================================= From sardjito.antonius at gmail.com Mon Jan 18 16:58:07 2010 From: sardjito.antonius at gmail.com (Antonius Sardjito) Date: Mon, 18 Jan 2010 16:58:07 -0600 Subject: [Swift-user] could not initialized shared directory on pbs Message-ID: <110a6b261001181458x59af150bqca221c6237993a9e@mail.gmail.com> Hi, I tired to execute a Swift script on the PADS cluster using this command: swift -tc.file tc -sites.file pbs.xml modis.swift But was encountered by this error below: [antonius at login2 work]$ swift -tc.file tc -sites.file pbs.xml modis.swift Swift svn swift-r3202 cog-r2682 RunID: 20100118-1636-roba879f Progress: Execution failed: Could not initialize shared directory on pbs Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Failed to create directory: /home/wilde/swiftwork/modis-20100118-1636-roba879f/shared Any suggestion on what might have caused the "could not initialized shared directory on pbs" ? -Antonius -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Jan 18 17:21:48 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 18 Jan 2010 17:21:48 -0600 Subject: [Swift-user] could not initialized shared directory on pbs In-Reply-To: <110a6b261001181458x59af150bqca221c6237993a9e@mail.gmail.com> References: <110a6b261001181458x59af150bqca221c6237993a9e@mail.gmail.com> Message-ID: <1263856908.9254.2.camel@localhost> On Mon, 2010-01-18 at 16:58 -0600, Antonius Sardjito wrote: > Execution failed: > Could not initialize shared directory on pbs > Caused by: > org.globus.cog.abstraction.impl.file.FileResourceException: Failed to > create > directory: /home/wilde/swiftwork/modis-20100118-1636-roba879f/shared Right. You need to edit pbs.xml and change the work directory to something you have write permissions to. I added a note on the wiki assignment page (http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftTutorialForBigData#Doing_Swift_Assignment_1) Mihael From hategan at mcs.anl.gov Mon Jan 18 19:44:19 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 18 Jan 2010 19:44:19 -0600 Subject: [Swift-user] could not initialized shared directory on pbs In-Reply-To: <110a6b261001181553g4a4e1113re5ca30a0ec164897@mail.gmail.com> References: <110a6b261001181458x59af150bqca221c6237993a9e@mail.gmail.com> <1263856908.9254.2.camel@localhost> <110a6b261001181553g4a4e1113re5ca30a0ec164897@mail.gmail.com> Message-ID: <1263865459.10474.12.camel@localhost> It may be a good idea to CC the list so that if others run into the same problem, they can see what the solution is. On Mon, 2010-01-18 at 17:53 -0600, Antonius Sardjito wrote: > I tried editing the file but when I run it I encountered a long erros > at the top there was: > [...] > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Cannot submit job: Cannot run program "qsub": java.io.IOException: > error=2, No such file or directory [...] Qsub is the standard PBS/Torque (queuing system) submit command. You should generally have it in your path. If not, edit your ~/.soft file and make sure the following lines are there: +maui +torque Then save, run "resoft" and try running qsub. Pay attention to errors that may appear when running "resoft". > > Also the Modis folder is still inaccessible so.. so far I could on > test what I've done with the data in the sample folder only. That directory is a symbolic link. If you follow it, you'll see that it points to /gpfs/pads/projects/see/data/raw/mcd12q1/2002/lct1. Looking at /etc/fstab, it appears that /gpfs/pads is a mount point, and it doesn't seem to be mounted. I would send an email to support at ci.uchicago.edu asking for things to be restored to their proper state (e.g. "please mount /gpfs/pads on the pads login nodes"). From hategan at mcs.anl.gov Mon Jan 18 19:54:17 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 18 Jan 2010 19:54:17 -0600 Subject: [Swift-user] could not initialized shared directory on pbs In-Reply-To: <1263865459.10474.12.camel@localhost> References: <110a6b261001181458x59af150bqca221c6237993a9e@mail.gmail.com> <1263856908.9254.2.camel@localhost> <110a6b261001181553g4a4e1113re5ca30a0ec164897@mail.gmail.com> <1263865459.10474.12.camel@localhost> Message-ID: <1263866057.11003.1.camel@localhost> On Mon, 2010-01-18 at 19:44 -0600, Mihael Hategan wrote: > > > > Also the Modis folder is still inaccessible so.. so far I could on > > test what I've done with the data in the sample folder only. > > That directory is a symbolic link. If you follow it, you'll see that it > points to /gpfs/pads/projects/see/data/raw/mcd12q1/2002/lct1. Looking > at /etc/fstab, it appears that /gpfs/pads is a mount point, and it > doesn't seem to be mounted. > > I would send an email to support at ci.uchicago.edu asking for things to be > restored to their proper state (e.g. "please mount /gpfs/pads on the > pads login nodes"). > Scrap that. Use the support address that the assignment page mentions (pads-support at ci.uchicago.edu). Mihael From fedorov at bwh.harvard.edu Tue Jan 19 09:38:45 2010 From: fedorov at bwh.harvard.edu (Andriy Fedorov) Date: Tue, 19 Jan 2010 10:38:45 -0500 Subject: [Swift-user] Tuning parameters of coaster execution In-Reply-To: <1256055802.24685.13.camel@localhost> References: <82f536810910192035o1eaf761chfff2e006e31fb51a@mail.gmail.com> <1256054147.22279.18.camel@localhost> <82f536810910200904x584d8ca3m2da7fab8dc660b1d@mail.gmail.com> <1256055802.24685.13.camel@localhost> Message-ID: <82f536811001190738u10442036k1e2b24116cb54fbc@mail.gmail.com> Hi Mihael, I've been playing with this following your suggestions, but I can't get it to work. Here's my site description: /u/ac/fedorov/scratch-global/scratch 2.55 10000 10 false 0.1 2 10 My maxWalltime for the job is 2, and I have 100 of them. When I run the script, I see one job in the queue, with 10 nodes and 22 minutes walltime. However, when the script is executing, it appears the jobs are being scheduled one at a time. I have the current checkout of the cog/swift trunk: Swift svn swift-r3202 cog-r2682. I attach the coaster.log file for your reference. Can you help me understand what I am doing wrong? Also, I was trying to look in the code that does allocation, and it seems that the code responsible for determining the block size for allocation is in modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BlockQueueProcessor.java. Is this correct? And what is the piece of code that decides how to schedule jobs within the allocated block? I would appreciate any help. Thank you. -- Andriy Fedorov, Ph.D. Research Fellow Brigham and Women's Hospital Harvard Medical School 75 Francis Street Boston, MA 02115 USA fedorov at bwh.harvard.edu On Tue, Oct 20, 2009 at 11:23, Mihael Hategan wrote: > On Tue, 2009-10-20 at 12:04 -0400, Andriy Fedorov wrote: >> On Tue, Oct 20, 2009 at 11:55, Mihael Hategan wrote: >> > You need a more recent version of the code. >> > >> >> Mihael, I actually updated svn for both cog and swift yesterday prior >> to running the tests. Here's what swift reports I have right now: >> >> Swift svn swift-r3170 cog-r2529 > > Given that even when you have granularity=10 you still see 2 jobs, I > suspect you are using swift site throttling parameters that force that. > I would set the jobThrottle higher and possibly the initial score > higher. > > For troubleshooting, what you could do is, on the remote side, say cat > ~/.globus/coasters/coasters.log|grep "BlockQueueProcessor">bqp.log and > post that. Also, you could set the remoteMonitorEnabled profile to > "true" to get visual feedback of what's happening. > > The allocation time is 18 minutes because the new stuff doesn't > overallocate using a fixed multiplier (though you can force it to do > so). For small jobs (walltime = 1s) the multiplier is set by > lowOverallocation (10.0 by default) while for large jobs (walltime -> > +inf) the multiplier is 1, with an exponential decay in-between. > > If you want to always have blocks being 10 times the job walltime, you > can set highOverallocation to 10. > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: bqp.log Type: text/x-log Size: 369066 bytes Desc: not available URL: From hategan at mcs.anl.gov Tue Jan 19 11:43:26 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 19 Jan 2010 11:43:26 -0600 Subject: [Swift-user] Tuning parameters of coaster execution In-Reply-To: <82f536811001190738u10442036k1e2b24116cb54fbc@mail.gmail.com> References: <82f536810910192035o1eaf761chfff2e006e31fb51a@mail.gmail.com> <1256054147.22279.18.camel@localhost> <82f536810910200904x584d8ca3m2da7fab8dc660b1d@mail.gmail.com> <1256055802.24685.13.camel@localhost> <82f536811001190738u10442036k1e2b24116cb54fbc@mail.gmail.com> Message-ID: <1263923006.18474.8.camel@localhost> On Tue, 2010-01-19 at 10:38 -0500, Andriy Fedorov wrote: > Hi Mihael, > > I've been playing with this following your suggestions, but I can't > get it to work. > > Here's my site description: > > > > url="grid-abe.ncsa.teragrid.org"/> > /u/ac/fedorov/scratch-global/scratch > 2.55 > 10000 > 10 > false > 0.1 > 2 > 10 > > > My maxWalltime for the job is 2, and I have 100 of them. When I run > the script, I see one job in the queue, with 10 nodes and 22 minutes > walltime. However, when the script is executing, it appears the jobs > are being scheduled one at a time. I have the current checkout of the > cog/swift trunk: Swift svn swift-r3202 cog-r2682. I attach the > coaster.log file for your reference. It doesn't look like Swift is sending more than one job at a time. It may be helpful to understand what the swift part is doing (i.e. swift log, the swift script, etc.). > > Can you help me understand what I am doing wrong? > > Also, I was trying to look in the code that does allocation, and it > seems that the code responsible for determining the block size for > allocation is in > modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BlockQueueProcessor.java. > Is this correct? And what is the piece of code that decides how to > schedule jobs within the allocated block? Each worker will slurp jobs fitting (walltime < worker_remaining_walltime) from the coaster queue if that's not empty. So there isn't much in the way of scheduling at that point. From fedorov at bwh.harvard.edu Tue Jan 19 12:01:15 2010 From: fedorov at bwh.harvard.edu (Andriy Fedorov) Date: Tue, 19 Jan 2010 13:01:15 -0500 Subject: [Swift-user] Tuning parameters of coaster execution In-Reply-To: <1263923006.18474.8.camel@localhost> References: <82f536810910192035o1eaf761chfff2e006e31fb51a@mail.gmail.com> <1256054147.22279.18.camel@localhost> <82f536810910200904x584d8ca3m2da7fab8dc660b1d@mail.gmail.com> <1256055802.24685.13.camel@localhost> <82f536811001190738u10442036k1e2b24116cb54fbc@mail.gmail.com> <1263923006.18474.8.camel@localhost> Message-ID: <82f536811001191001t33bdea78sb76556f04ec68b2e@mail.gmail.com> Mihael, The script is very simple: iterate cnt { doStuff } until (cnt<100); I thought this is a parallel construct. Was I wrong? -- Andriy Fedorov, Ph.D. Research Fellow Brigham and Women's Hospital Harvard Medical School 75 Francis Street Boston, MA 02115 USA fedorov at bwh.harvard.edu On Tue, Jan 19, 2010 at 12:43, Mihael Hategan wrote: > On Tue, 2010-01-19 at 10:38 -0500, Andriy Fedorov wrote: >> Hi Mihael, >> >> I've been playing with this following your suggestions, but I can't >> get it to work. >> >> Here's my site description: >> >> >> ? >> ? > url="grid-abe.ncsa.teragrid.org"/> >> ? /u/ac/fedorov/scratch-global/scratch >> ? 2.55 >> ? 10000 >> ? 10 >> ? false >> ? 0.1 >> ? 2 >> ? 10 >> >> >> My maxWalltime for the job is 2, and I have 100 of them. When I run >> the script, I see one job in the queue, with 10 nodes and 22 minutes >> walltime. However, when the script is executing, it appears the jobs >> are being scheduled one at a time. I have the current checkout of the >> cog/swift trunk: Swift svn swift-r3202 cog-r2682. I attach the >> coaster.log file for your reference. > > It doesn't look like Swift is sending more than one job at a time. It > may be helpful to understand what the swift part is doing (i.e. swift > log, the swift script, etc.). > >> >> Can you help me understand what I am doing wrong? >> >> Also, I was trying to look in the code that does allocation, and it >> seems that the code responsible for determining the block size for >> allocation is in >> modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BlockQueueProcessor.java. >> Is this correct? And what is the piece of code that decides how to >> schedule jobs within the allocated block? > > Each worker will slurp jobs fitting (walltime < > worker_remaining_walltime) from the coaster queue if that's not empty. > So there isn't much in the way of scheduling at that point. > > > From wilde at mcs.anl.gov Tue Jan 19 12:22:34 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 19 Jan 2010 12:22:34 -0600 Subject: [Swift-user] Problem staging out from PBS? Message-ID: <4B55F86A.9080102@mcs.anl.gov> I'm getting the messages below in email from PBS on pads.ci.uchicago.edu. Are the messages: "Unable to copy file 908.svc.pads.ci.uchicago.edu.OU to wilde at login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5898136727049685871.submit.stdout" due to Swift errors (ie, somehow PBS cant write to my scripts/ directory) or due to some problem in PBS? The prior PBS error of being unable to write to /gpfs/scratch seems to have gone away. - Mike -------- Original Message -------- Subject: PBS JOB 908.svc.pads.ci.uchicago.edu Date: Tue, 19 Jan 2010 12:16:37 -0600 (CST) From: adm at ci.uchicago.edu (root) To: wilde at ci.uchicago.edu PBS Job Id: 908.svc.pads.ci.uchicago.edu Job Name: null Exec host: c19.pads.ci.uchicago.edu/0 An error has occurred processing your job, see below. request to copy stageout files failed on node 'c19.pads.ci.uchicago.edu/0' for job 908.svc.pads.ci.uchicago.edu Unable to copy file 908.svc.pads.ci.uchicago.edu.OU to wilde at login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5898136727049685871.submit.stdout >>> error from copy LD_LIBRARY_PATH= >>> end error output Unable to copy file 908.svc.pads.ci.uchicago.edu.ER to wilde at login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5898136727049685871.submit.stderr >>> error from copy LD_LIBRARY_PATH= >>> end error output From fedorov at bwh.harvard.edu Tue Jan 19 12:46:58 2010 From: fedorov at bwh.harvard.edu (Andriy Fedorov) Date: Tue, 19 Jan 2010 13:46:58 -0500 Subject: [Swift-user] Tuning parameters of coaster execution In-Reply-To: <82f536811001191001t33bdea78sb76556f04ec68b2e@mail.gmail.com> References: <82f536810910192035o1eaf761chfff2e006e31fb51a@mail.gmail.com> <1256054147.22279.18.camel@localhost> <82f536810910200904x584d8ca3m2da7fab8dc660b1d@mail.gmail.com> <1256055802.24685.13.camel@localhost> <82f536811001190738u10442036k1e2b24116cb54fbc@mail.gmail.com> <1263923006.18474.8.camel@localhost> <82f536811001191001t33bdea78sb76556f04ec68b2e@mail.gmail.com> Message-ID: <82f536811001191046y6f5e2509l5963be8229380b8c@mail.gmail.com> On Tue, Jan 19, 2010 at 13:01, Andriy Fedorov wrote: > Mihael, > > The script is very simple: > > iterate cnt { > ?doStuff > } until (cnt<100); > > I thought this is a parallel construct. Was I wrong? > Yes, apparently I was wrong. I need this instead: "foreach i in [1:100] { }". Apologize for not trying this before asking for help... > -- > Andriy Fedorov, Ph.D. > > Research Fellow > Brigham and Women's Hospital > Harvard Medical School > 75 Francis Street > Boston, MA 02115 USA > fedorov at bwh.harvard.edu > > > > On Tue, Jan 19, 2010 at 12:43, Mihael Hategan wrote: >> On Tue, 2010-01-19 at 10:38 -0500, Andriy Fedorov wrote: >>> Hi Mihael, >>> >>> I've been playing with this following your suggestions, but I can't >>> get it to work. >>> >>> Here's my site description: >>> >>> >>> ? >>> ? >> url="grid-abe.ncsa.teragrid.org"/> >>> ? /u/ac/fedorov/scratch-global/scratch >>> ? 2.55 >>> ? 10000 >>> ? 10 >>> ? false >>> ? 0.1 >>> ? 2 >>> ? 10 >>> >>> >>> My maxWalltime for the job is 2, and I have 100 of them. When I run >>> the script, I see one job in the queue, with 10 nodes and 22 minutes >>> walltime. However, when the script is executing, it appears the jobs >>> are being scheduled one at a time. I have the current checkout of the >>> cog/swift trunk: Swift svn swift-r3202 cog-r2682. I attach the >>> coaster.log file for your reference. >> >> It doesn't look like Swift is sending more than one job at a time. It >> may be helpful to understand what the swift part is doing (i.e. swift >> log, the swift script, etc.). >> >>> >>> Can you help me understand what I am doing wrong? >>> >>> Also, I was trying to look in the code that does allocation, and it >>> seems that the code responsible for determining the block size for >>> allocation is in >>> modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BlockQueueProcessor.java. >>> Is this correct? And what is the piece of code that decides how to >>> schedule jobs within the allocated block? >> >> Each worker will slurp jobs fitting (walltime < >> worker_remaining_walltime) from the coaster queue if that's not empty. >> So there isn't much in the way of scheduling at that point. >> >> >> > From hategan at mcs.anl.gov Tue Jan 19 13:01:57 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 19 Jan 2010 13:01:57 -0600 Subject: [Swift-user] Tuning parameters of coaster execution In-Reply-To: <82f536811001191001t33bdea78sb76556f04ec68b2e@mail.gmail.com> References: <82f536810910192035o1eaf761chfff2e006e31fb51a@mail.gmail.com> <1256054147.22279.18.camel@localhost> <82f536810910200904x584d8ca3m2da7fab8dc660b1d@mail.gmail.com> <1256055802.24685.13.camel@localhost> <82f536811001190738u10442036k1e2b24116cb54fbc@mail.gmail.com> <1263923006.18474.8.camel@localhost> <82f536811001191001t33bdea78sb76556f04ec68b2e@mail.gmail.com> Message-ID: <1263927717.21382.19.camel@localhost> On Tue, 2010-01-19 at 13:01 -0500, Andriy Fedorov wrote: > Mihael, > > The script is very simple: > > iterate cnt { > doStuff > } until (cnt<100); > > I thought this is a parallel construct. Was I wrong? Short answer: yes. Long answer: I (and judging by the existence if iterate also "we") don't understand clearly why the normal parallel foreach couldn't be used to implement convergence conditions.The plain solution: a[0] = initialValue; foreach v, k in a { if (a[k] < epsilon) { a[k + 1] = f(a[k]); } else { //nothing } } should work in my view, since it expresses a convergence problem correctly, without an explicitly sequential operation. But I suppose that posed some implementation difficulties that were deemed not worth solving, hence "iterate". I suspect the problem was that of detecting that certain branches of an if can lead to no more data being put in the array, which seems difficult to analyze at compile-time or figure out at run-time. Mihael From hategan at mcs.anl.gov Tue Jan 19 13:04:15 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 19 Jan 2010 13:04:15 -0600 Subject: [Swift-user] Tuning parameters of coaster execution In-Reply-To: <82f536811001191046y6f5e2509l5963be8229380b8c@mail.gmail.com> References: <82f536810910192035o1eaf761chfff2e006e31fb51a@mail.gmail.com> <1256054147.22279.18.camel@localhost> <82f536810910200904x584d8ca3m2da7fab8dc660b1d@mail.gmail.com> <1256055802.24685.13.camel@localhost> <82f536811001190738u10442036k1e2b24116cb54fbc@mail.gmail.com> <1263923006.18474.8.camel@localhost> <82f536811001191001t33bdea78sb76556f04ec68b2e@mail.gmail.com> <82f536811001191046y6f5e2509l5963be8229380b8c@mail.gmail.com> Message-ID: <1263927855.21382.22.camel@localhost> On Tue, 2010-01-19 at 13:46 -0500, Andriy Fedorov wrote: > On Tue, Jan 19, 2010 at 13:01, Andriy Fedorov wrote: > > Mihael, > > > > The script is very simple: > > > > iterate cnt { > > doStuff > > } until (cnt<100); > > > > I thought this is a parallel construct. Was I wrong? > > > > Yes, apparently I was wrong. I need this instead: "foreach i in > [1:100] { }". Apologize for not trying this before asking for help... > That's ok. Why "iterate" is there and what it's doing always seems like a good question, and maybe if people keep asking, I/we'll have a good answer. From wilde at mcs.anl.gov Tue Jan 19 13:26:36 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 19 Jan 2010 13:26:36 -0600 Subject: [Swift-user] Coaster jobs are not running with expected parallelism Message-ID: <4B56076C.6090507@mcs.anl.gov> Im running a script on PADS that emits 20 jobs in parallel with a foreach(). I set coasters to use 8 workers per node, and my throttle to allow 64 jobs to run in parallel, so I would expect *at least* 8 jobs to be running in parallel. But what I see is: - 3 PBS worker jobs start - 2 of these have a single core (c19/0 and c19/1) - 1 of these has 18 *nodes* - all 20 jobs show up as submitted or active, but never more than *3* active (note that 1 job is a setup job ad completes right away). Below is info on this run. Any idea why coaster provider is behaving this way? - Mike pool entry is: 00:05:00 1800 8 .63 10000 $rundir Running on login2, I see: /home/wilde/protlib2/bin/run.loops.sh: Executing on site pbs Running from host with compute-node reachable address of 172.5.86.6 Running in /home/wilde/protests/run.loops.1498 protlib2 home is /home/wilde/protlib2 Swift svn swift-r3202 cog-r2682 RunID: 20100119-1309-l72sbpg8 Progress: Progress: Checking status:1 Progress: Selecting site:18 Initializing site shared directory:1 Stage in:1 Finished successfully:1 Progress: Submitting:19 Submitted:1 Finished successfully:1 Progress: Submitted:19 Active:1 Finished successfully:1 Progress: Submitted:17 Active:3 Finished successfully:1 Progress: Submitted:17 Active:3 Finished successfully:1 Progress: Submitted:17 Active:3 Finished successfully:1 Progress: Submitted:17 Active:3 Finished successfully:1 Progress: Submitted:17 Active:3 Finished successfully:1 Progress: Submitted:17 Active:3 Finished successfully:1 Progress: Submitted:17 Active:3 Finished successfully:1 Progress: Submitted:17 Active:3 Finished successfully:1 Progress: Submitted:17 Active:3 Finished successfully:1 Progress: Submitted:17 Active:3 Finished successfully:1 Progress: Submitted:17 Active:3 Finished successfully:1 Progress: Submitted:17 Active:3 Finished successfully:1 Progress: Submitted:17 Active:3 Finished successfully:1 Progress: Submitted:17 Active:3 Finished successfully:1 Progress: Submitted:17 Active:3 Finished successfully:1 Progress: Submitted:17 Active:3 Finished successfully:1 Progress: Submitted:17 Active:3 Finished successfully:1 Progress: Submitted:17 Active:2 Checking status:1 Finished successfully:1 Progress: Submitted:15 Active:3 Stage out:1 Finished successfully:2 Progress: Submitted:15 Active:3 Finished successfully:3 ...and this keeps up - the script is progressing but only 3 jobs are running at a time. (Each takes about 5 minutes) PBS shows: login2$ qstat -n svc.pads.ci.uchicago.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 912.svc.pads.ci. wilde extended null 14877 1 -- -- 00:29 R -- c19 913.svc.pads.ci. wilde extended null -- 18 -- -- 00:29 R -- c46+c45+c44+c06+c07+c08+c10+c12+c14+c17+c22+c24+c28+c34+c35+c37+c39+c40 914.svc.pads.ci. wilde extended null 15135 1 -- -- 00:29 R -- c19 login2$ qstat -f Job Id: 912.svc.pads.ci.uchicago.edu Job_Name = null Job_Owner = wilde at login2.pads.ci.uchicago.edu resources_used.cput = 00:00:58 resources_used.mem = 165768kb resources_used.vmem = 757612kb resources_used.walltime = 00:01:14 job_state = R queue = extended server = svc.pads.ci.uchicago.edu Checkpoint = u ctime = Tue Jan 19 13:09:16 2010 Error_Path = login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS58 66754363410172037.submit.stderr exec_host = c19.pads.ci.uchicago.edu/0 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = n mtime = Tue Jan 19 13:09:18 2010 Output_Path = login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5 866754363410172037.submit.stdout Priority = 0 qtime = Tue Jan 19 13:09:16 2010 Rerunable = True Resource_List.nodect = 1 Resource_List.nodes = 1 Resource_List.walltime = 00:29:00 session_id = 14877 Shell_Path_List = /bin/sh Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde, PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0. 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1. 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2. 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin: /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0- r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/ swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift- svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin, PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash, PBS_SERVER=login2.pads.ci.uchicago.edu, PBS_O_HOST=login2.pads.ci.uchicago.edu, PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498, PBS_O_QUEUE=extended etime = Tue Jan 19 13:09:16 2010 submit_args = /home/wilde/.globus/scripts/PBS5866754363410172037.submit start_time = Tue Jan 19 13:09:17 2010 start_count = 1 Job Id: 913.svc.pads.ci.uchicago.edu Job_Name = null Job_Owner = wilde at login2.pads.ci.uchicago.edu resources_used.cput = 00:00:36 resources_used.mem = 166452kb resources_used.vmem = 765732kb resources_used.walltime = 00:00:51 job_state = R queue = extended server = svc.pads.ci.uchicago.edu Checkpoint = u ctime = Tue Jan 19 13:09:16 2010 Error_Path = login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS89 90749016166185054.submit.stderr exec_host = c46.pads.ci.uchicago.edu/0+c45.pads.ci.uchicago.edu/0+c44.pads .ci.uchicago.edu/0+c06.pads.ci.uchicago.edu/0+c07.pads.ci.uchicago.edu /0+c08.pads.ci.uchicago.edu/0+c10.pads.ci.uchicago.edu/0+c12.pads.ci.u chicago.edu/0+c14.pads.ci.uchicago.edu/0+c17.pads.ci.uchicago.edu/0+c2 2.pads.ci.uchicago.edu/0+c24.pads.ci.uchicago.edu/0+c28.pads.ci.uchica go.edu/0+c34.pads.ci.uchicago.edu/0+c35.pads.ci.uchicago.edu/0+c37.pad s.ci.uchicago.edu/0+c39.pads.ci.uchicago.edu/0+c40.pads.ci.uchicago.ed u/0 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = n mtime = Tue Jan 19 13:09:55 2010 Output_Path = login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS8 990749016166185054.submit.stdout Priority = 0 qtime = Tue Jan 19 13:09:16 2010 Rerunable = True Resource_List.nodect = 18 Resource_List.nodes = 18 Resource_List.walltime = 00:29:00 session_id = 13956 Shell_Path_List = /bin/sh Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde, PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0. 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1. 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2. 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin: /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0- r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/ swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift- svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin, PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash, PBS_SERVER=login2.pads.ci.uchicago.edu, PBS_O_HOST=login2.pads.ci.uchicago.edu, PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498, PBS_O_QUEUE=extended etime = Tue Jan 19 13:09:16 2010 submit_args = /home/wilde/.globus/scripts/PBS8990749016166185054.submit start_time = Tue Jan 19 13:09:18 2010 start_count = 1 Job Id: 914.svc.pads.ci.uchicago.edu Job_Name = null Job_Owner = wilde at login2.pads.ci.uchicago.edu resources_used.cput = 00:00:58 resources_used.mem = 165760kb resources_used.vmem = 757612kb resources_used.walltime = 00:01:11 job_state = R queue = extended server = svc.pads.ci.uchicago.edu Checkpoint = u ctime = Tue Jan 19 13:09:18 2010 Error_Path = login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS54 46269528052212820.submit.stderr exec_host = c19.pads.ci.uchicago.edu/1 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = n mtime = Tue Jan 19 13:09:20 2010 Output_Path = login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5 446269528052212820.submit.stdout Priority = 0 qtime = Tue Jan 19 13:09:18 2010 Rerunable = True Resource_List.nodect = 1 Resource_List.nodes = 1 Resource_List.walltime = 00:29:00 session_id = 15135 Shell_Path_List = /bin/sh Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde, PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0. 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1. 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2. 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin: /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0- r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/ swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift- svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin, PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash, PBS_SERVER=login2.pads.ci.uchicago.edu, PBS_O_HOST=login2.pads.ci.uchicago.edu, PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498, PBS_O_QUEUE=extended etime = Tue Jan 19 13:09:18 2010 submit_args = /home/wilde/.globus/scripts/PBS5446269528052212820.submit start_time = Tue Jan 19 13:09:20 2010 start_count = 1 login2$ ------------------------------------------------------------------------------------------------------- From hategan at mcs.anl.gov Tue Jan 19 13:32:30 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 19 Jan 2010 13:32:30 -0600 Subject: [Swift-user] Coaster jobs are not running with expected parallelism In-Reply-To: <4B56076C.6090507@mcs.anl.gov> References: <4B56076C.6090507@mcs.anl.gov> Message-ID: <1263929550.22225.5.camel@localhost> Maybe PBS is lying about that 18 node job. The coaster or worker logs on pads/~/.globus/coasters could shed some light on this. On Tue, 2010-01-19 at 13:26 -0600, Michael Wilde wrote: > Im running a script on PADS that emits 20 jobs in parallel with a foreach(). > > I set coasters to use 8 workers per node, and my throttle to allow 64 > jobs to run in parallel, so I would expect *at least* 8 jobs to be > running in parallel. But what I see is: > > - 3 PBS worker jobs start > - 2 of these have a single core (c19/0 and c19/1) > - 1 of these has 18 *nodes* > - all 20 jobs show up as submitted or active, but never more than *3* > active (note that 1 job is a setup job ad completes right away). > > Below is info on this run. > > Any idea why coaster provider is behaving this way? > > - Mike > > pool entry is: > > > 00:05:00 > 1800 > > 8 > .63 > 10000 > > $rundir > > > Running on login2, I see: > > /home/wilde/protlib2/bin/run.loops.sh: Executing on site pbs > Running from host with compute-node reachable address of 172.5.86.6 > Running in /home/wilde/protests/run.loops.1498 > protlib2 home is /home/wilde/protlib2 > Swift svn swift-r3202 cog-r2682 > > RunID: 20100119-1309-l72sbpg8 > Progress: > Progress: Checking status:1 > Progress: Selecting site:18 Initializing site shared directory:1 > Stage in:1 Finished successfully:1 > Progress: Submitting:19 Submitted:1 Finished successfully:1 > Progress: Submitted:19 Active:1 Finished successfully:1 > Progress: Submitted:17 Active:3 Finished successfully:1 > Progress: Submitted:17 Active:3 Finished successfully:1 > Progress: Submitted:17 Active:3 Finished successfully:1 > Progress: Submitted:17 Active:3 Finished successfully:1 > Progress: Submitted:17 Active:3 Finished successfully:1 > Progress: Submitted:17 Active:3 Finished successfully:1 > Progress: Submitted:17 Active:3 Finished successfully:1 > Progress: Submitted:17 Active:3 Finished successfully:1 > Progress: Submitted:17 Active:3 Finished successfully:1 > Progress: Submitted:17 Active:3 Finished successfully:1 > Progress: Submitted:17 Active:3 Finished successfully:1 > Progress: Submitted:17 Active:3 Finished successfully:1 > Progress: Submitted:17 Active:3 Finished successfully:1 > Progress: Submitted:17 Active:3 Finished successfully:1 > Progress: Submitted:17 Active:3 Finished successfully:1 > Progress: Submitted:17 Active:3 Finished successfully:1 > Progress: Submitted:17 Active:3 Finished successfully:1 > Progress: Submitted:17 Active:2 Checking status:1 Finished > successfully:1 > Progress: Submitted:15 Active:3 Stage out:1 Finished successfully:2 > Progress: Submitted:15 Active:3 Finished successfully:3 > > ...and this keeps up - the script is progressing but only 3 jobs are > running at a time. (Each takes about 5 minutes) > > PBS shows: > > login2$ qstat -n > > svc.pads.ci.uchicago.edu: > > Req'd Req'd Elap > Job ID Username Queue Jobname SessID NDS TSK > Memory Time S Time > -------------------- -------- -------- ---------------- ------ ----- --- > ------ ----- - ----- > 912.svc.pads.ci. wilde extended null 14877 1 -- > -- 00:29 R -- > c19 > 913.svc.pads.ci. wilde extended null -- 18 -- > -- 00:29 R -- > c46+c45+c44+c06+c07+c08+c10+c12+c14+c17+c22+c24+c28+c34+c35+c37+c39+c40 > 914.svc.pads.ci. wilde extended null 15135 1 -- > -- 00:29 R -- > c19 > login2$ qstat -f > Job Id: 912.svc.pads.ci.uchicago.edu > Job_Name = null > Job_Owner = wilde at login2.pads.ci.uchicago.edu > resources_used.cput = 00:00:58 > resources_used.mem = 165768kb > resources_used.vmem = 757612kb > resources_used.walltime = 00:01:14 > job_state = R > queue = extended > server = svc.pads.ci.uchicago.edu > Checkpoint = u > ctime = Tue Jan 19 13:09:16 2010 > Error_Path = > login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS58 > 66754363410172037.submit.stderr > exec_host = c19.pads.ci.uchicago.edu/0 > Hold_Types = n > Join_Path = n > Keep_Files = n > Mail_Points = n > mtime = Tue Jan 19 13:09:18 2010 > Output_Path = > login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5 > 866754363410172037.submit.stdout > Priority = 0 > qtime = Tue Jan 19 13:09:16 2010 > Rerunable = True > Resource_List.nodect = 1 > Resource_List.nodes = 1 > Resource_List.walltime = 00:29:00 > session_id = 14877 > Shell_Path_List = /bin/sh > Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde, > > PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s > > oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0. > > 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin > > :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar > > e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1. > > 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2. > > 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma > > ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin: > > /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0- > > r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/ > > swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift- > svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin, > PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash, > PBS_SERVER=login2.pads.ci.uchicago.edu, > PBS_O_HOST=login2.pads.ci.uchicago.edu, > PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498, > PBS_O_QUEUE=extended > etime = Tue Jan 19 13:09:16 2010 > submit_args = /home/wilde/.globus/scripts/PBS5866754363410172037.submit > start_time = Tue Jan 19 13:09:17 2010 > start_count = 1 > > Job Id: 913.svc.pads.ci.uchicago.edu > Job_Name = null > Job_Owner = wilde at login2.pads.ci.uchicago.edu > resources_used.cput = 00:00:36 > resources_used.mem = 166452kb > resources_used.vmem = 765732kb > resources_used.walltime = 00:00:51 > job_state = R > queue = extended > server = svc.pads.ci.uchicago.edu > Checkpoint = u > ctime = Tue Jan 19 13:09:16 2010 > Error_Path = > login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS89 > 90749016166185054.submit.stderr > exec_host = > c46.pads.ci.uchicago.edu/0+c45.pads.ci.uchicago.edu/0+c44.pads > > .ci.uchicago.edu/0+c06.pads.ci.uchicago.edu/0+c07.pads.ci.uchicago.edu > > /0+c08.pads.ci.uchicago.edu/0+c10.pads.ci.uchicago.edu/0+c12.pads.ci.u > > chicago.edu/0+c14.pads.ci.uchicago.edu/0+c17.pads.ci.uchicago.edu/0+c2 > > 2.pads.ci.uchicago.edu/0+c24.pads.ci.uchicago.edu/0+c28.pads.ci.uchica > > go.edu/0+c34.pads.ci.uchicago.edu/0+c35.pads.ci.uchicago.edu/0+c37.pad > > s.ci.uchicago.edu/0+c39.pads.ci.uchicago.edu/0+c40.pads.ci.uchicago.ed > u/0 > Hold_Types = n > Join_Path = n > Keep_Files = n > Mail_Points = n > mtime = Tue Jan 19 13:09:55 2010 > Output_Path = > login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS8 > 990749016166185054.submit.stdout > Priority = 0 > qtime = Tue Jan 19 13:09:16 2010 > Rerunable = True > Resource_List.nodect = 18 > Resource_List.nodes = 18 > Resource_List.walltime = 00:29:00 > session_id = 13956 > Shell_Path_List = /bin/sh > Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde, > > PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s > > oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0. > > 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin > > :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar > > e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1. > > 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2. > > 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma > > ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin: > > /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0- > > r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/ > > swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift- > svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin, > PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash, > PBS_SERVER=login2.pads.ci.uchicago.edu, > PBS_O_HOST=login2.pads.ci.uchicago.edu, > PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498, > PBS_O_QUEUE=extended > etime = Tue Jan 19 13:09:16 2010 > submit_args = /home/wilde/.globus/scripts/PBS8990749016166185054.submit > start_time = Tue Jan 19 13:09:18 2010 > start_count = 1 > > Job Id: 914.svc.pads.ci.uchicago.edu > Job_Name = null > Job_Owner = wilde at login2.pads.ci.uchicago.edu > resources_used.cput = 00:00:58 > resources_used.mem = 165760kb > resources_used.vmem = 757612kb > resources_used.walltime = 00:01:11 > job_state = R > queue = extended > server = svc.pads.ci.uchicago.edu > Checkpoint = u > ctime = Tue Jan 19 13:09:18 2010 > Error_Path = > login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS54 > 46269528052212820.submit.stderr > exec_host = c19.pads.ci.uchicago.edu/1 > Hold_Types = n > Join_Path = n > Keep_Files = n > Mail_Points = n > mtime = Tue Jan 19 13:09:20 2010 > Output_Path = > login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5 > 446269528052212820.submit.stdout > Priority = 0 > qtime = Tue Jan 19 13:09:18 2010 > Rerunable = True > Resource_List.nodect = 1 > Resource_List.nodes = 1 > Resource_List.walltime = 00:29:00 > session_id = 15135 > Shell_Path_List = /bin/sh > Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde, > > PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s > > oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0. > > 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin > > :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar > > e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1. > > 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2. > > 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma > > ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin: > > /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0- > > r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/ > > swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift- > svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin, > PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash, > PBS_SERVER=login2.pads.ci.uchicago.edu, > PBS_O_HOST=login2.pads.ci.uchicago.edu, > PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498, > PBS_O_QUEUE=extended > etime = Tue Jan 19 13:09:18 2010 > submit_args = /home/wilde/.globus/scripts/PBS5446269528052212820.submit > start_time = Tue Jan 19 13:09:20 2010 > start_count = 1 > > login2$ > ------------------------------------------------------------------------------------------------------- > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From wilde at mcs.anl.gov Tue Jan 19 13:38:44 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 19 Jan 2010 13:38:44 -0600 Subject: [Swift-user] Coaster jobs are not running with expected parallelism In-Reply-To: <1263929550.22225.5.camel@localhost> References: <4B56076C.6090507@mcs.anl.gov> <1263929550.22225.5.camel@localhost> Message-ID: <4B560A44.8050002@mcs.anl.gov> On 1/19/10 1:32 PM, Mihael Hategan wrote: > Maybe PBS is lying about that 18 node job. I would be surprised if thats the case. But even if it had *1* node you would think it would run at least 8 jobs in parallel. Im confused why it has started three jobs, two with only one core and one with 18 nodes. But the 18 node job just hit its wall time limit; now coasters seems to have started a 10 node job: login2$ qstat -n svc.pads.ci.uchicago.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 912.svc.pads.ci. wilde extended null 14877 1 -- -- 00:29 R 00:25 c19 915.svc.pads.ci. wilde extended null 9028 1 -- -- 00:29 R -- c38 916.svc.pads.ci. wilde extended null -- 10 -- -- 00:29 R -- c45+c44+c06+c07+c08+c10+c12+c14+c17+c22 login2$ The coaster or worker logs on > pads/~/.globus/coasters could shed some light on this. I'll look and make these readable by you. - Mike > On Tue, 2010-01-19 at 13:26 -0600, Michael Wilde wrote: >> Im running a script on PADS that emits 20 jobs in parallel with a foreach(). >> >> I set coasters to use 8 workers per node, and my throttle to allow 64 >> jobs to run in parallel, so I would expect *at least* 8 jobs to be >> running in parallel. But what I see is: >> >> - 3 PBS worker jobs start >> - 2 of these have a single core (c19/0 and c19/1) >> - 1 of these has 18 *nodes* >> - all 20 jobs show up as submitted or active, but never more than *3* >> active (note that 1 job is a setup job ad completes right away). >> >> Below is info on this run. >> >> Any idea why coaster provider is behaving this way? >> >> - Mike >> >> pool entry is: >> >> >> 00:05:00 >> 1800 >> >> 8 >> .63 >> 10000 >> >> $rundir >> >> >> Running on login2, I see: >> >> /home/wilde/protlib2/bin/run.loops.sh: Executing on site pbs >> Running from host with compute-node reachable address of 172.5.86.6 >> Running in /home/wilde/protests/run.loops.1498 >> protlib2 home is /home/wilde/protlib2 >> Swift svn swift-r3202 cog-r2682 >> >> RunID: 20100119-1309-l72sbpg8 >> Progress: >> Progress: Checking status:1 >> Progress: Selecting site:18 Initializing site shared directory:1 >> Stage in:1 Finished successfully:1 >> Progress: Submitting:19 Submitted:1 Finished successfully:1 >> Progress: Submitted:19 Active:1 Finished successfully:1 >> Progress: Submitted:17 Active:3 Finished successfully:1 >> Progress: Submitted:17 Active:3 Finished successfully:1 >> Progress: Submitted:17 Active:3 Finished successfully:1 >> Progress: Submitted:17 Active:3 Finished successfully:1 >> Progress: Submitted:17 Active:3 Finished successfully:1 >> Progress: Submitted:17 Active:3 Finished successfully:1 >> Progress: Submitted:17 Active:3 Finished successfully:1 >> Progress: Submitted:17 Active:3 Finished successfully:1 >> Progress: Submitted:17 Active:3 Finished successfully:1 >> Progress: Submitted:17 Active:3 Finished successfully:1 >> Progress: Submitted:17 Active:3 Finished successfully:1 >> Progress: Submitted:17 Active:3 Finished successfully:1 >> Progress: Submitted:17 Active:3 Finished successfully:1 >> Progress: Submitted:17 Active:3 Finished successfully:1 >> Progress: Submitted:17 Active:3 Finished successfully:1 >> Progress: Submitted:17 Active:3 Finished successfully:1 >> Progress: Submitted:17 Active:3 Finished successfully:1 >> Progress: Submitted:17 Active:2 Checking status:1 Finished >> successfully:1 >> Progress: Submitted:15 Active:3 Stage out:1 Finished successfully:2 >> Progress: Submitted:15 Active:3 Finished successfully:3 >> >> ...and this keeps up - the script is progressing but only 3 jobs are >> running at a time. (Each takes about 5 minutes) >> >> PBS shows: >> >> login2$ qstat -n >> >> svc.pads.ci.uchicago.edu: >> >> Req'd Req'd Elap >> Job ID Username Queue Jobname SessID NDS TSK >> Memory Time S Time >> -------------------- -------- -------- ---------------- ------ ----- --- >> ------ ----- - ----- >> 912.svc.pads.ci. wilde extended null 14877 1 -- >> -- 00:29 R -- >> c19 >> 913.svc.pads.ci. wilde extended null -- 18 -- >> -- 00:29 R -- >> c46+c45+c44+c06+c07+c08+c10+c12+c14+c17+c22+c24+c28+c34+c35+c37+c39+c40 >> 914.svc.pads.ci. wilde extended null 15135 1 -- >> -- 00:29 R -- >> c19 >> login2$ qstat -f >> Job Id: 912.svc.pads.ci.uchicago.edu >> Job_Name = null >> Job_Owner = wilde at login2.pads.ci.uchicago.edu >> resources_used.cput = 00:00:58 >> resources_used.mem = 165768kb >> resources_used.vmem = 757612kb >> resources_used.walltime = 00:01:14 >> job_state = R >> queue = extended >> server = svc.pads.ci.uchicago.edu >> Checkpoint = u >> ctime = Tue Jan 19 13:09:16 2010 >> Error_Path = >> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS58 >> 66754363410172037.submit.stderr >> exec_host = c19.pads.ci.uchicago.edu/0 >> Hold_Types = n >> Join_Path = n >> Keep_Files = n >> Mail_Points = n >> mtime = Tue Jan 19 13:09:18 2010 >> Output_Path = >> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5 >> 866754363410172037.submit.stdout >> Priority = 0 >> qtime = Tue Jan 19 13:09:16 2010 >> Rerunable = True >> Resource_List.nodect = 1 >> Resource_List.nodes = 1 >> Resource_List.walltime = 00:29:00 >> session_id = 14877 >> Shell_Path_List = /bin/sh >> Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde, >> >> PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s >> >> oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0. >> >> 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin >> >> :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar >> >> e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1. >> >> 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2. >> >> 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma >> >> ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin: >> >> /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0- >> >> r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/ >> >> swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift- >> svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin, >> PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash, >> PBS_SERVER=login2.pads.ci.uchicago.edu, >> PBS_O_HOST=login2.pads.ci.uchicago.edu, >> PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498, >> PBS_O_QUEUE=extended >> etime = Tue Jan 19 13:09:16 2010 >> submit_args = /home/wilde/.globus/scripts/PBS5866754363410172037.submit >> start_time = Tue Jan 19 13:09:17 2010 >> start_count = 1 >> >> Job Id: 913.svc.pads.ci.uchicago.edu >> Job_Name = null >> Job_Owner = wilde at login2.pads.ci.uchicago.edu >> resources_used.cput = 00:00:36 >> resources_used.mem = 166452kb >> resources_used.vmem = 765732kb >> resources_used.walltime = 00:00:51 >> job_state = R >> queue = extended >> server = svc.pads.ci.uchicago.edu >> Checkpoint = u >> ctime = Tue Jan 19 13:09:16 2010 >> Error_Path = >> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS89 >> 90749016166185054.submit.stderr >> exec_host = >> c46.pads.ci.uchicago.edu/0+c45.pads.ci.uchicago.edu/0+c44.pads >> >> .ci.uchicago.edu/0+c06.pads.ci.uchicago.edu/0+c07.pads.ci.uchicago.edu >> >> /0+c08.pads.ci.uchicago.edu/0+c10.pads.ci.uchicago.edu/0+c12.pads.ci.u >> >> chicago.edu/0+c14.pads.ci.uchicago.edu/0+c17.pads.ci.uchicago.edu/0+c2 >> >> 2.pads.ci.uchicago.edu/0+c24.pads.ci.uchicago.edu/0+c28.pads.ci.uchica >> >> go.edu/0+c34.pads.ci.uchicago.edu/0+c35.pads.ci.uchicago.edu/0+c37.pad >> >> s.ci.uchicago.edu/0+c39.pads.ci.uchicago.edu/0+c40.pads.ci.uchicago.ed >> u/0 >> Hold_Types = n >> Join_Path = n >> Keep_Files = n >> Mail_Points = n >> mtime = Tue Jan 19 13:09:55 2010 >> Output_Path = >> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS8 >> 990749016166185054.submit.stdout >> Priority = 0 >> qtime = Tue Jan 19 13:09:16 2010 >> Rerunable = True >> Resource_List.nodect = 18 >> Resource_List.nodes = 18 >> Resource_List.walltime = 00:29:00 >> session_id = 13956 >> Shell_Path_List = /bin/sh >> Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde, >> >> PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s >> >> oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0. >> >> 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin >> >> :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar >> >> e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1. >> >> 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2. >> >> 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma >> >> ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin: >> >> /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0- >> >> r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/ >> >> swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift- >> svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin, >> PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash, >> PBS_SERVER=login2.pads.ci.uchicago.edu, >> PBS_O_HOST=login2.pads.ci.uchicago.edu, >> PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498, >> PBS_O_QUEUE=extended >> etime = Tue Jan 19 13:09:16 2010 >> submit_args = /home/wilde/.globus/scripts/PBS8990749016166185054.submit >> start_time = Tue Jan 19 13:09:18 2010 >> start_count = 1 >> >> Job Id: 914.svc.pads.ci.uchicago.edu >> Job_Name = null >> Job_Owner = wilde at login2.pads.ci.uchicago.edu >> resources_used.cput = 00:00:58 >> resources_used.mem = 165760kb >> resources_used.vmem = 757612kb >> resources_used.walltime = 00:01:11 >> job_state = R >> queue = extended >> server = svc.pads.ci.uchicago.edu >> Checkpoint = u >> ctime = Tue Jan 19 13:09:18 2010 >> Error_Path = >> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS54 >> 46269528052212820.submit.stderr >> exec_host = c19.pads.ci.uchicago.edu/1 >> Hold_Types = n >> Join_Path = n >> Keep_Files = n >> Mail_Points = n >> mtime = Tue Jan 19 13:09:20 2010 >> Output_Path = >> login2.pads.ci.uchicago.edu:/home/wilde/.globus/scripts/PBS5 >> 446269528052212820.submit.stdout >> Priority = 0 >> qtime = Tue Jan 19 13:09:18 2010 >> Rerunable = True >> Resource_List.nodect = 1 >> Resource_List.nodes = 1 >> Resource_List.walltime = 00:29:00 >> session_id = 15135 >> Shell_Path_List = /bin/sh >> Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LOGNAME=wilde, >> >> PBS_O_PATH=/soft/apache-ant-1.7.1-r1/bin:/soft/python-2.6.1-r1/bin:/s >> >> oft/swig-1.3.38-r1/bin:/soft/python-rdflib-2.4.0-r1/bin:/soft/pyxml-0. >> >> 8.4-r1/bin:/soft/python-zsi-2.1a1-r1/bin:/soft/python-sip-4.7.9-r1/bin >> >> :/soft/python-setuptools-0.6c9-r1/bin:/soft/matlab-7.7-r1/bin:/softwar >> >> e/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/java-1. >> >> 6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/globus-4.2. >> >> 1-r2/bin:/soft/globus-4.2.1-r2/sbin:/soft/torque-2.3.6-r1/bin:/soft/ma >> >> ui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/usr/kerberos/bin:/bin: >> >> /usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0- >> >> r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/home/wilde/ >> >> swift/tools:/home/wilde/swift/src/stable/cog/modules/swift/dist/swift- >> svn/bin:/home/wilde/blast/ncbi/bin:/home/wilde/protlib2/bin, >> PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash, >> PBS_SERVER=login2.pads.ci.uchicago.edu, >> PBS_O_HOST=login2.pads.ci.uchicago.edu, >> PBS_O_WORKDIR=/home/wilde/protests/run.loops.1498, >> PBS_O_QUEUE=extended >> etime = Tue Jan 19 13:09:18 2010 >> submit_args = /home/wilde/.globus/scripts/PBS5446269528052212820.submit >> start_time = Tue Jan 19 13:09:20 2010 >> start_count = 1 >> >> login2$ >> ------------------------------------------------------------------------------------------------------- >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > From hategan at mcs.anl.gov Tue Jan 19 13:44:06 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 19 Jan 2010 13:44:06 -0600 Subject: [Swift-user] Coaster jobs are not running with expected parallelism In-Reply-To: <4B560A44.8050002@mcs.anl.gov> References: <4B56076C.6090507@mcs.anl.gov> <1263929550.22225.5.camel@localhost> <4B560A44.8050002@mcs.anl.gov> Message-ID: <1263930246.22837.3.camel@localhost> On Tue, 2010-01-19 at 13:38 -0600, Michael Wilde wrote: > > On 1/19/10 1:32 PM, Mihael Hategan wrote: > > Maybe PBS is lying about that 18 node job. > > I would be surprised if thats the case. But even if it had *1* node you > would think it would run at least 8 jobs in parallel. I see. Though not with your current setup. You should use "workersPerNode" instead of "coastersPerNode". > > Im confused why it has started three jobs, two with only one core and > one with 18 nodes. It does that. It spreads out the block sizes to exploit non-linearities in queuing times. > > But the 18 node job just hit its wall time limit; now coasters seems to > have started a 10 node job: Don't know about that. Logs please. From wilde at mcs.anl.gov Tue Jan 19 13:49:02 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 19 Jan 2010 13:49:02 -0600 Subject: [Swift-user] Coaster jobs are not running with expected parallelism In-Reply-To: <1263930246.22837.3.camel@localhost> References: <4B56076C.6090507@mcs.anl.gov> <1263929550.22225.5.camel@localhost> <4B560A44.8050002@mcs.anl.gov> <1263930246.22837.3.camel@localhost> Message-ID: <4B560CAE.5020203@mcs.anl.gov> On 1/19/10 1:44 PM, Mihael Hategan wrote: > On Tue, 2010-01-19 at 13:38 -0600, Michael Wilde wrote: >> On 1/19/10 1:32 PM, Mihael Hategan wrote: >>> Maybe PBS is lying about that 18 node job. >> I would be surprised if thats the case. But even if it had *1* node you >> would think it would run at least 8 jobs in parallel. > > I see. Though not with your current setup. You should use > "workersPerNode" instead of "coastersPerNode". Thanks! I'll fix that and try again. This makes more sense now, if its assuming 1 worker per node. Still doesnt explain why its not starting more jobs, since it allocated abundant nodes (even assuming 1 worker per node). > >> Im confused why it has started three jobs, two with only one core and >> one with 18 nodes. > > It does that. It spreads out the block sizes to exploit non-linearities > in queuing times. > >> But the 18 node job just hit its wall time limit; now coasters seems to >> have started a 10 node job: > > Don't know about that. Logs please. > Here's the logs from that dir for this run. I dont understand why the coasters.log file in that directory has not been written to since Jan 13. login2$ ls -dt * | head worker-0119-090116-000002.log worker-0114-310129-000005.log worker-0119-090116-000004.log worker-0114-310129-000006.log worker-0119-090116-000003.log worker-0114-310129-000007.log worker-0119-090116-000001.log worker-0114-310129-000008.log worker-0119-090116-000000.log worker-0114-310129-000009.log cscript7310283766853084762.pl worker-0114-310129-000000.log worker-0119-491225-000001.log worker-0114-110123-000004.log worker-0119-491225-000000.log worker-0114-110123-000002.log worker-0119-151225-000001.log worker-0114-110123-000003.log worker-0119-151225-000000.log worker-0114-110123-000000.log login2$ ls -1dt * | head worker-0119-090116-000002.log worker-0119-090116-000004.log worker-0119-090116-000003.log worker-0119-090116-000001.log worker-0119-090116-000000.log cscript7310283766853084762.pl worker-0119-491225-000001.log worker-0119-491225-000000.log worker-0119-151225-000001.log worker-0119-151225-000000.log login2$ more *0119-090116* :::::::::::::: worker-0119-090116-000000.log :::::::::::::: 1263928159 0119-090116-000000 Logging started 1263928159 INFO - Running on node c19.pads.ci.uchicago.edu 1263928159 INFO 000000 Registration successful. ID=000000 :::::::::::::: worker-0119-090116-000001.log :::::::::::::: 1263928159 0119-090116-000001 Logging started 1263928159 INFO - Running on node c46.pads.ci.uchicago.edu 1263928160 INFO 000000 Registration successful. ID=000000 :::::::::::::: worker-0119-090116-000002.log :::::::::::::: 1263928160 0119-090116-000002 Logging started 1263928161 INFO - Running on node c19.pads.ci.uchicago.edu 1263928161 INFO 000000 Registration successful. ID=000000 1263929738 INFO 000000 Acknowledged shutdown. Exiting 1263929738 INFO 000000 Ran a total of 3 jobs 1263929738 INFO - All sub-processes finished. Exiting. :::::::::::::: worker-0119-090116-000003.log :::::::::::::: 1263929733 0119-090116-000003 Logging started 1263929733 INFO - Running on node c38.pads.ci.uchicago.edu 1263929733 INFO 000000 Registration successful. ID=000000 :::::::::::::: worker-0119-090116-000004.log :::::::::::::: 1263929734 0119-090116-000004 Logging started 1263929734 INFO - Running on node c45.pads.ci.uchicago.edu 1263929734 INFO 000000 Registration successful. ID=000000 login2$ From hategan at mcs.anl.gov Tue Jan 19 13:55:33 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 19 Jan 2010 13:55:33 -0600 Subject: [Swift-user] Coaster jobs are not running with expected parallelism In-Reply-To: <4B560CAE.5020203@mcs.anl.gov> References: <4B56076C.6090507@mcs.anl.gov> <1263929550.22225.5.camel@localhost> <4B560A44.8050002@mcs.anl.gov> <1263930246.22837.3.camel@localhost> <4B560CAE.5020203@mcs.anl.gov> Message-ID: <1263930933.22837.14.camel@localhost> On Tue, 2010-01-19 at 13:49 -0600, Michael Wilde wrote: > > On 1/19/10 1:44 PM, Mihael Hategan wrote: > > On Tue, 2010-01-19 at 13:38 -0600, Michael Wilde wrote: > >> On 1/19/10 1:32 PM, Mihael Hategan wrote: > >>> Maybe PBS is lying about that 18 node job. > >> I would be surprised if thats the case. But even if it had *1* node you > >> would think it would run at least 8 jobs in parallel. > > > > I see. Though not with your current setup. You should use > > "workersPerNode" instead of "coastersPerNode". > > Thanks! I'll fix that and try again. This makes more sense now, if its > assuming 1 worker per node. > > Still doesnt explain why its not starting more jobs, since it allocated > abundant nodes (even assuming 1 worker per node). Trunk or branch? > > > > > >> Im confused why it has started three jobs, two with only one core and > >> one with 18 nodes. > > > > It does that. It spreads out the block sizes to exploit non-linearities > > in queuing times. > > > >> But the 18 node job just hit its wall time limit; now coasters seems to > >> have started a 10 node job: > > > > Don't know about that. Logs please. > > > > Here's the logs from that dir for this run. I dont understand why the > coasters.log file in that directory has not been written to since Jan 13. If you run swift on the head node and the coaster bootstrap provider is "local", then the coaster service runs in the same jvm as swift, and it writes to the same log as swift. > > login2$ more *0119-090116* [...] Seems fine so far. Swift log then. From wilde at mcs.anl.gov Tue Jan 19 14:02:34 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 19 Jan 2010 14:02:34 -0600 Subject: [Swift-user] Coaster jobs are not running with expected parallelism In-Reply-To: <1263930933.22837.14.camel@localhost> References: <4B56076C.6090507@mcs.anl.gov> <1263929550.22225.5.camel@localhost> <4B560A44.8050002@mcs.anl.gov> <1263930246.22837.3.camel@localhost> <4B560CAE.5020203@mcs.anl.gov> <1263930933.22837.14.camel@localhost> Message-ID: <4B560FDA.60707@mcs.anl.gov> On 1/19/10 1:55 PM, Mihael Hategan wrote: > On Tue, 2010-01-19 at 13:49 -0600, Michael Wilde wrote: >> On 1/19/10 1:44 PM, Mihael Hategan wrote: >>> On Tue, 2010-01-19 at 13:38 -0600, Michael Wilde wrote: >>>> On 1/19/10 1:32 PM, Mihael Hategan wrote: >>>>> Maybe PBS is lying about that 18 node job. >>>> I would be surprised if thats the case. But even if it had *1* node you >>>> would think it would run at least 8 jobs in parallel. >>> I see. Though not with your current setup. You should use >>> "workersPerNode" instead of "coastersPerNode". >> Thanks! I'll fix that and try again. This makes more sense now, if its >> assuming 1 worker per node. >> >> Still doesnt explain why its not starting more jobs, since it allocated >> abundant nodes (even assuming 1 worker per node). > > Trunk or branch? Stable branch. > >> >>>> Im confused why it has started three jobs, two with only one core and >>>> one with 18 nodes. >>> It does that. It spreads out the block sizes to exploit non-linearities >>> in queuing times. >>> >>>> But the 18 node job just hit its wall time limit; now coasters seems to >>>> have started a 10 node job: >>> Don't know about that. Logs please. >>> >> Here's the logs from that dir for this run. I dont understand why the >> coasters.log file in that directory has not been written to since Jan 13. > > If you run swift on the head node and the coaster bootstrap provider is > "local", then the coaster service runs in the same jvm as swift, and it > writes to the same log as swift. > >> login2$ more *0119-090116* > > [...] > > Seems fine so far. Swift log then. -rw-r--r-- 1 wilde ci-users 912946 Jan 19 13:49 /home/wilde/protests/run.loops.1498/psim.loops-20100119-1309-l72sbpg8.log I killed the run and will retry with workersPerNode corrected; maybe you can see, though, in this log, why the run was limited to only 3 active at once. I'll see if same happens with workersPerNode set. This would be explained if leaving workersPerNode *not* set somehow defaults to 1 worker per *block* (ie per pbs job) instead of 1 worker per node. Could that be hapenning? - Mike From hategan at mcs.anl.gov Tue Jan 19 14:09:36 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 19 Jan 2010 14:09:36 -0600 Subject: [Swift-user] Coaster jobs are not running with expected parallelism In-Reply-To: <4B560FDA.60707@mcs.anl.gov> References: <4B56076C.6090507@mcs.anl.gov> <1263929550.22225.5.camel@localhost> <4B560A44.8050002@mcs.anl.gov> <1263930246.22837.3.camel@localhost> <4B560CAE.5020203@mcs.anl.gov> <1263930933.22837.14.camel@localhost> <4B560FDA.60707@mcs.anl.gov> Message-ID: <1263931776.23720.0.camel@localhost> On Tue, 2010-01-19 at 14:02 -0600, Michael Wilde wrote: > -rw-r--r-- 1 wilde ci-users 912946 Jan 19 13:49 > /home/wilde/protests/run.loops.1498/psim.loops-20100119-1309-l72sbpg8.log > > I killed the run and will retry with workersPerNode corrected; maybe you > can see, though, in this log, why the run was limited to only 3 active > at once. > > I'll see if same happens with workersPerNode set. > > This would be explained if leaving workersPerNode *not* set somehow > defaults to 1 worker per *block* (ie per pbs job) instead of 1 worker > per node. Could that be hapenning? Not intentionally. From wilde at mcs.anl.gov Tue Jan 19 14:23:43 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 19 Jan 2010 14:23:43 -0600 Subject: [Swift-user] Coaster jobs are not running with expected parallelism In-Reply-To: <1263931776.23720.0.camel@localhost> References: <4B56076C.6090507@mcs.anl.gov> <1263929550.22225.5.camel@localhost> <4B560A44.8050002@mcs.anl.gov> <1263930246.22837.3.camel@localhost> <4B560CAE.5020203@mcs.anl.gov> <1263930933.22837.14.camel@localhost> <4B560FDA.60707@mcs.anl.gov> <1263931776.23720.0.camel@localhost> Message-ID: <4B5614CF.4030704@mcs.anl.gov> With workersPerNode = 8, I now see 2 PBS jobs; one has 1 node, one has 3 nodes. Now *16* jobs are active. The pattern seems to be that its only running workersPerNode app() tasks per PBS job (ie, per block). I'll see if I can get it to run workersPerNode tasks per *node* with more explicit settings in the sites file. The current jobs is: /home/wilde/protlib2/bin/run.loops.sh: Executing on site pbs Running from host with compute-node reachable address of 172.5.86.6 Running in /home/wilde/protests/run.loops.5357 protlib2 home is /home/wilde/protlib2 Swift svn swift-r3202 cog-r2682 RunID: 20100119-1414-q09uz2c0 Progress: Progress: Checking status:1 Progress: Selecting site:18 Initializing site shared directory:1 Stage in:1 Finished successfully:1 Progress: Stage in:19 Submitting:1 Finished successfully:1 Progress: Submitted:19 Active:1 Finished successfully:1 Progress: Submitted:11 Active:9 Finished successfully:1 Progress: Submitted:7 Active:13 Finished successfully:1 Progress: Submitted:4 Active:16 Finished successfully:1 Progress: Submitted:4 Active:16 Finished successfully:1 Progress: Submitted:4 Active:16 Finished successfully:1 Progress: Submitted:4 Active:16 Finished successfully:1 PBS says: login2$ qstat -n svc.pads.ci.uchicago.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 917.svc.pads.ci. wilde extended null 16709 1 -- -- 00:29 R 00:04 c19 918.svc.pads.ci. wilde extended null 15309 3 -- -- 00:29 R 00:04 c46+c45+c44 login2$ Swift log is in: login2$ ls -l $(pwd)/*0.log -rw-r--r-- 1 wilde ci-users 386242 Jan 19 14:21 /home/wilde/protests/run.loops.5357/psim.loops-20100119 4-q09uz2c0.log login2$ On 1/19/10 2:09 PM, Mihael Hategan wrote: > On Tue, 2010-01-19 at 14:02 -0600, Michael Wilde wrote: > >> -rw-r--r-- 1 wilde ci-users 912946 Jan 19 13:49 >> /home/wilde/protests/run.loops.1498/psim.loops-20100119-1309-l72sbpg8.log >> >> I killed the run and will retry with workersPerNode corrected; maybe you >> can see, though, in this log, why the run was limited to only 3 active >> at once. >> >> I'll see if same happens with workersPerNode set. >> >> This would be explained if leaving workersPerNode *not* set somehow >> defaults to 1 worker per *block* (ie per pbs job) instead of 1 worker >> per node. Could that be hapenning? > > Not intentionally. > > From iraicu at cs.uchicago.edu Tue Jan 19 16:49:51 2010 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 19 Jan 2010 16:49:51 -0600 Subject: [Swift-user] CFP: Cloud Futures 2010: Advancing Research with Cloud Computing Message-ID: <4B56370F.3060608@cs.uchicago.edu> Cloud Futures 2010: Advancing Research with Cloud Computing April 8-9, 2010 Redmond, WA Workshop Co-Chairs Dan Reed David A. Patterson Corporate Vice President Professor of Computer Science Extreme Computing Group Reliable Adaptive Distributed Systems Lab Microsoft Research University of California - Berkeley Call for Abstracts Cloud computing is fast becoming the most important platform for research. Scientists today need vast computing resources to collect, share, manipulate, and explore massive data sets as well as to build and deploy new services for research. Cloud computing has the potential to advance research discoveries by making data and computing resources readily available at unprecedented economy of scale and nearly infinite scalability. To realize the full promise of cloud computing for research, however, one must think about the cloud as a holistic platform for creating new services, new experiences, and new methods to pursue research, teaching and scholarly communication. This goal presents a broad range of interesting questions. We invite extended abstracts that illustrate the role of cloud computing across a variety of research and curriculum development areas---including computer science, earth sciences, healthcare, humanities, life sciences, and social sciences---that highlight how new techniques and methods of research in the cloud may solve distinct challenges arising in those diverse areas. Please include a bio (150 words max) and a brief abstract (300 words max) of a 30-minute short talk on a topic that describes practical experiences, experimental results, and vision papers. Please submit your abstract by February 10, 2010 to cloudfut at microsoft.com Invited talks will be announced on February 18, 2010 -- ================================================================= Ioan Raicu, Ph.D. NSF/CRA Computing Innovation Fellow ================================================================= Center for Ultra-scale Computing and Information Security (CUCIS) Department of Electrical Engineering and Computer Science Northwestern University 2145 Sheridan Rd, Tech M384 Evanston, IL 60208-3118 ================================================================= Cel: 1-847-722-0876 Tel: 1-847-491-8163 Email: iraicu at eecs.northwestern.edu Web: http://www.eecs.northwestern.edu/~iraicu/ https://wiki.cucis.eecs.northwestern.edu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Wed Jan 20 09:38:51 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 20 Jan 2010 09:38:51 -0600 Subject: [Swift-user] Coaster provider is not allocating dedicated nodes Message-ID: <4B57238B.9020806@mcs.anl.gov> Using the sites entry below, I see that coasters is allocating 8 *shared* nods rather than *dedicated* nodes; hence its running many more processes per node than it should, causing the jobs to run longer than expected and exceed their walltime. using this sites entry: 7500 8 12 1 1 1.27 10000 $rundir qstat (below) shows the 12 coaster jobs I requested with "slots=12", but they are only using 2 different nodes, c45 and c46, between them, even though they are running 96 total coaster workers. (I can see that I have 96 jobs active). It seems like between coasters and the PBS provider, Swift is nt telling PBS that each of these jobs should get a dedicated node of 8 cores. Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 1034.svc.pads.ci wilde extended null 13086 1 -- -- 02:04 R 01:26 c46 1035.svc.pads.ci wilde extended null 13168 1 -- -- 02:04 R 01:26 c46 1036.svc.pads.ci wilde extended null 13387 1 -- -- 02:04 R 01:26 c46 1037.svc.pads.ci wilde extended null 14060 1 -- -- 02:04 R 01:26 c46 1038.svc.pads.ci wilde extended null 14237 1 -- -- 02:04 R 01:26 c46 1039.svc.pads.ci wilde extended null 14640 1 -- -- 02:04 R 01:26 c46 1040.svc.pads.ci wilde extended null 15200 1 -- -- 02:04 R 01:26 c46 1041.svc.pads.ci wilde extended null 15753 1 -- -- 02:04 R 01:26 c46 1042.svc.pads.ci wilde extended null 23700 1 -- -- 02:04 R 01:26 c45 1043.svc.pads.ci wilde extended null 23781 1 -- -- 02:04 R 01:26 c45 1044.svc.pads.ci wilde extended null 24016 1 -- -- 02:04 R 01:26 c45 1045.svc.pads.ci wilde extended null 24796 1 -- -- 02:04 R 01:26 c45 From wilde at mcs.anl.gov Wed Jan 20 09:42:51 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 20 Jan 2010 09:42:51 -0600 Subject: [Swift-user] Coaster provider is not allocating dedicated nodes In-Reply-To: <4B57238B.9020806@mcs.anl.gov> References: <4B57238B.9020806@mcs.anl.gov> Message-ID: <4B57247B.1050607@mcs.anl.gov> The log for the run below is in: /home/wilde/protests/run.loops.3231/psim.loops-20100120-0802-tsvkj4e7.log - Mike On 1/20/10 9:38 AM, Michael Wilde wrote: > Using the sites entry below, I see that coasters is allocating 8 > *shared* nods rather than *dedicated* nodes; hence its running many more > processes per node than it should, causing the jobs to run longer than > expected and exceed their walltime. > > using this sites entry: > > > > > 7500 > 8 > > 12 > 1 > 1 > > 1.27 > 10000 > > $rundir > > > qstat (below) shows the 12 coaster jobs I requested with "slots=12", but > they are only using 2 different nodes, c45 and c46, between them, even > though they are running 96 total coaster workers. (I can see that I have > 96 jobs active). > > It seems like between coasters and the PBS provider, Swift is nt telling > PBS that each of these jobs should get a dedicated node of 8 cores. > > > Job ID Username Queue Jobname SessID NDS TSK > Memory Time S Time > -------------------- -------- -------- ---------------- ------ ----- --- > ------ ----- - ----- > 1034.svc.pads.ci wilde extended null 13086 1 -- > -- 02:04 R 01:26 > c46 > 1035.svc.pads.ci wilde extended null 13168 1 -- > -- 02:04 R 01:26 > c46 > 1036.svc.pads.ci wilde extended null 13387 1 -- > -- 02:04 R 01:26 > c46 > 1037.svc.pads.ci wilde extended null 14060 1 -- > -- 02:04 R 01:26 > c46 > 1038.svc.pads.ci wilde extended null 14237 1 -- > -- 02:04 R 01:26 > c46 > 1039.svc.pads.ci wilde extended null 14640 1 -- > -- 02:04 R 01:26 > c46 > 1040.svc.pads.ci wilde extended null 15200 1 -- > -- 02:04 R 01:26 > c46 > 1041.svc.pads.ci wilde extended null 15753 1 -- > -- 02:04 R 01:26 > c46 > 1042.svc.pads.ci wilde extended null 23700 1 -- > -- 02:04 R 01:26 > c45 > 1043.svc.pads.ci wilde extended null 23781 1 -- > -- 02:04 R 01:26 > c45 > 1044.svc.pads.ci wilde extended null 24016 1 -- > -- 02:04 R 01:26 > c45 > 1045.svc.pads.ci wilde extended null 24796 1 -- > -- 02:04 R 01:26 > c45 > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From wilde at mcs.anl.gov Wed Jan 20 10:42:16 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 20 Jan 2010 10:42:16 -0600 Subject: [Swift-user] Coaster provider is not allocating dedicated nodes In-Reply-To: <4B57247B.1050607@mcs.anl.gov> References: <4B57238B.9020806@mcs.anl.gov> <4B57247B.1050607@mcs.anl.gov> Message-ID: <4B573268.6010003@mcs.anl.gov> I ran a test of the same entry using a simple foreach/cat script and captured the PBS submit file. It shows: login2$ more logs/PBS1883411659688642512.submit #PBS -S /bin/sh #PBS -N null #PBS -m n #PBS -l nodes=1 #PBS -l walltime=01:10:00 #PBS -o /home/wilde/.globus/scripts/PBS1883411659688642512.submit.stdout #PBS -e /home/wilde/.globus/scripts/PBS1883411659688642512.submit.stderr /usr/bin/perl /home/wilde/.globus/coasters/cscript2151716324069557151.pl http://192.5.86. 6:50003 0120-331021-000010 8 /home/wilde/.globus/coasters /bin/echo $? >/home/wilde/.globus/scripts/PBS1883411659688642512.submit.exitcode login2$ It seems that the line "#PBS -l nodes=1" should be: #PBS -l nodes=1:ppn=8 - Mike On 1/20/10 9:42 AM, Michael Wilde wrote: > The log for the run below is in: > > /home/wilde/protests/run.loops.3231/psim.loops-20100120-0802-tsvkj4e7.log > > - Mike > > On 1/20/10 9:38 AM, Michael Wilde wrote: >> Using the sites entry below, I see that coasters is allocating 8 >> *shared* nods rather than *dedicated* nodes; hence its running many more >> processes per node than it should, causing the jobs to run longer than >> expected and exceed their walltime. >> >> using this sites entry: >> >> >> >> >> 7500 >> 8 >> >> 12 >> 1 >> 1 >> >> 1.27 >> 10000 >> >> $rundir >> >> >> qstat (below) shows the 12 coaster jobs I requested with "slots=12", but >> they are only using 2 different nodes, c45 and c46, between them, even >> though they are running 96 total coaster workers. (I can see that I have >> 96 jobs active). >> >> It seems like between coasters and the PBS provider, Swift is nt telling >> PBS that each of these jobs should get a dedicated node of 8 cores. >> >> >> Job ID Username Queue Jobname SessID NDS TSK >> Memory Time S Time >> -------------------- -------- -------- ---------------- ------ ----- --- >> ------ ----- - ----- >> 1034.svc.pads.ci wilde extended null 13086 1 -- >> -- 02:04 R 01:26 >> c46 >> 1035.svc.pads.ci wilde extended null 13168 1 -- >> -- 02:04 R 01:26 >> c46 >> 1036.svc.pads.ci wilde extended null 13387 1 -- >> -- 02:04 R 01:26 >> c46 >> 1037.svc.pads.ci wilde extended null 14060 1 -- >> -- 02:04 R 01:26 >> c46 >> 1038.svc.pads.ci wilde extended null 14237 1 -- >> -- 02:04 R 01:26 >> c46 >> 1039.svc.pads.ci wilde extended null 14640 1 -- >> -- 02:04 R 01:26 >> c46 >> 1040.svc.pads.ci wilde extended null 15200 1 -- >> -- 02:04 R 01:26 >> c46 >> 1041.svc.pads.ci wilde extended null 15753 1 -- >> -- 02:04 R 01:26 >> c46 >> 1042.svc.pads.ci wilde extended null 23700 1 -- >> -- 02:04 R 01:26 >> c45 >> 1043.svc.pads.ci wilde extended null 23781 1 -- >> -- 02:04 R 01:26 >> c45 >> 1044.svc.pads.ci wilde extended null 24016 1 -- >> -- 02:04 R 01:26 >> c45 >> 1045.svc.pads.ci wilde extended null 24796 1 -- >> -- 02:04 R 01:26 >> c45 >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From hategan at mcs.anl.gov Wed Jan 20 11:01:21 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 20 Jan 2010 11:01:21 -0600 Subject: [Swift-user] Coaster provider is not allocating dedicated nodes In-Reply-To: <4B57238B.9020806@mcs.anl.gov> References: <4B57238B.9020806@mcs.anl.gov> Message-ID: <1264006881.463.5.camel@localhost> On Wed, 2010-01-20 at 09:38 -0600, Michael Wilde wrote: > Using the sites entry below, I see that coasters is allocating 8 > *shared* nods rather than *dedicated* nodes; hence its running many more > processes per node than it should, causing the jobs to run longer than > expected and exceed their walltime. Right. It looks like the pbs provider uses nodes= and doesn't mess with ppn=, which means it allocate nodes as defined by the local policy (which may mean cores instead of nodes). I suggest setting workersPerNode to 1, but then you run into the previous problem, which I'm trying to fix now and for which I have an open ticket with PADS support. From hategan at mcs.anl.gov Wed Jan 20 15:02:19 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 20 Jan 2010 15:02:19 -0600 Subject: [Swift-user] Coaster provider is not allocating dedicated nodes In-Reply-To: <1264006881.463.5.camel@localhost> References: <4B57238B.9020806@mcs.anl.gov> <1264006881.463.5.camel@localhost> Message-ID: <1264021339.8319.0.camel@localhost> On Wed, 2010-01-20 at 11:01 -0600, Mihael Hategan wrote: > On Wed, 2010-01-20 at 09:38 -0600, Michael Wilde wrote: > > Using the sites entry below, I see that coasters is allocating 8 > > *shared* nods rather than *dedicated* nodes; hence its running many more > > processes per node than it should, causing the jobs to run longer than > > expected and exceed their walltime. > > Right. It looks like the pbs provider uses nodes= and doesn't mess with > ppn=, which means it allocate nodes as defined by the local policy > (which may mean cores instead of nodes). > > I suggest setting workersPerNode to 1, but then you run into the > previous problem, which I'm trying to fix now and for which I have an > open ticket with PADS support. > The ssh problem on PADS was fixed and I committed a patch to the branch to start multiple instances of the app (cog r2683). Mihael From sardjito.antonius at gmail.com Wed Jan 20 21:45:18 2010 From: sardjito.antonius at gmail.com (Antonius Sardjito) Date: Wed, 20 Jan 2010 21:45:18 -0600 Subject: [Swift-user] qsub problem Message-ID: <110a6b261001201945u3ca19089p7c44644533786df7@mail.gmail.com> Hi, I am still having trouble executing task on PADS, the problem is still qsub. I tried adding +maui and +torque also tried it with the @ symbol instead but I continued to get "cannot run program:qsub error" -Antonius -------------- next part -------------- An HTML attachment was scrubbed... URL: From sardjito.antonius at gmail.com Wed Jan 20 22:00:33 2010 From: sardjito.antonius at gmail.com (Antonius Sardjito) Date: Wed, 20 Jan 2010 22:00:33 -0600 Subject: [Swift-user] Re: qsub problem In-Reply-To: <110a6b261001201945u3ca19089p7c44644533786df7@mail.gmail.com> References: <110a6b261001201945u3ca19089p7c44644533786df7@mail.gmail.com> Message-ID: <110a6b261001202000y30c8f3a8ybff8b283080fa72@mail.gmail.com> Actually never mind my question, I got it to work now. I downloaded the program (the second to newest version) and use the command in the README.configure. -Antonius On Wed, Jan 20, 2010 at 9:45 PM, Antonius Sardjito < sardjito.antonius at gmail.com> wrote: > Hi, > > I am still having trouble executing task on PADS, the problem is still > qsub. I tried adding +maui and +torque also tried it with the @ symbol > instead but I continued to get "cannot run program:qsub error" > > -Antonius > -------------- next part -------------- An HTML attachment was scrubbed... URL: From foster at anl.gov Thu Jan 21 05:20:01 2010 From: foster at anl.gov (Ian Foster) Date: Thu, 21 Jan 2010 05:20:01 -0600 Subject: [Swift-user] Re: [Cmsc34900] qsub problem In-Reply-To: <110a6b261001202000y30c8f3a8ybff8b283080fa72@mail.gmail.com> References: <110a6b261001201945u3ca19089p7c44644533786df7@mail.gmail.com> <110a6b261001202000y30c8f3a8ybff8b283080fa72@mail.gmail.com> Message-ID: <06ECAC42-2D54-4F88-B2A3-80A1396D4F62@anl.gov> Antonius: Thanks for the update. I'm glad it is working. Ian. On Jan 20, 2010, at 10:00 PM, Antonius Sardjito wrote: > Actually never mind my question, I got it to work now. I downloaded the program (the second to newest version) and use the command in the README.configure. > > -Antonius > On Wed, Jan 20, 2010 at 9:45 PM, Antonius Sardjito wrote: > Hi, > > I am still having trouble executing task on PADS, the problem is still qsub. I tried adding +maui and +torque also tried it with the @ symbol instead but I continued to get "cannot run program:qsub error" > > -Antonius > > _______________________________________________ > Cmsc34900 mailing list > Cmsc34900 at mailman.cs.uchicago.edu > https://mailman.cs.uchicago.edu/mailman/listinfo/cmsc34900 -------------- next part -------------- An HTML attachment was scrubbed... URL: From sardjito.antonius at gmail.com Thu Jan 21 23:37:05 2010 From: sardjito.antonius at gmail.com (Antonius Sardjito) Date: Thu, 21 Jan 2010 23:37:05 -0600 Subject: [Swift-user] readData documentation Message-ID: <110a6b261001212137i61e43c18m4dcff72c1823118a@mail.gmail.com> Hi, Is there a more complete documentation on readData? besides the one in the user guide. Thank you -Antonius -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Jan 21 23:38:44 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 21 Jan 2010 23:38:44 -0600 Subject: [Swift-user] Re: [Cmsc34900] readData documentation In-Reply-To: <110a6b261001212137i61e43c18m4dcff72c1823118a@mail.gmail.com> References: <110a6b261001212137i61e43c18m4dcff72c1823118a@mail.gmail.com> Message-ID: <4B5939E4.8050805@mcs.anl.gov> There isn't, but if you have questions about it, please ask. Its also helpful to test its behavior with tiny test swift scripts. - Mike On 1/21/10 11:37 PM, Antonius Sardjito wrote: > Hi, > > Is there a more complete documentation on readData? besides the one in > the user guide. Thank you > > > -Antonius > > > ------------------------------------------------------------------------ > > _______________________________________________ > Cmsc34900 mailing list > Cmsc34900 at mailman.cs.uchicago.edu > https://mailman.cs.uchicago.edu/mailman/listinfo/cmsc34900 From wilde at mcs.anl.gov Thu Jan 21 23:45:10 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 21 Jan 2010 23:45:10 -0600 Subject: [Swift-user] Problem in error handling for localhost jobs with status.mode=provider Message-ID: <4B593B66.2040400@mcs.anl.gov> When running a job on localhost with status.mode=provider set in swift.properties, missing-output-file error messages are lost. You can replicate this error with this this script: -- login2$ cat missingresult.swift type file; app (file f) echo() { echo "foo"; } file f<"missing.txt">; f = echo(); -- With status.mode not set, you get the expected error message, "The following output files were not created by the application: missing.txt" but with it set to "provider" you only get "Job failed with an exit code of 254": (note that Ive got a bunch of debug messages below in _swiftwrap) login2$ swift -config props missingresult.swift Swift svn swift-r3202 cog-r2683 RunID: 20100121-2337-pn3jdg2c Progress: To TTY: exit code = 0 _swiftwrap: returned from checkError _swiftwrap: exit step 1 _swiftwrap: exit step 2 _swiftwrap: exit step 3 _swiftwrap: exit step 4 _swiftwrap: exit step 5 checking for outfile missing.txt jobs/j/echo-jlb3snmj/missing.txt is missing The following output files were not created by the application: missing.txt fail(254) logged message The following output files were not created by the application: missing.txt fail(254) logged info Execution failed: Exception in echo: Arguments: [foo] Host: localhost Directory: missingresult-20100121-2337-pn3jdg2c/jobs/j/echo-jlb3snmj stderr.txt: stdout.txt: foo ---- Caused by: Job failed with an exit code of 254 login2$ From sardjito.antonius at gmail.com Fri Jan 22 00:05:04 2010 From: sardjito.antonius at gmail.com (Antonius Sardjito) Date: Fri, 22 Jan 2010 00:05:04 -0600 Subject: [Swift-user] Re: [Cmsc34900] readData documentation In-Reply-To: <4B5939E4.8050805@mcs.anl.gov> References: <110a6b261001212137i61e43c18m4dcff72c1823118a@mail.gmail.com> <4B5939E4.8050805@mcs.anl.gov> Message-ID: <110a6b261001212205h495ca62fg1588ccc80eae500b@mail.gmail.com> Currently I am playing around with it but I keep getting an execution error something like below: Progress: Submitting:1 Finished successfully:317 To TTY: exit code = 0 _swiftwrap: returned from checkError Execution failed: File header does not match type. Expected 0 whitespace separated items. Got 1 instead. The file that is passed to readData() have no spaces, it was just an array of string which I have checked with a text editor. I hope you will be available tommorow, I think it is easier to show it. -Antonius On Thu, Jan 21, 2010 at 11:38 PM, Michael Wilde wrote: > There isn't, but if you have questions about it, please ask. > > Its also helpful to test its behavior with tiny test swift scripts. > > - Mike > > > > On 1/21/10 11:37 PM, Antonius Sardjito wrote: > >> Hi, >> Is there a more complete documentation on readData? besides the one in >> the user guide. Thank you >> -Antonius >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Cmsc34900 mailing list >> Cmsc34900 at mailman.cs.uchicago.edu >> https://mailman.cs.uchicago.edu/mailman/listinfo/cmsc34900 >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sardjito.antonius at gmail.com Fri Jan 22 09:59:14 2010 From: sardjito.antonius at gmail.com (Antonius Sardjito) Date: Fri, 22 Jan 2010 09:59:14 -0600 Subject: [Swift-user] regexp problem Message-ID: <110a6b261001220759k4abda459l31e5c95963d3fdd1@mail.gmail.com> Hi Dr. Wilde, I am able to readData from the input now, into an array of string. I am currently trying to use the strcut and strcat to cut the ".tif" and replace it with "Colored.tif" I tried to do something like this: string cutresult = @strcut(output[0],"/home/wilde/bigdata/data/modis/output/"); #outfile = eccho(output[0]); trace(output[0]); and the error is : * RunID: 20100122-0951-345qqi15 Progress: SwiftScript trace: /home/wilde/bigdata/data/modis/output/h11v05.tif Execution failed: java.lang.IndexOutOfBoundsException: No group 1 I don't understand why I got the error IndexOutOfBounds, I have checked the string using a procedure that echo the string to a file, and as you can see from above I also checked it with trace(output[0]) which produces a valid value. -Antonius ps. This is the whole test file named mini.swift type file; file input <"input_for_mini_test.txt">; string output[] = readData(input); file outfile <"output_for_mini_test.txt">; string color = "Colored.tif"; app (file out) eccho (string inputStr) { echo inputStr stdout=@out; } outfile = eccho(output[0]); string cutresult = @strcut(output[0],"/home/wilde/bigdata/data/modis/output/"); trace(output[0]); --EOF-- * -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Fri Jan 22 12:40:16 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 22 Jan 2010 12:40:16 -0600 Subject: [Swift-user] regexp problem In-Reply-To: <110a6b261001220759k4abda459l31e5c95963d3fdd1@mail.gmail.com> References: <110a6b261001220759k4abda459l31e5c95963d3fdd1@mail.gmail.com> Message-ID: <4B59F110.4040608@mcs.anl.gov> You need to specify a string pattern which has at least one matching "parenthesized group", as below. - Mike -- $ cat strcut.swift string output0 = "/home/wilde/bigdata/data/modis/output/mydir/my.file"; string pattern = "/home/wilde/bigdata/data/modis/output/(.*)"; string cutresult = @strcut(output0,pattern); trace("output0",output0); trace("cutresult",cutresult);login2$ $ swift strcut.swift Swift svn swift-r3202 cog-r2683 RunID: 20100122-1236-euf2lr6b Progress: SwiftScript trace: output0, /home/wilde/bigdata/data/modis/output/mydir/my.file SwiftScript trace: cutresult, mydir/my.file On 1/22/10 9:59 AM, Antonius Sardjito wrote: > Hi Dr. Wilde, > > I am able to readData from the input now, into an array of string. I am > currently trying to use the strcut and strcat to cut the ".tif" and > replace it with "Colored.tif" > > I tried to do something like this: > > string cutresult = > @strcut(output[0],"/home/wilde/bigdata/data/modis/output/"); > #outfile = eccho(output[0]); > trace(output[0]); > > > and the error is : > * > RunID: 20100122-0951-345qqi15 > Progress: > SwiftScript trace: /home/wilde/bigdata/data/modis/output/h11v05.tif > Execution failed: > java.lang.IndexOutOfBoundsException: No group 1 > > > I don't understand why I got the error IndexOutOfBounds, I have checked > the string using a procedure that echo the string to a file, and as you > can see from above I also checked it with trace(output[0]) which > produces a valid value. > > > -Antonius > > > ps. > > This is the whole test file named mini.swift > > type file; > file input <"input_for_mini_test.txt">; > string output[] = readData(input); > file outfile <"output_for_mini_test.txt">; > string color = "Colored.tif"; > > app (file out) eccho (string inputStr) > { > echo inputStr stdout=@out; > } > > outfile = eccho(output[0]); > string cutresult = > @strcut(output[0],"/home/wilde/bigdata/data/modis/output/"); > trace(output[0]); > > --EOF-- > > * > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From sardjito.antonius at gmail.com Fri Jan 22 14:45:32 2010 From: sardjito.antonius at gmail.com (Antonius Sardjito) Date: Fri, 22 Jan 2010 14:45:32 -0600 Subject: [Swift-user] unmapping a file Message-ID: <110a6b261001221245s67547be8vfd1ba10a4d0dfe5b@mail.gmail.com> Hi, Is there a way to unmap a variable from a file ? say f----> f.txt are there ways to unmap this so I could have say s--->f.txt Thanks -Antonius -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Fri Jan 22 15:49:21 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 22 Jan 2010 15:49:21 -0600 Subject: [Swift-user] unmapping a file In-Reply-To: <110a6b261001221245s67547be8vfd1ba10a4d0dfe5b@mail.gmail.com> References: <110a6b261001221245s67547be8vfd1ba10a4d0dfe5b@mail.gmail.com> Message-ID: <4B5A1D61.6050707@mcs.anl.gov> There is no way to unmap a variable. And you need to manually ensure that you dont map the same physical filename to multiple *output* variables, as then its very likely that Swift will complain at run time that its trying to map and access a file that is already mapped to another variable (it complains of a "cache" conflict in such cases, as the file is already in a site's file cache). Swift does let you map the same filename for *input* to different variables or array members. If a file is already mapped to a variable, one will often reference that same variable in a later statement to re-read the same file again, or to read a previously produced file. But in this example, the purpose of returning a list of tile filenames from analyzelanduse.sh was so that you can read in this list and use it to re-map a *subset* of a previously mapped array of files. - Mike On 1/22/10 2:45 PM, Antonius Sardjito wrote: > Hi, > > Is there a way to unmap a variable from a file ? > say f----> f.txt are there ways to unmap this so I could have say > s--->f.txt > > Thanks > -Antonius > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From jamalphd at gmail.com Tue Jan 26 22:58:17 2010 From: jamalphd at gmail.com (J A) Date: Tue, 26 Jan 2010 23:58:17 -0500 Subject: [Swift-user] Read file and write to files Message-ID: Hi All: I am still learning the swift script. I have a text file (list.txt) contains the following: 3 String1 String2 String3 Where the first line contains the number of sentence with text. I would like to read the file list.txt and output the strings into separate files where file1.txt contains "String1", file2.txt contains "String2", and so on. Any suggestions on what to use or a sample code will be appreciated. Thanks, Jamal -------------- next part -------------- An HTML attachment was scrubbed... URL: From jamalphd at gmail.com Tue Jan 26 23:01:06 2010 From: jamalphd at gmail.com (J A) Date: Wed, 27 Jan 2010 00:01:06 -0500 Subject: [Swift-user] first.swift Message-ID: Hi All: I am looking at first.swift: ========== type messagefile {} (messagefile t) greeting() { app { echo "Hello, world!" stdout=@filename(t); } } messagefile outfile <"hello.txt">; outfile = greeting(); ============ Is there a way to avoid using the '<' and '>' in the script and have code still do the same thing? Thanks, Jamal -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Wed Jan 27 04:02:22 2010 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 27 Jan 2010 10:02:22 +0000 (GMT) Subject: [Swift-user] first.swift In-Reply-To: References: Message-ID: > Is there a way to avoid using the '<' and '>' in the script and have code > still do the same thing? This is the kind of question that makes me reply with "why are you asking that?" ;) so: why are you asking that? (why do you not want < and > in the script?) -- From skenny at uchicago.edu Wed Jan 27 17:27:41 2010 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Wed, 27 Jan 2010 17:27:41 -0600 (CST) Subject: [Swift-user] #include ? Message-ID: <20100127172741.CIV77065@m4500-02.uchicago.edu> i heard a rumor once :) that there was a working version of an 'include' feature of swift such that i may declare functions in one script and use them in others...is this in the latest swift? i wasn't able to find anything in the doc about it so thought i'd check. ~sk From wilde at mcs.anl.gov Wed Jan 27 17:48:45 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Wed, 27 Jan 2010 17:48:45 -0600 (CST) Subject: [Swift-user] #include ? In-Reply-To: <4512972.68241264636028671.JavaMail.root@zimbra> Message-ID: <19009325.68301264636125591.JavaMail.root@zimbra> Yes, Ben added the start of a simple import mechanism: login2$ cat cati.swift import catapp; type file; file data<"data.txt">; file out<"out.txt">; out = cat(data); login2$ cat catapp.swift import typedefs; app (file o) cat (file i) { cat @i stdout=@o; } At the moment, the files you import need to be in the dir in which you are running Swift. - Mike ----- Original Message ----- From: skenny at uchicago.edu To: swift-user at ci.uchicago.edu Sent: Wednesday, January 27, 2010 5:27:41 PM GMT -06:00 US/Canada Central Subject: [Swift-user] #include ? i heard a rumor once :) that there was a working version of an 'include' feature of swift such that i may declare functions in one script and use them in others...is this in the latest swift? i wasn't able to find anything in the doc about it so thought i'd check. ~sk _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From skenny at uchicago.edu Wed Jan 27 18:01:52 2010 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Wed, 27 Jan 2010 18:01:52 -0600 (CST) Subject: [Swift-user] #include ? In-Reply-To: <19009325.68301264636125591.JavaMail.root@zimbra> References: <4512972.68241264636028671.JavaMail.root@zimbra> <19009325.68301264636125591.JavaMail.root@zimbra> Message-ID: <20100127180152.CIV82143@m4500-02.uchicago.edu> great thanks! ---- Original message ---- >Date: Wed, 27 Jan 2010 17:48:45 -0600 (CST) >From: wilde at mcs.anl.gov >Subject: Re: [Swift-user] #include ? >To: skenny at uchicago.edu >Cc: swift-user at ci.uchicago.edu > >Yes, Ben added the start of a simple import mechanism: > >login2$ cat cati.swift >import catapp; > >type file; > >file data<"data.txt">; >file out<"out.txt">; >out = cat(data); > > >login2$ cat catapp.swift >import typedefs; >app (file o) cat (file i) >{ > cat @i stdout=@o; >} > >At the moment, the files you import need to be in the dir in which you are running Swift. > >- Mike > > > > >----- Original Message ----- >From: skenny at uchicago.edu >To: swift-user at ci.uchicago.edu >Sent: Wednesday, January 27, 2010 5:27:41 PM GMT -06:00 US/Canada Central >Subject: [Swift-user] #include ? > >i heard a rumor once :) that there was a working version of an >'include' feature of swift such that i may declare functions >in one script and use them in others...is this in the latest >swift? i wasn't able to find anything in the doc about it so >thought i'd check. > >~sk >_______________________________________________ >Swift-user mailing list >Swift-user at ci.uchicago.edu >http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From jamalphd at gmail.com Wed Jan 27 19:59:01 2010 From: jamalphd at gmail.com (J A) Date: Wed, 27 Jan 2010 20:59:01 -0500 Subject: [Swift-user] Re: Read file and write to files In-Reply-To: References: Message-ID: Any suggestions on my question below? Thanks On 1/26/10, J A wrote: > > Hi All: > > I am still learning the swift script. > > I have a text file (list.txt) contains the following: > > > 3 > String1 > String2 > String3 > > > Where the first line contains the number of sentence with text. > > I would like to read the file list.txt and output the strings into separate > files where file1.txt contains "String1", file2.txt contains "String2", and > so on. > > Any suggestions on what to use or a sample code will be appreciated. > > Thanks, > Jamal > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Jan 28 05:35:47 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 28 Jan 2010 05:35:47 -0600 (CST) Subject: [Swift-user] Read file and write to files In-Reply-To: Message-ID: <6118345.73361264678547548.JavaMail.root@zimbra> Jamal, You can use an approach like this: Use readData() to read list.txt into an array. Use foreach to iterate over the array. Use the index of the array to form the new file names, and a single_file_mapper to map a variable inside the foreach to these files. Use an app() like echo() to write the text to the files. - Mike ----- Original Message ----- From: "J A" To: swift-user at ci.uchicago.edu Sent: Tuesday, January 26, 2010 10:58:17 PM GMT -06:00 US/Canada Central Subject: [Swift-user] Read file and write to files Hi All: I am still learning the swift script. I have a text file (list.txt) contains the following: 3 String1 String2 String3 Where the first line contains the number of sentence with text. I would like to read the file list.txt and output the strings into separate files where file1.txt contains "String1", file2.txt contains "String2", and so on. Any suggestions on what to use or a sample code will be appreciated. Thanks, Jamal _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From wilde at mcs.anl.gov Thu Jan 28 05:40:17 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 28 Jan 2010 05:40:17 -0600 (CST) Subject: [Swift-user] first.swift In-Reply-To: Message-ID: <7046714.73401264678817210.JavaMail.root@zimbra> Jamal, were you asking whether you can avoid using "mappers" (which are indicated by <>) or whether you can avoid using the chars "<>" in a Swift script? In either case, I believe the answer is "no", in that Swift is based on the use of mappers, and the syntax for mappers requires "<>". - Mike ----- Original Message ----- From: "Ben Clifford" To: "J A" Cc: swift-user at ci.uchicago.edu Sent: Wednesday, January 27, 2010 4:02:22 AM GMT -06:00 US/Canada Central Subject: Re: [Swift-user] first.swift > Is there a way to avoid using the '<' and '>' in the script and have code > still do the same thing? This is the kind of question that makes me reply with "why are you asking that?" ;) so: why are you asking that? (why do you not want < and > in the script?) -- _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From jamalphd at gmail.com Thu Jan 28 10:11:36 2010 From: jamalphd at gmail.com (J A) Date: Thu, 28 Jan 2010 11:11:36 -0500 Subject: [Swift-user] compile error Message-ID: Hi: When I try to run the following code: array_iteration.swift: type file {} (file f) echo (string s) { app { echo s stdout=@filename(f); } } (file fa[]) echo_batch (string sa[]) { foreach string s, i in sa { fa[i] = echo(s); } } string sa[] = ["hello","hi there","how are you"]; file fa[]; fa = echo_batch(sa); ...... I get the following error: Could not compile SwiftScript source: line 10:13: expecting an identifier, found 'string' any idea on how to fix it? Thanks, Jamal -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Jan 28 10:29:35 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 28 Jan 2010 10:29:35 -0600 (CST) Subject: [Swift-user] compile error In-Reply-To: Message-ID: <12775175.83861264696175449.JavaMail.root@zimbra> One thing I spot here is that this statement: foreach string s, i in sa should be written: foreach s, i in sa The foreach statement does not permit you to re-declare the types of the iteration variables (s and i in your case). - Mike ----- Original Message ----- From: "J A" To: swift-user at ci.uchicago.edu Sent: Thursday, January 28, 2010 10:11:36 AM GMT -06:00 US/Canada Central Subject: [Swift-user] compile error Hi: When I try to run the following code: array_iteration.swift: type file {} (file f) echo (string s) { app { echo s stdout=@filename(f ); } } (file fa[]) echo_batch (string sa[]) { foreach string s, i in sa { fa[i] = echo(s); } } string sa[] = ["hello","hi there","how are you"]; file fa[]; fa = echo_batch(sa); ...... I get the following error: Could not compile SwiftScript source: line 10:13: expecting an identifier, found 'string' any idea on how to fix it? Thanks, Jamal _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From jamalphd at gmail.com Thu Jan 28 12:55:07 2010 From: jamalphd at gmail.com (J A) Date: Thu, 28 Jan 2010 13:55:07 -0500 Subject: [Swift-user] compile error In-Reply-To: <12775175.83861264696175449.JavaMail.root@zimbra> References: <12775175.83861264696175449.JavaMail.root@zimbra> Message-ID: This is one of the examples I dowlonaded from the swift website. On Thu, Jan 28, 2010 at 11:29 AM, Michael Wilde wrote: > One thing I spot here is that this statement: > > foreach string s, i in sa > > should be written: > > foreach s, i in sa > > The foreach statement does not permit you to re-declare the types of the > iteration variables (s and i in your case). > > - Mike > > > ----- Original Message ----- > From: "J A" > To: swift-user at ci.uchicago.edu > Sent: Thursday, January 28, 2010 10:11:36 AM GMT -06:00 US/Canada Central > Subject: [Swift-user] compile error > > > > Hi: > > When I try to run the following code: > > array_iteration.swift: > > > type file {} > (file f) echo (string s) { > app { > echo s stdout=@filename(f ); > } > } > (file fa[]) echo_batch (string sa[]) { > foreach string s, i in sa { > fa[i] = echo(s); > } > } > string sa[] = ["hello","hi there","how are you"]; > file fa[]; > fa = echo_batch(sa); > > ...... > > I get the following error: > > Could not compile SwiftScript source: line 10:13: expecting an identifier, > found 'string' > > > any idea on how to fix it? > > > Thanks, > Jamal > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jamalphd at gmail.com Thu Jan 28 13:03:48 2010 From: jamalphd at gmail.com (J A) Date: Thu, 28 Jan 2010 14:03:48 -0500 Subject: [Swift-user] compile error In-Reply-To: <12775175.83861264696175449.JavaMail.root@zimbra> References: <12775175.83861264696175449.JavaMail.root@zimbra> Message-ID: That solved the issue Michael. Thanks On Thu, Jan 28, 2010 at 11:29 AM, Michael Wilde wrote: > One thing I spot here is that this statement: > > foreach string s, i in sa > > should be written: > > foreach s, i in sa > > The foreach statement does not permit you to re-declare the types of the > iteration variables (s and i in your case). > > - Mike > > > ----- Original Message ----- > From: "J A" > To: swift-user at ci.uchicago.edu > Sent: Thursday, January 28, 2010 10:11:36 AM GMT -06:00 US/Canada Central > Subject: [Swift-user] compile error > > > > Hi: > > When I try to run the following code: > > array_iteration.swift: > > > type file {} > (file f) echo (string s) { > app { > echo s stdout=@filename(f ); > } > } > (file fa[]) echo_batch (string sa[]) { > foreach string s, i in sa { > fa[i] = echo(s); > } > } > string sa[] = ["hello","hi there","how are you"]; > file fa[]; > fa = echo_batch(sa); > > ...... > > I get the following error: > > Could not compile SwiftScript source: line 10:13: expecting an identifier, > found 'string' > > > any idea on how to fix it? > > > Thanks, > Jamal > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jamalphd at gmail.com Thu Jan 28 16:21:00 2010 From: jamalphd at gmail.com (J A) Date: Thu, 28 Jan 2010 17:21:00 -0500 Subject: [Swift-user] example to write to a file Message-ID: Hi All: does anyone has an example on how to write to a file? Thanks, Jamal -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Jan 28 21:26:58 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Thu, 28 Jan 2010 21:26:58 -0600 (CST) Subject: [Swift-user] Using the writeData() built-in function In-Reply-To: <8254476.109671264735416491.JavaMail.root@zimbra> Message-ID: <19940874.109751264735618737.JavaMail.root@zimbra> Below is some preliminary info (a tiny example) on writing files from Swift using the not-yet-well-documented function writeData(). http://www.ci.uchicago.edu/swift/guides/userguide.php#procedure.writedata Other than this function, the only way to write data to a file is to call an app() that writes the data. Note that writeData was meant solely to assist in writing long argument lists to a file that can be passed to an app() function, to avoid passing overly-long command lines. Ive been exploring a few simple "library" functions writing using simple apps like echo and awk to do data formatting or conversion. While thats not ready for release, I just want to hint to those who need it that such techniques are reasonable and feasible. In general, common practice in Swift is *not* to write general data files directly via Swift statements, but rather to call app() functions that write data files. - Mike ----- Forwarded Message ----- From: "Ben Clifford" To: swift-devel at ci.uchicago.edu Sent: Wednesday, July 1, 2009 10:37:28 AM GMT -06:00 US/Canada Central Subject: [Swift-devel] writeData r2994 contains a writeData function which does the opposite of readData. Specifically, you can say: file l; l = writeData(@f); to output the filenames for a data structure into a text file, so that you can pass this instead of passing filenames on the command line. -- _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From fedorov at bwh.harvard.edu Sat Jan 30 16:36:52 2010 From: fedorov at bwh.harvard.edu (Andriy Fedorov) Date: Sat, 30 Jan 2010 17:36:52 -0500 Subject: [Swift-user] Looking for the cause of failure Message-ID: <82f536811001301436v56ac72dp489d4a2f9da2fad6@mail.gmail.com> Hi, I've been running a 1000-job swift script with coaster provider. After executing successfully 998 jobs, I see continuous stream of messages Progress: Submitted:1 Active:1 Finished successfully:998 ... At the same time, there are no jobs in the PBS queue. looking at ~/.globus/coasters/coasters.log, I found the following error messages towards the end of the log: 2010-01-30 16:17:22,275-0600 INFO Block Block task status changed: Failed The job manager could not stage out a file 2010-01-30 16:17:22,275-0600 INFO Block Failed task spec: Job: executable: /usr/bin/perl arguments: /u/ac/fedorov/.globus/coasters/cscript28331.pl http://141.142.68.180:54622 0130-580326-000001 /u/ac/fedorov/.globus/coasters stdout: null stderr: null directory: null batch: false redirected: false {hostcount=40, maxwalltime=24, count=40, jobtype=multiple} 2010-01-30 16:17:22,275-0600 WARN Block Worker task failed: org.globus.gram.GramException: The job manager could not stage out a file at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531) at org.globus.gram.GramJob.setStatus(GramJob.java:184) at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) at java.lang.Thread.run(Thread.java:595) And then a longer series of what looks like timeout messages: 2010-01-30 16:19:40,911-0600 WARN Command Command(3, SHUTDOWN): handling reply timeout; sendReqTime=100130-161740.893, sendTime=100130-161740.893, now=100130-161940.911 2010-01-30 16:19:40,911-0600 WARN Command Command(3, SHUTDOWN)fault was: Reply timeout org.globus.cog.karajan.workflow.service.ReplyTimeoutException at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269) at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) 2010-01-30 16:19:40,911-0600 WARN Command Command(4, SHUTDOWN): handling reply timeout; sendReqTime=100130-161740.893, sendTime=100130-161740.893, now=100130-161940.911 2010-01-30 16:19:40,911-0600 WARN Command Command(4, SHUTDOWN)fault was: Reply timeout org.globus.cog.karajan.workflow.service.ReplyTimeoutException at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269) at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) Anybody can explain what happened? The same workflow ran earlier, but with fewer (2) workers per node. I am running this on Abe, Swift svn swift-r3202 cog-r2682, site description: /u/ac/fedorov/scratch-global/scratch 2.55 10000 20 false 0.1 4 10 Thanks Andriy Fedorov From wilde at mcs.anl.gov Sat Jan 30 18:27:28 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Sat, 30 Jan 2010 18:27:28 -0600 (CST) Subject: [Swift-user] Looking for the cause of failure In-Reply-To: <12680489.151301264897544218.JavaMail.root@zimbra> Message-ID: <2795624.151361264897648127.JavaMail.root@zimbra> Andriy, I need to look at this in more detail. (Mihael is unavailable this week). But I'm wondering - since you are running Swift on an abe login host, consider changing: to: Also, on abe, don't you want to set workersNerNode to 8, as its nodes are 8-core hosts? You may also want to set the max time of the coster job (in seconds) to, for example: 7500 Change "7500" to whatever makes sense for your application. This is somewhat manual, but I suggest setting the time to some value that ensure that the PBS jobs last long enough for your application jobs. That aspect may need further adjustment. Lastly, instead of the gridftp tag you can use: But the gridftp tag you have is fine, and equivalent. - Mike ----- "Andriy Fedorov" wrote: > Hi, > > I've been running a 1000-job swift script with coaster provider. > After > executing successfully 998 jobs, I see continuous stream of messages > > Progress: Submitted:1 Active:1 Finished successfully:998 > ... > > At the same time, there are no jobs in the PBS queue. looking at > ~/.globus/coasters/coasters.log, I found the following error messages > towards the end of the log: > > 2010-01-30 16:17:22,275-0600 INFO Block Block task status changed: > Failed The job manager could not stage out a file > 2010-01-30 16:17:22,275-0600 INFO Block Failed task spec: Job: > executable: /usr/bin/perl > arguments: /u/ac/fedorov/.globus/coasters/cscript28331.pl > http://141.142.68.180:54622 0130-580326-000001 > /u/ac/fedorov/.globus/coasters > stdout: null > stderr: null > directory: null > batch: false > redirected: false > {hostcount=40, maxwalltime=24, count=40, jobtype=multiple} > > 2010-01-30 16:17:22,275-0600 WARN Block Worker task failed: > org.globus.gram.GramException: The job manager could not stage out a > file > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531) > at org.globus.gram.GramJob.setStatus(GramJob.java:184) > at > org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) > at java.lang.Thread.run(Thread.java:595) > > And then a longer series of what looks like timeout messages: > > 2010-01-30 16:19:40,911-0600 WARN Command Command(3, SHUTDOWN): > handling reply timeout; sendReqTime=100130-161740.893, > sendTime=100130-161740.893, now=100130-161940.911 > 2010-01-30 16:19:40,911-0600 WARN Command Command(3, SHUTDOWN)fault > was: Reply timeout > org.globus.cog.karajan.workflow.service.ReplyTimeoutException > at > org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269) > at > org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274) > at java.util.TimerThread.mainLoop(Timer.java:512) > at java.util.TimerThread.run(Timer.java:462) > 2010-01-30 16:19:40,911-0600 WARN Command Command(4, SHUTDOWN): > handling reply timeout; sendReqTime=100130-161740.893, > sendTime=100130-161740.893, now=100130-161940.911 > 2010-01-30 16:19:40,911-0600 WARN Command Command(4, SHUTDOWN)fault > was: Reply timeout > org.globus.cog.karajan.workflow.service.ReplyTimeoutException > at > org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269) > at > org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274) > at java.util.TimerThread.mainLoop(Timer.java:512) > at java.util.TimerThread.run(Timer.java:462) > > Anybody can explain what happened? The same workflow ran earlier, but > with fewer (2) workers per node. > > I am running this on Abe, Swift svn swift-r3202 cog-r2682, site > description: > > > > url="grid-abe.ncsa.teragrid.org"/> > /u/ac/fedorov/scratch-global/scratch > 2.55 > 10000 > 20 > key="remoteMonitorEnabled">false > 0.1 > 4 > 10 > > > Thanks > > Andriy Fedorov > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From hategan at mcs.anl.gov Sat Jan 30 21:46:33 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 30 Jan 2010 21:46:33 -0600 Subject: [Swift-user] Looking for the cause of failure In-Reply-To: <82f536811001301436v56ac72dp489d4a2f9da2fad6@mail.gmail.com> References: <82f536811001301436v56ac72dp489d4a2f9da2fad6@mail.gmail.com> Message-ID: <1264909593.6403.8.camel@localhost> On Sat, 2010-01-30 at 17:36 -0500, Andriy Fedorov wrote: > 2010-01-30 16:17:22,275-0600 INFO Block Block task status changed: > Failed The job manager could not stage out a file > 2010-01-30 16:17:22,275-0600 INFO Block Failed task spec: Job: > executable: /usr/bin/perl > arguments: /u/ac/fedorov/.globus/coasters/cscript28331.pl > http://141.142.68.180:54622 0130-580326-000001 > /u/ac/fedorov/.globus/coasters > stdout: null > stderr: null > directory: null > batch: false > redirected: false > {hostcount=40, maxwalltime=24, count=40, jobtype=multiple} > > 2010-01-30 16:17:22,275-0600 WARN Block Worker task failed: > org.globus.gram.GramException: The job manager could not stage out a file > at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531) > at org.globus.gram.GramJob.setStatus(GramJob.java:184) > at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) > at java.lang.Thread.run(Thread.java:595) That in itself is not a failure condition as it is something that happens after the worker job completes. > > And then a longer series of what looks like timeout messages: > > 2010-01-30 16:19:40,911-0600 WARN Command Command(3, SHUTDOWN): > handling reply timeout; sendReqTime=100130-161740.893, > sendTime=100130-161740.893, now=100130-161940.911 > 2010-01-30 16:19:40,911-0600 WARN Command Command(3, SHUTDOWN)fault > was: Reply timeout > org.globus.cog.karajan.workflow.service.ReplyTimeoutException > at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269) > at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274) > at java.util.TimerThread.mainLoop(Timer.java:512) > at java.util.TimerThread.run(Timer.java:462) That is an indication that the worker didn't respond to a shutdown command, perhaps because it died previously. In ~/.globus/coasters you will find a bunch of worker logs. If you can identify the ones for your run (based perhaps on the timestamp on the files), they may contain the reason for the failure. > 2010-01-30 16:19:40,911-0600 WARN Command Command(4, SHUTDOWN): > handling reply timeout; sendReqTime=100130-161740.893, > sendTime=100130-161740.893, now=100130-161940.911 > 2010-01-30 16:19:40,911-0600 WARN Command Command(4, SHUTDOWN)fault > was: Reply timeout > org.globus.cog.karajan.workflow.service.ReplyTimeoutException > at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269) > at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274) > at java.util.TimerThread.mainLoop(Timer.java:512) > at java.util.TimerThread.run(Timer.java:462) > > Anybody can explain what happened? The same workflow ran earlier, but > with fewer (2) workers per node. Does it work if you set workers per node to 2 again? If yes, that may be an indication that the workers per node setting causes a problem, and that's a stronger statement than "it doesn't work right now". From jamalphd at gmail.com Sat Jan 30 22:00:29 2010 From: jamalphd at gmail.com (J A) Date: Sat, 30 Jan 2010 23:00:29 -0500 Subject: [Swift-user] passing a file as an argument Message-ID: Hi: how can I pass a file when executing swift, so i want to do something like this: $ swift file.txt how do I catch the file inside the code? Thanks, Jamal -------------- next part -------------- An HTML attachment was scrubbed... URL: From fedorov at bwh.harvard.edu Sat Jan 30 22:07:47 2010 From: fedorov at bwh.harvard.edu (Andriy Fedorov) Date: Sat, 30 Jan 2010 23:07:47 -0500 Subject: [Swift-user] Looking for the cause of failure In-Reply-To: <2795624.151361264897648127.JavaMail.root@zimbra> References: <12680489.151301264897544218.JavaMail.root@zimbra> <2795624.151361264897648127.JavaMail.root@zimbra> Message-ID: <82f536811001302007v6940936fn37b9cd79bbab4736@mail.gmail.com> On Sat, Jan 30, 2010 at 19:27, wrote: > Andriy, I need to look at this in more detail. (Mihael is unavailable this week). > > But I'm wondering - since you are running Swift on an abe login host, consider changing: > > ? ? ? ? ? ? ? url="grid-abe.ncsa.teragrid.org"/> > > to: > > ? > Michael, thank you for the suggestion -- I will try! > Also, on abe, don't you want to set workersNerNode to 8, as its nodes are 8-core hosts? > Yes, you are right! > You may also want to set the max time of the coster job (in seconds) to, for example: > > ?7500 > > Change "7500" to whatever makes sense for your application. This is somewhat manual, but I suggest setting the time to some value that ensure that the PBS jobs last long enough for your application jobs. That aspect may need ?further adjustment. > I am not sure about this one. The documentation says maxtime defines the maximum walltime for a coaster block, and is by default unlimited. It seems to me that setting this parameter could actually create problems. Can you explain? > Lastly, instead of the gridftp tag you can use: > > ? > > But the gridftp tag you have is fine, and equivalent. > > - Mike > > > ----- "Andriy Fedorov" wrote: > >> Hi, >> >> I've been running a 1000-job swift script with coaster provider. >> After >> executing successfully 998 jobs, I see continuous stream of messages >> >> Progress: ?Submitted:1 ?Active:1 ?Finished successfully:998 >> ... >> >> At the same time, there are no jobs in the PBS queue. looking at >> ~/.globus/coasters/coasters.log, I found the following error messages >> towards the end of the log: >> >> 2010-01-30 16:17:22,275-0600 INFO ?Block Block task status changed: >> Failed The job manager could not stage out a file >> 2010-01-30 16:17:22,275-0600 INFO ?Block Failed task spec: Job: >> ? ? ? ? executable: /usr/bin/perl >> ? ? ? ? arguments: ?/u/ac/fedorov/.globus/coasters/cscript28331.pl >> http://141.142.68.180:54622 0130-580326-000001 >> /u/ac/fedorov/.globus/coasters >> ? ? ? ? stdout: ? ? null >> ? ? ? ? stderr: ? ? null >> ? ? ? ? directory: ?null >> ? ? ? ? batch: ? ? ?false >> ? ? ? ? redirected: false >> ? ? ? ? {hostcount=40, maxwalltime=24, count=40, jobtype=multiple} >> >> 2010-01-30 16:17:22,275-0600 WARN ?Block Worker task failed: >> org.globus.gram.GramException: The job manager could not stage out a >> file >> ? ? ? ? at >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531) >> ? ? ? ? at org.globus.gram.GramJob.setStatus(GramJob.java:184) >> ? ? ? ? at >> org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) >> ? ? ? ? at java.lang.Thread.run(Thread.java:595) >> >> And then a longer series of what looks like timeout messages: >> >> 2010-01-30 16:19:40,911-0600 WARN ?Command Command(3, SHUTDOWN): >> handling reply timeout; sendReqTime=100130-161740.893, >> sendTime=100130-161740.893, now=100130-161940.911 >> 2010-01-30 16:19:40,911-0600 WARN ?Command Command(3, SHUTDOWN)fault >> was: Reply timeout >> org.globus.cog.karajan.workflow.service.ReplyTimeoutException >> ? ? ? ? at >> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269) >> ? ? ? ? at >> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274) >> ? ? ? ? at java.util.TimerThread.mainLoop(Timer.java:512) >> ? ? ? ? at java.util.TimerThread.run(Timer.java:462) >> 2010-01-30 16:19:40,911-0600 WARN ?Command Command(4, SHUTDOWN): >> handling reply timeout; sendReqTime=100130-161740.893, >> sendTime=100130-161740.893, now=100130-161940.911 >> 2010-01-30 16:19:40,911-0600 WARN ?Command Command(4, SHUTDOWN)fault >> was: Reply timeout >> org.globus.cog.karajan.workflow.service.ReplyTimeoutException >> ? ? ? ? at >> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269) >> ? ? ? ? at >> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274) >> ? ? ? ? at java.util.TimerThread.mainLoop(Timer.java:512) >> ? ? ? ? at java.util.TimerThread.run(Timer.java:462) >> >> Anybody can explain what happened? The same workflow ran earlier, but >> with fewer (2) workers per node. >> >> I am running this on Abe, Swift svn swift-r3202 cog-r2682, site >> description: >> >> >> ? >> ? > ? url="grid-abe.ncsa.teragrid.org"/> >> ? /u/ac/fedorov/scratch-global/scratch >> ? 2.55 >> ? 10000 >> ? 20 >> ? > key="remoteMonitorEnabled">false >> ? 0.1 >> ? 4 >> ? 10 >> >> >> Thanks >> >> Andriy Fedorov >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > From fedorov at bwh.harvard.edu Sat Jan 30 22:10:27 2010 From: fedorov at bwh.harvard.edu (Andriy Fedorov) Date: Sat, 30 Jan 2010 23:10:27 -0500 Subject: [Swift-user] Looking for the cause of failure In-Reply-To: <1264909593.6403.8.camel@localhost> References: <82f536811001301436v56ac72dp489d4a2f9da2fad6@mail.gmail.com> <1264909593.6403.8.camel@localhost> Message-ID: <82f536811001302010h641557f4u9e52e91a72b50543@mail.gmail.com> On Sat, Jan 30, 2010 at 22:46, Mihael Hategan wrote: > In ~/.globus/coasters you will find a bunch of worker logs. If you can > identify the ones for your run (based perhaps on the timestamp on the > files), they may contain the reason for the failure. > Strangely, I don't have worker logs for these executions -- the latest are from Jan 18. >> Anybody can explain what happened? The same workflow ran earlier, but >> with fewer (2) workers per node. > > Does it work if you set workers per node to 2 again? If yes, that may be > an indication that the workers per node setting causes a problem, and > that's a stronger statement than "it doesn't work right now". > I will try, and let you know. If this is indeed the case, is there any particular reason why it may not work for 4 workers per node? As Mike pointed out, the nodes actually have 8 cores. > > > From hategan at mcs.anl.gov Sat Jan 30 22:14:04 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 30 Jan 2010 22:14:04 -0600 Subject: [Swift-user] Looking for the cause of failure In-Reply-To: <82f536811001302007v6940936fn37b9cd79bbab4736@mail.gmail.com> References: <12680489.151301264897544218.JavaMail.root@zimbra> <2795624.151361264897648127.JavaMail.root@zimbra> <82f536811001302007v6940936fn37b9cd79bbab4736@mail.gmail.com> Message-ID: <1264911244.7775.3.camel@localhost> On Sat, 2010-01-30 at 23:07 -0500, Andriy Fedorov wrote: > > You may also want to set the max time of the coster job (in seconds) to, for example: > > > > 7500 > > > > Change "7500" to whatever makes sense for your application. This is somewhat manual, but I suggest setting the time to some value that ensure that the PBS jobs last long enough for your application jobs. That aspect may need further adjustment. > > > > I am not sure about this one. The documentation says maxtime defines > the maximum walltime for a coaster block, and is by default unlimited. > It seems to me that setting this parameter could actually create > problems. Can you explain? > What may happen is that the block (the actual PBS job submitted to run the workers) is longer than what the queue allows. For example, you may select the "short" queue, and that may have a limit of, say, 2 hours for the walltime. You want to set the maxtime accordingly in order to prevent coasters from submitting a job with a walltime higher than what the queue allows, which would cause the job to fail immediately. Even in the case you don't explicitly specify a queue, the default queue may itself have a limit. From hategan at mcs.anl.gov Sat Jan 30 22:25:02 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 30 Jan 2010 22:25:02 -0600 Subject: [Swift-user] Looking for the cause of failure In-Reply-To: <82f536811001302010h641557f4u9e52e91a72b50543@mail.gmail.com> References: <82f536811001301436v56ac72dp489d4a2f9da2fad6@mail.gmail.com> <1264909593.6403.8.camel@localhost> <82f536811001302010h641557f4u9e52e91a72b50543@mail.gmail.com> Message-ID: <1264911902.7775.14.camel@localhost> On Sat, 2010-01-30 at 23:10 -0500, Andriy Fedorov wrote: > On Sat, Jan 30, 2010 at 22:46, Mihael Hategan wrote: > > In ~/.globus/coasters you will find a bunch of worker logs. If you can > > identify the ones for your run (based perhaps on the timestamp on the > > files), they may contain the reason for the failure. > > > > Strangely, I don't have worker logs for these executions -- the latest > are from Jan 18. That indicates that the workers aren't even started. It's somewhat unfortunate that GRAM fails to stage out stdout/stderr, because those would likely contain information about the failure. What you can probably do in this case is try to reproduce the jobs that the coasters submit and do it manually with qsub or GRAM to see what the queuing system complains about. For that, you could enable log4j debugging for org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler. That would give you the gt2 RSL of the job, and that would likely be useful. > > >> Anybody can explain what happened? The same workflow ran earlier, but > >> with fewer (2) workers per node. > > > > Does it work if you set workers per node to 2 again? If yes, that may be > > an indication that the workers per node setting causes a problem, and > > that's a stronger statement than "it doesn't work right now". > > > > I will try, and let you know. If this is indeed the case, is there any > particular reason why it may not work for 4 workers per node? > > As Mike pointed out, the nodes actually have 8 cores. No idea. I'm pretty much blind about the issue, and in such cases it seems that the reasonable solution is to use a stick and hit random things and get a feel for the obstacles around. Now, Mike's suggestion about using the PBS provider directly seems like a good one because it provides an alternative mechanism for doing the same thing which, well, is pretty much like our stick above, except it's a pretty big stick, so it has decent chances of making a difference. Also, in case you're there, trunk is unstable code. For more stable code, use the stable branch (details on the swift download page). From fedorov at bwh.harvard.edu Sat Jan 30 22:28:23 2010 From: fedorov at bwh.harvard.edu (Andriy Fedorov) Date: Sat, 30 Jan 2010 23:28:23 -0500 Subject: [Swift-user] Looking for the cause of failure In-Reply-To: <1264911244.7775.3.camel@localhost> References: <12680489.151301264897544218.JavaMail.root@zimbra> <2795624.151361264897648127.JavaMail.root@zimbra> <82f536811001302007v6940936fn37b9cd79bbab4736@mail.gmail.com> <1264911244.7775.3.camel@localhost> Message-ID: <82f536811001302028p6c58a241l87c4cd9af220c6f8@mail.gmail.com> On Sat, Jan 30, 2010 at 23:14, Mihael Hategan wrote: > What may happen is that the block (the actual PBS job submitted to run > the workers) is longer than what the queue allows. > > For example, you may select the "short" queue, and that may have a limit > of, say, 2 hours for the walltime. You want to set the maxtime > accordingly in order to prevent coasters from submitting a job with a > walltime higher than what the queue allows, which would cause the job to > fail immediately. > Even in the case you don't explicitly specify a queue, the default queue > may itself have a limit. This makes sense -- thank you for the explanation! So I changed the number of workers per node to 8, and set the provider to "local:pbs", as Mike suggested. I see 2 PBS jobs (20 and 40 nodes) running, but from what Swift reports to me, only 16 (?) jobs are being active at a time. Selecting site:664 Submitted:240 Active:16 Finished successfully:80 With the previous setup, it made more sense, because the number of active jobs was *. Am I missing something simple? Maybe I should just try the stable branch. I will do this next. > > > From hategan at mcs.anl.gov Sat Jan 30 22:45:53 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 30 Jan 2010 22:45:53 -0600 Subject: [Swift-user] Looking for the cause of failure In-Reply-To: <82f536811001302028p6c58a241l87c4cd9af220c6f8@mail.gmail.com> References: <12680489.151301264897544218.JavaMail.root@zimbra> <2795624.151361264897648127.JavaMail.root@zimbra> <82f536811001302007v6940936fn37b9cd79bbab4736@mail.gmail.com> <1264911244.7775.3.camel@localhost> <82f536811001302028p6c58a241l87c4cd9af220c6f8@mail.gmail.com> Message-ID: <1264913153.8312.15.camel@localhost> On Sat, 2010-01-30 at 23:28 -0500, Andriy Fedorov wrote: > On Sat, Jan 30, 2010 at 23:14, Mihael Hategan wrote: > > What may happen is that the block (the actual PBS job submitted to run > > the workers) is longer than what the queue allows. > > > > For example, you may select the "short" queue, and that may have a limit > > of, say, 2 hours for the walltime. You want to set the maxtime > > accordingly in order to prevent coasters from submitting a job with a > > walltime higher than what the queue allows, which would cause the job to > > fail immediately. > > Even in the case you don't explicitly specify a queue, the default queue > > may itself have a limit. > > This makes sense -- thank you for the explanation! > > So I changed the number of workers per node to 8, and set the provider > to "local:pbs", as Mike suggested. I see 2 PBS jobs (20 and 40 nodes) > running, but from what Swift reports to me, only 16 (?) jobs are being > active at a time. > > Selecting site:664 Submitted:240 Active:16 Finished successfully:80 It may be a strange variation on relativity. What swift sees as the number of concurrent jobs may not be what the cluster sees as the number of concurrent jobs because messages between the two take various amounts of time to make it from one place to the other. This is especially visible when the jobs are short. That or maybe this patch I recently committed (cog branches/4.1.7 r2683) for the PBS provider. 16 is suspiciously equal to number_of_jobs*workers_per_node, which may be a result of the PBS provider starting only one executable irrespective of the number of nodes requested. The patch mentioned uses pdsh to start the proper number of executable instances. > > With the previous setup, it made more sense, because the number of > active jobs was *. Define "previous setup". If it's about one coaster job per node, yes. Unfortunately that's also something that prevents scalability with gram2 or clusters that have limits on the number of jobs in the queue (like the BG/P). You can force that behavior though with maxnodes=1. > > Am I missing something simple? Maybe I should just try the stable > branch. I will do this next. > I would advise everybody besides about 2 people doing research on I/O scalability with Swift to use the stable branch. Not only does it get fixes before trunk, but it doesn't get weird changes that may cause random breakage. From fedorov at bwh.harvard.edu Sun Jan 31 09:49:54 2010 From: fedorov at bwh.harvard.edu (Andriy Fedorov) Date: Sun, 31 Jan 2010 10:49:54 -0500 Subject: [Swift-user] Looking for the cause of failure In-Reply-To: <1264913153.8312.15.camel@localhost> References: <12680489.151301264897544218.JavaMail.root@zimbra> <2795624.151361264897648127.JavaMail.root@zimbra> <82f536811001302007v6940936fn37b9cd79bbab4736@mail.gmail.com> <1264911244.7775.3.camel@localhost> <82f536811001302028p6c58a241l87c4cd9af220c6f8@mail.gmail.com> <1264913153.8312.15.camel@localhost> Message-ID: <82f536811001310749o37a37509i6606f0f6bfcf3be3@mail.gmail.com> On Sat, Jan 30, 2010 at 23:45, Mihael Hategan wrote: >> With the previous setup, it made more sense, because the number of >> active jobs was *. > > Define "previous setup". "previous setup" is the site configuration I included in the email that started this thread. I just tried this "previous setup", increasing number of workers per node to 8, and everything worked very well (job status plot attached). > If it's about one coaster job per node, yes. > Unfortunately that's also something that prevents scalability with gram2 > or clusters that have limits on the number of jobs in the queue (like > the BG/P). > > You can force that behavior though with maxnodes=1. > >> >> Am I missing something simple? Maybe I should just try the stable >> branch. I will do this next. >> > > I would advise everybody besides about 2 people doing research on I/O > scalability with Swift to use the stable branch. Not only does it get > fixes before trunk, but it doesn't get weird changes that may cause > random breakage. > With the stable branch, and "updated setup" (execution provider "local:pbs") I have this error message: /var/spool/torque/mom_priv/jobs/2489852.abem5.ncsa.uiuc.edu.SC: line 10: pdsh: command not found Should I install pdsh first? I didn't see it right away in the TG software list. I also don't see instructions in the Swift user guide, unless I missed it. > -------------- next part -------------- A non-text attachment was scrubbed... Name: karatasks.JOB_SUBMISSION-trails.png Type: image/png Size: 5011 bytes Desc: not available URL: From hategan at mcs.anl.gov Sun Jan 31 09:56:42 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 31 Jan 2010 09:56:42 -0600 Subject: [Swift-user] Looking for the cause of failure In-Reply-To: <82f536811001310749o37a37509i6606f0f6bfcf3be3@mail.gmail.com> References: <12680489.151301264897544218.JavaMail.root@zimbra> <2795624.151361264897648127.JavaMail.root@zimbra> <82f536811001302007v6940936fn37b9cd79bbab4736@mail.gmail.com> <1264911244.7775.3.camel@localhost> <82f536811001302028p6c58a241l87c4cd9af220c6f8@mail.gmail.com> <1264913153.8312.15.camel@localhost> <82f536811001310749o37a37509i6606f0f6bfcf3be3@mail.gmail.com> Message-ID: <1264953402.11359.1.camel@localhost> On Sun, 2010-01-31 at 10:49 -0500, Andriy Fedorov wrote: > On Sat, Jan 30, 2010 at 23:45, Mihael Hategan wrote: > >> With the previous setup, it made more sense, because the number of > >> active jobs was *. > > > > Define "previous setup". > > "previous setup" is the site configuration I included in the email > that started this thread. > > I just tried this "previous setup", increasing number of workers per > node to 8, and everything worked very well (job status plot attached). > > > If it's about one coaster job per node, yes. > > Unfortunately that's also something that prevents scalability with gram2 > > or clusters that have limits on the number of jobs in the queue (like > > the BG/P). > > > > You can force that behavior though with maxnodes=1. > > > >> > >> Am I missing something simple? Maybe I should just try the stable > >> branch. I will do this next. > >> > > > > I would advise everybody besides about 2 people doing research on I/O > > scalability with Swift to use the stable branch. Not only does it get > > fixes before trunk, but it doesn't get weird changes that may cause > > random breakage. > > > > With the stable branch, and "updated setup" (execution provider > "local:pbs") I have this error message: > > /var/spool/torque/mom_priv/jobs/2489852.abem5.ncsa.uiuc.edu.SC: line > 10: pdsh: command not found > > Should I install pdsh first? Yes. Might have a softenv package. > I didn't see it right away in the TG > software list. I also don't see instructions in the Swift user guide, > unless I missed it. It's relatively new. There was also the assumption that it would be installed pretty much everywhere, but it doesn't seem to be the case, so I', thinking a plain ssh solution (which is what gram does) may be better. From wilde at mcs.anl.gov Sun Jan 31 12:20:02 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 31 Jan 2010 12:20:02 -0600 (CST) Subject: [Swift-user] passing a file as an argument In-Reply-To: Message-ID: <27622487.155821264962002707.JavaMail.root@zimbra> Jamal, for this, you pass the file name on the command line as a script argument after all the swift command arguments, and pick it up inside your swift script with the @arg() function, which is like argv in C (except indexed by name rather than position). Its described in the User Guide. - Mike ----- "J A" wrote: > Hi: > > how can I pass a file when executing swift, so i want to do something > like this: > > $ swift file.txt > > > how do I catch the file inside the code? > > Thanks, > Jamal > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user