From hategan at mcs.anl.gov Mon Aug 2 18:15:42 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 02 Aug 2010 18:15:42 -0500 Subject: [Swift-devel] [Fwd: [jglobus-announce] CoG JGlobus Update] Message-ID: <1280790942.17298.1.camel@blabla2.none> FYI This won't affect swift directly since jglobus is packaged as a cog4 module, but a few details on that module will have to change. Mihael -------- Forwarded Message -------- From: Rachana Ananthakrishnan To: jglobus-dev at lists.globus.org, jglobus-user at lists.globus.org, jglobus-announce at lists.globus.org Subject: [jglobus-announce] CoG JGlobus Update Date: Mon, 2 Aug 2010 17:07:13 -0500 We are planning the next major release of features provided by CoG JGlobus and CoG JGlobus-FX libraries. The first update will cover the GSI features, and will be followed up with support for GridFTP, GRAM and MyProxy client libraries. The primary goals of the release are - upgrade third-party libraries - port to standard security Java APIs - improve package and distribution model - deprecation of unused code The upcoming release will be protocol compatible with CoG JGlobus version 1.8.x, with some API and packages deprecated. The following are the key changes planned for the libraries: 1. Packaging and distribution: The single jar distribution that contains, GSI, GRAM, GridFTP, MDS and MyProxy clients, will be split into separate logical packages and modules. The dependencies and distribution will be managed using Maven. 2. Project and code repository: The project will be moved to SourceForge and code will be maintained on GitHub. Repository will be open for read access, with existing dev.globus committers continuing to have commit rights on specific sub-projects. 3. Release plan: Detailed release plans for the library are being worked on currently. CoG JGlobus GSI will be the first piece to be upgraded, and is targeted for an alpha release in September 2010 release. This will be followed-up by updated GridFTP, GRAM and MyProxy client libraries. An alpha version of each these will be released a few weeks for early testers. 4. Support: Support commitment will continue to be best-effort, and support requests will be monitored on a user mailing list setup for the project. 5. CoG JGlobus 1.x support: Support for existing library will continue for upto 3 months after the 2.0 release, to support transition to the new code base. The GSI features will be the first set to be upgraded and released, and the other clients will build on the new GSI library. The key change will be the upgrade to use standard Java SSL library, and replace PureTLS and supporting libraries. This will not only deprecate the use of unsupported PureTLS, but also provide access to better security algorithms, such has SHA2. The upcoming release will also use Java Security Provider framework and standard API, thus facilitating use of any standard provider implementations for processing certificates and CRLs, path validation and trust managers. Other than discontinuing of the PureTLS package and already deprecated code, most of the existing API will be maintained, although many will be deprecated to favor use of more standard APIs. There are no plans to continue support for Tomcat connectors for GSI SSL (HTTPS and delegation). Based on prototype work currently being done, the API changes are documented here: http://www.mcs.anl.gov/~ranantha/jGlobus/jGlobus-jGlobusAPIChanges-02Aug10-1223PM-4.pdf The GSI pieces will be released as following software packages, with the following Maven artifacts planned: 1. jGlobus GSI 2.0 1A. GSI Core - API for creation of proxy credentials, and utility API to deal with proxy credentials/certificate chains, as needed. 1B. GSI TrustManager - Trust Manager for Java SSL with support for RFC 3820 Proxy Certificate and Signing Policy and authorization. 1C. GSS-GSI - GSS API wrapper for standard SSL and GSI SSL (SSL with delegation), with support for RFC 3820 Proxy Certificates and Signing Policy. 1D. GSI CL - Client tools for certificate and proxy credential handling 2. jGlobus Connectors 2.0 2A. SSL Proxy Connectors - Tomcat and JBoss connectors to enable SSL with Proxy certificates The following shows some common library usage, and the module to download. The dependencies for the module will automatically be resolved and downloaded: 1. Command line clients (e.g grid-proxy-init) , GSI CL 2. GSS API to integrate with applications, GSS-GSI 3. API to extract properties of credential (e.g identity or type of proxy), GSI Core 4. Tomcat requiring SSL access and support for proxies and signing policy, SSL Proxy Connectors Similar details for the other modules in the library, that is GridFTP, GRAM, MDS and MyProxy will be provided soon. Please provide comments/feedback on the planned updates. If your community has specific usage of the current libraries, that are not covered by this plan, please let us know. Thanks, Rachana Rachana Ananthakrishnan Argonne National Lab | University of Chicago From hockyg at uchicago.edu Tue Aug 3 11:40:10 2010 From: hockyg at uchicago.edu (Glen Hocky) Date: Tue, 3 Aug 2010 12:40:10 -0400 Subject: [Swift-devel] swift resume request/suggestion Message-ID: I just tried (I think successfully) to resume a workflow that failed on bgp. Here is what I had to type to resume my workflow > swift -resume glassContinueCavities-20100803-0045-nc4gyma0.0.rlog > -sites.file sites_pedersenipl.xml -tc.file tc_pedersenipl > ./glassContinueCavities.swift -temp=0.8 -steps=15000 -prevsteps=5000 > -totsteps=20000 -esteps=1000000000 -ceqsteps=10000 -natoms=4915 -volume=4096 > -rlist=rlist -fraca=0.8 -nmodels=10 -nsub=25 -energyfunction=pedersenipl Is it possible to make this process simpler by saving all of these command line arguments in the *.rlog file? my suggestion would be that all that is necessary to restart a swift workflow would be swift -resume rlogfile and specification of any swift arguments such as -tc.file would override those stored in the restart file -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Tue Aug 3 11:44:35 2010 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 3 Aug 2010 16:44:35 +0000 (GMT) Subject: [Swift-devel] swift resume request/suggestion In-Reply-To: References: Message-ID: > Is it possible to make this process simpler by saving all of these command > line arguments in the *.rlog file? Probably it should be mandatory that that happens - it doesn't make sense (in general) to restart a run with different parameters, in as much as it doesn't make sense to change the values of constants part way through the execution of a program. -- From hockyg at uchicago.edu Tue Aug 3 12:22:11 2010 From: hockyg at uchicago.edu (Glen Hocky) Date: Tue, 3 Aug 2010 13:22:11 -0400 Subject: [Swift-devel] problem compiling latest version of swift Message-ID: For some reason I haven't been able to compile swift from the svn on any of the machine's I've tried. not sure if something has changed since the last time I had to do it many months ago. attached is the log of ant failing as well as details on the repositories I used, etc. any suggestions? -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- $ svn info Path: . URL: https://cogkit.svn.sourceforge.net/svnroot/cogkit/trunk/current/src/cog/modules Repository Root: https://cogkit.svn.sourceforge.net/svnroot/cogkit Repository UUID: 5b74d2a0-fa0e-0410-85ed-ffba77ec0bde Revision: 2829 Node Kind: directory Schedule: normal Last Changed Author: hategan Last Changed Rev: 2829 Last Changed Date: 2010-07-28 14:32:48 -0500 (Wed, 28 Jul 2010) $ svn info Path: . URL: https://svn.ci.uchicago.edu/svn/vdl2/trunk Repository Root: https://svn.ci.uchicago.edu/svn/vdl2 Repository UUID: e2bb083e-7f23-0410-b3a8-8253ac9ef6d8 Revision: 3496 Node Kind: directory Schedule: normal Last Changed Author: jonmon Last Changed Rev: 3494 Last Changed Date: 2010-07-30 13:56:11 -0500 (Fri, 30 Jul 2010) $ ant dist Buildfile: build.xml generateVersion: antlr: [java] ANTLR Parser Generator Version 2.7.5 (20050128) 1989-2005 jGuru.com [java] resources/swiftscript.g:1028: warning:nondeterminism upon [java] resources/swiftscript.g:1028: k==1:LBRACK [java] resources/swiftscript.g:1028: k==2:ID,STRING_LITERAL,LBRACK,LPAREN,AT,PLUS,MINUS,STAR,NOT,INT_LITERAL,FLOAT_LITERAL,"true","false" [java] resources/swiftscript.g:1028: between alt 1 and exit branch of block compileSchema: [java] Time to build schema type system: 0.511 seconds [java] Time to generate code: 1.268 seconds [java] Time to compile code: 2.541 seconds [java] Compiled types to: /autonfs/home/hockyg/swift/cog/modules/swift/../../modules/swift/lib/vdldefinitions.jar dist: dist: log4j.properties: log4j.check.module: log4j.properties.init: [delete] Deleting: /autonfs/home/hockyg/swift/cog/modules/swift/dist/swift-svn/etc/log4j.properties [delete] Deleting: /autonfs/home/hockyg/swift/cog/modules/swift/dist/swift-svn/CHANGES.txt [copy] Copying 1 file to /autonfs/home/hockyg/swift/cog/modules/swift/dist/swift-svn/etc log4j.properties.update: build.dependencies: dependencies: create.dependency.log: delete.dependency.log.1: [delete] Deleting: /autonfs/home/hockyg/swift/cog/dependency.log.dist deps: dep: dep.1: dist: dist: log4j.properties: log4j.check.module: log4j.properties.init: log4j.properties.update: build.dependencies: dependencies: create.dependency.log: delete.dependency.log.1: deps: dep: dep.1: dist: dist: log4j.properties: log4j.check.module: log4j.properties.init: log4j.properties.update: build.dependencies: dependencies: create.dependency.log: delete.dependency.log.1: deps: dep: dep.1: dist: dist: log4j.properties: log4j.check.module: log4j.properties.init: log4j.properties.update: build.dependencies: dependencies: create.dependency.log: delete.dependency.log.1: deps: dep: dep.1: dist: dist: log4j.properties: log4j.check.module: log4j.properties.init: log4j.properties.update: [java] Warning: source log (/autonfs/home/hockyg/swift/cog/modules/jglobus/../../modules/jglobus/CHANGES.txt) does not exist build.dependencies: dependencies: create.dependency.log: delete.dependency.log.1: deps: delete.dependency.log.1: [echo] [jglobus]: DIST [echo] [jglobus]: JARCOPY delete.jar: compile: [echo] [jglobus]: COMPILE copy.resources: jar: [echo] [jglobus]: JAR (cog-jglobus-1.7.0.jar) [jar] Building jar: /autonfs/home/hockyg/swift/cog/modules/swift/dist/swift-svn/lib/cog-jglobus-1.7.0.jar create: launcher: create.launcher: launcher: create.launcher: launcher: create.launcher: launcher: create.launcher: launcher: create.launcher: launcher: create.launcher: launcher: create.launcher: launcher: create.launcher: launcher: create.launcher: launcher: create.launcher: launcher: create.launcher: launcher: create.launcher: launcher: create.launcher: launcher: create.launcher: deploy.examples: do.deploy.examples: [copy] Copying 1 file to /autonfs/home/hockyg/swift/cog/modules/swift/dist/swift-svn/lib dep: dep.1: dist: dist: log4j.properties: log4j.check.module: log4j.properties.init: log4j.properties.update: [java] Warning: source log (/autonfs/home/hockyg/swift/cog/modules/util/../../modules/util/CHANGES.txt) does not exist build.dependencies: dependencies: create.dependency.log: delete.dependency.log.1: deps: delete.dependency.log.1: [echo] [util]: DIST [echo] [util]: JARCOPY delete.jar: compile: [echo] [util]: COMPILE copy.resources: jar: [echo] [util]: JAR (cog-util-0.92.jar) create: launcher: create.launcher: deploy.examples: do.deploy.examples: delete.dependency.log.1: [echo] [abstraction-common]: DIST [echo] [abstraction-common]: JARCOPY delete.jar: compile: [echo] [abstraction-common]: COMPILE copy.resources: jar: [echo] [abstraction-common]: JAR (cog-abstraction-common-2.4.jar) create: launcher: create.launcher: launcher: create.launcher: launcher: create.launcher: launcher: create.launcher: example.launcher: create.example.launcher: [echo] [abstraction-common]: EXAMPLE LAUNCHER examples/hierarchical-queue-handler example.launcher: create.example.launcher: [echo] [abstraction-common]: EXAMPLE LAUNCHER examples/hierarchical-set-handler launcher: create.launcher: example.launcher: create.example.launcher: [echo] [abstraction-common]: EXAMPLE LAUNCHER examples/taskgraph-2-xml launcher: create.launcher: launcher: create.launcher: example.launcher: create.example.launcher: [echo] [abstraction-common]: EXAMPLE LAUNCHER examples/xml-2-taskgraph launcher: create.launcher: deploy.examples: do.deploy.examples: dep: dep.1: dist: dist: log4j.properties: log4j.check.module: log4j.properties.init: log4j.properties.update: build.dependencies: dependencies: create.dependency.log: delete.dependency.log.1: deps: dep: dep.1: dep: dep.1: dep: dep.1: delete.dependency.log.1: [echo] [provider-gt2]: DIST [echo] [provider-gt2]: JARCOPY delete.jar: [echo] [provider-gt2]: DELETE.JAR (cog-provider-gt2-2.4.jar) compile: [echo] [provider-gt2]: COMPILE [javac] Compiling 16 source files to /autonfs/home/hockyg/swift/cog/modules/provider-gt2/build [javac] /autonfs/home/hockyg/swift/cog/modules/provider-gt2/src/org/globus/cog/abstraction/impl/execution/gt2/JobSubmissionTaskHandler.java:332: cannot find symbol [javac] symbol : method getStageIn() [javac] location: interface org.globus.cog.abstraction.interfaces.JobSpecification [javac] boolean staging = notEmpty(spec.getStageIn()) [javac] ^ [javac] /autonfs/home/hockyg/swift/cog/modules/provider-gt2/src/org/globus/cog/abstraction/impl/execution/gt2/JobSubmissionTaskHandler.java:333: cannot find symbol [javac] symbol : method getStageOut() [javac] location: interface org.globus.cog.abstraction.interfaces.JobSpecification [javac] || notEmpty(spec.getStageOut()); [javac] ^ [javac] /autonfs/home/hockyg/swift/cog/modules/provider-gt2/src/org/globus/cog/abstraction/impl/execution/gt2/JobSubmissionTaskHandler.java:364: cannot find symbol [javac] symbol : method getStageIn() [javac] location: interface org.globus.cog.abstraction.interfaces.JobSpecification [javac] addStagingSpec(spec.getStageIn(), rsl, "file_stage_in"); [javac] ^ [javac] /autonfs/home/hockyg/swift/cog/modules/provider-gt2/src/org/globus/cog/abstraction/impl/execution/gt2/JobSubmissionTaskHandler.java:365: cannot find symbol [javac] symbol : method getStageOut() [javac] location: interface org.globus.cog.abstraction.interfaces.JobSpecification [javac] addStagingSpec(spec.getStageOut(), rsl, "file_stage_out"); [javac] ^ [javac] /autonfs/home/hockyg/swift/cog/modules/provider-gt2/src/org/globus/cog/abstraction/impl/execution/gt2/JobSubmissionTaskHandler.java:381: incompatible types [javac] found : java.lang.Object [javac] required: java.lang.String [javac] for (String key : spec.getAttributeNames()) { [javac] ^ [javac] /autonfs/home/hockyg/swift/cog/modules/provider-gt2/src/org/globus/cog/abstraction/impl/execution/gt2/JobSubmissionTaskHandler.java:498: incompatible types [javac] found : java.lang.Object [javac] required: java.lang.String [javac] for (String arg : spec.getArgumentsAsList()) { [javac] ^ [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 6 errors BUILD FAILED /autonfs/home/hockyg/swift/cog/modules/swift/build.xml:73: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/mbuild.xml:444: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/mbuild.xml:79: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/mbuild.xml:52: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/modules/swift/dependencies.xml:4: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/mbuild.xml:163: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/mbuild.xml:168: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/modules/karajan/build.xml:59: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/mbuild.xml:444: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/mbuild.xml:79: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/mbuild.xml:52: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/modules/karajan/dependencies.xml:4: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/mbuild.xml:163: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/mbuild.xml:168: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/modules/abstraction/build.xml:58: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/mbuild.xml:444: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/mbuild.xml:79: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/mbuild.xml:52: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/modules/abstraction/dependencies.xml:7: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/mbuild.xml:163: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/mbuild.xml:168: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/modules/provider-gt2/build.xml:59: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/mbuild.xml:465: The following error occurred while executing this line: /autonfs/home/hockyg/swift/cog/mbuild.xml:228: Compile failed; see the compiler error output for details. Total time: 55 seconds From hategan at mcs.anl.gov Tue Aug 3 12:43:05 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 03 Aug 2010 12:43:05 -0500 Subject: [Swift-devel] problem compiling latest version of swift In-Reply-To: References: Message-ID: <1280857386.23003.0.camel@blabla2.none> Make sure you do a svn update on cog. Also do 'ant distclean' in the swift dir. You may have some old classes that Ant doesn't recompile properly. Mihael On Tue, 2010-08-03 at 13:22 -0400, Glen Hocky wrote: > For some reason I haven't been able to compile swift from the svn on > any of the machine's I've tried. not sure if something has changed > since the last time I had to do it many months ago. > > > attached is the log of ant failing as well as details on the > repositories I used, etc. > > > any suggestions? > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From iraicu at cs.uchicago.edu Wed Aug 4 03:15:03 2010 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 04 Aug 2010 03:15:03 -0500 Subject: [Swift-devel] CFP: The 3rd IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) 2010, co-located with Supercomputing 2010 -- November 15th, 2010 Message-ID: <4C592187.6050702@cs.uchicago.edu> Call for Papers ------------------------------------------------------------------------------------------------ The 3rd IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) 2010 http://dsl.cs.uchicago.edu/MTAGS10/ ------------------------------------------------------------------------------------------------ November 15th, 2010 New Orleans, Louisiana, USA Co-located with with IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC10) ================================================================================================ The 3rd workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) will provide the scientific community a dedicated forum for presenting new research, development, and deployment efforts of large-scale many-task computing (MTC) applications on large scale clusters, Grids, Supercomputers, and Cloud Computing infrastructure. MTC, the theme of the workshop encompasses loosely coupled applications, which are generally composed of many tasks (both independent and dependent tasks) to achieve some larger application goal. This workshop will cover challenges that can hamper efficiency and utilization in running applications on large-scale systems, such as local resource manager scalability and granularity, efficient utilization of raw hardware, parallel file system contention and scalability, data management, I/O management, reliability at scale, and application scalability. We welcome paper submissions on all topics related to MTC on large scale systems. Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the IEEE digital library (pending approval). The workshop will be co-located with the IEEE/ACM Supercomputing 2010 Conference in New Orleans Louisiana on November 15th, 2010. For more information, please see http://dsl.cs.uchicago.edu/MTAGS010/. Topics ------------------------------------------------------------------------------------------------ We invite the submission of original work that is related to the topics below. The papers can be either short (5 pages) position papers, or long (10 pages) research papers. Topics of interest include (in the context of Many-Task Computing): * Compute Resource Management * Scheduling * Job execution frameworks * Local resource manager extensions * Performance evaluation of resource managers in use on large scale systems * Dynamic resource provisioning * Techniques to manage many-core resources and/or GPUs * Challenges and opportunities in running many-task workloads on HPC systems * Challenges and opportunities in running many-task workloads on Cloud Computing infrastructure * Storage architectures and implementations * Distributed file systems * Parallel file systems * Distributed meta-data management * Content distribution systems for large data * Data caching frameworks and techniques * Data management within and across data centers * Data-aware scheduling * Data-intensive computing applications * Eventual-consistency storage usage and management * Programming models and tools * Map-reduce and its generalizations * Many-task computing middleware and applications * Parallel programming frameworks * Ensemble MPI techniques and frameworks * Service-oriented science applications * Large-Scale Workflow Systems * Workflow system performance and scalability analysis * Scalability of workflow systems * Workflow infrastructure and e-Science middleware * Programming Paradigms and Models * Large-Scale Many-Task Applications * High-throughput computing (HTC) applications * Data-intensive applications * Quasi-supercomputing applications, deployments, and experiences * Performance Evaluation * Performance evaluation * Real systems * Simulations * Reliability of large systems Paper Submission and Publication ------------------------------------------------------------------------------------------------ Authors are invited to submit papers with unpublished, original work of not more than 10 pages of double column text using single spaced 10 point size on 8.5 x 11 inch pages, as per IEEE 8.5 x 11 manuscript guidelines; document templates can be found at ftp://pubftp.computer.org/Press/Outgoing/proceedings/instruct8.5x11.pdf and ftp://pubftp.computer.org/Press/Outgoing/proceedings/instruct8.5x11.doc. We are also seeking position papers of no more than 5 pages in length. A 250 word abstract (PDF format) must be submitted online at https://cmt.research.microsoft.com/MTAGS2010/ before the deadline of August 25th, 2010 at 11:59PM PST; the final 5/10 page papers in PDF format will be due on September 1st, 2010 at 11:59PM PST. Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the IEEE digital library (pending approval). Notifications of the paper decisions will be sent out by October 1st, 2010. Selected excellent work may be eligible for additional post-conference publication as journal articles or book chapters; see this year's ongoing special issue in the IEEE Transactions on Parallel and Distributed Systems (TPDS) at http://dsl.cs.uchicago.edu/TPDS_MTC/. Submission implies the willingness of at least one of the authors to register and present the paper. For more information, please visit http://dsl.cs.uchicago.edu/MTAGS10/. Important Dates ------------------------------------------------------------------------------------------------ * Abstract Due: August 25th, 2010 * Papers Due: September 1st, 2010 * Notification of Acceptance: October 1st, 2010 * Camera Ready Papers Due: November 1st, 2010 * Workshop Date: November 15th, 2010 Committee Members ------------------------------------------------------------------------------------------------ Workshop Chairs * Ioan Raicu, Illinois Institute of Technology * Ian Foster, University of Chicago & Argonne National Laboratory * Yong Zhao, Microsoft Steering Committee * David Abramson, Monash University, Australia * Alok Choudhary, Northwestern University, USA * Jack Dongara, University of Tennessee, USA * Geoffrey Fox, Indiana University, USA * Robert Grossman, University of Illinois at Chicago, USA * Arthur Maccabe, Oak Ridge National Labs, USA * Dan Reed, Microsoft Research, USA * Marc Snir, University of Illinois at Urbana Champaign, USA * Xian-He Sun, Illinois Institute of Technology, USA * Manish Parashar, Rutgers University, USA Technical Committee * Roger Barga, Microsoft Research, USA * Mihai Budiu, Microsoft Research, USA * Rajkumar Buyya, University of Melbourne, Australia * Henri Casanova, University of Hawaii at Manoa, USA * Jeff Chase, Duke University, USA * Peter Dinda, Northwestern University, USA * Catalin Dumitrescu, Fermi National Labs, USA * Evangelinos Constantinos, Massachusetts Institute of Technology, USA * Indranil Gupta, University of Illinois at Urbana Champaign, USA * Alexandru Iosup, Delft University of Technology, Netherlands * Florin Isaila, Universidad Carlos III de Madrid, Spain * Michael Isard, Microsoft Research, USA * Kamil Iskra, Argonne National Laboratory, USA * Daniel Katz, University of Chicago, USA * Tevfik Kosar, Louisiana State University, USA * Zhiling Lan, Illinois Institute of Technology, USA * Ignacio Llorente, Universidad Complutense de Madrid, Spain * Reagan Moore, University of North Carolina, Chappel Hill, USA * Jose Moreira, IBM Research, USA * Marlon Pierce, Indiana University, USA * Lavanya Ramakrishnan, Lawrence Berkeley National Laboratory, USA * Matei Ripeanu, University of British Columbia, Canada * Alain Roy, University of Wisconsin Madison, USA * Edward Walker, Texas Advanced Computing Center, USA * Mike Wilde, University of Chicago & Argonne National Laboratory, USA * Matthew Woitaszek, The University Coorporation for Atmospheric Research, USA * Justin Wozniak, Argonne National Laboratory, USA * Ken Yocum, University of California San Diego, USA -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor ================================================================= Computer Science Department Illinois Institute of Technology 10 W. 31st Street Stuart Building, Room 237D Chicago, IL 60616 ================================================================= Cel: 1-847-722-0876 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ ================================================================= ================================================================= From wilde at mcs.anl.gov Wed Aug 4 10:49:43 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Wed, 4 Aug 2010 09:49:43 -0600 (GMT-06:00) Subject: [Swift-devel] OSG links for site config info and trouble tickets In-Reply-To: <1511315271.572221280936785539.JavaMail.root@zimbra.anl.gov> Message-ID: <266955252.572371280936983678.JavaMail.root@zimbra.anl.gov> Arjun, the links below provided by Marco should explain OSG site directory conventions in detail. For reporting problems and getting help: osg-sites at opensciencegrid.org - to report problems with a site osg-int to report problems with OSG software (int == "integration") An OSG "campfire" chat is staffed three times a week for interactive text chat with OSG support staff (info on the OSG site I think). - Mike ----- Forwarded Message ----- From: "Marco Mambelli" To: "Mike Wilde" Sent: Wednesday, August 4, 2010 10:43:19 AM GMT -06:00 US/Canada Central Subject: OSG links Hi Mike, this is the page describing how to get help: https://twiki.grid.iu.edu/bin/view/ReleaseDocumentation/HelpProcedure This is the page about site planning: https://twiki.grid.iu.edu/bin/view/ReleaseDocumentation/SitePlanning Here describes the possible configurations of the directories: https://twiki.grid.iu.edu/bin/view/ReleaseDocumentation/LocalStorageConfiguration If the ReSS/BDII information is inconsistent with the behavior, there is a site configuration problem and it is OK to report it or open a ticket. Shared directories sometime can give performance/space problems. I have no idea why all of the sudden now it is working. Sites could have been fixing their setup. Another possibility is that another VO that uses them (e.g. LIGO) now has reduced activity. Best, Marco -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Wed Aug 4 12:22:55 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 4 Aug 2010 11:22:55 -0600 (GMT-06:00) Subject: [Swift-devel] Fwd: [Dsl-seminar] Go In-Reply-To: Message-ID: <2747093.578701280942575574.JavaMail.root@zimbra.anl.gov> Go's "go-routines" seem to me to have a lot in common with Karajan elements. Im curious to hear what people think of the similarities and differences, and whether Go has any bearing on Swift. - Mike ----- Forwarded Message ----- From: "Anne Rogers" To: dsl-seminar at cs.uchicago.edu Sent: Wednesday, August 4, 2010 12:19:50 PM GMT -06:00 US/Canada Central Subject: [Dsl-seminar] Go Techstaff has installed the go compiler and linker on the department's linux machines. You can find it here: /opt/go/bin. -anne _______________________________________________ DSL-seminar mailing list DSL-seminar at mailman.cs.uchicago.edu https://mailman.cs.uchicago.edu/mailman/listinfo/dsl-seminar -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Wed Aug 4 12:37:44 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 04 Aug 2010 12:37:44 -0500 Subject: [Swift-devel] Fwd: [Dsl-seminar] Go In-Reply-To: <2747093.578701280942575574.JavaMail.root@zimbra.anl.gov> References: <2747093.578701280942575574.JavaMail.root@zimbra.anl.gov> Message-ID: <1280943464.31915.9.camel@blabla2.none> List of languages with co-routines: http://en.wikipedia.org/wiki/Coroutine#Programming_languages_supporting_coroutines Languages supporting continuations (which are more powerful than coroutines, and probably what you are looking for): http://en.wikipedia.org/wiki/Continuation#Programming_language_support I call your attention to Scala, for it's java library support (and being a JVM language in general). And here's a paper on how continuations are implemented in Scala, which also contains a few examples of how to use continuations for concurrency and scalable I/O: http://lamp.epfl.ch/~rompf/continuations-icfp09.pdf Mihael On Wed, 2010-08-04 at 11:22 -0600, Michael Wilde wrote: > Go's "go-routines" seem to me to have a lot in common with Karajan elements. > > Im curious to hear what people think of the similarities and differences, and whether Go has any bearing on Swift. > > - Mike > > > ----- Forwarded Message ----- > From: "Anne Rogers" > To: dsl-seminar at cs.uchicago.edu > Sent: Wednesday, August 4, 2010 12:19:50 PM GMT -06:00 US/Canada Central > Subject: [Dsl-seminar] Go > > > Techstaff has installed the go compiler and linker on the department's linux machines. You can find it here: /opt/go/bin. > > -anne > > _______________________________________________ > DSL-seminar mailing list > DSL-seminar at mailman.cs.uchicago.edu > https://mailman.cs.uchicago.edu/mailman/listinfo/dsl-seminar > From hategan at mcs.anl.gov Wed Aug 4 13:02:16 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 04 Aug 2010 13:02:16 -0500 Subject: [Swift-devel] Fwd: [Dsl-seminar] Go In-Reply-To: <2747093.578701280942575574.JavaMail.root@zimbra.anl.gov> References: <2747093.578701280942575574.JavaMail.root@zimbra.anl.gov> Message-ID: <1280944936.32631.3.camel@blabla2.none> On Wed, 2010-08-04 at 11:22 -0600, Michael Wilde wrote: > Go's "go-routines" seem to me to have a lot in common with Karajan elements. Thought I have to mention: from a concurrency perspective, both karajan and continuations try to achieve the same thing: cheap (lightweight) concurrency. However, from the programming perspective, they are pretty different in that coroutines allow one to implement these things in a normal language at the expense of simplicity, whereas in karajan the goal is to use threading as much as possible (as well as make threading simpler) but have the underlying implementation of those threads be lightweight. From wilde at mcs.anl.gov Wed Aug 4 13:32:32 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Wed, 4 Aug 2010 12:32:32 -0600 (GMT-06:00) Subject: [Swift-devel] more on job throughput In-Reply-To: <905253644.581641280946313139.JavaMail.root@zimbra.anl.gov> Message-ID: <671115461.582101280946752723.JavaMail.root@zimbra.anl.gov> Is this test run telling us: - overall, 8192 jobs in 25 seconds: 328 jobs / sec - about 5 seconds for anything to start, so: 8192 jobs / 20 seconds: 409 jobs / sec - drop off site select time (can bypass for single-site case?) and stage in (can bypass w/CDM?) and we are at the number you mentioned in an earlier thread: 8192 jobs / 10 seconds = 819 jobs / sec ? Seems a major question is what is practical with a real providers, eg coasters on BG/P? And can the 15 seconds of "overhead" (out of 25 sec) in this run be eliminated or reduced? - Mike ----- "Mihael Hategan" wrote: > Here's a plot of the number of tasks in the various stages that the > runtime stats track. > > This is with 8192 jobs and the fake provider (which does nothing and > finishes tasks almost immediately, and which I should probably commit > somewhere if anybody else wants to play with this). > > I also attached the scripts used. You would need to change > RuntimeStats > to print the stats more often than the 1s default (say something like > (MIN,MAX)_PERIOD_MS=100). > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From benc at hawaga.org.uk Wed Aug 4 15:20:32 2010 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 4 Aug 2010 20:20:32 +0000 (GMT) Subject: [Swift-devel] Fwd: [Dsl-seminar] Go In-Reply-To: <2747093.578701280942575574.JavaMail.root@zimbra.anl.gov> References: <2747093.578701280942575574.JavaMail.root@zimbra.anl.gov> Message-ID: I've looked at summaries of go but not tried it myself. Feature-wise it doesn't seem to have anything particularly novel. It seems to expose concurrency to the user fairly explicitly. From the perspective of Swift as a language, I think thats a bad thing to do. But that's not a criticism of go particularly. -- http://www.hawaga.org.uk/ben/ From hockyg at uchicago.edu Thu Aug 5 16:14:25 2010 From: hockyg at uchicago.edu (Glen Hocky) Date: Thu, 5 Aug 2010 17:14:25 -0400 Subject: [Swift-devel] problem with coasters on pbs provider (on pads) Message-ID: I'm having a problem running on PADS. It seems that when I submit jobs with workerspernode=8, the queuing system doesn't pick up on the fact that each job submitted by swift should have ppn=8 (specifically, that is missing from the qsub command registered by pbs) when I do a qsub -f on my running jobs I get > submit_args = -A CI-CCR000013 -l nodes=1,walltime=02:00:00, size=1 /home/hockyg/.globus/scripts/PBS4482066898055181239.submit so there's no ppn=8 and i think it should also say size=8. the result is that I get 56 jobs running on one node BTW, this is with the latest version of cog (as well as earlier ones) and a version of swift that's working for me on bgp > Swift svn swift-r3432 (swift modified locally) cog-r2829 my sites entry is > 01:00:00 8 172.5.86.5 16 1 1 1.27 10000 CI-CCR000013 /tmp > /home/hockyg/reichman/glassy_dynamics/code/swift/run/real -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Aug 5 16:22:05 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Thu, 5 Aug 2010 15:22:05 -0600 (GMT-06:00) Subject: [Swift-devel] problem with coasters on pbs provider (on pads) In-Reply-To: <1783773521.630241281043119503.JavaMail.root@zimbra.anl.gov> Message-ID: <1581357356.630431281043325116.JavaMail.root@zimbra.anl.gov> Glen, I've worked on this problem in the past; I will look to see if I have a fix in one of my test versions. In the meantime, say workerspernode=1, and I *think* it will correctly run single-node jobs and PBS will place up to 8 of these per node. - Mike ----- "Glen Hocky" wrote: > I'm having a problem running on PADS. It seems that when I submit jobs > with workerspernode=8, the queuing system doesn't pick up on the fact > that each job submitted by swift should have ppn=8 (specifically, that > is missing from the qsub command registered by pbs) > > > when I do a qsub -f on my running jobs I get > > > submit_args = -A CI-CCR000013 -l nodes=1,walltime=02:00:00, > > size=1 /home/hockyg/.globus/scripts/PBS4482066898055181239.submit > > > so there's no ppn=8 and i think it should also say size=8. > > > the result is that I get 56 jobs running on one node > > > > BTW, this is with the latest version of cog (as well as earlier ones) > and a version of swift that's working for me on bgp > > > Swift svn swift-r3432 (swift modified locally) cog-r2829 > > > > > > > my sites entry is > > > > > > > 01:00:00 > > 8 > > key="internalHostname">172.5.86.5 > > 16 > > 1 > > 1 > > 1.27 > > 10000 > > CI-CCR000013 > > > > /tmp > > /home/hockyg/reichman/glassy_dynamics/code/swift/run/real > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Thu Aug 5 16:23:35 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Aug 2010 16:23:35 -0500 Subject: [Swift-devel] problem with coasters on pbs provider (on pads) In-Reply-To: References: Message-ID: <1281043415.11210.5.camel@blabla2.none> On Thu, 2010-08-05 at 17:14 -0400, Glen Hocky wrote: > I'm having a problem running on PADS. It seems that when I submit jobs > with workerspernode=8, the queuing system doesn't pick up on the fact > that each job submitted by swift should have ppn=8 (specifically, that > is missing from the qsub command registered by pbs) It's not meant to. At some point in the past the meaning of "workerspernode" has changed from "start n instances of the worker" to "submit at most n concurrent jobs to one worker". Since this applies to SMP "nodes" the end result is similar (i.e. n jobs per node), except only one worker.pl instance (and therefore only one TCP connection) is used per node. > > > when I do a qsub -f on my running jobs I get > submit_args = -A CI-CCR000013 -l nodes=1,walltime=02:00:00, > > size=1 /home/hockyg/.globus/scripts/PBS4482066898055181239.submit > > > so there's no ppn=8 and i think it should also say size=8. > > > the result is that I get 56 jobs running on one node > I lost you there. You get 56 j/n with coasters (which would be bad) or with the manual qsub (whose degree of badness I cannot assess)? Mihael From hategan at mcs.anl.gov Thu Aug 5 16:43:08 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Aug 2010 16:43:08 -0500 Subject: [Swift-devel] problem with coasters on pbs provider (on pads) In-Reply-To: References: <1281043415.11210.5.camel@blabla2.none> Message-ID: <1281044588.11402.4.camel@blabla2.none> Oh, I see. PADS treats cores as nodes. Then what Mike says: workersPerNode=1 and nodeGranularity=8. On Thu, 2010-08-05 at 17:27 -0400, Glen Hocky wrote: > with coasters some number of queue jobs spawn (16), and when 1 node > (with 8 cpus) became available, the queuing system starts 7 coasters > jobs which run 56 workers concurrently > > > since this happens, what settings should I pick to have my desired > behavior (i.e. 8 workers per node) > > On Thu, Aug 5, 2010 at 5:23 PM, Mihael Hategan > wrote: > On Thu, 2010-08-05 at 17:14 -0400, Glen Hocky wrote: > > I'm having a problem running on PADS. It seems that when I > submit jobs > > with workerspernode=8, the queuing system doesn't pick up on > the fact > > that each job submitted by swift should have ppn=8 > (specifically, that > > is missing from the qsub command registered by pbs) > > > It's not meant to. > > At some point in the past the meaning of "workerspernode" has > changed > from "start n instances of the worker" to "submit at most n > concurrent > jobs to one worker". > Since this applies to SMP "nodes" the end result is similar > (i.e. n jobs > per node), except only one worker.pl instance (and therefore > only one > TCP connection) is used per node. > > > > > > > when I do a qsub -f on my running jobs I get > > submit_args = -A CI-CCR000013 -l > nodes=1,walltime=02:00:00, > > > > > size=1 /home/hockyg/.globus/scripts/PBS4482066898055181239.submit > > > > > > so there's no ppn=8 and i think it should also say size=8. > > > > > > the result is that I get 56 jobs running on one node > > > > > I lost you there. You get 56 j/n with coasters (which would be > bad) or > with the manual qsub (whose degree of badness I cannot > assess)? > > Mihael > > > > From hategan at mcs.anl.gov Thu Aug 5 22:34:34 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 05 Aug 2010 22:34:34 -0500 Subject: [Swift-devel] persistent coaster service Message-ID: <1281065674.20004.2.camel@blabla2.none> ... was added in cog r2834. Despite having run a few jobs with it, I don't feel very confident about it. So please test. Start with bin/coaster-service and use "coaster-persistent" as provider in sites.xml. Everything else would be the same as in the "coaster" case. Mihael From dk0966 at cs.ship.edu Fri Aug 6 09:48:55 2010 From: dk0966 at cs.ship.edu (David Kelly) Date: Fri, 6 Aug 2010 10:48:55 -0400 Subject: [Swift-devel] Swift usage statistics Message-ID: Hello all, I have been working on putting together a process for recording Swift usage statistics. The new swift shell script (attached) will gather information such as the swift return code, length of swift script, hashed user id and script id. It will then use a utility called netcat to send the information via UDP packets to a remote listener (swift_listener.pl attached). The listener is currently running on bridled.ci.uchicago.edu. The listener is a perl script which will record that information, along with start time, stop time, hostname, and ip address. The listener stores this data in a MySQL database on db.ci.uchicago.edu called swiftusage. Cron will execute a script called check_listener.sh every 15 minutes to ensure the listener is still running. The data can then be used to detect patterns and usage trends over time. Usage statistics can be disabled by setting ALLOW_USAGE_STATS to 0. What data is recorded, and how it is recorded, closely matches the Globus policy for usage statistics ( http://www.globus.org/toolkit/docs/4.2/4.2.1/admin/install/#gtadmin-config-usage-stats). I would like to add this to svn in the next few days. Please let me know if you have any concerns about how this works or have any ideas for how it could be improved. Thanks! Regards, David -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: swift_listener.pl Type: application/x-perl Size: 2587 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: swift Type: application/octet-stream Size: 3860 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: check_listener.sh Type: application/x-sh Size: 158 bytes Desc: not available URL: From iraicu at cs.uchicago.edu Sat Aug 7 14:28:18 2010 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Sat, 07 Aug 2010 14:28:18 -0500 Subject: [Swift-devel] CFP: The 5th Workshop on Workflows in Support of Large-Scale Science 2010 Message-ID: <4C5DB3D2.2040703@cs.uchicago.edu> Call for Papers The 5th Workshop on Workflows in Support of Large-Scale Science in conjunction with SC?10 New Orleans, LA November 14, 2010 http://www.isi.edu/works10 Scientific workflows are a key technology that enables large-scale computations and service management on distributed resources. Workflows enable scientists to design complex analysis that are composed of individual application components or services and often such components and services are designed, developed, and tested collaboratively. The size of the data and the complexity of the analysis often lead to large amounts of shared resources, such as clusters and storage systems, being used to store the data sets and execute the workflows. The process of workflow design and execution in a distributed environment can be very complex and can involve multiple stages including their textual or graphical specification, the mapping of the high-level workflow descriptions onto the available resources, as well as monitoring and debugging of the subsequent execution. Further, since computations and data access operations are performed on shared resources, there is an increased interest in managing the fair allocation and management of those resources at the workflow level. Large-scale scientific applications pose several requirements on the workflow systems. Besides the magnitude of data processed by the workflow components, the intermediate and resulting data needs to be annotated with provenance and other information to evaluate the quality of the data and support the repeatability of the analysis. Further, adequate workflow descriptions are needed to support the complex workflow management process which includes workflow creation, workflow reuse, and modifications made to the workflow over time?for example modifications to the individual workflow components. Additional workflow annotations may provide guidelines and requirements for resource mapping and execution. The Fifth Workshop on Workflows in Support of Large-Scale Science focuses on the entire workflow lifecycle including the workflow composition, mapping, robust execution and the recording of provenance information. The workshop also welcomes contributions in the applications area, where the requirements on the workflow management systems can be derived. Special attention will be paid to Bio-Computing applications which are the theme for SC10. The topics of the workshop include but are not limited to: * Workflow applications and their requirements with special emphasis on Bio-Computing applications. * Workflow composition, tools and languages. * Workflow user environments, including portals. * Workflow refinement tools that can manage the workflow mapping process. * Workflow execution in distributed environments. * Workflow fault-tolerance and recovery techniques. * Data-driven workflow processing. * Adaptive workflows. * Workflow monitoring. * Workflow optimizations. * Performance analysis of workflows * Workflow debugging. * Workflow provenance. * Interactive workflows. * Workflow interoperability * Mashups and workflows * Workflows on the cloud. Important Dates: Papers due September 3, 2010 Notifications of acceptance September 30, 2010 Final papers due October 8, 2010 We will accept both short (6pages) and long (10 page) papers. The papers should be in IEEE format. To submit the papers, please submit to EasyChair at http://www.easychair.org/conferences/?conf=works10 If you have questions, please email works10 at isi.edu -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor ================================================================= Computer Science Department Illinois Institute of Technology 10 W. 31st Street Stuart Building, Room 237D Chicago, IL 60616 ================================================================= Cel: 1-847-722-0876 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ ================================================================= ================================================================= From hockyg at uchicago.edu Mon Aug 9 16:44:06 2010 From: hockyg at uchicago.edu (Glen Hocky) Date: Mon, 9 Aug 2010 17:44:06 -0400 Subject: [Swift-devel] decaying number of coaster jobs leaves some tasks unfinished Message-ID: Hey everyone, I've been trying to run some short jobs in the "fast" queue on pads. That means I need to keep the wall time under 1 hour, and my tasks are around 20 min. What's been happening, at least for a smallish number of jobs, is that swift decreases the number of jobs submitted to the queue as the number of tasks is reduced and at the end, some tasks remain unfinished while no jobs are in the queue, and this continues indefinately. The following is one sites entry where I reproducibly had this problem for 70 tasks 3600 00:25:00 1 172.5.86.5 120 1 1 0.99 10000 CI-CCR000013 /tmp > /home/hockyg/reichman/glassy_dynamics/code/swift/run/real There are also some of this type of error Exception caught while unregistering channel > org.globus.cog.karajan.workflow.service.channels.ChannelException: Trying > to bind invalid channel (2027063355: {}) to 60652275: {} at > org.globus.cog.karajan.workflow.service.channels.MetaChannel.bind(MetaChannel.java:67) at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.unregisterChannel(ChannelManager.java:401) at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.shutdownChannel(ChannelManager.java:411) at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:284) at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajanChannel.java:83) at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:257) but i'm not sure that's related... Running with "Swift svn swift-r3432 (swift modified locally) cog-r2829" Swift output went something like Progress: Submitted:69 Active:1 Finished successfully:1 Progress: Submitted:67 Active:3 Finished successfully:1 Progress: Submitted:66 Active:4 Finished successfully:1 Progress: Submitted:65 Active:5 Finished successfully:1 Progress: Submitted:64 Active:6 Finished successfully:1 Progress: Submitted:61 Active:9 Finished successfully:1 Progress: Submitted:58 Active:12 Finished successfully:1 Progress: Submitted:57 Active:13 Finished successfully:1 Progress: Submitted:54 Active:16 Finished successfully:1 Progress: Submitted:52 Active:18 Finished successfully:1 Progress: Submitted:51 Active:19 Finished successfully:1 Progress: Submitted:50 Active:20 Finished successfully:1 Progress: Submitted:49 Active:21 Finished successfully:1 Progress: Submitted:48 Active:22 Finished successfully:1 Progress: Submitted:41 Active:29 Finished successfully:1 Progress: Submitted:38 Active:32 Finished successfully:1 Progress: Submitted:37 Active:33 Finished successfully:1 Progress: Submitted:35 Active:35 Finished successfully:1 Progress: Submitted:31 Active:39 Finished successfully:1 Progress: Submitted:30 Active:40 Finished successfully:1 Progress: Submitted:26 Active:44 Finished successfully:1 Progress: Submitted:26 Active:44 Finished successfully:1 Progress: Submitted:26 Active:44 Finished successfully:1 Progress: Submitted:26 Active:44 Finished successfully:1 Progress: Submitted:26 Active:44 Finished successfully:1 Progress: Submitted:26 Active:44 Finished successfully:1 Progress: Submitted:26 Active:43 Checking status:1 Finished successfully:1 Progress: Submitted:26 Active:43 Finished successfully:2 Progress: Submitted:26 Active:42 Checking status:1 Finished successfully:2 Progress: Submitted:25 Active:42 Checking status:1 Finished successfully:3 Progress: Submitted:25 Active:41 Checking status:1 Finished successfully:4 Progress: Submitted:25 Active:41 Finished successfully:5 Progress: Submitted:25 Active:40 Checking status:1 Finished successfully:5 Progress: Submitted:25 Active:39 Checking status:1 Finished successfully:6 Progress: Submitted:24 Active:40 Finished successfully:7 Progress: Submitted:24 Active:39 Checking status:1 Finished successfully:7 Progress: Submitted:24 Active:38 Checking status:1 Finished successfully:8 Progress: Submitted:24 Active:38 Finished successfully:9 Progress: Submitted:24 Active:37 Checking status:1 Finished successfully:9 Progress: Submitted:24 Active:35 Checking status:1 Finished successfully:11 Progress: Submitted:23 Active:35 Checking status:1 Finished successfully:12 Progress: Submitted:22 Active:35 Checking status:1 Finished successfully:13 Progress: Submitted:22 Active:35 Finished successfully:14 Progress: Submitted:22 Active:34 Checking status:1 Finished successfully:14 Progress: Submitted:21 Active:34 Checking status:1 Finished successfully:15 Progress: Submitted:21 Active:34 Finished successfully:16 Progress: Submitted:21 Active:33 Checking status:1 Finished successfully:16 Progress: Submitted:21 Active:33 Finished successfully:17 Progress: Submitted:20 Active:32 Checking status:1 Finished successfully:18 Progress: Submitted:20 Active:32 Finished successfully:19 Progress: Submitted:20 Active:31 Checking status:1 Finished successfully:19 Progress: Submitted:19 Active:31 Finished successfully:21 Progress: Submitted:19 Active:30 Checking status:1 Finished successfully:21 Progress: Submitted:18 Active:30 Checking status:1 Finished successfully:22 Progress: Submitted:18 Active:29 Checking status:1 Finished successfully:23 Progress: Submitted:18 Active:28 Checking status:1 Finished successfully:24 Progress: Submitted:17 Active:29 Finished successfully:25 Progress: Submitted:17 Active:29 Finished successfully:25 Progress: Submitted:17 Active:28 Checking status:1 Finished successfully:25 Progress: Submitted:17 Active:27 Checking status:1 Finished successfully:26 Progress: Submitted:17 Active:26 Checking status:1 Finished successfully:27 Progress: Submitted:17 Active:25 Checking status:1 Finished successfully:28 Progress: Submitted:17 Active:24 Checking status:1 Finished successfully:29 Progress: Submitted:16 Active:25 Finished successfully:30 Progress: Submitted:16 Active:24 Checking status:1 Finished successfully:30 Progress: Submitted:15 Active:24 Checking status:1 Finished successfully:31 Progress: Submitted:15 Active:24 Finished successfully:32 Progress: Submitted:15 Active:23 Checking status:1 Finished successfully:32 Progress: Submitted:14 Active:24 Finished successfully:33 Progress: Submitted:14 Active:23 Checking status:1 Finished successfully:33 Progress: Submitted:14 Active:22 Checking status:1 Finished successfully:34 Progress: Submitted:14 Active:22 Finished successfully:35 Progress: Submitted:14 Active:21 Checking status:1 Finished successfully:35 Progress: Submitted:13 Active:22 Finished successfully:36 Progress: Submitted:13 Active:22 Finished successfully:36 Progress: Submitted:13 Active:20 Checking status:1 Finished successfully:37 Progress: Submitted:12 Active:21 Finished successfully:38 Progress: Submitted:12 Active:20 Checking status:1 Finished successfully:38 Progress: Submitted:12 Active:19 Checking status:1 Finished successfully:39 Progress: Submitted:12 Active:19 Finished successfully:40 Progress: Submitted:12 Active:18 Checking status:1 Finished successfully:40 Progress: Submitted:12 Active:17 Checking status:1 Finished successfully:41 Progress: Submitted:11 Active:17 Checking status:1 Finished successfully:42 Progress: Submitted:11 Active:17 Finished successfully:43 Progress: Submitted:11 Active:16 Checking status:1 Finished successfully:43 Progress: Submitted:11 Active:15 Checking status:1 Finished successfully:44 Progress: Submitted:10 Active:16 Finished successfully:45 Progress: Submitted:3 Active:22 Finished successfully:46 Progress: Submitted:3 Active:21 Checking status:1 Finished successfully:46 Progress: Submitted:3 Active:19 Finished successfully:49 Progress: Submitted:3 Active:19 Finished successfully:49 Progress: Submitted:2 Active:20 Finished successfully:49 Progress: Submitted:1 Active:21 Finished successfully:49 . . . Progress: Submitted:1 Active:15 Finished successfully:55 Progress: Submitted:1 Active:15 Finished successfully:55 Progress: Submitted:1 Active:15 Finished successfully:55 Progress: Submitted:1 Active:15 Finished successfully:55 Progress: Submitted:1 Active:14 Checking status:1 Finished successfully:55 Progress: Submitted:1 Active:14 Finished successfully:56 Progress: Submitted:1 Active:13 Checking status:1 Finished successfully:56 Progress: Submitted:1 Active:12 Checking status:1 Finished successfully:57 Progress: Submitted:1 Active:12 Finished successfully:58 Progress: Submitted:1 Active:11 Checking status:1 Finished successfully:58 Progress: Submitted:1 Active:10 Checking status:1 Finished successfully:59 Progress: Submitted:1 Active:10 Finished successfully:60 Progress: Submitted:1 Active:10 Finished successfully:60 Progress: Submitted:1 Active:10 Finished successfully:60 Progress: Submitted:1 Active:8 Checking status:1 Finished successfully:61 Progress: Submitted:1 Active:8 Finished successfully:62 Progress: Submitted:1 Active:7 Checking status:1 Finished successfully:62 Progress: Submitted:1 Active:7 Finished successfully:63 Progress: Submitted:1 Active:6 Checking status:1 Finished successfully:63 Progress: Submitted:1 Active:4 Checking status:1 Finished successfully:65 Progress: Submitted:1 Active:3 Checking status:1 Finished successfully:66 Progress: Submitted:1 Active:3 Finished successfully:67 Progress: Submitted:1 Active:3 Finished successfully:67 Progress: Submitted:1 Active:2 Checking status:1 Finished successfully:67 Progress: Submitted:1 Active:2 Finished successfully:68 Progress: Submitted:1 Active:1 Checking status:1 Finished successfully:68 Progress: Submitted:1 Finished successfully:70 Progress: Submitted:1 Finished successfully:70 Progress: Submitted:1 Finished successfully:70 Progress: Submitted:1 Finished successfully:70 Progress: Submitted:1 Finished successfully:70 Progress: Submitted:1 Finished successfully:70 Progress: Submitted:1 Finished successfully:70 Progress: Submitted:1 Finished successfully:70 Progress: Submitted:1 Finished successfully:70 Progress: Submitted:1 Finished successfully:70 Progress: Submitted:1 Finished successfully:70 Progress: Submitted:1 Finished successfully:70 Progress: Submitted:1 Finished successfully:70 Progress: Submitted:1 Finished successfully:70 Progress: Submitted:1 Finished successfully:70 etc -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Aug 9 16:49:30 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Aug 2010 16:49:30 -0500 Subject: [Swift-devel] decaying number of coaster jobs leaves some tasks unfinished In-Reply-To: References: Message-ID: <1281390570.13731.1.camel@blabla2.none> That error might be related. Can I have the full log? On Mon, 2010-08-09 at 17:44 -0400, Glen Hocky wrote: > Hey everyone, > I've been trying to run some short jobs in the "fast" queue on pads. > That means I need to keep the wall time under 1 hour, and my tasks are > around 20 min. What's been happening, at least for a smallish number > of jobs, is that swift decreases the number of jobs submitted to the > queue as the number of tasks is reduced and at the end, some tasks > remain unfinished while no jobs are in the queue, and this continues > indefinately. > > > The following is one sites entry where I reproducibly had this problem > for 70 tasks > > > jobManager="local:pbs"/> > > 3600 > key="maxwalltime">00:25:00 > key="workersPerNode">1 > key="internalHostname">172.5.86.5 > 120 > key="nodeGranularity">1 > 1 > key="jobThrottle">0.99 > key="initialScore">10000 > key="project">CI-CCR000013 > > /tmp > > /home/hockyg/reichman/glassy_dynamics/code/swift/run/real > > > > > There are also some of this type of error > Exception caught while unregistering channel > org.globus.cog.karajan.workflow.service.channels.ChannelException: Trying to bind invalid channel (2027063355: {}) to 60652275: {} > at > org.globus.cog.karajan.workflow.service.channels.MetaChannel.bind(MetaChannel.java:67) > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.unregisterChannel(ChannelManager.java:401) > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.shutdownChannel(ChannelManager.java:411) > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:284) > at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajanChannel.java:83) > at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:257) > but i'm not sure that's related... > > > > > Running with "Swift svn swift-r3432 (swift modified locally) > cog-r2829" > > > Swift output went something like > Progress: Submitted:69 Active:1 Finished successfully:1 > Progress: Submitted:67 Active:3 Finished successfully:1 > Progress: Submitted:66 Active:4 Finished successfully:1 > Progress: Submitted:65 Active:5 Finished successfully:1 > Progress: Submitted:64 Active:6 Finished successfully:1 > Progress: Submitted:61 Active:9 Finished successfully:1 > Progress: Submitted:58 Active:12 Finished successfully:1 > Progress: Submitted:57 Active:13 Finished successfully:1 > Progress: Submitted:54 Active:16 Finished successfully:1 > Progress: Submitted:52 Active:18 Finished successfully:1 > Progress: Submitted:51 Active:19 Finished successfully:1 > Progress: Submitted:50 Active:20 Finished successfully:1 > Progress: Submitted:49 Active:21 Finished successfully:1 > Progress: Submitted:48 Active:22 Finished successfully:1 > Progress: Submitted:41 Active:29 Finished successfully:1 > Progress: Submitted:38 Active:32 Finished successfully:1 > Progress: Submitted:37 Active:33 Finished successfully:1 > Progress: Submitted:35 Active:35 Finished successfully:1 > Progress: Submitted:31 Active:39 Finished successfully:1 > Progress: Submitted:30 Active:40 Finished successfully:1 > Progress: Submitted:26 Active:44 Finished successfully:1 > Progress: Submitted:26 Active:44 Finished successfully:1 > Progress: Submitted:26 Active:44 Finished successfully:1 > Progress: Submitted:26 Active:44 Finished successfully:1 > Progress: Submitted:26 Active:44 Finished successfully:1 > Progress: Submitted:26 Active:44 Finished successfully:1 > Progress: Submitted:26 Active:43 Checking status:1 Finished > successfully:1 > Progress: Submitted:26 Active:43 Finished successfully:2 > Progress: Submitted:26 Active:42 Checking status:1 Finished > successfully:2 > Progress: Submitted:25 Active:42 Checking status:1 Finished > successfully:3 > Progress: Submitted:25 Active:41 Checking status:1 Finished > successfully:4 > Progress: Submitted:25 Active:41 Finished successfully:5 > Progress: Submitted:25 Active:40 Checking status:1 Finished > successfully:5 > Progress: Submitted:25 Active:39 Checking status:1 Finished > successfully:6 > Progress: Submitted:24 Active:40 Finished successfully:7 > Progress: Submitted:24 Active:39 Checking status:1 Finished > successfully:7 > Progress: Submitted:24 Active:38 Checking status:1 Finished > successfully:8 > Progress: Submitted:24 Active:38 Finished successfully:9 > Progress: Submitted:24 Active:37 Checking status:1 Finished > successfully:9 > Progress: Submitted:24 Active:35 Checking status:1 Finished > successfully:11 > Progress: Submitted:23 Active:35 Checking status:1 Finished > successfully:12 > Progress: Submitted:22 Active:35 Checking status:1 Finished > successfully:13 > Progress: Submitted:22 Active:35 Finished successfully:14 > Progress: Submitted:22 Active:34 Checking status:1 Finished > successfully:14 > Progress: Submitted:21 Active:34 Checking status:1 Finished > successfully:15 > Progress: Submitted:21 Active:34 Finished successfully:16 > Progress: Submitted:21 Active:33 Checking status:1 Finished > successfully:16 > Progress: Submitted:21 Active:33 Finished successfully:17 > Progress: Submitted:20 Active:32 Checking status:1 Finished > successfully:18 > Progress: Submitted:20 Active:32 Finished successfully:19 > Progress: Submitted:20 Active:31 Checking status:1 Finished > successfully:19 > Progress: Submitted:19 Active:31 Finished successfully:21 > Progress: Submitted:19 Active:30 Checking status:1 Finished > successfully:21 > Progress: Submitted:18 Active:30 Checking status:1 Finished > successfully:22 > Progress: Submitted:18 Active:29 Checking status:1 Finished > successfully:23 > Progress: Submitted:18 Active:28 Checking status:1 Finished > successfully:24 > Progress: Submitted:17 Active:29 Finished successfully:25 > Progress: Submitted:17 Active:29 Finished successfully:25 > Progress: Submitted:17 Active:28 Checking status:1 Finished > successfully:25 > Progress: Submitted:17 Active:27 Checking status:1 Finished > successfully:26 > Progress: Submitted:17 Active:26 Checking status:1 Finished > successfully:27 > Progress: Submitted:17 Active:25 Checking status:1 Finished > successfully:28 > Progress: Submitted:17 Active:24 Checking status:1 Finished > successfully:29 > Progress: Submitted:16 Active:25 Finished successfully:30 > Progress: Submitted:16 Active:24 Checking status:1 Finished > successfully:30 > Progress: Submitted:15 Active:24 Checking status:1 Finished > successfully:31 > Progress: Submitted:15 Active:24 Finished successfully:32 > Progress: Submitted:15 Active:23 Checking status:1 Finished > successfully:32 > Progress: Submitted:14 Active:24 Finished successfully:33 > Progress: Submitted:14 Active:23 Checking status:1 Finished > successfully:33 > Progress: Submitted:14 Active:22 Checking status:1 Finished > successfully:34 > Progress: Submitted:14 Active:22 Finished successfully:35 > Progress: Submitted:14 Active:21 Checking status:1 Finished > successfully:35 > Progress: Submitted:13 Active:22 Finished successfully:36 > Progress: Submitted:13 Active:22 Finished successfully:36 > Progress: Submitted:13 Active:20 Checking status:1 Finished > successfully:37 > Progress: Submitted:12 Active:21 Finished successfully:38 > Progress: Submitted:12 Active:20 Checking status:1 Finished > successfully:38 > Progress: Submitted:12 Active:19 Checking status:1 Finished > successfully:39 > Progress: Submitted:12 Active:19 Finished successfully:40 > Progress: Submitted:12 Active:18 Checking status:1 Finished > successfully:40 > Progress: Submitted:12 Active:17 Checking status:1 Finished > successfully:41 > Progress: Submitted:11 Active:17 Checking status:1 Finished > successfully:42 > Progress: Submitted:11 Active:17 Finished successfully:43 > Progress: Submitted:11 Active:16 Checking status:1 Finished > successfully:43 > Progress: Submitted:11 Active:15 Checking status:1 Finished > successfully:44 > Progress: Submitted:10 Active:16 Finished successfully:45 > Progress: Submitted:3 Active:22 Finished successfully:46 > Progress: Submitted:3 Active:21 Checking status:1 Finished > successfully:46 > Progress: Submitted:3 Active:19 Finished successfully:49 > Progress: Submitted:3 Active:19 Finished successfully:49 > Progress: Submitted:2 Active:20 Finished successfully:49 > Progress: Submitted:1 Active:21 Finished successfully:49 > . > . > . > Progress: Submitted:1 Active:15 Finished successfully:55 > Progress: Submitted:1 Active:15 Finished successfully:55 > Progress: Submitted:1 Active:15 Finished successfully:55 > Progress: Submitted:1 Active:15 Finished successfully:55 > Progress: Submitted:1 Active:14 Checking status:1 Finished > successfully:55 > Progress: Submitted:1 Active:14 Finished successfully:56 > Progress: Submitted:1 Active:13 Checking status:1 Finished > successfully:56 > Progress: Submitted:1 Active:12 Checking status:1 Finished > successfully:57 > Progress: Submitted:1 Active:12 Finished successfully:58 > Progress: Submitted:1 Active:11 Checking status:1 Finished > successfully:58 > Progress: Submitted:1 Active:10 Checking status:1 Finished > successfully:59 > Progress: Submitted:1 Active:10 Finished successfully:60 > Progress: Submitted:1 Active:10 Finished successfully:60 > Progress: Submitted:1 Active:10 Finished successfully:60 > Progress: Submitted:1 Active:8 Checking status:1 Finished > successfully:61 > Progress: Submitted:1 Active:8 Finished successfully:62 > Progress: Submitted:1 Active:7 Checking status:1 Finished > successfully:62 > Progress: Submitted:1 Active:7 Finished successfully:63 > Progress: Submitted:1 Active:6 Checking status:1 Finished > successfully:63 > Progress: Submitted:1 Active:4 Checking status:1 Finished > successfully:65 > Progress: Submitted:1 Active:3 Checking status:1 Finished > successfully:66 > Progress: Submitted:1 Active:3 Finished successfully:67 > Progress: Submitted:1 Active:3 Finished successfully:67 > Progress: Submitted:1 Active:2 Checking status:1 Finished > successfully:67 > Progress: Submitted:1 Active:2 Finished successfully:68 > Progress: Submitted:1 Active:1 Checking status:1 Finished > successfully:68 > Progress: Submitted:1 Finished successfully:70 > Progress: Submitted:1 Finished successfully:70 > Progress: Submitted:1 Finished successfully:70 > Progress: Submitted:1 Finished successfully:70 > Progress: Submitted:1 Finished successfully:70 > Progress: Submitted:1 Finished successfully:70 > Progress: Submitted:1 Finished successfully:70 > Progress: Submitted:1 Finished successfully:70 > Progress: Submitted:1 Finished successfully:70 > Progress: Submitted:1 Finished successfully:70 > Progress: Submitted:1 Finished successfully:70 > Progress: Submitted:1 Finished successfully:70 > Progress: Submitted:1 Finished successfully:70 > Progress: Submitted:1 Finished successfully:70 > Progress: Submitted:1 Finished successfully:70 > > > etc > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Mon Aug 9 17:01:24 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 9 Aug 2010 16:01:24 -0600 (GMT-06:00) Subject: [Swift-devel] decaying number of coaster jobs leaves some tasks unfinished In-Reply-To: <1281390570.13731.1.camel@blabla2.none> Message-ID: <931258139.713821281391284726.JavaMail.root@zimbra.anl.gov> I have seen a common problem where maxwalltime on queues jobs exceeds maxtime, in which case Swift hangs, never finding a block it can fit the jobs into. I wonder if this is another manifestation of that behavior/bug: the time left in the running block is less than the 25 min maxwalltime for the remaining tasks, and Swift does not realize that it needs to end that block and start a new one. Did you leave the tail end of this run running long enough for the current block to end, to see if it starts a new 3600 second block? Im just surmising one possible cause; actual problem here might be completely different. - Mike ----- "Mihael Hategan" wrote: > That error might be related. Can I have the full log? > > On Mon, 2010-08-09 at 17:44 -0400, Glen Hocky wrote: > > Hey everyone, > > I've been trying to run some short jobs in the "fast" queue on > pads. > > That means I need to keep the wall time under 1 hour, and my tasks > are > > around 20 min. What's been happening, at least for a smallish > number > > of jobs, is that swift decreases the number of jobs submitted to > the > > queue as the number of tasks is reduced and at the end, some tasks > > remain unfinished while no jobs are in the queue, and this > continues > > indefinately. > > > > > > The following is one sites entry where I reproducibly had this > problem > > for 70 tasks > > > > > > > jobManager="local:pbs"/> > > > > key="maxtime">3600 > > > key="maxwalltime">00:25:00 > > > key="workersPerNode">1 > > > key="internalHostname">172.5.86.5 > > 120 > > > key="nodeGranularity">1 > > 1 > > > key="jobThrottle">0.99 > > > key="initialScore">10000 > > > key="project">CI-CCR000013 > > > > /tmp > > > > > /home/hockyg/reichman/glassy_dynamics/code/swift/run/real > > > > > > > > > > There are also some of this type of error > > Exception caught while unregistering channel > > > org.globus.cog.karajan.workflow.service.channels.ChannelException: > Trying to bind invalid channel (2027063355: {}) to 60652275: {} > > at > > > org.globus.cog.karajan.workflow.service.channels.MetaChannel.bind(MetaChannel.java:67) > > at > > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.unregisterChannel(ChannelManager.java:401) > > at > > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.shutdownChannel(ChannelManager.java:411) > > at > > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:284) > > at > > > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajanChannel.java:83) > > at > > > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:257) > > but i'm not sure that's related... > > > > > > > > > > Running with "Swift svn swift-r3432 (swift modified locally) > > cog-r2829" > > > > > > Swift output went something like > > Progress: Submitted:69 Active:1 Finished successfully:1 > > Progress: Submitted:67 Active:3 Finished successfully:1 > > Progress: Submitted:66 Active:4 Finished successfully:1 > > Progress: Submitted:65 Active:5 Finished successfully:1 > > Progress: Submitted:64 Active:6 Finished successfully:1 > > Progress: Submitted:61 Active:9 Finished successfully:1 > > Progress: Submitted:58 Active:12 Finished successfully:1 > > Progress: Submitted:57 Active:13 Finished successfully:1 > > Progress: Submitted:54 Active:16 Finished successfully:1 > > Progress: Submitted:52 Active:18 Finished successfully:1 > > Progress: Submitted:51 Active:19 Finished successfully:1 > > Progress: Submitted:50 Active:20 Finished successfully:1 > > Progress: Submitted:49 Active:21 Finished successfully:1 > > Progress: Submitted:48 Active:22 Finished successfully:1 > > Progress: Submitted:41 Active:29 Finished successfully:1 > > Progress: Submitted:38 Active:32 Finished successfully:1 > > Progress: Submitted:37 Active:33 Finished successfully:1 > > Progress: Submitted:35 Active:35 Finished successfully:1 > > Progress: Submitted:31 Active:39 Finished successfully:1 > > Progress: Submitted:30 Active:40 Finished successfully:1 > > Progress: Submitted:26 Active:44 Finished successfully:1 > > Progress: Submitted:26 Active:44 Finished successfully:1 > > Progress: Submitted:26 Active:44 Finished successfully:1 > > Progress: Submitted:26 Active:44 Finished successfully:1 > > Progress: Submitted:26 Active:44 Finished successfully:1 > > Progress: Submitted:26 Active:44 Finished successfully:1 > > Progress: Submitted:26 Active:43 Checking status:1 Finished > > successfully:1 > > Progress: Submitted:26 Active:43 Finished successfully:2 > > Progress: Submitted:26 Active:42 Checking status:1 Finished > > successfully:2 > > Progress: Submitted:25 Active:42 Checking status:1 Finished > > successfully:3 > > Progress: Submitted:25 Active:41 Checking status:1 Finished > > successfully:4 > > Progress: Submitted:25 Active:41 Finished successfully:5 > > Progress: Submitted:25 Active:40 Checking status:1 Finished > > successfully:5 > > Progress: Submitted:25 Active:39 Checking status:1 Finished > > successfully:6 > > Progress: Submitted:24 Active:40 Finished successfully:7 > > Progress: Submitted:24 Active:39 Checking status:1 Finished > > successfully:7 > > Progress: Submitted:24 Active:38 Checking status:1 Finished > > successfully:8 > > Progress: Submitted:24 Active:38 Finished successfully:9 > > Progress: Submitted:24 Active:37 Checking status:1 Finished > > successfully:9 > > Progress: Submitted:24 Active:35 Checking status:1 Finished > > successfully:11 > > Progress: Submitted:23 Active:35 Checking status:1 Finished > > successfully:12 > > Progress: Submitted:22 Active:35 Checking status:1 Finished > > successfully:13 > > Progress: Submitted:22 Active:35 Finished successfully:14 > > Progress: Submitted:22 Active:34 Checking status:1 Finished > > successfully:14 > > Progress: Submitted:21 Active:34 Checking status:1 Finished > > successfully:15 > > Progress: Submitted:21 Active:34 Finished successfully:16 > > Progress: Submitted:21 Active:33 Checking status:1 Finished > > successfully:16 > > Progress: Submitted:21 Active:33 Finished successfully:17 > > Progress: Submitted:20 Active:32 Checking status:1 Finished > > successfully:18 > > Progress: Submitted:20 Active:32 Finished successfully:19 > > Progress: Submitted:20 Active:31 Checking status:1 Finished > > successfully:19 > > Progress: Submitted:19 Active:31 Finished successfully:21 > > Progress: Submitted:19 Active:30 Checking status:1 Finished > > successfully:21 > > Progress: Submitted:18 Active:30 Checking status:1 Finished > > successfully:22 > > Progress: Submitted:18 Active:29 Checking status:1 Finished > > successfully:23 > > Progress: Submitted:18 Active:28 Checking status:1 Finished > > successfully:24 > > Progress: Submitted:17 Active:29 Finished successfully:25 > > Progress: Submitted:17 Active:29 Finished successfully:25 > > Progress: Submitted:17 Active:28 Checking status:1 Finished > > successfully:25 > > Progress: Submitted:17 Active:27 Checking status:1 Finished > > successfully:26 > > Progress: Submitted:17 Active:26 Checking status:1 Finished > > successfully:27 > > Progress: Submitted:17 Active:25 Checking status:1 Finished > > successfully:28 > > Progress: Submitted:17 Active:24 Checking status:1 Finished > > successfully:29 > > Progress: Submitted:16 Active:25 Finished successfully:30 > > Progress: Submitted:16 Active:24 Checking status:1 Finished > > successfully:30 > > Progress: Submitted:15 Active:24 Checking status:1 Finished > > successfully:31 > > Progress: Submitted:15 Active:24 Finished successfully:32 > > Progress: Submitted:15 Active:23 Checking status:1 Finished > > successfully:32 > > Progress: Submitted:14 Active:24 Finished successfully:33 > > Progress: Submitted:14 Active:23 Checking status:1 Finished > > successfully:33 > > Progress: Submitted:14 Active:22 Checking status:1 Finished > > successfully:34 > > Progress: Submitted:14 Active:22 Finished successfully:35 > > Progress: Submitted:14 Active:21 Checking status:1 Finished > > successfully:35 > > Progress: Submitted:13 Active:22 Finished successfully:36 > > Progress: Submitted:13 Active:22 Finished successfully:36 > > Progress: Submitted:13 Active:20 Checking status:1 Finished > > successfully:37 > > Progress: Submitted:12 Active:21 Finished successfully:38 > > Progress: Submitted:12 Active:20 Checking status:1 Finished > > successfully:38 > > Progress: Submitted:12 Active:19 Checking status:1 Finished > > successfully:39 > > Progress: Submitted:12 Active:19 Finished successfully:40 > > Progress: Submitted:12 Active:18 Checking status:1 Finished > > successfully:40 > > Progress: Submitted:12 Active:17 Checking status:1 Finished > > successfully:41 > > Progress: Submitted:11 Active:17 Checking status:1 Finished > > successfully:42 > > Progress: Submitted:11 Active:17 Finished successfully:43 > > Progress: Submitted:11 Active:16 Checking status:1 Finished > > successfully:43 > > Progress: Submitted:11 Active:15 Checking status:1 Finished > > successfully:44 > > Progress: Submitted:10 Active:16 Finished successfully:45 > > Progress: Submitted:3 Active:22 Finished successfully:46 > > Progress: Submitted:3 Active:21 Checking status:1 Finished > > successfully:46 > > Progress: Submitted:3 Active:19 Finished successfully:49 > > Progress: Submitted:3 Active:19 Finished successfully:49 > > Progress: Submitted:2 Active:20 Finished successfully:49 > > Progress: Submitted:1 Active:21 Finished successfully:49 > > . > > . > > . > > Progress: Submitted:1 Active:15 Finished successfully:55 > > Progress: Submitted:1 Active:15 Finished successfully:55 > > Progress: Submitted:1 Active:15 Finished successfully:55 > > Progress: Submitted:1 Active:15 Finished successfully:55 > > Progress: Submitted:1 Active:14 Checking status:1 Finished > > successfully:55 > > Progress: Submitted:1 Active:14 Finished successfully:56 > > Progress: Submitted:1 Active:13 Checking status:1 Finished > > successfully:56 > > Progress: Submitted:1 Active:12 Checking status:1 Finished > > successfully:57 > > Progress: Submitted:1 Active:12 Finished successfully:58 > > Progress: Submitted:1 Active:11 Checking status:1 Finished > > successfully:58 > > Progress: Submitted:1 Active:10 Checking status:1 Finished > > successfully:59 > > Progress: Submitted:1 Active:10 Finished successfully:60 > > Progress: Submitted:1 Active:10 Finished successfully:60 > > Progress: Submitted:1 Active:10 Finished successfully:60 > > Progress: Submitted:1 Active:8 Checking status:1 Finished > > successfully:61 > > Progress: Submitted:1 Active:8 Finished successfully:62 > > Progress: Submitted:1 Active:7 Checking status:1 Finished > > successfully:62 > > Progress: Submitted:1 Active:7 Finished successfully:63 > > Progress: Submitted:1 Active:6 Checking status:1 Finished > > successfully:63 > > Progress: Submitted:1 Active:4 Checking status:1 Finished > > successfully:65 > > Progress: Submitted:1 Active:3 Checking status:1 Finished > > successfully:66 > > Progress: Submitted:1 Active:3 Finished successfully:67 > > Progress: Submitted:1 Active:3 Finished successfully:67 > > Progress: Submitted:1 Active:2 Checking status:1 Finished > > successfully:67 > > Progress: Submitted:1 Active:2 Finished successfully:68 > > Progress: Submitted:1 Active:1 Checking status:1 Finished > > successfully:68 > > Progress: Submitted:1 Finished successfully:70 > > Progress: Submitted:1 Finished successfully:70 > > Progress: Submitted:1 Finished successfully:70 > > Progress: Submitted:1 Finished successfully:70 > > Progress: Submitted:1 Finished successfully:70 > > Progress: Submitted:1 Finished successfully:70 > > Progress: Submitted:1 Finished successfully:70 > > Progress: Submitted:1 Finished successfully:70 > > Progress: Submitted:1 Finished successfully:70 > > Progress: Submitted:1 Finished successfully:70 > > Progress: Submitted:1 Finished successfully:70 > > Progress: Submitted:1 Finished successfully:70 > > Progress: Submitted:1 Finished successfully:70 > > Progress: Submitted:1 Finished successfully:70 > > Progress: Submitted:1 Finished successfully:70 > > > > > > etc > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From aespinosa at cs.uchicago.edu Mon Aug 9 17:09:16 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 9 Aug 2010 17:09:16 -0500 Subject: [Swift-devel] persistent coaster service In-Reply-To: <1281065674.20004.2.camel@blabla2.none> References: <1281065674.20004.2.camel@blabla2.none> Message-ID: <20100809220916.GD2796@origin> I tried it today on OSG. The coaster service was run on bridled.ci . But from the session below, it looks like its connecting to the site headnode instead: RunID: coaster Progress: Progress: uninitialized:1 Selecting site:675 Initializing site shared directory:1 Progress: Initializing:2 Selecting site:1444 Initializing site shared directory:1 Progress: uninitialized:1 Selecting site:2499 Initializing site shared directory:1 Progress: uninitialized:1 Selecting site:3818 Initializing site shared directory:1 Progress: uninitialized:1 Initializing:1 Selecting site:4201 Initializing site shared directory:1 Progress: Initializing:1 Selecting site:3 Stage in:4202 Progress: uninitialized:1 Initializing:1 Selecting site:5 Submitting:4202 Progress: Initializing:1 Selecting site:6 Stage in:2 Submitting:4202 Find: https://ff-grid2.unl.edu:1984 Find: keepalive(120), reconnect - https://ff-grid2.unl.edu:1984 Progress: Initializing:2 Selecting site:6 Stage in:144 Submitting:4303 Failed but can retry:16 Progress: Initializing:2 Selecting site:31 Stage in:80 Submitting:4945 Failed but can retry:54 Progress: Initializing:1 Selecting site:6 Stage in:2 Submitting:5222 Failed but can retry:68 Progress: Initializing:1 Selecting site:6 Stage in:1 Submitting:5686 Submitted:1 Failed but can retry:95 ... ... Corresponding log entry (IMO): 2010-08-09 17:01:31,690-0500 WARN RemoteConfiguration Find: https://ff-grid2.unl.edu:1984 2010-08-09 17:01:31,690-0500 WARN RemoteConfiguration Find: keepalive(120), reconnect - https://ff-grid2.unl.edu:1984 sites.xml 86400 1290 0.8 10 20 true 1500.0 51.54 /panfs/panasas/CMS/data/engage-scec/swift_scratch -Allan On Thu, Aug 05, 2010 at 10:34:34PM -0500, Mihael Hategan wrote: > ... was added in cog r2834. > > Despite having run a few jobs with it, I don't feel very confident about > it. So please test. > > Start with bin/coaster-service and use "coaster-persistent" as provider > in sites.xml. Everything else would be the same as in the "coaster" > case. > > Mihael > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hockyg at uchicago.edu Mon Aug 9 17:19:41 2010 From: hockyg at uchicago.edu (Glen Hocky) Date: Mon, 9 Aug 2010 18:19:41 -0400 Subject: [Swift-devel] decaying number of coaster jobs leaves some tasks unfinished In-Reply-To: <931258139.713821281391284726.JavaMail.root@zimbra.anl.gov> References: <1281390570.13731.1.camel@blabla2.none> <931258139.713821281391284726.JavaMail.root@zimbra.anl.gov> Message-ID: Here's the full log (I think). What's Mike's describing is basically my gut feeling as well... Did you leave the tail end of this run running long enough for the current > block to end, to see if it starts a new 3600 second block? A different run before I tried to reproduce the problem ran all night like that last night without starting any new blocks....(but the settings were very slightly different (fewer "slots") and it stalled with 7 jobs left i think On Mon, Aug 9, 2010 at 6:01 PM, Michael Wilde wrote: > I have seen a common problem where maxwalltime on queues jobs exceeds > maxtime, in which case Swift hangs, never finding a block it can fit the > jobs into. > > I wonder if this is another manifestation of that behavior/bug: the time > left in the running block is less than the 25 min maxwalltime for the > remaining tasks, and Swift does not realize that it needs to end that block > and start a new one. > > Did you leave the tail end of this run running long enough for the current > block to end, to see if it starts a new 3600 second block? > > Im just surmising one possible cause; actual problem here might be > completely different. > > - Mike > > > ----- "Mihael Hategan" wrote: > > > That error might be related. Can I have the full log? > > > > On Mon, 2010-08-09 at 17:44 -0400, Glen Hocky wrote: > > > Hey everyone, > > > I've been trying to run some short jobs in the "fast" queue on > > pads. > > > That means I need to keep the wall time under 1 hour, and my tasks > > are > > > around 20 min. What's been happening, at least for a smallish > > number > > > of jobs, is that swift decreases the number of jobs submitted to > > the > > > queue as the number of tasks is reduced and at the end, some tasks > > > remain unfinished while no jobs are in the queue, and this > > continues > > > indefinately. > > > > > > > > > The following is one sites entry where I reproducibly had this > > problem > > > for 70 tasks > > > > > > > > > > > jobManager="local:pbs"/> > > > > > > > key="maxtime">3600 > > > > > key="maxwalltime">00:25:00 > > > > > key="workersPerNode">1 > > > > > key="internalHostname">172.5.86.5 > > > 120 > > > > > key="nodeGranularity">1 > > > 1 > > > > > key="jobThrottle">0.99 > > > > > key="initialScore">10000 > > > > > key="project">CI-CCR000013 > > > > > > /tmp > > > > > > > > > /home/hockyg/reichman/glassy_dynamics/code/swift/run/real > > > > > > > > > > > > > > > There are also some of this type of error > > > Exception caught while unregistering channel > > > > > org.globus.cog.karajan.workflow.service.channels.ChannelException: > > Trying to bind invalid channel (2027063355: {}) to 60652275: {} > > > at > > > > > > org.globus.cog.karajan.workflow.service.channels.MetaChannel.bind(MetaChannel.java:67) > > > at > > > > > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.unregisterChannel(ChannelManager.java:401) > > > at > > > > > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.shutdownChannel(ChannelManager.java:411) > > > at > > > > > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:284) > > > at > > > > > > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajanChannel.java:83) > > > at > > > > > > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:257) > > > but i'm not sure that's related... > > > > > > > > > > > > > > > Running with "Swift svn swift-r3432 (swift modified locally) > > > cog-r2829" > > > > > > > > > Swift output went something like > > > Progress: Submitted:69 Active:1 Finished successfully:1 > > > Progress: Submitted:67 Active:3 Finished successfully:1 > > > Progress: Submitted:66 Active:4 Finished successfully:1 > > > Progress: Submitted:65 Active:5 Finished successfully:1 > > > Progress: Submitted:64 Active:6 Finished successfully:1 > > > Progress: Submitted:61 Active:9 Finished successfully:1 > > > Progress: Submitted:58 Active:12 Finished successfully:1 > > > Progress: Submitted:57 Active:13 Finished successfully:1 > > > Progress: Submitted:54 Active:16 Finished successfully:1 > > > Progress: Submitted:52 Active:18 Finished successfully:1 > > > Progress: Submitted:51 Active:19 Finished successfully:1 > > > Progress: Submitted:50 Active:20 Finished successfully:1 > > > Progress: Submitted:49 Active:21 Finished successfully:1 > > > Progress: Submitted:48 Active:22 Finished successfully:1 > > > Progress: Submitted:41 Active:29 Finished successfully:1 > > > Progress: Submitted:38 Active:32 Finished successfully:1 > > > Progress: Submitted:37 Active:33 Finished successfully:1 > > > Progress: Submitted:35 Active:35 Finished successfully:1 > > > Progress: Submitted:31 Active:39 Finished successfully:1 > > > Progress: Submitted:30 Active:40 Finished successfully:1 > > > Progress: Submitted:26 Active:44 Finished successfully:1 > > > Progress: Submitted:26 Active:44 Finished successfully:1 > > > Progress: Submitted:26 Active:44 Finished successfully:1 > > > Progress: Submitted:26 Active:44 Finished successfully:1 > > > Progress: Submitted:26 Active:44 Finished successfully:1 > > > Progress: Submitted:26 Active:44 Finished successfully:1 > > > Progress: Submitted:26 Active:43 Checking status:1 Finished > > > successfully:1 > > > Progress: Submitted:26 Active:43 Finished successfully:2 > > > Progress: Submitted:26 Active:42 Checking status:1 Finished > > > successfully:2 > > > Progress: Submitted:25 Active:42 Checking status:1 Finished > > > successfully:3 > > > Progress: Submitted:25 Active:41 Checking status:1 Finished > > > successfully:4 > > > Progress: Submitted:25 Active:41 Finished successfully:5 > > > Progress: Submitted:25 Active:40 Checking status:1 Finished > > > successfully:5 > > > Progress: Submitted:25 Active:39 Checking status:1 Finished > > > successfully:6 > > > Progress: Submitted:24 Active:40 Finished successfully:7 > > > Progress: Submitted:24 Active:39 Checking status:1 Finished > > > successfully:7 > > > Progress: Submitted:24 Active:38 Checking status:1 Finished > > > successfully:8 > > > Progress: Submitted:24 Active:38 Finished successfully:9 > > > Progress: Submitted:24 Active:37 Checking status:1 Finished > > > successfully:9 > > > Progress: Submitted:24 Active:35 Checking status:1 Finished > > > successfully:11 > > > Progress: Submitted:23 Active:35 Checking status:1 Finished > > > successfully:12 > > > Progress: Submitted:22 Active:35 Checking status:1 Finished > > > successfully:13 > > > Progress: Submitted:22 Active:35 Finished successfully:14 > > > Progress: Submitted:22 Active:34 Checking status:1 Finished > > > successfully:14 > > > Progress: Submitted:21 Active:34 Checking status:1 Finished > > > successfully:15 > > > Progress: Submitted:21 Active:34 Finished successfully:16 > > > Progress: Submitted:21 Active:33 Checking status:1 Finished > > > successfully:16 > > > Progress: Submitted:21 Active:33 Finished successfully:17 > > > Progress: Submitted:20 Active:32 Checking status:1 Finished > > > successfully:18 > > > Progress: Submitted:20 Active:32 Finished successfully:19 > > > Progress: Submitted:20 Active:31 Checking status:1 Finished > > > successfully:19 > > > Progress: Submitted:19 Active:31 Finished successfully:21 > > > Progress: Submitted:19 Active:30 Checking status:1 Finished > > > successfully:21 > > > Progress: Submitted:18 Active:30 Checking status:1 Finished > > > successfully:22 > > > Progress: Submitted:18 Active:29 Checking status:1 Finished > > > successfully:23 > > > Progress: Submitted:18 Active:28 Checking status:1 Finished > > > successfully:24 > > > Progress: Submitted:17 Active:29 Finished successfully:25 > > > Progress: Submitted:17 Active:29 Finished successfully:25 > > > Progress: Submitted:17 Active:28 Checking status:1 Finished > > > successfully:25 > > > Progress: Submitted:17 Active:27 Checking status:1 Finished > > > successfully:26 > > > Progress: Submitted:17 Active:26 Checking status:1 Finished > > > successfully:27 > > > Progress: Submitted:17 Active:25 Checking status:1 Finished > > > successfully:28 > > > Progress: Submitted:17 Active:24 Checking status:1 Finished > > > successfully:29 > > > Progress: Submitted:16 Active:25 Finished successfully:30 > > > Progress: Submitted:16 Active:24 Checking status:1 Finished > > > successfully:30 > > > Progress: Submitted:15 Active:24 Checking status:1 Finished > > > successfully:31 > > > Progress: Submitted:15 Active:24 Finished successfully:32 > > > Progress: Submitted:15 Active:23 Checking status:1 Finished > > > successfully:32 > > > Progress: Submitted:14 Active:24 Finished successfully:33 > > > Progress: Submitted:14 Active:23 Checking status:1 Finished > > > successfully:33 > > > Progress: Submitted:14 Active:22 Checking status:1 Finished > > > successfully:34 > > > Progress: Submitted:14 Active:22 Finished successfully:35 > > > Progress: Submitted:14 Active:21 Checking status:1 Finished > > > successfully:35 > > > Progress: Submitted:13 Active:22 Finished successfully:36 > > > Progress: Submitted:13 Active:22 Finished successfully:36 > > > Progress: Submitted:13 Active:20 Checking status:1 Finished > > > successfully:37 > > > Progress: Submitted:12 Active:21 Finished successfully:38 > > > Progress: Submitted:12 Active:20 Checking status:1 Finished > > > successfully:38 > > > Progress: Submitted:12 Active:19 Checking status:1 Finished > > > successfully:39 > > > Progress: Submitted:12 Active:19 Finished successfully:40 > > > Progress: Submitted:12 Active:18 Checking status:1 Finished > > > successfully:40 > > > Progress: Submitted:12 Active:17 Checking status:1 Finished > > > successfully:41 > > > Progress: Submitted:11 Active:17 Checking status:1 Finished > > > successfully:42 > > > Progress: Submitted:11 Active:17 Finished successfully:43 > > > Progress: Submitted:11 Active:16 Checking status:1 Finished > > > successfully:43 > > > Progress: Submitted:11 Active:15 Checking status:1 Finished > > > successfully:44 > > > Progress: Submitted:10 Active:16 Finished successfully:45 > > > Progress: Submitted:3 Active:22 Finished successfully:46 > > > Progress: Submitted:3 Active:21 Checking status:1 Finished > > > successfully:46 > > > Progress: Submitted:3 Active:19 Finished successfully:49 > > > Progress: Submitted:3 Active:19 Finished successfully:49 > > > Progress: Submitted:2 Active:20 Finished successfully:49 > > > Progress: Submitted:1 Active:21 Finished successfully:49 > > > . > > > . > > > . > > > Progress: Submitted:1 Active:15 Finished successfully:55 > > > Progress: Submitted:1 Active:15 Finished successfully:55 > > > Progress: Submitted:1 Active:15 Finished successfully:55 > > > Progress: Submitted:1 Active:15 Finished successfully:55 > > > Progress: Submitted:1 Active:14 Checking status:1 Finished > > > successfully:55 > > > Progress: Submitted:1 Active:14 Finished successfully:56 > > > Progress: Submitted:1 Active:13 Checking status:1 Finished > > > successfully:56 > > > Progress: Submitted:1 Active:12 Checking status:1 Finished > > > successfully:57 > > > Progress: Submitted:1 Active:12 Finished successfully:58 > > > Progress: Submitted:1 Active:11 Checking status:1 Finished > > > successfully:58 > > > Progress: Submitted:1 Active:10 Checking status:1 Finished > > > successfully:59 > > > Progress: Submitted:1 Active:10 Finished successfully:60 > > > Progress: Submitted:1 Active:10 Finished successfully:60 > > > Progress: Submitted:1 Active:10 Finished successfully:60 > > > Progress: Submitted:1 Active:8 Checking status:1 Finished > > > successfully:61 > > > Progress: Submitted:1 Active:8 Finished successfully:62 > > > Progress: Submitted:1 Active:7 Checking status:1 Finished > > > successfully:62 > > > Progress: Submitted:1 Active:7 Finished successfully:63 > > > Progress: Submitted:1 Active:6 Checking status:1 Finished > > > successfully:63 > > > Progress: Submitted:1 Active:4 Checking status:1 Finished > > > successfully:65 > > > Progress: Submitted:1 Active:3 Checking status:1 Finished > > > successfully:66 > > > Progress: Submitted:1 Active:3 Finished successfully:67 > > > Progress: Submitted:1 Active:3 Finished successfully:67 > > > Progress: Submitted:1 Active:2 Checking status:1 Finished > > > successfully:67 > > > Progress: Submitted:1 Active:2 Finished successfully:68 > > > Progress: Submitted:1 Active:1 Checking status:1 Finished > > > successfully:68 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > > > > > > > etc > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: glassEquilCavities-20100809-1547-i1s75vd0.log Type: application/octet-stream Size: 1659730 bytes Desc: not available URL: From hategan at mcs.anl.gov Mon Aug 9 17:36:59 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Aug 2010 17:36:59 -0500 Subject: [Swift-devel] persistent coaster service In-Reply-To: <20100809220916.GD2796@origin> References: <1281065674.20004.2.camel@blabla2.none> <20100809220916.GD2796@origin> Message-ID: <1281393419.14191.0.camel@blabla2.none> ff-grid2.unl.edu is the url you are supplying in sites.xml. It's connecting to that. Though I'm surprised it works given that you are implying that there is no service running there. On Mon, 2010-08-09 at 17:09 -0500, Allan Espinosa wrote: > I tried it today on OSG. The coaster service was run on bridled.ci . But from > the session below, it looks like its connecting to the site headnode instead: > > RunID: coaster > Progress: > Progress: uninitialized:1 Selecting site:675 Initializing site shared > directory:1 > Progress: Initializing:2 Selecting site:1444 Initializing site shared > directory:1 > Progress: uninitialized:1 Selecting site:2499 Initializing site shared > directory:1 > Progress: uninitialized:1 Selecting site:3818 Initializing site shared > directory:1 > Progress: uninitialized:1 Initializing:1 Selecting site:4201 Initializing > site shared directory:1 > Progress: Initializing:1 Selecting site:3 Stage in:4202 > Progress: uninitialized:1 Initializing:1 Selecting site:5 Submitting:4202 > Progress: Initializing:1 Selecting site:6 Stage in:2 Submitting:4202 > Find: https://ff-grid2.unl.edu:1984 > Find: keepalive(120), reconnect - https://ff-grid2.unl.edu:1984 > Progress: Initializing:2 Selecting site:6 Stage in:144 Submitting:4303 > Failed but can retry:16 > Progress: Initializing:2 Selecting site:31 Stage in:80 Submitting:4945 > Failed but can retry:54 > Progress: Initializing:1 Selecting site:6 Stage in:2 Submitting:5222 Failed > but can retry:68 > Progress: Initializing:1 Selecting site:6 Stage in:1 Submitting:5686 > Submitted:1 Failed but can retry:95 > ... > ... > > Corresponding log entry (IMO): > 2010-08-09 17:01:31,690-0500 WARN RemoteConfiguration Find: > https://ff-grid2.unl.edu:1984 > 2010-08-09 17:01:31,690-0500 WARN RemoteConfiguration Find: keepalive(120), > reconnect - https://ff-grid2.unl.edu:1984 > > > > sites.xml > > jobmanager="gt2:gt2:pbs" /> > > 86400 > 1290 > 0.8 > 10 > 20 > true > > 1500.0 > 51.54 > > > /panfs/panasas/CMS/data/engage-scec/swift_scratch > > > > -Allan > > On Thu, Aug 05, 2010 at 10:34:34PM -0500, Mihael Hategan wrote: > > > ... was added in cog r2834. > > > > Despite having run a few jobs with it, I don't feel very confident about > > it. So please test. > > > > Start with bin/coaster-service and use "coaster-persistent" as provider > > in sites.xml. Everything else would be the same as in the "coaster" > > case. > > > > Mihael > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Mon Aug 9 17:37:27 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Aug 2010 17:37:27 -0500 Subject: [Swift-devel] decaying number of coaster jobs leaves some tasks unfinished In-Reply-To: References: <1281390570.13731.1.camel@blabla2.none> <931258139.713821281391284726.JavaMail.root@zimbra.anl.gov> Message-ID: <1281393447.14191.1.camel@blabla2.none> On Mon, 2010-08-09 at 18:19 -0400, Glen Hocky wrote: > Here's the full log (I think). > > What's Mike's describing is basically my gut feeling as well... Right. The log should tell us. > > > Did you leave the tail end of this run running long enough for > the current block to end, to see if it starts a new 3600 > second block? > A different run before I tried to reproduce the problem ran all night > like that last night without starting any new blocks....(but the > settings were very slightly different (fewer "slots") and it stalled > with 7 jobs left i think > > On Mon, Aug 9, 2010 at 6:01 PM, Michael Wilde > wrote: > I have seen a common problem where maxwalltime on queues jobs > exceeds maxtime, in which case Swift hangs, never finding a > block it can fit the jobs into. > > I wonder if this is another manifestation of that > behavior/bug: the time left in the running block is less than > the 25 min maxwalltime for the remaining tasks, and Swift does > not realize that it needs to end that block and start a new > one. > > Did you leave the tail end of this run running long enough for > the current block to end, to see if it starts a new 3600 > second block? > > Im just surmising one possible cause; actual problem here > might be completely different. > > - Mike > > > > ----- "Mihael Hategan" wrote: > > > That error might be related. Can I have the full log? > > > > On Mon, 2010-08-09 at 17:44 -0400, Glen Hocky wrote: > > > Hey everyone, > > > I've been trying to run some short jobs in the "fast" > queue on > > pads. > > > That means I need to keep the wall time under 1 hour, and > my tasks > > are > > > around 20 min. What's been happening, at least for a > smallish > > number > > > of jobs, is that swift decreases the number of jobs > submitted to > > the > > > queue as the number of tasks is reduced and at the end, > some tasks > > > remain unfinished while no jobs are in the queue, and this > > continues > > > indefinately. > > > > > > > > > The following is one sites entry where I reproducibly had > this > > problem > > > for 70 tasks > > > > > > > > > > > jobManager="local:pbs"/> > > > > > > > key="maxtime">3600 > > > > > key="maxwalltime">00:25:00 > > > > > key="workersPerNode">1 > > > > > key="internalHostname">172.5.86.5 > > > key="slots">120 > > > > > key="nodeGranularity">1 > > > key="maxNodes">1 > > > > > key="jobThrottle">0.99 > > > > > key="initialScore">10000 > > > > > key="project">CI-CCR000013 > > > > > > /tmp > > > > > > > > > /home/hockyg/reichman/glassy_dynamics/code/swift/run/real > > > > > > > > > > > > > > > There are also some of this type of error > > > Exception caught while unregistering channel > > > > > > org.globus.cog.karajan.workflow.service.channels.ChannelException: > > Trying to bind invalid channel (2027063355: {}) to 60652275: > {} > > > at > > > > > > org.globus.cog.karajan.workflow.service.channels.MetaChannel.bind(MetaChannel.java:67) > > > at > > > > > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.unregisterChannel(ChannelManager.java:401) > > > at > > > > > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.shutdownChannel(ChannelManager.java:411) > > > at > > > > > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:284) > > > at > > > > > > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajanChannel.java:83) > > > at > > > > > > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:257) > > > but i'm not sure that's related... > > > > > > > > > > > > > > > Running with "Swift svn swift-r3432 (swift modified > locally) > > > cog-r2829" > > > > > > > > > Swift output went something like > > > Progress: Submitted:69 Active:1 Finished successfully:1 > > > Progress: Submitted:67 Active:3 Finished successfully:1 > > > Progress: Submitted:66 Active:4 Finished successfully:1 > > > Progress: Submitted:65 Active:5 Finished successfully:1 > > > Progress: Submitted:64 Active:6 Finished successfully:1 > > > Progress: Submitted:61 Active:9 Finished successfully:1 > > > Progress: Submitted:58 Active:12 Finished > successfully:1 > > > Progress: Submitted:57 Active:13 Finished > successfully:1 > > > Progress: Submitted:54 Active:16 Finished > successfully:1 > > > Progress: Submitted:52 Active:18 Finished > successfully:1 > > > Progress: Submitted:51 Active:19 Finished > successfully:1 > > > Progress: Submitted:50 Active:20 Finished > successfully:1 > > > Progress: Submitted:49 Active:21 Finished > successfully:1 > > > Progress: Submitted:48 Active:22 Finished > successfully:1 > > > Progress: Submitted:41 Active:29 Finished > successfully:1 > > > Progress: Submitted:38 Active:32 Finished > successfully:1 > > > Progress: Submitted:37 Active:33 Finished > successfully:1 > > > Progress: Submitted:35 Active:35 Finished > successfully:1 > > > Progress: Submitted:31 Active:39 Finished > successfully:1 > > > Progress: Submitted:30 Active:40 Finished > successfully:1 > > > Progress: Submitted:26 Active:44 Finished > successfully:1 > > > Progress: Submitted:26 Active:44 Finished > successfully:1 > > > Progress: Submitted:26 Active:44 Finished > successfully:1 > > > Progress: Submitted:26 Active:44 Finished > successfully:1 > > > Progress: Submitted:26 Active:44 Finished > successfully:1 > > > Progress: Submitted:26 Active:44 Finished > successfully:1 > > > Progress: Submitted:26 Active:43 Checking status:1 > Finished > > > successfully:1 > > > Progress: Submitted:26 Active:43 Finished > successfully:2 > > > Progress: Submitted:26 Active:42 Checking status:1 > Finished > > > successfully:2 > > > Progress: Submitted:25 Active:42 Checking status:1 > Finished > > > successfully:3 > > > Progress: Submitted:25 Active:41 Checking status:1 > Finished > > > successfully:4 > > > Progress: Submitted:25 Active:41 Finished > successfully:5 > > > Progress: Submitted:25 Active:40 Checking status:1 > Finished > > > successfully:5 > > > Progress: Submitted:25 Active:39 Checking status:1 > Finished > > > successfully:6 > > > Progress: Submitted:24 Active:40 Finished > successfully:7 > > > Progress: Submitted:24 Active:39 Checking status:1 > Finished > > > successfully:7 > > > Progress: Submitted:24 Active:38 Checking status:1 > Finished > > > successfully:8 > > > Progress: Submitted:24 Active:38 Finished > successfully:9 > > > Progress: Submitted:24 Active:37 Checking status:1 > Finished > > > successfully:9 > > > Progress: Submitted:24 Active:35 Checking status:1 > Finished > > > successfully:11 > > > Progress: Submitted:23 Active:35 Checking status:1 > Finished > > > successfully:12 > > > Progress: Submitted:22 Active:35 Checking status:1 > Finished > > > successfully:13 > > > Progress: Submitted:22 Active:35 Finished > successfully:14 > > > Progress: Submitted:22 Active:34 Checking status:1 > Finished > > > successfully:14 > > > Progress: Submitted:21 Active:34 Checking status:1 > Finished > > > successfully:15 > > > Progress: Submitted:21 Active:34 Finished > successfully:16 > > > Progress: Submitted:21 Active:33 Checking status:1 > Finished > > > successfully:16 > > > Progress: Submitted:21 Active:33 Finished > successfully:17 > > > Progress: Submitted:20 Active:32 Checking status:1 > Finished > > > successfully:18 > > > Progress: Submitted:20 Active:32 Finished > successfully:19 > > > Progress: Submitted:20 Active:31 Checking status:1 > Finished > > > successfully:19 > > > Progress: Submitted:19 Active:31 Finished > successfully:21 > > > Progress: Submitted:19 Active:30 Checking status:1 > Finished > > > successfully:21 > > > Progress: Submitted:18 Active:30 Checking status:1 > Finished > > > successfully:22 > > > Progress: Submitted:18 Active:29 Checking status:1 > Finished > > > successfully:23 > > > Progress: Submitted:18 Active:28 Checking status:1 > Finished > > > successfully:24 > > > Progress: Submitted:17 Active:29 Finished > successfully:25 > > > Progress: Submitted:17 Active:29 Finished > successfully:25 > > > Progress: Submitted:17 Active:28 Checking status:1 > Finished > > > successfully:25 > > > Progress: Submitted:17 Active:27 Checking status:1 > Finished > > > successfully:26 > > > Progress: Submitted:17 Active:26 Checking status:1 > Finished > > > successfully:27 > > > Progress: Submitted:17 Active:25 Checking status:1 > Finished > > > successfully:28 > > > Progress: Submitted:17 Active:24 Checking status:1 > Finished > > > successfully:29 > > > Progress: Submitted:16 Active:25 Finished > successfully:30 > > > Progress: Submitted:16 Active:24 Checking status:1 > Finished > > > successfully:30 > > > Progress: Submitted:15 Active:24 Checking status:1 > Finished > > > successfully:31 > > > Progress: Submitted:15 Active:24 Finished > successfully:32 > > > Progress: Submitted:15 Active:23 Checking status:1 > Finished > > > successfully:32 > > > Progress: Submitted:14 Active:24 Finished > successfully:33 > > > Progress: Submitted:14 Active:23 Checking status:1 > Finished > > > successfully:33 > > > Progress: Submitted:14 Active:22 Checking status:1 > Finished > > > successfully:34 > > > Progress: Submitted:14 Active:22 Finished > successfully:35 > > > Progress: Submitted:14 Active:21 Checking status:1 > Finished > > > successfully:35 > > > Progress: Submitted:13 Active:22 Finished > successfully:36 > > > Progress: Submitted:13 Active:22 Finished > successfully:36 > > > Progress: Submitted:13 Active:20 Checking status:1 > Finished > > > successfully:37 > > > Progress: Submitted:12 Active:21 Finished > successfully:38 > > > Progress: Submitted:12 Active:20 Checking status:1 > Finished > > > successfully:38 > > > Progress: Submitted:12 Active:19 Checking status:1 > Finished > > > successfully:39 > > > Progress: Submitted:12 Active:19 Finished > successfully:40 > > > Progress: Submitted:12 Active:18 Checking status:1 > Finished > > > successfully:40 > > > Progress: Submitted:12 Active:17 Checking status:1 > Finished > > > successfully:41 > > > Progress: Submitted:11 Active:17 Checking status:1 > Finished > > > successfully:42 > > > Progress: Submitted:11 Active:17 Finished > successfully:43 > > > Progress: Submitted:11 Active:16 Checking status:1 > Finished > > > successfully:43 > > > Progress: Submitted:11 Active:15 Checking status:1 > Finished > > > successfully:44 > > > Progress: Submitted:10 Active:16 Finished > successfully:45 > > > Progress: Submitted:3 Active:22 Finished > successfully:46 > > > Progress: Submitted:3 Active:21 Checking status:1 > Finished > > > successfully:46 > > > Progress: Submitted:3 Active:19 Finished > successfully:49 > > > Progress: Submitted:3 Active:19 Finished > successfully:49 > > > Progress: Submitted:2 Active:20 Finished > successfully:49 > > > Progress: Submitted:1 Active:21 Finished > successfully:49 > > > . > > > . > > > . > > > Progress: Submitted:1 Active:15 Finished > successfully:55 > > > Progress: Submitted:1 Active:15 Finished > successfully:55 > > > Progress: Submitted:1 Active:15 Finished > successfully:55 > > > Progress: Submitted:1 Active:15 Finished > successfully:55 > > > Progress: Submitted:1 Active:14 Checking status:1 > Finished > > > successfully:55 > > > Progress: Submitted:1 Active:14 Finished > successfully:56 > > > Progress: Submitted:1 Active:13 Checking status:1 > Finished > > > successfully:56 > > > Progress: Submitted:1 Active:12 Checking status:1 > Finished > > > successfully:57 > > > Progress: Submitted:1 Active:12 Finished > successfully:58 > > > Progress: Submitted:1 Active:11 Checking status:1 > Finished > > > successfully:58 > > > Progress: Submitted:1 Active:10 Checking status:1 > Finished > > > successfully:59 > > > Progress: Submitted:1 Active:10 Finished > successfully:60 > > > Progress: Submitted:1 Active:10 Finished > successfully:60 > > > Progress: Submitted:1 Active:10 Finished > successfully:60 > > > Progress: Submitted:1 Active:8 Checking status:1 > Finished > > > successfully:61 > > > Progress: Submitted:1 Active:8 Finished successfully:62 > > > Progress: Submitted:1 Active:7 Checking status:1 > Finished > > > successfully:62 > > > Progress: Submitted:1 Active:7 Finished successfully:63 > > > Progress: Submitted:1 Active:6 Checking status:1 > Finished > > > successfully:63 > > > Progress: Submitted:1 Active:4 Checking status:1 > Finished > > > successfully:65 > > > Progress: Submitted:1 Active:3 Checking status:1 > Finished > > > successfully:66 > > > Progress: Submitted:1 Active:3 Finished successfully:67 > > > Progress: Submitted:1 Active:3 Finished successfully:67 > > > Progress: Submitted:1 Active:2 Checking status:1 > Finished > > > successfully:67 > > > Progress: Submitted:1 Active:2 Finished successfully:68 > > > Progress: Submitted:1 Active:1 Checking status:1 > Finished > > > successfully:68 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > Progress: Submitted:1 Finished successfully:70 > > > > > > > > > etc > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > > From hategan at mcs.anl.gov Mon Aug 9 17:43:36 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Aug 2010 17:43:36 -0500 Subject: [Swift-devel] decaying number of coaster jobs leaves some tasks unfinished In-Reply-To: <1281393447.14191.1.camel@blabla2.none> References: <1281390570.13731.1.camel@blabla2.none> <931258139.713821281391284726.JavaMail.root@zimbra.anl.gov> <1281393447.14191.1.camel@blabla2.none> Message-ID: <1281393816.14191.4.camel@blabla2.none> On Mon, 2010-08-09 at 17:37 -0500, Mihael Hategan wrote: > On Mon, 2010-08-09 at 18:19 -0400, Glen Hocky wrote: > > Here's the full log (I think). > > > > What's Mike's describing is basically my gut feeling as well... > > Right. The log should tell us. The log, towards the end, says that the block scheduler sees no jobs. Which means that either it failed to notify swift when a block failed bringing the jobs down with it, or it otherwise failed to notify swift that the job failed. I should probably run a few things on PADS and try to iron out these things. In the mean time, is this the normal behavior that you see or do you also have successful runs? Mihael From aespinosa at cs.uchicago.edu Mon Aug 9 18:03:55 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 9 Aug 2010 18:03:55 -0500 Subject: [Swift-devel] persistent coaster service In-Reply-To: <1281393419.14191.0.camel@blabla2.none> References: <1281065674.20004.2.camel@blabla2.none> <20100809220916.GD2796@origin> <1281393419.14191.0.camel@blabla2.none> Message-ID: <20100809230355.GA2078@origin> Ah. so the persistent coaster service is meant to run with the manual workers? -Allan On Mon, Aug 09, 2010 at 05:36:59PM -0500, Mihael Hategan wrote: > ff-grid2.unl.edu is the url you are supplying in sites.xml. It's > connecting to that. Though I'm surprised it works given that you are > implying that there is no service running there. > > On Mon, 2010-08-09 at 17:09 -0500, Allan Espinosa wrote: > > I tried it today on OSG. The coaster service was run on bridled.ci . But from > > the session below, it looks like its connecting to the site headnode instead: > > > > RunID: coaster > > Progress: > > Progress: uninitialized:1 Selecting site:675 Initializing site shared > > directory:1 > > Progress: Initializing:2 Selecting site:1444 Initializing site shared > > directory:1 > > Progress: uninitialized:1 Selecting site:2499 Initializing site shared > > directory:1 > > Progress: uninitialized:1 Selecting site:3818 Initializing site shared > > directory:1 > > Progress: uninitialized:1 Initializing:1 Selecting site:4201 Initializing > > site shared directory:1 > > Progress: Initializing:1 Selecting site:3 Stage in:4202 > > Progress: uninitialized:1 Initializing:1 Selecting site:5 Submitting:4202 > > Progress: Initializing:1 Selecting site:6 Stage in:2 Submitting:4202 > > Find: https://ff-grid2.unl.edu:1984 > > Find: keepalive(120), reconnect - https://ff-grid2.unl.edu:1984 > > Progress: Initializing:2 Selecting site:6 Stage in:144 Submitting:4303 > > Failed but can retry:16 > > Progress: Initializing:2 Selecting site:31 Stage in:80 Submitting:4945 > > Failed but can retry:54 > > Progress: Initializing:1 Selecting site:6 Stage in:2 Submitting:5222 Failed > > but can retry:68 > > Progress: Initializing:1 Selecting site:6 Stage in:1 Submitting:5686 > > Submitted:1 Failed but can retry:95 > > ... > > ... > > > > Corresponding log entry (IMO): > > 2010-08-09 17:01:31,690-0500 WARN RemoteConfiguration Find: > > https://ff-grid2.unl.edu:1984 > > 2010-08-09 17:01:31,690-0500 WARN RemoteConfiguration Find: keepalive(120), > > reconnect - https://ff-grid2.unl.edu:1984 > > > > > > > > sites.xml > > > > > jobmanager="gt2:gt2:pbs" /> > > > > 86400 > > 1290 > > 0.8 > > 10 > > 20 > > true > > > > 1500.0 > > 51.54 > > > > > > /panfs/panasas/CMS/data/engage-scec/swift_scratch > > > > > > > > -Allan > > > > On Thu, Aug 05, 2010 at 10:34:34PM -0500, Mihael Hategan wrote: > > > > > ... was added in cog r2834. > > > > > > Despite having run a few jobs with it, I don't feel very confident about > > > it. So please test. > > > > > > Start with bin/coaster-service and use "coaster-persistent" as provider > > > in sites.xml. Everything else would be the same as in the "coaster" > > > case. > > > > > > Mihael From hategan at mcs.anl.gov Mon Aug 9 18:07:40 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Aug 2010 18:07:40 -0500 Subject: [Swift-devel] persistent coaster service In-Reply-To: <20100809230355.GA2078@origin> References: <1281065674.20004.2.camel@blabla2.none> <20100809220916.GD2796@origin> <1281393419.14191.0.camel@blabla2.none> <20100809230355.GA2078@origin> Message-ID: <1281395260.14643.1.camel@blabla2.none> On Mon, 2010-08-09 at 18:03 -0500, Allan Espinosa wrote: > Ah. so the persistent coaster service is meant to run with the manual workers? No. It's like, say, GRAM, in that you need to start a service on some head node, and you need to supply the URL of that head node in sites.xml. It won't start the service automatically. > > -Allan > > On Mon, Aug 09, 2010 at 05:36:59PM -0500, Mihael Hategan wrote: > > ff-grid2.unl.edu is the url you are supplying in sites.xml. It's > > connecting to that. Though I'm surprised it works given that you are > > implying that there is no service running there. > > > > On Mon, 2010-08-09 at 17:09 -0500, Allan Espinosa wrote: > > > I tried it today on OSG. The coaster service was run on bridled.ci . But from > > > the session below, it looks like its connecting to the site headnode instead: > > > > > > RunID: coaster > > > Progress: > > > Progress: uninitialized:1 Selecting site:675 Initializing site shared > > > directory:1 > > > Progress: Initializing:2 Selecting site:1444 Initializing site shared > > > directory:1 > > > Progress: uninitialized:1 Selecting site:2499 Initializing site shared > > > directory:1 > > > Progress: uninitialized:1 Selecting site:3818 Initializing site shared > > > directory:1 > > > Progress: uninitialized:1 Initializing:1 Selecting site:4201 Initializing > > > site shared directory:1 > > > Progress: Initializing:1 Selecting site:3 Stage in:4202 > > > Progress: uninitialized:1 Initializing:1 Selecting site:5 Submitting:4202 > > > Progress: Initializing:1 Selecting site:6 Stage in:2 Submitting:4202 > > > Find: https://ff-grid2.unl.edu:1984 > > > Find: keepalive(120), reconnect - https://ff-grid2.unl.edu:1984 > > > Progress: Initializing:2 Selecting site:6 Stage in:144 Submitting:4303 > > > Failed but can retry:16 > > > Progress: Initializing:2 Selecting site:31 Stage in:80 Submitting:4945 > > > Failed but can retry:54 > > > Progress: Initializing:1 Selecting site:6 Stage in:2 Submitting:5222 Failed > > > but can retry:68 > > > Progress: Initializing:1 Selecting site:6 Stage in:1 Submitting:5686 > > > Submitted:1 Failed but can retry:95 > > > ... > > > ... > > > > > > Corresponding log entry (IMO): > > > 2010-08-09 17:01:31,690-0500 WARN RemoteConfiguration Find: > > > https://ff-grid2.unl.edu:1984 > > > 2010-08-09 17:01:31,690-0500 WARN RemoteConfiguration Find: keepalive(120), > > > reconnect - https://ff-grid2.unl.edu:1984 > > > > > > > > > > > > sites.xml > > > > > > > > jobmanager="gt2:gt2:pbs" /> > > > > > > 86400 > > > 1290 > > > 0.8 > > > 10 > > > 20 > > > true > > > > > > 1500.0 > > > 51.54 > > > > > > > > > /panfs/panasas/CMS/data/engage-scec/swift_scratch > > > > > > > > > > > > -Allan > > > > > > On Thu, Aug 05, 2010 at 10:34:34PM -0500, Mihael Hategan wrote: > > > > > > > ... was added in cog r2834. > > > > > > > > Despite having run a few jobs with it, I don't feel very confident about > > > > it. So please test. > > > > > > > > Start with bin/coaster-service and use "coaster-persistent" as provider > > > > in sites.xml. Everything else would be the same as in the "coaster" > > > > case. > > > > > > > > Mihael > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From aespinosa at cs.uchicago.edu Mon Aug 9 18:14:39 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 9 Aug 2010 18:14:39 -0500 Subject: [Swift-devel] persistent coaster service In-Reply-To: <1281395260.14643.1.camel@blabla2.none> References: <1281065674.20004.2.camel@blabla2.none> <20100809220916.GD2796@origin> <1281393419.14191.0.camel@blabla2.none> <20100809230355.GA2078@origin> <1281395260.14643.1.camel@blabla2.none> Message-ID: <20100809231439.GB2078@origin> So the url should be "bridled.ci.uchicago.edu" since I run the service there. But this same field is also used for spawning the workers unless it specifies "manual coasters" right? -Allan On Mon, Aug 09, 2010 at 06:07:40PM -0500, Mihael Hategan wrote: > On Mon, 2010-08-09 at 18:03 -0500, Allan Espinosa wrote: > > Ah. so the persistent coaster service is meant to run with the manual workers? > > No. It's like, say, GRAM, in that you need to start a service on some > head node, and you need to supply the URL of that head node in > sites.xml. > > It won't start the service automatically. > > > > > -Allan > > > > On Mon, Aug 09, 2010 at 05:36:59PM -0500, Mihael Hategan wrote: > > > ff-grid2.unl.edu is the url you are supplying in sites.xml. It's > > > connecting to that. Though I'm surprised it works given that you are > > > implying that there is no service running there. > > > > > > On Mon, 2010-08-09 at 17:09 -0500, Allan Espinosa wrote: > > > > I tried it today on OSG. The coaster service was run on bridled.ci . But from > > > > the session below, it looks like its connecting to the site headnode instead: > > > > > > > > RunID: coaster > > > > Progress: > > > > Progress: uninitialized:1 Selecting site:675 Initializing site shared > > > > directory:1 > > > > Progress: Initializing:2 Selecting site:1444 Initializing site shared > > > > directory:1 > > > > Progress: uninitialized:1 Selecting site:2499 Initializing site shared > > > > directory:1 > > > > Progress: uninitialized:1 Selecting site:3818 Initializing site shared > > > > directory:1 > > > > Progress: uninitialized:1 Initializing:1 Selecting site:4201 Initializing > > > > site shared directory:1 > > > > Progress: Initializing:1 Selecting site:3 Stage in:4202 > > > > Progress: uninitialized:1 Initializing:1 Selecting site:5 Submitting:4202 > > > > Progress: Initializing:1 Selecting site:6 Stage in:2 Submitting:4202 > > > > Find: https://ff-grid2.unl.edu:1984 > > > > Find: keepalive(120), reconnect - https://ff-grid2.unl.edu:1984 > > > > Progress: Initializing:2 Selecting site:6 Stage in:144 Submitting:4303 > > > > Failed but can retry:16 > > > > Progress: Initializing:2 Selecting site:31 Stage in:80 Submitting:4945 > > > > Failed but can retry:54 > > > > Progress: Initializing:1 Selecting site:6 Stage in:2 Submitting:5222 Failed > > > > but can retry:68 > > > > Progress: Initializing:1 Selecting site:6 Stage in:1 Submitting:5686 > > > > Submitted:1 Failed but can retry:95 > > > > ... > > > > ... > > > > > > > > Corresponding log entry (IMO): > > > > 2010-08-09 17:01:31,690-0500 WARN RemoteConfiguration Find: > > > > https://ff-grid2.unl.edu:1984 > > > > 2010-08-09 17:01:31,690-0500 WARN RemoteConfiguration Find: keepalive(120), > > > > reconnect - https://ff-grid2.unl.edu:1984 > > > > > > > > > > > > > > > > sites.xml > > > > > > > > > > > jobmanager="gt2:gt2:pbs" /> > > > > > > > > 86400 > > > > 1290 > > > > 0.8 > > > > 10 > > > > 20 > > > > true > > > > > > > > 1500.0 > > > > 51.54 > > > > > > > > > > > > /panfs/panasas/CMS/data/engage-scec/swift_scratch > > > > > > > > > > > > > > > > -Allan > > > > > > > > On Thu, Aug 05, 2010 at 10:34:34PM -0500, Mihael Hategan wrote: > > > > > > > > > ... was added in cog r2834. > > > > > > > > > > Despite having run a few jobs with it, I don't feel very confident about > > > > > it. So please test. > > > > > > > > > > Start with bin/coaster-service and use "coaster-persistent" as provider > > > > > in sites.xml. Everything else would be the same as in the "coaster" > > > > > case. > > > > > > > > > > Mihael From hategan at mcs.anl.gov Mon Aug 9 18:17:11 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Aug 2010 18:17:11 -0500 Subject: [Swift-devel] persistent coaster service In-Reply-To: <20100809231439.GB2078@origin> References: <1281065674.20004.2.camel@blabla2.none> <20100809220916.GD2796@origin> <1281393419.14191.0.camel@blabla2.none> <20100809230355.GA2078@origin> <1281395260.14643.1.camel@blabla2.none> <20100809231439.GB2078@origin> Message-ID: <1281395831.14764.0.camel@blabla2.none> On Mon, 2010-08-09 at 18:14 -0500, Allan Espinosa wrote: > So the url should be "bridled.ci.uchicago.edu" since I run the service > there. But this same field is also used for spawning the workers unless it > specifies "manual coasters" right? Right. > > -Allan > > On Mon, Aug 09, 2010 at 06:07:40PM -0500, Mihael Hategan wrote: > > On Mon, 2010-08-09 at 18:03 -0500, Allan Espinosa wrote: > > > Ah. so the persistent coaster service is meant to run with the manual workers? > > > > No. It's like, say, GRAM, in that you need to start a service on some > > head node, and you need to supply the URL of that head node in > > sites.xml. > > > > It won't start the service automatically. > > > > > > > > -Allan > > > > > > On Mon, Aug 09, 2010 at 05:36:59PM -0500, Mihael Hategan wrote: > > > > ff-grid2.unl.edu is the url you are supplying in sites.xml. It's > > > > connecting to that. Though I'm surprised it works given that you are > > > > implying that there is no service running there. > > > > > > > > On Mon, 2010-08-09 at 17:09 -0500, Allan Espinosa wrote: > > > > > I tried it today on OSG. The coaster service was run on bridled.ci . But from > > > > > the session below, it looks like its connecting to the site headnode instead: > > > > > > > > > > RunID: coaster > > > > > Progress: > > > > > Progress: uninitialized:1 Selecting site:675 Initializing site shared > > > > > directory:1 > > > > > Progress: Initializing:2 Selecting site:1444 Initializing site shared > > > > > directory:1 > > > > > Progress: uninitialized:1 Selecting site:2499 Initializing site shared > > > > > directory:1 > > > > > Progress: uninitialized:1 Selecting site:3818 Initializing site shared > > > > > directory:1 > > > > > Progress: uninitialized:1 Initializing:1 Selecting site:4201 Initializing > > > > > site shared directory:1 > > > > > Progress: Initializing:1 Selecting site:3 Stage in:4202 > > > > > Progress: uninitialized:1 Initializing:1 Selecting site:5 Submitting:4202 > > > > > Progress: Initializing:1 Selecting site:6 Stage in:2 Submitting:4202 > > > > > Find: https://ff-grid2.unl.edu:1984 > > > > > Find: keepalive(120), reconnect - https://ff-grid2.unl.edu:1984 > > > > > Progress: Initializing:2 Selecting site:6 Stage in:144 Submitting:4303 > > > > > Failed but can retry:16 > > > > > Progress: Initializing:2 Selecting site:31 Stage in:80 Submitting:4945 > > > > > Failed but can retry:54 > > > > > Progress: Initializing:1 Selecting site:6 Stage in:2 Submitting:5222 Failed > > > > > but can retry:68 > > > > > Progress: Initializing:1 Selecting site:6 Stage in:1 Submitting:5686 > > > > > Submitted:1 Failed but can retry:95 > > > > > ... > > > > > ... > > > > > > > > > > Corresponding log entry (IMO): > > > > > 2010-08-09 17:01:31,690-0500 WARN RemoteConfiguration Find: > > > > > https://ff-grid2.unl.edu:1984 > > > > > 2010-08-09 17:01:31,690-0500 WARN RemoteConfiguration Find: keepalive(120), > > > > > reconnect - https://ff-grid2.unl.edu:1984 > > > > > > > > > > > > > > > > > > > > sites.xml > > > > > > > > > > > > > > jobmanager="gt2:gt2:pbs" /> > > > > > > > > > > 86400 > > > > > 1290 > > > > > 0.8 > > > > > 10 > > > > > 20 > > > > > true > > > > > > > > > > 1500.0 > > > > > 51.54 > > > > > > > > > > > > > > > /panfs/panasas/CMS/data/engage-scec/swift_scratch > > > > > > > > > > > > > > > > > > > > -Allan > > > > > > > > > > On Thu, Aug 05, 2010 at 10:34:34PM -0500, Mihael Hategan wrote: > > > > > > > > > > > ... was added in cog r2834. > > > > > > > > > > > > Despite having run a few jobs with it, I don't feel very confident about > > > > > > it. So please test. > > > > > > > > > > > > Start with bin/coaster-service and use "coaster-persistent" as provider > > > > > > in sites.xml. Everything else would be the same as in the "coaster" > > > > > > case. > > > > > > > > > > > > Mihael > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Aug 10 12:38:24 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 10 Aug 2010 11:38:24 -0600 (GMT-06:00) Subject: [Swift-devel] How to update the swift command for usage tracking? In-Reply-To: Message-ID: <1006071066.740891281461904291.JavaMail.root@zimbra.anl.gov> Mihael, can you provide guidance for how David should proceed? Thanks, Mike ----- Forwarded Message ----- From: "David Kelly" To: wilde at mcs.anl.gov Sent: Tuesday, August 10, 2010 12:34:50 PM GMT -06:00 US/Canada Central Subject: Re: Swift listener Hi Mike, I wanted to add the new swift shell script with usage tracking last night, but noticed that the swift shell script is not directly a part of SVN. It gets generated during "ant dist" based on cog/etc/unix/launcher-template. I think I can get around this by editing cog/modules/swift/build.xml. The option should allow me to add my stuff from there. Just wanted to check with you to make sure this was the way I should do it before I submitted anything. Thanks, David -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Aug 10 12:50:13 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 10 Aug 2010 11:50:13 -0600 (GMT-06:00) Subject: [Swift-devel] How to update the swift command for usage tracking? In-Reply-To: <1006071066.740891281461904291.JavaMail.root@zimbra.anl.gov> Message-ID: <926157286.741471281462613207.JavaMail.root@zimbra.anl.gov> A thought on this: could/should the swift command be moved from cog/etc to modules/swift? Or should Swift developers also be CoG committers? - Mike ----- "Michael Wilde" wrote: > Mihael, can you provide guidance for how David should proceed? > > Thanks, > > Mike > > > ----- Forwarded Message ----- > From: "David Kelly" > To: wilde at mcs.anl.gov > Sent: Tuesday, August 10, 2010 12:34:50 PM GMT -06:00 US/Canada > Central > Subject: Re: Swift listener > > Hi Mike, > > I wanted to add the new swift shell script with usage tracking last > night, but noticed that the swift shell script is not directly a part > of SVN. It gets generated during "ant dist" based on > cog/etc/unix/launcher-template. I think I can get around this by > editing cog/modules/swift/build.xml. The option should allow > me to add my stuff from there. Just wanted to check with you to make > sure this was the way I should do it before I submitted anything. > > Thanks, > David > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Tue Aug 10 13:07:31 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 10 Aug 2010 13:07:31 -0500 Subject: [Swift-devel] Re: How to update the swift command for usage tracking? In-Reply-To: <1006071066.740891281461904291.JavaMail.root@zimbra.anl.gov> References: <1006071066.740891281461904291.JavaMail.root@zimbra.anl.gov> Message-ID: <1281463651.19452.0.camel@blabla2.none> The replace strategy from swift/build.xml is what we have used so far. On Tue, 2010-08-10 at 11:38 -0600, Michael Wilde wrote: > Mihael, can you provide guidance for how David should proceed? > > Thanks, > > Mike > > > ----- Forwarded Message ----- > From: "David Kelly" > To: wilde at mcs.anl.gov > Sent: Tuesday, August 10, 2010 12:34:50 PM GMT -06:00 US/Canada Central > Subject: Re: Swift listener > > Hi Mike, > > I wanted to add the new swift shell script with usage tracking last night, but noticed that the swift shell script is not directly a part of SVN. It gets generated during "ant dist" based on cog/etc/unix/launcher-template. I think I can get around this by editing cog/modules/swift/build.xml. The option should allow me to add my stuff from there. Just wanted to check with you to make sure this was the way I should do it before I submitted anything. > > Thanks, > David > From hategan at mcs.anl.gov Tue Aug 10 13:12:05 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 10 Aug 2010 13:12:05 -0500 Subject: [Swift-devel] How to update the swift command for usage tracking? In-Reply-To: <926157286.741471281462613207.JavaMail.root@zimbra.anl.gov> References: <926157286.741471281462613207.JavaMail.root@zimbra.anl.gov> Message-ID: <1281463925.19452.5.camel@blabla2.none> On Tue, 2010-08-10 at 11:50 -0600, Michael Wilde wrote: > A thought on this: could/should the swift command be moved from cog/etc to modules/swift? The swift command is not in cog/etc. It gets generated from a template there and then combined with some swift specific things. We could bypass that and store it plainly in swift/bin. > > Or should Swift developers also be CoG committers? > > - Mike > > ----- "Michael Wilde" wrote: > > > Mihael, can you provide guidance for how David should proceed? > > > > Thanks, > > > > Mike > > > > > > ----- Forwarded Message ----- > > From: "David Kelly" > > To: wilde at mcs.anl.gov > > Sent: Tuesday, August 10, 2010 12:34:50 PM GMT -06:00 US/Canada > > Central > > Subject: Re: Swift listener > > > > Hi Mike, > > > > I wanted to add the new swift shell script with usage tracking last > > night, but noticed that the swift shell script is not directly a part > > of SVN. It gets generated during "ant dist" based on > > cog/etc/unix/launcher-template. I think I can get around this by > > editing cog/modules/swift/build.xml. The option should allow > > me to add my stuff from there. Just wanted to check with you to make > > sure this was the way I should do it before I submitted anything. > > > > Thanks, > > David > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Tue Aug 10 17:31:55 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 10 Aug 2010 16:31:55 -0600 (GMT-06:00) Subject: [Swift-devel] How to update the swift command for usage tracking? In-Reply-To: <1281463925.19452.5.camel@blabla2.none> Message-ID: <1976812125.762141281479515316.JavaMail.root@zimbra.anl.gov> ----- "Mihael Hategan" wrote: > On Tue, 2010-08-10 at 11:50 -0600, Michael Wilde wrote: > > A thought on this: could/should the swift command be moved from > cog/etc to modules/swift? > > The swift command is not in cog/etc. It gets generated from a > template > there and then combined with some swift specific things. > > We could bypass that and store it plainly in swift/bin. Yes, I think David should it do that way. I suspect he's adding large enough of a code fragment that we dont want to maintain this via in build.xml. ANd it will be more convenient for other developers to have access to the swift command via swift svn. - Mike > > > > > Or should Swift developers also be CoG committers? > > > > - Mike > > > > ----- "Michael Wilde" wrote: > > > > > Mihael, can you provide guidance for how David should proceed? > > > > > > Thanks, > > > > > > Mike > > > > > > > > > ----- Forwarded Message ----- > > > From: "David Kelly" > > > To: wilde at mcs.anl.gov > > > Sent: Tuesday, August 10, 2010 12:34:50 PM GMT -06:00 US/Canada > > > Central > > > Subject: Re: Swift listener > > > > > > Hi Mike, > > > > > > I wanted to add the new swift shell script with usage tracking > last > > > night, but noticed that the swift shell script is not directly a > part > > > of SVN. It gets generated during "ant dist" based on > > > cog/etc/unix/launcher-template. I think I can get around this by > > > editing cog/modules/swift/build.xml. The option should > allow > > > me to add my stuff from there. Just wanted to check with you to > make > > > sure this was the way I should do it before I submitted anything. > > > > > > > Thanks, > > > David > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From iraicu at cs.uchicago.edu Tue Aug 10 20:03:02 2010 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 10 Aug 2010 20:03:02 -0500 Subject: [Swift-devel] CFP: The 20th International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC) 2011 Message-ID: <4C61F6C6.3020203@cs.uchicago.edu> Call For Papers The 20th International ACM Symposium on High-Performance Parallel and Distributed Computing http://www.hpdc.org/2011/ San Jose, California, June 8-11, 2011 The ACM International Symposium on High-Performance Parallel and Distributed Computing is the premier conference for presenting the latest research on the design, implementation, evaluation, and use of parallel and distributed systems for high end computing. The 20th installment of HPDC will take place in San Jose, California, in the heart of Silicon Valley. This year, HPDC is affiliated with the ACM Federated Computing Research Conference, consisting of fifteen leading ACM conferences all in one week. HPDC will be held on June 9-11 (Thursday through Saturday) with affiliated workshops taking place on June 8th (Wednesday). Submissions are welcomed on all forms of high performance parallel and distributed computing, including but not limited to clusters, clouds, grids, utility computing, data-intensive computing, multicore and parallel computing. All papers will be reviewed by a distinguished program committee, with a strong preference for rigorous results obtained in operational parallel and distributed systems. All papers will be evaluated for correctness, originality, potential impact, quality of presentation, and interest and relevance to the conference. In addition to traditional technical papers, we also invite experience papers. Such papers should present operational details of a production high end system or application, and draw out conclusions gained from operating the system or application. The evaluation of experience papers will place a greater weight on the real-world impact of the system and the value of conclusions to future system designs. Topics of interest include, but are not limited to: ------------------------------------------------------------------------------- # Applications of parallel and distributed computing. # Systems, networks, and architectures for high end computing. # Parallel and multicore issues and opportunities. # Virtualization of machines, networks, and storage. # Programming languages and environments. # I/O, file systems, and data management. # Data intensive computing. # Resource management, scheduling, and load-balancing. # Performance modeling, simulation, and prediction. # Fault tolerance, reliability and availability. # Security, configuration, policy, and management issues. # Models and use cases for utility, grid, and cloud computing. Authors are invited to submit technical papers of at most 12 pages in PDF format, including all figures and references. Papers should be formatted in the ACM Proceedings Style and submitted via the conference web site. Accepted papers will appear in the conference proceedings, and will be incorporated into the ACM Digital Library. Papers must be self-contained and provide the technical substance required for the program committee to evaluate the paper's contribution. Papers should thoughtfully address all related work, particularly work presented at previous HPDC events. Submitted papers must be original work that has not appeared in and is not under consideration for another conference or a journal. See the ACM Prior Publication Policy for more details. Workshops ------------------------------------------------------------------------------- We invite proposals for workshops affiliated with HPDC to be held on Wednesday, June 8th. For more information, see the Call for Workshops at http://www.hpdc.org/2011/cfw.php. Important Dates ------------------------------------------------------------------------------- Workshop Proposals Due 1 October 2010 Technical Papers Due: 17 January 2011 PAPER DEADLINE EXTENDED: 24 January 2011 (No further extensions!) Author Notifications: 28 February 2011 Final Papers Due: 24 March 2011 Conference Dates: 8-11 June 2011 Organization ------------------------------------------------------------------------------- General Chair Barney Maccabe, Oak Ridge National Laboratory Program Chair Douglas Thain, University of Notre Dame Workshops Chair Mike Lewis, Binghamton University Local Arrangements Chair Nick Wright, Lawrence Berkeley National Laboratory Publicity Chairs Alexandru Iosup, Delft University John Lange, University of Pittsburgh Ioan Raicu, Illinois Institute of Technology Yong Zhao, Microsoft Program Committee Kento Aida, National Institute of Informatics Henri Bal, Vrije Universiteit Roger Barga, Microsoft Jim Basney, NCSA John Bent, Los Alamos National Laboratory Ron Brightwell, Sandia National Laboratories Shawn Brown, Pittsburgh Supercomputer Center Claris Castillo, IBM Andrew A. Chien, UC San Diego and SDSC Ewa Deelman, USC Information Sciences Institute Peter Dinda, Northwestern University Scott Emrich, University of Notre Dame Dick Epema, TU-Delft Gilles Fedak, INRIA Renato Figuierdo, University of Florida Ian Foster, University of Chicago and Argonne National Laboratory Gabriele Garzoglio, Fermi National Accelerator Laboratory Rong Ge, Marquette University Sebastien Goasguen, Clemson University Kartik Gopalan, Binghamton University Dean Hildebrand, IBM Almaden Adriana Iamnitchi, University of South Florida Alexandru Iosup, TU-Delft Keith Jackson, Lawrence Berkeley Shantenu Jha, Louisiana State University Daniel S. Katz, University of Chicago and Argonne National Laboratory Thilo Kielmann, Vrije Universiteit Charles Killian, Purdue University Tevfik Kosar, Louisiana State University John Lange, University of Pittsburgh Mike Lewis, Binghamton University Barney Maccabe, Oak Ridge National Laboratory Grzegorz Malewicz, Google Satoshi Matsuoka, Tokyo Institute of Technology Jarek Nabrzyski, University of Notre Dame Manish Parashar, Rutgers University Beth Plale, Indiana University Ioan Raicu, Illinois Institute of Technology Philip Rhodes, University of Mississippi Philip Roth, Oak Ridge National Laboratory Karsten Schwan, Georgia Tech Martin Swany, University of Delaware Jon Weissman, University of Minnesota Dongyan Xu, Purdue University Ken Yocum, UCSD Yong Zhao, Microsoft Steering Committee Henri Bal, Vrije Universiteit Andrew A. Chien, UC San Diego and SDSC Peter Dinda, Northwestern University Ian Foster, Argonne National Laboratory and University of Chicago Dennis Gannon, Microsoft Salim Hariri, University of Arizona Dieter Kranzlmueller, Ludwig-Maximilians-Univ. Muenchen Manish Parashar, Rutgers University Karsten Schwan, Georgia Tech Jon Weissman, University of Minnesota (Chair) -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor ================================================================= Computer Science Department Illinois Institute of Technology 10 W. 31st Street Stuart Building, Room 237D Chicago, IL 60616 ================================================================= Cel: 1-847-722-0876 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Tue Aug 10 20:47:53 2010 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 10 Aug 2010 20:47:53 -0500 Subject: [Swift-devel] Call for Workshops at ACM HPDC 2011 Message-ID: <4C620149.3060609@cs.uchicago.edu> Call for Workshops The 20th International ACM Symposium on High-Performance Parallel and Distributed Computing http://www.hpdc.org/2011/ San Jose, California, June 8-11, 2011 ------------------------------------------------------------------------------- The ACM Symposium on High Performance Distributed Computing (HPDC) conference organizers invite proposals for Workshops to be held with HPDC in San Jose, California in June 2011. Workshops will run on June 8, preceding the main conference sessions June 9-11. HPDC 2011 is the 20th anniversary of HPDC, a preeminent conference in high performance computing, including cloud and grid computing. This year's conference will be held in conjunction with the Federated Computing Research Conference (FCRC), which includes high profile conferences in complementary research areas, providing a unique opportunity for a broader technical audience and wider impact for successful workshops. Workshops provide forums for discussion among researchers and practitioners on focused topics, emerging research areas, or both. Organizers may structure workshops as they see fit, possibly including invited talks, panel discussions, presentations of work in progress, fully peer-reviewed papers, or some combination. Workshops could be scheduled for a half day or a full day, depending on interest, space constraints, and organizer preference. Organizers should design workshops for approximately 20-40 participants, to balance impact and effective discussion. A workshop proposal must be made in writing, sent to Mike Lewis at mlewis at cs.binghamton.edu, and should include: # The name of the workshop # Several paragraphs describing the theme of the workshop and how it relates to the HPDC conference # Data about previous offerings of the workshop (if any), including attendance, number of papers, or presentations submitted and accepted # Names and affiliations of the workshop organizers, and if applicable, a significant portion of the program committee # A plan for attracting submissions and attendees Due to publication deadlines, workshops must operate within roughly the following timeline: papers due in early February (2-3 weeks after the HPDC deadline) and selected and sent to the publisher by late February. IMPORTANT DATES: # Workshop Proposals Deadline: October 1, 2010 # Notification: October 25, 2010 # Workshop CFPs Online and Distributed: November 8, 2010 -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor ================================================================= Computer Science Department Illinois Institute of Technology 10 W. 31st Street Stuart Building, Room 237D Chicago, IL 60616 ================================================================= Cel: 1-847-722-0876 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Aug 11 02:22:09 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Aug 2010 02:22:09 -0500 Subject: [Swift-devel] How to update the swift command for usage tracking? In-Reply-To: <1976812125.762141281479515316.JavaMail.root@zimbra.anl.gov> References: <1976812125.762141281479515316.JavaMail.root@zimbra.anl.gov> Message-ID: <1281511329.25899.2.camel@blabla2.none> May I suggest using UDP because TCP currently makes swift hang while trying to connect. Mihael From hategan at mcs.anl.gov Wed Aug 11 02:41:11 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Aug 2010 02:41:11 -0500 Subject: [Swift-devel] How to update the swift command for usage tracking? In-Reply-To: <1281511329.25899.2.camel@blabla2.none> References: <1976812125.762141281479515316.JavaMail.root@zimbra.anl.gov> <1281511329.25899.2.camel@blabla2.none> Message-ID: <1281512471.26881.0.camel@blabla2.none> On Wed, 2010-08-11 at 02:22 -0500, Mihael Hategan wrote: > May I suggest using UDP because TCP currently makes swift hang while > trying to connect. That or sending in a background process. From wilde at mcs.anl.gov Wed Aug 11 07:53:16 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 11 Aug 2010 06:53:16 -0600 (GMT-06:00) Subject: [Swift-devel] How to update the swift command for usage tracking? In-Reply-To: <1281512471.26881.0.camel@blabla2.none> Message-ID: <171394612.772701281531196810.JavaMail.root@zimbra.anl.gov> My understanding is that David already switched this to UDP last week. - Mike ----- "Mihael Hategan" wrote: > On Wed, 2010-08-11 at 02:22 -0500, Mihael Hategan wrote: > > May I suggest using UDP because TCP currently makes swift hang > while > > trying to connect. > > That or sending in a background process. -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From dk0966 at cs.ship.edu Wed Aug 11 08:09:33 2010 From: dk0966 at cs.ship.edu (David Kelly) Date: Wed, 11 Aug 2010 09:09:33 -0400 Subject: [Swift-devel] How to update the swift command for usage tracking? In-Reply-To: <171394612.772701281531196810.JavaMail.root@zimbra.anl.gov> References: <1281512471.26881.0.camel@blabla2.none> <171394612.772701281531196810.JavaMail.root@zimbra.anl.gov> Message-ID: Yep, nc is sending the data as UDP. Running it as a background process makes sense. The user shouldn't have to wait for it to finish. That is added to the script now. David On Wed, Aug 11, 2010 at 8:53 AM, Michael Wilde wrote: > My understanding is that David already switched this to UDP last week. > > - Mike > > ----- "Mihael Hategan" wrote: > > > On Wed, 2010-08-11 at 02:22 -0500, Mihael Hategan wrote: > > > May I suggest using UDP because TCP currently makes swift hang > > while > > > trying to connect. > > > > That or sending in a background process. > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dk0966 at cs.ship.edu Wed Aug 11 11:57:31 2010 From: dk0966 at cs.ship.edu (David Kelly) Date: Wed, 11 Aug 2010 12:57:31 -0400 Subject: [Swift-devel] auth.defaults Message-ID: Hello all, During today's conference call we were discussing auth.defaults in relation to swiftconfig, but it expanded into a more general discussion about how swift uses auth.defaults. I was asked to send an email to the list to discuss it further. One of the concerns mentioned was security. Is there a way to transition from having passwords stored in plaintext to another method, perhaps an agent-based authentication? Another thing that would be nice, for swiftconfig/swiftrun, would be to have a per-host auth.defaults outside of .ssh. Then you could specify the auth.defaults file to use, as you currently can with sites and tc. Currently this might not be a good idea due to security concerns, but if we could eliminate the passwords this might be possible? David -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Aug 11 15:18:56 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Aug 2010 15:18:56 -0500 Subject: [Swift-devel] auth.defaults In-Reply-To: References: Message-ID: <1281557936.27901.3.camel@blabla2.none> On Wed, 2010-08-11 at 12:57 -0400, David Kelly wrote: > Hello all, > > During today's conference call we were discussing auth.defaults in > relation to swiftconfig, but it expanded into a more general > discussion about how swift uses auth.defaults. I was asked to send an > email to the list to discuss it further. > > One of the concerns mentioned was security. Is there a way to > transition from having passwords stored in plaintext to another > method, perhaps an agent-based authentication? Right now ssh agents can't be used from java do to they way they are implemented (unix domain sockets). But it's also not necessary to put in the passwords. You get prompted for one if you don't, and it is cached for as long as the JVM lasts (which may or may not be a security concern in itself). > > Another thing that would be nice, for swiftconfig/swiftrun, would be > to have a per-host auth.defaults outside of .ssh. Then you could > specify the auth.defaults file to use, as you currently can with sites > and tc. Currently this might not be a good idea due to security > concerns, but if we could eliminate the passwords this might be > possible? I don't think I understand what you mean by "per-host auth.defaults". From dk0966 at cs.ship.edu Wed Aug 11 16:08:16 2010 From: dk0966 at cs.ship.edu (David Kelly) Date: Wed, 11 Aug 2010 17:08:16 -0400 Subject: [Swift-devel] auth.defaults In-Reply-To: <1281557936.27901.3.camel@blabla2.none> References: <1281557936.27901.3.camel@blabla2.none> Message-ID: On Wed, Aug 11, 2010 at 4:18 PM, Mihael Hategan wrote: > I don't think I understand what you mean by "per-host auth.defaults". > What I mean by that is, each configuration generated by swiftconfig could have it's own unique auth.defaults file like it has it's own sites.xml. Then you could run something like "swift -auth.file /mypath/auth.defaults" rather than requiring it to be stored in ~/.ssh. David -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Aug 11 19:58:40 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Aug 2010 19:58:40 -0500 Subject: [Swift-devel] auth.defaults In-Reply-To: References: <1281557936.27901.3.camel@blabla2.none> Message-ID: <1281574720.27901.12.camel@blabla2.none> On Wed, 2010-08-11 at 17:08 -0400, David Kelly wrote: > On Wed, Aug 11, 2010 at 4:18 PM, Mihael Hategan > wrote: > > I don't think I understand what you mean by "per-host > auth.defaults". > > What I mean by that is, each configuration generated by swiftconfig > could have it's own unique auth.defaults file like it has it's own > sites.xml. Then you could run something like "swift > -auth.file /mypath/auth.defaults" rather than requiring it to be > stored in ~/.ssh. It's possible, but I don't think it's worth the effort. That file, like authorized_keys or known_hosts is not meant to contain frequently changing information, since passwords and usernanames are not generally a moving target (nor are host keys). From jon.monette at gmail.com Fri Aug 13 11:30:32 2010 From: jon.monette at gmail.com (Jonathan Monette) Date: Fri, 13 Aug 2010 11:30:32 -0500 Subject: [Swift-devel] Worker connection Message-ID: <4C657328.6060304@gmail.com> Hello, How does the worker decide what connection to connect to? Right now what I think it does is it runs ifconfig and greps the inet address and then test each of these connections. Is this correct? When I am running on PADS it seems that the worker always chooses the wrong connection to the service. It seems to choose the UBS0 connection where the correct connection is the ib0 connection. Is there a way that maybe the worker can be fixed to choose a better connection or the correct connection? This seems to be only happening on PADS. -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From hategan at mcs.anl.gov Fri Aug 13 11:35:59 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Aug 2010 11:35:59 -0500 Subject: [Swift-devel] Re: Worker connection In-Reply-To: <4C657328.6060304@gmail.com> References: <4C657328.6060304@gmail.com> Message-ID: <1281717359.11891.3.camel@blabla2.none> On Fri, 2010-08-13 at 11:30 -0500, Jonathan Monette wrote: > Hello, > How does the worker decide what connection to connect to? Right > now what I think it does is it runs ifconfig and greps the inet address > and then test each of these connections. Is this correct? When I am > running on PADS it seems that the worker always chooses the wrong > connection to the service. It seems to choose the UBS0 connection where > the correct connection is the ib0 connection. Is there a way that maybe > the worker can be fixed to choose a better connection or the correct > connection? This seems to be only happening on PADS. > That was temporary. Initially it would use the same address as the url in sites.xml. Then I added the "try all interfaces" thing, but in some cases the connect on certain wrong addresses does not fail quickly enough and has to timeout instead, which usually takes a few minutes. So that got disabled and only the frist address is used now (unless overridden - see below). You can say x.y.z.w in sites.xml Mihael From jon.monette at gmail.com Fri Aug 13 11:43:36 2010 From: jon.monette at gmail.com (Jonathan Monette) Date: Fri, 13 Aug 2010 11:43:36 -0500 Subject: [Swift-devel] Re: Worker connection In-Reply-To: <1281717359.11891.3.camel@blabla2.none> References: <4C657328.6060304@gmail.com> <1281717359.11891.3.camel@blabla2.none> Message-ID: <4C657638.5080609@gmail.com> Right now I am using "internalHostname". I was just wondering if an should this be changed since I am always changing this entry depending if I am on login1 or login2? On 8/13/10 11:35 AM, Mihael Hategan wrote: > On Fri, 2010-08-13 at 11:30 -0500, Jonathan Monette wrote: > >> Hello, >> How does the worker decide what connection to connect to? Right >> now what I think it does is it runs ifconfig and greps the inet address >> and then test each of these connections. Is this correct? When I am >> running on PADS it seems that the worker always chooses the wrong >> connection to the service. It seems to choose the UBS0 connection where >> the correct connection is the ib0 connection. Is there a way that maybe >> the worker can be fixed to choose a better connection or the correct >> connection? This seems to be only happening on PADS. >> >> > That was temporary. Initially it would use the same address as the url > in sites.xml. Then I added the "try all interfaces" thing, but in some > cases the connect on certain wrong addresses does not fail quickly > enough and has to timeout instead, which usually takes a few minutes. So > that got disabled and only the frist address is used now (unless > overridden - see below). > > You can say key="internalHostname">x.y.z.w in sites.xml > > Mihael > > > > -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From hategan at mcs.anl.gov Fri Aug 13 11:54:50 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Aug 2010 11:54:50 -0500 Subject: [Swift-devel] Re: Worker connection In-Reply-To: <4C657638.5080609@gmail.com> References: <4C657328.6060304@gmail.com> <1281717359.11891.3.camel@blabla2.none> <4C657638.5080609@gmail.com> Message-ID: <1281718490.12095.8.camel@blabla2.none> On Fri, 2010-08-13 at 11:43 -0500, Jonathan Monette wrote: > Right now I am using "internalHostname". I was just wondering if an > should this be changed since I am always changing this entry depending > if I am on login1 or login2? It should, but the question is to what. I offer $20 to the first person to find a reliable (that works on all TG sites + PADS + Intrepid), quick (that does not, by itself, delay worker startup or the overall workflow by more than a few seconds) and automated way of figuring out that IP. I reserve the right to refuse a solution if it does not meet certain propriety criteria that I did not necessarily specify here. (btw you could make a wrapper around swift that detects whether you are on login1 or login2 and picks one of two sites files and passes that to swift). Mihael > > On 8/13/10 11:35 AM, Mihael Hategan wrote: > > On Fri, 2010-08-13 at 11:30 -0500, Jonathan Monette wrote: > > > >> Hello, > >> How does the worker decide what connection to connect to? Right > >> now what I think it does is it runs ifconfig and greps the inet address > >> and then test each of these connections. Is this correct? When I am > >> running on PADS it seems that the worker always chooses the wrong > >> connection to the service. It seems to choose the UBS0 connection where > >> the correct connection is the ib0 connection. Is there a way that maybe > >> the worker can be fixed to choose a better connection or the correct > >> connection? This seems to be only happening on PADS. > >> > >> > > That was temporary. Initially it would use the same address as the url > > in sites.xml. Then I added the "try all interfaces" thing, but in some > > cases the connect on certain wrong addresses does not fail quickly > > enough and has to timeout instead, which usually takes a few minutes. So > > that got disabled and only the frist address is used now (unless > > overridden - see below). > > > > You can say > key="internalHostname">x.y.z.w in sites.xml > > > > Mihael > > > > > > > > > From hategan at mcs.anl.gov Fri Aug 13 12:04:39 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Aug 2010 12:04:39 -0500 Subject: [Swift-devel] Re: Worker connection In-Reply-To: References: <4C657328.6060304@gmail.com> <1281717359.11891.3.camel@blabla2.none> <4C657638.5080609@gmail.com> <1281718490.12095.8.camel@blabla2.none> Message-ID: <1281719079.12275.0.camel@blabla2.none> On Fri, 2010-08-13 at 12:59 -0400, Glen Hocky wrote: > The OOPS project uses sed to set that parameter and create the sites > file on the fly. it's very effective How was the IP picked? > > On Fri, Aug 13, 2010 at 12:54 PM, Mihael Hategan > wrote: > On Fri, 2010-08-13 at 11:43 -0500, Jonathan Monette wrote: > > Right now I am using "internalHostname". I was just > wondering if an > > should this be changed since I am always changing this entry > depending > > if I am on login1 or login2? > > > It should, but the question is to what. > > I offer $20 to the first person to find a reliable (that works > on all TG > sites + PADS + Intrepid), quick (that does not, by itself, > delay worker > startup or the overall workflow by more than a few seconds) > and > automated way of figuring out that IP. I reserve the right to > refuse a > solution if it does not meet certain propriety criteria that I > did not > necessarily specify here. > > (btw you could make a wrapper around swift that detects > whether you are > on login1 or login2 and picks one of two sites files and > passes that to > swift). > > Mihael > > > > > On 8/13/10 11:35 AM, Mihael Hategan wrote: > > > On Fri, 2010-08-13 at 11:30 -0500, Jonathan Monette wrote: > > > > > >> Hello, > > >> How does the worker decide what connection to > connect to? Right > > >> now what I think it does is it runs ifconfig and greps > the inet address > > >> and then test each of these connections. Is this > correct? When I am > > >> running on PADS it seems that the worker always chooses > the wrong > > >> connection to the service. It seems to choose the UBS0 > connection where > > >> the correct connection is the ib0 connection. Is there a > way that maybe > > >> the worker can be fixed to choose a better connection or > the correct > > >> connection? This seems to be only happening on PADS. > > >> > > >> > > > That was temporary. Initially it would use the same > address as the url > > > in sites.xml. Then I added the "try all interfaces" thing, > but in some > > > cases the connect on certain wrong addresses does not fail > quickly > > > enough and has to timeout instead, which usually takes a > few minutes. So > > > that got disabled and only the frist address is used now > (unless > > > overridden - see below). > > > > > > You can say > > key="internalHostname">x.y.z.w in sites.xml > > > > > > Mihael > > > > > > > > > > > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From aespinosa at cs.uchicago.edu Fri Aug 13 12:13:48 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Fri, 13 Aug 2010 12:13:48 -0500 Subject: [Swift-devel] Re: Worker connection In-Reply-To: <1281717359.11891.3.camel@blabla2.none> References: <4C657328.6060304@gmail.com> <1281717359.11891.3.camel@blabla2.none> Message-ID: <20100813171348.GA2346@origin> So this is simply passed to the worker? The "url" parameter is still the one used to spawn the service right? -Allan On Fri, Aug 13, 2010 at 11:35:59AM -0500, Mihael Hategan wrote: > On Fri, 2010-08-13 at 11:30 -0500, Jonathan Monette wrote: > > Hello, > > How does the worker decide what connection to connect to? Right > > now what I think it does is it runs ifconfig and greps the inet address > > and then test each of these connections. Is this correct? When I am > > running on PADS it seems that the worker always chooses the wrong > > connection to the service. It seems to choose the UBS0 connection where > > the correct connection is the ib0 connection. Is there a way that maybe > > the worker can be fixed to choose a better connection or the correct > > connection? This seems to be only happening on PADS. > > > > That was temporary. Initially it would use the same address as the url > in sites.xml. Then I added the "try all interfaces" thing, but in some > cases the connect on certain wrong addresses does not fail quickly > enough and has to timeout instead, which usually takes a few minutes. So > that got disabled and only the frist address is used now (unless > overridden - see below). > > You can say key="internalHostname">x.y.z.w in sites.xml > From hategan at mcs.anl.gov Fri Aug 13 12:20:25 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Aug 2010 12:20:25 -0500 Subject: [Swift-devel] Re: Worker connection In-Reply-To: <20100813171348.GA2346@origin> References: <4C657328.6060304@gmail.com> <1281717359.11891.3.camel@blabla2.none> <20100813171348.GA2346@origin> Message-ID: <1281720025.12401.2.camel@blabla2.none> On Fri, 2010-08-13 at 12:13 -0500, Allan Espinosa wrote: > So this is simply passed to the worker? The "url" parameter is still the one > used to spawn the service right? Right. Using the URL worked the most reasonably, but some machines (like Intrepid) would still fail with that (the external ip is not accessible by the WNs). Also it doesn't work well when the URL is "localhost". Btw, the way it is passed or whether it is passed to the worker, I don't care about. A potential solution has flexibility to address the issue in whatever way. > > -Allan > On Fri, Aug 13, 2010 at 11:35:59AM -0500, Mihael Hategan wrote: > > On Fri, 2010-08-13 at 11:30 -0500, Jonathan Monette wrote: > > > Hello, > > > How does the worker decide what connection to connect to? Right > > > now what I think it does is it runs ifconfig and greps the inet address > > > and then test each of these connections. Is this correct? When I am > > > running on PADS it seems that the worker always chooses the wrong > > > connection to the service. It seems to choose the UBS0 connection where > > > the correct connection is the ib0 connection. Is there a way that maybe > > > the worker can be fixed to choose a better connection or the correct > > > connection? This seems to be only happening on PADS. > > > > > > > That was temporary. Initially it would use the same address as the url > > in sites.xml. Then I added the "try all interfaces" thing, but in some > > cases the connect on certain wrong addresses does not fail quickly > > enough and has to timeout instead, which usually takes a few minutes. So > > that got disabled and only the frist address is used now (unless > > overridden - see below). > > > > You can say > key="internalHostname">x.y.z.w in sites.xml > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Fri Aug 13 12:21:45 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Aug 2010 12:21:45 -0500 Subject: [Swift-devel] Re: Worker connection In-Reply-To: <3415961925365440150@unknownmsgid> References: <4C657328.6060304@gmail.com> <1281717359.11891.3.camel@blabla2.none> <4C657638.5080609@gmail.com> <1281718490.12095.8.camel@blabla2.none> <1281719079.12275.0.camel@blabla2.none> <3415961925365440150@unknownmsgid> Message-ID: <1281720105.12401.4.camel@blabla2.none> On Fri, 2010-08-13 at 13:18 -0400, Glen Hocky wrote: > By picking the principle ip on the submit node (assumes running on a > site from that sites submit host) Right. That's the next best solution, but doesn't work everywhere. > > On Aug 13, 2010, at 1:04 PM, Mihael Hategan wrote: > > > On Fri, 2010-08-13 at 12:59 -0400, Glen Hocky wrote: > >> The OOPS project uses sed to set that parameter and create the sites > >> file on the fly. it's very effective > > > > How was the IP picked? > > > >> > >> On Fri, Aug 13, 2010 at 12:54 PM, Mihael Hategan > >> wrote: > >> On Fri, 2010-08-13 at 11:43 -0500, Jonathan Monette wrote: > >>> Right now I am using "internalHostname". I was just > >> wondering if an > >>> should this be changed since I am always changing this entry > >> depending > >>> if I am on login1 or login2? > >> > >> > >> It should, but the question is to what. > >> > >> I offer $20 to the first person to find a reliable (that works > >> on all TG > >> sites + PADS + Intrepid), quick (that does not, by itself, > >> delay worker > >> startup or the overall workflow by more than a few seconds) > >> and > >> automated way of figuring out that IP. I reserve the right to > >> refuse a > >> solution if it does not meet certain propriety criteria that I > >> did not > >> necessarily specify here. > >> > >> (btw you could make a wrapper around swift that detects > >> whether you are > >> on login1 or login2 and picks one of two sites files and > >> passes that to > >> swift). > >> > >> Mihael > >> > >>> > >>> On 8/13/10 11:35 AM, Mihael Hategan wrote: > >>>> On Fri, 2010-08-13 at 11:30 -0500, Jonathan Monette wrote: > >>>> > >>>>> Hello, > >>>>> How does the worker decide what connection to > >> connect to? Right > >>>>> now what I think it does is it runs ifconfig and greps > >> the inet address > >>>>> and then test each of these connections. Is this > >> correct? When I am > >>>>> running on PADS it seems that the worker always chooses > >> the wrong > >>>>> connection to the service. It seems to choose the UBS0 > >> connection where > >>>>> the correct connection is the ib0 connection. Is there a > >> way that maybe > >>>>> the worker can be fixed to choose a better connection or > >> the correct > >>>>> connection? This seems to be only happening on PADS. > >>>>> > >>>>> > >>>> That was temporary. Initially it would use the same > >> address as the url > >>>> in sites.xml. Then I added the "try all interfaces" thing, > >> but in some > >>>> cases the connect on certain wrong addresses does not fail > >> quickly > >>>> enough and has to timeout instead, which usually takes a > >> few minutes. So > >>>> that got disabled and only the frist address is used now > >> (unless > >>>> overridden - see below). > >>>> > >>>> You can say >>>> key="internalHostname">x.y.z.w in sites.xml > >>>> > >>>> Mihael > >>>> > >>>> > >>>> > >>>> > >>> > >> > >> > >> > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > >> > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri Aug 13 12:23:15 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Fri, 13 Aug 2010 11:23:15 -0600 (GMT-06:00) Subject: [Swift-devel] Re: Worker connection In-Reply-To: <780389740.884421281720106575.JavaMail.root@zimbra.anl.gov> Message-ID: <242879890.884441281720194968.JavaMail.root@zimbra.anl.gov> If we set internal hostname to $(hostname -f) and then the worker just connects to that, resolving the IP address via DNS, won't that typically connect? At least as the default? Then users manually override for any clusters that are not sufficiently sanely configured to make that possible, and we provide a set of manual instructions for users to determine the right settings if this is the case and what to set if not. see below. - Mike login2$ hostname -f login2.pads.ci.uchicago.edu login2$ qsub -I qsub: waiting for job 444923.svc.pads.ci.uchicago.edu to start qsub: job 444923.svc.pads.ci.uchicago.edu ready ---------------------------------------- Begin PBS Prologue Fri Aug 13 12:21:44 CDT 2010 Job ID: 444923.svc.pads.ci.uchicago.edu Username: wilde Group: ci-users Nodes: c40.pads.ci.uchicago.edu End PBS Prologue Fri Aug 13 12:21:44 CDT 2010 ---------------------------------------- c40$ ping login2.pads.ci.uchicago.edu PING login2.pads.ci.uchicago.edu (192.5.86.6) 56(84) bytes of data. 64 bytes from login2.pads.ci.uchicago.edu (192.5.86.6): icmp_seq=1 ttl=64 time=0.099 ms 64 bytes from login2.pads.ci.uchicago.edu (192.5.86.6): icmp_seq=2 ttl=64 time=0.185 ms 64 bytes from login2.pads.ci.uchicago.edu (192.5.86.6): icmp_seq=3 ttl=64 time=0.221 ms 64 bytes from login2.pads.ci.uchicago.edu (192.5.86.6): icmp_seq=4 ttl=64 time=0.164 ms --- login2.pads.ci.uchicago.edu ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3000ms rtt min/avg/max/mdev = 0.099/0.167/0.221/0.045 ms c40$ exit logout qsub: job 444923.svc.pads.ci.uchicago.edu completed login2$ ----- "Mihael Hategan" wrote: > On Fri, 2010-08-13 at 12:59 -0400, Glen Hocky wrote: > > The OOPS project uses sed to set that parameter and create the > sites > > file on the fly. it's very effective > > How was the IP picked? > > > > > On Fri, Aug 13, 2010 at 12:54 PM, Mihael Hategan > > > wrote: > > On Fri, 2010-08-13 at 11:43 -0500, Jonathan Monette wrote: > > > Right now I am using "internalHostname". I was just > > wondering if an > > > should this be changed since I am always changing this > entry > > depending > > > if I am on login1 or login2? > > > > > > It should, but the question is to what. > > > > I offer $20 to the first person to find a reliable (that > works > > on all TG > > sites + PADS + Intrepid), quick (that does not, by itself, > > delay worker > > startup or the overall workflow by more than a few seconds) > > and > > automated way of figuring out that IP. I reserve the right > to > > refuse a > > solution if it does not meet certain propriety criteria that > I > > did not > > necessarily specify here. > > > > (btw you could make a wrapper around swift that detects > > whether you are > > on login1 or login2 and picks one of two sites files and > > passes that to > > swift). > > > > Mihael > > > > > > > > On 8/13/10 11:35 AM, Mihael Hategan wrote: > > > > On Fri, 2010-08-13 at 11:30 -0500, Jonathan Monette > wrote: > > > > > > > >> Hello, > > > >> How does the worker decide what connection to > > connect to? Right > > > >> now what I think it does is it runs ifconfig and greps > > the inet address > > > >> and then test each of these connections. Is this > > correct? When I am > > > >> running on PADS it seems that the worker always > chooses > > the wrong > > > >> connection to the service. It seems to choose the > UBS0 > > connection where > > > >> the correct connection is the ib0 connection. Is there > a > > way that maybe > > > >> the worker can be fixed to choose a better connection > or > > the correct > > > >> connection? This seems to be only happening on PADS. > > > >> > > > >> > > > > That was temporary. Initially it would use the same > > address as the url > > > > in sites.xml. Then I added the "try all interfaces" > thing, > > but in some > > > > cases the connect on certain wrong addresses does not > fail > > quickly > > > > enough and has to timeout instead, which usually takes > a > > few minutes. So > > > > that got disabled and only the frist address is used > now > > (unless > > > > overridden - see below). > > > > > > > > You can say > > > key="internalHostname">x.y.z.w in sites.xml > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Fri Aug 13 12:52:33 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Aug 2010 12:52:33 -0500 Subject: [Swift-devel] Re: Worker connection In-Reply-To: <242879890.884441281720194968.JavaMail.root@zimbra.anl.gov> References: <242879890.884441281720194968.JavaMail.root@zimbra.anl.gov> Message-ID: <1281721953.12704.4.camel@blabla2.none> On Fri, 2010-08-13 at 11:23 -0600, wilde at mcs.anl.gov wrote: > If we set internal hostname to $(hostname -f) and then the worker just > connects to that, resolving the IP address via DNS, won't that > typically connect? At least as the default? Yes, it would. Though no prize for "typically". > > Then users manually override for any clusters that are not > sufficiently sanely configured to make that possible, and we provide a > set of manual instructions for users to determine the right settings > if this is the case and what to set if not. Right. I think this is very similar to what used to be the default. Mihael > > see below. > > - Mike > > login2$ hostname -f > login2.pads.ci.uchicago.edu > login2$ qsub -I > qsub: waiting for job 444923.svc.pads.ci.uchicago.edu to start > qsub: job 444923.svc.pads.ci.uchicago.edu ready > > ---------------------------------------- > Begin PBS Prologue Fri Aug 13 12:21:44 CDT 2010 > Job ID: 444923.svc.pads.ci.uchicago.edu > Username: wilde > Group: ci-users > Nodes: c40.pads.ci.uchicago.edu > End PBS Prologue Fri Aug 13 12:21:44 CDT 2010 > ---------------------------------------- > c40$ ping login2.pads.ci.uchicago.edu > PING login2.pads.ci.uchicago.edu (192.5.86.6) 56(84) bytes of data. > 64 bytes from login2.pads.ci.uchicago.edu (192.5.86.6): icmp_seq=1 ttl=64 time=0.099 ms > 64 bytes from login2.pads.ci.uchicago.edu (192.5.86.6): icmp_seq=2 ttl=64 time=0.185 ms > 64 bytes from login2.pads.ci.uchicago.edu (192.5.86.6): icmp_seq=3 ttl=64 time=0.221 ms > 64 bytes from login2.pads.ci.uchicago.edu (192.5.86.6): icmp_seq=4 ttl=64 time=0.164 ms > > --- login2.pads.ci.uchicago.edu ping statistics --- > 4 packets transmitted, 4 received, 0% packet loss, time 3000ms > rtt min/avg/max/mdev = 0.099/0.167/0.221/0.045 ms > c40$ exit > logout > > qsub: job 444923.svc.pads.ci.uchicago.edu completed > login2$ > > ----- "Mihael Hategan" wrote: > > > On Fri, 2010-08-13 at 12:59 -0400, Glen Hocky wrote: > > > The OOPS project uses sed to set that parameter and create the > > sites > > > file on the fly. it's very effective > > > > How was the IP picked? > > > > > > > > On Fri, Aug 13, 2010 at 12:54 PM, Mihael Hategan > > > > > wrote: > > > On Fri, 2010-08-13 at 11:43 -0500, Jonathan Monette wrote: > > > > Right now I am using "internalHostname". I was just > > > wondering if an > > > > should this be changed since I am always changing this > > entry > > > depending > > > > if I am on login1 or login2? > > > > > > > > > It should, but the question is to what. > > > > > > I offer $20 to the first person to find a reliable (that > > works > > > on all TG > > > sites + PADS + Intrepid), quick (that does not, by itself, > > > delay worker > > > startup or the overall workflow by more than a few seconds) > > > and > > > automated way of figuring out that IP. I reserve the right > > to > > > refuse a > > > solution if it does not meet certain propriety criteria that > > I > > > did not > > > necessarily specify here. > > > > > > (btw you could make a wrapper around swift that detects > > > whether you are > > > on login1 or login2 and picks one of two sites files and > > > passes that to > > > swift). > > > > > > Mihael > > > > > > > > > > > On 8/13/10 11:35 AM, Mihael Hategan wrote: > > > > > On Fri, 2010-08-13 at 11:30 -0500, Jonathan Monette > > wrote: > > > > > > > > > >> Hello, > > > > >> How does the worker decide what connection to > > > connect to? Right > > > > >> now what I think it does is it runs ifconfig and greps > > > the inet address > > > > >> and then test each of these connections. Is this > > > correct? When I am > > > > >> running on PADS it seems that the worker always > > chooses > > > the wrong > > > > >> connection to the service. It seems to choose the > > UBS0 > > > connection where > > > > >> the correct connection is the ib0 connection. Is there > > a > > > way that maybe > > > > >> the worker can be fixed to choose a better connection > > or > > > the correct > > > > >> connection? This seems to be only happening on PADS. > > > > >> > > > > >> > > > > > That was temporary. Initially it would use the same > > > address as the url > > > > > in sites.xml. Then I added the "try all interfaces" > > thing, > > > but in some > > > > > cases the connect on certain wrong addresses does not > > fail > > > quickly > > > > > enough and has to timeout instead, which usually takes > > a > > > few minutes. So > > > > > that got disabled and only the frist address is used > > now > > > (unless > > > > > overridden - see below). > > > > > > > > > > You can say > > > > key="internalHostname">x.y.z.w in sites.xml > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From skenny at uchicago.edu Fri Aug 13 13:11:08 2010 From: skenny at uchicago.edu (Sarah Kenny) Date: Fri, 13 Aug 2010 13:11:08 -0500 Subject: [Swift-devel] cleanup fails on abe Message-ID: hi all, not sure if anyone else is running on abe, but for some reason cleanup seems to fail on there very consistently. swift throws a warning: The following warnings have occurred: 1. Cleanup on ABE failed Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot submit job at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:146) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:100) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) at org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:40) Caused by: org.globus.gram.GramException: Parameter not supported at org.globus.gram.Gram.request(Gram.java:358) at org.globus.gram.GramJob.request(GramJob.java:262) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:134) ... 4 more if i shut off cleanup, i don't get the warning and the workflow 'apprears' to have completed successfully, however even with cleanup shut off pbs still generates the email below giving the error: i'm still poking around to see if i can figure out what's up, but thought i would throw this out there in case someone else has come across it. swift, coaster and gram logs attached. ~sk ---------- Forwarded message ---------- From: adm Date: Fri, Aug 13, 2010 at 12:53 PM Subject: PBS JOB 3000582.abem5.ncsa.uiuc.edu To: skenny at abe1196.ncsa.uiuc.edu PBS Job Id: 3000582.abem5.ncsa.uiuc.edu Job Name: ? configtester Exec host: ?abe0553/0+abe0314/0+abe0313/0+abe0311/0+abe0310/0+abe0307/0+abe0294/0+abe0290/0+abe0287/0+abe0286/0+abe0285/0+abe0284/0+abe0283/0+abe0279/0+abe0278/0+abe0277/0+abe0275/0+abe0273/0+abe0272/0+abe0271/0+abe0256/0+abe0254/0+abe0174/0+abe0173/0+abe0166/0+abe0165/0+abe0163/0+abe0087/0+abe0085/0+abe0084/0+abe0010/0+abe0387/0 An error has occurred processing your job, see below. Post job file processing error; job 3000582.abem5.ncsa.uiuc.edu on host abe0553/0+abe0314/0+abe0313/0+abe0311/0+abe0310/0+abe0307/0+abe0294/0+abe0290/0+abe0287/0+abe0286/0+abe0285/0+abe0284/0+abe0283/0+abe0279/0+abe0278/0+abe0277/0+abe0275/0+abe0273/0+abe0272/0+abe0271/0+abe0256/0+abe0254/0+abe0174/0+abe0173/0+abe0166/0+abe0165/0+abe0163/0+abe0087/0+abe0085/0+abe0084/0+abe0010/0+abe0387/0 Unable to copy file /u/ac/skenny/.pbs_spool//3000582.abem5.ncsa.uiuc.edu.OU to /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stdout *** error from copy /bin/cp: cannot create regular file `/u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stdout': No such file or directory *** end error output Unable to copy file /u/ac/skenny/.pbs_spool//3000582.abem5.ncsa.uiuc.edu.ER to /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stderr *** error from copy /bin/cp: cannot create regular file `/u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stderr': No such file or directory *** end error output -------------- next part -------------- A non-text attachment was scrubbed... Name: coasters.log Type: application/octet-stream Size: 218637 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: gram_job_mgr_15575.log Type: application/octet-stream Size: 25429 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: rmx-20100813-1251-kx9dzk98.log Type: application/octet-stream Size: 2137298 bytes Desc: not available URL: From wilde at mcs.anl.gov Fri Aug 13 13:16:31 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 13 Aug 2010 12:16:31 -0600 (GMT-06:00) Subject: [Swift-devel] cleanup fails on abe In-Reply-To: Message-ID: <1083131031.886451281723391529.JavaMail.root@zimbra.anl.gov> Sarah, what does "shut off cleanup" mean? Jon, is there any similarity between what Sarah is encountering and what you observed on TeraPort (presumably using provider=coaster jobmanager=local:pbs)? - Mike ----- "Sarah Kenny" wrote: > hi all, not sure if anyone else is running on abe, but for some > reason > cleanup seems to fail on there very consistently. swift throws a > warning: > > The following warnings have occurred: > 1. Cleanup on ABE failed > Caused by: > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Cannot submit job > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:146) > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:100) > at > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > at > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) > at > org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:40) > Caused by: org.globus.gram.GramException: Parameter not supported > at org.globus.gram.Gram.request(Gram.java:358) > at org.globus.gram.GramJob.request(GramJob.java:262) > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:134) > ... 4 more > > if i shut off cleanup, i don't get the warning and the workflow > 'apprears' to have completed successfully, however even with cleanup > shut off pbs still generates the email below giving the error: > > > i'm still poking around to see if i can figure out what's up, but > thought i would throw this out there in case someone else has come > across it. > > swift, coaster and gram logs attached. > > ~sk > > ---------- Forwarded message ---------- > From: adm > Date: Fri, Aug 13, 2010 at 12:53 PM > Subject: PBS JOB 3000582.abem5.ncsa.uiuc.edu > To: skenny at abe1196.ncsa.uiuc.edu > > > PBS Job Id: 3000582.abem5.ncsa.uiuc.edu > Job Name: ? configtester > Exec host: > ?abe0553/0+abe0314/0+abe0313/0+abe0311/0+abe0310/0+abe0307/0+abe0294/0+abe0290/0+abe0287/0+abe0286/0+abe0285/0+abe0284/0+abe0283/0+abe0279/0+abe0278/0+abe0277/0+abe0275/0+abe0273/0+abe0272/0+abe0271/0+abe0256/0+abe0254/0+abe0174/0+abe0173/0+abe0166/0+abe0165/0+abe0163/0+abe0087/0+abe0085/0+abe0084/0+abe0010/0+abe0387/0 > An error has occurred processing your job, see below. > Post job file processing error; job 3000582.abem5.ncsa.uiuc.edu on > host > abe0553/0+abe0314/0+abe0313/0+abe0311/0+abe0310/0+abe0307/0+abe0294/0+abe0290/0+abe0287/0+abe0286/0+abe0285/0+abe0284/0+abe0283/0+abe0279/0+abe0278/0+abe0277/0+abe0275/0+abe0273/0+abe0272/0+abe0271/0+abe0256/0+abe0254/0+abe0174/0+abe0173/0+abe0166/0+abe0165/0+abe0163/0+abe0087/0+abe0085/0+abe0084/0+abe0010/0+abe0387/0 > > Unable to copy file > /u/ac/skenny/.pbs_spool//3000582.abem5.ncsa.uiuc.edu.OU to > /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stdout > *** error from copy > /bin/cp: cannot create regular file > `/u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stdout': > No such file or directory > *** end error output > > Unable to copy file > /u/ac/skenny/.pbs_spool//3000582.abem5.ncsa.uiuc.edu.ER to > /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stderr > *** error from copy > /bin/cp: cannot create regular file > `/u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stderr': > No such file or directory > *** end error output > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Fri Aug 13 13:19:35 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Aug 2010 13:19:35 -0500 Subject: [Swift-devel] cleanup fails on abe In-Reply-To: References: Message-ID: <1281723575.12858.5.camel@blabla2.none> Can you can find the gram log for the cleanup job (it's a /bin/rm)? Also, I remember you being able to run things just fine on Abe. Are you aware of any configuration changes there? Any disks full? On Fri, 2010-08-13 at 13:11 -0500, Sarah Kenny wrote: > hi all, not sure if anyone else is running on abe, but for some reason > cleanup seems to fail on there very consistently. swift throws a > warning: > > The following warnings have occurred: > 1. Cleanup on ABE failed > Caused by: > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Cannot submit job > at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:146) > at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:100) > at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) > at org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:40) > Caused by: org.globus.gram.GramException: Parameter not supported > at org.globus.gram.Gram.request(Gram.java:358) > at org.globus.gram.GramJob.request(GramJob.java:262) > at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:134) > ... 4 more > > if i shut off cleanup, i don't get the warning and the workflow > 'apprears' to have completed successfully, however even with cleanup > shut off pbs still generates the email below giving the error: > > > i'm still poking around to see if i can figure out what's up, but > thought i would throw this out there in case someone else has come > across it. > > swift, coaster and gram logs attached. > > ~sk > > ---------- Forwarded message ---------- > From: adm > Date: Fri, Aug 13, 2010 at 12:53 PM > Subject: PBS JOB 3000582.abem5.ncsa.uiuc.edu > To: skenny at abe1196.ncsa.uiuc.edu > > > PBS Job Id: 3000582.abem5.ncsa.uiuc.edu > Job Name: configtester > Exec host: abe0553/0+abe0314/0+abe0313/0+abe0311/0+abe0310/0+abe0307/0+abe0294/0+abe0290/0+abe0287/0+abe0286/0+abe0285/0+abe0284/0+abe0283/0+abe0279/0+abe0278/0+abe0277/0+abe0275/0+abe0273/0+abe0272/0+abe0271/0+abe0256/0+abe0254/0+abe0174/0+abe0173/0+abe0166/0+abe0165/0+abe0163/0+abe0087/0+abe0085/0+abe0084/0+abe0010/0+abe0387/0 > An error has occurred processing your job, see below. > Post job file processing error; job 3000582.abem5.ncsa.uiuc.edu on > host abe0553/0+abe0314/0+abe0313/0+abe0311/0+abe0310/0+abe0307/0+abe0294/0+abe0290/0+abe0287/0+abe0286/0+abe0285/0+abe0284/0+abe0283/0+abe0279/0+abe0278/0+abe0277/0+abe0275/0+abe0273/0+abe0272/0+abe0271/0+abe0256/0+abe0254/0+abe0174/0+abe0173/0+abe0166/0+abe0165/0+abe0163/0+abe0087/0+abe0085/0+abe0084/0+abe0010/0+abe0387/0 > > Unable to copy file > /u/ac/skenny/.pbs_spool//3000582.abem5.ncsa.uiuc.edu.OU to > /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stdout > *** error from copy > /bin/cp: cannot create regular file > `/u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stdout': > No such file or directory > *** end error output > > Unable to copy file > /u/ac/skenny/.pbs_spool//3000582.abem5.ncsa.uiuc.edu.ER to > /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stderr > *** error from copy > /bin/cp: cannot create regular file > `/u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stderr': > No such file or directory > *** end error output > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From jon.monette at gmail.com Fri Aug 13 13:33:30 2010 From: jon.monette at gmail.com (Jonathan Monette) Date: Fri, 13 Aug 2010 13:33:30 -0500 Subject: [Swift-devel] cleanup fails on abe In-Reply-To: <1083131031.886451281723391529.JavaMail.root@zimbra.anl.gov> References: <1083131031.886451281723391529.JavaMail.root@zimbra.anl.gov> Message-ID: <4C658FFA.6090808@gmail.com> My problem is on PADS. And no I do not have a cleanup error. At the end of the run where it qdel all the workers it, it just hands. Once all the jobs in the queue have been deleted swift hangs and I have to 'control c' the job to gain control of the terminal again. On my larger jobs I get a 'Failed to shutdown block' error. On 8/13/10 1:16 PM, Michael Wilde wrote: > Sarah, what does "shut off cleanup" mean? > > Jon, is there any similarity between what Sarah is encountering and what you observed on TeraPort (presumably using provider=coaster jobmanager=local:pbs)? > > - Mike > > ----- "Sarah Kenny" wrote: > > >> hi all, not sure if anyone else is running on abe, but for some >> reason >> cleanup seems to fail on there very consistently. swift throws a >> warning: >> >> The following warnings have occurred: >> 1. Cleanup on ABE failed >> Caused by: >> >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >> Cannot submit job >> at >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:146) >> at >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:100) >> at >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) >> at >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) >> at >> org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:40) >> Caused by: org.globus.gram.GramException: Parameter not supported >> at org.globus.gram.Gram.request(Gram.java:358) >> at org.globus.gram.GramJob.request(GramJob.java:262) >> at >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:134) >> ... 4 more >> >> if i shut off cleanup, i don't get the warning and the workflow >> 'apprears' to have completed successfully, however even with cleanup >> shut off pbs still generates the email below giving the error: >> >> >> i'm still poking around to see if i can figure out what's up, but >> thought i would throw this out there in case someone else has come >> across it. >> >> swift, coaster and gram logs attached. >> >> ~sk >> >> ---------- Forwarded message ---------- >> From: adm >> Date: Fri, Aug 13, 2010 at 12:53 PM >> Subject: PBS JOB 3000582.abem5.ncsa.uiuc.edu >> To: skenny at abe1196.ncsa.uiuc.edu >> >> >> PBS Job Id: 3000582.abem5.ncsa.uiuc.edu >> Job Name: configtester >> Exec host: >> abe0553/0+abe0314/0+abe0313/0+abe0311/0+abe0310/0+abe0307/0+abe0294/0+abe0290/0+abe0287/0+abe0286/0+abe0285/0+abe0284/0+abe0283/0+abe0279/0+abe0278/0+abe0277/0+abe0275/0+abe0273/0+abe0272/0+abe0271/0+abe0256/0+abe0254/0+abe0174/0+abe0173/0+abe0166/0+abe0165/0+abe0163/0+abe0087/0+abe0085/0+abe0084/0+abe0010/0+abe0387/0 >> An error has occurred processing your job, see below. >> Post job file processing error; job 3000582.abem5.ncsa.uiuc.edu on >> host >> abe0553/0+abe0314/0+abe0313/0+abe0311/0+abe0310/0+abe0307/0+abe0294/0+abe0290/0+abe0287/0+abe0286/0+abe0285/0+abe0284/0+abe0283/0+abe0279/0+abe0278/0+abe0277/0+abe0275/0+abe0273/0+abe0272/0+abe0271/0+abe0256/0+abe0254/0+abe0174/0+abe0173/0+abe0166/0+abe0165/0+abe0163/0+abe0087/0+abe0085/0+abe0084/0+abe0010/0+abe0387/0 >> >> Unable to copy file >> /u/ac/skenny/.pbs_spool//3000582.abem5.ncsa.uiuc.edu.OU to >> /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stdout >> *** error from copy >> /bin/cp: cannot create regular file >> `/u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stdout': >> No such file or directory >> *** end error output >> >> Unable to copy file >> /u/ac/skenny/.pbs_spool//3000582.abem5.ncsa.uiuc.edu.ER to >> /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stderr >> *** error from copy >> /bin/cp: cannot create regular file >> `/u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stderr': >> No such file or directory >> *** end error output >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From hategan at mcs.anl.gov Fri Aug 13 13:36:15 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Aug 2010 13:36:15 -0500 Subject: [Swift-devel] cleanup fails on abe In-Reply-To: <4C658FFA.6090808@gmail.com> References: <1083131031.886451281723391529.JavaMail.root@zimbra.anl.gov> <4C658FFA.6090808@gmail.com> Message-ID: <1281724575.13279.1.camel@blabla2.none> On Fri, 2010-08-13 at 13:33 -0500, Jonathan Monette wrote: > My problem is on PADS. And no I do not have a cleanup error. At the > end of the run where it qdel all the workers it, it just hands. Once > all the jobs in the queue have been deleted swift hangs and I have to > 'control c' the job to gain control of the terminal again. On my larger > jobs I get a 'Failed to shutdown block' error. There is a watchdog that will eventually (after 5 minutes) force the service to shut down. It may also be possible to shut the service down asynchronously (i.e. send the command and then not wait for the actual shutdown). Mihael > > On 8/13/10 1:16 PM, Michael Wilde wrote: > > Sarah, what does "shut off cleanup" mean? > > > > Jon, is there any similarity between what Sarah is encountering and what you observed on TeraPort (presumably using provider=coaster jobmanager=local:pbs)? > > > > - Mike > > > > ----- "Sarah Kenny" wrote: > > > > > >> hi all, not sure if anyone else is running on abe, but for some > >> reason > >> cleanup seems to fail on there very consistently. swift throws a > >> warning: > >> > >> The following warnings have occurred: > >> 1. Cleanup on ABE failed > >> Caused by: > >> > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > >> Cannot submit job > >> at > >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:146) > >> at > >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:100) > >> at > >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > >> at > >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) > >> at > >> org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:40) > >> Caused by: org.globus.gram.GramException: Parameter not supported > >> at org.globus.gram.Gram.request(Gram.java:358) > >> at org.globus.gram.GramJob.request(GramJob.java:262) > >> at > >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:134) > >> ... 4 more > >> > >> if i shut off cleanup, i don't get the warning and the workflow > >> 'apprears' to have completed successfully, however even with cleanup > >> shut off pbs still generates the email below giving the error: > >> > >> > >> i'm still poking around to see if i can figure out what's up, but > >> thought i would throw this out there in case someone else has come > >> across it. > >> > >> swift, coaster and gram logs attached. > >> > >> ~sk > >> > >> ---------- Forwarded message ---------- > >> From: adm > >> Date: Fri, Aug 13, 2010 at 12:53 PM > >> Subject: PBS JOB 3000582.abem5.ncsa.uiuc.edu > >> To: skenny at abe1196.ncsa.uiuc.edu > >> > >> > >> PBS Job Id: 3000582.abem5.ncsa.uiuc.edu > >> Job Name: configtester > >> Exec host: > >> abe0553/0+abe0314/0+abe0313/0+abe0311/0+abe0310/0+abe0307/0+abe0294/0+abe0290/0+abe0287/0+abe0286/0+abe0285/0+abe0284/0+abe0283/0+abe0279/0+abe0278/0+abe0277/0+abe0275/0+abe0273/0+abe0272/0+abe0271/0+abe0256/0+abe0254/0+abe0174/0+abe0173/0+abe0166/0+abe0165/0+abe0163/0+abe0087/0+abe0085/0+abe0084/0+abe0010/0+abe0387/0 > >> An error has occurred processing your job, see below. > >> Post job file processing error; job 3000582.abem5.ncsa.uiuc.edu on > >> host > >> abe0553/0+abe0314/0+abe0313/0+abe0311/0+abe0310/0+abe0307/0+abe0294/0+abe0290/0+abe0287/0+abe0286/0+abe0285/0+abe0284/0+abe0283/0+abe0279/0+abe0278/0+abe0277/0+abe0275/0+abe0273/0+abe0272/0+abe0271/0+abe0256/0+abe0254/0+abe0174/0+abe0173/0+abe0166/0+abe0165/0+abe0163/0+abe0087/0+abe0085/0+abe0084/0+abe0010/0+abe0387/0 > >> > >> Unable to copy file > >> /u/ac/skenny/.pbs_spool//3000582.abem5.ncsa.uiuc.edu.OU to > >> /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stdout > >> *** error from copy > >> /bin/cp: cannot create regular file > >> `/u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stdout': > >> No such file or directory > >> *** end error output > >> > >> Unable to copy file > >> /u/ac/skenny/.pbs_spool//3000582.abem5.ncsa.uiuc.edu.ER to > >> /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stderr > >> *** error from copy > >> /bin/cp: cannot create regular file > >> `/u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stderr': > >> No such file or directory > >> *** end error output > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > > > From wozniak at mcs.anl.gov Fri Aug 13 13:37:06 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Fri, 13 Aug 2010 13:37:06 -0500 (CDT) Subject: [Swift-devel] cleanup fails on abe In-Reply-To: <4C658FFA.6090808@gmail.com> References: <1083131031.886451281723391529.JavaMail.root@zimbra.anl.gov> <4C658FFA.6090808@gmail.com> Message-ID: I have been seeing this as well on PADS, I'm looking into it... (Note that this now also may leave nc processes running, use killall if those pile up.) On Fri, 13 Aug 2010, Jonathan Monette wrote: > My problem is on PADS. And no I do not have a cleanup error. At the end of > the run where it qdel all the workers it, it just hands. Once all the jobs > in the queue have been deleted swift hangs and I have to 'control c' the job > to gain control of the terminal again. On my larger jobs I get a 'Failed to > shutdown block' error. > > On 8/13/10 1:16 PM, Michael Wilde wrote: >> Sarah, what does "shut off cleanup" mean? >> >> Jon, is there any similarity between what Sarah is encountering and what >> you observed on TeraPort (presumably using provider=coaster >> jobmanager=local:pbs)? >> >> - Mike >> >> ----- "Sarah Kenny" wrote: >> >> >>> hi all, not sure if anyone else is running on abe, but for some >>> reason >>> cleanup seems to fail on there very consistently. swift throws a >>> warning: >>> >>> The following warnings have occurred: >>> 1. Cleanup on ABE failed >>> Caused by: >>> >>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >>> Cannot submit job >>> at >>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:146) >>> at >>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:100) >>> at >>> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) >>> at >>> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) >>> at >>> org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:40) >>> Caused by: org.globus.gram.GramException: Parameter not supported >>> at org.globus.gram.Gram.request(Gram.java:358) >>> at org.globus.gram.GramJob.request(GramJob.java:262) >>> at >>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:134) >>> ... 4 more >>> >>> if i shut off cleanup, i don't get the warning and the workflow >>> 'apprears' to have completed successfully, however even with cleanup >>> shut off pbs still generates the email below giving the error: >>> >>> >>> i'm still poking around to see if i can figure out what's up, but >>> thought i would throw this out there in case someone else has come >>> across it. >>> >>> swift, coaster and gram logs attached. >>> >>> ~sk >>> >>> ---------- Forwarded message ---------- >>> From: adm >>> Date: Fri, Aug 13, 2010 at 12:53 PM >>> Subject: PBS JOB 3000582.abem5.ncsa.uiuc.edu >>> To: skenny at abe1196.ncsa.uiuc.edu >>> >>> >>> PBS Job Id: 3000582.abem5.ncsa.uiuc.edu >>> Job Name: configtester >>> Exec host: >>> abe0553/0+abe0314/0+abe0313/0+abe0311/0+abe0310/0+abe0307/0+abe0294/0+abe0290/0+abe0287/0+abe0286/0+abe0285/0+abe0284/0+abe0283/0+abe0279/0+abe0278/0+abe0277/0+abe0275/0+abe0273/0+abe0272/0+abe0271/0+abe0256/0+abe0254/0+abe0174/0+abe0173/0+abe0166/0+abe0165/0+abe0163/0+abe0087/0+abe0085/0+abe0084/0+abe0010/0+abe0387/0 >>> An error has occurred processing your job, see below. >>> Post job file processing error; job 3000582.abem5.ncsa.uiuc.edu on >>> host >>> abe0553/0+abe0314/0+abe0313/0+abe0311/0+abe0310/0+abe0307/0+abe0294/0+abe0290/0+abe0287/0+abe0286/0+abe0285/0+abe0284/0+abe0283/0+abe0279/0+abe0278/0+abe0277/0+abe0275/0+abe0273/0+abe0272/0+abe0271/0+abe0256/0+abe0254/0+abe0174/0+abe0173/0+abe0166/0+abe0165/0+abe0163/0+abe0087/0+abe0085/0+abe0084/0+abe0010/0+abe0387/0 >>> >>> Unable to copy file >>> /u/ac/skenny/.pbs_spool//3000582.abem5.ncsa.uiuc.edu.OU to >>> /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stdout >>> *** error from copy >>> /bin/cp: cannot create regular file >>> `/u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stdout': >>> No such file or directory >>> *** end error output >>> >>> Unable to copy file >>> /u/ac/skenny/.pbs_spool//3000582.abem5.ncsa.uiuc.edu.ER to >>> /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stderr >>> *** error from copy >>> /bin/cp: cannot create regular file >>> `/u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stderr': >>> No such file or directory >>> *** end error output >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> > > -- Justin M Wozniak From jon.monette at gmail.com Fri Aug 13 14:01:51 2010 From: jon.monette at gmail.com (Jonathan Monette) Date: Fri, 13 Aug 2010 14:01:51 -0500 Subject: [Swift-devel] cleanup fails on abe In-Reply-To: References: <1083131031.886451281723391529.JavaMail.root@zimbra.anl.gov> <4C658FFA.6090808@gmail.com> Message-ID: <4C65969F.9000901@gmail.com> Ok. I will do a killall on the nc jobs. Also if swift fails to shutdown the block does the coaster service shutdown? On my largest run it failed to shutdown 4 blocks and no has hung. There are no more jobs in the queue and none are being submitted anymore. On 8/13/10 1:37 PM, Justin M Wozniak wrote: > > I have been seeing this as well on PADS, I'm looking into it... > > (Note that this now also may leave nc processes running, use killall > if those pile up.) > > On Fri, 13 Aug 2010, Jonathan Monette wrote: > >> My problem is on PADS. And no I do not have a cleanup error. At the >> end of the run where it qdel all the workers it, it just hands. Once >> all the jobs in the queue have been deleted swift hangs and I have to >> 'control c' the job to gain control of the terminal again. On my >> larger jobs I get a 'Failed to shutdown block' error. >> >> On 8/13/10 1:16 PM, Michael Wilde wrote: >>> Sarah, what does "shut off cleanup" mean? >>> >>> Jon, is there any similarity between what Sarah is encountering and >>> what you observed on TeraPort (presumably using provider=coaster >>> jobmanager=local:pbs)? >>> >>> - Mike >>> >>> ----- "Sarah Kenny" wrote: >>> >>> >>>> hi all, not sure if anyone else is running on abe, but for some >>>> reason >>>> cleanup seems to fail on there very consistently. swift throws a >>>> warning: >>>> >>>> The following warnings have occurred: >>>> 1. Cleanup on ABE failed >>>> Caused by: >>>> >>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >>>> Cannot submit job >>>> at >>>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:146) >>>> >>>> at >>>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:100) >>>> >>>> at >>>> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) >>>> >>>> at >>>> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) >>>> >>>> at >>>> org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:40) >>>> >>>> Caused by: org.globus.gram.GramException: Parameter not supported >>>> at org.globus.gram.Gram.request(Gram.java:358) >>>> at org.globus.gram.GramJob.request(GramJob.java:262) >>>> at >>>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:134) >>>> >>>> ... 4 more >>>> >>>> if i shut off cleanup, i don't get the warning and the workflow >>>> 'apprears' to have completed successfully, however even with cleanup >>>> shut off pbs still generates the email below giving the error: >>>> >>>> >>>> i'm still poking around to see if i can figure out what's up, but >>>> thought i would throw this out there in case someone else has come >>>> across it. >>>> >>>> swift, coaster and gram logs attached. >>>> >>>> ~sk >>>> >>>> ---------- Forwarded message ---------- >>>> From: adm >>>> Date: Fri, Aug 13, 2010 at 12:53 PM >>>> Subject: PBS JOB 3000582.abem5.ncsa.uiuc.edu >>>> To: skenny at abe1196.ncsa.uiuc.edu >>>> >>>> >>>> PBS Job Id: 3000582.abem5.ncsa.uiuc.edu >>>> Job Name: configtester >>>> Exec host: >>>> >>>> abe0553/0+abe0314/0+abe0313/0+abe0311/0+abe0310/0+abe0307/0+abe0294/0+abe0290/0+abe0287/0+abe0286/0+abe0285/0+abe0284/0+abe0283/0+abe0279/0+abe0278/0+abe0277/0+abe0275/0+abe0273/0+abe0272/0+abe0271/0+abe0256/0+abe0254/0+abe0174/0+abe0173/0+abe0166/0+abe0165/0+abe0163/0+abe0087/0+abe0085/0+abe0084/0+abe0010/0+abe0387/0 >>>> >>>> An error has occurred processing your job, see below. >>>> Post job file processing error; job 3000582.abem5.ncsa.uiuc.edu on >>>> host >>>> abe0553/0+abe0314/0+abe0313/0+abe0311/0+abe0310/0+abe0307/0+abe0294/0+abe0290/0+abe0287/0+abe0286/0+abe0285/0+abe0284/0+abe0283/0+abe0279/0+abe0278/0+abe0277/0+abe0275/0+abe0273/0+abe0272/0+abe0271/0+abe0256/0+abe0254/0+abe0174/0+abe0173/0+abe0166/0+abe0165/0+abe0163/0+abe0087/0+abe0085/0+abe0084/0+abe0010/0+abe0387/0 >>>> >>>> >>>> Unable to copy file >>>> /u/ac/skenny/.pbs_spool//3000582.abem5.ncsa.uiuc.edu.OU to >>>> /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stdout >>>> *** error from copy >>>> /bin/cp: cannot create regular file >>>> `/u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stdout': >>>> >>>> No such file or directory >>>> *** end error output >>>> >>>> Unable to copy file >>>> /u/ac/skenny/.pbs_spool//3000582.abem5.ncsa.uiuc.edu.ER to >>>> /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stderr >>>> *** error from copy >>>> /bin/cp: cannot create regular file >>>> `/u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stderr': >>>> >>>> No such file or directory >>>> *** end error output >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >>> >> >> > -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From skenny at uchicago.edu Fri Aug 13 14:37:42 2010 From: skenny at uchicago.edu (Sarah Kenny) Date: Fri, 13 Aug 2010 14:37:42 -0500 Subject: [Swift-devel] cleanup fails on abe In-Reply-To: <1083131031.886451281723391529.JavaMail.root@zimbra.anl.gov> References: <1083131031.886451281723391529.JavaMail.root@zimbra.anl.gov> Message-ID: On Fri, Aug 13, 2010 at 1:16 PM, Michael Wilde wrote: > Sarah, what does "shut off cleanup" mean? sitedir.keep=true > > Jon, is there any similarity between what Sarah is encountering and what you observed on TeraPort (presumably using provider=coaster jobmanager=local:pbs)? > > - Mike > > ----- "Sarah Kenny" wrote: > >> hi all, not sure if anyone else is running on abe, but for some >> reason >> cleanup seems to fail on there very consistently. swift throws a >> warning: >> >> The following warnings have occurred: >> 1. Cleanup on ABE failed >> Caused by: >> >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >> Cannot submit job >> ? ? ? ? at >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:146) >> ? ? ? ? at >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:100) >> ? ? ? ? at >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) >> ? ? ? ? at >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) >> ? ? ? ? at >> org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:40) >> Caused by: org.globus.gram.GramException: Parameter not supported >> ? ? ? ? at org.globus.gram.Gram.request(Gram.java:358) >> ? ? ? ? at org.globus.gram.GramJob.request(GramJob.java:262) >> ? ? ? ? at >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:134) >> ? ? ? ? ... 4 more >> >> if i shut off cleanup, i don't get the warning and the workflow >> 'apprears' to have completed successfully, however even with cleanup >> shut off pbs still generates the email below giving the error: >> >> >> i'm still poking around to see if i can figure out what's up, but >> thought i would throw this out there in case someone else has come >> across it. >> >> swift, coaster and gram logs attached. >> >> ~sk >> >> ---------- Forwarded message ---------- >> From: adm >> Date: Fri, Aug 13, 2010 at 12:53 PM >> Subject: PBS JOB 3000582.abem5.ncsa.uiuc.edu >> To: skenny at abe1196.ncsa.uiuc.edu >> >> >> PBS Job Id: 3000582.abem5.ncsa.uiuc.edu >> Job Name: ? configtester >> Exec host: >> ?abe0553/0+abe0314/0+abe0313/0+abe0311/0+abe0310/0+abe0307/0+abe0294/0+abe0290/0+abe0287/0+abe0286/0+abe0285/0+abe0284/0+abe0283/0+abe0279/0+abe0278/0+abe0277/0+abe0275/0+abe0273/0+abe0272/0+abe0271/0+abe0256/0+abe0254/0+abe0174/0+abe0173/0+abe0166/0+abe0165/0+abe0163/0+abe0087/0+abe0085/0+abe0084/0+abe0010/0+abe0387/0 >> An error has occurred processing your job, see below. >> Post job file processing error; job 3000582.abem5.ncsa.uiuc.edu on >> host >> abe0553/0+abe0314/0+abe0313/0+abe0311/0+abe0310/0+abe0307/0+abe0294/0+abe0290/0+abe0287/0+abe0286/0+abe0285/0+abe0284/0+abe0283/0+abe0279/0+abe0278/0+abe0277/0+abe0275/0+abe0273/0+abe0272/0+abe0271/0+abe0256/0+abe0254/0+abe0174/0+abe0173/0+abe0166/0+abe0165/0+abe0163/0+abe0087/0+abe0085/0+abe0084/0+abe0010/0+abe0387/0 >> >> Unable to copy file >> /u/ac/skenny/.pbs_spool//3000582.abem5.ncsa.uiuc.edu.OU to >> /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stdout >> *** error from copy >> /bin/cp: cannot create regular file >> `/u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stdout': >> No such file or directory >> *** end error output >> >> Unable to copy file >> /u/ac/skenny/.pbs_spool//3000582.abem5.ncsa.uiuc.edu.ER to >> /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stderr >> *** error from copy >> /bin/cp: cannot create regular file >> `/u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stderr': >> No such file or directory >> *** end error output >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > From dk0966 at cs.ship.edu Fri Aug 13 17:16:45 2010 From: dk0966 at cs.ship.edu (David Kelly) Date: Fri, 13 Aug 2010 18:16:45 -0400 Subject: [Swift-devel] cleanup fails on abe In-Reply-To: References: <1083131031.886451281723391529.JavaMail.root@zimbra.anl.gov> <4C658FFA.6090808@gmail.com> Message-ID: On Fri, Aug 13, 2010 at 2:37 PM, Justin M Wozniak wrote: > (Note that this now also may leave nc processes running, use killall if > those pile up.) > I couldn't get this to happen on my machine, but I suspect ctrl-c is somehow causing it to sit and wait for input. I just checked in a new version of the swift script which adds a timeout value. The process should die after 60 seconds. Please let me know if you're still seeing any hanging processes and I'll take a closer look at it. David -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Fri Aug 13 19:38:01 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Aug 2010 19:38:01 -0500 Subject: [Swift-devel] cleanup fails on abe In-Reply-To: References: <1083131031.886451281723391529.JavaMail.root@zimbra.anl.gov> <4C658FFA.6090808@gmail.com> Message-ID: <1281746281.4378.6.camel@blabla2.none> I'm seeing a "which: no nc in $PATH" on Intrepid. I don't mind it in particular, but it doesn't look very nice either. Mihael On Fri, 2010-08-13 at 18:16 -0400, David Kelly wrote: > > > On Fri, Aug 13, 2010 at 2:37 PM, Justin M Wozniak > wrote: > > (Note that this now also may leave nc processes running, use > killall if those pile up.) > > I couldn't get this to happen on my machine, but I suspect ctrl-c is > somehow causing it to sit and wait for input. I just checked in a new > version of the swift script which adds a timeout value. The process > should die after 60 seconds. Please let me know if you're still seeing > any hanging processes and I'll take a closer look at it. > > David > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From dk0966 at cs.ship.edu Fri Aug 13 20:30:18 2010 From: dk0966 at cs.ship.edu (David Kelly) Date: Fri, 13 Aug 2010 21:30:18 -0400 Subject: [Swift-devel] cleanup fails on abe In-Reply-To: <1281746281.4378.6.camel@blabla2.none> References: <1083131031.886451281723391529.JavaMail.root@zimbra.anl.gov> <4C658FFA.6090808@gmail.com> <1281746281.4378.6.camel@blabla2.none> Message-ID: Don't have access to intrepid, but my guess is that version of 'which' was sending to stderr and not being captured. That should be fixed now. David On Fri, Aug 13, 2010 at 8:38 PM, Mihael Hategan wrote: > I'm seeing a "which: no nc in $PATH" on Intrepid. > > I don't mind it in particular, but it doesn't look very nice either. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Sun Aug 15 14:04:02 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Sun, 15 Aug 2010 13:04:02 -0600 (GMT-06:00) Subject: [Swift-devel] cleanup fails on abe In-Reply-To: <762540926.915821281898981743.JavaMail.root@zimbra.anl.gov> Message-ID: <1125803895.915841281899042029.JavaMail.root@zimbra.anl.gov> I see that the BG/P login hosts have netcat rather than nc. - Mike sur$ netcat -h [v1.10] connect to somewhere: netcat [-options] hostname port[s] [ports] ... listen for inbound: netcat -l -p port [-options] [hostname] [port] options: -g gateway source-routing hop point[s], up to 8 -G num source-routing pointer: 4, 8, 12, ... -h this cruft -i secs delay interval for lines sent, ports scanned -l listen mode, for inbound connects -n numeric-only IP addresses, no DNS -o file hex dump of traffic -p port local port number -r randomize local and remote ports -s addr local source address -t answer TELNET negotiation -u UDP mode -v verbose [use twice to be more verbose] -w secs timeout for connects and final net reads -z zero-I/O mode [used for scanning] port numbers can be individual or ranges: lo-hi [inclusive] sur$ ----- "David Kelly" wrote: > Don't have access to intrepid, but my guess is that version of 'which' > was sending to stderr and not being captured. That should be fixed > now. > > David > > > On Fri, Aug 13, 2010 at 8:38 PM, Mihael Hategan < hategan at mcs.anl.gov > > wrote: > > > I'm seeing a "which: no nc in $PATH" on Intrepid. > > I don't mind it in particular, but it doesn't look very nice either. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Sun Aug 15 14:20:13 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Sun, 15 Aug 2010 13:20:13 -0600 (GMT-06:00) Subject: [Swift-devel] cleanup fails on abe In-Reply-To: <1976291289.915911281899632286.JavaMail.root@zimbra.anl.gov> Message-ID: <1429214909.917161281900013490.JavaMail.root@zimbra.anl.gov> If netcat blocks we should have it time out in a few seconds at most. It would be good to see where its blocking. Im not sure where that would happen to send a UDP packet - its one reason we're using UDP. The next time anyone sees a hung nc, can you find out what its waiting on, with some ps options like this: ps -o pid,wchan=WIDE-WCHAN-COLUMN -o comm (Others may suggest a better way to do this?) I see this note about netcat blocking: From: http://nc110.sourceforge.net/ "Where many other network apps use the FIONBIO ioctl to set non-blocking I/O on network sockets, netcat uses straightforward blocking I/O everywhere. This makes everything very lock-step, relying on the network and filesystem layers to feed in data when needed. Data read in is completely written out before any more is fetched. This may not be quite the right thing to do under some OSes that don't do timed select() right, but this remains to be seen." I dont see where it would block trying to send a UDP packet, so either there are cases where it can (eg buffers full? firewalls making a send() hang?), or maybe its also doing some kind of read/recv? - Mike ----- "David Kelly" wrote: > On Fri, Aug 13, 2010 at 2:37 PM, Justin M Wozniak < > wozniak at mcs.anl.gov > wrote: > > > > (Note that this now also may leave nc processes running, use killall > if those pile up.) > > > I couldn't get this to happen on my machine, but I suspect ctrl-c is > somehow causing it to sit and wait for input. I just checked in a new > version of the swift script which adds a timeout value. The process > should die after 60 seconds. Please let me know if you're still seeing > any hanging processes and I'll take a closer look at it. > > David > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From benc at hawaga.org.uk Sun Aug 15 15:00:41 2010 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 15 Aug 2010 20:00:41 +0000 (GMT) Subject: [Swift-devel] cleanup fails on abe In-Reply-To: <1429214909.917161281900013490.JavaMail.root@zimbra.anl.gov> References: <1429214909.917161281900013490.JavaMail.root@zimbra.anl.gov> Message-ID: > The next time anyone sees a hung nc, can you find out what its waiting > on DNS lookup? that's where every mysterious network hang ever seems to come from. Globus already did a bunch of usage stats stuff with (now) some years of experience. Using that code might be more useful than relearning all that experience. -- From wilde at mcs.anl.gov Sun Aug 15 15:54:24 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 15 Aug 2010 14:54:24 -0600 (GMT-06:00) Subject: [Swift-devel] cleanup fails on abe In-Reply-To: Message-ID: <1573722486.917821281905664472.JavaMail.root@zimbra.anl.gov> ----- "Ben Clifford" wrote: > > The next time anyone sees a hung nc, can you find out what its > waiting > > on > > DNS lookup? that's where every mysterious network hang ever seems to > come > from. Good suspect. > Globus already did a bunch of usage stats stuff with (now) some years > of > experience. Using that code might be more useful than relearning all > that > experience. Yes, thats on the radar. Its got its complexities as well, but maybe the next step. From dk0966 at cs.ship.edu Sun Aug 15 17:30:53 2010 From: dk0966 at cs.ship.edu (David Kelly) Date: Sun, 15 Aug 2010 18:30:53 -0400 Subject: [Swift-devel] cleanup fails on abe In-Reply-To: <1573722486.917821281905664472.JavaMail.root@zimbra.anl.gov> References: <1573722486.917821281905664472.JavaMail.root@zimbra.anl.gov> Message-ID: It connects directly to the IP address now. All DNS and service lookups are disabled. It checks first for nc, then netcat. It times out after 10 seconds. If anyone finds that nc is still hanging after these updates, Mihael will pay them $20 :-) David On Sun, Aug 15, 2010 at 4:54 PM, Michael Wilde wrote: > > ----- "Ben Clifford" wrote: > > > > The next time anyone sees a hung nc, can you find out what its > > waiting > > > on > > > > DNS lookup? that's where every mysterious network hang ever seems to > > come > > from. > > Good suspect. > > > Globus already did a bunch of usage stats stuff with (now) some years > > of > > experience. Using that code might be more useful than relearning all > > that > > experience. > > Yes, thats on the radar. Its got its complexities as well, but maybe the > next step. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From skenny at uchicago.edu Mon Aug 16 11:28:33 2010 From: skenny at uchicago.edu (Sarah Kenny) Date: Mon, 16 Aug 2010 11:28:33 -0500 Subject: [Swift-devel] cleanup fails on abe In-Reply-To: <1281723575.12858.5.camel@blabla2.none> References: <1281723575.12858.5.camel@blabla2.none> Message-ID: here's the entirety of the gram log for the rm job: 8/16 10:58:31 JM: Security context imported 8/16 10:58:31 Pre-parsed RSL string: &( directory = "/scratch/users/skenny" )( arguments = "-rf" "rmx-20100816-1055-k1z90mf7" )( maxnodes = "16" )( executable = "/bin/rm" )( maxwalltime = "30" )( project = "TG-DBS080004N" )( queue = "normal" )( slots = "10" )( nodegranularity = "16" )( name = "cleantest" )( workerspernode = "1" ) 8/16 10:58:31 <<<<>>>>Job Request RSL 8/16 10:58:31 <<<<>>>>Job Request RSL (canonical) 8/16 10:58:31 <<<<>>>>Job RSL 8/16 10:58:31 <<<<>>>>Job RSL (post-eval) 8/16 10:58:31 JMI: testing job manager scripts for type pbs exist and permissions are ok. 8/16 10:58:31 JMI: completed script validation: job manager type is pbs. 8/16 10:58:31 JMI: cmd = cache_cleanup Mon Aug 16 10:58:31 2010 JM_SCRIPT: New Perl JobManager created. Mon Aug 16 10:58:31 2010 JM_SCRIPT: Using jm supplied job dir: /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/2408.1281974311 Mon Aug 16 10:58:31 2010 JM_SCRIPT: Using jm supplied job dir: /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/2408.1281974311 Mon Aug 16 10:58:31 2010 JM_SCRIPT: cache_cleanup(enter) Mon Aug 16 10:58:31 2010 JM_SCRIPT: Cleaning files in job dir /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/2408.1281974311 Mon Aug 16 10:58:31 2010 JM_SCRIPT: Removed 1 files from /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/2408.1281974311 Mon Aug 16 10:58:31 2010 JM_SCRIPT: cache_cleanup(exit) 8/16 10:58:31 JM: before sending to client: rc=0 (Success) 8/16 10:58:31 JM: in globus_gram_job_manager_reporting_file_remove() 8/16 10:58:31 JM: in globus_gram_job_manager_reporting_file_remove() 8/16 10:58:31 JM: exiting globus_gram_job_manager. as far as i can tell i'm not at quota on my work or home dir's on abe. yeah we were able to run fine before...haven't changed our config since then so maybe something on their end. On Fri, Aug 13, 2010 at 1:19 PM, Mihael Hategan wrote: > Can you can find the gram log for the cleanup job (it's a /bin/rm)? > > Also, I remember you being able to run things just fine on Abe. Are you > aware of any configuration changes there? Any disks full? > > On Fri, 2010-08-13 at 13:11 -0500, Sarah Kenny wrote: >> hi all, not sure if anyone else is running on abe, but for some reason >> cleanup seems to fail on there very consistently. swift throws a >> warning: >> >> The following warnings have occurred: >> 1. Cleanup on ABE failed >> Caused by: >> >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >> Cannot submit job >> ? ? ? ? at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:146) >> ? ? ? ? at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:100) >> ? ? ? ? at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) >> ? ? ? ? at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) >> ? ? ? ? at org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:40) >> Caused by: org.globus.gram.GramException: Parameter not supported >> ? ? ? ? at org.globus.gram.Gram.request(Gram.java:358) >> ? ? ? ? at org.globus.gram.GramJob.request(GramJob.java:262) >> ? ? ? ? at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:134) >> ? ? ? ? ... 4 more >> >> if i shut off cleanup, i don't get the warning and the workflow >> 'apprears' to have completed successfully, however even with cleanup >> shut off pbs still generates the email below giving the error: >> >> >> i'm still poking around to see if i can figure out what's up, but >> thought i would throw this out there in case someone else has come >> across it. >> >> swift, coaster and gram logs attached. >> >> ~sk >> >> ---------- Forwarded message ---------- >> From: adm >> Date: Fri, Aug 13, 2010 at 12:53 PM >> Subject: PBS JOB 3000582.abem5.ncsa.uiuc.edu >> To: skenny at abe1196.ncsa.uiuc.edu >> >> >> PBS Job Id: 3000582.abem5.ncsa.uiuc.edu >> Job Name: ? configtester >> Exec host: ?abe0553/0+abe0314/0+abe0313/0+abe0311/0+abe0310/0+abe0307/0+abe0294/0+abe0290/0+abe0287/0+abe0286/0+abe0285/0+abe0284/0+abe0283/0+abe0279/0+abe0278/0+abe0277/0+abe0275/0+abe0273/0+abe0272/0+abe0271/0+abe0256/0+abe0254/0+abe0174/0+abe0173/0+abe0166/0+abe0165/0+abe0163/0+abe0087/0+abe0085/0+abe0084/0+abe0010/0+abe0387/0 >> An error has occurred processing your job, see below. >> Post job file processing error; job 3000582.abem5.ncsa.uiuc.edu on >> host abe0553/0+abe0314/0+abe0313/0+abe0311/0+abe0310/0+abe0307/0+abe0294/0+abe0290/0+abe0287/0+abe0286/0+abe0285/0+abe0284/0+abe0283/0+abe0279/0+abe0278/0+abe0277/0+abe0275/0+abe0273/0+abe0272/0+abe0271/0+abe0256/0+abe0254/0+abe0174/0+abe0173/0+abe0166/0+abe0165/0+abe0163/0+abe0087/0+abe0085/0+abe0084/0+abe0010/0+abe0387/0 >> >> Unable to copy file >> /u/ac/skenny/.pbs_spool//3000582.abem5.ncsa.uiuc.edu.OU to >> /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stdout >> *** error from copy >> /bin/cp: cannot create regular file >> `/u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stdout': >> No such file or directory >> *** end error output >> >> Unable to copy file >> /u/ac/skenny/.pbs_spool//3000582.abem5.ncsa.uiuc.edu.ER to >> /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stderr >> *** error from copy >> /bin/cp: cannot create regular file >> `/u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stderr': >> No such file or directory >> *** end error output >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > From hategan at mcs.anl.gov Mon Aug 16 11:36:07 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 16 Aug 2010 11:36:07 -0500 Subject: [Swift-devel] cleanup fails on abe In-Reply-To: References: <1281723575.12858.5.camel@blabla2.none> Message-ID: <1281976567.31286.6.camel@blabla2.none> Looks like some coaster parameters are making it into the RSL and I believe that's what GRAM is complaining about. On Mon, 2010-08-16 at 11:28 -0500, Sarah Kenny wrote: > here's the entirety of the gram log for the rm job: > > 8/16 10:58:31 JM: Security context imported > 8/16 10:58:31 Pre-parsed RSL string: &( directory = > "/scratch/users/skenny" )( arguments = "-rf" > "rmx-20100816-1055-k1z90mf7" )( maxnodes = "16" )( executable = > "/bin/rm" )( maxwalltime = "30" )( project = "TG-DBS080004N" )( queue > = "normal" )( slots = "10" )( nodegranularity = "16" )( name = > "cleantest" )( workerspernode = "1" ) > 8/16 10:58:31 > <<<< &("directory" = "/scratch/users/skenny" )("arguments" = "-rf" > "rmx-20100816-1055-k1z90mf7" )("maxnodes" = "16" )("executable" = > "/bin/rm" )("maxwalltime" = "30" )("project" = "TG-DBS080004N" > )("queue" = "normal" )("slots" = "10" )("nodegranularity" = "16" > )("name" = "cleantest" )("workerspernode" = "1" ) > >>>>>Job Request RSL > 8/16 10:58:31 > <<<< &("directory" = "/scratch/users/skenny" )("arguments" = "-rf" > "rmx-20100816-1055-k1z90mf7" )("maxnodes" = "16" )("executable" = > "/bin/rm" )("maxwalltime" = "30" )("project" = "TG-DBS080004N" > )("queue" = "normal" )("slots" = "10" )("nodegranularity" = "16" > )("name" = "cleantest" )("workerspernode" = "1" ) > >>>>>Job Request RSL (canonical) > 8/16 10:58:31 > <<<< &("environment" = ("HOME" "/u/ac/skenny" ) ("LOGNAME" "skenny" ) > )("directory" = "/scratch/users/skenny" )("arguments" = "-rf" > "rmx-20100816-1055-k1z90mf7")("maxnodes" = "16" )("executable" = > "/bin/rm" )("maxwalltime" = "30" )("project" = "TG-DBS080004N" > )("queue" = "normal" )("slots" = "10" )("nodegranularity" = "16" > )("name" = "cleantest" )("workerspernode" = "1" ) > >>>>>Job RSL > 8/16 10:58:31 > <<<< &("environment" = ("HOME" "/u/ac/skenny" ) ("LOGNAME" "skenny" ) > )("directory" = "/scratch/users/skenny" )("arguments" = "-rf" > "rmx-20100816-1055-k1z90mf7" )("maxnodes" = "16" )("executable" = > "/bin/rm" )("maxwalltime" = "30" )("project" = "TG-DBS080004N" > )("queue" = "normal" )("slots" = "10" )("nodegranularity\ > " = "16" )("name" = "cleantest" )("workerspernode" = "1" ) > >>>>>Job RSL (post-eval) > 8/16 10:58:31 JMI: testing job manager scripts for type pbs exist and > permissions are ok. > 8/16 10:58:31 JMI: completed script validation: job manager type is pbs. > 8/16 10:58:31 JMI: cmd = cache_cleanup > Mon Aug 16 10:58:31 2010 JM_SCRIPT: New Perl JobManager created. > Mon Aug 16 10:58:31 2010 JM_SCRIPT: Using jm supplied job dir: > /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/2408.1281974311 > Mon Aug 16 10:58:31 2010 JM_SCRIPT: Using jm supplied job dir: > /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/2408.1281974311 > Mon Aug 16 10:58:31 2010 JM_SCRIPT: cache_cleanup(enter) > Mon Aug 16 10:58:31 2010 JM_SCRIPT: Cleaning files in job dir > /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/2408.1281974311 > Mon Aug 16 10:58:31 2010 JM_SCRIPT: Removed 1 files from > /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/2408.1281974311 > Mon Aug 16 10:58:31 2010 JM_SCRIPT: cache_cleanup(exit) > 8/16 10:58:31 JM: before sending to client: rc=0 (Success) > 8/16 10:58:31 JM: in globus_gram_job_manager_reporting_file_remove() > 8/16 10:58:31 JM: in globus_gram_job_manager_reporting_file_remove() > 8/16 10:58:31 JM: exiting globus_gram_job_manager. > > as far as i can tell i'm not at quota on my work or home dir's on abe. > yeah we were able to run fine before...haven't changed our config > since then so maybe something on their end. > > > On Fri, Aug 13, 2010 at 1:19 PM, Mihael Hategan wrote: > > Can you can find the gram log for the cleanup job (it's a /bin/rm)? > > > > Also, I remember you being able to run things just fine on Abe. Are you > > aware of any configuration changes there? Any disks full? > > > > On Fri, 2010-08-13 at 13:11 -0500, Sarah Kenny wrote: > >> hi all, not sure if anyone else is running on abe, but for some reason > >> cleanup seems to fail on there very consistently. swift throws a > >> warning: > >> > >> The following warnings have occurred: > >> 1. Cleanup on ABE failed > >> Caused by: > >> > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > >> Cannot submit job > >> at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:146) > >> at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:100) > >> at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > >> at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) > >> at org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:40) > >> Caused by: org.globus.gram.GramException: Parameter not supported > >> at org.globus.gram.Gram.request(Gram.java:358) > >> at org.globus.gram.GramJob.request(GramJob.java:262) > >> at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:134) > >> ... 4 more > >> > >> if i shut off cleanup, i don't get the warning and the workflow > >> 'apprears' to have completed successfully, however even with cleanup > >> shut off pbs still generates the email below giving the error: > >> > >> > >> i'm still poking around to see if i can figure out what's up, but > >> thought i would throw this out there in case someone else has come > >> across it. > >> > >> swift, coaster and gram logs attached. > >> > >> ~sk > >> > >> ---------- Forwarded message ---------- > >> From: adm > >> Date: Fri, Aug 13, 2010 at 12:53 PM > >> Subject: PBS JOB 3000582.abem5.ncsa.uiuc.edu > >> To: skenny at abe1196.ncsa.uiuc.edu > >> > >> > >> PBS Job Id: 3000582.abem5.ncsa.uiuc.edu > >> Job Name: configtester > >> Exec host: abe0553/0+abe0314/0+abe0313/0+abe0311/0+abe0310/0+abe0307/0+abe0294/0+abe0290/0+abe0287/0+abe0286/0+abe0285/0+abe0284/0+abe0283/0+abe0279/0+abe0278/0+abe0277/0+abe0275/0+abe0273/0+abe0272/0+abe0271/0+abe0256/0+abe0254/0+abe0174/0+abe0173/0+abe0166/0+abe0165/0+abe0163/0+abe0087/0+abe0085/0+abe0084/0+abe0010/0+abe0387/0 > >> An error has occurred processing your job, see below. > >> Post job file processing error; job 3000582.abem5.ncsa.uiuc.edu on > >> host abe0553/0+abe0314/0+abe0313/0+abe0311/0+abe0310/0+abe0307/0+abe0294/0+abe0290/0+abe0287/0+abe0286/0+abe0285/0+abe0284/0+abe0283/0+abe0279/0+abe0278/0+abe0277/0+abe0275/0+abe0273/0+abe0272/0+abe0271/0+abe0256/0+abe0254/0+abe0174/0+abe0173/0+abe0166/0+abe0165/0+abe0163/0+abe0087/0+abe0085/0+abe0084/0+abe0010/0+abe0387/0 > >> > >> Unable to copy file > >> /u/ac/skenny/.pbs_spool//3000582.abem5.ncsa.uiuc.edu.OU to > >> /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stdout > >> *** error from copy > >> /bin/cp: cannot create regular file > >> `/u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stdout': > >> No such file or directory > >> *** end error output > >> > >> Unable to copy file > >> /u/ac/skenny/.pbs_spool//3000582.abem5.ncsa.uiuc.edu.ER to > >> /u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stderr > >> *** error from copy > >> /bin/cp: cannot create regular file > >> `/u/ac/skenny/.globus/job/abe1196.ncsa.uiuc.edu/15575.1281721892/stderr': > >> No such file or directory > >> *** end error output > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > From jon.monette at gmail.com Mon Aug 16 13:38:39 2010 From: jon.monette at gmail.com (Jonathan Monette) Date: Mon, 16 Aug 2010 13:38:39 -0500 Subject: [Swift-devel] Coaster error Message-ID: <4C6985AF.5070400@gmail.com> Hello, I am getting this error when running coasters on PADS. Canceling job 449188.svc.pads.ci.uchicago.edu Canceling job 449189.svc.pads.ci.uchicago.edu Failed to shut down block org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Failed to cancel task. qdel returned with an exit code of 153 at org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:159) at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:101) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:90) at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:44) at org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:293) at org.globus.cog.abstraction.coaster.service.job.manager.Block$1.run(Block.java:284) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) I am assuming is that coasters could not qdel a job. As soon as this error appeared all my jobs in the queue disappeared and no more jobs are submitted. My script hangs because it is waiting for some apps to run but the jobs are never submitted to the PADS scheduler. My run and all the log files are located at /home/jonmon/Workspace/Montage/m101_j_6x6/runs/m101_montage_Aug-16-2010_13-24-43 on the CI machines. -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From jon.monette at gmail.com Tue Aug 17 12:08:29 2010 From: jon.monette at gmail.com (Jonathan Monette) Date: Tue, 17 Aug 2010 12:08:29 -0500 Subject: [Swift-devel] Re: Coaster error In-Reply-To: <4C6985AF.5070400@gmail.com> References: <4C6985AF.5070400@gmail.com> Message-ID: <4C6AC20D.30306@gmail.com> Ok. Have ran more tests on this problem. I am running on both localhost and pads. In the first stage of my workflow I run on localhost to collect some metadata. I then use this metadata to reproject the images submitting these jobs to pads. All the images are reprojected and completes without error. After this the coasters is waiting for more jobs to submit to the workers while localhost is collecting more metadata. I believe coasters starts to shutdown some of the workers because they are idle and wants to free the resources on the machine(am I correct so far?) During the shutdown some workers are shutdown successfully but there is always 1 or 2 that fail to shutdown and I get the qdel error 153 I mentioned yesterday. If coasters fails to shutdown a job does the service terminate? I ask this because after the job fails to shutdown there are no more jobs being submitted in the queue and my script hangs since it is waiting for the next stage in my workflow to complete. Is there a coaster parameter that lets coasters know to not shutdown the workers even if they become idle for a bit or is this a legitimate error in coasters? On 8/16/10 1:38 PM, Jonathan Monette wrote: > Hello, > I am getting this error when running coasters on PADS. > > Canceling job 449188.svc.pads.ci.uchicago.edu > Canceling job 449189.svc.pads.ci.uchicago.edu > Failed to shut down block > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Failed to cancel task. qdel returned with an exit code of 153 > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:159) > > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) > > at > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:70) > > at > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:101) > > at > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:90) > > at > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:44) > > at > org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:293) > > at > org.globus.cog.abstraction.coaster.service.job.manager.Block$1.run(Block.java:284) > > at java.util.TimerThread.mainLoop(Timer.java:512) > at java.util.TimerThread.run(Timer.java:462) > > I am assuming is that coasters could not qdel a job. As soon as this > error appeared all my jobs in the queue disappeared and no more jobs > are submitted. My script hangs because it is waiting for some apps to > run but the jobs are never submitted to the PADS scheduler. My run > and all the log files are located at > /home/jonmon/Workspace/Montage/m101_j_6x6/runs/m101_montage_Aug-16-2010_13-24-43 > on the CI machines. > -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From hategan at mcs.anl.gov Tue Aug 17 12:43:47 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 17 Aug 2010 12:43:47 -0500 Subject: [Swift-devel] Re: Coaster error In-Reply-To: <4C6AC20D.30306@gmail.com> References: <4C6985AF.5070400@gmail.com> <4C6AC20D.30306@gmail.com> Message-ID: <1282067027.2881.3.camel@blabla2.none> On Tue, 2010-08-17 at 12:08 -0500, Jonathan Monette wrote: > Ok. Have ran more tests on this problem. I am running on both > localhost and pads. In the first stage of my workflow I run on > localhost to collect some metadata. I then use this metadata to > reproject the images submitting these jobs to pads. All the images are > reprojected and completes without error. After this the coasters is > waiting for more jobs to submit to the workers while localhost is > collecting more metadata. I believe coasters starts to shutdown some of > the workers because they are idle and wants to free the resources on the > machine(am I correct so far?) You are. > During the shutdown some workers are > shutdown successfully but there is always 1 or 2 that fail to shutdown > and I get the qdel error 153 I mentioned yesterday. If coasters fails > to shutdown a job does the service terminate? No. The qdel part is not critical and is used when workers don't shut down cleanly or on time. > I ask this because after > the job fails to shutdown there are no more jobs being submitted in the > queue and my script hangs since it is waiting for the next stage in my > workflow to complete. Is there a coaster parameter that lets coasters > know to not shutdown the workers even if they become idle for a bit or > is this a legitimate error in coasters? You are assuming that the shutdown failure has something to do with jobs not being run. I do not think that's necessarily right. From jon.monette at gmail.com Tue Aug 17 13:21:03 2010 From: jon.monette at gmail.com (Jonathan Monette) Date: Tue, 17 Aug 2010 13:21:03 -0500 Subject: [Swift-devel] Re: Coaster error In-Reply-To: <1282067027.2881.3.camel@blabla2.none> References: <4C6985AF.5070400@gmail.com> <4C6AC20D.30306@gmail.com> <1282067027.2881.3.camel@blabla2.none> Message-ID: <4C6AD30F.505@gmail.com> Or so the qdel error I am seeing is ignorable? And I am assuming that the shutdown failure has something to do with the jobs being run because when I run a smaller data set (10 images instead of 1300 images) the shutdown error happens at the end of the workflow and I also get the error Failed to shut down channel org.globus.cog.karajan.workflow.service.channels.ChannelException: Invalid channel: 1338035062: {} at org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:442) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:422) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.shutdownChannel(ChannelManager.java:411) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:284) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajanChannel.java:83) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:257) On 8/17/10 12:43 PM, Mihael Hategan wrote: > On Tue, 2010-08-17 at 12:08 -0500, Jonathan Monette wrote: > >> Ok. Have ran more tests on this problem. I am running on both >> localhost and pads. In the first stage of my workflow I run on >> localhost to collect some metadata. I then use this metadata to >> reproject the images submitting these jobs to pads. All the images are >> reprojected and completes without error. After this the coasters is >> waiting for more jobs to submit to the workers while localhost is >> collecting more metadata. I believe coasters starts to shutdown some of >> the workers because they are idle and wants to free the resources on the >> machine(am I correct so far?) >> > You are. > > >> During the shutdown some workers are >> shutdown successfully but there is always 1 or 2 that fail to shutdown >> and I get the qdel error 153 I mentioned yesterday. If coasters fails >> to shutdown a job does the service terminate? >> > No. The qdel part is not critical and is used when workers don't shut > down cleanly or on time. > > >> I ask this because after >> the job fails to shutdown there are no more jobs being submitted in the >> queue and my script hangs since it is waiting for the next stage in my >> workflow to complete. Is there a coaster parameter that lets coasters >> know to not shutdown the workers even if they become idle for a bit or >> is this a legitimate error in coasters? >> > You are assuming that the shutdown failure has something to do with jobs > not being run. I do not think that's necessarily right. > > > > -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From hategan at mcs.anl.gov Tue Aug 17 13:33:20 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 17 Aug 2010 13:33:20 -0500 Subject: [Swift-devel] Re: Coaster error In-Reply-To: <4C6AD30F.505@gmail.com> References: <4C6985AF.5070400@gmail.com> <4C6AC20D.30306@gmail.com> <1282067027.2881.3.camel@blabla2.none> <4C6AD30F.505@gmail.com> Message-ID: <1282070000.3517.0.camel@blabla2.none> The failure to shut down a channel is also ignorable. Essentially the worker shuts down before it gets to acknowledge the shutdown command. I guess this could be fixed, but for now ignore it. On Tue, 2010-08-17 at 13:21 -0500, Jonathan Monette wrote: > Or so the qdel error I am seeing is ignorable? And I am assuming that > the shutdown failure has something to do with the jobs being run because > when I run a smaller data set (10 images instead of 1300 images) the > shutdown error happens at the end of the workflow and I also get the error > > Failed to shut down channel > org.globus.cog.karajan.workflow.service.channels.ChannelException: > Invalid channel: 1338035062: {} > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:442) > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:422) > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.shutdownChannel(ChannelManager.java:411) > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:284) > at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajanChannel.java:83) > at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:257) > > > On 8/17/10 12:43 PM, Mihael Hategan wrote: > > On Tue, 2010-08-17 at 12:08 -0500, Jonathan Monette wrote: > > > >> Ok. Have ran more tests on this problem. I am running on both > >> localhost and pads. In the first stage of my workflow I run on > >> localhost to collect some metadata. I then use this metadata to > >> reproject the images submitting these jobs to pads. All the images are > >> reprojected and completes without error. After this the coasters is > >> waiting for more jobs to submit to the workers while localhost is > >> collecting more metadata. I believe coasters starts to shutdown some of > >> the workers because they are idle and wants to free the resources on the > >> machine(am I correct so far?) > >> > > You are. > > > > > >> During the shutdown some workers are > >> shutdown successfully but there is always 1 or 2 that fail to shutdown > >> and I get the qdel error 153 I mentioned yesterday. If coasters fails > >> to shutdown a job does the service terminate? > >> > > No. The qdel part is not critical and is used when workers don't shut > > down cleanly or on time. > > > > > >> I ask this because after > >> the job fails to shutdown there are no more jobs being submitted in the > >> queue and my script hangs since it is waiting for the next stage in my > >> workflow to complete. Is there a coaster parameter that lets coasters > >> know to not shutdown the workers even if they become idle for a bit or > >> is this a legitimate error in coasters? > >> > > You are assuming that the shutdown failure has something to do with jobs > > not being run. I do not think that's necessarily right. > > > > > > > > > From jon.monette at gmail.com Tue Aug 17 13:37:13 2010 From: jon.monette at gmail.com (Jonathan Monette) Date: Tue, 17 Aug 2010 13:37:13 -0500 Subject: [Swift-devel] Re: Coaster error In-Reply-To: <1282070000.3517.0.camel@blabla2.none> References: <4C6985AF.5070400@gmail.com> <4C6AC20D.30306@gmail.com> <1282067027.2881.3.camel@blabla2.none> <4C6AD30F.505@gmail.com> <1282070000.3517.0.camel@blabla2.none> Message-ID: <4C6AD6D9.9040805@gmail.com> Ok then. Then do you have any ideas on why no more jobs are submitted through coasters after this error? Here is my sites entry for pads 3600 192.5.86.6 1 10 1 1 fast 1 10000 /gpfs/pads/swift/jonmon/Swift/work/pads I have slots set to 10. Does this mean this is the maximum number of jobs that will be submitted and this number should be increased? On 8/17/10 1:33 PM, Mihael Hategan wrote: > The failure to shut down a channel is also ignorable. > Essentially the worker shuts down before it gets to acknowledge the > shutdown command. I guess this could be fixed, but for now ignore it. > > On Tue, 2010-08-17 at 13:21 -0500, Jonathan Monette wrote: > >> Or so the qdel error I am seeing is ignorable? And I am assuming that >> the shutdown failure has something to do with the jobs being run because >> when I run a smaller data set (10 images instead of 1300 images) the >> shutdown error happens at the end of the workflow and I also get the error >> >> Failed to shut down channel >> org.globus.cog.karajan.workflow.service.channels.ChannelException: >> Invalid channel: 1338035062: {} >> at >> org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:442) >> at >> org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:422) >> at >> org.globus.cog.karajan.workflow.service.channels.ChannelManager.shutdownChannel(ChannelManager.java:411) >> at >> org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:284) >> at >> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajanChannel.java:83) >> at >> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:257) >> >> >> On 8/17/10 12:43 PM, Mihael Hategan wrote: >> >>> On Tue, 2010-08-17 at 12:08 -0500, Jonathan Monette wrote: >>> >>> >>>> Ok. Have ran more tests on this problem. I am running on both >>>> localhost and pads. In the first stage of my workflow I run on >>>> localhost to collect some metadata. I then use this metadata to >>>> reproject the images submitting these jobs to pads. All the images are >>>> reprojected and completes without error. After this the coasters is >>>> waiting for more jobs to submit to the workers while localhost is >>>> collecting more metadata. I believe coasters starts to shutdown some of >>>> the workers because they are idle and wants to free the resources on the >>>> machine(am I correct so far?) >>>> >>>> >>> You are. >>> >>> >>> >>>> During the shutdown some workers are >>>> shutdown successfully but there is always 1 or 2 that fail to shutdown >>>> and I get the qdel error 153 I mentioned yesterday. If coasters fails >>>> to shutdown a job does the service terminate? >>>> >>>> >>> No. The qdel part is not critical and is used when workers don't shut >>> down cleanly or on time. >>> >>> >>> >>>> I ask this because after >>>> the job fails to shutdown there are no more jobs being submitted in the >>>> queue and my script hangs since it is waiting for the next stage in my >>>> workflow to complete. Is there a coaster parameter that lets coasters >>>> know to not shutdown the workers even if they become idle for a bit or >>>> is this a legitimate error in coasters? >>>> >>>> >>> You are assuming that the shutdown failure has something to do with jobs >>> not being run. I do not think that's necessarily right. >>> >>> >>> >>> >>> >> > > -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From hategan at mcs.anl.gov Tue Aug 17 14:37:01 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 17 Aug 2010 14:37:01 -0500 Subject: [Swift-devel] Re: Coaster error In-Reply-To: <4C6AD6D9.9040805@gmail.com> References: <4C6985AF.5070400@gmail.com> <4C6AC20D.30306@gmail.com> <1282067027.2881.3.camel@blabla2.none> <4C6AD30F.505@gmail.com> <1282070000.3517.0.camel@blabla2.none> <4C6AD6D9.9040805@gmail.com> Message-ID: <1282073821.3937.0.camel@blabla2.none> On Tue, 2010-08-17 at 13:37 -0500, Jonathan Monette wrote: > Ok then. Then do you have any ideas on why no more jobs are submitted > through coasters after this error? Nope. Do you have the coaster log? > Here is my sites entry for pads > > > url="login.pads.ci.uchicago.edu" /> > > 3600 > 192.5.86.6 > 1 > 10 > 1 > 1 > fast > 1 > 10000 > /gpfs/pads/swift/jonmon/Swift/work/pads > > > I have slots set to 10. Does this mean this is the maximum number of > jobs that will be submitted and this number should be increased? > > On 8/17/10 1:33 PM, Mihael Hategan wrote: > > The failure to shut down a channel is also ignorable. > > Essentially the worker shuts down before it gets to acknowledge the > > shutdown command. I guess this could be fixed, but for now ignore it. > > > > On Tue, 2010-08-17 at 13:21 -0500, Jonathan Monette wrote: > > > >> Or so the qdel error I am seeing is ignorable? And I am assuming that > >> the shutdown failure has something to do with the jobs being run because > >> when I run a smaller data set (10 images instead of 1300 images) the > >> shutdown error happens at the end of the workflow and I also get the error > >> > >> Failed to shut down channel > >> org.globus.cog.karajan.workflow.service.channels.ChannelException: > >> Invalid channel: 1338035062: {} > >> at > >> org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:442) > >> at > >> org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:422) > >> at > >> org.globus.cog.karajan.workflow.service.channels.ChannelManager.shutdownChannel(ChannelManager.java:411) > >> at > >> org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:284) > >> at > >> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajanChannel.java:83) > >> at > >> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:257) > >> > >> > >> On 8/17/10 12:43 PM, Mihael Hategan wrote: > >> > >>> On Tue, 2010-08-17 at 12:08 -0500, Jonathan Monette wrote: > >>> > >>> > >>>> Ok. Have ran more tests on this problem. I am running on both > >>>> localhost and pads. In the first stage of my workflow I run on > >>>> localhost to collect some metadata. I then use this metadata to > >>>> reproject the images submitting these jobs to pads. All the images are > >>>> reprojected and completes without error. After this the coasters is > >>>> waiting for more jobs to submit to the workers while localhost is > >>>> collecting more metadata. I believe coasters starts to shutdown some of > >>>> the workers because they are idle and wants to free the resources on the > >>>> machine(am I correct so far?) > >>>> > >>>> > >>> You are. > >>> > >>> > >>> > >>>> During the shutdown some workers are > >>>> shutdown successfully but there is always 1 or 2 that fail to shutdown > >>>> and I get the qdel error 153 I mentioned yesterday. If coasters fails > >>>> to shutdown a job does the service terminate? > >>>> > >>>> > >>> No. The qdel part is not critical and is used when workers don't shut > >>> down cleanly or on time. > >>> > >>> > >>> > >>>> I ask this because after > >>>> the job fails to shutdown there are no more jobs being submitted in the > >>>> queue and my script hangs since it is waiting for the next stage in my > >>>> workflow to complete. Is there a coaster parameter that lets coasters > >>>> know to not shutdown the workers even if they become idle for a bit or > >>>> is this a legitimate error in coasters? > >>>> > >>>> > >>> You are assuming that the shutdown failure has something to do with jobs > >>> not being run. I do not think that's necessarily right. > >>> > >>> > >>> > >>> > >>> > >> > > > > > From jon.monette at gmail.com Tue Aug 17 16:18:35 2010 From: jon.monette at gmail.com (Jonathan Monette) Date: Tue, 17 Aug 2010 16:18:35 -0500 Subject: [Swift-devel] Re: Coaster error In-Reply-To: <1282073821.3937.0.camel@blabla2.none> References: <4C6985AF.5070400@gmail.com> <4C6AC20D.30306@gmail.com> <1282067027.2881.3.camel@blabla2.none> <4C6AD30F.505@gmail.com> <1282070000.3517.0.camel@blabla2.none> <4C6AD6D9.9040805@gmail.com> <1282073821.3937.0.camel@blabla2.none> Message-ID: <4C6AFCAB.7050202@gmail.com> The one in .globus/coasters/ doesn't get anything new written to it when I do my runs. Could that be because I have my jobmanager local:pbs? Does that put the coaster stuff in the swift log? On 8/17/10 2:37 PM, Mihael Hategan wrote: > On Tue, 2010-08-17 at 13:37 -0500, Jonathan Monette wrote: > >> Ok then. Then do you have any ideas on why no more jobs are submitted >> through coasters after this error? >> > Nope. Do you have the coaster log? > > >> Here is my sites entry for pads >> >> >> > url="login.pads.ci.uchicago.edu" /> >> >> 3600 >> 192.5.86.6 >> 1 >> 10 >> 1 >> 1 >> fast >> 1 >> 10000 >> /gpfs/pads/swift/jonmon/Swift/work/pads >> >> >> I have slots set to 10. Does this mean this is the maximum number of >> jobs that will be submitted and this number should be increased? >> >> On 8/17/10 1:33 PM, Mihael Hategan wrote: >> >>> The failure to shut down a channel is also ignorable. >>> Essentially the worker shuts down before it gets to acknowledge the >>> shutdown command. I guess this could be fixed, but for now ignore it. >>> >>> On Tue, 2010-08-17 at 13:21 -0500, Jonathan Monette wrote: >>> >>> >>>> Or so the qdel error I am seeing is ignorable? And I am assuming that >>>> the shutdown failure has something to do with the jobs being run because >>>> when I run a smaller data set (10 images instead of 1300 images) the >>>> shutdown error happens at the end of the workflow and I also get the error >>>> >>>> Failed to shut down channel >>>> org.globus.cog.karajan.workflow.service.channels.ChannelException: >>>> Invalid channel: 1338035062: {} >>>> at >>>> org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:442) >>>> at >>>> org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:422) >>>> at >>>> org.globus.cog.karajan.workflow.service.channels.ChannelManager.shutdownChannel(ChannelManager.java:411) >>>> at >>>> org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:284) >>>> at >>>> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.handleChannelException(AbstractStreamKarajanChannel.java:83) >>>> at >>>> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:257) >>>> >>>> >>>> On 8/17/10 12:43 PM, Mihael Hategan wrote: >>>> >>>> >>>>> On Tue, 2010-08-17 at 12:08 -0500, Jonathan Monette wrote: >>>>> >>>>> >>>>> >>>>>> Ok. Have ran more tests on this problem. I am running on both >>>>>> localhost and pads. In the first stage of my workflow I run on >>>>>> localhost to collect some metadata. I then use this metadata to >>>>>> reproject the images submitting these jobs to pads. All the images are >>>>>> reprojected and completes without error. After this the coasters is >>>>>> waiting for more jobs to submit to the workers while localhost is >>>>>> collecting more metadata. I believe coasters starts to shutdown some of >>>>>> the workers because they are idle and wants to free the resources on the >>>>>> machine(am I correct so far?) >>>>>> >>>>>> >>>>>> >>>>> You are. >>>>> >>>>> >>>>> >>>>> >>>>>> During the shutdown some workers are >>>>>> shutdown successfully but there is always 1 or 2 that fail to shutdown >>>>>> and I get the qdel error 153 I mentioned yesterday. If coasters fails >>>>>> to shutdown a job does the service terminate? >>>>>> >>>>>> >>>>>> >>>>> No. The qdel part is not critical and is used when workers don't shut >>>>> down cleanly or on time. >>>>> >>>>> >>>>> >>>>> >>>>>> I ask this because after >>>>>> the job fails to shutdown there are no more jobs being submitted in the >>>>>> queue and my script hangs since it is waiting for the next stage in my >>>>>> workflow to complete. Is there a coaster parameter that lets coasters >>>>>> know to not shutdown the workers even if they become idle for a bit or >>>>>> is this a legitimate error in coasters? >>>>>> >>>>>> >>>>>> >>>>> You are assuming that the shutdown failure has something to do with jobs >>>>> not being run. I do not think that's necessarily right. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> > > -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From wilde at mcs.anl.gov Tue Aug 17 18:54:07 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 17 Aug 2010 17:54:07 -0600 (GMT-06:00) Subject: [Swift-devel] David Kelly visiting next week Message-ID: <1542629310.1007931282089247475.JavaMail.root@zimbra.anl.gov> Hi All, David Kelly will be visiting next week, arriving Sunday afternoon Aug 22 and staying through about noon Wed Aug 25. He'll hopefully be able to meet several of you and with some specific users and to work on swiftconfig/swiftrun and the Swift tutorial/starter kit. We'll plan this out further as the week proceeds, and possibly include a visit to Argonne. - Mike From dk0966 at cs.ship.edu Wed Aug 18 12:34:08 2010 From: dk0966 at cs.ship.edu (David Kelly) Date: Wed, 18 Aug 2010 13:34:08 -0400 Subject: [Swift-devel] Error messages and execution retries Message-ID: Hello, When execution.retries is greater than 0 and an error occurs, the error message is not displayed or recorded in the log file. In my case, I had an application pointing to the wrong path. I couldn't get a clear idea of what happening until I set execution.retries to 0. Should swift be changed to report these errors as soon as they happen? David -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Wed Aug 18 15:19:08 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 18 Aug 2010 14:19:08 -0600 (GMT-06:00) Subject: [Swift-devel] Error messages and execution retries In-Reply-To: Message-ID: <1827116552.1043911282162748662.JavaMail.root@zimbra.anl.gov> Sarah, this seems like an excellent first improvement to tackle in the project of improving Swift messages and logs. The problem is that Swift buffers some class of runtime error message and emits them at the end of the run *if and only if the run completes*. In the very common case where the user kills the swift command with ^C because it seems to be getting nowhere, the use never sees these critical messages unless they can fush them out of some stdout/err files deep in e.g. the work directory. I think these messages should be clearly presented in the log or some other place as soon as they occur. - Mike ----- "David Kelly" wrote: > Hello, > > When execution.retries is greater than 0 and an error occurs, the > error message is not displayed or recorded in the log file. In my > case, I had an application pointing to the wrong path. I couldn't get > a clear idea of what happening until I set execution.retries to 0. > > Should swift be changed to report these errors as soon as they happen? > > David > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From benc at hawaga.org.uk Wed Aug 18 15:19:27 2010 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 18 Aug 2010 20:19:27 +0000 (GMT) Subject: [Swift-devel] Error messages and execution retries In-Reply-To: References: Message-ID: even when all retries have happened? -- http://www.hawaga.org.uk/ben/ From wilde at mcs.anl.gov Wed Aug 18 15:23:08 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 18 Aug 2010 14:23:08 -0600 (GMT-06:00) Subject: [Swift-devel] Error messages and execution retries In-Reply-To: Message-ID: <1954873812.1044231282162988528.JavaMail.root@zimbra.anl.gov> I think the case is that these errors are only presented when the swift command reaches "normal" compete termination, which in early debugging of a script is a rare event. By "normal" I mean all runnable tasks are done, at which time more complete messages for prior errors are emitted. - Mike ----- "Ben Clifford" wrote: > even when all retries have happened? > > -- > http://www.hawaga.org.uk/ben/ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Wed Aug 18 15:34:00 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 18 Aug 2010 15:34:00 -0500 Subject: [Swift-devel] Error messages and execution retries In-Reply-To: <1954873812.1044231282162988528.JavaMail.root@zimbra.anl.gov> References: <1954873812.1044231282162988528.JavaMail.root@zimbra.anl.gov> Message-ID: <1282163640.11453.0.camel@blabla2.none> That's only if lazy errors are enabled. Perhaps they should be disabled by default. On Wed, 2010-08-18 at 14:23 -0600, Michael Wilde wrote: > I think the case is that these errors are only presented when the swift command reaches "normal" compete termination, which in early debugging of a script is a rare event. By "normal" I mean all runnable tasks are done, at which time more complete messages for prior errors are emitted. > > - Mike > > > ----- "Ben Clifford" wrote: > > > even when all retries have happened? > > > > -- > > http://www.hawaga.org.uk/ben/ > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Wed Aug 18 15:39:47 2010 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 18 Aug 2010 20:39:47 +0000 (GMT) Subject: [Swift-devel] Error messages and execution retries In-Reply-To: <1282163640.11453.0.camel@blabla2.none> References: <1954873812.1044231282162988528.JavaMail.root@zimbra.anl.gov> <1282163640.11453.0.camel@blabla2.none> Message-ID: > That's only if lazy errors are enabled. Perhaps they should be disabled > by default. They are in trunk according to SVN - looks like I changed it 2y ago. b54a10c7 (benc at CI.UCHICAGO.EDU 2007-09-13 16:30:32 +0000 44) lazy.errors=fal But they interact with retries in a way that makes them still lazier than naively expected if you have lots of tasks - the third retry for the very first task run, if you have BIGNUM tasks, won't be until 2*BIGNUM tasks have been attempted... There was this balance between lazy errors being on or off by default, which was "production" versus "debugging". Maybe the same applies for retries too and a set of defaults aimed at users who wantt o debug rather than run in production should have retries disabled. -- From wilde at mcs.anl.gov Wed Aug 18 15:42:43 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 18 Aug 2010 14:42:43 -0600 (GMT-06:00) Subject: [Swift-devel] Error messages and execution retries In-Reply-To: <1282163640.11453.0.camel@blabla2.none> Message-ID: <713927327.1053061282164163715.JavaMail.root@zimbra.anl.gov> We should probably test to make sure that this is the case. For example, if 10 jobs are launched in parallel, and one fails, then even with lazy.errors=false, the 9 running jobs will still finish, right? Its just that no new ones will start. So will the error (stderr?) from the failing job be sent to the log or the swift stdout/err right away, or will it still wait for the full swift termination, which still may get circumvented by a ^C ??? I'm rather suspicious that its the latter (undesirable) case, although I may be wrong. - Mike ----- "Mihael Hategan" wrote: > That's only if lazy errors are enabled. Perhaps they should be > disabled > by default. > > On Wed, 2010-08-18 at 14:23 -0600, Michael Wilde wrote: > > I think the case is that these errors are only presented when the > swift command reaches "normal" compete termination, which in early > debugging of a script is a rare event. By "normal" I mean all runnable > tasks are done, at which time more complete messages for prior errors > are emitted. > > > > - Mike > > > > > > ----- "Ben Clifford" wrote: > > > > > even when all retries have happened? > > > > > > -- > > > http://www.hawaga.org.uk/ben/ > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From benc at hawaga.org.uk Wed Aug 18 15:45:34 2010 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 18 Aug 2010 20:45:34 +0000 (GMT) Subject: [Swift-devel] Error messages and execution retries In-Reply-To: <713927327.1053061282164163715.JavaMail.root@zimbra.anl.gov> References: <713927327.1053061282164163715.JavaMail.root@zimbra.anl.gov> Message-ID: if people are hitting ctrl-C a lot, it might be interesting to put something in to dump final status when that is pressed. Some unix programs do that. (although sometimes its rather annoying - ry pressing ctrl-C in GNU bc, for example) -- http://www.hawaga.org.uk/ben/ From wilde at mcs.anl.gov Wed Aug 18 15:46:54 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 18 Aug 2010 14:46:54 -0600 (GMT-06:00) Subject: [Swift-devel] Error messages and execution retries In-Reply-To: <713927327.1053061282164163715.JavaMail.root@zimbra.anl.gov> Message-ID: <212161918.1053421282164414268.JavaMail.root@zimbra.anl.gov> I think your note just before this, Ben, confirms my suspicion. The improvement needed, I believe, is the more immediate and unbuffered posting of useful error info in a way that makes it less likely to be lost in the event swift never reaches its normal exit processing. - Mike ----- "Michael Wilde" wrote: > We should probably test to make sure that this is the case. For > example, if 10 jobs are launched in parallel, and one fails, then even > with lazy.errors=false, the 9 running jobs will still finish, right? > Its just that no new ones will start. So will the error (stderr?) from > the failing job be sent to the log or the swift stdout/err right away, > or will it still wait for the full swift termination, which still may > get circumvented by a ^C ??? > > I'm rather suspicious that its the latter (undesirable) case, although > I may be wrong. > > - Mike > > > ----- "Mihael Hategan" wrote: > > > That's only if lazy errors are enabled. Perhaps they should be > > disabled > > by default. > > > > On Wed, 2010-08-18 at 14:23 -0600, Michael Wilde wrote: > > > I think the case is that these errors are only presented when the > > swift command reaches "normal" compete termination, which in early > > debugging of a script is a rare event. By "normal" I mean all > runnable > > tasks are done, at which time more complete messages for prior > errors > > are emitted. > > > > > > - Mike > > > > > > > > > ----- "Ben Clifford" wrote: > > > > > > > even when all retries have happened? > > > > > > > > -- > > > > http://www.hawaga.org.uk/ben/ > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Wed Aug 18 15:47:48 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 18 Aug 2010 15:47:48 -0500 Subject: [Swift-devel] Error messages and execution retries In-Reply-To: References: <1954873812.1044231282162988528.JavaMail.root@zimbra.anl.gov> <1282163640.11453.0.camel@blabla2.none> Message-ID: <1282164468.11796.5.camel@blabla2.none> On Wed, 2010-08-18 at 20:39 +0000, Ben Clifford wrote: > > That's only if lazy errors are enabled. Perhaps they should be disabled > > by default. > > They are in trunk according to SVN - looks like I changed it 2y ago. > > b54a10c7 (benc at CI.UCHICAGO.EDU 2007-09-13 16:30:32 +0000 44) lazy.errors=fal > > > But they interact with retries in a way that makes them still lazier than > naively expected if you have lots of tasks - the third retry for the very > first task run, if you have BIGNUM tasks, won't be until 2*BIGNUM tasks > have been attempted... > > There was this balance between lazy errors being on or off by default, > which was "production" versus "debugging". Maybe the same applies for > retries too and a set of defaults aimed at users who wantt o debug rather > than run in production should have retries disabled. > I'm not sure. Retries are meant to deal with transient errors, where transient is pretty much defined as "eventually stops happening if you retry enough times". The determination of whether they are transient or not (to a certain degree of confidence) requires that the operations are retried. A skilled person could perhaps, by looking at the error, be able to make a quicker determination. But then the same skilled person would probably be able to set retries to 0 if he/she wanted to debug. A normal user (who doesn't care about the details) may be disturbed by the printing of an error message that could be solved by retries. So I don't think that's necessarily the right choice unless it is made very clear, in the error message, that the task will be retried. From wilde at mcs.anl.gov Wed Aug 18 15:49:02 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 18 Aug 2010 14:49:02 -0600 (GMT-06:00) Subject: [Swift-devel] Error messages and execution retries In-Reply-To: Message-ID: <694045685.1053651282164542300.JavaMail.root@zimbra.anl.gov> I agree. Also to be able to distinguish this in usage stats. Maybe a "gentle" signal catcher that doesnt try heroically to protect its cleanup processing from a further ^C. - Mike ----- "Ben Clifford" wrote: > if people are hitting ctrl-C a lot, it might be interesting to put > something in to dump final status when that is pressed. Some unix > programs > do that. (although sometimes its rather annoying - ry pressing ctrl-C > in > GNU bc, for example) > > -- > http://www.hawaga.org.uk/ben/ -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Wed Aug 18 15:50:37 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 18 Aug 2010 15:50:37 -0500 Subject: [Swift-devel] Error messages and execution retries In-Reply-To: <713927327.1053061282164163715.JavaMail.root@zimbra.anl.gov> References: <713927327.1053061282164163715.JavaMail.root@zimbra.anl.gov> Message-ID: <1282164637.11796.8.camel@blabla2.none> On Wed, 2010-08-18 at 14:42 -0600, Michael Wilde wrote: > We should probably test to make sure that this is the case. For > example, if 10 jobs are launched in parallel, and one fails, then even > with lazy.errors=false, the 9 running jobs will still finish, right? Not exactly. They will only finish if retrying the failing job takes more time than finishing the remaining 9. > Its just that no new ones will start. So will the error (stderr?) > from the failing job be sent to the log or the swift stdout/err right > away, or will it still wait for the full swift termination, If lazy errors are disabled, as soon as all the retries for a job fail, a message will be printed and the execution of the run aborted. > which still may get circumvented by a ^C ??? Btw, we could intercept the ^C and still print the errors. From benc at hawaga.org.uk Wed Aug 18 15:54:49 2010 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 18 Aug 2010 20:54:49 +0000 (GMT) Subject: [Swift-devel] Error messages and execution retries In-Reply-To: <1282164468.11796.5.camel@blabla2.none> References: <1954873812.1044231282162988528.JavaMail.root@zimbra.anl.gov> <1282163640.11453.0.camel@blabla2.none> <1282164468.11796.5.camel@blabla2.none> Message-ID: > Retries are meant to deal with transient errors, where transient is > pretty much defined as "eventually stops happening if you retry enough > times". The determination of whether they are transient or not (to a > certain degree of confidence) requires that the operations are retried. Right. Sometimes tehre are transient errors. Sometimes there are not. The theory of distributed computing likes to talk about transient errors and how they can be dealt with this way. But its not clear to me in practice how much that happens - my gut feeling from when I ran stuff was that most errors were non-transient and retries happened rarely. But I have no numerical evidence. That numerical evidence (either way) is probably the decider for retries. > A skilled person could perhaps, by looking at the error, be able to make > a quicker determination. But then the same skilled person would probably > be able to set retries to 0 if he/she wanted to debug. A skilled person equally well could turn retries on. This thread is starting to sound pretty much like a complaint people have about condor where rather than failing a job, it will keep trying over and over. A 'skilled person' knows how and where to look to see wahts going on. A non-skilled person sees their job go into the queue and never complete. > A normal user (who doesn't care about the details) may be disturbed by > the printing of an error message that could be solved by retries. So I > don't think that's necessarily the right choice unless it is made very > clear, in the error message, that the task will be retried. I agree with what I think you are saying, which is that error messages shouldn't be printed if they are not terminal. -- From hategan at mcs.anl.gov Wed Aug 18 16:08:31 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 18 Aug 2010 16:08:31 -0500 Subject: [Swift-devel] Error messages and execution retries In-Reply-To: References: <1954873812.1044231282162988528.JavaMail.root@zimbra.anl.gov> <1282163640.11453.0.camel@blabla2.none> <1282164468.11796.5.camel@blabla2.none> Message-ID: <1282165711.12054.8.camel@blabla2.none> On Wed, 2010-08-18 at 20:54 +0000, Ben Clifford wrote: > > Retries are meant to deal with transient errors, where transient is > > pretty much defined as "eventually stops happening if you retry enough > > times". The determination of whether they are transient or not (to a > > certain degree of confidence) requires that the operations are retried. > > Right. > > Sometimes tehre are transient errors. Sometimes there are not. > > The theory of distributed computing likes to talk about transient errors > and how they can be dealt with this way. But its not clear to me in > practice how much that happens - my gut feeling from when I ran stuff was > that most errors were non-transient and retries happened rarely. But I > have no numerical evidence. That numerical evidence (either way) is > probably the decider for retries. Right. It used to be the case somewhat with GT2/GT4. There is, of course, also the issue that in the multi-site case, retries also imply re-scheduling. So this may iron out temporarily bad sites. Which I think is an essential issue (and commonly used in automated swift installations). > > > A skilled person could perhaps, by looking at the error, be able to make > > a quicker determination. But then the same skilled person would probably > > be able to set retries to 0 if he/she wanted to debug. > > A skilled person equally well could turn retries on. > > This thread is starting to sound pretty much like a complaint people have > about condor where rather than failing a job, it will keep trying over and > over. A 'skilled person' knows how and where to look to see wahts going > on. A non-skilled person sees their job go into the queue and never > complete. The distinction here being "never" as opposed to after some finite (and reasonably short compared to the expected workflow run time) amount of time. >From the perspective of a user, they should never even have to see that there were retries. So I think this argument is a bit silly. All we are saying is that there could be a way to find out about errors a little faster. And while we could automate that, it comes at a cost. We already have mechanisms to find out about errors as soon as they happen, and it's called lazy.errors=false and retries=0. From benc at hawaga.org.uk Wed Aug 18 16:42:40 2010 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 18 Aug 2010 21:42:40 +0000 (GMT) Subject: [Swift-devel] Error messages and execution retries In-Reply-To: <1282165711.12054.8.camel@blabla2.none> References: <1954873812.1044231282162988528.JavaMail.root@zimbra.anl.gov> <1282163640.11453.0.camel@blabla2.none> <1282164468.11796.5.camel@blabla2.none> <1282165711.12054.8.camel@blabla2.none> Message-ID: > faster. And while we could automate that, it comes at a cost. We already > have mechanisms to find out about errors as soon as they happen, and > it's called lazy.errors=false and retries=0. Right. But I think there are 'ways' or 'modes' or 'fashions' of using Swift that differ. (or 'use-cases' if you like that word). One is that you want lazy.errors=false and retries=0. One is that you want non-lazy errors and lots of retries. Presenting that choice to the user as two separate parameters is not the friendliest way to do it. From hategan at mcs.anl.gov Wed Aug 18 16:54:05 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 18 Aug 2010 16:54:05 -0500 Subject: [Swift-devel] Error messages and execution retries In-Reply-To: References: <1954873812.1044231282162988528.JavaMail.root@zimbra.anl.gov> <1282163640.11453.0.camel@blabla2.none> <1282164468.11796.5.camel@blabla2.none> <1282165711.12054.8.camel@blabla2.none> Message-ID: <1282168445.12229.3.camel@blabla2.none> On Wed, 2010-08-18 at 21:42 +0000, Ben Clifford wrote: > > faster. And while we could automate that, it comes at a cost. We already > > have mechanisms to find out about errors as soon as they happen, and > > it's called lazy.errors=false and retries=0. > > Right. > > But I think there are 'ways' or 'modes' or 'fashions' of using Swift that > differ. (or 'use-cases' if you like that word). One is that you want > lazy.errors=false and retries=0. One is that you want non-lazy errors and > lots of retries. Presenting that choice to the user as two separate > parameters is not the friendliest way to do it. Maybe. But you should note that the discussion started not in the direction of friendlier presentation but in that of more verbosity (which to me smells of less friendly). Point being that friendly may mean simple to express or indicative of what happens inside. And this conflict, with high level stuff such as swift (or even Java), I don't think will go away. From benc at hawaga.org.uk Wed Aug 18 16:57:54 2010 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 18 Aug 2010 21:57:54 +0000 (GMT) Subject: [Swift-devel] Error messages and execution retries In-Reply-To: <1282168445.12229.3.camel@blabla2.none> References: <1954873812.1044231282162988528.JavaMail.root@zimbra.anl.gov> <1282163640.11453.0.camel@blabla2.none> <1282164468.11796.5.camel@blabla2.none> <1282165711.12054.8.camel@blabla2.none> <1282168445.12229.3.camel@blabla2.none> Message-ID: > But you should note that the discussion started not in the direction of > friendlier presentation but in that of more verbosity (which to me > smells of less friendly). well it started with a secret "I have problem X", and the solution to it is more verbosity. I'm inferring X and then giving a different solution. -- From aespinosa at cs.uchicago.edu Thu Aug 19 15:45:10 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 19 Aug 2010 15:45:10 -0500 Subject: [Swift-devel] gram2 coaster workers terminating prematurely Message-ID: Hi, It looks like the service cannot start the workers successfully when I run on OSG resources. I get these messages from the coasters.log: 2010-08-19 14:53:33,274-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:1282247428606-1282247568724-1282247568725) setting status to Failed Task failed: Error submitting block task 2010-08-19 14:53:34,532-0500 WARN Block Failed to shut down block 2010-08-19 14:53:39,054-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:1282247428746-1282247568932-1282247568933) setting status to Failed Task failed: Error submitting block task 2010-08-19 14:53:39,054-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:1282247428749-1282247568935-1282247568936) setting status to Failed Task failed: Error submitting block task 2010-08-19 14:53:41,148-0500 WARN Block Failed to shut down block 2010-08-19 14:53:41,953-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:1282247428534-1282247568613-1282247568614) setting status to Failed Task failed: Error submitting block task 2010-08-19 14:53:43,352-0500 WARN Block Failed to shut down block 2010-08-19 14:53:48,904-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:1282247428602-1282247568718-1282247568719) setting status to Failed Task failed: Error submitting block task 2010-08-19 14:53:49,969-0500 WARN Block Failed to shut down block pool entry: 129.93.227.78 86400 1290 0.8 10 20 1500.0 51.54 /panfs/panasas/CMS/data/engage-scec/swift_scratch I traced a submission for a block and here are the related log entries: corresponding entry in coasterlog and additional context lines: $ grep 000000 -C 2 coasters.log 2010-08-19 14:52:50,200-0500 INFO JobQueue Adding task Task(type=JOB_SUBMISSION, identity=urn:1282247429268-1282247569718-1282247569719) to coaster queue 2010-08-19 14:52:50,229-0500 INFO BlockTaskSubmitter Queuing block Block 0819-520247-000000 (322x5100.000s) for submission 2010-08-19 14:52:50,232-0500 INFO BlockQueueProcessor Added 350 jobs to new blocks -- 2010-08-19 14:52:51,004-0500 INFO JobQueue Adding task Task(type=JOB_SUBMISSION, identity=urn:1282247430113-1282247570981-1282247570982) to coaster queue 2010-08-19 14:52:51,005-0500 INFO BlockTaskSubmitter Submitting block Block 0819-520247-000000 (322x5100.000s) 2010-08-19 14:52:51,026-0500 INFO JobQueue Adding task Task(type=JOB_SUBMISSION, identity=urn:1282247430114-1282247570984-1282247570985) to coaster queue -- 2010-08-19 14:53:00,436-0500 INFO BlockQueueProcessor Committed 415 new jobs 2010-08-19 14:53:00,437-0500 INFO Block Shutting down block Block 0819-520247-000000 (322x5100.000s) 2010-08-19 14:53:00,437-0500 INFO Block Block Block 0819-520247-000000 (322x5100.000s) not running. Cancelling job. 2010-08-19 14:53:00,438-0500 WARN JobSubmissionTaskHandler Job cleaned before -- ... 9 more 2010-08-19 14:53:00,438-0500 INFO BlockQueueProcessor Removing block Block 0819-520247-000000 (322x5100.000s) 2010-08-19 14:53:00,438-0500 INFO BlockQueueProcessor Cleaned 1 done blocks gram log: /19 14:52:55 JM: Security context imported 8/19 14:52:55 JM: Adding new callback contact (url=https://129.93.227.78:57277/1282247571979, mask=65535) 8/19 14:52:55 JM: Added successfully 8/19 14:52:55 Pre-parsed RSL string: &( rsl_substitution = (GLOBUSRUN_GASS_URL "https://129.93.227.78:41794") )( directory = "/" )( arguments = "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl" "http://129.93.227.78:38194" "0819-520247-000000" "/grid_home/engage/.globus/coasters" )( hostcount = "322" )( executable = "/usr/bin/perl" )( maxwalltime = "85" )( stderr = $(GLOBUSRUN_GASS_URL) # "/dev/stderr-urn:cog-1282247568900" )( stdout = $(GLOBUSRUN_GASS_URL) # "/dev/stdout-urn:cog-1282247568900" )( name = "Block-0819-520247-000000" )( jobtype = "multiple" )( count = "322" ) 8/19 14:52:55 <<<<>>>>Job Request RSL 8/19 14:52:55 <<<<>>>>Job Request RSL (canonical) 8/19 14:52:55 JM: Evaluating RSL Value8/19 14:52:55 JM: Evaluated RSL Value to GLOBUSRUN_GASS_URL8/19 14:52:55 JM: Evaluating RSL Value8/19 14:52:55 JM: Evaluated RSL Value to https://129.93.227.78:417948/19 14:52:55 Appending extra env.var LD_LIBRARY_PATH=/opt/osg/osg-1.2/subversion/lib:/opt/osg/osg-1.2/apache/lib:/opt/osg/osg-1.2/MonaLisa/Service/VDTFarm/pgsql/lib:/opt/osg/osg-1.2/glite/lib64:/opt/osg/osg-1.2/glite/lib:/opt/osg/osg-1.2/prima/lib:/opt/osg/osg-1.2/mysql5/lib/mysql:/opt/osg/osg-1.2/globus/lib:/opt/osg/osg-1.2/berkeley-db/lib:/opt/osg/osg-1.2/expat/lib: 8/19 14:52:55 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_MAKE_SCRATCHDIR 8/19 14:52:55 <<<<>>>>Job RSL 8/19 14:52:55 <<<<>>>>Job RSL (post-eval) RSL attribute 'name' is not in the validation file! 8/19 14:52:55 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED 8/19 14:52:55 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CLOSE_OUTPUT 8/19 14:52:55 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_PRE_FILE_CLEAN_UP 8/19 14:52:55 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_FILE_CLEAN_UP 8/19 14:52:55 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_SCRATCH_CLEAN_UP 8/19 14:52:55 JMI: testing job manager scripts for type pbs exist and permissions are ok. 8/19 14:52:55 JMI: completed script validation: job manager type is pbs. 8/19 14:52:55 JMI: cmd = cache_cleanup Thu Aug 19 14:52:57 2010 JM_SCRIPT: New Perl JobManager created. Thu Aug 19 14:52:57 2010 JM_SCRIPT: Using jm supplied job dir: /grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575 Thu Aug 19 14:52:57 2010 JM_SCRIPT: Using jm supplied job dir: /grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575 Thu Aug 19 14:52:57 2010 JM_SCRIPT: cache_cleanup(enter) Thu Aug 19 14:52:57 2010 JM_SCRIPT: Cleaning files in job dir /grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575 Thu Aug 19 14:52:57 2010 JM_SCRIPT: Removed 1 files from /grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575 Thu Aug 19 14:52:57 2010 JM_SCRIPT: cache_cleanup(exit) 8/19 14:52:57 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CACHE_CLEAN_UP 8/19 14:52:57 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_RESPONSE 8/19 14:52:57 JM: before sending to client: rc=0 (Success) 8/19 14:52:57 Job Manager State Machine (exiting): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE 8/19 14:52:57 JM: in globus_gram_job_manager_reporting_file_remove() 8/19 14:52:57 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE 8/19 14:52:57 JM: in globus_gram_job_manager_reporting_file_remove() 8/19 14:52:57 JM: exiting globus_gram_job_manager >From the gram log above, it looks like that the workers are finishing nicely from the cancel signal made by the coaster service. -Allan -- Allan M. Espinosa PhD student, Computer Science University of Chicago From hategan at mcs.anl.gov Thu Aug 19 15:55:57 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 19 Aug 2010 15:55:57 -0500 Subject: [Swift-devel] gram2 coaster workers terminating prematurely In-Reply-To: References: Message-ID: <1282251357.17966.3.camel@blabla2.none> Why was I under the impression that gram supports a "name" argument to name your job? I'll remove this from the gt2 provider. On Thu, 2010-08-19 at 15:45 -0500, Allan Espinosa wrote: > Hi, > > It looks like the service cannot start the workers successfully when I > run on OSG resources. I get these messages from the coasters.log: > > 2010-08-19 14:53:33,274-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:1282247428606-1282247568724-1282247568725) setting status > to Failed Task failed: Error submitting block task > 2010-08-19 14:53:34,532-0500 WARN Block Failed to shut down block > 2010-08-19 14:53:39,054-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:1282247428746-1282247568932-1282247568933) setting status > to Failed Task failed: Error submitting block task > 2010-08-19 14:53:39,054-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:1282247428749-1282247568935-1282247568936) setting status > to Failed Task failed: Error submitting block task > 2010-08-19 14:53:41,148-0500 WARN Block Failed to shut down block > 2010-08-19 14:53:41,953-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:1282247428534-1282247568613-1282247568614) setting status > to Failed Task failed: Error submitting block task > 2010-08-19 14:53:43,352-0500 WARN Block Failed to shut down block > 2010-08-19 14:53:48,904-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > identity=urn:1282247428602-1282247568718-1282247568719) setting status > to Failed Task failed: Error submitting block task > 2010-08-19 14:53:49,969-0500 WARN Block Failed to shut down block > > pool entry: > > jobmanager="gt2:gt2:pbs" /> > 129.93.227.78 > > 86400 > 1290 > 0.8 > 10 > 20 > > > 1500.0 > 51.54 > > > /panfs/panasas/CMS/data/engage-scec/swift_scratch > > > I traced a submission for a block and here are the related log entries: > > corresponding entry in coasterlog and additional context lines: > $ grep 000000 -C 2 coasters.log > 2010-08-19 14:52:50,200-0500 INFO JobQueue Adding task > Task(type=JOB_SUBMISSION, > identity=urn:1282247429268-1282247569718-1282247569719) to coaster > queue > 2010-08-19 14:52:50,229-0500 INFO BlockTaskSubmitter Queuing block > Block 0819-520247-000000 (322x5100.000s) for submission > 2010-08-19 14:52:50,232-0500 INFO BlockQueueProcessor Added 350 jobs > to new blocks > -- > 2010-08-19 14:52:51,004-0500 INFO JobQueue Adding task > Task(type=JOB_SUBMISSION, > identity=urn:1282247430113-1282247570981-1282247570982) to coaster > queue > 2010-08-19 14:52:51,005-0500 INFO BlockTaskSubmitter Submitting block > Block 0819-520247-000000 (322x5100.000s) > 2010-08-19 14:52:51,026-0500 INFO JobQueue Adding task > Task(type=JOB_SUBMISSION, > identity=urn:1282247430114-1282247570984-1282247570985) to coaster > queue > -- > 2010-08-19 14:53:00,436-0500 INFO BlockQueueProcessor Committed 415 new jobs > 2010-08-19 14:53:00,437-0500 INFO Block Shutting down block Block > 0819-520247-000000 (322x5100.000s) > 2010-08-19 14:53:00,437-0500 INFO Block Block Block > 0819-520247-000000 (322x5100.000s) not running. Cancelling job. > 2010-08-19 14:53:00,438-0500 WARN JobSubmissionTaskHandler Job cleaned before > -- > ... 9 more > 2010-08-19 14:53:00,438-0500 INFO BlockQueueProcessor Removing block > Block 0819-520247-000000 (322x5100.000s) > 2010-08-19 14:53:00,438-0500 INFO BlockQueueProcessor Cleaned 1 done blocks > > > gram log: > > /19 14:52:55 JM: Security context imported > 8/19 14:52:55 JM: Adding new callback contact > (url=https://129.93.227.78:57277/1282247571979, mask=65535) > 8/19 14:52:55 JM: Added successfully > 8/19 14:52:55 Pre-parsed RSL string: &( rsl_substitution = > (GLOBUSRUN_GASS_URL "https://129.93.227.78:41794") )( directory = "/" > )( arguments = "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl" > "http://129.93.227.78:38194" "0819-520247-000000" > "/grid_home/engage/.globus/coasters" )( hostcount = "322" )( > executable = "/usr/bin/perl" )( maxwalltime = "85" )( stderr = > $(GLOBUSRUN_GASS_URL) # "/dev/stderr-urn:cog-1282247568900" )( stdout > = $(GLOBUSRUN_GASS_URL) # "/dev/stdout-urn:cog-1282247568900" )( name > = "Block-0819-520247-000000" )( jobtype = "multiple" )( count = "322" > ) > 8/19 14:52:55 > <<<< &("rsl_substitution" = ("GLOBUSRUN_GASS_URL" > "https://129.93.227.78:41794" ) )("directory" = "/" )("arguments" = > "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl" > "http://129.93.227.78:38194" "0819-520247-000000" > "/grid_home/engage/.globus/coasters" )("hostcount" = "322" > )("executable" = "/usr/bin/perl" )("maxwalltime" = "85" )("stderr" = > $("GLOBUSRUN_GASS_URL") # "/dev/stderr-urn:cog-1282247568900" > )("stdout" = $("GLOBUSRUN_GASS_URL") # > "/dev/stdout-urn:cog-1282247568900" )("name" = > "Block-0819-520247-000000" )("jobtype" = "multiple" )("count" = "322" > ) > >>>>>Job Request RSL > 8/19 14:52:55 > <<<< &("rslsubstitution" = ("GLOBUSRUN_GASS_URL" > "https://129.93.227.78:41794" ) )("directory" = "/" )("arguments" = > "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl" > "http://129.93.227.78:38194" "0819-520247-000000" > "/grid_home/engage/.globus/coasters" )("hostcount" = "322" > )("executable" = "/usr/bin/perl" )("maxwalltime" = "85" )("stderr" = > $("GLOBUSRUN_GASS_URL") # "/dev/stderr-urn:cog-1282247568900" > )("stdout" = $("GLOBUSRUN_GASS_URL") # > "/dev/stdout-urn:cog-1282247568900" )("name" = > "Block-0819-520247-000000" )("jobtype" = "multiple" )("count" = "322" > ) > >>>>>Job Request RSL (canonical) > 8/19 14:52:55 JM: Evaluating RSL Value8/19 14:52:55 JM: Evaluated RSL > Value to GLOBUSRUN_GASS_URL8/19 14:52:55 JM: Evaluating RSL Value8/19 > 14:52:55 JM: Evaluated RSL Value to https://129.93.227.78:417948/19 > 14:52:55 Appending extra env.var > LD_LIBRARY_PATH=/opt/osg/osg-1.2/subversion/lib:/opt/osg/osg-1.2/apache/lib:/opt/osg/osg-1.2/MonaLisa/Service/VDTFarm/pgsql/lib:/opt/osg/osg-1.2/glite/lib64:/opt/osg/osg-1.2/glite/lib:/opt/osg/osg-1.2/prima/lib:/opt/osg/osg-1.2/mysql5/lib/mysql:/opt/osg/osg-1.2/globus/lib:/opt/osg/osg-1.2/berkeley-db/lib:/opt/osg/osg-1.2/expat/lib: > 8/19 14:52:55 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_MAKE_SCRATCHDIR > 8/19 14:52:55 > <<<< &("environment" = ("LD_LIBRARY_PATH" > "/opt/osg/osg-1.2/subversion/lib:/opt/osg/osg-1.2/apache/lib:/opt/osg/osg-1.2/MonaLisa/Service/VDTFarm/pgsql/lib:/opt/osg/osg-1.2/glite/lib64:/opt/osg/osg-1.2/glite/lib:/opt/osg/osg-1.2/prima/lib:/opt/osg/osg-1.2/mysql5/lib/mysql:/opt/osg/osg-1.2/globus/lib:/opt/osg/osg-1.2/berkeley-db/lib:/opt/osg/osg-1.2/expat/lib:" > ) ("HOME" "/grid_home/engage" ) ("LOGNAME" "engage" ) > )("rslsubstitution" = ("GLOBUSRUN_GASS_URL" > "https://129.93.227.78:41794" ) )("directory" = "/" )("arguments" = > "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl" > "http://129.93.227.78:38194" "0819-520247-000000" > "/grid_home/engage/.globus/coasters" )("hostcount" = "322" > )("executable" = "/usr/bin/perl" )("maxwalltime" = "85" )("stderr" = > $("GLOBUSRUN_GASS_URL") # "/dev/stderr-urn:cog-1282247568900" > )("stdout" = $("GLOBUSRUN_GASS_URL") # > "/dev/stdout-urn:cog-1282247568900" )("name" = > "Block-0819-520247-000000" )("jobtype" = "multiple" )("count" = "322" > ) > >>>>>Job RSL > 8/19 14:52:55 > <<<< &("environment" = ("LD_LIBRARY_PATH" > "/opt/osg/osg-1.2/subversion/lib:/opt/osg/osg-1.2/apache/lib:/opt/osg/osg-1.2/MonaLisa/Service/VDTFarm/pgsql/lib:/opt/osg/osg-1.2/glite/lib64:/opt/osg/osg-1.2/glite/lib:/opt/osg/osg-1.2/prima/lib:/opt/osg/osg-1.2/mysql5/lib/mysql:/opt/osg/osg-1.2/globus/lib:/opt/osg/osg-1.2/berkeley-db/lib:/opt/osg/osg-1.2/expat/lib:" > ) ("HOME" "/grid_home/engage" ) ("LOGNAME" "engage" ) > )("rslsubstitution" = ("GLOBUSRUN_GASS_URL" > "https://129.93.227.78:41794" ) )("directory" = "/" )("arguments" = > "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl" > "http://129.93.227.78:38194" "0819-520247-000000" > "/grid_home/engage/.globus/coasters" )("hostcount" = "322" > )("executable" = "/usr/bin/perl" )("maxwalltime" = "85" )("stderr" = > "https://129.93.227.78:41794/dev/stderr-urn:cog-1282247568900" > )("stdout" = "https://129.93.227.78:41794/dev/stdout-urn:cog-1282247568900" > )("name" = "Block-0819-520247-000000" )("jobtype" = "multiple" > )("count" = "322" ) > >>>>>Job RSL (post-eval) > RSL attribute 'name' is not in the validation file! > 8/19 14:52:55 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED > 8/19 14:52:55 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CLOSE_OUTPUT > 8/19 14:52:55 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_PRE_FILE_CLEAN_UP > 8/19 14:52:55 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_FILE_CLEAN_UP > 8/19 14:52:55 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_SCRATCH_CLEAN_UP > 8/19 14:52:55 JMI: testing job manager scripts for type pbs exist and > permissions are ok. > 8/19 14:52:55 JMI: completed script validation: job manager type is pbs. > 8/19 14:52:55 JMI: cmd = cache_cleanup > Thu Aug 19 14:52:57 2010 JM_SCRIPT: New Perl JobManager created. > Thu Aug 19 14:52:57 2010 JM_SCRIPT: Using jm supplied job dir: > /grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575 > Thu Aug 19 14:52:57 2010 JM_SCRIPT: Using jm supplied job dir: > /grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575 > Thu Aug 19 14:52:57 2010 JM_SCRIPT: cache_cleanup(enter) > Thu Aug 19 14:52:57 2010 JM_SCRIPT: Cleaning files in job dir > /grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575 > Thu Aug 19 14:52:57 2010 JM_SCRIPT: Removed 1 files from > /grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575 > Thu Aug 19 14:52:57 2010 JM_SCRIPT: cache_cleanup(exit) > 8/19 14:52:57 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CACHE_CLEAN_UP > 8/19 14:52:57 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_RESPONSE > 8/19 14:52:57 JM: before sending to client: rc=0 (Success) > 8/19 14:52:57 Job Manager State Machine (exiting): > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE > 8/19 14:52:57 JM: in globus_gram_job_manager_reporting_file_remove() > 8/19 14:52:57 Job Manager State Machine (entering): > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE > 8/19 14:52:57 JM: in globus_gram_job_manager_reporting_file_remove() > 8/19 14:52:57 JM: exiting globus_gram_job_manager > > >From the gram log above, it looks like that the workers are finishing > nicely from the cancel signal made by the coaster service. > > -Allan From hategan at mcs.anl.gov Thu Aug 19 16:08:11 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 19 Aug 2010 16:08:11 -0500 Subject: [Swift-devel] gram2 coaster workers terminating prematurely In-Reply-To: <1282251357.17966.3.camel@blabla2.none> References: <1282251357.17966.3.camel@blabla2.none> Message-ID: <1282252091.17966.4.camel@blabla2.none> On Thu, 2010-08-19 at 15:55 -0500, Mihael Hategan wrote: > Why was I under the impression that gram supports a "name" argument to > name your job? > I'll remove this from the gt2 provider. cog r2867. > > On Thu, 2010-08-19 at 15:45 -0500, Allan Espinosa wrote: > > Hi, > > > > It looks like the service cannot start the workers successfully when I > > run on OSG resources. I get these messages from the coasters.log: > > > > 2010-08-19 14:53:33,274-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > > identity=urn:1282247428606-1282247568724-1282247568725) setting status > > to Failed Task failed: Error submitting block task > > 2010-08-19 14:53:34,532-0500 WARN Block Failed to shut down block > > 2010-08-19 14:53:39,054-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > > identity=urn:1282247428746-1282247568932-1282247568933) setting status > > to Failed Task failed: Error submitting block task > > 2010-08-19 14:53:39,054-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > > identity=urn:1282247428749-1282247568935-1282247568936) setting status > > to Failed Task failed: Error submitting block task > > 2010-08-19 14:53:41,148-0500 WARN Block Failed to shut down block > > 2010-08-19 14:53:41,953-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > > identity=urn:1282247428534-1282247568613-1282247568614) setting status > > to Failed Task failed: Error submitting block task > > 2010-08-19 14:53:43,352-0500 WARN Block Failed to shut down block > > 2010-08-19 14:53:48,904-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, > > identity=urn:1282247428602-1282247568718-1282247568719) setting status > > to Failed Task failed: Error submitting block task > > 2010-08-19 14:53:49,969-0500 WARN Block Failed to shut down block > > > > pool entry: > > > > > jobmanager="gt2:gt2:pbs" /> > > 129.93.227.78 > > > > 86400 > > 1290 > > 0.8 > > 10 > > 20 > > > > > > 1500.0 > > 51.54 > > > > > > /panfs/panasas/CMS/data/engage-scec/swift_scratch > > > > > > I traced a submission for a block and here are the related log entries: > > > > corresponding entry in coasterlog and additional context lines: > > $ grep 000000 -C 2 coasters.log > > 2010-08-19 14:52:50,200-0500 INFO JobQueue Adding task > > Task(type=JOB_SUBMISSION, > > identity=urn:1282247429268-1282247569718-1282247569719) to coaster > > queue > > 2010-08-19 14:52:50,229-0500 INFO BlockTaskSubmitter Queuing block > > Block 0819-520247-000000 (322x5100.000s) for submission > > 2010-08-19 14:52:50,232-0500 INFO BlockQueueProcessor Added 350 jobs > > to new blocks > > -- > > 2010-08-19 14:52:51,004-0500 INFO JobQueue Adding task > > Task(type=JOB_SUBMISSION, > > identity=urn:1282247430113-1282247570981-1282247570982) to coaster > > queue > > 2010-08-19 14:52:51,005-0500 INFO BlockTaskSubmitter Submitting block > > Block 0819-520247-000000 (322x5100.000s) > > 2010-08-19 14:52:51,026-0500 INFO JobQueue Adding task > > Task(type=JOB_SUBMISSION, > > identity=urn:1282247430114-1282247570984-1282247570985) to coaster > > queue > > -- > > 2010-08-19 14:53:00,436-0500 INFO BlockQueueProcessor Committed 415 new jobs > > 2010-08-19 14:53:00,437-0500 INFO Block Shutting down block Block > > 0819-520247-000000 (322x5100.000s) > > 2010-08-19 14:53:00,437-0500 INFO Block Block Block > > 0819-520247-000000 (322x5100.000s) not running. Cancelling job. > > 2010-08-19 14:53:00,438-0500 WARN JobSubmissionTaskHandler Job cleaned before > > -- > > ... 9 more > > 2010-08-19 14:53:00,438-0500 INFO BlockQueueProcessor Removing block > > Block 0819-520247-000000 (322x5100.000s) > > 2010-08-19 14:53:00,438-0500 INFO BlockQueueProcessor Cleaned 1 done blocks > > > > > > gram log: > > > > /19 14:52:55 JM: Security context imported > > 8/19 14:52:55 JM: Adding new callback contact > > (url=https://129.93.227.78:57277/1282247571979, mask=65535) > > 8/19 14:52:55 JM: Added successfully > > 8/19 14:52:55 Pre-parsed RSL string: &( rsl_substitution = > > (GLOBUSRUN_GASS_URL "https://129.93.227.78:41794") )( directory = "/" > > )( arguments = "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl" > > "http://129.93.227.78:38194" "0819-520247-000000" > > "/grid_home/engage/.globus/coasters" )( hostcount = "322" )( > > executable = "/usr/bin/perl" )( maxwalltime = "85" )( stderr = > > $(GLOBUSRUN_GASS_URL) # "/dev/stderr-urn:cog-1282247568900" )( stdout > > = $(GLOBUSRUN_GASS_URL) # "/dev/stdout-urn:cog-1282247568900" )( name > > = "Block-0819-520247-000000" )( jobtype = "multiple" )( count = "322" > > ) > > 8/19 14:52:55 > > <<<< > &("rsl_substitution" = ("GLOBUSRUN_GASS_URL" > > "https://129.93.227.78:41794" ) )("directory" = "/" )("arguments" = > > "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl" > > "http://129.93.227.78:38194" "0819-520247-000000" > > "/grid_home/engage/.globus/coasters" )("hostcount" = "322" > > )("executable" = "/usr/bin/perl" )("maxwalltime" = "85" )("stderr" = > > $("GLOBUSRUN_GASS_URL") # "/dev/stderr-urn:cog-1282247568900" > > )("stdout" = $("GLOBUSRUN_GASS_URL") # > > "/dev/stdout-urn:cog-1282247568900" )("name" = > > "Block-0819-520247-000000" )("jobtype" = "multiple" )("count" = "322" > > ) > > >>>>>Job Request RSL > > 8/19 14:52:55 > > <<<< > &("rslsubstitution" = ("GLOBUSRUN_GASS_URL" > > "https://129.93.227.78:41794" ) )("directory" = "/" )("arguments" = > > "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl" > > "http://129.93.227.78:38194" "0819-520247-000000" > > "/grid_home/engage/.globus/coasters" )("hostcount" = "322" > > )("executable" = "/usr/bin/perl" )("maxwalltime" = "85" )("stderr" = > > $("GLOBUSRUN_GASS_URL") # "/dev/stderr-urn:cog-1282247568900" > > )("stdout" = $("GLOBUSRUN_GASS_URL") # > > "/dev/stdout-urn:cog-1282247568900" )("name" = > > "Block-0819-520247-000000" )("jobtype" = "multiple" )("count" = "322" > > ) > > >>>>>Job Request RSL (canonical) > > 8/19 14:52:55 JM: Evaluating RSL Value8/19 14:52:55 JM: Evaluated RSL > > Value to GLOBUSRUN_GASS_URL8/19 14:52:55 JM: Evaluating RSL Value8/19 > > 14:52:55 JM: Evaluated RSL Value to https://129.93.227.78:417948/19 > > 14:52:55 Appending extra env.var > > LD_LIBRARY_PATH=/opt/osg/osg-1.2/subversion/lib:/opt/osg/osg-1.2/apache/lib:/opt/osg/osg-1.2/MonaLisa/Service/VDTFarm/pgsql/lib:/opt/osg/osg-1.2/glite/lib64:/opt/osg/osg-1.2/glite/lib:/opt/osg/osg-1.2/prima/lib:/opt/osg/osg-1.2/mysql5/lib/mysql:/opt/osg/osg-1.2/globus/lib:/opt/osg/osg-1.2/berkeley-db/lib:/opt/osg/osg-1.2/expat/lib: > > 8/19 14:52:55 Job Manager State Machine (entering): > > GLOBUS_GRAM_JOB_MANAGER_STATE_MAKE_SCRATCHDIR > > 8/19 14:52:55 > > <<<< > &("environment" = ("LD_LIBRARY_PATH" > > "/opt/osg/osg-1.2/subversion/lib:/opt/osg/osg-1.2/apache/lib:/opt/osg/osg-1.2/MonaLisa/Service/VDTFarm/pgsql/lib:/opt/osg/osg-1.2/glite/lib64:/opt/osg/osg-1.2/glite/lib:/opt/osg/osg-1.2/prima/lib:/opt/osg/osg-1.2/mysql5/lib/mysql:/opt/osg/osg-1.2/globus/lib:/opt/osg/osg-1.2/berkeley-db/lib:/opt/osg/osg-1.2/expat/lib:" > > ) ("HOME" "/grid_home/engage" ) ("LOGNAME" "engage" ) > > )("rslsubstitution" = ("GLOBUSRUN_GASS_URL" > > "https://129.93.227.78:41794" ) )("directory" = "/" )("arguments" = > > "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl" > > "http://129.93.227.78:38194" "0819-520247-000000" > > "/grid_home/engage/.globus/coasters" )("hostcount" = "322" > > )("executable" = "/usr/bin/perl" )("maxwalltime" = "85" )("stderr" = > > $("GLOBUSRUN_GASS_URL") # "/dev/stderr-urn:cog-1282247568900" > > )("stdout" = $("GLOBUSRUN_GASS_URL") # > > "/dev/stdout-urn:cog-1282247568900" )("name" = > > "Block-0819-520247-000000" )("jobtype" = "multiple" )("count" = "322" > > ) > > >>>>>Job RSL > > 8/19 14:52:55 > > <<<< > &("environment" = ("LD_LIBRARY_PATH" > > "/opt/osg/osg-1.2/subversion/lib:/opt/osg/osg-1.2/apache/lib:/opt/osg/osg-1.2/MonaLisa/Service/VDTFarm/pgsql/lib:/opt/osg/osg-1.2/glite/lib64:/opt/osg/osg-1.2/glite/lib:/opt/osg/osg-1.2/prima/lib:/opt/osg/osg-1.2/mysql5/lib/mysql:/opt/osg/osg-1.2/globus/lib:/opt/osg/osg-1.2/berkeley-db/lib:/opt/osg/osg-1.2/expat/lib:" > > ) ("HOME" "/grid_home/engage" ) ("LOGNAME" "engage" ) > > )("rslsubstitution" = ("GLOBUSRUN_GASS_URL" > > "https://129.93.227.78:41794" ) )("directory" = "/" )("arguments" = > > "/grid_home/engage/.globus/coasters/cscript758567006055368697.pl" > > "http://129.93.227.78:38194" "0819-520247-000000" > > "/grid_home/engage/.globus/coasters" )("hostcount" = "322" > > )("executable" = "/usr/bin/perl" )("maxwalltime" = "85" )("stderr" = > > "https://129.93.227.78:41794/dev/stderr-urn:cog-1282247568900" > > )("stdout" = "https://129.93.227.78:41794/dev/stdout-urn:cog-1282247568900" > > )("name" = "Block-0819-520247-000000" )("jobtype" = "multiple" > > )("count" = "322" ) > > >>>>>Job RSL (post-eval) > > RSL attribute 'name' is not in the validation file! > > 8/19 14:52:55 Job Manager State Machine (entering): > > GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED > > 8/19 14:52:55 Job Manager State Machine (entering): > > GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CLOSE_OUTPUT > > 8/19 14:52:55 Job Manager State Machine (entering): > > GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_PRE_FILE_CLEAN_UP > > 8/19 14:52:55 Job Manager State Machine (entering): > > GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_FILE_CLEAN_UP > > 8/19 14:52:55 Job Manager State Machine (entering): > > GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_SCRATCH_CLEAN_UP > > 8/19 14:52:55 JMI: testing job manager scripts for type pbs exist and > > permissions are ok. > > 8/19 14:52:55 JMI: completed script validation: job manager type is pbs. > > 8/19 14:52:55 JMI: cmd = cache_cleanup > > Thu Aug 19 14:52:57 2010 JM_SCRIPT: New Perl JobManager created. > > Thu Aug 19 14:52:57 2010 JM_SCRIPT: Using jm supplied job dir: > > /grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575 > > Thu Aug 19 14:52:57 2010 JM_SCRIPT: Using jm supplied job dir: > > /grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575 > > Thu Aug 19 14:52:57 2010 JM_SCRIPT: cache_cleanup(enter) > > Thu Aug 19 14:52:57 2010 JM_SCRIPT: Cleaning files in job dir > > /grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575 > > Thu Aug 19 14:52:57 2010 JM_SCRIPT: Removed 1 files from > > /grid_home/engage/.globus/job/ff-grid2.unl.edu/23986.1282247575 > > Thu Aug 19 14:52:57 2010 JM_SCRIPT: cache_cleanup(exit) > > 8/19 14:52:57 Job Manager State Machine (entering): > > GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CACHE_CLEAN_UP > > 8/19 14:52:57 Job Manager State Machine (entering): > > GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_RESPONSE > > 8/19 14:52:57 JM: before sending to client: rc=0 (Success) > > 8/19 14:52:57 Job Manager State Machine (exiting): > > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE > > 8/19 14:52:57 JM: in globus_gram_job_manager_reporting_file_remove() > > 8/19 14:52:57 Job Manager State Machine (entering): > > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE > > 8/19 14:52:57 JM: in globus_gram_job_manager_reporting_file_remove() > > 8/19 14:52:57 JM: exiting globus_gram_job_manager > > > > >From the gram log above, it looks like that the workers are finishing > > nicely from the cancel signal made by the coaster service. > > > > -Allan > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From aespinosa at cs.uchicago.edu Thu Aug 19 16:32:01 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 19 Aug 2010 16:32:01 -0500 Subject: [Swift-devel] gram2 coaster workers terminating prematurely In-Reply-To: <1282252091.17966.4.camel@blabla2.none> References: <1282251357.17966.3.camel@blabla2.none> <1282252091.17966.4.camel@blabla2.none> Message-ID: <20100819213201.GA3609@origin> Cool, thanks. My jobs are now getting queued properly. Hopefully the workflow will finish later. -Allan On Thu, Aug 19, 2010 at 04:08:11PM -0500, Mihael Hategan wrote: > On Thu, 2010-08-19 at 15:55 -0500, Mihael Hategan wrote: > > Why was I under the impression that gram supports a "name" argument to > > name your job? > > I'll remove this from the gt2 provider. > > cog r2867. > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From dk0966 at cs.ship.edu Fri Aug 20 11:35:51 2010 From: dk0966 at cs.ship.edu (David Kelly) Date: Fri, 20 Aug 2010 12:35:51 -0400 Subject: [Swift-devel] GPL licensing question Message-ID: Hello, My swiftconfig and swiftrun utilities use a module called XML::Simple. It is not included in most standard perl distributions. I would like to include it so users do not have to manually install modules. Perl modules are licensed by the same terms as perl itself - which is the GNU General Public License version 1 (or later at your discretion), or the "artistic license" ( http://dev.perl.org/licenses/artistic.html). My understanding is that all Globus projects are required to be under an Apache license. I would like to add swiftconfig, swiftrun and associated libraries to the main swift distribution. Can I safely do this without running into any licensing conflicts? Thanks, David -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Fri Aug 20 15:01:00 2010 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 20 Aug 2010 20:01:00 +0000 (GMT) Subject: [Swift-devel] GPL licensing question In-Reply-To: References: Message-ID: > I would like to add swiftconfig, swiftrun and associated libraries to the > main swift distribution. Can I safely do this without running into any > licensing conflicts? If there's anyone who retains knowledge of former discussions about that within Globus, you might find them on gt-dev at globus.org - its certainly been debated a bunch of times. -- From foster at anl.gov Sun Aug 22 23:22:32 2010 From: foster at anl.gov (Ian Foster) Date: Sun, 22 Aug 2010 23:22:32 -0500 Subject: [Swift-devel] GPL licensing question In-Reply-To: References: Message-ID: <64CC40C7-9A29-446A-AEB1-1B17954AFBDF@anl.gov> David: We need to be very careful about including any GPL-licensed code in Swift, as that will make all Swift code GPL. Steve Tuecke can provide more specifics. Ian. On Aug 20, 2010, at 11:35 AM, David Kelly wrote: > Hello, > > My swiftconfig and swiftrun utilities use a module called XML::Simple. It is not included in most standard perl distributions. I would like to include it so users do not have to manually install modules. Perl modules are licensed by the same terms as perl itself - which is the GNU General Public License version 1 (or later at your discretion), or the "artistic license" (http://dev.perl.org/licenses/artistic.html). My understanding is that all Globus projects are required to be under an Apache license. > > I would like to add swiftconfig, swiftrun and associated libraries to the main swift distribution. Can I safely do this without running into any licensing conflicts? > > Thanks, > David > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Sun Aug 22 23:37:46 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 22 Aug 2010 22:37:46 -0600 (GMT-06:00) Subject: [Swift-devel] GPL licensing question In-Reply-To: <64CC40C7-9A29-446A-AEB1-1B17954AFBDF@anl.gov> Message-ID: <560419158.1176481282538265999.JavaMail.root@zimbra.anl.gov> Ive had an off-list thread going with Steve on this topic, and earlier today posed the question of whether Artistic License code can be included in Swift. - Mike ----- "Ian Foster" wrote: > David: > > > We need to be very careful about including any GPL-licensed code in > Swift, as that will make all Swift code GPL. Steve Tuecke can provide > more specifics. > > > Ian. > > > > On Aug 20, 2010, at 11:35 AM, David Kelly wrote: > > > Hello, > > My swiftconfig and swiftrun utilities use a module called XML::Simple. > It is not included in most standard perl distributions. I would like > to include it so users do not have to manually install modules. Perl > modules are licensed by the same terms as perl itself - which is the > GNU General Public License version 1 (or later at your discretion), or > the "artistic license" ( http://dev.perl.org/licenses/artistic.html ). > My understanding is that all Globus projects are required to be under > an Apache license. > > I would like to add swiftconfig, swiftrun and associated libraries to > the main swift distribution. Can I safely do this without running into > any licensing conflicts? > > Thanks, > David > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Sun Aug 22 23:48:49 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Sun, 22 Aug 2010 22:48:49 -0600 (GMT-06:00) Subject: [Swift-devel] struct mapping code that formerly worked now fails In-Reply-To: <1977416686.1176681282538727466.JavaMail.root@zimbra.anl.gov> Message-ID: <1237136558.1176751282538929261.JavaMail.root@zimbra.anl.gov> Glen Hocky has a script that used to work but now fails under recent trunk revisions. Ive reproduced the failure in a simplified script: maptest2.swift is: #--- type file; type mystruct { file logfile; } app (mystruct o) cat2log (file i) { cat @i stdout=@filename(o.logfile); } int nmodels=1; mystruct modelOut[] ; file data<"data.txt">; foreach j in [0:nmodels-1] { modelOut[j] = cat2log(data); } # --- mystruct.map is: #!/bin/bash nummodels=$2 outdir=outdir for i in `seq 0 $(($nummodels-1))`;do echo [$i].logfile $outdir/$i.log done --- Under a recent trunk this fails with: e$ swift maptest2.swift swift-r3490 (swift modified locally) cog-r2829 (cog modified locally) RunID: 20100823-0045-bmpsl452 Progress: Execution failed: org.griphyn.vdl.mapping.InvalidPathException: Invalid path (..logfile) for org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20100823-0045-r1xhema5:720000000009 type mystruct with no value at dataset=modelOut path=[0] (not closed) e$ --- Under an older Swift it works: e$ swift maptest2.swift swift-r2974 cog-r2407 RunID: 20100823-0047-j504gggc Progress: Final status: Finished successfully:1 e$ Mihael, do you know what is causing this? Can you reproduce the error using the script and mapper above? - Mike From hategan at mcs.anl.gov Mon Aug 23 00:28:28 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 23 Aug 2010 00:28:28 -0500 Subject: [Swift-devel] struct mapping code that formerly worked now fails In-Reply-To: <1237136558.1176751282538929261.JavaMail.root@zimbra.anl.gov> References: <1237136558.1176751282538929261.JavaMail.root@zimbra.anl.gov> Message-ID: <1282541308.31313.1.camel@blabla2.none> Yep. Should be fixed in swift r3561. I moved some of the functions in vdl-int.k and vdl.k from karajan to Java for performance purposes. Some things got broken in the process. Let me know if you find more things like this. Mihael On Sun, 2010-08-22 at 22:48 -0600, wilde at mcs.anl.gov wrote: > Glen Hocky has a script that used to work but now fails under recent trunk revisions. > > Ive reproduced the failure in a simplified script: > > maptest2.swift is: > > #--- > > type file; > > type mystruct { > file logfile; > } > > app (mystruct o) cat2log (file i) > { > cat @i stdout=@filename(o.logfile); > } > > int nmodels=1; > mystruct modelOut[] ; > > file data<"data.txt">; > > foreach j in [0:nmodels-1] { > modelOut[j] = cat2log(data); > } > > # --- > > mystruct.map is: > > #!/bin/bash > > nummodels=$2 > outdir=outdir > > for i in `seq 0 $(($nummodels-1))`;do > echo [$i].logfile $outdir/$i.log > done > > --- > > Under a recent trunk this fails with: > > e$ swift maptest2.swift > > swift-r3490 (swift modified locally) cog-r2829 (cog modified locally) > > RunID: 20100823-0045-bmpsl452 > Progress: > Execution failed: > org.griphyn.vdl.mapping.InvalidPathException: Invalid path (..logfile) for org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20100823-0045-r1xhema5:720000000009 type mystruct with no value at dataset=modelOut path=[0] (not closed) > e$ > > --- > > Under an older Swift it works: > > > e$ swift maptest2.swift > > swift-r2974 cog-r2407 > > RunID: 20100823-0047-j504gggc > Progress: > Final status: Finished successfully:1 > e$ > > Mihael, do you know what is causing this? Can you reproduce the error using the script and mapper above? > > - Mike > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Mon Aug 23 18:25:01 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 23 Aug 2010 17:25:01 -0600 (GMT-06:00) Subject: [Swift-devel] localscheduler (condor/ condorg) breaking on lots of condor jobs In-Reply-To: <1280346516.12761.3.camel@blabla2.none> Message-ID: <1080295881.1222371282605901496.JavaMail.root@zimbra.anl.gov> ----- "Mihael Hategan" wrote: > Yeah. That's why the provider should be updated to use job logs > instead > of condor_qstat/condor_qedit for figuring out status. Is that easy or hard? For such an approach should we make all the submit files specify a single per-user condorg user log file? > That or update limits (and, btw, what does ulimit -a say on that > machine)? Ive asked for the limit to be changed from 1024 to 20,000 - thats what engage-submit on OSG is using. - Mike From hategan at mcs.anl.gov Mon Aug 23 18:31:04 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 23 Aug 2010 18:31:04 -0500 Subject: [Swift-devel] localscheduler (condor/ condorg) breaking on lots of condor jobs In-Reply-To: <1080295881.1222371282605901496.JavaMail.root@zimbra.anl.gov> References: <1080295881.1222371282605901496.JavaMail.root@zimbra.anl.gov> Message-ID: <1282606264.3205.4.camel@blabla2.none> On Mon, 2010-08-23 at 17:25 -0600, Michael Wilde wrote: > ----- "Mihael Hategan" wrote: > > > Yeah. That's why the provider should be updated to use job logs > > instead > > of condor_qstat/condor_qedit for figuring out status. > > Is that easy or hard? Should be doable in a week or two by somebody who has some experience with providers and some with condor. That includes testing. And then a few more scattered hours due to subtleties that weren't obvious from the start. I might already have some code that I never committed. If somebody wants to clean it/test it, I'd be happy to send it. > > For such an approach should we make all the submit files specify a single per-user condorg user log file? Yes. You would want that for scalability reasons. From my limited testing, condor seems to properly handle that situation. > > > That or update limits (and, btw, what does ulimit -a say on that > > machine)? > > Ive asked for the limit to be changed from 1024 to 20,000 - thats what engage-submit on OSG is using. Mmm, decimal... From hockyg at uchicago.edu Wed Aug 25 09:45:31 2010 From: hockyg at uchicago.edu (Glen Hocky) Date: Wed, 25 Aug 2010 10:45:31 -0400 Subject: [Swift-devel] replication comment Message-ID: I'm doing a test run on 1 osg site replication is turned on. Since nothing is happening on the site i see Progress: Submitted:6 Finished successfully:2 Progress: Submitted:6 Finished successfully:2 Progress: Submitted:6 Finished successfully:2 Progress: Submitted:5 Replicating:1 Finished successfully:2 Progress: Submitted:6 Finished successfully:2 Progress: Submitted:6 Finished successfully:2 Not sure what happens when a job is replicated on one site but I thought you might like to know... -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Wed Aug 25 09:52:44 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 25 Aug 2010 08:52:44 -0600 (GMT-06:00) Subject: [Swift-devel] replication comment In-Reply-To: Message-ID: <1396102831.1285471282747964806.JavaMail.root@zimbra.anl.gov> Thanks, Glen. I would think that when Swift is running with only a single site, replication should essentially be a no-op and disable itself. If thats what is already happening, we should state that in the userguide and swift.properties file. We might be able to tell form the full log if its really replicating or not. This is a minor issue since it can be addressed with documentation (ie "dont turn on replication with only 1 site") even if the code is not already doing that. - Mike ----- "Glen Hocky" wrote: > I'm doing a test run on 1 osg site replication is turned on. Since > nothing is happening on the site i see > > > > > Progress: Submitted:6 Finished successfully:2 > > Progress: Submitted:6 Finished successfully:2 > > Progress: Submitted:6 Finished successfully:2 > > Progress: Submitted:5 Replicating:1 Finished successfully:2 > > Progress: Submitted:6 Finished successfully:2 > > Progress: Submitted:6 Finished successfully:2 > > > Not sure what happens when a job is replicated on one site but I > thought you might like to know... > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hockyg at uchicago.edu Wed Aug 25 13:42:31 2010 From: hockyg at uchicago.edu (Glen Hocky) Date: Wed, 25 Aug 2010 14:42:31 -0400 Subject: [Swift-devel] replication comment In-Reply-To: <1396102831.1285471282747964806.JavaMail.root@zimbra.anl.gov> References: <1396102831.1285471282747964806.JavaMail.root@zimbra.anl.gov> Message-ID: Further comment: trying 2 sites with replication limit=3 for 30 jobs resulted in 90 jobs in the queue after (when none were starting for some reason) so there's a pigeonhole problem here somewhere since (i would think) for N sites you would want no more than N-1 replicas, one for each other site On Wed, Aug 25, 2010 at 10:52 AM, Michael Wilde wrote: > Thanks, Glen. I would think that when Swift is running with only a single > site, replication should essentially be a no-op and disable itself. If thats > what is already happening, we should state that in the userguide and > swift.properties file. > > We might be able to tell form the full log if its really replicating or > not. > > This is a minor issue since it can be addressed with documentation (ie > "dont turn on replication with only 1 site") even if the code is not already > doing that. > > - Mike > > > ----- "Glen Hocky" wrote: > > > I'm doing a test run on 1 osg site replication is turned on. Since > > nothing is happening on the site i see > > > > > > > > > > Progress: Submitted:6 Finished successfully:2 > > > > Progress: Submitted:6 Finished successfully:2 > > > > Progress: Submitted:6 Finished successfully:2 > > > > Progress: Submitted:5 Replicating:1 Finished successfully:2 > > > > Progress: Submitted:6 Finished successfully:2 > > > > Progress: Submitted:6 Finished successfully:2 > > > > > > Not sure what happens when a job is replicated on one site but I > > thought you might like to know... > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Aug 25 14:45:39 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 25 Aug 2010 14:45:39 -0500 Subject: [Swift-devel] replication comment In-Reply-To: References: <1396102831.1285471282747964806.JavaMail.root@zimbra.anl.gov> Message-ID: <1282765539.12069.2.camel@blabla2.none> Right. The replication scheme right now is completely independent of the site information. On Wed, 2010-08-25 at 14:42 -0400, Glen Hocky wrote: > Further comment: > trying 2 sites with replication limit=3 for 30 jobs resulted in 90 > jobs in the queue after (when none were starting for some reason) > > > so there's a pigeonhole problem here somewhere since (i would think) > for N sites you would want no more than N-1 replicas, one for each > other site > > On Wed, Aug 25, 2010 at 10:52 AM, Michael Wilde > wrote: > Thanks, Glen. I would think that when Swift is running with > only a single site, replication should essentially be a no-op > and disable itself. If thats what is already happening, we > should state that in the userguide and swift.properties file. > > We might be able to tell form the full log if its really > replicating or not. > > This is a minor issue since it can be addressed with > documentation (ie "dont turn on replication with only 1 site") > even if the code is not already doing that. > > - Mike > > > > ----- "Glen Hocky" wrote: > > > I'm doing a test run on 1 osg site replication is turned on. > Since > > nothing is happening on the site i see > > > > > > > > > > Progress: Submitted:6 Finished successfully:2 > > > > Progress: Submitted:6 Finished successfully:2 > > > > Progress: Submitted:6 Finished successfully:2 > > > > Progress: Submitted:5 Replicating:1 Finished successfully:2 > > > > Progress: Submitted:6 Finished successfully:2 > > > > Progress: Submitted:6 Finished successfully:2 > > > > > > Not sure what happens when a job is replicated on one site > but I > > thought you might like to know... > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed Aug 25 15:08:43 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 25 Aug 2010 14:08:43 -0600 (GMT-06:00) Subject: [Swift-devel] Adjust site scores on job start not job end In-Reply-To: Message-ID: <301440345.1306121282766923887.JavaMail.root@zimbra.anl.gov> We discussed in the Swift internals review meetings the desirability of adjusting the scheduler's site scores more by how many job start events than by sucessful completion events. The rationale was that for workloads consisting entirely of long running jobs, on for example OSG, this approach would much more quickly reward sites that have been starting jobs with additional jobs, until the start rate diminishes when the jobs start queuing up. Another approach we discussed (which was demonstrated by Dinah Sulakhe to be successful in VDS) was to keep sending jobs to sites until each site has some fixed threshold of jobs sitting in its queue, and to keep all the sites at some threshold (possibly a per-site threshold based on the site's throughput). We're now at the point where a few users (Glen and Allan) would benefit from this change in scheduling algorithm. Mihael, all, can you where and how to explore such changes (module-wise) and what pitfalls are likely to be encountered? Thanks, - Mike ----- Forwarded Message ----- From: "Michael Wilde" To: "Michael Wilde" Sent: Wednesday, August 25, 2010 2:22:16 PM GMT -06:00 US/Canada Central Subject: Sched on start not end -- Sent from my mobile device -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Wed Aug 25 15:10:27 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 25 Aug 2010 14:10:27 -0600 (GMT-06:00) Subject: [Swift-devel] Adjust site scores on job start not job end In-Reply-To: <301440345.1306121282766923887.JavaMail.root@zimbra.anl.gov> Message-ID: <1939450963.1306251282767027104.JavaMail.root@zimbra.anl.gov> > Mihael, all, can you where and how to explore such changes > (module-wise) and what pitfalls are likely to be encountered? can you *suggest* where and how... :) Mike From hategan at mcs.anl.gov Wed Aug 25 15:22:51 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 25 Aug 2010 15:22:51 -0500 Subject: [Swift-devel] Re: Adjust site scores on job start not job end In-Reply-To: <301440345.1306121282766923887.JavaMail.root@zimbra.anl.gov> References: <301440345.1306121282766923887.JavaMail.root@zimbra.anl.gov> Message-ID: <1282767771.12552.7.camel@blabla2.none> On Wed, 2010-08-25 at 14:08 -0600, Michael Wilde wrote: > We discussed in the Swift internals review meetings the desirability > of adjusting the scheduler's site scores more by how many job start > events than by sucessful completion events. > > The rationale was that for workloads consisting entirely of long > running jobs, on for example OSG, this approach would much more > quickly reward sites that have been starting jobs with additional > jobs, until the start rate diminishes when the jobs start queuing up. Right. The score should take into account multiple things, such as overall throughput and queue throughput rather than just number of jobs finished ok. > > Another approach we discussed (which was demonstrated by Dinah Sulakhe > to be successful in VDS) was to keep sending jobs to sites until each > site has some fixed threshold of jobs sitting in its queue, and to > keep all the sites at some threshold (possibly a per-site threshold > based on the site's throughput). That threshold is currently the site score. > > We're now at the point where a few users (Glen and Allan) would > benefit from this change in scheduling algorithm. > > Mihael, all, can you where and how to explore such changes > (module-wise) and what pitfalls are likely to be encountered? Essentially the decision problem of how to distribute a number of jobs to a number of sites (assuming hard constraints are resolved) only requires one number for each site. So I think the score should be kept because it is the right abstraction and makes it easy to sub-divide the problem. So I think somebody (or somebodies) needs to figure out exactly what the formula for the score should be and why. That's the hard part. Then we can add the various raw measures into the sites properties and change the score calculations according to those. That's probably easier. Mihael From wilde at mcs.anl.gov Wed Aug 25 16:53:06 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 25 Aug 2010 15:53:06 -0600 (GMT-06:00) Subject: [Swift-devel] Re: Adjust site scores on job start not job end In-Reply-To: <1282767771.12552.7.camel@blabla2.none> Message-ID: <137714018.1312061282773186993.JavaMail.root@zimbra.anl.gov> Cool - what you say below makes good sense. I suspect that once Allan and Glen start getting more data (logfiles) and experience on OSG we'll have some ideas on what score formulas are worth trying. - Mike ----- "Mihael Hategan" wrote: > On Wed, 2010-08-25 at 14:08 -0600, Michael Wilde wrote: > > We discussed in the Swift internals review meetings the > desirability > > of adjusting the scheduler's site scores more by how many job start > > events than by sucessful completion events. > > > > The rationale was that for workloads consisting entirely of long > > running jobs, on for example OSG, this approach would much more > > quickly reward sites that have been starting jobs with additional > > jobs, until the start rate diminishes when the jobs start queuing > up. > > Right. The score should take into account multiple things, such as > overall throughput and queue throughput rather than just number of > jobs > finished ok. > > > > > Another approach we discussed (which was demonstrated by Dinah > Sulakhe > > to be successful in VDS) was to keep sending jobs to sites until > each > > site has some fixed threshold of jobs sitting in its queue, and to > > keep all the sites at some threshold (possibly a per-site threshold > > based on the site's throughput). > > That threshold is currently the site score. > > > > > We're now at the point where a few users (Glen and Allan) would > > benefit from this change in scheduling algorithm. > > > > Mihael, all, can you where and how to explore such changes > > (module-wise) and what pitfalls are likely to be encountered? > > Essentially the decision problem of how to distribute a number of > jobs > to a number of sites (assuming hard constraints are resolved) only > requires one number for each site. So I think the score should be > kept > because it is the right abstraction and makes it easy to sub-divide > the > problem. > > So I think somebody (or somebodies) needs to figure out exactly what > the > formula for the score should be and why. That's the hard part. Then > we > can add the various raw measures into the sites properties and change > the score calculations according to those. That's probably easier. > > Mihael -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From dk0966 at cs.ship.edu Wed Aug 25 22:52:51 2010 From: dk0966 at cs.ship.edu (David Kelly) Date: Wed, 25 Aug 2010 23:52:51 -0400 Subject: [Swift-devel] ConcurrentModificationException with coasters Message-ID: Hello, Today as I was trying to get a group of MCS machines working with coasters, I ran into the following error: java.util.ConcurrentModificationException at java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:761) at java.util.LinkedList$ListItr.next(LinkedList.java:696) at org.globus.cog.abstraction.impl.execution.coaster.BootstrapService$ConnectionProcessor.run(BootstrapService.java:251) at java.lang.Thread.run(Thread.java:619) I've tried running the script several more times under the same conditions (same host, config files and swift version) but can not reproduce it. At this point everything seems to be working fine, but I thought it might be useful to report. Script, config files, and log attached. David -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- Hi There -------------- next part -------------- A non-text attachment was scrubbed... Name: cf Type: application/octet-stream Size: 11731 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: tc Type: application/octet-stream Size: 1867 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: mcs.servers.coasters.xml Type: text/xml Size: 9484 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: catsn.swift Type: application/octet-stream Size: 343 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: catsn-20100825-1335-8zm7xuq3.log.gz Type: application/x-gzip Size: 16311 bytes Desc: not available URL: From wilde at mcs.anl.gov Thu Aug 26 13:48:31 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Thu, 26 Aug 2010 12:48:31 -0600 (GMT-06:00) Subject: [Swift-devel] Need help on race issue in Swift on Montage code In-Reply-To: <987571999.1345591282848084841.JavaMail.root@zimbra.anl.gov> Message-ID: <327715763.1346111282848511306.JavaMail.root@zimbra.anl.gov> Hi Mihael, Justin, Long email follows - sorry! As I mentioned in passing, Jon is stuck on what looks much like we have a race condition in Swift/Karajan thread synchronization. (Testing on trunk) If Jon runs a Montage problem of size "10" it seems to always complete successfully. If he runs a problem of size ~1600, it always hangs. He now has a problem of size 18 that seems to hang a significant percentage of the time (~ 50%???) Jon is now trying to whittle that size-18 failing example down to a simple example you can run yourself to reproduce the problem. He knows pretty well what it is hanging on (see below; Jon is trying to package up a failing test case). The logic is basically: 1. csv map an array of structures from a csv file that describes the output of the earlier stages of Montage processing 2. foreach entry in the array of structures (~ 0..34 in the size-18 problem): a. use simple mapper to map 2 files from the struct b. run a montage function "mDiff" on these two files plus one constant hdr file from outside the loop The program hangs on the foreach loop because (I think, if I have this right) *some* of the mapped dependencies dont seem to be getting set. Its not clear to me whether, in the failing case, *all* the mDiff() calls inside the foreach loop are hanging, or *some* of them are. Jon: please provide the details and correct me as needed. Also, we are relying heavily here on tracef("%k") to print the set-state of various variables. If %k is not 100% correct, then all of our assumptions are questionable. Im also curious to know what tools we have - or could develop - to in general help find what is hanging on what as a debugging aid, both for users to shake out their app errors and for Swift developers to diagnose a hang that is a Swift bug. (Jon told me about some ^T command that causes Swift to enter a Karajan debugging mode? I'd like to learn more about that, and how we might make it most useful for end users and for diagnostic info gathering). Incase its of use, Ive pasted below our latest Skype txt chat on this problem, which details what we know and what Jon will try next. Help and guidance on how to proceed would be great! Thanks, - Mike --- [8/26/10 11:05:04 AM] Jonathan Monette: ... I gathered some stuff to send to devel about the hanging problem just not sure how to word the problem for Mihael and Justin [8/26/10 1:00:25 PM] Michael Wilde: Hi, sorry, I missed your last message above [8/26/10 1:00:42 PM] Michael Wilde: "just not sure how to word the problem for Mihael and Justin" - how can I help on that? [8/26/10 1:01:20 PM] Jonathan Monette: uhhhh......well how do I work the email to describe my problem? [8/26/10 1:01:21 PM] Michael Wilde: Seems like you were on the right track yesterday: adding enough traces that the problem can be readily seen; [8/26/10 1:01:32 PM] Michael Wilde: able to state what seems to be hanging on what; [8/26/10 1:02:04 PM] Michael Wilde: and running enough times to prove that it runs OK some times (and what that traces out as) and then fails to complete other times (and what *that* traces as) [8/26/10 1:03:03 PM] Jonathan Monette: yea but how do word the email? just something like "There is a hanging problem in swift. Jobs do not submit even though the inputs for the apps are satisfied." [8/26/10 1:03:14 PM] Michael Wilde: Then, what you can/should do, is try to start "whittling" down the test case so we can catch the failure in a simple example, that mihael or justin can easily run with minimal setup, to first make the problem happen, and then test their fix [8/26/10 1:03:36 PM] Michael Wilde: "There is a hanging problem in swift. Jobs do not submit even though the inputs for the apps SEEM TO BE satisfied." [8/26/10 1:03:58 PM] Michael Wilde: here is the code, here is the outut trace, and here are the logs [8/26/10 1:04:13 PM] Michael Wilde: I have run this 10 times and it hangs 6 of 10 times [8/26/10 1:04:14 PM] Jonathan Monette: ok. well my simple test script completed a couple of times. I will run it more to see if I can get it to hang. [8/26/10 1:04:14 PM] Michael Wilde: etc [8/26/10 1:04:31 PM] Michael Wilde: saving the logs into run.nnnn through run.nnnn+10 [8/26/10 1:04:54 PM] Michael Wilde: Right, thats exactly what you need to do. [8/26/10 1:05:00 PM] Michael Wilde: I usually do this: [8/26/10 1:05:14 PM] Michael Wilde: start with the real code; run N times, see what the failure ratio is. [8/26/10 1:05:42 PM] Michael Wilde: Then make a complete copy, make sure it still fails, and start stripping it down to the simplest program that still fails. [8/26/10 1:05:51 PM] Jonathan Monette: ok. ill see if I can get the test script to fail [8/26/10 1:06:01 PM] Michael Wilde: wherever possible, replace Montage code with cat/sleep etc [8/26/10 1:06:19 PM] Michael Wilde: So this is and needs to be a general Swift development method: [8/26/10 1:07:03 PM] Michael Wilde: when we find any error, but *especially* a race, hang, or similar paralleism-related error, we need to isolate it to a test case that can be added to the test suite [8/26/10 1:07:23 PM] Michael Wilde: This bug, in particular, Mihael *thinks* he fixed, yet it "came back". [8/26/10 1:07:40 PM] Michael Wilde: Thats what SE folks mean by "regression testing": [8/26/10 1:08:17 PM] Michael Wilde: create a test that ensures that a fix is in place; then run that test forever more, to make sure that bug stas fixed and that nothing similar takes its place ; ) [8/26/10 1:08:37 PM] Michael Wilde: This is hard work, no doubt about it [8/26/10 1:09:01 PM] Michael Wilde: But a simple reliable test case that reproduces a bug is THE most important requirement for fixing the bug [8/26/10 1:09:30 PM] Michael Wilde: So you need to swap your user hat for a Swift developer hat, for the moment : ) [8/26/10 1:09:30 PM] Jonathan Monette: yea. I am running several tests on the stripped down function I have and see if I can reproduce the error. [8/26/10 1:09:39 PM] Jonathan Monette: alright. [8/26/10 1:09:48 PM] Jonathan Monette: lets see if I can reproduce the error. [8/26/10 1:09:54 PM] Michael Wilde: Now, for this test to go into the test suite, it needs to go into a loop [8/26/10 1:10:11 PM] Michael Wilde: Some nasty races only occur 1/100 times, or worse. [8/26/10 1:10:34 PM] Michael Wilde: so we can only trust swift when we can run the simple example 100,000 times w/o a hang : ( [8/26/10 1:11:16 PM] Michael Wilde: Once we have the tests, we can then run torture tests before releases that give us a good assurance of having a reliable product [8/26/10 1:11:26 PM] Jonathan Monette: alright. [8/26/10 1:11:42 PM] Michael Wilde: Once more, with enthusiasm! :) [8/26/10 1:12:04 PM] Jonathan Monette: well I am going to try this test several times with several different input files that increase in size. [8/26/10 1:12:30 PM] Jonathan Monette: the problem "always" appears with my larger sets so maybe with the large file it will fail more often [8/26/10 1:12:54 PM] Michael Wilde: I'll echo the above dialog to Mihael and Justin so they can pipe in with suggestions, OK? Hopefully to make your life easier and find the problem faster.... [8/26/10 1:13:29 PM] Jonathan Monette: alright. thanks. that will help. [8/26/10 1:14:01 PM] Michael Wilde: One approach is to strip out the earlier stages and the later stages, and first see if a shorter script with *just* mDiff and the foreach loop will fail. I think it should. [8/26/10 1:14:32 PM] Michael Wilde: Then once you have that (ie, a 20 line montage script that fails) you can try to replace Montage with cat [8/26/10 1:14:42 PM] Michael Wilde: because really all it is doing, [8/26/10 1:15:33 PM] Michael Wilde: is a csv mapping (and you can capture and freeze that) and then a simple foreach loop with just one reall app (mDiff) ?which you can replace with a cat of 3 files to 1 file, right? -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jon.monette at gmail.com Thu Aug 26 14:06:18 2010 From: jon.monette at gmail.com (Jonathan Monette) Date: Thu, 26 Aug 2010 14:06:18 -0500 Subject: [Swift-devel] Re: Need help on race issue in Swift on Montage code In-Reply-To: <327715763.1346111282848511306.JavaMail.root@zimbra.anl.gov> References: <327715763.1346111282848511306.JavaMail.root@zimbra.anl.gov> Message-ID: <4C76BB2A.60706@gmail.com> If looking at log files or output files will help to give a better idea of the problem, you can take a look at ~jonmon/Workspace/Swift/Montage/m101_j_05x05. This is my 18 image set. In that directory there is two run directories. run.0002 is a run that completed the workflow and run.0001 is a run that hung. In each of those directories there is a swift.out file that contains the output to the screen that was captured. Also, Mihael when the hang occurs, I type v for the inhook you set up and I get Register Futures: and then nothing. Does this mean there is no listeners set up and that is why it hung? On 08/26/2010 01:48 PM, wilde at mcs.anl.gov wrote: > Hi Mihael, Justin, > > Long email follows - sorry! > > As I mentioned in passing, Jon is stuck on what looks much like we have a race condition in Swift/Karajan thread synchronization. (Testing on trunk) > > If Jon runs a Montage problem of size "10" it seems to always complete successfully. > > If he runs a problem of size ~1600, it always hangs. > > He now has a problem of size 18 that seems to hang a significant percentage of the time (~ 50%???) > > Jon is now trying to whittle that size-18 failing example down to a simple example you can run yourself to reproduce the problem. > > He knows pretty well what it is hanging on (see below; Jon is trying to package up a failing test case). > > The logic is basically: > > 1. csv map an array of structures from a csv file that describes the output of the earlier stages of Montage processing > > 2. foreach entry in the array of structures (~ 0..34 in the size-18 problem): > a. use simple mapper to map 2 files from the struct > b. run a montage function "mDiff" on these two files plus one constant hdr file from outside the loop > > The program hangs on the foreach loop because (I think, if I have this right) *some* of the mapped dependencies dont seem to be getting set. Its not clear to me whether, in the failing case, *all* the mDiff() calls inside the foreach loop are hanging, or *some* of them are. Jon: please provide the details and correct me as needed. > > Also, we are relying heavily here on tracef("%k") to print the set-state of various variables. If %k is not 100% correct, then all of our assumptions are questionable. > > Im also curious to know what tools we have - or could develop - to in general help find what is hanging on what as a debugging aid, both for users to shake out their app errors and for Swift developers to diagnose a hang that is a Swift bug. > > (Jon told me about some ^T command that causes Swift to enter a Karajan debugging mode? I'd like to learn more about that, and how we might make it most useful for end users and for diagnostic info gathering). > > Incase its of use, Ive pasted below our latest Skype txt chat on this problem, which details what we know and what Jon will try next. > > Help and guidance on how to proceed would be great! > > Thanks, > > - Mike > > --- > > [8/26/10 11:05:04 AM] Jonathan Monette: ... I gathered some stuff to send to devel about the hanging problem just not sure how to word the problem for Mihael and Justin > [8/26/10 1:00:25 PM] Michael Wilde: Hi, sorry, I missed your last message above > [8/26/10 1:00:42 PM] Michael Wilde: "just not sure how to word the problem for Mihael and Justin" - how can I help on that? > [8/26/10 1:01:20 PM] Jonathan Monette: uhhhh......well how do I work the email to describe my problem? > [8/26/10 1:01:21 PM] Michael Wilde: Seems like you were on the right track yesterday: adding enough traces that the problem can be readily seen; > [8/26/10 1:01:32 PM] Michael Wilde: able to state what seems to be hanging on what; > [8/26/10 1:02:04 PM] Michael Wilde: and running enough times to prove that it runs OK some times (and what that traces out as) and then fails to complete other times (and what *that* traces as) > [8/26/10 1:03:03 PM] Jonathan Monette: yea but how do word the email? just something like "There is a hanging problem in swift. Jobs do not submit even though the inputs for the apps are satisfied." > [8/26/10 1:03:14 PM] Michael Wilde: Then, what you can/should do, is try to start "whittling" down the test case so we can catch the failure in a simple example, that mihael or justin can easily run with minimal setup, to first make the problem happen, and then test their fix > [8/26/10 1:03:36 PM] Michael Wilde: "There is a hanging problem in swift. Jobs do not submit even though the inputs for the apps SEEM TO BE satisfied." > [8/26/10 1:03:58 PM] Michael Wilde: here is the code, here is the outut trace, and here are the logs > [8/26/10 1:04:13 PM] Michael Wilde: I have run this 10 times and it hangs 6 of 10 times > [8/26/10 1:04:14 PM] Jonathan Monette: ok. well my simple test script completed a couple of times. I will run it more to see if I can get it to hang. > [8/26/10 1:04:14 PM] Michael Wilde: etc > [8/26/10 1:04:31 PM] Michael Wilde: saving the logs into run.nnnn through run.nnnn+10 > [8/26/10 1:04:54 PM] Michael Wilde: Right, thats exactly what you need to do. > [8/26/10 1:05:00 PM] Michael Wilde: I usually do this: > [8/26/10 1:05:14 PM] Michael Wilde: start with the real code; run N times, see what the failure ratio is. > [8/26/10 1:05:42 PM] Michael Wilde: Then make a complete copy, make sure it still fails, and start stripping it down to the simplest program that still fails. > [8/26/10 1:05:51 PM] Jonathan Monette: ok. ill see if I can get the test script to fail > [8/26/10 1:06:01 PM] Michael Wilde: wherever possible, replace Montage code with cat/sleep etc > [8/26/10 1:06:19 PM] Michael Wilde: So this is and needs to be a general Swift development method: > [8/26/10 1:07:03 PM] Michael Wilde: when we find any error, but *especially* a race, hang, or similar paralleism-related error, we need to isolate it to a test case that can be added to the test suite > [8/26/10 1:07:23 PM] Michael Wilde: This bug, in particular, Mihael *thinks* he fixed, yet it "came back". > [8/26/10 1:07:40 PM] Michael Wilde: Thats what SE folks mean by "regression testing": > [8/26/10 1:08:17 PM] Michael Wilde: create a test that ensures that a fix is in place; then run that test forever more, to make sure that bug stas fixed and that nothing similar takes its place ; ) > [8/26/10 1:08:37 PM] Michael Wilde: This is hard work, no doubt about it > [8/26/10 1:09:01 PM] Michael Wilde: But a simple reliable test case that reproduces a bug is THE most important requirement for fixing the bug > [8/26/10 1:09:30 PM] Michael Wilde: So you need to swap your user hat for a Swift developer hat, for the moment : ) > [8/26/10 1:09:30 PM] Jonathan Monette: yea. I am running several tests on the stripped down function I have and see if I can reproduce the error. > [8/26/10 1:09:39 PM] Jonathan Monette: alright. > [8/26/10 1:09:48 PM] Jonathan Monette: lets see if I can reproduce the error. > [8/26/10 1:09:54 PM] Michael Wilde: Now, for this test to go into the test suite, it needs to go into a loop > [8/26/10 1:10:11 PM] Michael Wilde: Some nasty races only occur 1/100 times, or worse. > [8/26/10 1:10:34 PM] Michael Wilde: so we can only trust swift when we can run the simple example 100,000 times w/o a hang : ( > [8/26/10 1:11:16 PM] Michael Wilde: Once we have the tests, we can then run torture tests before releases that give us a good assurance of having a reliable product > [8/26/10 1:11:26 PM] Jonathan Monette: alright. > [8/26/10 1:11:42 PM] Michael Wilde: Once more, with enthusiasm! :) > [8/26/10 1:12:04 PM] Jonathan Monette: well I am going to try this test several times with several different input files that increase in size. > [8/26/10 1:12:30 PM] Jonathan Monette: the problem "always" appears with my larger sets so maybe with the large file it will fail more often > [8/26/10 1:12:54 PM] Michael Wilde: I'll echo the above dialog to Mihael and Justin so they can pipe in with suggestions, OK? Hopefully to make your life easier and find the problem faster.... > [8/26/10 1:13:29 PM] Jonathan Monette: alright. thanks. that will help. > [8/26/10 1:14:01 PM] Michael Wilde: One approach is to strip out the earlier stages and the later stages, and first see if a shorter script with *just* mDiff and the foreach loop will fail. I think it should. > [8/26/10 1:14:32 PM] Michael Wilde: Then once you have that (ie, a 20 line montage script that fails) you can try to replace Montage with cat > [8/26/10 1:14:42 PM] Michael Wilde: because really all it is doing, > [8/26/10 1:15:33 PM] Michael Wilde: is a csv mapping (and you can capture and freeze that) and then a simple foreach loop with just one reall app (mDiff) which you can replace with a cat of 3 files to 1 file, right? > > -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From zhaozhang at uchicago.edu Thu Aug 26 14:25:58 2010 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 26 Aug 2010 14:25:58 -0500 Subject: [Swift-devel] Hacking swift/karajan to fetch a job description Message-ID: <4C76BFC6.6080809@uchicago.edu> Hi, Mihael I am wondering where I can get the job description? I was trying to hack the _swiftwrap and disable file transfer in order to minimize the workload. Is there any way for me to do this while running with dryrun option of swift? Thanks. A job description is some like: mock-d9hph6qj -jobdir d/9 -e /home/zzhang/workplace/swift/bin/mock -out _concurrent/result-c7e71c11-2f94-4312-a7ff-cee339e40595--array/h21/h3//elt-721 -err stderr.txt -i -d input^_concurrent/result-c7e71c11-2f94-4312-a7ff-cee339e40595--array/h21/h3/ -if input/C01546.mol2 -of _concurrent/result-c7e71c11-2f94-4312-a7ff-cee339e40595--array/h21/h3//elt-721 -k -a input/C01546.mol2 best zhao From zhaozhang at uchicago.edu Thu Aug 26 14:36:20 2010 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 26 Aug 2010 14:36:20 -0500 Subject: [Swift-devel] Re: Need help on race issue in Swift on Montage code In-Reply-To: <4C76BB2A.60706@gmail.com> References: <327715763.1346111282848511306.JavaMail.root@zimbra.anl.gov> <4C76BB2A.60706@gmail.com> Message-ID: <4C76C234.10208@uchicago.edu> Hi, Jon When I was running mProject in a batch style, there is a _area.fits file with every .fits file, and I didn't see them jonmon/Workspace/Swift/Montage/m101_j_05x05. I am not sure if that is a necessary input file, but mProject did demand the _area.fits file in my case. best zhao Jonathan Monette wrote: > If looking at log files or output files will help to give a better > idea of the problem, you can take a look at > ~jonmon/Workspace/Swift/Montage/m101_j_05x05. This is my 18 image > set. In that directory there is two run directories. run.0002 is a > run that completed the workflow and run.0001 is a run that hung. In > each of those directories there is a swift.out file that contains the > output to the screen that was captured. > > Also, Mihael when the hang occurs, I type v for the inhook you set up > and I get > Register Futures: > > and then nothing. Does this mean there is no listeners set up and > that is why it hung? > > On 08/26/2010 01:48 PM, wilde at mcs.anl.gov wrote: >> Hi Mihael, Justin, >> >> Long email follows - sorry! >> >> As I mentioned in passing, Jon is stuck on what looks much like we >> have a race condition in Swift/Karajan thread synchronization. >> (Testing on trunk) >> >> If Jon runs a Montage problem of size "10" it seems to always >> complete successfully. >> >> If he runs a problem of size ~1600, it always hangs. >> >> He now has a problem of size 18 that seems to hang a significant >> percentage of the time (~ 50%???) >> >> Jon is now trying to whittle that size-18 failing example down to a >> simple example you can run yourself to reproduce the problem. >> >> He knows pretty well what it is hanging on (see below; Jon is trying >> to package up a failing test case). >> >> The logic is basically: >> >> 1. csv map an array of structures from a csv file that describes the >> output of the earlier stages of Montage processing >> >> 2. foreach entry in the array of structures (~ 0..34 in the size-18 >> problem): >> a. use simple mapper to map 2 files from the struct >> b. run a montage function "mDiff" on these two files plus one >> constant hdr file from outside the loop >> >> The program hangs on the foreach loop because (I think, if I have >> this right) *some* of the mapped dependencies dont seem to be getting >> set. Its not clear to me whether, in the failing case, *all* the >> mDiff() calls inside the foreach loop are hanging, or *some* of them >> are. Jon: please provide the details and correct me as needed. >> >> Also, we are relying heavily here on tracef("%k") to print the >> set-state of various variables. If %k is not 100% correct, then all >> of our assumptions are questionable. >> >> Im also curious to know what tools we have - or could develop - to in >> general help find what is hanging on what as a debugging aid, both >> for users to shake out their app errors and for Swift developers to >> diagnose a hang that is a Swift bug. >> >> (Jon told me about some ^T command that causes Swift to enter a >> Karajan debugging mode? I'd like to learn more about that, and how we >> might make it most useful for end users and for diagnostic info >> gathering). >> >> Incase its of use, Ive pasted below our latest Skype txt chat on this >> problem, which details what we know and what Jon will try next. >> >> Help and guidance on how to proceed would be great! >> >> Thanks, >> >> - Mike >> >> --- >> >> [8/26/10 11:05:04 AM] Jonathan Monette: ... I gathered some stuff to >> send to devel about the hanging problem just not sure how to word the >> problem for Mihael and Justin >> [8/26/10 1:00:25 PM] Michael Wilde: Hi, sorry, I missed your last >> message above >> [8/26/10 1:00:42 PM] Michael Wilde: "just not sure how to word the >> problem for Mihael and Justin" - how can I help on that? >> [8/26/10 1:01:20 PM] Jonathan Monette: uhhhh......well how do I work >> the email to describe my problem? >> [8/26/10 1:01:21 PM] Michael Wilde: Seems like you were on the right >> track yesterday: adding enough traces that the problem can be readily >> seen; >> [8/26/10 1:01:32 PM] Michael Wilde: able to state what seems to be >> hanging on what; >> [8/26/10 1:02:04 PM] Michael Wilde: and running enough times to prove >> that it runs OK some times (and what that traces out as) and then >> fails to complete other times (and what *that* traces as) >> [8/26/10 1:03:03 PM] Jonathan Monette: yea but how do word the email? >> just something like "There is a hanging problem in swift. Jobs do >> not submit even though the inputs for the apps are satisfied." >> [8/26/10 1:03:14 PM] Michael Wilde: Then, what you can/should do, is >> try to start "whittling" down the test case so we can catch the >> failure in a simple example, that mihael or justin can easily run >> with minimal setup, to first make the problem happen, and then test >> their fix >> [8/26/10 1:03:36 PM] Michael Wilde: "There is a hanging problem in >> swift. Jobs do not submit even though the inputs for the apps SEEM >> TO BE satisfied." >> [8/26/10 1:03:58 PM] Michael Wilde: here is the code, here is the >> outut trace, and here are the logs >> [8/26/10 1:04:13 PM] Michael Wilde: I have run this 10 times and it >> hangs 6 of 10 times >> [8/26/10 1:04:14 PM] Jonathan Monette: ok. well my simple test >> script completed a couple of times. I will run it more to see if I >> can get it to hang. >> [8/26/10 1:04:14 PM] Michael Wilde: etc >> [8/26/10 1:04:31 PM] Michael Wilde: saving the logs into run.nnnn >> through run.nnnn+10 >> [8/26/10 1:04:54 PM] Michael Wilde: Right, thats exactly what you >> need to do. >> [8/26/10 1:05:00 PM] Michael Wilde: I usually do this: >> [8/26/10 1:05:14 PM] Michael Wilde: start with the real code; run N >> times, see what the failure ratio is. >> [8/26/10 1:05:42 PM] Michael Wilde: Then make a complete copy, make >> sure it still fails, and start stripping it down to the simplest >> program that still fails. >> [8/26/10 1:05:51 PM] Jonathan Monette: ok. ill see if I can get the >> test script to fail >> [8/26/10 1:06:01 PM] Michael Wilde: wherever possible, replace >> Montage code with cat/sleep etc >> [8/26/10 1:06:19 PM] Michael Wilde: So this is and needs to be a >> general Swift development method: >> [8/26/10 1:07:03 PM] Michael Wilde: when we find any error, but >> *especially* a race, hang, or similar paralleism-related error, we >> need to isolate it to a test case that can be added to the test suite >> [8/26/10 1:07:23 PM] Michael Wilde: This bug, in particular, Mihael >> *thinks* he fixed, yet it "came back". >> [8/26/10 1:07:40 PM] Michael Wilde: Thats what SE folks mean by >> "regression testing": >> [8/26/10 1:08:17 PM] Michael Wilde: create a test that ensures that a >> fix is in place; then run that test forever more, to make sure that >> bug stas fixed and that nothing similar takes its place ; ) >> [8/26/10 1:08:37 PM] Michael Wilde: This is hard work, no doubt about it >> [8/26/10 1:09:01 PM] Michael Wilde: But a simple reliable test case >> that reproduces a bug is THE most important requirement for fixing >> the bug >> [8/26/10 1:09:30 PM] Michael Wilde: So you need to swap your user hat >> for a Swift developer hat, for the moment : ) >> [8/26/10 1:09:30 PM] Jonathan Monette: yea. I am running several >> tests on the stripped down function I have and see if I can reproduce >> the error. >> [8/26/10 1:09:39 PM] Jonathan Monette: alright. >> [8/26/10 1:09:48 PM] Jonathan Monette: lets see if I can reproduce >> the error. >> [8/26/10 1:09:54 PM] Michael Wilde: Now, for this test to go into the >> test suite, it needs to go into a loop >> [8/26/10 1:10:11 PM] Michael Wilde: Some nasty races only occur 1/100 >> times, or worse. >> [8/26/10 1:10:34 PM] Michael Wilde: so we can only trust swift when >> we can run the simple example 100,000 times w/o a hang : ( >> [8/26/10 1:11:16 PM] Michael Wilde: Once we have the tests, we can >> then run torture tests before releases that give us a good assurance >> of having a reliable product >> [8/26/10 1:11:26 PM] Jonathan Monette: alright. >> [8/26/10 1:11:42 PM] Michael Wilde: Once more, with enthusiasm! :) >> [8/26/10 1:12:04 PM] Jonathan Monette: well I am going to try this >> test several times with several different input files that increase >> in size. >> [8/26/10 1:12:30 PM] Jonathan Monette: the problem "always" appears >> with my larger sets so maybe with the large file it will fail more often >> [8/26/10 1:12:54 PM] Michael Wilde: I'll echo the above dialog to >> Mihael and Justin so they can pipe in with suggestions, OK? Hopefully >> to make your life easier and find the problem faster.... >> [8/26/10 1:13:29 PM] Jonathan Monette: alright. thanks. that will help. >> [8/26/10 1:14:01 PM] Michael Wilde: One approach is to strip out the >> earlier stages and the later stages, and first see if a shorter >> script with *just* mDiff and the foreach loop will fail. I think it >> should. >> [8/26/10 1:14:32 PM] Michael Wilde: Then once you have that (ie, a 20 >> line montage script that fails) you can try to replace Montage with cat >> [8/26/10 1:14:42 PM] Michael Wilde: because really all it is doing, >> [8/26/10 1:15:33 PM] Michael Wilde: is a csv mapping (and you can >> capture and freeze that) and then a simple foreach loop with just one >> reall app (mDiff) which you can replace with a cat of 3 files to 1 >> file, right? >> >> > From jon.monette at gmail.com Thu Aug 26 14:38:08 2010 From: jon.monette at gmail.com (Jonathan Monette) Date: Thu, 26 Aug 2010 14:38:08 -0500 Subject: [Swift-devel] Re: Need help on race issue in Swift on Montage code In-Reply-To: <4C76C234.10208@uchicago.edu> References: <327715763.1346111282848511306.JavaMail.root@zimbra.anl.gov> <4C76BB2A.60706@gmail.com> <4C76C234.10208@uchicago.edu> Message-ID: <4C76C2A0.9030200@gmail.com> No. That file is not necessary since there is a -n option that can be passed to several of the functions that simply tells that function to ignore those area files. Those area files are just weights that are applied to the projected image. On 08/26/2010 02:36 PM, Zhao Zhang wrote: > Hi, Jon > > When I was running mProject in a batch style, there is a _area.fits > file with every .fits file, and I didn't see them > jonmon/Workspace/Swift/Montage/m101_j_05x05. > I am not sure if that is a necessary input file, but mProject did > demand the _area.fits file in my case. > > best > zhao > > Jonathan Monette wrote: >> If looking at log files or output files will help to give a better >> idea of the problem, you can take a look at >> ~jonmon/Workspace/Swift/Montage/m101_j_05x05. This is my 18 image >> set. In that directory there is two run directories. run.0002 is a >> run that completed the workflow and run.0001 is a run that hung. In >> each of those directories there is a swift.out file that contains the >> output to the screen that was captured. >> >> Also, Mihael when the hang occurs, I type v for the inhook you set up >> and I get >> Register Futures: >> >> and then nothing. Does this mean there is no listeners set up and >> that is why it hung? >> >> On 08/26/2010 01:48 PM, wilde at mcs.anl.gov wrote: >>> Hi Mihael, Justin, >>> >>> Long email follows - sorry! >>> >>> As I mentioned in passing, Jon is stuck on what looks much like we >>> have a race condition in Swift/Karajan thread synchronization. >>> (Testing on trunk) >>> >>> If Jon runs a Montage problem of size "10" it seems to always >>> complete successfully. >>> >>> If he runs a problem of size ~1600, it always hangs. >>> >>> He now has a problem of size 18 that seems to hang a significant >>> percentage of the time (~ 50%???) >>> >>> Jon is now trying to whittle that size-18 failing example down to a >>> simple example you can run yourself to reproduce the problem. >>> >>> He knows pretty well what it is hanging on (see below; Jon is trying >>> to package up a failing test case). >>> >>> The logic is basically: >>> >>> 1. csv map an array of structures from a csv file that describes the >>> output of the earlier stages of Montage processing >>> >>> 2. foreach entry in the array of structures (~ 0..34 in the size-18 >>> problem): >>> a. use simple mapper to map 2 files from the struct >>> b. run a montage function "mDiff" on these two files plus one >>> constant hdr file from outside the loop >>> >>> The program hangs on the foreach loop because (I think, if I have >>> this right) *some* of the mapped dependencies dont seem to be >>> getting set. Its not clear to me whether, in the failing case, *all* >>> the mDiff() calls inside the foreach loop are hanging, or *some* of >>> them are. Jon: please provide the details and correct me as needed. >>> >>> Also, we are relying heavily here on tracef("%k") to print the >>> set-state of various variables. If %k is not 100% correct, then all >>> of our assumptions are questionable. >>> >>> Im also curious to know what tools we have - or could develop - to >>> in general help find what is hanging on what as a debugging aid, >>> both for users to shake out their app errors and for Swift >>> developers to diagnose a hang that is a Swift bug. >>> >>> (Jon told me about some ^T command that causes Swift to enter a >>> Karajan debugging mode? I'd like to learn more about that, and how >>> we might make it most useful for end users and for diagnostic info >>> gathering). >>> >>> Incase its of use, Ive pasted below our latest Skype txt chat on >>> this problem, which details what we know and what Jon will try next. >>> >>> Help and guidance on how to proceed would be great! >>> >>> Thanks, >>> >>> - Mike >>> >>> --- >>> >>> [8/26/10 11:05:04 AM] Jonathan Monette: ... I gathered some stuff to >>> send to devel about the hanging problem just not sure how to word >>> the problem for Mihael and Justin >>> [8/26/10 1:00:25 PM] Michael Wilde: Hi, sorry, I missed your last >>> message above >>> [8/26/10 1:00:42 PM] Michael Wilde: "just not sure how to word the >>> problem for Mihael and Justin" - how can I help on that? >>> [8/26/10 1:01:20 PM] Jonathan Monette: uhhhh......well how do I work >>> the email to describe my problem? >>> [8/26/10 1:01:21 PM] Michael Wilde: Seems like you were on the right >>> track yesterday: adding enough traces that the problem can be >>> readily seen; >>> [8/26/10 1:01:32 PM] Michael Wilde: able to state what seems to be >>> hanging on what; >>> [8/26/10 1:02:04 PM] Michael Wilde: and running enough times to >>> prove that it runs OK some times (and what that traces out as) and >>> then fails to complete other times (and what *that* traces as) >>> [8/26/10 1:03:03 PM] Jonathan Monette: yea but how do word the >>> email? just something like "There is a hanging problem in swift. >>> Jobs do not submit even though the inputs for the apps are satisfied." >>> [8/26/10 1:03:14 PM] Michael Wilde: Then, what you can/should do, is >>> try to start "whittling" down the test case so we can catch the >>> failure in a simple example, that mihael or justin can easily run >>> with minimal setup, to first make the problem happen, and then test >>> their fix >>> [8/26/10 1:03:36 PM] Michael Wilde: "There is a hanging problem in >>> swift. Jobs do not submit even though the inputs for the apps SEEM >>> TO BE satisfied." >>> [8/26/10 1:03:58 PM] Michael Wilde: here is the code, here is the >>> outut trace, and here are the logs >>> [8/26/10 1:04:13 PM] Michael Wilde: I have run this 10 times and it >>> hangs 6 of 10 times >>> [8/26/10 1:04:14 PM] Jonathan Monette: ok. well my simple test >>> script completed a couple of times. I will run it more to see if I >>> can get it to hang. >>> [8/26/10 1:04:14 PM] Michael Wilde: etc >>> [8/26/10 1:04:31 PM] Michael Wilde: saving the logs into run.nnnn >>> through run.nnnn+10 >>> [8/26/10 1:04:54 PM] Michael Wilde: Right, thats exactly what you >>> need to do. >>> [8/26/10 1:05:00 PM] Michael Wilde: I usually do this: >>> [8/26/10 1:05:14 PM] Michael Wilde: start with the real code; run N >>> times, see what the failure ratio is. >>> [8/26/10 1:05:42 PM] Michael Wilde: Then make a complete copy, make >>> sure it still fails, and start stripping it down to the simplest >>> program that still fails. >>> [8/26/10 1:05:51 PM] Jonathan Monette: ok. ill see if I can get the >>> test script to fail >>> [8/26/10 1:06:01 PM] Michael Wilde: wherever possible, replace >>> Montage code with cat/sleep etc >>> [8/26/10 1:06:19 PM] Michael Wilde: So this is and needs to be a >>> general Swift development method: >>> [8/26/10 1:07:03 PM] Michael Wilde: when we find any error, but >>> *especially* a race, hang, or similar paralleism-related error, we >>> need to isolate it to a test case that can be added to the test suite >>> [8/26/10 1:07:23 PM] Michael Wilde: This bug, in particular, Mihael >>> *thinks* he fixed, yet it "came back". >>> [8/26/10 1:07:40 PM] Michael Wilde: Thats what SE folks mean by >>> "regression testing": >>> [8/26/10 1:08:17 PM] Michael Wilde: create a test that ensures that >>> a fix is in place; then run that test forever more, to make sure >>> that bug stas fixed and that nothing similar takes its place ; ) >>> [8/26/10 1:08:37 PM] Michael Wilde: This is hard work, no doubt >>> about it >>> [8/26/10 1:09:01 PM] Michael Wilde: But a simple reliable test case >>> that reproduces a bug is THE most important requirement for fixing >>> the bug >>> [8/26/10 1:09:30 PM] Michael Wilde: So you need to swap your user >>> hat for a Swift developer hat, for the moment : ) >>> [8/26/10 1:09:30 PM] Jonathan Monette: yea. I am running several >>> tests on the stripped down function I have and see if I can >>> reproduce the error. >>> [8/26/10 1:09:39 PM] Jonathan Monette: alright. >>> [8/26/10 1:09:48 PM] Jonathan Monette: lets see if I can reproduce >>> the error. >>> [8/26/10 1:09:54 PM] Michael Wilde: Now, for this test to go into >>> the test suite, it needs to go into a loop >>> [8/26/10 1:10:11 PM] Michael Wilde: Some nasty races only occur >>> 1/100 times, or worse. >>> [8/26/10 1:10:34 PM] Michael Wilde: so we can only trust swift when >>> we can run the simple example 100,000 times w/o a hang : ( >>> [8/26/10 1:11:16 PM] Michael Wilde: Once we have the tests, we can >>> then run torture tests before releases that give us a good assurance >>> of having a reliable product >>> [8/26/10 1:11:26 PM] Jonathan Monette: alright. >>> [8/26/10 1:11:42 PM] Michael Wilde: Once more, with enthusiasm! :) >>> [8/26/10 1:12:04 PM] Jonathan Monette: well I am going to try this >>> test several times with several different input files that increase >>> in size. >>> [8/26/10 1:12:30 PM] Jonathan Monette: the problem "always" appears >>> with my larger sets so maybe with the large file it will fail more >>> often >>> [8/26/10 1:12:54 PM] Michael Wilde: I'll echo the above dialog to >>> Mihael and Justin so they can pipe in with suggestions, OK? >>> Hopefully to make your life easier and find the problem faster.... >>> [8/26/10 1:13:29 PM] Jonathan Monette: alright. thanks. that will >>> help. >>> [8/26/10 1:14:01 PM] Michael Wilde: One approach is to strip out the >>> earlier stages and the later stages, and first see if a shorter >>> script with *just* mDiff and the foreach loop will fail. I think it >>> should. >>> [8/26/10 1:14:32 PM] Michael Wilde: Then once you have that (ie, a >>> 20 line montage script that fails) you can try to replace Montage >>> with cat >>> [8/26/10 1:14:42 PM] Michael Wilde: because really all it is doing, >>> [8/26/10 1:15:33 PM] Michael Wilde: is a csv mapping (and you can >>> capture and freeze that) and then a simple foreach loop with just >>> one reall app (mDiff) which you can replace with a cat of 3 files >>> to 1 file, right? >>> >> > -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From hategan at mcs.anl.gov Thu Aug 26 14:42:55 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 26 Aug 2010 14:42:55 -0500 Subject: [Swift-devel] ConcurrentModificationException with coasters In-Reply-To: References: Message-ID: <1282851775.15164.0.camel@blabla2.none> Bonus points if you tell me how it can be fixed. Mihael On Wed, 2010-08-25 at 23:52 -0400, David Kelly wrote: > Hello, > > Today as I was trying to get a group of MCS machines working with > coasters, I ran into the following error: > > java.util.ConcurrentModificationException > at java.util.LinkedList > $ListItr.checkForComodification(LinkedList.java:761) > at java.util.LinkedList$ListItr.next(LinkedList.java:696) > at > org.globus.cog.abstraction.impl.execution.coaster.BootstrapService > $ConnectionProcessor.run(BootstrapService.java:251) > at java.lang.Thread.run(Thread.java:619) > > I've tried running the script several more times under the same > conditions (same host, config files and swift version) but can not > reproduce it. At this point everything seems to be working fine, but I > thought it might be useful to report. Script, config files, and log > attached. > > David > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Thu Aug 26 14:44:59 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 26 Aug 2010 14:44:59 -0500 Subject: [Swift-devel] Re: Need help on race issue in Swift on Montage code In-Reply-To: <4C76BB2A.60706@gmail.com> References: <327715763.1346111282848511306.JavaMail.root@zimbra.anl.gov> <4C76BB2A.60706@gmail.com> Message-ID: <1282851899.15164.2.camel@blabla2.none> On Thu, 2010-08-26 at 14:06 -0500, Jonathan Monette wrote: > If looking at log files or output files will help to give a better idea > of the problem, you can take a look at > ~jonmon/Workspace/Swift/Montage/m101_j_05x05. This is my 18 image set. > In that directory there is two run directories. run.0002 is a run that > completed the workflow and run.0001 is a run that hung. In each of > those directories there is a swift.out file that contains the output to > the screen that was captured. > > Also, Mihael when the hang occurs, I type v for the inhook you set up > and I get > Register Futures: > > and then nothing. Does this mean there is no listeners set up and that > is why it hung? The first part. I'm not sure about the second. Send me the scripts, and I'll try to reproduce it. Can you tell me the exact svn version of swift? Mihael From hategan at mcs.anl.gov Thu Aug 26 14:47:32 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 26 Aug 2010 14:47:32 -0500 Subject: [Swift-devel] Re: Hacking swift/karajan to fetch a job description In-Reply-To: <4C76BFC6.6080809@uchicago.edu> References: <4C76BFC6.6080809@uchicago.edu> Message-ID: <1282852052.15164.4.camel@blabla2.none> On Thu, 2010-08-26 at 14:25 -0500, Zhao Zhang wrote: > Hi, Mihael > > I am wondering where I can get the job description? I was trying to hack > the _swiftwrap and disable file transfer in order to minimize the workload. > Is there any way for me to do this while running with dryrun option of > swift? Thanks. That seems to overlap with CDM. Can you be more exact about what you are trying to do? Justin has some ifs in vdl-int.k that can disable staging for select files. Things are a bit different if you use provider staging. You can also get the task object and that will contain the JobSpecification which has all the arguments. Mihael From jon.monette at gmail.com Thu Aug 26 14:48:20 2010 From: jon.monette at gmail.com (Jonathan Monette) Date: Thu, 26 Aug 2010 14:48:20 -0500 Subject: [Swift-devel] Re: Need help on race issue in Swift on Montage code In-Reply-To: <1282851899.15164.2.camel@blabla2.none> References: <327715763.1346111282848511306.JavaMail.root@zimbra.anl.gov> <4C76BB2A.60706@gmail.com> <1282851899.15164.2.camel@blabla2.none> Message-ID: <4C76C504.403@gmail.com> trunk r3566. Attached is a tarball of the source files I used along with the screen output of a run that failed and one that completed. On 08/26/2010 02:44 PM, Mihael Hategan wrote: > On Thu, 2010-08-26 at 14:06 -0500, Jonathan Monette wrote: > >> If looking at log files or output files will help to give a better idea >> of the problem, you can take a look at >> ~jonmon/Workspace/Swift/Montage/m101_j_05x05. This is my 18 image set. >> In that directory there is two run directories. run.0002 is a run that >> completed the workflow and run.0001 is a run that hung. In each of >> those directories there is a swift.out file that contains the output to >> the screen that was captured. >> >> Also, Mihael when the hang occurs, I type v for the inhook you set up >> and I get >> Register Futures: >> >> and then nothing. Does this mean there is no listeners set up and that >> is why it hung? >> > The first part. I'm not sure about the second. > > Send me the scripts, and I'll try to reproduce it. Can you tell me the > exact svn version of swift? > > Mihael > > > -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein -------------- next part -------------- A non-text attachment was scrubbed... Name: Swift_Montage_src.tar.gz Type: application/x-gzip Size: 4910 bytes Desc: not available URL: From zhaozhang at uchicago.edu Thu Aug 26 14:51:12 2010 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 26 Aug 2010 14:51:12 -0500 Subject: [Swift-devel] Re: Hacking swift/karajan to fetch a job description In-Reply-To: <1282852052.15164.4.camel@blabla2.none> References: <4C76BFC6.6080809@uchicago.edu> <1282852052.15164.4.camel@blabla2.none> Message-ID: <4C76C5B0.8090508@uchicago.edu> Mihael Hategan wrote: > On Thu, 2010-08-26 at 14:25 -0500, Zhao Zhang wrote: > >> Hi, Mihael >> >> I am wondering where I can get the job description? I was trying to hack >> the _swiftwrap and disable file transfer in order to minimize the workload. >> Is there any way for me to do this while running with dryrun option of >> swift? Thanks. >> > > That seems to overlap with CDM. Can you be more exact about what you are > trying to do? > I am trying to use swift dryrun as a compiler to compile swift script to a job list, so I could run with the system I implemented on BG/P. best zhao > Justin has some ifs in vdl-int.k that can disable staging for select > files. Things are a bit different if you use provider staging. You can > also get the task object and that will contain the JobSpecification > which has all the arguments. > > Mihael > > > > From wilde at mcs.anl.gov Thu Aug 26 15:21:19 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 26 Aug 2010 14:21:19 -0600 (GMT-06:00) Subject: [Swift-devel] Swift Karajan debugging commands In-Reply-To: <4C76BB2A.60706@gmail.com> Message-ID: <1477001148.1351201282854079837.JavaMail.root@zimbra.anl.gov> ----- "Jonathan Monette" wrote: > ... when the hang occurs, I type v for the inhook you set up > > and I get > Register Futures: > > and then nothing. Does this mean there is no listeners set up and > that > is why it hung? > I'd heard about but forgotten about these debugging hooks (t and v commands). I tried a small example and documented it in the CookBook for now: http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftCookBook#Debugging_hung_Swift_tasks_Karaj Please elaborate if there's more useful things to state about this. It seems there is also a "d" command that wants an X DISPLAY to run. It also seems that the output can be made less noisy so that the meaning of the debug info stands out more clearly. - Mike From hategan at mcs.anl.gov Thu Aug 26 15:33:06 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 26 Aug 2010 15:33:06 -0500 Subject: [Swift-devel] Re: Hacking swift/karajan to fetch a job description In-Reply-To: <4C76C5B0.8090508@uchicago.edu> References: <4C76BFC6.6080809@uchicago.edu> <1282852052.15164.4.camel@blabla2.none> <4C76C5B0.8090508@uchicago.edu> Message-ID: <1282854786.16053.4.camel@blabla2.none> On Thu, 2010-08-26 at 14:51 -0500, Zhao Zhang wrote: > > Mihael Hategan wrote: > > On Thu, 2010-08-26 at 14:25 -0500, Zhao Zhang wrote: > > > >> Hi, Mihael > >> > >> I am wondering where I can get the job description? I was trying to hack > >> the _swiftwrap and disable file transfer in order to minimize the workload. > >> Is there any way for me to do this while running with dryrun option of > >> swift? Thanks. Then you can modify execute-dryrun.k to print out all that stuff. Better yet, make your own execute-zhao.k (or something) and plug it into vdl.k (around line 46). I.e. echo("tr={tr}, arguments={arguments}, ..."). "stagein" is a list of filenames, stageout is a list of [path, var] pairs on each of which you have to do vdl:absfilename(vdl:getfield(var, path=path)). Look at the normal vdl-int.k for guidance. Mihael From hategan at mcs.anl.gov Thu Aug 26 15:34:09 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 26 Aug 2010 15:34:09 -0500 Subject: [Swift-devel] Swift Karajan debugging commands In-Reply-To: <1477001148.1351201282854079837.JavaMail.root@zimbra.anl.gov> References: <1477001148.1351201282854079837.JavaMail.root@zimbra.anl.gov> Message-ID: <1282854849.16053.5.camel@blabla2.none> On Thu, 2010-08-26 at 14:21 -0600, Michael Wilde wrote: > ----- "Jonathan Monette" wrote: > > > ... when the hang occurs, I type v for the inhook you set up > > > > and I get > > Register Futures: > > > > and then nothing. Does this mean there is no listeners set up and > > that > > is why it hung? > > > > I'd heard about but forgotten about these debugging hooks (t and v commands). I tried a small example and documented it in the CookBook for now: > > http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftCookBook#Debugging_hung_Swift_tasks_Karaj > > Please elaborate if there's more useful things to state about this. > > It seems there is also a "d" command that wants an X DISPLAY to run. Right. Once that started a swing interface which pretty much showed the same information. I am not sure it works right now. Mihael From zhaozhang at uchicago.edu Thu Aug 26 15:34:55 2010 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 26 Aug 2010 15:34:55 -0500 Subject: [Swift-devel] Re: Hacking swift/karajan to fetch a job description In-Reply-To: <1282854786.16053.4.camel@blabla2.none> References: <4C76BFC6.6080809@uchicago.edu> <1282852052.15164.4.camel@blabla2.none> <4C76C5B0.8090508@uchicago.edu> <1282854786.16053.4.camel@blabla2.none> Message-ID: <4C76CFEF.2090602@uchicago.edu> Cool, thanks. zhao Mihael Hategan wrote: > On Thu, 2010-08-26 at 14:51 -0500, Zhao Zhang wrote: > >> Mihael Hategan wrote: >> >>> On Thu, 2010-08-26 at 14:25 -0500, Zhao Zhang wrote: >>> >>> >>>> Hi, Mihael >>>> >>>> I am wondering where I can get the job description? I was trying to hack >>>> the _swiftwrap and disable file transfer in order to minimize the workload. >>>> Is there any way for me to do this while running with dryrun option of >>>> swift? Thanks. >>>> > > Then you can modify execute-dryrun.k to print out all that stuff. Better > yet, make your own execute-zhao.k (or something) and plug it into vdl.k > (around line 46). > > I.e. echo("tr={tr}, arguments={arguments}, ..."). > "stagein" is a list of filenames, stageout is a list of [path, var] > pairs on each of which you have to do vdl:absfilename(vdl:getfield(var, > path=path)). > > Look at the normal vdl-int.k for guidance. > > Mihael > > > From wilde at mcs.anl.gov Thu Aug 26 16:00:21 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Thu, 26 Aug 2010 15:00:21 -0600 (GMT-06:00) Subject: [Swift-devel] persistent coaster service In-Reply-To: <144679378.1353111282856303080.JavaMail.root@zimbra.anl.gov> Message-ID: <1082193619.1353181282856421112.JavaMail.root@zimbra.anl.gov> I have 2 questions on the persistent coaster service: 1) is -nosec working right? I get this when I specify it: bri$ coaster-service -port 55123 -nosec Error loading credential: [JGLOBUS-5] Proxy file (/tmp/x509up_u1031) not found. Error loading credential org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy file (/tmp/x509up_u1031) not found. at org.globus.gsi.GlobusCredential.(GlobusCredential.java:114) at org.globus.gsi.GlobusCredential.reloadDefaultCredential(GlobusCredential.java:590) at org.globus.gsi.GlobusCredential.getDefaultCredential(GlobusCredential.java:575) at org.globus.cog.abstraction.coaster.service.CoasterPersistentService.main(CoasterPersistentService.java:73) bri$ I would have expected it to fully eliminate the need for a proxy, no? 2) Following up on Allan's last question, can you clarify: When you start a persistent coaster service do you have the option of either: (a) the Swift client starts workers per the sites.xml profile settings or (b) the user starts the workers manually, connecting to the persisten server, when you specify workerManager passive in the Globus profile tag: key="workerManager">passive Thanks, - Mike ----- "Mihael Hategan" wrote: > On Mon, 2010-08-09 at 18:14 -0500, Allan Espinosa wrote: > > So the url should be "bridled.ci.uchicago.edu" since I run the > service > > there. But this same field is also used for spawning the workers > unless it > > specifies "manual coasters" right? > > Right. > > > > > -Allan > > > > On Mon, Aug 09, 2010 at 06:07:40PM -0500, Mihael Hategan wrote: > > > On Mon, 2010-08-09 at 18:03 -0500, Allan Espinosa wrote: > > > > Ah. so the persistent coaster service is meant to run with the > manual workers? > > > > > > No. It's like, say, GRAM, in that you need to start a service on > some > > > head node, and you need to supply the URL of that head node in > > > sites.xml. > > > > > > It won't start the service automatically. > > > > > > > > > > > -Allan > > > > > > > > On Mon, Aug 09, 2010 at 05:36:59PM -0500, Mihael Hategan wrote: > > > > > ff-grid2.unl.edu is the url you are supplying in sites.xml. > It's > > > > > connecting to that. Though I'm surprised it works given that > you are > > > > > implying that there is no service running there. > > > > > > > > > > On Mon, 2010-08-09 at 17:09 -0500, Allan Espinosa wrote: > > > > > > I tried it today on OSG. The coaster service was run on > bridled.ci . But from > > > > > > the session below, it looks like its connecting to the site > headnode instead: > > > > > > > > > > > > RunID: coaster > > > > > > Progress: > > > > > > Progress: uninitialized:1 Selecting site:675 Initializing > site shared > > > > > > directory:1 > > > > > > Progress: Initializing:2 Selecting site:1444 Initializing > site shared > > > > > > directory:1 > > > > > > Progress: uninitialized:1 Selecting site:2499 > Initializing site shared > > > > > > directory:1 > > > > > > Progress: uninitialized:1 Selecting site:3818 > Initializing site shared > > > > > > directory:1 > > > > > > Progress: uninitialized:1 Initializing:1 Selecting > site:4201 Initializing > > > > > > site shared directory:1 > > > > > > Progress: Initializing:1 Selecting site:3 Stage in:4202 > > > > > > Progress: uninitialized:1 Initializing:1 Selecting site:5 > Submitting:4202 > > > > > > Progress: Initializing:1 Selecting site:6 Stage in:2 > Submitting:4202 > > > > > > Find: https://ff-grid2.unl.edu:1984 > > > > > > Find: keepalive(120), reconnect - > https://ff-grid2.unl.edu:1984 > > > > > > Progress: Initializing:2 Selecting site:6 Stage in:144 > Submitting:4303 > > > > > > Failed but can retry:16 > > > > > > Progress: Initializing:2 Selecting site:31 Stage in:80 > Submitting:4945 > > > > > > Failed but can retry:54 > > > > > > Progress: Initializing:1 Selecting site:6 Stage in:2 > Submitting:5222 Failed > > > > > > but can retry:68 > > > > > > Progress: Initializing:1 Selecting site:6 Stage in:1 > Submitting:5686 > > > > > > Submitted:1 Failed but can retry:95 > > > > > > ... > > > > > > ... > > > > > > > > > > > > Corresponding log entry (IMO): > > > > > > 2010-08-09 17:01:31,690-0500 WARN RemoteConfiguration > Find: > > > > > > https://ff-grid2.unl.edu:1984 > > > > > > 2010-08-09 17:01:31,690-0500 WARN RemoteConfiguration Find: > keepalive(120), > > > > > > reconnect - https://ff-grid2.unl.edu:1984 > > > > > > > > > > > > > > > > > > > > > > > > sites.xml > > > > > > > > > > > > url="ff-grid2.unl.edu" > > > > > > jobmanager="gt2:gt2:pbs" /> > > > > > > > > > > > > key="maxTime">86400 > > > > > > key="maxNodes">1290 > > > > > > 0.8 > > > > > > 10 > > > > > > key="lowOverallocation">20 > > > > > > key="remoteMonitorEnabled">true > > > > > > > > > > > > key="initialScore">1500.0 > > > > > > key="jobThrottle">51.54 > > > > > > > > > > > > > > > > > > > /panfs/panasas/CMS/data/engage-scec/swift_scratch > > > > > > > > > > > > > > > > > > > > > > > > -Allan > > > > > > > > > > > > On Thu, Aug 05, 2010 at 10:34:34PM -0500, Mihael Hategan > wrote: > > > > > > > > > > > > > ... was added in cog r2834. > > > > > > > > > > > > > > Despite having run a few jobs with it, I don't feel very > confident about > > > > > > > it. So please test. > > > > > > > > > > > > > > Start with bin/coaster-service and use > "coaster-persistent" as provider > > > > > > > in sites.xml. Everything else would be the same as in the > "coaster" > > > > > > > case. > > > > > > > > > > > > > > Mihael > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Thu Aug 26 17:05:25 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 26 Aug 2010 17:05:25 -0500 Subject: [Swift-devel] persistent coaster service In-Reply-To: <1082193619.1353181282856421112.JavaMail.root@zimbra.anl.gov> References: <1082193619.1353181282856421112.JavaMail.root@zimbra.anl.gov> Message-ID: <1282860325.16378.1.camel@blabla2.none> On Thu, 2010-08-26 at 15:00 -0600, wilde at mcs.anl.gov wrote: > I have 2 questions on the persistent coaster service: > > 1) is -nosec working right? I haven't tested it, so I make no such claims. > I get this when I specify it: > > bri$ coaster-service -port 55123 -nosec > Error loading credential: [JGLOBUS-5] Proxy file (/tmp/x509up_u1031) not found. > Error loading credential > org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy file (/tmp/x509up_u1031) not found. > at org.globus.gsi.GlobusCredential.(GlobusCredential.java:114) > at org.globus.gsi.GlobusCredential.reloadDefaultCredential(GlobusCredential.java:590) > at org.globus.gsi.GlobusCredential.getDefaultCredential(GlobusCredential.java:575) > at org.globus.cog.abstraction.coaster.service.CoasterPersistentService.main(CoasterPersistentService.java:73) > bri$ > > I would have expected it to fully eliminate the need for a proxy, no? Yes. Something seems messed up indeed. > > 2) Following up on Allan's last question, can you clarify: > > When you start a persistent coaster service do you have the option of either: > > (a) the Swift client starts workers per the sites.xml profile settings or > > (b) the user starts the workers manually, connecting to the persisten server, when you specify workerManager passive in the Globus profile tag: > > key="workerManager">passive Yes. That looks correct. Mihael From wilde at mcs.anl.gov Sun Aug 29 20:17:13 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 29 Aug 2010 19:17:13 -0600 (GMT-06:00) Subject: [Swift-devel] Fwd: [Swift-user] Re: Errors in 13-site OSG run: lazy error question In-Reply-To: Message-ID: <2008567707.28811283131033377.JavaMail.root@zimbra.anl.gov> Glen, "thing 1" below might be simply having a universal front-end command like swiftrun track the initial args to swift in a local file, so that restart is easier. But I guess the cmd line arguments could or should be saved in the restart file. Both sound like projects that David could take on. For now, lets make your fron-end wrappper save a swift.cmd.args file or something like that, for restart. - Mike ----- Forwarded Message ----- From: "Glen Hocky" To: "Michael Wilde" Sent: Sunday, August 29, 2010 8:11:15 PM GMT -06:00 US/Canada Central Subject: Re: [Swift-user] Re: Errors in 13-site OSG run: lazy error question oh but two things for the devels that we discussed before 1) if you could get someone to make restarting slightly easier (i.e. you don't have to specify all options to restart, see earlier email to list host) 2) tagging the jobs submitted or at least making sure they get pulled out when a job fails or is canceled with the condor provider On Sun, Aug 29, 2010 at 9:08 PM, Glen Hocky < hockyg at gmail.com > wrote: well 2 sites that would be productive, *vcell* and *mit* (forget exact names) both have jobs failing with "failed to transfer wrapper log" errors but since it works on so many other sites, i think that must be a problem on those sites...if we could work around or get that fixed that would add a lot of machines. otherwise i'm just gonna try to get some productive runs done (almost done one) so we can say that we used OSG productively.... On Sun, Aug 29, 2010 at 8:40 PM, Michael Wilde < wilde at mcs.anl.gov > wrote: Very good, thanks Glen. What's the next prio on this workflow? Still some sites that are not building or running correctly? - Mike ----- "Glen Hocky" < hockyg at gmail.com > wrote: > it works now. thanks a lot > > > On Fri, Aug 27, 2010 at 2:52 PM, Glen Hocky < hockyg at gmail.com > > wrote: > > > ok i'll try again > > > > > > On Fri, Aug 27, 2010 at 2:49 PM, Michael Wilde < wilde at mcs.anl.gov > > wrote: > > > Updated; ~wilde/swift/rev/trunk is now at: swift-r3571 cog-r2868 > > > > > - Mike > > ----- "Glen Hocky" < hockyg at gmail.com > wrote: > > > Let me know when you update... > > > > > > Begin forwarded message: > > > > > > > > > > From: Mihael Hategan < hategan at mcs.anl.gov > > > Date: August 27, 2010 2:01:56 PM EDT > > To: Glen Hocky < hockyg at gmail.com > > > Cc: Mike Wilde < wilde at mcs.anl.gov > > > Subject: Re: [Swift-user] Re: Errors in 13-site OSG run: lazy error > > question > > > > > > > > > > > > swift trunk r3568 > > > > On Fri, 2010-08-27 at 13:05 -0400, Glen Hocky wrote: > > > > > > in ci-home:~hockyg/for_mihael > > > > > > > > > > > > On Fri, Aug 27, 2010 at 12:41 PM, Mihael Hategan < > hategan at mcs.anl.gov > > > > > > > > > wrote: > > > > > > Or even the log itself, because I don't think I have access to > > > > > > engage-submit. > > > > > > > > > > > > > > > > > > On Fri, 2010-08-27 at 11:34 -0500, Mihael Hategan wrote: > > > > > > > > > > Or if you can find the stack trace of that specific error in > > > > > > the log, > > > > > > > > > > that might be useful. > > > > > > > > > > > > > > > > > > > > On Fri, 2010-08-27 at 09:06 -0600, Michael Wilde wrote: > > > > > > > > > > > > > > Glen, as I recall, in the previous incident of this error > > > > > > we re-created with a simpler script, using only the "cat" > > > > > > app(), correct? > > > > > > > > > > > > > > > > > > > > > > > > > > > > Is it possible to re-create this similar error in a > > > > > > similar test script? > > > > > > > > > > > > > > > > > > > > > > > > > > > > Mihael, any thoughts on whether its likely that the prior > > > > > > fix did not address all cases? > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- "Glen Hocky" < hockyg at gmail.com > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Yes nominally the same error but it's not at the > > > > > > beginning but in the > > > > > > > > > > > > > > > > > > middle now for some reason. I think it's a mid-stated > > > > > > error message. > > > > > > > > > > > > > > > > > > I'll attach the log soon > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Aug 27, 2010, at 12:11 AM, Michael Wilde > > > > > > < wilde at mcs.anl.gov > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Glen, I wonder if whats happening here is that Swift > > > > > > will retry and > > > > > > > > > > > > > > > > > > lazily run past *job* errors, but the error below (a > > > > > > mapping error) is > > > > > > > > > > > > > > > > > > maybe being treated as an error in Swift's > > > > > > interpretation of the > > > > > > > > > > > > > > > > > > script itself, and this causes an immediate halt to > > > > > > execution? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Can anyone confirm that this is whats happening, and > > > > > > if it is the > > > > > > > > > > > > > > > > > > expected behavior? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Also, Glen, 2 questions: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1) Isn't the error below the one that was fixed by > > > > > > Mihael in a > > > > > > > > > > > > > > > > > > recent revision - the same one I looked at earlier in > > > > > > the week? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2) Do you know what errors the "Failed but can > > > > > > retry:8" message is > > > > > > > > > > > > > > > > > > referring to? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Where is the log/run directory for this run? How long > > > > > > did it take > > > > > > > > > > > > > > > > > > to get the 589 jobs finished? It would be good to start > > > > > > plotting > > > > > > > > > > > > > > > > > > these large multi-site runs to get a sense of how the > > > > > > scheduler is > > > > > > > > > > > > > > > > > > doing. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- "Glen Hocky" < hockyg at uchicago.edu > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > here's the result of my 13 site run that ran while i > > > > > > was out this > > > > > > > > > > > > > > > > > > > > > > > > > > evening. It did pretty well! > > > > > > > > > > > > > > > > > > > > > > > > > > but seems to have that problem of not quite lazy > > > > > > errors > > > > > > > > > > > > > > > > > > > > > > > > > > ........ > > > > > > > > > > > > > > > > > > > > > > > > > > Progress: Submitting:3 Submitted:262 Active:147 > > > > > > Checking status:3 > > > > > > > > > > > > > > > > > > > > > > > > > > Stage out:1 Finished successfully:586 > > > > > > > > > > > > > > > > > > > > > > > > > > Progress: Submitting:3 Submitted:262 Active:144 > > > > > > Checking status:4 > > > > > > > > > > > > > > > > > > > > > > > > > > Stage out:2 Finished successfully:587 > > > > > > > > > > > > > > > > > > > > > > > > > > Progress: Submitting:3 Submitted:262 Active:142 Stage > > > > > > out:2 > > > > > > > > > > > > > > > > > > Finished > > > > > > > > > > > > > > > > > > > > > > > > > > successfully:587 Failed but can retry:6 > > > > > > > > > > > > > > > > > > > > > > > > > > Progress: Submitting:3 Submitted:262 Active:140 > > > > > > Finished > > > > > > > > > > > > > > > > > > > > > > > > > > successfully:589 Failed but can retry:8 > > > > > > > > > > > > > > > > > > > > > > > > > > Failed to transfer wrapper log from > > > > > > > > > > > > > > > > > > > > > > > > > > glassRunCavities-20100826-1718-7gi0dzs1/info/5 on > > > > > > > > > > > > > > > > > > > > > > > > > > UCHC_CBG_vdgateway.vcell.uchc.edu > > > > > > > > > > > > > > > > > > > > > > > > > > Execution failed: > > > > > > > > > > > > > > > > > > > > > > > > > > org.griphyn.vdl.mapping.InvalidPathException: Invalid > > > > > > path > > > > > > > > > > > > > > > > > > (..logfile) > > > > > > > > > > > > > > > > > > > > > > > > > > for org.griphyn.vdl.mapping.DataNode identifier > > > > > > > > > > > > > > > > > > > > > > > > > > tag:benc at ci.uchicago.edu > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ,2008:swift:dataset:20100826-1718-sznq1qr2:720000002968 > > > > > > type > > > > > > > > > > > > > > > > > > GlassOut > > > > > > > > > > > > > > > > > > > > > > > > > > with no value at dataset=modelOut path=[3][1][11] > > > > > > (not closed) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > > > Michael Wilde > > > > > > > > > > > > > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > > > > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > > > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > Swift-user mailing list > > > > > > > > > > Swift-user at ci.uchicago.edu > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > -- > > > > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Mon Aug 30 19:55:16 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Mon, 30 Aug 2010 18:55:16 -0600 (GMT-06:00) Subject: [Swift-devel] Provider staging is failing In-Reply-To: <1353648981.71071283216003333.JavaMail.root@zimbra.anl.gov> Message-ID: <608918146.71141283216116846.JavaMail.root@zimbra.anl.gov> Mihael, Justin, Im trying to use provider-staging for the first time. It seems to be starting the worker in /, and hence staging in fails right away (on _swiftwrap). Where is the worker supposed to start when using provider staging? Ive tried to set the jobdir to /tmp using the tag but that doesnt seem to be honored. Ive tried a few different sites configurations; Im running from bridled to communicado using ssh. My most recent is: 8 .07 10000 file /tmp/wilde/scratch - Mike From wilde at mcs.anl.gov Mon Aug 30 20:02:59 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Mon, 30 Aug 2010 19:02:59 -0600 (GMT-06:00) Subject: [Swift-devel] Provider staging is failing In-Reply-To: <7342587.71421283216537919.JavaMail.root@zimbra.anl.gov> Message-ID: <305821435.71441283216579039.JavaMail.root@zimbra.anl.gov> OK, I see now that it is honoring the workdirectory tag. (I thought that was not used with provider staging, but seems that it is). WHen mkdir was failing I was getting an error code 524; now Im getting an error code 520 - seems to be failing now in the actual transfer of swiftwrap. worker log is pasted below. - Mike com$ cat worker-0830-560709-000000.log 1283216169.574 INFO - 0830-560709-000000 Logging started: Mon Aug 30 19:56:09 2010 1283216169.576 INFO - Running on node communicado.ci.uchicago.edu 1283216169.576 DEBUG - uri=http://128.135.125.17:50001 1283216169.576 DEBUG - scheme=http 1283216169.576 DEBUG - host=128.135.125.17 1283216169.576 DEBUG - port=50001 1283216169.576 DEBUG - blockid=0830-560709-000000 1283216169.576 INFO - Connecting (0)... 1283216169.576 DEBUG - Trying 128.135.125.17:50001... 1283216169.578 INFO - Connected 1283216169.578 DEBUG - Replies: {} 1283216169.578 DEBUG - OUT: len=8, tag=0, flags=0 1283216169.578 DEBUG - OUT: len=18, tag=0, flags=0 1283216169.578 DEBUG - OUT: len=0, tag=0, flags=2 1283216169.578 DEBUG - done sending frags for 0 1283216169.623 DEBUG - Fin flag set 1283216169.624 INFO 000000 Registration successful. ID=000000 1283216169.624 DEBUG 000000 New request (1) 1283216169.624 DEBUG 000000 Fin flag set 1283216169.624 DEBUG 000000 Processing request 1283216169.625 DEBUG 000000 Cmd is SUBMITJOB 1283216169.625 INFO 000000 1283216169479 Job info received (tag=1) 1283216169.625 DEBUG 000000 1283216169479 Job check ok (dir: /home/wilde/swiftwork/catsn-20100830-1956-z61dqk05-d-cat-duz412yj) 1283216169.625 INFO 000000 1283216169479 Sending submit reply (tag=1) 1283216169.625 DEBUG 000000 OUT: len=2, tag=1, flags=3 1283216169.625 DEBUG 000000 done sending frags for 1 1283216169.625 INFO 000000 1283216169479 Submit reply sent (tag=1) 1283216169.625 DEBUG 000000 Replies: {} 1283216169.625 DEBUG 000000 OUT: len=9, tag=1, flags=0 1283216169.625 DEBUG 000000 OUT: len=13, tag=1, flags=0 1283216169.625 DEBUG 000000 OUT: len=2, tag=1, flags=0 1283216169.625 DEBUG 000000 OUT: len=1, tag=1, flags=0 1283216169.625 DEBUG 000000 OUT: len=15, tag=1, flags=2 1283216169.626 DEBUG 000000 done sending frags for 1 1283216169.626 DEBUG 000000 1283216169479 Staging in file://localhost//home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging 1283216169.626 DEBUG 000000 1283216169479 src: file://localhost//home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging, protocol: file, path: localhost//home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging 1283216169.627 DEBUG 000000 Opening /home/wilde/swiftwork/catsn-20100830-1956-z61dqk05-d-cat-duz412yj/_swiftwrap.staging in cwd / ... 1283216169.628 DEBUG 000000 1283216169479 Opened /home/wilde/swiftwork/catsn-20100830-1956-z61dqk05-d-cat-duz412yj/_swiftwrap.staging 1283216169.628 DEBUG 000000 Replies: {1 = ARRAY(0x93ce8f0)} 1283216169.628 DEBUG 000000 OUT: len=3, tag=2, flags=0 1283216169.628 DEBUG 000000 OUT: len=78, tag=2, flags=0 1283216169.628 DEBUG 000000 OUT: len=84, tag=2, flags=2 1283216169.628 DEBUG 000000 done sending frags for 2 1283216169.647 DEBUG 000000 Fin flag set 1283216169.653 DEBUG 000000 1283216169479 getFileCBDataIn jobid: 1283216169479, state: 0, tag: 2, err: 4, fin: 0 1283216169.653 DEBUG 000000 Replies: {2 = ARRAY(0x93ce980)} 1283216169.653 DEBUG 000000 OUT: len=9, tag=3, flags=0 1283216169.653 DEBUG 000000 OUT: len=13, tag=3, flags=0 1283216169.653 DEBUG 000000 OUT: len=1, tag=3, flags=0 1283216169.653 DEBUG 000000 OUT: len=3, tag=3, flags=0 1283216169.653 DEBUG 000000 OUT: len=239, tag=3, flags=2 1283216169.653 DEBUG 000000 done sending frags for 3 1283216169.653 DEBUG 000000 Fin flag set 1283216169.653 DEBUG 000000 1283216169479 getFileCBDataIn jobid: 1283216169479, state: 0, tag: 2, err: 4, fin: 2 1283216169.653 DEBUG 000000 Replies: {3 = ARRAY(0x93ce8d0)} 1283216169.653 DEBUG 000000 OUT: len=9, tag=4, flags=0 1283216169.653 DEBUG 000000 OUT: len=13, tag=4, flags=0 1283216169.653 DEBUG 000000 OUT: len=1, tag=4, flags=0 1283216169.653 DEBUG 000000 OUT: len=3, tag=4, flags=0 1283216169.653 DEBUG 000000 OUT: len=2223, tag=4, flags=2 1283216169.653 DEBUG 000000 done sending frags for 4 1283216169.698 DEBUG 000000 Fin flag set 1283216169.698 DEBUG 000000 Fin flag set 1283216169.902 DEBUG 000000 New request (2) 1283216169.902 DEBUG 000000 Fin flag set 1283216169.902 DEBUG 000000 Processing request 1283216169.902 DEBUG 000000 Cmd is SHUTDOWN 1283216169.902 DEBUG 000000 Shutdown command received 1283216169.902 DEBUG 000000 OUT: len=2, tag=2, flags=3 1283216169.902 DEBUG 000000 done sending frags for 2 com$ ----- wilde at mcs.anl.gov wrote: > Mihael, Justin, > > Im trying to use provider-staging for the first time. It seems to be > starting the worker in /, and hence staging in fails right away (on > _swiftwrap). > > Where is the worker supposed to start when using provider staging? > > Ive tried to set the jobdir to /tmp using the tag but that > doesnt seem to be honored. > > Ive tried a few different sites configurations; Im running from > bridled to communicado using ssh. My most recent is: > > > > jobmanager="ssh:local"/> > > > 8 > .07 > 10000 > > file > > > > /tmp/wilde/scratch > > > > - Mike > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Mon Aug 30 20:18:56 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 30 Aug 2010 20:18:56 -0500 Subject: [Swift-devel] Provider staging is failing In-Reply-To: <305821435.71441283216579039.JavaMail.root@zimbra.anl.gov> References: <305821435.71441283216579039.JavaMail.root@zimbra.anl.gov> Message-ID: <1283217536.28556.1.camel@blabla2.none> It's getting some error from the coaster service. I wonder why it isn't being printed. But the coaster/swift log will probably have it. Mihael On Mon, 2010-08-30 at 19:02 -0600, wilde at mcs.anl.gov wrote: > OK, I see now that it is honoring the workdirectory tag. (I thought that was not used with provider staging, but seems that it is). > > WHen mkdir was failing I was getting an error code 524; now Im getting an error code 520 - seems to be failing now in the actual transfer of swiftwrap. > > worker log is pasted below. > > - Mike > > com$ cat worker-0830-560709-000000.log > 1283216169.574 INFO - 0830-560709-000000 Logging started: Mon Aug 30 19:56:09 2010 > 1283216169.576 INFO - Running on node communicado.ci.uchicago.edu > 1283216169.576 DEBUG - uri=http://128.135.125.17:50001 > 1283216169.576 DEBUG - scheme=http > 1283216169.576 DEBUG - host=128.135.125.17 > 1283216169.576 DEBUG - port=50001 > 1283216169.576 DEBUG - blockid=0830-560709-000000 > 1283216169.576 INFO - Connecting (0)... > 1283216169.576 DEBUG - Trying 128.135.125.17:50001... > 1283216169.578 INFO - Connected > 1283216169.578 DEBUG - Replies: {} > 1283216169.578 DEBUG - OUT: len=8, tag=0, flags=0 > 1283216169.578 DEBUG - OUT: len=18, tag=0, flags=0 > 1283216169.578 DEBUG - OUT: len=0, tag=0, flags=2 > 1283216169.578 DEBUG - done sending frags for 0 > 1283216169.623 DEBUG - Fin flag set > 1283216169.624 INFO 000000 Registration successful. ID=000000 > 1283216169.624 DEBUG 000000 New request (1) > 1283216169.624 DEBUG 000000 Fin flag set > 1283216169.624 DEBUG 000000 Processing request > 1283216169.625 DEBUG 000000 Cmd is SUBMITJOB > 1283216169.625 INFO 000000 1283216169479 Job info received (tag=1) > 1283216169.625 DEBUG 000000 1283216169479 Job check ok (dir: /home/wilde/swiftwork/catsn-20100830-1956-z61dqk05-d-cat-duz412yj) > 1283216169.625 INFO 000000 1283216169479 Sending submit reply (tag=1) > 1283216169.625 DEBUG 000000 OUT: len=2, tag=1, flags=3 > 1283216169.625 DEBUG 000000 done sending frags for 1 > 1283216169.625 INFO 000000 1283216169479 Submit reply sent (tag=1) > 1283216169.625 DEBUG 000000 Replies: {} > 1283216169.625 DEBUG 000000 OUT: len=9, tag=1, flags=0 > 1283216169.625 DEBUG 000000 OUT: len=13, tag=1, flags=0 > 1283216169.625 DEBUG 000000 OUT: len=2, tag=1, flags=0 > 1283216169.625 DEBUG 000000 OUT: len=1, tag=1, flags=0 > 1283216169.625 DEBUG 000000 OUT: len=15, tag=1, flags=2 > 1283216169.626 DEBUG 000000 done sending frags for 1 > 1283216169.626 DEBUG 000000 1283216169479 Staging in file://localhost//home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging > 1283216169.626 DEBUG 000000 1283216169479 src: file://localhost//home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging, protocol: file, path: localhost//home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging > 1283216169.627 DEBUG 000000 Opening /home/wilde/swiftwork/catsn-20100830-1956-z61dqk05-d-cat-duz412yj/_swiftwrap.staging in cwd / > ... > 1283216169.628 DEBUG 000000 1283216169479 Opened /home/wilde/swiftwork/catsn-20100830-1956-z61dqk05-d-cat-duz412yj/_swiftwrap.staging > 1283216169.628 DEBUG 000000 Replies: {1 = ARRAY(0x93ce8f0)} > 1283216169.628 DEBUG 000000 OUT: len=3, tag=2, flags=0 > 1283216169.628 DEBUG 000000 OUT: len=78, tag=2, flags=0 > 1283216169.628 DEBUG 000000 OUT: len=84, tag=2, flags=2 > 1283216169.628 DEBUG 000000 done sending frags for 2 > 1283216169.647 DEBUG 000000 Fin flag set > 1283216169.653 DEBUG 000000 1283216169479 getFileCBDataIn jobid: 1283216169479, state: 0, tag: 2, err: 4, fin: 0 > 1283216169.653 DEBUG 000000 Replies: {2 = ARRAY(0x93ce980)} > 1283216169.653 DEBUG 000000 OUT: len=9, tag=3, flags=0 > 1283216169.653 DEBUG 000000 OUT: len=13, tag=3, flags=0 > 1283216169.653 DEBUG 000000 OUT: len=1, tag=3, flags=0 > 1283216169.653 DEBUG 000000 OUT: len=3, tag=3, flags=0 > 1283216169.653 DEBUG 000000 OUT: len=239, tag=3, flags=2 > 1283216169.653 DEBUG 000000 done sending frags for 3 > 1283216169.653 DEBUG 000000 Fin flag set > 1283216169.653 DEBUG 000000 1283216169479 getFileCBDataIn jobid: 1283216169479, state: 0, tag: 2, err: 4, fin: 2 > 1283216169.653 DEBUG 000000 Replies: {3 = ARRAY(0x93ce8d0)} > 1283216169.653 DEBUG 000000 OUT: len=9, tag=4, flags=0 > 1283216169.653 DEBUG 000000 OUT: len=13, tag=4, flags=0 > 1283216169.653 DEBUG 000000 OUT: len=1, tag=4, flags=0 > 1283216169.653 DEBUG 000000 OUT: len=3, tag=4, flags=0 > 1283216169.653 DEBUG 000000 OUT: len=2223, tag=4, flags=2 > 1283216169.653 DEBUG 000000 done sending frags for 4 > 1283216169.698 DEBUG 000000 Fin flag set > 1283216169.698 DEBUG 000000 Fin flag set > 1283216169.902 DEBUG 000000 New request (2) > 1283216169.902 DEBUG 000000 Fin flag set > 1283216169.902 DEBUG 000000 Processing request > 1283216169.902 DEBUG 000000 Cmd is SHUTDOWN > 1283216169.902 DEBUG 000000 Shutdown command received > 1283216169.902 DEBUG 000000 OUT: len=2, tag=2, flags=3 > 1283216169.902 DEBUG 000000 done sending frags for 2 > com$ > > ----- wilde at mcs.anl.gov wrote: > > > Mihael, Justin, > > > > Im trying to use provider-staging for the first time. It seems to be > > starting the worker in /, and hence staging in fails right away (on > > _swiftwrap). > > > > Where is the worker supposed to start when using provider staging? > > > > Ive tried to set the jobdir to /tmp using the tag but that > > doesnt seem to be honored. > > > > Ive tried a few different sites configurations; Im running from > > bridled to communicado using ssh. My most recent is: > > > > > > > > > jobmanager="ssh:local"/> > > > > > > 8 > > .07 > > 10000 > > > > file > > > > > > > > /tmp/wilde/scratch > > > > > > > > - Mike > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Mon Aug 30 20:26:27 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Mon, 30 Aug 2010 19:26:27 -0600 (GMT-06:00) Subject: [Swift-devel] Provider staging is failing In-Reply-To: <2031665818.71911283217398005.JavaMail.root@zimbra.anl.gov> Message-ID: <1041242549.72071283217987591.JavaMail.root@zimbra.anl.gov> I turned on the TRACE output level in worker.pl. I need to dig deeper but it looks to me that the pathnames its trying to fetch are getting mangled/confused with the file:// portion of the URI: org.globus.cog.karajan.workflow.service.ProtocolException: java.io.FileNotFoundException: /autonfs/home/wilde/./file:/localhost/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging (No such file or directory) The file "/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging" does exist on the client side. I can't yet tell if its really trying to open a pathname of the form "/autonfs/home/wilde/./file:/localhost/home/wilde/etcetc..." Full log is below. - Mike com$ cat worker-0830-070800-000000.log 1283216820.360 INFO - 0830-070800-000000 Logging started: Mon Aug 30 20:07:00 2010 1283216820.362 INFO - Running on node communicado.ci.uchicago.edu 1283216820.362 DEBUG - uri=http://128.135.125.17:50001 1283216820.362 DEBUG - scheme=http 1283216820.362 DEBUG - host=128.135.125.17 1283216820.362 DEBUG - port=50001 1283216820.362 DEBUG - blockid=0830-070800-000000 1283216820.362 INFO - Connecting (0)... 1283216820.362 DEBUG - Trying 128.135.125.17:50001... 1283216820.363 INFO - Connected 1283216820.363 DEBUG - Replies: {} 1283216820.363 DEBUG - OUT: len=8, tag=0, flags=0 1283216820.363 TRACE - REGISTER 1283216820.363 DEBUG - OUT: len=18, tag=0, flags=0 1283216820.363 TRACE - 0830-070800-000000 1283216820.363 DEBUG - OUT: len=0, tag=0, flags=2 1283216820.363 TRACE - 1283216820.363 DEBUG - done sending frags for 0 1283216820.409 TRACE - IN: len=6, actuallen=6, tag=0, flags=3, 000000 1283216820.409 DEBUG - Fin flag set 1283216820.409 INFO 000000 Registration successful. ID=000000 1283216820.410 TRACE 000000 IN: len=9, actuallen=9, tag=1, flags=0, SUBMITJOB 1283216820.410 DEBUG 000000 New request (1) 1283216820.410 TRACE 000000 IN: len=759, actuallen=759, tag=1, flags=2, identity=1283216820263 executable=/bin/bash directory=/home/wilde/swiftwork/catsn-20100830-2006-f0dhgma1-f-cat-fn4k12yj batch=false arg=_swiftwrap.staging arg=-e arg=/bin/cat arg=-out arg=outdir/f.0001.out arg=-err arg=stderr.txt arg=-i arg=-d arg=|outdir arg=-if arg=data.txt arg=-of arg=outdir/f.0001.out arg=-k arg=-cdmfile arg= arg=-status arg=provider arg=-a arg=data.txt stagein=file://localhost//home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging\n_swiftwrap.staging stagein=file://localhost/data.txt\n./data.txt stageout=wrapper.log\nfile://localhost/catsn-20100830-2006-f0dhgma1.d/cat-fn4k12yj.info stageout=./outdir/f.0001.out\nfile://localhost/outdir/f.0001.out cleanup=. contact=communicado.ci.uchicago.edu provider=coaster jm=ssh:local 1283216820.410 DEBUG 000000 Fin flag set 1283216820.410 DEBUG 000000 Processing request 1283216820.410 DEBUG 000000 Cmd is SUBMITJOB 1283216820.410 INFO 000000 1283216820263 Job info received (tag=1) 1283216820.411 DEBUG 000000 1283216820263 Job check ok (dir: /home/wilde/swiftwork/catsn-20100830-2006-f0dhgma1-f-cat-fn4k12yj) 1283216820.411 INFO 000000 1283216820263 Sending submit reply (tag=1) 1283216820.411 DEBUG 000000 OUT: len=2, tag=1, flags=3 1283216820.411 TRACE 000000 OK 1283216820.411 DEBUG 000000 done sending frags for 1 1283216820.411 INFO 000000 1283216820263 Submit reply sent (tag=1) 1283216820.411 DEBUG 000000 Replies: {} 1283216820.411 DEBUG 000000 OUT: len=9, tag=1, flags=0 1283216820.411 TRACE 000000 JOBSTATUS 1283216820.411 DEBUG 000000 OUT: len=13, tag=1, flags=0 1283216820.411 TRACE 000000 1283216820263 1283216820.411 DEBUG 000000 OUT: len=2, tag=1, flags=0 1283216820.411 TRACE 000000 16 1283216820.411 DEBUG 000000 OUT: len=1, tag=1, flags=0 1283216820.411 TRACE 000000 0 1283216820.411 DEBUG 000000 OUT: len=15, tag=1, flags=2 1283216820.411 TRACE 000000 workerid=000000 1283216820.411 DEBUG 000000 done sending frags for 1 1283216820.411 DEBUG 000000 1283216820263 Staging in file://localhost//home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging 1283216820.411 DEBUG 000000 1283216820263 src: file://localhost//home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging, protocol: file, path: localhost//home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging 1283216820.413 DEBUG 000000 Opening /home/wilde/swiftwork/catsn-20100830-2006-f0dhgma1-f-cat-fn4k12yj/_swiftwrap.staging in cwd / ... 1283216820.414 DEBUG 000000 1283216820263 Opened /home/wilde/swiftwork/catsn-20100830-2006-f0dhgma1-f-cat-fn4k12yj/_swiftwrap.staging 1283216820.414 DEBUG 000000 Replies: {1 = ARRAY(0xe3c4060)} 1283216820.414 DEBUG 000000 OUT: len=3, tag=2, flags=0 1283216820.414 TRACE 000000 GET 1283216820.414 DEBUG 000000 OUT: len=78, tag=2, flags=0 1283216820.414 TRACE 000000 file://localhost//home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging 1283216820.414 DEBUG 000000 OUT: len=84, tag=2, flags=2 1283216820.414 TRACE 000000 /home/wilde/swiftwork/catsn-20100830-2006-f0dhgma1-f-cat-fn4k12yj/_swiftwrap.staging 1283216820.414 DEBUG 000000 done sending frags for 2 1283216820.433 TRACE 000000 IN: len=2, actuallen=2, tag=1, flags=3, OK 1283216820.433 DEBUG 000000 Fin flag set 1283216820.438 TRACE 000000 IN: len=216, actuallen=216, tag=2, flags=5, org.globus.cog.karajan.workflow.service.ProtocolException: java.io.FileNotFoundException: /autonfs/home/wilde/./file:/localhost/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging (No such file or directory) 1283216820.438 DEBUG 000000 1283216820263 getFileCBDataIn jobid: 1283216820263, state: 0, tag: 2, err: 4, fin: 0 1283216820.438 DEBUG 000000 Replies: {2 = ARRAY(0xe3c40f0)} 1283216820.438 DEBUG 000000 OUT: len=9, tag=3, flags=0 1283216820.438 TRACE 000000 JOBSTATUS 1283216820.438 DEBUG 000000 OUT: len=13, tag=3, flags=0 1283216820.438 TRACE 000000 1283216820263 1283216820.438 DEBUG 000000 OUT: len=1, tag=3, flags=0 1283216820.438 TRACE 000000 5 1283216820.438 DEBUG 000000 OUT: len=3, tag=3, flags=0 1283216820.438 TRACE 000000 520 1283216820.438 DEBUG 000000 OUT: len=239, tag=3, flags=2 1283216820.438 TRACE 000000 Error staging in file: org.globus.cog.karajan.workflow.service.ProtocolException: java.io.FileNotFoundException: /autonfs/home/wilde/./file:/localhost/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging (No such file or directory) 1283216820.438 DEBUG 000000 done sending frags for 3 1283216820.438 TRACE 000000 IN: len=2200, actuallen=2200, tag=2, flags=7, org.globus.cog.karajan.workflow.service.ProtocolException: org.globus.cog.karajan.workflow.service.ProtocolException: java.io.FileNotFoundException: /autonfs/home/wilde/./file:/localhost/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging (No such file or directory) at org.globus.cog.abstraction.impl.file.coaster.handlers.GetFileHandler.requestComplete(GetFileHandler.java:41) at org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84) at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:387) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:159) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:368) Caused by: org.globus.cog.karajan.workflow.service.ProtocolException: java.io.FileNotFoundException: /autonfs/home/wilde/./file:/localhost/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging (No such file or directory) at org.globus.cog.abstraction.impl.file.coaster.handlers.GetFileHandler.send(GetFileHandler.java:64) at org.globus.cog.karajan.workflow.service.handlers.RequestHandler.sendReply(RequestHandler.java:37) at org.globus.cog.abstraction.impl.file.coaster.handlers.CoasterFileRequestHandler.sendReply(CoasterFileRequestHandler.java:45) at org.globus.cog.abstraction.impl.file.coaster.handlers.GetFileHandler.requestComplete(GetFileHandler.java:38) ... 4 more Caused by: java.io.FileNotFoundException: /autonfs/home/wilde/./file:/localhost/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:106) at org.globus.cog.abstraction.impl.file.coaster.handlers.providers.LocalIOProvider$Reader.(LocalIOProvider.java:120) at org.globus.cog.abstraction.impl.file.coaster.handlers.providers.LocalIOProvider.pull(LocalIOProvider.java:35) at org.globus.cog.abstraction.impl.file.coaster.handlers.GetFileHandler.send(GetFileHandler.java:60) ... 7 more 1283216820.438 DEBUG 000000 Fin flag set 1283216820.438 DEBUG 000000 1283216820263 getFileCBDataIn jobid: 1283216820263, state: 0, tag: 2, err: 4, fin: 2 1283216820.438 DEBUG 000000 Replies: {3 = ARRAY(0xe3c4040)} 1283216820.438 DEBUG 000000 OUT: len=9, tag=4, flags=0 1283216820.438 TRACE 000000 JOBSTATUS 1283216820.438 DEBUG 000000 OUT: len=13, tag=4, flags=0 1283216820.438 TRACE 000000 1283216820263 1283216820.438 DEBUG 000000 OUT: len=1, tag=4, flags=0 1283216820.438 TRACE 000000 5 1283216820.438 DEBUG 000000 OUT: len=3, tag=4, flags=0 1283216820.438 TRACE 000000 520 1283216820.438 DEBUG 000000 OUT: len=2223, tag=4, flags=2 1283216820.438 TRACE 000000 Error staging in file: org.globus.cog.karajan.workflow.service.ProtocolException: org.globus.cog.karajan.workflow.service.ProtocolException: java.io.FileNotFoundException: /autonfs/home/wilde/./file:/localhost/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging (No such file or directory) at org.globus.cog.abstraction.impl.file.coaster.handlers.GetFileHandler.requestComplete(GetFileHandler.java:41) at org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84) at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:387) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:159) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:368) Caused by: org.globus.cog.karajan.workflow.service.ProtocolException: java.io.FileNotFoundException: /autonfs/home/wilde/./file:/localhost/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging (No such file or directory) at org.globus.cog.abstraction.impl.file.coaster.handlers.GetFileHandler.send(GetFileHandler.java:64) at org.globus.cog.karajan.workflow.service.handlers.RequestHandler.sendReply(RequestHandler.java:37) at org.globus.cog.abstraction.impl.file.coaster.handlers.CoasterFileRequestHandler.sendReply(CoasterFileRequestHandler.java:45) at org.globus.cog.abstraction.impl.file.coaster.handlers.GetFileHandler.requestComplete(GetFileHandler.java:38) ... 4 more Caused by: java.io.FileNotFoundException: /autonfs/home/wilde/./file:/localhost/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:106) at org.globus.cog.abstraction.impl.file.coaster.handlers.providers.LocalIOProvider$Reader.(LocalIOProvider.java:120) at org.globus.cog.abstraction.impl.file.coaster.handlers.providers.LocalIOProvider.pull(LocalIOProvider.java:35) at org.globus.cog.abstraction.impl.file.coaster.handlers.GetFileHandler.send(GetFileHandler.java:60) ... 7 more 1283216820.438 DEBUG 000000 done sending frags for 4 1283216820.482 TRACE 000000 IN: len=2, actuallen=2, tag=3, flags=3, OK 1283216820.482 DEBUG 000000 Fin flag set 1283216820.482 TRACE 000000 IN: len=2, actuallen=2, tag=4, flags=3, OK 1283216820.482 DEBUG 000000 Fin flag set 1283216820.687 TRACE 000000 IN: len=8, actuallen=8, tag=2, flags=2, SHUTDOWN 1283216820.687 DEBUG 000000 New request (2) 1283216820.687 DEBUG 000000 Fin flag set 1283216820.687 DEBUG 000000 Processing request 1283216820.687 DEBUG 000000 Cmd is SHUTDOWN 1283216820.687 DEBUG 000000 Shutdown command received 1283216820.687 DEBUG 000000 OUT: len=2, tag=2, flags=3 1283216820.687 TRACE 000000 OK 1283216820.687 DEBUG 000000 done sending frags for 2 com$ ----- wilde at mcs.anl.gov wrote: > OK, I see now that it is honoring the workdirectory tag. (I thought > that was not used with provider staging, but seems that it is). > > WHen mkdir was failing I was getting an error code 524; now Im > getting an error code 520 - seems to be failing now in the actual > transfer of swiftwrap. > > worker log is pasted below. > > - Mike > > com$ cat worker-0830-560709-000000.log > 1283216169.574 INFO - 0830-560709-000000 Logging started: Mon Aug 30 > 19:56:09 2010 > 1283216169.576 INFO - Running on node communicado.ci.uchicago.edu > 1283216169.576 DEBUG - uri=http://128.135.125.17:50001 > 1283216169.576 DEBUG - scheme=http > 1283216169.576 DEBUG - host=128.135.125.17 > 1283216169.576 DEBUG - port=50001 > 1283216169.576 DEBUG - blockid=0830-560709-000000 > 1283216169.576 INFO - Connecting (0)... > 1283216169.576 DEBUG - Trying 128.135.125.17:50001... > 1283216169.578 INFO - Connected > 1283216169.578 DEBUG - Replies: {} > 1283216169.578 DEBUG - OUT: len=8, tag=0, flags=0 > 1283216169.578 DEBUG - OUT: len=18, tag=0, flags=0 > 1283216169.578 DEBUG - OUT: len=0, tag=0, flags=2 > 1283216169.578 DEBUG - done sending frags for 0 > 1283216169.623 DEBUG - Fin flag set > 1283216169.624 INFO 000000 Registration successful. ID=000000 > 1283216169.624 DEBUG 000000 New request (1) > 1283216169.624 DEBUG 000000 Fin flag set > 1283216169.624 DEBUG 000000 Processing request > 1283216169.625 DEBUG 000000 Cmd is SUBMITJOB > 1283216169.625 INFO 000000 1283216169479 Job info received (tag=1) > 1283216169.625 DEBUG 000000 1283216169479 Job check ok (dir: > /home/wilde/swiftwork/catsn-20100830-1956-z61dqk05-d-cat-duz412yj) > 1283216169.625 INFO 000000 1283216169479 Sending submit reply > (tag=1) > 1283216169.625 DEBUG 000000 OUT: len=2, tag=1, flags=3 > 1283216169.625 DEBUG 000000 done sending frags for 1 > 1283216169.625 INFO 000000 1283216169479 Submit reply sent (tag=1) > 1283216169.625 DEBUG 000000 Replies: {} > 1283216169.625 DEBUG 000000 OUT: len=9, tag=1, flags=0 > 1283216169.625 DEBUG 000000 OUT: len=13, tag=1, flags=0 > 1283216169.625 DEBUG 000000 OUT: len=2, tag=1, flags=0 > 1283216169.625 DEBUG 000000 OUT: len=1, tag=1, flags=0 > 1283216169.625 DEBUG 000000 OUT: len=15, tag=1, flags=2 > 1283216169.626 DEBUG 000000 done sending frags for 1 > 1283216169.626 DEBUG 000000 1283216169479 Staging in > file://localhost//home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging > 1283216169.626 DEBUG 000000 1283216169479 src: > file://localhost//home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging, > protocol: file, path: > localhost//home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging > 1283216169.627 DEBUG 000000 Opening > /home/wilde/swiftwork/catsn-20100830-1956-z61dqk05-d-cat-duz412yj/_swiftwrap.staging > in cwd / > ... > 1283216169.628 DEBUG 000000 1283216169479 Opened > /home/wilde/swiftwork/catsn-20100830-1956-z61dqk05-d-cat-duz412yj/_swiftwrap.staging > 1283216169.628 DEBUG 000000 Replies: {1 = ARRAY(0x93ce8f0)} > 1283216169.628 DEBUG 000000 OUT: len=3, tag=2, flags=0 > 1283216169.628 DEBUG 000000 OUT: len=78, tag=2, flags=0 > 1283216169.628 DEBUG 000000 OUT: len=84, tag=2, flags=2 > 1283216169.628 DEBUG 000000 done sending frags for 2 > 1283216169.647 DEBUG 000000 Fin flag set > 1283216169.653 DEBUG 000000 1283216169479 getFileCBDataIn jobid: > 1283216169479, state: 0, tag: 2, err: 4, fin: 0 > 1283216169.653 DEBUG 000000 Replies: {2 = ARRAY(0x93ce980)} > 1283216169.653 DEBUG 000000 OUT: len=9, tag=3, flags=0 > 1283216169.653 DEBUG 000000 OUT: len=13, tag=3, flags=0 > 1283216169.653 DEBUG 000000 OUT: len=1, tag=3, flags=0 > 1283216169.653 DEBUG 000000 OUT: len=3, tag=3, flags=0 > 1283216169.653 DEBUG 000000 OUT: len=239, tag=3, flags=2 > 1283216169.653 DEBUG 000000 done sending frags for 3 > 1283216169.653 DEBUG 000000 Fin flag set > 1283216169.653 DEBUG 000000 1283216169479 getFileCBDataIn jobid: > 1283216169479, state: 0, tag: 2, err: 4, fin: 2 > 1283216169.653 DEBUG 000000 Replies: {3 = ARRAY(0x93ce8d0)} > 1283216169.653 DEBUG 000000 OUT: len=9, tag=4, flags=0 > 1283216169.653 DEBUG 000000 OUT: len=13, tag=4, flags=0 > 1283216169.653 DEBUG 000000 OUT: len=1, tag=4, flags=0 > 1283216169.653 DEBUG 000000 OUT: len=3, tag=4, flags=0 > 1283216169.653 DEBUG 000000 OUT: len=2223, tag=4, flags=2 > 1283216169.653 DEBUG 000000 done sending frags for 4 > 1283216169.698 DEBUG 000000 Fin flag set > 1283216169.698 DEBUG 000000 Fin flag set > 1283216169.902 DEBUG 000000 New request (2) > 1283216169.902 DEBUG 000000 Fin flag set > 1283216169.902 DEBUG 000000 Processing request > 1283216169.902 DEBUG 000000 Cmd is SHUTDOWN > 1283216169.902 DEBUG 000000 Shutdown command received > 1283216169.902 DEBUG 000000 OUT: len=2, tag=2, flags=3 > 1283216169.902 DEBUG 000000 done sending frags for 2 > com$ > > ----- wilde at mcs.anl.gov wrote: > > > Mihael, Justin, > > > > Im trying to use provider-staging for the first time. It seems to > be > > starting the worker in /, and hence staging in fails right away (on > > _swiftwrap). > > > > Where is the worker supposed to start when using provider staging? > > > > Ive tried to set the jobdir to /tmp using the tag but > that > > doesnt seem to be honored. > > > > Ive tried a few different sites configurations; Im running from > > bridled to communicado using ssh. My most recent is: > > > > > > > > > jobmanager="ssh:local"/> > > > > > > 8 > > .07 > > 10000 > > > > file > > > > > > > > /tmp/wilde/scratch > > > > > > > > - Mike > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Mon Aug 30 20:31:41 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 30 Aug 2010 20:31:41 -0500 Subject: [Swift-devel] Provider staging is failing In-Reply-To: <1041242549.72071283217987591.JavaMail.root@zimbra.anl.gov> References: <1041242549.72071283217987591.JavaMail.root@zimbra.anl.gov> Message-ID: <1283218301.28556.12.camel@blabla2.none> On Mon, 2010-08-30 at 19:26 -0600, wilde at mcs.anl.gov wrote: > I turned on the TRACE output level in worker.pl. I need to dig deeper but it looks to me that the pathnames its trying to fetch are getting mangled/confused with the file:// portion of the URI: > > org.globus.cog.karajan.workflow.service.ProtocolException: java.io.FileNotFoundException: /autonfs/home/wilde/./file:/localhost/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging (No such file or directory) > > The file "/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging" does exist on the client side. Seems to. I gather "file" is broken. Can you try "proxy", and see if it fails? If not, I'll know a bit better where to look. Mihael From wilde at mcs.anl.gov Mon Aug 30 20:32:57 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 30 Aug 2010 19:32:57 -0600 (GMT-06:00) Subject: [Swift-devel] Provider staging is failing In-Reply-To: <1283217536.28556.1.camel@blabla2.none> Message-ID: <1262974924.72251283218377222.JavaMail.root@zimbra.anl.gov> coaster.log says same as the java error from the worker log: 2010-08-30 20:07:00,430-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:1283216814307-1\ 283216820262-1283216820263) setting status to Submitted 2010-08-30 20:07:00,430-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:1283216814307-1\ 283216820262-1283216820263) setting status to Stagein workerid=000000 2010-08-30 20:07:00,480-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:1283216814307-1\ 283216820262-1283216820263) setting status to Active 2010-08-30 20:07:00,480-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:1283216814307-1\ 283216820262-1283216820263) setting status to Failed Error staging in file: org.globus.cog.karajan\ .workflow.service.ProtocolException: java.io.FileNotFoundException: /autonfs/home/wilde/./file:/lo\ calhost/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging (No such file or directory) 2010-08-30 20:07:00,480-0500 INFO Cpu 0830-070800-000000:0 jobTerminated 2010-08-30 20:07:00,480-0500 INFO Cpu 0830-070800-000000:0 pull 2010-08-30 20:07:00,686-0500 INFO BlockQueueProcessor Shutting down blocks ----- "Mihael Hategan" wrote: > It's getting some error from the coaster service. I wonder why it > isn't > being printed. But the coaster/swift log will probably have it. > > Mihael > > On Mon, 2010-08-30 at 19:02 -0600, wilde at mcs.anl.gov wrote: > > OK, I see now that it is honoring the workdirectory tag. (I thought > that was not used with provider staging, but seems that it is). > > > > WHen mkdir was failing I was getting an error code 524; now Im > getting an error code 520 - seems to be failing now in the actual > transfer of swiftwrap. > > > > worker log is pasted below. > > > > - Mike > > > > com$ cat worker-0830-560709-000000.log > > 1283216169.574 INFO - 0830-560709-000000 Logging started: Mon Aug > 30 19:56:09 2010 > > 1283216169.576 INFO - Running on node communicado.ci.uchicago.edu > > 1283216169.576 DEBUG - uri=http://128.135.125.17:50001 > > 1283216169.576 DEBUG - scheme=http > > 1283216169.576 DEBUG - host=128.135.125.17 > > 1283216169.576 DEBUG - port=50001 > > 1283216169.576 DEBUG - blockid=0830-560709-000000 > > 1283216169.576 INFO - Connecting (0)... > > 1283216169.576 DEBUG - Trying 128.135.125.17:50001... > > 1283216169.578 INFO - Connected > > 1283216169.578 DEBUG - Replies: {} > > 1283216169.578 DEBUG - OUT: len=8, tag=0, flags=0 > > 1283216169.578 DEBUG - OUT: len=18, tag=0, flags=0 > > 1283216169.578 DEBUG - OUT: len=0, tag=0, flags=2 > > 1283216169.578 DEBUG - done sending frags for 0 > > 1283216169.623 DEBUG - Fin flag set > > 1283216169.624 INFO 000000 Registration successful. ID=000000 > > 1283216169.624 DEBUG 000000 New request (1) > > 1283216169.624 DEBUG 000000 Fin flag set > > 1283216169.624 DEBUG 000000 Processing request > > 1283216169.625 DEBUG 000000 Cmd is SUBMITJOB > > 1283216169.625 INFO 000000 1283216169479 Job info received (tag=1) > > 1283216169.625 DEBUG 000000 1283216169479 Job check ok (dir: > /home/wilde/swiftwork/catsn-20100830-1956-z61dqk05-d-cat-duz412yj) > > 1283216169.625 INFO 000000 1283216169479 Sending submit reply > (tag=1) > > 1283216169.625 DEBUG 000000 OUT: len=2, tag=1, flags=3 > > 1283216169.625 DEBUG 000000 done sending frags for 1 > > 1283216169.625 INFO 000000 1283216169479 Submit reply sent (tag=1) > > 1283216169.625 DEBUG 000000 Replies: {} > > 1283216169.625 DEBUG 000000 OUT: len=9, tag=1, flags=0 > > 1283216169.625 DEBUG 000000 OUT: len=13, tag=1, flags=0 > > 1283216169.625 DEBUG 000000 OUT: len=2, tag=1, flags=0 > > 1283216169.625 DEBUG 000000 OUT: len=1, tag=1, flags=0 > > 1283216169.625 DEBUG 000000 OUT: len=15, tag=1, flags=2 > > 1283216169.626 DEBUG 000000 done sending frags for 1 > > 1283216169.626 DEBUG 000000 1283216169479 Staging in > file://localhost//home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging > > 1283216169.626 DEBUG 000000 1283216169479 src: > file://localhost//home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging, > protocol: file, path: > localhost//home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging > > 1283216169.627 DEBUG 000000 Opening > /home/wilde/swiftwork/catsn-20100830-1956-z61dqk05-d-cat-duz412yj/_swiftwrap.staging > in cwd / > > ... > > 1283216169.628 DEBUG 000000 1283216169479 Opened > /home/wilde/swiftwork/catsn-20100830-1956-z61dqk05-d-cat-duz412yj/_swiftwrap.staging > > 1283216169.628 DEBUG 000000 Replies: {1 = ARRAY(0x93ce8f0)} > > 1283216169.628 DEBUG 000000 OUT: len=3, tag=2, flags=0 > > 1283216169.628 DEBUG 000000 OUT: len=78, tag=2, flags=0 > > 1283216169.628 DEBUG 000000 OUT: len=84, tag=2, flags=2 > > 1283216169.628 DEBUG 000000 done sending frags for 2 > > 1283216169.647 DEBUG 000000 Fin flag set > > 1283216169.653 DEBUG 000000 1283216169479 getFileCBDataIn jobid: > 1283216169479, state: 0, tag: 2, err: 4, fin: 0 > > 1283216169.653 DEBUG 000000 Replies: {2 = ARRAY(0x93ce980)} > > 1283216169.653 DEBUG 000000 OUT: len=9, tag=3, flags=0 > > 1283216169.653 DEBUG 000000 OUT: len=13, tag=3, flags=0 > > 1283216169.653 DEBUG 000000 OUT: len=1, tag=3, flags=0 > > 1283216169.653 DEBUG 000000 OUT: len=3, tag=3, flags=0 > > 1283216169.653 DEBUG 000000 OUT: len=239, tag=3, flags=2 > > 1283216169.653 DEBUG 000000 done sending frags for 3 > > 1283216169.653 DEBUG 000000 Fin flag set > > 1283216169.653 DEBUG 000000 1283216169479 getFileCBDataIn jobid: > 1283216169479, state: 0, tag: 2, err: 4, fin: 2 > > 1283216169.653 DEBUG 000000 Replies: {3 = ARRAY(0x93ce8d0)} > > 1283216169.653 DEBUG 000000 OUT: len=9, tag=4, flags=0 > > 1283216169.653 DEBUG 000000 OUT: len=13, tag=4, flags=0 > > 1283216169.653 DEBUG 000000 OUT: len=1, tag=4, flags=0 > > 1283216169.653 DEBUG 000000 OUT: len=3, tag=4, flags=0 > > 1283216169.653 DEBUG 000000 OUT: len=2223, tag=4, flags=2 > > 1283216169.653 DEBUG 000000 done sending frags for 4 > > 1283216169.698 DEBUG 000000 Fin flag set > > 1283216169.698 DEBUG 000000 Fin flag set > > 1283216169.902 DEBUG 000000 New request (2) > > 1283216169.902 DEBUG 000000 Fin flag set > > 1283216169.902 DEBUG 000000 Processing request > > 1283216169.902 DEBUG 000000 Cmd is SHUTDOWN > > 1283216169.902 DEBUG 000000 Shutdown command received > > 1283216169.902 DEBUG 000000 OUT: len=2, tag=2, flags=3 > > 1283216169.902 DEBUG 000000 done sending frags for 2 > > com$ > > > > ----- wilde at mcs.anl.gov wrote: > > > > > Mihael, Justin, > > > > > > Im trying to use provider-staging for the first time. It seems to > be > > > starting the worker in /, and hence staging in fails right away > (on > > > _swiftwrap). > > > > > > Where is the worker supposed to start when using provider > staging? > > > > > > Ive tried to set the jobdir to /tmp using the tag but > that > > > doesnt seem to be honored. > > > > > > Ive tried a few different sites configurations; Im running from > > > bridled to communicado using ssh. My most recent is: > > > > > > > > > > > > url="communicado.ci.uchicago.edu" > > > jobmanager="ssh:local"/> > > > > > > > > > 8 > > > .07 > > > key="initialScore">10000 > > > > > > file > > > > > > > > > > > > /tmp/wilde/scratch > > > > > > > > > > > > - Mike > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Mon Aug 30 20:39:26 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 30 Aug 2010 19:39:26 -0600 (GMT-06:00) Subject: [Swift-devel] Provider staging is failing In-Reply-To: <1283218301.28556.12.camel@blabla2.none> Message-ID: <1984765092.72371283218766807.JavaMail.root@zimbra.anl.gov> WIth proxy the stageins seem to complete. Then a get a 254 when it tries to run; Im looking at that now: 1283218480.397 DEBUG 000000 CWD: / 1283218480.397 DEBUG 000000 Running /bin/bash 1283218480.397 DEBUG 000000 Directory: /home/wilde/swiftwork/catsn-20100830-2034-hotqv61h-o-cat-oy\ un22yj 1283218480.397 DEBUG 000000 Command: _swiftwrap.staging -e /bin/cat -out outdir/f.0001.out -err st\ derr.txt -i -d |outdir -if data.txt -of outdir/f.0001.out -k -cdmfile -status provider -a data.tx\ t 1283218480.397 DEBUG 000000 Command: /bin/bash _swiftwrap.staging -e /bin/cat -out outdir/f.0001.o\ ut -err stderr.txt -i -d |outdir -if data.txt -of outdir/f.0001.out -k -cdmfile -status provider \ -a data.txt 1283218480.397 DEBUG 000000 1283218479990 Forked process 17949. Waiting for its completion 1283218480.408 DEBUG 000000 Checking jobs status (1 active) 1283218480.408 DEBUG 000000 1283218479990 Checking pid 17949 1283218480.408 DEBUG 000000 1283218479990 Job 17949 still running 1283218480.408 TRACE 000000 IN: len=2, actuallen=2, tag=4, flags=3, OK 1283218480.408 DEBUG 000000 Fin flag set 1283218480.508 DEBUG 000000 Checking jobs status (1 active) 1283218480.508 DEBUG 000000 1283218479990 Checking pid 17949 1283218480.508 DEBUG 000000 1283218479990 Child process 17949 terminated. Status is 254. - Mike ----- "Mihael Hategan" wrote: > On Mon, 2010-08-30 at 19:26 -0600, wilde at mcs.anl.gov wrote: > > I turned on the TRACE output level in worker.pl. I need to dig > deeper but it looks to me that the pathnames its trying to fetch are > getting mangled/confused with the file:// portion of the URI: > > > > org.globus.cog.karajan.workflow.service.ProtocolException: > java.io.FileNotFoundException: > /autonfs/home/wilde/./file:/localhost/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging > (No such file or directory) > > > > The file > "/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging" does > exist on the client side. > > Seems to. I gather "file" is broken. > > Can you try "proxy", and see if it fails? If not, I'll know a bit > better > where to look. > > Mihael -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Mon Aug 30 20:41:19 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 30 Aug 2010 19:41:19 -0600 (GMT-06:00) Subject: [Swift-devel] Provider staging is failing In-Reply-To: <1984765092.72371283218766807.JavaMail.root@zimbra.anl.gov> Message-ID: <432515376.72421283218879072.JavaMail.root@zimbra.anl.gov> _swiftwrap.staging didnt sem to get marked executable: com$ ls -l /home/wilde/swiftwork/catsn-20100830-2034-hotqv61h-o-cat-oy\ > un22yj total 20 -rw-r--r-- 1 wilde ci-users 0 Aug 30 20:34 3 -rw-r--r-- 1 wilde ci-users 5894 Aug 30 20:34 _swiftwrap.staging -rw-r--r-- 1 wilde ci-users 24 Aug 30 20:34 data.txt -rw-r--r-- 1 wilde ci-users 6731 Aug 30 20:34 wrapper.log com$ - mike ----- "Michael Wilde" wrote: > WIth proxy the stageins seem to complete. Then a get a 254 when it > tries to run; Im looking at that now: > > 1283218480.397 DEBUG 000000 CWD: / > 1283218480.397 DEBUG 000000 Running /bin/bash > 1283218480.397 DEBUG 000000 Directory: > /home/wilde/swiftwork/catsn-20100830-2034-hotqv61h-o-cat-oy\ > un22yj > 1283218480.397 DEBUG 000000 Command: _swiftwrap.staging -e /bin/cat > -out outdir/f.0001.out -err st\ > derr.txt -i -d |outdir -if data.txt -of outdir/f.0001.out -k -cdmfile > -status provider -a data.tx\ > t > 1283218480.397 DEBUG 000000 Command: /bin/bash _swiftwrap.staging -e > /bin/cat -out outdir/f.0001.o\ > ut -err stderr.txt -i -d |outdir -if data.txt -of outdir/f.0001.out -k > -cdmfile -status provider \ > -a data.txt > 1283218480.397 DEBUG 000000 1283218479990 Forked process 17949. > Waiting for its completion > 1283218480.408 DEBUG 000000 Checking jobs status (1 active) > 1283218480.408 DEBUG 000000 1283218479990 Checking pid 17949 > 1283218480.408 DEBUG 000000 1283218479990 Job 17949 still running > 1283218480.408 TRACE 000000 IN: len=2, actuallen=2, tag=4, flags=3, > OK > 1283218480.408 DEBUG 000000 Fin flag set > 1283218480.508 DEBUG 000000 Checking jobs status (1 active) > 1283218480.508 DEBUG 000000 1283218479990 Checking pid 17949 > 1283218480.508 DEBUG 000000 1283218479990 Child process 17949 > terminated. Status is 254. > > > - Mike > > ----- "Mihael Hategan" wrote: > > > On Mon, 2010-08-30 at 19:26 -0600, wilde at mcs.anl.gov wrote: > > > I turned on the TRACE output level in worker.pl. I need to dig > > deeper but it looks to me that the pathnames its trying to fetch > are > > getting mangled/confused with the file:// portion of the URI: > > > > > > org.globus.cog.karajan.workflow.service.ProtocolException: > > java.io.FileNotFoundException: > > > /autonfs/home/wilde/./file:/localhost/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging > > (No such file or directory) > > > > > > The file > > "/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging" > does > > exist on the client side. > > > > Seems to. I gather "file" is broken. > > > > Can you try "proxy", and see if it fails? If not, I'll know a bit > > better > > where to look. > > > > Mihael > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wozniak at mcs.anl.gov Mon Aug 30 21:49:47 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 30 Aug 2010 21:49:47 -0500 (Central Daylight Time) Subject: [Swift-devel] Provider staging is failing In-Reply-To: <432515376.72421283218879072.JavaMail.root@zimbra.anl.gov> References: <432515376.72421283218879072.JavaMail.root@zimbra.anl.gov> Message-ID: I think that's ok. Do you have the wrapper.log/info files? On Mon, 30 Aug 2010, Michael Wilde wrote: > _swiftwrap.staging didnt sem to get marked executable: > ----- "Michael Wilde" wrote: > >> WIth proxy the stageins seem to complete. Then a get a 254 when it >> tries to run; Im looking at that now: >> >> 1283218480.397 DEBUG 000000 CWD: / >> 1283218480.397 DEBUG 000000 Running /bin/bash >> 1283218480.397 DEBUG 000000 Directory: >> /home/wilde/swiftwork/catsn-20100830-2034-hotqv61h-o-cat-oy\ >> un22yj >> 1283218480.397 DEBUG 000000 Command: _swiftwrap.staging -e /bin/cat >> -out outdir/f.0001.out -err st\ >> derr.txt -i -d |outdir -if data.txt -of outdir/f.0001.out -k -cdmfile >> -status provider -a data.tx\ >> t >> 1283218480.397 DEBUG 000000 Command: /bin/bash _swiftwrap.staging -e >> /bin/cat -out outdir/f.0001.o\ >> ut -err stderr.txt -i -d |outdir -if data.txt -of outdir/f.0001.out -k >> -cdmfile -status provider \ >> -a data.txt >> 1283218480.397 DEBUG 000000 1283218479990 Forked process 17949. >> Waiting for its completion >> 1283218480.408 DEBUG 000000 Checking jobs status (1 active) >> 1283218480.408 DEBUG 000000 1283218479990 Checking pid 17949 >> 1283218480.408 DEBUG 000000 1283218479990 Job 17949 still running >> 1283218480.408 TRACE 000000 IN: len=2, actuallen=2, tag=4, flags=3, >> OK >> 1283218480.408 DEBUG 000000 Fin flag set >> 1283218480.508 DEBUG 000000 Checking jobs status (1 active) >> 1283218480.508 DEBUG 000000 1283218479990 Checking pid 17949 >> 1283218480.508 DEBUG 000000 1283218479990 Child process 17949 >> terminated. Status is 254. >> >> >> - Mike >> >> ----- "Mihael Hategan" wrote: >> >>> On Mon, 2010-08-30 at 19:26 -0600, wilde at mcs.anl.gov wrote: >>>> I turned on the TRACE output level in worker.pl. I need to dig >>> deeper but it looks to me that the pathnames its trying to fetch >> are >>> getting mangled/confused with the file:// portion of the URI: >>>> >>>> org.globus.cog.karajan.workflow.service.ProtocolException: >>> java.io.FileNotFoundException: >>> >> /autonfs/home/wilde/./file:/localhost/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging >>> (No such file or directory) >>>> >>>> The file >>> "/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging" >> does >>> exist on the client side. >>> >>> Seems to. I gather "file" is broken. >>> >>> Can you try "proxy", and see if it fails? If not, I'll know a bit >>> better >>> where to look. >>> >>> Mihael >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory > > -- Justin M Wozniak From wilde at mcs.anl.gov Mon Aug 30 23:07:34 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Mon, 30 Aug 2010 22:07:34 -0600 (GMT-06:00) Subject: [Swift-devel] Provider staging is failing In-Reply-To: <1480310186.74411283227473777.JavaMail.root@zimbra.anl.gov> Message-ID: <2097447996.74471283227654093.JavaMail.root@zimbra.anl.gov> ----- "Justin M Wozniak" wrote: > I think that's ok. Right: Mihael pointed out to me in IM that the exec'ed program is /bin/bash with _swiftwrap.staging as an arg. Digging deeper it looks like _swiftwrap.staging is getting run with this command line: /bin/bash _swiftwrap.staging -e /bin/cat -out outdir/f.0001.out -err stderr.txt -i -d '|outdir' -if data.txt -of outdir/f.0001.out -k -cdmfile -status provider -a data.txt and the extra "|" separator in the -d 'outdir' arg (quotes mine) is causing a spurious mkdir to get invoked for what would have been the "in dirs" argument.That in turn is causing the ret code 254. I think that extra | separator is not supposed to be there when there are no input directories (as in this case). vdl-int.staging has: "-d", flatten(each(fileDirs)), and I now suspect a null value for the dirs of stagein is not being handled right, somewhere around: fileDirs := fileDirs(stagein, stageout) - Mike > Do you have the wrapper.log/info files? > > On Mon, 30 Aug 2010, Michael Wilde wrote: > > > _swiftwrap.staging didnt sem to get marked executable: > > > > ----- "Michael Wilde" wrote: > > > >> WIth proxy the stageins seem to complete. Then a get a 254 when it > >> tries to run; Im looking at that now: > >> > >> 1283218480.397 DEBUG 000000 CWD: / > >> 1283218480.397 DEBUG 000000 Running /bin/bash > >> 1283218480.397 DEBUG 000000 Directory: > >> /home/wilde/swiftwork/catsn-20100830-2034-hotqv61h-o-cat-oy\ > >> un22yj > >> 1283218480.397 DEBUG 000000 Command: _swiftwrap.staging -e > /bin/cat > >> -out outdir/f.0001.out -err st\ > >> derr.txt -i -d |outdir -if data.txt -of outdir/f.0001.out -k > -cdmfile > >> -status provider -a data.tx\ > >> t > >> 1283218480.397 DEBUG 000000 Command: /bin/bash _swiftwrap.staging > -e > >> /bin/cat -out outdir/f.0001.o\ > >> ut -err stderr.txt -i -d |outdir -if data.txt -of outdir/f.0001.out > -k > >> -cdmfile -status provider \ > >> -a data.txt > >> 1283218480.397 DEBUG 000000 1283218479990 Forked process 17949. > >> Waiting for its completion > >> 1283218480.408 DEBUG 000000 Checking jobs status (1 active) > >> 1283218480.408 DEBUG 000000 1283218479990 Checking pid 17949 > >> 1283218480.408 DEBUG 000000 1283218479990 Job 17949 still running > >> 1283218480.408 TRACE 000000 IN: len=2, actuallen=2, tag=4, > flags=3, > >> OK > >> 1283218480.408 DEBUG 000000 Fin flag set > >> 1283218480.508 DEBUG 000000 Checking jobs status (1 active) > >> 1283218480.508 DEBUG 000000 1283218479990 Checking pid 17949 > >> 1283218480.508 DEBUG 000000 1283218479990 Child process 17949 > >> terminated. Status is 254. > >> > >> > >> - Mike > >> > >> ----- "Mihael Hategan" wrote: > >> > >>> On Mon, 2010-08-30 at 19:26 -0600, wilde at mcs.anl.gov wrote: > >>>> I turned on the TRACE output level in worker.pl. I need to dig > >>> deeper but it looks to me that the pathnames its trying to fetch > >> are > >>> getting mangled/confused with the file:// portion of the URI: > >>>> > >>>> org.globus.cog.karajan.workflow.service.ProtocolException: > >>> java.io.FileNotFoundException: > >>> > >> > /autonfs/home/wilde/./file:/localhost/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging > >>> (No such file or directory) > >>>> > >>>> The file > >>> "/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging" > >> does > >>> exist on the client side. > >>> > >>> Seems to. I gather "file" is broken. > >>> > >>> Can you try "proxy", and see if it fails? If not, I'll know a bit > >>> better > >>> where to look. > >>> > >>> Mihael > >> > >> -- > >> Michael Wilde > >> Computation Institute, University of Chicago > >> Mathematics and Computer Science Division > >> Argonne National Laboratory > > > > > > -- > Justin M Wozniak -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Mon Aug 30 23:52:44 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Mon, 30 Aug 2010 22:52:44 -0600 (GMT-06:00) Subject: [Swift-devel] Provider staging is failing In-Reply-To: <581036206.74891283228668929.JavaMail.root@zimbra.anl.gov> Message-ID: <969490906.75291283230364810.JavaMail.root@zimbra.anl.gov> Nope - I was wrong again. The "-d |outdir" form has been generated all along. The problem was that this causes a mkdir -p in _swiftwrap.staging to be invoked with a null value. This was obscured in _swiftwrap, which had a jobdir in front of the null input dir, and was thus silently ignored by mkdir -p. I committed a fix (skip mkdir if dir is null), but please keep an eye on _swiftwrap.staging in case it causes other issues. There was also a typo in a var $STDER -> STDERR. - Mike ----- wilde at mcs.anl.gov wrote: > ----- "Justin M Wozniak" wrote: > > > I think that's ok. > > Right: Mihael pointed out to me in IM that the exec'ed program is > /bin/bash with _swiftwrap.staging as an arg. > > Digging deeper it looks like _swiftwrap.staging is getting run with > this command line: > > /bin/bash _swiftwrap.staging -e /bin/cat -out outdir/f.0001.out -err > stderr.txt -i -d '|outdir' -if data.txt -of outdir/f.0001.out -k > -cdmfile -status provider -a data.txt > > and the extra "|" separator in the -d 'outdir' arg (quotes mine) is > causing a spurious mkdir to get invoked for what would have been the > "in dirs" argument.That in turn is causing the ret code 254. > > I think that extra | separator is not supposed to be there when there > are no input directories (as in this case). vdl-int.staging has: > "-d", flatten(each(fileDirs)), > and I now suspect a null value for the dirs of stagein is not being > handled right, somewhere around: > fileDirs := fileDirs(stagein, stageout) > > - Mike > > > > > > Do you have the wrapper.log/info files? > > > > On Mon, 30 Aug 2010, Michael Wilde wrote: > > > > > _swiftwrap.staging didnt sem to get marked executable: > > > > > > > ----- "Michael Wilde" wrote: > > > > > >> WIth proxy the stageins seem to complete. Then a get a 254 when > it > > >> tries to run; Im looking at that now: > > >> > > >> 1283218480.397 DEBUG 000000 CWD: / > > >> 1283218480.397 DEBUG 000000 Running /bin/bash > > >> 1283218480.397 DEBUG 000000 Directory: > > >> /home/wilde/swiftwork/catsn-20100830-2034-hotqv61h-o-cat-oy\ > > >> un22yj > > >> 1283218480.397 DEBUG 000000 Command: _swiftwrap.staging -e > > /bin/cat > > >> -out outdir/f.0001.out -err st\ > > >> derr.txt -i -d |outdir -if data.txt -of outdir/f.0001.out -k > > -cdmfile > > >> -status provider -a data.tx\ > > >> t > > >> 1283218480.397 DEBUG 000000 Command: /bin/bash > _swiftwrap.staging > > -e > > >> /bin/cat -out outdir/f.0001.o\ > > >> ut -err stderr.txt -i -d |outdir -if data.txt -of > outdir/f.0001.out > > -k > > >> -cdmfile -status provider \ > > >> -a data.txt > > >> 1283218480.397 DEBUG 000000 1283218479990 Forked process 17949. > > >> Waiting for its completion > > >> 1283218480.408 DEBUG 000000 Checking jobs status (1 active) > > >> 1283218480.408 DEBUG 000000 1283218479990 Checking pid 17949 > > >> 1283218480.408 DEBUG 000000 1283218479990 Job 17949 still > running > > >> 1283218480.408 TRACE 000000 IN: len=2, actuallen=2, tag=4, > > flags=3, > > >> OK > > >> 1283218480.408 DEBUG 000000 Fin flag set > > >> 1283218480.508 DEBUG 000000 Checking jobs status (1 active) > > >> 1283218480.508 DEBUG 000000 1283218479990 Checking pid 17949 > > >> 1283218480.508 DEBUG 000000 1283218479990 Child process 17949 > > >> terminated. Status is 254. > > >> > > >> > > >> - Mike > > >> > > >> ----- "Mihael Hategan" wrote: > > >> > > >>> On Mon, 2010-08-30 at 19:26 -0600, wilde at mcs.anl.gov wrote: > > >>>> I turned on the TRACE output level in worker.pl. I need to dig > > >>> deeper but it looks to me that the pathnames its trying to > fetch > > >> are > > >>> getting mangled/confused with the file:// portion of the URI: > > >>>> > > >>>> org.globus.cog.karajan.workflow.service.ProtocolException: > > >>> java.io.FileNotFoundException: > > >>> > > >> > > > /autonfs/home/wilde/./file:/localhost/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging > > >>> (No such file or directory) > > >>>> > > >>>> The file > > >>> "/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging" > > >> does > > >>> exist on the client side. > > >>> > > >>> Seems to. I gather "file" is broken. > > >>> > > >>> Can you try "proxy", and see if it fails? If not, I'll know a > bit > > >>> better > > >>> where to look. > > >>> > > >>> Mihael > > >> > > >> -- > > >> Michael Wilde > > >> Computation Institute, University of Chicago > > >> Mathematics and Computer Science Division > > >> Argonne National Laboratory > > > > > > > > > > -- > > Justin M Wozniak > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Aug 31 10:49:44 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Tue, 31 Aug 2010 09:49:44 -0600 (GMT-06:00) Subject: [Swift-devel] Issues to resolve for new coaster modes In-Reply-To: <125914403.90891283268629993.JavaMail.root@zimbra.anl.gov> Message-ID: <276677401.92181283269784964.JavaMail.root@zimbra.anl.gov> Here's the issues that I am aware of so far. Many of these need more discussion and/or debug/reproduction info. - -nosec doesnt seem to work on coaster-service command alternative: a wrapper script automatically installs an insecure SimpleCAbased proxy and necessary CA certs. We're using wrappers anyways, to start manual workers. - need -passive option on coaster-service command (for now, can run a dummy swift job to force coaster service into passive mode?) - file protocol for coaster provider staging seems broken - -port option on coaster-service command seems to print confusing messages and its unclear how the specified port vs the default port is being used (try to probe with netstat). - its unclear how to specify URIs for the service in sites.xml, in particular http: vs https: protocol and ports. Both are printed by the service: which to use? - still need an easy way to enable users to use remote coasters without an X509 certificate - in initial tests, the persistent coaster service seemed to work for a few swift commands, then hung or stopped responding. Workers timed out? Need to reproduce this and check logs. - in initial tests, both David and Mike have seen coaster service go into a rapid heartbeat (~ 1 per sec) mode. - distantly related: verify if ssh to a set of hosts that use the same ssh key only request the passphrase once from the user. If not, can that caching be implemented to avoid multiple passphrase entry? David, do you have additional issues, or clarifications/details on any of these? Mihael, and suggestions for approaches on any of these that are low hanging fruit which people other than you can work on? From hategan at mcs.anl.gov Tue Aug 31 11:00:43 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 31 Aug 2010 11:00:43 -0500 Subject: [Swift-devel] Re: Issues to resolve for new coaster modes In-Reply-To: <276677401.92181283269784964.JavaMail.root@zimbra.anl.gov> References: <276677401.92181283269784964.JavaMail.root@zimbra.anl.gov> Message-ID: <1283270443.30261.1.camel@blabla2.none> I think uI'll need to tackle these, but for low hanging fruits, maybe -nosec and the ssh passphrase. Mihael On Tue, 2010-08-31 at 09:49 -0600, wilde at mcs.anl.gov wrote: > Here's the issues that I am aware of so far. Many of these need more discussion and/or debug/reproduction info. > > - -nosec doesnt seem to work on coaster-service command > > alternative: a wrapper script automatically installs an insecure SimpleCAbased proxy and necessary CA certs. We're using wrappers anyways, to start manual workers. > > - need -passive option on coaster-service command (for now, can run a dummy swift job to force coaster service into passive mode?) > > - file protocol for coaster provider staging seems broken > > - -port option on coaster-service command seems to print confusing messages and its unclear how the specified port vs the default port is being used (try to probe with netstat). > > - its unclear how to specify URIs for the service in sites.xml, in particular http: vs https: protocol and ports. Both are printed by the service: which to use? > > - still need an easy way to enable users to use remote coasters without an X509 certificate > > - in initial tests, the persistent coaster service seemed to work for a few swift commands, then hung or stopped responding. Workers timed out? Need to reproduce this and check logs. > > - in initial tests, both David and Mike have seen coaster service go into a rapid heartbeat (~ 1 per sec) mode. > > - distantly related: verify if ssh to a set of hosts that use the same ssh key only request the passphrase once from the user. If not, can that caching be implemented to avoid multiple passphrase entry? > > David, do you have additional issues, or clarifications/details on any of these? > > Mihael, and suggestions for approaches on any of these that are low hanging fruit which people other than you can work on? > > > From dk0966 at cs.ship.edu Tue Aug 31 11:19:28 2010 From: dk0966 at cs.ship.edu (David Kelly) Date: Tue, 31 Aug 2010 12:19:28 -0400 Subject: [Swift-devel] Re: Issues to resolve for new coaster modes In-Reply-To: <276677401.92181283269784964.JavaMail.root@zimbra.anl.gov> References: <125914403.90891283268629993.JavaMail.root@zimbra.anl.gov> <276677401.92181283269784964.JavaMail.root@zimbra.anl.gov> Message-ID: Hello, I started looking at the code a bit yesterday. I have coaster-service running with -nosec, but I haven't been able to test it because I think swift itself is also requiring a certificate to start? is this correct, and if so, is there currently a way to disable it? Here is the change I made to get it started (untested) Index: modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/CoasterPersistentService.java =================================================================== --- modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/CoasterPersistentService.java (revision 2868) +++ modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/CoasterPersistentService.java (working copy) @@ -66,13 +66,14 @@ secure = false; } GlobusCredential gc; + GSSCredential cred = null; if (ap.hasValue("proxy")) { gc = new GlobusCredential(ap.getStringValue("proxy")); + cred = new GlobusGSSCredentialImpl(gc, GSSCredential.INITIATE_AND_ACCEPT); } - else { - gc = GlobusCredential.getDefaultCredential(); + else if(secure) { + gc = GlobusCredential.getDefaultCredential(); } - GSSCredential cred = new GlobusGSSCredentialImpl(gc, GSSCredential.INITIATE_AND_ACCEPT); int port = 1984; if (ap.hasValue("port")) { -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Aug 31 11:33:08 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 31 Aug 2010 11:33:08 -0500 Subject: [Swift-devel] Re: Issues to resolve for new coaster modes In-Reply-To: References: <125914403.90891283268629993.JavaMail.root@zimbra.anl.gov> <276677401.92181283269784964.JavaMail.root@zimbra.anl.gov> Message-ID: <1283272388.30415.6.camel@blabla2.none> On Tue, 2010-08-31 at 12:19 -0400, David Kelly wrote: > Hello, > > I started looking at the code a bit yesterday. I have coaster-service > running with -nosec, but I haven't been able to test it because I > think swift itself is also requiring a certificate to start? is this > correct, and if so, is there currently a way to disable it? Right. That's another think that needs a mechanism to disable security. > > Here is the change I made to get it started (untested) It's along those lines. Though I'd leave the meaning of the "proxy" command line arg as it is. Mihael > > Index: > modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/CoasterPersistentService.java > =================================================================== > --- > modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/CoasterPersistentService.java (revision 2868) > +++ > modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/CoasterPersistentService.java (working copy) > @@ -66,13 +66,14 @@ > secure = false; > } > GlobusCredential gc; > + GSSCredential cred = null; > if (ap.hasValue("proxy")) { > gc = new > GlobusCredential(ap.getStringValue("proxy")); > + cred = new GlobusGSSCredentialImpl(gc, > GSSCredential.INITIATE_AND_ACCEPT); > } > - else { > - gc = GlobusCredential.getDefaultCredential(); > + else if(secure) { > + gc = GlobusCredential.getDefaultCredential(); > } > - GSSCredential cred = new GlobusGSSCredentialImpl(gc, > GSSCredential.INITIATE_AND_ACCEPT); > > int port = 1984; > if (ap.hasValue("port")) { > > From wilde at mcs.anl.gov Tue Aug 31 12:14:21 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 31 Aug 2010 11:14:21 -0600 (GMT-06:00) Subject: [Swift-devel] Coaster persistent service issues - logs In-Reply-To: Message-ID: <543522540.97321283274861849.JavaMail.root@zimbra.anl.gov> ----- Forwarded Message ----- From: "David Kelly" To: "Michael Wilde" Cc: "Jonathan Monette" , "Justin Wozniak" , "Mihael Hategan" Sent: Sunday, August 29, 2010 10:32:07 AM GMT -06:00 US/Canada Central Subject: Re: change skype call time today - and some to-do notes Hello all, A few things I've noticed while trying out various coaster configurations this weekend: Had similar problems with the -nosec option. Here is the output I got: davidk at churn:~/cog/modules/swift/dist/swift-svn/bin$ coaster-service -nosec Error loading credential: [JGLOBUS-10] Expired credentials (DC=org,DC=doegrids,OU=People,CN=David Kelly 16830,CN=753950975). Error loading credential org.globus.gsi.GlobusCredentialException: [JGLOBUS-10] Expired credentials (DC=org,DC=doegrids,OU=People,CN=David Kelly 16830,CN=753950975). at org.globus.gsi.GlobusCredential.verify(GlobusCredential.java:321) at org.globus.gsi.GlobusCredential.reloadDefaultCredential(GlobusCredential.java:593) at org.globus.gsi.GlobusCredential.getDefaultCredential(GlobusCredential.java:575) at org.globus.cog.abstraction.coaster.service.CoasterPersistentService.main(CoasterPersistentService.java:73) I tested multiple connections when using a coasters-persistent+active mode. That seemed to have worked fine, with each new swift connection waiting for the previous to finish. I noticed there were some java exceptions in the log files: org.globus.cog.karajan.workflow.service.ReplyTimeoutException at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280) at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) Channel IOException java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) at java.net.SocketOutputStream.write(SocketOutputStream.java:124) at org.globus.gsi.gssapi.net.impl.GSIGssOutputStream.writeToken(GSIGssOutputStream.java:61) at org.globus.gsi.gssapi.net.impl.GSIGssOutputStream.flush(GSIGssOutputStream.java:45) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStreamKarajanChannel.java:298) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:247) Not sure if this important or not, but I will include the logs. I can't quite get coasters-persistent working in passive mode. I am not sure if this if a configuration issue, a swift issue, or operator error. Here is what I am trying to do: sites.xml: passive 1 3500 1 1 1 .31 10000 /home/davidk/swiftwork/churn I run grid-proxy init on the submit host (login*. mcs.anl.gov ) and on the remote host ( churn.mcs.anl.gov ). From churn I run coaster-service. When I run the catsn.swift script on login, I notice these kind of errors in the coaster-service output: GSSSChannel-null(1)[7990655: {}] REQ: Handler(HEARTBEAT) GSSSChannel-null(1)[7990655: {}] REQ: Handler(HEARTBEAT) GSSSChannel-null(1)[7990655: {}] REQ: Handler(HEARTBEAT) GSSSChannel-null(1)[7990655: {}] REQ: Handler(HEARTBEAT) GSSSChannel-null(1)[7990655: {}] REQ: Handler(HEARTBEAT) GSSSChannel-null(1)[7990655: {}] REQ: Handler(HEARTBEAT) GSSSChannel-null(1)[7990655: {}] REQ: Handler(HEARTBEAT) GSSSChannel-null(1)[7990655: {}] REQ: Handler(HEARTBEAT) GSSSChannel-null(1)[7990655: {}] REQ: Handler(HEARTBEAT) GSSSChannel-null(1)[7990655: {}] REQ: Handler(HEARTBEAT) These messages seem to repeat several times per second and never stop. The script never finishes. The configurations and log files attached. Once I can get this configuration working manually, I will start working on a script to automate this process for multiple hosts to make things a little easier. David -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: coaster-logs.tar.gz Type: application/x-gzip Size: 48618 bytes Desc: not available URL: From zhaozhang at uchicago.edu Tue Aug 31 14:02:58 2010 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 31 Aug 2010 14:02:58 -0500 Subject: [Swift-devel] quick question about delimiter Message-ID: <4C7D51E2.2000801@uchicago.edu> Hi, Mihael Could you point me where we defined the delimiter "|" in the latest version of karajan? It used to be in vdl-int.k, but I lost track of it in the new version of code. Thanks. best zhao From hategan at mcs.anl.gov Tue Aug 31 15:01:42 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 31 Aug 2010 15:01:42 -0500 Subject: [Swift-devel] Re: quick question about delimiter In-Reply-To: <4C7D51E2.2000801@uchicago.edu> References: <4C7D51E2.2000801@uchicago.edu> Message-ID: <1283284902.30995.0.camel@blabla2.none> org.griphyn.vdl.karajan.lib.Flatten On Tue, 2010-08-31 at 14:02 -0500, Zhao Zhang wrote: > Hi, Mihael > > Could you point me where we defined the delimiter "|" in the latest > version of karajan? It used to be in vdl-int.k, > but I lost track of it in the new version of code. Thanks. > > best > zhao From glen842 at uchicago.edu Fri Aug 13 11:59:34 2010 From: glen842 at uchicago.edu (Glen Hocky) Date: Fri, 13 Aug 2010 16:59:34 -0000 Subject: [Swift-devel] Re: Worker connection In-Reply-To: <1281718490.12095.8.camel@blabla2.none> References: <4C657328.6060304@gmail.com> <1281717359.11891.3.camel@blabla2.none> <4C657638.5080609@gmail.com> <1281718490.12095.8.camel@blabla2.none> Message-ID: The OOPS project uses sed to set that parameter and create the sites file on the fly. it's very effective On Fri, Aug 13, 2010 at 12:54 PM, Mihael Hategan wrote: > On Fri, 2010-08-13 at 11:43 -0500, Jonathan Monette wrote: > > Right now I am using "internalHostname". I was just wondering if an > > should this be changed since I am always changing this entry depending > > if I am on login1 or login2? > > It should, but the question is to what. > > I offer $20 to the first person to find a reliable (that works on all TG > sites + PADS + Intrepid), quick (that does not, by itself, delay worker > startup or the overall workflow by more than a few seconds) and > automated way of figuring out that IP. I reserve the right to refuse a > solution if it does not meet certain propriety criteria that I did not > necessarily specify here. > > (btw you could make a wrapper around swift that detects whether you are > on login1 or login2 and picks one of two sites files and passes that to > swift). > > Mihael > > > > > On 8/13/10 11:35 AM, Mihael Hategan wrote: > > > On Fri, 2010-08-13 at 11:30 -0500, Jonathan Monette wrote: > > > > > >> Hello, > > >> How does the worker decide what connection to connect to? Right > > >> now what I think it does is it runs ifconfig and greps the inet > address > > >> and then test each of these connections. Is this correct? When I am > > >> running on PADS it seems that the worker always chooses the wrong > > >> connection to the service. It seems to choose the UBS0 connection > where > > >> the correct connection is the ib0 connection. Is there a way that > maybe > > >> the worker can be fixed to choose a better connection or the correct > > >> connection? This seems to be only happening on PADS. > > >> > > >> > > > That was temporary. Initially it would use the same address as the url > > > in sites.xml. Then I added the "try all interfaces" thing, but in some > > > cases the connect on certain wrong addresses does not fail quickly > > > enough and has to timeout instead, which usually takes a few minutes. > So > > > that got disabled and only the frist address is used now (unless > > > overridden - see below). > > > > > > You can say > > key="internalHostname">x.y.z.w in sites.xml > > > > > > Mihael > > > > > > > > > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hockyg at gmail.com Fri Aug 13 12:19:03 2010 From: hockyg at gmail.com (Glen Hocky) Date: Fri, 13 Aug 2010 17:19:03 -0000 Subject: [Swift-devel] Re: Worker connection In-Reply-To: <1281719079.12275.0.camel@blabla2.none> References: <4C657328.6060304@gmail.com> <1281717359.11891.3.camel@blabla2.none> <4C657638.5080609@gmail.com> <1281718490.12095.8.camel@blabla2.none> <1281719079.12275.0.camel@blabla2.none> Message-ID: <3415961925365440150@unknownmsgid> By picking the principle ip on the submit node (assumes running on a site from that sites submit host) On Aug 13, 2010, at 1:04 PM, Mihael Hategan wrote: > On Fri, 2010-08-13 at 12:59 -0400, Glen Hocky wrote: >> The OOPS project uses sed to set that parameter and create the sites >> file on the fly. it's very effective > > How was the IP picked? > >> >> On Fri, Aug 13, 2010 at 12:54 PM, Mihael Hategan >> wrote: >> On Fri, 2010-08-13 at 11:43 -0500, Jonathan Monette wrote: >>> Right now I am using "internalHostname". I was just >> wondering if an >>> should this be changed since I am always changing this entry >> depending >>> if I am on login1 or login2? >> >> >> It should, but the question is to what. >> >> I offer $20 to the first person to find a reliable (that works >> on all TG >> sites + PADS + Intrepid), quick (that does not, by itself, >> delay worker >> startup or the overall workflow by more than a few seconds) >> and >> automated way of figuring out that IP. I reserve the right to >> refuse a >> solution if it does not meet certain propriety criteria that I >> did not >> necessarily specify here. >> >> (btw you could make a wrapper around swift that detects >> whether you are >> on login1 or login2 and picks one of two sites files and >> passes that to >> swift). >> >> Mihael >> >>> >>> On 8/13/10 11:35 AM, Mihael Hategan wrote: >>>> On Fri, 2010-08-13 at 11:30 -0500, Jonathan Monette wrote: >>>> >>>>> Hello, >>>>> How does the worker decide what connection to >> connect to? Right >>>>> now what I think it does is it runs ifconfig and greps >> the inet address >>>>> and then test each of these connections. Is this >> correct? When I am >>>>> running on PADS it seems that the worker always chooses >> the wrong >>>>> connection to the service. It seems to choose the UBS0 >> connection where >>>>> the correct connection is the ib0 connection. Is there a >> way that maybe >>>>> the worker can be fixed to choose a better connection or >> the correct >>>>> connection? This seems to be only happening on PADS. >>>>> >>>>> >>>> That was temporary. Initially it would use the same >> address as the url >>>> in sites.xml. Then I added the "try all interfaces" thing, >> but in some >>>> cases the connect on certain wrong addresses does not fail >> quickly >>>> enough and has to timeout instead, which usually takes a >> few minutes. So >>>> that got disabled and only the frist address is used now >> (unless >>>> overridden - see below). >>>> >>>> You can say>>> key="internalHostname">x.y.z.w in sites.xml >>>> >>>> Mihael >>>> >>>> >>>> >>>> >>> >> >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel