From skenny at uchicago.edu Sat Jan 3 17:47:14 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Sat, 3 Jan 2009 17:47:14 -0600 (CST) Subject: [Swift-devel] Swift svn swift-r2386 cog-r2261 Message-ID: <20090103174714.BPY33171@m4500-02.uchicago.edu> so, i compiled the latest swift today and was doing a test using first.swift. runs fine on ncsa, but i seem to be having some trouble on ranger (so i'm guessing some sge weirdness?) the job seems to hang indefinitely w/o making it into the q. i don't see any errors in the job log, but the coasters log is showing this: 2009-01-03 16:31:26,366-0600 WARN Worker Worker 503241197 status change: Failed The job manager detected an invalid script\ response 2009-01-03 16:31:26,366-0600 WARN WorkerManager Worker terminated: Worker[503241197] 2009-01-03 16:31:26,366-0600 WARN Worker Worker -1904910211 status change: Failed null my sites file looks like this: 1 8 TG-DBS080005N 16 /work/00926/tg459516/sidgrid_out/{username} any ideas? From hategan at mcs.anl.gov Sat Jan 3 17:54:13 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 03 Jan 2009 17:54:13 -0600 Subject: [Swift-devel] Swift svn swift-r2386 cog-r2261 In-Reply-To: <20090103174714.BPY33171@m4500-02.uchicago.edu> References: <20090103174714.BPY33171@m4500-02.uchicago.edu> Message-ID: <1231026853.29677.0.camel@localhost> On Sat, 2009-01-03 at 17:47 -0600, skenny at uchicago.edu wrote: > so, i compiled the latest swift today and was doing a test > using first.swift. runs fine on ncsa, but i seem to be having > some trouble on ranger (so i'm guessing some sge weirdness?) > > the job seems to hang indefinitely w/o making it into the q. i > don't see any errors in the job log, but the coasters log is > showing this: > > 2009-01-03 16:31:26,366-0600 WARN Worker Worker 503241197 > status change: Failed The job manager detected an invalid script\ > response That looks like a gram/sge configuration problem. Can you try a queue globusrun? From benc at hawaga.org.uk Sat Jan 10 14:39:24 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 10 Jan 2009 20:39:24 +0000 (GMT) Subject: [Swift-devel] dataflow dependencies for mapper params and declaration assignments Message-ID: I finally overcame my paranoia about breaking the core of Swift and committed the last of my changes to make dataflow dependencies work for mapper parameters and combined variable declarations/assignments. this makes some things in Swift that seemed logical to do but didn't actually work now work - there's no change to the description of the language. Previously mapper parameters had to be defined "early" (that is, the runtime needed the values of the parameters to be defined by the time it happened to encounter the mapper initialisation); and variables whose values were assigned in the same statement as they were declared likewise needed their values to be defined byt he time the runtime encountered those declarations. Behaviour when those values were not properly defined by the time they were encountered was somewhat random. This should now be fixed. However, if you see things not working pthat used to work, post here. -- From benc at hawaga.org.uk Sun Jan 11 05:19:21 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 11 Jan 2009 11:19:21 +0000 (GMT) Subject: [Swift-devel] If you build swift from SVN, read this before updating. Message-ID: As of r2433, the Swift cog module is called 'swift' instead of 'vdsk'. If you build from the SVN then there are two things you need to do: i) move your vdsk directory to be called swift: $ cd cog/modules/ $ mv vdsk swift ii) get rid of previous builds: $ cd cog/modules/swift $ rm -r dist/vdsk-svn When you rebuild, the distribution will now appear in dist/swift-svn instead of vdsk-svn, and you will need to adjust paths in your own environment accordingly. -- From bugzilla-daemon at mcs.anl.gov Mon Jan 12 01:03:28 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 12 Jan 2009 01:03:28 -0600 (CST) Subject: [Swift-devel] [Bug 170] New: APPLICATION_EXCEPTIONS during stageout causes stageout plot events to be generate with -LARGE durations, giving strange looking graphs Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=170 Summary: APPLICATION_EXCEPTIONS during stageout causes stageout plot events to be generate with -LARGE durations, giving strange looking graphs Product: Swift Version: unspecified Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Log processing and plotting AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk For example: $ grep 22ech42j modgenproc-20081110-1721-a7538coc.log 2008-11-10 17:28:11,504-0600 DEBUG vdl:execute2 THREAD_ASSOCIATION jobid=RInvoke-22ech42j thread=0-8058-1 host=ANLUCTERAGRID32 replicationGroup=3c5ch42j 2008-11-10 17:28:11,506-0600 INFO vdl:createdirset START jobid=RInvoke-22ech42j host=ANLUCTERAGRID32 - Initializing directory structure 2008-11-10 17:28:11,506-0600 INFO vdl:createdirset END jobid=RInvoke-22ech42j - Done initializing directory structure 2008-11-10 17:28:11,506-0600 INFO vdl:dostagein START jobid=RInvoke-22ech42j - Staging in files 2008-11-10 17:28:11,631-0600 INFO vdl:dostagein END jobid=RInvoke-22ech42j - Staging in finished 2008-11-10 17:28:11,631-0600 DEBUG vdl:execute2 JOB_START jobid=RInvoke-22ech42j tr=RInvoke arguments=[scripts/singlemodels.R, matrices/gestspeech.cov, 4, 13729] tmpdir=modgenproc-20081110-1721-a7538coc/jobs/2/RInvoke-22ech42j host=ANLUCTERAGRID32 2008-11-10 17:28:11,637-0600 INFO Execute jobid=RInvoke-22ech42j task=Task(type=JOB_SUBMISSION, identity=urn:0-8058-1-1226359319704) 2008-11-10 17:28:14,029-0600 DEBUG vdl:checkjobstatus START jobid=RInvoke-22ech42j 2008-11-10 17:28:14,125-0600 INFO vdl:checkjobstatus SUCCESS jobid=RInvoke-22ech42j - Success file found 2008-11-10 17:28:14,125-0600 DEBUG vdl:execute2 STAGING_OUT jobid=RInvoke-22ech42j 2008-11-10 17:28:14,125-0600 INFO vdl:dostageout START jobid=RInvoke-22ech42j - Staging out files 2008-11-10 17:28:15,119-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=RInvoke-22ech42j - Application exception: org.globus.cog.karajan.workflow.service.ProtocolException: /scratch/gpfs/local/sidgrid_out/skenny/modgenproc-20081110-1721-a7538coc/shared/results/13729.min (No such file or directory) gives an event line like this: $ grep 22ech42j dostageout.event 1226359694.125 -1226359694.125 RInvoke-22ech42j START -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Mon Jan 12 01:09:12 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 12 Jan 2009 01:09:12 -0600 (CST) Subject: [Swift-devel] [Bug 170] APPLICATION_EXCEPTIONS during stageout causes stageout plot events to be generate with -LARGE durations, giving strange looking graphs In-Reply-To: Message-ID: <20090112070912.B2FC5164CE@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=170 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED ------- Comment #1 from benc at hawaga.org.uk 2009-01-12 01:09 ------- The stageout never logs any kind of failed log line for itself - instead that is indicated by the application exception associated with the enclosing execute2. Some options: make dostageout catch exception, log a failure, and pass exception upwards; make the log processing treat this stageout as having had 0 duration; make the log processing aware of how the execute2 APPLICATION_EXCEPTION relates to dostageout. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Mon Jan 12 01:14:21 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 12 Jan 2009 01:14:21 -0600 (CST) Subject: [Swift-devel] [Bug 170] APPLICATION_EXCEPTIONS during stageout causes stageout plot events to be generate with -LARGE durations, giving strange looking graphs In-Reply-To: Message-ID: <20090112071421.99E50164B3@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=170 ------- Comment #2 from benc at hawaga.org.uk 2009-01-12 01:14 ------- transitions-to-event *should* be making this appear as having duration until the end of the run, rather than negative. it doesn't, though; and that is also undesirable in this particular case. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Mon Jan 12 01:56:21 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 12 Jan 2009 01:56:21 -0600 (CST) Subject: [Swift-devel] [Bug 170] APPLICATION_EXCEPTIONS during stageout causes stageout plot events to be generate with -LARGE durations, giving strange looking graphs In-Reply-To: Message-ID: <20090112075621.21B21164B3@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=170 ------- Comment #3 from benc at hawaga.org.uk 2009-01-12 01:56 ------- r2438 corrects the behaviour of compute-t-inf; this means that this plot still shows incorrect duration dostageouts but they now last until the end of the run instead of to the start of time; this also corrects a few other plots in the same run which were suffering from large negative durations... dostageouts that fail still have incorrect durations reported, however. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Mon Jan 12 12:16:29 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 12 Jan 2009 12:16:29 -0600 (CST) Subject: [Swift-devel] [Bug 164] types with single character names do not work In-Reply-To: Message-ID: <20090112181629.D330B164CE@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=164 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from benc at hawaga.org.uk 2009-01-12 12:16 ------- should be fixed in r2440 -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From wilde at mcs.anl.gov Tue Jan 13 10:56:42 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 13 Jan 2009 10:56:42 -0600 Subject: [Swift-devel] Finding apps via $PATH In-Reply-To: References: Message-ID: <496CC7CA.1040703@mcs.anl.gov> was: Re: [Swift-devel] fun with osg Ben, could you (re)post info on how to do this: > There also remains the problem of application location and deployment; in > my playing today I used default $PATH method of finding my test executable > (touch) - this is still hard, though I have some other notes to write > about that elsewhere. Does it work in the current rev? - Mike On 12/9/08 7:57 PM, Ben Clifford wrote: > I played some with Mats Rynge at RENCI today. > > He modified some code he already had to output sites.xml based on the OSG > ReSS information system. > > This made it straightforward to submit jobs to sites that are in OSGEDU > (the only OSG VO that I am a member of) running executables that are > already on the system $PATH. > > However, of the 13 sites advertising that they support OSG EDU, only two > are actually able to run Swift jobs this afternoon. > > I just got setup to submit jobs to the OSG Engagement VO; this changes the > range of sites available, but also opens up some more opportunity for > narrowing down the range of available sites as the Engagement VO has a > richer set of site availability information that can be used to construct > a more-likely-to-work sites.xml file. > > Of the two working-and-published OSGEDU sites, I also tried running with > coasters; that failed dismally on both - in the case of one site, the fork > job manager was the ever-more-common ManagedFork jobmanager, which appears > unable to be able to execute the head jobs. The other site ran the head > job but worker nodes could not communicate properly with that. so frrrr. > > There also remains the problem of application location and deployment; in > my playing today I used default $PATH method of finding my test executable > (touch) - this is still hard, though I have some other notes to write > about that elsewhere. > > Replication seems to do a good job with failing sites, although not a > completely perfect job. One site takes jobs to the karajan Active state > and then stays there forever. Replication, as presently implemented, > doesn't cope with that. > > So nothing particularly new - OSG as usual... > From benc at hawaga.org.uk Tue Jan 13 16:27:39 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 13 Jan 2009 22:27:39 +0000 (GMT) Subject: [Swift-devel] Re: Finding apps via $PATH In-Reply-To: <496CC7CA.1040703@mcs.anl.gov> References: <496CC7CA.1040703@mcs.anl.gov> Message-ID: Omit the path in a tc.data entry (eg say echo instead of /bin/echo) and $PATH will be used to look up the executable. This is in SVN head and also in the 0.7 release. Related, if you want to add stuff to the remote path rather than completely reset the remote path losing defaults, look at the PATHPREFIX env profile in the user guide. On Tue, 13 Jan 2009, Michael Wilde wrote: > was: Re: [Swift-devel] fun with osg > > Ben, could you (re)post info on how to do this: > > > There also remains the problem of application location and deployment; in > > my playing today I used default $PATH method of finding my test executable > > (touch) - this is still hard, though I have some other notes to write > > about that elsewhere. > > Does it work in the current rev? > > - Mike > > On 12/9/08 7:57 PM, Ben Clifford wrote: > > I played some with Mats Rynge at RENCI today. > > > > He modified some code he already had to output sites.xml based on the OSG > > ReSS information system. > > > > This made it straightforward to submit jobs to sites that are in OSGEDU (the > > only OSG VO that I am a member of) running executables that are already on > > the system $PATH. > > > > However, of the 13 sites advertising that they support OSG EDU, only two are > > actually able to run Swift jobs this afternoon. > > > > I just got setup to submit jobs to the OSG Engagement VO; this changes the > > range of sites available, but also opens up some more opportunity for > > narrowing down the range of available sites as the Engagement VO has a > > richer set of site availability information that can be used to construct a > > more-likely-to-work sites.xml file. > > > > Of the two working-and-published OSGEDU sites, I also tried running with > > coasters; that failed dismally on both - in the case of one site, the fork > > job manager was the ever-more-common ManagedFork jobmanager, which appears > > unable to be able to execute the head jobs. The other site ran the head job > > but worker nodes could not communicate properly with that. so frrrr. > > > > There also remains the problem of application location and deployment; in my > > playing today I used default $PATH method of finding my test executable > > (touch) - this is still hard, though I have some other notes to write about > > that elsewhere. > > > > Replication seems to do a good job with failing sites, although not a > > completely perfect job. One site takes jobs to the karajan Active state and > > then stays there forever. Replication, as presently implemented, doesn't > > cope with that. > > > > So nothing particularly new - OSG as usual... > > > > From rynge at renci.org Tue Jan 13 17:02:31 2009 From: rynge at renci.org (Mats Rynge) Date: Tue, 13 Jan 2009 18:02:31 -0500 Subject: [Swift-devel] Re: Finding apps via $PATH In-Reply-To: References: <496CC7CA.1040703@mcs.anl.gov> Message-ID: <496D1D87.5020908@renci.org> Ben Clifford wrote: > Omit the path in a tc.data entry (eg say echo instead of /bin/echo) and > $PATH will be used to look up the executable. This is in SVN head and also > in the 0.7 release. Note that on OSG you will find that most sites have no or a very minimal $PATH. -- Mats Rynge Renaissance Computing Institute From benc at hawaga.org.uk Wed Jan 14 02:14:50 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 14 Jan 2009 08:14:50 +0000 (GMT) Subject: [Swift-devel] Re: Finding apps via $PATH In-Reply-To: <496D1D87.5020908@renci.org> References: <496CC7CA.1040703@mcs.anl.gov> <496D1D87.5020908@renci.org> Message-ID: On Tue, 13 Jan 2009, Mats Rynge wrote: > > Omit the path in a tc.data entry (eg say echo instead of /bin/echo) and > > $PATH will be used to look up the executable. This is in SVN head and also > > in the 0.7 release. > > Note that on OSG you will find that most sites have no or a very minimal > $PATH. RIght. In that case, its probably most useful in conjunction with PATHPREFIX so that you can define your application bin directory once per site, rather than once per executable per site. I think it should also work nicely with the SoftEnv RSL options available on teragrid (though I haven't tried). -- From bugzilla-daemon at mcs.anl.gov Wed Jan 14 07:22:27 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 14 Jan 2009 07:22:27 -0600 (CST) Subject: [Swift-devel] [Bug 163] ext mapper doesn't like being used for input files. In-Reply-To: Message-ID: <20090114132227.CB504164B3@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=163 ------- Comment #2 from benc at hawaga.org.uk 2009-01-14 07:22 ------- On looking at this bug again, it looks like there is a type error in the supplied test case, and that ext mapper as input does actually work. The error message given is slightly confusing, suggesting some mapper inconsistency, when actually it is a type violation. This error message should perhaps be tidied up. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From iraicu at cs.uchicago.edu Wed Jan 14 11:30:32 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 14 Jan 2009 11:30:32 -0600 Subject: [Swift-devel] CFP: IEEE 2009 Third International Workshop on Scientific Workflows (SWF 2009) Message-ID: <496E2138.7060104@cs.uchicago.edu> IEEE 2009 Third International Workshop on Scientific Workflows (SWF 2009) Los Angeles, USA, July 7, 2009 http://www.servicescongress.org/2009/1/swf-2009.html In conjunction with IEEE International Conference on Web Services (ICWS 2009) Call for Papers Today, many scientific discoveries are achieved through complex and distributed scientific computations that are represented and structured as scientific workflows. User friendly scientific workflow systems are increasingly being developed to enable e-scientists to integrate, structure, and orchestrate various local or remote data and service resources to perform various in silico experiments to produce interesting scientific discovery. The critical role of scientific workflows in cyberinfrastructure bas been recognized by a recent NSF workshop on the challenges of scientific workflows in May 2006, which concluded that ?workflows should become first-class entities in cyberinfrastructure architecture. For domain scientists, they are important because workflows document and manage the increasingly complex processes involved in exploration and discovery through computations. For computer scientists, workflows provide a formal and declarative representation of complex distributed computations that must be managed efficiently through their lifecycle from assembly, to execution, to sharing.? Authors are invited to submit regular papers (8 pages), short papers (4 pages), and demo papers (2 pages) that show original unpublished research results in all areas of scientific workflows. Topics of interest are listed below; however, submissions on all aspects of scientific workflows are welcome. For demo papers, at least one author is expected to present a demo in the workshop during the demo session, special arrangement will be made to meet the need of the authors. Accepted SWF 2009 papers will be included in the proceedings of SERVICES 2009 (Part I), which will be published by IEEE Computer Society Press. Topics ? Architecture, model, and language ? Provenance management ? Task management ? Workflow scheduling ? Data product management ? Monitoring and failure handling ? Service, Grid, and Cloud workflows ? Scientific workflow composition ? Scientific workflow security ? Modeling, simulation, analysis ? Scalability, reliability, extensibility ? Scientific workflow applications Important dates February 16, 2009, paper submission; March 20, 2009, notification; April 10, 2009, camera-ready version due. Workshop organizers Workshop chairs: Shiyong Lu, Wayne State University, shiyong at wayne.edu; Calton Pu, Georgia Tech Publicity chairs: Yong Zhao, Microsoft Corporation; Ilkay Altintas, San Diego Supercomputer Center Publication chair: Cui Lin, Wayne State University Previous SWF workshops http://www.cs.wayne.edu/~shiyong/swf -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- A non-text attachment was scrubbed... Name: SWF09_CFP.pdf Type: application/pdf Size: 33125 bytes Desc: not available URL: From iraicu at cs.uchicago.edu Thu Jan 15 12:19:28 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 15 Jan 2009 12:19:28 -0600 Subject: [Swift-devel] CFP: Workshop on Large-Scale System and Application Performance (LSAP2009) Message-ID: <496F7E30.1020305@cs.uchicago.edu> Call for Papers --------------- Workshop on Large-Scale System and Application Performance (LSAP2009) In conjunction with the 18-th International Symposium on High Performance Distributed Computing (HPDC-18) Munich, Germany, June 9 or 10, 2009 http://www.lsap.org *** Submission deadline: March 1, 2009 *** MISSION Over the last decade, computer systems and applications in everyday use have grown to unprecedented scales. Large clusters serving millions of search requests per day, grids executing large workflows and parameter sweeps consisting of thousands of jobs, and supercomputers running complex e-science applications, have now hundreds of thousands of processing cores. In addition, clouds are quickly emerging as a large-scale computing infrastructure. Peer-to-peer systems and centralized video distribution systems that dominate the internet and complicated internet applications such as massive multiplayer online games are used by millions of people every day. In view of this tremendous growth, understanding the performance of large-scale computer systems and applications has become vital to institutional, commercial, and private interests. This workshop solicits original papers on performance evaluation methods, tools, and studies focusing on the challenges of large scale, such as decentralization, predictable performance, reliability, and scalability. It aims to bring together system designers and researchers involved with the modeling and performance evaluation of large-scale systems and applications. Topics of interest include, but are not limited to: - Performance aspects of large-scale systems - Performance aspects of large-scale applications - Performance-oriented properties such as availability, reliability, and scalability - Workload characterization and modeling - Mathematical modeling and analysis methods - Simulation methods and tools - Measurement methods and tools - Performance case studies SUBMISSION GUIDELINES Submitted papers should be limited to 8 pages (including tables, images, and references) and formatted according to the ACM SIGS Style. Use the official HPDC conference submission site to submit your paper; only pdf format is accepted. All papers will receive at least three reviews. Submission implies the willingness of at least one of the authors to register for the workshop and present the paper. The authors of the best paper in the workshop will receive a best-paper award. PROCEEDINGS The proceedings of the workshop will be published by ACM. IMPORTANT DATES Submission deadline: March 1, 2009 Author notification: March 31, 2009 Final papers due: TBA (in April, 2009) Workshop: June 9 or 10, 2009 SUBMISSION SITE Official HPDC conference submission site, https://ssl.linklings.net/conferences/hpdc/ WORKSHOP WEBSITE www.lsap2009.org PROGRAM CO-CHAIRS Dick Epema, Delft University of Technology, NL, d.h.j.epema at tudelft.nl Jose Moreira, IBM T.J. Watson Research Lab, USA, jmoreira at us.ibm.com Carey Williamson, University of Calgary, Canada, carey at cpsc.ucalgary.ca PROGRAM COMMITTEE Martin Arlitt, HP Labs, USA, and University of Calgary, CA Peter Buchholz, University of Dortmund, Germany Jon Howell, Microsoft Research, USA Adriana Iamnitchi, University of South Florida, USA Alexandru Iosup, Delft University of Technology, NL Evgenia Smirni, College of William and Mary, USA Swami Sivasubramanian, Amazon, USA Allen Snavely, University of California, San Diego, USA Denis Trystram, Labaratoire d'Informatique de Grenoble, FR CONTACT For further information please contact Dick Epema at d.h.j.epema at tudelft.nl. -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email:iraicu at cs.uchicago.edu Web:http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- A non-text attachment was scrubbed... Name: lsap2009_cfp.pdf Type: application/pdf Size: 45831 bytes Desc: not available URL: From wilde at mcs.anl.gov Thu Jan 15 20:37:55 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 15 Jan 2009 20:37:55 -0600 Subject: [Swift-devel] Swift svn swift-r2386 cog-r2261 In-Reply-To: <1231026853.29677.0.camel@localhost> References: <20090103174714.BPY33171@m4500-02.uchicago.edu> <1231026853.29677.0.camel@localhost> Message-ID: <496FF303.1040906@mcs.anl.gov> Was this ever resolved? Im trying to test on Ranger as well, and it seems I can no longer submit SGE jobs due to my login not being properly mapped to an account. A plain qsub gives me the error: ERROR: You have no project in the projectuser.map file. Please contact TACC Consulting (https://portal.tacc.utexas.edu/consulting for UT users, help at teragrid.org for Teragrid users). Ive forwarded it to the help desk, but does anyone know if this similar-sounding problem was resolved? On 1/3/09 5:54 PM, Mihael Hategan wrote: > On Sat, 2009-01-03 at 17:47 -0600, skenny at uchicago.edu wrote: >> so, i compiled the latest swift today and was doing a test >> using first.swift. runs fine on ncsa, but i seem to be having >> some trouble on ranger (so i'm guessing some sge weirdness?) >> >> the job seems to hang indefinitely w/o making it into the q. i >> don't see any errors in the job log, but the coasters log is >> showing this: >> >> 2009-01-03 16:31:26,366-0600 WARN Worker Worker 503241197 >> status change: Failed The job manager detected an invalid script\ >> response > > That looks like a gram/sge configuration problem. > > Can you try a queue globusrun? > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Thu Jan 15 20:38:43 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 15 Jan 2009 20:38:43 -0600 Subject: [Swift-devel] Swift svn swift-r2386 cog-r2261 In-Reply-To: <496FF303.1040906@mcs.anl.gov> References: <20090103174714.BPY33171@m4500-02.uchicago.edu> <1231026853.29677.0.camel@localhost> <496FF303.1040906@mcs.anl.gov> Message-ID: <1232073523.32554.0.camel@localhost> On Thu, 2009-01-15 at 20:37 -0600, Michael Wilde wrote: > Was this ever resolved? > > Im trying to test on Ranger as well, and it seems I can no longer submit > SGE jobs due to my login not being properly mapped to an account. > > A plain qsub gives me the error: > ERROR: You have no project in the projectuser.map file. > Please contact TACC Consulting (https://portal.tacc.utexas.edu/consulting > for UT users, help at teragrid.org for Teragrid users). Skenny sent a request to help at teragrid.org. I haven't seen any replies yet. From wilde at mcs.anl.gov Fri Jan 16 08:04:40 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 16 Jan 2009 08:04:40 -0600 Subject: [Swift-devel] Swift svn swift-r2386 cog-r2261 In-Reply-To: <1232073523.32554.0.camel@localhost> References: <20090103174714.BPY33171@m4500-02.uchicago.edu> <1231026853.29677.0.camel@localhost> <496FF303.1040906@mcs.anl.gov> <1232073523.32554.0.camel@localhost> Message-ID: <497093F8.2030505@mcs.anl.gov> I resolved my problem. Some bit in my account at TACC was set incorrectly - was fixed by TG helpdesk. Then I needed to explicitly specify a project, as I now have several on Ranger. Sarah, its possible that your problem is due to specifying an account thats not valid on Ranger. Try changing the '5N' in your account to '4N' - I think thats the right one for Ranger. - Mike On 1/15/09 8:38 PM, Mihael Hategan wrote: > On Thu, 2009-01-15 at 20:37 -0600, Michael Wilde wrote: >> Was this ever resolved? >> >> Im trying to test on Ranger as well, and it seems I can no longer submit >> SGE jobs due to my login not being properly mapped to an account. >> >> A plain qsub gives me the error: >> ERROR: You have no project in the projectuser.map file. >> Please contact TACC Consulting (https://portal.tacc.utexas.edu/consulting >> for UT users, help at teragrid.org for Teragrid users). > > Skenny sent a request to help at teragrid.org. I haven't seen any replies > yet. > > From wilde at mcs.anl.gov Fri Jan 16 08:30:49 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 16 Jan 2009 08:30:49 -0600 Subject: [Swift-devel] Re: fun with osg [Swift-devel Digest, Vol 23, Issue 12] In-Reply-To: References: <20081210162237.0AFE82C0044@mail.ci.uchicago.edu> <493FFCFB.6060600@ci.uchicago.edu> Message-ID: <49709A19.2060408@mcs.anl.gov> I'm assisting a UChicago application group that we work with (Open Protein Simulator, OOPS application) to run on OSG, initially in the Engage VO. I've read this thread and a related one, and was wondering is you can send an update on the following issues: - where to obtain Mats's script to generate sites.xml for Engage? - can that script be run for other VOs such as OSG, OSGEDU? - has progress been made on running coasters on OSG? - what are good settings for Swift without coasters to avoid overheating head nodes in the absence of Swift gridmonitor support? - what's the status of the swift-devel thread "Swift-devel] jobs that go active forever, and their effect on multisite osg runs" - any resolution? - any other tips for OSG swift users? I think that it would be good to collect OSG info in the swift User Guide under "http://www.ci.uchicago.edu/swift/guides/userguide.php#localhowtos" I'd be willing to help with that. On 12/10/08 12:59 PM, Ben Clifford wrote: > On Wed, 10 Dec 2008, Alina Bejan wrote: > >> For OSGEDU: the ReSS information should provide only 2 sites >> available, since this is the reality. It hasn't been 13 sites for this >> VO for more than 6 months. But the information seems to still be stale. >> >> Situation might be slightly different if you run under OSG VO, since >> more sites would be available. If you wish to pursue this option as >> well, let me know. We'll be able to add you to OSG VO if you plan on >> running more experiments that way. > > I'm in the Engage VO now as well, as there is a lot of work already done > (and ongoing) there to keep that VO's information system fairly fresh - > that seems most attractive to me from a Swift development perspective. > From rynge at renci.org Fri Jan 16 08:49:02 2009 From: rynge at renci.org (Mats Rynge) Date: Fri, 16 Jan 2009 09:49:02 -0500 Subject: [Swift-devel] Re: fun with osg [Swift-devel Digest, Vol 23, Issue 12] In-Reply-To: <49709A19.2060408@mcs.anl.gov> References: <20081210162237.0AFE82C0044@mail.ci.uchicago.edu> <493FFCFB.6060600@ci.uchicago.edu> <49709A19.2060408@mcs.anl.gov> Message-ID: <49709E5E.1090201@renci.org> Michael Wilde wrote: > I'm assisting a UChicago application group that we work with (Open > Protein Simulator, OOPS application) to run on OSG, initially in the > Engage VO. > > I've read this thread and a related one, and was wondering is you can > send an update on the following issues: > > - where to obtain Mats's script to generate sites.xml for Engage? http://www.renci.org/~rynge/osg/swift/ress-to-site-catalog I'm trying to get the contrib agreement signed, but working something like that through the university legal system will take some time. Note that the script depends on having condor_status in your path. > - can that script be run for other VOs such as OSG, OSGEDU? Yes. Try: ./ress-to-site-catalog --vo=OSG Or, ./ress-to-site-catalog --help > - has progress been made on running coasters on OSG? We did a quick try, but ran into some problems. Ben knows more. In general, I think you will run into problem on some sites using managed-fork, as they might kill the forwarding process you put on their submit nodes. > - what are good settings for Swift without coasters to avoid overheating > head nodes in the absence of Swift gridmonitor support? I'm still trying to get a feel for this. I think it is a good idea to start with pretty conservative per-site settings, and try to use many sites instead. -- Mats Rynge Renaissance Computing Institute From benc at hawaga.org.uk Fri Jan 16 09:46:04 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 16 Jan 2009 15:46:04 +0000 (GMT) Subject: [Swift-devel] Re: fun with osg [Swift-devel Digest, Vol 23, Issue 12] In-Reply-To: <49709A19.2060408@mcs.anl.gov> References: <20081210162237.0AFE82C0044@mail.ci.uchicago.edu> <493FFCFB.6060600@ci.uchicago.edu> <49709A19.2060408@mcs.anl.gov> Message-ID: On Fri, 16 Jan 2009, Michael Wilde wrote: > - can that script be run for other VOs such as OSG, OSGEDU? yes. the results you will get for Engage will be higher quality than the results that you get for other VOs, in as much as the sites file generated for the Engage VO has sites filtered out based on additional testing done periodically. > - has progress been made on running coasters on OSG? no > - what are good settings for Swift without coasters to avoid overheating head > nodes in the absence of Swift gridmonitor support? I'd give the same advice that I give for any general use of GRAM2 - don't set the job throttle to higher than 0.2; this will have a maximum of 20 jobs at once on a site. > - what's the status of the swift-devel thread "Swift-devel] jobs that go > active forever, and their effect on multisite osg runs" - any > resolution? There's a bugzilla ticket open but no implementation at the moment: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=169 -- From bugzilla-daemon at mcs.anl.gov Mon Jan 19 10:22:09 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 19 Jan 2009 10:22:09 -0600 (CST) Subject: [Swift-devel] [Bug 171] New: status.mode values need validating when read Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=171 Summary: status.mode values need validating when read Product: Swift Version: unspecified Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: minor Priority: P2 Component: General AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk status.mode parameters are not checked for validity. they should be. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From benc at hawaga.org.uk Tue Jan 20 12:15:58 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 20 Jan 2009 18:15:58 +0000 (GMT) Subject: [Swift-devel] packaging log-processing for 0.8 release Message-ID: I'd like to get the log-processing code into the 0.8 release; tI'm not entirely sure what packaging of that should look like. So for 0.8 I'd like to fairly kludgily put it into the release, with swift-plot-log appearing in the bin/ directory, and in subsequent releases tidy it up whilst keeping swift-plot-log in the same place so that there is no UI change. -- From hategan at mcs.anl.gov Tue Jan 20 12:22:34 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 20 Jan 2009 12:22:34 -0600 Subject: [Swift-devel] packaging log-processing for 0.8 release In-Reply-To: References: Message-ID: <1232475754.7487.22.camel@localhost> +1 On Tue, 2009-01-20 at 18:15 +0000, Ben Clifford wrote: > I'd like to get the log-processing code into the 0.8 release; tI'm not > entirely sure what packaging of that should look like. So for 0.8 I'd like > to fairly kludgily put it into the release, with swift-plot-log appearing > in the bin/ directory, and in subsequent releases tidy it up whilst > keeping swift-plot-log in the same place so that there is no UI change. > From benc at hawaga.org.uk Tue Jan 20 12:28:06 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 20 Jan 2009 18:28:06 +0000 (GMT) Subject: [Swift-devel] swift 0.8 release plans Message-ID: I plan on making a Swift 0.8 release candidate around this coming Friday (23th Jan) with actual release happening around 7 days later if no one objects to the release candidate; as with previous releases, this will not follow the dev.globus release process, but will follow the Swift tradition of releasing if no-one objects. -- From skenny at uchicago.edu Wed Jan 21 12:48:22 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Wed, 21 Jan 2009 12:48:22 -0600 (CST) Subject: [Swift-devel] swift jobs hanging on ranger Message-ID: <20090121124822.BQY14292@m4500-02.uchicago.edu> so, ranger requires a project/account be specified for any job going into the scheduler...i can submit a globus-job-run and also a cog-job-submit with my current project id and it works. then when i submit with swift (specifying that same project id in my sites file) it hangs and does not make it into the queue: [skenny at gwynn check_env]$ cog-job-submit -p gt2 -jm SGE -a project=TG-DBS090006 -e /bin/hostname -s gatekeeper.ranger.tacc.teragrid.org Forcing redirection because the SGE JM is broken. Job completed [skenny at gwynn check_env]$ globus-job-run gatekeeper.ranger.tacc.teragrid.org/jobmanager-sge -p TG-DBS090006 /bin/hostname ... Job is running. Job 452453 has completed. [skenny at gwynn check_env]$ swift -tc.file /disks/ci-gpfs/fmri/cnari/swift/config/tc.data -sites.file ./sites_ranger.xml env.swift -user="skenny" Swift svn swift-r2386 cog-r2261 RunID: 20090121-1228-s2aha776 Progress: env started env started env started Progress: Selecting site:2 Stage in:1 Progress: Submitted:3 ... **********the sites file entry is: 1 8 TG-DBS090006 16 /work/00043/tg457040/sidgrid_out/{username} ******gram log on remote site contains this: Wed Jan 21 12:36:33 2009 JM_SCRIPT: Checking project details Wed Jan 21 12:36:33 2009 JM_SCRIPT: SGE Regular Edition: NO project support Wed Jan 21 12:36:33 2009 JM_SCRIPT: WARNING: Project set to TG-DBS090006 .... Wed Jan 21 12:36:33 2009 JM_SCRIPT: Submitting a job Wed Jan 21 12:36:33 2009 JM_SCRIPT: ERROR: /opt/sge/bin/lx24-amd64/qsub /share/home/00043/tg457040/.globus/.gass_cac\ he/local/md5/89/ff0dc3a3eeffb3ca94dabdf57b8473/md5/4a/a95989876bc53f3f82515f4507c280/data retcode = 256 Wed Jan 21 12:36:33 2009 JM_SCRIPT: ERROR: job submission failed Wed Jan 21 12:36:33 2009 JM_SCRIPT: check if the project specified does exist 1/21 12:36:33 JMI: while return_buf = GRAM_SCRIPT_ERROR = 24 hopefully i'm not missing something obvious here, but can anyone think of a reason why the project id is producing an error when i submit with swift? thanks!! sarah From benc at hawaga.org.uk Wed Jan 21 12:58:35 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 21 Jan 2009 18:58:35 +0000 (GMT) Subject: [Swift-devel] swift jobs hanging on ranger In-Reply-To: <20090121124822.BQY14292@m4500-02.uchicago.edu> References: <20090121124822.BQY14292@m4500-02.uchicago.edu> Message-ID: try swift without using coasters and see if that goes through; try cog-job-submit using coasters and see if that fails. it might be that inside coasters, the project is not passing through correctly. -- From skenny at uchicago.edu Wed Jan 21 13:37:39 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Wed, 21 Jan 2009 13:37:39 -0600 (CST) Subject: [Swift-devel] swift jobs hanging on ranger Message-ID: <20090121133739.BQY23827@m4500-02.uchicago.edu> >try swift without using coasters and see if that goes through; still doesn't go thru and gives the same error in the gram log. but it doesn't hang indefinitely on the submit host. it gives: Progress: Stage in:1 Failed but can retry:2 Failed to transfer wrapper log from env-20090121-1320-obxi57t5/info/4 on RANGER env failed Execution failed: Exception in env: Arguments: [] Host: RANGER Directory: env-20090121-1320-obxi57t5/jobs/4/env-4vv5nh5j stderr.txt: stdout.txt: ---- Caused by: The job manager detected an invalid script response >cog-job-submit using coasters and see if that fails. [skenny at gwynn check_env]$ cog-job-submit -p coaster -jm gt2:gt2:SGE -a project=TG-DBS090006 -e /bin/hostname -s gatekeeper.ranger.tacc.teragrid.org Started local service: 128.135.92.83:50004 Socket bound. URL is http://gwynn.bsd.uchicago.edu:50005 [/129.114.50.163:34018] GET /coaster-bootstrap.jar HTTP/1.0 [/129.114.50.163:34023] GET /list?serviceId=1487248510 HTTP/1.1 GSSSChannel-null(0): Disabling heartbeats (config is null) Initialized connection handler Multiplexer 0 started (1) Scheduling GSSSChannel-null(1) for addition nullChannel started Connection handler started Multiplexer 1 started GSSSChannel-null(1) REQ: Handler(CHANNELCONFIG) Channel id: 341bc107:11efaac5124:-8000:-60989cff:11efaac517b:-8000 MetaChannel: 11644607 -> null: Disabling heartbeats (disabled in config) MetaChannel: 11644607 -> null.bind -> GSSSChannel-null(1) GSSSChannel-null(1) REQ: Handler(REGISTER) Trying to re-bind current channel Sending Command(1, SUBMITJOB) on GSSSChannel-null(1) Command(1, SUBMITJOB) CMD: Command(1, SUBMITJOB) GSSSChannel-null(1) REPL: Command(1, SUBMITJOB) Submitted task Task(type=JOB_SUBMISSION, identity=urn:cog-1232566229197). Job id: urn:1232566229197-1232566244068-1232566244069 Unregistering Command(1, SUBMITJOB) GSSSChannel-null(1) REQ: Handler(JOBSTATUS) GSSSChannel-null(1) REQ: Handler(JOBSTATUS) Job completed ********logs for this are here: /home/skenny/logs From benc at hawaga.org.uk Wed Jan 21 13:49:29 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 21 Jan 2009 19:49:29 +0000 (GMT) Subject: [Swift-devel] swift jobs hanging on ranger In-Reply-To: <20090121124822.BQY14292@m4500-02.uchicago.edu> References: <20090121124822.BQY14292@m4500-02.uchicago.edu> Message-ID: where is your check_env directory? -- From skenny at uchicago.edu Wed Jan 21 15:16:09 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Wed, 21 Jan 2009 15:16:09 -0600 (CST) Subject: [Swift-devel] swift jobs hanging on ranger Message-ID: <20090121151609.BQY45261@m4500-02.uchicago.edu> looks like the parsing of maxwalltime is having an effect on whether the job gets into the scheduler: [skenny at gwynn check_env]$ cog-job-submit -p gt2 -jm SGE -a project=TG-DBS090006,maxwalltime=240 -e /bin/hostname -s gatekeeper.ranger.tacc.teragrid.org Forcing redirection because the SGE JM is broken. Job failed: org.globus.gram.GramException: The job manager detected an invalid script response at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:534) at org.globus.gram.GramJob.setStatus(GramJob.java:184) at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) at java.lang.Thread.run(Thread.java:595) whereas this succeeds: [skenny at gwynn check_env]$ cog-job-submit -p gt2 -jm SGE -a project=TG-DBS090006,maxwalltime=60 -e /bin/hostname -s gatekeeper.ranger.tacc.teragrid.org the problem seems to lie in sge's inconsistent passing of maxwalltime. ---- Original message ---- >Date: Wed, 21 Jan 2009 13:37:39 -0600 (CST) >From: >Subject: Re: [Swift-devel] swift jobs hanging on ranger >To: Ben Clifford >Cc: swift-devel at ci.uchicago.edu > >>try swift without using coasters and see if that goes through; > >still doesn't go thru and gives the same error in the gram >log. but it doesn't hang indefinitely on the submit host. it >gives: > >Progress: Stage in:1 Failed but can retry:2 >Failed to transfer wrapper log from >env-20090121-1320-obxi57t5/info/4 on RANGER >env failed >Execution failed: > Exception in env: >Arguments: [] >Host: RANGER >Directory: env-20090121-1320-obxi57t5/jobs/4/env-4vv5nh5j >stderr.txt: > >stdout.txt: > >---- > >Caused by: > The job manager detected an invalid script response > > >>cog-job-submit using coasters and see if that fails. > >[skenny at gwynn check_env]$ cog-job-submit -p coaster -jm >gt2:gt2:SGE -a project=TG-DBS090006 -e /bin/hostname -s >gatekeeper.ranger.tacc.teragrid.org >Started local service: 128.135.92.83:50004 >Socket bound. URL is http://gwynn.bsd.uchicago.edu:50005 >[/129.114.50.163:34018] GET /coaster-bootstrap.jar HTTP/1.0 >[/129.114.50.163:34023] GET /list?serviceId=1487248510 HTTP/1.1 >GSSSChannel-null(0): Disabling heartbeats (config is null) >Initialized connection handler >Multiplexer 0 started >(1) Scheduling GSSSChannel-null(1) for addition >nullChannel started >Connection handler started >Multiplexer 1 started >GSSSChannel-null(1) REQ: Handler(CHANNELCONFIG) >Channel id: 341bc107:11efaac5124:-8000:-60989cff:11efaac517b:-8000 >MetaChannel: 11644607 -> null: Disabling heartbeats (disabled >in config) >MetaChannel: 11644607 -> null.bind -> GSSSChannel-null(1) >GSSSChannel-null(1) REQ: Handler(REGISTER) >Trying to re-bind current channel >Sending Command(1, SUBMITJOB) on GSSSChannel-null(1) >Command(1, SUBMITJOB) CMD: Command(1, SUBMITJOB) >GSSSChannel-null(1) REPL: Command(1, SUBMITJOB) >Submitted task Task(type=JOB_SUBMISSION, >identity=urn:cog-1232566229197). Job id: >urn:1232566229197-1232566244068-1232566244069 >Unregistering Command(1, SUBMITJOB) >GSSSChannel-null(1) REQ: Handler(JOBSTATUS) >GSSSChannel-null(1) REQ: Handler(JOBSTATUS) >Job completed > > >********logs for this are here: > >/home/skenny/logs > > >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Wed Jan 21 15:20:08 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 21 Jan 2009 21:20:08 +0000 (GMT) Subject: [Swift-devel] swift jobs hanging on ranger In-Reply-To: <20090121151609.BQY45261@m4500-02.uchicago.edu> References: <20090121151609.BQY45261@m4500-02.uchicago.edu> Message-ID: On Wed, 21 Jan 2009, skenny at uchicago.edu wrote: > the problem seems to lie in sge's inconsistent passing of > maxwalltime. or the maxwalltime allowed in the queue, perhaps - if you probe random values between 240 and 60, I think there should be a duration where all probes above it fail and all probes below it fail. Somewhere on the internet there may even be documentation of what that value is for ranger, although a google search didn't give me anything immediately. -- From hategan at mcs.anl.gov Wed Jan 21 15:30:41 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 21 Jan 2009 15:30:41 -0600 Subject: [Swift-devel] swift jobs hanging on ranger In-Reply-To: References: <20090121151609.BQY45261@m4500-02.uchicago.edu> Message-ID: <1232573441.30864.2.camel@localhost> On Wed, 2009-01-21 at 21:20 +0000, Ben Clifford wrote: > On Wed, 21 Jan 2009, skenny at uchicago.edu wrote: > > > the problem seems to lie in sge's inconsistent passing of > > maxwalltime. > > or the maxwalltime allowed in the queue, perhaps - if you probe random > values between 240 and 60, I think there should be a duration where all > probes above it fail and all probes below it fail. Somewhere on the > internet there may even be documentation of what that value is for ranger, > although a google search didn't give me anything immediately. 120 seems to be the limit. So it may not be a format issue, but a matter of using the wrong queue. From hategan at mcs.anl.gov Wed Jan 21 15:34:52 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 21 Jan 2009 15:34:52 -0600 Subject: [Swift-devel] swift jobs hanging on ranger In-Reply-To: References: <20090121151609.BQY45261@m4500-02.uchicago.edu> Message-ID: <1232573692.30864.5.camel@localhost> >From the docs at http://www.tacc.utexas.edu/services/userguides/ranger/ Queue Name Max Runtime (default) Max Procs SU Charge Rate Purpose normal 24 hrs 4096 1 Normal Priority large 24 hrs 12288 1 Large Core Count development 2 hrs 256 1 development serial 2 hrs 16 1 Large Jobs Request 24 hrs 16384 1 Special Requests systest -- -- -- System Testing This would be consistent with the observed behavior if by default the jobs go to "development" or "serial". On Wed, 2009-01-21 at 21:20 +0000, Ben Clifford wrote: > On Wed, 21 Jan 2009, skenny at uchicago.edu wrote: > > > the problem seems to lie in sge's inconsistent passing of > > maxwalltime. > > or the maxwalltime allowed in the queue, perhaps - if you probe random > values between 240 and 60, I think there should be a duration where all > probes above it fail and all probes below it fail. Somewhere on the > internet there may even be documentation of what that value is for ranger, > although a google search didn't give me anything immediately. > From benc at hawaga.org.uk Wed Jan 21 15:50:12 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 21 Jan 2009 21:50:12 +0000 (GMT) Subject: [Swift-devel] swift jobs hanging on ranger In-Reply-To: <1232573692.30864.5.camel@localhost> References: <20090121151609.BQY45261@m4500-02.uchicago.edu> <1232573692.30864.5.camel@localhost> Message-ID: On Wed, 21 Jan 2009, Mihael Hategan wrote: > This would be consistent with the observed behavior if by default the > jobs go to "development" or "serial". Explicitly specifying the normal queue appears to make (non-coaster) execution work with longer wall times. -- From benc at hawaga.org.uk Fri Jan 23 03:47:53 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 23 Jan 2009 09:47:53 +0000 (GMT) Subject: [Swift-devel] nightly builds and tests Message-ID: I noticed that the downloads page still contains links to 'nightly build and test'. In my mind, I abandoned nightly builds as a distribution mechanism a long time ago, with their purpose being on code quality control rather than as software distribution. The reason for this is that my interactions with people who used to download the nightly build almost always ended up with them subsequently having a custom bugfix that they then wanted from SVN or a custom patch, necessitating making a source build. That being the case, it makes more sense to me that our single supported non-release mechanism should be building from SVN. So, I'd like to move away the build/test link from the download page to some other page. -- From benc at hawaga.org.uk Fri Jan 23 05:52:30 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 23 Jan 2009 11:52:30 +0000 (GMT) Subject: [Swift-devel] swift 0.8 rc1 Message-ID: Here is Swift 0.8 release candidate 1: http://www.ci.uchicago.edu/~benc/tmp/swift-0.8rc1.tar.gz I'll release it in about 7 days if there are no critical bugs discovered. Please test. I've included swift-plot-log into this release - that is a major addition to the released code. The other big change since 0.7 is handling of data dependencies in various places, so that the language behaves more like it is sometimes expected to. I've previously commented about that on this list, a couple of weeks ago. -- From zhengxiongh at uchicago.edu Fri Jan 23 12:21:34 2009 From: zhengxiongh at uchicago.edu (Zhengxiong Hou) Date: Fri, 23 Jan 2009 12:21:34 -0600 Subject: [Swift-devel] Re: work directory bug fixed In-Reply-To: References: <49501689.4010500@uchicago.edu> <4951108E.1070902@uchicago.edu> <4952A49F.3010605@uchicago.edu> Message-ID: <497A0AAE.3020507@uchicago.edu> Ben Clifford wrote: > swift r2380 should put a tidier version of this into the codebase. I'm > fairly confident about it but I can't test against red.unl.edu at the > moment as it isn't taking my jobs today. > Yes, it seems that there are still some problem on this site. But I don't think it is due to the "work directory". I also checked it. But the status just stay as "Submitted", after many repeated "Progress: Submitted:1", it exits automatically or seems infinite repeat. E.g. [houzx at communicado run-by-swift]$ swift -tc.file ./dock-17-tc.data -sites.file test-condor-problem/site-1-red.xml grid-many-dock6-auto.swift Swift svn swift-r2386 (Swift modified locally) cog-r2125 (CoG modified locally) RunID: 20090123-1126-qfqnd5fb Progress: rundock started Sorted: [Nebraska:0.000(1.000):0/1 overload: 0] Will apply workdir to jobspec directory: /mnt/nfs03/zeng/osg_data/data/osg/houzx/grid-many-dock6-auto-20090123-1126-qfqnd5fb found wrapper.sh: /mnt/nfs03/zeng/osg_data/data/osg/houzx/grid-many-dock6-auto-20090123-1126-qfqnd5fb/shared/wrapper.sh Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 ... ... ... ... Progress: Submitted:1 Progress: Submitted:1 Progress: Submitted:1 [houzx at communicado run-by-swift]$ globus-job-run red.unl.edu /bin/ls -al /mnt/nfs03/zeng/osg_data/data/osg/houzx/grid-many-dock6-auto-20090123-1126-qfqnd5fb total 16 drwxr-xr-x 6 osg osg 59 Jan 23 11:00 . drwxr-xr-x 5 osg osg 4096 Jan 23 11:00 .. drwxr-xr-x 2 osg osg 6 Jan 23 11:00 info drwxr-xr-x 2 osg osg 6 Jan 23 11:00 kickstart drwxr-xr-x 3 osg osg 48 Jan 23 11:00 shared drwxr-xr-x 2 osg osg 6 Jan 23 11:00 status From benc at hawaga.org.uk Fri Jan 23 12:32:43 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 23 Jan 2009 18:32:43 +0000 (GMT) Subject: [Swift-devel] Re: work directory bug fixed In-Reply-To: <497A0AAE.3020507@uchicago.edu> References: <49501689.4010500@uchicago.edu> <4951108E.1070902@uchicago.edu> <4952A49F.3010605@uchicago.edu> <497A0AAE.3020507@uchicago.edu> Message-ID: On Fri, 23 Jan 2009, Zhengxiong Hou wrote: > Yes, it seems that there are still some problem on this site. But I don't > think it is due to the "work directory". I also checked it. > But the status just stay as "Submitted", after many repeated "Progress: > Submitted:1", it exits automatically or seems infinite repeat. What does 'it exits automatically' mean? That you get an error message? When I submit using the command-line client like this: $ globus-job-run red.unl.edu/jobmanager-condor /bin/hostname I find that red.unl.edu does not run the job, for as long as I am prepared to wait. So I think you should verify that job submission to red.unl.edu is working properly outside of Swift. (eg with this command: $ globus-job-run red.unl.edu/jobmanager-condor /bin/hostname ) If you are running with multiple sites, then this situation should be handled well by the replication facility in Swift. -- From aespinosa at cs.uchicago.edu Sat Jan 24 00:24:04 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Sat, 24 Jan 2009 00:24:04 -0600 Subject: [Swift-devel] coasters on osgedu sites: Message-ID: <50b07b4b0901232224l66b7f00avbbb2f3ef25a7f725@mail.gmail.com> I am running swift 0.8rc1 swift log file: is in http://www.ci.uchicago.edu/~aespinosa/test-20090124-0019-je79k4q4.log swift runtime output: Swift 0.8rc1 swift-r2448 cog-r2261 RunID: 20090124-0019-je79k4q4 Progress: Progress: Selecting site:18 Initializing site shared directory:3 Progress: Selecting site:16 Initializing site shared directory:5 Progress: Selecting site:15 Initializing site shared directory:6 Progress: Selecting site:13 Initializing site shared directory:8 Progress: Selecting site:12 Initializing site shared directory:9 Progress: Selecting site:11 Initializing site shared directory:10 Progress: Selecting site:10 Initializing site shared directory:11 Progress: Selecting site:8 Initializing site shared directory:13 Progress: Selecting site:8 Initializing site shared directory:13 Progress: Selecting site:6 Initializing site shared directory:15 Progress: Selecting site:4 Initializing site shared directory:17 Progress: Selecting site:4 Initializing site shared directory:17 Progress: Selecting site:3 Initializing site shared directory:18 Progress: Selecting site:3 Initializing site shared directory:18 Progress: Selecting site:2 Initializing site shared directory:19 Execution failed: Could not initialize shared directory on WISC Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Failed to start coaster resource on osg-edu.cs.wisc.edu Caused by: Could not start coaster service Caused by: Task ended before registration was received. STDOUT: ERROR: Failed to parse command file (line 16). STDERR: null Caused by: The job failed when the job manager attempted to run it Cleaning up... Done sites.xml /export/osg/data/aespinosa /nfs/osg-data/aespinosa manual cog submission: cog-job-submit -verbose -s osg-edu.cs.wisc.edu -p coaster -jm gt2:gt2:condor -e /bin/hostname -r Started local service: 128.135.125.17:50000 Socket bound. URL is http://communicado.ci.uchicago.edu:50001 Service task Task(type=JOB_SUBMISSION, identity=urn:cog-1232777697384) terminated. Removing service. Service does not appear to be registered with this manager Submission Exception: Could not submit job Cleaning up... Done -Allan From benc at hawaga.org.uk Sat Jan 24 02:40:52 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 24 Jan 2009 08:40:52 +0000 (GMT) Subject: [Swift-devel] coasters on osgedu sites: In-Reply-To: <50b07b4b0901232224l66b7f00avbbb2f3ef25a7f725@mail.gmail.com> References: <50b07b4b0901232224l66b7f00avbbb2f3ef25a7f725@mail.gmail.com> Message-ID: On Sat, 24 Jan 2009, Allan Espinosa wrote: > I am running swift 0.8rc1 [...] > Failed to start coaster resource on osg-edu.cs.wisc.edu Have coasters worked for you on that site recently with a different Swift version? If so which one? -- From aespinosa at cs.uchicago.edu Sat Jan 24 07:43:08 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Sat, 24 Jan 2009 07:43:08 -0600 Subject: [Swift-devel] coasters on osgedu sites: In-Reply-To: References: <50b07b4b0901232224l66b7f00avbbb2f3ef25a7f725@mail.gmail.com> Message-ID: <50b07b4b0901240543mf8f9d03r2e922ea44f2b0aa5@mail.gmail.com> No, I am also getting this in v0.7 -Allan On Sat, Jan 24, 2009 at 2:40 AM, Ben Clifford wrote: > > On Sat, 24 Jan 2009, Allan Espinosa wrote: > >> I am running swift 0.8rc1 > [...] >> Failed to start coaster resource on osg-edu.cs.wisc.edu > > Have coasters worked for you on that site recently with a different Swift > version? If so which one? > > -- > > > From aespinosa at cs.uchicago.edu Sat Jan 24 17:03:42 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Sat, 24 Jan 2009 17:03:42 -0600 Subject: [Swift-devel] swift changing walltime of prews-gram jobs Message-ID: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> Hi, I am using swift0.8rc1. the same also happens to v0.7 I tried submitting a job from communicado to tp-grid1 (teraport) using coasters. The swift runtime does not give any error but it does not finish as well. Looking through the files received by the teraport head node, i observed that swift keeps submitting gram jobs. It looks like that the submitted pbs scripts kept finishing / failing. diging through ~/.globus/jobs/tp-grid1.uchicago.edu/*/scheduler* we see that maxwalltime become 101:00 from 00:10:00 (in sites.xml) /usr/bin/perl "/home/aespinosa/.globus/coasters/cscript63266.pl" "http://128.135.125.118:50001" "1728236079" #! /bin/sh # PBS batch job script built by Globus job manager # #PBS -S /bin/sh #PBS -m n #PBS -q fast #PBS -l walltime=101:00 #PBS -o /dev/null #PBS -e /dev/null #PBS -l nodes=1 HOME="/home/aespinosa"; export HOME; OSG_DATA="/gpfs1/osg/data"; ... ... counter=0 exit_code=0 while test $counter -lt 1; do /bin/touch /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter; read tmp_exit_code < /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter if [ $exit_code = 0 -a $tmp_exit_code != 0 ]; then exit_code=$tmp_exit_code fi counter=`expr $counter + 1` done exit $exit_code qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max walltime requirement Below is my sites.xml: fast 00:10:00 /disks/tp-gpfs/scratch/aespinosa This does not happen if i use "local:pbs" as the jobmanager for the coaster and was successful in running jobs -Allan -- Allan M. Espinosa PhD student, Computer Science University of Chicago From benc at hawaga.org.uk Sun Jan 25 07:54:43 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 25 Jan 2009 13:54:43 +0000 (GMT) Subject: [Swift-devel] swift changing walltime of prews-gram jobs In-Reply-To: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> Message-ID: Using coasters will cause job submissions with different walltimes than your individual swift-level jobs. Coaster workers get submitted with a longer walltime than the jobs you are trying to send through. This is intended to result in coaster workers that will run long enough to run many jobs. At the moment, this is not very configurable. In the source code, provider-coaster//src/org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager.java contains these fragments: public static final Seconds TIME_RESERVE = new Seconds(60); public static final int OVERALLOCATION_FACTOR = 10; startWorker(new Seconds(req.maxWallTime.getSeconds()) .multiply(OVERALLOCATION_FACTOR) .add(TIME_RESERVE), req.prototype); so whatever your maxwalltime is, you'll get coaster workers submitted with ten times that plus one minute. The coaster workers don't enforce job maxwalltimes, so you can work around this by making the job maxwalltimes small enough so that 10*that+60s fits inside the queue maximum wall time, even if that is actually too small for your jobs. You should see the same behaviour using local:pbs, which will use direct PBS submission instead of GRAM; but you don't. That is an inconsistency that suggests something is not right. My initial suspicion would be that the cog PBS provider is not correctly passing either the walltime or queue parameters. I will investigate this. Probably coasters should get another configuration option to allow the worker wall time to be more explicitly set, separately from job execution wall times - that makes sense for sites where parameters such as queue limits are well known by the user. Thanks for testing 0.8rc1. -- From aespinosa at cs.uchicago.edu Sun Jan 25 09:06:30 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Sun, 25 Jan 2009 09:06:30 -0600 Subject: [Swift-devel] swift changing walltime of prews-gram jobs In-Reply-To: References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> Message-ID: <50b07b4b0901250706g479afeacn2936ffc9b254db4a@mail.gmail.com> Or having different queue names for the coaster and workers. So for example, I indicate "long" queue for the site coaster and "fast" for the real jobs. -Allan On Sun, Jan 25, 2009 at 7:54 AM, Ben Clifford wrote: > > Using coasters will cause job submissions with different walltimes than > your individual swift-level jobs. > > Coaster workers get submitted with a longer walltime than the jobs you are > trying to send through. This is intended to result in coaster workers that > will run long enough to run many jobs. > > At the moment, this is not very configurable. In the source code, > > provider-coaster//src/org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager.java > > contains these fragments: > > public static final Seconds TIME_RESERVE = new Seconds(60); > public static final int OVERALLOCATION_FACTOR = 10; > startWorker(new Seconds(req.maxWallTime.getSeconds()) > .multiply(OVERALLOCATION_FACTOR) > .add(TIME_RESERVE), req.prototype); > > > so whatever your maxwalltime is, you'll get coaster workers submitted with > ten times that plus one minute. > > The coaster workers don't enforce job maxwalltimes, so you can work around > this by making the job maxwalltimes small enough so that 10*that+60s fits > inside the queue maximum wall time, even if that is actually too small for > your jobs. > > You should see the same behaviour using local:pbs, which will use direct > PBS submission instead of GRAM; but you don't. That is an inconsistency > that suggests something is not right. My initial suspicion would be that > the cog PBS provider is not correctly passing either the walltime or queue > parameters. I will investigate this. > > Probably coasters should get another configuration option to allow the > worker wall time to be more explicitly set, separately from job execution > wall times - that makes sense for sites where parameters such as queue > limits are well known by the user. -- Allan M. Espinosa PhD student, Computer Science University of Chicago From benc at hawaga.org.uk Sun Jan 25 09:11:51 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 25 Jan 2009 15:11:51 +0000 (GMT) Subject: [Swift-devel] swift changing walltime of prews-gram jobs In-Reply-To: <50b07b4b0901250706g479afeacn2936ffc9b254db4a@mail.gmail.com> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <50b07b4b0901250706g479afeacn2936ffc9b254db4a@mail.gmail.com> Message-ID: On Sun, 25 Jan 2009, Allan Espinosa wrote: > Or having different queue names for the coaster and workers. So for > example, I indicate "long" queue for the site coaster and "fast" for > the real jobs. Not sure what you mean by the above. When using coasters, 'real jobs' don't go through the PBS queue at all - they go through the coaster queueing system. Specifying a queue when using coasters will affect where the coaster workers go (I think). -- From aespinosa at cs.uchicago.edu Sun Jan 25 09:33:36 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Sun, 25 Jan 2009 09:33:36 -0600 Subject: [Swift-devel] swift changing walltime of prews-gram jobs In-Reply-To: References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <50b07b4b0901250706g479afeacn2936ffc9b254db4a@mail.gmail.com> Message-ID: <50b07b4b0901250733r7f64d701w7aae6a86652d15a0@mail.gmail.com> Oh right. The coaster service on the site runs over fork and the workers over LRM right? so just one queue is needed to be specified. -Allan On Sun, Jan 25, 2009 at 9:11 AM, Ben Clifford wrote: > > Not sure what you mean by the above. > > When using coasters, 'real jobs' don't go through the PBS queue at all - > they go through the coaster queueing system. > > Specifying a queue when using coasters will affect where the coaster > workers go (I think). From benc at hawaga.org.uk Sun Jan 25 09:34:37 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 25 Jan 2009 15:34:37 +0000 (GMT) Subject: [Swift-devel] swift changing walltime of prews-gram jobs In-Reply-To: <50b07b4b0901250733r7f64d701w7aae6a86652d15a0@mail.gmail.com> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <50b07b4b0901250706g479afeacn2936ffc9b254db4a@mail.gmail.com> <50b07b4b0901250733r7f64d701w7aae6a86652d15a0@mail.gmail.com> Message-ID: On Sun, 25 Jan 2009, Allan Espinosa wrote: > Oh right. The coaster service on the site runs over fork and the > workers over LRM right? so just one queue is needed to be specified. yes. -- From benc at hawaga.org.uk Mon Jan 26 10:10:05 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Jan 2009 16:10:05 +0000 (GMT) Subject: [Swift-devel] a note on running coasters on osg's RENCI Engage site Message-ID: Coasters in the release don't work on RENCI Engage. I fiddled with this a bit before, and just fiddled with it a bit more. The external IP address of the cluster head node (152.54.1.231) is not accessible from the cluster worker nodes, which sit on a different network. The headnode *is* accessible from its IP address on that network, 192.168.1.11. Forcing the URI passed to workers to use that IP address instead of the automatically determined one is sufficient to make coasters work on the RENCI Engage site. The hack I made in my local install to test that is shown below for interest. I can't see an easy way to automatically determine what this address should be in the general case. It might be useful to have a configuration parameter to allow it to be specified. --- cog.orig/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager.java 2008-11-04 17:28:42.000000000 +0000 +++ cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager.java 2009-01-26 15:51:44.000000000 +0000 @@ -14,6 +14,7 @@ import java.io.IOException; import java.io.InputStream; import java.net.URI; +import java.net.URISyntaxException; import java.util.ArrayList; import java.util.Arrays; import java.util.Collection; @@ -234,7 +235,18 @@ JobSpecification js = new JobSpecificationImpl(); js.setExecutable("/usr/bin/perl"); js.addArgument(script.getAbsolutePath()); - js.addArgument(callbackURI.toString()); +try { + logger.warn("original callback URI is "+callbackURI.toString()); + URI internalcallbackURI=new URI(callbackURI.getScheme(), + callbackURI.getUserInfo(), + "192.168.1.11", + callbackURI.getPort(), callbackURI.getPath(), + callbackURI.getQuery(), callbackURI.getFragment()); + logger.warn("internal callback URI is "+internalcallbackURI.toString()); + js.addArgument(internalcallbackURI.toString()); +} catch(URISyntaxException use) { throw new RuntimeException(use); } +// js.addArgument(callbackURI.toString()); + // js.addArgument(id); return js; } -- From iraicu at cs.uchicago.edu Mon Jan 26 10:14:03 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 26 Jan 2009 10:14:03 -0600 Subject: [Swift-devel] a note on running coasters on osg's RENCI Engage site In-Reply-To: References: Message-ID: <497DE14B.7080708@cs.uchicago.edu> We ran into a similar problem with Falkon on the Blue Gene/P, where the automatic address picked up by the Falkon service wasn't the right one in a multi-homed machine. We ended up adding an override mechanism to let the user specify the right IP address. Ioan Ben Clifford wrote: > Coasters in the release don't work on RENCI Engage. > > I fiddled with this a bit before, and just fiddled with it a bit more. > > The external IP address of the cluster head node (152.54.1.231) is not > accessible from the cluster worker nodes, which sit on a different > network. > > The headnode *is* accessible from its IP address on that network, > 192.168.1.11. > > Forcing the URI passed to workers to use that IP address instead of the > automatically determined one is sufficient to make coasters work on the > RENCI Engage site. > > The hack I made in my local install to test that is shown below for > interest. > > I can't see an easy way to automatically determine what this address > should be in the general case. It might be useful to have a configuration > parameter to allow it to be specified. > > --- > cog.orig/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager.java > 2008-11-04 17:28:42.000000000 +0000 > +++ > cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager.java > 2009-01-26 15:51:44.000000000 +0000 > @@ -14,6 +14,7 @@ > import java.io.IOException; > import java.io.InputStream; > import java.net.URI; > +import java.net.URISyntaxException; > import java.util.ArrayList; > import java.util.Arrays; > import java.util.Collection; > @@ -234,7 +235,18 @@ > JobSpecification js = new JobSpecificationImpl(); > js.setExecutable("/usr/bin/perl"); > js.addArgument(script.getAbsolutePath()); > - js.addArgument(callbackURI.toString()); > +try { > + logger.warn("original callback URI is "+callbackURI.toString()); > + URI internalcallbackURI=new URI(callbackURI.getScheme(), > + callbackURI.getUserInfo(), > + "192.168.1.11", > + callbackURI.getPort(), callbackURI.getPath(), > + callbackURI.getQuery(), callbackURI.getFragment()); > + logger.warn("internal callback URI is > "+internalcallbackURI.toString()); > + js.addArgument(internalcallbackURI.toString()); > +} catch(URISyntaxException use) { throw new RuntimeException(use); } > +// js.addArgument(callbackURI.toString()); > + > // js.addArgument(id); > return js; > } > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From hategan at mcs.anl.gov Mon Jan 26 10:26:40 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Jan 2009 10:26:40 -0600 Subject: [Swift-devel] a note on running coasters on osg's RENCI Engage site In-Reply-To: References: Message-ID: <1232987200.15406.2.camel@localhost> I think a site attribute is the place where this should go. I can do that or you can. On Mon, 2009-01-26 at 16:10 +0000, Ben Clifford wrote: > Coasters in the release don't work on RENCI Engage. > > I fiddled with this a bit before, and just fiddled with it a bit more. > > The external IP address of the cluster head node (152.54.1.231) is not > accessible from the cluster worker nodes, which sit on a different > network. > > The headnode *is* accessible from its IP address on that network, > 192.168.1.11. > > Forcing the URI passed to workers to use that IP address instead of the > automatically determined one is sufficient to make coasters work on the > RENCI Engage site. > > The hack I made in my local install to test that is shown below for > interest. > > I can't see an easy way to automatically determine what this address > should be in the general case. It might be useful to have a configuration > parameter to allow it to be specified. > > --- > cog.orig/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager.java > 2008-11-04 17:28:42.000000000 +0000 > +++ > cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/WorkerManager.java > 2009-01-26 15:51:44.000000000 +0000 > @@ -14,6 +14,7 @@ > import java.io.IOException; > import java.io.InputStream; > import java.net.URI; > +import java.net.URISyntaxException; > import java.util.ArrayList; > import java.util.Arrays; > import java.util.Collection; > @@ -234,7 +235,18 @@ > JobSpecification js = new JobSpecificationImpl(); > js.setExecutable("/usr/bin/perl"); > js.addArgument(script.getAbsolutePath()); > - js.addArgument(callbackURI.toString()); > +try { > + logger.warn("original callback URI is "+callbackURI.toString()); > + URI internalcallbackURI=new URI(callbackURI.getScheme(), > + callbackURI.getUserInfo(), > + "192.168.1.11", > + callbackURI.getPort(), callbackURI.getPath(), > + callbackURI.getQuery(), callbackURI.getFragment()); > + logger.warn("internal callback URI is > "+internalcallbackURI.toString()); > + js.addArgument(internalcallbackURI.toString()); > +} catch(URISyntaxException use) { throw new RuntimeException(use); } > +// js.addArgument(callbackURI.toString()); > + > // js.addArgument(id); > return js; > } > > From benc at hawaga.org.uk Mon Jan 26 10:35:37 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Jan 2009 16:35:37 +0000 (GMT) Subject: [Swift-devel] a note on running coasters on osg's RENCI Engage site In-Reply-To: <1232987200.15406.2.camel@localhost> References: <1232987200.15406.2.camel@localhost> Message-ID: On Mon, 26 Jan 2009, Mihael Hategan wrote: > I think a site attribute is the place where this should go. right. I'll add it. -- From rynge at renci.org Mon Jan 26 11:28:26 2009 From: rynge at renci.org (Mats Rynge) Date: Mon, 26 Jan 2009 12:28:26 -0500 Subject: [Swift-devel] a note on running coasters on osg's RENCI Engage site In-Reply-To: References: Message-ID: <497DF2BA.2030009@renci.org> Ben Clifford wrote: > Coasters in the release don't work on RENCI Engage. > > I fiddled with this a bit before, and just fiddled with it a bit more. > > The external IP address of the cluster head node (152.54.1.231) is not > accessible from the cluster worker nodes, which sit on a different > network. > > The headnode *is* accessible from its IP address on that network, > 192.168.1.11. > > Forcing the URI passed to workers to use that IP address instead of the > automatically determined one is sufficient to make coasters work on the > RENCI Engage site. I'm surprised that the private address was not used already. I mean, isn't this the use case for putting the coaster forwarder on the headnode in the first place? If you have outbound network connectivity, the coaster daemons in the worker nodes could connect directly to the host where swift is running on. -- Mats Rynge Renaissance Computing Institute From benc at hawaga.org.uk Mon Jan 26 11:34:58 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Jan 2009 17:34:58 +0000 (GMT) Subject: [Swift-devel] a note on running coasters on osg's RENCI Engage site In-Reply-To: <497DF2BA.2030009@renci.org> References: <497DF2BA.2030009@renci.org> Message-ID: On Mon, 26 Jan 2009, Mats Rynge wrote: > I'm surprised that the private address was not used already. It uses whatever the underlying stack tells it, which I think most likely is what the java.net.* library returns. Even with a better API, its not clear to me what is the right way to determine which IP address goes where - a simple heuristic of 'RFC1918 address = internal, non-RFC1918 address = external' will probably cover most cases, but not all. -- From hategan at mcs.anl.gov Mon Jan 26 12:33:49 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Jan 2009 12:33:49 -0600 Subject: [Swift-devel] a note on running coasters on osg's RENCI Engage site In-Reply-To: <497DF2BA.2030009@renci.org> References: <497DF2BA.2030009@renci.org> Message-ID: <1232994829.19371.4.camel@localhost> On Mon, 2009-01-26 at 12:28 -0500, Mats Rynge wrote: > Ben Clifford wrote: > > Coasters in the release don't work on RENCI Engage. > > > > I fiddled with this a bit before, and just fiddled with it a bit more. > > > > The external IP address of the cluster head node (152.54.1.231) is not > > accessible from the cluster worker nodes, which sit on a different > > network. > > > > The headnode *is* accessible from its IP address on that network, > > 192.168.1.11. > > > > Forcing the URI passed to workers to use that IP address instead of the > > automatically determined one is sufficient to make coasters work on the > > RENCI Engage site. > > I'm surprised that the private address was not used already. I mean, > isn't this the use case for putting the coaster forwarder on the > headnode in the first place? If you have outbound network connectivity, > the coaster daemons in the worker nodes could connect directly to the > host where swift is running on. Automatically figuring out which eth should be used is a tricky thing given that cluster configurations don't seem to follow some pre-established rule. Each sites does stuff in their own way, though most seem to have the head node accessible internally through the public head node address. There is a performance and reliability advantage to using a "proxy" instead of going back directly to the submit site. It's easier to achieve scalability with a hierarchical approach. From hategan at mcs.anl.gov Mon Jan 26 12:39:51 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Jan 2009 12:39:51 -0600 Subject: [Swift-devel] a note on running coasters on osg's RENCI Engage site In-Reply-To: References: <497DF2BA.2030009@renci.org> Message-ID: <1232995191.19371.11.camel@localhost> On Mon, 2009-01-26 at 17:34 +0000, Ben Clifford wrote: > On Mon, 26 Jan 2009, Mats Rynge wrote: > > > I'm surprised that the private address was not used already. > > It uses whatever the underlying stack tells it, which I think most likely > is what the java.net.* library returns. As far as I can remember from coding it, the coaster service explicitly uses the GLOBUS_HOSTNAME that the service is started with for the callback address. > > Even with a better API, its not clear to me what is the right way to > determine which IP address goes where - a simple heuristic of 'RFC1918 > address = internal, non-RFC1918 address = external' will probably cover > most cases, but not all. > We could test this, but I suspect that given one of the two strategies, there will be some sites for which things won't work with it. From aespinosa at cs.uchicago.edu Mon Jan 26 12:39:56 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 26 Jan 2009 12:39:56 -0600 Subject: [Swift-devel] a note on running coasters on osg's RENCI Engage site In-Reply-To: <1232994829.19371.4.camel@localhost> References: <497DF2BA.2030009@renci.org> <1232994829.19371.4.camel@localhost> Message-ID: <50b07b4b0901261039p79925c91t2fa15115b0e442e5@mail.gmail.com> So in the current setup, the coaster service only binds to the public address of the head node and not all the network devices? On Mon, Jan 26, 2009 at 12:33 PM, Mihael Hategan wrote: > On Mon, 2009-01-26 at 12:28 -0500, Mats Rynge wrote: >> Ben Clifford wrote: >> > Coasters in the release don't work on RENCI Engage. >> > >> > I fiddled with this a bit before, and just fiddled with it a bit more. >> > >> > The external IP address of the cluster head node (152.54.1.231) is not >> > accessible from the cluster worker nodes, which sit on a different >> > network. >> > >> > The headnode *is* accessible from its IP address on that network, >> > 192.168.1.11. >> > >> > Forcing the URI passed to workers to use that IP address instead of the >> > automatically determined one is sufficient to make coasters work on the >> > RENCI Engage site. >> >> I'm surprised that the private address was not used already. I mean, >> isn't this the use case for putting the coaster forwarder on the >> headnode in the first place? If you have outbound network connectivity, >> the coaster daemons in the worker nodes could connect directly to the >> host where swift is running on. > > Automatically figuring out which eth should be used is a tricky thing > given that cluster configurations don't seem to follow some > pre-established rule. Each sites does stuff in their own way, though > most seem to have the head node accessible internally through the public > head node address. > From hategan at mcs.anl.gov Mon Jan 26 12:43:13 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Jan 2009 12:43:13 -0600 Subject: [Swift-devel] a note on running coasters on osg's RENCI Engage site In-Reply-To: <50b07b4b0901261039p79925c91t2fa15115b0e442e5@mail.gmail.com> References: <497DF2BA.2030009@renci.org> <1232994829.19371.4.camel@localhost> <50b07b4b0901261039p79925c91t2fa15115b0e442e5@mail.gmail.com> Message-ID: <1232995393.19371.14.camel@localhost> On Mon, 2009-01-26 at 12:39 -0600, Allan Espinosa wrote: > So in the current setup, the coaster service only binds to the public > address of the head node and not all the network devices? No. It binds to 0.0.0.0. But it tells the workers to contact the service on external.ip.address. Though it seems like binding only to the external eth won't make any difference in this case. > > On Mon, Jan 26, 2009 at 12:33 PM, Mihael Hategan wrote: > > On Mon, 2009-01-26 at 12:28 -0500, Mats Rynge wrote: > >> Ben Clifford wrote: > >> > Coasters in the release don't work on RENCI Engage. > >> > > >> > I fiddled with this a bit before, and just fiddled with it a bit more. > >> > > >> > The external IP address of the cluster head node (152.54.1.231) is not > >> > accessible from the cluster worker nodes, which sit on a different > >> > network. > >> > > >> > The headnode *is* accessible from its IP address on that network, > >> > 192.168.1.11. > >> > > >> > Forcing the URI passed to workers to use that IP address instead of the > >> > automatically determined one is sufficient to make coasters work on the > >> > RENCI Engage site. > >> > >> I'm surprised that the private address was not used already. I mean, > >> isn't this the use case for putting the coaster forwarder on the > >> headnode in the first place? If you have outbound network connectivity, > >> the coaster daemons in the worker nodes could connect directly to the > >> host where swift is running on. > > > > Automatically figuring out which eth should be used is a tricky thing > > given that cluster configurations don't seem to follow some > > pre-established rule. Each sites does stuff in their own way, though > > most seem to have the head node accessible internally through the public > > head node address. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Mon Jan 26 12:43:10 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Jan 2009 18:43:10 +0000 (GMT) Subject: [Swift-devel] a note on running coasters on osg's RENCI Engage site In-Reply-To: <50b07b4b0901261039p79925c91t2fa15115b0e442e5@mail.gmail.com> References: <497DF2BA.2030009@renci.org> <1232994829.19371.4.camel@localhost> <50b07b4b0901261039p79925c91t2fa15115b0e442e5@mail.gmail.com> Message-ID: On Mon, 26 Jan 2009, Allan Espinosa wrote: > So in the current setup, the coaster service only binds to the public > address of the head node and not all the network devices? The problem I had is not that where it is binding to, but where it is telling workers to connect back to. Head node launches a worker, telling it its own IP address[*]. Worker starts and connects to the IP address of the head node that it has been told, in order to start receiving work. In the RENCI Engage case, the head node is telling the workers the wrong IP address, in as much as it is an IP address that is inaccessible from workers. -- From benc at hawaga.org.uk Mon Jan 26 12:44:22 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 26 Jan 2009 18:44:22 +0000 (GMT) Subject: [Swift-devel] a note on running coasters on osg's RENCI Engage site In-Reply-To: <1232995191.19371.11.camel@localhost> References: <497DF2BA.2030009@renci.org> <1232995191.19371.11.camel@localhost> Message-ID: On Mon, 26 Jan 2009, Mihael Hategan wrote: > We could test this, but I suspect that given one of the two strategies, > there will be some sites for which things won't work with it. right. pretty much nothing automatic works on my laptop which has regularly varying IP address on one interface, and a bunch of VMware fake interfaces with RFC1918 addresses... -- From hategan at mcs.anl.gov Mon Jan 26 12:49:53 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Jan 2009 12:49:53 -0600 Subject: [Swift-devel] a note on running coasters on osg's RENCI Engage site In-Reply-To: References: <497DF2BA.2030009@renci.org> <1232995191.19371.11.camel@localhost> Message-ID: <1232995793.19371.18.camel@localhost> On Mon, 2009-01-26 at 18:44 +0000, Ben Clifford wrote: > On Mon, 26 Jan 2009, Mihael Hategan wrote: > > > We could test this, but I suspect that given one of the two strategies, > > there will be some sites for which things won't work with it. > > right. pretty much nothing automatic works on my laptop which has > regularly varying IP address on one interface, and a bunch of VMware fake > interfaces with RFC1918 addresses... > However... We could pass a list of addresses to the worker which could be tried in order (like telnet would do if multiple dns entries are there for a host). From iraicu at cs.uchicago.edu Mon Jan 26 15:23:34 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 26 Jan 2009 15:23:34 -0600 Subject: [Swift-devel] CFP: 3rd International Workshop on Virtualization Technologies in Distributed Computing (VTDC-09) Message-ID: <497E29D6.5030507@cs.uchicago.edu> ======================================================================== Call for Papers 3rd International Workshop on Virtualization Technologies in Distributed Computing (VTDC-09) http://grid-appliance.org/vtdc09 In conjunction with ICAC 2009 Barcelona, Spain, June 15 2009 ======================================================================== Workshop scope: --------------- Virtualization has proven to be a powerful enabler in the field of distributed computing and has led to the emergence of the cloud computing paradigm and the provisioning of Infrastructure-as-a-Service (IaaS). This new paradigm raises challenges ranging from performance evaluation of IaaS platforms, through new methods of resource management including providing Service Level Agreements (SLAs) and energy- and cost-efficient schedules, to the emergence of supporting technologies such as virtual appliance management. For the last three years, the VTDC workshop has served as a forum for the exchange of ideas and experiences studying the challenges and opportunities created by IaaS/cloud computing and virtualization technologies. VTDC brings together researchers in academia and industry who are involved in research, development and planning activities involving the use of virtualization in the context of distributed systems, where the opportunities and challenges with respect to the management of such virtualized systems is of interest to the ICAC community at large. Important dates: ---------------- * Submission deadline: February 20th, 2009 * Notification of acceptance: March 23rd, 2009 * Final manuscripts due: April 6, 2009 * Workshop: June 15, 2009 Topics: ------ Authors are invited to submit original and unpublished work that exposes a new problem, advocates a specific solution, or reports on actual experience. Papers should be submitted as full-length 8 page papers of double column text using single space 10pt size type on an 8.5x11 paper. Papers will be published in the proceedings of the workshop. VTDC 2009 topics of interest include, but are not limited to: * Infrastructure as a service (IaaS) * Virtualization in data centers * Virtualization for resource management and QoS assurance * Security aspects of using virtualization in a distributed environment * Virtual networks * Virtual data, storage as a service * Fault tolerance in virtualized environments * Virtualization in P2P systems * Virtualization-based adaptive/autonomic systems * The creation and management of environments/appliances * Virtualization technologies * Performance modeling (applications and systems) * Virtualization techniques for energy/thermal management * Case studies of applications on IaaS platforms * Deployment studies of virtualization technologies * Tools relevant to virtualization Organization: ------------- * General Chair: o Kate Keahey, University of Chicago, Argonne National Laboratory * Program Chair: o Renato Figueiredo, University of Florida * Steering Committee Chair: o Jose A. B. Fortes, University of Florida * Program Committee: o Franck Cappello, INRIA o Jeff Chase, Duke University o Peter Dinda, Northwestern University o Ian Foster, University of Chicago, Argonne National Laboratory o Dennis Gannon, Microsoft Research o Sebastien Goasguen, Clemson University o Sverre Jarp, CERN o John Lange, Northwestern University o Matei Ripeanu, University of British Columbia o Paul Ruth, University of Mississippi o Kyung Ryu, IBM o Chris Samuel, Victorian Partnership for Advanced Computing o Frank Siebenlist, Argonne National Laboratory o Dilma da Silva, IBM o Mike Wray, HP o Dongyan Xu, Purdue University o Mazin Yousif, Avirtec o Ming Zhao, Florida International University * Publicity Chair: o Ming Zhao, Florida International University For more information: --------------------- VTDC-09 Web site: http://grid-appliance.org/vtdc09 ICAC-09 Web site: http://icac2009.acis.ufl.edu From wilde at mcs.anl.gov Tue Jan 27 12:54:28 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 27 Jan 2009 12:54:28 -0600 Subject: [Swift-devel] swift changing walltime of prews-gram jobs In-Reply-To: References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <50b07b4b0901250706g479afeacn2936ffc9b254db4a@mail.gmail.com> <50b07b4b0901250733r7f64d701w7aae6a86652d15a0@mail.gmail.com> Message-ID: <497F5864.6060509@mcs.anl.gov> I'm trying to duplicate Allan's success with coasters using the local:pbs configuration on TeraPort. (Im trying local:pbs because my gt2:gt2:pbs coaster jobs are also failing; am still debugging those and will send separate email on them). I'm running 0.8rc1, submitting from tp-login to TeraPort. This combination seems to still get the walltime from the Globus profile options, but it seems to be putting the time estimate in the seconds portion of the PBS request instead of the minutes, so my jobs are dying on wall-time exceed (from pbs) (4 jobs, needing 30-60 seconds each). My sites.xml is: fast 00:05:00 /home/wilde/swiftwork But PBS qstat -f shows: Resource_List.walltime = 00:00:51 (full qstat -f below) And I get a walltime-exceeded email message from PBS for each pbs coaster job submitted. The coaster log shows: attr=maxwalltime=00:05:00 When I change the sites.xml maxwalltime to "05:00:00" I do indeed get 50 minutes, and the entire script runs to completion. So it seems to be placing the walltime request one unit to the right of where it should. - Mike Job Id: 848667.tp-mgt.ci.uchicago.edu Job_Name = null Job_Owner = wilde at tp-login2.ci.uchicago.edu job_state = R queue = fast server = tp-mgt.ci.uchicago.edu Checkpoint = u ctime = Tue Jan 27 12:22:21 2009 Error_Path = tp-login2.ci.uchicago.edu:/home/wilde/.globus/scripts/pbs2062 7.qsub.stderr exec_host = tp-c118/0 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = n mtime = Tue Jan 27 12:22:24 2009 Output_Path = tp-login2.ci.uchicago.edu:/home/wilde/.globus/scripts/pbs206 27.qsub.stdout Priority = 0 qtime = Tue Jan 27 12:22:21 2009 Rerunable = True Resource_List.nodect = 1 Resource_List.nodes = 1:rhel4-compute Resource_List.walltime = 00:00:51 session_id = 32414 Shell_Path_List = /bin/sh Variable_List = PBS_O_HOME=/home/wilde,PBS_O_LANG=en_US.UTF-8, PBS_O_LOGNAME=wilde, PBS_O_PATH=/autonfs/home/wilde/tutorials/osgedu/build/docbook-xsl/too ls/bin:/home/wilde/bin:/soft/java-1.5.0_06-sun-r1/bin:/soft/java-1.5.0 _06-sun-r1/jre/bin:/soft/apache-ant-1.7.1-r1/bin:/software/common/gx-m ap-0.5.3.3-r1/bin:/soft/condor-6.8.1-r1/bin:/soft/apache-ant-1.6.5-r1/ bin:/software/common/cert-scripts-2-5.rev44-r1/bin:/soft/globus-4.0.3- r1/bin:/soft/globus-4.0.3-r1/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr /X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/ wilde/bin/linux-rhel4-x86_64:/home/wilde/bin:/soft/R-2.4.0-r1/bin:/sof t/R-2.4.0-r1/lib/R/bin:/soft/torque-2.3.3-r1/bin:/soft/maui-3.2.6p19-g cc-r1/bin:/soft/maui-3.2.6p19-gcc-r1/sbin:/soft/matlab-7.5-r1/bin:/sof t/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_6 4/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin: /soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bi n:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin:/home/wilde/swift/tools:/hom e/wilde/swift/rev/latest/bin:/home/wilde/blast/ncbi/bin, PBS_O_MAIL=/var/spool/mail/wilde,PBS_O_SHELL=/bin/bash, PBS_SERVER=tp-login2.ci.uchicago.edu, PBS_O_HOST=tp-login2.ci.uchicago.edu,PBS_O_WORKDIR=/home/wilde, PBS_O_QUEUE=fast etime = Tue Jan 27 12:22:21 2009 submit_args = /home/wilde/.globus/scripts/pbs20627.qsub start_time = Tue Jan 27 12:22:23 2009 start_count = 1 tp$ On 1/25/09 9:34 AM, Ben Clifford wrote: > On Sun, 25 Jan 2009, Allan Espinosa wrote: > >> Oh right. The coaster service on the site runs over fork and the >> workers over LRM right? so just one queue is needed to be specified. > > yes. > From wilde at mcs.anl.gov Tue Jan 27 13:41:49 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 27 Jan 2009 13:41:49 -0600 Subject: [Swift-devel] Coasters failing on Teraport - cant find Java? In-Reply-To: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> Message-ID: <497F637D.5080707@mcs.anl.gov> Related to: Re: [Swift-devel] swift changing walltime of prews-gram jobs I can't get a Swift script to run on coasters on TeraPort in gt2:gt2:pbs mode. Im using 0.8rc1 and submitting from tp-login. I am running with a DOEgrids cert in the OSG VO. I *think* the issue is that when a gt2 jobs on this vo runs with a login shell, it doesnt get java in its path. When I run /bin/sh *without* the "-l" option, under globus, I do get a java in my path. Allan: what VO did you run on when you got a sucsessful gt2:gt2:pbs coaster run on teraport, after you fixed the walltime issue? It seems to me that this is a rough edge with coaster startup. Recall that I had a similar problem running on abe last year: I had to edit out the "-l" and create a custom .profile to get coasters to work. It would be great if we can iron this out in 0.8 or soon after. I'm willing to do some testing and enlist help from Allan and Zhengxiong for wider testing. Do we need special site attributes for specific sites to override default behaviors when they dont work? My sites.xml is: fast 00:05:00 /gpfs1/osg/data/oops/swiftwork I get this on stdout/err: --------------------------------------------- Swift 0.8rc1 swift-r2448 cog-r2261 RunID: 20090127-1305-hcxdpor3 Progress: Progress: Selecting site:2 Stage in:1 Initializing site shared directory:1 Progress: Selecting site:2 Stage in:1 Submitting:1 Progress: Selecting site:2 Submitting:1 Submitted:1 Failed to transfer wrapper log from oops5-20090127-1305-hcxdpor3/info/a on teraport Execution failed: Exception in runoops: Arguments: [input/fasta/T1af7.fasta, input/secseq/T1af7.secseq, input/native/T1af7.pdb, output/T1af7.1.pdt, output/T1af7.1.rmsd, 1, [TEMP UPDATE INTERVAL = 10, SMOOTH DEVIATION COEFFICIENT = 0.80001]] Host: teraport Directory: oops5-20090127-1305-hcxdpor3/jobs/a/runoops-asq0ir5j stderr.txt: stdout.txt: ---- Caused by: Could not submit job Caused by: Could not start coaster service Caused by: Task ended before registration was received. STDOUT: which: no java in (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin) dirname: too few arguments Try `dirname --help' for more information. http://tp-login2.ci.uchicago.edu:50001: line 55: -Djava.home=/..: No such file or directory STDERR: null Cleaning up... Done ------------------------------------ Checking out the environment with this cert I see: tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'java -version' /bin/sh: java: command not found tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java -version' java version "1.5.0_14" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03) Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'which java; echo JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH' JAVA_HOME IS: PATH IS: /usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin /usr/bin/which: no java in (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin) tp$ tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'which java; echo JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH' /opt/osg-ce-0.8.0-r1/jdk1.5/bin/java JAVA_HOME IS: PATH IS: /opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt/osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/opt/osg-ce-0.8.0-r1/globus/bin:/opt/osg-ce-0.8.0-r1/globus/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt /osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java -version'java version "1.5.0_14" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03) Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) - Mike On 1/24/09 5:03 PM, Allan Espinosa wrote: > Hi, > > I am using swift0.8rc1. the same also happens to v0.7 > > I tried submitting a job from communicado to tp-grid1 (teraport) using > coasters. The swift runtime does not give any error but it does not > finish as well. Looking through the files received by the teraport > head node, i observed that swift keeps submitting gram jobs. It looks > like that the submitted pbs scripts kept finishing / failing. > > diging through ~/.globus/jobs/tp-grid1.uchicago.edu/*/scheduler* we > see that maxwalltime become 101:00 from 00:10:00 (in sites.xml) > > /usr/bin/perl "/home/aespinosa/.globus/coasters/cscript63266.pl" > "http://128.135.125.118:50001" "1728236079" > #! /bin/sh > # PBS batch job script built by Globus job manager > # > #PBS -S /bin/sh > #PBS -m n > #PBS -q fast > #PBS -l walltime=101:00 > #PBS -o /dev/null > #PBS -e /dev/null > #PBS -l nodes=1 > HOME="/home/aespinosa"; > export HOME; > OSG_DATA="/gpfs1/osg/data"; > ... > ... > counter=0 > exit_code=0 > while test $counter -lt 1; do > /bin/touch /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter; > > read tmp_exit_code < > /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter > if [ $exit_code = 0 -a $tmp_exit_code != 0 ]; then > exit_code=$tmp_exit_code > fi > counter=`expr $counter + 1` > done > > exit $exit_code > qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max > walltime requirement > > > > Below is my sites.xml: > > > > > fast > 00:10:00 > storage="/opt/osg/data/aespinosa" major="2" minor="2" patch="4"> > > jobmanager="gt2:gt2:pbs" /> > > /disks/tp-gpfs/scratch/aespinosa > > > > > This does not happen if i use "local:pbs" as the jobmanager for the > coaster and was successful in running jobs > -Allan > > From wilde at mcs.anl.gov Tue Jan 27 14:14:35 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 27 Jan 2009 14:14:35 -0600 Subject: [Swift-devel] Coasters failing on Teraport - cant find Java? In-Reply-To: <497F637D.5080707@mcs.anl.gov> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <497F637D.5080707@mcs.anl.gov> Message-ID: <497F6B2B.3030508@mcs.anl.gov> Further info: I dont see any .profile or shell .rc files in OSG, so Im confused on how its environment is getting set up, unless softenv is doing it all, and acting differently for a login shell and non-login shell. It seems backwards to me that (as in previous email) the "-l" shell, which *should* do full initialization, is getting a smaller environment than the non- "-l" shell, which has tons of osg directories in its path, and includes java. Running a globus fork job without a shell shows the full OSG PATH is set up (see printenv below). Probably, because there is no .profile or shell .rc files, /bin/sh -l unsets the PATH that was set up by default. Is globus doing some osg initialization when it launches jobs? Can we have a per-site option to drop the "-l" when launching coasters? Am I heading down the right path on this, or is the problem & solution elsewhere? - Mike tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'ls -ld $HOME/.*'drwxr-xr-x 4 osg osgvo 28672 Jan 27 13:50 /home/osgvo/osg/. drwxr-xr-x 40 root root 4096 May 19 2008 /home/osgvo/osg/.. drwx------ 4 osg osgvo 4096 Jun 12 2008 /home/osgvo/osg/.globus -rw------- 1 osg osgvo 245 Jun 22 2008 /home/osgvo/osg/.soft -rw-r--r-- 1 osg osgvo 9044 Jan 27 11:04 /home/osgvo/osg/.soft.cache.csh -rw-r--r-- 1 osg osgvo 9193 Jan 27 11:04 /home/osgvo/osg/.soft.cache.sh drwx------ 2 osg osgvo 4096 Jun 22 2008 /home/osgvo/osg/.ssh tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'cat $HOME/.soft' # # This is your SoftEnv configuration run control file. # # It is used to tell SoftEnv how to customize your environment by # setting up variables such as PATH and MANPATH. To learn more # about this file, do a "man softenv". # @default tp$ globus-job-run tp-grid1.ci.uchicago.edu /usr/bin/printenv PATH /opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt/osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/opt/osg-ce-0.8.0-r1/globus/bin:/opt/osg-ce-0.8.0-r1/globus/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt /osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin tp$ On 1/27/09 1:41 PM, Michael Wilde wrote: > Related to: Re: [Swift-devel] swift changing walltime of prews-gram jobs > > I can't get a Swift script to run on coasters on TeraPort in gt2:gt2:pbs > mode. > > Im using 0.8rc1 and submitting from tp-login. > > I am running with a DOEgrids cert in the OSG VO. > > I *think* the issue is that when a gt2 jobs on this vo runs with a login > shell, it doesnt get java in its path. > > When I run /bin/sh *without* the "-l" option, under globus, I do get a > java in my path. > > Allan: what VO did you run on when you got a sucsessful gt2:gt2:pbs > coaster run on teraport, after you fixed the walltime issue? > > It seems to me that this is a rough edge with coaster startup. Recall > that I had a similar problem running on abe last year: I had to edit out > the "-l" and create a custom .profile to get coasters to work. > > It would be great if we can iron this out in 0.8 or soon after. I'm > willing to do some testing and enlist help from Allan and Zhengxiong for > wider testing. > > Do we need special site attributes for specific sites to override > default behaviors when they dont work? > > > My sites.xml is: > > > > fast > 00:05:00 > > url="tp-grid1.ci.uchicago.edu" > jobmanager="gt2:gt2:pbs" /> > /gpfs1/osg/data/oops/swiftwork > > > > I get this on stdout/err: > > --------------------------------------------- > Swift 0.8rc1 swift-r2448 cog-r2261 > > RunID: 20090127-1305-hcxdpor3 > Progress: > Progress: Selecting site:2 Stage in:1 Initializing site shared directory:1 > Progress: Selecting site:2 Stage in:1 Submitting:1 > Progress: Selecting site:2 Submitting:1 Submitted:1 > Failed to transfer wrapper log from oops5-20090127-1305-hcxdpor3/info/a > on teraport > Execution failed: > Exception in runoops: > Arguments: [input/fasta/T1af7.fasta, input/secseq/T1af7.secseq, > input/native/T1af7.pdb, output/T1af7.1.pdt, output/T1af7.1.rmsd, 1, > [TEMP UPDATE INTERVAL = 10, SMOOTH DEVIATION COEFFICIENT = 0.80001]] > Host: teraport > Directory: oops5-20090127-1305-hcxdpor3/jobs/a/runoops-asq0ir5j > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Could not submit job > Caused by: > Could not start coaster service > Caused by: > Task ended before registration was received. > STDOUT: which: no java in > (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin) > > dirname: too few arguments > Try `dirname --help' for more information. > http://tp-login2.ci.uchicago.edu:50001: line 55: -Djava.home=/..: No > such file or directory > > STDERR: null > Cleaning up... > Done > > ------------------------------------ > > Checking out the environment with this cert I see: > > tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'java -version' > /bin/sh: java: command not found > > > tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java -version' > java version "1.5.0_14" > Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03) > Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) > > > tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'which java; > echo JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH' > JAVA_HOME IS: > PATH IS: > /usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin > > /usr/bin/which: no java in > (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin) > > tp$ > > > tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'which java; echo > JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH' > > /opt/osg-ce-0.8.0-r1/jdk1.5/bin/java > JAVA_HOME IS: > PATH IS: > /opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt/osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/opt/osg-ce-0.8.0-r1/globus/bin:/opt/osg-ce-0.8.0-r1/globus/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/o pt > > /osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin > > tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java > -version'java version "1.5.0_14" > Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03) > Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) > > > - Mike > > > > > > On 1/24/09 5:03 PM, Allan Espinosa wrote: >> Hi, >> >> I am using swift0.8rc1. the same also happens to v0.7 >> >> I tried submitting a job from communicado to tp-grid1 (teraport) using >> coasters. The swift runtime does not give any error but it does not >> finish as well. Looking through the files received by the teraport >> head node, i observed that swift keeps submitting gram jobs. It looks >> like that the submitted pbs scripts kept finishing / failing. >> >> diging through ~/.globus/jobs/tp-grid1.uchicago.edu/*/scheduler* we >> see that maxwalltime become 101:00 from 00:10:00 (in sites.xml) >> >> /usr/bin/perl "/home/aespinosa/.globus/coasters/cscript63266.pl" >> "http://128.135.125.118:50001" "1728236079" >> #! /bin/sh >> # PBS batch job script built by Globus job manager >> # >> #PBS -S /bin/sh >> #PBS -m n >> #PBS -q fast >> #PBS -l walltime=101:00 >> #PBS -o /dev/null >> #PBS -e /dev/null >> #PBS -l nodes=1 >> HOME="/home/aespinosa"; >> export HOME; >> OSG_DATA="/gpfs1/osg/data"; >> ... >> ... >> counter=0 >> exit_code=0 >> while test $counter -lt 1; do >> /bin/touch >> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter; >> >> >> read tmp_exit_code < >> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter >> >> if [ $exit_code = 0 -a $tmp_exit_code != 0 ]; then >> exit_code=$tmp_exit_code >> fi >> counter=`expr $counter + 1` >> done >> >> exit $exit_code >> qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max >> walltime requirement >> >> >> >> Below is my sites.xml: >> >> >> >> >> fast >> 00:10:00 >> > url="gsiftp://tp-grid1.ci.uchicago.edu/disks/tp-gpfs/scratch/aespinosa" >> storage="/opt/osg/data/aespinosa" major="2" minor="2" patch="4"> >> >> > jobmanager="gt2:gt2:pbs" /> >> >> /disks/tp-gpfs/scratch/aespinosa >> >> >> >> >> This does not happen if i use "local:pbs" as the jobmanager for the >> coaster and was successful in running jobs >> -Allan >> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From aespinosa at cs.uchicago.edu Tue Jan 27 15:05:57 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Tue, 27 Jan 2009 15:05:57 -0600 Subject: [Swift-devel] Re: Coasters failing on Teraport - cant find Java? In-Reply-To: <497F637D.5080707@mcs.anl.gov> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <497F637D.5080707@mcs.anl.gov> Message-ID: <50b07b4b0901271305j24c8ef13q3a60c064d61a5995@mail.gmail.com> Hi Mike, I actually emailed directly Teraport support to add my DOEgrids DN to the gridmap file. so my jobs are actually being executed under my username (aespinosa). As of now, I can only submit to OSG sites supporting the OSGEDU VO. If i remember correctly, we placed OSG as my VO when applying forthe DOEgrids certificate. Then I just emailed Alina to include my DN in the OSGEDU VO member list. I need to email and follow-up OSG operations in the status of my VO application. For the sites.xml, I think you need to specify the filesystem provider which sets up the environment for the coaster (based on what I understood from the documentation). Below is my sites.xml: fast 00:01:00 ***** /disks/tp-gpfs/scratch/aespinosa -Allan On Tue, Jan 27, 2009 at 1:41 PM, Michael Wilde wrote: > When I run /bin/sh *without* the "-l" option, under globus, I do get a java > in my path. > > Allan: what VO did you run on when you got a sucsessful gt2:gt2:pbs coaster > run on teraport, after you fixed the walltime issue? > > > My sites.xml is: > > > > fast > 00:05:00 > > url="tp-grid1.ci.uchicago.edu" > jobmanager="gt2:gt2:pbs" /> > /gpfs1/osg/data/oops/swiftwork > > > > I get this on stdout/err: > > --------------------------------------------- > Swift 0.8rc1 swift-r2448 cog-r2261 > > RunID: 20090127-1305-hcxdpor3 > Progress: > Progress: Selecting site:2 Stage in:1 Initializing site shared directory:1 > Progress: Selecting site:2 Stage in:1 Submitting:1 > Progress: Selecting site:2 Submitting:1 Submitted:1 > Failed to transfer wrapper log from oops5-20090127-1305-hcxdpor3/info/a on > teraport > Execution failed: > Exception in runoops: > Arguments: [input/fasta/T1af7.fasta, input/secseq/T1af7.secseq, > input/native/T1af7.pdb, output/T1af7.1.pdt, output/T1af7.1.rmsd, 1, [TEMP > UPDATE INTERVAL = 10, SMOOTH DEVIATION COEFFICIENT = 0.80001]] > Host: teraport > Directory: oops5-20090127-1305-hcxdpor3/jobs/a/runoops-asq0ir5j > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Could not submit job > Caused by: > Could not start coaster service > Caused by: > Task ended before registration was received. > STDOUT: which: no java in > (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin) > dirname: too few arguments > Try `dirname --help' for more information. > http://tp-login2.ci.uchicago.edu:50001: line 55: -Djava.home=/..: No such > file or directory > > STDERR: null > Cleaning up... > Done > > ------------------------------------ > > Checking out the environment with this cert I see: > > tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'java -version' > /bin/sh: java: command not found > > > tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java -version' > java version "1.5.0_14" > Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03) > Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) > > > tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'which java; echo > JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH' > JAVA_HOME IS: > PATH IS: > /usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin > /usr/bin/which: no java in > (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin) > tp$ > > > tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'which java; echo > JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH' > > /opt/osg-ce-0.8.0-r1/jdk1.5/bin/java > JAVA_HOME IS: > PATH IS: > /opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt/osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/opt/osg-ce-0.8.0-r1/globus/bin:/opt/osg-ce-0.8.0-r1/globus/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt > /osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin > tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java -version'java > version "1.5.0_14" > Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03) > Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) > > > - Mike > > > > > > On 1/24/09 5:03 PM, Allan Espinosa wrote: >> >> Hi, >> >> I am using swift0.8rc1. the same also happens to v0.7 >> >> I tried submitting a job from communicado to tp-grid1 (teraport) using >> coasters. The swift runtime does not give any error but it does not >> finish as well. Looking through the files received by the teraport >> head node, i observed that swift keeps submitting gram jobs. It looks >> like that the submitted pbs scripts kept finishing / failing. >> >> diging through ~/.globus/jobs/tp-grid1.uchicago.edu/*/scheduler* we >> see that maxwalltime become 101:00 from 00:10:00 (in sites.xml) >> >> /usr/bin/perl "/home/aespinosa/.globus/coasters/cscript63266.pl" >> "http://128.135.125.118:50001" "1728236079" >> #! /bin/sh >> # PBS batch job script built by Globus job manager >> # >> #PBS -S /bin/sh >> #PBS -m n >> #PBS -q fast >> #PBS -l walltime=101:00 >> #PBS -o /dev/null >> #PBS -e /dev/null >> #PBS -l nodes=1 >> HOME="/home/aespinosa"; >> export HOME; >> OSG_DATA="/gpfs1/osg/data"; >> ... >> ... >> counter=0 >> exit_code=0 >> while test $counter -lt 1; do >> /bin/touch >> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter; >> From benc at hawaga.org.uk Tue Jan 27 15:13:37 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 27 Jan 2009 21:13:37 +0000 (GMT) Subject: [Swift-devel] swift changing walltime of prews-gram jobs In-Reply-To: <497F5864.6060509@mcs.anl.gov> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <50b07b4b0901250706g479afeacn2936ffc9b254db4a@mail.gmail.com> <50b07b4b0901250733r7f64d701w7aae6a86652d15a0@mail.gmail.com> <497F5864.6060509@mcs.anl.gov> Message-ID: On Tue, 27 Jan 2009, Michael Wilde wrote: > This combination seems to still get the walltime from the Globus profile > options, but it seems to be putting the time estimate in the seconds portion > of the PBS request instead of the minutes, so my jobs are dying on wall-time > exceed (from pbs) (4 jobs, needing 30-60 seconds each). You should not be seeing the walltime from tc.data level jobs being passed through to coaster worker jobs. Coaster worker jobs should be getting a different maxwalltime, as documented in previous emails, being some multiple plus a constant of a normal job's maxwalltime. Based on previous emails from Allan, I'm suspicious about provider-pbs passing through things correctly (its been a bugridden piece of shit when I've tried to use it). I'll get round to looking at that in the next few days, most likely. Until then, the least mysterious path is probably to stick with gt2:gt2:pbs as your coaster jobmanager of choice. -- From wilde at mcs.anl.gov Tue Jan 27 15:16:48 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 27 Jan 2009 15:16:48 -0600 Subject: [Swift-devel] Re: Coasters failing on Teraport - cant find Java? In-Reply-To: <50b07b4b0901271305j24c8ef13q3a60c064d61a5995@mail.gmail.com> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <497F637D.5080707@mcs.anl.gov> <50b07b4b0901271305j24c8ef13q3a60c064d61a5995@mail.gmail.com> Message-ID: <497F79C0.2050903@mcs.anl.gov> Thanks, Allan. So it would be interesting to probe an OSGEDU site with /bin/sh both with and without "-l" to see how the PATH is set there. Also, question for Ben/Mihael: for coasters, are the filesystem and gridftp tags meant to be mutually exclusive? I'll send you mail off-list about the certs. - Mike On 1/27/09 3:05 PM, Allan Espinosa wrote: > Hi Mike, > > I actually emailed directly Teraport support to add my DOEgrids DN to > the gridmap file. so my jobs are actually being executed under my > username (aespinosa). > > As of now, I can only submit to OSG sites supporting the OSGEDU VO. > If i remember correctly, we placed OSG as my VO when applying forthe > DOEgrids certificate. Then I just emailed Alina to include my DN in > the OSGEDU VO member list. I need to email and follow-up OSG > operations in the status of my VO application. > > For the sites.xml, I think you need to specify the filesystem provider > which sets up the environment for the coaster (based on what I > understood from the documentation). Below is my sites.xml: > > > > > fast > 00:01:00 > storage="/opt/osg/data/aespinosa" major="2" minor="2" patch="4"> > > jobmanager="gt2:gt2:pbs" /> > ***** > /disks/tp-gpfs/scratch/aespinosa > > > > > > > -Allan > > > > On Tue, Jan 27, 2009 at 1:41 PM, Michael Wilde wrote: >> When I run /bin/sh *without* the "-l" option, under globus, I do get a java >> in my path. >> >> Allan: what VO did you run on when you got a sucsessful gt2:gt2:pbs coaster >> run on teraport, after you fixed the walltime issue? >> >> >> My sites.xml is: >> >> >> >> fast >> 00:05:00 >> >> > url="tp-grid1.ci.uchicago.edu" >> jobmanager="gt2:gt2:pbs" /> >> /gpfs1/osg/data/oops/swiftwork >> >> >> >> I get this on stdout/err: >> >> --------------------------------------------- >> Swift 0.8rc1 swift-r2448 cog-r2261 >> >> RunID: 20090127-1305-hcxdpor3 >> Progress: >> Progress: Selecting site:2 Stage in:1 Initializing site shared directory:1 >> Progress: Selecting site:2 Stage in:1 Submitting:1 >> Progress: Selecting site:2 Submitting:1 Submitted:1 >> Failed to transfer wrapper log from oops5-20090127-1305-hcxdpor3/info/a on >> teraport >> Execution failed: >> Exception in runoops: >> Arguments: [input/fasta/T1af7.fasta, input/secseq/T1af7.secseq, >> input/native/T1af7.pdb, output/T1af7.1.pdt, output/T1af7.1.rmsd, 1, [TEMP >> UPDATE INTERVAL = 10, SMOOTH DEVIATION COEFFICIENT = 0.80001]] >> Host: teraport >> Directory: oops5-20090127-1305-hcxdpor3/jobs/a/runoops-asq0ir5j >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> Could not submit job >> Caused by: >> Could not start coaster service >> Caused by: >> Task ended before registration was received. >> STDOUT: which: no java in >> (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin) >> dirname: too few arguments >> Try `dirname --help' for more information. >> http://tp-login2.ci.uchicago.edu:50001: line 55: -Djava.home=/..: No such >> file or directory >> >> STDERR: null >> Cleaning up... >> Done >> >> ------------------------------------ >> >> Checking out the environment with this cert I see: >> >> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'java -version' >> /bin/sh: java: command not found >> >> >> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java -version' >> java version "1.5.0_14" >> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03) >> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) >> >> >> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'which java; echo >> JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH' >> JAVA_HOME IS: >> PATH IS: >> /usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin >> /usr/bin/which: no java in >> (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin) >> tp$ >> >> >> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'which java; echo >> JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH' >> >> /opt/osg-ce-0.8.0-r1/jdk1.5/bin/java >> JAVA_HOME IS: >> PATH IS: >> /opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt/osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/opt/osg-ce-0.8.0-r1/globus/bin:/opt/osg-ce-0.8.0-r1/globus/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/ opt >> /osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin >> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java -version'java >> version "1.5.0_14" >> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03) >> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) >> >> >> - Mike >> >> >> >> >> >> On 1/24/09 5:03 PM, Allan Espinosa wrote: >>> Hi, >>> >>> I am using swift0.8rc1. the same also happens to v0.7 >>> >>> I tried submitting a job from communicado to tp-grid1 (teraport) using >>> coasters. The swift runtime does not give any error but it does not >>> finish as well. Looking through the files received by the teraport >>> head node, i observed that swift keeps submitting gram jobs. It looks >>> like that the submitted pbs scripts kept finishing / failing. >>> >>> diging through ~/.globus/jobs/tp-grid1.uchicago.edu/*/scheduler* we >>> see that maxwalltime become 101:00 from 00:10:00 (in sites.xml) >>> >>> /usr/bin/perl "/home/aespinosa/.globus/coasters/cscript63266.pl" >>> "http://128.135.125.118:50001" "1728236079" >>> #! /bin/sh >>> # PBS batch job script built by Globus job manager >>> # >>> #PBS -S /bin/sh >>> #PBS -m n >>> #PBS -q fast >>> #PBS -l walltime=101:00 >>> #PBS -o /dev/null >>> #PBS -e /dev/null >>> #PBS -l nodes=1 >>> HOME="/home/aespinosa"; >>> export HOME; >>> OSG_DATA="/gpfs1/osg/data"; >>> ... >>> ... >>> counter=0 >>> exit_code=0 >>> while test $counter -lt 1; do >>> /bin/touch >>> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter; >>> From benc at hawaga.org.uk Tue Jan 27 15:16:39 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 27 Jan 2009 21:16:39 +0000 (GMT) Subject: [Swift-devel] Re: Coasters failing on Teraport - cant find Java? In-Reply-To: <50b07b4b0901271305j24c8ef13q3a60c064d61a5995@mail.gmail.com> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <497F637D.5080707@mcs.anl.gov> <50b07b4b0901271305j24c8ef13q3a60c064d61a5995@mail.gmail.com> Message-ID: On Tue, 27 Jan 2009, Allan Espinosa wrote: > For the sites.xml, I think you need to specify the filesystem provider > which sets up the environment for the coaster (based on what I > understood from the documentation). Below is my sites.xml: Don't specify and - they are different ways of expressing the same configuration. On teraport, you shoudl be fine using the line that you showed: > url="gsiftp://tp-grid1.ci.uchicago.edu/disks/tp-gpfs/scratch/aespinosa" > storage="/opt/osg/data/aespinosa" major="2" minor="2" patch="4"> > Specifying and at the same time is an error, and you get a chocolate bar if you file a bugzilla report that Swift does not correctly detect this (seriously). -- From aespinosa at cs.uchicago.edu Tue Jan 27 15:21:10 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Tue, 27 Jan 2009 15:21:10 -0600 Subject: [Swift-devel] Re: Coasters failing on Teraport - cant find Java? In-Reply-To: References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <497F637D.5080707@mcs.anl.gov> <50b07b4b0901271305j24c8ef13q3a60c064d61a5995@mail.gmail.com> Message-ID: <50b07b4b0901271321p8558840o528f5933b12d9460@mail.gmail.com> I guess I should file one to get a chocolate then. :) But from Section 20 of the documentation its saying that i should specify it: The jobmanager string contains more detail than with other providers. It contains either two or three colon separated fields: 1:the provider to be use to execute the coaster head job - this provider will submit from the Swift client side environment. Commonly this will be one of the GRAM providers; 2: the provider to be used to execute coaster worker jobs. This provider will be used to submit from the coaster head job environment, so a local scheduler provider can sometimes be used instead of GRAM. 3: optionally, the jobmanager to be used when submitting worker job using the provider specified in field 2. To use for file transfer, specify a sites.xml filesystem element like this: The url parameter should be a pseudo-URI formed with the URI scheme being the name of the provider to use to submit the coaster head job, and the hostname portion being the hostname to be used to execute the coaster head job. Note that this provider and hostname will be used for execution of a coaster head job, not for file transfer; so for example, a GRAM endpoint should be specified here rather than a GridFTP endpoint. On Tue, Jan 27, 2009 at 3:16 PM, Ben Clifford wrote: > > On Tue, 27 Jan 2009, Allan Espinosa wrote: > >> For the sites.xml, I think you need to specify the filesystem provider >> which sets up the environment for the coaster (based on what I >> understood from the documentation). Below is my sites.xml: > > Don't specify and - they are different ways of > expressing the same configuration. > > On teraport, you shoudl be fine using the line that you showed: > >> > url="gsiftp://tp-grid1.ci.uchicago.edu/disks/tp-gpfs/scratch/aespinosa" >> storage="/opt/osg/data/aespinosa" major="2" minor="2" patch="4"> >> > > Specifying and at the same time is an error, and > you get a chocolate bar if you file a bugzilla report that Swift does not > correctly detect this (seriously). From benc at hawaga.org.uk Tue Jan 27 15:21:20 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 27 Jan 2009 21:21:20 +0000 (GMT) Subject: [Swift-devel] Coasters failing on Teraport - cant find Java? In-Reply-To: <497F637D.5080707@mcs.anl.gov> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <497F637D.5080707@mcs.anl.gov> Message-ID: On Tue, 27 Jan 2009, Michael Wilde wrote: > It would be great if we can iron this out in 0.8 or soon after. I'm willing to > do some testing and enlist help from Allan and Zhengxiong for wider testing. It won't be in 0.8, as this is, as far as I can tell, not a bug that was not present a couple weeks ago. But it would be nice to get more people contributing experiences for 0.9. It should be straightforward to add configuration options for site initialisation commands, although not particularly straightforward to determine what the correct values for a site are automatically (which is a problem more for OSG than TeraGrid, I think). -- From benc at hawaga.org.uk Tue Jan 27 15:23:45 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 27 Jan 2009 21:23:45 +0000 (GMT) Subject: [Swift-devel] Re: Coasters failing on Teraport - cant find Java? In-Reply-To: <50b07b4b0901271321p8558840o528f5933b12d9460@mail.gmail.com> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <497F637D.5080707@mcs.anl.gov> <50b07b4b0901271305j24c8ef13q3a60c064d61a5995@mail.gmail.com> <50b07b4b0901271321p8558840o528f5933b12d9460@mail.gmail.com> Message-ID: > But from Section 20 of the documentation its saying that i should specify it: ok. that's ambiguous. You can replace the gridftp mechanism of file transfer with one provided by coasters - that is, take away gridftp and use coasters instead for that. That is mostly orthogonal to using coasters for job submission. -- From wilde at mcs.anl.gov Tue Jan 27 15:25:13 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 27 Jan 2009 15:25:13 -0600 Subject: [Swift-devel] Re: Coasters failing on Teraport - cant find Java? In-Reply-To: References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <497F637D.5080707@mcs.anl.gov> <50b07b4b0901271305j24c8ef13q3a60c064d61a5995@mail.gmail.com> Message-ID: <497F7BB9.3030105@mcs.anl.gov> I'll let Allan go for the chocolate bar :) To clarify further: with coasters, is a valid option? Ie, coasters has 3 valid data provider options: gridftp, local, and "filesystem" which uses the coasters themselves to move data? (Can you explain how the "filesystem" method works? I didnt find the userguide clear enough on that to help decide if/when to select this method) - Mike On 1/27/09 3:16 PM, Ben Clifford wrote: > On Tue, 27 Jan 2009, Allan Espinosa wrote: > >> For the sites.xml, I think you need to specify the filesystem provider >> which sets up the environment for the coaster (based on what I >> understood from the documentation). Below is my sites.xml: > > Don't specify and - they are different ways of > expressing the same configuration. > > On teraport, you shoudl be fine using the line that you showed: > >> > url="gsiftp://tp-grid1.ci.uchicago.edu/disks/tp-gpfs/scratch/aespinosa" >> storage="/opt/osg/data/aespinosa" major="2" minor="2" patch="4"> >> > > Specifying and at the same time is an error, and > you get a chocolate bar if you file a bugzilla report that Swift does not > correctly detect this (seriously). > From benc at hawaga.org.uk Tue Jan 27 15:29:52 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 27 Jan 2009 21:29:52 +0000 (GMT) Subject: [Swift-devel] Re: Coasters failing on Teraport - cant find Java? In-Reply-To: <497F7BB9.3030105@mcs.anl.gov> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <497F637D.5080707@mcs.anl.gov> <50b07b4b0901271305j24c8ef13q3a60c064d61a5995@mail.gmail.com> <497F7BB9.3030105@mcs.anl.gov> Message-ID: On Tue, 27 Jan 2009, Michael Wilde wrote: > To clarify further: with coasters, is a valid > option? > Ie, coasters has 3 valid data provider options: gridftp, local, and > "filesystem" which uses the coasters themselves to move data? use any one of those three. > (Can you explain how the "filesystem" method works? I didnt find the > userguide clear enough on that to help decide if/when to select this > method) is pretty much a synonym for So then consider the above three as different mechanisms using , with: provider="gt2" - use gridftp to transfer the file from where the Swift command line client is running to the remote location provider="local" - use posix filesystem access to transfer the file, which only works where you have posix access to the 'remote' site shared directory provider="coaster" - execute a remote file server process (being part of coaster) and through that access files on the remote site -- From benc at hawaga.org.uk Tue Jan 27 15:32:58 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 27 Jan 2009 21:32:58 +0000 (GMT) Subject: [Swift-devel] Re: Coasters failing on Teraport - cant find Java? In-Reply-To: <497F79C0.2050903@mcs.anl.gov> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <497F637D.5080707@mcs.anl.gov> <50b07b4b0901271305j24c8ef13q3a60c064d61a5995@mail.gmail.com> <497F79C0.2050903@mcs.anl.gov> Message-ID: > Also, question for Ben/Mihael: for coasters, are the filesystem and gridftp > tags meant to be mutually exclusive? yes. There should be better error reporting both for using them together and also for specifying either of them more than once in a site entry. (chocolate bar rule applies) -- From bugzilla-daemon at mcs.anl.gov Tue Jan 27 15:37:32 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 27 Jan 2009 15:37:32 -0600 (CST) Subject: [Swift-devel] [Bug 172] New: filesystem and gridftp element in the same pool. Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=172 Summary: filesystem and gridftp element in the same pool. Product: Swift Version: unspecified Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Specific site issues AssignedTo: benc at hawaga.org.uk ReportedBy: aespinosa at cs.uchicago.edu For coasters, Sec 20 of the documentations says that the coaster provider in the filesystem tag is used to describe how coasters are executed on a resource. But looking on how filesystem and gridftp should function in general, swift should detect that these two elements are mutually exclusive. thus + should be detected by swift as a misconfiguration of sites.xml -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From hategan at mcs.anl.gov Tue Jan 27 21:33:31 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 27 Jan 2009 21:33:31 -0600 Subject: [Swift-devel] Coasters failing on Teraport - cant find Java? In-Reply-To: <497F637D.5080707@mcs.anl.gov> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <497F637D.5080707@mcs.anl.gov> Message-ID: <1233113611.2159.25.camel@localhost> Hmm. Looks like -l has the opposite effect of what I thought it should do (end up with an environment equivalent to the one you get in when you log in as an interactive session). Is it my misunderstanding or something else? On Tue, 2009-01-27 at 13:41 -0600, Michael Wilde wrote: > Related to: Re: [Swift-devel] swift changing walltime of prews-gram jobs > > I can't get a Swift script to run on coasters on TeraPort in gt2:gt2:pbs > mode. > > Im using 0.8rc1 and submitting from tp-login. > > I am running with a DOEgrids cert in the OSG VO. > > I *think* the issue is that when a gt2 jobs on this vo runs with a login > shell, it doesnt get java in its path. > > When I run /bin/sh *without* the "-l" option, under globus, I do get a > java in my path. > > Allan: what VO did you run on when you got a sucsessful gt2:gt2:pbs > coaster run on teraport, after you fixed the walltime issue? > > It seems to me that this is a rough edge with coaster startup. Recall > that I had a similar problem running on abe last year: I had to edit out > the "-l" and create a custom .profile to get coasters to work. > > It would be great if we can iron this out in 0.8 or soon after. I'm > willing to do some testing and enlist help from Allan and Zhengxiong for > wider testing. > > Do we need special site attributes for specific sites to override > default behaviors when they dont work? > > > My sites.xml is: > > > > fast > 00:05:00 > > url="tp-grid1.ci.uchicago.edu" > jobmanager="gt2:gt2:pbs" /> > /gpfs1/osg/data/oops/swiftwork > > > > I get this on stdout/err: > > --------------------------------------------- > Swift 0.8rc1 swift-r2448 cog-r2261 > > RunID: 20090127-1305-hcxdpor3 > Progress: > Progress: Selecting site:2 Stage in:1 Initializing site shared directory:1 > Progress: Selecting site:2 Stage in:1 Submitting:1 > Progress: Selecting site:2 Submitting:1 Submitted:1 > Failed to transfer wrapper log from oops5-20090127-1305-hcxdpor3/info/a > on teraport > Execution failed: > Exception in runoops: > Arguments: [input/fasta/T1af7.fasta, input/secseq/T1af7.secseq, > input/native/T1af7.pdb, output/T1af7.1.pdt, output/T1af7.1.rmsd, 1, > [TEMP UPDATE INTERVAL = 10, SMOOTH DEVIATION COEFFICIENT = 0.80001]] > Host: teraport > Directory: oops5-20090127-1305-hcxdpor3/jobs/a/runoops-asq0ir5j > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Could not submit job > Caused by: > Could not start coaster service > Caused by: > Task ended before registration was received. > STDOUT: which: no java in > (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin) > dirname: too few arguments > Try `dirname --help' for more information. > http://tp-login2.ci.uchicago.edu:50001: line 55: -Djava.home=/..: No > such file or directory > > STDERR: null > Cleaning up... > Done > > ------------------------------------ > > Checking out the environment with this cert I see: > > tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'java -version' > /bin/sh: java: command not found > > > tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java -version' > java version "1.5.0_14" > Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03) > Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) > > > tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'which java; > echo JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH' > JAVA_HOME IS: > PATH IS: > /usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin > /usr/bin/which: no java in > (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin) > tp$ > > > tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'which java; echo > JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH' > > /opt/osg-ce-0.8.0-r1/jdk1.5/bin/java > JAVA_HOME IS: > PATH IS: > /opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt/osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/opt/osg-ce-0.8.0-r1/globus/bin:/opt/osg-ce-0.8.0-r1/globus/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/o pt > /osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin > tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java > -version'java version "1.5.0_14" > Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03) > Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) > > > - Mike > > > > > > On 1/24/09 5:03 PM, Allan Espinosa wrote: > > Hi, > > > > I am using swift0.8rc1. the same also happens to v0.7 > > > > I tried submitting a job from communicado to tp-grid1 (teraport) using > > coasters. The swift runtime does not give any error but it does not > > finish as well. Looking through the files received by the teraport > > head node, i observed that swift keeps submitting gram jobs. It looks > > like that the submitted pbs scripts kept finishing / failing. > > > > diging through ~/.globus/jobs/tp-grid1.uchicago.edu/*/scheduler* we > > see that maxwalltime become 101:00 from 00:10:00 (in sites.xml) > > > > /usr/bin/perl "/home/aespinosa/.globus/coasters/cscript63266.pl" > > "http://128.135.125.118:50001" "1728236079" > > #! /bin/sh > > # PBS batch job script built by Globus job manager > > # > > #PBS -S /bin/sh > > #PBS -m n > > #PBS -q fast > > #PBS -l walltime=101:00 > > #PBS -o /dev/null > > #PBS -e /dev/null > > #PBS -l nodes=1 > > HOME="/home/aespinosa"; > > export HOME; > > OSG_DATA="/gpfs1/osg/data"; > > ... > > ... > > counter=0 > > exit_code=0 > > while test $counter -lt 1; do > > /bin/touch /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter; > > > > read tmp_exit_code < > > /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter > > if [ $exit_code = 0 -a $tmp_exit_code != 0 ]; then > > exit_code=$tmp_exit_code > > fi > > counter=`expr $counter + 1` > > done > > > > exit $exit_code > > qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max > > walltime requirement > > > > > > > > Below is my sites.xml: > > > > > > > > > > fast > > 00:10:00 > > > storage="/opt/osg/data/aespinosa" major="2" minor="2" patch="4"> > > > > > jobmanager="gt2:gt2:pbs" /> > > > > /disks/tp-gpfs/scratch/aespinosa > > > > > > > > > > This does not happen if i use "local:pbs" as the jobmanager for the > > coaster and was successful in running jobs > > -Allan > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Jan 27 21:34:56 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 27 Jan 2009 21:34:56 -0600 Subject: [Swift-devel] swift changing walltime of prews-gram jobs In-Reply-To: References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <50b07b4b0901250706g479afeacn2936ffc9b254db4a@mail.gmail.com> <50b07b4b0901250733r7f64d701w7aae6a86652d15a0@mail.gmail.com> <497F5864.6060509@mcs.anl.gov> Message-ID: <1233113696.2159.27.camel@localhost> On Tue, 2009-01-27 at 21:13 +0000, Ben Clifford wrote: > On Tue, 27 Jan 2009, Michael Wilde wrote: > > > This combination seems to still get the walltime from the Globus profile > > options, but it seems to be putting the time estimate in the seconds portion > > of the PBS request instead of the minutes, so my jobs are dying on wall-time > > exceed (from pbs) (4 jobs, needing 30-60 seconds each). > > You should not be seeing the walltime from tc.data level jobs being passed > through to coaster worker jobs. Coaster worker jobs should be getting a > different maxwalltime, as documented in previous emails, being some > multiple plus a constant of a normal job's maxwalltime. > > Based on previous emails from Allan, I'm suspicious about provider-pbs > passing through things correctly (its been a bugridden piece of shit when > I've tried to use it). Yep. It was a quick hack. Scalable, but still a hack. > > I'll get round to looking at that in the next few days, most likely. > > Until then, the least mysterious path is probably to stick with > gt2:gt2:pbs as your coaster jobmanager of choice. > From wilde at mcs.anl.gov Tue Jan 27 23:03:59 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 27 Jan 2009 23:03:59 -0600 Subject: [Swift-devel] Coasters failing on Teraport - cant find Java? In-Reply-To: <1233113611.2159.25.camel@localhost> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <497F637D.5080707@mcs.anl.gov> <1233113611.2159.25.camel@localhost> Message-ID: <497FE73F.7000307@mcs.anl.gov> I dug a bit deeper. As far as I can tell, this is what's happening: 1) On OSG sites, the jobmanager(s) are modified to inset OSG env vars and set the PATH to contain OSG stuff. So if you do a globus-job-run of /usr/bin/printenv (i.e. with no shell) you see all this, including java in the path (from an osg dir). 2) when you globus-job-run /bin/sh, all this stays around, but 3) when you globus-job-run /bin/sh with -l, it runs /etc/profile, which un-does the path and LD_LIBRARY_PATH, setting PATH to some default and LD_LIBRARY_PATH to null. I *think* this is being done by softenv which runs from /etc/profile.d, called at the end of /etc/profile. You can simulate this with: globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c "which java; source /etc/profile; which java" (or try printenv instead of which java to see the details) So bottom line: there's at least two cases where -l hurts, this one, and abe, where attempts to run login shells from globus are thwarted. If the purpose of -l was just to get java in the path,, then for OSG sites that behave like teraport, just omitting -l should work, because the OSG jobmanager modes put it in the path. For sites like abe, bypassing -l, and forcing the user to put Java in the path with a .bashrc or equivalent, may work. (The hack I used on abe was to remove the -l arg, and insert this in bootstrap.sh: +if [ -f ~/.myetcprofile ]; then + source ~/.myetcprofile +else + source /etc/profile +fi One option is to accept a per-site option from sites.xml to bypass "-l" on the startup shell, and insert the logic above for something like .coasterinit, sourcing that if the user provides it. Another option is to put a +java line in the OSG .soft file on TeraPort. Its possible this problem only eists on the few sites like teraport that run both OSG mods and softenv??? I think we need to test coasters broadly across OSG to be sure (Ben's IP problem is a case in point). But a simple shell test across all the OSG VO sites could detect whether Java will be there or not, with and without -l. - Mike On 1/27/09 9:33 PM, Mihael Hategan wrote: > Hmm. Looks like -l has the opposite effect of what I thought it should > do (end up with an environment equivalent to the one you get in when you > log in as an interactive session). Is it my misunderstanding or > something else? > > On Tue, 2009-01-27 at 13:41 -0600, Michael Wilde wrote: >> Related to: Re: [Swift-devel] swift changing walltime of prews-gram jobs >> >> I can't get a Swift script to run on coasters on TeraPort in gt2:gt2:pbs >> mode. >> >> Im using 0.8rc1 and submitting from tp-login. >> >> I am running with a DOEgrids cert in the OSG VO. >> >> I *think* the issue is that when a gt2 jobs on this vo runs with a login >> shell, it doesnt get java in its path. >> >> When I run /bin/sh *without* the "-l" option, under globus, I do get a >> java in my path. >> >> Allan: what VO did you run on when you got a sucsessful gt2:gt2:pbs >> coaster run on teraport, after you fixed the walltime issue? >> >> It seems to me that this is a rough edge with coaster startup. Recall >> that I had a similar problem running on abe last year: I had to edit out >> the "-l" and create a custom .profile to get coasters to work. >> >> It would be great if we can iron this out in 0.8 or soon after. I'm >> willing to do some testing and enlist help from Allan and Zhengxiong for >> wider testing. >> >> Do we need special site attributes for specific sites to override >> default behaviors when they dont work? >> >> >> My sites.xml is: >> >> >> >> fast >> 00:05:00 >> >> > url="tp-grid1.ci.uchicago.edu" >> jobmanager="gt2:gt2:pbs" /> >> /gpfs1/osg/data/oops/swiftwork >> >> >> >> I get this on stdout/err: >> >> --------------------------------------------- >> Swift 0.8rc1 swift-r2448 cog-r2261 >> >> RunID: 20090127-1305-hcxdpor3 >> Progress: >> Progress: Selecting site:2 Stage in:1 Initializing site shared directory:1 >> Progress: Selecting site:2 Stage in:1 Submitting:1 >> Progress: Selecting site:2 Submitting:1 Submitted:1 >> Failed to transfer wrapper log from oops5-20090127-1305-hcxdpor3/info/a >> on teraport >> Execution failed: >> Exception in runoops: >> Arguments: [input/fasta/T1af7.fasta, input/secseq/T1af7.secseq, >> input/native/T1af7.pdb, output/T1af7.1.pdt, output/T1af7.1.rmsd, 1, >> [TEMP UPDATE INTERVAL = 10, SMOOTH DEVIATION COEFFICIENT = 0.80001]] >> Host: teraport >> Directory: oops5-20090127-1305-hcxdpor3/jobs/a/runoops-asq0ir5j >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> Could not submit job >> Caused by: >> Could not start coaster service >> Caused by: >> Task ended before registration was received. >> STDOUT: which: no java in >> (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin) >> dirname: too few arguments >> Try `dirname --help' for more information. >> http://tp-login2.ci.uchicago.edu:50001: line 55: -Djava.home=/..: No >> such file or directory >> >> STDERR: null >> Cleaning up... >> Done >> >> ------------------------------------ >> >> Checking out the environment with this cert I see: >> >> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'java -version' >> /bin/sh: java: command not found >> >> >> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java -version' >> java version "1.5.0_14" >> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03) >> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) >> >> >> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'which java; >> echo JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH' >> JAVA_HOME IS: >> PATH IS: >> /usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin >> /usr/bin/which: no java in >> (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin) >> tp$ >> >> >> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'which java; echo >> JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH' >> >> /opt/osg-ce-0.8.0-r1/jdk1.5/bin/java >> JAVA_HOME IS: >> PATH IS: >> /opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt/osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/opt/osg-ce-0.8.0-r1/globus/bin:/opt/osg-ce-0.8.0-r1/globus/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/ o > pt >> /osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin >> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java >> -version'java version "1.5.0_14" >> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03) >> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) >> >> >> - Mike >> >> >> >> >> >> On 1/24/09 5:03 PM, Allan Espinosa wrote: >>> Hi, >>> >>> I am using swift0.8rc1. the same also happens to v0.7 >>> >>> I tried submitting a job from communicado to tp-grid1 (teraport) using >>> coasters. The swift runtime does not give any error but it does not >>> finish as well. Looking through the files received by the teraport >>> head node, i observed that swift keeps submitting gram jobs. It looks >>> like that the submitted pbs scripts kept finishing / failing. >>> >>> diging through ~/.globus/jobs/tp-grid1.uchicago.edu/*/scheduler* we >>> see that maxwalltime become 101:00 from 00:10:00 (in sites.xml) >>> >>> /usr/bin/perl "/home/aespinosa/.globus/coasters/cscript63266.pl" >>> "http://128.135.125.118:50001" "1728236079" >>> #! /bin/sh >>> # PBS batch job script built by Globus job manager >>> # >>> #PBS -S /bin/sh >>> #PBS -m n >>> #PBS -q fast >>> #PBS -l walltime=101:00 >>> #PBS -o /dev/null >>> #PBS -e /dev/null >>> #PBS -l nodes=1 >>> HOME="/home/aespinosa"; >>> export HOME; >>> OSG_DATA="/gpfs1/osg/data"; >>> ... >>> ... >>> counter=0 >>> exit_code=0 >>> while test $counter -lt 1; do >>> /bin/touch /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter; >>> >>> read tmp_exit_code < >>> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter >>> if [ $exit_code = 0 -a $tmp_exit_code != 0 ]; then >>> exit_code=$tmp_exit_code >>> fi >>> counter=`expr $counter + 1` >>> done >>> >>> exit $exit_code >>> qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max >>> walltime requirement >>> >>> >>> >>> Below is my sites.xml: >>> >>> >>> >>> >>> fast >>> 00:10:00 >>> >> storage="/opt/osg/data/aespinosa" major="2" minor="2" patch="4"> >>> >>> >> jobmanager="gt2:gt2:pbs" /> >>> >>> /disks/tp-gpfs/scratch/aespinosa >>> >>> >>> >>> >>> This does not happen if i use "local:pbs" as the jobmanager for the >>> coaster and was successful in running jobs >>> -Allan >>> >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Tue Jan 27 23:23:52 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 27 Jan 2009 23:23:52 -0600 Subject: [Swift-devel] Coaster urandom fix? In-Reply-To: <1217357832.10507.1.camel@localhost> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <488DF957.9000803@mcs.anl.gov> <488E0C64.1020106@mcs.anl.gov> <1217303412.4347.0.camel@localhost> <488EA079.4000404@mcs.anl.gov> <488EA562.9020704@mcs.anl.gov> <488F6005.3060400@mcs.anl.gov> <1217357832.10507.1.camel@localhost> Message-ID: <497FEBE8.4050605@mcs.anl.gov> While looking into the coaster "-l" issue on teraport I noticed another diff in my svn tree (not checked in) for the urandom problem below. (-Djava.security.egd=file:///dev/urandom) Did you make that fix somewhere? In my code it was in bootstrap.sh, but I dont see it there as of 2261: --- modules/provider-coaster/resources/bootstrap.sh (revision 2261) +++ modules/provider-coaster/resources/bootstrap.sh (working copy) @@ -11,6 +11,11 @@ rm -f $DJ exit 1 } +if [ -f ~/.myetcprofile ]; then + source ~/.myetcprofile +else + source /etc/profile +fi if [ "$L" == "" ]; then L=~/coaster-boot-$ID.log fi @@ -52,8 +57,13 @@ fi echo "JAVA=$JAVA" >>$L if [ -x $JAVA ]; then +<<<<<<< .mine + echo "$JAVA -Djava.home="$JAVA_HOME" -DX509_USER_PROXY="$X509_USER_PROXY" -DGLOBUS_HOSTNAME="$H" -Djava.security.egd=file:///dev/urandom -jar $DJ $BS $LMD5 $LS $ID" >>$L + $JAVA -Djava.home="$JAVA_HOME" -DGLOBUS_TCP_PORT_RANGE="$GLOBUS_TCP_PORT_RANGE" -DX509_USER_PROXY="$X509_USER_PROXY" -DX509_CERT_DIR="$X509_CERT_DIR" -DGLOBUS_HOSTNAME="$H" -Djava.security.egd=file:///dev/urandom -jar $DJ $BS $LMD5 $LS $ID >>$L 2>&1 +======= echo "$JAVA -Djava.home="$JAVA_HOME" -DX509_USER_PROXY="$X509_USER_PROXY" -DGLOBUS_HOSTNAME="$H" -jar $DJ $BS $LMD5 $LS $ID" >>$L $JAVA -Djava.home="$JAVA_HOME" -DGLOBUS_TCP_PORT_RANGE="$GLOBUS_TCP_PORT_RANGE" -DX509_USER_PROXY="$X509_USER_PROXY" -DX509_CERT_DIR="$X509_CERT_DIR" -DGLOBUS_HOSTNAME="$H" -jar $DJ $BS $LMD5 $LS $ID +>>>>>>> .r2261 EC=$? echo "EC: $EC" >>$L rm -f $DJ com$ - Mike On 7/29/08 1:57 PM, Mihael Hategan wrote: > On Tue, 2008-07-29 at 13:23 -0500, Michael Wilde wrote: > >> Another possibility is the /dev/random delay in generating an id due ot >> lack of server entropy. Now *that* would explain things, as its right >> where the delay is occurring: >> >> private void startWorker(int maxWallTime, Task prototype) >> throws InvalidServiceContactException { >> int id = sr.nextInt(); // <<<<<<<<<<<<<<<<<<<<<< >> if (logger.isInfoEnabled()) { >> logger.info("Starting worker with id=" + id + " and >> } >> which uses SecureRandom.getInstance("SHA1PRNG") >> >> This just occurred to me and is perhaps a more likely explanation. Is >> this the same scenario that was causing the Swift client to encounter >> long delays as it started trivial workflows? How was that eventually fixed? > > Hmm. Yes. I'll change the bootstrap class to start the service > with /dev/urandom instead (if available). > >> I can stub this out with a simple number generator and test. And/or time >> SecureRandom in a standalone program. >> >> - Mike >> >> >> >> >> >> On 7/29/08 12:06 AM, Michael Wilde wrote: >>> hmmm. my debug statement didnt print. but this time the job on abe ran ok. >>> >>> Tomorrow I'll run more tests and see how stable it is there, and why my >>> logging calls never showed up. >>> >>> - Mike >>> >>> >>> On 7/28/08 11:45 PM, Michael Wilde wrote: >>>> Ive moved on, and put a temp hack in to not use -l and instead run >>>> "~/.myetcprofile" if it exists and /etc/profile if it doesnt. >>>> >>>> .myetcprofile on abe is /etc/profile with the problematic code removed. >>>> >>>> Now abe gets past the problem and runs bootstrap.sh ok. >>>> >>>> The sequence runs OK up to the point where the service on abe's >>>> headnode receives a message to start a job. >>>> >>>> AT this point, the service on abe seems to hang. >>>> >>>> Comparing to the message sequence on mercury, which works, I see this: >>>> >>>> *** mercury: >>>> >>>> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2 >>>> SUBMITJOB(identity=1217268111318 >>>> executable=/bin/bash >>>> directory=/home/ncsa/wilde/swiftwork/ctest-20080728-1301-7c4ok42h >>>> arg=shared/wrapper.sh >>>> arg=echo-myx2e6xi >>>> arg=-jobdir >>>> arg=m >>>> arg=-e >>>> arg=/bin/echo >>>> arg=-out >>>> arg=echo_s000.txt >>>> arg=-err >>>> arg=stderr.txt >>>> arg=-i >>>> arg=-d >>>> ar) >>>> [ChannelManager] DEBUG Channel multiplexer - >>>> Looking up -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS >>>> [ChannelManager] DEBUG Channel multiplexer - Found >>>> -134779b6:11b6ad597e2:-7fff:3598cb3d:11b6ad597b5:-7fffS >>>> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2 >>>> SUBMITJOB(urn:1217268111318-1217268128309-1217268128310) >>>> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin = >>>> true, datalen = 45, data = urn:1217268111318-1217268128309-1217268128310 >>>> [WorkerManager] INFO Coaster Queue Processor - No suitable worker >>>> found. Attempting to start a new one. >>>> [WorkerManager] INFO Worker Manager - Got allocation request: >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 >>>> >>>> [WorkerManager] INFO Worker Manager - Starting worker with >>>> id=-615912369 and maxwalltime=6060s >>>> Worker start provider: gt2 >>>> Worker start JM: pbs >>>> >>>> *** abe: >>>> >>>> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND< 2 >>>> SUBMITJOB(identity=1217291444315 >>>> executable=/bin/bash >>>> directory=/u/ac/wilde/swiftwork/ctest-20080728-1930-m5a70lvc >>>> arg=shared/wrapper.sh >>>> arg=echo-zc5mt6xi >>>> arg=-jobdir >>>> arg=z >>>> arg=-e >>>> arg=/bin/echo >>>> arg=-out >>>> arg=echo_s000.txt >>>> arg=-err >>>> arg=stderr.txt >>>> arg=-i >>>> arg=-d >>>> arg= >>>> ar) >>>> [ChannelManager] DEBUG Channel multiplexer - >>>> Looking up 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS >>>> [ChannelManager] DEBUG Channel multiplexer - Found >>>> 17badc64:11b6c39944a:-7fff:f7c31d:11b6c399416:-7fffS >>>> [RequestHandler] DEBUG Channel multiplexer - GSSC-null: HND> 2 >>>> SUBMITJOB(urn:1217291444315-1217291458042-1217291458043) >>>> [Replier] DEBUG Worker 1 - Replier(GSSC-null)REPL>: tag = 2, fin = >>>> true, datalen = 45, data = urn:1217291444315-1217291458042-1217291458043 >>>> [WorkerManager] INFO Coaster Queue Processor - No suitable worker >>>> found. Attempting to start a new one. >>>> [WorkerManager] INFO Worker Manager - Got allocation request: >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 95cfbe >>>> >>>> [AbstractKarajanChannel] DEBUG Channel multiplexer - GSSC-null REQ<: >>>> tag = 3, fin = true, err = false, datalen = 15, data = SHUTDOWNSERVICE >>>> >>>> *** >>>> >>>> I *think* the SHUTDOWNSERVICE message on abe is coming much later, >>>> after abe's service hangs, but Im not sure. >>>> >>>> What it looks like to me is that what should should happen on abe is >>>> this: >>>> >>>> [WorkerManager] INFO Worker Manager - Got allocation request: >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager$AllocationRequest at 151ca803 >>>> >>>> [WorkerManager] INFO Worker Manager - Starting worker with >>>> id=-615912369 and maxwalltime=6060s >>>> >>>> but on abe the "Worker Manager - Starting worker" is never seen. >>>> >>>> Looking at WorkerManager.run() its hard to see how the "Starting >>>> worker" message could *not* show up right after "Got allocation >>>> request", but there must be some sequence of events that causes this. >>>> >>>> Abe is an 8-core system. Is there perhaps more opportunity for a >>>> multi-thread race or deadlock that could cause this? >>>> >>>> I will insert some more debug logging and try a few more times to see >>>> if thing shang in this manner every time or not. >>>> >>>> - Mike >>>> >>>> ps client Logs with abe server side boot logs are on CI net in >>>> ~wilde/coast/run11 >>>> >>>> >>>> >>>> On 7/28/08 10:50 PM, Mihael Hategan wrote: >>>>> On Mon, 2008-07-28 at 19:32 +0000, Ben Clifford wrote: >>>>>> On Mon, 28 Jul 2008, Michael Wilde wrote: >>>>>> >>>>>>> So it looks like something in the job specs that is launching >>>>>>> coaster for >>>>>>> gt2:pbs is not being accepted by abe. >>>>>> ok. TeraGrid's unified account system is insufficiently unified for >>>>>> me to be able to access abe, but they are aware of that; if and when >>>>>> I am reunified, I'll try this out myself. >>>>> Not to be cynical or anything, but that unified thing: never worked. >>>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Wed Jan 28 03:05:35 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 28 Jan 2009 09:05:35 +0000 (GMT) Subject: [Swift-devel] a note on running coasters on osg's RENCI Engage site In-Reply-To: <1232987200.15406.2.camel@localhost> References: <1232987200.15406.2.camel@localhost> Message-ID: On Mon, 26 Jan 2009, Mihael Hategan wrote: > I think a site attribute is the place where this should go. I can do > that or you can. cog r2262 has a coasterInternalIP attribute for coasters. This can go in a site profile entry, as demonstrated in (working for me) tests/sites/coaster/renci-engage-coaster.xml file in swift svn r2453 -- From benc at hawaga.org.uk Wed Jan 28 04:08:26 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 28 Jan 2009 10:08:26 +0000 (GMT) Subject: [Swift-devel] swift changing walltime of prews-gram jobs In-Reply-To: References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> Message-ID: On Sun, 25 Jan 2009, Ben Clifford wrote: > Probably coasters should get another configuration option to allow the > worker wall time to be more explicitly set, separately from job execution > wall times - that makes sense for sites where parameters such as queue > limits are well known by the user. CoG r2263 has a new parameter to explicitly set the coaster worker maxwalltimes, overriding the default behaviour of computing based on the maxwalltimes of jobs submitted into coasters. Add an entry to your sites.xml entry like this (which specifies 1h20m maxwalltime for coaster workers): 1:20 -- From benc at hawaga.org.uk Wed Jan 28 07:18:10 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 28 Jan 2009 13:18:10 +0000 (GMT) Subject: [Swift-devel] dataset terminology Message-ID: The HPDC paper in SVN contains (in the language section, which I am porting to the user guide): > We often refer to instances of composites of > mapped types as \emph{datasets}. Is this intended to mean that: file f <"mydata.dat">; is not a dataset? In the implementation, pretty much everything is a dataset, in the sense that it is represented by an instance of the DSHandle class. This includes primitive in-memory values such as integers and strings. -- From benc at hawaga.org.uk Wed Jan 28 07:37:50 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 28 Jan 2009 13:37:50 +0000 (GMT) Subject: [Swift-devel] mapping and assignment in one statement. Message-ID: At present, declarations and assignments are documented as: typename variablename ( | = initialValue ) ; That is, you can declare: file f <"foo">; (1) or file f = p(); (2) but not: file f <"foo"> = p(); (3) I think given the way the language has evolved recently that we should permit 3 - the implementation now should not make that hard at all, and it will not break any existing program. -- From benc at hawaga.org.uk Wed Jan 28 08:31:47 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 28 Jan 2009 14:31:47 +0000 (GMT) Subject: [Swift-devel] 'mapped type' terminology Message-ID: >From hpdc draft: > Types in Swift can be atomic or composite. An atomic type can be either > a primitive type or a mapped type. The phrasing of this kinda excludes the idea that composite types might be mapped, which is not true at all... The atomic types that are mapped to single files are what I initially called marker types; a better term was sought for that but 'mapped type' is not it in my opinion - being a mapped type or not is orthogonal to the atomic/composite distinction. -- From foster at anl.gov Wed Jan 28 08:38:52 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 28 Jan 2009 08:38:52 -0600 Subject: [Swift-devel] 'mapped type' terminology In-Reply-To: References: Message-ID: <91D0DC70-69E4-4A2A-AE78-E1E9D7A9702D@anl.gov> The "dataset" construct is intended to refer to objects that have a representation both in memory and on disk. If we use DSHandle to refer to things that have no disk representation, that seems contrary to the way the construct was originally defined. From wilde at mcs.anl.gov Wed Jan 28 08:56:55 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 28 Jan 2009 08:56:55 -0600 Subject: [Swift-devel] 'mapped type' terminology In-Reply-To: References: Message-ID: <49807237.4070408@mcs.anl.gov> I agree with this comment. Early on I felt that files should be a primitive type - in the language today they show all the attributes of such. One alternative is to add "file" to the list of atomic types. Another is to call single files a mapped atomic type. A third is to find a better name than "mapped type" or "marker". Can a mapped atomic object be mapped to anything but a file? Ie, if the mapper associates a database or service handle with an object, encoded as a string returned by the mapper, when the app is called, wont the implementation attempt to locate and pass a file to the application? (Sorry that this is digressing, but it seems that to name this concept we need to make sure we understand its definition) It seems that its the @filename() designation that causes the current file-handling behavior at runtime, and other @something() primitives could cause other runtime behavior (such as @handle() for passing the opaque mapping or @database() for database-specific passing). If we did this, then we are essentially expanding the possible set of mapped atomic types. One way to do that is to define a tiny set of such mapped atomic types. Two that would be useful are: file == @filename() data == @data() (for an opaque handle, or @handle) - Mike On 1/28/09 8:31 AM, Ben Clifford wrote: >>From hpdc draft: > >> Types in Swift can be atomic or composite. An atomic type can be either >> a primitive type or a mapped type. > > The phrasing of this kinda excludes the idea that composite types might be > mapped, which is not true at all... > > The atomic types that are mapped to single files are what I initially > called marker types; a better term was sought for that but 'mapped type' > is not it in my opinion - being a mapped type or not is orthogonal to the > atomic/composite distinction. > From benc at hawaga.org.uk Wed Jan 28 08:57:26 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 28 Jan 2009 14:57:26 +0000 (GMT) Subject: [Swift-devel] 'mapped type' terminology In-Reply-To: <91D0DC70-69E4-4A2A-AE78-E1E9D7A9702D@anl.gov> References: <91D0DC70-69E4-4A2A-AE78-E1E9D7A9702D@anl.gov> Message-ID: On Wed, 28 Jan 2009, Ian Foster wrote: > The "dataset" construct is intended to refer to objects that have a > representation both in memory and on disk. If we use DSHandle to refer to > things that have no disk representation, that seems contrary to the way the > construct was originally defined. DSHandle is a specific java interface which is used to deal with data in SwiftScript which looks like data. Initially it did not include things like ints and other primitive data, but subsequent development has indicated that it is desirable for primitive data to be not so primitive (for example, being able to participate in data dependency ordering of job execution). I'm more interested in clarifying what is mean by 'dataset' - whether it means 'any data in swiftscript' (which is what DSHandle means), or what subset of data it means. No dataset in Swift has an in-memory representation of its value, though there is in-memory representation of 'managementy' things to do with that data (eg. where to find the data in URI-space, and how it is interacting with data-dependency ordering of job execution). Some data (the extern type) has no mapped representation in URI-space, because the extern type pretty much explicitly says "Swift is having nothing to do with where you keep this data". So the model is somewhat more complicated than the model that was floating around 3 years ago. -- From benc at hawaga.org.uk Wed Jan 28 09:00:25 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 28 Jan 2009 15:00:25 +0000 (GMT) Subject: [Swift-devel] 'mapped type' terminology In-Reply-To: References: <91D0DC70-69E4-4A2A-AE78-E1E9D7A9702D@anl.gov> Message-ID: On Wed, 28 Jan 2009, Ben Clifford wrote: > So the model is somewhat more complicated than the model that was floating > around 3 years ago. or perhaps less complicated ... everything is a dataset, with each one perhaps having (based on its type) some in memory representation or some URI-space representation. -- From hategan at mcs.anl.gov Wed Jan 28 09:48:07 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 28 Jan 2009 09:48:07 -0600 Subject: [Swift-devel] Re: Coaster urandom fix? In-Reply-To: <497FEBE8.4050605@mcs.anl.gov> References: <488CC81C.7030205@mcs.anl.gov> <488D4217.2010006@mcs.anl.gov> <488DF957.9000803@mcs.anl.gov> <488E0C64.1020106@mcs.anl.gov> <1217303412.4347.0.camel@localhost> <488EA079.4000404@mcs.anl.gov> <488EA562.9020704@mcs.anl.gov> <488F6005.3060400@mcs.anl.gov> <1217357832.10507.1.camel@localhost> <497FEBE8.4050605@mcs.anl.gov> Message-ID: <1233157687.5004.5.camel@localhost> On Tue, 2009-01-27 at 23:23 -0600, Michael Wilde wrote: > While looking into the coaster "-l" issue on teraport I noticed another > diff in my svn tree (not checked in) for the urandom problem below. > > (-Djava.security.egd=file:///dev/urandom) > > Did you make that fix somewhere? In my code it was in bootstrap.sh, but > I dont see it there as of 2261: I don't have it in my local copy. From hategan at mcs.anl.gov Wed Jan 28 10:43:35 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 28 Jan 2009 10:43:35 -0600 Subject: [Swift-devel] dataset terminology In-Reply-To: References: Message-ID: <1233161015.6101.13.camel@localhost> On Wed, 2009-01-28 at 13:18 +0000, Ben Clifford wrote: > The HPDC paper in SVN contains (in the language section, which I am > porting to the user guide): > > > We often refer to instances of composites of > > mapped types as \emph{datasets}. > > Is this intended to mean that: > > file f <"mydata.dat">; > > is not a dataset? No. green is not grass. And as far as I can tell, we don't use "dataset" much in connection to single files. But we could as long as we're clear about it. However, for the purpose of the paper, the term "dataset" wasn't explained, which I thought it should. > > In the implementation, pretty much everything is a dataset, in the sense > that it is represented by an instance of the DSHandle class. This includes > primitive in-memory values such as integers and strings. > From hategan at mcs.anl.gov Wed Jan 28 11:04:24 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 28 Jan 2009 11:04:24 -0600 Subject: [Swift-devel] 'mapped type' terminology In-Reply-To: References: Message-ID: <1233162264.6770.5.camel@localhost> On Wed, 2009-01-28 at 14:31 +0000, Ben Clifford wrote: > >From hpdc draft: > > > Types in Swift can be atomic or composite. An atomic type can be either > > a primitive type or a mapped type. > > The phrasing of this kinda excludes the idea that composite types might be > mapped, which is not true at all... "Mapped type" was not meant to be the same as "declaration with mapper". By virtue of that fact that composite types can be made of atomic types, composite types can contain mapped types. The mapping itself, and the fact that for a composite type it's specified for the whole declaration, not for individual fields, does not make a type "mapped" or "not mapped". In any event, we need a term to describe types that are atomic but not in-memory. > > The atomic types that are mapped to single files are what I initially > called marker types; a better term was sought for that but 'mapped type' > is not it in my opinion - being a mapped type or not is orthogonal to the > atomic/composite distinction. > From wilde at mcs.anl.gov Wed Jan 28 11:25:34 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 28 Jan 2009 11:25:34 -0600 Subject: [Swift-devel] 'mapped type' terminology In-Reply-To: <49807237.4070408@mcs.anl.gov> References: <49807237.4070408@mcs.anl.gov> Message-ID: <4980950E.5020706@mcs.anl.gov> On 1/28/09 8:56 AM, Michael Wilde wrote: > I agree with this comment. Early on I felt that files should be a > primitive type - in the language today they show all the attributes of > such. > > One alternative is to add "file" to the list of atomic types. > Another is to call single files a mapped atomic type. > A third is to find a better name than "mapped type" or "marker". > > Can a mapped atomic object be mapped to anything but a file? Ie, if the > mapper associates a database or service handle with an object, encoded > as a string returned by the mapper, when the app is called, wont the > implementation attempt to locate and pass a file to the application? > > (Sorry that this is digressing, but it seems that to name this concept > we need to make sure we understand its definition) > > It seems that its the @filename() designation that causes the current > file-handling behavior at runtime, and other @something() primitives > could cause other runtime behavior (such as @handle() for passing the > opaque mapping or @database() for database-specific passing). I take that back - its the act of passing a file-mapped object to an atomic that causes the file to be transfered to or from the site of execution. So it would need to be the type of mapping that causes different mappings to be handled differently. - Mike > > If we did this, then we are essentially expanding the possible set of > mapped atomic types. One way to do that is to define a tiny set of such > mapped atomic types. Two that would be useful are: > > file == @filename() > data == @data() (for an opaque handle, or @handle) > > - Mike > > > On 1/28/09 8:31 AM, Ben Clifford wrote: >>> From hpdc draft: >> >>> Types in Swift can be atomic or composite. An atomic type can be >>> either a primitive type or a mapped type. >> >> The phrasing of this kinda excludes the idea that composite types >> might be mapped, which is not true at all... >> >> The atomic types that are mapped to single files are what I initially >> called marker types; a better term was sought for that but 'mapped >> type' is not it in my opinion - being a mapped type or not is >> orthogonal to the atomic/composite distinction. >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Thu Jan 29 07:58:21 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 29 Jan 2009 07:58:21 -0600 Subject: [Swift-devel] swift error handling problem Message-ID: <4981B5FD.5090403@mcs.anl.gov> A minor problem in swift error handling: com$ cat c2.swift int i=@missing(@arg("niter")); // fails int i=@toint(@arg("niter")); // works com$ swift c2.swift -niter=10 Could not start execution. Failed to convert .xml to .kml for c2.swift com$ -- invoking a missing @ function in the expression above seems to make the parser fail. Should I file as a bug? From wilde at mcs.anl.gov Thu Jan 29 07:59:01 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 29 Jan 2009 07:59:01 -0600 Subject: [Swift-devel] swift error handling problem In-Reply-To: <4981B5FD.5090403@mcs.anl.gov> References: <4981B5FD.5090403@mcs.anl.gov> Message-ID: <4981B625.2090903@mcs.anl.gov> 0.8rc1 by the way. On 1/29/09 7:58 AM, Michael Wilde wrote: > A minor problem in swift error handling: > > com$ cat c2.swift > int i=@missing(@arg("niter")); // fails > > int i=@toint(@arg("niter")); // works > > com$ swift c2.swift -niter=10 > Could not start execution. > Failed to convert .xml to .kml for c2.swift > com$ > > -- > > invoking a missing @ function in the expression above seems to make the > parser fail. > > Should I file as a bug? > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Thu Jan 29 08:28:23 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 29 Jan 2009 14:28:23 +0000 (GMT) Subject: [Swift-devel] swift error handling problem In-Reply-To: <4981B5FD.5090403@mcs.anl.gov> References: <4981B5FD.5090403@mcs.anl.gov> Message-ID: > A minor problem in swift error handling: > > com$ cat c2.swift > int i=@missing(@arg("niter")); // fails > com$ swift c2.swift -niter=10 > Could not start execution. > Failed to convert .xml to .kml for c2.swift > invoking a missing @ function in the expression above seems to make the parser > fail. Its the typechecking that fails - r2474 will still fail, but with a more specific error: Could not start execution. Compile error in assigment at line 2: Unknown function missing -- From hategan at mcs.anl.gov Thu Jan 29 10:45:11 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 29 Jan 2009 10:45:11 -0600 Subject: [Swift-devel] swift error handling problem In-Reply-To: References: <4981B5FD.5090403@mcs.anl.gov> Message-ID: <1233247511.13192.0.camel@localhost> On Thu, 2009-01-29 at 14:28 +0000, Ben Clifford wrote: > Compile error in assigment at line 2: Unknown function missing That's just funny. From benc at hawaga.org.uk Thu Jan 29 10:46:42 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 29 Jan 2009 16:46:42 +0000 (GMT) Subject: [Swift-devel] swift error handling problem In-Reply-To: <1233247511.13192.0.camel@localhost> References: <4981B5FD.5090403@mcs.anl.gov> <1233247511.13192.0.camel@localhost> Message-ID: On Thu, 29 Jan 2009, Mihael Hategan wrote: > On Thu, 2009-01-29 at 14:28 +0000, Ben Clifford wrote: > > Compile error in assigment at line 2: Unknown function missing > > That's just funny. Its funny when the unknown function is called 'missing'... more generally, Unknown function foobar -- From hategan at mcs.anl.gov Thu Jan 29 10:51:51 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 29 Jan 2009 10:51:51 -0600 Subject: [Swift-devel] swift error handling problem In-Reply-To: References: <4981B5FD.5090403@mcs.anl.gov> <1233247511.13192.0.camel@localhost> Message-ID: <1233247913.13441.0.camel@localhost> On Thu, 2009-01-29 at 16:46 +0000, Ben Clifford wrote: > On Thu, 29 Jan 2009, Mihael Hategan wrote: > > > On Thu, 2009-01-29 at 14:28 +0000, Ben Clifford wrote: > > > Compile error in assigment at line 2: Unknown function missing > > > > That's just funny. > > Its funny when the unknown function is called 'missing'... more generally, > Unknown function foobar > Right. Quotes may help. From benc at hawaga.org.uk Thu Jan 29 10:59:51 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 29 Jan 2009 16:59:51 +0000 (GMT) Subject: [Swift-devel] swift error handling problem In-Reply-To: <1233247913.13441.0.camel@localhost> References: <4981B5FD.5090403@mcs.anl.gov> <1233247511.13192.0.camel@localhost> <1233247913.13441.0.camel@localhost> Message-ID: looks like this now: > Compile error in assigment at line 2: Unknown function: @missing -- From hategan at mcs.anl.gov Thu Jan 29 11:06:12 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 29 Jan 2009 11:06:12 -0600 Subject: [Swift-devel] swift error handling problem In-Reply-To: References: <4981B5FD.5090403@mcs.anl.gov> <1233247511.13192.0.camel@localhost> <1233247913.13441.0.camel@localhost> Message-ID: <1233248772.13766.0.camel@localhost> Much better. On Thu, 2009-01-29 at 16:59 +0000, Ben Clifford wrote: > looks like this now: > > > Compile error in assigment at line 2: Unknown function: @missing > > From hategan at mcs.anl.gov Thu Jan 29 12:26:36 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 29 Jan 2009 12:26:36 -0600 Subject: [Swift-devel] strsplit Message-ID: <1233253596.18766.1.camel@localhost> r2484 contains a strsplit function which can be used to split a string into multiple pieces based on a regular expression. At least one application I know needs this and there is no reasonable way to get the same results otherwise. From benc at hawaga.org.uk Fri Jan 30 08:12:43 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 30 Jan 2009 14:12:43 +0000 (GMT) Subject: [Swift-devel] Swift 0.8 released. Message-ID: Swift 0.8 is released. Downlaod it at http://www.ci.uchicago.edu/swift/downloads/ In addition to a number of bugfixes and improvements in output and error quality, a new swift-plot-log command is included to make pretty graphical plots of swift runs. The release notes, with more information, are available at: http://www.ci.uchicago.edu/swift/downloads/release-notes-0.8.txt -- From benc at hawaga.org.uk Fri Jan 30 08:13:34 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 30 Jan 2009 14:13:34 +0000 (GMT) Subject: [Swift-devel] swift 0.9 timeline Message-ID: The present every-two-months release cycle still feels right to me, so I plan on making the 0.9 release in mid/late March. -- From benc at hawaga.org.uk Fri Jan 30 10:40:16 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 30 Jan 2009 16:40:16 +0000 (GMT) Subject: [Swift-devel] swift changing walltime of prews-gram jobs In-Reply-To: References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> Message-ID: On Sun, 25 Jan 2009, Ben Clifford wrote: > You should see the same behaviour using local:pbs, which will use direct > PBS submission instead of GRAM; but you don't. That is an inconsistency > that suggests something is not right. My initial suspicion would be that > the cog PBS provider is not correctly passing either the walltime or queue > parameters. I will investigate this. A quick peek at this suggests something is awry in walltimes in the PBS provider stack. Specifying: globus::maxwalltime="7" which means 7 minutes results in a PBS queue walltime of: $ qstat -f 849545 | grep -i time [...] Resource_List.walltime = 00:00:07 which I understand to mean 7 seconds. -- From hategan at mcs.anl.gov Fri Jan 30 10:45:19 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 30 Jan 2009 10:45:19 -0600 Subject: [Swift-devel] swift changing walltime of prews-gram jobs In-Reply-To: References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> Message-ID: <1233333919.13982.1.camel@localhost> On Fri, 2009-01-30 at 16:40 +0000, Ben Clifford wrote: > On Sun, 25 Jan 2009, Ben Clifford wrote: > > > You should see the same behaviour using local:pbs, which will use direct > > PBS submission instead of GRAM; but you don't. That is an inconsistency > > that suggests something is not right. My initial suspicion would be that > > the cog PBS provider is not correctly passing either the walltime or queue > > parameters. I will investigate this. > > A quick peek at this suggests something is awry in walltimes in the PBS > provider stack. Specifying: > > globus::maxwalltime="7" > > which means 7 minutes > > results in a PBS queue walltime of: > > $ qstat -f 849545 | grep -i time > [...] > Resource_List.walltime = 00:00:07 > > > which I understand to mean 7 seconds. > Right. The walltime seems to suffer from a widespread lack of well definedness. In swift we addressed this in some way, but I don't think that same scheme went into the PBS provider. It should. From hategan at mcs.anl.gov Fri Jan 30 10:53:54 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 30 Jan 2009 10:53:54 -0600 Subject: [Swift-devel] swift changing walltime of prews-gram jobs In-Reply-To: References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> Message-ID: <1233334434.14201.3.camel@localhost> On Fri, 2009-01-30 at 16:40 +0000, Ben Clifford wrote: > On Sun, 25 Jan 2009, Ben Clifford wrote: > > > You should see the same behaviour using local:pbs, which will use direct > > PBS submission instead of GRAM; but you don't. That is an inconsistency > > that suggests something is not right. My initial suspicion would be that > > the cog PBS provider is not correctly passing either the walltime or queue > > parameters. I will investigate this. > > A quick peek at this suggests something is awry in walltimes in the PBS > provider stack. Specifying: > > globus::maxwalltime="7" > > which means 7 minutes > > results in a PBS queue walltime of: > > $ qstat -f 849545 | grep -i time > [...] > Resource_List.walltime = 00:00:07 > > > which I understand to mean 7 seconds. > So I'm looking at the PBS provider code. It looks like it does not do any processing on the max wall time. If you say "7" that results in #PBS -l maxwalltime=7 in the PBS script. So I should probably conclude that it's Globus that interprets a plain 7 as minutes and transforms that to 00:07:00. I suppose the PBS provider could adopt the same scheme. From benc at hawaga.org.uk Fri Jan 30 11:16:39 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 30 Jan 2009 17:16:39 +0000 (GMT) Subject: [Swift-devel] swift changing walltime of prews-gram jobs In-Reply-To: <1233334434.14201.3.camel@localhost> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <1233334434.14201.3.camel@localhost> Message-ID: On Fri, 30 Jan 2009, Mihael Hategan wrote: > I suppose the PBS provider could adopt the same scheme. If the general cog API is going to have walltimes specified the same as the Swift user guide documents, then that makes sense. -- From hategan at mcs.anl.gov Fri Jan 30 12:42:00 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 30 Jan 2009 12:42:00 -0600 Subject: [Swift-devel] Coasters failing on Teraport - cant find Java? In-Reply-To: <497FE73F.7000307@mcs.anl.gov> References: <50b07b4b0901241503r72f28b96rec19583bb8044ea1@mail.gmail.com> <497F637D.5080707@mcs.anl.gov> <1233113611.2159.25.camel@localhost> <497FE73F.7000307@mcs.anl.gov> Message-ID: <1233340920.18750.1.camel@localhost> Cog r2267 contains a tentative fix for this. The bootstrap script is started without -l, and if java cannot be found, it attempts to get that information using bash -l. I haven't tested it. On Tue, 2009-01-27 at 23:03 -0600, Michael Wilde wrote: > I dug a bit deeper. As far as I can tell, this is what's happening: > > 1) On OSG sites, the jobmanager(s) are modified to inset OSG env vars > and set the PATH to contain OSG stuff. So if you do a globus-job-run of > /usr/bin/printenv (i.e. with no shell) you see all this, including java > in the path (from an osg dir). > > 2) when you globus-job-run /bin/sh, all this stays around, but > > 3) when you globus-job-run /bin/sh with -l, it runs /etc/profile, which > un-does the path and LD_LIBRARY_PATH, setting PATH to some default and > LD_LIBRARY_PATH to null. I *think* this is being done by softenv which > runs from /etc/profile.d, called at the end of /etc/profile. > > You can simulate this with: > > globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c "which java; source > /etc/profile; which java" (or try printenv instead of which java to see > the details) > > So bottom line: there's at least two cases where -l hurts, this one, and > abe, where attempts to run login shells from globus are thwarted. > > If the purpose of -l was just to get java in the path,, then > for OSG sites that behave like teraport, just omitting -l should work, > because the OSG jobmanager modes put it in the path. > > For sites like abe, bypassing -l, and forcing the user to put Java in > the path with a .bashrc or equivalent, may work. (The hack I used on abe > was to remove the -l arg, and insert this in bootstrap.sh: > > +if [ -f ~/.myetcprofile ]; then > + source ~/.myetcprofile > +else > + source /etc/profile > +fi > > One option is to accept a per-site option from sites.xml to bypass "-l" > on the startup shell, and insert the logic above for something like > .coasterinit, sourcing that if the user provides it. > > Another option is to put a +java line in the OSG .soft file on TeraPort. > > Its possible this problem only eists on the few sites like teraport that > run both OSG mods and softenv??? > > I think we need to test coasters broadly across OSG to be sure (Ben's IP > problem is a case in point). But a simple shell test across all the OSG > VO sites could detect whether Java will be there or not, with and > without -l. > > - Mike > > > On 1/27/09 9:33 PM, Mihael Hategan wrote: > > Hmm. Looks like -l has the opposite effect of what I thought it should > > do (end up with an environment equivalent to the one you get in when you > > log in as an interactive session). Is it my misunderstanding or > > something else? > > > > On Tue, 2009-01-27 at 13:41 -0600, Michael Wilde wrote: > >> Related to: Re: [Swift-devel] swift changing walltime of prews-gram jobs > >> > >> I can't get a Swift script to run on coasters on TeraPort in gt2:gt2:pbs > >> mode. > >> > >> Im using 0.8rc1 and submitting from tp-login. > >> > >> I am running with a DOEgrids cert in the OSG VO. > >> > >> I *think* the issue is that when a gt2 jobs on this vo runs with a login > >> shell, it doesnt get java in its path. > >> > >> When I run /bin/sh *without* the "-l" option, under globus, I do get a > >> java in my path. > >> > >> Allan: what VO did you run on when you got a sucsessful gt2:gt2:pbs > >> coaster run on teraport, after you fixed the walltime issue? > >> > >> It seems to me that this is a rough edge with coaster startup. Recall > >> that I had a similar problem running on abe last year: I had to edit out > >> the "-l" and create a custom .profile to get coasters to work. > >> > >> It would be great if we can iron this out in 0.8 or soon after. I'm > >> willing to do some testing and enlist help from Allan and Zhengxiong for > >> wider testing. > >> > >> Do we need special site attributes for specific sites to override > >> default behaviors when they dont work? > >> > >> > >> My sites.xml is: > >> > >> > >> > >> fast > >> 00:05:00 > >> > >> >> url="tp-grid1.ci.uchicago.edu" > >> jobmanager="gt2:gt2:pbs" /> > >> /gpfs1/osg/data/oops/swiftwork > >> > >> > >> > >> I get this on stdout/err: > >> > >> --------------------------------------------- > >> Swift 0.8rc1 swift-r2448 cog-r2261 > >> > >> RunID: 20090127-1305-hcxdpor3 > >> Progress: > >> Progress: Selecting site:2 Stage in:1 Initializing site shared directory:1 > >> Progress: Selecting site:2 Stage in:1 Submitting:1 > >> Progress: Selecting site:2 Submitting:1 Submitted:1 > >> Failed to transfer wrapper log from oops5-20090127-1305-hcxdpor3/info/a > >> on teraport > >> Execution failed: > >> Exception in runoops: > >> Arguments: [input/fasta/T1af7.fasta, input/secseq/T1af7.secseq, > >> input/native/T1af7.pdb, output/T1af7.1.pdt, output/T1af7.1.rmsd, 1, > >> [TEMP UPDATE INTERVAL = 10, SMOOTH DEVIATION COEFFICIENT = 0.80001]] > >> Host: teraport > >> Directory: oops5-20090127-1305-hcxdpor3/jobs/a/runoops-asq0ir5j > >> stderr.txt: > >> > >> stdout.txt: > >> > >> ---- > >> > >> Caused by: > >> Could not submit job > >> Caused by: > >> Could not start coaster service > >> Caused by: > >> Task ended before registration was received. > >> STDOUT: which: no java in > >> (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin) > >> dirname: too few arguments > >> Try `dirname --help' for more information. > >> http://tp-login2.ci.uchicago.edu:50001: line 55: -Djava.home=/..: No > >> such file or directory > >> > >> STDERR: null > >> Cleaning up... > >> Done > >> > >> ------------------------------------ > >> > >> Checking out the environment with this cert I see: > >> > >> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'java -version' > >> /bin/sh: java: command not found > >> > >> > >> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java -version' > >> java version "1.5.0_14" > >> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03) > >> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) > >> > >> > >> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'which java; > >> echo JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH' > >> JAVA_HOME IS: > >> PATH IS: > >> /usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin > >> /usr/bin/which: no java in > >> (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin) > >> tp$ > >> > >> > >> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'which java; echo > >> JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH' > >> > >> /opt/osg-ce-0.8.0-r1/jdk1.5/bin/java > >> JAVA_HOME IS: > >> PATH IS: > >> /opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt/osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/opt/osg-ce-0.8.0-r1/globus/bin:/opt/osg-ce-0.8.0-r1/globus/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin :/ > o > > pt > >> /osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin > >> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java > >> -version'java version "1.5.0_14" > >> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03) > >> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) > >> > >> > >> - Mike > >> > >> > >> > >> > >> > >> On 1/24/09 5:03 PM, Allan Espinosa wrote: > >>> Hi, > >>> > >>> I am using swift0.8rc1. the same also happens to v0.7 > >>> > >>> I tried submitting a job from communicado to tp-grid1 (teraport) using > >>> coasters. The swift runtime does not give any error but it does not > >>> finish as well. Looking through the files received by the teraport > >>> head node, i observed that swift keeps submitting gram jobs. It looks > >>> like that the submitted pbs scripts kept finishing / failing. > >>> > >>> diging through ~/.globus/jobs/tp-grid1.uchicago.edu/*/scheduler* we > >>> see that maxwalltime become 101:00 from 00:10:00 (in sites.xml) > >>> > >>> /usr/bin/perl "/home/aespinosa/.globus/coasters/cscript63266.pl" > >>> "http://128.135.125.118:50001" "1728236079" > >>> #! /bin/sh > >>> # PBS batch job script built by Globus job manager > >>> # > >>> #PBS -S /bin/sh > >>> #PBS -m n > >>> #PBS -q fast > >>> #PBS -l walltime=101:00 > >>> #PBS -o /dev/null > >>> #PBS -e /dev/null > >>> #PBS -l nodes=1 > >>> HOME="/home/aespinosa"; > >>> export HOME; > >>> OSG_DATA="/gpfs1/osg/data"; > >>> ... > >>> ... > >>> counter=0 > >>> exit_code=0 > >>> while test $counter -lt 1; do > >>> /bin/touch /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter; > >>> > >>> read tmp_exit_code < > >>> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter > >>> if [ $exit_code = 0 -a $tmp_exit_code != 0 ]; then > >>> exit_code=$tmp_exit_code > >>> fi > >>> counter=`expr $counter + 1` > >>> done > >>> > >>> exit $exit_code > >>> qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max > >>> walltime requirement > >>> > >>> > >>> > >>> Below is my sites.xml: > >>> > >>> > >>> > >>> > >>> fast > >>> 00:10:00 > >>> >>> storage="/opt/osg/data/aespinosa" major="2" minor="2" patch="4"> > >>> > >>> >>> jobmanager="gt2:gt2:pbs" /> > >>> > >>> /disks/tp-gpfs/scratch/aespinosa > >>> > >>> > >>> > >>> > >>> This does not happen if i use "local:pbs" as the jobmanager for the > >>> coaster and was successful in running jobs > >>> -Allan > >>> > >>> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >