From iraicu at cs.uchicago.edu Tue Nov 2 14:47:36 2010 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 02 Nov 2010 14:47:36 -0500 Subject: [Swift-devel] Call for Participation: 3rd IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS10), co-located with Supercomputing 2010 -- win an Apple iPad!!! Message-ID: <4CD06AD8.1050201@cs.uchicago.edu> Dear all, We invite you to participate in the 3rd workshop on Many-Task Computing on Grids and Supercomputers (MTAGS10) on Monday, November 15th, 2010, co-located with IEEE/ACM Supercomputing 2010 in New Orleans LA. MTAGS will provide the scientific community a dedicated forum for presenting new research, development, and deployment efforts of large-scale many-task computing (MTC) applications on large scale clusters, Grids, Supercomputers, and Cloud Computing infrastructure. A few highlights of the workshop: * *Workshop Program: *The program can be found at http://www.cs.iit.edu/~iraicu/MTAGS10/program.htm; papers and slides will be posted by November 15th, 2010 * *Keynote speaker: *Roger Barga, PhD, Architect, Extreme Computing Group, Microsoft Research * *Best Paper Nominees: * o Timothy Armstrong, Mike Wilde, Daniel Katz, Zhao Zhang, Ian Foster. "/Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks/", 3rd IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS10) 2010 o Thomas Budnik, Brant Knudson, Mark Megerian, Sam Miller, Mike Mundy, Will Stockdell. "/Blue Gene/Q Resource Management Architecture/", 3rd IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS10) 2010 * *Attendance Prize: *There will be a /*free Apple iPad */giveaway at the end of the workshop; must attend at least 1 talk throughout the day at the workshop, and must be present to win at the end of the workshop at 6:15PM The workshop program is: * 9:00AM Opening Remarks * 9:10AM *Keynote: Data Laden Clouds, Roger Barga, PhD, Architect, Extreme Computing Group, Microsoft Research * * Session 1: Applications o 10:30AM Many Task Computing for Modeling the Fate of Oil Discharged from the Deep Water Horizon Well Blowout o 11:00AM Many-Task Applications in the Integrated Plasma Simulator o 11:30AM Compute and data management strategies for grid deployment of high throughput protein structure studies * Session 2: Storage o 1:30PM Processing Massive Sized Graphs Using Sector/Sphere o 2:00PM Easy and Instantaneous Processing for Data-Intensive Workflows o 2:30PM Detecting Bottlenecks in Parallel DAG-based Data Flow Programs * Session 3: Resource Management o 3:30PM Improving Many-Task Computing in Scientific Workflows Using P2P Techniques o 4:00PM Dynamic Task Scheduling for the Uintah Framework o 4:30PM Automatic and Coordinated Job Recovery for High Performance Computing * Session 4: Best Papers Nominees o 5:15PM Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks o 5:45PM Blue Gene/Q Resource Management Architecture * 6:15PM Best Paper Award, Attendees Prizes, & Closing Remarks We look forward to seeing you at the workshop in less than 2 weeks! Regards, Ioan Raicu, Yong Zhao, and Ian Foster MTAGS10 Chairs http://www.cs.iit.edu/~iraicu/MTAGS10/ -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor ================================================================= Computer Science Department Illinois Institute of Technology 10 W. 31st Street Stuart Building, Room 237D Chicago, IL 60616 ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Wed Nov 3 10:52:03 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 3 Nov 2010 10:52:03 -0500 (CDT) Subject: [Swift-devel] Swift parser fails under Java 1.6.0_07 In-Reply-To: <1214002773.7635.1288798811790.JavaMail.root@zimbra.anl.gov> Message-ID: <287862175.7750.1288799523344.JavaMail.root@zimbra.anl.gov> I (with the help of a new user) just painfully re-discoverd that Swift's parser fails under the (somewhat old) Java JRE release 1.6.0_07 which happens to be the default under on the UChicago IBI cluster. [wilde at ibicluster t2]$ java -version java version "1.6.0_07" Java(TM) SE Runtime Environment (build 1.6.0_07-b06) Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode) When I run under 1.6.0_20 the problem does not occur. Under _07, Swift fails compiling even the most trivial script, as in the example below, in which I include the complete log file. Has anyone else seen this, and/or know the cause? Im puzzled that the swift .log file doesn't start with the typical version and environment info. Its almost like swift is taking some strangely different execution path under 1.6.0_07. This is not yet a major issue, but its unsettling that a specific java version could trigger this behavior. - Mike [wilde at ibicluster t4]$ ls -l total 4 -rw-r--r-- 1 wilde brdfuser 14 Nov 3 10:42 hw.swift [wilde at ibicluster t4]$ cat hw.swift trace ("hi"); [wilde at ibicluster t4]$ swift hw.swift >stdout 2>stderr [wilde at ibicluster t4]$ ls -l total 272 -rw-r--r-- 1 wilde brdfuser 1559 Nov 3 10:44 hw-20101103-1045-3ljqjlz3.log -rw-r--r-- 1 wilde brdfuser 14 Nov 3 10:42 hw.swift -rw-r--r-- 1 wilde brdfuser 1 Nov 3 10:44 hw.xml -rw-r--r-- 1 wilde brdfuser 257393 Nov 3 10:44 stderr -rw-r--r-- 1 wilde brdfuser 0 Nov 3 10:44 stdout -rw-r--r-- 1 wilde brdfuser 57 Nov 3 10:44 swift.log [wilde at ibicluster t4]$ cat hw-20101103-1045-3ljqjlz3.log 2010-11-03 10:45:38,587-0500 DEBUG Loader Max heap: 238616576 2010-11-03 10:45:38,588-0500 INFO Loader hw.swift: source file is new. Recompiling. 2010-11-03 10:45:39,066-0500 DEBUG Loader Detailed exception: org.griphyn.vdl.karajan.CompilationException: Unable to parse intermediate XML at org.griphyn.vdl.engine.Karajan.compile(Karajan.java:98) at org.griphyn.vdl.karajan.Loader.compile(Loader.java:295) at org.griphyn.vdl.karajan.Loader.main(Loader.java:140) Caused by: org.apache.xmlbeans.XmlException: /userhom2/2/wilde/swift/lab/t4/hw.xml:2:1: error: Unexpected end of file after null at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3467) at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270) at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257) at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:252) at org.globus.swift.language.ProgramDocument$Factory.parse(ProgramDocument.java:499) at org.griphyn.vdl.engine.Karajan.parseProgramXML(Karajan.java:122) at org.griphyn.vdl.engine.Karajan.compile(Karajan.java:96) ... 2 more Caused by: org.xml.sax.SAXParseException: Unexpected end of file after null at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportFatalError(Piccolo.java:1038) at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:723) at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3435) ... 9 more [wilde at ibicluster t4]$ wc -l stderr 3529 stderr [wilde at ibicluster t4]$ head -30 stderr action parse error in group XDTM line 3; template context is [program] line 1:1: unexpected token: program at org.antlr.stringtemplate.language.ActionParser.primaryExpr(ActionParser.java:722) at org.antlr.stringtemplate.language.ActionParser.expr(ActionParser.java:430) at org.antlr.stringtemplate.language.ActionParser.templatesExpr(ActionParser.java:212) at org.antlr.stringtemplate.language.ActionParser.action(ActionParser.java:126) at org.antlr.stringtemplate.StringTemplate.parseAction(StringTemplate.java:957) at org.antlr.stringtemplate.language.TemplateParser.action(TemplateParser.java:161) at org.antlr.stringtemplate.language.TemplateParser.template(TemplateParser.java:127) at org.antlr.stringtemplate.StringTemplate.breakTemplateIntoChunks(StringTemplate.java:931) at org.antlr.stringtemplate.StringTemplate.setTemplate(StringTemplate.java:532) at org.antlr.stringtemplate.language.GroupParser.template(GroupParser.java:327) at org.antlr.stringtemplate.language.GroupParser.group(GroupParser.java:186) at org.antlr.stringtemplate.StringTemplateGroup.parseGroup(StringTemplateGroup.java:769) at org.antlr.stringtemplate.StringTemplateGroup.(StringTemplateGroup.java:271) at org.antlr.stringtemplate.StringTemplateGroup.(StringTemplateGroup.java:241) at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:57) at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:45) at org.griphyn.vdl.karajan.Loader.compile(Loader.java:290) at org.griphyn.vdl.karajan.Loader.main(Loader.java:140) Can't parse chunk: program xmlns=$defaultNS()$ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema"$if(!namespaces)$ line 1:15: unexpected char: '$' at org.antlr.stringtemplate.language.ActionLexer.nextToken(ActionLexer.java:219) at antlr.TokenBuffer.fill(TokenBuffer.java:69) at antlr.TokenBuffer.LA(TokenBuffer.java:80) at antlr.LLkParser.LA(LLkParser.java:52) at antlr.Parser.consumeUntil(Parser.java:149) at antlr.Parser.recover(Parser.java:312) [wilde at ibicluster t4]$ which java /usr/java/latest/bin/java [wilde at ibicluster t4]$ -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Wed Nov 3 15:19:55 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 03 Nov 2010 13:19:55 -0700 Subject: [Swift-devel] Swift parser fails under Java 1.6.0_07 In-Reply-To: <287862175.7750.1288799523344.JavaMail.root@zimbra.anl.gov> References: <287862175.7750.1288799523344.JavaMail.root@zimbra.anl.gov> Message-ID: <1288815595.32143.4.camel@blabla2.none> On Wed, 2010-11-03 at 10:52 -0500, Michael Wilde wrote: > I (with the help of a new user) just painfully re-discoverd that Swift's parser fails under the (somewhat old) Java JRE release 1.6.0_07 which happens to be the default under on the UChicago IBI cluster. > > [wilde at ibicluster t2]$ java -version > java version "1.6.0_07" > Java(TM) SE Runtime Environment (build 1.6.0_07-b06) > Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode) > > When I run under 1.6.0_20 the problem does not occur. > > Under _07, Swift fails compiling even the most trivial script, as in the example below, in which I include the complete log file. > > Has anyone else seen this, and/or know the cause? > > Im puzzled that the swift .log file doesn't start with the typical > version and environment info. Its almost like swift is taking some > strangely different execution path under 1.6.0_07. I think the version is printed after compilation. > > This is not yet a major issue, but its unsettling that a specific java version could trigger this behavior. I suspect it's the version of SAX (xml parser) included with the JVM. Which may also suggest a solution: use a good version of the SAX jar and override the JVM provided one. Mihael From bugzilla-daemon at mcs.anl.gov Sun Nov 7 13:23:19 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sun, 7 Nov 2010 13:23:19 -0600 (CST) Subject: [Swift-devel] [Bug 3] VDL2 quickstart guide issues In-Reply-To: References: Message-ID: <20101107192319.D52502C9EC@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=3 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED CC| |skenny at uchicago.edu Resolution| |FIXED --- Comment #3 from skenny 2010-11-07 13:23:19 --- link to the really quick start guide has been removed (and the code commented out) in the doc and many of the other issues here have been resolved (as noted below). the remainder of the bullet points seem to be general comments/questions, so i'm not sure this is the right place for them (and as mihael points out this ambiguity could keep the bug open indefinitely). thus i'm closing this bug. but if others feel there is a clear significant bug here in the comments that hasn't been addressed please re-open it as a separate bug report. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching someone on the CC list of the bug. From hategan at mcs.anl.gov Sun Nov 7 18:33:11 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 07 Nov 2010 16:33:11 -0800 Subject: [Swift-devel] Standard Swift coaster behavior doesnt work wellfor sporadic jobs In-Reply-To: <1287642685.15118.46.camel@blabla2.none> References: <57986955.982201286845836821.JavaMail.root@zimbra.anl.gov> <4CBF0D27.80302@gmail.com><1287619089.13330.2.camel@blabla2.none> <1287642685.15118.46.camel@blabla2.none> Message-ID: <1289176391.11359.1.camel@blabla2.none> I can't reproduce this. I tried to have both the blocks and the services shut down between the jobs, and both scenarios lead to new blocks/services being started and the jobs completing. Though I did clean up the code a bit after that. I'll need full logs from failed runs (both swift and coaster). Mihael On Wed, 2010-10-20 at 23:31 -0700, Mihael Hategan wrote: > On Thu, 2010-10-21 at 00:00 +0000, jon.monette at gmail.com wrote: > > Ok. I can try to put together a script that does it. But I think it just need to be a script in which between two jobs that are submitted to a site there is a long time so all the workers time out. > > I can probably do that with a local sleep sandwiched between two coaster > jobs. So don't worry about it. > > Mihael > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From jon.monette at gmail.com Sun Nov 7 19:55:42 2010 From: jon.monette at gmail.com (Jonathan Monette) Date: Sun, 07 Nov 2010 19:55:42 -0600 Subject: [Swift-devel] Standard Swift coaster behavior doesnt work wellfor sporadic jobs In-Reply-To: <1289176391.11359.1.camel@blabla2.none> References: <57986955.982201286845836821.JavaMail.root@zimbra.anl.gov> <4CBF0D27.80302@gmail.com><1287619089.13330.2.camel@blabla2.none> <1287642685.15118.46.camel@blabla2.none> <1289176391.11359.1.camel@blabla2.none> Message-ID: <4CD7589E.80706@gmail.com> Ok. I can send you those. I have a test tomorrow so I can send them later in the week. On 11/7/10 6:33 PM, Mihael Hategan wrote: > I can't reproduce this. > > I tried to have both the blocks and the services shut down between the > jobs, and both scenarios lead to new blocks/services being started and > the jobs completing. Though I did clean up the code a bit after that. > > I'll need full logs from failed runs (both swift and coaster). > > Mihael > > On Wed, 2010-10-20 at 23:31 -0700, Mihael Hategan wrote: >> On Thu, 2010-10-21 at 00:00 +0000, jon.monette at gmail.com wrote: >>> Ok. I can try to put together a script that does it. But I think it just need to be a script in which between two jobs that are submitted to a site there is a long time so all the workers time out. >> I can probably do that with a local sleep sandwiched between two coaster >> jobs. So don't worry about it. >> >> Mihael >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From bugzilla-daemon at mcs.anl.gov Mon Nov 8 14:46:01 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 8 Nov 2010 14:46:01 -0600 (CST) Subject: [Swift-devel] [Bug 39] a poor syntax error In-Reply-To: References: Message-ID: <20101108204601.28D142CB90@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=39 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED CC| |skenny at uchicago.edu Resolution| |FIXED --- Comment #2 from skenny 2010-11-08 14:46:00 --- the latest swift reports the next token (after the missing semicolon) as the unexpected token, which seems appropriate...that is, it does not misinterpret the > as a greater-than symbol: [skenny at login2]$ cat 119-missing-semi.swift type file {}; type student { file name; file age; file gpa; } app (file t) getname(string n) { echo n stdout=@filename(t); } file results ; student fnames[] results = getname(@filename(fnames[0])); [skenny at login2]$ swift 119-missing-semi.swift Could not compile SwiftScript source: line 13:1: unexpected token: results a test for this has been added to swift/tests/language/should-not-work -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Nov 8 15:16:51 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 8 Nov 2010 15:16:51 -0600 (CST) Subject: [Swift-devel] [Bug 5] Directory names seem wrong, and files are missing In-Reply-To: References: Message-ID: <20101108211651.BABBA2B87F@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=5 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED CC| |skenny at uchicago.edu Resolution| |FIXED --- Comment #2 from skenny 2010-11-08 15:16:51 --- this appears to refer to outdated tutorial material. first.swift is the current example in section 2 of the tutorial and does not refer to any non-existent files. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From bugzilla-daemon at mcs.anl.gov Mon Nov 8 16:19:44 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 8 Nov 2010 16:19:44 -0600 (CST) Subject: [Swift-devel] [Bug 6] Not globally unique temporary file names In-Reply-To: References: Message-ID: <20101108221944.A57BF2CB98@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=6 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED CC| |skenny at uchicago.edu --- Comment #3 from skenny 2010-11-08 16:19:44 --- both instances produce 10 files: [skenny at login2]$ cat t2_outnames.swift type messagefile; app (messagefile t) greeting(string s) { echo s stdout=@filename(t); } myprog() { messagefile outfile; outfile=greeting("this file"); } int idx[] = [1:10]; foreach i in idx { myprog(); } foreach i in idx { messagefile outfile; outfile = greeting("this file"); } [skenny at login2]$ swift t2_outnames.swift Swift svn swift-r3680 cog-r2913 RunID: 20101108-1615-7j7nvdc9 Progress: Progress: Selecting site:18 Stage in:1 Finished successfully:1 ... [skenny at login2]$ ls _concurrent/ outfile-3976bc53-6010-41b4-adc2-823c3cb17d91-1-0 outfile-7e53f096-cf13-4eea-b52c-ee0c341f03bf-2-0 outfile-3976bc53-6010-41b4-adc2-823c3cb17d91-1-1 outfile-7e53f096-cf13-4eea-b52c-ee0c341f03bf-2-1 outfile-3976bc53-6010-41b4-adc2-823c3cb17d91-1-2 outfile-7e53f096-cf13-4eea-b52c-ee0c341f03bf-2-2 outfile-3976bc53-6010-41b4-adc2-823c3cb17d91-1-3 outfile-7e53f096-cf13-4eea-b52c-ee0c341f03bf-2-3 outfile-3976bc53-6010-41b4-adc2-823c3cb17d91-1-4 outfile-7e53f096-cf13-4eea-b52c-ee0c341f03bf-2-4 outfile-3976bc53-6010-41b4-adc2-823c3cb17d91-1-5 outfile-7e53f096-cf13-4eea-b52c-ee0c341f03bf-2-5 outfile-3976bc53-6010-41b4-adc2-823c3cb17d91-1-6 outfile-7e53f096-cf13-4eea-b52c-ee0c341f03bf-2-6 outfile-3976bc53-6010-41b4-adc2-823c3cb17d91-1-7 outfile-7e53f096-cf13-4eea-b52c-ee0c341f03bf-2-7 outfile-3976bc53-6010-41b4-adc2-823c3cb17d91-1-8 outfile-7e53f096-cf13-4eea-b52c-ee0c341f03bf-2-8 outfile-3976bc53-6010-41b4-adc2-823c3cb17d91-1-9 outfile-7e53f096-cf13-4eea-b52c-ee0c341f03bf-2-9 this addresses ian's issue...then if i re-run it the output files produced are given the same names (so they just overwrite the output from the previous run) which, i *think* is what mihael was referring to (?) this bug might need a bit of clarification as it's not clear to me if it's been resolved... -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Nov 8 17:11:09 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 8 Nov 2010 17:11:09 -0600 (CST) Subject: [Swift-devel] [Bug 6] Not globally unique temporary file names In-Reply-To: References: Message-ID: <20101108231109.CBBA72B86D@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=6 --- Comment #4 from Mihael Hategan 2010-11-08 17:11:09 --- (In reply to comment #3) > this addresses ian's issue...then if i re-run it the output files produced are > given the same names (so they just overwrite the output from the previous run) > which, i *think* is what mihael was referring to (?) Right. Which is Ian's initial complaint. The question is whether you get the same behavior if you were to import that in a different script and repeatedly call that program, would it still work? I suspect that it now does, since swift will use both the static random name, as well as the thread id, which will be a mix of the parent thread id and the stand-alone thread ids. Though if you had hard-mapped files, you would still run into problems. Mihael -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Nov 8 17:43:56 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 8 Nov 2010 17:43:56 -0600 (CST) Subject: [Swift-devel] [Bug 31] error message when mapper parameter is wrong In-Reply-To: References: Message-ID: <20101108234357.022232B8BF@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=31 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |skenny at uchicago.edu --- Comment #2 from skenny 2010-11-08 17:43:56 --- currently gives the appropriate mapper parameter error (though still shows that there's a java exception): [skenny at communicado]$ swift mapperparam.swift Swift svn swift-r3680 cog-r2913 RunID: 20101108-1737-9nkngll7 Execution failed: java.lang.RuntimeException: org.griphyn.vdl.mapping.InvalidMapperException: csv_mapper: CSV mapper must have a file parameter. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Nov 8 17:52:36 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 8 Nov 2010 17:52:36 -0600 (CST) Subject: [Swift-devel] [Bug 43] simple_mapper and ClassCastException In-Reply-To: References: Message-ID: <20101108235236.0B9642B89A@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=43 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED CC| |skenny at uchicago.edu Resolution| |WORKSFORME -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching someone on the CC list of the bug. You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Nov 8 19:27:56 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 8 Nov 2010 19:27:56 -0600 (CST) Subject: [Swift-devel] [Bug 71] Develop a Matlab version for the SIDGrid workflow In-Reply-To: References: Message-ID: <20101109012756.124B12CBAF@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=71 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED CC| |skenny at uchicago.edu Resolution| |WONTFIX --- Comment #1 from skenny 2010-11-08 19:27:55 --- Bennett and his students are gone. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Nov 8 19:51:02 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 8 Nov 2010 19:51:02 -0600 (CST) Subject: [Swift-devel] [Bug 77] Remote access to the CNARI Data In-Reply-To: References: Message-ID: <20101109015102.A2F0F2CBAF@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=77 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED CC| |skenny at uchicago.edu Resolution| |FIXED --- Comment #1 from skenny 2010-11-08 19:51:02 --- the CNARI project has many workflows for accessing their databases remotely... currently documented primarily on the wiki: http://www.ci.uchicago.edu/wiki/bin/view/CNARI/WebHome -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Nov 8 19:59:53 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 8 Nov 2010 19:59:53 -0600 (CST) Subject: [Swift-devel] [Bug 75] Run the SIDGrid workflow through Falkon In-Reply-To: References: Message-ID: <20101109015953.BA95D2CBB0@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=75 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED CC| |skenny at uchicago.edu Resolution| |WONTFIX --- Comment #1 from skenny 2010-11-08 19:59:53 --- the SIDGrid project was comprised of a portal+processing scripts+grid resources. i'm not sure which 'tool' this bug was referring to, but those processing scripts have become part of the CNARI project. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Nov 8 22:35:06 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 8 Nov 2010 22:35:06 -0600 (CST) Subject: [Swift-devel] [Bug 69] Experiment Management Frontend In-Reply-To: References: Message-ID: <20101109043506.34F032B8FF@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=69 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED CC| |skenny at uchicago.edu Resolution| |WONTFIX --- Comment #2 from skenny 2010-11-08 22:35:06 --- i *think* this is referring to some iteration of the old SIDGrid portal. the link provided does not work so i'm not entirely certain what this refers to. i'm closing the bug...please feel free to re-open if you can provide more information and you believe this to be a valid swift bug. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Nov 8 22:47:31 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 8 Nov 2010 22:47:31 -0600 (CST) Subject: [Swift-devel] [Bug 70] LEB problem: Worker vs Investor occupation selection In-Reply-To: References: Message-ID: <20101109044731.ACB162CABB@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=70 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED CC| |skenny at uchicago.edu Resolution| |WONTFIX --- Comment #3 from skenny 2010-11-08 22:47:31 --- this does not appear to be a swift bug/ticket but rather something specific to a matlab application (?) link is broken. closing ticket, please re-open if more detail/info can be provided. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Tue Nov 9 00:14:50 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 9 Nov 2010 00:14:50 -0600 (CST) Subject: [Swift-devel] [Bug 6] Not globally unique temporary file names In-Reply-To: References: Message-ID: <20101109061450.1F3112CBC2@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=6 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED --- Comment #5 from skenny 2010-11-09 00:14:49 --- (In reply to comment #4) > (In reply to comment #3) > > this addresses ian's issue...then if i re-run it the output files produced are > > given the same names (so they just overwrite the output from the previous run) > > which, i *think* is what mihael was referring to (?) > > Right. Which is Ian's initial complaint. > > The question is whether you get the same behavior if you were to import that in > a different script and repeatedly call that program, would it still work? I > suspect that it now does correct...it still works when you import the function from another script and then call it repeatedly. thus it seems to have been resolved. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Tue Nov 9 00:56:36 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 9 Nov 2010 00:56:36 -0600 (CST) Subject: [Swift-devel] [Bug 82] Request for a centralized installed applications catalog In-Reply-To: References: Message-ID: <20101109065636.91C352CBD4@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=82 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED CC| |skenny at uchicago.edu Resolution| |WONTFIX --- Comment #2 from skenny 2010-11-09 00:56:36 --- as mentioned by ben, this is a discussion topic more so than a bug/ticket. therefore, i am closing the bug. however, if others feel it is a valid ticket, please re-open with further, detailed specification. it's worth noting that there has been a general shift amongst users (HNL users specifically) to having a single shell wrapper executable entered into tc.data which is used to call all other binaries on a give site. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching someone on the CC list of the bug. You are watching the reporter. From wilde at mcs.anl.gov Tue Nov 9 18:21:56 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 9 Nov 2010 18:21:56 -0600 (CST) Subject: [Swift-devel] Hostnames vs BoundContacts in vdl-int.k In-Reply-To: <1312386496.38446.1289347608250.JavaMail.root@zimbra.anl.gov> Message-ID: <1632030314.38521.1289348516452.JavaMail.root@zimbra.anl.gov> Im trying to extend Justin's initial cut of "external" CDM file types to connect to Globus Online. The current code in trunk only handles input. Im trying to hook it into doStageOut as well. Im making progress but stuck at the moment on passing hosts from vdl-int.k into the CDM external java functions. Specifically I'm getting a string as a hostname argument to the cdm:external() element, where its expecting a "BoundContact". It pulls its args off the Karajan stack as follows: --- public void cdm_external(VariableStack stack) throws ExecutionException { String provider = (String) PA_PROVIDER.getValue(stack); String srchost = (String) PA_SRCHOST.getValue(stack); String srcfile = (String) PA_SRCFILE.getValue(stack); String srcdir = (String) PA_SRCDIR.getValue(stack); BoundContact bc = (BoundContact) PA_DESTHOST.getValue(stack); String destdir = (String) PA_DESTDIR.getValue(stack); --- Is it the case that within vdl-int.k some host variables are simple strings (site names) whereas others are structured objects representing the site? I'm having difficulty tracing this and would appreciate any guidance you can offer. Thanks, Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Nov 9 22:53:12 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 9 Nov 2010 22:53:12 -0600 (CST) Subject: [Swift-devel] Re: Hostnames vs BoundContacts in vdl-int.k In-Reply-To: <1632030314.38521.1289348516452.JavaMail.root@zimbra.anl.gov> Message-ID: <1356016262.39112.1289364792649.JavaMail.root@zimbra.anl.gov> OK, I think I get it. The "host" argument to doStagein and doStageOut is a BoundContact (ie, a "site" or pool entry). The dhost and srchost parameters that those elements pass down to the lower level elements are computed from filename arguments and are simple strings. - Mike ----- Original Message ----- > Im trying to extend Justin's initial cut of "external" CDM file types > to connect to Globus Online. > > The current code in trunk only handles input. Im trying to hook it > into doStageOut as well. > > Im making progress but stuck at the moment on passing hosts from > vdl-int.k into the CDM external java functions. > > Specifically I'm getting a string as a hostname argument to the > cdm:external() element, where its expecting a "BoundContact". It pulls > its args off the Karajan stack as follows: > --- > public void cdm_external(VariableStack stack) > throws ExecutionException > { > String provider = (String) PA_PROVIDER.getValue(stack); > String srchost = (String) PA_SRCHOST.getValue(stack); > String srcfile = (String) PA_SRCFILE.getValue(stack); > String srcdir = (String) PA_SRCDIR.getValue(stack); > BoundContact bc = (BoundContact) PA_DESTHOST.getValue(stack); > String destdir = (String) PA_DESTDIR.getValue(stack); > --- > > Is it the case that within vdl-int.k some host variables are simple > strings (site names) whereas others are structured objects > representing the site? > > I'm having difficulty tracing this and would appreciate any guidance > you can offer. > > Thanks, > > Mike > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Wed Nov 10 14:35:00 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 10 Nov 2010 14:35:00 -0600 (CST) Subject: [Swift-devel] File names in gridftp provider seem to need an extra leading / Message-ID: <267320207.42737.1289421300954.JavaMail.root@zimbra.anl.gov> Both when I map files to a physical name starting with gsiftp://, and when I copy files using the swift version of globus-url-copy, I seem to need an extra "/" at the start of the file's pathname. Here's an example of the issue from within a .swift script: login1$ swift -config cf -sites.file sites.xml -tc.file tc.data gcat.swift Swift svn swift-r3702 (swift modified locally) cog-r2924 (cog modified locally) RunID: 20101110-1426-uouuvdf1 Progress: Failed to transfer wrapper log from gcat-20101110-1426-uouuvdf1/info/g on localhost Execution failed: Exception in cp: Arguments: [etc/group, home/wilde/godata/gridoutput.txt] Host: localhost Directory: gcat-20101110-1426-uouuvdf1/jobs/g/cp-g3yt5f1k stderr.txt: stdout.txt: ---- Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Failed to retrieve file information about etc/group ^^^^^^^^^ Caused by: Server refused performing the request. Custom message: Server refused MLST command (error code 1) [Nested exception message: Custom message: Unexpected reply: 500-Command failed : globus_gridftp_server_file.c:globus_l_gfs_file_stat:389: 500-System error in stat: No such file or directory 500-A system call failed: No such file or directory 500 End.] login1$ cat gcat.swift type file; app (file o) copy (file i) { cp @i @o; } file f1<"gsiftp://pf-grid.unl.edu/etc/group">; file f2<"gsiftp://gridftp.pads.ci.uchicago.edu/home/wilde/godata/gridoutput.txt">; f2 = copy(f1); login1$ When I put 2 slashes after the hostname in the URIs above, it works. A similar issue occurs using Swift globus-url-copy, using the file:// protocol. Rather then the usual 3 slashes after file:, I need *4*. With 3, it looks (in my test) for etc/group instead of /etc/group. With 4 it works. With 2, it drops of etc entirely and looks for the file "group". Are both of these the normal/expected behavior from the Swift gridftp code, or is this an error? - Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Wed Nov 10 14:53:46 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 10 Nov 2010 14:53:46 -0600 (CST) Subject: [Swift-devel] Time/hires.pm used by coaster worker.pl is not available on Ranger compute nodes In-Reply-To: <485388925.42818.1289421889393.JavaMail.root@zimbra.anl.gov> Message-ID: <1124957525.42929.1289422426208.JavaMail.root@zimbra.anl.gov> Glen Hocky observed this in recent runs. It makes worker.pl fail to start. I verified that this module is available on the login nodes but not the worker nodes. For now, worker.pl works OK if you just comment out the line: # use Time::HiRes qw(time); The error you get on Ranger is below. Sarah, you may want to watch for this if you run on Ranger. - Mike Can't locate Time/HiRes.pm in @INC (@INC contains: /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi /usr/lib/perl5/5.8.5 /usr/lib64/perl5/site_perl/5.8.5/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.4/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.3/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.2/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.1/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.0/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.8.5 /usr/lib/perl5/site_perl/5.8.4 /usr/lib/perl5/site_perl/5.8.3 /usr/lib/perl5/site_perl/5.8.2 /usr/lib/perl5/site_perl/5.8.1 /usr/lib/perl5/site_perl/5.8.0 /usr/lib/perl5/site_perl /usr/lib64/perl5/vendor_perl/5.8.5/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.4/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.3/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.2/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.1/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.0/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.5 /usr/lib/perl5/vendor_perl/5.8.4 /usr/lib/perl5/vendor_perl/5.8.3 /usr/lib/perl5/vendor_perl/5.8.2 /usr/lib/perl5/vendor_perl/5.8.1 /usr/lib/perl5/vendor_perl/5.8.0 /usr/lib/perl5/vendor_perl .) at ./worker.pl line 17. BEGIN failed--compilation aborted at ./worker.pl line 17. i115-206$ fg -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Wed Nov 10 15:55:01 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 10 Nov 2010 13:55:01 -0800 Subject: [Swift-devel] Re: Hostnames vs BoundContacts in vdl-int.k In-Reply-To: <1356016262.39112.1289364792649.JavaMail.root@zimbra.anl.gov> References: <1356016262.39112.1289364792649.JavaMail.root@zimbra.anl.gov> Message-ID: <1289426101.24457.2.camel@blabla2.none> On Tue, 2010-11-09 at 22:53 -0600, Michael Wilde wrote: > OK, I think I get it. The "host" argument to doStagein and doStageOut > is a BoundContact (ie, a "site" or pool entry). The dhost and srchost > parameters that those elements pass down to the lower level elements > are computed from filename arguments and are simple strings. Right. I think transfer() accepts both a host object and a hostname[:port] for the srchost/desthost arguments. One of them is passed as a host object while the other is passed as a string. Mihael > > - Mike > > ----- Original Message ----- > > Im trying to extend Justin's initial cut of "external" CDM file types > > to connect to Globus Online. > > > > The current code in trunk only handles input. Im trying to hook it > > into doStageOut as well. > > > > Im making progress but stuck at the moment on passing hosts from > > vdl-int.k into the CDM external java functions. > > > > Specifically I'm getting a string as a hostname argument to the > > cdm:external() element, where its expecting a "BoundContact". It pulls > > its args off the Karajan stack as follows: > > --- > > public void cdm_external(VariableStack stack) > > throws ExecutionException > > { > > String provider = (String) PA_PROVIDER.getValue(stack); > > String srchost = (String) PA_SRCHOST.getValue(stack); > > String srcfile = (String) PA_SRCFILE.getValue(stack); > > String srcdir = (String) PA_SRCDIR.getValue(stack); > > BoundContact bc = (BoundContact) PA_DESTHOST.getValue(stack); > > String destdir = (String) PA_DESTDIR.getValue(stack); > > --- > > > > Is it the case that within vdl-int.k some host variables are simple > > strings (site names) whereas others are structured objects > > representing the site? > > > > I'm having difficulty tracing this and would appreciate any guidance > > you can offer. > > > > Thanks, > > > > Mike > > > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > From hategan at mcs.anl.gov Wed Nov 10 15:59:04 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 10 Nov 2010 13:59:04 -0800 Subject: [Swift-devel] Re: File names in gridftp provider seem to need an extra leading / In-Reply-To: <267320207.42737.1289421300954.JavaMail.root@zimbra.anl.gov> References: <267320207.42737.1289421300954.JavaMail.root@zimbra.anl.gov> Message-ID: <1289426344.24457.6.camel@blabla2.none> On Wed, 2010-11-10 at 14:35 -0600, Michael Wilde wrote: > Both when I map files to a physical name starting with gsiftp://, and > when I copy files using the swift version of globus-url-copy, I seem > to need an extra "/" at the start of the file's pathname. Absolute paths need an extra "/". The first one (i.e. the one between the host name and the path name) is considered a separator and not counted as part of the path name. Mihael > > Here's an example of the issue from within a .swift script: > > login1$ swift -config cf -sites.file sites.xml -tc.file tc.data gcat.swift > Swift svn swift-r3702 (swift modified locally) cog-r2924 (cog modified locally) > > RunID: 20101110-1426-uouuvdf1 > Progress: > Failed to transfer wrapper log from gcat-20101110-1426-uouuvdf1/info/g on localhost > Execution failed: > Exception in cp: > Arguments: [etc/group, home/wilde/godata/gridoutput.txt] > Host: localhost > Directory: gcat-20101110-1426-uouuvdf1/jobs/g/cp-g3yt5f1k > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > org.globus.cog.abstraction.impl.file.FileResourceException: Failed to retrieve file information about etc/group > ^^^^^^^^^ > Caused by: > Server refused performing the request. Custom message: Server refused MLST command (error code 1) [Nested exception message: Custom message: Unexpected reply: 500-Command failed : globus_gridftp_server_file.c:globus_l_gfs_file_stat:389: > 500-System error in stat: No such file or directory > 500-A system call failed: No such file or directory > 500 End.] > login1$ cat gcat.swift > > type file; > > app (file o) copy (file i) > { > cp @i @o; > } > > file f1<"gsiftp://pf-grid.unl.edu/etc/group">; > file f2<"gsiftp://gridftp.pads.ci.uchicago.edu/home/wilde/godata/gridoutput.txt">; > f2 = copy(f1); > login1$ > > When I put 2 slashes after the hostname in the URIs above, it works. > > A similar issue occurs using Swift globus-url-copy, using the file:// protocol. Rather then the usual 3 slashes after file:, I need *4*. With 3, it looks (in my test) for etc/group instead of /etc/group. With 4 it works. With 2, it drops of etc entirely and looks for the file "group". > > Are both of these the normal/expected behavior from the Swift gridftp code, or is this an error? > > - Mike > From hategan at mcs.anl.gov Wed Nov 10 16:06:10 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 10 Nov 2010 14:06:10 -0800 Subject: [Swift-devel] Time/hires.pm used by coaster worker.pl is not available on Ranger compute nodes In-Reply-To: <1124957525.42929.1289422426208.JavaMail.root@zimbra.anl.gov> References: <1124957525.42929.1289422426208.JavaMail.root@zimbra.anl.gov> Message-ID: <1289426770.24457.8.camel@blabla2.none> That, I think, is only used for the timestamps in the log file. Otherwise the granularity of localtime() is seconds (and not very useful for timing the worker script). I'm curious whether there is a way to "only include it if it's there". Essentially it re-defines the standard localtime(), so no other changes would be needed. Mihael On Wed, 2010-11-10 at 14:53 -0600, Michael Wilde wrote: > Glen Hocky observed this in recent runs. It makes worker.pl fail to start. > > I verified that this module is available on the login nodes but not the worker nodes. > > For now, worker.pl works OK if you just comment out the line: > > # use Time::HiRes qw(time); > > The error you get on Ranger is below. > > Sarah, you may want to watch for this if you run on Ranger. > > - Mike > > Can't locate Time/HiRes.pm in @INC (@INC contains: /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi /usr/lib/perl5/5.8.5 /usr/lib64/perl5/site_perl/5.8.5/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.4/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.3/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.2/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.1/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.0/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.8.5 /usr/lib/perl5/site_perl/5.8.4 /usr/lib/perl5/site_perl/5.8.3 /usr/lib/perl5/site_perl/5.8.2 /usr/lib/perl5/site_perl/5.8.1 /usr/lib/perl5/site_perl/5.8.0 /usr/lib/perl5/site_perl /usr/lib64/perl5/vendor_perl/5.8.5/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.4/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.3/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.2/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.1/x86_64-linux-thread-multi /usr/lib64/perl5/vendor_perl/5.8.0/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.5 /usr/lib/perl5/vendor_perl/5.8.4 /usr/lib/perl5/vendor_perl/5.8.3 /usr/lib/perl5/vendor_perl/5.8.2 /usr/lib/perl5/vendor_perl/5.8.1 /usr/lib/perl5/vendor_perl/5.8.0 /usr/lib/perl5/vendor_perl .) at ./worker.pl line 17. > BEGIN failed--compilation aborted at ./worker.pl line 17. > i115-206$ fg > From wilde at mcs.anl.gov Thu Nov 11 09:11:15 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 11 Nov 2010 09:11:15 -0600 (CST) Subject: [Swift-devel] Problems with provider.staging.pin.swiftfiles In-Reply-To: <1788797735.45828.1289488198159.JavaMail.root@zimbra.anl.gov> Message-ID: <1669558788.45842.1289488275177.JavaMail.root@zimbra.anl.gov> Justin, When Tom Uram turns on this option and runs a simple test script (a foreach and an app that just collects node info), he gets an error "520" returned in the swift log, as if from the app. I am thinking that the 520 is somehow coming from worker. This is going from bridled to Eureka worker nodes, with provider staging turned on in proxy mode. When we set provider.staging.pin.swiftfiles to false, the script runs ok. We'll need to collect and send logs and a test case, but I wanted to alert you to a potential problem with this feature. - Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wozniak at mcs.anl.gov Thu Nov 11 09:41:43 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Thu, 11 Nov 2010 09:41:43 -0600 (CST) Subject: [Swift-devel] Re: Problems with provider.staging.pin.swiftfiles In-Reply-To: <1669558788.45842.1289488275177.JavaMail.root@zimbra.anl.gov> References: <1669558788.45842.1289488275177.JavaMail.root@zimbra.anl.gov> Message-ID: Hello Yes, this was broken a few weeks ago- I will try to restore it ASAP. (Cf. swift-devel post from 9/27.) Justin On Thu, 11 Nov 2010, Michael Wilde wrote: > Justin, > > When Tom Uram turns on this option and runs a simple test script (a > foreach and an app that just collects node info), he gets an error "520" > returned in the swift log, as if from the app. I am thinking that the > 520 is somehow coming from worker. > > This is going from bridled to Eureka worker nodes, with provider staging > turned on in proxy mode. > > When we set provider.staging.pin.swiftfiles to false, the script runs > ok. > > We'll need to collect and send logs and a test case, but I wanted to > alert you to a potential problem with this feature. > > - Mike -- Justin M Wozniak From wilde at mcs.anl.gov Thu Nov 11 16:28:41 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 11 Nov 2010 16:28:41 -0600 (CST) Subject: [Swift-devel] Problems with karajan lists In-Reply-To: <162047948.49532.1289514442117.JavaMail.root@zimbra.anl.gov> Message-ID: <514919228.49534.1289514521745.JavaMail.root@zimbra.anl.gov> I'm trying to append to a list multiple times, but when I try to append the second time, I get an error: import("sys.k") sequential( vec := list() append(vec,10) print(vec) // append(vec,20) // print(vec) ) If I uncomment the 2 lines above, I get the error: login1$ swift foo.k [10.0] [10.0, 20.0] Ex098 org.globus.cog.karajan.workflow.KarajanRuntimeException: Illegal extra argument `[10.0, 20.0]' to kernel:karajan @ foo.k, line: 1 at org.globus.cog.karajan.arguments.NameBindingVariableArguments.append(NameBindingVariableArguments.java:58) Its almost as if append is not consuming its arguments and Karajan is finding extra stuff on the stack when it exits??? Can you help me understand what Im doing wrong here? Thanks, Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Thu Nov 11 16:35:11 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 11 Nov 2010 14:35:11 -0800 Subject: [Swift-devel] Re: Problems with karajan lists In-Reply-To: <514919228.49534.1289514521745.JavaMail.root@zimbra.anl.gov> References: <514919228.49534.1289514521745.JavaMail.root@zimbra.anl.gov> Message-ID: <1289514911.27871.3.camel@blabla2.none> Append is append(list, items) -> list. So it returns a list. If you are only interested in the side-effect of append, you can probably do: discard(append(list, 10)). Alternatively you could say: vec := list() vec := append(list, 10) Mihael On Thu, 2010-11-11 at 16:28 -0600, Michael Wilde wrote: > I'm trying to append to a list multiple times, but when I try to append the second time, I get an error: > > import("sys.k") > sequential( > vec := list() > append(vec,10) > print(vec) > // append(vec,20) > // print(vec) > ) > > If I uncomment the 2 lines above, I get the error: > > login1$ swift foo.k > [10.0] > [10.0, 20.0] > Ex098 > org.globus.cog.karajan.workflow.KarajanRuntimeException: Illegal extra argument `[10.0, 20.0]' to kernel:karajan @ foo.k, line: 1 > at org.globus.cog.karajan.arguments.NameBindingVariableArguments.append(NameBindingVariableArguments.java:58) > > Its almost as if append is not consuming its arguments and Karajan is finding extra stuff on the stack when it exits??? > > Can you help me understand what Im doing wrong here? > > Thanks, > > Mike > > From bugzilla-daemon at mcs.anl.gov Fri Nov 12 13:41:56 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Fri, 12 Nov 2010 13:41:56 -0600 (CST) Subject: [Swift-devel] [Bug 130] submitting to TG NCSA Mercury PBS with PATH env profile set causes job to hang In-Reply-To: References: Message-ID: <20101112194156.E74B82BF1E@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=130 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |WONTFIX --- Comment #2 from skenny 2010-11-12 13:41:56 --- ncsa mercury has been deprecated -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the reporter. From wilde at mcs.anl.gov Sat Nov 13 18:32:15 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 13 Nov 2010 18:32:15 -0600 (CST) Subject: [Swift-devel] Case handling bug Message-ID: <1329526766.57544.1289694735438.JavaMail.root@zimbra.anl.gov> Here's a cute one I just stumbled on - the 2-line script: int n=10; int N = 12; gives this error: login1$ swift casebug.swift Swift svn swift-r3702 (swift modified locally) cog-r2924 (cog modified locally) RunID: 20101113-1830-nkmmf1b5 Progress: Execution failed: java.lang.IllegalArgumentException: N is closed with a value of 12.0 login1$ -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Sun Nov 14 12:45:19 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 14 Nov 2010 12:45:19 -0600 (CST) Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ? Message-ID: <15696286.58351.1289760319639.JavaMail.root@zimbra.anl.gov> Im using Justin's new external CDM policy to talk to globusonline.org. Ive got the two IO throttles set to 30: throttle.transfers=30 throttle.file.operations=30 ...but I still see exactly 8 concurrent dostagein calls (as seen by calls to my external IO handler external.sh) Do you know what is limiting this concurrency to 8? I'd like to open it up much wider. My swift command and relevant files are below. Logs are in CI: /home/wilde/swift/demo/modis Thanks, Mike login1$ cat rundemo.go.local.sh rm -f external.*.log swift -config cf -tc.file tc.local -sites.file sites.local.xml -cdm.file fs.ftponly modis.go.swift -nfiles=30 # -location= -n= -site== login1$ cat cf wrapperlog.always.transfer=true sitedir.keep=true execution.retries=0 lazy.errors=false status.mode=provider use.provider.staging=false provider.staging.pin.swiftfiles=false throttle.transfers=30 throttle.file.operations=30 login1$ cat sites.local.xml .63 10000 /home/wilde/swiftwork login1$ login1$ cat fs.ftponly rule .*gsiftp://.* EXTERNAL /home/wilde/swift/lab/go/external.sh login1$ -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sun Nov 14 15:38:29 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 14 Nov 2010 13:38:29 -0800 Subject: [Swift-devel] Re: Concurrent dostagein calls limited to 8 ? In-Reply-To: <15696286.58351.1289760319639.JavaMail.root@zimbra.anl.gov> References: <15696286.58351.1289760319639.JavaMail.root@zimbra.anl.gov> Message-ID: <1289770709.10824.0.camel@blabla2.none> My bad. Try cog trunk r2932. Mihael On Sun, 2010-11-14 at 12:45 -0600, Michael Wilde wrote: > Im using Justin's new external CDM policy to talk to globusonline.org. > > Ive got the two IO throttles set to 30: > > throttle.transfers=30 > throttle.file.operations=30 > > ...but I still see exactly 8 concurrent dostagein calls (as seen by calls to my external IO handler external.sh) > > Do you know what is limiting this concurrency to 8? I'd like to open it up much wider. > > My swift command and relevant files are below. > Logs are in CI: /home/wilde/swift/demo/modis > > Thanks, > > Mike > > > login1$ cat rundemo.go.local.sh > rm -f external.*.log > swift -config cf -tc.file tc.local -sites.file sites.local.xml -cdm.file fs.ftponly modis.go.swift -nfiles=30 > # -location= -n= -site== > > login1$ cat cf > wrapperlog.always.transfer=true > sitedir.keep=true > execution.retries=0 > lazy.errors=false > status.mode=provider > use.provider.staging=false > provider.staging.pin.swiftfiles=false > throttle.transfers=30 > throttle.file.operations=30 > > login1$ cat sites.local.xml > > > > > .63 > 10000 > > /home/wilde/swiftwork > > > > login1$ > > login1$ cat fs.ftponly > rule .*gsiftp://.* EXTERNAL /home/wilde/swift/lab/go/external.sh > login1$ > > From wilde at mcs.anl.gov Sun Nov 14 16:12:39 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 14 Nov 2010 16:12:39 -0600 (CST) Subject: [Swift-devel] Re: Concurrent dostagein calls limited to 8 ? In-Reply-To: <1289770709.10824.0.camel@blabla2.none> Message-ID: <2026767406.58546.1289772759569.JavaMail.root@zimbra.anl.gov> As far as I can tell its still showing the same behavior - only 8 stageins are active: wilde 17568 17564 17564 4873 0 16:05 pts/8 00:00:00 /bin/sh /scratch/local/wilde/swift/src/trunk/cog/modules/swift wilde 17632 17568 17564 4873 5 16:05 pts/8 00:00:09 java -Xmx256M -Djava.endorsed.dirs=/scratch/local/wilde/swif wilde 19193 17632 17564 4873 0 16:08 pts/8 00:00:00 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse- wilde 19357 19193 17564 4873 0 16:08 pts/8 00:00:00 sleep 3 wilde 19198 17632 17564 4873 0 16:08 pts/8 00:00:00 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse- wilde 19358 19198 17564 4873 0 16:08 pts/8 00:00:00 sleep 3 wilde 19205 17632 17564 4873 0 16:08 pts/8 00:00:00 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse- wilde 19359 19205 17564 4873 0 16:08 pts/8 00:00:00 sleep 3 wilde 19210 17632 17564 4873 0 16:08 pts/8 00:00:00 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse- wilde 19361 19210 17564 4873 0 16:08 pts/8 00:00:00 sleep 3 wilde 19214 17632 17564 4873 0 16:08 pts/8 00:00:00 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse- wilde 19362 19214 17564 4873 0 16:08 pts/8 00:00:00 sleep 3 wilde 19218 17632 17564 4873 0 16:08 pts/8 00:00:00 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse- wilde 19363 19218 17564 4873 0 16:08 pts/8 00:00:00 sleep 3 wilde 19224 17632 17564 4873 0 16:08 pts/8 00:00:00 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse- wilde 19364 19224 17564 4873 0 16:08 pts/8 00:00:00 sleep 3 wilde 19240 17632 17564 4873 0 16:08 pts/8 00:00:00 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse- wilde 19365 19240 17564 4873 0 16:08 pts/8 00:00:00 sleep 3 - Mike ----- Original Message ----- > My bad. Try cog trunk r2932. > > Mihael > > On Sun, 2010-11-14 at 12:45 -0600, Michael Wilde wrote: > > Im using Justin's new external CDM policy to talk to > > globusonline.org. > > > > Ive got the two IO throttles set to 30: > > > > throttle.transfers=30 > > throttle.file.operations=30 > > > > ...but I still see exactly 8 concurrent dostagein calls (as seen by > > calls to my external IO handler external.sh) > > > > Do you know what is limiting this concurrency to 8? I'd like to open > > it up much wider. > > > > My swift command and relevant files are below. > > Logs are in CI: /home/wilde/swift/demo/modis > > > > Thanks, > > > > Mike > > > > > > login1$ cat rundemo.go.local.sh > > rm -f external.*.log > > swift -config cf -tc.file tc.local -sites.file sites.local.xml > > -cdm.file fs.ftponly modis.go.swift -nfiles=30 > > # -location= -n= -site== > > > > login1$ cat cf > > wrapperlog.always.transfer=true > > sitedir.keep=true > > execution.retries=0 > > lazy.errors=false > > status.mode=provider > > use.provider.staging=false > > provider.staging.pin.swiftfiles=false > > throttle.transfers=30 > > throttle.file.operations=30 > > > > login1$ cat sites.local.xml > > > > > > > > > > .63 > > 10000 > > > > /home/wilde/swiftwork > > > > > > > > login1$ > > > > login1$ cat fs.ftponly > > rule .*gsiftp://.* EXTERNAL /home/wilde/swift/lab/go/external.sh > > login1$ > > > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sun Nov 14 16:21:56 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 14 Nov 2010 14:21:56 -0800 Subject: [Swift-devel] Re: Concurrent dostagein calls limited to 8 ? In-Reply-To: <2026767406.58546.1289772759569.JavaMail.root@zimbra.anl.gov> References: <2026767406.58546.1289772759569.JavaMail.root@zimbra.anl.gov> Message-ID: <1289773316.10896.10.camel@blabla2.none> I need more info. Some useful questions to answer: - how many stage-ins are queued to the scheduler? - how quick are the stage-ins? - who (i.e. what code) starts the external.sh script and how? In theory, if a sufficient number of tasks are queued then the only limit (after my last commit) should be the throttles and whatever inherent limits are in the mechanism used to implement the stage-ins. Mihael On Sun, 2010-11-14 at 16:12 -0600, Michael Wilde wrote: > As far as I can tell its still showing the same behavior - only 8 stageins are active: > > wilde 17568 17564 17564 4873 0 16:05 pts/8 00:00:00 /bin/sh /scratch/local/wilde/swift/src/trunk/cog/modules/swift > wilde 17632 17568 17564 4873 5 16:05 pts/8 00:00:09 java -Xmx256M -Djava.endorsed.dirs=/scratch/local/wilde/swif > wilde 19193 17632 17564 4873 0 16:08 pts/8 00:00:00 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse- > wilde 19357 19193 17564 4873 0 16:08 pts/8 00:00:00 sleep 3 > wilde 19198 17632 17564 4873 0 16:08 pts/8 00:00:00 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse- > wilde 19358 19198 17564 4873 0 16:08 pts/8 00:00:00 sleep 3 > wilde 19205 17632 17564 4873 0 16:08 pts/8 00:00:00 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse- > wilde 19359 19205 17564 4873 0 16:08 pts/8 00:00:00 sleep 3 > wilde 19210 17632 17564 4873 0 16:08 pts/8 00:00:00 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse- > wilde 19361 19210 17564 4873 0 16:08 pts/8 00:00:00 sleep 3 > wilde 19214 17632 17564 4873 0 16:08 pts/8 00:00:00 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse- > wilde 19362 19214 17564 4873 0 16:08 pts/8 00:00:00 sleep 3 > wilde 19218 17632 17564 4873 0 16:08 pts/8 00:00:00 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse- > wilde 19363 19218 17564 4873 0 16:08 pts/8 00:00:00 sleep 3 > wilde 19224 17632 17564 4873 0 16:08 pts/8 00:00:00 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse- > wilde 19364 19224 17564 4873 0 16:08 pts/8 00:00:00 sleep 3 > wilde 19240 17632 17564 4873 0 16:08 pts/8 00:00:00 /bin/bash /home/wilde/swift/lab/go/external.sh getlanduse- > wilde 19365 19240 17564 4873 0 16:08 pts/8 00:00:00 sleep 3 > > > - Mike > > > ----- Original Message ----- > > My bad. Try cog trunk r2932. > > > > Mihael > > > > On Sun, 2010-11-14 at 12:45 -0600, Michael Wilde wrote: > > > Im using Justin's new external CDM policy to talk to > > > globusonline.org. > > > > > > Ive got the two IO throttles set to 30: > > > > > > throttle.transfers=30 > > > throttle.file.operations=30 > > > > > > ...but I still see exactly 8 concurrent dostagein calls (as seen by > > > calls to my external IO handler external.sh) > > > > > > Do you know what is limiting this concurrency to 8? I'd like to open > > > it up much wider. > > > > > > My swift command and relevant files are below. > > > Logs are in CI: /home/wilde/swift/demo/modis > > > > > > Thanks, > > > > > > Mike > > > > > > > > > login1$ cat rundemo.go.local.sh > > > rm -f external.*.log > > > swift -config cf -tc.file tc.local -sites.file sites.local.xml > > > -cdm.file fs.ftponly modis.go.swift -nfiles=30 > > > # -location= -n= -site== > > > > > > login1$ cat cf > > > wrapperlog.always.transfer=true > > > sitedir.keep=true > > > execution.retries=0 > > > lazy.errors=false > > > status.mode=provider > > > use.provider.staging=false > > > provider.staging.pin.swiftfiles=false > > > throttle.transfers=30 > > > throttle.file.operations=30 > > > > > > login1$ cat sites.local.xml > > > > > > > > > > > > > > > .63 > > > 10000 > > > > > > /home/wilde/swiftwork > > > > > > > > > > > > login1$ > > > > > > login1$ cat fs.ftponly > > > rule .*gsiftp://.* EXTERNAL /home/wilde/swift/lab/go/external.sh > > > login1$ > > > > > > > From hategan at mcs.anl.gov Sun Nov 14 17:45:36 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 14 Nov 2010 15:45:36 -0800 Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ? In-Reply-To: References: <2026767406.58546.1289772759569.JavaMail.root@zimbra.anl.gov> <1289773316.10896.10.camel@blabla2.none> Message-ID: <1289778336.11636.2.camel@blabla2.none> On Sun, 2010-11-14 at 17:23 -0600, Michael Wilde wrote: > Some answers from my handheld: > - foreach loop has 317 files so ample parallelism I would have assumed it's > 8. But I suspect, given one of the answers below, that it does not matter. > - throttle in sites entry set to .63 to run 64 jobs at once > - the "active" external.sh is called from end of dostagein and > dostageout in vdl-int.k (after all files for the job were put in a > list by prior calls to externa.sh from within those functions How is this call actually implemented. I.e. can you post the respective snippet of vdl-int? > - the actual staging op by globusonline take 30-60 seconds, sometimes > more. I batch them up. From wilde at mcs.anl.gov Sun Nov 14 22:42:14 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 14 Nov 2010 22:42:14 -0600 (CST) Subject: [Swift-devel] SGE provider error parsing qstat output In-Reply-To: Message-ID: <1001141783.59115.1289796134679.JavaMail.root@zimbra.anl.gov> was: Re: Provenance DB Diagrams Hi David, The first bug seems very familiar, and I thought Mihael fixed it once. qstat was giving slightly different output between older versions (eg I think sisboombah) and later ones (eg Ranger). Perhaps this is a different manifestation of similar problems in command output format variations? Feel free to dig into the code. Do you have access to an SGE system where this works? (Let me know if you need access to the UC IBI Cluster; also try godzilla or ranger). Regarding the coaster error: whats happening here is that the PE is not being passed from the coaster pool attributes to the attributes of the SGE jobs that the coaster provider is creating. Do you have access to Ranger? I have a fix for this there that needs to be tested and integrated into trunk. Marc Parisien of UChicago IBD is trying to run coasters on the IBI cluster, and getting the same error. If you could find and integrate the fix that would be great. I attach my mods from Ranger. I think the mods related to "coresPerNode" can be removed as hopefully Mihael's PPN fix addresses them. Whats needed is just the code that passes PE from the coasters pool to the job spces of the jSGE jobs it creates. My svn diff is below and modified files are attached. This was done on the stable branch, but the SGE provider has since been moved to trunk. You should either get guidance from Mihael on this, or discuss with him if you'd rather he make the fix. >From svn status: M modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/sge/SGEExecutor.java M modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Settings.java M modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BlockTask.java I attach these three files. Just look for the changes that propagate the PE setting. Ignore the coresPerNode changes. There were also changes to ensure that the provider starts the right number of workers per node, which should now always be one copy of worker.pl whose parallelism is controlled by workersPerNode. The PPN setting should ensure that the job gets the expected number of cores allocated, for systems that do node sharing of jobs. from svn diff: (lots of junk below from my experiments): login3$ svn diff Index: modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/sge/SGEExecutor.java =================================================================== --- modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/sge/SGEExecutor.java (revision 2734) +++ modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/sge/SGEExecutor.java (working copy) @@ -52,6 +52,10 @@ writeAttr(attrName, arg, wr, null); } + protected void writeAttrValue(String value, String arg, Writer wr ) throws IOException { + wr.write("#$ " + arg + value + '\n'); + } + protected void writeWallTime(Writer wr) throws IOException { Object walltime = getSpec().getAttribute("maxwalltime"); if (walltime != null) { @@ -77,10 +81,32 @@ wr.write("#$ -V\n"); writeAttr("project", "-A ", wr); - writeAttr("count", "-pe " +// FIXME: testing this change: MW + + Object countValue = getSpec().getAttribute("count"); + int count; + + if (countValue != null) + count = Integer.valueOf(String.valueOf(countValue)).intValue(); + else + count = 1; + + // FIXME: wpn is only meaningful for coasters; is 1 ok otherwise? + // should we flag wpn as error if not coasters? + + Object wpnValue = getAttribute(spec, "workerspernode", "1"); + int wpn = Integer.valueOf(String.valueOf(wpnValue)).intValue(); + logger.info("FETCH OF WPN: " + wpn); // FIXME: DB + + count *= wpn; + logger.info("FETCH OF PE: " + getAttribute(spec, "pe", "NO pe")); + logger.info("FETCH OF CPN: " + getAttribute(spec, "corespernode", "NO cpn")); + writeAttrValue(String.valueOf(count), "-pe " + getAttribute(spec, "pe", getSGEProperties().getDefaultPE()) - + " ", wr, "1"); + + " ", wr); +// FIXME: END OF MW CHANGE + writeWallTime(wr); writeAttr("queue", "-q ", wr); if (spec.getStdInput() != null) { @@ -157,7 +183,8 @@ protected void writeMultiJobPreamble(Writer wr, String exitcodefile) throws IOException { - wr.write("NODES=`cat $PE_HOSTFILE | awk '{ for(i=0;i<$2;i++){print $1} }'`\n"); +// FIXME:MW wr.write("NODES=`cat $PE_HOSTFILE | awk '{ for(i=0;i<$2;i++){print $1} }'`\n"); + wr.write("NODES=`cat $PE_HOSTFILE | awk '{ print $1 }'`\n"); wr.write("ECF=" + exitcodefile + "\n"); wr.write("INDEX=0\n"); wr.write("for NODE in $NODES; do\n"); @@ -188,13 +215,21 @@ return (Properties) getProperties(); } - public static final Pattern JOB_ID_LINE = Pattern.compile("[Yy]our job (\\d+) \\(.*\\) has been submitted"); + public static final Pattern JOB_ID_LINE = Pattern.compile(".*[Yy]our job (\\d+) \\(.*\\) has been submitted"); + // public static final Pattern JOB_ID_LINE = Pattern.compile("[Yy]our job (\\d+) \\(.*\\) has been submitted"); + // public static final Pattern JOB_ID_LINE = Pattern.compile("[Yy]our job \\([0-9]\\+\\) .* has been submitted"); protected String parseSubmitCommandOutput(String out) throws IOException { // > your job 2494189 ("t1.sub") has been submitted BufferedReader br = new BufferedReader(new CharArrayReader(out.toCharArray())); String line = br.readLine(); + if (logger.isInfoEnabled()) { + logger.info("parseSubmitCommandOutput: out=" + out); + } while (line != null) { + if (logger.isInfoEnabled()) { + logger.info("parseSubmitCommandOutput: line=" + line); + } Matcher m = JOB_ID_LINE.matcher(line); if (m.matches()) { String id = m.group(1); Index: modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Settings.java =================================================================== --- modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Settings.java (revision 2734) +++ modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Settings.java (working copy) @@ -22,16 +22,17 @@ public static final Logger logger = Logger.getLogger(Settings.class); public static final String[] NAMES = - new String[] { "slots", "workersPerNode", "nodeGranularity", "allocationStepSize", + new String[] { "slots", "workersPerNode", "coresPerNode", "nodeGranularity", "allocationStepSize", "maxNodes", "lowOverallocation", "highOverallocation", "overallocationDecayFactor", "spread", "reserve", "maxtime", "project", - "queue", "remoteMonitorEnabled", "kernelprofile", "alcfbgpnat", "internalHostname" }; + "queue", "remoteMonitorEnabled", "kernelprofile", "alcfbgpnat", "internalHostname", "pe" }; /** * The maximum number of blocks that can be active at one time */ private int slots = 20; private int workersPerNode = 1; + private int coresPerNode = 1; /** * How many nodes to allocate at once */ @@ -90,6 +91,8 @@ private String queue; + private String pe; + private String kernelprofile; private boolean alcfbgpnat; @@ -116,6 +119,14 @@ this.workersPerNode = workersPerNode; } + public int getCoresPerNode() { + return coresPerNode; + } + + public void setCoresPerNode(int coresPerNode) { + this.coresPerNode = coresPerNode; + } + public int getNodeGranularity() { return nodeGranularity; } @@ -273,6 +284,14 @@ this.queue = queue; } + public String getPe() { + return pe; + } + + public void setPe(String pe) { + this.pe = pe; + } + public boolean getRemoteMonitorEnabled() { return remoteMonitorEnabled; } Index: modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BlockTask.java =================================================================== --- modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BlockTask.java (revision 2734) +++ modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BlockTask.java (working copy) @@ -40,6 +40,13 @@ setAttribute(spec, "maxwalltime", WallTime.format((int) block.getWalltime().getSeconds())); setAttribute(spec, "queue", settings.getQueue()); setAttribute(spec, "project", settings.getProject()); + + // added - mw: + setAttribute(spec, "coresPerNode", String.valueOf(settings.getCoresPerNode())); + setAttribute(spec, "workersPerNode", String.valueOf(settings.getWorkersPerNode())); + setAttribute(spec, "pe", settings.getPe()); + // end additions - mw ^^^ + int count = block.getWorkerCount() / settings.getWorkersPerNode(); if (count > 1) { setAttribute(spec, "jobType", "multiple"); login3$ -- - Mike ----- Original Message ----- > Hello Mike, > > I am working on adding more tests to the automated test suite. I am > running into some issues when trying to run swift with SGE on > sisboombah. The tests I wrote are based on the example configurations > you sent out to the list earlier. Here is what is happening. > > I am running the SGE local test with the following config file: > > > > > threaded > .49 > 10000 > > /home/dk0966/swiftwork > > > > The error I am seeing is: > Caused by: > java.io.IOException: Failed to parse qstat line: 623018 0.55500 > SteadyShea xinliang r 11/13/2010 14:09:05 all.q at node1 > 1 > > The next test I try on this machine is SGE-coasters with the following > config: > > > > > threaded > 4 > 128 > 1 > 1 > 5.11 > 10000 > > /home/dk0966/swiftwork > > > > For which I get the following error: > Worker task failed: Error submitting block task > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Cannot submit job: Could not submit job (qsub reported an exit code of > 1). > Unable to run job: job rejected: the requested parallel environment > "16way" does not exist.Exiting. > > I couldn't find much information about SGE setups either in the guide > or the cookbook. Is there anything else I am missing to get this up > and running? > > Thanks, > David -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: sgecoastermods.tar Type: application/x-tar Size: 30720 bytes Desc: not available URL: From wilde at mcs.anl.gov Sun Nov 14 23:03:26 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 14 Nov 2010 23:03:26 -0600 (CST) Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ? In-Reply-To: <1289778336.11636.2.camel@blabla2.none> Message-ID: <1711691138.59170.1289797406428.JavaMail.root@zimbra.anl.gov> Mihael, I attached my vdl-int.k. The changes were based on Justin's initial version of the external policy CDM setting, but I added the ability to handle stageout as well as stagein, and to gather all the files for a stagein or stageout in the external script, and process them all at once. In my external script, I now batch the files for multiple requests into one larger transfer command to globusonline, using time-based batching. This adds latency to an individual request but saves greatly overall, as globusonline will only do 3 concurrent transfers for a given user, and has its own latency for checking its work queue. My hooks are the calls to cdm:externalin, externalout, and externalgo. I use a map element as a reference variable to determine when to call externalgo. All this seems to work at the basic level, but I still see only a steady state of 8 external calls running at once. Further, I think the latency involved is causing some strange interaction with coasters which I need to send you. My scripts run fine on localhost but fail on PADS with coasters: after about 80 of 300+ jobs I get a caster failure that I need to log and post - looks like some kind of timeout in worker.pl waiting for a response. - Mike ----- Original Message ----- > On Sun, 2010-11-14 at 17:23 -0600, Michael Wilde wrote: > > Some answers from my handheld: > > - foreach loop has 317 files so ample parallelism > > I would have assumed it's > 8. But I suspect, given one of the answers > below, that it does not matter. > > > - throttle in sites entry set to .63 to run 64 jobs at once > > - the "active" external.sh is called from end of dostagein and > > dostageout in vdl-int.k (after all files for the job were put in a > > list by prior calls to externa.sh from within those functions > > How is this call actually implemented. I.e. can you post the > respective > snippet of vdl-int? > > > - the actual staging op by globusonline take 30-60 seconds, > > sometimes > > more. I batch them up. -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: vdl-int.k Type: application/octet-stream Size: 20559 bytes Desc: not available URL: From hategan at mcs.anl.gov Sun Nov 14 23:07:21 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 14 Nov 2010 21:07:21 -0800 Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ? In-Reply-To: References: <2026767406.58546.1289772759569.JavaMail.root@zimbra.anl.gov> <1289773316.10896.10.camel@blabla2.none> <1289778336.11636.2.camel@blabla2.none> Message-ID: <1289797641.13416.5.camel@blabla2.none> The cdm functions (externalin, externalout, externalgo) are not asynchronous. They block the karajan worker threads and therefore, besides preventing anything else from running in the interpreter, are also limited to concurrently running whatever the number of karajan worker threads is (2*cores). I would suggest changing those functions to use the local provider or some other scheme that can free the workers while the sub-processes run. Mihael On Sun, 2010-11-14 at 20:56 -0600, Michael Wilde wrote: > I'm in a cab - vdlint.k is in local fs on: > > Login1.pads.ci > /scratch/local/wilde/swift/src/trunk/... > Running from dist/swft-svn in that tree > > On 11/14/10, Mihael Hategan wrote: > > On Sun, 2010-11-14 at 17:23 -0600, Michael Wilde wrote: > >> Some answers from my handheld: > >> - foreach loop has 317 files so ample parallelism > > > > I would have assumed it's > 8. But I suspect, given one of the answers > > below, that it does not matter. > > > >> - throttle in sites entry set to .63 to run 64 jobs at once > >> - the "active" external.sh is called from end of dostagein and > >> dostageout in vdl-int.k (after all files for the job were put in a > >> list by prior calls to externa.sh from within those functions > > > > How is this call actually implemented. I.e. can you post the respective > > snippet of vdl-int? > > > >> - the actual staging op by globusonline take 30-60 seconds, sometimes > >> more. I batch them up. > > > > > > > From wilde at mcs.anl.gov Sun Nov 14 23:24:42 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 14 Nov 2010 23:24:42 -0600 (CST) Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ? In-Reply-To: <1289797641.13416.5.camel@blabla2.none> Message-ID: <1604855895.59214.1289798682652.JavaMail.root@zimbra.anl.gov> That explains a lot - the limited number of Karajan threads probably explains why coasters goes haywire in the larger tests as well. Clearly this should be done as full fledged provider. But that will take a fair bit more work. Would there be any ill effects from bumping up the number of karajan threads to see if I can make this demo work? WHere is that set? Also, when you say "use the local provider or > some other scheme that can free the workers while the sub-processes > run." - do you have anything "quick and easy" in mind there? - Mike ----- Original Message ----- > The cdm functions (externalin, externalout, externalgo) are not > asynchronous. They block the karajan worker threads and therefore, > besides preventing anything else from running in the interpreter, are > also limited to concurrently running whatever the number of karajan > worker threads is (2*cores). > > I would suggest changing those functions to use the local provider or > some other scheme that can free the workers while the sub-processes > run. > > Mihael > > On Sun, 2010-11-14 at 20:56 -0600, Michael Wilde wrote: > > I'm in a cab - vdlint.k is in local fs on: > > > > Login1.pads.ci > > /scratch/local/wilde/swift/src/trunk/... > > Running from dist/swft-svn in that tree > > > > On 11/14/10, Mihael Hategan wrote: > > > On Sun, 2010-11-14 at 17:23 -0600, Michael Wilde wrote: > > >> Some answers from my handheld: > > >> - foreach loop has 317 files so ample parallelism > > > > > > I would have assumed it's > 8. But I suspect, given one of the > > > answers > > > below, that it does not matter. > > > > > >> - throttle in sites entry set to .63 to run 64 jobs at once > > >> - the "active" external.sh is called from end of dostagein and > > >> dostageout in vdl-int.k (after all files for the job were put in > > >> a > > >> list by prior calls to externa.sh from within those functions > > > > > > How is this call actually implemented. I.e. can you post the > > > respective > > > snippet of vdl-int? > > > > > >> - the actual staging op by globusonline take 30-60 seconds, > > >> sometimes > > >> more. I batch them up. > > > > > > > > > > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Mon Nov 15 00:02:06 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 15 Nov 2010 00:02:06 -0600 (CST) Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ? In-Reply-To: <1604855895.59214.1289798682652.JavaMail.root@zimbra.anl.gov> Message-ID: <1756882204.59280.1289800926179.JavaMail.root@zimbra.anl.gov> I bumped up the thread count to 32*cores. It was 4 * # cores, so maybe there is some 50% allocation factor going on? At any rate, if I reduce the number of files Im processing from 317 to 100, the entire script seems to work fairly reliably. But I can definitely see the ill effects of the large number of threads Im tying up waiting on IO. (For one thing, I cant keep my coaster cores busy, and I get the "Canceling job" message from coaster workers shutting down for lack of work). This will improve a bit when I enhance the interface to globusonline to wait on individual file transfers rather than on the whole allocation request. - Mike ----- Original Message ----- > That explains a lot - the limited number of Karajan threads probably > explains why coasters goes haywire in the larger tests as well. > > Clearly this should be done as full fledged provider. But that will > take a fair bit more work. > > Would there be any ill effects from bumping up the number of karajan > threads to see if I can make this demo work? WHere is that set? > > Also, when you say "use the local provider or > > some other scheme that can free the workers while the sub-processes > > run." - do you have anything "quick and easy" in mind there? > > - Mike > > > ----- Original Message ----- > > The cdm functions (externalin, externalout, externalgo) are not > > asynchronous. They block the karajan worker threads and therefore, > > besides preventing anything else from running in the interpreter, > > are > > also limited to concurrently running whatever the number of karajan > > worker threads is (2*cores). > > > > I would suggest changing those functions to use the local provider > > or > > some other scheme that can free the workers while the sub-processes > > run. > > > > Mihael > > > > On Sun, 2010-11-14 at 20:56 -0600, Michael Wilde wrote: > > > I'm in a cab - vdlint.k is in local fs on: > > > > > > Login1.pads.ci > > > /scratch/local/wilde/swift/src/trunk/... > > > Running from dist/swft-svn in that tree > > > > > > On 11/14/10, Mihael Hategan wrote: > > > > On Sun, 2010-11-14 at 17:23 -0600, Michael Wilde wrote: > > > >> Some answers from my handheld: > > > >> - foreach loop has 317 files so ample parallelism > > > > > > > > I would have assumed it's > 8. But I suspect, given one of the > > > > answers > > > > below, that it does not matter. > > > > > > > >> - throttle in sites entry set to .63 to run 64 jobs at once > > > >> - the "active" external.sh is called from end of dostagein and > > > >> dostageout in vdl-int.k (after all files for the job were put > > > >> in > > > >> a > > > >> list by prior calls to externa.sh from within those functions > > > > > > > > How is this call actually implemented. I.e. can you post the > > > > respective > > > > snippet of vdl-int? > > > > > > > >> - the actual staging op by globusonline take 30-60 seconds, > > > >> sometimes > > > >> more. I batch them up. > > > > > > > > > > > > > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Mon Nov 15 00:06:44 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 14 Nov 2010 22:06:44 -0800 Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ? In-Reply-To: <1604855895.59214.1289798682652.JavaMail.root@zimbra.anl.gov> References: <1604855895.59214.1289798682652.JavaMail.root@zimbra.anl.gov> Message-ID: <1289801204.13873.2.camel@blabla2.none> On Sun, 2010-11-14 at 23:24 -0600, Michael Wilde wrote: > That explains a lot - the limited number of Karajan threads probably explains why coasters goes haywire in the larger tests as well. > > Clearly this should be done as full fledged provider. But that will take a fair bit more work. > > Would there be any ill effects from bumping up the number of karajan threads to see if I can make this demo work? WHere is that set? There will be the ill effect of wasting memory to wait for stuff. It's set in EventBus.java. > > Also, when you say "use the local provider or > > some other scheme that can free the workers while the sub-processes > > run." - do you have anything "quick and easy" in mind there? Yep. Say task:execute(script, args) in vdl-int instead. Mihael From hategan at mcs.anl.gov Mon Nov 15 00:08:09 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 14 Nov 2010 22:08:09 -0800 Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ? In-Reply-To: <1756882204.59280.1289800926179.JavaMail.root@zimbra.anl.gov> References: <1756882204.59280.1289800926179.JavaMail.root@zimbra.anl.gov> Message-ID: <1289801289.13873.3.camel@blabla2.none> On Mon, 2010-11-15 at 00:02 -0600, Michael Wilde wrote: > I bumped up the thread count to 32*cores. It was 4 * # cores, so maybe there is some 50% allocation factor going on? There shouldn't be. I had 2*cores on my version though. > > At any rate, if I reduce the number of files Im processing from 317 to 100, the entire script seems to work fairly reliably. But I can definitely see the ill effects of the large number of threads Im tying up waiting on IO. > > (For one thing, I cant keep my coaster cores busy, and I get the "Canceling job" message from coaster workers shutting down for lack of work). > > This will improve a bit when I enhance the interface to globusonline to wait on individual file transfers rather than on the whole allocation request. > > - Mike > > > ----- Original Message ----- > > That explains a lot - the limited number of Karajan threads probably > > explains why coasters goes haywire in the larger tests as well. > > > > Clearly this should be done as full fledged provider. But that will > > take a fair bit more work. > > > > Would there be any ill effects from bumping up the number of karajan > > threads to see if I can make this demo work? WHere is that set? > > > > Also, when you say "use the local provider or > > > some other scheme that can free the workers while the sub-processes > > > run." - do you have anything "quick and easy" in mind there? > > > > - Mike > > > > > > ----- Original Message ----- > > > The cdm functions (externalin, externalout, externalgo) are not > > > asynchronous. They block the karajan worker threads and therefore, > > > besides preventing anything else from running in the interpreter, > > > are > > > also limited to concurrently running whatever the number of karajan > > > worker threads is (2*cores). > > > > > > I would suggest changing those functions to use the local provider > > > or > > > some other scheme that can free the workers while the sub-processes > > > run. > > > > > > Mihael > > > > > > On Sun, 2010-11-14 at 20:56 -0600, Michael Wilde wrote: > > > > I'm in a cab - vdlint.k is in local fs on: > > > > > > > > Login1.pads.ci > > > > /scratch/local/wilde/swift/src/trunk/... > > > > Running from dist/swft-svn in that tree > > > > > > > > On 11/14/10, Mihael Hategan wrote: > > > > > On Sun, 2010-11-14 at 17:23 -0600, Michael Wilde wrote: > > > > >> Some answers from my handheld: > > > > >> - foreach loop has 317 files so ample parallelism > > > > > > > > > > I would have assumed it's > 8. But I suspect, given one of the > > > > > answers > > > > > below, that it does not matter. > > > > > > > > > >> - throttle in sites entry set to .63 to run 64 jobs at once > > > > >> - the "active" external.sh is called from end of dostagein and > > > > >> dostageout in vdl-int.k (after all files for the job were put > > > > >> in > > > > >> a > > > > >> list by prior calls to externa.sh from within those functions > > > > > > > > > > How is this call actually implemented. I.e. can you post the > > > > > respective > > > > > snippet of vdl-int? > > > > > > > > > >> - the actual staging op by globusonline take 30-60 seconds, > > > > >> sometimes > > > > >> more. I batch them up. > > > > > > > > > > > > > > > > > > > > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From bugzilla-daemon at mcs.anl.gov Mon Nov 15 14:19:29 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 15 Nov 2010 14:19:29 -0600 (CST) Subject: [Swift-devel] [Bug 167] clustering time limit specification in seconds is awkward for large clustering times In-Reply-To: References: Message-ID: <20101115201929.D8C142BF1E@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=167 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |skenny at uchicago.edu --- Comment #2 from skenny 2010-11-15 14:19:29 --- are we deprecating clustering? -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Nov 15 14:35:41 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 15 Nov 2010 14:35:41 -0600 (CST) Subject: [Swift-devel] [Bug 178] strange unused string replacement in CSVMapper needs investigating In-Reply-To: References: Message-ID: <20101115203541.5CA482BF52@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=178 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |skenny at uchicago.edu --- Comment #2 from skenny 2010-11-15 14:35:41 --- (In reply to comment #1) > I believe we should deprecate the CSV mapper entirely in favor of the ext > mapper, which is both more powerful and easier to use. agreed! myself and my users do not use the csv mapper. do others use this? -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Nov 15 14:58:51 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 15 Nov 2010 14:58:51 -0600 (CST) Subject: [Swift-devel] [Bug 182] Error messages summarized at end of Swift output should also be printed when they occur In-Reply-To: References: Message-ID: <20101115205851.2FBD12BF1E@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=182 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED CC| |skenny at uchicago.edu AssignedTo|benc at hawaga.org.uk |skenny at uchicago.edu -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Nov 15 15:55:24 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 15 Nov 2010 15:55:24 -0600 (CST) Subject: [Swift-devel] [Bug 178] strange unused string replacement in CSVMapper needs investigating In-Reply-To: References: Message-ID: <20101115215524.90C7C2CBD5@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=178 Justin Wozniak changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |wozniak at mcs.anl.gov --- Comment #3 from Justin Wozniak 2010-11-15 15:55:24 --- It is used by the Montage application. I fixed this up a bit over the summer. I vote for keeping it. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From wilde at mcs.anl.gov Mon Nov 15 16:35:30 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 15 Nov 2010 16:35:30 -0600 (CST) Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ? In-Reply-To: <1289801204.13873.2.camel@blabla2.none> Message-ID: <2108607140.63954.1289860530901.JavaMail.root@zimbra.anl.gov> > >...do you have anything "quick and easy" in mind there? > > Yep. Say task:execute(script, args) in vdl-int instead. OK, that looks very promising and works for me in simple .k tests. Can you tell me how to pluck the workdir out of the host description from within vdl-int.k? (I will hunt for this but clues are welcome :) Thanks, - Mike ----- Original Message ----- > On Sun, 2010-11-14 at 23:24 -0600, Michael Wilde wrote: > > That explains a lot - the limited number of Karajan threads probably > > explains why coasters goes haywire in the larger tests as well. > > > > Clearly this should be done as full fledged provider. But that will > > take a fair bit more work. > > > > Would there be any ill effects from bumping up the number of karajan > > threads to see if I can make this demo work? WHere is that set? > > There will be the ill effect of wasting memory to wait for stuff. It's > set in EventBus.java. > > > > > Also, when you say "use the local provider or > > > some other scheme that can free the workers while the > > > sub-processes > > > run." - do you have anything "quick and easy" in mind there? > > Yep. Say task:execute(script, args) in vdl-int instead. > > Mihael -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From bugzilla-daemon at mcs.anl.gov Mon Nov 15 16:55:43 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 15 Nov 2010 16:55:43 -0600 (CST) Subject: [Swift-devel] [Bug 199] error in simple mapper when underscores are used in type declaration In-Reply-To: References: Message-ID: <20101115225543.5BC082CBE5@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=199 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |WORKSFORME -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From bugzilla-daemon at mcs.anl.gov Mon Nov 15 17:06:32 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 15 Nov 2010 17:06:32 -0600 (CST) Subject: [Swift-devel] [Bug 215] stdout and stderr redirect for SGE jobmanager causing failure on stageouts In-Reply-To: References: Message-ID: <20101115230632.8EDEE2CBF5@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=215 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED --- Comment #1 from skenny 2010-11-15 17:06:32 --- has been working as of Swift svn swift-r3497 cog-r2829 -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From bugzilla-daemon at mcs.anl.gov Mon Nov 15 17:11:21 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 15 Nov 2010 17:11:21 -0600 (CST) Subject: [Swift-devel] [Bug 220] id's for external data types not stored in rlog for resume In-Reply-To: References: Message-ID: <20101115231121.2D2B52B99D@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=220 --- Comment #1 from skenny 2010-11-15 17:11:20 --- *** Bug 219 has been marked as a duplicate of this bug. *** -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From bugzilla-daemon at mcs.anl.gov Mon Nov 15 17:11:21 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 15 Nov 2010 17:11:21 -0600 (CST) Subject: [Swift-devel] [Bug 219] variables of type external are not mapped/written to rlog In-Reply-To: References: Message-ID: <20101115231121.181F32B99D@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=219 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |DUPLICATE --- Comment #1 from skenny 2010-11-15 17:11:20 --- *** This bug has been marked as a duplicate of bug 220 *** -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From bugzilla-daemon at mcs.anl.gov Mon Nov 15 17:19:19 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 15 Nov 2010 17:19:19 -0600 (CST) Subject: [Swift-devel] [Bug 227] Always keep submit and stdout/err files for failing jobs from localscheduler provider In-Reply-To: References: Message-ID: <20101115231919.23C022CC02@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=227 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the reporter. From wilde at mcs.anl.gov Mon Nov 15 20:10:41 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 15 Nov 2010 20:10:41 -0600 (CST) Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ? In-Reply-To: <2108607140.63954.1289860530901.JavaMail.root@zimbra.anl.gov> Message-ID: <2049120682.64729.1289873441190.JavaMail.root@zimbra.anl.gov> Justin pointed me at the function vdl:getprofile, but this does not seem to contain the same attributes as the pool entry. WHen I call: vdl:siteprofile(desthost,"workdirectory") I get: Execution failed: Exception in cp: Arguments: [etc/group, home/wilde/godata/gridoutput.txt] Host: localhost Directory: gcat-20101115-2005-kh054dx8/jobs/1/cp-1uu8rn1k stderr.txt: stdout.txt: ---- Caused by: Missing profile: workdirectory login1$ cat sites.xml /home/wilde/swift/lab/go/work --- Mike ----- Original Message ----- > > >...do you have anything "quick and easy" in mind there? > > > > Yep. Say task:execute(script, args) in vdl-int instead. > > OK, that looks very promising and works for me in simple .k tests. > > Can you tell me how to pluck the workdir out of the host description > from within vdl-int.k? > > (I will hunt for this but clues are welcome :) > > Thanks, > > - Mike > > > ----- Original Message ----- > > On Sun, 2010-11-14 at 23:24 -0600, Michael Wilde wrote: > > > That explains a lot - the limited number of Karajan threads > > > probably > > > explains why coasters goes haywire in the larger tests as well. > > > > > > Clearly this should be done as full fledged provider. But that > > > will > > > take a fair bit more work. > > > > > > Would there be any ill effects from bumping up the number of > > > karajan > > > threads to see if I can make this demo work? WHere is that set? > > > > There will be the ill effect of wasting memory to wait for stuff. > > It's > > set in EventBus.java. > > > > > > > > Also, when you say "use the local provider or > > > > some other scheme that can free the workers while the > > > > sub-processes > > > > run." - do you have anything "quick and easy" in mind there? > > > > Yep. Say task:execute(script, args) in vdl-int instead. > > > > Mihael > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Mon Nov 15 20:49:52 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 15 Nov 2010 18:49:52 -0800 Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ? In-Reply-To: <2108607140.63954.1289860530901.JavaMail.root@zimbra.anl.gov> References: <2108607140.63954.1289860530901.JavaMail.root@zimbra.anl.gov> Message-ID: <1289875792.17856.0.camel@blabla2.none> On Mon, 2010-11-15 at 16:35 -0600, Michael Wilde wrote: > > >...do you have anything "quick and easy" in mind there? > > > > Yep. Say task:execute(script, args) in vdl-int instead. > > OK, that looks very promising and works for me in simple .k tests. > > Can you tell me how to pluck the workdir out of the host description from within vdl-int.k? I don't know the details. Worst case scenario you write a fava karajan function that does it. Mihael From bugzilla-daemon at mcs.anl.gov Mon Nov 15 23:05:05 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 15 Nov 2010 23:05:05 -0600 (CST) Subject: [Swift-devel] [Bug 178] strange unused string replacement in CSVMapper needs investigating In-Reply-To: References: Message-ID: <20101116050505.E6B402CC33@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=178 Mihael Hategan changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |hategan at mcs.anl.gov --- Comment #4 from Mihael Hategan 2010-11-15 23:05:05 --- (In reply to comment #3) > It is used by the Montage application. I fixed this up a bit over the summer. > I vote for keeping it. The question is whether we want to keep it in the long run. Orthogonality would dictate that if there is a better way to do it, this should go. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching someone on the CC list of the bug. You are watching the assignee of the bug. You are watching the reporter. From wozniak at mcs.anl.gov Tue Nov 16 09:46:52 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Tue, 16 Nov 2010 09:46:52 -0600 (Central Standard Time) Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ? In-Reply-To: <1289875792.17856.0.camel@blabla2.none> References: <2108607140.63954.1289860530901.JavaMail.root@zimbra.anl.gov> <1289875792.17856.0.camel@blabla2.none> Message-ID: On Mon, 15 Nov 2010, Mihael Hategan wrote: > On Mon, 2010-11-15 at 16:35 -0600, Michael Wilde wrote: >>>> ...do you have anything "quick and easy" in mind there? >>> >>> Yep. Say task:execute(script, args) in vdl-int instead. >> >> OK, that looks very promising and works for me in simple .k tests. >> >> Can you tell me how to pluck the workdir out of the host description from within vdl-int.k? > > I don't know the details. Worst case scenario you write a fava karajan > function that does it. > > Mihael I haven't tried this myself but vdl-sc.k makes it look like these are all stored in the HostNode properties and are therefore accessible by vdl:siteprofile() . Is this correct? -- Justin M Wozniak From wozniak at mcs.anl.gov Tue Nov 16 10:17:00 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Tue, 16 Nov 2010 10:17:00 -0600 (Central Standard Time) Subject: [Swift-devel] Concurrent dostagein calls limited to 8 ? In-Reply-To: References: <2108607140.63954.1289860530901.JavaMail.root@zimbra.anl.gov> <1289875792.17856.0.camel@blabla2.none> Message-ID: On Tue, 16 Nov 2010, Justin M Wozniak wrote: > On Mon, 15 Nov 2010, Mihael Hategan wrote: > >> On Mon, 2010-11-15 at 16:35 -0600, Michael Wilde wrote: >>> Can you tell me how to pluck the workdir out of the host description from >>> within vdl-int.k? >> >> I don't know the details. Worst case scenario you write a fava karajan >> function that does it. >> >> Mihael > > I haven't tried this myself but vdl-sc.k makes it look like these are all > stored in the HostNode properties and are therefore accessible by > vdl:siteprofile() . Is this correct? This does seem to work for me: for example, in vdl-int.k:initSharedDir() I can add the following and get the expected result from sites.xml: workdir := vdl:siteprofile(rhost, "workdir") print("workdir: {workdir}") ... Progress: Submitted:1 workdir: /home/wozniak/work Final status: Finished successfully:1 ... -- Justin M Wozniak From bugzilla-daemon at mcs.anl.gov Tue Nov 16 14:27:13 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 16 Nov 2010 14:27:13 -0600 (CST) Subject: [Swift-devel] [Bug 167] clustering time limit specification in seconds is awkward for large clustering times In-Reply-To: References: Message-ID: <20101116202713.545A12C9DA@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=167 Justin Wozniak changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |wozniak at mcs.anl.gov --- Comment #3 from Justin Wozniak 2010-11-16 14:27:13 --- I think so. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From aespinosa at cs.uchicago.edu Tue Nov 16 18:17:23 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Tue, 16 Nov 2010 18:17:23 -0600 Subject: [Swift-devel] patch: dostageinfile.transitions Message-ID: Support for dostageinfile.png and dostageinfile-total.png . Hopefully this is a more precise plot of actual file stageins. diff --git a/libexec/log-processing/log-to-dostageinfile-transitions b/libexec/log-processing/log-to-dostageinfile-transitions new file mode 100644 index 0000000..48dd399 --- /dev/null +++ b/libexec/log-processing/log-to-dostageinfile-transitions @@ -0,0 +1,11 @@ +#!/usr/bin/env ruby + +require 'time' + +$stdin.grep(/vdl:dostageinfile/).each do |line| + x = line.match(/^(\S*\ \S*)\ \S*\ \S*\ (\S*)\ file=(\S*)\ srchost=(\S*) \S*\ \S* desthost=(\S*)/) + oras = Time.parse(x[1]).to_f + id = "#{x[3]}-#{x[4]}-#{x[5]}" + state = x[2].match(/(START|END)/)[1] + puts "#{oras} #{id} #{state}" +end diff --git a/libexec/log-processing/makefile b/libexec/log-processing/makefile index 40bdb8d..9a52f5b 100644 --- a/libexec/log-processing/makefile +++ b/libexec/log-processing/makefile @@ -95,6 +95,9 @@ createdirset.transitions: $(LOG) dostagein.transitions: $(LOG) log-to-dostagein-transitions < $(LOG) > dostagein.transitions +dostageinfile.transitions: $(LOG) + log-to-dostageinfile-transitions < $(LOG) > dostageinfile.transitions + dostageout.transitions: $(LOG) log-to-dostageout-transitions < $(LOG) > dostageout.transitions -- Allan M. Espinosa PhD student, Computer Science University of Chicago From aespinosa at cs.uchicago.edu Wed Nov 17 15:32:32 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 17 Nov 2010 15:32:32 -0600 Subject: [Swift-devel] Re: the persistence of the persistent coaster service. In-Reply-To: References: Message-ID: Bumping the thread. In an attempt to isolate the bug, I made this workflow: app (external o) sleep(int time) { sleep time; } /* Main program */ external rups[]; int t = 300; int a[]; iterate ix { a[ix] = ix; } until (ix == 1300); foreach ai,i in a { rups[i] = sleep(t); } passive /gpfs/pads/swift/aespinosa/swift-runs localhost sleep /bin/sleep INSTALLED INTEL32::LINUX null and still get the same type of error message: RunID: 20101117-1527-ui6i2lra Progress: Find: https://communicado.ci.uchicago.edu:61999 Find: keepalive(120), reconnect - https://communicado.ci.uchicago.edu:61999 Progress: Selecting site:1 Submitting:294 Progress: Selecting site:3 Submitting:367 Progress: Selecting site:3 Submitting:367 Progress: Selecting site:3 Submitting:367 Progress: Selecting site:3 Submitting:367 Command(1, CHANNELCONFIG): handling reply timeout; sendReqTime=101117-152717.209, sendTime=101117 -152717.211, now=101117-152917.232 Progress: Selecting site:3 Submitting:366 Submitted:1 Command(1, CHANNELCONFIG)fault was: Reply timeout org.globus.cog.karajan.workflow.service.ReplyTimeoutException at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.ja va:280) at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) Progress: Selecting site:3 Submitting:366 Failed but can retry:1 Progress: Selecting site:3 Submitting:366 Failed but can retry:1 2010/10/21 Allan Espinosa : > Hi, > > When I'm reusing the coaster service onto the next swift session, i > get reply timeouts in the CHANNELCONFIG command: > > > swift-r3685 cog-r2913 > > RunID: extract > Progress: > Progress: ?uninitialized:2 ?Finished in previous run:2 > Progress: ?uninitialized:2 ?Finished in previous run:2 > Progress: ?Stage in:99 ?Submitting:1 ?Finished in previous run:102 > Find: https://communicado.ci.uchicago.edu:61999 > Find: ?keepalive(120), reconnect - https://communicado.ci.uchicago.edu:61999 > Progress: ?Stage in:92 ?Submitting:8 ?Finished in previous run:102 > Passive queue processor initialized. Callback URI is http://128.135.125.17:60999 > Progress: ?Stage in:71 ?Submitting:2 ?Submitted:27 ?Finished in previous run:102 > Progress: ?Stage in:29 ?Submitting:1 ?Submitted:70 ?Finished in previous run:102 > > **Abord** (Ctrl-C) > ** rerun/ resume workflow ** > swift-r3685 cog-r2913 > > RunID: extract > Progress: > Progress: ?uninitialized:3 ?Finished in previous run:2 > Progress: ?Stage in:99 ?Submitting:1 ?Finished in previous run:102 > Find: https://communicado.ci.uchicago.edu:61999 > Find: ?keepalive(120), reconnect - https://communicado.ci.uchicago.edu:61999 > Progress: ?Stage in:92 ?Submitting:8 ?Finished in previous run:102 > Progress: ?Stage in:92 ?Submitting:8 ?Finished in previous run:102 > Progress: ?Stage in:92 ?Submitting:8 ?Finished in previous run:102 > Progress: ?Stage in:92 ?Submitting:8 ?Finished in previous run:102 > Command(1, CHANNELCONFIG): handling reply timeout; > sendReqTime=101021-174124.460, sendTime=101021-174124.471, > now=101021-174324.492 > Command(1, CHANNELCONFIG)fault was: Reply timeout > org.globus.cog.karajan.workflow.service.ReplyTimeoutException > ? ? ? ?at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280) > ? ? ? ?at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285) > ? ? ? ?at java.util.TimerThread.mainLoop(Timer.java:512) > ? ? ? ?at java.util.TimerThread.run(Timer.java:462) > Progress: ?Stage in:92 ?Submitting:7 ?Submitted:1 ?Finished in previous run:102 > > My sites.xml sets the persistent service to work in passive mode. > > > thanks, > -Allan > > -- > Allan M. Espinosa > PhD student, Computer Science > University of Chicago > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From aespinosa at cs.uchicago.edu Wed Nov 17 15:35:40 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 17 Nov 2010 15:35:40 -0600 Subject: [Swift-devel] Re: the persistence of the persistent coaster service. In-Reply-To: References: Message-ID: Upon the client's connection, this gets registered in the service log: ... ... Plan time: 1 Plan time: 1 GSSSChannel-null(0)[1175215772: {}]: Disabling heartbeats (config is null) (1) Scheduling GSSSChannel-null(12)[1175215772: {}] for addition nullChannel started Channel id: u-20ccd0f-12c5bc25c45--8000-u-28c73091-12c5b774ab1--7ff5 MetaChannel: 682820082[1175215772: {}] -> null: Disabling heartbeats (disabled in config) MetaChannel: 682820082[1175215772: {}] -> null.bind -> GSSSChannel-null(12)[1175215772: {}] Plan time: 1 Congestion queue size: 0 runTime: 0, sleepTime: 10049 Plan time: 1 ... ... 2010/11/17 Allan Espinosa : > Bumping the thread. ?In an attempt to isolate the bug, I made this workflow: > > app (external o) sleep(int time) { > ?sleep time; > } > > > /* Main program */ > external rups[]; > > int t = 300; > int a[]; > > iterate ix { > ?a[ix] = ix; > } until (ix == 1300); > > foreach ai,i in a { > ?rups[i] = sleep(t); > } > > > > ? > ? ? url="https://communicado.ci.uchicago.edu:61999" > ? ? ? ?jobmanager="local:local" /> > > ? ?passive > > ? ? > ? ?/gpfs/pads/swift/aespinosa/swift-runs > ? > > > > > localhost ?sleep ? ? ? ? ?/bin/sleep INSTALLED INTEL32::LINUX null > > and still get the same type of error message: > RunID: 20101117-1527-ui6i2lra > Progress: > Find: https://communicado.ci.uchicago.edu:61999 > Find: ?keepalive(120), reconnect - https://communicado.ci.uchicago.edu:61999 > Progress: ?Selecting site:1 ?Submitting:294 > Progress: ?Selecting site:3 ?Submitting:367 > Progress: ?Selecting site:3 ?Submitting:367 > Progress: ?Selecting site:3 ?Submitting:367 > Progress: ?Selecting site:3 ?Submitting:367 > Command(1, CHANNELCONFIG): handling reply timeout; > sendReqTime=101117-152717.209, sendTime=101117 > -152717.211, now=101117-152917.232 > Progress: ?Selecting site:3 ?Submitting:366 ?Submitted:1 > Command(1, CHANNELCONFIG)fault was: Reply timeout > org.globus.cog.karajan.workflow.service.ReplyTimeoutException > ? ? ? ?at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.ja > va:280) > ? ? ? ?at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285) > ? ? ? ?at java.util.TimerThread.mainLoop(Timer.java:512) > ? ? ? ?at java.util.TimerThread.run(Timer.java:462) > Progress: ?Selecting site:3 ?Submitting:366 Failed but can retry:1 > Progress: ?Selecting site:3 ?Submitting:366 Failed but can retry:1 > > > 2010/10/21 Allan Espinosa : >> Hi, >> >> When I'm reusing the coaster service onto the next swift session, i >> get reply timeouts in the CHANNELCONFIG command: >> >> >> swift-r3685 cog-r2913 >> >> RunID: extract >> Progress: >> Progress: ?uninitialized:2 ?Finished in previous run:2 >> Progress: ?uninitialized:2 ?Finished in previous run:2 >> Progress: ?Stage in:99 ?Submitting:1 ?Finished in previous run:102 >> Find: https://communicado.ci.uchicago.edu:61999 >> Find: ?keepalive(120), reconnect - https://communicado.ci.uchicago.edu:61999 >> Progress: ?Stage in:92 ?Submitting:8 ?Finished in previous run:102 >> Passive queue processor initialized. Callback URI is http://128.135.125.17:60999 >> Progress: ?Stage in:71 ?Submitting:2 ?Submitted:27 ?Finished in previous run:102 >> Progress: ?Stage in:29 ?Submitting:1 ?Submitted:70 ?Finished in previous run:102 >> >> **Abord** (Ctrl-C) >> ** rerun/ resume workflow ** >> swift-r3685 cog-r2913 >> >> RunID: extract >> Progress: >> Progress: ?uninitialized:3 ?Finished in previous run:2 >> Progress: ?Stage in:99 ?Submitting:1 ?Finished in previous run:102 >> Find: https://communicado.ci.uchicago.edu:61999 >> Find: ?keepalive(120), reconnect - https://communicado.ci.uchicago.edu:61999 >> Progress: ?Stage in:92 ?Submitting:8 ?Finished in previous run:102 >> Progress: ?Stage in:92 ?Submitting:8 ?Finished in previous run:102 >> Progress: ?Stage in:92 ?Submitting:8 ?Finished in previous run:102 >> Progress: ?Stage in:92 ?Submitting:8 ?Finished in previous run:102 >> Command(1, CHANNELCONFIG): handling reply timeout; >> sendReqTime=101021-174124.460, sendTime=101021-174124.471, >> now=101021-174324.492 >> Command(1, CHANNELCONFIG)fault was: Reply timeout >> org.globus.cog.karajan.workflow.service.ReplyTimeoutException >> ? ? ? ?at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280) >> ? ? ? ?at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285) >> ? ? ? ?at java.util.TimerThread.mainLoop(Timer.java:512) >> ? ? ? ?at java.util.TimerThread.run(Timer.java:462) >> Progress: ?Stage in:92 ?Submitting:7 ?Submitted:1 ?Finished in previous run:102 >> >> My sites.xml sets the persistent service to work in passive mode. >> >> >> thanks, >> -Allan >> >> -- >> Allan M. Espinosa >> PhD student, Computer Science >> University of Chicago >> > > > > -- > Allan M. Espinosa > PhD student, Computer Science > University of Chicago > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From wilde at mcs.anl.gov Wed Nov 17 23:25:19 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 17 Nov 2010 23:25:19 -0600 (CST) Subject: [Swift-devel] Re: the persistence of the persistent coaster service. In-Reply-To: Message-ID: <1516013329.76009.1290057919991.JavaMail.root@zimbra.anl.gov> Allan, Ive had similar symptoms, but think that Im seeing different error messages. When I start a persistent service, I can run repeated Swift scripts against it, but *only* if I do them in fairly quick succession. If I let the service sit idle for more than about 5 minutes, it becomes unusable. I need to carefully capture a test case, as well as testing on an unmodified trunk that would enable Mihael to reproduce and fix the problem. I think thats the key: If you can give Mihael a way to easily reproduce the problem at will, then he'll likely be able to fix it quickly. I also see a possible related problem: when I run coasters with a large number of slots (say 64) and my workload is unable to keep the workers busy due to staging delays, then after the workers start timing out (ie I get the message "Job Cancelled") then this causes an error somewhere on the client side and Swift quickly dies with a fatal error. I need to try to reproduce this as well and/or capture logs from it. I hope to get to this next week after SC. - Mike ----- Original Message ----- > Upon the client's connection, this gets registered in the service log: > > ... > ... > Plan time: 1 > Plan time: 1 > GSSSChannel-null(0)[1175215772: {}]: Disabling heartbeats (config is > null) > (1) Scheduling GSSSChannel-null(12)[1175215772: {}] for addition > nullChannel started > Channel id: u-20ccd0f-12c5bc25c45--8000-u-28c73091-12c5b774ab1--7ff5 > MetaChannel: 682820082[1175215772: {}] -> null: Disabling heartbeats > (disabled in config) > MetaChannel: 682820082[1175215772: {}] -> null.bind -> > GSSSChannel-null(12)[1175215772: {}] > Plan time: 1 > Congestion queue size: 0 > runTime: 0, sleepTime: 10049 > Plan time: 1 > ... > ... > > 2010/11/17 Allan Espinosa : > > Bumping the thread. In an attempt to isolate the bug, I made this > > workflow: > > > > app (external o) sleep(int time) { > > ?sleep time; > > } > > > > > > /* Main program */ > > external rups[]; > > > > int t = 300; > > int a[]; > > > > iterate ix { > > ?a[ix] = ix; > > } until (ix == 1300); > > > > foreach ai,i in a { > > ?rups[i] = sleep(t); > > } > > > > > > > > ? > > ? ? > url="https://communicado.ci.uchicago.edu:61999" > > ? ? ? ?jobmanager="local:local" /> > > > > ? ?passive > > > > ? ? > > ? ?/gpfs/pads/swift/aespinosa/swift-runs > > ? > > > > > > > > > > localhost sleep /bin/sleep INSTALLED INTEL32::LINUX null > > > > and still get the same type of error message: > > RunID: 20101117-1527-ui6i2lra > > Progress: > > Find: https://communicado.ci.uchicago.edu:61999 > > Find: keepalive(120), reconnect - > > https://communicado.ci.uchicago.edu:61999 > > Progress: Selecting site:1 Submitting:294 > > Progress: Selecting site:3 Submitting:367 > > Progress: Selecting site:3 Submitting:367 > > Progress: Selecting site:3 Submitting:367 > > Progress: Selecting site:3 Submitting:367 > > Command(1, CHANNELCONFIG): handling reply timeout; > > sendReqTime=101117-152717.209, sendTime=101117 > > -152717.211, now=101117-152917.232 > > Progress: Selecting site:3 Submitting:366 Submitted:1 > > Command(1, CHANNELCONFIG)fault was: Reply timeout > > org.globus.cog.karajan.workflow.service.ReplyTimeoutException > > ? ? ? ?at > > ? ? ? ?org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.ja > > va:280) > > ? ? ? ?at > > ? ? ? ?org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285) > > ? ? ? ?at java.util.TimerThread.mainLoop(Timer.java:512) > > ? ? ? ?at java.util.TimerThread.run(Timer.java:462) > > Progress: Selecting site:3 Submitting:366 Failed but can retry:1 > > Progress: Selecting site:3 Submitting:366 Failed but can retry:1 > > > > > > 2010/10/21 Allan Espinosa : > >> Hi, > >> > >> When I'm reusing the coaster service onto the next swift session, i > >> get reply timeouts in the CHANNELCONFIG command: > >> > >> > >> swift-r3685 cog-r2913 > >> > >> RunID: extract > >> Progress: > >> Progress: uninitialized:2 Finished in previous run:2 > >> Progress: uninitialized:2 Finished in previous run:2 > >> Progress: Stage in:99 Submitting:1 Finished in previous run:102 > >> Find: https://communicado.ci.uchicago.edu:61999 > >> Find: keepalive(120), reconnect - > >> https://communicado.ci.uchicago.edu:61999 > >> Progress: Stage in:92 Submitting:8 Finished in previous run:102 > >> Passive queue processor initialized. Callback URI is > >> http://128.135.125.17:60999 > >> Progress: Stage in:71 Submitting:2 Submitted:27 Finished in > >> previous run:102 > >> Progress: Stage in:29 Submitting:1 Submitted:70 Finished in > >> previous run:102 > >> > >> **Abord** (Ctrl-C) > >> ** rerun/ resume workflow ** > >> swift-r3685 cog-r2913 > >> > >> RunID: extract > >> Progress: > >> Progress: uninitialized:3 Finished in previous run:2 > >> Progress: Stage in:99 Submitting:1 Finished in previous run:102 > >> Find: https://communicado.ci.uchicago.edu:61999 > >> Find: keepalive(120), reconnect - > >> https://communicado.ci.uchicago.edu:61999 > >> Progress: Stage in:92 Submitting:8 Finished in previous run:102 > >> Progress: Stage in:92 Submitting:8 Finished in previous run:102 > >> Progress: Stage in:92 Submitting:8 Finished in previous run:102 > >> Progress: Stage in:92 Submitting:8 Finished in previous run:102 > >> Command(1, CHANNELCONFIG): handling reply timeout; > >> sendReqTime=101021-174124.460, sendTime=101021-174124.471, > >> now=101021-174324.492 > >> Command(1, CHANNELCONFIG)fault was: Reply timeout > >> org.globus.cog.karajan.workflow.service.ReplyTimeoutException > >> ? ? ? ?at > >> ? ? ? ?org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280) > >> ? ? ? ?at > >> ? ? ? ?org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285) > >> ? ? ? ?at java.util.TimerThread.mainLoop(Timer.java:512) > >> ? ? ? ?at java.util.TimerThread.run(Timer.java:462) > >> Progress: Stage in:92 Submitting:7 Submitted:1 Finished in previous > >> run:102 > >> > >> My sites.xml sets the persistent service to work in passive mode. > >> > >> > >> thanks, > >> -Allan > >> > >> -- > >> Allan M. Espinosa > >> PhD student, Computer Science > >> University of Chicago > >> > > > > > > > > -- > > Allan M. Espinosa > > PhD student, Computer Science > > University of Chicago > > > > > > -- > Allan M. Espinosa > PhD student, Computer Science > University of Chicago > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From bugzilla-daemon at mcs.anl.gov Thu Nov 18 08:46:51 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 18 Nov 2010 08:46:51 -0600 (CST) Subject: [Swift-devel] [Bug 229] New: Swift log should capture additional environmental information Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=229 Summary: Swift log should capture additional environmental information Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: skenny at uchicago.edu ReportedBy: wilde at mcs.anl.gov Add: - java version - full tc, sites, and property info - printenv - swift script We should control this via a property for users that dont want all this info captured. That can be a second step. With this info in the log, we will more likely be able to diagnose a user's problem with just the single log file. We should consider removing the file "swift.log" as I have never seen any useful info in it. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the reporter. From aespinosa at cs.uchicago.edu Thu Nov 18 14:39:04 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 18 Nov 2010 14:39:04 -0600 Subject: [Swift-devel] misassignment of jobs Message-ID: tc.data for worker15: SPRACE_osg-ce.sprace.org.br worker15 /osg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" But it was assigned to another site instead: $ grep 0erqqq1k worker-*.log 2010-11-17 15:38:58,804-0600 DEBUG vdl:execute2 THREAD_ASSOCIATION jobid=worker15-0erqqq1k thread host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu replicationGroup=2pnqqq1k 2010-11-17 15:38:59,110-0600 INFO vdl:createdirset START jobid=worker15-0erqqq1k host=LIGO_UWM_N ce.phys.uwm.edu - Initializing directory structure 2010-11-17 15:38:59,137-0600 INFO vdl:createdirset END jobid=worker15-0erqqq1k - Done initializi structure 2010-11-17 15:38:59,172-0600 INFO vdl:dostagein START jobid=worker15-0erqqq1k - Staging in files 2010-11-17 15:38:59,257-0600 INFO vdl:dostagein END jobid=worker15-0erqqq1k - Staging in finishe 2010-11-17 15:38:59,323-0600 DEBUG vdl:execute2 JOB_START jobid=worker15-0erqqq1k tr=worker15 arg //128.135.125.17:61015, SPRACE_osg-ce.sprace.org.br, /tmp, 7200] tmpdir=worker-20101117-1538-fe9a orker15-0erqqq1k host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu 2010-11-17 15:39:01,394-0600 INFO Execute Submit: in: worker-20101117-1538-fe9aq209 command: /bi /_swiftwrap worker15-0erqqq1k -jobdir 0 -scratch -e worker15 -out stdout.txt -err stderr.txt -i -k -cdmfile -status files -a http://128.135.125.17:61015 SPRACE_osg-ce.sprace.org.br /tmp 7200 2010-11-17 15:39:01,394-0600 INFO GridExec TASK_DEFINITION: Task(type=JOB_SUBMISSION, identity=u -1-1290029938030) is /bin/bash shared/_swiftwrap worker15-0erqqq1k -jobdir 0 -scratch -e worker1 .txt -err stderr.txt -i -d -if -of -k -cdmfile -status files -a http://128.135.125.17:61015 .sprace.org.br /tmp 7200 2010-11-17 16:49:33,106-0600 DEBUG vdl:checkjobstatus START jobid=worker15-0erqqq1k 2010-11-17 16:49:33,278-0600 INFO vdl:checkjobstatus FAILURE jobid=worker15-0erqqq1k - Failure f 2010-11-17 16:49:38,180-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=worker15-0erqqq1k - A ception: Cannot find executable worker15 on site system path There is no entry for worker15 for the site LIGO_UWM_NEMO in my tc.data -Allan -- Allan M. Espinosa PhD student, Computer Science University of Chicago From hategan at mcs.anl.gov Thu Nov 18 16:03:09 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 18 Nov 2010 14:03:09 -0800 Subject: [Swift-devel] misassignment of jobs In-Reply-To: References: Message-ID: <1290117789.30414.1.camel@blabla2.none> I'm sure there is a reasonable explanation for this. Can you post your entire tc.data? And to make sure we're talking about the right one, can you look at the swift log and use exactly the one that swift claims is using? Mihael On Thu, 2010-11-18 at 14:39 -0600, Allan Espinosa wrote: > tc.data for worker15: > SPRACE_osg-ce.sprace.org.br worker15 /osg/app/engage/scec/worker.pl > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" > > But it was assigned to another site instead: > $ grep 0erqqq1k worker-*.log > 2010-11-17 15:38:58,804-0600 DEBUG vdl:execute2 THREAD_ASSOCIATION > jobid=worker15-0erqqq1k thread > host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu replicationGroup=2pnqqq1k > 2010-11-17 15:38:59,110-0600 INFO vdl:createdirset START > jobid=worker15-0erqqq1k host=LIGO_UWM_N > ce.phys.uwm.edu - Initializing directory structure > 2010-11-17 15:38:59,137-0600 INFO vdl:createdirset END > jobid=worker15-0erqqq1k - Done initializi > structure > 2010-11-17 15:38:59,172-0600 INFO vdl:dostagein START > jobid=worker15-0erqqq1k - Staging in files > 2010-11-17 15:38:59,257-0600 INFO vdl:dostagein END > jobid=worker15-0erqqq1k - Staging in finishe > 2010-11-17 15:38:59,323-0600 DEBUG vdl:execute2 JOB_START > jobid=worker15-0erqqq1k tr=worker15 arg > //128.135.125.17:61015, SPRACE_osg-ce.sprace.org.br, /tmp, 7200] > tmpdir=worker-20101117-1538-fe9a > orker15-0erqqq1k host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu > 2010-11-17 15:39:01,394-0600 INFO Execute Submit: in: > worker-20101117-1538-fe9aq209 command: /bi > /_swiftwrap worker15-0erqqq1k -jobdir 0 -scratch -e worker15 -out > stdout.txt -err stderr.txt -i > -k -cdmfile -status files -a http://128.135.125.17:61015 > SPRACE_osg-ce.sprace.org.br /tmp 7200 > 2010-11-17 15:39:01,394-0600 INFO GridExec TASK_DEFINITION: > Task(type=JOB_SUBMISSION, identity=u > -1-1290029938030) is /bin/bash shared/_swiftwrap worker15-0erqqq1k > -jobdir 0 -scratch -e worker1 > .txt -err stderr.txt -i -d -if -of -k -cdmfile -status files -a > http://128.135.125.17:61015 > .sprace.org.br /tmp 7200 > 2010-11-17 16:49:33,106-0600 DEBUG vdl:checkjobstatus START > jobid=worker15-0erqqq1k > 2010-11-17 16:49:33,278-0600 INFO vdl:checkjobstatus FAILURE > jobid=worker15-0erqqq1k - Failure f > 2010-11-17 16:49:38,180-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=worker15-0erqqq1k - A > ception: Cannot find executable worker15 on site system path > > There is no entry for worker15 for the site LIGO_UWM_NEMO in my tc.data > > -Allan > From aespinosa at cs.uchicago.edu Thu Nov 18 16:08:56 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 18 Nov 2010 16:08:56 -0600 Subject: [Swift-devel] misassignment of jobs In-Reply-To: <1290117789.30414.1.camel@blabla2.none> References: <1290117789.30414.1.camel@blabla2.none> Message-ID: i'm using a file named tc.data 2010-11-17 15:38:50,115-0600 INFO unknown Using tc.data: tc.data $cat tc.data PADS sleep_pads /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" BNL-ATLAS_gridgk01.racf.bnl.gov worker0 /usatlas/OSG/engage-scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" BNL-ATLAS_gridgk01.racf.bnl.gov sleep0 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" BNL-ATLAS_gridgk01.racf.bnl.gov sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" BNL-ATLAS_gridgk02.racf.bnl.gov worker1 /usatlas/OSG/engage-scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" BNL-ATLAS_gridgk02.racf.bnl.gov sleep1 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" BNL-ATLAS_gridgk02.racf.bnl.gov sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" FNAL_FERMIGRID_fnpcosg1.fnal.gov worker2 /grid/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" FNAL_FERMIGRID_fnpcosg1.fnal.gov sleep2 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" FNAL_FERMIGRID_fnpcosg1.fnal.gov sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" Firefly_ff-grid3.unl.edu worker3 /panfs/panasas/CMS/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" Firefly_ff-grid3.unl.edu sleep3 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" Firefly_ff-grid3.unl.edu sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" GridUNESP_CENTRAL_ce.grid.unesp.br worker4 /osg/app/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" GridUNESP_CENTRAL_ce.grid.unesp.br sleep4 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" GridUNESP_CENTRAL_ce.grid.unesp.br sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu worker5 /opt/osg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu sleep5 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" MIT_CMS_ce01.cmsaf.mit.edu worker6 /osg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" MIT_CMS_ce01.cmsaf.mit.edu sleep6 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" MIT_CMS_ce01.cmsaf.mit.edu sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" MIT_CMS_ce02.cmsaf.mit.edu worker7 /osg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" MIT_CMS_ce02.cmsaf.mit.edu sleep7 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" MIT_CMS_ce02.cmsaf.mit.edu sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu worker8 /grid-tmp/grid-apps/engage/scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu sleep8 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" Nebraska_gpn-husker.unl.edu worker9 /opt/osg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" Nebraska_gpn-husker.unl.edu sleep9 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" Nebraska_gpn-husker.unl.edu sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" Nebraska_red.unl.edu worker10 /opt/osg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" Nebraska_red.unl.edu sleep10 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" Nebraska_red.unl.edu sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" Prairiefire_pf-grid.unl.edu worker11 /opt/pfgridapp/engage/scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" Prairiefire_pf-grid.unl.edu sleep11 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" Prairiefire_pf-grid.unl.edu sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" Purdue-RCAC_osg.rcac.purdue.edu worker12 /apps/osg/engage/scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" Purdue-RCAC_osg.rcac.purdue.edu sleep12 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" Purdue-RCAC_osg.rcac.purdue.edu sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" RENCI-Engagement_belhaven-1.renci.org worker13 /nfs/osg-app/engage/scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" RENCI-Engagement_belhaven-1.renci.org sleep13 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" RENCI-Engagement_belhaven-1.renci.org sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" SBGrid-Harvard-East_osg-east.hms.harvard.edu worker14 /osg/storage/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" SBGrid-Harvard-East_osg-east.hms.harvard.edu sleep14 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" SBGrid-Harvard-East_osg-east.hms.harvard.edu sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" SPRACE_osg-ce.sprace.org.br worker15 /osg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" SPRACE_osg-ce.sprace.org.br sleep15 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" SPRACE_osg-ce.sprace.org.br sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" UCHC_CBG_vdgateway.vcell.uchc.edu worker16 /osg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" UCHC_CBG_vdgateway.vcell.uchc.edu sleep16 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" UCHC_CBG_vdgateway.vcell.uchc.edu sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" UCR-HEP_top.ucr.edu worker17 /data/bottom/osg_app/engage/scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" UCR-HEP_top.ucr.edu sleep17 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" UCR-HEP_top.ucr.edu sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" UFlorida-HPC_osg.hpc.ufl.edu worker18 /osg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" UFlorida-HPC_osg.hpc.ufl.edu sleep18 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" UFlorida-HPC_osg.hpc.ufl.edu sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" UFlorida-PG_pg.ihepa.ufl.edu worker19 /raid/osgpg/pg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" UFlorida-PG_pg.ihepa.ufl.edu sleep19 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" UFlorida-PG_pg.ihepa.ufl.edu sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" UMissHEP_umiss001.hep.olemiss.edu worker20 /osgremote/osg_app/engage/scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" UMissHEP_umiss001.hep.olemiss.edu sleep20 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" UMissHEP_umiss001.hep.olemiss.edu sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" UTA_SWT2_gk04.swt2.uta.edu worker21 /cluster/grid/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" UTA_SWT2_gk04.swt2.uta.edu sleep21 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" UTA_SWT2_gk04.swt2.uta.edu sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" WQCG-Harvard-OSG_tuscany.med.harvard.edu worker22 /osg/storage/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" WQCG-Harvard-OSG_tuscany.med.harvard.edu sleep22 /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" WQCG-Harvard-OSG_tuscany.med.harvard.edu sleep /bin/sleep INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" 2010/11/18 Mihael Hategan : > I'm sure there is a reasonable explanation for this. > > Can you post your entire tc.data? And to make sure we're talking about > the right one, can you look at the swift log and use exactly the one > that swift claims is using? > > Mihael > > On Thu, 2010-11-18 at 14:39 -0600, Allan Espinosa wrote: >> tc.data for worker15: >> SPRACE_osg-ce.sprace.org.br ?worker15 /osg/app/engage/scec/worker.pl >> ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" >> >> But it was assigned to another site instead: >> $ grep 0erqqq1k worker-*.log >> 2010-11-17 15:38:58,804-0600 DEBUG vdl:execute2 THREAD_ASSOCIATION >> jobid=worker15-0erqqq1k thread >> ?host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu replicationGroup=2pnqqq1k >> 2010-11-17 15:38:59,110-0600 INFO ?vdl:createdirset START >> jobid=worker15-0erqqq1k host=LIGO_UWM_N >> ce.phys.uwm.edu - Initializing directory structure >> 2010-11-17 15:38:59,137-0600 INFO ?vdl:createdirset END >> jobid=worker15-0erqqq1k - Done initializi >> structure >> 2010-11-17 15:38:59,172-0600 INFO ?vdl:dostagein START >> jobid=worker15-0erqqq1k - Staging in files >> 2010-11-17 15:38:59,257-0600 INFO ?vdl:dostagein END >> jobid=worker15-0erqqq1k - Staging in finishe >> 2010-11-17 15:38:59,323-0600 DEBUG vdl:execute2 JOB_START >> jobid=worker15-0erqqq1k tr=worker15 arg >> //128.135.125.17:61015, SPRACE_osg-ce.sprace.org.br, /tmp, 7200] >> tmpdir=worker-20101117-1538-fe9a >> orker15-0erqqq1k host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu >> 2010-11-17 15:39:01,394-0600 INFO ?Execute Submit: in: >> worker-20101117-1538-fe9aq209 command: /bi >> /_swiftwrap worker15-0erqqq1k -jobdir 0 -scratch ?-e worker15 -out >> stdout.txt -err stderr.txt -i >> ?-k ?-cdmfile ?-status files -a http://128.135.125.17:61015 >> SPRACE_osg-ce.sprace.org.br /tmp 7200 >> 2010-11-17 15:39:01,394-0600 INFO ?GridExec TASK_DEFINITION: >> Task(type=JOB_SUBMISSION, identity=u >> -1-1290029938030) is /bin/bash shared/_swiftwrap worker15-0erqqq1k >> -jobdir 0 -scratch ?-e worker1 >> .txt -err stderr.txt -i -d ?-if ?-of ?-k ?-cdmfile ?-status files -a >> http://128.135.125.17:61015 >> .sprace.org.br /tmp 7200 >> 2010-11-17 16:49:33,106-0600 DEBUG vdl:checkjobstatus START >> jobid=worker15-0erqqq1k >> 2010-11-17 16:49:33,278-0600 INFO ?vdl:checkjobstatus FAILURE >> jobid=worker15-0erqqq1k - Failure f >> 2010-11-17 16:49:38,180-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION >> jobid=worker15-0erqqq1k - A >> ception: Cannot find executable worker15 on site system path >> >> There is no entry for worker15 for the site LIGO_UWM_NEMO in my tc.data >> >> -Allan >> > > > > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From hategan at mcs.anl.gov Thu Nov 18 17:39:30 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 18 Nov 2010 15:39:30 -0800 Subject: [Swift-devel] misassignment of jobs In-Reply-To: References: <1290117789.30414.1.camel@blabla2.none> Message-ID: <1290123570.30658.0.camel@blabla2.none> Ok. I can see a couple of code paths that can lead to this, but I need to constrain it some more. Does this happen every time you run this? Mihael On Thu, 2010-11-18 at 16:08 -0600, Allan Espinosa wrote: > i'm using a file named tc.data > > 2010-11-17 15:38:50,115-0600 INFO unknown Using tc.data: tc.data > $cat tc.data > PADS sleep_pads /bin/sleep INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="00:05:00" > > BNL-ATLAS_gridgk01.racf.bnl.gov worker0 > /usatlas/OSG/engage-scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > BNL-ATLAS_gridgk01.racf.bnl.gov sleep0 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > BNL-ATLAS_gridgk01.racf.bnl.gov sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > BNL-ATLAS_gridgk02.racf.bnl.gov worker1 > /usatlas/OSG/engage-scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > BNL-ATLAS_gridgk02.racf.bnl.gov sleep1 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > BNL-ATLAS_gridgk02.racf.bnl.gov sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > FNAL_FERMIGRID_fnpcosg1.fnal.gov worker2 > /grid/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > FNAL_FERMIGRID_fnpcosg1.fnal.gov sleep2 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > FNAL_FERMIGRID_fnpcosg1.fnal.gov sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > Firefly_ff-grid3.unl.edu worker3 > /panfs/panasas/CMS/app/engage/scec/worker.pl INSTALLED > INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" > Firefly_ff-grid3.unl.edu sleep3 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > Firefly_ff-grid3.unl.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > GridUNESP_CENTRAL_ce.grid.unesp.br worker4 /osg/app/worker.pl > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" > GridUNESP_CENTRAL_ce.grid.unesp.br sleep4 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > GridUNESP_CENTRAL_ce.grid.unesp.br sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu worker5 > /opt/osg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu sleep5 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > MIT_CMS_ce01.cmsaf.mit.edu worker6 /osg/app/engage/scec/worker.pl > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" > MIT_CMS_ce01.cmsaf.mit.edu sleep6 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > MIT_CMS_ce01.cmsaf.mit.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > MIT_CMS_ce02.cmsaf.mit.edu worker7 /osg/app/engage/scec/worker.pl > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" > MIT_CMS_ce02.cmsaf.mit.edu sleep7 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > MIT_CMS_ce02.cmsaf.mit.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu worker8 > /grid-tmp/grid-apps/engage/scec/worker.pl INSTALLED > INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" > NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu sleep8 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="00:05:00" > > Nebraska_gpn-husker.unl.edu worker9 > /opt/osg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > Nebraska_gpn-husker.unl.edu sleep9 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > Nebraska_gpn-husker.unl.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > Nebraska_red.unl.edu worker10 /opt/osg/app/engage/scec/worker.pl > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" > Nebraska_red.unl.edu sleep10 /bin/sleep INSTALLED > INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > Nebraska_red.unl.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > Prairiefire_pf-grid.unl.edu worker11 > /opt/pfgridapp/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > Prairiefire_pf-grid.unl.edu sleep11 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > Prairiefire_pf-grid.unl.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > Purdue-RCAC_osg.rcac.purdue.edu worker12 > /apps/osg/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > Purdue-RCAC_osg.rcac.purdue.edu sleep12 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > Purdue-RCAC_osg.rcac.purdue.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > RENCI-Engagement_belhaven-1.renci.org worker13 > /nfs/osg-app/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > RENCI-Engagement_belhaven-1.renci.org sleep13 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > RENCI-Engagement_belhaven-1.renci.org sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > SBGrid-Harvard-East_osg-east.hms.harvard.edu worker14 > /osg/storage/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > SBGrid-Harvard-East_osg-east.hms.harvard.edu sleep14 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > SBGrid-Harvard-East_osg-east.hms.harvard.edu sleep > /bin/sleep INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="00:05:00" > > SPRACE_osg-ce.sprace.org.br worker15 /osg/app/engage/scec/worker.pl > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" > SPRACE_osg-ce.sprace.org.br sleep15 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > SPRACE_osg-ce.sprace.org.br sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > UCHC_CBG_vdgateway.vcell.uchc.edu worker16 > /osg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > UCHC_CBG_vdgateway.vcell.uchc.edu sleep16 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > UCHC_CBG_vdgateway.vcell.uchc.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > UCR-HEP_top.ucr.edu worker17 > /data/bottom/osg_app/engage/scec/worker.pl INSTALLED > INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" > UCR-HEP_top.ucr.edu sleep17 /bin/sleep INSTALLED > INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > UCR-HEP_top.ucr.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > UFlorida-HPC_osg.hpc.ufl.edu worker18 /osg/app/engage/scec/worker.pl > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" > UFlorida-HPC_osg.hpc.ufl.edu sleep18 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > UFlorida-HPC_osg.hpc.ufl.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > UFlorida-PG_pg.ihepa.ufl.edu worker19 > /raid/osgpg/pg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > UFlorida-PG_pg.ihepa.ufl.edu sleep19 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > UFlorida-PG_pg.ihepa.ufl.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > UMissHEP_umiss001.hep.olemiss.edu worker20 > /osgremote/osg_app/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > UMissHEP_umiss001.hep.olemiss.edu sleep20 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > UMissHEP_umiss001.hep.olemiss.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > UTA_SWT2_gk04.swt2.uta.edu worker21 > /cluster/grid/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > UTA_SWT2_gk04.swt2.uta.edu sleep21 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > UTA_SWT2_gk04.swt2.uta.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > WQCG-Harvard-OSG_tuscany.med.harvard.edu worker22 > /osg/storage/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > WQCG-Harvard-OSG_tuscany.med.harvard.edu sleep22 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > WQCG-Harvard-OSG_tuscany.med.harvard.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="00:05:00" > > 2010/11/18 Mihael Hategan : > > I'm sure there is a reasonable explanation for this. > > > > Can you post your entire tc.data? And to make sure we're talking about > > the right one, can you look at the swift log and use exactly the one > > that swift claims is using? > > > > Mihael > > > > On Thu, 2010-11-18 at 14:39 -0600, Allan Espinosa wrote: > >> tc.data for worker15: > >> SPRACE_osg-ce.sprace.org.br worker15 /osg/app/engage/scec/worker.pl > >> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" > >> > >> But it was assigned to another site instead: > >> $ grep 0erqqq1k worker-*.log > >> 2010-11-17 15:38:58,804-0600 DEBUG vdl:execute2 THREAD_ASSOCIATION > >> jobid=worker15-0erqqq1k thread > >> host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu replicationGroup=2pnqqq1k > >> 2010-11-17 15:38:59,110-0600 INFO vdl:createdirset START > >> jobid=worker15-0erqqq1k host=LIGO_UWM_N > >> ce.phys.uwm.edu - Initializing directory structure > >> 2010-11-17 15:38:59,137-0600 INFO vdl:createdirset END > >> jobid=worker15-0erqqq1k - Done initializi > >> structure > >> 2010-11-17 15:38:59,172-0600 INFO vdl:dostagein START > >> jobid=worker15-0erqqq1k - Staging in files > >> 2010-11-17 15:38:59,257-0600 INFO vdl:dostagein END > >> jobid=worker15-0erqqq1k - Staging in finishe > >> 2010-11-17 15:38:59,323-0600 DEBUG vdl:execute2 JOB_START > >> jobid=worker15-0erqqq1k tr=worker15 arg > >> //128.135.125.17:61015, SPRACE_osg-ce.sprace.org.br, /tmp, 7200] > >> tmpdir=worker-20101117-1538-fe9a > >> orker15-0erqqq1k host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu > >> 2010-11-17 15:39:01,394-0600 INFO Execute Submit: in: > >> worker-20101117-1538-fe9aq209 command: /bi > >> /_swiftwrap worker15-0erqqq1k -jobdir 0 -scratch -e worker15 -out > >> stdout.txt -err stderr.txt -i > >> -k -cdmfile -status files -a http://128.135.125.17:61015 > >> SPRACE_osg-ce.sprace.org.br /tmp 7200 > >> 2010-11-17 15:39:01,394-0600 INFO GridExec TASK_DEFINITION: > >> Task(type=JOB_SUBMISSION, identity=u > >> -1-1290029938030) is /bin/bash shared/_swiftwrap worker15-0erqqq1k > >> -jobdir 0 -scratch -e worker1 > >> .txt -err stderr.txt -i -d -if -of -k -cdmfile -status files -a > >> http://128.135.125.17:61015 > >> .sprace.org.br /tmp 7200 > >> 2010-11-17 16:49:33,106-0600 DEBUG vdl:checkjobstatus START > >> jobid=worker15-0erqqq1k > >> 2010-11-17 16:49:33,278-0600 INFO vdl:checkjobstatus FAILURE > >> jobid=worker15-0erqqq1k - Failure f > >> 2010-11-17 16:49:38,180-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > >> jobid=worker15-0erqqq1k - A > >> ception: Cannot find executable worker15 on site system path > >> > >> There is no entry for worker15 for the site LIGO_UWM_NEMO in my tc.data > >> > >> -Allan > >> > > > > > > > > > > > From hategan at mcs.anl.gov Thu Nov 18 17:45:29 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 18 Nov 2010 15:45:29 -0800 Subject: [Swift-devel] misassignment of jobs In-Reply-To: References: <1290117789.30414.1.camel@blabla2.none> Message-ID: <1290123929.30658.1.camel@blabla2.none> Also, can you post sites.xml and the full log? On Thu, 2010-11-18 at 16:08 -0600, Allan Espinosa wrote: > i'm using a file named tc.data > > 2010-11-17 15:38:50,115-0600 INFO unknown Using tc.data: tc.data > $cat tc.data > PADS sleep_pads /bin/sleep INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="00:05:00" > > BNL-ATLAS_gridgk01.racf.bnl.gov worker0 > /usatlas/OSG/engage-scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > BNL-ATLAS_gridgk01.racf.bnl.gov sleep0 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > BNL-ATLAS_gridgk01.racf.bnl.gov sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > BNL-ATLAS_gridgk02.racf.bnl.gov worker1 > /usatlas/OSG/engage-scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > BNL-ATLAS_gridgk02.racf.bnl.gov sleep1 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > BNL-ATLAS_gridgk02.racf.bnl.gov sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > FNAL_FERMIGRID_fnpcosg1.fnal.gov worker2 > /grid/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > FNAL_FERMIGRID_fnpcosg1.fnal.gov sleep2 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > FNAL_FERMIGRID_fnpcosg1.fnal.gov sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > Firefly_ff-grid3.unl.edu worker3 > /panfs/panasas/CMS/app/engage/scec/worker.pl INSTALLED > INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" > Firefly_ff-grid3.unl.edu sleep3 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > Firefly_ff-grid3.unl.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > GridUNESP_CENTRAL_ce.grid.unesp.br worker4 /osg/app/worker.pl > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" > GridUNESP_CENTRAL_ce.grid.unesp.br sleep4 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > GridUNESP_CENTRAL_ce.grid.unesp.br sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu worker5 > /opt/osg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu sleep5 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > MIT_CMS_ce01.cmsaf.mit.edu worker6 /osg/app/engage/scec/worker.pl > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" > MIT_CMS_ce01.cmsaf.mit.edu sleep6 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > MIT_CMS_ce01.cmsaf.mit.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > MIT_CMS_ce02.cmsaf.mit.edu worker7 /osg/app/engage/scec/worker.pl > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" > MIT_CMS_ce02.cmsaf.mit.edu sleep7 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > MIT_CMS_ce02.cmsaf.mit.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu worker8 > /grid-tmp/grid-apps/engage/scec/worker.pl INSTALLED > INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" > NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu sleep8 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="00:05:00" > > Nebraska_gpn-husker.unl.edu worker9 > /opt/osg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > Nebraska_gpn-husker.unl.edu sleep9 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > Nebraska_gpn-husker.unl.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > Nebraska_red.unl.edu worker10 /opt/osg/app/engage/scec/worker.pl > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" > Nebraska_red.unl.edu sleep10 /bin/sleep INSTALLED > INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > Nebraska_red.unl.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > Prairiefire_pf-grid.unl.edu worker11 > /opt/pfgridapp/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > Prairiefire_pf-grid.unl.edu sleep11 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > Prairiefire_pf-grid.unl.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > Purdue-RCAC_osg.rcac.purdue.edu worker12 > /apps/osg/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > Purdue-RCAC_osg.rcac.purdue.edu sleep12 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > Purdue-RCAC_osg.rcac.purdue.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > RENCI-Engagement_belhaven-1.renci.org worker13 > /nfs/osg-app/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > RENCI-Engagement_belhaven-1.renci.org sleep13 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > RENCI-Engagement_belhaven-1.renci.org sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > SBGrid-Harvard-East_osg-east.hms.harvard.edu worker14 > /osg/storage/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > SBGrid-Harvard-East_osg-east.hms.harvard.edu sleep14 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > SBGrid-Harvard-East_osg-east.hms.harvard.edu sleep > /bin/sleep INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="00:05:00" > > SPRACE_osg-ce.sprace.org.br worker15 /osg/app/engage/scec/worker.pl > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" > SPRACE_osg-ce.sprace.org.br sleep15 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > SPRACE_osg-ce.sprace.org.br sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > UCHC_CBG_vdgateway.vcell.uchc.edu worker16 > /osg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > UCHC_CBG_vdgateway.vcell.uchc.edu sleep16 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > UCHC_CBG_vdgateway.vcell.uchc.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > UCR-HEP_top.ucr.edu worker17 > /data/bottom/osg_app/engage/scec/worker.pl INSTALLED > INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" > UCR-HEP_top.ucr.edu sleep17 /bin/sleep INSTALLED > INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > UCR-HEP_top.ucr.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > UFlorida-HPC_osg.hpc.ufl.edu worker18 /osg/app/engage/scec/worker.pl > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" > UFlorida-HPC_osg.hpc.ufl.edu sleep18 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > UFlorida-HPC_osg.hpc.ufl.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > UFlorida-PG_pg.ihepa.ufl.edu worker19 > /raid/osgpg/pg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > UFlorida-PG_pg.ihepa.ufl.edu sleep19 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > UFlorida-PG_pg.ihepa.ufl.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > UMissHEP_umiss001.hep.olemiss.edu worker20 > /osgremote/osg_app/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > UMissHEP_umiss001.hep.olemiss.edu sleep20 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > UMissHEP_umiss001.hep.olemiss.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > UTA_SWT2_gk04.swt2.uta.edu worker21 > /cluster/grid/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > UTA_SWT2_gk04.swt2.uta.edu sleep21 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > UTA_SWT2_gk04.swt2.uta.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > > WQCG-Harvard-OSG_tuscany.med.harvard.edu worker22 > /osg/storage/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="02:00:00" > WQCG-Harvard-OSG_tuscany.med.harvard.edu sleep22 /bin/sleep > INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" > WQCG-Harvard-OSG_tuscany.med.harvard.edu sleep /bin/sleep > INSTALLED INTEL32::LINUX > GLOBUS::maxwalltime="00:05:00" > > 2010/11/18 Mihael Hategan : > > I'm sure there is a reasonable explanation for this. > > > > Can you post your entire tc.data? And to make sure we're talking about > > the right one, can you look at the swift log and use exactly the one > > that swift claims is using? > > > > Mihael > > > > On Thu, 2010-11-18 at 14:39 -0600, Allan Espinosa wrote: > >> tc.data for worker15: > >> SPRACE_osg-ce.sprace.org.br worker15 /osg/app/engage/scec/worker.pl > >> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" > >> > >> But it was assigned to another site instead: > >> $ grep 0erqqq1k worker-*.log > >> 2010-11-17 15:38:58,804-0600 DEBUG vdl:execute2 THREAD_ASSOCIATION > >> jobid=worker15-0erqqq1k thread > >> host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu replicationGroup=2pnqqq1k > >> 2010-11-17 15:38:59,110-0600 INFO vdl:createdirset START > >> jobid=worker15-0erqqq1k host=LIGO_UWM_N > >> ce.phys.uwm.edu - Initializing directory structure > >> 2010-11-17 15:38:59,137-0600 INFO vdl:createdirset END > >> jobid=worker15-0erqqq1k - Done initializi > >> structure > >> 2010-11-17 15:38:59,172-0600 INFO vdl:dostagein START > >> jobid=worker15-0erqqq1k - Staging in files > >> 2010-11-17 15:38:59,257-0600 INFO vdl:dostagein END > >> jobid=worker15-0erqqq1k - Staging in finishe > >> 2010-11-17 15:38:59,323-0600 DEBUG vdl:execute2 JOB_START > >> jobid=worker15-0erqqq1k tr=worker15 arg > >> //128.135.125.17:61015, SPRACE_osg-ce.sprace.org.br, /tmp, 7200] > >> tmpdir=worker-20101117-1538-fe9a > >> orker15-0erqqq1k host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu > >> 2010-11-17 15:39:01,394-0600 INFO Execute Submit: in: > >> worker-20101117-1538-fe9aq209 command: /bi > >> /_swiftwrap worker15-0erqqq1k -jobdir 0 -scratch -e worker15 -out > >> stdout.txt -err stderr.txt -i > >> -k -cdmfile -status files -a http://128.135.125.17:61015 > >> SPRACE_osg-ce.sprace.org.br /tmp 7200 > >> 2010-11-17 15:39:01,394-0600 INFO GridExec TASK_DEFINITION: > >> Task(type=JOB_SUBMISSION, identity=u > >> -1-1290029938030) is /bin/bash shared/_swiftwrap worker15-0erqqq1k > >> -jobdir 0 -scratch -e worker1 > >> .txt -err stderr.txt -i -d -if -of -k -cdmfile -status files -a > >> http://128.135.125.17:61015 > >> .sprace.org.br /tmp 7200 > >> 2010-11-17 16:49:33,106-0600 DEBUG vdl:checkjobstatus START > >> jobid=worker15-0erqqq1k > >> 2010-11-17 16:49:33,278-0600 INFO vdl:checkjobstatus FAILURE > >> jobid=worker15-0erqqq1k - Failure f > >> 2010-11-17 16:49:38,180-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > >> jobid=worker15-0erqqq1k - A > >> ception: Cannot find executable worker15 on site system path > >> > >> There is no entry for worker15 for the site LIGO_UWM_NEMO in my tc.data > >> > >> -Allan > >> > > > > > > > > > > > From aespinosa at cs.uchicago.edu Thu Nov 18 18:15:03 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 18 Nov 2010 18:15:03 -0600 Subject: [Swift-devel] misassignment of jobs In-Reply-To: <1290123929.30658.1.camel@blabla2.none> References: <1290117789.30414.1.camel@blabla2.none> <1290123929.30658.1.camel@blabla2.none> Message-ID: 2010/11/18 Mihael Hategan : > Also, can you post sites.xml and the full log? > > On Thu, 2010-11-18 at 16:08 -0600, Allan Espinosa wrote: >> i'm using a file named tc.data >> >> 2010-11-17 15:38:50,115-0600 INFO ?unknown Using tc.data: tc.data >> $cat tc.data >> PADS ?sleep_pads ? ? /bin/sleep ? ? ?INSTALLED INTEL32::LINUX >> GLOBUS::maxwalltime="00:05:00" >> >> BNL-ATLAS_gridgk01.racf.bnl.gov ?worker0 >> /usatlas/OSG/engage-scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX >> GLOBUS::maxwalltime="02:00:00" >> BNL-ATLAS_gridgk01.racf.bnl.gov ?sleep0 ?/bin/sleep >> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> BNL-ATLAS_gridgk01.racf.bnl.gov ?sleep ? ? ? ? ? ?/bin/sleep >> ? ? ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> >> BNL-ATLAS_gridgk02.racf.bnl.gov ?worker1 >> /usatlas/OSG/engage-scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX >> GLOBUS::maxwalltime="02:00:00" >> BNL-ATLAS_gridgk02.racf.bnl.gov ?sleep1 ?/bin/sleep >> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> BNL-ATLAS_gridgk02.racf.bnl.gov ?sleep ? ? ? ? ? ?/bin/sleep >> ? ? ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> >> FNAL_FERMIGRID_fnpcosg1.fnal.gov ?worker2 >> /grid/app/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX >> GLOBUS::maxwalltime="02:00:00" >> FNAL_FERMIGRID_fnpcosg1.fnal.gov ?sleep2 ?/bin/sleep >> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> FNAL_FERMIGRID_fnpcosg1.fnal.gov ?sleep ? ? ? ? ? ?/bin/sleep >> ? ? ? ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> >> Firefly_ff-grid3.unl.edu ?worker3 >> /panfs/panasas/CMS/app/engage/scec/worker.pl ? ? ?INSTALLED >> INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" >> Firefly_ff-grid3.unl.edu ?sleep3 ?/bin/sleep >> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> Firefly_ff-grid3.unl.edu ?sleep ? ? ? ? ? ?/bin/sleep >> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> >> GridUNESP_CENTRAL_ce.grid.unesp.br ?worker4 /osg/app/worker.pl >> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" >> GridUNESP_CENTRAL_ce.grid.unesp.br ?sleep4 ?/bin/sleep >> ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> GridUNESP_CENTRAL_ce.grid.unesp.br ?sleep ? ? ? ? ? ?/bin/sleep >> ? ? ? ? ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> >> LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu ?worker5 >> /opt/osg/app/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX >> GLOBUS::maxwalltime="02:00:00" >> LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu ?sleep5 ?/bin/sleep >> ? ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu ?sleep ? ? ? ? ? ?/bin/sleep >> ? ? ? ? ? ? ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> >> MIT_CMS_ce01.cmsaf.mit.edu ?worker6 /osg/app/engage/scec/worker.pl >> ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" >> MIT_CMS_ce01.cmsaf.mit.edu ?sleep6 ?/bin/sleep >> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> MIT_CMS_ce01.cmsaf.mit.edu ?sleep ? ? ? ? ? ?/bin/sleep >> ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> >> MIT_CMS_ce02.cmsaf.mit.edu ?worker7 /osg/app/engage/scec/worker.pl >> ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" >> MIT_CMS_ce02.cmsaf.mit.edu ?sleep7 ?/bin/sleep >> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> MIT_CMS_ce02.cmsaf.mit.edu ?sleep ? ? ? ? ? ?/bin/sleep >> ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> >> NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu ?worker8 >> /grid-tmp/grid-apps/engage/scec/worker.pl ? ? ?INSTALLED >> INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" >> NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu ?sleep8 ?/bin/sleep >> ? ? ? ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu ?sleep ? ? ? ? ? ?/bin/sleep >> ? ? ? ? ? ? ? ? ?INSTALLED INTEL32::LINUX >> GLOBUS::maxwalltime="00:05:00" >> >> Nebraska_gpn-husker.unl.edu ?worker9 >> /opt/osg/app/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX >> GLOBUS::maxwalltime="02:00:00" >> Nebraska_gpn-husker.unl.edu ?sleep9 ?/bin/sleep >> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> Nebraska_gpn-husker.unl.edu ?sleep ? ? ? ? ? ?/bin/sleep >> ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> >> Nebraska_red.unl.edu ?worker10 /opt/osg/app/engage/scec/worker.pl >> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" >> Nebraska_red.unl.edu ?sleep10 ?/bin/sleep ? ? ? ? ? ? ? ? ?INSTALLED >> INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> Nebraska_red.unl.edu ?sleep ? ? ? ? ? ?/bin/sleep >> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> >> Prairiefire_pf-grid.unl.edu ?worker11 >> /opt/pfgridapp/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX >> GLOBUS::maxwalltime="02:00:00" >> Prairiefire_pf-grid.unl.edu ?sleep11 ?/bin/sleep >> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> Prairiefire_pf-grid.unl.edu ?sleep ? ? ? ? ? ?/bin/sleep >> ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> >> Purdue-RCAC_osg.rcac.purdue.edu ?worker12 >> /apps/osg/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX >> GLOBUS::maxwalltime="02:00:00" >> Purdue-RCAC_osg.rcac.purdue.edu ?sleep12 ?/bin/sleep >> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> Purdue-RCAC_osg.rcac.purdue.edu ?sleep ? ? ? ? ? ?/bin/sleep >> ? ? ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> >> RENCI-Engagement_belhaven-1.renci.org ?worker13 >> /nfs/osg-app/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX >> GLOBUS::maxwalltime="02:00:00" >> RENCI-Engagement_belhaven-1.renci.org ?sleep13 ?/bin/sleep >> ? ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> RENCI-Engagement_belhaven-1.renci.org ?sleep ? ? ? ? ? ?/bin/sleep >> ? ? ? ? ? ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> >> SBGrid-Harvard-East_osg-east.hms.harvard.edu ?worker14 >> /osg/storage/app/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX >> GLOBUS::maxwalltime="02:00:00" >> SBGrid-Harvard-East_osg-east.hms.harvard.edu ?sleep14 ?/bin/sleep >> ? ? ? ? ? ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> SBGrid-Harvard-East_osg-east.hms.harvard.edu ?sleep >> /bin/sleep ? ? ? ? ? ? ? ? ?INSTALLED INTEL32::LINUX >> GLOBUS::maxwalltime="00:05:00" >> >> SPRACE_osg-ce.sprace.org.br ?worker15 /osg/app/engage/scec/worker.pl >> ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" >> SPRACE_osg-ce.sprace.org.br ?sleep15 ?/bin/sleep >> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> SPRACE_osg-ce.sprace.org.br ?sleep ? ? ? ? ? ?/bin/sleep >> ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> >> UCHC_CBG_vdgateway.vcell.uchc.edu ?worker16 >> /osg/app/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX >> GLOBUS::maxwalltime="02:00:00" >> UCHC_CBG_vdgateway.vcell.uchc.edu ?sleep16 ?/bin/sleep >> ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> UCHC_CBG_vdgateway.vcell.uchc.edu ?sleep ? ? ? ? ? ?/bin/sleep >> ? ? ? ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> >> UCR-HEP_top.ucr.edu ?worker17 >> /data/bottom/osg_app/engage/scec/worker.pl ? ? ?INSTALLED >> INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" >> UCR-HEP_top.ucr.edu ?sleep17 ?/bin/sleep ? ? ? ? ? ? ? ? ?INSTALLED >> INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> UCR-HEP_top.ucr.edu ?sleep ? ? ? ? ? ?/bin/sleep >> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> >> UFlorida-HPC_osg.hpc.ufl.edu ?worker18 /osg/app/engage/scec/worker.pl >> ? ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" >> UFlorida-HPC_osg.hpc.ufl.edu ?sleep18 ?/bin/sleep >> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> UFlorida-HPC_osg.hpc.ufl.edu ?sleep ? ? ? ? ? ?/bin/sleep >> ? ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> >> UFlorida-PG_pg.ihepa.ufl.edu ?worker19 >> /raid/osgpg/pg/app/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX >> GLOBUS::maxwalltime="02:00:00" >> UFlorida-PG_pg.ihepa.ufl.edu ?sleep19 ?/bin/sleep >> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> UFlorida-PG_pg.ihepa.ufl.edu ?sleep ? ? ? ? ? ?/bin/sleep >> ? ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> >> UMissHEP_umiss001.hep.olemiss.edu ?worker20 >> /osgremote/osg_app/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX >> GLOBUS::maxwalltime="02:00:00" >> UMissHEP_umiss001.hep.olemiss.edu ?sleep20 ?/bin/sleep >> ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> UMissHEP_umiss001.hep.olemiss.edu ?sleep ? ? ? ? ? ?/bin/sleep >> ? ? ? ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> >> UTA_SWT2_gk04.swt2.uta.edu ?worker21 >> /cluster/grid/app/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX >> GLOBUS::maxwalltime="02:00:00" >> UTA_SWT2_gk04.swt2.uta.edu ?sleep21 ?/bin/sleep >> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> UTA_SWT2_gk04.swt2.uta.edu ?sleep ? ? ? ? ? ?/bin/sleep >> ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> >> WQCG-Harvard-OSG_tuscany.med.harvard.edu ?worker22 >> /osg/storage/app/engage/scec/worker.pl ? ? ?INSTALLED INTEL32::LINUX >> GLOBUS::maxwalltime="02:00:00" >> WQCG-Harvard-OSG_tuscany.med.harvard.edu ?sleep22 ?/bin/sleep >> ? ? ? ? INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00" >> WQCG-Harvard-OSG_tuscany.med.harvard.edu ?sleep ? ? ? ? ? ?/bin/sleep >> ? ? ? ? ? ? ? ? INSTALLED INTEL32::LINUX >> GLOBUS::maxwalltime="00:05:00" >> >> 2010/11/18 Mihael Hategan : >> > I'm sure there is a reasonable explanation for this. >> > >> > Can you post your entire tc.data? And to make sure we're talking about >> > the right one, can you look at the swift log and use exactly the one >> > that swift claims is using? >> > >> > Mihael >> > >> > On Thu, 2010-11-18 at 14:39 -0600, Allan Espinosa wrote: >> >> tc.data for worker15: >> >> SPRACE_osg-ce.sprace.org.br ?worker15 /osg/app/engage/scec/worker.pl >> >> ? ?INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00" >> >> >> >> But it was assigned to another site instead: >> >> $ grep 0erqqq1k worker-*.log >> >> 2010-11-17 15:38:58,804-0600 DEBUG vdl:execute2 THREAD_ASSOCIATION >> >> jobid=worker15-0erqqq1k thread >> >> ?host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu replicationGroup=2pnqqq1k >> >> 2010-11-17 15:38:59,110-0600 INFO ?vdl:createdirset START >> >> jobid=worker15-0erqqq1k host=LIGO_UWM_N >> >> ce.phys.uwm.edu - Initializing directory structure >> >> 2010-11-17 15:38:59,137-0600 INFO ?vdl:createdirset END >> >> jobid=worker15-0erqqq1k - Done initializi >> >> structure >> >> 2010-11-17 15:38:59,172-0600 INFO ?vdl:dostagein START >> >> jobid=worker15-0erqqq1k - Staging in files >> >> 2010-11-17 15:38:59,257-0600 INFO ?vdl:dostagein END >> >> jobid=worker15-0erqqq1k - Staging in finishe >> >> 2010-11-17 15:38:59,323-0600 DEBUG vdl:execute2 JOB_START >> >> jobid=worker15-0erqqq1k tr=worker15 arg >> >> //128.135.125.17:61015, SPRACE_osg-ce.sprace.org.br, /tmp, 7200] >> >> tmpdir=worker-20101117-1538-fe9a >> >> orker15-0erqqq1k host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu >> >> 2010-11-17 15:39:01,394-0600 INFO ?Execute Submit: in: >> >> worker-20101117-1538-fe9aq209 command: /bi >> >> /_swiftwrap worker15-0erqqq1k -jobdir 0 -scratch ?-e worker15 -out >> >> stdout.txt -err stderr.txt -i >> >> ?-k ?-cdmfile ?-status files -a http://128.135.125.17:61015 >> >> SPRACE_osg-ce.sprace.org.br /tmp 7200 >> >> 2010-11-17 15:39:01,394-0600 INFO ?GridExec TASK_DEFINITION: >> >> Task(type=JOB_SUBMISSION, identity=u >> >> -1-1290029938030) is /bin/bash shared/_swiftwrap worker15-0erqqq1k >> >> -jobdir 0 -scratch ?-e worker1 >> >> .txt -err stderr.txt -i -d ?-if ?-of ?-k ?-cdmfile ?-status files -a >> >> http://128.135.125.17:61015 >> >> .sprace.org.br /tmp 7200 >> >> 2010-11-17 16:49:33,106-0600 DEBUG vdl:checkjobstatus START >> >> jobid=worker15-0erqqq1k >> >> 2010-11-17 16:49:33,278-0600 INFO ?vdl:checkjobstatus FAILURE >> >> jobid=worker15-0erqqq1k - Failure f >> >> 2010-11-17 16:49:38,180-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION >> >> jobid=worker15-0erqqq1k - A >> >> ception: Cannot find executable worker15 on site system path >> >> >> >> There is no entry for worker15 for the site LIGO_UWM_NEMO in my tc.data >> >> >> >> -Allan >> >> >> > >> > >> > >> > >> >> >> > > > > -- Allan M. Espinosa PhD student, Computer Science University of Chicago -------------- next part -------------- A non-text attachment was scrubbed... Name: condor_osg.xml Type: text/xml Size: 12555 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: worker-20101117-1538-fe9aq209.log.bz2 Type: application/x-bzip2 Size: 1584471 bytes Desc: not available URL: From aespinosa at cs.uchicago.edu Thu Nov 18 19:38:18 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 18 Nov 2010 19:38:18 -0600 Subject: [Swift-devel] info to provider Message-ID: Hi, I'm poking around the provider-coaster tree to be able to manually specify the ports of the local service while the persistent coaster service bugs are not yet ironed out. Somewhere along the "-localport" patch I made before. What's the reference again for extracting information from the sites.xml file to the provider? Thanks, -Allan -- Allan M. Espinosa PhD student, Computer Science University of Chicago From wilde at mcs.anl.gov Thu Nov 18 19:46:12 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 18 Nov 2010 19:46:12 -0600 (CST) Subject: [Swift-devel] info to provider In-Reply-To: Message-ID: <243722423.89931.1290131172635.JavaMail.root@zimbra.anl.gov> Allan, assuming you want to fetch this from Java (as opposed to Karajan in vdl-int.k) there are a few examples in the code snip I posted for David a few days ago: + Object countValue = getSpec().getAttribute("count"); + int count; + + if (countValue != null) + count = Integer.valueOf(String.valueOf(countValue)).intValue(); + else + count = 1; + + // FIXME: wpn is only meaningful for coasters; is 1 ok otherwise? + // should we flag wpn as error if not coasters? + + Object wpnValue = getAttribute(spec, "workerspernode", "1"); + int wpn = Integer.valueOf(String.valueOf(wpnValue)).intValue(); + logger.info("FETCH OF WPN: " + wpn); // FIXME: DB + + count *= wpn; + logger.info("FETCH OF PE: " + getAttribute(spec, "pe", "NO pe")); + logger.info("FETCH OF CPN: " + getAttribute(spec, "corespernode", "NO cpn")); + writeAttrValue(String.valueOf(count), "-pe " + getAttribute(spec, "pe", getSGEProperties().getDefaultPE()) - + " ", wr, "1"); + + " ", wr); - Mike ----- Original Message ----- > Hi, > > I'm poking around the provider-coaster tree to be able to manually > specify the ports of the local service while the persistent coaster > service bugs are not yet ironed out. Somewhere along the "-localport" > patch I made before. > > What's the reference again for extracting information from > the sites.xml file to the provider? > > Thanks, > -Allan > > -- > Allan M. Espinosa > PhD student, Computer Science > University of Chicago > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Thu Nov 18 23:57:47 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 18 Nov 2010 21:57:47 -0800 Subject: [Swift-devel] misassignment of jobs In-Reply-To: References: <1290117789.30414.1.camel@blabla2.none> <1290123929.30658.1.camel@blabla2.none> Message-ID: <1290146267.2226.11.camel@blabla2.none> I was ready to blame cosmic rays, but this seems to be a pretty common occurrence in your log. So I'm on it. mike at blabla2 tmp$ cat worker-20101117-1538-fe9aq209.log|grep JOB_START | awk '{print $7 " " $13}'|sort|uniq tr=worker0 host=BNL-ATLAS_gridgk01.racf.bnl.gov tr=worker0 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu tr=worker1 host=BNL-ATLAS_gridgk02.racf.bnl.gov tr=worker10 host=Nebraska_red.unl.edu tr=worker11 host=Prairiefire_pf-grid.unl.edu tr=worker12 host=Purdue-RCAC_osg.rcac.purdue.edu tr=worker13 host=GridUNESP_CENTRAL_ce.grid.unesp.br tr=worker13 host=RENCI-Engagement_belhaven-1.renci.org tr=worker14 host=SBGrid-Harvard-East_osg-east.hms.harvard.edu tr=worker15 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu tr=worker15 host=SPRACE_osg-ce.sprace.org.br tr=worker16 host=UCHC_CBG_vdgateway.vcell.uchc.edu tr=worker17 host=UCR-HEP_top.ucr.edu tr=worker18 host=UFlorida-HPC_osg.hpc.ufl.edu tr=worker19 host=UFlorida-PG_pg.ihepa.ufl.edu tr=worker2 host=FNAL_FERMIGRID_fnpcosg1.fnal.gov tr=worker20 host=MIT_CMS_ce01.cmsaf.mit.edu tr=worker20 host=UMissHEP_umiss001.hep.olemiss.edu tr=worker21 host=Firefly_ff-grid3.unl.edu tr=worker21 host=Nebraska_red.unl.edu tr=worker21 host=UTA_SWT2_gk04.swt2.uta.edu tr=worker22 host=WQCG-Harvard-OSG_tuscany.med.harvard.edu tr=worker3 host=Firefly_ff-grid3.unl.edu tr=worker3 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu tr=worker4 host=GridUNESP_CENTRAL_ce.grid.unesp.br tr=worker5 host=Firefly_ff-grid3.unl.edu tr=worker5 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu tr=worker6 host=MIT_CMS_ce01.cmsaf.mit.edu tr=worker7 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu tr=worker7 host=MIT_CMS_ce02.cmsaf.mit.edu tr=worker8 host=NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu tr=worker9 host=Nebraska_gpn-husker.unl.edu From bugzilla-daemon at mcs.anl.gov Sat Nov 20 17:17:26 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 20 Nov 2010 17:17:26 -0600 (CST) Subject: [Swift-devel] [Bug 231] New: ssh staging gives error if login scripts write to stdout Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=231 Summary: ssh staging gives error if login scripts write to stdout Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Severity: normal Priority: P3 Component: SwiftScript language AssignedTo: skenny at uchicago.edu ReportedBy: wilde at mcs.anl.gov I found this in old notes. Somewhat low prio as not many people use ssh staging. May be easy to re-create it. > > - if either .profile or .bashrc sends anything to stdout, you get > this cryptic, mysterious message from swift: > ... > > Exception in thread "sftp subsystem 1" java.lang.OutOfMemoryError: > Java heap space > > at > com.sshtools.j2ssh.subsystem.SubsystemClient.run(SubsystemClient.java:198) > > at java.lang.Thread.run(Unknown Source) > > That is funny. In other words a bug. Is there any easy way to > reproduce > that? -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the reporter. From hategan at mcs.anl.gov Sun Nov 21 16:56:26 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 21 Nov 2010 14:56:26 -0800 Subject: [Swift-devel] misassignment of jobs In-Reply-To: <1290146267.2226.11.camel@blabla2.none> References: <1290117789.30414.1.camel@blabla2.none> <1290123929.30658.1.camel@blabla2.none> <1290146267.2226.11.camel@blabla2.none> Message-ID: <1290380186.26914.1.camel@blabla2.none> Sadly though, I can't reproduce this. Can you give me more details, such as the swift script, the version of swift used, and anything that would be unusual compared to vanilla swift use. Mihael On Thu, 2010-11-18 at 21:57 -0800, Mihael Hategan wrote: > I was ready to blame cosmic rays, but this seems to be a pretty common > occurrence in your log. So I'm on it. > > mike at blabla2 tmp$ cat worker-20101117-1538-fe9aq209.log|grep JOB_START | > awk '{print $7 " " $13}'|sort|uniq > tr=worker0 host=BNL-ATLAS_gridgk01.racf.bnl.gov > tr=worker0 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu > tr=worker1 host=BNL-ATLAS_gridgk02.racf.bnl.gov > tr=worker10 host=Nebraska_red.unl.edu > tr=worker11 host=Prairiefire_pf-grid.unl.edu > tr=worker12 host=Purdue-RCAC_osg.rcac.purdue.edu > tr=worker13 host=GridUNESP_CENTRAL_ce.grid.unesp.br > tr=worker13 host=RENCI-Engagement_belhaven-1.renci.org > tr=worker14 host=SBGrid-Harvard-East_osg-east.hms.harvard.edu > tr=worker15 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu > tr=worker15 host=SPRACE_osg-ce.sprace.org.br > tr=worker16 host=UCHC_CBG_vdgateway.vcell.uchc.edu > tr=worker17 host=UCR-HEP_top.ucr.edu > tr=worker18 host=UFlorida-HPC_osg.hpc.ufl.edu > tr=worker19 host=UFlorida-PG_pg.ihepa.ufl.edu > tr=worker2 host=FNAL_FERMIGRID_fnpcosg1.fnal.gov > tr=worker20 host=MIT_CMS_ce01.cmsaf.mit.edu > tr=worker20 host=UMissHEP_umiss001.hep.olemiss.edu > tr=worker21 host=Firefly_ff-grid3.unl.edu > tr=worker21 host=Nebraska_red.unl.edu > tr=worker21 host=UTA_SWT2_gk04.swt2.uta.edu > tr=worker22 host=WQCG-Harvard-OSG_tuscany.med.harvard.edu > tr=worker3 host=Firefly_ff-grid3.unl.edu > tr=worker3 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu > tr=worker4 host=GridUNESP_CENTRAL_ce.grid.unesp.br > tr=worker5 host=Firefly_ff-grid3.unl.edu > tr=worker5 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu > tr=worker6 host=MIT_CMS_ce01.cmsaf.mit.edu > tr=worker7 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu > tr=worker7 host=MIT_CMS_ce02.cmsaf.mit.edu > tr=worker8 host=NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu > tr=worker9 host=Nebraska_gpn-husker.unl.edu > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Sun Nov 21 17:10:15 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 21 Nov 2010 17:10:15 -0600 (CST) Subject: [Swift-devel] misassignment of jobs In-Reply-To: <1290380186.26914.1.camel@blabla2.none> Message-ID: <998123415.101613.1290381015065.JavaMail.root@zimbra.anl.gov> Mihael, If you're in fixin' mode, I'll spend some time now trying to reproduce the 3 coaster problems that are high on my "needed for users" list: 1. Swift hangs/fails talking to persistent server if it sites idle for a few minutes, even with large timeout values (which were possibly not set correctly or fully). 2. With normal coaster mode, if workers start toiming out for lack of work, the Swift run dies. 3. Errors in provider staging at high volume. If you already have test cases for these issues, let me know, and I'll focus on the missing ones. But Im assuming for now you need all three. - Mike ----- Original Message ----- > Sadly though, I can't reproduce this. > > Can you give me more details, such as the swift script, the version of > swift used, and anything that would be unusual compared to vanilla > swift > use. > > Mihael > > On Thu, 2010-11-18 at 21:57 -0800, Mihael Hategan wrote: > > I was ready to blame cosmic rays, but this seems to be a pretty > > common > > occurrence in your log. So I'm on it. > > > > mike at blabla2 tmp$ cat worker-20101117-1538-fe9aq209.log|grep > > JOB_START | > > awk '{print $7 " " $13}'|sort|uniq > > tr=worker0 host=BNL-ATLAS_gridgk01.racf.bnl.gov > > tr=worker0 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu > > tr=worker1 host=BNL-ATLAS_gridgk02.racf.bnl.gov > > tr=worker10 host=Nebraska_red.unl.edu > > tr=worker11 host=Prairiefire_pf-grid.unl.edu > > tr=worker12 host=Purdue-RCAC_osg.rcac.purdue.edu > > tr=worker13 host=GridUNESP_CENTRAL_ce.grid.unesp.br > > tr=worker13 host=RENCI-Engagement_belhaven-1.renci.org > > tr=worker14 host=SBGrid-Harvard-East_osg-east.hms.harvard.edu > > tr=worker15 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu > > tr=worker15 host=SPRACE_osg-ce.sprace.org.br > > tr=worker16 host=UCHC_CBG_vdgateway.vcell.uchc.edu > > tr=worker17 host=UCR-HEP_top.ucr.edu > > tr=worker18 host=UFlorida-HPC_osg.hpc.ufl.edu > > tr=worker19 host=UFlorida-PG_pg.ihepa.ufl.edu > > tr=worker2 host=FNAL_FERMIGRID_fnpcosg1.fnal.gov > > tr=worker20 host=MIT_CMS_ce01.cmsaf.mit.edu > > tr=worker20 host=UMissHEP_umiss001.hep.olemiss.edu > > tr=worker21 host=Firefly_ff-grid3.unl.edu > > tr=worker21 host=Nebraska_red.unl.edu > > tr=worker21 host=UTA_SWT2_gk04.swt2.uta.edu > > tr=worker22 host=WQCG-Harvard-OSG_tuscany.med.harvard.edu > > tr=worker3 host=Firefly_ff-grid3.unl.edu > > tr=worker3 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu > > tr=worker4 host=GridUNESP_CENTRAL_ce.grid.unesp.br > > tr=worker5 host=Firefly_ff-grid3.unl.edu > > tr=worker5 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu > > tr=worker6 host=MIT_CMS_ce01.cmsaf.mit.edu > > tr=worker7 host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu > > tr=worker7 host=MIT_CMS_ce02.cmsaf.mit.edu > > tr=worker8 host=NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu > > tr=worker9 host=Nebraska_gpn-husker.unl.edu > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sun Nov 21 19:31:18 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 21 Nov 2010 17:31:18 -0800 Subject: [Swift-devel] misassignment of jobs In-Reply-To: <998123415.101613.1290381015065.JavaMail.root@zimbra.anl.gov> References: <998123415.101613.1290381015065.JavaMail.root@zimbra.anl.gov> Message-ID: <1290389478.27403.1.camel@blabla2.none> On Sun, 2010-11-21 at 17:10 -0600, Michael Wilde wrote: > Mihael, > > If you're in fixin' mode, I've been in fixin' mode for the past two months :) > I'll spend some time now trying to reproduce the 3 coaster problems that are high on my "needed for users" list: > > 1. Swift hangs/fails talking to persistent server if it sites idle for > a few minutes, even with large timeout values (which were possibly not > set correctly or fully). > > 2. With normal coaster mode, if workers start toiming out for lack of work, the Swift run dies. That one is addressed by removing the worker timeout. As I mentioned in a previous email, that timeout is a artifact of an older worker management scheme. > > 3. Errors in provider staging at high volume. > > If you already have test cases for these issues, let me know, and I'll > focus on the missing ones. But Im assuming for now you need all three. I have test cases for 1 and 3. I couldn't reproduce the problems so far. Mihael From wilde at mcs.anl.gov Sun Nov 21 19:37:18 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 21 Nov 2010 19:37:18 -0600 (CST) Subject: [Swift-devel] misassignment of jobs In-Reply-To: <1290389478.27403.1.camel@blabla2.none> Message-ID: <139095612.101784.1290389838078.JavaMail.root@zimbra.anl.gov> OK, re bug 2: I didnt connect the symptoms of this issue with your earlier comments on timeouts, and just verified that you are correct: with the same extended timeouts I was using to try to keep a persistent coaster service up for an extended time, the failing case for bug 2 works. I'll try to reproduce bug 1 now, then 3. - Mike ----- Original Message ----- > On Sun, 2010-11-21 at 17:10 -0600, Michael Wilde wrote: > > Mihael, > > > > If you're in fixin' mode, > > I've been in fixin' mode for the past two months :) > > > I'll spend some time now trying to reproduce the 3 coaster problems > > that are high on my "needed for users" list: > > > > 1. Swift hangs/fails talking to persistent server if it sites idle > > for > > a few minutes, even with large timeout values (which were possibly > > not > > set correctly or fully). > > > > 2. With normal coaster mode, if workers start toiming out for lack > > of work, the Swift run dies. > > That one is addressed by removing the worker timeout. As I mentioned > in > a previous email, that timeout is a artifact of an older worker > management scheme. > > > > > 3. Errors in provider staging at high volume. > > > > If you already have test cases for these issues, let me know, and > > I'll > > focus on the missing ones. But Im assuming for now you need all > > three. > > I have test cases for 1 and 3. I couldn't reproduce the problems so > far. > > Mihael -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sun Nov 21 20:37:48 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 21 Nov 2010 18:37:48 -0800 Subject: [Swift-devel] misassignment of jobs In-Reply-To: <139095612.101784.1290389838078.JavaMail.root@zimbra.anl.gov> References: <139095612.101784.1290389838078.JavaMail.root@zimbra.anl.gov> Message-ID: <1290393468.27675.0.camel@blabla2.none> Ok. I will remove the idle timeouts from the worker. I do not expect any negative consequences there given the reasoning I outlined before. Mihael On Sun, 2010-11-21 at 19:37 -0600, Michael Wilde wrote: > OK, re bug 2: I didnt connect the symptoms of this issue with your earlier comments on timeouts, and just verified that you are correct: with the same extended timeouts I was using to try to keep a persistent coaster service up for an extended time, the failing case for bug 2 works. > > I'll try to reproduce bug 1 now, then 3. > > - Mike > > > ----- Original Message ----- > > On Sun, 2010-11-21 at 17:10 -0600, Michael Wilde wrote: > > > Mihael, > > > > > > If you're in fixin' mode, > > > > I've been in fixin' mode for the past two months :) > > > > > I'll spend some time now trying to reproduce the 3 coaster problems > > > that are high on my "needed for users" list: > > > > > > 1. Swift hangs/fails talking to persistent server if it sites idle > > > for > > > a few minutes, even with large timeout values (which were possibly > > > not > > > set correctly or fully). > > > > > > 2. With normal coaster mode, if workers start toiming out for lack > > > of work, the Swift run dies. > > > > That one is addressed by removing the worker timeout. As I mentioned > > in > > a previous email, that timeout is a artifact of an older worker > > management scheme. > > > > > > > > 3. Errors in provider staging at high volume. > > > > > > If you already have test cases for these issues, let me know, and > > > I'll > > > focus on the missing ones. But Im assuming for now you need all > > > three. > > > > I have test cases for 1 and 3. I couldn't reproduce the problems so > > far. > > > > Mihael > From wilde at mcs.anl.gov Sun Nov 21 20:45:38 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 21 Nov 2010 20:45:38 -0600 (CST) Subject: [Swift-devel] misassignment of jobs In-Reply-To: <1290393468.27675.0.camel@blabla2.none> Message-ID: <180644543.101847.1290393938265.JavaMail.root@zimbra.anl.gov> I was testing with the two mods below in place (long values in both worker timeout and service timeout). - Mike login1$ pwd /scratch/local/wilde/swift/src/trunk.gomods/cog/modules/provider-coaster login1$ login1$ svn diff Index: src/org/globus/cog/abstraction/coaster/service/CoasterService.java =================================================================== --- src/org/globus/cog/abstraction/coaster/service/CoasterService.java (revision 2932) +++ src/org/globus/cog/abstraction/coaster/service/CoasterService.java (working copy) @@ -41,7 +41,7 @@ public static final Logger logger = Logger .getLogger(CoasterService.class); - public static final int IDLE_TIMEOUT = 120 * 1000; + public static final int IDLE_TIMEOUT = 120 * 1000 /* extend it: */ * 30 * 240; public static final int CONNECT_TIMEOUT = 2 * 60 * 1000; Index: resources/worker.pl =================================================================== --- resources/worker.pl (revision 2932) +++ resources/worker.pl (working copy) @@ -123,7 +123,7 @@ my $URISTR=$ARGV[0]; my $BLOCKID=$ARGV[1]; my $LOGDIR=$ARGV[2]; -my $IDLETIMEOUT = ( $#ARGV <= 2 ) ? (4 * 60) : $ARGV[3]; +my $IDLETIMEOUT = ( $#ARGV <= 2 ) ? (4 * 60 * 60 * 24) : $ARGV[3]; # REQUESTS holds a map of incoming requests login1$ ----- Original Message ----- > Ok. I will remove the idle timeouts from the worker. I do not expect > any > negative consequences there given the reasoning I outlined before. > > Mihael > > On Sun, 2010-11-21 at 19:37 -0600, Michael Wilde wrote: > > OK, re bug 2: I didnt connect the symptoms of this issue with your > > earlier comments on timeouts, and just verified that you are > > correct: with the same extended timeouts I was using to try to keep > > a persistent coaster service up for an extended time, the failing > > case for bug 2 works. > > > > I'll try to reproduce bug 1 now, then 3. > > > > - Mike > > > > > > ----- Original Message ----- > > > On Sun, 2010-11-21 at 17:10 -0600, Michael Wilde wrote: > > > > Mihael, > > > > > > > > If you're in fixin' mode, > > > > > > I've been in fixin' mode for the past two months :) > > > > > > > I'll spend some time now trying to reproduce the 3 coaster > > > > problems > > > > that are high on my "needed for users" list: > > > > > > > > 1. Swift hangs/fails talking to persistent server if it sites > > > > idle > > > > for > > > > a few minutes, even with large timeout values (which were > > > > possibly > > > > not > > > > set correctly or fully). > > > > > > > > 2. With normal coaster mode, if workers start toiming out for > > > > lack > > > > of work, the Swift run dies. > > > > > > That one is addressed by removing the worker timeout. As I > > > mentioned > > > in > > > a previous email, that timeout is a artifact of an older worker > > > management scheme. > > > > > > > > > > > 3. Errors in provider staging at high volume. > > > > > > > > If you already have test cases for these issues, let me know, > > > > and > > > > I'll > > > > focus on the missing ones. But Im assuming for now you need all > > > > three. > > > > > > I have test cases for 1 and 3. I couldn't reproduce the problems > > > so > > > far. > > > > > > Mihael > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sun Nov 21 20:49:13 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 21 Nov 2010 18:49:13 -0800 Subject: [Swift-devel] misassignment of jobs In-Reply-To: <180644543.101847.1290393938265.JavaMail.root@zimbra.anl.gov> References: <180644543.101847.1290393938265.JavaMail.root@zimbra.anl.gov> Message-ID: <1290394153.27777.1.camel@blabla2.none> Right. I would hold off on the service timeout. My tests show that it has no impact, and, in theory, that both shouldn't have an impact and it should not be removed. Mihael On Sun, 2010-11-21 at 20:45 -0600, Michael Wilde wrote: > I was testing with the two mods below in place (long values in both worker timeout and service timeout). > > - Mike > > login1$ pwd > /scratch/local/wilde/swift/src/trunk.gomods/cog/modules/provider-coaster > login1$ > > login1$ svn diff > Index: src/org/globus/cog/abstraction/coaster/service/CoasterService.java > =================================================================== > --- src/org/globus/cog/abstraction/coaster/service/CoasterService.java (revision 2932) > +++ src/org/globus/cog/abstraction/coaster/service/CoasterService.java (working copy) > @@ -41,7 +41,7 @@ > public static final Logger logger = Logger > .getLogger(CoasterService.class); > > - public static final int IDLE_TIMEOUT = 120 * 1000; > + public static final int IDLE_TIMEOUT = 120 * 1000 /* extend it: */ * 30 * 240; > > public static final int CONNECT_TIMEOUT = 2 * 60 * 1000; > > Index: resources/worker.pl > =================================================================== > --- resources/worker.pl (revision 2932) > +++ resources/worker.pl (working copy) > @@ -123,7 +123,7 @@ > my $URISTR=$ARGV[0]; > my $BLOCKID=$ARGV[1]; > my $LOGDIR=$ARGV[2]; > -my $IDLETIMEOUT = ( $#ARGV <= 2 ) ? (4 * 60) : $ARGV[3]; > +my $IDLETIMEOUT = ( $#ARGV <= 2 ) ? (4 * 60 * 60 * 24) : $ARGV[3]; > > > # REQUESTS holds a map of incoming requests > login1$ > > > ----- Original Message ----- > > Ok. I will remove the idle timeouts from the worker. I do not expect > > any > > negative consequences there given the reasoning I outlined before. > > > > Mihael > > > > On Sun, 2010-11-21 at 19:37 -0600, Michael Wilde wrote: > > > OK, re bug 2: I didnt connect the symptoms of this issue with your > > > earlier comments on timeouts, and just verified that you are > > > correct: with the same extended timeouts I was using to try to keep > > > a persistent coaster service up for an extended time, the failing > > > case for bug 2 works. > > > > > > I'll try to reproduce bug 1 now, then 3. > > > > > > - Mike > > > > > > > > > ----- Original Message ----- > > > > On Sun, 2010-11-21 at 17:10 -0600, Michael Wilde wrote: > > > > > Mihael, > > > > > > > > > > If you're in fixin' mode, > > > > > > > > I've been in fixin' mode for the past two months :) > > > > > > > > > I'll spend some time now trying to reproduce the 3 coaster > > > > > problems > > > > > that are high on my "needed for users" list: > > > > > > > > > > 1. Swift hangs/fails talking to persistent server if it sites > > > > > idle > > > > > for > > > > > a few minutes, even with large timeout values (which were > > > > > possibly > > > > > not > > > > > set correctly or fully). > > > > > > > > > > 2. With normal coaster mode, if workers start toiming out for > > > > > lack > > > > > of work, the Swift run dies. > > > > > > > > That one is addressed by removing the worker timeout. As I > > > > mentioned > > > > in > > > > a previous email, that timeout is a artifact of an older worker > > > > management scheme. > > > > > > > > > > > > > > 3. Errors in provider staging at high volume. > > > > > > > > > > If you already have test cases for these issues, let me know, > > > > > and > > > > > I'll > > > > > focus on the missing ones. But Im assuming for now you need all > > > > > three. > > > > > > > > I have test cases for 1 and 3. I couldn't reproduce the problems > > > > so > > > > far. > > > > > > > > Mihael > > > > From wilde at mcs.anl.gov Sun Nov 21 21:00:04 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 21 Nov 2010 21:00:04 -0600 (CST) Subject: [Swift-devel] Persistent coaster service fails after several runs In-Reply-To: <1290394153.27777.1.camel@blabla2.none> Message-ID: <2030510416.101872.1290394804460.JavaMail.root@zimbra.anl.gov> subject was: Re: [Swift-devel] misassignment of jobs Re the service-side timeout, OK, will do. Ive just re-created bug1, but its a little different than I thought. Swift runs to the persistent coaster server lock up (ie fail to progress) and then get errors, not after a delay, but seemingly randomly. Thats likely why I was misled into thinking it was delay related. I started a coaster server on localhost with one worker.pl. I then run catsn.swift against it with various n (# of cat jobs) including 1, 10 , and 100. The first several (5-10) swift runs work fine. Then I let it sleep 5 mins and tried again. That too worked fine. But then, after a few more runs, things hang. Here's all the logs and details if you want to look into this particular run. working in /home/wilde/swift/lab, on pads login1 The latest .log in this this is the failing case; the others worked (against the same persistent server): login1$ ls -lt *.log | head -20 -rw-r--r-- 1 wilde ci-users 95478 Nov 21 20:41 catsn-20101121-2039-1yfngygc.log -rw-r--r-- 1 wilde ci-users 36085 Nov 21 20:39 swift.log -rw-r--r-- 1 wilde ci-users 272734 Nov 21 20:37 catsn-20101121-2037-7uk5fj33.log -rw-r--r-- 1 wilde ci-users 272644 Nov 21 20:37 catsn-20101121-2037-j8xq9aie.log -rw-r--r-- 1 wilde ci-users 272468 Nov 21 20:36 catsn-20101121-2036-4y0tnimd.log -rw-r--r-- 1 wilde ci-users 31317 Nov 21 20:36 catsn-20101121-2036-opcvomk4.log -rw-r--r-- 1 wilde ci-users 7183 Nov 21 20:36 catsn-20101121-2036-u59brtm4.log -rw-r--r-- 1 wilde ci-users 7183 Nov 21 20:35 catsn-20101121-2035-360kh03b.log -rw-r--r-- 1 wilde ci-users 7351 Nov 21 20:35 catsn-20101121-2035-8lttnn88.log -rw-r--r-- 1 wilde ci-users 7183 Nov 21 20:30 catsn-20101121-2030-ddmo6gt3.log -rw-r--r-- 1 wilde ci-users 7267 Nov 21 20:29 catsn-20101121-2029-sq8y6cnb.log -rw-r--r-- 1 wilde ci-users 7179 Nov 21 20:29 catsn-20101121-2029-3su2x8v9.log -rw-r--r-- 1 wilde ci-users 7183 Nov 21 20:29 catsn-20101121-2029-z0g50i50.log -rw-r--r-- 1 wilde ci-users 7267 Nov 21 20:29 catsn-20101121-2029-5x6pbkde.log The worker and service logs are in: /tmp/wilde/Swift/{server,worker} swift is: /scratch/local/wilde/swift/src/trunk.gomods/cog/modules/swift/dist/swift-svn/bin/swift The test runs were all of this form, with various n as above: login1$ swift -config cf -tc.file tc -sites.file pecos04.xml catsn.swift -n=100 I started the persistent coaster with the somewhat ugly script: /home/wilde/swift/lab/pecos/start-mcs (which runs a dummy job to force the server to passive mode, for the general case of workers joining and leaving the service) I'll clean this up for re-creatability if you cant spot the issue form these logs. Lastly, the last few runs, including the failing one, gave this on stdout/err: login1$ swift -config cf -tc.file tc -sites.file pecos04.xml catsn.swift -n=100 Swift svn swift-r3707 (swift modified locally) cog-r2932 (cog modified locally) RunID: 20101121-2037-j8xq9aie Progress: Find: http://localhost:1985 Find: keepalive(120), reconnect - http://localhost:1985 Progress: Selecting site:64 Submitting:3 Submitted:25 Active:4 Finished successfully:4 Progress: Selecting site:52 Submitted:28 Active:3 Checking status:1 Finished successfully:16 Progress: Selecting site:36 Submitting:3 Submitted:25 Active:4 Finished successfully:32 Progress: Selecting site:23 Submitted:28 Active:3 Checking status:1 Finished successfully:45 Progress: Selecting site:7 Submitted:27 Active:3 Checking status:1 Finished successfully:62 Progress: Submitted:14 Active:2 Stage out:3 Finished successfully:81 Progress: Submitted:3 Stage out:3 Finished successfully:94 Final status: Finished successfully:100 login1$ swift -config cf -tc.file tc -sites.file pecos04.xml catsn.swift -n=100 Swift svn swift-r3707 (swift modified locally) cog-r2932 (cog modified locally) RunID: 20101121-2037-7uk5fj33 Progress: Find: http://localhost:1985 Find: keepalive(120), reconnect - http://localhost:1985 Progress: Selecting site:64 Submitted:28 Active:3 Checking status:1 Finished successfully:4 Progress: Selecting site:48 Submitting:3 Submitted:25 Active:4 Finished successfully:20 Progress: Selecting site:36 Submitted:28 Active:3 Checking status:1 Finished successfully:32 Progress: Selecting site:23 Submitted:24 Active:4 Stage out:3 Finished successfully:46 Progress: Selecting site:6 Submitted:28 Active:3 Checking status:1 Finished successfully:62 Progress: Submitted:17 Active:3 Checking status:1 Finished successfully:79 Progress: Submitted:3 Active:1 Stage out:3 Finished successfully:93 Final status: Finished successfully:100 login1$ swift -config cf -tc.file tc -sites.file pecos04.xml catsn.swift -n=100 Swift svn swift-r3707 (swift modified locally) cog-r2932 (cog modified locally) RunID: 20101121-2039-1yfngygc Progress: Find: http://localhost:1985 Find: keepalive(120), reconnect - http://localhost:1985 Progress: Selecting site:68 Submitting:32 Progress: Selecting site:68 Submitting:32 Progress: Selecting site:68 Submitting:32 Progress: Selecting site:68 Submitting:32 Command(1, CHANNELCONFIG): handling reply timeout; sendReqTime=101121-203902.376, sendTime=101121-203902.377, now=101121-204102.399 Command(1, CHANNELCONFIG)fault was: Reply timeout org.globus.cog.karajan.workflow.service.ReplyTimeoutException at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280) at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) Progress: Selecting site:68 Submitting:31 Failed but can retry:1 login1$ ----- - Mike ----- Original Message ----- > Right. I would hold off on the service timeout. My tests show that it > has no impact, and, in theory, that both shouldn't have an impact and > it > should not be removed. > > Mihael > > On Sun, 2010-11-21 at 20:45 -0600, Michael Wilde wrote: > > I was testing with the two mods below in place (long values in both > > worker timeout and service timeout). > > > > - Mike > > > > login1$ pwd > > /scratch/local/wilde/swift/src/trunk.gomods/cog/modules/provider-coaster > > login1$ > > > > login1$ svn diff > > Index: > > src/org/globus/cog/abstraction/coaster/service/CoasterService.java > > =================================================================== > > --- > > src/org/globus/cog/abstraction/coaster/service/CoasterService.java > > (revision 2932) > > +++ > > src/org/globus/cog/abstraction/coaster/service/CoasterService.java > > (working copy) > > @@ -41,7 +41,7 @@ > > public static final Logger logger = Logger > > .getLogger(CoasterService.class); > > > > - public static final int IDLE_TIMEOUT = 120 * 1000; > > + public static final int IDLE_TIMEOUT = 120 * 1000 /* extend it: */ > > * 30 * 240; > > > > public static final int CONNECT_TIMEOUT = 2 * 60 * 1000; > > > > Index: resources/worker.pl > > =================================================================== > > --- resources/worker.pl (revision 2932) > > +++ resources/worker.pl (working copy) > > @@ -123,7 +123,7 @@ > > my $URISTR=$ARGV[0]; > > my $BLOCKID=$ARGV[1]; > > my $LOGDIR=$ARGV[2]; > > -my $IDLETIMEOUT = ( $#ARGV <= 2 ) ? (4 * 60) : $ARGV[3]; > > +my $IDLETIMEOUT = ( $#ARGV <= 2 ) ? (4 * 60 * 60 * 24) : $ARGV[3]; > > > > > > # REQUESTS holds a map of incoming requests > > login1$ > > > > > > ----- Original Message ----- > > > Ok. I will remove the idle timeouts from the worker. I do not > > > expect > > > any > > > negative consequences there given the reasoning I outlined before. > > > > > > Mihael > > > > > > On Sun, 2010-11-21 at 19:37 -0600, Michael Wilde wrote: > > > > OK, re bug 2: I didnt connect the symptoms of this issue with > > > > your > > > > earlier comments on timeouts, and just verified that you are > > > > correct: with the same extended timeouts I was using to try to > > > > keep > > > > a persistent coaster service up for an extended time, the > > > > failing > > > > case for bug 2 works. > > > > > > > > I'll try to reproduce bug 1 now, then 3. > > > > > > > > - Mike > > > > > > > > > > > > ----- Original Message ----- > > > > > On Sun, 2010-11-21 at 17:10 -0600, Michael Wilde wrote: > > > > > > Mihael, > > > > > > > > > > > > If you're in fixin' mode, > > > > > > > > > > I've been in fixin' mode for the past two months :) > > > > > > > > > > > I'll spend some time now trying to reproduce the 3 coaster > > > > > > problems > > > > > > that are high on my "needed for users" list: > > > > > > > > > > > > 1. Swift hangs/fails talking to persistent server if it > > > > > > sites > > > > > > idle > > > > > > for > > > > > > a few minutes, even with large timeout values (which were > > > > > > possibly > > > > > > not > > > > > > set correctly or fully). > > > > > > > > > > > > 2. With normal coaster mode, if workers start toiming out > > > > > > for > > > > > > lack > > > > > > of work, the Swift run dies. > > > > > > > > > > That one is addressed by removing the worker timeout. As I > > > > > mentioned > > > > > in > > > > > a previous email, that timeout is a artifact of an older > > > > > worker > > > > > management scheme. > > > > > > > > > > > > > > > > > 3. Errors in provider staging at high volume. > > > > > > > > > > > > If you already have test cases for these issues, let me > > > > > > know, > > > > > > and > > > > > > I'll > > > > > > focus on the missing ones. But Im assuming for now you need > > > > > > all > > > > > > three. > > > > > > > > > > I have test cases for 1 and 3. I couldn't reproduce the > > > > > problems > > > > > so > > > > > far. > > > > > > > > > > Mihael > > > > > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sun Nov 21 21:06:22 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 21 Nov 2010 19:06:22 -0800 Subject: [Swift-devel] Re: Persistent coaster service fails after several runs In-Reply-To: <2030510416.101872.1290394804460.JavaMail.root@zimbra.anl.gov> References: <2030510416.101872.1290394804460.JavaMail.root@zimbra.anl.gov> Message-ID: <1290395182.27886.0.camel@blabla2.none> [hategan at login1 Swift]$ cd server -bash: cd: server: Permission denied On Sun, 2010-11-21 at 21:00 -0600, Michael Wilde wrote: > The worker and service logs are in: /tmp/wilde/Swift/{server,worker} > From wilde at mcs.anl.gov Sun Nov 21 21:59:51 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 21 Nov 2010 21:59:51 -0600 (CST) Subject: [Swift-devel] Re: Persistent coaster service fails after several runs In-Reply-To: <1290395182.27886.0.camel@blabla2.none> Message-ID: <1935641716.101928.1290398391418.JavaMail.root@zimbra.anl.gov> sorry, fixed. ----- Original Message ----- > [hategan at login1 Swift]$ cd server > -bash: cd: server: Permission denied > > > On Sun, 2010-11-21 at 21:00 -0600, Michael Wilde wrote: > > > The worker and service logs are in: /tmp/wilde/Swift/{server,worker} > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Sun Nov 21 23:10:47 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 21 Nov 2010 23:10:47 -0600 (CST) Subject: [Swift-devel] Provider staging error in long-running test In-Reply-To: <602179782.102075.1290402309752.JavaMail.root@zimbra.anl.gov> Message-ID: <387720139.102083.1290402647892.JavaMail.root@zimbra.anl.gov> Mihael, here is bug 3: I was testing a foreach loop doing a cat of 10,000 input files of sizes up to about 300-400K each. ?The test hit an error after around 3,491 files: Progress: ?Selecting site:1008 ?Submitted:12 ?Active:3 ?Finished successfully:3476 Progress: ?Selecting site:1008 ?Submitted:13 ?Active:3 ?Finished successfully:3491 Failed to shut down channel java.lang.NullPointerException ?? ? ? ?at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.configureHeartBeat(AbstractKarajanChannel.java:57) ?? ? ? ?at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.(AbstractKarajanChannel.java:52) The test was executed on PADS login1 like this: cd /home/wilde/swift/lab ./run.local.coast.ps.sh catsall log file: catsall-20101121-2239-oc2flmn0.log sites.xml: ?? ?? ? ?? ? ?? ? ?? ?8 ?? ?1 ?? ?1 ?? ?.15 ?? ?10000 ?? ?proxy ?? ?/scratch/local/wilde/pstest/swiftwork ?? login1$ cat cf wrapperlog.always.transfer=true sitedir.keep=true execution.retries=0 lazy.errors=false status.mode=provider use.provider.staging=true provider.staging.pin.swiftfiles=false login1$ cat catsall.swift type file; app (file o) cat (file i) { cat @i stdout=@o; } file infile[] ; file outfile[] ; foreach f, i in infile { outfile[i] = cat(f); } login1$ login1$ which swift /scratch/local/wilde/swift/src/trunk.gomods/cog/modules/swift/dist/swift-svn/bin/swift login1$ java -version java version "1.6.0_22" Java(TM) SE Runtime Environment (build 1.6.0_22-b04) Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode) login1$ -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wozniak at mcs.anl.gov Mon Nov 22 16:31:38 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 22 Nov 2010 16:31:38 -0600 (CST) Subject: [Swift-devel] Re: Problems with provider.staging.pin.swiftfiles In-Reply-To: References: <1669558788.45842.1289488275177.JavaMail.root@zimbra.anl.gov> Message-ID: Hello This should be corrected in trunk now. Justin On Thu, 11 Nov 2010, Justin M Wozniak wrote: > Hello > Yes, this was broken a few weeks ago- I will try to restore it ASAP. > (Cf. swift-devel post from 9/27.) > Justin > > On Thu, 11 Nov 2010, Michael Wilde wrote: > >> Justin, >> >> When Tom Uram turns on this option and runs a simple test script (a foreach >> and an app that just collects node info), he gets an error "520" returned >> in the swift log, as if from the app. I am thinking that the 520 is somehow >> coming from worker. >> >> This is going from bridled to Eureka worker nodes, with provider staging >> turned on in proxy mode. >> >> When we set provider.staging.pin.swiftfiles to false, the script runs ok. >> >> We'll need to collect and send logs and a test case, but I wanted to alert >> you to a potential problem with this feature. >> >> - Mike -- Justin M Wozniak From hategan at mcs.anl.gov Mon Nov 22 20:29:57 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 22 Nov 2010 18:29:57 -0800 Subject: [Swift-devel] Re: Provider staging error in long-running test In-Reply-To: <387720139.102083.1290402647892.JavaMail.root@zimbra.anl.gov> References: <387720139.102083.1290402647892.JavaMail.root@zimbra.anl.gov> Message-ID: <1290479397.30590.1.camel@blabla2.none> Ok. So that doesn't look like it's a staging problem specifically, but more like something with the comm library. I'll have to look at the logs. And I can foresee some free time coming in a couple of days just for that! Mihael On Sun, 2010-11-21 at 23:10 -0600, Michael Wilde wrote: > Mihael, here is bug 3: > > I was testing a foreach loop doing a cat of 10,000 input files of sizes up to about 300-400K each. The test hit an error after around 3,491 files: > > Progress: Selecting site:1008 Submitted:12 Active:3 Finished successfully:3476 > Progress: Selecting site:1008 Submitted:13 Active:3 Finished successfully:3491 > Failed to shut down channel > java.lang.NullPointerException > at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.configureHeartBeat(AbstractKarajanChannel.java:57) > at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.(AbstractKarajanChannel.java:52) > > The test was executed on PADS login1 like this: > > cd /home/wilde/swift/lab > ./run.local.coast.ps.sh catsall > > log file: catsall-20101121-2239-oc2flmn0.log > > sites.xml: > > > > > > > 8 > 1 > 1 > .15 > 10000 > proxy > /scratch/local/wilde/pstest/swiftwork > > > > login1$ cat cf > wrapperlog.always.transfer=true > sitedir.keep=true > execution.retries=0 > lazy.errors=false > status.mode=provider > use.provider.staging=true > provider.staging.pin.swiftfiles=false > login1$ cat catsall.swift > type file; > > app (file o) cat (file i) > { > cat @i stdout=@o; > } > > file infile[] ; > file outfile[] ; > > foreach f, i in infile { > outfile[i] = cat(f); > } > login1$ > > login1$ which swift > /scratch/local/wilde/swift/src/trunk.gomods/cog/modules/swift/dist/swift-svn/bin/swift > login1$ java -version > java version "1.6.0_22" > Java(TM) SE Runtime Environment (build 1.6.0_22-b04) > Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode) > login1$ > > From bugzilla-daemon at mcs.anl.gov Tue Nov 23 00:49:16 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 23 Nov 2010 00:49:16 -0600 (CST) Subject: [Swift-devel] [Bug 182] Error messages summarized at end of Swift output should also be printed when they occur In-Reply-To: References: Message-ID: <20101123064916.A613F2CD6F@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=182 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the reporter. From damitha119 at gmail.com Tue Nov 23 12:17:34 2010 From: damitha119 at gmail.com (Damitha Wimalasooriya) Date: Tue, 23 Nov 2010 23:47:34 +0530 Subject: [Swift-devel] adding a source file Message-ID: I have coded a source code for a new method. But still I don't know to add that file to the prevailing libraries. Can somebody help me to add this file and test it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bugzilla-daemon at mcs.anl.gov Tue Nov 23 12:51:42 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 23 Nov 2010 12:51:42 -0600 (CST) Subject: [Swift-devel] [Bug 31] error message should not refer to java exception classes In-Reply-To: References: Message-ID: <20101123185142.C70322CD0C@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=31 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED AssignedTo|hategan at mcs.anl.gov |skenny at uchicago.edu Summary|error message when mapper |error message should not |parameter is wrong |refer to java exception | |classes -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From wilde at mcs.anl.gov Tue Nov 23 14:10:42 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 23 Nov 2010 14:10:42 -0600 (CST) Subject: [Swift-devel] Next Swift release In-Reply-To: <577622484.117049.1290542892759.JavaMail.root@zimbra.anl.gov> Message-ID: <1278622160.117059.1290543042855.JavaMail.root@zimbra.anl.gov> All, Sarah is going to take the lead in producing the next Swift release, and will propose a release definition and plan. We want to have the release done by Dec 20. - Mike From bugzilla-daemon at mcs.anl.gov Tue Nov 23 14:36:37 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 23 Nov 2010 14:36:37 -0600 (CST) Subject: [Swift-devel] [Bug 235] cryptic error message when app is not specified in tc.data In-Reply-To: References: Message-ID: <20101123203637.80D782CDCE@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=235 skenny changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |swift-devel at ci.uchicago.edu -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From bugzilla-daemon at mcs.anl.gov Tue Nov 23 14:39:55 2010 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 23 Nov 2010 14:39:55 -0600 (CST) Subject: [Swift-devel] [Bug 235] cryptic error message when app is not specified in tc.data In-Reply-To: References: Message-ID: <20101123203955.638512CB38@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=235 --- Comment #1 from skenny 2010-11-23 14:39:55 --- new error is: RunID: 20101123-1225-2xvzsta7 Progress: Execution failed: The application "RInvokee" is not available in your tc.data catalog -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. From wilde at mcs.anl.gov Wed Nov 24 08:35:16 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 24 Nov 2010 08:35:16 -0600 (CST) Subject: [Swift-devel] Hi all, In-Reply-To: Message-ID: <1167986945.119613.1290609316657.JavaMail.root@zimbra.anl.gov> Chaturanga, The only help we can offer to new or potential Swift users is help on specific problems or questions you have on using Swift. If you send to the swift-user list specific Swift code that you have tried, and any errors Swift is giving you, list members will try to help as time permits. As with any user list, we can not guarantee you an answer in the time you need it. Regarding Swift logging, you will need to understand how to use Java log4j to use Swift logging. Once you understand that, then logging can be controlled by settings in the property files in the Swift etc/ directory. I am sorry, but I will not be able to answer any further questions unless they are specific enough and show that you have read the Swift documentation carefully, completely executed the tutorial examples, and have made an effort to debug any problems that occur - just as you would have to to learn any other new computer language. That takes a lot of work, and we can not do this work for you. You need to work with your professor or teaching assistants to get guidance on how to complete your class project. - Mike ----- Original Message ----- Sir, Thank you. I'm doing this for class credit. I asked it, they said OK. >You should try to use them in a few programs of your own, to verify your understanding of how they work. And build swift with extra >logging turned on (or added) to trace mapper activity. I tried with my own programs. What did you mean ' build swift with extra logging turned on (or added) to trace mapper activity' ? Is it the same thing that's mentioned in the tutorial under 'writing a mapper' ? (I did them and worked.) Do I have some specific things to do with mappers ? Chaturanga On Tue, Nov 23, 2010 at 8:59 AM, Michael Wilde < wilde at mcs.anl.gov > wrote: ----- Original Message ----- > Michael, > > I was involving some exam stuffs in last few weeks at university. So, > was unable to spend more time with Swift. Hopefully now I can spend my > whole time with Swift. > I did the tutorial and read the user guide. It was interesting but had > some problems while doing. Can you report these to the swift-user list, please, so we can try to fix them? > > I'm interested to do the project 'Re-work the mapper naming > conventions to make code more readable and less verbose, and to fix > some broken mapper semantics' as you had mentioned earlier. OK. That may be a pretty difficult project, but it could perhaps be done incrementally, one small improvement at a time. What is your goal in this project? Are you doing it for class credit (in which case your professor should give you some guidance as to whether it is a good idea), or for learning more programming skills? > I looked at Mappers from the tutorial and user guide and was able to > understand them. You should try to use them in a few programs of your own, to verify your understanding of how they work. And build swift with extra logging turned on (or added) to trace mapper activity. > Can't we implement arrays in a similar manner described in tutorial > for the mappers or does it has another way? I mean by using an > Abstract class from java. Can you clarify what you mean? Java is not visible from Swift - its below the Karajan implementation. And its not clear what you mean by "cant we implement arrays..." The purpose of this particular project is to implement better mappers, not arrays, right? - Mike > > Sir, I'm pretty new to these things. > > If you can share some suggestions it will be very much helpful to me. > > Thank You. > > > > On Tue, Nov 23, 2010 at 5:11 AM, Chaturanga Wimalarathne < > chaturanganamal at gmail.com > wrote: > > > Michael, > > This is really helpful. I have already prepared much for this project > regarding SWIFT. I downloaded swift tutorial and user guide. I'll go > through those and will choose a suitable project. I also downloaded > the source code through Subversion. I was almost going to change the > project. Thank you for the suggestions. I will definitely choose one > project from this list and Let you know ASAP. And I am sure I can > count on your continued assistance. Thanks Again. > > Chaturanga > > > -- > Chaturanga Namal > Department of Computer Science and Engineering > University of Moratuwa > Sri Lanka -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -- Chaturanga Namal Department of Computer Science and Engineering University of Moratuwa Sri Lanka -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: From dk0966 at cs.ship.edu Fri Nov 26 21:25:00 2010 From: dk0966 at cs.ship.edu (David Kelly) Date: Fri, 26 Nov 2010 22:25:00 -0500 Subject: [Swift-devel] SGE qstat and XML Message-ID: Hello, As I was testing Swift with SGE I noticed that the qstat included in newer versions of the Grid Environment was not being correctly parsed by Swift, causing it to fail and exit prematurely. This was caused by a change in the formatting of qstat output. Starting with Grid Environment 6.0, qstat can output data as XML. I believe this should provide a more consistent way to parse the data. Attached are my changes for using and parsing XML. So far I've tested this on the ibicluster and on my Ubuntu laptop with grid environment installed. David -------------- next part -------------- A non-text attachment was scrubbed... Name: sge-updates.diff Type: text/x-patch Size: 9015 bytes Desc: not available URL: From hategan at mcs.anl.gov Sat Nov 27 19:41:46 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 27 Nov 2010 17:41:46 -0800 Subject: [Swift-devel] Re: Provider staging error in long-running test In-Reply-To: <1290479397.30590.1.camel@blabla2.none> References: <387720139.102083.1290402647892.JavaMail.root@zimbra.anl.gov> <1290479397.30590.1.camel@blabla2.none> Message-ID: <1290908506.32250.2.camel@blabla2.none> So I think that was due to incorrect assumption that message headers will never be broken up into pieces by the TCP layer. That caused the worker to fail, presumably under high load, but I cannot be sure about the exact conditions that led to the problem (and therefore I cannot be sure of the solution). I have added code to read things from a socket in a more resilient fashion. I have also removed the idle timeout from the worker. That should not bother us any more. Mihael On Mon, 2010-11-22 at 18:29 -0800, Mihael Hategan wrote: > Ok. So that doesn't look like it's a staging problem specifically, but > more like something with the comm library. I'll have to look at the > logs. And I can foresee some free time coming in a couple of days just > for that! > > Mihael > > On Sun, 2010-11-21 at 23:10 -0600, Michael Wilde wrote: > > Mihael, here is bug 3: > > > > I was testing a foreach loop doing a cat of 10,000 input files of sizes up to about 300-400K each. The test hit an error after around 3,491 files: > > > > Progress: Selecting site:1008 Submitted:12 Active:3 Finished successfully:3476 > > Progress: Selecting site:1008 Submitted:13 Active:3 Finished successfully:3491 > > Failed to shut down channel > > java.lang.NullPointerException > > at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.configureHeartBeat(AbstractKarajanChannel.java:57) > > at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.(AbstractKarajanChannel.java:52) > > > > The test was executed on PADS login1 like this: > > > > cd /home/wilde/swift/lab > > ./run.local.coast.ps.sh catsall > > > > log file: catsall-20101121-2239-oc2flmn0.log > > > > sites.xml: > > > > > > > > > > > > > > 8 > > 1 > > 1 > > .15 > > 10000 > > proxy > > /scratch/local/wilde/pstest/swiftwork > > > > > > > > login1$ cat cf > > wrapperlog.always.transfer=true > > sitedir.keep=true > > execution.retries=0 > > lazy.errors=false > > status.mode=provider > > use.provider.staging=true > > provider.staging.pin.swiftfiles=false > > login1$ cat catsall.swift > > type file; > > > > app (file o) cat (file i) > > { > > cat @i stdout=@o; > > } > > > > file infile[] ; > > file outfile[] ; > > > > foreach f, i in infile { > > outfile[i] = cat(f); > > } > > login1$ > > > > login1$ which swift > > /scratch/local/wilde/swift/src/trunk.gomods/cog/modules/swift/dist/swift-svn/bin/swift > > login1$ java -version > > java version "1.6.0_22" > > Java(TM) SE Runtime Environment (build 1.6.0_22-b04) > > Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode) > > login1$ > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Sat Nov 27 23:58:24 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 27 Nov 2010 21:58:24 -0800 Subject: [Swift-devel] Re: Persistent coaster service fails after several runs In-Reply-To: <2030510416.101872.1290394804460.JavaMail.root@zimbra.anl.gov> References: <2030510416.101872.1290394804460.JavaMail.root@zimbra.anl.gov> Message-ID: <1290923904.9422.6.camel@blabla2.none> I think some of the logs in /home/wilde/swift/lab are gone. Nonetheless, I believe that the lockup was caused by the following issue: - when something bad happened on a channel, some method would be called to allow the channel implementation to handle that error. - an existing problem (which I thought I fixed, but it turns out I had not committed it) caused that method to throw an exception - that would in turn (because it was not in a try/catch block) kill the thread used to send messages on behalf of all channels of a given type. This was fixed as follows: 1. I committed what I should have a while ago such that the triggering problem is gone 2. The handling of channel exceptions is now properly isolated Mihael On Sun, 2010-11-21 at 21:00 -0600, Michael Wilde wrote: > subject was: Re: [Swift-devel] misassignment of jobs > > Re the service-side timeout, OK, will do. > > Ive just re-created bug1, but its a little different than I thought. > > Swift runs to the persistent coaster server lock up (ie fail to > progress) and then get errors, not after a delay, but seemingly > randomly. Thats likely why I was misled into thinking it was delay > related. From wilde at mcs.anl.gov Sun Nov 28 00:46:26 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 28 Nov 2010 00:46:26 -0600 (CST) Subject: [Swift-devel] Re: Provider staging error in long-running test In-Reply-To: <1290908506.32250.2.camel@blabla2.none> Message-ID: <1027257314.126477.1290926786640.JavaMail.root@zimbra.anl.gov> Great! The test that failed after 3000+ transfers now ran all 10,000 OK. Im putting that in a loop now to see if it runs all night. Looks promising! - Mike ----- Original Message ----- > So I think that was due to incorrect assumption that message headers > will never be broken up into pieces by the TCP layer. That caused the > worker to fail, presumably under high load, but I cannot be sure about > the exact conditions that led to the problem (and therefore I cannot > be > sure of the solution). > > I have added code to read things from a socket in a more resilient > fashion. > > I have also removed the idle timeout from the worker. That should not > bother us any more. > > Mihael > > On Mon, 2010-11-22 at 18:29 -0800, Mihael Hategan wrote: > > Ok. So that doesn't look like it's a staging problem specifically, > > but > > more like something with the comm library. I'll have to look at the > > logs. And I can foresee some free time coming in a couple of days > > just > > for that! > > > > Mihael > > > > On Sun, 2010-11-21 at 23:10 -0600, Michael Wilde wrote: > > > Mihael, here is bug 3: > > > > > > I was testing a foreach loop doing a cat of 10,000 input files of > > > sizes up to about 300-400K each. The test hit an error after > > > around 3,491 files: > > > > > > Progress: Selecting site:1008 Submitted:12 Active:3 Finished > > > successfully:3476 > > > Progress: Selecting site:1008 Submitted:13 Active:3 Finished > > > successfully:3491 > > > Failed to shut down channel > > > java.lang.NullPointerException > > > at > > > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.configureHeartBeat(AbstractKarajanChannel.java:57) > > > at > > > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.(AbstractKarajanChannel.java:52) > > > > > > The test was executed on PADS login1 like this: > > > > > > cd /home/wilde/swift/lab > > > ./run.local.coast.ps.sh catsall > > > > > > log file: catsall-20101121-2239-oc2flmn0.log > > > > > > sites.xml: > > > > > > > > > > > > > > > > > jobmanager="local:local"/> > > > > > > 8 > > > 1 > > > 1 > > > .15 > > > > > key="initialScore">10000 > > > proxy > > > /scratch/local/wilde/pstest/swiftwork > > > > > > > > > > > > login1$ cat cf > > > wrapperlog.always.transfer=true > > > sitedir.keep=true > > > execution.retries=0 > > > lazy.errors=false > > > status.mode=provider > > > use.provider.staging=true > > > provider.staging.pin.swiftfiles=false > > > login1$ cat catsall.swift > > > type file; > > > > > > app (file o) cat (file i) > > > { > > > cat @i stdout=@o; > > > } > > > > > > file infile[] > > suffix=".in">; > > > file outfile[] > > prefix="f.",suffix=".out">; > > > > > > foreach f, i in infile { > > > outfile[i] = cat(f); > > > } > > > login1$ > > > > > > login1$ which swift > > > /scratch/local/wilde/swift/src/trunk.gomods/cog/modules/swift/dist/swift-svn/bin/swift > > > login1$ java -version > > > java version "1.6.0_22" > > > Java(TM) SE Runtime Environment (build 1.6.0_22-b04) > > > Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode) > > > login1$ > > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Sun Nov 28 00:47:56 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 28 Nov 2010 00:47:56 -0600 (CST) Subject: [Swift-devel] Re: Persistent coaster service fails after several runs In-Reply-To: <1290923904.9422.6.camel@blabla2.none> Message-ID: <810228394.126479.1290926876271.JavaMail.root@zimbra.anl.gov> Will test this one tomorrow. I deleted logs and other junk as I was way over quota. Sorry, I forgot I had pointed you to these. - Mike ----- Original Message ----- > I think some of the logs in /home/wilde/swift/lab are gone. > Nonetheless, > I believe that the lockup was caused by the following issue: > > - when something bad happened on a channel, some method would be > called > to allow the channel implementation to handle that error. > - an existing problem (which I thought I fixed, but it turns out I had > not committed it) caused that method to throw an exception > - that would in turn (because it was not in a try/catch block) kill > the > thread used to send messages on behalf of all channels of a given > type. > > This was fixed as follows: > 1. I committed what I should have a while ago such that the triggering > problem is gone > 2. The handling of channel exceptions is now properly isolated > > Mihael > > On Sun, 2010-11-21 at 21:00 -0600, Michael Wilde wrote: > > subject was: Re: [Swift-devel] misassignment of jobs > > > > Re the service-side timeout, OK, will do. > > > > Ive just re-created bug1, but its a little different than I thought. > > > > Swift runs to the persistent coaster server lock up (ie fail to > > progress) and then get errors, not after a delay, but seemingly > > randomly. Thats likely why I was misled into thinking it was delay > > related. -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sun Nov 28 02:07:30 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 28 Nov 2010 00:07:30 -0800 Subject: [Swift-devel] Re: Provider staging error in long-running test In-Reply-To: <1027257314.126477.1290926786640.JavaMail.root@zimbra.anl.gov> References: <1027257314.126477.1290926786640.JavaMail.root@zimbra.anl.gov> Message-ID: <1290931650.9947.0.camel@blabla2.none> On Sun, 2010-11-28 at 00:46 -0600, Michael Wilde wrote: > Great! The test that failed after 3000+ transfers now ran all 10,000 OK. > Im putting that in a loop now to see if it runs all night. Right. One run is probably not sufficient to tell. Mihael From hategan at mcs.anl.gov Sun Nov 28 02:09:13 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 28 Nov 2010 00:09:13 -0800 Subject: [Swift-devel] Re: Persistent coaster service fails after several runs In-Reply-To: <810228394.126479.1290926876271.JavaMail.root@zimbra.anl.gov> References: <810228394.126479.1290926876271.JavaMail.root@zimbra.anl.gov> Message-ID: <1290931753.9947.2.camel@blabla2.none> On Sun, 2010-11-28 at 00:47 -0600, Michael Wilde wrote: > Will test this one tomorrow. I deleted logs and other junk as I was way over quota. Sorry, I forgot I had pointed you to these. It's ok. The problem (or what I think was the problem) was visible in one of the other logs. Mihael From wilde at mcs.anl.gov Sun Nov 28 09:53:53 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 28 Nov 2010 09:53:53 -0600 (CST) Subject: [Swift-devel] Re: Provider staging error in long-running test In-Reply-To: <1290931650.9947.0.camel@blabla2.none> Message-ID: <382373340.126841.1290959633698.JavaMail.root@zimbra.anl.gov> It ran all night and is still going - about 27 runs of 10,000 files each; all finished with no errors. I'll start testing to multiple remote nodes now (the current test is to localhost). Nice work! - Mike ----- Original Message ----- > On Sun, 2010-11-28 at 00:46 -0600, Michael Wilde wrote: > > Great! The test that failed after 3000+ transfers now ran all 10,000 > > OK. > > Im putting that in a loop now to see if it runs all night. > > Right. One run is probably not sufficient to tell. > > Mihael -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Sun Nov 28 23:23:42 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 28 Nov 2010 23:23:42 -0600 (CST) Subject: [Swift-devel] Re: Persistent coaster service fails after several runs In-Reply-To: <1290931753.9947.2.camel@blabla2.none> Message-ID: <959208306.127425.1291008222274.JavaMail.root@zimbra.anl.gov> This fix looks great so far - Ive tested with varying workflow sizes and delays, and have not seen any problems. - Mike ----- Original Message ----- > On Sun, 2010-11-28 at 00:47 -0600, Michael Wilde wrote: > > Will test this one tomorrow. I deleted logs and other junk as I was > > way over quota. Sorry, I forgot I had pointed you to these. > > It's ok. The problem (or what I think was the problem) was visible in > one of the other logs. > > Mihael -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wozniak at mcs.anl.gov Tue Nov 30 09:59:34 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Tue, 30 Nov 2010 09:59:34 -0600 (CST) Subject: [Swift-devel] Re: Persistent coaster service fails after several runs In-Reply-To: <959208306.127425.1291008222274.JavaMail.root@zimbra.anl.gov> References: <959208306.127425.1291008222274.JavaMail.root@zimbra.anl.gov> Message-ID: Along these lines, I'm looking at memory usage in Coasters. There's a plot attached below- usage spikes when the workers start running. 96% of the usage is byte[] which makes me think it could be KarajanChannel stuff... http://www.ci.uchicago.edu/wiki/bin/view/SWFT/PerformanceNotes#Memory On Sun, 28 Nov 2010, Michael Wilde wrote: > This fix looks great so far - Ive tested with varying workflow sizes and delays, and have not seen any problems. > > - Mike -- Justin M Wozniak From hategan at mcs.anl.gov Tue Nov 30 16:51:12 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 30 Nov 2010 14:51:12 -0800 Subject: [Swift-devel] Re: Persistent coaster service fails after several runs In-Reply-To: References: <959208306.127425.1291008222274.JavaMail.root@zimbra.anl.gov> Message-ID: <1291157472.21980.0.camel@blabla2.none> Can you make a heap dump of the relevant issue? On Tue, 2010-11-30 at 09:59 -0600, Justin M Wozniak wrote: > Along these lines, I'm looking at memory usage in Coasters. There's a > plot attached below- usage spikes when the workers start running. > > 96% of the usage is byte[] which makes me think it could be KarajanChannel > stuff... > > http://www.ci.uchicago.edu/wiki/bin/view/SWFT/PerformanceNotes#Memory > > On Sun, 28 Nov 2010, Michael Wilde wrote: > > > This fix looks great so far - Ive tested with varying workflow sizes and delays, and have not seen any problems. > > > > - Mike > From wozniak at mcs.anl.gov Tue Nov 30 16:57:45 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Tue, 30 Nov 2010 16:57:45 -0600 (CST) Subject: [Swift-devel] Re: Persistent coaster service fails after several runs In-Reply-To: <1291157472.21980.0.camel@blabla2.none> References: <959208306.127425.1291008222274.JavaMail.root@zimbra.anl.gov> <1291157472.21980.0.camel@blabla2.none> Message-ID: I'm on Intrepid so it's an IBM heap dump. There's a good one there in ~wozniak/Public/heapdumps . The byte[]s are definitely associated with TCPChannel but that's all I have been able to figure out so far- I don't see where they are retained. It is possible that the reader is generating the bytes faster than the network can push them out, so we just need to tighten up the throttle? On Tue, 30 Nov 2010, Mihael Hategan wrote: > Can you make a heap dump of the relevant issue? > > On Tue, 2010-11-30 at 09:59 -0600, Justin M Wozniak wrote: >> Along these lines, I'm looking at memory usage in Coasters. There's a >> plot attached below- usage spikes when the workers start running. >> >> 96% of the usage is byte[] which makes me think it could be KarajanChannel >> stuff... >> >> http://www.ci.uchicago.edu/wiki/bin/view/SWFT/PerformanceNotes#Memory >> >> On Sun, 28 Nov 2010, Michael Wilde wrote: >> >>> This fix looks great so far - Ive tested with varying workflow sizes and delays, and have not seen any problems. >>> >>> - Mike >> > > > -- Justin M Wozniak