From wilde at mcs.anl.gov Thu Jan 3 08:40:01 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 03 Jan 2008 08:40:01 -0600 Subject: [Swift-devel] how to increment int? In-Reply-To: References: <928045.71133.qm@web52301.mail.re2.yahoo.com> Message-ID: <477CF3C1.4020700@mcs.anl.gov> Mike, did this ever get resolved? On 12/18/07 8:08 AM, Ben Clifford wrote: > > On Tue, 18 Dec 2007, Mike Kubal wrote: > >> I tried both approaches, but I get the following >> errors: > > what version of swift are you using? Both of those errors look like the > sort of thing I'd expect from a version earlier than v0.3. > From benc at hawaga.org.uk Thu Jan 3 08:42:50 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 3 Jan 2008 14:42:50 +0000 (GMT) Subject: [Swift-devel] how to increment int? In-Reply-To: <477CF3C1.4020700@mcs.anl.gov> References: <928045.71133.qm@web52301.mail.re2.yahoo.com> <477CF3C1.4020700@mcs.anl.gov> Message-ID: The immediate problem did - both of the appraoches I suggested needed swift 0.3 (though the equivalent one that Mihael gave would have worked on earlier versions); my understanding is that that upgrade happened and the suggestions I made then worked. On Thu, 3 Jan 2008, Michael Wilde wrote: > Mike, did this ever get resolved? > > > On 12/18/07 8:08 AM, Ben Clifford wrote: > > > > On Tue, 18 Dec 2007, Mike Kubal wrote: > > > > > I tried both approaches, but I get the following > > > errors: > > > > what version of swift are you using? Both of those errors look like the sort > > of thing I'd expect from a version earlier than v0.3. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From mikekubal at yahoo.com Thu Jan 3 08:51:06 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Thu, 3 Jan 2008 06:51:06 -0800 (PST) Subject: [Swift-devel] how to increment int? Message-ID: <918929.52903.qm@web52303.mail.re2.yahoo.com> Yes. I upgraded and it worked. Thanks. ----- Original Message ---- From: Ben Clifford To: Michael Wilde Cc: Mike Kubal ; swift-devel at ci.uchicago.edu Sent: Thursday, January 3, 2008 8:42:50 AM Subject: Re: [Swift-devel] how to increment int? The immediate problem did - both of the appraoches I suggested needed swift 0.3 (though the equivalent one that Mihael gave would have worked on earlier versions); my understanding is that that upgrade happened and the suggestions I made then worked. On Thu, 3 Jan 2008, Michael Wilde wrote: > Mike, did this ever get resolved? > > > On 12/18/07 8:08 AM, Ben Clifford wrote: > > > > On Tue, 18 Dec 2007, Mike Kubal wrote: > > > > > I tried both approaches, but I get the following > > > errors: > > > > what version of swift are you using? Both of those errors look like the sort > > of thing I'd expect from a version earlier than v0.3. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel ____________________________________________________________________________________ Looking for last minute shopping deals? Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Thu Jan 3 19:10:53 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 4 Jan 2008 01:10:53 +0000 (GMT) Subject: [Swift-devel] lots of commits Message-ID: I've just made a large number of commits to trunk. I've had these sitting in my patch stack for a while and they seem to work with me. But pretty much lots of commits == lots of exciting new bugs to discover. Please find them for me. -- From benc at hawaga.org.uk Mon Jan 7 16:19:30 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 7 Jan 2008 22:19:30 +0000 (GMT) Subject: [Swift-devel] logging for provenance tracking Message-ID: Recently I've been investigating provenance storage and querying for Swift. So far, I've been extracting the necessary information from log files (the Swift run logs and kickstart records). As there has been some interest expressed in gathering more real data to play with in this space, I have committed the necessary extra log lines to the Swift trunk in r1558. Any runs made with Swift >= r1558 should be able to feed data into my various pieces of prototype provenance code. The rest of my prototype code remains unfit for human eyes, though I hope to have something for others to experiment with by January 18th. -- From hategan at mcs.anl.gov Thu Jan 10 17:50:24 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 10 Jan 2008 17:50:24 -0600 Subject: [Swift-devel] Re: Criteria for rollout In-Reply-To: <1199985821.13373.2.camel@blabla.mcs.anl.gov> References: <16D5653B-2D3F-4504-BFDA-C6F106312521@fnal.gov> <7b6dcb010801091143s25f9958bv236b0dfd633d3912@mail.gmail.com> <1199931480.21815.29.camel@blabla.mcs.anl.gov> <7b6dcb010801100726w147e083fw5ece0e5353480d11@mail.gmail.com> <1199985821.13373.2.camel@blabla.mcs.anl.gov> Message-ID: <1200009024.22770.1.camel@blabla.mcs.anl.gov> > > Again, I looked at the scratch dirs and diffed the frequencyOut file > > that gets plotted by Plot.pl, and those files look different. Here > > are the scratch dirs of the studies I ran: > > Production: > > /disks/i2u2/cosmic/users/AY2004/IL/Batavia/Fermilab/Jordan/undergrads/cosmic/scratch/run-2008.0110.091001.0456 > > Swift: > > /disks/i2u2-dev/cosmic/users/AY2004/IL/Batavia/Fermilab/Jordan/undergrads/cosmic/scratch/2008.01.10.09.09.53.030.7541 > > I'm seeing differences in the plot. In the two plots you linked to, > > you can tell the Production-produced plot has a lower first point > > (around 7100) and the Swift-produced plot has a higher first point > > (around 7500). Also, when doing fitting for both, they came up with > > different lifetime values. You can see those in the extraFun_rawFile > > files. > > Good point. I'll look at this. > Hah. Got it. Looks like when passing an array as a parameter to an app in Swift, the order is not preserved. So there is a mismatch between the cpld frequencies and the files. From benc at hawaga.org.uk Thu Jan 10 17:53:11 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 10 Jan 2008 23:53:11 +0000 (GMT) Subject: [Swift-devel] Re: Criteria for rollout In-Reply-To: <1200009024.22770.1.camel@blabla.mcs.anl.gov> References: <16D5653B-2D3F-4504-BFDA-C6F106312521@fnal.gov> <7b6dcb010801091143s25f9958bv236b0dfd633d3912@mail.gmail.com> <1199931480.21815.29.camel@blabla.mcs.anl.gov> <7b6dcb010801100726w147e083fw5ece0e5353480d11@mail.gmail.com> <1199985821.13373.2.camel@blabla.mcs.anl.gov> <1200009024.22770.1.camel@blabla.mcs.anl.gov> Message-ID: On Thu, 10 Jan 2008, Mihael Hategan wrote: > Hah. Got it. Looks like when passing an array as a parameter to an app > in Swift, the order is not preserved. So there is a mismatch between the > cpld frequencies and the files. Array as in @filenames of an array or actually passing an array? -- From hategan at mcs.anl.gov Thu Jan 10 17:59:19 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 10 Jan 2008 17:59:19 -0600 Subject: [Swift-devel] Re: Criteria for rollout In-Reply-To: References: <16D5653B-2D3F-4504-BFDA-C6F106312521@fnal.gov> <7b6dcb010801091143s25f9958bv236b0dfd633d3912@mail.gmail.com> <1199931480.21815.29.camel@blabla.mcs.anl.gov> <7b6dcb010801100726w147e083fw5ece0e5353480d11@mail.gmail.com> <1199985821.13373.2.camel@blabla.mcs.anl.gov> <1200009024.22770.1.camel@blabla.mcs.anl.gov> Message-ID: <1200009559.24499.3.camel@blabla.mcs.anl.gov> On Thu, 2008-01-10 at 23:53 +0000, Ben Clifford wrote: > On Thu, 10 Jan 2008, Mihael Hategan wrote: > > > Hah. Got it. Looks like when passing an array as a parameter to an app > > in Swift, the order is not preserved. So there is a mismatch between the > > cpld frequencies and the files. > > Array as in @filenames of an array or actually passing an array? Another good point. @filenames. But then looking at the code for passing an array, it doesn't seem like it keeps any particular order either: if (handle.getType().isArray()) { Map value = handle.getArrayValue(); if (handle.isClosed()) { return new PairIterator(value); ... public PairIterator(Map map) { this.it = map.entrySet().iterator(); However, @filenames hurts. The other one not as much. > From benc at hawaga.org.uk Thu Jan 10 18:04:07 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 11 Jan 2008 00:04:07 +0000 (GMT) Subject: [Swift-devel] Re: Criteria for rollout In-Reply-To: <1200009559.24499.3.camel@blabla.mcs.anl.gov> References: <16D5653B-2D3F-4504-BFDA-C6F106312521@fnal.gov> <7b6dcb010801091143s25f9958bv236b0dfd633d3912@mail.gmail.com> <1199931480.21815.29.camel@blabla.mcs.anl.gov> <7b6dcb010801100726w147e083fw5ece0e5353480d11@mail.gmail.com> <1199985821.13373.2.camel@blabla.mcs.anl.gov> <1200009024.22770.1.camel@blabla.mcs.anl.gov> <1200009559.24499.3.camel@blabla.mcs.anl.gov> Message-ID: On Thu, 10 Jan 2008, Mihael Hategan wrote: > Another good point. @filenames. But then looking at the code for passing > an array, it doesn't seem like it keeps any particular order either: I didn't even know passing an array outside of @filenames worked... > if (handle.getType().isArray()) { > Map value = handle.getArrayValue(); > if (handle.isClosed()) { > return new PairIterator(value); > ... > public PairIterator(Map map) { > this.it = map.entrySet().iterator(); > > However, @filenames hurts. The other one not as much. yeah. I think the correct behaviour should be to output in the order of the array indices (for both plain array passing and for @filenames). I don't think anything else maps over an array apart from @filenames. -- From hategan at mcs.anl.gov Thu Jan 10 18:14:26 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 10 Jan 2008 18:14:26 -0600 Subject: [Swift-devel] Re: Criteria for rollout In-Reply-To: References: <16D5653B-2D3F-4504-BFDA-C6F106312521@fnal.gov> <7b6dcb010801091143s25f9958bv236b0dfd633d3912@mail.gmail.com> <1199931480.21815.29.camel@blabla.mcs.anl.gov> <7b6dcb010801100726w147e083fw5ece0e5353480d11@mail.gmail.com> <1199985821.13373.2.camel@blabla.mcs.anl.gov> <1200009024.22770.1.camel@blabla.mcs.anl.gov> <1200009559.24499.3.camel@blabla.mcs.anl.gov> Message-ID: <1200010466.24499.16.camel@blabla.mcs.anl.gov> On Fri, 2008-01-11 at 00:04 +0000, Ben Clifford wrote: > On Thu, 10 Jan 2008, Mihael Hategan wrote: > > > Another good point. @filenames. But then looking at the code for passing > > an array, it doesn't seem like it keeps any particular order either: > > I didn't even know passing an array outside of @filenames worked... :) > > > > However, @filenames hurts. The other one not as much. > > yeah. > > I think the correct behaviour should be to output in the order of the > array indices (for both plain array passing and for @filenames). So I think sorting in all cases (i.e. getFringePaths()) isn't necessary. Not sorting is faster. So a simple solution would be to sort those paths right in leavesFileNames(). In the particular case of an array, it will achieve the desired result. In other cases (i.e. more complex data types) it would result in some form of order, not particularily worse than no order. Make sense? > > I don't think anything else maps over an array apart from @filenames. > From benc at hawaga.org.uk Thu Jan 10 18:29:46 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 11 Jan 2008 00:29:46 +0000 (GMT) Subject: [Swift-devel] Re: Criteria for rollout In-Reply-To: <1200010466.24499.16.camel@blabla.mcs.anl.gov> References: <16D5653B-2D3F-4504-BFDA-C6F106312521@fnal.gov> <7b6dcb010801091143s25f9958bv236b0dfd633d3912@mail.gmail.com> <1199931480.21815.29.camel@blabla.mcs.anl.gov> <7b6dcb010801100726w147e083fw5ece0e5353480d11@mail.gmail.com> <1199985821.13373.2.camel@blabla.mcs.anl.gov> <1200009024.22770.1.camel@blabla.mcs.anl.gov> <1200009559.24499.3.camel@blabla.mcs.anl.gov> <1200010466.24499.16.camel@blabla.mcs.anl.gov> Message-ID: On Thu, 10 Jan 2008, Mihael Hategan wrote: > > I think the correct behaviour should be to output in the order of the > > array indices (for both plain array passing and for @filenames). > > So I think sorting in all cases (i.e. getFringePaths()) isn't necessary. > Not sorting is faster. > > So a simple solution would be to sort those paths right in > leavesFileNames(). In the particular case of an array, it will achieve > the desired result. In other cases (i.e. more complex data types) it > would result in some form of order, not particularily worse than no > order. Make sense? Sorting only needs to happen when the arrays get serialised in some place that is significant. At the moment, I think the only place for that is @filename and @filenames. So leavesFileNames would be an ok place to do it. A more complicated-to-implement approach would be to sort out the return type of @filenames (it doesn't have one at the moment that fits in the Swift data model) so that it returns a Swift-style array of strings rather than a Java array of strings; and then have the sorting happening at the point that the swift-style array of strings is serialised (whether its a Swift array coming out of @filenames or a swift array coming from somewhere else). That's something that I want to do to tidy up the language some anyway; but don't let that hold up implementation of a leavesFileNames sorting fix. -- From hategan at mcs.anl.gov Thu Jan 10 18:38:28 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 10 Jan 2008 18:38:28 -0600 Subject: [Swift-devel] Re: Criteria for rollout In-Reply-To: References: <16D5653B-2D3F-4504-BFDA-C6F106312521@fnal.gov> <7b6dcb010801091143s25f9958bv236b0dfd633d3912@mail.gmail.com> <1199931480.21815.29.camel@blabla.mcs.anl.gov> <7b6dcb010801100726w147e083fw5ece0e5353480d11@mail.gmail.com> <1199985821.13373.2.camel@blabla.mcs.anl.gov> <1200009024.22770.1.camel@blabla.mcs.anl.gov> <1200009559.24499.3.camel@blabla.mcs.anl.gov> <1200010466.24499.16.camel@blabla.mcs.anl.gov> Message-ID: <1200011908.24499.17.camel@blabla.mcs.anl.gov> > A more complicated-to-implement approach would be to sort out the return > type of @filenames (it doesn't have one at the moment that fits in the > Swift data model) so that it returns a Swift-style array of strings rather > than a Java array of strings; and then have the sorting happening at the > point that the swift-style array of strings is serialised (whether its a > Swift array coming out of @filenames or a swift array coming from > somewhere else). That's something that I want to do to tidy up the > language some anyway; but don't let that hold up implementation of a > leavesFileNames sorting fix. > You'd probably need to change said method anyway, so this shouldn't interfere too much with that. From bugzilla-daemon at mcs.anl.gov Fri Jan 11 05:38:41 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Fri, 11 Jan 2008 05:38:41 -0600 (CST) Subject: [Swift-devel] [Bug 124] New: Logging system cannot deal with multiple workflows running within one JVM Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=124 Summary: Logging system cannot deal with multiple workflows running within one JVM Product: Swift Version: unspecified Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: nobody at mcs.anl.gov ReportedBy: benc at hawaga.org.uk CC: swift-devel at ci.uchicago.edu The present logging style has one log per workflow; however, if several workflows are run within the same virtual machine (or at least using the same log4j classes) then logging from all workflows will go only to one log file. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From hategan at mcs.anl.gov Mon Jan 14 11:56:39 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 14 Jan 2008 11:56:39 -0600 Subject: [Swift-devel] iterations on cartesian and inner products Message-ID: <1200333400.13642.2.camel@blabla.mcs.anl.gov> I'm thinking the following might be nice to have: X[] Y[] for (x, y) in (X, Y) { ... } for (x, y) in (X * Y) { ... } From hategan at mcs.anl.gov Mon Jan 14 12:44:14 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 14 Jan 2008 12:44:14 -0600 Subject: [Swift-devel] Re: Phone meeting? In-Reply-To: <4786575C.3020906@fnal.gov> References: <478586D4.9030606@fnal.gov> <478648BA.2090902@mcs.anl.gov> <4786575C.3020906@fnal.gov> Message-ID: <1200336254.13803.38.camel@blabla.mcs.anl.gov> I say we start an email discussion about swift improvements for lqcd (but not necessarily only for lqcd). I'll start with a few points: 1. The app blocks are ugly. The C/Java world uses modifiers for such things. Therefore I'm thinking that (file output) cvt12x12 (file input) { app { CVT12x12 "-v" "-i" @input "-o" @output; } } should become app (quark output) cvt12x12 (file input) { CVT12x12 "-v" "-i" @input "-o" @output; } 2. Improved iteration. I already sent an email to swift-devel, but I'll re-iterate here: Given X[n] and Y[n], (a) for (x, y) in (X, Y) {} will be roughly equivalent to for dummy, i in X { x = X[i]; y = Y[i]; } and (b) for (x, y) in (X * Y) {} will be equivalent to: for x in X { for y in Y { } } I think (a) offers a cleaner appearance than the current alternative, while (b) is only a marginal improvement. 3. I'll need a list of data objects for which you care about the file names, and the corresponding file names. 4. There are two basic ways to construct lists/arrays: the comprehension/functional way and the imperative (for loop) way. Swift has chosen the imperative style. This works fine specifying data dependencies, but falls short for mapping/file names. Swift requires that mappings be specified in one step, which means that file name lists need to be constructed before the mapping is expressed. However, in many cases the naming patterns follow the data patterns, which means that control structures will be duplicated. Allow me to demonstrate (in a yet to be version of the language): string masses[] = ["0.005", "0.007", "0.01", "0.02", "0.03"]; string configs[] = ["000102","000108","000114"]; string fn[]; for (m, c), i in (masses * configs) { fn[i] = m + "_" + c; } file stags[][] ; But then another iteration over the configs needs to be done for the actual calculations (and one for the masses, which is currently hidden in an app). The problem here is the inability of swift to allow specifying file names for one piece of data in arbitrary places. The alternative would be: for configs, c in configs { for mass, m in masses { stags[c][m] &= m + "_" + c; stags[c][m] = stagSolve(gauge, mass, source); } } This may also eliminate the need for funny mapping tricks (in particular the assignments with not-so-well-defined semantics). 5. Oh, and the + operator should be able to concatenate strings. From benc at hawaga.org.uk Mon Jan 14 16:17:04 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 14 Jan 2008 22:17:04 +0000 (GMT) Subject: [Swift-devel] tz Message-ID: This just appeared in CHANGES.txt: > *** Hmm. Maybe we should use UTC for the timestamps. It probably shouldn't be there. You can use UTC for timestamps if you want by setting the unix timezone to that (either system-wide or in TZ to override per shell); similarly any of the other 500 or so timezones can be used in the same way. That's the same way it works for pretty much all other unix programs. -- From hategan at mcs.anl.gov Mon Jan 14 16:27:59 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 14 Jan 2008 16:27:59 -0600 Subject: [Swift-devel] tz In-Reply-To: References: Message-ID: <1200349679.26962.1.camel@blabla.mcs.anl.gov> I was referring to the changelog timestamps. I wrote it because the previous changelog entry was... in the future. At least from my perspective. It seemed odd. And funny. On Mon, 2008-01-14 at 22:17 +0000, Ben Clifford wrote: > This just appeared in CHANGES.txt: > > > *** Hmm. Maybe we should use UTC for the timestamps. > > It probably shouldn't be there. > > You can use UTC for timestamps if you want by setting the unix timezone to > that (either system-wide or in TZ to override per shell); similarly any of > the other 500 or so timezones can be used in the same way. That's the same > way it works for pretty much all other unix programs. > From benc at hawaga.org.uk Mon Jan 14 16:32:48 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 14 Jan 2008 22:32:48 +0000 (GMT) Subject: [Swift-devel] tz In-Reply-To: <1200349679.26962.1.camel@blabla.mcs.anl.gov> References: <1200349679.26962.1.camel@blabla.mcs.anl.gov> Message-ID: On Mon, 14 Jan 2008, Mihael Hategan wrote: > I was referring to the changelog timestamps. > > I wrote it because the previous changelog entry was... in the future. At > least from my perspective. It seemed odd. And funny. yeah. though its only the 14th where I am too. The change claiming to be the 15th was actually made the 5th. Caused, perhaps, by another internationalisation/localisation issue in that I cannot for the life of me form american-short-format dates properly nor recognise what date they actually represent without immense levels of concentration - so most of the dates I put in have almost no sanity checking on them at all. -- From hategan at mcs.anl.gov Tue Jan 15 14:15:30 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 15 Jan 2008 14:15:30 -0600 Subject: [Swift-devel] Re: Phone meeting? In-Reply-To: <478CEA5E.9020701@fnal.gov> References: <478586D4.9030606@fnal.gov> <478648BA.2090902@mcs.anl.gov> <4786575C.3020906@fnal.gov> <1200336254.13803.38.camel@blabla.mcs.anl.gov> <478CEA5E.9020701@fnal.gov> Message-ID: <1200428131.4848.53.camel@blabla.mcs.anl.gov> On Tue, 2008-01-15 at 11:16 -0600, Jim Kowalkowski wrote: > Hi Mihael, > > See below for comments (from all of us). > You can forward this to your development list, since I don't think I can > post directly to it. That's ok. Somebody will get a bounce and they will approve the post. > > Jim (for the LQCD group) > > > > app (quark output) cvt12x12 (file input) { > > CVT12x12 "-v" "-i" @input "-o" @output; > > } > > > Cosmetic changes like these make the script easier to understand and we > will be able to convince the > domain scientists that its lot easier for them to adopt it. I thought it would make more sense for reasons beyond cosmetics. The app{} block cannot, for example, be used in a foreach loop (or at least it isn't well defined there). So it is bound to a "procedure" definition. > > One minor issue - would it be better to place this "function attribute" > closer to the function name, > such as "(quark output) app cvt12x12 (file input)"? Or perhaps after the > function name? I'd personally like to follow known styles (e.g. Java), but I guess the parser can be made to understand "app" flexibly. > > 2. Improved iteration. I already sent an email to swift-devel, but I'll > > re-iterate here: > > (a) for (x, y) in (X, Y) {} will be roughly equivalent to > > and > > (b) for (x, y) in (X * Y) {} will be equivalent to: > > Again, cosmetic changes like this are useful. And we agree that (a) is > more useful than (b). > > > 3. I'll need a list of data objects for which you care about the file > > names, and the corresponding file names. > > > > Coming up with an answer for questions (3) and (4) was somewhat > difficult. We thought that it would be > best for us to show you an example that demonstrates the Swift language > features that we are thinking about. > This also addresses some of the issues involved in (4), essentially > because it does away with the need to > reference descriptive information from more than one place (i.e. > multiple arrays and file objects). ... in Swift. However, lots of things are now pushed into the mappers, so the referencing of descriptive information from multiple places may still happen. Or not. > Hopefully this will be more clear in the example below than in the short > description above. > > /** > * This workflow was created based on a portion of the 2ptHL SwiftScript. > * Many steps are not implemented here, but it does illustrates the > * desired functionality of assigning mappers to types. This eliminates > * the nested loops and complex array indices. > */ > > /** > * Types can be bound to mappers at declaration. > */ > type ParameterSet; type HeavyQuark; > type HeavyQuarkConv; > type TwoPointHLFile; Interesting. So declarations will then parametrize those mappers. I think this needs more thinking. As you point out below, there are some issues. > > /** > * The set of parameters is retrieved from a database (or file system) using > * the specified key. The parameter come in as name-value pairs. The values > * are attached to the object as attributes available using the name. > */ > > /* in our Dec discussion, this was referred to as "parameter 57" */ > ParameterSet pset; > > /** > * The parameter set values are used to create a set of file names - heavy > * quarks in this example. In this case, the mapper bound to HeavyQuark > accesses > * the wsrc attribute to make the files. The [] mean that the variable is a > * container (which might be multidimensional internally), it is not an > array, > * but a object with some array-like semantics. > * > * We are leaving out some important details here - one is that the > mapper is bound > * to a HeavyQuark, not a container of HeavyQuarks. > This could mean that > mapper must > * know how to work with containers of the type they are bound to i.e. > have some sort > * of container operator defined. Also, should one construct a struct with fields of HeavyQuark type, there is some blurriness in how things would be done. > * > * We are also assuming a very simple structure to the parameter set. A > more > * interesting scenario would be some hierarchy within the parameter set, > in other > * words: pset at 2pt would give the child parameter set object > * specific to TwoPointHLFiles. > */ > HeavyQuark heavyQuarkFiles[]; > HeavyQuarkConv heavyQuarkConvertedFiles[]; > TwoPointHLFile twoPointHLFile[]; > > /** > * Create the heavy quark file from the parameter set. We specify the > parameter > * set as input, but the mapper attaches the attributes (e.g. kappa) to the > * object. So 'out' is really an object here, where all the attributes that > * went into making the filename are also individually available. This > eliminates > * the need for passing in data that was already used in the creation of > * the filename i.e. removes redundancy and need of extra looping at the > higher > * level. So in principle file names as well as data depend on both parameters specified in the script, as well as input data. However, @ is weird here. I'm thinking that this can be done fairly well with structs. For example: type HeavyQuark { ParameterSet pset; HQData data; } and then a constructor: (HeavyQuark quark) HeavyQuark(ParameterSet pset) { quark.pset = pset; quark.data = CreateHQ(pset); } app (HQData data) CreateHQ(ParameterSet pset) { wrapper "-k" pset.kappa "-w" pset.csw "-o" filename(data); } app (HeavyQuarkConv out) HQConv(HeavyQuark q) { wrapper "-k" q.pset.kappa "-w" q.pset.csw "-o" filename(out); } This would avoid a lot of complexity in how the parameter set would be magically and not quite clearly moved around. Ok, I'm a bit fuzzy on kappa. It's attached to the quark, but it's supposed to be there before the quark is there. > */ > app (HeavyQuark out) CreateHQ (ParameterSet pset) { > wrapper -k out at kappa -w pset at csw -o @filename(out); > } > > app (HeavyQuarkConv out) HQConv (HeavyQuark q) { > wrapper -i q at kappa -o out; > } > > /** > * The parameters lq and wf are not define explicitly in this example, > * but are similar to hq. > * We need extra functionality to extract a file name from a set based on > a list > * of attributes, that is what hq.query () and lq.find () are trying to do. > * This adds random access to the container, linking it to the mapper, > instead > * of just assuming index-only access. > * Not all attributes used for finding lq and hq files are used in this > example. > */ > app (TwoPointHLFile hlFile) TwoPointHL (ParameterSet pset, > HeavyQuarkConv[] hq, > LightQuark[] lq, WaveFile[] wf) { > > wrapper -o @filename (hlFile) -h hq.query (hlFile at kappa) \ > -l lq.find (hlFile at kappa, hlFile at mass, hlFile at d1) > } So there are two choices here: 1. These query functions are written in Swift (perhaps unwise at this time) 2. The query functions are written in something else. In fact, they could be mappers: HeavyQuarkConv[] hqSubset1 ; ... Mihael > > /* creation only depends on the parameters in the parameter set */ > foreach qf in heavyQuarkFiles { > qf = CreateHQ (pset); > } > > /* conversion requires us to move two iterators at the same time */ > foreach qf,qfout in heavyQuarkFiles, heavyQuarkConvertedFiles { > qfout = HQConv (qf); > } > > /** > * This replaces the last set of nested loops of the 2ptHL workflow. > * The file object contain all the information necessary to do the > * processing. > */ > foreach hl in twoPointHLFile { > hl = TwoPointHL (heavyQuarkConvertedFiles, lightQuarkFiles, waveFiles); > } > > From benc at hawaga.org.uk Tue Jan 15 16:32:55 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 15 Jan 2008 22:32:55 +0000 (GMT) Subject: [Swift-devel] Re: Phone meeting? In-Reply-To: <1200428131.4848.53.camel@blabla.mcs.anl.gov> References: <478586D4.9030606@fnal.gov> <478648BA.2090902@mcs.anl.gov> <4786575C.3020906@fnal.gov> <1200336254.13803.38.camel@blabla.mcs.anl.gov> <478CEA5E.9020701@fnal.gov> <1200428131.4848.53.camel@blabla.mcs.anl.gov> Message-ID: On Tue, 15 Jan 2008, Mihael Hategan wrote: > > See below for comments (from all of us). > > You can forward this to your development list, since I don't think I can > > post directly to it. > > That's ok. Somebody will get a bounce and they will approve the post. I didn't for this message; I guess Jim didn't post to swift-devel. As Mihael suggests - if someone sane-looking posts to the list and its me that does the approval, that person will be approved for permanent posting rights to the list. -- From wilde at mcs.anl.gov Tue Jan 15 18:50:38 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 15 Jan 2008 18:50:38 -0600 Subject: [Swift-devel] Re: Potential programming languages related projects on Swift for Ian's Grid course? In-Reply-To: <000401c857d6$aadeb8c0$009c2a40$@uchicago.edu> References: <000401c857d6$aadeb8c0$009c2a40$@uchicago.edu> Message-ID: <478D54DE.3000904@mcs.anl.gov> Hi Lars and Akiva, Ben Clifford and Mihael Hategan are the Swift development team. Ben works out of London, but travels here frequently. Mihael is based here at the CI (can can often be found there). I passed along a few of the ideas that I mentioned to you after class to the group email list. (Ian's on the list btw). For a class project, the difficulty will be scaling to something that can be done in the available time. I dont quite know the scale that Ian is expecting from the projects, but for starters, even for example if Mihael and Ben do much or all of the implementation work for a new language feature, a good project would be to test and demonstrate its use (and performance) on a distributed system. (Eg, the cross-product looping feature I mentioned to you). A few ideas that might be more feasible would involve a swift "preprocessor" where you could prototype new language features using fairly simple preprocesors that emit swift code as output. For example, the ability to call web services might be prototyped in this manner. A good start would be for you to try the swift tutorial(s) and experiment with it - make sure its something you can work with. They are at: http://www.ci.uchicago.edu/swift/docs/index.php So this should hopefully kick off a discussion, and we'll see where it can take you. Ben and I are headed into 3 weeks of intensive travel and teaching, but but hopefully you can meet face to face with Mihael to brainstorm ideas. As we mull over things, the shape of a few good projects is sure to emerge. - Mike On 1/15/08 6:28 PM, Lars Bergstrom wrote: > After your presentation on Swift, Akiva and I were very interested in any > extension projects to Swift that you might have in mind. In particular, > we're interested in any language extensions or language implementation > issues you have at hand, though we're both enthused by the project in > general and would be interested in any ideas you or > your team have for further fleshing out the tool that would fit inside of a > quarter. > > You mentioned that you would be out of town for a while, but would it be > possible for you to pass this along to someone that we might be able to work > with on this? > > Thanks much, > - Lars and Akiva > > From benc at hawaga.org.uk Sat Jan 19 12:03:58 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 19 Jan 2008 13:03:58 -0500 (EST) Subject: [Swift-devel] iterations on cartesian and inner products In-Reply-To: <1200333400.13642.2.camel@blabla.mcs.anl.gov> References: <1200333400.13642.2.camel@blabla.mcs.anl.gov> Message-ID: On Mon, 14 Jan 2008, Mihael Hategan wrote: > I'm thinking the following might be nice to have: > > X[] > Y[] > > for (x, y) in (X, Y) { > ... > } > > for (x, y) in (X * Y) { > ... > } Rather than implementing it in the iteration construct, a Haskelly (and more general) way to do this would be something like this: i) put anonymously-typed tuples into the language ii) have proper tuple assignment. we already have this sort-of for executions, where you can say (x,y) = foo(); if foo() is a procedure, but not with an expression. So you can't say (x,y) = @foo() iii) put the two array product operations into the langauge or library, so that you can in an expression construct the desired product iv) make foreach understand tuple assignments - i.e. given an array of pairs (or n-tuples) assign them in sequence to I think this would need anonymous tuple types in the language, but I think I'd like that anyway - as we move towards unifying foo() and @foo() into a single language construct, I think its desirable to be able to figure out sensibly what is wrong with 5 + foo() when foo returns (int a, int b) - i.e. give a type error that (int) is incompatible with (int,int). At present, 5 + foo() is a syntax error because you cannot invoke a procedure inside an expression; but that would go away under foo/@foo unification. -- From benc at hawaga.org.uk Sat Jan 19 11:55:55 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 19 Jan 2008 12:55:55 -0500 (EST) Subject: [Swift-devel] Re: Phone meeting? In-Reply-To: <1200428131.4848.53.camel@blabla.mcs.anl.gov> References: <478586D4.9030606@fnal.gov> <478648BA.2090902@mcs.anl.gov> <4786575C.3020906@fnal.gov> <1200336254.13803.38.camel@blabla.mcs.anl.gov> <478CEA5E.9020701@fnal.gov> <1200428131.4848.53.camel@blabla.mcs.anl.gov> Message-ID: On Tue, 15 Jan 2008, Mihael Hategan wrote: > > > app (quark output) cvt12x12 (file input) { > > > CVT12x12 "-v" "-i" @input "-o" @output; > > > } > > > [...] > I thought it would make more sense for reasons beyond cosmetics. The > app{} block cannot, for example, be used in a foreach loop (or at least > it isn't well defined there). So it is bound to a "procedure" > definition. This has been bugging me for a while. Its a pretty straightforward change at the text-mode parser level, and as someone else seems to have come up with the same solution that I've had buzzing round my brain for a while, I guess that indicates some consensus. > > One minor issue - would it be better to place this "function attribute" > > closer to the function name, > > such as "(quark output) app cvt12x12 (file input)"? Or perhaps after the > > function name? > I'd personally like to follow known styles (e.g. Java), but I guess the > parser can be made to understand "app" flexibly. I'd prefer to follow Java/C syntax style as much as possible. I'd also prefer to not have flexibility here - it complicates the parser which tends to have a fairly direct consequence of making syntax error messages less useful. -- From benc at hawaga.org.uk Sat Jan 19 12:32:54 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 19 Jan 2008 13:32:54 -0500 (EST) Subject: [Swift-devel] Re: Potential programming languages related projects on Swift for Ian's Grid course? In-Reply-To: <478D54DE.3000904@mcs.anl.gov> References: <000401c857d6$aadeb8c0$009c2a40$@uchicago.edu> <478D54DE.3000904@mcs.anl.gov> Message-ID: The provenance database prototypes that I've been working on recently are perhaps at the stage where they would be interesting to work on as a student project. Specifically, I have implemented some stuff which takes details of Swift runs (which procedures were executed, how they were executed, what files they took as inputs and provided as outputs); and puts those details into various forms of database (most completely SQL and XML databases at the moment; but also I'm investigating (to a much lesser depth) several other forms. A. One project idea would be to more deeply investigate one of the formats (such as RDF, prolog, graphgrep) that I haven't had time to investigate properly (or some other form that I don't have on my list). The project might include at least: i) implementing code (which needn't be of production quality) to store provenance data to the chosen format. this is not very hard and there examples for both SQL and XML already. ii) figure out the necessary queries and methods to solve, for example, the provenance challenge queries (which I have been using as some motivating examples) iii) write a report with the code from i, the queries from ii and some discussion about whether this database/format seems useful / easy to query / high performance. I'd be willing to provide a fair amount of support for this. B. Another project idea would be to use the provenance prototype code in some application - either one that exists already from one of the application people on this list, or a simple application implemented as the first part of the project; then do some interesting stuff there. This is more vague - I'm not sure what the appropriate applications or the interesting stuff would be. -- From hategan at mcs.anl.gov Sun Jan 20 10:31:16 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 20 Jan 2008 10:31:16 -0600 Subject: [Swift-devel] iterations on cartesian and inner products In-Reply-To: References: <1200333400.13642.2.camel@blabla.mcs.anl.gov> Message-ID: <1200846676.3335.2.camel@blabla.mcs.anl.gov> On Sat, 2008-01-19 at 13:03 -0500, Ben Clifford wrote: > Rather than implementing it in the iteration construct, a Haskelly (and > more general) way to do this would be something like this: > > i) put anonymously-typed tuples into the language That was my initial train of thought, but my brain discarded it for being too far away from our current implementation. Mihael From benc at hawaga.org.uk Sun Jan 20 15:50:19 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 20 Jan 2008 21:50:19 +0000 (GMT) Subject: [Swift-devel] iterations on cartesian and inner products In-Reply-To: <1200846676.3335.2.camel@blabla.mcs.anl.gov> References: <1200333400.13642.2.camel@blabla.mcs.anl.gov> <1200846676.3335.2.camel@blabla.mcs.anl.gov> Message-ID: > > Rather than implementing it in the iteration construct, a Haskelly (and > > more general) way to do this would be something like this: > > > > i) put anonymously-typed tuples into the language > > That was my initial train of thought, but my brain discarded it for > being too far away from our current implementation. it ties in with an area thats very rough in the current language and that I'd like to fix; so maybe it isn't too far. -- From wilde at mcs.anl.gov Wed Jan 23 15:38:54 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 23 Jan 2008 15:38:54 -0600 Subject: [Swift-devel] Re: gridmap file for tg-uc In-Reply-To: <57027693-8DC8-4025-87C6-1314F9E48D34@mcs.anl.gov> References: <73084.37523.qm@web52302.mail.re2.yahoo.com> <57027693-8DC8-4025-87C6-1314F9E48D34@mcs.anl.gov> Message-ID: <4797B3EE.20202@mcs.anl.gov> info from Joe, below. MikeK: where are you expecting files to move from and to in this script? - MikeW On 1/23/08 3:33 PM, joseph insley wrote: > Mike, > > You are indeed n the grid-mapfile on the UC/ANL TeraGrid resource: > > insley at tg-viz-login1:~> grep Kubal /etc/grid-security/grid-mapfile > "/DC=org/DC=doegrids/OU=People/CN=Michael Kubal 486347" kubal > > However, looks like the error is coming from the Teraport resource, not > TeraGrid. I am not as knowledgeable about that system, but I would > assume that the same gx-request command should work there as well. > > joe. > > On Jan 23, 2008, at 3:15 PM, Mike Kubal wrote: > >> Hi Joe and Mike, >> >> I used the gx-request again to add myself to the >> gridmap on UC's teragrid again. It seemed to go >> smoothly, just like the time before. >> >> Though I still get the following error message when >> running my Swift script: >> >> Execution failed: >> Could not initialize shared directory on >> teraport >> Caused by: >> >> org.globus.cog.abstraction.impl.file.FileResourceException: >> Error communicating with the GridFTP server >> Caused by: >> Server refused performing the request. Custom >> message: Bad password. (error code 1) [Nested >> exception message: Custom message: Unexpected reply: >> 530-Login incorrect. : >> gridmap.c:globus_gss_assist_map_and_authorize:1910: >> 530-Error invoking callout >> 530-globus_callout.c:globus_callout_handle_call_type:727: >> 530-The callout returned an error >> 530-prima_module.c:Globus Gridmap Callout:430: >> 530-Gridmap lookup failure: Could not retrieve mapping >> for /DC=org/DC=doegrids/OU=People/CN=Michael Kubal >> 486347 from identity mapping server >> 530- >> 530 End.] >> >> Before running I also removed the certificates dir in >> ~mkubal/.globus. >> >> Any suggestions? >> >> Thanks, >> >> Mike >> >> >> --- joseph insley > wrote: >> >>> Mike, >>> >>> You should be able to do this yourself using the >>> gx-request command. >>> See more info at: >>> >> http://www.teragrid.org/userinfo/access/auth_gxmap.php >>> >>> >>> joe. >>> >>> On Oct 29, 2007, at 2:07 PM, Mike Kubal wrote: >>> >>>> Hi Joe, >>>> >>>> Please add me to gridmap file for tg-uc. My >>> subject is >>>> /DC=org/DC=doegrids/OU=People/CN=Michael Kubal >>> 486347 >>>> >>>> My user name is kubal. >>>> >>>> Thanks, >>>> >>>> Mike >>>> >>>> __________________________________________________ >>>> Do You Yahoo!? >>>> Tired of spam? Yahoo! Mail has the best spam >>> protection around >>>> http://mail.yahoo.com >>>> >>> >>> >> >> >> >> >> ____________________________________________________________________________________ >> Be a better friend, newshound, and >> know-it-all with Yahoo! Mobile. Try it now. >> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >> > > =================================================== > > joseph a. insley > insley at mcs.anl.gov > > mathematics & computer science division (630) 252-5649 > > argonne national laboratory (630) 252-5986 > (fax) > > > From benc at hawaga.org.uk Wed Jan 23 15:46:24 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 23 Jan 2008 21:46:24 +0000 (GMT) Subject: [Swift-devel] Re: gridmap file for tg-uc In-Reply-To: <4797B3EE.20202@mcs.anl.gov> References: <73084.37523.qm@web52302.mail.re2.yahoo.com> <57027693-8DC8-4025-87C6-1314F9E48D34@mcs.anl.gov> <4797B3EE.20202@mcs.anl.gov> Message-ID: yes, looks like an OSG rather than teragrid error. mike kubal, send the sites.xml file that you are using. On Wed, 23 Jan 2008, Michael Wilde wrote: > info from Joe, below. > > MikeK: where are you expecting files to move from and to in this script? > > - MikeW > > > On 1/23/08 3:33 PM, joseph insley wrote: > > Mike, > > > > You are indeed n the grid-mapfile on the UC/ANL TeraGrid resource: > > > > insley at tg-viz-login1:~> grep Kubal /etc/grid-security/grid-mapfile > > "/DC=org/DC=doegrids/OU=People/CN=Michael Kubal 486347" kubal > > > > However, looks like the error is coming from the Teraport resource, not > > TeraGrid. I am not as knowledgeable about that system, but I would assume > > that the same gx-request command should work there as well. > > > > joe. > > > > On Jan 23, 2008, at 3:15 PM, Mike Kubal wrote: > > > > > Hi Joe and Mike, > > > > > > I used the gx-request again to add myself to the > > > gridmap on UC's teragrid again. It seemed to go > > > smoothly, just like the time before. > > > > > > Though I still get the following error message when > > > running my Swift script: > > > > > > Execution failed: > > > Could not initialize shared directory on > > > teraport > > > Caused by: > > > > > > org.globus.cog.abstraction.impl.file.FileResourceException: > > > Error communicating with the GridFTP server > > > Caused by: > > > Server refused performing the request. Custom > > > message: Bad password. (error code 1) [Nested > > > exception message: Custom message: Unexpected reply: > > > 530-Login incorrect. : > > > gridmap.c:globus_gss_assist_map_and_authorize:1910: > > > 530-Error invoking callout > > > 530-globus_callout.c:globus_callout_handle_call_type:727: > > > 530-The callout returned an error > > > 530-prima_module.c:Globus Gridmap Callout:430: > > > 530-Gridmap lookup failure: Could not retrieve mapping > > > for /DC=org/DC=doegrids/OU=People/CN=Michael Kubal > > > 486347 from identity mapping server > > > 530- > > > 530 End.] > > > > > > Before running I also removed the certificates dir in > > > ~mkubal/.globus. > > > > > > Any suggestions? > > > > > > Thanks, > > > > > > Mike > > > > > > > > > --- joseph insley > wrote: > > > > > > > Mike, > > > > > > > > You should be able to do this yourself using the > > > > gx-request command. > > > > See more info at: > > > > > > > http://www.teragrid.org/userinfo/access/auth_gxmap.php > > > > > > > > > > > > joe. > > > > > > > > On Oct 29, 2007, at 2:07 PM, Mike Kubal wrote: > > > > > > > > > Hi Joe, > > > > > > > > > > Please add me to gridmap file for tg-uc. My > > > > subject is > > > > > /DC=org/DC=doegrids/OU=People/CN=Michael Kubal > > > > 486347 > > > > > > > > > > My user name is kubal. > > > > > > > > > > Thanks, > > > > > > > > > > Mike > > > > > > > > > > __________________________________________________ > > > > > Do You Yahoo!? > > > > > Tired of spam? Yahoo! Mail has the best spam > > > > protection around > > > > > http://mail.yahoo.com > > > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it > > > now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > > > =================================================== > > > > joseph a. insley > > insley at mcs.anl.gov > > > > mathematics & computer science division (630) 252-5649 > > > > argonne national laboratory (630) 252-5986 > > (fax) > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From mikekubal at yahoo.com Wed Jan 23 16:07:39 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Wed, 23 Jan 2008 14:07:39 -0800 (PST) Subject: [Swift-devel] Re: gridmap file for tg-uc In-Reply-To: Message-ID: <467545.88384.qm@web52312.mail.re2.yahoo.com> Here's a tar file of the script and the logs. I get the same error when running against teraport. Thanks, Mike --- Ben Clifford wrote: > > yes, looks like an OSG rather than teragrid error. > > mike kubal, send the sites.xml file that you are > using. > > On Wed, 23 Jan 2008, Michael Wilde wrote: > > > info from Joe, below. > > > > MikeK: where are you expecting files to move from > and to in this script? > > > > - MikeW > > > > > > On 1/23/08 3:33 PM, joseph insley wrote: > > > Mike, > > > > > > You are indeed n the grid-mapfile on the UC/ANL > TeraGrid resource: > > > > > > insley at tg-viz-login1:~> grep Kubal > /etc/grid-security/grid-mapfile > > > "/DC=org/DC=doegrids/OU=People/CN=Michael Kubal > 486347" kubal > > > > > > However, looks like the error is coming from the > Teraport resource, not > > > TeraGrid. I am not as knowledgeable about that > system, but I would assume > > > that the same gx-request command should work > there as well. > > > > > > joe. > > > > > > On Jan 23, 2008, at 3:15 PM, Mike Kubal wrote: > > > > > > > Hi Joe and Mike, > > > > > > > > I used the gx-request again to add myself to > the > > > > gridmap on UC's teragrid again. It seemed to > go > > > > smoothly, just like the time before. > > > > > > > > Though I still get the following error message > when > > > > running my Swift script: > > > > > > > > Execution failed: > > > > Could not initialize shared directory > on > > > > teraport > > > > Caused by: > > > > > > > > > org.globus.cog.abstraction.impl.file.FileResourceException: > > > > Error communicating with the GridFTP server > > > > Caused by: > > > > Server refused performing the request. > Custom > > > > message: Bad password. (error code 1) [Nested > > > > exception message: Custom message: Unexpected > reply: > > > > 530-Login incorrect. : > > > > > gridmap.c:globus_gss_assist_map_and_authorize:1910: > > > > 530-Error invoking callout > > > > > 530-globus_callout.c:globus_callout_handle_call_type:727: > > > > 530-The callout returned an error > > > > 530-prima_module.c:Globus Gridmap Callout:430: > > > > 530-Gridmap lookup failure: Could not retrieve > mapping > > > > for /DC=org/DC=doegrids/OU=People/CN=Michael > Kubal > > > > 486347 from identity mapping server > > > > 530- > > > > 530 End.] > > > > > > > > Before running I also removed the certificates > dir in > > > > ~mkubal/.globus. > > > > > > > > Any suggestions? > > > > > > > > Thanks, > > > > > > > > Mike > > > > > > > > > > > > --- joseph insley > wrote: > > > > > > > > > Mike, > > > > > > > > > > You should be able to do this yourself using > the > > > > > gx-request command. > > > > > See more info at: > > > > > > > > > > http://www.teragrid.org/userinfo/access/auth_gxmap.php > > > > > > > > > > > > > > > joe. > > > > > > > > > > On Oct 29, 2007, at 2:07 PM, Mike Kubal > wrote: > > > > > > > > > > > Hi Joe, > > > > > > > > > > > > Please add me to gridmap file for tg-uc. > My > > > > > subject is > > > > > > /DC=org/DC=doegrids/OU=People/CN=Michael > Kubal > > > > > 486347 > > > > > > > > > > > > My user name is kubal. > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Mike > > > > > > > > > > > > > __________________________________________________ > > > > > > Do You Yahoo!? > > > > > > Tired of spam? Yahoo! Mail has the best > spam > > > > > protection around > > > > > > http://mail.yahoo.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > > Be a better friend, newshound, and know-it-all > with Yahoo! Mobile. Try it > > > > now. > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > > > > > > > =================================================== > > > > > > joseph a. insley > > > insley at mcs.anl.gov > > > > > > mathematics & computer science division > (630) 252-5649 > > > > > > argonne national laboratory > (630) 252-5986 > > > (fax) > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs -------------- next part -------------- A non-text attachment was scrubbed... Name: swift_job.tar Type: application/x-tar Size: 40960 bytes Desc: 2238658463-swift_job.tar URL: From bugzilla-daemon at mcs.anl.gov Thu Jan 24 17:07:07 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 24 Jan 2008 17:07:07 -0600 (CST) Subject: [Swift-devel] [Bug 125] New: Single line struct declarations are not quite right Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=125 Summary: Single line struct declarations are not quite right Product: Swift Version: unspecified Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: hategan at mcs.anl.gov type LogScale { string x, y, z; } translation: -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From simone at fnal.gov Fri Jan 25 10:20:35 2008 From: simone at fnal.gov (James Simone) Date: Fri, 25 Jan 2008 10:20:35 -0600 Subject: [Swift-devel] Re: Question and update on swift script In-Reply-To: <476064D0.7050208@mcs.anl.gov> References: <476064D0.7050208@mcs.anl.gov> Message-ID: <479A0C53.2060108@fnal.gov> Hi Mike, Could you please send us your target version of the 2-pt QCD workflow? The target swiftscipt would not necessarily be able to run with current swift, but it would include syntax changes that are/will be under development. Thanks, --jim Michael Wilde wrote: > Hi all, > > I made some good progress yesterday in understanding the swift code for > the 2ptHL workflow, and started revising it. > > Im doing the following: > - giving all files a mnemonic type name > - creating compound types that encapsulate each data/info file pair > - putting each set of nested foreach loops into a procedure (for better > abstraction) > - changing the mapper to tie it to each data type > > For now, Im also pulling the two cache-loading functions out of the > workflow, as it seems like these should be part of the runtime > environment, rather than in the workflow script. Do you feel the same? > > I *thought* that you had a set of wrappers that were python stubs for > simulated execution, but the wrappers Don sent me looked like wrappers > that call the *real* code. So for now I'll create my own simple stubs to > test the data flow with. > > Ive got many more questions that Im collecting, but for now I only need > the answer to this one: > > In the nested loops that call the function Onia() (lines 136-148 in the > attached numbered listing), is the actual call at line 145 correct? > This call is passing the same "HeavyQuarkConverted" as both the > anti-quark and quark1. Im assuming that is the correct intent. > Also, its passing only the wave[1] wave file (1S) each time (second of > the three wave files d, 1S, 2S). (Same is true for BStaggered). > > Lastly, Onia seems to be getting called with successive pairs of > converted heavy quark files, but it seems like the final call is going > to get a null file for the second file, as the value of "bindex" will be > one past the last converted heavy quark file computed. > > Im looking for ways to avoid the way the current script needs to compute > the various array indices, but Im not likely to find something that > doesnt require a language change. Was this approach of computing the > indices something that your team liked, or did not like, about this > swift script? > > I'll try to send more details and questions as I proceed. > A few of these are below. If you have a moment to consider these, that > will help me get a better understanding of the environment. > > Thanks, > > - Mike > > qcopy: I need to learn more about how you manage data and use dcache, > but for now, I get the essence if this. Can you point me to a qcopy doc > page though? I couldnt find one. > > tag_array_mapper: I can guess how this works, but can you send me the > code for it? It seems like the order in which it fills the array must > match the index computations in your swift script. Looks to me like the > leftmost tag in the format script is varied fastest (ie, is the "least > significant index"). Is this correct? > > kappa value passing bug: you allude to this in a comment. Was this a > swift problem or a problem in the py wrapper? Resolved or still open? > If Swift, I can test to see if I can reproduce it. But I suspect you > were testing against a pretty old version of swift? > > Is the notion of an "info" file paired with most data files a standard > metadata convention for LQCD? (Ie, Im assuming that this is done > throughout your apps, not just in this swift example, right? If so, it > seems to justify a mapping convention so that you can simply pass a > "Quark" object, and have the data and info files passed automatically, > together. You can then dereference the object top extract each field > (file). > > Are the file naming conventions set by the tag mapper the ones already > in use by the current workflow system? I.e., the order of the foreach > loops and hence of the mappings was chosen carefully to match the > existing file-naming conventions? > > How is the name of the ensemble chosen? Does it have a relation to the > phyarams? Is it a database key? (It seems opaque to the current swift > example. Is that by preference or was there a desire to expose its > structure? Its its contents related to the phyparams? Or when looked up > in a DB, does it yield the phyparams? > > From mikekubal at yahoo.com Fri Jan 25 13:26:29 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Fri, 25 Jan 2008 11:26:29 -0800 (PST) Subject: [Swift-devel] Re: gridmap file for tg-uc Message-ID: <345055.73057.qm@web52303.mail.re2.yahoo.com> Perhaps this will provide some insight into the problem. I was able to get the Swift script to run successfully and execute on the UC teragrid if I used myproxy-logon, instead of grid-proxy-init to establish my credentials. I did get the following warning: 2008-01-25 13:05:42,266-0600 WARN JobSubmissionTaskHandler Failed to clean up job java.lang.IllegalStateException: No registered callback handler for org.globus.gsi.gssapi.GlobusGSSCredentialImpl at ee5a0 6 at org.globus.cog.abstraction.impl.execution.gt2.CallbackHandlerManager.decreaseUsageCount(CallbackHandlerManag er.java:33) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.cleanup(JobSubmissionTaskHandler.java :482) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandle r.java:475) at org.globus.gram.GramJob.setStatus(GramJob.java:184) at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) at java.lang.Thread.run(Thread.java:595) --- Mike Kubal wrote: > Here's a tar file of the script and the logs. > > I get the same error when running against teraport. > > Thanks, > > Mike > --- Ben Clifford wrote: > > > > > yes, looks like an OSG rather than teragrid error. > > > > mike kubal, send the sites.xml file that you are > > using. > > > > On Wed, 23 Jan 2008, Michael Wilde wrote: > > > > > info from Joe, below. > > > > > > MikeK: where are you expecting files to move > from > > and to in this script? > > > > > > - MikeW > > > > > > > > > On 1/23/08 3:33 PM, joseph insley wrote: > > > > Mike, > > > > > > > > You are indeed n the grid-mapfile on the > UC/ANL > > TeraGrid resource: > > > > > > > > insley at tg-viz-login1:~> grep Kubal > > /etc/grid-security/grid-mapfile > > > > "/DC=org/DC=doegrids/OU=People/CN=Michael > Kubal > > 486347" kubal > > > > > > > > However, looks like the error is coming from > the > > Teraport resource, not > > > > TeraGrid. I am not as knowledgeable about > that > > system, but I would assume > > > > that the same gx-request command should work > > there as well. > > > > > > > > joe. > > > > > > > > On Jan 23, 2008, at 3:15 PM, Mike Kubal wrote: > > > > > > > > > Hi Joe and Mike, > > > > > > > > > > I used the gx-request again to add myself to > > the > > > > > gridmap on UC's teragrid again. It seemed to > > go > > > > > smoothly, just like the time before. > > > > > > > > > > Though I still get the following error > message > > when > > > > > running my Swift script: > > > > > > > > > > Execution failed: > > > > > Could not initialize shared > directory > > on > > > > > teraport > > > > > Caused by: > > > > > > > > > > > > > org.globus.cog.abstraction.impl.file.FileResourceException: > > > > > Error communicating with the GridFTP server > > > > > Caused by: > > > > > Server refused performing the > request. > > Custom > > > > > message: Bad password. (error code 1) > [Nested > > > > > exception message: Custom message: > Unexpected > > reply: > > > > > 530-Login incorrect. : > > > > > > > > gridmap.c:globus_gss_assist_map_and_authorize:1910: > > > > > 530-Error invoking callout > > > > > > > > 530-globus_callout.c:globus_callout_handle_call_type:727: > > > > > 530-The callout returned an error > > > > > 530-prima_module.c:Globus Gridmap > Callout:430: > > > > > 530-Gridmap lookup failure: Could not > retrieve > > mapping > > > > > for > /DC=org/DC=doegrids/OU=People/CN=Michael > > Kubal > > > > > 486347 from identity mapping server > > > > > 530- > > > > > 530 End.] > > > > > > > > > > Before running I also removed the > certificates > > dir in > > > > > ~mkubal/.globus. > > > > > > > > > > Any suggestions? > > > > > > > > > > Thanks, > > > > > > > > > > Mike > > > > > > > > > > > > > > > --- joseph insley > > wrote: > > > > > > > > > > > Mike, > > > > > > > > > > > > You should be able to do this yourself > using > > the > > > > > > gx-request command. > > > > > > See more info at: > > > > > > > > > > > > > > http://www.teragrid.org/userinfo/access/auth_gxmap.php > > > > > > > > > > > > > > > > > > joe. > > > > > > > > > > > > On Oct 29, 2007, at 2:07 PM, Mike Kubal > > wrote: > > > > > > > > > > > > > Hi Joe, > > > > > > > > > > > > > > Please add me to gridmap file for tg-uc. > > My > > > > > > subject is > > > > > > > /DC=org/DC=doegrids/OU=People/CN=Michael > > Kubal > > > > > > 486347 > > > > > > > > > > > > > > My user name is kubal. > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > Mike > > > > > > > > > > > > > > > > __________________________________________________ > > > > > > > Do You Yahoo!? > > > > > > > Tired of spam? Yahoo! Mail has the best > > spam > > > > > > protection around > > > > > > > http://mail.yahoo.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > > > Be a better friend, newshound, and > know-it-all > > with Yahoo! Mobile. Try it > > > > > now. > > > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > > > > > > > > > > > > =================================================== > > > > > > > > joseph a. insley > > > > insley at mcs.anl.gov > > > > > > > > mathematics & computer science division > > (630) 252-5649 > > > > > > > > argonne national laboratory > > > (630) 252-5986 > > > > (fax) > > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > === message truncated === ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs From benc at hawaga.org.uk Fri Jan 25 13:28:37 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 25 Jan 2008 19:28:37 +0000 (GMT) Subject: [Swift-devel] Re: gridmap file for tg-uc In-Reply-To: <345055.73057.qm@web52303.mail.re2.yahoo.com> References: <345055.73057.qm@web52303.mail.re2.yahoo.com> Message-ID: On Fri, 25 Jan 2008, Mike Kubal wrote: > I was able to get the Swift script to run successfully > and execute on the UC teragrid if I used > myproxy-logon, instead of grid-proxy-init to establish > my credentials. Do myproxy-logon and then send then output of grid-proxy-info Then do a grid-proxy init and send the output of grid-proxy-info again -- From hategan at mcs.anl.gov Fri Jan 25 13:39:15 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 25 Jan 2008 13:39:15 -0600 Subject: [Swift-devel] Re: gridmap file for tg-uc In-Reply-To: References: <345055.73057.qm@web52303.mail.re2.yahoo.com> Message-ID: <1201289955.4010.0.camel@blabla.mcs.anl.gov> On Fri, 2008-01-25 at 19:28 +0000, Ben Clifford wrote: > > On Fri, 25 Jan 2008, Mike Kubal wrote: > > > I was able to get the Swift script to run successfully > > and execute on the UC teragrid if I used > > myproxy-logon, instead of grid-proxy-init to establish > > my credentials. > > Do myproxy-logon and then send then output of grid-proxy-info > > Then do a grid-proxy init and send the output of grid-proxy-info again And while we're at it, can you also send the output of "which grid-proxy-init"? > From mikekubal at yahoo.com Fri Jan 25 14:29:35 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Fri, 25 Jan 2008 12:29:35 -0800 (PST) Subject: [Swift-devel] Re: gridmap file for tg-uc In-Reply-To: <1201289955.4010.0.camel@blabla.mcs.anl.gov> Message-ID: <489401.46742.qm@web52312.mail.re2.yahoo.com> myproxy-logon returns: A proxy has been received for user mkubal in /tmp/x509up_u2902. then grid-proxy-info returns: subject : /C=US/O=National Center for Supercomputing Applications/CN=Michael Kubal issuer : /C=US/O=National Center for Supercomputing Applications/OU=Certificate Authorities/CN=MyProxy identity : /C=US/O=National Center for Supercomputing Applications/CN=Michael Kubal type : end entity credential strength : 512 bits path : /tmp/x509up_u2902 timeleft : 11:59:06 after doing a grid-proxy-init, grid-proxy-info returns: subject : /DC=org/DC=doegrids/OU=People/CN=Michael Kubal 486347/CN=515513296 issuer : /DC=org/DC=doegrids/OU=People/CN=Michael Kubal 486347 identity : /DC=org/DC=doegrids/OU=People/CN=Michael Kubal 486347 type : Proxy draft (pre-RFC) compliant impersonation proxy strength : 512 bits path : /tmp/x509up_u2902 timeleft : 11:59:47 `which grid-proxy-init` returns: /sandbox/software/globus-4.0.1/bin/grid-proxy-init cheers, mike --- Mihael Hategan wrote: > > On Fri, 2008-01-25 at 19:28 +0000, Ben Clifford > wrote: > > > > On Fri, 25 Jan 2008, Mike Kubal wrote: > > > > > I was able to get the Swift script to run > successfully > > > and execute on the UC teragrid if I used > > > myproxy-logon, instead of grid-proxy-init to > establish > > > my credentials. > > > > Do myproxy-logon and then send then output of > grid-proxy-info > > > > Then do a grid-proxy init and send the output of > grid-proxy-info again > > And while we're at it, can you also send the output > of "which > grid-proxy-init"? > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs From tiberius at ci.uchicago.edu Mon Jan 28 15:51:24 2008 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Mon, 28 Jan 2008 15:51:24 -0600 Subject: [Swift-devel] RFF (request for feature) Message-ID: Hi gang, I find myself in the need for a queuing facility in swift with the following operations: createQ submitQ(function) triggerQ(function, #jobs in queue) - to signal empty queues, for instance deleteQ I would think that in addition to atomic functions and composite functions, we will have the queue facility acting as an intermediary. Is any of this possible/doable in a data-flow language ? Thanks Tibi -- Tiberiu (Tibi) Stef-Praun, PhD Research Staff, Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From mikekubal at yahoo.com Mon Jan 28 19:13:14 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Mon, 28 Jan 2008 17:13:14 -0800 (PST) Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: Message-ID: <921658.18899.qm@web52308.mail.re2.yahoo.com> Yes, I'm submitting molecular dynamics simulations using Swift. Is there a default wall-time limit for jobs on tg-uc? --- joseph insley wrote: > Actually, these numbers are now escalating... > > top - 17:18:54 up 2:29, 1 user, load average: > 149.02, 123.63, 91.94 > Tasks: 469 total, 4 running, 465 sleeping, 0 > stopped, 0 zombie > > insley at tg-grid1:~> ps -ef | grep kubal | wc -l > 479 > > insley at tg-viz-login1:~> time globusrun -a -r > tg-grid.uc.teragrid.org > GRAM Authentication test successful > real 0m26.134s > user 0m0.090s > sys 0m0.010s > > > On Jan 28, 2008, at 5:15 PM, joseph insley wrote: > > > Earlier today tg-grid.uc.teragrid.org (the UC/ANL > TG GRAM host) > > became unresponsive and had to be rebooted. I am > now seeing slow > > response times from the Gatekeeper there again. > Authenticating to > > the gatekeeper should only take a second or two, > but it is > > periodically taking up to 16 seconds: > > > > insley at tg-viz-login1:~> time globusrun -a -r > tg-grid.uc.teragrid.org > > GRAM Authentication test successful > > real 0m16.096s > > user 0m0.060s > > sys 0m0.020s > > > > looking at the load on tg-grid, it is rather high: > > > > top - 16:55:26 up 2:06, 1 user, load average: > 89.59, 78.69, 62.92 > > Tasks: 398 total, 20 running, 378 sleeping, 0 > stopped, 0 zombie > > > > And there appear to be a large number of processes > owned by kubal: > > insley at tg-grid1:~> ps -ef | grep kubal | wc -l > > 380 > > > > I assume that Mike is using swift to do the job > submission. Is > > there some throttling of the rate at which jobs > are submitted to > > the gatekeeper that could be done that would > lighten this load > > some? (Or has that already been done since > earlier today?) The > > current response times are not unacceptable, but > I'm hoping to > > avoid having the machine grind to a halt as it did > earlier today. > > > > Thanks, > > joe. > > > > > > > =================================================== > > joseph a. > > insley > > > insley at mcs.anl.gov > > mathematics & computer science division > (630) 252-5649 > > argonne national laboratory > (630) > > 252-5986 (fax) > > > > > > =================================================== > joseph a. insley > > insley at mcs.anl.gov > mathematics & computer science division (630) > 252-5649 > argonne national laboratory > (630) > 252-5986 (fax) > > > ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ From mikekubal at yahoo.com Tue Jan 29 09:01:38 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Tue, 29 Jan 2008 07:01:38 -0800 (PST) Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <66C07B61-ED5A-40F3-BC73-899D591EC8BF@mcs.anl.gov> Message-ID: <427798.76891.qm@web52301.mail.re2.yahoo.com> Yes, I have a TG project. I should have updated my swift sites file after initial testing. Is it common for the TG project id to be the same as the local with just the 'TG-' and 'UC'- prefix switched? --- Ti Leggett wrote: > Also, it looks like you're using a local project and > not a TG project. > We are not able to report this usage to NSF because > it counts against > our discretionary usage. Do you have a TG project? > If not can you > request a DACC allocation to use? > > On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal > wrote: > > > Yes, I'm submitting molecular dynamics simulations > > using Swift. > > > > Is there a default wall-time limit for jobs on > tg-uc? > > > > > > > > --- joseph insley wrote: > > > >> Actually, these numbers are now escalating... > >> > >> top - 17:18:54 up 2:29, 1 user, load average: > >> 149.02, 123.63, 91.94 > >> Tasks: 469 total, 4 running, 465 sleeping, 0 > >> stopped, 0 zombie > >> > >> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >> 479 > >> > >> insley at tg-viz-login1:~> time globusrun -a -r > >> tg-grid.uc.teragrid.org > >> GRAM Authentication test successful > >> real 0m26.134s > >> user 0m0.090s > >> sys 0m0.010s > >> > >> > >> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: > >> > >>> Earlier today tg-grid.uc.teragrid.org (the > UC/ANL > >> TG GRAM host) > >>> became unresponsive and had to be rebooted. I > am > >> now seeing slow > >>> response times from the Gatekeeper there again. > >> Authenticating to > >>> the gatekeeper should only take a second or two, > >> but it is > >>> periodically taking up to 16 seconds: > >>> > >>> insley at tg-viz-login1:~> time globusrun -a -r > >> tg-grid.uc.teragrid.org > >>> GRAM Authentication test successful > >>> real 0m16.096s > >>> user 0m0.060s > >>> sys 0m0.020s > >>> > >>> looking at the load on tg-grid, it is rather > high: > >>> > >>> top - 16:55:26 up 2:06, 1 user, load average: > >> 89.59, 78.69, 62.92 > >>> Tasks: 398 total, 20 running, 378 sleeping, 0 > >> stopped, 0 zombie > >>> > >>> And there appear to be a large number of > processes > >> owned by kubal: > >>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>> 380 > >>> > >>> I assume that Mike is using swift to do the job > >> submission. Is > >>> there some throttling of the rate at which jobs > >> are submitted to > >>> the gatekeeper that could be done that would > >> lighten this load > >>> some? (Or has that already been done since > >> earlier today?) The > >>> current response times are not unacceptable, but > >> I'm hoping to > >>> avoid having the machine grind to a halt as it > did > >> earlier today. > >>> > >>> Thanks, > >>> joe. > >>> > >>> > >>> > >> > =================================================== > >>> joseph a. > >>> insley > >> > >>> insley at mcs.anl.gov > >>> mathematics & computer science division > >> (630) 252-5649 > >>> argonne national laboratory > >> (630) > >>> 252-5986 (fax) > >>> > >>> > >> > >> > =================================================== > >> joseph a. insley > >> > >> insley at mcs.anl.gov > >> mathematics & computer science division > (630) > >> 252-5649 > >> argonne national laboratory > >> (630) > >> 252-5986 (fax) > >> > >> > >> > > > > > > > > > > > ____________________________________________________________________________________ > > Be a better friend, newshound, and > > know-it-all with Yahoo! Mobile. Try it now. > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > > > ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs From wilde at mcs.anl.gov Tue Jan 29 10:11:51 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 29 Jan 2008 10:11:51 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> Message-ID: <479F5047.2040805@mcs.anl.gov> [ was Re: Swift jobs on UC/ANL TG ] Hi. Im at OHare and will be flying soon. Ben or Mihael, if you are online, can you investigate? Yes, there are significant throttles turned on by default, and the system opens those very gradually. MikeK, can you post to the swift-devel list your swift.properties file, command line options, and your swift source code? Thanks, MikeW On 1/29/08 8:11 AM, Ti Leggett wrote: > The default walltime is 15 minutes. Are you doing fork jobs or pbs jobs? > You shouldn't be doing fork jobs at all. Mike W, I thought there were > throttles in place in Swift to prevent this type of overrun? Mike K, > I'll need you to either stop these types of jobs until Mike W can verify > throttling or only submit a few 10s of jobs at a time. > > On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote: > >> Yes, I'm submitting molecular dynamics simulations >> using Swift. >> >> Is there a default wall-time limit for jobs on tg-uc? >> >> >> >> --- joseph insley wrote: >> >>> Actually, these numbers are now escalating... >>> >>> top - 17:18:54 up 2:29, 1 user, load average: >>> 149.02, 123.63, 91.94 >>> Tasks: 469 total, 4 running, 465 sleeping, 0 >>> stopped, 0 zombie >>> >>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>> 479 >>> >>> insley at tg-viz-login1:~> time globusrun -a -r >>> tg-grid.uc.teragrid.org >>> GRAM Authentication test successful >>> real 0m26.134s >>> user 0m0.090s >>> sys 0m0.010s >>> >>> >>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: >>> >>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL >>> TG GRAM host) >>>> became unresponsive and had to be rebooted. I am >>> now seeing slow >>>> response times from the Gatekeeper there again. >>> Authenticating to >>>> the gatekeeper should only take a second or two, >>> but it is >>>> periodically taking up to 16 seconds: >>>> >>>> insley at tg-viz-login1:~> time globusrun -a -r >>> tg-grid.uc.teragrid.org >>>> GRAM Authentication test successful >>>> real 0m16.096s >>>> user 0m0.060s >>>> sys 0m0.020s >>>> >>>> looking at the load on tg-grid, it is rather high: >>>> >>>> top - 16:55:26 up 2:06, 1 user, load average: >>> 89.59, 78.69, 62.92 >>>> Tasks: 398 total, 20 running, 378 sleeping, 0 >>> stopped, 0 zombie >>>> >>>> And there appear to be a large number of processes >>> owned by kubal: >>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>> 380 >>>> >>>> I assume that Mike is using swift to do the job >>> submission. Is >>>> there some throttling of the rate at which jobs >>> are submitted to >>>> the gatekeeper that could be done that would >>> lighten this load >>>> some? (Or has that already been done since >>> earlier today?) The >>>> current response times are not unacceptable, but >>> I'm hoping to >>>> avoid having the machine grind to a halt as it did >>> earlier today. >>>> >>>> Thanks, >>>> joe. >>>> >>>> >>>> >>> =================================================== >>>> joseph a. >>>> insley >>> >>>> insley at mcs.anl.gov >>>> mathematics & computer science division >>> (630) 252-5649 >>>> argonne national laboratory >>> (630) >>>> 252-5986 (fax) >>>> >>>> >>> >>> =================================================== >>> joseph a. insley >>> >>> insley at mcs.anl.gov >>> mathematics & computer science division (630) >>> 252-5649 >>> argonne national laboratory >>> (630) >>> 252-5986 (fax) >>> >>> >>> >> >> >> >> >> ____________________________________________________________________________________ >> >> Be a better friend, newshound, and >> know-it-all with Yahoo! Mobile. Try it now. >> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >> > > From skenny at uchicago.edu Tue Jan 29 11:09:36 2008 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Tue, 29 Jan 2008 11:09:36 -0600 (CST) Subject: [Swift-devel] Fwd: kickstart error Message-ID: <20080129110936.AZI34643@m4500-02.uchicago.edu> ok, tried sending this to swift-user but it doesn't seem to have gone thru (?) so i'm trying the dev list. -------------- next part -------------- An embedded message was scrubbed... From: Subject: kickstart error Date: Tue, 29 Jan 2008 10:31:19 -0600 (CST) Size: 2217 URL: From foster at mcs.anl.gov Tue Jan 29 13:15:51 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Tue, 29 Jan 2008 13:15:51 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <479F5047.2040805@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> Message-ID: <479F7B67.7070400@mcs.anl.gov> Hi, I've CCed Stuart Martin--I'd greatly appreciate some insights into what is causing this. I assume that you are using GRAM4 (aka WS-GRAM)? Ian. Michael Wilde wrote: > [ was Re: Swift jobs on UC/ANL TG ] > > Hi. Im at OHare and will be flying soon. > Ben or Mihael, if you are online, can you investigate? > > Yes, there are significant throttles turned on by default, and the > system opens those very gradually. > > MikeK, can you post to the swift-devel list your swift.properties > file, command line options, and your swift source code? > > Thanks, > > MikeW > > > On 1/29/08 8:11 AM, Ti Leggett wrote: >> The default walltime is 15 minutes. Are you doing fork jobs or pbs >> jobs? You shouldn't be doing fork jobs at all. Mike W, I thought >> there were throttles in place in Swift to prevent this type of >> overrun? Mike K, I'll need you to either stop these types of jobs >> until Mike W can verify throttling or only submit a few 10s of jobs >> at a time. >> >> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote: >> >>> Yes, I'm submitting molecular dynamics simulations >>> using Swift. >>> >>> Is there a default wall-time limit for jobs on tg-uc? >>> >>> >>> >>> --- joseph insley wrote: >>> >>>> Actually, these numbers are now escalating... >>>> >>>> top - 17:18:54 up 2:29, 1 user, load average: >>>> 149.02, 123.63, 91.94 >>>> Tasks: 469 total, 4 running, 465 sleeping, 0 >>>> stopped, 0 zombie >>>> >>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>> 479 >>>> >>>> insley at tg-viz-login1:~> time globusrun -a -r >>>> tg-grid.uc.teragrid.org >>>> GRAM Authentication test successful >>>> real 0m26.134s >>>> user 0m0.090s >>>> sys 0m0.010s >>>> >>>> >>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: >>>> >>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL >>>> TG GRAM host) >>>>> became unresponsive and had to be rebooted. I am >>>> now seeing slow >>>>> response times from the Gatekeeper there again. >>>> Authenticating to >>>>> the gatekeeper should only take a second or two, >>>> but it is >>>>> periodically taking up to 16 seconds: >>>>> >>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>> tg-grid.uc.teragrid.org >>>>> GRAM Authentication test successful >>>>> real 0m16.096s >>>>> user 0m0.060s >>>>> sys 0m0.020s >>>>> >>>>> looking at the load on tg-grid, it is rather high: >>>>> >>>>> top - 16:55:26 up 2:06, 1 user, load average: >>>> 89.59, 78.69, 62.92 >>>>> Tasks: 398 total, 20 running, 378 sleeping, 0 >>>> stopped, 0 zombie >>>>> >>>>> And there appear to be a large number of processes >>>> owned by kubal: >>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>> 380 >>>>> >>>>> I assume that Mike is using swift to do the job >>>> submission. Is >>>>> there some throttling of the rate at which jobs >>>> are submitted to >>>>> the gatekeeper that could be done that would >>>> lighten this load >>>>> some? (Or has that already been done since >>>> earlier today?) The >>>>> current response times are not unacceptable, but >>>> I'm hoping to >>>>> avoid having the machine grind to a halt as it did >>>> earlier today. >>>>> >>>>> Thanks, >>>>> joe. >>>>> >>>>> >>>>> >>>> =================================================== >>>>> joseph a. >>>>> insley >>>> >>>>> insley at mcs.anl.gov >>>>> mathematics & computer science division >>>> (630) 252-5649 >>>>> argonne national laboratory >>>> (630) >>>>> 252-5986 (fax) >>>>> >>>>> >>>> >>>> =================================================== >>>> joseph a. insley >>>> >>>> insley at mcs.anl.gov >>>> mathematics & computer science division (630) >>>> 252-5649 >>>> argonne national laboratory >>>> (630) >>>> 252-5986 (fax) >>>> >>>> >>>> >>> >>> >>> >>> >>> ____________________________________________________________________________________ >>> >>> Be a better friend, newshound, and >>> know-it-all with Yahoo! Mobile. Try it now. >>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>> >> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Tue Jan 29 13:31:53 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jan 2008 13:31:53 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <479F7B67.7070400@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> Message-ID: <1201635113.21914.1.camel@blabla.mcs.anl.gov> Ah, I was seeing the 16 second submission on Teraport (a cluster at UC), right after an upgrade of sorts. I can ask more about this upgrade... On Tue, 2008-01-29 at 13:15 -0600, Ian Foster wrote: > Hi, > > I've CCed Stuart Martin--I'd greatly appreciate some insights into what > is causing this. I assume that you are using GRAM4 (aka WS-GRAM)? > > Ian. > > Michael Wilde wrote: > > [ was Re: Swift jobs on UC/ANL TG ] > > > > Hi. Im at OHare and will be flying soon. > > Ben or Mihael, if you are online, can you investigate? > > > > Yes, there are significant throttles turned on by default, and the > > system opens those very gradually. > > > > MikeK, can you post to the swift-devel list your swift.properties > > file, command line options, and your swift source code? > > > > Thanks, > > > > MikeW > > > > > > On 1/29/08 8:11 AM, Ti Leggett wrote: > >> The default walltime is 15 minutes. Are you doing fork jobs or pbs > >> jobs? You shouldn't be doing fork jobs at all. Mike W, I thought > >> there were throttles in place in Swift to prevent this type of > >> overrun? Mike K, I'll need you to either stop these types of jobs > >> until Mike W can verify throttling or only submit a few 10s of jobs > >> at a time. > >> > >> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote: > >> > >>> Yes, I'm submitting molecular dynamics simulations > >>> using Swift. > >>> > >>> Is there a default wall-time limit for jobs on tg-uc? > >>> > >>> > >>> > >>> --- joseph insley wrote: > >>> > >>>> Actually, these numbers are now escalating... > >>>> > >>>> top - 17:18:54 up 2:29, 1 user, load average: > >>>> 149.02, 123.63, 91.94 > >>>> Tasks: 469 total, 4 running, 465 sleeping, 0 > >>>> stopped, 0 zombie > >>>> > >>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>> 479 > >>>> > >>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>> tg-grid.uc.teragrid.org > >>>> GRAM Authentication test successful > >>>> real 0m26.134s > >>>> user 0m0.090s > >>>> sys 0m0.010s > >>>> > >>>> > >>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: > >>>> > >>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL > >>>> TG GRAM host) > >>>>> became unresponsive and had to be rebooted. I am > >>>> now seeing slow > >>>>> response times from the Gatekeeper there again. > >>>> Authenticating to > >>>>> the gatekeeper should only take a second or two, > >>>> but it is > >>>>> periodically taking up to 16 seconds: > >>>>> > >>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>> tg-grid.uc.teragrid.org > >>>>> GRAM Authentication test successful > >>>>> real 0m16.096s > >>>>> user 0m0.060s > >>>>> sys 0m0.020s > >>>>> > >>>>> looking at the load on tg-grid, it is rather high: > >>>>> > >>>>> top - 16:55:26 up 2:06, 1 user, load average: > >>>> 89.59, 78.69, 62.92 > >>>>> Tasks: 398 total, 20 running, 378 sleeping, 0 > >>>> stopped, 0 zombie > >>>>> > >>>>> And there appear to be a large number of processes > >>>> owned by kubal: > >>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>>> 380 > >>>>> > >>>>> I assume that Mike is using swift to do the job > >>>> submission. Is > >>>>> there some throttling of the rate at which jobs > >>>> are submitted to > >>>>> the gatekeeper that could be done that would > >>>> lighten this load > >>>>> some? (Or has that already been done since > >>>> earlier today?) The > >>>>> current response times are not unacceptable, but > >>>> I'm hoping to > >>>>> avoid having the machine grind to a halt as it did > >>>> earlier today. > >>>>> > >>>>> Thanks, > >>>>> joe. > >>>>> > >>>>> > >>>>> > >>>> =================================================== > >>>>> joseph a. > >>>>> insley > >>>> > >>>>> insley at mcs.anl.gov > >>>>> mathematics & computer science division > >>>> (630) 252-5649 > >>>>> argonne national laboratory > >>>> (630) > >>>>> 252-5986 (fax) > >>>>> > >>>>> > >>>> > >>>> =================================================== > >>>> joseph a. insley > >>>> > >>>> insley at mcs.anl.gov > >>>> mathematics & computer science division (630) > >>>> 252-5649 > >>>> argonne national laboratory > >>>> (630) > >>>> 252-5986 (fax) > >>>> > >>>> > >>>> > >>> > >>> > >>> > >>> > >>> ____________________________________________________________________________________ > >>> > >>> Be a better friend, newshound, and > >>> know-it-all with Yahoo! Mobile. Try it now. > >>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > >>> > >> > >> > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From iraicu at cs.uchicago.edu Tue Jan 29 13:38:03 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 29 Jan 2008 13:38:03 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <479F7B67.7070400@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> Message-ID: <479F809B.5050306@cs.uchicago.edu> Can someone double check that the jobs are using PBS (and not FORK) in GRAM? If you are using FORK, then the high load is being caused by the applications running on the GRAM host. If it is PBS, then I don't know, others might have more insight. Ioan Ian Foster wrote: > Hi, > > I've CCed Stuart Martin--I'd greatly appreciate some insights into > what is causing this. I assume that you are using GRAM4 (aka WS-GRAM)? > > Ian. > > Michael Wilde wrote: >> [ was Re: Swift jobs on UC/ANL TG ] >> >> Hi. Im at OHare and will be flying soon. >> Ben or Mihael, if you are online, can you investigate? >> >> Yes, there are significant throttles turned on by default, and the >> system opens those very gradually. >> >> MikeK, can you post to the swift-devel list your swift.properties >> file, command line options, and your swift source code? >> >> Thanks, >> >> MikeW >> >> >> On 1/29/08 8:11 AM, Ti Leggett wrote: >>> The default walltime is 15 minutes. Are you doing fork jobs or pbs >>> jobs? You shouldn't be doing fork jobs at all. Mike W, I thought >>> there were throttles in place in Swift to prevent this type of >>> overrun? Mike K, I'll need you to either stop these types of jobs >>> until Mike W can verify throttling or only submit a few 10s of jobs >>> at a time. >>> >>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote: >>> >>>> Yes, I'm submitting molecular dynamics simulations >>>> using Swift. >>>> >>>> Is there a default wall-time limit for jobs on tg-uc? >>>> >>>> >>>> >>>> --- joseph insley wrote: >>>> >>>>> Actually, these numbers are now escalating... >>>>> >>>>> top - 17:18:54 up 2:29, 1 user, load average: >>>>> 149.02, 123.63, 91.94 >>>>> Tasks: 469 total, 4 running, 465 sleeping, 0 >>>>> stopped, 0 zombie >>>>> >>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>> 479 >>>>> >>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>> tg-grid.uc.teragrid.org >>>>> GRAM Authentication test successful >>>>> real 0m26.134s >>>>> user 0m0.090s >>>>> sys 0m0.010s >>>>> >>>>> >>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: >>>>> >>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL >>>>> TG GRAM host) >>>>>> became unresponsive and had to be rebooted. I am >>>>> now seeing slow >>>>>> response times from the Gatekeeper there again. >>>>> Authenticating to >>>>>> the gatekeeper should only take a second or two, >>>>> but it is >>>>>> periodically taking up to 16 seconds: >>>>>> >>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>> tg-grid.uc.teragrid.org >>>>>> GRAM Authentication test successful >>>>>> real 0m16.096s >>>>>> user 0m0.060s >>>>>> sys 0m0.020s >>>>>> >>>>>> looking at the load on tg-grid, it is rather high: >>>>>> >>>>>> top - 16:55:26 up 2:06, 1 user, load average: >>>>> 89.59, 78.69, 62.92 >>>>>> Tasks: 398 total, 20 running, 378 sleeping, 0 >>>>> stopped, 0 zombie >>>>>> >>>>>> And there appear to be a large number of processes >>>>> owned by kubal: >>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>> 380 >>>>>> >>>>>> I assume that Mike is using swift to do the job >>>>> submission. Is >>>>>> there some throttling of the rate at which jobs >>>>> are submitted to >>>>>> the gatekeeper that could be done that would >>>>> lighten this load >>>>>> some? (Or has that already been done since >>>>> earlier today?) The >>>>>> current response times are not unacceptable, but >>>>> I'm hoping to >>>>>> avoid having the machine grind to a halt as it did >>>>> earlier today. >>>>>> >>>>>> Thanks, >>>>>> joe. >>>>>> >>>>>> >>>>>> >>>>> =================================================== >>>>>> joseph a. >>>>>> insley >>>>> >>>>>> insley at mcs.anl.gov >>>>>> mathematics & computer science division >>>>> (630) 252-5649 >>>>>> argonne national laboratory >>>>> (630) >>>>>> 252-5986 (fax) >>>>>> >>>>>> >>>>> >>>>> =================================================== >>>>> joseph a. insley >>>>> >>>>> insley at mcs.anl.gov >>>>> mathematics & computer science division (630) >>>>> 252-5649 >>>>> argonne national laboratory >>>>> (630) >>>>> 252-5986 (fax) >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> >>>> ____________________________________________________________________________________ >>>> >>>> Be a better friend, newshound, and >>>> know-it-all with Yahoo! Mobile. Try it now. >>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>>> >>> >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- ================================================== Ioan Raicu Ph.D. Candidate ================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS ================================================== ================================================== From hategan at mcs.anl.gov Tue Jan 29 13:39:30 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jan 2008 13:39:30 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <1201635113.21914.1.camel@blabla.mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <1201635113.21914.1.camel@blabla.mcs.anl.gov> Message-ID: <1201635570.21914.5.camel@blabla.mcs.anl.gov> On Tue, 2008-01-29 at 13:31 -0600, Mihael Hategan wrote: > Ah, I was seeing the 16 second submission on Teraport (a cluster at UC), > right after an upgrade of sorts. I can ask more about this upgrade... So teraport uses VDT. Which makes it odd. Whatever change triggers this is both in VDT and that SDCTTTRWSC thing teragrid uses. > > On Tue, 2008-01-29 at 13:15 -0600, Ian Foster wrote: > > Hi, > > > > I've CCed Stuart Martin--I'd greatly appreciate some insights into what > > is causing this. I assume that you are using GRAM4 (aka WS-GRAM)? > > > > Ian. > > > > Michael Wilde wrote: > > > [ was Re: Swift jobs on UC/ANL TG ] > > > > > > Hi. Im at OHare and will be flying soon. > > > Ben or Mihael, if you are online, can you investigate? > > > > > > Yes, there are significant throttles turned on by default, and the > > > system opens those very gradually. > > > > > > MikeK, can you post to the swift-devel list your swift.properties > > > file, command line options, and your swift source code? > > > > > > Thanks, > > > > > > MikeW > > > > > > > > > On 1/29/08 8:11 AM, Ti Leggett wrote: > > >> The default walltime is 15 minutes. Are you doing fork jobs or pbs > > >> jobs? You shouldn't be doing fork jobs at all. Mike W, I thought > > >> there were throttles in place in Swift to prevent this type of > > >> overrun? Mike K, I'll need you to either stop these types of jobs > > >> until Mike W can verify throttling or only submit a few 10s of jobs > > >> at a time. > > >> > > >> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote: > > >> > > >>> Yes, I'm submitting molecular dynamics simulations > > >>> using Swift. > > >>> > > >>> Is there a default wall-time limit for jobs on tg-uc? > > >>> > > >>> > > >>> > > >>> --- joseph insley wrote: > > >>> > > >>>> Actually, these numbers are now escalating... > > >>>> > > >>>> top - 17:18:54 up 2:29, 1 user, load average: > > >>>> 149.02, 123.63, 91.94 > > >>>> Tasks: 469 total, 4 running, 465 sleeping, 0 > > >>>> stopped, 0 zombie > > >>>> > > >>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > > >>>> 479 > > >>>> > > >>>> insley at tg-viz-login1:~> time globusrun -a -r > > >>>> tg-grid.uc.teragrid.org > > >>>> GRAM Authentication test successful > > >>>> real 0m26.134s > > >>>> user 0m0.090s > > >>>> sys 0m0.010s > > >>>> > > >>>> > > >>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: > > >>>> > > >>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL > > >>>> TG GRAM host) > > >>>>> became unresponsive and had to be rebooted. I am > > >>>> now seeing slow > > >>>>> response times from the Gatekeeper there again. > > >>>> Authenticating to > > >>>>> the gatekeeper should only take a second or two, > > >>>> but it is > > >>>>> periodically taking up to 16 seconds: > > >>>>> > > >>>>> insley at tg-viz-login1:~> time globusrun -a -r > > >>>> tg-grid.uc.teragrid.org > > >>>>> GRAM Authentication test successful > > >>>>> real 0m16.096s > > >>>>> user 0m0.060s > > >>>>> sys 0m0.020s > > >>>>> > > >>>>> looking at the load on tg-grid, it is rather high: > > >>>>> > > >>>>> top - 16:55:26 up 2:06, 1 user, load average: > > >>>> 89.59, 78.69, 62.92 > > >>>>> Tasks: 398 total, 20 running, 378 sleeping, 0 > > >>>> stopped, 0 zombie > > >>>>> > > >>>>> And there appear to be a large number of processes > > >>>> owned by kubal: > > >>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > > >>>>> 380 > > >>>>> > > >>>>> I assume that Mike is using swift to do the job > > >>>> submission. Is > > >>>>> there some throttling of the rate at which jobs > > >>>> are submitted to > > >>>>> the gatekeeper that could be done that would > > >>>> lighten this load > > >>>>> some? (Or has that already been done since > > >>>> earlier today?) The > > >>>>> current response times are not unacceptable, but > > >>>> I'm hoping to > > >>>>> avoid having the machine grind to a halt as it did > > >>>> earlier today. > > >>>>> > > >>>>> Thanks, > > >>>>> joe. > > >>>>> > > >>>>> > > >>>>> > > >>>> =================================================== > > >>>>> joseph a. > > >>>>> insley > > >>>> > > >>>>> insley at mcs.anl.gov > > >>>>> mathematics & computer science division > > >>>> (630) 252-5649 > > >>>>> argonne national laboratory > > >>>> (630) > > >>>>> 252-5986 (fax) > > >>>>> > > >>>>> > > >>>> > > >>>> =================================================== > > >>>> joseph a. insley > > >>>> > > >>>> insley at mcs.anl.gov > > >>>> mathematics & computer science division (630) > > >>>> 252-5649 > > >>>> argonne national laboratory > > >>>> (630) > > >>>> 252-5986 (fax) > > >>>> > > >>>> > > >>>> > > >>> > > >>> > > >>> > > >>> > > >>> ____________________________________________________________________________________ > > >>> > > >>> Be a better friend, newshound, and > > >>> know-it-all with Yahoo! Mobile. Try it now. > > >>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > >>> > > >> > > >> > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Tue Jan 29 13:42:45 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jan 2008 13:42:45 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <1201635570.21914.5.camel@blabla.mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <1201635113.21914.1.camel@blabla.mcs.anl.gov> <1201635570.21914.5.camel@blabla.mcs.anl.gov> Message-ID: <1201635766.21914.7.camel@blabla.mcs.anl.gov> On Tue, 2008-01-29 at 13:39 -0600, Mihael Hategan wrote: > On Tue, 2008-01-29 at 13:31 -0600, Mihael Hategan wrote: > > Ah, I was seeing the 16 second submission on Teraport (a cluster at UC), > > right after an upgrade of sorts. I can ask more about this upgrade... > > So teraport uses VDT. Which makes it odd. Whatever change triggers this > is both in VDT and that SDCTTTRWSC thing teragrid uses. Hmm. So it turns out VDT came with the concept of "managed-fork" which seems to go through a condor queue. I hope this doesn't apply to TG? > > > > > On Tue, 2008-01-29 at 13:15 -0600, Ian Foster wrote: > > > Hi, > > > > > > I've CCed Stuart Martin--I'd greatly appreciate some insights into what > > > is causing this. I assume that you are using GRAM4 (aka WS-GRAM)? > > > > > > Ian. > > > > > > Michael Wilde wrote: > > > > [ was Re: Swift jobs on UC/ANL TG ] > > > > > > > > Hi. Im at OHare and will be flying soon. > > > > Ben or Mihael, if you are online, can you investigate? > > > > > > > > Yes, there are significant throttles turned on by default, and the > > > > system opens those very gradually. > > > > > > > > MikeK, can you post to the swift-devel list your swift.properties > > > > file, command line options, and your swift source code? > > > > > > > > Thanks, > > > > > > > > MikeW > > > > > > > > > > > > On 1/29/08 8:11 AM, Ti Leggett wrote: > > > >> The default walltime is 15 minutes. Are you doing fork jobs or pbs > > > >> jobs? You shouldn't be doing fork jobs at all. Mike W, I thought > > > >> there were throttles in place in Swift to prevent this type of > > > >> overrun? Mike K, I'll need you to either stop these types of jobs > > > >> until Mike W can verify throttling or only submit a few 10s of jobs > > > >> at a time. > > > >> > > > >> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote: > > > >> > > > >>> Yes, I'm submitting molecular dynamics simulations > > > >>> using Swift. > > > >>> > > > >>> Is there a default wall-time limit for jobs on tg-uc? > > > >>> > > > >>> > > > >>> > > > >>> --- joseph insley wrote: > > > >>> > > > >>>> Actually, these numbers are now escalating... > > > >>>> > > > >>>> top - 17:18:54 up 2:29, 1 user, load average: > > > >>>> 149.02, 123.63, 91.94 > > > >>>> Tasks: 469 total, 4 running, 465 sleeping, 0 > > > >>>> stopped, 0 zombie > > > >>>> > > > >>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > > > >>>> 479 > > > >>>> > > > >>>> insley at tg-viz-login1:~> time globusrun -a -r > > > >>>> tg-grid.uc.teragrid.org > > > >>>> GRAM Authentication test successful > > > >>>> real 0m26.134s > > > >>>> user 0m0.090s > > > >>>> sys 0m0.010s > > > >>>> > > > >>>> > > > >>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: > > > >>>> > > > >>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL > > > >>>> TG GRAM host) > > > >>>>> became unresponsive and had to be rebooted. I am > > > >>>> now seeing slow > > > >>>>> response times from the Gatekeeper there again. > > > >>>> Authenticating to > > > >>>>> the gatekeeper should only take a second or two, > > > >>>> but it is > > > >>>>> periodically taking up to 16 seconds: > > > >>>>> > > > >>>>> insley at tg-viz-login1:~> time globusrun -a -r > > > >>>> tg-grid.uc.teragrid.org > > > >>>>> GRAM Authentication test successful > > > >>>>> real 0m16.096s > > > >>>>> user 0m0.060s > > > >>>>> sys 0m0.020s > > > >>>>> > > > >>>>> looking at the load on tg-grid, it is rather high: > > > >>>>> > > > >>>>> top - 16:55:26 up 2:06, 1 user, load average: > > > >>>> 89.59, 78.69, 62.92 > > > >>>>> Tasks: 398 total, 20 running, 378 sleeping, 0 > > > >>>> stopped, 0 zombie > > > >>>>> > > > >>>>> And there appear to be a large number of processes > > > >>>> owned by kubal: > > > >>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > > > >>>>> 380 > > > >>>>> > > > >>>>> I assume that Mike is using swift to do the job > > > >>>> submission. Is > > > >>>>> there some throttling of the rate at which jobs > > > >>>> are submitted to > > > >>>>> the gatekeeper that could be done that would > > > >>>> lighten this load > > > >>>>> some? (Or has that already been done since > > > >>>> earlier today?) The > > > >>>>> current response times are not unacceptable, but > > > >>>> I'm hoping to > > > >>>>> avoid having the machine grind to a halt as it did > > > >>>> earlier today. > > > >>>>> > > > >>>>> Thanks, > > > >>>>> joe. > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>> =================================================== > > > >>>>> joseph a. > > > >>>>> insley > > > >>>> > > > >>>>> insley at mcs.anl.gov > > > >>>>> mathematics & computer science division > > > >>>> (630) 252-5649 > > > >>>>> argonne national laboratory > > > >>>> (630) > > > >>>>> 252-5986 (fax) > > > >>>>> > > > >>>>> > > > >>>> > > > >>>> =================================================== > > > >>>> joseph a. insley > > > >>>> > > > >>>> insley at mcs.anl.gov > > > >>>> mathematics & computer science division (630) > > > >>>> 252-5649 > > > >>>> argonne national laboratory > > > >>>> (630) > > > >>>> 252-5986 (fax) > > > >>>> > > > >>>> > > > >>>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> ____________________________________________________________________________________ > > > >>> > > > >>> Be a better friend, newshound, and > > > >>> know-it-all with Yahoo! Mobile. Try it now. > > > >>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > > >>> > > > >> > > > >> > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From foster at mcs.anl.gov Tue Jan 29 14:01:17 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Tue, 29 Jan 2008 14:01:17 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> Message-ID: <479F860D.2030900@mcs.anl.gov> I think that using WS-GRAM is key here--it has been created, and extensively tested, explicitly to address these concerns. joseph insley wrote: > I was seeing Mike's jobs show up in the queue, and running on the > backend nodes, and the processes I was seeing on tg-grid appeared to > be gram and not some other application, so it would seem that it was > indeed using PBS. > > However, it appears to be using PRE-WS GRAM.... I still had some of > the 'ps | grep kubal' output in my scrollback: > > insley at tg-grid1:~> ps -ef | grep kubal > kubal 16981 1 0 16:41 ? 00:00:00 globus-job-manager > -conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs > -rdn jobmanager-pbs -machine-type unknown -publish-jobs > kubal 18390 1 0 16:42 ? 00:00:00 globus-job-manager > -conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs > -rdn jobmanager-pbs -machine-type unknown -publish-jobs > kubal 18891 1 0 16:43 ? 00:00:00 globus-job-manager > -conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs > -rdn jobmanager-pbs -machine-type unknown -publish-jobs > kubal 18917 1 0 16:43 ? > > [snip] > > kubal 28200 25985 0 16:50 ? 00:00:00 /usr/bin/perl > /soft/prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs > -f /tmp/gram_iwEHrc -c poll > kubal 28201 26954 1 16:50 ? 00:00:00 /usr/bin/perl > /soft/prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs > -f /tmp/gram_lQaIPe -c poll > kubal 28202 19438 1 16:50 ? 00:00:00 /usr/bin/perl > /soft/prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs > -f /tmp/gram_SPsdme -c poll > > > On Jan 29, 2008, at 1:38 PM, Ioan Raicu wrote: > >> Can someone double check that the jobs are using PBS (and not FORK) >> in GRAM? If you are using FORK, then the high load is being caused >> by the applications running on the GRAM host. If it is PBS, then I >> don't know, others might have more insight. >> >> Ioan >> >> Ian Foster wrote: >>> Hi, >>> >>> I've CCed Stuart Martin--I'd greatly appreciate some insights into >>> what is causing this. I assume that you are using GRAM4 (aka WS-GRAM)? >>> >>> Ian. >>> >>> Michael Wilde wrote: >>>> [ was Re: Swift jobs on UC/ANL TG ] >>>> >>>> Hi. Im at OHare and will be flying soon. >>>> Ben or Mihael, if you are online, can you investigate? >>>> >>>> Yes, there are significant throttles turned on by default, and the >>>> system opens those very gradually. >>>> >>>> MikeK, can you post to the swift-devel list your swift.properties >>>> file, command line options, and your swift source code? >>>> >>>> Thanks, >>>> >>>> MikeW >>>> >>>> >>>> On 1/29/08 8:11 AM, Ti Leggett wrote: >>>>> The default walltime is 15 minutes. Are you doing fork jobs or pbs >>>>> jobs? You shouldn't be doing fork jobs at all. Mike W, I thought >>>>> there were throttles in place in Swift to prevent this type of >>>>> overrun? Mike K, I'll need you to either stop these types of jobs >>>>> until Mike W can verify throttling or only submit a few 10s of >>>>> jobs at a time. >>>>> >>>>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote: >>>>> >>>>>> Yes, I'm submitting molecular dynamics simulations >>>>>> using Swift. >>>>>> >>>>>> Is there a default wall-time limit for jobs on tg-uc? >>>>>> >>>>>> >>>>>> >>>>>> --- joseph insley >>>>> > wrote: >>>>>> >>>>>>> Actually, these numbers are now escalating... >>>>>>> >>>>>>> top - 17:18:54 up 2:29, 1 user, load average: >>>>>>> 149.02, 123.63, 91.94 >>>>>>> Tasks: 469 total, 4 running, 465 sleeping, 0 >>>>>>> stopped, 0 zombie >>>>>>> >>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>> 479 >>>>>>> >>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>> tg-grid.uc.teragrid.org >>>>>>> GRAM Authentication test successful >>>>>>> real 0m26.134s >>>>>>> user 0m0.090s >>>>>>> sys 0m0.010s >>>>>>> >>>>>>> >>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: >>>>>>> >>>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL >>>>>>> TG GRAM host) >>>>>>>> became unresponsive and had to be rebooted. I am >>>>>>> now seeing slow >>>>>>>> response times from the Gatekeeper there again. >>>>>>> Authenticating to >>>>>>>> the gatekeeper should only take a second or two, >>>>>>> but it is >>>>>>>> periodically taking up to 16 seconds: >>>>>>>> >>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>> tg-grid.uc.teragrid.org >>>>>>>> GRAM Authentication test successful >>>>>>>> real 0m16.096s >>>>>>>> user 0m0.060s >>>>>>>> sys 0m0.020s >>>>>>>> >>>>>>>> looking at the load on tg-grid, it is rather high: >>>>>>>> >>>>>>>> top - 16:55:26 up 2:06, 1 user, load average: >>>>>>> 89.59, 78.69, 62.92 >>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, 0 >>>>>>> stopped, 0 zombie >>>>>>>> >>>>>>>> And there appear to be a large number of processes >>>>>>> owned by kubal: >>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>>> 380 >>>>>>>> >>>>>>>> I assume that Mike is using swift to do the job >>>>>>> submission. Is >>>>>>>> there some throttling of the rate at which jobs >>>>>>> are submitted to >>>>>>>> the gatekeeper that could be done that would >>>>>>> lighten this load >>>>>>>> some? (Or has that already been done since >>>>>>> earlier today?) The >>>>>>>> current response times are not unacceptable, but >>>>>>> I'm hoping to >>>>>>>> avoid having the machine grind to a halt as it did >>>>>>> earlier today. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> joe. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> =================================================== >>>>>>>> joseph a. >>>>>>>> insley >>>>>>> >>>>>>>> insley at mcs.anl.gov >>>>>>>> mathematics & computer science division >>>>>>> (630) 252-5649 >>>>>>>> argonne national laboratory >>>>>>> (630) >>>>>>>> 252-5986 (fax) >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> =================================================== >>>>>>> joseph a. insley >>>>>>> >>>>>>> insley at mcs.anl.gov >>>>>>> mathematics & computer science division (630) >>>>>>> 252-5649 >>>>>>> argonne national laboratory >>>>>>> (630) >>>>>>> 252-5986 (fax) >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ____________________________________________________________________________________ >>>>>> Be a better friend, newshound, and >>>>>> know-it-all with Yahoo! Mobile. Try it now. >>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> >> -- >> ================================================== >> Ioan Raicu >> Ph.D. Candidate >> ================================================== >> Distributed Systems Laboratory >> Computer Science Department >> University of Chicago >> 1100 E. 58th Street, Ryerson Hall >> Chicago, IL 60637 >> ================================================== >> Email: iraicu at cs.uchicago.edu >> Web: http://www.cs.uchicago.edu/~iraicu >> >> http://dev.globus.org/wiki/Incubator/Falkon >> http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS >> ================================================== >> ================================================== >> >> > > =================================================== > > joseph a. insley > insley at mcs.anl.gov > > mathematics & computer science division (630) 252-5649 > > argonne national laboratory (630) > 252-5986 (fax) > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Jan 29 14:02:14 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jan 2008 14:02:14 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <1201635113.21914.1.camel@blabla.mcs.anl.gov> <1201635570.21914.5.camel@blabla.mcs.anl.gov> <1201635766.21914.7.camel@blabla.mcs.anl.gov> Message-ID: <1201636934.23430.1.camel@blabla.mcs.anl.gov> So I can't confirm a 16s time on TG. A /bin/date for me takes about 6 seconds. While still high, some of it is probably due to java start-up time. On Tue, 2008-01-29 at 13:43 -0600, JP Navarro wrote: > It does not. > > On Jan 29, 2008, at 1:42 PM, Mihael Hategan wrote: > > > I hope this doesn't apply to TG? > From smartin at mcs.anl.gov Tue Jan 29 14:06:22 2008 From: smartin at mcs.anl.gov (Stuart Martin) Date: Tue, 29 Jan 2008 14:06:22 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> Message-ID: <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> This is the classic GRAM2 scaling issue due to each job polling for status to the LRM. condor-g does all sorts of things to make GRAM2 scale for that scenario. If swift is not using condor-g and not doing the condor-g tricks, then I'd recommend swift to switch to using gram4. -Stu On Jan 29, 2008, at Jan 29, 1:57 PM, joseph insley wrote: > I was seeing Mike's jobs show up in the queue, and running on the > backend nodes, and the processes I was seeing on tg-grid appeared to > be gram and not some other application, so it would seem that it was > indeed using PBS. > > However, it appears to be using PRE-WS GRAM.... I still had some of > the 'ps | grep kubal' output in my scrollback: > > insley at tg-grid1:~> ps -ef | grep kubal > kubal 16981 1 0 16:41 ? 00:00:00 globus-job-manager - > conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs > -rdn jobmanager-pbs -machine-type unknown -publish-jobs > kubal 18390 1 0 16:42 ? 00:00:00 globus-job-manager - > conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs > -rdn jobmanager-pbs -machine-type unknown -publish-jobs > kubal 18891 1 0 16:43 ? 00:00:00 globus-job-manager - > conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs > -rdn jobmanager-pbs -machine-type unknown -publish-jobs > kubal 18917 1 0 16:43 ? > > [snip] > > kubal 28200 25985 0 16:50 ? 00:00:00 /usr/bin/perl /soft/ > prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / > tmp/gram_iwEHrc -c poll > kubal 28201 26954 1 16:50 ? 00:00:00 /usr/bin/perl /soft/ > prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / > tmp/gram_lQaIPe -c poll > kubal 28202 19438 1 16:50 ? 00:00:00 /usr/bin/perl /soft/ > prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / > tmp/gram_SPsdme -c poll > > > On Jan 29, 2008, at 1:38 PM, Ioan Raicu wrote: > >> Can someone double check that the jobs are using PBS (and not FORK) >> in GRAM? If you are using FORK, then the high load is being caused >> by the applications running on the GRAM host. If it is PBS, then I >> don't know, others might have more insight. >> >> Ioan >> >> Ian Foster wrote: >>> Hi, >>> >>> I've CCed Stuart Martin--I'd greatly appreciate some insights into >>> what is causing this. I assume that you are using GRAM4 (aka WS- >>> GRAM)? >>> >>> Ian. >>> >>> Michael Wilde wrote: >>>> [ was Re: Swift jobs on UC/ANL TG ] >>>> >>>> Hi. Im at OHare and will be flying soon. >>>> Ben or Mihael, if you are online, can you investigate? >>>> >>>> Yes, there are significant throttles turned on by default, and >>>> the system opens those very gradually. >>>> >>>> MikeK, can you post to the swift-devel list your swift.properties >>>> file, command line options, and your swift source code? >>>> >>>> Thanks, >>>> >>>> MikeW >>>> >>>> >>>> On 1/29/08 8:11 AM, Ti Leggett wrote: >>>>> The default walltime is 15 minutes. Are you doing fork jobs or >>>>> pbs jobs? You shouldn't be doing fork jobs at all. Mike W, I >>>>> thought there were throttles in place in Swift to prevent this >>>>> type of overrun? Mike K, I'll need you to either stop these >>>>> types of jobs until Mike W can verify throttling or only submit >>>>> a few 10s of jobs at a time. >>>>> >>>>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote: >>>>> >>>>>> Yes, I'm submitting molecular dynamics simulations >>>>>> using Swift. >>>>>> >>>>>> Is there a default wall-time limit for jobs on tg-uc? >>>>>> >>>>>> >>>>>> >>>>>> --- joseph insley wrote: >>>>>> >>>>>>> Actually, these numbers are now escalating... >>>>>>> >>>>>>> top - 17:18:54 up 2:29, 1 user, load average: >>>>>>> 149.02, 123.63, 91.94 >>>>>>> Tasks: 469 total, 4 running, 465 sleeping, 0 >>>>>>> stopped, 0 zombie >>>>>>> >>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>> 479 >>>>>>> >>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>> tg-grid.uc.teragrid.org >>>>>>> GRAM Authentication test successful >>>>>>> real 0m26.134s >>>>>>> user 0m0.090s >>>>>>> sys 0m0.010s >>>>>>> >>>>>>> >>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: >>>>>>> >>>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL >>>>>>> TG GRAM host) >>>>>>>> became unresponsive and had to be rebooted. I am >>>>>>> now seeing slow >>>>>>>> response times from the Gatekeeper there again. >>>>>>> Authenticating to >>>>>>>> the gatekeeper should only take a second or two, >>>>>>> but it is >>>>>>>> periodically taking up to 16 seconds: >>>>>>>> >>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>> tg-grid.uc.teragrid.org >>>>>>>> GRAM Authentication test successful >>>>>>>> real 0m16.096s >>>>>>>> user 0m0.060s >>>>>>>> sys 0m0.020s >>>>>>>> >>>>>>>> looking at the load on tg-grid, it is rather high: >>>>>>>> >>>>>>>> top - 16:55:26 up 2:06, 1 user, load average: >>>>>>> 89.59, 78.69, 62.92 >>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, 0 >>>>>>> stopped, 0 zombie >>>>>>>> >>>>>>>> And there appear to be a large number of processes >>>>>>> owned by kubal: >>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>>> 380 >>>>>>>> >>>>>>>> I assume that Mike is using swift to do the job >>>>>>> submission. Is >>>>>>>> there some throttling of the rate at which jobs >>>>>>> are submitted to >>>>>>>> the gatekeeper that could be done that would >>>>>>> lighten this load >>>>>>>> some? (Or has that already been done since >>>>>>> earlier today?) The >>>>>>>> current response times are not unacceptable, but >>>>>>> I'm hoping to >>>>>>>> avoid having the machine grind to a halt as it did >>>>>>> earlier today. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> joe. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> =================================================== >>>>>>>> joseph a. >>>>>>>> insley >>>>>>> >>>>>>>> insley at mcs.anl.gov >>>>>>>> mathematics & computer science division >>>>>>> (630) 252-5649 >>>>>>>> argonne national laboratory >>>>>>> (630) >>>>>>>> 252-5986 (fax) >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> =================================================== >>>>>>> joseph a. insley >>>>>>> >>>>>>> insley at mcs.anl.gov >>>>>>> mathematics & computer science division (630) >>>>>>> 252-5649 >>>>>>> argonne national laboratory >>>>>>> (630) >>>>>>> 252-5986 (fax) >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ____________________________________________________________________________________ >>>>>> Be a better friend, newshound, and >>>>>> know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> >> -- >> ================================================== >> Ioan Raicu >> Ph.D. Candidate >> ================================================== >> Distributed Systems Laboratory >> Computer Science Department >> University of Chicago >> 1100 E. 58th Street, Ryerson Hall >> Chicago, IL 60637 >> ================================================== >> Email: iraicu at cs.uchicago.edu >> Web: http://www.cs.uchicago.edu/~iraicu >> http://dev.globus.org/wiki/Incubator/Falkon >> http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS >> ================================================== >> ================================================== >> >> > > =================================================== > joseph a. > insley insley at mcs.anl.gov > mathematics & computer science division (630) 252-5649 > argonne national laboratory (630) > 252-5986 (fax) > > From hategan at mcs.anl.gov Tue Jan 29 14:10:51 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jan 2008 14:10:51 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <1201636934.23430.1.camel@blabla.mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <1201635113.21914.1.camel@blabla.mcs.anl.gov> <1201635570.21914.5.camel@blabla.mcs.anl.gov> <1201635766.21914.7.camel@blabla.mcs.anl.gov> <1201636934.23430.1.camel@blabla.mcs.anl.gov> Message-ID: <1201637451.23430.7.camel@blabla.mcs.anl.gov> However, I read the thread in more detail. We need to distinguish between how it behaves under load and how it behaves on single jobs. It is somewhat to be expected that a high number of submissions will increase the end-to-end time for a job. It seems like the reported long times for a job happen in the context of lots of other jobs being submitted? If yes, can you try submitting single jobs and see how that works? Mihael On Tue, 2008-01-29 at 14:02 -0600, Mihael Hategan wrote: > So I can't confirm a 16s time on TG. A /bin/date for me takes about 6 > seconds. While still high, some of it is probably due to java start-up > time. > > On Tue, 2008-01-29 at 13:43 -0600, JP Navarro wrote: > > It does not. > > > > On Jan 29, 2008, at 1:42 PM, Mihael Hategan wrote: > > > > > I hope this doesn't apply to TG? > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From insley at mcs.anl.gov Mon Jan 28 17:15:41 2008 From: insley at mcs.anl.gov (joseph insley) Date: Mon, 28 Jan 2008 17:15:41 -0600 Subject: [Swift-devel] Swift jobs on UC/ANL TG Message-ID: <3ED4E2D7-704C-49F3-970C-399D335ED8F7@mcs.anl.gov> Earlier today tg-grid.uc.teragrid.org (the UC/ANL TG GRAM host) became unresponsive and had to be rebooted. I am now seeing slow response times from the Gatekeeper there again. Authenticating to the gatekeeper should only take a second or two, but it is periodically taking up to 16 seconds: insley at tg-viz-login1:~> time globusrun -a -r tg-grid.uc.teragrid.org GRAM Authentication test successful real 0m16.096s user 0m0.060s sys 0m0.020s looking at the load on tg-grid, it is rather high: top - 16:55:26 up 2:06, 1 user, load average: 89.59, 78.69, 62.92 Tasks: 398 total, 20 running, 378 sleeping, 0 stopped, 0 zombie And there appear to be a large number of processes owned by kubal: insley at tg-grid1:~> ps -ef | grep kubal | wc -l 380 I assume that Mike is using swift to do the job submission. Is there some throttling of the rate at which jobs are submitted to the gatekeeper that could be done that would lighten this load some? (Or has that already been done since earlier today?) The current response times are not unacceptable, but I'm hoping to avoid having the machine grind to a halt as it did earlier today. Thanks, joe. =================================================== joseph a. insley insley at mcs.anl.gov mathematics & computer science division (630) 252-5649 argonne national laboratory (630) 252-5986 (fax) -------------- next part -------------- An HTML attachment was scrubbed... URL: From insley at mcs.anl.gov Mon Jan 28 17:22:38 2008 From: insley at mcs.anl.gov (joseph insley) Date: Mon, 28 Jan 2008 17:22:38 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <3ED4E2D7-704C-49F3-970C-399D335ED8F7@mcs.anl.gov> References: <3ED4E2D7-704C-49F3-970C-399D335ED8F7@mcs.anl.gov> Message-ID: Actually, these numbers are now escalating... top - 17:18:54 up 2:29, 1 user, load average: 149.02, 123.63, 91.94 Tasks: 469 total, 4 running, 465 sleeping, 0 stopped, 0 zombie insley at tg-grid1:~> ps -ef | grep kubal | wc -l 479 insley at tg-viz-login1:~> time globusrun -a -r tg-grid.uc.teragrid.org GRAM Authentication test successful real 0m26.134s user 0m0.090s sys 0m0.010s On Jan 28, 2008, at 5:15 PM, joseph insley wrote: > Earlier today tg-grid.uc.teragrid.org (the UC/ANL TG GRAM host) > became unresponsive and had to be rebooted. I am now seeing slow > response times from the Gatekeeper there again. Authenticating to > the gatekeeper should only take a second or two, but it is > periodically taking up to 16 seconds: > > insley at tg-viz-login1:~> time globusrun -a -r tg-grid.uc.teragrid.org > GRAM Authentication test successful > real 0m16.096s > user 0m0.060s > sys 0m0.020s > > looking at the load on tg-grid, it is rather high: > > top - 16:55:26 up 2:06, 1 user, load average: 89.59, 78.69, 62.92 > Tasks: 398 total, 20 running, 378 sleeping, 0 stopped, 0 zombie > > And there appear to be a large number of processes owned by kubal: > insley at tg-grid1:~> ps -ef | grep kubal | wc -l > 380 > > I assume that Mike is using swift to do the job submission. Is > there some throttling of the rate at which jobs are submitted to > the gatekeeper that could be done that would lighten this load > some? (Or has that already been done since earlier today?) The > current response times are not unacceptable, but I'm hoping to > avoid having the machine grind to a halt as it did earlier today. > > Thanks, > joe. > > > =================================================== > joseph a. > insley > insley at mcs.anl.gov > mathematics & computer science division (630) 252-5649 > argonne national laboratory (630) > 252-5986 (fax) > > =================================================== joseph a. insley insley at mcs.anl.gov mathematics & computer science division (630) 252-5649 argonne national laboratory (630) 252-5986 (fax) -------------- next part -------------- An HTML attachment was scrubbed... URL: From leggett at mcs.anl.gov Tue Jan 29 08:11:06 2008 From: leggett at mcs.anl.gov (Ti Leggett) Date: Tue, 29 Jan 2008 08:11:06 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <921658.18899.qm@web52308.mail.re2.yahoo.com> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> Message-ID: <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> The default walltime is 15 minutes. Are you doing fork jobs or pbs jobs? You shouldn't be doing fork jobs at all. Mike W, I thought there were throttles in place in Swift to prevent this type of overrun? Mike K, I'll need you to either stop these types of jobs until Mike W can verify throttling or only submit a few 10s of jobs at a time. On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote: > Yes, I'm submitting molecular dynamics simulations > using Swift. > > Is there a default wall-time limit for jobs on tg-uc? > > > > --- joseph insley wrote: > >> Actually, these numbers are now escalating... >> >> top - 17:18:54 up 2:29, 1 user, load average: >> 149.02, 123.63, 91.94 >> Tasks: 469 total, 4 running, 465 sleeping, 0 >> stopped, 0 zombie >> >> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >> 479 >> >> insley at tg-viz-login1:~> time globusrun -a -r >> tg-grid.uc.teragrid.org >> GRAM Authentication test successful >> real 0m26.134s >> user 0m0.090s >> sys 0m0.010s >> >> >> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: >> >>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL >> TG GRAM host) >>> became unresponsive and had to be rebooted. I am >> now seeing slow >>> response times from the Gatekeeper there again. >> Authenticating to >>> the gatekeeper should only take a second or two, >> but it is >>> periodically taking up to 16 seconds: >>> >>> insley at tg-viz-login1:~> time globusrun -a -r >> tg-grid.uc.teragrid.org >>> GRAM Authentication test successful >>> real 0m16.096s >>> user 0m0.060s >>> sys 0m0.020s >>> >>> looking at the load on tg-grid, it is rather high: >>> >>> top - 16:55:26 up 2:06, 1 user, load average: >> 89.59, 78.69, 62.92 >>> Tasks: 398 total, 20 running, 378 sleeping, 0 >> stopped, 0 zombie >>> >>> And there appear to be a large number of processes >> owned by kubal: >>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>> 380 >>> >>> I assume that Mike is using swift to do the job >> submission. Is >>> there some throttling of the rate at which jobs >> are submitted to >>> the gatekeeper that could be done that would >> lighten this load >>> some? (Or has that already been done since >> earlier today?) The >>> current response times are not unacceptable, but >> I'm hoping to >>> avoid having the machine grind to a halt as it did >> earlier today. >>> >>> Thanks, >>> joe. >>> >>> >>> >> =================================================== >>> joseph a. >>> insley >> >>> insley at mcs.anl.gov >>> mathematics & computer science division >> (630) 252-5649 >>> argonne national laboratory >> (630) >>> 252-5986 (fax) >>> >>> >> >> =================================================== >> joseph a. insley >> >> insley at mcs.anl.gov >> mathematics & computer science division (630) >> 252-5649 >> argonne national laboratory >> (630) >> 252-5986 (fax) >> >> >> > > > > > ____________________________________________________________________________________ > Be a better friend, newshound, and > know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > From leggett at mcs.anl.gov Tue Jan 29 08:19:39 2008 From: leggett at mcs.anl.gov (Ti Leggett) Date: Tue, 29 Jan 2008 08:19:39 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <921658.18899.qm@web52308.mail.re2.yahoo.com> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> Message-ID: <66C07B61-ED5A-40F3-BC73-899D591EC8BF@mcs.anl.gov> Also, it looks like you're using a local project and not a TG project. We are not able to report this usage to NSF because it counts against our discretionary usage. Do you have a TG project? If not can you request a DACC allocation to use? On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote: > Yes, I'm submitting molecular dynamics simulations > using Swift. > > Is there a default wall-time limit for jobs on tg-uc? > > > > --- joseph insley wrote: > >> Actually, these numbers are now escalating... >> >> top - 17:18:54 up 2:29, 1 user, load average: >> 149.02, 123.63, 91.94 >> Tasks: 469 total, 4 running, 465 sleeping, 0 >> stopped, 0 zombie >> >> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >> 479 >> >> insley at tg-viz-login1:~> time globusrun -a -r >> tg-grid.uc.teragrid.org >> GRAM Authentication test successful >> real 0m26.134s >> user 0m0.090s >> sys 0m0.010s >> >> >> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: >> >>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL >> TG GRAM host) >>> became unresponsive and had to be rebooted. I am >> now seeing slow >>> response times from the Gatekeeper there again. >> Authenticating to >>> the gatekeeper should only take a second or two, >> but it is >>> periodically taking up to 16 seconds: >>> >>> insley at tg-viz-login1:~> time globusrun -a -r >> tg-grid.uc.teragrid.org >>> GRAM Authentication test successful >>> real 0m16.096s >>> user 0m0.060s >>> sys 0m0.020s >>> >>> looking at the load on tg-grid, it is rather high: >>> >>> top - 16:55:26 up 2:06, 1 user, load average: >> 89.59, 78.69, 62.92 >>> Tasks: 398 total, 20 running, 378 sleeping, 0 >> stopped, 0 zombie >>> >>> And there appear to be a large number of processes >> owned by kubal: >>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>> 380 >>> >>> I assume that Mike is using swift to do the job >> submission. Is >>> there some throttling of the rate at which jobs >> are submitted to >>> the gatekeeper that could be done that would >> lighten this load >>> some? (Or has that already been done since >> earlier today?) The >>> current response times are not unacceptable, but >> I'm hoping to >>> avoid having the machine grind to a halt as it did >> earlier today. >>> >>> Thanks, >>> joe. >>> >>> >>> >> =================================================== >>> joseph a. >>> insley >> >>> insley at mcs.anl.gov >>> mathematics & computer science division >> (630) 252-5649 >>> argonne national laboratory >> (630) >>> 252-5986 (fax) >>> >>> >> >> =================================================== >> joseph a. insley >> >> insley at mcs.anl.gov >> mathematics & computer science division (630) >> 252-5649 >> argonne national laboratory >> (630) >> 252-5986 (fax) >> >> >> > > > > > ____________________________________________________________________________________ > Be a better friend, newshound, and > know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > From insley at mcs.anl.gov Tue Jan 29 09:20:53 2008 From: insley at mcs.anl.gov (joseph insley) Date: Tue, 29 Jan 2008 09:20:53 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <427798.76891.qm@web52301.mail.re2.yahoo.com> References: <427798.76891.qm@web52301.mail.re2.yahoo.com> Message-ID: <60AC3A86-1EF1-40DB-A586-81229AAC8A1E@mcs.anl.gov> Yes, the local project id is typically created by swapping UC for TG in the TG project id. I believe these jobs are in fact PBS jobs submitted to the scheduler through gram. However, pre-ws gram forks off a jobmanager process for each job that is submitted, to keep track of state, etc. This is a known limitation of pre-ws gram. joe. On Jan 29, 2008, at 9:01 AM, Mike Kubal wrote: > Yes, I have a TG project. I should have updated my > swift sites file after initial testing. > > Is it common for the TG project id to be the same as > the local with just the 'TG-' and 'UC'- prefix > switched? > > > --- Ti Leggett wrote: > >> Also, it looks like you're using a local project and >> not a TG project. >> We are not able to report this usage to NSF because >> it counts against >> our discretionary usage. Do you have a TG project? >> If not can you >> request a DACC allocation to use? >> >> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal >> wrote: >> >>> Yes, I'm submitting molecular dynamics simulations >>> using Swift. >>> >>> Is there a default wall-time limit for jobs on >> tg-uc? >>> >>> >>> >>> --- joseph insley wrote: >>> >>>> Actually, these numbers are now escalating... >>>> >>>> top - 17:18:54 up 2:29, 1 user, load average: >>>> 149.02, 123.63, 91.94 >>>> Tasks: 469 total, 4 running, 465 sleeping, 0 >>>> stopped, 0 zombie >>>> >>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>> 479 >>>> >>>> insley at tg-viz-login1:~> time globusrun -a -r >>>> tg-grid.uc.teragrid.org >>>> GRAM Authentication test successful >>>> real 0m26.134s >>>> user 0m0.090s >>>> sys 0m0.010s >>>> >>>> >>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: >>>> >>>>> Earlier today tg-grid.uc.teragrid.org (the >> UC/ANL >>>> TG GRAM host) >>>>> became unresponsive and had to be rebooted. I >> am >>>> now seeing slow >>>>> response times from the Gatekeeper there again. >>>> Authenticating to >>>>> the gatekeeper should only take a second or two, >>>> but it is >>>>> periodically taking up to 16 seconds: >>>>> >>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>> tg-grid.uc.teragrid.org >>>>> GRAM Authentication test successful >>>>> real 0m16.096s >>>>> user 0m0.060s >>>>> sys 0m0.020s >>>>> >>>>> looking at the load on tg-grid, it is rather >> high: >>>>> >>>>> top - 16:55:26 up 2:06, 1 user, load average: >>>> 89.59, 78.69, 62.92 >>>>> Tasks: 398 total, 20 running, 378 sleeping, 0 >>>> stopped, 0 zombie >>>>> >>>>> And there appear to be a large number of >> processes >>>> owned by kubal: >>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>> 380 >>>>> >>>>> I assume that Mike is using swift to do the job >>>> submission. Is >>>>> there some throttling of the rate at which jobs >>>> are submitted to >>>>> the gatekeeper that could be done that would >>>> lighten this load >>>>> some? (Or has that already been done since >>>> earlier today?) The >>>>> current response times are not unacceptable, but >>>> I'm hoping to >>>>> avoid having the machine grind to a halt as it >> did >>>> earlier today. >>>>> >>>>> Thanks, >>>>> joe. >>>>> >>>>> >>>>> >>>> >> =================================================== >>>>> joseph a. >>>>> insley >>>> >>>>> insley at mcs.anl.gov >>>>> mathematics & computer science division >>>> (630) 252-5649 >>>>> argonne national laboratory >>>> (630) >>>>> 252-5986 (fax) >>>>> >>>>> >>>> >>>> >> =================================================== >>>> joseph a. insley >>>> >>>> insley at mcs.anl.gov >>>> mathematics & computer science division >> (630) >>>> 252-5649 >>>> argonne national laboratory >>>> (630) >>>> 252-5986 (fax) >>>> >>>> >>>> >>> >>> >>> >>> >>> >> > ______________________________________________________________________ > ______________ >>> Be a better friend, newshound, and >>> know-it-all with Yahoo! Mobile. Try it now. >> > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>> >> >> > > > > > ______________________________________________________________________ > ______________ > Never miss a thing. Make Yahoo your home page. > http://www.yahoo.com/r/hs > =================================================== joseph a. insley insley at mcs.anl.gov mathematics & computer science division (630) 252-5649 argonne national laboratory (630) 252-5986 (fax) -------------- next part -------------- An HTML attachment was scrubbed... URL: From leggett at mcs.anl.gov Tue Jan 29 10:01:00 2008 From: leggett at mcs.anl.gov (Ti Leggett) Date: Tue, 29 Jan 2008 10:01:00 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <60AC3A86-1EF1-40DB-A586-81229AAC8A1E@mcs.anl.gov> References: <427798.76891.qm@web52301.mail.re2.yahoo.com> <60AC3A86-1EF1-40DB-A586-81229AAC8A1E@mcs.anl.gov> Message-ID: I'm going to remove the local project mapping then. On Jan 29, 2008, at 01/29/08 09:20 AM, joseph insley wrote: > Yes, the local project id is typically created by swapping UC for TG > in the TG project id. > > I believe these jobs are in fact PBS jobs submitted to the scheduler > through gram. However, pre-ws gram forks off a jobmanager process > for each job that is submitted, to keep track of state, etc. This > is a known limitation of pre-ws gram. > > joe. > > On Jan 29, 2008, at 9:01 AM, Mike Kubal wrote: > >> Yes, I have a TG project. I should have updated my >> swift sites file after initial testing. >> >> Is it common for the TG project id to be the same as >> the local with just the 'TG-' and 'UC'- prefix >> switched? >> >> >> --- Ti Leggett wrote: >> >>> Also, it looks like you're using a local project and >>> not a TG project. >>> We are not able to report this usage to NSF because >>> it counts against >>> our discretionary usage. Do you have a TG project? >>> If not can you >>> request a DACC allocation to use? >>> >>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal >>> wrote: >>> >>>> Yes, I'm submitting molecular dynamics simulations >>>> using Swift. >>>> >>>> Is there a default wall-time limit for jobs on >>> tg-uc? >>>> >>>> >>>> >>>> --- joseph insley wrote: >>>> >>>>> Actually, these numbers are now escalating... >>>>> >>>>> top - 17:18:54 up 2:29, 1 user, load average: >>>>> 149.02, 123.63, 91.94 >>>>> Tasks: 469 total, 4 running, 465 sleeping, 0 >>>>> stopped, 0 zombie >>>>> >>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>> 479 >>>>> >>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>> tg-grid.uc.teragrid.org >>>>> GRAM Authentication test successful >>>>> real 0m26.134s >>>>> user 0m0.090s >>>>> sys 0m0.010s >>>>> >>>>> >>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: >>>>> >>>>>> Earlier today tg-grid.uc.teragrid.org (the >>> UC/ANL >>>>> TG GRAM host) >>>>>> became unresponsive and had to be rebooted. I >>> am >>>>> now seeing slow >>>>>> response times from the Gatekeeper there again. >>>>> Authenticating to >>>>>> the gatekeeper should only take a second or two, >>>>> but it is >>>>>> periodically taking up to 16 seconds: >>>>>> >>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>> tg-grid.uc.teragrid.org >>>>>> GRAM Authentication test successful >>>>>> real 0m16.096s >>>>>> user 0m0.060s >>>>>> sys 0m0.020s >>>>>> >>>>>> looking at the load on tg-grid, it is rather >>> high: >>>>>> >>>>>> top - 16:55:26 up 2:06, 1 user, load average: >>>>> 89.59, 78.69, 62.92 >>>>>> Tasks: 398 total, 20 running, 378 sleeping, 0 >>>>> stopped, 0 zombie >>>>>> >>>>>> And there appear to be a large number of >>> processes >>>>> owned by kubal: >>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>> 380 >>>>>> >>>>>> I assume that Mike is using swift to do the job >>>>> submission. Is >>>>>> there some throttling of the rate at which jobs >>>>> are submitted to >>>>>> the gatekeeper that could be done that would >>>>> lighten this load >>>>>> some? (Or has that already been done since >>>>> earlier today?) The >>>>>> current response times are not unacceptable, but >>>>> I'm hoping to >>>>>> avoid having the machine grind to a halt as it >>> did >>>>> earlier today. >>>>>> >>>>>> Thanks, >>>>>> joe. >>>>>> >>>>>> >>>>>> >>>>> >>> =================================================== >>>>>> joseph a. >>>>>> insley >>>>> >>>>>> insley at mcs.anl.gov >>>>>> mathematics & computer science division >>>>> (630) 252-5649 >>>>>> argonne national laboratory >>>>> (630) >>>>>> 252-5986 (fax) >>>>>> >>>>>> >>>>> >>>>> >>> =================================================== >>>>> joseph a. insley >>>>> >>>>> insley at mcs.anl.gov >>>>> mathematics & computer science division >>> (630) >>>>> 252-5649 >>>>> argonne national laboratory >>>>> (630) >>>>> 252-5986 (fax) >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> >>>> >>> >> ____________________________________________________________________________________ >>>> Be a better friend, newshound, and >>>> know-it-all with Yahoo! Mobile. Try it now. >>> >> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>>> >>> >>> >> >> >> >> >> ____________________________________________________________________________________ >> Never miss a thing. Make Yahoo your home page. >> http://www.yahoo.com/r/hs >> > > =================================================== > joseph a. > insley insley at mcs.anl.gov > mathematics & computer science division (630) 252-5649 > argonne national laboratory (630) > 252-5986 (fax) > > From navarro at mcs.anl.gov Tue Jan 29 13:43:48 2008 From: navarro at mcs.anl.gov (JP Navarro) Date: Tue, 29 Jan 2008 13:43:48 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <1201635766.21914.7.camel@blabla.mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <1201635113.21914.1.camel@blabla.mcs.anl.gov> <1201635570.21914.5.camel@blabla.mcs.anl.gov> <1201635766.21914.7.camel@blabla.mcs.anl.gov> Message-ID: It does not. On Jan 29, 2008, at 1:42 PM, Mihael Hategan wrote: > I hope this doesn't apply to TG? From insley at mcs.anl.gov Tue Jan 29 13:57:40 2008 From: insley at mcs.anl.gov (joseph insley) Date: Tue, 29 Jan 2008 13:57:40 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <479F809B.5050306@cs.uchicago.edu> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> Message-ID: <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> I was seeing Mike's jobs show up in the queue, and running on the backend nodes, and the processes I was seeing on tg-grid appeared to be gram and not some other application, so it would seem that it was indeed using PBS. However, it appears to be using PRE-WS GRAM.... I still had some of the 'ps | grep kubal' output in my scrollback: insley at tg-grid1:~> ps -ef | grep kubal kubal 16981 1 0 16:41 ? 00:00:00 globus-job-manager - conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs - rdn jobmanager-pbs -machine-type unknown -publish-jobs kubal 18390 1 0 16:42 ? 00:00:00 globus-job-manager - conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs - rdn jobmanager-pbs -machine-type unknown -publish-jobs kubal 18891 1 0 16:43 ? 00:00:00 globus-job-manager - conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs - rdn jobmanager-pbs -machine-type unknown -publish-jobs kubal 18917 1 0 16:43 ? [snip] kubal 28200 25985 0 16:50 ? 00:00:00 /usr/bin/perl /soft/ prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / tmp/gram_iwEHrc -c poll kubal 28201 26954 1 16:50 ? 00:00:00 /usr/bin/perl /soft/ prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / tmp/gram_lQaIPe -c poll kubal 28202 19438 1 16:50 ? 00:00:00 /usr/bin/perl /soft/ prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / tmp/gram_SPsdme -c poll On Jan 29, 2008, at 1:38 PM, Ioan Raicu wrote: > Can someone double check that the jobs are using PBS (and not FORK) > in GRAM? If you are using FORK, then the high load is being caused > by the applications running on the GRAM host. If it is PBS, then I > don't know, others might have more insight. > > Ioan > > Ian Foster wrote: >> Hi, >> >> I've CCed Stuart Martin--I'd greatly appreciate some insights into >> what is causing this. I assume that you are using GRAM4 (aka WS- >> GRAM)? >> >> Ian. >> >> Michael Wilde wrote: >>> [ was Re: Swift jobs on UC/ANL TG ] >>> >>> Hi. Im at OHare and will be flying soon. >>> Ben or Mihael, if you are online, can you investigate? >>> >>> Yes, there are significant throttles turned on by default, and >>> the system opens those very gradually. >>> >>> MikeK, can you post to the swift-devel list your swift.properties >>> file, command line options, and your swift source code? >>> >>> Thanks, >>> >>> MikeW >>> >>> >>> On 1/29/08 8:11 AM, Ti Leggett wrote: >>>> The default walltime is 15 minutes. Are you doing fork jobs or >>>> pbs jobs? You shouldn't be doing fork jobs at all. Mike W, I >>>> thought there were throttles in place in Swift to prevent this >>>> type of overrun? Mike K, I'll need you to either stop these >>>> types of jobs until Mike W can verify throttling or only submit >>>> a few 10s of jobs at a time. >>>> >>>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote: >>>> >>>>> Yes, I'm submitting molecular dynamics simulations >>>>> using Swift. >>>>> >>>>> Is there a default wall-time limit for jobs on tg-uc? >>>>> >>>>> >>>>> >>>>> --- joseph insley wrote: >>>>> >>>>>> Actually, these numbers are now escalating... >>>>>> >>>>>> top - 17:18:54 up 2:29, 1 user, load average: >>>>>> 149.02, 123.63, 91.94 >>>>>> Tasks: 469 total, 4 running, 465 sleeping, 0 >>>>>> stopped, 0 zombie >>>>>> >>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>> 479 >>>>>> >>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>> tg-grid.uc.teragrid.org >>>>>> GRAM Authentication test successful >>>>>> real 0m26.134s >>>>>> user 0m0.090s >>>>>> sys 0m0.010s >>>>>> >>>>>> >>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: >>>>>> >>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL >>>>>> TG GRAM host) >>>>>>> became unresponsive and had to be rebooted. I am >>>>>> now seeing slow >>>>>>> response times from the Gatekeeper there again. >>>>>> Authenticating to >>>>>>> the gatekeeper should only take a second or two, >>>>>> but it is >>>>>>> periodically taking up to 16 seconds: >>>>>>> >>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>> tg-grid.uc.teragrid.org >>>>>>> GRAM Authentication test successful >>>>>>> real 0m16.096s >>>>>>> user 0m0.060s >>>>>>> sys 0m0.020s >>>>>>> >>>>>>> looking at the load on tg-grid, it is rather high: >>>>>>> >>>>>>> top - 16:55:26 up 2:06, 1 user, load average: >>>>>> 89.59, 78.69, 62.92 >>>>>>> Tasks: 398 total, 20 running, 378 sleeping, 0 >>>>>> stopped, 0 zombie >>>>>>> >>>>>>> And there appear to be a large number of processes >>>>>> owned by kubal: >>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>> 380 >>>>>>> >>>>>>> I assume that Mike is using swift to do the job >>>>>> submission. Is >>>>>>> there some throttling of the rate at which jobs >>>>>> are submitted to >>>>>>> the gatekeeper that could be done that would >>>>>> lighten this load >>>>>>> some? (Or has that already been done since >>>>>> earlier today?) The >>>>>>> current response times are not unacceptable, but >>>>>> I'm hoping to >>>>>>> avoid having the machine grind to a halt as it did >>>>>> earlier today. >>>>>>> >>>>>>> Thanks, >>>>>>> joe. >>>>>>> >>>>>>> >>>>>>> >>>>>> =================================================== >>>>>>> joseph a. >>>>>>> insley >>>>>> >>>>>>> insley at mcs.anl.gov >>>>>>> mathematics & computer science division >>>>>> (630) 252-5649 >>>>>>> argonne national laboratory >>>>>> (630) >>>>>>> 252-5986 (fax) >>>>>>> >>>>>>> >>>>>> >>>>>> =================================================== >>>>>> joseph a. insley >>>>>> >>>>>> insley at mcs.anl.gov >>>>>> mathematics & computer science division (630) >>>>>> 252-5649 >>>>>> argonne national laboratory >>>>>> (630) >>>>>> 252-5986 (fax) >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>>> __________________________________________________________________ >>>>> __________________ >>>>> Be a better friend, newshound, and >>>>> know-it-all with Yahoo! Mobile. Try it now. http:// >>>>> mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>>>> >>>> >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > -- > ================================================== > Ioan Raicu > Ph.D. Candidate > ================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > ================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS > ================================================== > ================================================== > > =================================================== joseph a. insley insley at mcs.anl.gov mathematics & computer science division (630) 252-5649 argonne national laboratory (630) 252-5986 (fax) -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Jan 29 14:32:24 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jan 2008 14:32:24 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> Message-ID: <1201638744.24149.4.camel@blabla.mcs.anl.gov> On Tue, 2008-01-29 at 14:06 -0600, Stuart Martin wrote: > This is the classic GRAM2 scaling issue due to each job polling for > status to the LRM. condor-g does all sorts of things to make GRAM2 > scale for that scenario. If swift is not using condor-g and not doing > the condor-g tricks, then I'd recommend swift to switch to using gram4. Swift should work with gram4 as it is. One needs, however, to specify that in the sites.xml file. > > -Stu > > On Jan 29, 2008, at Jan 29, 1:57 PM, joseph insley wrote: > > > I was seeing Mike's jobs show up in the queue, and running on the > > backend nodes, and the processes I was seeing on tg-grid appeared to > > be gram and not some other application, so it would seem that it was > > indeed using PBS. > > > > However, it appears to be using PRE-WS GRAM.... I still had some of > > the 'ps | grep kubal' output in my scrollback: > > > > insley at tg-grid1:~> ps -ef | grep kubal > > kubal 16981 1 0 16:41 ? 00:00:00 globus-job-manager - > > conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs > > -rdn jobmanager-pbs -machine-type unknown -publish-jobs > > kubal 18390 1 0 16:42 ? 00:00:00 globus-job-manager - > > conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs > > -rdn jobmanager-pbs -machine-type unknown -publish-jobs > > kubal 18891 1 0 16:43 ? 00:00:00 globus-job-manager - > > conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs > > -rdn jobmanager-pbs -machine-type unknown -publish-jobs > > kubal 18917 1 0 16:43 ? > > > > [snip] > > > > kubal 28200 25985 0 16:50 ? 00:00:00 /usr/bin/perl /soft/ > > prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / > > tmp/gram_iwEHrc -c poll > > kubal 28201 26954 1 16:50 ? 00:00:00 /usr/bin/perl /soft/ > > prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / > > tmp/gram_lQaIPe -c poll > > kubal 28202 19438 1 16:50 ? 00:00:00 /usr/bin/perl /soft/ > > prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / > > tmp/gram_SPsdme -c poll > > > > > > On Jan 29, 2008, at 1:38 PM, Ioan Raicu wrote: > > > >> Can someone double check that the jobs are using PBS (and not FORK) > >> in GRAM? If you are using FORK, then the high load is being caused > >> by the applications running on the GRAM host. If it is PBS, then I > >> don't know, others might have more insight. > >> > >> Ioan > >> > >> Ian Foster wrote: > >>> Hi, > >>> > >>> I've CCed Stuart Martin--I'd greatly appreciate some insights into > >>> what is causing this. I assume that you are using GRAM4 (aka WS- > >>> GRAM)? > >>> > >>> Ian. > >>> > >>> Michael Wilde wrote: > >>>> [ was Re: Swift jobs on UC/ANL TG ] > >>>> > >>>> Hi. Im at OHare and will be flying soon. > >>>> Ben or Mihael, if you are online, can you investigate? > >>>> > >>>> Yes, there are significant throttles turned on by default, and > >>>> the system opens those very gradually. > >>>> > >>>> MikeK, can you post to the swift-devel list your swift.properties > >>>> file, command line options, and your swift source code? > >>>> > >>>> Thanks, > >>>> > >>>> MikeW > >>>> > >>>> > >>>> On 1/29/08 8:11 AM, Ti Leggett wrote: > >>>>> The default walltime is 15 minutes. Are you doing fork jobs or > >>>>> pbs jobs? You shouldn't be doing fork jobs at all. Mike W, I > >>>>> thought there were throttles in place in Swift to prevent this > >>>>> type of overrun? Mike K, I'll need you to either stop these > >>>>> types of jobs until Mike W can verify throttling or only submit > >>>>> a few 10s of jobs at a time. > >>>>> > >>>>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote: > >>>>> > >>>>>> Yes, I'm submitting molecular dynamics simulations > >>>>>> using Swift. > >>>>>> > >>>>>> Is there a default wall-time limit for jobs on tg-uc? > >>>>>> > >>>>>> > >>>>>> > >>>>>> --- joseph insley wrote: > >>>>>> > >>>>>>> Actually, these numbers are now escalating... > >>>>>>> > >>>>>>> top - 17:18:54 up 2:29, 1 user, load average: > >>>>>>> 149.02, 123.63, 91.94 > >>>>>>> Tasks: 469 total, 4 running, 465 sleeping, 0 > >>>>>>> stopped, 0 zombie > >>>>>>> > >>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>>>>> 479 > >>>>>>> > >>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>>>>> tg-grid.uc.teragrid.org > >>>>>>> GRAM Authentication test successful > >>>>>>> real 0m26.134s > >>>>>>> user 0m0.090s > >>>>>>> sys 0m0.010s > >>>>>>> > >>>>>>> > >>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: > >>>>>>> > >>>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL > >>>>>>> TG GRAM host) > >>>>>>>> became unresponsive and had to be rebooted. I am > >>>>>>> now seeing slow > >>>>>>>> response times from the Gatekeeper there again. > >>>>>>> Authenticating to > >>>>>>>> the gatekeeper should only take a second or two, > >>>>>>> but it is > >>>>>>>> periodically taking up to 16 seconds: > >>>>>>>> > >>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>>>>> tg-grid.uc.teragrid.org > >>>>>>>> GRAM Authentication test successful > >>>>>>>> real 0m16.096s > >>>>>>>> user 0m0.060s > >>>>>>>> sys 0m0.020s > >>>>>>>> > >>>>>>>> looking at the load on tg-grid, it is rather high: > >>>>>>>> > >>>>>>>> top - 16:55:26 up 2:06, 1 user, load average: > >>>>>>> 89.59, 78.69, 62.92 > >>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, 0 > >>>>>>> stopped, 0 zombie > >>>>>>>> > >>>>>>>> And there appear to be a large number of processes > >>>>>>> owned by kubal: > >>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>>>>>> 380 > >>>>>>>> > >>>>>>>> I assume that Mike is using swift to do the job > >>>>>>> submission. Is > >>>>>>>> there some throttling of the rate at which jobs > >>>>>>> are submitted to > >>>>>>>> the gatekeeper that could be done that would > >>>>>>> lighten this load > >>>>>>>> some? (Or has that already been done since > >>>>>>> earlier today?) The > >>>>>>>> current response times are not unacceptable, but > >>>>>>> I'm hoping to > >>>>>>>> avoid having the machine grind to a halt as it did > >>>>>>> earlier today. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> joe. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> =================================================== > >>>>>>>> joseph a. > >>>>>>>> insley > >>>>>>> > >>>>>>>> insley at mcs.anl.gov > >>>>>>>> mathematics & computer science division > >>>>>>> (630) 252-5649 > >>>>>>>> argonne national laboratory > >>>>>>> (630) > >>>>>>>> 252-5986 (fax) > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> =================================================== > >>>>>>> joseph a. insley > >>>>>>> > >>>>>>> insley at mcs.anl.gov > >>>>>>> mathematics & computer science division (630) > >>>>>>> 252-5649 > >>>>>>> argonne national laboratory > >>>>>>> (630) > >>>>>>> 252-5986 (fax) > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> ____________________________________________________________________________________ > >>>>>> Be a better friend, newshound, and > >>>>>> know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > >>>>>> > >>>>> > >>>>> > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> > >> > >> -- > >> ================================================== > >> Ioan Raicu > >> Ph.D. Candidate > >> ================================================== > >> Distributed Systems Laboratory > >> Computer Science Department > >> University of Chicago > >> 1100 E. 58th Street, Ryerson Hall > >> Chicago, IL 60637 > >> ================================================== > >> Email: iraicu at cs.uchicago.edu > >> Web: http://www.cs.uchicago.edu/~iraicu > >> http://dev.globus.org/wiki/Incubator/Falkon > >> http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS > >> ================================================== > >> ================================================== > >> > >> > > > > =================================================== > > joseph a. > > insley insley at mcs.anl.gov > > mathematics & computer science division (630) 252-5649 > > argonne national laboratory (630) > > 252-5986 (fax) > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Tue Jan 29 15:39:25 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jan 2008 15:39:25 -0600 Subject: [Swift-devel] Re: Question and update on swift script In-Reply-To: <479A0C53.2060108@fnal.gov> References: <476064D0.7050208@mcs.anl.gov> <479A0C53.2060108@fnal.gov> Message-ID: <1201642765.25746.2.camel@blabla.mcs.anl.gov> Mike is probably on or around a plane. Here's my feeble attempt at it. Mihael On Fri, 2008-01-25 at 10:20 -0600, James Simone wrote: > Hi Mike, > > Could you please send us your target version of the 2-pt QCD workflow? > The target swiftscipt would not necessarily be able to run with current > swift, but it would include syntax changes that are/will be under development. > > Thanks, > --jim > > Michael Wilde wrote: > > Hi all, > > > > I made some good progress yesterday in understanding the swift code for > > the 2ptHL workflow, and started revising it. > > > > Im doing the following: > > - giving all files a mnemonic type name > > - creating compound types that encapsulate each data/info file pair > > - putting each set of nested foreach loops into a procedure (for better > > abstraction) > > - changing the mapper to tie it to each data type > > > > For now, Im also pulling the two cache-loading functions out of the > > workflow, as it seems like these should be part of the runtime > > environment, rather than in the workflow script. Do you feel the same? > > > > I *thought* that you had a set of wrappers that were python stubs for > > simulated execution, but the wrappers Don sent me looked like wrappers > > that call the *real* code. So for now I'll create my own simple stubs to > > test the data flow with. > > > > Ive got many more questions that Im collecting, but for now I only need > > the answer to this one: > > > > In the nested loops that call the function Onia() (lines 136-148 in the > > attached numbered listing), is the actual call at line 145 correct? > > This call is passing the same "HeavyQuarkConverted" as both the > > anti-quark and quark1. Im assuming that is the correct intent. > > Also, its passing only the wave[1] wave file (1S) each time (second of > > the three wave files d, 1S, 2S). (Same is true for BStaggered). > > > > Lastly, Onia seems to be getting called with successive pairs of > > converted heavy quark files, but it seems like the final call is going > > to get a null file for the second file, as the value of "bindex" will be > > one past the last converted heavy quark file computed. > > > > Im looking for ways to avoid the way the current script needs to compute > > the various array indices, but Im not likely to find something that > > doesnt require a language change. Was this approach of computing the > > indices something that your team liked, or did not like, about this > > swift script? > > > > I'll try to send more details and questions as I proceed. > > A few of these are below. If you have a moment to consider these, that > > will help me get a better understanding of the environment. > > > > Thanks, > > > > - Mike > > > > qcopy: I need to learn more about how you manage data and use dcache, > > but for now, I get the essence if this. Can you point me to a qcopy doc > > page though? I couldnt find one. > > > > tag_array_mapper: I can guess how this works, but can you send me the > > code for it? It seems like the order in which it fills the array must > > match the index computations in your swift script. Looks to me like the > > leftmost tag in the format script is varied fastest (ie, is the "least > > significant index"). Is this correct? > > > > kappa value passing bug: you allude to this in a comment. Was this a > > swift problem or a problem in the py wrapper? Resolved or still open? > > If Swift, I can test to see if I can reproduce it. But I suspect you > > were testing against a pretty old version of swift? > > > > Is the notion of an "info" file paired with most data files a standard > > metadata convention for LQCD? (Ie, Im assuming that this is done > > throughout your apps, not just in this swift example, right? If so, it > > seems to justify a mapping convention so that you can simply pass a > > "Quark" object, and have the data and info files passed automatically, > > together. You can then dereference the object top extract each field > > (file). > > > > Are the file naming conventions set by the tag mapper the ones already > > in use by the current workflow system? I.e., the order of the foreach > > loops and hence of the mappings was chosen carefully to match the > > existing file-naming conventions? > > > > How is the name of the ensemble chosen? Does it have a relation to the > > phyarams? Is it a database key? (It seems opaque to the current swift > > example. Is that by preference or was there a desire to expose its > > structure? Its its contents related to the phyparams? Or when looked up > > in a DB, does it yield the phyparams? > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -------------- next part -------------- type Template; type Gauge; type Quark; type Stag; type Clover; type QuarkArchive; type StagArchive; app (Gauge gauge) stageIn(Template t, String config) { stageIn "-v" "-i" filename(t) "-c" config "-o" filename(gauge); } app (Stag stags[]) stagSolve(Gauge gauge, String mass, String source) { stagSolve "-v" "-g" filename(gauge) "-m" mass "-s" source "-o" filename(stags); } app (Clover out) cloverSolve(String kappa, String cSW, Gauge gauge, String source) { cloverSolve "-v" "-k" kappa "-c" cSW "-g" filename(gauge) "-s" source "-o" filename(out); } app (Quark q) cvt12x12(Clover c) { CVT12x12 "-v" "-i" filename(c) "-o" filename(q); } app (QuarkArchive a) archive(Quark q) { Archive "-v" "-i" filename(q) "-o" filename(a); } app (StagArchive a) archiveStag(String mass, Stag stags[]) { ArchiveStag "-v" "-m" mass "-i" filename(stags) "-o" filename(a); } app (file sdo) twoPtHH(Gauge gauge, Quark antiQ, Quark Q0, Quark Q1, Quark Q2) { TwoPtHH "-v" "-g" filename(gauge) "-a" antiQ "-0" Q0 "-1" Q1 "-2" Q2 stdout=filename(sdo); } app (file sdo) twoPtSH(Gauge gauge, Quark stag, Quark antiQ, Quark Q0, Quark Q1, Quark Q2) { TwoPtSH "-v" "-g" filename(gauge) "-s" stag "-a" antiQ "-0" Q0 "-1" Q1 "-2" Q2 stdout=filename(sdo); } String[] confList=["000102","000108","000114"]; String ensemble = "l4096f21b708m0031m031"; String kappaQ = "0.127"; String cSW = "1.75"; String mass = "0.005,0.007,0.01,0.02,0.03"; String[] fn = ["m0.005_000102 m0.007_000102 m0.01_000102 m0.02_000102 m0.03_000102", "m0.005_000108 m0.007_000108 m0.01_000108 m0.02_000108 m0.03_000108", "m0.005_000114 m0.007_000114 m0.01_000114 m0.02_000114 m0.03_000114"]; String[] configs = ["local,0,0,0,0", "wavefunction,0,1S", "wavefunction,0,2S"]; foreach config,i in conflist { # gauge template name Template template <"foo">; Gauge gauge = stageIn(template, configs[0]); Stag stags[] ; stags = stagSolve(gauge, mass, source); StagArchive stagTar ; stagTar = archiveStag(mass, stags); Quark[] q; QuarkArchive[] cvtArch; for i in [0:2] { Clover clover = cloverSolve(kappaQ, cSW, gauge, sources[i]); q[i] = cvt12x12(clover); cvtArch[i] = archive(q[i]); } Quark antiQ = q[0]; file pStdout = twoPtHH(gauge, antiQ, q[0], q[1], q[2]); foreach stag in stags { file sStdout = twoPtSH(gauge, stag, antiQ, q[0], q[1], q[2]); } } From simone at fnal.gov Tue Jan 29 18:22:16 2008 From: simone at fnal.gov (James Simone) Date: Tue, 29 Jan 2008 18:22:16 -0600 Subject: [Swift-devel] Re: Question and update on swift script In-Reply-To: <1201642765.25746.2.camel@blabla.mcs.anl.gov> References: <476064D0.7050208@mcs.anl.gov> <479A0C53.2060108@fnal.gov> <1201642765.25746.2.camel@blabla.mcs.anl.gov> Message-ID: <479FC338.8070402@fnal.gov> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Thanks Mihael! Mihael Hategan wrote: | Mike is probably on or around a plane. | | Here's my feeble attempt at it. | | Mihael | | On Fri, 2008-01-25 at 10:20 -0600, James Simone wrote: |> Hi Mike, |> |> Could you please send us your target version of the 2-pt QCD workflow? |> The target swiftscipt would not necessarily be able to run with current |> swift, but it would include syntax changes that are/will be under development. |> |> Thanks, |> --jim |> |> Michael Wilde wrote: |>> Hi all, |>> |>> I made some good progress yesterday in understanding the swift code for |>> the 2ptHL workflow, and started revising it. |>> |>> Im doing the following: |>> - giving all files a mnemonic type name |>> - creating compound types that encapsulate each data/info file pair |>> - putting each set of nested foreach loops into a procedure (for better |>> abstraction) |>> - changing the mapper to tie it to each data type |>> |>> For now, Im also pulling the two cache-loading functions out of the |>> workflow, as it seems like these should be part of the runtime |>> environment, rather than in the workflow script. Do you feel the same? |>> |>> I *thought* that you had a set of wrappers that were python stubs for |>> simulated execution, but the wrappers Don sent me looked like wrappers |>> that call the *real* code. So for now I'll create my own simple stubs to |>> test the data flow with. |>> |>> Ive got many more questions that Im collecting, but for now I only need |>> the answer to this one: |>> |>> In the nested loops that call the function Onia() (lines 136-148 in the |>> attached numbered listing), is the actual call at line 145 correct? |>> This call is passing the same "HeavyQuarkConverted" as both the |>> anti-quark and quark1. Im assuming that is the correct intent. |>> Also, its passing only the wave[1] wave file (1S) each time (second of |>> the three wave files d, 1S, 2S). (Same is true for BStaggered). |>> |>> Lastly, Onia seems to be getting called with successive pairs of |>> converted heavy quark files, but it seems like the final call is going |>> to get a null file for the second file, as the value of "bindex" will be |>> one past the last converted heavy quark file computed. |>> |>> Im looking for ways to avoid the way the current script needs to compute |>> the various array indices, but Im not likely to find something that |>> doesnt require a language change. Was this approach of computing the |>> indices something that your team liked, or did not like, about this |>> swift script? |>> |>> I'll try to send more details and questions as I proceed. |>> A few of these are below. If you have a moment to consider these, that |>> will help me get a better understanding of the environment. |>> |>> Thanks, |>> |>> - Mike |>> |>> qcopy: I need to learn more about how you manage data and use dcache, |>> but for now, I get the essence if this. Can you point me to a qcopy doc |>> page though? I couldnt find one. |>> |>> tag_array_mapper: I can guess how this works, but can you send me the |>> code for it? It seems like the order in which it fills the array must |>> match the index computations in your swift script. Looks to me like the |>> leftmost tag in the format script is varied fastest (ie, is the "least |>> significant index"). Is this correct? |>> |>> kappa value passing bug: you allude to this in a comment. Was this a |>> swift problem or a problem in the py wrapper? Resolved or still open? |>> If Swift, I can test to see if I can reproduce it. But I suspect you |>> were testing against a pretty old version of swift? |>> |>> Is the notion of an "info" file paired with most data files a standard |>> metadata convention for LQCD? (Ie, Im assuming that this is done |>> throughout your apps, not just in this swift example, right? If so, it |>> seems to justify a mapping convention so that you can simply pass a |>> "Quark" object, and have the data and info files passed automatically, |>> together. You can then dereference the object top extract each field |>> (file). |>> |>> Are the file naming conventions set by the tag mapper the ones already |>> in use by the current workflow system? I.e., the order of the foreach |>> loops and hence of the mappings was chosen carefully to match the |>> existing file-naming conventions? |>> |>> How is the name of the ensemble chosen? Does it have a relation to the |>> phyarams? Is it a database key? (It seems opaque to the current swift |>> example. Is that by preference or was there a desire to expose its |>> structure? Its its contents related to the phyparams? Or when looked up |>> in a DB, does it yield the phyparams? |>> |>> |> _______________________________________________ |> Swift-devel mailing list |> Swift-devel at ci.uchicago.edu |> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel |> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkefwzgACgkQCdIhGvtLXB3jxwCgvTGUPVmVC1Jrvs4NDSfUOs8E BGkAoOofKeUArKdq0J9CpqLpoSKqIPYC =d1mF -----END PGP SIGNATURE----- From mikekubal at yahoo.com Tue Jan 29 20:00:09 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Tue, 29 Jan 2008 18:00:09 -0800 (PST) Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <479F5047.2040805@mcs.anl.gov> Message-ID: <361627.86365.qm@web52306.mail.re2.yahoo.com> The attachment contains the swift script, tc file, sites file and swift.properties file. I didn't provide any additional command line arguments. MikeK --- Michael Wilde wrote: > [ was Re: Swift jobs on UC/ANL TG ] > > Hi. Im at OHare and will be flying soon. > Ben or Mihael, if you are online, can you > investigate? > > Yes, there are significant throttles turned on by > default, and the > system opens those very gradually. > > MikeK, can you post to the swift-devel list your > swift.properties file, > command line options, and your swift source code? > > Thanks, > > MikeW > > > On 1/29/08 8:11 AM, Ti Leggett wrote: > > The default walltime is 15 minutes. Are you doing > fork jobs or pbs jobs? > > You shouldn't be doing fork jobs at all. Mike W, I > thought there were > > throttles in place in Swift to prevent this type > of overrun? Mike K, > > I'll need you to either stop these types of jobs > until Mike W can verify > > throttling or only submit a few 10s of jobs at a > time. > > > > On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal > wrote: > > > >> Yes, I'm submitting molecular dynamics > simulations > >> using Swift. > >> > >> Is there a default wall-time limit for jobs on > tg-uc? > >> > >> > >> > >> --- joseph insley wrote: > >> > >>> Actually, these numbers are now escalating... > >>> > >>> top - 17:18:54 up 2:29, 1 user, load average: > >>> 149.02, 123.63, 91.94 > >>> Tasks: 469 total, 4 running, 465 sleeping, 0 > >>> stopped, 0 zombie > >>> > >>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>> 479 > >>> > >>> insley at tg-viz-login1:~> time globusrun -a -r > >>> tg-grid.uc.teragrid.org > >>> GRAM Authentication test successful > >>> real 0m26.134s > >>> user 0m0.090s > >>> sys 0m0.010s > >>> > >>> > >>> On Jan 28, 2008, at 5:15 PM, joseph insley > wrote: > >>> > >>>> Earlier today tg-grid.uc.teragrid.org (the > UC/ANL > >>> TG GRAM host) > >>>> became unresponsive and had to be rebooted. I > am > >>> now seeing slow > >>>> response times from the Gatekeeper there again. > >>> Authenticating to > >>>> the gatekeeper should only take a second or > two, > >>> but it is > >>>> periodically taking up to 16 seconds: > >>>> > >>>> insley at tg-viz-login1:~> time globusrun -a -r > >>> tg-grid.uc.teragrid.org > >>>> GRAM Authentication test successful > >>>> real 0m16.096s > >>>> user 0m0.060s > >>>> sys 0m0.020s > >>>> > >>>> looking at the load on tg-grid, it is rather > high: > >>>> > >>>> top - 16:55:26 up 2:06, 1 user, load > average: > >>> 89.59, 78.69, 62.92 > >>>> Tasks: 398 total, 20 running, 378 sleeping, > 0 > >>> stopped, 0 zombie > >>>> > >>>> And there appear to be a large number of > processes > >>> owned by kubal: > >>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>> 380 > >>>> > >>>> I assume that Mike is using swift to do the job > >>> submission. Is > >>>> there some throttling of the rate at which jobs > >>> are submitted to > >>>> the gatekeeper that could be done that would > >>> lighten this load > >>>> some? (Or has that already been done since > >>> earlier today?) The > >>>> current response times are not unacceptable, > but > >>> I'm hoping to > >>>> avoid having the machine grind to a halt as it > did > >>> earlier today. > >>>> > >>>> Thanks, > >>>> joe. > >>>> > >>>> > >>>> > >>> > =================================================== > >>>> joseph a. > >>>> insley > >>> > >>>> insley at mcs.anl.gov > >>>> mathematics & computer science division > >>> (630) 252-5649 > >>>> argonne national laboratory > >>> (630) > >>>> 252-5986 (fax) > >>>> > >>>> > >>> > >>> > =================================================== > >>> joseph a. insley > >>> > >>> insley at mcs.anl.gov > >>> mathematics & computer science division > (630) > >>> 252-5649 > >>> argonne national laboratory > >>> (630) > >>> 252-5986 (fax) > >>> > >>> > >>> > >> > >> > >> > >> > >> > ____________________________________________________________________________________ > > >> > >> Be a better friend, newshound, and > >> know-it-all with Yahoo! Mobile. Try it now. > >> > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > >> > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs From wilde at mcs.anl.gov Tue Jan 29 20:02:40 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 29 Jan 2008 20:02:40 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <361627.86365.qm@web52306.mail.re2.yahoo.com> References: <361627.86365.qm@web52306.mail.re2.yahoo.com> Message-ID: <479FDAC0.20301@mcs.anl.gov> MikeK, no attachment. Ive narrowed the cc list, and need to read back through the email thread on this to see what Mihael observed. - MikeW On 1/29/08 8:00 PM, Mike Kubal wrote: > The attachment contains the swift script, tc file, > sites file and swift.properties file. > > I didn't provide any additional command line > arguments. > > MikeK > > > --- Michael Wilde wrote: > >> [ was Re: Swift jobs on UC/ANL TG ] >> >> Hi. Im at OHare and will be flying soon. >> Ben or Mihael, if you are online, can you >> investigate? >> >> Yes, there are significant throttles turned on by >> default, and the >> system opens those very gradually. >> >> MikeK, can you post to the swift-devel list your >> swift.properties file, >> command line options, and your swift source code? >> >> Thanks, >> >> MikeW >> >> >> On 1/29/08 8:11 AM, Ti Leggett wrote: >>> The default walltime is 15 minutes. Are you doing >> fork jobs or pbs jobs? >>> You shouldn't be doing fork jobs at all. Mike W, I >> thought there were >>> throttles in place in Swift to prevent this type >> of overrun? Mike K, >>> I'll need you to either stop these types of jobs >> until Mike W can verify >>> throttling or only submit a few 10s of jobs at a >> time. >>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal >> wrote: >>>> Yes, I'm submitting molecular dynamics >> simulations >>>> using Swift. >>>> >>>> Is there a default wall-time limit for jobs on >> tg-uc? >>>> >>>> >>>> --- joseph insley wrote: >>>> >>>>> Actually, these numbers are now escalating... >>>>> >>>>> top - 17:18:54 up 2:29, 1 user, load average: >>>>> 149.02, 123.63, 91.94 >>>>> Tasks: 469 total, 4 running, 465 sleeping, 0 >>>>> stopped, 0 zombie >>>>> >>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>> 479 >>>>> >>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>> tg-grid.uc.teragrid.org >>>>> GRAM Authentication test successful >>>>> real 0m26.134s >>>>> user 0m0.090s >>>>> sys 0m0.010s >>>>> >>>>> >>>>> On Jan 28, 2008, at 5:15 PM, joseph insley >> wrote: >>>>>> Earlier today tg-grid.uc.teragrid.org (the >> UC/ANL >>>>> TG GRAM host) >>>>>> became unresponsive and had to be rebooted. I >> am >>>>> now seeing slow >>>>>> response times from the Gatekeeper there again. >>>>> Authenticating to >>>>>> the gatekeeper should only take a second or >> two, >>>>> but it is >>>>>> periodically taking up to 16 seconds: >>>>>> >>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>> tg-grid.uc.teragrid.org >>>>>> GRAM Authentication test successful >>>>>> real 0m16.096s >>>>>> user 0m0.060s >>>>>> sys 0m0.020s >>>>>> >>>>>> looking at the load on tg-grid, it is rather >> high: >>>>>> top - 16:55:26 up 2:06, 1 user, load >> average: >>>>> 89.59, 78.69, 62.92 >>>>>> Tasks: 398 total, 20 running, 378 sleeping, >> 0 >>>>> stopped, 0 zombie >>>>>> And there appear to be a large number of >> processes >>>>> owned by kubal: >>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>> 380 >>>>>> >>>>>> I assume that Mike is using swift to do the job >>>>> submission. Is >>>>>> there some throttling of the rate at which jobs >>>>> are submitted to >>>>>> the gatekeeper that could be done that would >>>>> lighten this load >>>>>> some? (Or has that already been done since >>>>> earlier today?) The >>>>>> current response times are not unacceptable, >> but >>>>> I'm hoping to >>>>>> avoid having the machine grind to a halt as it >> did >>>>> earlier today. >>>>>> Thanks, >>>>>> joe. >>>>>> >>>>>> >>>>>> >> =================================================== >>>>>> joseph a. >>>>>> insley >>>>>> insley at mcs.anl.gov >>>>>> mathematics & computer science division >>>>> (630) 252-5649 >>>>>> argonne national laboratory >>>>> (630) >>>>>> 252-5986 (fax) >>>>>> >>>>>> >>>>> >> =================================================== >>>>> joseph a. insley >>>>> >>>>> insley at mcs.anl.gov >>>>> mathematics & computer science division >> (630) >>>>> 252-5649 >>>>> argonne national laboratory >>>>> (630) >>>>> 252-5986 (fax) >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> > ____________________________________________________________________________________ >>>> Be a better friend, newshound, and >>>> know-it-all with Yahoo! Mobile. Try it now. >>>> > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > > > ____________________________________________________________________________________ > Never miss a thing. Make Yahoo your home page. > http://www.yahoo.com/r/hs > > From hategan at mcs.anl.gov Tue Jan 29 20:06:10 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jan 2008 20:06:10 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <479FDAC0.20301@mcs.anl.gov> References: <361627.86365.qm@web52306.mail.re2.yahoo.com> <479FDAC0.20301@mcs.anl.gov> Message-ID: <1201658770.31610.1.camel@blabla.mcs.anl.gov> On Tue, 2008-01-29 at 20:02 -0600, Michael Wilde wrote: > MikeK, no attachment. > > Ive narrowed the cc list, and need to read back through the email thread > on this to see what Mihael observed. Let me summarize: too many gt2 gram jobs running concurrently = too many job manager processes = high load on gram node. Not a new issue. > > - MikeW > > On 1/29/08 8:00 PM, Mike Kubal wrote: > > The attachment contains the swift script, tc file, > > sites file and swift.properties file. > > > > I didn't provide any additional command line > > arguments. > > > > MikeK > > > > > > --- Michael Wilde wrote: > > > >> [ was Re: Swift jobs on UC/ANL TG ] > >> > >> Hi. Im at OHare and will be flying soon. > >> Ben or Mihael, if you are online, can you > >> investigate? > >> > >> Yes, there are significant throttles turned on by > >> default, and the > >> system opens those very gradually. > >> > >> MikeK, can you post to the swift-devel list your > >> swift.properties file, > >> command line options, and your swift source code? > >> > >> Thanks, > >> > >> MikeW > >> > >> > >> On 1/29/08 8:11 AM, Ti Leggett wrote: > >>> The default walltime is 15 minutes. Are you doing > >> fork jobs or pbs jobs? > >>> You shouldn't be doing fork jobs at all. Mike W, I > >> thought there were > >>> throttles in place in Swift to prevent this type > >> of overrun? Mike K, > >>> I'll need you to either stop these types of jobs > >> until Mike W can verify > >>> throttling or only submit a few 10s of jobs at a > >> time. > >>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal > >> wrote: > >>>> Yes, I'm submitting molecular dynamics > >> simulations > >>>> using Swift. > >>>> > >>>> Is there a default wall-time limit for jobs on > >> tg-uc? > >>>> > >>>> > >>>> --- joseph insley wrote: > >>>> > >>>>> Actually, these numbers are now escalating... > >>>>> > >>>>> top - 17:18:54 up 2:29, 1 user, load average: > >>>>> 149.02, 123.63, 91.94 > >>>>> Tasks: 469 total, 4 running, 465 sleeping, 0 > >>>>> stopped, 0 zombie > >>>>> > >>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>>> 479 > >>>>> > >>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>>> tg-grid.uc.teragrid.org > >>>>> GRAM Authentication test successful > >>>>> real 0m26.134s > >>>>> user 0m0.090s > >>>>> sys 0m0.010s > >>>>> > >>>>> > >>>>> On Jan 28, 2008, at 5:15 PM, joseph insley > >> wrote: > >>>>>> Earlier today tg-grid.uc.teragrid.org (the > >> UC/ANL > >>>>> TG GRAM host) > >>>>>> became unresponsive and had to be rebooted. I > >> am > >>>>> now seeing slow > >>>>>> response times from the Gatekeeper there again. > >>>>> Authenticating to > >>>>>> the gatekeeper should only take a second or > >> two, > >>>>> but it is > >>>>>> periodically taking up to 16 seconds: > >>>>>> > >>>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>>> tg-grid.uc.teragrid.org > >>>>>> GRAM Authentication test successful > >>>>>> real 0m16.096s > >>>>>> user 0m0.060s > >>>>>> sys 0m0.020s > >>>>>> > >>>>>> looking at the load on tg-grid, it is rather > >> high: > >>>>>> top - 16:55:26 up 2:06, 1 user, load > >> average: > >>>>> 89.59, 78.69, 62.92 > >>>>>> Tasks: 398 total, 20 running, 378 sleeping, > >> 0 > >>>>> stopped, 0 zombie > >>>>>> And there appear to be a large number of > >> processes > >>>>> owned by kubal: > >>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>>>> 380 > >>>>>> > >>>>>> I assume that Mike is using swift to do the job > >>>>> submission. Is > >>>>>> there some throttling of the rate at which jobs > >>>>> are submitted to > >>>>>> the gatekeeper that could be done that would > >>>>> lighten this load > >>>>>> some? (Or has that already been done since > >>>>> earlier today?) The > >>>>>> current response times are not unacceptable, > >> but > >>>>> I'm hoping to > >>>>>> avoid having the machine grind to a halt as it > >> did > >>>>> earlier today. > >>>>>> Thanks, > >>>>>> joe. > >>>>>> > >>>>>> > >>>>>> > >> =================================================== > >>>>>> joseph a. > >>>>>> insley > >>>>>> insley at mcs.anl.gov > >>>>>> mathematics & computer science division > >>>>> (630) 252-5649 > >>>>>> argonne national laboratory > >>>>> (630) > >>>>>> 252-5986 (fax) > >>>>>> > >>>>>> > >>>>> > >> =================================================== > >>>>> joseph a. insley > >>>>> > >>>>> insley at mcs.anl.gov > >>>>> mathematics & computer science division > >> (630) > >>>>> 252-5649 > >>>>> argonne national laboratory > >>>>> (630) > >>>>> 252-5986 (fax) > >>>>> > >>>>> > >>>>> > >>>> > >>>> > >>>> > >>>> > > ____________________________________________________________________________________ > >>>> Be a better friend, newshound, and > >>>> know-it-all with Yahoo! Mobile. Try it now. > >>>> > > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > >>> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > > > > > > > > ____________________________________________________________________________________ > > Never miss a thing. Make Yahoo your home page. > > http://www.yahoo.com/r/hs > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From mikekubal at yahoo.com Tue Jan 29 20:31:59 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Tue, 29 Jan 2008 18:31:59 -0800 (PST) Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <1201658770.31610.1.camel@blabla.mcs.anl.gov> Message-ID: <621635.1992.qm@web52302.mail.re2.yahoo.com> sorry, long day : ) --- Mihael Hategan wrote: > > On Tue, 2008-01-29 at 20:02 -0600, Michael Wilde > wrote: > > MikeK, no attachment. > > > > Ive narrowed the cc list, and need to read back > through the email thread > > on this to see what Mihael observed. > > Let me summarize: too many gt2 gram jobs running > concurrently = too many > job manager processes = high load on gram node. Not > a new issue. > > > > > - MikeW > > > > On 1/29/08 8:00 PM, Mike Kubal wrote: > > > The attachment contains the swift script, tc > file, > > > sites file and swift.properties file. > > > > > > I didn't provide any additional command line > > > arguments. > > > > > > MikeK > > > > > > > > > --- Michael Wilde wrote: > > > > > >> [ was Re: Swift jobs on UC/ANL TG ] > > >> > > >> Hi. Im at OHare and will be flying soon. > > >> Ben or Mihael, if you are online, can you > > >> investigate? > > >> > > >> Yes, there are significant throttles turned on > by > > >> default, and the > > >> system opens those very gradually. > > >> > > >> MikeK, can you post to the swift-devel list > your > > >> swift.properties file, > > >> command line options, and your swift source > code? > > >> > > >> Thanks, > > >> > > >> MikeW > > >> > > >> > > >> On 1/29/08 8:11 AM, Ti Leggett wrote: > > >>> The default walltime is 15 minutes. Are you > doing > > >> fork jobs or pbs jobs? > > >>> You shouldn't be doing fork jobs at all. Mike > W, I > > >> thought there were > > >>> throttles in place in Swift to prevent this > type > > >> of overrun? Mike K, > > >>> I'll need you to either stop these types of > jobs > > >> until Mike W can verify > > >>> throttling or only submit a few 10s of jobs at > a > > >> time. > > >>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike > Kubal > > >> wrote: > > >>>> Yes, I'm submitting molecular dynamics > > >> simulations > > >>>> using Swift. > > >>>> > > >>>> Is there a default wall-time limit for jobs > on > > >> tg-uc? > > >>>> > > >>>> > > >>>> --- joseph insley wrote: > > >>>> > > >>>>> Actually, these numbers are now > escalating... > > >>>>> > > >>>>> top - 17:18:54 up 2:29, 1 user, load > average: > > >>>>> 149.02, 123.63, 91.94 > > >>>>> Tasks: 469 total, 4 running, 465 sleeping, > 0 > > >>>>> stopped, 0 zombie > > >>>>> > > >>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc > -l > > >>>>> 479 > > >>>>> > > >>>>> insley at tg-viz-login1:~> time globusrun -a -r > > >>>>> tg-grid.uc.teragrid.org > > >>>>> GRAM Authentication test successful > > >>>>> real 0m26.134s > > >>>>> user 0m0.090s > > >>>>> sys 0m0.010s > > >>>>> > > >>>>> > > >>>>> On Jan 28, 2008, at 5:15 PM, joseph insley > > >> wrote: > > >>>>>> Earlier today tg-grid.uc.teragrid.org (the > > >> UC/ANL > > >>>>> TG GRAM host) > > >>>>>> became unresponsive and had to be rebooted. > I > > >> am > > >>>>> now seeing slow > > >>>>>> response times from the Gatekeeper there > again. > > >>>>> Authenticating to > > >>>>>> the gatekeeper should only take a second or > > >> two, > > >>>>> but it is > > >>>>>> periodically taking up to 16 seconds: > > >>>>>> > > >>>>>> insley at tg-viz-login1:~> time globusrun -a > -r > > >>>>> tg-grid.uc.teragrid.org > > >>>>>> GRAM Authentication test successful > > >>>>>> real 0m16.096s > > >>>>>> user 0m0.060s > > >>>>>> sys 0m0.020s > > >>>>>> > > >>>>>> looking at the load on tg-grid, it is > rather > > >> high: > > >>>>>> top - 16:55:26 up 2:06, 1 user, load > > >> average: > > >>>>> 89.59, 78.69, 62.92 > > >>>>>> Tasks: 398 total, 20 running, 378 > sleeping, > > >> 0 > > >>>>> stopped, 0 zombie > > >>>>>> And there appear to be a large number of > > >> processes > > >>>>> owned by kubal: > > >>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc > -l > > >>>>>> 380 > > >>>>>> > > >>>>>> I assume that Mike is using swift to do the > job > > >>>>> submission. Is > > >>>>>> there some throttling of the rate at which > jobs > > >>>>> are submitted to > > >>>>>> the gatekeeper that could be done that > would > > >>>>> lighten this load > > >>>>>> some? (Or has that already been done since > > >>>>> earlier today?) The > > >>>>>> current response times are not > unacceptable, > > >> but > > >>>>> I'm hoping to > > >>>>>> avoid having the machine grind to a halt as > it > > >> did > > >>>>> earlier today. > > >>>>>> Thanks, > > >>>>>> joe. > > >>>>>> > > >>>>>> > > >>>>>> > > >> > =================================================== > > >>>>>> joseph a. > > >>>>>> insley > > >>>>>> insley at mcs.anl.gov > > >>>>>> mathematics & computer science division > > >>>>> (630) 252-5649 > > >>>>>> argonne national laboratory > > >>>>> (630) > > >>>>>> 252-5986 (fax) > > >>>>>> > > >>>>>> > > >>>>> > > >> > =================================================== > > >>>>> joseph a. insley > > >>>>> > > >>>>> insley at mcs.anl.gov > > >>>>> mathematics & computer science division > > > >> (630) > > >>>>> 252-5649 > > >>>>> argonne national laboratory > > >>>>> (630) > > >>>>> 252-5986 (fax) > > >>>>> > > >>>>> > > >>>>> > > >>>> > > >>>> > > >>>> > === message truncated === ____________________________________________________________________________________ Looking for last minute shopping deals? Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping -------------- next part -------------- A non-text attachment was scrubbed... Name: swift_stuff.tar Type: application/x-tar Size: 30720 bytes Desc: 382151955-swift_stuff.tar URL: From hategan at mcs.anl.gov Tue Jan 29 20:42:52 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jan 2008 20:42:52 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <621635.1992.qm@web52302.mail.re2.yahoo.com> References: <621635.1992.qm@web52302.mail.re2.yahoo.com> Message-ID: <1201660972.32154.1.camel@blabla.mcs.anl.gov> You may want to try to lower throttle.score.job.factor from 4 to 1. That will cap the number of jobs at ~100 instead of ~400. Mihael On Tue, 2008-01-29 at 18:31 -0800, Mike Kubal wrote: > sorry, long day : ) > > > --- Mihael Hategan wrote: > > > > > On Tue, 2008-01-29 at 20:02 -0600, Michael Wilde > > wrote: > > > MikeK, no attachment. > > > > > > Ive narrowed the cc list, and need to read back > > through the email thread > > > on this to see what Mihael observed. > > > > Let me summarize: too many gt2 gram jobs running > > concurrently = too many > > job manager processes = high load on gram node. Not > > a new issue. > > > > > > > > - MikeW > > > > > > On 1/29/08 8:00 PM, Mike Kubal wrote: > > > > The attachment contains the swift script, tc > > file, > > > > sites file and swift.properties file. > > > > > > > > I didn't provide any additional command line > > > > arguments. > > > > > > > > MikeK > > > > > > > > > > > > --- Michael Wilde wrote: > > > > > > > >> [ was Re: Swift jobs on UC/ANL TG ] > > > >> > > > >> Hi. Im at OHare and will be flying soon. > > > >> Ben or Mihael, if you are online, can you > > > >> investigate? > > > >> > > > >> Yes, there are significant throttles turned on > > by > > > >> default, and the > > > >> system opens those very gradually. > > > >> > > > >> MikeK, can you post to the swift-devel list > > your > > > >> swift.properties file, > > > >> command line options, and your swift source > > code? > > > >> > > > >> Thanks, > > > >> > > > >> MikeW > > > >> > > > >> > > > >> On 1/29/08 8:11 AM, Ti Leggett wrote: > > > >>> The default walltime is 15 minutes. Are you > > doing > > > >> fork jobs or pbs jobs? > > > >>> You shouldn't be doing fork jobs at all. Mike > > W, I > > > >> thought there were > > > >>> throttles in place in Swift to prevent this > > type > > > >> of overrun? Mike K, > > > >>> I'll need you to either stop these types of > > jobs > > > >> until Mike W can verify > > > >>> throttling or only submit a few 10s of jobs at > > a > > > >> time. > > > >>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike > > Kubal > > > >> wrote: > > > >>>> Yes, I'm submitting molecular dynamics > > > >> simulations > > > >>>> using Swift. > > > >>>> > > > >>>> Is there a default wall-time limit for jobs > > on > > > >> tg-uc? > > > >>>> > > > >>>> > > > >>>> --- joseph insley wrote: > > > >>>> > > > >>>>> Actually, these numbers are now > > escalating... > > > >>>>> > > > >>>>> top - 17:18:54 up 2:29, 1 user, load > > average: > > > >>>>> 149.02, 123.63, 91.94 > > > >>>>> Tasks: 469 total, 4 running, 465 sleeping, > > 0 > > > >>>>> stopped, 0 zombie > > > >>>>> > > > >>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc > > -l > > > >>>>> 479 > > > >>>>> > > > >>>>> insley at tg-viz-login1:~> time globusrun -a -r > > > >>>>> tg-grid.uc.teragrid.org > > > >>>>> GRAM Authentication test successful > > > >>>>> real 0m26.134s > > > >>>>> user 0m0.090s > > > >>>>> sys 0m0.010s > > > >>>>> > > > >>>>> > > > >>>>> On Jan 28, 2008, at 5:15 PM, joseph insley > > > >> wrote: > > > >>>>>> Earlier today tg-grid.uc.teragrid.org (the > > > >> UC/ANL > > > >>>>> TG GRAM host) > > > >>>>>> became unresponsive and had to be rebooted. > > I > > > >> am > > > >>>>> now seeing slow > > > >>>>>> response times from the Gatekeeper there > > again. > > > >>>>> Authenticating to > > > >>>>>> the gatekeeper should only take a second or > > > >> two, > > > >>>>> but it is > > > >>>>>> periodically taking up to 16 seconds: > > > >>>>>> > > > >>>>>> insley at tg-viz-login1:~> time globusrun -a > > -r > > > >>>>> tg-grid.uc.teragrid.org > > > >>>>>> GRAM Authentication test successful > > > >>>>>> real 0m16.096s > > > >>>>>> user 0m0.060s > > > >>>>>> sys 0m0.020s > > > >>>>>> > > > >>>>>> looking at the load on tg-grid, it is > > rather > > > >> high: > > > >>>>>> top - 16:55:26 up 2:06, 1 user, load > > > >> average: > > > >>>>> 89.59, 78.69, 62.92 > > > >>>>>> Tasks: 398 total, 20 running, 378 > > sleeping, > > > >> 0 > > > >>>>> stopped, 0 zombie > > > >>>>>> And there appear to be a large number of > > > >> processes > > > >>>>> owned by kubal: > > > >>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc > > -l > > > >>>>>> 380 > > > >>>>>> > > > >>>>>> I assume that Mike is using swift to do the > > job > > > >>>>> submission. Is > > > >>>>>> there some throttling of the rate at which > > jobs > > > >>>>> are submitted to > > > >>>>>> the gatekeeper that could be done that > > would > > > >>>>> lighten this load > > > >>>>>> some? (Or has that already been done since > > > >>>>> earlier today?) The > > > >>>>>> current response times are not > > unacceptable, > > > >> but > > > >>>>> I'm hoping to > > > >>>>>> avoid having the machine grind to a halt as > > it > > > >> did > > > >>>>> earlier today. > > > >>>>>> Thanks, > > > >>>>>> joe. > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >> > > =================================================== > > > >>>>>> joseph a. > > > >>>>>> insley > > > >>>>>> insley at mcs.anl.gov > > > >>>>>> mathematics & computer science division > > > >>>>> (630) 252-5649 > > > >>>>>> argonne national laboratory > > > >>>>> (630) > > > >>>>>> 252-5986 (fax) > > > >>>>>> > > > >>>>>> > > > >>>>> > > > >> > > =================================================== > > > >>>>> joseph a. insley > > > >>>>> > > > >>>>> insley at mcs.anl.gov > > > >>>>> mathematics & computer science division > > > > > >> (630) > > > >>>>> 252-5649 > > > >>>>> argonne national laboratory > > > >>>>> (630) > > > >>>>> 252-5986 (fax) > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>> > > > >>>> > > > >>>> > > > === message truncated === > > > ____________________________________________________________________________________ > Looking for last minute shopping deals? > Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping From hategan at mcs.anl.gov Tue Jan 29 20:47:32 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jan 2008 20:47:32 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <1201660972.32154.1.camel@blabla.mcs.anl.gov> References: <621635.1992.qm@web52302.mail.re2.yahoo.com> <1201660972.32154.1.camel@blabla.mcs.anl.gov> Message-ID: <1201661252.32327.2.camel@blabla.mcs.anl.gov> That and/or try using ws-gram: On Tue, 2008-01-29 at 20:42 -0600, Mihael Hategan wrote: > You may want to try to lower throttle.score.job.factor from 4 to 1. That > will cap the number of jobs at ~100 instead of ~400. > > Mihael > > On Tue, 2008-01-29 at 18:31 -0800, Mike Kubal wrote: > > sorry, long day : ) > > > > > > --- Mihael Hategan wrote: > > > > > > > > On Tue, 2008-01-29 at 20:02 -0600, Michael Wilde > > > wrote: > > > > MikeK, no attachment. > > > > > > > > Ive narrowed the cc list, and need to read back > > > through the email thread > > > > on this to see what Mihael observed. > > > > > > Let me summarize: too many gt2 gram jobs running > > > concurrently = too many > > > job manager processes = high load on gram node. Not > > > a new issue. > > > > > > > > > > > - MikeW > > > > > > > > On 1/29/08 8:00 PM, Mike Kubal wrote: > > > > > The attachment contains the swift script, tc > > > file, > > > > > sites file and swift.properties file. > > > > > > > > > > I didn't provide any additional command line > > > > > arguments. > > > > > > > > > > MikeK > > > > > > > > > > > > > > > --- Michael Wilde wrote: > > > > > > > > > >> [ was Re: Swift jobs on UC/ANL TG ] > > > > >> > > > > >> Hi. Im at OHare and will be flying soon. > > > > >> Ben or Mihael, if you are online, can you > > > > >> investigate? > > > > >> > > > > >> Yes, there are significant throttles turned on > > > by > > > > >> default, and the > > > > >> system opens those very gradually. > > > > >> > > > > >> MikeK, can you post to the swift-devel list > > > your > > > > >> swift.properties file, > > > > >> command line options, and your swift source > > > code? > > > > >> > > > > >> Thanks, > > > > >> > > > > >> MikeW > > > > >> > > > > >> > > > > >> On 1/29/08 8:11 AM, Ti Leggett wrote: > > > > >>> The default walltime is 15 minutes. Are you > > > doing > > > > >> fork jobs or pbs jobs? > > > > >>> You shouldn't be doing fork jobs at all. Mike > > > W, I > > > > >> thought there were > > > > >>> throttles in place in Swift to prevent this > > > type > > > > >> of overrun? Mike K, > > > > >>> I'll need you to either stop these types of > > > jobs > > > > >> until Mike W can verify > > > > >>> throttling or only submit a few 10s of jobs at > > > a > > > > >> time. > > > > >>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike > > > Kubal > > > > >> wrote: > > > > >>>> Yes, I'm submitting molecular dynamics > > > > >> simulations > > > > >>>> using Swift. > > > > >>>> > > > > >>>> Is there a default wall-time limit for jobs > > > on > > > > >> tg-uc? > > > > >>>> > > > > >>>> > > > > >>>> --- joseph insley wrote: > > > > >>>> > > > > >>>>> Actually, these numbers are now > > > escalating... > > > > >>>>> > > > > >>>>> top - 17:18:54 up 2:29, 1 user, load > > > average: > > > > >>>>> 149.02, 123.63, 91.94 > > > > >>>>> Tasks: 469 total, 4 running, 465 sleeping, > > > 0 > > > > >>>>> stopped, 0 zombie > > > > >>>>> > > > > >>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc > > > -l > > > > >>>>> 479 > > > > >>>>> > > > > >>>>> insley at tg-viz-login1:~> time globusrun -a -r > > > > >>>>> tg-grid.uc.teragrid.org > > > > >>>>> GRAM Authentication test successful > > > > >>>>> real 0m26.134s > > > > >>>>> user 0m0.090s > > > > >>>>> sys 0m0.010s > > > > >>>>> > > > > >>>>> > > > > >>>>> On Jan 28, 2008, at 5:15 PM, joseph insley > > > > >> wrote: > > > > >>>>>> Earlier today tg-grid.uc.teragrid.org (the > > > > >> UC/ANL > > > > >>>>> TG GRAM host) > > > > >>>>>> became unresponsive and had to be rebooted. > > > I > > > > >> am > > > > >>>>> now seeing slow > > > > >>>>>> response times from the Gatekeeper there > > > again. > > > > >>>>> Authenticating to > > > > >>>>>> the gatekeeper should only take a second or > > > > >> two, > > > > >>>>> but it is > > > > >>>>>> periodically taking up to 16 seconds: > > > > >>>>>> > > > > >>>>>> insley at tg-viz-login1:~> time globusrun -a > > > -r > > > > >>>>> tg-grid.uc.teragrid.org > > > > >>>>>> GRAM Authentication test successful > > > > >>>>>> real 0m16.096s > > > > >>>>>> user 0m0.060s > > > > >>>>>> sys 0m0.020s > > > > >>>>>> > > > > >>>>>> looking at the load on tg-grid, it is > > > rather > > > > >> high: > > > > >>>>>> top - 16:55:26 up 2:06, 1 user, load > > > > >> average: > > > > >>>>> 89.59, 78.69, 62.92 > > > > >>>>>> Tasks: 398 total, 20 running, 378 > > > sleeping, > > > > >> 0 > > > > >>>>> stopped, 0 zombie > > > > >>>>>> And there appear to be a large number of > > > > >> processes > > > > >>>>> owned by kubal: > > > > >>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc > > > -l > > > > >>>>>> 380 > > > > >>>>>> > > > > >>>>>> I assume that Mike is using swift to do the > > > job > > > > >>>>> submission. Is > > > > >>>>>> there some throttling of the rate at which > > > jobs > > > > >>>>> are submitted to > > > > >>>>>> the gatekeeper that could be done that > > > would > > > > >>>>> lighten this load > > > > >>>>>> some? (Or has that already been done since > > > > >>>>> earlier today?) The > > > > >>>>>> current response times are not > > > unacceptable, > > > > >> but > > > > >>>>> I'm hoping to > > > > >>>>>> avoid having the machine grind to a halt as > > > it > > > > >> did > > > > >>>>> earlier today. > > > > >>>>>> Thanks, > > > > >>>>>> joe. > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >> > > > =================================================== > > > > >>>>>> joseph a. > > > > >>>>>> insley > > > > >>>>>> insley at mcs.anl.gov > > > > >>>>>> mathematics & computer science division > > > > >>>>> (630) 252-5649 > > > > >>>>>> argonne national laboratory > > > > >>>>> (630) > > > > >>>>>> 252-5986 (fax) > > > > >>>>>> > > > > >>>>>> > > > > >>>>> > > > > >> > > > =================================================== > > > > >>>>> joseph a. insley > > > > >>>>> > > > > >>>>> insley at mcs.anl.gov > > > > >>>>> mathematics & computer science division > > > > > > > >> (630) > > > > >>>>> 252-5649 > > > > >>>>> argonne national laboratory > > > > >>>>> (630) > > > > >>>>> 252-5986 (fax) > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > > === message truncated === > > > > > > ____________________________________________________________________________________ > > Looking for last minute shopping deals? > > Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Tue Jan 29 21:04:17 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 29 Jan 2008 21:04:17 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <1201661252.32327.2.camel@blabla.mcs.anl.gov> References: <621635.1992.qm@web52302.mail.re2.yahoo.com> <1201660972.32154.1.camel@blabla.mcs.anl.gov> <1201661252.32327.2.camel@blabla.mcs.anl.gov> Message-ID: <479FE931.2090306@mcs.anl.gov> MikeK, this may be obvious but just in case: On 1/29/08 8:47 PM, Mihael Hategan wrote: > That and/or try using ws-gram: > minor="0" patch="0"/> (this goes in the sites.xml file) Q for the group: is ws-gram supported on uc.teragrid? > > > On Tue, 2008-01-29 at 20:42 -0600, Mihael Hategan wrote: >> You may want to try to lower throttle.score.job.factor from 4 to 1. That >> will cap the number of jobs at ~100 instead of ~400. >> >> Mihael for info on setting Swift properties, see "Swift Engine Configuration" in the users guide at: http://www.ci.uchicago.edu/swift/guides/userguide.php#properties - MikeW >> >> On Tue, 2008-01-29 at 18:31 -0800, Mike Kubal wrote: >>> sorry, long day : ) >>> >>> >>> --- Mihael Hategan wrote: >>> >>>> On Tue, 2008-01-29 at 20:02 -0600, Michael Wilde >>>> wrote: >>>>> MikeK, no attachment. >>>>> >>>>> Ive narrowed the cc list, and need to read back >>>> through the email thread >>>>> on this to see what Mihael observed. >>>> Let me summarize: too many gt2 gram jobs running >>>> concurrently = too many >>>> job manager processes = high load on gram node. Not >>>> a new issue. >>>> >>>>> - MikeW >>>>> >>>>> On 1/29/08 8:00 PM, Mike Kubal wrote: >>>>>> The attachment contains the swift script, tc >>>> file, >>>>>> sites file and swift.properties file. >>>>>> >>>>>> I didn't provide any additional command line >>>>>> arguments. >>>>>> >>>>>> MikeK >>>>>> >>>>>> >>>>>> --- Michael Wilde wrote: >>>>>> >>>>>>> [ was Re: Swift jobs on UC/ANL TG ] >>>>>>> >>>>>>> Hi. Im at OHare and will be flying soon. >>>>>>> Ben or Mihael, if you are online, can you >>>>>>> investigate? >>>>>>> >>>>>>> Yes, there are significant throttles turned on >>>> by >>>>>>> default, and the >>>>>>> system opens those very gradually. >>>>>>> >>>>>>> MikeK, can you post to the swift-devel list >>>> your >>>>>>> swift.properties file, >>>>>>> command line options, and your swift source >>>> code? >>>>>>> Thanks, >>>>>>> >>>>>>> MikeW >>>>>>> >>>>>>> >>>>>>> On 1/29/08 8:11 AM, Ti Leggett wrote: >>>>>>>> The default walltime is 15 minutes. Are you >>>> doing >>>>>>> fork jobs or pbs jobs? >>>>>>>> You shouldn't be doing fork jobs at all. Mike >>>> W, I >>>>>>> thought there were >>>>>>>> throttles in place in Swift to prevent this >>>> type >>>>>>> of overrun? Mike K, >>>>>>>> I'll need you to either stop these types of >>>> jobs >>>>>>> until Mike W can verify >>>>>>>> throttling or only submit a few 10s of jobs at >>>> a >>>>>>> time. >>>>>>>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike >>>> Kubal >>>>>>> wrote: >>>>>>>>> Yes, I'm submitting molecular dynamics >>>>>>> simulations >>>>>>>>> using Swift. >>>>>>>>> >>>>>>>>> Is there a default wall-time limit for jobs >>>> on >>>>>>> tg-uc? >>>>>>>>> >>>>>>>>> --- joseph insley wrote: >>>>>>>>> >>>>>>>>>> Actually, these numbers are now >>>> escalating... >>>>>>>>>> top - 17:18:54 up 2:29, 1 user, load >>>> average: >>>>>>>>>> 149.02, 123.63, 91.94 >>>>>>>>>> Tasks: 469 total, 4 running, 465 sleeping, >>>> 0 >>>>>>>>>> stopped, 0 zombie >>>>>>>>>> >>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc >>>> -l >>>>>>>>>> 479 >>>>>>>>>> >>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>>>>> tg-grid.uc.teragrid.org >>>>>>>>>> GRAM Authentication test successful >>>>>>>>>> real 0m26.134s >>>>>>>>>> user 0m0.090s >>>>>>>>>> sys 0m0.010s >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley >>>>>>> wrote: >>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the >>>>>>> UC/ANL >>>>>>>>>> TG GRAM host) >>>>>>>>>>> became unresponsive and had to be rebooted. >>>> I >>>>>>> am >>>>>>>>>> now seeing slow >>>>>>>>>>> response times from the Gatekeeper there >>>> again. >>>>>>>>>> Authenticating to >>>>>>>>>>> the gatekeeper should only take a second or >>>>>>> two, >>>>>>>>>> but it is >>>>>>>>>>> periodically taking up to 16 seconds: >>>>>>>>>>> >>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a >>>> -r >>>>>>>>>> tg-grid.uc.teragrid.org >>>>>>>>>>> GRAM Authentication test successful >>>>>>>>>>> real 0m16.096s >>>>>>>>>>> user 0m0.060s >>>>>>>>>>> sys 0m0.020s >>>>>>>>>>> >>>>>>>>>>> looking at the load on tg-grid, it is >>>> rather >>>>>>> high: >>>>>>>>>>> top - 16:55:26 up 2:06, 1 user, load >>>>>>> average: >>>>>>>>>> 89.59, 78.69, 62.92 >>>>>>>>>>> Tasks: 398 total, 20 running, 378 >>>> sleeping, >>>>>>> 0 >>>>>>>>>> stopped, 0 zombie >>>>>>>>>>> And there appear to be a large number of >>>>>>> processes >>>>>>>>>> owned by kubal: >>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc >>>> -l >>>>>>>>>>> 380 >>>>>>>>>>> >>>>>>>>>>> I assume that Mike is using swift to do the >>>> job >>>>>>>>>> submission. Is >>>>>>>>>>> there some throttling of the rate at which >>>> jobs >>>>>>>>>> are submitted to >>>>>>>>>>> the gatekeeper that could be done that >>>> would >>>>>>>>>> lighten this load >>>>>>>>>>> some? (Or has that already been done since >>>>>>>>>> earlier today?) The >>>>>>>>>>> current response times are not >>>> unacceptable, >>>>>>> but >>>>>>>>>> I'm hoping to >>>>>>>>>>> avoid having the machine grind to a halt as >>>> it >>>>>>> did >>>>>>>>>> earlier today. >>>>>>>>>>> Thanks, >>>>>>>>>>> joe. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>> =================================================== >>>>>>>>>>> joseph a. >>>>>>>>>>> insley >>>>>>>>>>> insley at mcs.anl.gov >>>>>>>>>>> mathematics & computer science division >>>>>>>>>> (630) 252-5649 >>>>>>>>>>> argonne national laboratory >>>>>>>>>> (630) >>>>>>>>>>> 252-5986 (fax) >>>>>>>>>>> >>>>>>>>>>> >>>> =================================================== >>>>>>>>>> joseph a. insley >>>>>>>>>> >>>>>>>>>> insley at mcs.anl.gov >>>>>>>>>> mathematics & computer science division >>>>>>> (630) >>>>>>>>>> 252-5649 >>>>>>>>>> argonne national laboratory >>>>>>>>>> (630) >>>>>>>>>> 252-5986 (fax) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>> === message truncated === >>> >>> >>> ____________________________________________________________________________________ >>> Looking for last minute shopping deals? >>> Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From iraicu at cs.uchicago.edu Tue Jan 29 21:25:49 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 29 Jan 2008 21:25:49 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <479FE931.2090306@mcs.anl.gov> References: <621635.1992.qm@web52302.mail.re2.yahoo.com> <1201660972.32154.1.camel@blabla.mcs.anl.gov> <1201661252.32327.2.camel@blabla.mcs.anl.gov> <479FE931.2090306@mcs.anl.gov> Message-ID: <479FEE3D.2050102@cs.uchicago.edu> Yong and I ran most of our tests (from Swift) using WS-GRAM (aka GRAM4) on UC/ANL TG, and I use Falkon on the same cluster using only WS-GRAM. If I am not mistaken, all TG sites support WS-GRAM. Ioan Michael Wilde wrote: > MikeK, this may be obvious but just in case: > > On 1/29/08 8:47 PM, Mihael Hategan wrote: >> That and/or try using ws-gram: >> > minor="0" patch="0"/> > > (this goes in the sites.xml file) > > Q for the group: is ws-gram supported on uc.teragrid? > >> >> >> On Tue, 2008-01-29 at 20:42 -0600, Mihael Hategan wrote: >>> You may want to try to lower throttle.score.job.factor from 4 to 1. >>> That >>> will cap the number of jobs at ~100 instead of ~400. >>> >>> Mihael > > for info on setting Swift properties, see "Swift Engine Configuration" > in the users guide at: > > http://www.ci.uchicago.edu/swift/guides/userguide.php#properties > > - MikeW > >>> >>> On Tue, 2008-01-29 at 18:31 -0800, Mike Kubal wrote: >>>> sorry, long day : ) >>>> >>>> >>>> --- Mihael Hategan wrote: >>>> >>>>> On Tue, 2008-01-29 at 20:02 -0600, Michael Wilde >>>>> wrote: >>>>>> MikeK, no attachment. >>>>>> >>>>>> Ive narrowed the cc list, and need to read back >>>>> through the email thread >>>>>> on this to see what Mihael observed. >>>>> Let me summarize: too many gt2 gram jobs running >>>>> concurrently = too many >>>>> job manager processes = high load on gram node. Not >>>>> a new issue. >>>>> >>>>>> - MikeW >>>>>> >>>>>> On 1/29/08 8:00 PM, Mike Kubal wrote: >>>>>>> The attachment contains the swift script, tc >>>>> file, >>>>>>> sites file and swift.properties file. >>>>>>> >>>>>>> I didn't provide any additional command line >>>>>>> arguments. >>>>>>> >>>>>>> MikeK >>>>>>> >>>>>>> >>>>>>> --- Michael Wilde wrote: >>>>>>> >>>>>>>> [ was Re: Swift jobs on UC/ANL TG ] >>>>>>>> >>>>>>>> Hi. Im at OHare and will be flying soon. >>>>>>>> Ben or Mihael, if you are online, can you >>>>>>>> investigate? >>>>>>>> >>>>>>>> Yes, there are significant throttles turned on >>>>> by >>>>>>>> default, and the system opens those very gradually. >>>>>>>> >>>>>>>> MikeK, can you post to the swift-devel list >>>>> your >>>>>>>> swift.properties file, command line options, and your swift source >>>>> code? >>>>>>>> Thanks, >>>>>>>> >>>>>>>> MikeW >>>>>>>> >>>>>>>> >>>>>>>> On 1/29/08 8:11 AM, Ti Leggett wrote: >>>>>>>>> The default walltime is 15 minutes. Are you >>>>> doing >>>>>>>> fork jobs or pbs jobs? >>>>>>>>> You shouldn't be doing fork jobs at all. Mike >>>>> W, I >>>>>>>> thought there were >>>>>>>>> throttles in place in Swift to prevent this >>>>> type >>>>>>>> of overrun? Mike K, >>>>>>>>> I'll need you to either stop these types of >>>>> jobs >>>>>>>> until Mike W can verify >>>>>>>>> throttling or only submit a few 10s of jobs at >>>>> a >>>>>>>> time. >>>>>>>>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike >>>>> Kubal >>>>>>>> wrote: >>>>>>>>>> Yes, I'm submitting molecular dynamics >>>>>>>> simulations >>>>>>>>>> using Swift. >>>>>>>>>> >>>>>>>>>> Is there a default wall-time limit for jobs >>>>> on >>>>>>>> tg-uc? >>>>>>>>>> >>>>>>>>>> --- joseph insley wrote: >>>>>>>>>> >>>>>>>>>>> Actually, these numbers are now >>>>> escalating... >>>>>>>>>>> top - 17:18:54 up 2:29, 1 user, load >>>>> average: >>>>>>>>>>> 149.02, 123.63, 91.94 >>>>>>>>>>> Tasks: 469 total, 4 running, 465 sleeping, >>>>> 0 >>>>>>>>>>> stopped, 0 zombie >>>>>>>>>>> >>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc >>>>> -l >>>>>>>>>>> 479 >>>>>>>>>>> >>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>>>>>> tg-grid.uc.teragrid.org >>>>>>>>>>> GRAM Authentication test successful >>>>>>>>>>> real 0m26.134s >>>>>>>>>>> user 0m0.090s >>>>>>>>>>> sys 0m0.010s >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley >>>>>>>> wrote: >>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the >>>>>>>> UC/ANL >>>>>>>>>>> TG GRAM host) >>>>>>>>>>>> became unresponsive and had to be rebooted. >>>>> I >>>>>>>> am >>>>>>>>>>> now seeing slow >>>>>>>>>>>> response times from the Gatekeeper there >>>>> again. >>>>>>>>>>> Authenticating to >>>>>>>>>>>> the gatekeeper should only take a second or >>>>>>>> two, >>>>>>>>>>> but it is >>>>>>>>>>>> periodically taking up to 16 seconds: >>>>>>>>>>>> >>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a >>>>> -r >>>>>>>>>>> tg-grid.uc.teragrid.org >>>>>>>>>>>> GRAM Authentication test successful >>>>>>>>>>>> real 0m16.096s >>>>>>>>>>>> user 0m0.060s >>>>>>>>>>>> sys 0m0.020s >>>>>>>>>>>> >>>>>>>>>>>> looking at the load on tg-grid, it is >>>>> rather >>>>>>>> high: >>>>>>>>>>>> top - 16:55:26 up 2:06, 1 user, load >>>>>>>> average: >>>>>>>>>>> 89.59, 78.69, 62.92 >>>>>>>>>>>> Tasks: 398 total, 20 running, 378 >>>>> sleeping, >>>>>>>> 0 >>>>>>>>>>> stopped, 0 zombie >>>>>>>>>>>> And there appear to be a large number of >>>>>>>> processes >>>>>>>>>>> owned by kubal: >>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc >>>>> -l >>>>>>>>>>>> 380 >>>>>>>>>>>> >>>>>>>>>>>> I assume that Mike is using swift to do the >>>>> job >>>>>>>>>>> submission. Is >>>>>>>>>>>> there some throttling of the rate at which >>>>> jobs >>>>>>>>>>> are submitted to >>>>>>>>>>>> the gatekeeper that could be done that >>>>> would >>>>>>>>>>> lighten this load >>>>>>>>>>>> some? (Or has that already been done since >>>>>>>>>>> earlier today?) The >>>>>>>>>>>> current response times are not >>>>> unacceptable, >>>>>>>> but >>>>>>>>>>> I'm hoping to >>>>>>>>>>>> avoid having the machine grind to a halt as >>>>> it >>>>>>>> did >>>>>>>>>>> earlier today. >>>>>>>>>>>> Thanks, >>>>>>>>>>>> joe. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>> =================================================== >>>>>>>>>>>> joseph a. >>>>>>>>>>>> insley >>>>>>>>>>>> insley at mcs.anl.gov >>>>>>>>>>>> mathematics & computer science division >>>>>>>>>>> (630) 252-5649 >>>>>>>>>>>> argonne national laboratory >>>>>>>>>>> (630) >>>>>>>>>>>> 252-5986 (fax) >>>>>>>>>>>> >>>>>>>>>>>> >>>>> =================================================== >>>>>>>>>>> joseph a. insley >>>>>>>>>>> >>>>>>>>>>> insley at mcs.anl.gov >>>>>>>>>>> mathematics & computer science division >>>>>>>> (630) >>>>>>>>>>> 252-5649 >>>>>>>>>>> argonne national laboratory >>>>>>>>>>> (630) >>>>>>>>>>> 252-5986 (fax) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>> === message truncated === >>>> >>>> >>>> >>>> ____________________________________________________________________________________ >>>> >>>> Looking for last minute shopping deals? Find them fast with Yahoo! >>>> Search. >>>> http://tools.search.yahoo.com/newsearch/category.php?category=shopping >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- ================================================== Ioan Raicu Ph.D. Candidate ================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS ================================================== ================================================== From hategan at mcs.anl.gov Tue Jan 29 21:35:21 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jan 2008 21:35:21 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> Message-ID: <1201664122.32688.36.camel@blabla.mcs.anl.gov> So I've been thinking about this... Our current throttling parameters are the result of long discussions and trials (pretty heated at times; the discussions that is). Obviously they are not always appropriate. But that's not the problem. The problem, I think, is the lack of consensus on (and sometimes even ability to articulate) what is ok and what isn't. Currently our process of determining this for a site is trying to maximize performance while avoiding failures (this may imply high utilization on both client side and service side), and toning down when the site admins complain. I'm not sure how reasonable this is for our users. The other strategies I've seen are: 1. Condor: Make it slow but safe. This works as long as users don't have a frame of reference to judge how slow things are. My bosses don't seem to like this one (nor do I for that matter), but it is a decent strategy: users get their job done (albeit slowly) and sites don't complain much. 2. LEAD: Lobby to every consequential body and urge for the services to be sufficiently scalable to address the specific requirements of that project (as much as is possible given that LEAD does not have exclusivity). I've expressed my opinion on this one. So how do we figure out the metrics (e.g. how many total concurrent jobs, the rate of submissions, etc.) and how can we reach some consensus on the numbers? Can we build some automated system that would allow clients and services to negotiate such parameters? Mihael On Tue, 2008-01-29 at 14:06 -0600, Stuart Martin wrote: > This is the classic GRAM2 scaling issue due to each job polling for > status to the LRM. condor-g does all sorts of things to make GRAM2 > scale for that scenario. If swift is not using condor-g and not doing > the condor-g tricks, then I'd recommend swift to switch to using gram4. > > -Stu > > On Jan 29, 2008, at Jan 29, 1:57 PM, joseph insley wrote: > > > I was seeing Mike's jobs show up in the queue, and running on the > > backend nodes, and the processes I was seeing on tg-grid appeared to > > be gram and not some other application, so it would seem that it was > > indeed using PBS. > > > > However, it appears to be using PRE-WS GRAM.... I still had some of > > the 'ps | grep kubal' output in my scrollback: > > > > insley at tg-grid1:~> ps -ef | grep kubal > > kubal 16981 1 0 16:41 ? 00:00:00 globus-job-manager - > > conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs > > -rdn jobmanager-pbs -machine-type unknown -publish-jobs > > kubal 18390 1 0 16:42 ? 00:00:00 globus-job-manager - > > conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs > > -rdn jobmanager-pbs -machine-type unknown -publish-jobs > > kubal 18891 1 0 16:43 ? 00:00:00 globus-job-manager - > > conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs > > -rdn jobmanager-pbs -machine-type unknown -publish-jobs > > kubal 18917 1 0 16:43 ? > > > > [snip] > > > > kubal 28200 25985 0 16:50 ? 00:00:00 /usr/bin/perl /soft/ > > prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / > > tmp/gram_iwEHrc -c poll > > kubal 28201 26954 1 16:50 ? 00:00:00 /usr/bin/perl /soft/ > > prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / > > tmp/gram_lQaIPe -c poll > > kubal 28202 19438 1 16:50 ? 00:00:00 /usr/bin/perl /soft/ > > prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / > > tmp/gram_SPsdme -c poll > > > > > > On Jan 29, 2008, at 1:38 PM, Ioan Raicu wrote: > > > >> Can someone double check that the jobs are using PBS (and not FORK) > >> in GRAM? If you are using FORK, then the high load is being caused > >> by the applications running on the GRAM host. If it is PBS, then I > >> don't know, others might have more insight. > >> > >> Ioan > >> > >> Ian Foster wrote: > >>> Hi, > >>> > >>> I've CCed Stuart Martin--I'd greatly appreciate some insights into > >>> what is causing this. I assume that you are using GRAM4 (aka WS- > >>> GRAM)? > >>> > >>> Ian. > >>> > >>> Michael Wilde wrote: > >>>> [ was Re: Swift jobs on UC/ANL TG ] > >>>> > >>>> Hi. Im at OHare and will be flying soon. > >>>> Ben or Mihael, if you are online, can you investigate? > >>>> > >>>> Yes, there are significant throttles turned on by default, and > >>>> the system opens those very gradually. > >>>> > >>>> MikeK, can you post to the swift-devel list your swift.properties > >>>> file, command line options, and your swift source code? > >>>> > >>>> Thanks, > >>>> > >>>> MikeW > >>>> > >>>> > >>>> On 1/29/08 8:11 AM, Ti Leggett wrote: > >>>>> The default walltime is 15 minutes. Are you doing fork jobs or > >>>>> pbs jobs? You shouldn't be doing fork jobs at all. Mike W, I > >>>>> thought there were throttles in place in Swift to prevent this > >>>>> type of overrun? Mike K, I'll need you to either stop these > >>>>> types of jobs until Mike W can verify throttling or only submit > >>>>> a few 10s of jobs at a time. > >>>>> > >>>>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote: > >>>>> > >>>>>> Yes, I'm submitting molecular dynamics simulations > >>>>>> using Swift. > >>>>>> > >>>>>> Is there a default wall-time limit for jobs on tg-uc? > >>>>>> > >>>>>> > >>>>>> > >>>>>> --- joseph insley wrote: > >>>>>> > >>>>>>> Actually, these numbers are now escalating... > >>>>>>> > >>>>>>> top - 17:18:54 up 2:29, 1 user, load average: > >>>>>>> 149.02, 123.63, 91.94 > >>>>>>> Tasks: 469 total, 4 running, 465 sleeping, 0 > >>>>>>> stopped, 0 zombie > >>>>>>> > >>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>>>>> 479 > >>>>>>> > >>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>>>>> tg-grid.uc.teragrid.org > >>>>>>> GRAM Authentication test successful > >>>>>>> real 0m26.134s > >>>>>>> user 0m0.090s > >>>>>>> sys 0m0.010s > >>>>>>> > >>>>>>> > >>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: > >>>>>>> > >>>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL > >>>>>>> TG GRAM host) > >>>>>>>> became unresponsive and had to be rebooted. I am > >>>>>>> now seeing slow > >>>>>>>> response times from the Gatekeeper there again. > >>>>>>> Authenticating to > >>>>>>>> the gatekeeper should only take a second or two, > >>>>>>> but it is > >>>>>>>> periodically taking up to 16 seconds: > >>>>>>>> > >>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>>>>> tg-grid.uc.teragrid.org > >>>>>>>> GRAM Authentication test successful > >>>>>>>> real 0m16.096s > >>>>>>>> user 0m0.060s > >>>>>>>> sys 0m0.020s > >>>>>>>> > >>>>>>>> looking at the load on tg-grid, it is rather high: > >>>>>>>> > >>>>>>>> top - 16:55:26 up 2:06, 1 user, load average: > >>>>>>> 89.59, 78.69, 62.92 > >>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, 0 > >>>>>>> stopped, 0 zombie > >>>>>>>> > >>>>>>>> And there appear to be a large number of processes > >>>>>>> owned by kubal: > >>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>>>>>> 380 > >>>>>>>> > >>>>>>>> I assume that Mike is using swift to do the job > >>>>>>> submission. Is > >>>>>>>> there some throttling of the rate at which jobs > >>>>>>> are submitted to > >>>>>>>> the gatekeeper that could be done that would > >>>>>>> lighten this load > >>>>>>>> some? (Or has that already been done since > >>>>>>> earlier today?) The > >>>>>>>> current response times are not unacceptable, but > >>>>>>> I'm hoping to > >>>>>>>> avoid having the machine grind to a halt as it did > >>>>>>> earlier today. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> joe. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> =================================================== > >>>>>>>> joseph a. > >>>>>>>> insley > >>>>>>> > >>>>>>>> insley at mcs.anl.gov > >>>>>>>> mathematics & computer science division > >>>>>>> (630) 252-5649 > >>>>>>>> argonne national laboratory > >>>>>>> (630) > >>>>>>>> 252-5986 (fax) > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> =================================================== > >>>>>>> joseph a. insley > >>>>>>> > >>>>>>> insley at mcs.anl.gov > >>>>>>> mathematics & computer science division (630) > >>>>>>> 252-5649 > >>>>>>> argonne national laboratory > >>>>>>> (630) > >>>>>>> 252-5986 (fax) > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> ____________________________________________________________________________________ > >>>>>> Be a better friend, newshound, and > >>>>>> know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > >>>>>> > >>>>> > >>>>> > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> > >> > >> -- > >> ================================================== > >> Ioan Raicu > >> Ph.D. Candidate > >> ================================================== > >> Distributed Systems Laboratory > >> Computer Science Department > >> University of Chicago > >> 1100 E. 58th Street, Ryerson Hall > >> Chicago, IL 60637 > >> ================================================== > >> Email: iraicu at cs.uchicago.edu > >> Web: http://www.cs.uchicago.edu/~iraicu > >> http://dev.globus.org/wiki/Incubator/Falkon > >> http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS > >> ================================================== > >> ================================================== > >> > >> > > > > =================================================== > > joseph a. > > insley insley at mcs.anl.gov > > mathematics & computer science division (630) 252-5649 > > argonne national laboratory (630) > > 252-5986 (fax) > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Tue Jan 29 21:38:15 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jan 2008 21:38:15 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <479FEE3D.2050102@cs.uchicago.edu> References: <621635.1992.qm@web52302.mail.re2.yahoo.com> <1201660972.32154.1.camel@blabla.mcs.anl.gov> <1201661252.32327.2.camel@blabla.mcs.anl.gov> <479FE931.2090306@mcs.anl.gov> <479FEE3D.2050102@cs.uchicago.edu> Message-ID: <1201664295.32688.40.camel@blabla.mcs.anl.gov> I'm becoming confused now. Last time I spoke to Yong about WS-GRAM, it was less scalable and slower (although that varied) than gt2 gram. So unless I see some numbers, I personally won't believe either of the statements. On Tue, 2008-01-29 at 21:25 -0600, Ioan Raicu wrote: > Yong and I ran most of our tests (from Swift) using WS-GRAM (aka GRAM4) > on UC/ANL TG, and I use Falkon on the same cluster using only WS-GRAM. > If I am not mistaken, all TG sites support WS-GRAM. > > Ioan > > Michael Wilde wrote: > > MikeK, this may be obvious but just in case: > > > > On 1/29/08 8:47 PM, Mihael Hategan wrote: > >> That and/or try using ws-gram: > >> >> minor="0" patch="0"/> > > > > (this goes in the sites.xml file) > > > > Q for the group: is ws-gram supported on uc.teragrid? > > > >> > >> > >> On Tue, 2008-01-29 at 20:42 -0600, Mihael Hategan wrote: > >>> You may want to try to lower throttle.score.job.factor from 4 to 1. > >>> That > >>> will cap the number of jobs at ~100 instead of ~400. > >>> > >>> Mihael > > > > for info on setting Swift properties, see "Swift Engine Configuration" > > in the users guide at: > > > > http://www.ci.uchicago.edu/swift/guides/userguide.php#properties > > > > - MikeW > > > >>> > >>> On Tue, 2008-01-29 at 18:31 -0800, Mike Kubal wrote: > >>>> sorry, long day : ) > >>>> > >>>> > >>>> --- Mihael Hategan wrote: > >>>> > >>>>> On Tue, 2008-01-29 at 20:02 -0600, Michael Wilde > >>>>> wrote: > >>>>>> MikeK, no attachment. > >>>>>> > >>>>>> Ive narrowed the cc list, and need to read back > >>>>> through the email thread > >>>>>> on this to see what Mihael observed. > >>>>> Let me summarize: too many gt2 gram jobs running > >>>>> concurrently = too many > >>>>> job manager processes = high load on gram node. Not > >>>>> a new issue. > >>>>> > >>>>>> - MikeW > >>>>>> > >>>>>> On 1/29/08 8:00 PM, Mike Kubal wrote: > >>>>>>> The attachment contains the swift script, tc > >>>>> file, > >>>>>>> sites file and swift.properties file. > >>>>>>> > >>>>>>> I didn't provide any additional command line > >>>>>>> arguments. > >>>>>>> > >>>>>>> MikeK > >>>>>>> > >>>>>>> > >>>>>>> --- Michael Wilde wrote: > >>>>>>> > >>>>>>>> [ was Re: Swift jobs on UC/ANL TG ] > >>>>>>>> > >>>>>>>> Hi. Im at OHare and will be flying soon. > >>>>>>>> Ben or Mihael, if you are online, can you > >>>>>>>> investigate? > >>>>>>>> > >>>>>>>> Yes, there are significant throttles turned on > >>>>> by > >>>>>>>> default, and the system opens those very gradually. > >>>>>>>> > >>>>>>>> MikeK, can you post to the swift-devel list > >>>>> your > >>>>>>>> swift.properties file, command line options, and your swift source > >>>>> code? > >>>>>>>> Thanks, > >>>>>>>> > >>>>>>>> MikeW > >>>>>>>> > >>>>>>>> > >>>>>>>> On 1/29/08 8:11 AM, Ti Leggett wrote: > >>>>>>>>> The default walltime is 15 minutes. Are you > >>>>> doing > >>>>>>>> fork jobs or pbs jobs? > >>>>>>>>> You shouldn't be doing fork jobs at all. Mike > >>>>> W, I > >>>>>>>> thought there were > >>>>>>>>> throttles in place in Swift to prevent this > >>>>> type > >>>>>>>> of overrun? Mike K, > >>>>>>>>> I'll need you to either stop these types of > >>>>> jobs > >>>>>>>> until Mike W can verify > >>>>>>>>> throttling or only submit a few 10s of jobs at > >>>>> a > >>>>>>>> time. > >>>>>>>>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike > >>>>> Kubal > >>>>>>>> wrote: > >>>>>>>>>> Yes, I'm submitting molecular dynamics > >>>>>>>> simulations > >>>>>>>>>> using Swift. > >>>>>>>>>> > >>>>>>>>>> Is there a default wall-time limit for jobs > >>>>> on > >>>>>>>> tg-uc? > >>>>>>>>>> > >>>>>>>>>> --- joseph insley wrote: > >>>>>>>>>> > >>>>>>>>>>> Actually, these numbers are now > >>>>> escalating... > >>>>>>>>>>> top - 17:18:54 up 2:29, 1 user, load > >>>>> average: > >>>>>>>>>>> 149.02, 123.63, 91.94 > >>>>>>>>>>> Tasks: 469 total, 4 running, 465 sleeping, > >>>>> 0 > >>>>>>>>>>> stopped, 0 zombie > >>>>>>>>>>> > >>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc > >>>>> -l > >>>>>>>>>>> 479 > >>>>>>>>>>> > >>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>>>>>>>>> tg-grid.uc.teragrid.org > >>>>>>>>>>> GRAM Authentication test successful > >>>>>>>>>>> real 0m26.134s > >>>>>>>>>>> user 0m0.090s > >>>>>>>>>>> sys 0m0.010s > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley > >>>>>>>> wrote: > >>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the > >>>>>>>> UC/ANL > >>>>>>>>>>> TG GRAM host) > >>>>>>>>>>>> became unresponsive and had to be rebooted. > >>>>> I > >>>>>>>> am > >>>>>>>>>>> now seeing slow > >>>>>>>>>>>> response times from the Gatekeeper there > >>>>> again. > >>>>>>>>>>> Authenticating to > >>>>>>>>>>>> the gatekeeper should only take a second or > >>>>>>>> two, > >>>>>>>>>>> but it is > >>>>>>>>>>>> periodically taking up to 16 seconds: > >>>>>>>>>>>> > >>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a > >>>>> -r > >>>>>>>>>>> tg-grid.uc.teragrid.org > >>>>>>>>>>>> GRAM Authentication test successful > >>>>>>>>>>>> real 0m16.096s > >>>>>>>>>>>> user 0m0.060s > >>>>>>>>>>>> sys 0m0.020s > >>>>>>>>>>>> > >>>>>>>>>>>> looking at the load on tg-grid, it is > >>>>> rather > >>>>>>>> high: > >>>>>>>>>>>> top - 16:55:26 up 2:06, 1 user, load > >>>>>>>> average: > >>>>>>>>>>> 89.59, 78.69, 62.92 > >>>>>>>>>>>> Tasks: 398 total, 20 running, 378 > >>>>> sleeping, > >>>>>>>> 0 > >>>>>>>>>>> stopped, 0 zombie > >>>>>>>>>>>> And there appear to be a large number of > >>>>>>>> processes > >>>>>>>>>>> owned by kubal: > >>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc > >>>>> -l > >>>>>>>>>>>> 380 > >>>>>>>>>>>> > >>>>>>>>>>>> I assume that Mike is using swift to do the > >>>>> job > >>>>>>>>>>> submission. Is > >>>>>>>>>>>> there some throttling of the rate at which > >>>>> jobs > >>>>>>>>>>> are submitted to > >>>>>>>>>>>> the gatekeeper that could be done that > >>>>> would > >>>>>>>>>>> lighten this load > >>>>>>>>>>>> some? (Or has that already been done since > >>>>>>>>>>> earlier today?) The > >>>>>>>>>>>> current response times are not > >>>>> unacceptable, > >>>>>>>> but > >>>>>>>>>>> I'm hoping to > >>>>>>>>>>>> avoid having the machine grind to a halt as > >>>>> it > >>>>>>>> did > >>>>>>>>>>> earlier today. > >>>>>>>>>>>> Thanks, > >>>>>>>>>>>> joe. > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>> =================================================== > >>>>>>>>>>>> joseph a. > >>>>>>>>>>>> insley > >>>>>>>>>>>> insley at mcs.anl.gov > >>>>>>>>>>>> mathematics & computer science division > >>>>>>>>>>> (630) 252-5649 > >>>>>>>>>>>> argonne national laboratory > >>>>>>>>>>> (630) > >>>>>>>>>>>> 252-5986 (fax) > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>> =================================================== > >>>>>>>>>>> joseph a. insley > >>>>>>>>>>> > >>>>>>>>>>> insley at mcs.anl.gov > >>>>>>>>>>> mathematics & computer science division > >>>>>>>> (630) > >>>>>>>>>>> 252-5649 > >>>>>>>>>>> argonne national laboratory > >>>>>>>>>>> (630) > >>>>>>>>>>> 252-5986 (fax) > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>> === message truncated === > >>>> > >>>> > >>>> > >>>> ____________________________________________________________________________________ > >>>> > >>>> Looking for last minute shopping deals? Find them fast with Yahoo! > >>>> Search. > >>>> http://tools.search.yahoo.com/newsearch/category.php?category=shopping > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > >> > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > From iraicu at cs.uchicago.edu Tue Jan 29 21:44:10 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 29 Jan 2008 21:44:10 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <1201664295.32688.40.camel@blabla.mcs.anl.gov> References: <621635.1992.qm@web52302.mail.re2.yahoo.com> <1201660972.32154.1.camel@blabla.mcs.anl.gov> <1201661252.32327.2.camel@blabla.mcs.anl.gov> <479FE931.2090306@mcs.anl.gov> <479FEE3D.2050102@cs.uchicago.edu> <1201664295.32688.40.camel@blabla.mcs.anl.gov> Message-ID: <479FF28A.1070302@cs.uchicago.edu> Here is a paper from TG07, that compares GRAM2 with GRAM4. The conclusion of the paper are (copied and pasted from the paper at http://www.globus.org/alliance/publications/papers/TG07-GRAM-comparison.pdf): * GRAM4 provides vastly better functionality than GRAM2, in numerous respects. * GRAM4 provides better scalability than GRAM2, in terms of the number of concurrent jobs that can be sup-port. It also greatly reduces load on service nodes, and permits management of that load. * GRAM4 performance is roughly comparable to that of GRAM2. (We still need to improve sequential submission and file staging performance, and we have plans for doing that, and also for other performance optimizations.) You can draw your own conclusions once you read the paper. I also bet Stu has more numbers than were reported in this paper. From what I heard, GRAM2 will be optional in GT4.2, and will be phased out completely in GT4.4, so the upgrade to GRAM4 is inevitable. Ioan Mihael Hategan wrote: > I'm becoming confused now. Last time I spoke to Yong about WS-GRAM, it > was less scalable and slower (although that varied) than gt2 gram. > > So unless I see some numbers, I personally won't believe either of the > statements. > > On Tue, 2008-01-29 at 21:25 -0600, Ioan Raicu wrote: > >> Yong and I ran most of our tests (from Swift) using WS-GRAM (aka GRAM4) >> on UC/ANL TG, and I use Falkon on the same cluster using only WS-GRAM. >> If I am not mistaken, all TG sites support WS-GRAM. >> >> Ioan >> >> Michael Wilde wrote: >> >>> MikeK, this may be obvious but just in case: >>> >>> On 1/29/08 8:47 PM, Mihael Hategan wrote: >>> >>>> That and/or try using ws-gram: >>>> >>> minor="0" patch="0"/> >>>> >>> (this goes in the sites.xml file) >>> >>> Q for the group: is ws-gram supported on uc.teragrid? >>> >>> >>>> On Tue, 2008-01-29 at 20:42 -0600, Mihael Hategan wrote: >>>> >>>>> You may want to try to lower throttle.score.job.factor from 4 to 1. >>>>> That >>>>> will cap the number of jobs at ~100 instead of ~400. >>>>> >>>>> Mihael >>>>> >>> for info on setting Swift properties, see "Swift Engine Configuration" >>> in the users guide at: >>> >>> http://www.ci.uchicago.edu/swift/guides/userguide.php#properties >>> >>> - MikeW >>> >>> >>>>> On Tue, 2008-01-29 at 18:31 -0800, Mike Kubal wrote: >>>>> >>>>>> sorry, long day : ) >>>>>> >>>>>> >>>>>> --- Mihael Hategan wrote: >>>>>> >>>>>> >>>>>>> On Tue, 2008-01-29 at 20:02 -0600, Michael Wilde >>>>>>> wrote: >>>>>>> >>>>>>>> MikeK, no attachment. >>>>>>>> >>>>>>>> Ive narrowed the cc list, and need to read back >>>>>>>> >>>>>>> through the email thread >>>>>>> >>>>>>>> on this to see what Mihael observed. >>>>>>>> >>>>>>> Let me summarize: too many gt2 gram jobs running >>>>>>> concurrently = too many >>>>>>> job manager processes = high load on gram node. Not >>>>>>> a new issue. >>>>>>> >>>>>>> >>>>>>>> - MikeW >>>>>>>> >>>>>>>> On 1/29/08 8:00 PM, Mike Kubal wrote: >>>>>>>> >>>>>>>>> The attachment contains the swift script, tc >>>>>>>>> >>>>>>> file, >>>>>>> >>>>>>>>> sites file and swift.properties file. >>>>>>>>> >>>>>>>>> I didn't provide any additional command line >>>>>>>>> arguments. >>>>>>>>> >>>>>>>>> MikeK >>>>>>>>> >>>>>>>>> >>>>>>>>> --- Michael Wilde wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>>> [ was Re: Swift jobs on UC/ANL TG ] >>>>>>>>>> >>>>>>>>>> Hi. Im at OHare and will be flying soon. >>>>>>>>>> Ben or Mihael, if you are online, can you >>>>>>>>>> investigate? >>>>>>>>>> >>>>>>>>>> Yes, there are significant throttles turned on >>>>>>>>>> >>>>>>> by >>>>>>> >>>>>>>>>> default, and the system opens those very gradually. >>>>>>>>>> >>>>>>>>>> MikeK, can you post to the swift-devel list >>>>>>>>>> >>>>>>> your >>>>>>> >>>>>>>>>> swift.properties file, command line options, and your swift source >>>>>>>>>> >>>>>>> code? >>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> MikeW >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 1/29/08 8:11 AM, Ti Leggett wrote: >>>>>>>>>> >>>>>>>>>>> The default walltime is 15 minutes. Are you >>>>>>>>>>> >>>>>>> doing >>>>>>> >>>>>>>>>> fork jobs or pbs jobs? >>>>>>>>>> >>>>>>>>>>> You shouldn't be doing fork jobs at all. Mike >>>>>>>>>>> >>>>>>> W, I >>>>>>> >>>>>>>>>> thought there were >>>>>>>>>> >>>>>>>>>>> throttles in place in Swift to prevent this >>>>>>>>>>> >>>>>>> type >>>>>>> >>>>>>>>>> of overrun? Mike K, >>>>>>>>>> >>>>>>>>>>> I'll need you to either stop these types of >>>>>>>>>>> >>>>>>> jobs >>>>>>> >>>>>>>>>> until Mike W can verify >>>>>>>>>> >>>>>>>>>>> throttling or only submit a few 10s of jobs at >>>>>>>>>>> >>>>>>> a >>>>>>> >>>>>>>>>> time. >>>>>>>>>> >>>>>>>>>>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike >>>>>>>>>>> >>>>>>> Kubal >>>>>>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>>> Yes, I'm submitting molecular dynamics >>>>>>>>>>>> >>>>>>>>>> simulations >>>>>>>>>> >>>>>>>>>>>> using Swift. >>>>>>>>>>>> >>>>>>>>>>>> Is there a default wall-time limit for jobs >>>>>>>>>>>> >>>>>>> on >>>>>>> >>>>>>>>>> tg-uc? >>>>>>>>>> >>>>>>>>>>>> --- joseph insley wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> Actually, these numbers are now >>>>>>>>>>>>> >>>>>>> escalating... >>>>>>> >>>>>>>>>>>>> top - 17:18:54 up 2:29, 1 user, load >>>>>>>>>>>>> >>>>>>> average: >>>>>>> >>>>>>>>>>>>> 149.02, 123.63, 91.94 >>>>>>>>>>>>> Tasks: 469 total, 4 running, 465 sleeping, >>>>>>>>>>>>> >>>>>>> 0 >>>>>>> >>>>>>>>>>>>> stopped, 0 zombie >>>>>>>>>>>>> >>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc >>>>>>>>>>>>> >>>>>>> -l >>>>>>> >>>>>>>>>>>>> 479 >>>>>>>>>>>>> >>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>>>>>>>> tg-grid.uc.teragrid.org >>>>>>>>>>>>> GRAM Authentication test successful >>>>>>>>>>>>> real 0m26.134s >>>>>>>>>>>>> user 0m0.090s >>>>>>>>>>>>> sys 0m0.010s >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley >>>>>>>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the >>>>>>>>>>>>>> >>>>>>>>>> UC/ANL >>>>>>>>>> >>>>>>>>>>>>> TG GRAM host) >>>>>>>>>>>>> >>>>>>>>>>>>>> became unresponsive and had to be rebooted. >>>>>>>>>>>>>> >>>>>>> I >>>>>>> >>>>>>>>>> am >>>>>>>>>> >>>>>>>>>>>>> now seeing slow >>>>>>>>>>>>> >>>>>>>>>>>>>> response times from the Gatekeeper there >>>>>>>>>>>>>> >>>>>>> again. >>>>>>> >>>>>>>>>>>>> Authenticating to >>>>>>>>>>>>> >>>>>>>>>>>>>> the gatekeeper should only take a second or >>>>>>>>>>>>>> >>>>>>>>>> two, >>>>>>>>>> >>>>>>>>>>>>> but it is >>>>>>>>>>>>> >>>>>>>>>>>>>> periodically taking up to 16 seconds: >>>>>>>>>>>>>> >>>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a >>>>>>>>>>>>>> >>>>>>> -r >>>>>>> >>>>>>>>>>>>> tg-grid.uc.teragrid.org >>>>>>>>>>>>> >>>>>>>>>>>>>> GRAM Authentication test successful >>>>>>>>>>>>>> real 0m16.096s >>>>>>>>>>>>>> user 0m0.060s >>>>>>>>>>>>>> sys 0m0.020s >>>>>>>>>>>>>> >>>>>>>>>>>>>> looking at the load on tg-grid, it is >>>>>>>>>>>>>> >>>>>>> rather >>>>>>> >>>>>>>>>> high: >>>>>>>>>> >>>>>>>>>>>>>> top - 16:55:26 up 2:06, 1 user, load >>>>>>>>>>>>>> >>>>>>>>>> average: >>>>>>>>>> >>>>>>>>>>>>> 89.59, 78.69, 62.92 >>>>>>>>>>>>> >>>>>>>>>>>>>> Tasks: 398 total, 20 running, 378 >>>>>>>>>>>>>> >>>>>>> sleeping, >>>>>>> >>>>>>>>>> 0 >>>>>>>>>> >>>>>>>>>>>>> stopped, 0 zombie >>>>>>>>>>>>> >>>>>>>>>>>>>> And there appear to be a large number of >>>>>>>>>>>>>> >>>>>>>>>> processes >>>>>>>>>> >>>>>>>>>>>>> owned by kubal: >>>>>>>>>>>>> >>>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc >>>>>>>>>>>>>> >>>>>>> -l >>>>>>> >>>>>>>>>>>>>> 380 >>>>>>>>>>>>>> >>>>>>>>>>>>>> I assume that Mike is using swift to do the >>>>>>>>>>>>>> >>>>>>> job >>>>>>> >>>>>>>>>>>>> submission. Is >>>>>>>>>>>>> >>>>>>>>>>>>>> there some throttling of the rate at which >>>>>>>>>>>>>> >>>>>>> jobs >>>>>>> >>>>>>>>>>>>> are submitted to >>>>>>>>>>>>> >>>>>>>>>>>>>> the gatekeeper that could be done that >>>>>>>>>>>>>> >>>>>>> would >>>>>>> >>>>>>>>>>>>> lighten this load >>>>>>>>>>>>> >>>>>>>>>>>>>> some? (Or has that already been done since >>>>>>>>>>>>>> >>>>>>>>>>>>> earlier today?) The >>>>>>>>>>>>> >>>>>>>>>>>>>> current response times are not >>>>>>>>>>>>>> >>>>>>> unacceptable, >>>>>>> >>>>>>>>>> but >>>>>>>>>> >>>>>>>>>>>>> I'm hoping to >>>>>>>>>>>>> >>>>>>>>>>>>>> avoid having the machine grind to a halt as >>>>>>>>>>>>>> >>>>>>> it >>>>>>> >>>>>>>>>> did >>>>>>>>>> >>>>>>>>>>>>> earlier today. >>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> joe. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>> =================================================== >>>>>>> >>>>>>>>>>>>>> joseph a. >>>>>>>>>>>>>> insley >>>>>>>>>>>>>> insley at mcs.anl.gov >>>>>>>>>>>>>> mathematics & computer science division >>>>>>>>>>>>>> >>>>>>>>>>>>> (630) 252-5649 >>>>>>>>>>>>> >>>>>>>>>>>>>> argonne national laboratory >>>>>>>>>>>>>> >>>>>>>>>>>>> (630) >>>>>>>>>>>>> >>>>>>>>>>>>>> 252-5986 (fax) >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>> =================================================== >>>>>>> >>>>>>>>>>>>> joseph a. insley >>>>>>>>>>>>> >>>>>>>>>>>>> insley at mcs.anl.gov >>>>>>>>>>>>> mathematics & computer science division >>>>>>>>>>>>> >>>>>>>>>> (630) >>>>>>>>>> >>>>>>>>>>>>> 252-5649 >>>>>>>>>>>>> argonne national laboratory >>>>>>>>>>>>> (630) >>>>>>>>>>>>> 252-5986 (fax) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>> === message truncated === >>>>>> >>>>>> >>>>>> >>>>>> ____________________________________________________________________________________ >>>>>> >>>>>> Looking for last minute shopping deals? Find them fast with Yahoo! >>>>>> Search. >>>>>> http://tools.search.yahoo.com/newsearch/category.php?category=shopping >>>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>> >>>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >>>> >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> > > > -- ================================================== Ioan Raicu Ph.D. Candidate ================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS ================================================== ================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Jan 29 22:26:39 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 29 Jan 2008 22:26:39 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <479FF28A.1070302@cs.uchicago.edu> References: <621635.1992.qm@web52302.mail.re2.yahoo.com> <1201660972.32154.1.camel@blabla.mcs.anl.gov> <1201661252.32327.2.camel@blabla.mcs.anl.gov> <479FE931.2090306@mcs.anl.gov> <479FEE3D.2050102@cs.uchicago.edu> <1201664295.32688.40.camel@blabla.mcs.anl.gov> <479FF28A.1070302@cs.uchicago.edu> Message-ID: <1201667199.1052.34.camel@blabla.mcs.anl.gov> Gotta love the use of such fuzzy terms as "vastly better", "greatly reduces", "roughly comparable", "virtually painless" (last one isn't from the paper). Well, try it out. What I remember is that it eats more memory per job on the client side, so you probably need to: export COG_OPTS="-Xmx512M" ... or more. On Tue, 2008-01-29 at 21:44 -0600, Ioan Raicu wrote: > Here is a paper from TG07, that compares GRAM2 with GRAM4. The > conclusion of the paper are (copied and pasted from the paper at > http://www.globus.org/alliance/publications/papers/TG07-GRAM-comparison.pdf): > * GRAM4 provides vastly better functionality than GRAM2, in > numerous respects. > * GRAM4 provides better scalability than GRAM2, in terms of the > number of concurrent jobs that can be sup-port. It also > greatly reduces load on service nodes, and permits management > of that load. > * GRAM4 performance is roughly comparable to that of GRAM2. (We > still need to improve sequential submission and file staging > performance, and we have plans for doing that, and also for > other performance optimizations.) > You can draw your own conclusions once you read the paper. I also bet > Stu has more numbers than were reported in this paper. From what I > heard, GRAM2 will be optional in GT4.2, and will be phased out > completely in GT4.4, so the upgrade to GRAM4 is inevitable. > > Ioan > > > Mihael Hategan wrote: > > I'm becoming confused now. Last time I spoke to Yong about WS-GRAM, it > > was less scalable and slower (although that varied) than gt2 gram. > > > > So unless I see some numbers, I personally won't believe either of the > > statements. > > > > On Tue, 2008-01-29 at 21:25 -0600, Ioan Raicu wrote: > > > > > Yong and I ran most of our tests (from Swift) using WS-GRAM (aka GRAM4) > > > on UC/ANL TG, and I use Falkon on the same cluster using only WS-GRAM. > > > If I am not mistaken, all TG sites support WS-GRAM. > > > > > > Ioan > > > > > > Michael Wilde wrote: > > > > > > > MikeK, this may be obvious but just in case: > > > > > > > > On 1/29/08 8:47 PM, Mihael Hategan wrote: > > > > > > > > > That and/or try using ws-gram: > > > > > > > > > minor="0" patch="0"/> > > > > > > > > > (this goes in the sites.xml file) > > > > > > > > Q for the group: is ws-gram supported on uc.teragrid? > > > > > > > > > > > > > On Tue, 2008-01-29 at 20:42 -0600, Mihael Hategan wrote: > > > > > > > > > > > You may want to try to lower throttle.score.job.factor from 4 to 1. > > > > > > That > > > > > > will cap the number of jobs at ~100 instead of ~400. > > > > > > > > > > > > Mihael > > > > > > > > > > for info on setting Swift properties, see "Swift Engine Configuration" > > > > in the users guide at: > > > > > > > > http://www.ci.uchicago.edu/swift/guides/userguide.php#properties > > > > > > > > - MikeW > > > > > > > > > > > > > > On Tue, 2008-01-29 at 18:31 -0800, Mike Kubal wrote: > > > > > > > > > > > > > sorry, long day : ) > > > > > > > > > > > > > > > > > > > > > --- Mihael Hategan wrote: > > > > > > > > > > > > > > > > > > > > > > On Tue, 2008-01-29 at 20:02 -0600, Michael Wilde > > > > > > > > wrote: > > > > > > > > > > > > > > > > > MikeK, no attachment. > > > > > > > > > > > > > > > > > > Ive narrowed the cc list, and need to read back > > > > > > > > > > > > > > > > > through the email thread > > > > > > > > > > > > > > > > > on this to see what Mihael observed. > > > > > > > > > > > > > > > > > Let me summarize: too many gt2 gram jobs running > > > > > > > > concurrently = too many > > > > > > > > job manager processes = high load on gram node. Not > > > > > > > > a new issue. > > > > > > > > > > > > > > > > > > > > > > > > > - MikeW > > > > > > > > > > > > > > > > > > On 1/29/08 8:00 PM, Mike Kubal wrote: > > > > > > > > > > > > > > > > > > > The attachment contains the swift script, tc > > > > > > > > > > > > > > > > > > file, > > > > > > > > > > > > > > > > > > sites file and swift.properties file. > > > > > > > > > > > > > > > > > > > > I didn't provide any additional command line > > > > > > > > > > arguments. > > > > > > > > > > > > > > > > > > > > MikeK > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --- Michael Wilde wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [ was Re: Swift jobs on UC/ANL TG ] > > > > > > > > > > > > > > > > > > > > > > Hi. Im at OHare and will be flying soon. > > > > > > > > > > > Ben or Mihael, if you are online, can you > > > > > > > > > > > investigate? > > > > > > > > > > > > > > > > > > > > > > Yes, there are significant throttles turned on > > > > > > > > > > > > > > > > > > > by > > > > > > > > > > > > > > > > > > > default, and the system opens those very gradually. > > > > > > > > > > > > > > > > > > > > > > MikeK, can you post to the swift-devel list > > > > > > > > > > > > > > > > > > > your > > > > > > > > > > > > > > > > > > > swift.properties file, command line options, and your swift source > > > > > > > > > > > > > > > > > > > code? > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > MikeW > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 1/29/08 8:11 AM, Ti Leggett wrote: > > > > > > > > > > > > > > > > > > > > > > > The default walltime is 15 minutes. Are you > > > > > > > > > > > > > > > > > > > > doing > > > > > > > > > > > > > > > > > > > fork jobs or pbs jobs? > > > > > > > > > > > > > > > > > > > > > > > You shouldn't be doing fork jobs at all. Mike > > > > > > > > > > > > > > > > > > > > W, I > > > > > > > > > > > > > > > > > > > thought there were > > > > > > > > > > > > > > > > > > > > > > > throttles in place in Swift to prevent this > > > > > > > > > > > > > > > > > > > > type > > > > > > > > > > > > > > > > > > > of overrun? Mike K, > > > > > > > > > > > > > > > > > > > > > > > I'll need you to either stop these types of > > > > > > > > > > > > > > > > > > > > jobs > > > > > > > > > > > > > > > > > > > until Mike W can verify > > > > > > > > > > > > > > > > > > > > > > > throttling or only submit a few 10s of jobs at > > > > > > > > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > > > time. > > > > > > > > > > > > > > > > > > > > > > > On Jan 28, 2008, at 01/28/08 07:13 PM, Mike > > > > > > > > > > > > > > > > > > > > Kubal > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > Yes, I'm submitting molecular dynamics > > > > > > > > > > > > > > > > > > > > > > > > simulations > > > > > > > > > > > > > > > > > > > > > > > > using Swift. > > > > > > > > > > > > > > > > > > > > > > > > > > Is there a default wall-time limit for jobs > > > > > > > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > > > > tg-uc? > > > > > > > > > > > > > > > > > > > > > > > > --- joseph insley wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Actually, these numbers are now > > > > > > > > > > > > > > > > > > > > > > escalating... > > > > > > > > > > > > > > > > > > > > > > top - 17:18:54 up 2:29, 1 user, load > > > > > > > > > > > > > > > > > > > > > > average: > > > > > > > > > > > > > > > > > > > > > > 149.02, 123.63, 91.94 > > > > > > > > > > > > > > Tasks: 469 total, 4 running, 465 sleeping, > > > > > > > > > > > > > > > > > > > > > > 0 > > > > > > > > > > > > > > > > > > > > > > stopped, 0 zombie > > > > > > > > > > > > > > > > > > > > > > > > > > > > insley at tg-grid1:~> ps -ef | grep kubal | wc > > > > > > > > > > > > > > > > > > > > > > -l > > > > > > > > > > > > > > > > > > > > > > 479 > > > > > > > > > > > > > > > > > > > > > > > > > > > > insley at tg-viz-login1:~> time globusrun -a -r > > > > > > > > > > > > > > tg-grid.uc.teragrid.org > > > > > > > > > > > > > > GRAM Authentication test successful > > > > > > > > > > > > > > real 0m26.134s > > > > > > > > > > > > > > user 0m0.090s > > > > > > > > > > > > > > sys 0m0.010s > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Jan 28, 2008, at 5:15 PM, joseph insley > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > Earlier today tg-grid.uc.teragrid.org (the > > > > > > > > > > > > > > > > > > > > > > > > > > UC/ANL > > > > > > > > > > > > > > > > > > > > > > > > > TG GRAM host) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > became unresponsive and had to be rebooted. > > > > > > > > > > > > > > > > > > > > > > > I > > > > > > > > > > > > > > > > > > > am > > > > > > > > > > > > > > > > > > > > > > > > > now seeing slow > > > > > > > > > > > > > > > > > > > > > > > > > > > > > response times from the Gatekeeper there > > > > > > > > > > > > > > > > > > > > > > > again. > > > > > > > > > > > > > > > > > > > > > > Authenticating to > > > > > > > > > > > > > > > > > > > > > > > > > > > > > the gatekeeper should only take a second or > > > > > > > > > > > > > > > > > > > > > > > > > > two, > > > > > > > > > > > > > > > > > > > > > > > > > but it is > > > > > > > > > > > > > > > > > > > > > > > > > > > > > periodically taking up to 16 seconds: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > insley at tg-viz-login1:~> time globusrun -a > > > > > > > > > > > > > > > > > > > > > > > -r > > > > > > > > > > > > > > > > > > > > > > tg-grid.uc.teragrid.org > > > > > > > > > > > > > > > > > > > > > > > > > > > > > GRAM Authentication test successful > > > > > > > > > > > > > > > real 0m16.096s > > > > > > > > > > > > > > > user 0m0.060s > > > > > > > > > > > > > > > sys 0m0.020s > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > looking at the load on tg-grid, it is > > > > > > > > > > > > > > > > > > > > > > > rather > > > > > > > > > > > > > > > > > > > high: > > > > > > > > > > > > > > > > > > > > > > > > > > top - 16:55:26 up 2:06, 1 user, load > > > > > > > > > > > > > > > > > > > > > > > > > > average: > > > > > > > > > > > > > > > > > > > > > > > > > 89.59, 78.69, 62.92 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Tasks: 398 total, 20 running, 378 > > > > > > > > > > > > > > > > > > > > > > > sleeping, > > > > > > > > > > > > > > > > > > > 0 > > > > > > > > > > > > > > > > > > > > > > > > > stopped, 0 zombie > > > > > > > > > > > > > > > > > > > > > > > > > > > > > And there appear to be a large number of > > > > > > > > > > > > > > > > > > > > > > > > > > processes > > > > > > > > > > > > > > > > > > > > > > > > > owned by kubal: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > insley at tg-grid1:~> ps -ef | grep kubal | wc > > > > > > > > > > > > > > > > > > > > > > > -l > > > > > > > > > > > > > > > > > > > > > > > 380 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I assume that Mike is using swift to do the > > > > > > > > > > > > > > > > > > > > > > > job > > > > > > > > > > > > > > > > > > > > > > submission. Is > > > > > > > > > > > > > > > > > > > > > > > > > > > > > there some throttling of the rate at which > > > > > > > > > > > > > > > > > > > > > > > jobs > > > > > > > > > > > > > > > > > > > > > > are submitted to > > > > > > > > > > > > > > > > > > > > > > > > > > > > > the gatekeeper that could be done that > > > > > > > > > > > > > > > > > > > > > > > would > > > > > > > > > > > > > > > > > > > > > > lighten this load > > > > > > > > > > > > > > > > > > > > > > > > > > > > > some? (Or has that already been done since > > > > > > > > > > > > > > > > > > > > > > > > > > > > > earlier today?) The > > > > > > > > > > > > > > > > > > > > > > > > > > > > > current response times are not > > > > > > > > > > > > > > > > > > > > > > > unacceptable, > > > > > > > > > > > > > > > > > > > but > > > > > > > > > > > > > > > > > > > > > > > > > I'm hoping to > > > > > > > > > > > > > > > > > > > > > > > > > > > > > avoid having the machine grind to a halt as > > > > > > > > > > > > > > > > > > > > > > > it > > > > > > > > > > > > > > > > > > > did > > > > > > > > > > > > > > > > > > > > > > > > > earlier today. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > joe. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > =================================================== > > > > > > > > > > > > > > > > > > > > > > > joseph a. > > > > > > > > > > > > > > > insley > > > > > > > > > > > > > > > insley at mcs.anl.gov > > > > > > > > > > > > > > > mathematics & computer science division > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (630) 252-5649 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > argonne national laboratory > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (630) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 252-5986 (fax) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > =================================================== > > > > > > > > > > > > > > > > > > > > > > joseph a. insley > > > > > > > > > > > > > > > > > > > > > > > > > > > > insley at mcs.anl.gov > > > > > > > > > > > > > > mathematics & computer science division > > > > > > > > > > > > > > > > > > > > > > > > > (630) > > > > > > > > > > > > > > > > > > > > > > > > > 252-5649 > > > > > > > > > > > > > > argonne national laboratory > > > > > > > > > > > > > > (630) > > > > > > > > > > > > > > 252-5986 (fax) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > === message truncated === > > > > > > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > > > > > > > > > > > > Looking for last minute shopping deals? Find them fast with Yahoo! > > > > > > > Search. > > > > > > > http://tools.search.yahoo.com/newsearch/category.php?category=shopping > > > > > > > > > > > > > _______________________________________________ > > > > > > Swift-devel mailing list > > > > > > Swift-devel at ci.uchicago.edu > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > -- > ================================================== > Ioan Raicu > Ph.D. Candidate > ================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > ================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS > ================================================== > ================================================== > From leggett at mcs.anl.gov Wed Jan 30 06:26:03 2008 From: leggett at mcs.anl.gov (Ti Leggett) Date: Wed, 30 Jan 2008 06:26:03 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <1201664122.32688.36.camel@blabla.mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> Message-ID: As a site admin I would rather you ramp up and not throttle down. Starting high and working to a lower number means you could kill the machine many times before you find the lower bound of what a site can handle. Starting slowly and ramping up means you find that lower bound once. From my point of view, one user consistently killing the resource can be turned off to prevent denial of service to all other users *until* they can prove they won't kill the resource. So I prefer the conservative. On Jan 29, 2008, at 01/29/08 09:35 PM, Mihael Hategan wrote: > So I've been thinking about this... > Our current throttling parameters are the result of long discussions > and > trials (pretty heated at times; the discussions that is). Obviously > they > are not always appropriate. But that's not the problem. The problem, I > think, is the lack of consensus on (and sometimes even ability to > articulate) what is ok and what isn't. > > Currently our process of determining this for a site is trying to > maximize performance while avoiding failures (this may imply high > utilization on both client side and service side), and toning down > when > the site admins complain. I'm not sure how reasonable this is for our > users. > > The other strategies I've seen are: > > 1. Condor: Make it slow but safe. This works as long as users don't > have > a frame of reference to judge how slow things are. My bosses don't > seem > to like this one (nor do I for that matter), but it is a decent > strategy: users get their job done (albeit slowly) and sites don't > complain much. > > 2. LEAD: Lobby to every consequential body and urge for the services > to > be sufficiently scalable to address the specific requirements of that > project (as much as is possible given that LEAD does not have > exclusivity). I've expressed my opinion on this one. > > So how do we figure out the metrics (e.g. how many total concurrent > jobs, the rate of submissions, etc.) and how can we reach some > consensus > on the numbers? Can we build some automated system that would allow > clients and services to negotiate such parameters? > > Mihael > > On Tue, 2008-01-29 at 14:06 -0600, Stuart Martin wrote: >> This is the classic GRAM2 scaling issue due to each job polling for >> status to the LRM. condor-g does all sorts of things to make GRAM2 >> scale for that scenario. If swift is not using condor-g and not >> doing >> the condor-g tricks, then I'd recommend swift to switch to using >> gram4. >> >> -Stu >> >> On Jan 29, 2008, at Jan 29, 1:57 PM, joseph insley wrote: >> >>> I was seeing Mike's jobs show up in the queue, and running on the >>> backend nodes, and the processes I was seeing on tg-grid appeared to >>> be gram and not some other application, so it would seem that it was >>> indeed using PBS. >>> >>> However, it appears to be using PRE-WS GRAM.... I still had some of >>> the 'ps | grep kubal' output in my scrollback: >>> >>> insley at tg-grid1:~> ps -ef | grep kubal >>> kubal 16981 1 0 16:41 ? 00:00:00 globus-job-manager - >>> conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs >>> -rdn jobmanager-pbs -machine-type unknown -publish-jobs >>> kubal 18390 1 0 16:42 ? 00:00:00 globus-job-manager - >>> conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs >>> -rdn jobmanager-pbs -machine-type unknown -publish-jobs >>> kubal 18891 1 0 16:43 ? 00:00:00 globus-job-manager - >>> conf /soft/prews-gram-4.0.1-r3/etc/globus-job-manager.conf -type pbs >>> -rdn jobmanager-pbs -machine-type unknown -publish-jobs >>> kubal 18917 1 0 16:43 ? >>> >>> [snip] >>> >>> kubal 28200 25985 0 16:50 ? 00:00:00 /usr/bin/perl /soft/ >>> prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / >>> tmp/gram_iwEHrc -c poll >>> kubal 28201 26954 1 16:50 ? 00:00:00 /usr/bin/perl /soft/ >>> prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / >>> tmp/gram_lQaIPe -c poll >>> kubal 28202 19438 1 16:50 ? 00:00:00 /usr/bin/perl /soft/ >>> prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / >>> tmp/gram_SPsdme -c poll >>> >>> >>> On Jan 29, 2008, at 1:38 PM, Ioan Raicu wrote: >>> >>>> Can someone double check that the jobs are using PBS (and not FORK) >>>> in GRAM? If you are using FORK, then the high load is being caused >>>> by the applications running on the GRAM host. If it is PBS, then I >>>> don't know, others might have more insight. >>>> >>>> Ioan >>>> >>>> Ian Foster wrote: >>>>> Hi, >>>>> >>>>> I've CCed Stuart Martin--I'd greatly appreciate some insights into >>>>> what is causing this. I assume that you are using GRAM4 (aka WS- >>>>> GRAM)? >>>>> >>>>> Ian. >>>>> >>>>> Michael Wilde wrote: >>>>>> [ was Re: Swift jobs on UC/ANL TG ] >>>>>> >>>>>> Hi. Im at OHare and will be flying soon. >>>>>> Ben or Mihael, if you are online, can you investigate? >>>>>> >>>>>> Yes, there are significant throttles turned on by default, and >>>>>> the system opens those very gradually. >>>>>> >>>>>> MikeK, can you post to the swift-devel list your swift.properties >>>>>> file, command line options, and your swift source code? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> MikeW >>>>>> >>>>>> >>>>>> On 1/29/08 8:11 AM, Ti Leggett wrote: >>>>>>> The default walltime is 15 minutes. Are you doing fork jobs or >>>>>>> pbs jobs? You shouldn't be doing fork jobs at all. Mike W, I >>>>>>> thought there were throttles in place in Swift to prevent this >>>>>>> type of overrun? Mike K, I'll need you to either stop these >>>>>>> types of jobs until Mike W can verify throttling or only submit >>>>>>> a few 10s of jobs at a time. >>>>>>> >>>>>>> On Jan 28, 2008, at 01/28/08 07:13 PM, Mike Kubal wrote: >>>>>>> >>>>>>>> Yes, I'm submitting molecular dynamics simulations >>>>>>>> using Swift. >>>>>>>> >>>>>>>> Is there a default wall-time limit for jobs on tg-uc? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> --- joseph insley wrote: >>>>>>>> >>>>>>>>> Actually, these numbers are now escalating... >>>>>>>>> >>>>>>>>> top - 17:18:54 up 2:29, 1 user, load average: >>>>>>>>> 149.02, 123.63, 91.94 >>>>>>>>> Tasks: 469 total, 4 running, 465 sleeping, 0 >>>>>>>>> stopped, 0 zombie >>>>>>>>> >>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>>>> 479 >>>>>>>>> >>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>>>> tg-grid.uc.teragrid.org >>>>>>>>> GRAM Authentication test successful >>>>>>>>> real 0m26.134s >>>>>>>>> user 0m0.090s >>>>>>>>> sys 0m0.010s >>>>>>>>> >>>>>>>>> >>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: >>>>>>>>> >>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL >>>>>>>>> TG GRAM host) >>>>>>>>>> became unresponsive and had to be rebooted. I am >>>>>>>>> now seeing slow >>>>>>>>>> response times from the Gatekeeper there again. >>>>>>>>> Authenticating to >>>>>>>>>> the gatekeeper should only take a second or two, >>>>>>>>> but it is >>>>>>>>>> periodically taking up to 16 seconds: >>>>>>>>>> >>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>>>> tg-grid.uc.teragrid.org >>>>>>>>>> GRAM Authentication test successful >>>>>>>>>> real 0m16.096s >>>>>>>>>> user 0m0.060s >>>>>>>>>> sys 0m0.020s >>>>>>>>>> >>>>>>>>>> looking at the load on tg-grid, it is rather high: >>>>>>>>>> >>>>>>>>>> top - 16:55:26 up 2:06, 1 user, load average: >>>>>>>>> 89.59, 78.69, 62.92 >>>>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, 0 >>>>>>>>> stopped, 0 zombie >>>>>>>>>> >>>>>>>>>> And there appear to be a large number of processes >>>>>>>>> owned by kubal: >>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>>>>> 380 >>>>>>>>>> >>>>>>>>>> I assume that Mike is using swift to do the job >>>>>>>>> submission. Is >>>>>>>>>> there some throttling of the rate at which jobs >>>>>>>>> are submitted to >>>>>>>>>> the gatekeeper that could be done that would >>>>>>>>> lighten this load >>>>>>>>>> some? (Or has that already been done since >>>>>>>>> earlier today?) The >>>>>>>>>> current response times are not unacceptable, but >>>>>>>>> I'm hoping to >>>>>>>>>> avoid having the machine grind to a halt as it did >>>>>>>>> earlier today. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> joe. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> =================================================== >>>>>>>>>> joseph a. >>>>>>>>>> insley >>>>>>>>> >>>>>>>>>> insley at mcs.anl.gov >>>>>>>>>> mathematics & computer science division >>>>>>>>> (630) 252-5649 >>>>>>>>>> argonne national laboratory >>>>>>>>> (630) >>>>>>>>>> 252-5986 (fax) >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> =================================================== >>>>>>>>> joseph a. insley >>>>>>>>> >>>>>>>>> insley at mcs.anl.gov >>>>>>>>> mathematics & computer science division (630) >>>>>>>>> 252-5649 >>>>>>>>> argonne national laboratory >>>>>>>>> (630) >>>>>>>>> 252-5986 (fax) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ____________________________________________________________________________________ >>>>>>>> Be a better friend, newshound, and >>>>>>>> know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>>>>>>> >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>> >>>> >>>> -- >>>> ================================================== >>>> Ioan Raicu >>>> Ph.D. Candidate >>>> ================================================== >>>> Distributed Systems Laboratory >>>> Computer Science Department >>>> University of Chicago >>>> 1100 E. 58th Street, Ryerson Hall >>>> Chicago, IL 60637 >>>> ================================================== >>>> Email: iraicu at cs.uchicago.edu >>>> Web: http://www.cs.uchicago.edu/~iraicu >>>> http://dev.globus.org/wiki/Incubator/Falkon >>>> http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS >>>> ================================================== >>>> ================================================== >>>> >>>> >>> >>> =================================================== >>> joseph a. >>> insley insley at mcs.anl.gov >>> mathematics & computer science division (630) 252-5649 >>> argonne national laboratory (630) >>> 252-5986 (fax) >>> >>> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > From benc at hawaga.org.uk Wed Jan 30 07:46:07 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 30 Jan 2008 13:46:07 +0000 (GMT) Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> Message-ID: On Wed, 30 Jan 2008, Ti Leggett wrote: > As a site admin I would rather you ramp up and not throttle down. Starting > high and working to a lower number means you could kill the machine many times > before you find the lower bound of what a site can handle. Starting slowly and > ramping up means you find that lower bound once. From my point of view, one > user consistently killing the resource can be turned off to prevent denial of > service to all other users *until* they can prove they won't kill the > resource. So I prefer the conservative. The code does ramp up at the moment, starting with 6 simultaneous jobs by default. What doesn't happen very well at the moment is automated detection of 'too much' in order to stop ramping up - the only really good feedback at the moment (not just in this particular case but in other cases before) seems to be a human being sitting in the feedback loop tweaking stuff. Two things we should work on are: i) making it easier for the human who is sitting in that loop and ii) figuring out a better way to get automated feedback. >From a TG-UC perspective, for example, what is a good way to know 'too much'? Is it OK to keep submitting jobs until they start failing? Or should there be some lower point at which we stop? -- From foster at mcs.anl.gov Wed Jan 30 08:31:18 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Wed, 30 Jan 2008 08:31:18 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> Message-ID: <47A08A36.1000502@mcs.anl.gov> Just to check--before we do all this, have we tried running with GRAM4? Ben Clifford wrote: > On Wed, 30 Jan 2008, Ti Leggett wrote: > > >> As a site admin I would rather you ramp up and not throttle down. Starting >> high and working to a lower number means you could kill the machine many times >> before you find the lower bound of what a site can handle. Starting slowly and >> ramping up means you find that lower bound once. From my point of view, one >> user consistently killing the resource can be turned off to prevent denial of >> service to all other users *until* they can prove they won't kill the >> resource. So I prefer the conservative. >> > > The code does ramp up at the moment, starting with 6 simultaneous jobs by > default. > > What doesn't happen very well at the moment is automated detection of 'too > much' in order to stop ramping up - the only really good feedback at the > moment (not just in this particular case but in other cases before) seems > to be a human being sitting in the feedback loop tweaking stuff. > > Two things we should work on are: > i) making it easier for the human who is sitting in that loop > and > ii) figuring out a better way to get automated feedback. > > >From a TG-UC perspective, for example, what is a good way to know 'too > much'? Is it OK to keep submitting jobs until they start failing? Or > should there be some lower point at which we stop? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Wed Jan 30 08:36:30 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 30 Jan 2008 14:36:30 +0000 (GMT) Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <47A08A36.1000502@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08A36.1000502@mcs.anl.gov> Message-ID: On Wed, 30 Jan 2008, Ian Foster wrote: > Just to check--before we do all this, have we tried running with GRAM4? I personally abandoned using GRAM4 a year ago because deployments generally didn't seem to work right and its general lack of use meant that deployments rotted very quickly even when they were fixed. This was because I had other stuff that was more important to do than drive the on going maintenance of GRAM4. Maybe this is changed now. but like pretty much anything to do with GRAM4, I'll believe it when I see it. Mihael posted the details for submitting to GRAM4, so it shouldn't be hard for people to try, though. -- From iraicu at cs.uchicago.edu Wed Jan 30 08:40:24 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 30 Jan 2008 08:40:24 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> Message-ID: <47A08C58.4090806@cs.uchicago.edu> Here is something that might help Swift determine when the GRAM host is under heavy load, prior to things starting to fail. Could a simple service be made to run in the same container as the GRAM4 service that would expose certain low level information, such as CPU utilization, machine load, memory free, swap used, disk I/O, network I/O, etc... If this is a standard service that exposes this information as RP, or even a simple status information WS function, then it could be used to determine the load on the machine where GRAM is running. The tricky part is getting this kind of low level information in a platform independent fashion, but it might be worth the effort. BTW, I have done exactly this in the context of Falkon, to monitor the state of the machine where the Falkon service runs. I actually start "vmstat" and scrape the output to get the needed information at regular intervals, and it works quite well on the few Linux distributions I tried it on, RH8, SuSe 9 and SuSe 10. Ioan Ben Clifford wrote: > On Wed, 30 Jan 2008, Ti Leggett wrote: > > >> As a site admin I would rather you ramp up and not throttle down. Starting >> high and working to a lower number means you could kill the machine many times >> before you find the lower bound of what a site can handle. Starting slowly and >> ramping up means you find that lower bound once. From my point of view, one >> user consistently killing the resource can be turned off to prevent denial of >> service to all other users *until* they can prove they won't kill the >> resource. So I prefer the conservative. >> > > The code does ramp up at the moment, starting with 6 simultaneous jobs by > default. > > What doesn't happen very well at the moment is automated detection of 'too > much' in order to stop ramping up - the only really good feedback at the > moment (not just in this particular case but in other cases before) seems > to be a human being sitting in the feedback loop tweaking stuff. > > Two things we should work on are: > i) making it easier for the human who is sitting in that loop > and > ii) figuring out a better way to get automated feedback. > > >From a TG-UC perspective, for example, what is a good way to know 'too > much'? Is it OK to keep submitting jobs until they start failing? Or > should there be some lower point at which we stop? > > -- ================================================== Ioan Raicu Ph.D. Candidate ================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS ================================================== ================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Wed Jan 30 08:42:49 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 30 Jan 2008 14:42:49 +0000 (GMT) Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <47A08C58.4090806@cs.uchicago.edu> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08C58.4090806@cs.uchicago.edu> Message-ID: On Wed, 30 Jan 2008, Ioan Raicu wrote: > Here is something that might help Swift determine when the GRAM host is under > heavy load, prior to things starting to fail. > Could a simple service be made to run in the same container as the GRAM4 > service that would expose certain low level information, such as CPU That's reinventing MDS! We used to have a great set of cross-platform submit-host information scripts but for whatever reason decided to abandon them (though they're still used in gLite, I think). -- From iraicu at cs.uchicago.edu Wed Jan 30 08:46:40 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 30 Jan 2008 08:46:40 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08C58.4090806@cs.uchicago.edu> Message-ID: <47A08DD0.8050807@cs.uchicago.edu> If MDS will give you this kind of info, and its already available to you, then why not use it to determine the load on the remote machine? Ben Clifford wrote: > On Wed, 30 Jan 2008, Ioan Raicu wrote: > > >> Here is something that might help Swift determine when the GRAM host is under >> heavy load, prior to things starting to fail. >> Could a simple service be made to run in the same container as the GRAM4 >> service that would expose certain low level information, such as CPU >> > > That's reinventing MDS! > > We used to have a great set of cross-platform submit-host information > scripts but for whatever reason decided to abandon them (though they're > still used in gLite, I think). > > -- ================================================== Ioan Raicu Ph.D. Candidate ================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 ================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS ================================================== ================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From foster at mcs.anl.gov Wed Jan 30 08:54:16 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Wed, 30 Jan 2008 08:54:16 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08C58.4090806@cs.uchicago.edu> Message-ID: <47A08F98.2000007@mcs.anl.gov> Isn't it expanding the set of information published via MDS? Ben Clifford wrote: > On Wed, 30 Jan 2008, Ioan Raicu wrote: > > >> Here is something that might help Swift determine when the GRAM host is under >> heavy load, prior to things starting to fail. >> Could a simple service be made to run in the same container as the GRAM4 >> service that would expose certain low level information, such as CPU >> > > That's reinventing MDS! > > We used to have a great set of cross-platform submit-host information > scripts but for whatever reason decided to abandon them (though they're > still used in gLite, I think). > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Jan 30 09:09:57 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 30 Jan 2008 09:09:57 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <47A08C58.4090806@cs.uchicago.edu> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08C58.4090806@cs.uchicago.edu> Message-ID: <1201705797.2885.2.camel@blabla.mcs.anl.gov> We still need to agree on what are ok numbers and what are not ok numbers. We should perhaps find that first and then talk about the implementation. On Wed, 2008-01-30 at 08:40 -0600, Ioan Raicu wrote: > Here is something that might help Swift determine when the GRAM host > is under heavy load, prior to things starting to fail. > > Could a simple service be made to run in the same container as the > GRAM4 service that would expose certain low level information, such as > CPU utilization, machine load, memory free, swap used, disk I/O, > network I/O, etc... If this is a standard service that exposes this > information as RP, or even a simple status information WS function, > then it could be used to determine the load on the machine where GRAM > is running. The tricky part is getting this kind of low level > information in a platform independent fashion, but it might be worth > the effort. > > BTW, I have done exactly this in the context of Falkon, to monitor the > state of the machine where the Falkon service runs. I actually start > "vmstat" and scrape the output to get the needed information at > regular intervals, and it works quite well on the few Linux > distributions I tried it on, RH8, SuSe 9 and SuSe 10. > > Ioan > > Ben Clifford wrote: > > On Wed, 30 Jan 2008, Ti Leggett wrote: > > > > > > > As a site admin I would rather you ramp up and not throttle down. Starting > > > high and working to a lower number means you could kill the machine many times > > > before you find the lower bound of what a site can handle. Starting slowly and > > > ramping up means you find that lower bound once. From my point of view, one > > > user consistently killing the resource can be turned off to prevent denial of > > > service to all other users *until* they can prove they won't kill the > > > resource. So I prefer the conservative. > > > > > > > The code does ramp up at the moment, starting with 6 simultaneous jobs by > > default. > > > > What doesn't happen very well at the moment is automated detection of 'too > > much' in order to stop ramping up - the only really good feedback at the > > moment (not just in this particular case but in other cases before) seems > > to be a human being sitting in the feedback loop tweaking stuff. > > > > Two things we should work on are: > > i) making it easier for the human who is sitting in that loop > > and > > ii) figuring out a better way to get automated feedback. > > > > >From a TG-UC perspective, for example, what is a good way to know 'too > > much'? Is it OK to keep submitting jobs until they start failing? Or > > should there be some lower point at which we stop? > > > > > > -- > ================================================== > Ioan Raicu > Ph.D. Candidate > ================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > ================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS > ================================================== > ================================================== > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Wed Jan 30 09:13:17 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 30 Jan 2008 09:13:17 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <47A08F98.2000007@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08C58.4090806@cs.uchicago.edu> <47A08F98.2000007@mcs.anl.gov> Message-ID: <1201705997.2885.4.camel@blabla.mcs.anl.gov> It depends on whether the information is published via MDS or another simple service in the container :) On Wed, 2008-01-30 at 08:54 -0600, Ian Foster wrote: > Isn't it expanding the set of information published via MDS? > > Ben Clifford wrote: > > On Wed, 30 Jan 2008, Ioan Raicu wrote: > > > > > > > Here is something that might help Swift determine when the GRAM host is under > > > heavy load, prior to things starting to fail. > > > Could a simple service be made to run in the same container as the GRAM4 > > > service that would expose certain low level information, such as CPU > > > > > > > That's reinventing MDS! > > > > We used to have a great set of cross-platform submit-host information > > scripts but for whatever reason decided to abandon them (though they're > > still used in gLite, I think). > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From navarro at mcs.anl.gov Wed Jan 30 09:27:52 2008 From: navarro at mcs.anl.gov (JP Navarro) Date: Wed, 30 Jan 2008 09:27:52 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08A36.1000502@mcs.anl.gov> Message-ID: Ben, It's tough for those of us involved in preparing GRAM4, deploying it, and supporting it to address the issues people encounter unless they are consistently reported. Do you have any tickets, bug reports, or other details that you could share that explain "didn't seem to work right" in more detail? Thanks, JP On Jan 30, 2008, at 8:36 AM, Ben Clifford wrote: > I personally abandoned using GRAM4 a year ago because deployments > generally didn't seem to work right and its general lack of use > meant that > deployments rotted very quickly even when they were fixed. From navarro at mcs.anl.gov Wed Jan 30 09:32:17 2008 From: navarro at mcs.anl.gov (JP Navarro) Date: Wed, 30 Jan 2008 09:32:17 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <1201705797.2885.2.camel@blabla.mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08C58.4090806@cs.uchicago.edu> <1201705797.2885.2.camel@blabla.mcs.anl.gov> Message-ID: <9A4C850E-8D3F-4F91-BD1E-2D46D0C68587@mcs.anl.gov> If you guys can identify the metrics about a GRAM submit host that would help clients asses host load, I can commit some TeraGrid GIG resources to help implement it and to test deploy publishing it on the TeraGrid. JP On Jan 30, 2008, at 9:09 AM, Mihael Hategan wrote: > We still need to agree on what are ok numbers and what are not ok > numbers. > > We should perhaps find that first and then talk about the > implementation. From benc at hawaga.org.uk Wed Jan 30 09:48:05 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 30 Jan 2008 15:48:05 +0000 (GMT) Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08A36.1000502@mcs.anl.gov> Message-ID: On Wed, 30 Jan 2008, JP Navarro wrote: > It's tough for those of us involved in preparing GRAM4, deploying it, > and supporting it to address the issues people encounter unless they > are consistently reported. right. I'm not criticsing you (or any site admin at any site) > Do you have any tickets, bug reports, or other details that you could > share that explain "didn't seem to work right" in more detail? No. The default behaviour when working with a user who is "just trying to get their stuff to run" is "screw this, use GRAM2 because it works". Its a self-reinforcing feedback loop, that will be broken at the point that it becomes easier for people to stick with GRAM4 than default back to GRAM2. I guess we need to keep trying every now and then and hope that one time it sticks ;-) -- From leggett at mcs.anl.gov Wed Jan 30 10:00:36 2008 From: leggett at mcs.anl.gov (Ti Leggett) Date: Wed, 30 Jan 2008 10:00:36 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08A36.1000502@mcs.anl.gov> Message-ID: <0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov> On Jan 30, 2008, at 01/30/08 09:48 AM, Ben Clifford wrote: [snip] > No. The default behaviour when working with a user who is "just > trying to > get their stuff to run" is "screw this, use GRAM2 because it works". > > Its a self-reinforcing feedback loop, that will be broken at the point > that it becomes easier for people to stick with GRAM4 than default > back to > GRAM2. I guess we need to keep trying every now and then and hope > that one > time it sticks ;-) > > -- Well this works to a point, but if falling back to a technology that is known to not be scalable for your sizes results in killing a machine, I, as a site admin, will eventually either a) deny you service b) shut down the poorly performing service or c) all of the above. So it's in your best interest to find and use those technologies that are best suited to the task at hand so the users of your software don't get nailed by (a). In this case it seems to me that using WS-GRAM, extending WS-GRAM and/ or MDS to report site statistics, and/or modifying WS-GRAM to throttle itself (think of how apache reports "Server busy. Try again later") is the best path forward. For the short term, it seems that the Swift developers should manually find those limits for sites that the users use regularly for them to use, *and* educate their users on how to identify that they could be adversely affecting a resource and throttle themselves till the ideal, automated method is a usable reality. From benc at hawaga.org.uk Wed Jan 30 10:09:45 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 30 Jan 2008 16:09:45 +0000 (GMT) Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08A36.1000502@mcs.anl.gov> <0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov> Message-ID: On Wed, 30 Jan 2008, Ti Leggett wrote: > > Its a self-reinforcing feedback loop, that will be broken at the point > For the short term, it seems that the Swift developers should > manually find those limits for sites that the users use regularly for them to > use, In terms of peak jobs-on-a-site-at-once, we have a parameter throttle.score.job.factor which limits the maximum number of jobs on a site at any one time. by default, this is set to allow 402 jobs on a site at once. I guess that should be massively reduced by default. Do you have a ballpark number for the most jobs that should be on eg. TG-UC? -- From wilde at mcs.anl.gov Wed Jan 30 10:23:00 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 30 Jan 2008 10:23:00 -0600 Subject: [Swift-devel] Need help using Data Capacitor in Swift Workflow Message-ID: <47A0A464.70707@mcs.anl.gov> [was: Re: Store files in IU Data Capacitor?] A week ago I started to help MikeK find several TB of free space to support his next round of workflows. Currently the CI SAN has only 700GB free. I started looking at how to use the 500TB data capacitor server at IU. MikeK, should we assume you still need this, test access to it, and give you the parameters for your sites file, and swift examples of mapping data to/from GridFTP URIs? If so, Mihael can you do this? (URL for the DC is below. Can use scratch space for 2-week stretches. Ive applied for an allocation on it). - Mike -------- Original Message -------- Subject: Re: Store files in IU Data Capacitor? Date: Mon, 21 Jan 2008 14:42:48 -0800 (PST) From: Mike Kubal To: Michael Wilde , "Binkowski, Thomas Andrew" 10TB would be plenty. I've been using Benoit's allocation. The TG project ID I've been using is: TG-MCA01S018 Thanks MikeW, MikeK --- Michael Wilde wrote: > Hi Andrew and Mike, > > One place we might be able to keep our multi-GB data > files is in the > IndianaU Data Capacitor: > > http://datacapacitor.researchtechnologies.uits.iu.edu/ > > Do you have a rough estimate of what we need? the > DC allocates storage > in 10TB chunks, which I assume will meet our needs > for the moment? > > Can you send me the TG project # for the LigandAtlas > project, and I will > try to get an allocation of space there? > > Are we running on our own TG DAC, or on Benoit's > large allocation? > > Thanks, > > Mike > ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs From mikekubal at yahoo.com Wed Jan 30 10:32:20 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Wed, 30 Jan 2008 08:32:20 -0800 (PST) Subject: [Swift-devel] Need help using Data Capacitor in Swift Workflow Message-ID: <736987.22506.qm@web52304.mail.re2.yahoo.com> Yes, definitely need the space and assistance modifying scripts to utilize data capacitor at UI ( if for no other reason that it sounds cools to say "I'm using a data capacitor" : ) ). ----- Original Message ---- From: Michael Wilde To: Mike Kubal ; Mihael Hategan Cc: swift-devel Sent: Wednesday, January 30, 2008 10:23:00 AM Subject: [Swift-devel] Need help using Data Capacitor in Swift Workflow [was: Re: Store files in IU Data Capacitor?] A week ago I started to help MikeK find several TB of free space to support his next round of workflows. Currently the CI SAN has only 700GB free. I started looking at how to use the 500TB data capacitor server at IU. MikeK, should we assume you still need this, test access to it, and give you the parameters for your sites file, and swift examples of mapping data to/from GridFTP URIs? If so, Mihael can you do this? (URL for the DC is below. Can use scratch space for 2-week stretches. Ive applied for an allocation on it). - Mike -------- Original Message -------- Subject: Re: Store files in IU Data Capacitor? Date: Mon, 21 Jan 2008 14:42:48 -0800 (PST) From: Mike Kubal To: Michael Wilde , "Binkowski, Thomas Andrew" 10TB would be plenty. I've been using Benoit's allocation. The TG project ID I've been using is: TG-MCA01S018 Thanks MikeW, MikeK --- Michael Wilde wrote: > Hi Andrew and Mike, > > One place we might be able to keep our multi-GB data > files is in the > IndianaU Data Capacitor: > > http://datacapacitor.researchtechnologies.uits.iu.edu/ > > Do you have a rough estimate of what we need? the > DC allocates storage > in 10TB chunks, which I assume will meet our needs > for the moment? > > Can you send me the TG project # for the LigandAtlas > project, and I will > try to get an allocation of space there? > > Are we running on our own TG DAC, or on Benoit's > large allocation? > > Thanks, > > Mike > ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs -------------- next part -------------- An HTML attachment was scrubbed... URL: From leggett at mcs.anl.gov Wed Jan 30 10:34:25 2008 From: leggett at mcs.anl.gov (Ti Leggett) Date: Wed, 30 Jan 2008 10:34:25 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08A36.1000502@mcs.anl.gov> <0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov> Message-ID: <0AC191D2-CDBD-4E04-AC2E-0DA2DDC8FEE7@mcs.anl.gov> I can't say how many gram jobs can be submitted at one time or over a time period X. I do know that Torque and Moab can handle 1000s of jobs in the queue and rather quickly as well. I think you'll need to discover what those limits are per resource per method. For instance TG will have a different threshold than Teraport for instance. On Jan 30, 2008, at 01/30/08 10:09 AM, Ben Clifford wrote: > > On Wed, 30 Jan 2008, Ti Leggett wrote: > >>> Its a self-reinforcing feedback loop, that will be broken at the >>> point > >> For the short term, it seems that the Swift developers should >> manually find those limits for sites that the users use regularly >> for them to >> use, > > In terms of peak jobs-on-a-site-at-once, we have > a parameter throttle.score.job.factor which limits the maximum > number of > jobs on a site at any one time. by default, this is set to allow 402 > jobs > on a site at once. > > I guess that should be massively reduced by default. Do you have a > ballpark number for the most jobs that should be on eg. TG-UC? > > -- > From hategan at mcs.anl.gov Wed Jan 30 11:00:58 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 30 Jan 2008 11:00:58 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08A36.1000502@mcs.anl.gov> <0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov> Message-ID: <1201712458.3477.3.camel@blabla.mcs.anl.gov> > In this case it seems to me that using WS-GRAM, extending WS-GRAM and/ > or MDS to report site statistics, and/or modifying WS-GRAM to throttle > itself (think of how apache reports "Server busy. Try again later") is > the best path forward. For the short term, it seems that the Swift > developers should manually find those limits for sites that the users > use regularly It's a moving target, and there are conflicting trends. Those who take care of the system will mostly argue for more throttling, while the users desiring speed will always want... well... speed. So this needs to be a conscious and clear effort on our side (read both service providers and middleware providers). > for them to use, *and* educate their users on how to > identify that they could be adversely affecting a resource and > throttle themselves till the ideal, automated method is a usable > reality. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Wed Jan 30 11:02:40 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 30 Jan 2008 11:02:40 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <0AC191D2-CDBD-4E04-AC2E-0DA2DDC8FEE7@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08A36.1000502@mcs.anl.gov> <0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov> <0AC191D2-CDBD-4E04-AC2E-0DA2DDC8FEE7@mcs.anl.gov> Message-ID: <1201712560.3477.4.camel@blabla.mcs.anl.gov> On Wed, 2008-01-30 at 10:34 -0600, Ti Leggett wrote: > I can't say how many gram jobs can be submitted at one time or over a > time period X. I do know that Torque and Moab can handle 1000s of jobs > in the queue and rather quickly as well. I think you'll need to > discover what those limits are per resource per method. For instance > TG will have a different threshold than Teraport for instance. And that is the very problem. This discovery doesn't happen by magic. > > On Jan 30, 2008, at 01/30/08 10:09 AM, Ben Clifford wrote: > > > > > On Wed, 30 Jan 2008, Ti Leggett wrote: > > > >>> Its a self-reinforcing feedback loop, that will be broken at the > >>> point > > > >> For the short term, it seems that the Swift developers should > >> manually find those limits for sites that the users use regularly > >> for them to > >> use, > > > > In terms of peak jobs-on-a-site-at-once, we have > > a parameter throttle.score.job.factor which limits the maximum > > number of > > jobs on a site at any one time. by default, this is set to allow 402 > > jobs > > on a site at once. > > > > I guess that should be massively reduced by default. Do you have a > > ballpark number for the most jobs that should be on eg. TG-UC? > > > > -- > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Wed Jan 30 11:09:10 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 30 Jan 2008 11:09:10 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08A36.1000502@mcs.anl.gov> <0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov> Message-ID: <47A0AF36.406@mcs.anl.gov> On 1/30/08 10:00 AM, Ti Leggett wrote: > ... For the short term, it seems that the Swift > developers should manually find those limits for sites that the users > use regularly for them to use, *and* educate their users on how to > identify that they could be adversely affecting a resource and throttle > themselves till the ideal, automated method is a usable reality. I agree, Ti. We'll do this. May need to do some controlled testing on uc-teragrid to see where the limits are. Can probably gauge that without knocking over the machine if we scale up the tests and watch some health metrics like CPU, mem and queue length. From smartin at mcs.anl.gov Wed Jan 30 11:21:35 2008 From: smartin at mcs.anl.gov (Stuart Martin) Date: Wed, 30 Jan 2008 11:21:35 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08A36.1000502@mcs.anl.gov> <0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov> Message-ID: All, I wanted to chime in with a number of things being discussed here. There is a GRAM RFT Core reliability group focused on ensuring the GRAM service stays up and functional in spit of an onslaught from a client. http://confluence.globus.org/display/CDIGS/GRAM-RFT-Core+Reliability+Tiger+Team The ultimate goal here is that a client may get a timeout and that would be the signal to backoff some. ----- OSG - VO testing: We worked with Terrence (CMS) recently and here are his test results. http://hepuser.ucsd.edu/twiki/bin/view/UCSDTier2/WSGramTests GRAM2 handled this 2000 jobs x 2 condor-g clients to the same GRAM service better than GRAM4. But again, this is with the condor-g tricks. Without the tricks, GRAM2 will handle the load better. OSG VTB testing: These were using globusrun-ws and also condor-g. https://twiki.grid.iu.edu/twiki/bin/view/Integration/WSGramValidation clients in these tests got a variety of errors depending on the jobs run: timeouts, GridFTP authentication errors, client-side OOM, ... GRAM4 functions pretty well, but it was not able to handle Terrence's scenario. But it handled 1000 jobs x 1 condor-g client just fine. ----- It would be very interesting to see how swift does with GRAM4. This would make for a nice comparison to condor-g. As far as having functioning GRAM4 services on TG, things have improved. LEAD is using GRAM4 exclusively and we've been working with them to make sure the GRAM4 services are up and functioning. INCA has been updated to more effectively test and monitor GRAM4 and GridFTP services that LEAD is targeting. This could be extended for any hosts that swift would like to test against. Here are some interesting charts from INCA - http://cuzco.sdsc.edu:8085/cgi-bin/lead.cgi -Stu On Jan 30, 2008, at Jan 30, 10:00 AM, Ti Leggett wrote: > > On Jan 30, 2008, at 01/30/08 09:48 AM, Ben Clifford wrote: > > [snip] > >> No. The default behaviour when working with a user who is "just >> trying to >> get their stuff to run" is "screw this, use GRAM2 because it works". >> >> Its a self-reinforcing feedback loop, that will be broken at the >> point >> that it becomes easier for people to stick with GRAM4 than default >> back to >> GRAM2. I guess we need to keep trying every now and then and hope >> that one >> time it sticks ;-) >> >> -- > > Well this works to a point, but if falling back to a technology that > is known to not be scalable for your sizes results in killing a > machine, I, as a site admin, will eventually either a) deny you > service b) shut down the poorly performing service or c) all of the > above. So it's in your best interest to find and use those > technologies that are best suited to the task at hand so the users > of your software don't get nailed by (a). > > In this case it seems to me that using WS-GRAM, extending WS-GRAM > and/or MDS to report site statistics, and/or modifying WS-GRAM to > throttle itself (think of how apache reports "Server busy. Try again > later") is the best path forward. For the short term, it seems that > the Swift developers should manually find those limits for sites > that the users use regularly for them to use, *and* educate their > users on how to identify that they could be adversely affecting a > resource and throttle themselves till the ideal, automated method is > a usable reality. > From wilde at mcs.anl.gov Wed Jan 30 11:24:43 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 30 Jan 2008 11:24:43 -0600 Subject: [Swift-devel] Need help using Data Capacitor in Swift Workflow In-Reply-To: <736987.22506.qm@web52304.mail.re2.yahoo.com> References: <736987.22506.qm@web52304.mail.re2.yahoo.com> Message-ID: <47A0B2DB.4070005@mcs.anl.gov> Mike, did you already do a space estimate for the upcoming workflows? If not, can we help? Once we have DC access (via certs), its just a GridFTP server, so its not much more work than just adding a new storage site. Files in the DC have "term limits" of either 2 or 4 weeks of non-use before they are removed (depending on whether we use scratch or persistent directory areas). The only extra work will be watching the space and deciding when things need to move to a tape archive (also via GridFTP, possibly using tgcp) We will look into getting more space in the CI SAN, but need to find funds and place orders. This will take some time. Space in the CI SAN is not reserved, so if we buy mre disks, we need to use placeholder files to hold our space. Thats yet more management work. The DC space will always be handy to have access to, so I think we're doing the right thing by pushing it forward. Turns out my app to TG Help for DC space got bounced back a week agon and I didnt notice. I just refiled it directly with the DC group. Im hoping to hear in a day or two. I *think* we can test against the DC scratch space with an ordinary TG account. Thats what I was testing a week ago before I started traveling. Im hoping Mihael can take over on this. - Mike On 1/30/08 10:32 AM, Mike Kubal wrote: > Yes, definitely need the space and assistance modifying scripts to > utilize data capacitor at UI ( if for no other reason that it sounds > cools to say "I'm using a data capacitor" : ) ). > > > ----- Original Message ---- > From: Michael Wilde > To: Mike Kubal ; Mihael Hategan > Cc: swift-devel > Sent: Wednesday, January 30, 2008 10:23:00 AM > Subject: [Swift-devel] Need help using Data Capacitor in Swift Workflow > > [was: Re: Store files in IU Data Capacitor?] > > A week ago I started to help MikeK find several TB of free space to > support his next round of workflows. > > Currently the CI SAN has only 700GB free. > > I started looking at how to use the 500TB data capacitor server at IU. > > MikeK, should we assume you still need this, test access to it, and give > you the parameters for your sites file, and swift examples of mapping > data to/from GridFTP URIs? > > If so, Mihael can you do this? (URL for the DC is below. Can use scratch > space for 2-week stretches. Ive applied for an allocation on it). > > - Mike > > > -------- Original Message -------- > Subject: Re: Store files in IU Data Capacitor? > Date: Mon, 21 Jan 2008 14:42:48 -0800 (PST) > From: Mike Kubal > > To: Michael Wilde >, > "Binkowski, Thomas Andrew" > > > > 10TB would be plenty. > > I've been using Benoit's allocation. > > The TG project ID I've been using is: > > TG-MCA01S018 > > Thanks MikeW, > > MikeK > > --- Michael Wilde > wrote: > > > Hi Andrew and Mike, > > > > One place we might be able to keep our multi-GB data > > files is in the > > IndianaU Data Capacitor: > > > > > http://datacapacitor.researchtechnologies.uits.iu.edu/ > > > > Do you have a rough estimate of what we need? the > > DC allocates storage > > in 10TB chunks, which I assume will meet our needs > > for the moment? > > > > Can you send me the TG project # for the LigandAtlas > > project, and I will > > try to get an allocation of space there? > > > > Are we running on our own TG DAC, or on Benoit's > > large allocation? > > > > Thanks, > > > > Mike > > > > > > > ____________________________________________________________________________________ > Never miss a thing. Make Yahoo your home page. > http://www.yahoo.com/r/hs > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > ------------------------------------------------------------------------ > Never miss a thing. Make Yahoo your homepage. > From smartin at mcs.anl.gov Wed Jan 30 11:23:55 2008 From: smartin at mcs.anl.gov (Stuart Martin) Date: Wed, 30 Jan 2008 11:23:55 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08A36.1000502@mcs.anl.gov> <0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov> Message-ID: <3D0063FE-27F7-472F-AB8D-72CAA47D4FC1@mcs.anl.gov> On Jan 30, 2008, at Jan 30, 11:21 AM, Stuart Martin wrote: > All, > > I wanted to chime in with a number of things being discussed here. > > There is a GRAM RFT Core reliability group focused on ensuring the > GRAM service stays up and functional in spit of an onslaught from a > client. http://confluence.globus.org/display/CDIGS/GRAM-RFT-Core+Reliability+Tiger+Team > > The ultimate goal here is that a client may get a timeout and that > would be the signal to backoff some. > > ----- > > OSG - VO testing: We worked with Terrence (CMS) recently and here > are his test results. > http://hepuser.ucsd.edu/twiki/bin/view/UCSDTier2/WSGramTests > > GRAM2 handled this 2000 jobs x 2 condor-g clients to the same GRAM > service better than GRAM4. But again, this is with the condor-g > tricks. Without the tricks, GRAM2 will handle the load better. Without the tricks, *GRAM4* will handle the load better. > > > OSG VTB testing: These were using globusrun-ws and also condor-g. > https://twiki.grid.iu.edu/twiki/bin/view/Integration/WSGramValidation > > clients in these tests got a variety of errors depending on the jobs > run: timeouts, GridFTP authentication errors, client-side OOM, ... > GRAM4 functions pretty well, but it was not able to handle > Terrence's scenario. But it handled 1000 jobs x 1 condor-g client > just fine. > > ----- > > It would be very interesting to see how swift does with GRAM4. This > would make for a nice comparison to condor-g. > > As far as having functioning GRAM4 services on TG, things have > improved. LEAD is using GRAM4 exclusively and we've been working > with them to make sure the GRAM4 services are up and functioning. > INCA has been updated to more effectively test and monitor GRAM4 and > GridFTP services that LEAD is targeting. This could be extended for > any hosts that swift would like to test against. Here are some > interesting charts from INCA - http://cuzco.sdsc.edu:8085/cgi-bin/lead.cgi > > -Stu > > On Jan 30, 2008, at Jan 30, 10:00 AM, Ti Leggett wrote: > >> >> On Jan 30, 2008, at 01/30/08 09:48 AM, Ben Clifford wrote: >> >> [snip] >> >>> No. The default behaviour when working with a user who is "just >>> trying to >>> get their stuff to run" is "screw this, use GRAM2 because it works". >>> >>> Its a self-reinforcing feedback loop, that will be broken at the >>> point >>> that it becomes easier for people to stick with GRAM4 than default >>> back to >>> GRAM2. I guess we need to keep trying every now and then and hope >>> that one >>> time it sticks ;-) >>> >>> -- >> >> Well this works to a point, but if falling back to a technology >> that is known to not be scalable for your sizes results in killing >> a machine, I, as a site admin, will eventually either a) deny you >> service b) shut down the poorly performing service or c) all of the >> above. So it's in your best interest to find and use those >> technologies that are best suited to the task at hand so the users >> of your software don't get nailed by (a). >> >> In this case it seems to me that using WS-GRAM, extending WS-GRAM >> and/or MDS to report site statistics, and/or modifying WS-GRAM to >> throttle itself (think of how apache reports "Server busy. Try >> again later") is the best path forward. For the short term, it >> seems that the Swift developers should manually find those limits >> for sites that the users use regularly for them to use, *and* >> educate their users on how to identify that they could be adversely >> affecting a resource and throttle themselves till the ideal, >> automated method is a usable reality. >> > From simone at fnal.gov Wed Jan 30 11:30:14 2008 From: simone at fnal.gov (James Simone) Date: Wed, 30 Jan 2008 11:30:14 -0600 Subject: [Swift-devel] Re: Question and update on swift script In-Reply-To: <479FC338.8070402@fnal.gov> References: <476064D0.7050208@mcs.anl.gov> <479A0C53.2060108@fnal.gov> <1201642765.25746.2.camel@blabla.mcs.anl.gov> <479FC338.8070402@fnal.gov> Message-ID: <47A0B426.4070405@fnal.gov> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Dear swift developers, Do you have a system diagram for swift? I'd like to show one as part of my talk on Friday. Thanks, --jim James Simone wrote: | Thanks Mihael! | | | Mihael Hategan wrote: | | Mike is probably on or around a plane. | | | | Here's my feeble attempt at it. | | | | Mihael | | | | On Fri, 2008-01-25 at 10:20 -0600, James Simone wrote: | |> Hi Mike, | |> | |> Could you please send us your target version of the 2-pt QCD workflow? | |> The target swiftscipt would not necessarily be able to run with current | |> swift, but it would include syntax changes that are/will be under | development. | |> | |> Thanks, | |> --jim | |> | |> Michael Wilde wrote: | |>> Hi all, | |>> | |>> I made some good progress yesterday in understanding the swift code for | |>> the 2ptHL workflow, and started revising it. | |>> | |>> Im doing the following: | |>> - giving all files a mnemonic type name | |>> - creating compound types that encapsulate each data/info file pair | |>> - putting each set of nested foreach loops into a procedure (for better | |>> abstraction) | |>> - changing the mapper to tie it to each data type | |>> | |>> For now, Im also pulling the two cache-loading functions out of the | |>> workflow, as it seems like these should be part of the runtime | |>> environment, rather than in the workflow script. Do you feel the same? | |>> | |>> I *thought* that you had a set of wrappers that were python stubs for | |>> simulated execution, but the wrappers Don sent me looked like wrappers | |>> that call the *real* code. So for now I'll create my own simple | stubs to | |>> test the data flow with. | |>> | |>> Ive got many more questions that Im collecting, but for now I only need | |>> the answer to this one: | |>> | |>> In the nested loops that call the function Onia() (lines 136-148 in the | |>> attached numbered listing), is the actual call at line 145 correct? | |>> This call is passing the same "HeavyQuarkConverted" as both the | |>> anti-quark and quark1. Im assuming that is the correct intent. | |>> Also, its passing only the wave[1] wave file (1S) each time (second of | |>> the three wave files d, 1S, 2S). (Same is true for BStaggered). | |>> | |>> Lastly, Onia seems to be getting called with successive pairs of | |>> converted heavy quark files, but it seems like the final call is going | |>> to get a null file for the second file, as the value of "bindex" | will be | |>> one past the last converted heavy quark file computed. | |>> | |>> Im looking for ways to avoid the way the current script needs to | compute | |>> the various array indices, but Im not likely to find something that | |>> doesnt require a language change. Was this approach of computing the | |>> indices something that your team liked, or did not like, about this | |>> swift script? | |>> | |>> I'll try to send more details and questions as I proceed. | |>> A few of these are below. If you have a moment to consider these, that | |>> will help me get a better understanding of the environment. | |>> | |>> Thanks, | |>> | |>> - Mike | |>> | |>> qcopy: I need to learn more about how you manage data and use dcache, | |>> but for now, I get the essence if this. Can you point me to a qcopy doc | |>> page though? I couldnt find one. | |>> | |>> tag_array_mapper: I can guess how this works, but can you send me the | |>> code for it? It seems like the order in which it fills the array must | |>> match the index computations in your swift script. Looks to me like the | |>> leftmost tag in the format script is varied fastest (ie, is the "least | |>> significant index"). Is this correct? | |>> | |>> kappa value passing bug: you allude to this in a comment. Was this a | |>> swift problem or a problem in the py wrapper? Resolved or still open? | |>> If Swift, I can test to see if I can reproduce it. But I suspect you | |>> were testing against a pretty old version of swift? | |>> | |>> Is the notion of an "info" file paired with most data files a standard | |>> metadata convention for LQCD? (Ie, Im assuming that this is done | |>> throughout your apps, not just in this swift example, right? If so, it | |>> seems to justify a mapping convention so that you can simply pass a | |>> "Quark" object, and have the data and info files passed automatically, | |>> together. You can then dereference the object top extract each field | |>> (file). | |>> | |>> Are the file naming conventions set by the tag mapper the ones already | |>> in use by the current workflow system? I.e., the order of the foreach | |>> loops and hence of the mappings was chosen carefully to match the | |>> existing file-naming conventions? | |>> | |>> How is the name of the ensemble chosen? Does it have a relation to the | |>> phyarams? Is it a database key? (It seems opaque to the current swift | |>> example. Is that by preference or was there a desire to expose its | |>> structure? Its its contents related to the phyparams? Or when | looked up | |>> in a DB, does it yield the phyparams? | |>> | |>> | |> _______________________________________________ | |> Swift-devel mailing list | |> Swift-devel at ci.uchicago.edu | |> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel | |> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkegtCYACgkQCdIhGvtLXB0YXACfWs5Va2U23eTxzCiSnjPVcRqe Jp0AnjeQ9ONnE0H9FHp97RIcOm80bENd =CIo3 -----END PGP SIGNATURE----- From wilde at mcs.anl.gov Wed Jan 30 11:38:39 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 30 Jan 2008 11:38:39 -0600 Subject: [Swift-devel] Re: Question and update on swift script In-Reply-To: <47A0B426.4070405@fnal.gov> References: <476064D0.7050208@mcs.anl.gov> <479A0C53.2060108@fnal.gov> <1201642765.25746.2.camel@blabla.mcs.anl.gov> <479FC338.8070402@fnal.gov> <47A0B426.4070405@fnal.gov> Message-ID: <47A0B61F.70101@mcs.anl.gov> Jim, the Swift tutorial I gave last week has a few version of a system diagram. Let me know if this meets your needs: https://twiki.grid.iu.edu/pub/Education/FloridaInternationalGridSchool2008Syllabus/6_SwiftWorkflow.pdf https://twiki.grid.iu.edu/pub/Education/FloridaInternationalGridSchool2008Syllabus/6_SwiftWorkflow.ppt slides 7, 22, 28. Feel free to use anything else. - Mike On 1/30/08 11:30 AM, James Simone wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Dear swift developers, > > Do you have a system diagram for swift? > I'd like to show one as part of my talk on Friday. > > Thanks, > --jim > > James Simone wrote: > | Thanks Mihael! > | > | > | Mihael Hategan wrote: > | | Mike is probably on or around a plane. > | | > | | Here's my feeble attempt at it. > | | > | | Mihael > | | > | | On Fri, 2008-01-25 at 10:20 -0600, James Simone wrote: > | |> Hi Mike, > | |> > | |> Could you please send us your target version of the 2-pt QCD workflow? > | |> The target swiftscipt would not necessarily be able to run with > current > | |> swift, but it would include syntax changes that are/will be under > | development. > | |> > | |> Thanks, > | |> --jim > | |> > | |> Michael Wilde wrote: > | |>> Hi all, > | |>> > | |>> I made some good progress yesterday in understanding the swift > code for > | |>> the 2ptHL workflow, and started revising it. > | |>> > | |>> Im doing the following: > | |>> - giving all files a mnemonic type name > | |>> - creating compound types that encapsulate each data/info file pair > | |>> - putting each set of nested foreach loops into a procedure (for > better > | |>> abstraction) > | |>> - changing the mapper to tie it to each data type > | |>> > | |>> For now, Im also pulling the two cache-loading functions out of the > | |>> workflow, as it seems like these should be part of the runtime > | |>> environment, rather than in the workflow script. Do you feel the > same? > | |>> > | |>> I *thought* that you had a set of wrappers that were python stubs for > | |>> simulated execution, but the wrappers Don sent me looked like > wrappers > | |>> that call the *real* code. So for now I'll create my own simple > | stubs to > | |>> test the data flow with. > | |>> > | |>> Ive got many more questions that Im collecting, but for now I only > need > | |>> the answer to this one: > | |>> > | |>> In the nested loops that call the function Onia() (lines 136-148 > in the > | |>> attached numbered listing), is the actual call at line 145 correct? > | |>> This call is passing the same "HeavyQuarkConverted" as both the > | |>> anti-quark and quark1. Im assuming that is the correct intent. > | |>> Also, its passing only the wave[1] wave file (1S) each time > (second of > | |>> the three wave files d, 1S, 2S). (Same is true for BStaggered). > | |>> > | |>> Lastly, Onia seems to be getting called with successive pairs of > | |>> converted heavy quark files, but it seems like the final call is > going > | |>> to get a null file for the second file, as the value of "bindex" > | will be > | |>> one past the last converted heavy quark file computed. > | |>> > | |>> Im looking for ways to avoid the way the current script needs to > | compute > | |>> the various array indices, but Im not likely to find something that > | |>> doesnt require a language change. Was this approach of computing the > | |>> indices something that your team liked, or did not like, about this > | |>> swift script? > | |>> > | |>> I'll try to send more details and questions as I proceed. > | |>> A few of these are below. If you have a moment to consider these, > that > | |>> will help me get a better understanding of the environment. > | |>> > | |>> Thanks, > | |>> > | |>> - Mike > | |>> > | |>> qcopy: I need to learn more about how you manage data and use dcache, > | |>> but for now, I get the essence if this. Can you point me to a > qcopy doc > | |>> page though? I couldnt find one. > | |>> > | |>> tag_array_mapper: I can guess how this works, but can you send me the > | |>> code for it? It seems like the order in which it fills the array must > | |>> match the index computations in your swift script. Looks to me > like the > | |>> leftmost tag in the format script is varied fastest (ie, is the > "least > | |>> significant index"). Is this correct? > | |>> > | |>> kappa value passing bug: you allude to this in a comment. Was this a > | |>> swift problem or a problem in the py wrapper? Resolved or still > open? > | |>> If Swift, I can test to see if I can reproduce it. But I suspect you > | |>> were testing against a pretty old version of swift? > | |>> > | |>> Is the notion of an "info" file paired with most data files a > standard > | |>> metadata convention for LQCD? (Ie, Im assuming that this is done > | |>> throughout your apps, not just in this swift example, right? If > so, it > | |>> seems to justify a mapping convention so that you can simply pass a > | |>> "Quark" object, and have the data and info files passed > automatically, > | |>> together. You can then dereference the object top extract each field > | |>> (file). > | |>> > | |>> Are the file naming conventions set by the tag mapper the ones > already > | |>> in use by the current workflow system? I.e., the order of the foreach > | |>> loops and hence of the mappings was chosen carefully to match the > | |>> existing file-naming conventions? > | |>> > | |>> How is the name of the ensemble chosen? Does it have a relation to > the > | |>> phyarams? Is it a database key? (It seems opaque to the current swift > | |>> example. Is that by preference or was there a desire to expose its > | |>> structure? Its its contents related to the phyparams? Or when > | looked up > | |>> in a DB, does it yield the phyparams? > | |>> > | |>> > | |> _______________________________________________ > | |> Swift-devel mailing list > | |> Swift-devel at ci.uchicago.edu > | |> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > | |> > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.8 (MingW32) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iEYEARECAAYFAkegtCYACgkQCdIhGvtLXB0YXACfWs5Va2U23eTxzCiSnjPVcRqe > Jp0AnjeQ9ONnE0H9FHp97RIcOm80bENd > =CIo3 > -----END PGP SIGNATURE----- > > From foster at mcs.anl.gov Wed Jan 30 11:46:23 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Wed, 30 Jan 2008 11:46:23 -0600 Subject: [Swift-devel] Need help using Data Capacitor in Swift Workflow In-Reply-To: <47A0A464.70707@mcs.anl.gov> References: <47A0A464.70707@mcs.anl.gov> Message-ID: <47A0B7EF.5010800@mcs.anl.gov> It seems crazy to me that we are going to IU for storage. Did Ti indicate that we cannot provide any space at U.Chicago? Argonne has no space? Ian. Michael Wilde wrote: > [was: Re: Store files in IU Data Capacitor?] > > A week ago I started to help MikeK find several TB of free space to > support his next round of workflows. > > Currently the CI SAN has only 700GB free. > > I started looking at how to use the 500TB data capacitor server at IU. > > MikeK, should we assume you still need this, test access to it, and > give you the parameters for your sites file, and swift examples of > mapping data to/from GridFTP URIs? > > If so, Mihael can you do this? (URL for the DC is below. Can use > scratch space for 2-week stretches. Ive applied for an allocation on it). > > - Mike > > > -------- Original Message -------- > Subject: Re: Store files in IU Data Capacitor? > Date: Mon, 21 Jan 2008 14:42:48 -0800 (PST) > From: Mike Kubal > To: Michael Wilde , "Binkowski, Thomas Andrew" > > > 10TB would be plenty. > > I've been using Benoit's allocation. > > The TG project ID I've been using is: > > TG-MCA01S018 > > Thanks MikeW, > > MikeK > > --- Michael Wilde wrote: > >> Hi Andrew and Mike, >> >> One place we might be able to keep our multi-GB data >> files is in the IndianaU Data Capacitor: >> > http://datacapacitor.researchtechnologies.uits.iu.edu/ >> >> Do you have a rough estimate of what we need? the >> DC allocates storage in 10TB chunks, which I assume will meet our needs >> for the moment? >> >> Can you send me the TG project # for the LigandAtlas >> project, and I will try to get an allocation of space there? >> >> Are we running on our own TG DAC, or on Benoit's >> large allocation? >> >> Thanks, >> >> Mike >> > > > > > ____________________________________________________________________________________ > > Never miss a thing. Make Yahoo your home page. > http://www.yahoo.com/r/hs > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Wed Jan 30 12:05:08 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 30 Jan 2008 12:05:08 -0600 Subject: [Swift-devel] Need help using Data Capacitor in Swift Workflow In-Reply-To: <47A0B7EF.5010800@mcs.anl.gov> References: <47A0A464.70707@mcs.anl.gov> <47A0B7EF.5010800@mcs.anl.gov> Message-ID: <47A0BC54.1000703@mcs.anl.gov> Ti indicated that FLASH will free 15 TB in about 3 weeks, and that we can buy more disks. Currently the SAN is still sitting at 99% full with less than 1 TB free. The DC is just another TG resource - I see no reason not to use any Grid resource available to us. I dont have any funds I can use for growing the SAN. If you want us to grow it using CI funds, we can. I dont know that we've created a policy for how to grow it. We should. - Mike On 1/30/08 11:46 AM, Ian Foster wrote: > It seems crazy to me that we are going to IU for storage. > > Did Ti indicate that we cannot provide any space at U.Chicago? Argonne > has no space? > > Ian. > > Michael Wilde wrote: >> [was: Re: Store files in IU Data Capacitor?] >> >> A week ago I started to help MikeK find several TB of free space to >> support his next round of workflows. >> >> Currently the CI SAN has only 700GB free. >> >> I started looking at how to use the 500TB data capacitor server at IU. >> >> MikeK, should we assume you still need this, test access to it, and >> give you the parameters for your sites file, and swift examples of >> mapping data to/from GridFTP URIs? >> >> If so, Mihael can you do this? (URL for the DC is below. Can use >> scratch space for 2-week stretches. Ive applied for an allocation on it). >> >> - Mike >> >> >> -------- Original Message -------- >> Subject: Re: Store files in IU Data Capacitor? >> Date: Mon, 21 Jan 2008 14:42:48 -0800 (PST) >> From: Mike Kubal >> To: Michael Wilde , "Binkowski, Thomas Andrew" >> >> >> 10TB would be plenty. >> >> I've been using Benoit's allocation. >> >> The TG project ID I've been using is: >> >> TG-MCA01S018 >> >> Thanks MikeW, >> >> MikeK >> >> --- Michael Wilde wrote: >> >>> Hi Andrew and Mike, >>> >>> One place we might be able to keep our multi-GB data >>> files is in the IndianaU Data Capacitor: >>> >> http://datacapacitor.researchtechnologies.uits.iu.edu/ >>> >>> Do you have a rough estimate of what we need? the >>> DC allocates storage in 10TB chunks, which I assume will meet our needs >>> for the moment? >>> >>> Can you send me the TG project # for the LigandAtlas >>> project, and I will try to get an allocation of space there? >>> >>> Are we running on our own TG DAC, or on Benoit's >>> large allocation? >>> >>> Thanks, >>> >>> Mike >>> >> >> >> >> >> ____________________________________________________________________________________ >> >> Never miss a thing. Make Yahoo your home page. >> http://www.yahoo.com/r/hs >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > From hategan at mcs.anl.gov Wed Jan 30 12:37:39 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 30 Jan 2008 12:37:39 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08A36.1000502@mcs.anl.gov> <0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov> Message-ID: <1201718259.5465.1.camel@blabla.mcs.anl.gov> I'm confused. Why would you want to test GRAM scalability while introducing additional biasing elements, such as Condor-G? On Wed, 2008-01-30 at 11:21 -0600, Stuart Martin wrote: > All, > > I wanted to chime in with a number of things being discussed here. > > There is a GRAM RFT Core reliability group focused on ensuring the > GRAM service stays up and functional in spit of an onslaught from a > client. http://confluence.globus.org/display/CDIGS/GRAM-RFT-Core+Reliability+Tiger+Team > > The ultimate goal here is that a client may get a timeout and that > would be the signal to backoff some. > > ----- > > OSG - VO testing: We worked with Terrence (CMS) recently and here are > his test results. > http://hepuser.ucsd.edu/twiki/bin/view/UCSDTier2/WSGramTests > > GRAM2 handled this 2000 jobs x 2 condor-g clients to the same GRAM > service better than GRAM4. But again, this is with the condor-g > tricks. Without the tricks, GRAM2 will handle the load better. > > OSG VTB testing: These were using globusrun-ws and also condor-g. > https://twiki.grid.iu.edu/twiki/bin/view/Integration/WSGramValidation > > clients in these tests got a variety of errors depending on the jobs > run: timeouts, GridFTP authentication errors, client-side OOM, ... > GRAM4 functions pretty well, but it was not able to handle Terrence's > scenario. But it handled 1000 jobs x 1 condor-g client just fine. > > ----- > > It would be very interesting to see how swift does with GRAM4. This > would make for a nice comparison to condor-g. > > As far as having functioning GRAM4 services on TG, things have > improved. LEAD is using GRAM4 exclusively and we've been working with > them to make sure the GRAM4 services are up and functioning. INCA has > been updated to more effectively test and monitor GRAM4 and GridFTP > services that LEAD is targeting. This could be extended for any hosts > that swift would like to test against. Here are some interesting > charts from INCA - http://cuzco.sdsc.edu:8085/cgi-bin/lead.cgi > > -Stu > > On Jan 30, 2008, at Jan 30, 10:00 AM, Ti Leggett wrote: > > > > > On Jan 30, 2008, at 01/30/08 09:48 AM, Ben Clifford wrote: > > > > [snip] > > > >> No. The default behaviour when working with a user who is "just > >> trying to > >> get their stuff to run" is "screw this, use GRAM2 because it works". > >> > >> Its a self-reinforcing feedback loop, that will be broken at the > >> point > >> that it becomes easier for people to stick with GRAM4 than default > >> back to > >> GRAM2. I guess we need to keep trying every now and then and hope > >> that one > >> time it sticks ;-) > >> > >> -- > > > > Well this works to a point, but if falling back to a technology that > > is known to not be scalable for your sizes results in killing a > > machine, I, as a site admin, will eventually either a) deny you > > service b) shut down the poorly performing service or c) all of the > > above. So it's in your best interest to find and use those > > technologies that are best suited to the task at hand so the users > > of your software don't get nailed by (a). > > > > In this case it seems to me that using WS-GRAM, extending WS-GRAM > > and/or MDS to report site statistics, and/or modifying WS-GRAM to > > throttle itself (think of how apache reports "Server busy. Try again > > later") is the best path forward. For the short term, it seems that > > the Swift developers should manually find those limits for sites > > that the users use regularly for them to use, *and* educate their > > users on how to identify that they could be adversely affecting a > > resource and throttle themselves till the ideal, automated method is > > a usable reality. > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From smartin at mcs.anl.gov Wed Jan 30 13:19:47 2008 From: smartin at mcs.anl.gov (Stuart Martin) Date: Wed, 30 Jan 2008 13:19:47 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <1201718259.5465.1.camel@blabla.mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08A36.1000502@mcs.anl.gov> <0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov> <1201718259.5465.1.camel@blabla.mcs.anl.gov> Message-ID: <69921E77-384E-4D36-922C-BED4D52F2177@mcs.anl.gov> I'm saying run swift tests using GRAM4 and see what you get. Run a similar job scenario like 2000 jobs to the same GRAM4 service. I will be interested to see how swift does for performance, scalability, errors... It's possible that condor-g is not optimal, so seeing how another GRAM4 client dong similar job submission scenarios fares would make for an interesting comparison. -Stu On Jan 30, 2008, at Jan 30, 12:37 PM, Mihael Hategan wrote: > I'm confused. Why would you want to test GRAM scalability while > introducing additional biasing elements, such as Condor-G? > > On Wed, 2008-01-30 at 11:21 -0600, Stuart Martin wrote: >> All, >> >> I wanted to chime in with a number of things being discussed here. >> >> There is a GRAM RFT Core reliability group focused on ensuring the >> GRAM service stays up and functional in spit of an onslaught from a >> client. http://confluence.globus.org/display/CDIGS/GRAM-RFT-Core+Reliability+Tiger+Team >> >> The ultimate goal here is that a client may get a timeout and that >> would be the signal to backoff some. >> >> ----- >> >> OSG - VO testing: We worked with Terrence (CMS) recently and here are >> his test results. >> http://hepuser.ucsd.edu/twiki/bin/view/UCSDTier2/WSGramTests >> >> GRAM2 handled this 2000 jobs x 2 condor-g clients to the same GRAM >> service better than GRAM4. But again, this is with the condor-g >> tricks. Without the tricks, GRAM2 will handle the load better. >> >> OSG VTB testing: These were using globusrun-ws and also condor-g. >> https://twiki.grid.iu.edu/twiki/bin/view/Integration/ >> WSGramValidation >> >> clients in these tests got a variety of errors depending on the jobs >> run: timeouts, GridFTP authentication errors, client-side OOM, ... >> GRAM4 functions pretty well, but it was not able to handle Terrence's >> scenario. But it handled 1000 jobs x 1 condor-g client just fine. >> >> ----- >> >> It would be very interesting to see how swift does with GRAM4. This >> would make for a nice comparison to condor-g. >> >> As far as having functioning GRAM4 services on TG, things have >> improved. LEAD is using GRAM4 exclusively and we've been working >> with >> them to make sure the GRAM4 services are up and functioning. INCA >> has >> been updated to more effectively test and monitor GRAM4 and GridFTP >> services that LEAD is targeting. This could be extended for any >> hosts >> that swift would like to test against. Here are some interesting >> charts from INCA - http://cuzco.sdsc.edu:8085/cgi-bin/lead.cgi >> >> -Stu >> >> On Jan 30, 2008, at Jan 30, 10:00 AM, Ti Leggett wrote: >> >>> >>> On Jan 30, 2008, at 01/30/08 09:48 AM, Ben Clifford wrote: >>> >>> [snip] >>> >>>> No. The default behaviour when working with a user who is "just >>>> trying to >>>> get their stuff to run" is "screw this, use GRAM2 because it >>>> works". >>>> >>>> Its a self-reinforcing feedback loop, that will be broken at the >>>> point >>>> that it becomes easier for people to stick with GRAM4 than default >>>> back to >>>> GRAM2. I guess we need to keep trying every now and then and hope >>>> that one >>>> time it sticks ;-) >>>> >>>> -- >>> >>> Well this works to a point, but if falling back to a technology that >>> is known to not be scalable for your sizes results in killing a >>> machine, I, as a site admin, will eventually either a) deny you >>> service b) shut down the poorly performing service or c) all of the >>> above. So it's in your best interest to find and use those >>> technologies that are best suited to the task at hand so the users >>> of your software don't get nailed by (a). >>> >>> In this case it seems to me that using WS-GRAM, extending WS-GRAM >>> and/or MDS to report site statistics, and/or modifying WS-GRAM to >>> throttle itself (think of how apache reports "Server busy. Try again >>> later") is the best path forward. For the short term, it seems that >>> the Swift developers should manually find those limits for sites >>> that the users use regularly for them to use, *and* educate their >>> users on how to identify that they could be adversely affecting a >>> resource and throttle themselves till the ideal, automated method is >>> a usable reality. >>> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > From foster at mcs.anl.gov Wed Jan 30 13:39:47 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Wed, 30 Jan 2008 13:39:47 -0600 Subject: [Swift-devel] Need help using Data Capacitor in Swift Workflow In-Reply-To: <47A0BC54.1000703@mcs.anl.gov> References: <47A0A464.70707@mcs.anl.gov> <47A0B7EF.5010800@mcs.anl.gov> <47A0BC54.1000703@mcs.anl.gov> Message-ID: <47A0D283.7070005@mcs.anl.gov> I'm told that it will be available at the end of next week, so there is no immediate need to grow the SAN. I'm always nervous introducing more moving parts into already complex systems ... Michael Wilde wrote: > Ti indicated that FLASH will free 15 TB in about 3 weeks, and that we > can buy more disks. Currently the SAN is still sitting at 99% full > with less than 1 TB free. > > The DC is just another TG resource - I see no reason not to use any > Grid resource available to us. > > I dont have any funds I can use for growing the SAN. If you want us to > grow it using CI funds, we can. I dont know that we've created a > policy for how to grow it. We should. > > - Mike > > > > On 1/30/08 11:46 AM, Ian Foster wrote: >> It seems crazy to me that we are going to IU for storage. >> >> Did Ti indicate that we cannot provide any space at U.Chicago? >> Argonne has no space? >> >> Ian. >> >> Michael Wilde wrote: >>> [was: Re: Store files in IU Data Capacitor?] >>> >>> A week ago I started to help MikeK find several TB of free space to >>> support his next round of workflows. >>> >>> Currently the CI SAN has only 700GB free. >>> >>> I started looking at how to use the 500TB data capacitor server at IU. >>> >>> MikeK, should we assume you still need this, test access to it, and >>> give you the parameters for your sites file, and swift examples of >>> mapping data to/from GridFTP URIs? >>> >>> If so, Mihael can you do this? (URL for the DC is below. Can use >>> scratch space for 2-week stretches. Ive applied for an allocation on >>> it). >>> >>> - Mike >>> >>> >>> -------- Original Message -------- >>> Subject: Re: Store files in IU Data Capacitor? >>> Date: Mon, 21 Jan 2008 14:42:48 -0800 (PST) >>> From: Mike Kubal >>> To: Michael Wilde , "Binkowski, Thomas Andrew" >>> >>> >>> 10TB would be plenty. >>> >>> I've been using Benoit's allocation. >>> >>> The TG project ID I've been using is: >>> >>> TG-MCA01S018 >>> >>> Thanks MikeW, >>> >>> MikeK >>> >>> --- Michael Wilde wrote: >>> >>>> Hi Andrew and Mike, >>>> >>>> One place we might be able to keep our multi-GB data >>>> files is in the IndianaU Data Capacitor: >>>> >>> http://datacapacitor.researchtechnologies.uits.iu.edu/ >>>> >>>> Do you have a rough estimate of what we need? the >>>> DC allocates storage in 10TB chunks, which I assume will meet our >>>> needs >>>> for the moment? >>>> >>>> Can you send me the TG project # for the LigandAtlas >>>> project, and I will try to get an allocation of space there? >>>> >>>> Are we running on our own TG DAC, or on Benoit's >>>> large allocation? >>>> >>>> Thanks, >>>> >>>> Mike >>>> >>> >>> >>> >>> >>> ____________________________________________________________________________________ >>> >>> Never miss a thing. Make Yahoo your home page. >>> http://www.yahoo.com/r/hs >>> >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> >> > From benc at hawaga.org.uk Wed Jan 30 14:15:09 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 30 Jan 2008 20:15:09 +0000 (GMT) Subject: [Swift-devel] swift 0.4 Message-ID: I'm thinking about putting out a 0.4 release around the 11th of feb (when I am finished with my present set of conferences and schools). -- From bugzilla-daemon at mcs.anl.gov Wed Jan 30 14:43:19 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 30 Jan 2008 14:43:19 -0600 (CST) Subject: [Swift-devel] [Bug 126] New: mixed float and int in [::] array ranges wonky Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=126 Summary: mixed float and int in [::] array ranges wonky Product: Swift Version: unspecified Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk CC: swift-devel at ci.uchicago.edu In expression [0:1:0.2], an array of integers is created that behaves the same as [0,0,0,0,0,1] This should probably produce an array of floats -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Wed Jan 30 15:50:57 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 30 Jan 2008 15:50:57 -0600 (CST) Subject: [Swift-devel] [Bug 125] Single line struct declarations are not quite right In-Reply-To: Message-ID: <20080130215057.43D28164BB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=125 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED ------- Comment #1 from benc at hawaga.org.uk 2008-01-30 15:50 ------- This was introduced by r1538. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Wed Jan 30 16:16:54 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 30 Jan 2008 16:16:54 -0600 (CST) Subject: [Swift-devel] [Bug 125] Single line struct declarations are not quite right In-Reply-To: Message-ID: <20080130221654.E5290164BB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=125 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED ------- Comment #2 from benc at hawaga.org.uk 2008-01-30 16:16 ------- fixed in r1605 (which also adds a behaviour test to catch). -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Wed Jan 30 16:24:20 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 30 Jan 2008 16:24:20 -0600 (CST) Subject: [Swift-devel] [Bug 117] multidimensional array support In-Reply-To: Message-ID: <20080130222420.C7088164BB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=117 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from benc at hawaga.org.uk 2008-01-30 16:24 ------- multidimensional arrays were introduced in r1539 -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Wed Jan 30 16:41:38 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 30 Jan 2008 16:41:38 -0600 (CST) Subject: [Swift-devel] [Bug 121] declarations with procedure calls do not allow declarations of arrays. In-Reply-To: Message-ID: <20080130224138.E11E41650A@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=121 ------- Comment #1 from benc at hawaga.org.uk 2008-01-30 16:41 ------- This syntax is now permitted (I think as of approx. r1538). The two constructs compile to identical intermediate-XML. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Wed Jan 30 16:41:58 2008 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 30 Jan 2008 16:41:58 -0600 (CST) Subject: [Swift-devel] [Bug 121] declarations with procedure calls do not allow declarations of arrays. In-Reply-To: Message-ID: <20080130224158.DDFD21650A@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=121 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From hategan at mcs.anl.gov Wed Jan 30 19:15:09 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 30 Jan 2008 19:15:09 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <69921E77-384E-4D36-922C-BED4D52F2177@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08A36.1000502@mcs.anl.gov> <0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov> <1201718259.5465.1.camel@blabla.mcs.anl.gov> <69921E77-384E-4D36-922C-BED4D52F2177@mcs.anl.gov> Message-ID: <1201742109.9441.3.camel@blabla.mcs.anl.gov> Me doing such tests will probably mess the gatekeeper node again. How do we proceed? On Wed, 2008-01-30 at 13:19 -0600, Stuart Martin wrote: > I'm saying run swift tests using GRAM4 and see what you get. Run a > similar job scenario like 2000 jobs to the same GRAM4 service. I will > be interested to see how swift does for performance, scalability, > errors... > It's possible that condor-g is not optimal, so seeing how another > GRAM4 client dong similar job submission scenarios fares would make > for an interesting comparison. > > -Stu > > On Jan 30, 2008, at Jan 30, 12:37 PM, Mihael Hategan wrote: > > > I'm confused. Why would you want to test GRAM scalability while > > introducing additional biasing elements, such as Condor-G? > > > > On Wed, 2008-01-30 at 11:21 -0600, Stuart Martin wrote: > >> All, > >> > >> I wanted to chime in with a number of things being discussed here. > >> > >> There is a GRAM RFT Core reliability group focused on ensuring the > >> GRAM service stays up and functional in spit of an onslaught from a > >> client. http://confluence.globus.org/display/CDIGS/GRAM-RFT-Core+Reliability+Tiger+Team > >> > >> The ultimate goal here is that a client may get a timeout and that > >> would be the signal to backoff some. > >> > >> ----- > >> > >> OSG - VO testing: We worked with Terrence (CMS) recently and here are > >> his test results. > >> http://hepuser.ucsd.edu/twiki/bin/view/UCSDTier2/WSGramTests > >> > >> GRAM2 handled this 2000 jobs x 2 condor-g clients to the same GRAM > >> service better than GRAM4. But again, this is with the condor-g > >> tricks. Without the tricks, GRAM2 will handle the load better. > >> > >> OSG VTB testing: These were using globusrun-ws and also condor-g. > >> https://twiki.grid.iu.edu/twiki/bin/view/Integration/ > >> WSGramValidation > >> > >> clients in these tests got a variety of errors depending on the jobs > >> run: timeouts, GridFTP authentication errors, client-side OOM, ... > >> GRAM4 functions pretty well, but it was not able to handle Terrence's > >> scenario. But it handled 1000 jobs x 1 condor-g client just fine. > >> > >> ----- > >> > >> It would be very interesting to see how swift does with GRAM4. This > >> would make for a nice comparison to condor-g. > >> > >> As far as having functioning GRAM4 services on TG, things have > >> improved. LEAD is using GRAM4 exclusively and we've been working > >> with > >> them to make sure the GRAM4 services are up and functioning. INCA > >> has > >> been updated to more effectively test and monitor GRAM4 and GridFTP > >> services that LEAD is targeting. This could be extended for any > >> hosts > >> that swift would like to test against. Here are some interesting > >> charts from INCA - http://cuzco.sdsc.edu:8085/cgi-bin/lead.cgi > >> > >> -Stu > >> > >> On Jan 30, 2008, at Jan 30, 10:00 AM, Ti Leggett wrote: > >> > >>> > >>> On Jan 30, 2008, at 01/30/08 09:48 AM, Ben Clifford wrote: > >>> > >>> [snip] > >>> > >>>> No. The default behaviour when working with a user who is "just > >>>> trying to > >>>> get their stuff to run" is "screw this, use GRAM2 because it > >>>> works". > >>>> > >>>> Its a self-reinforcing feedback loop, that will be broken at the > >>>> point > >>>> that it becomes easier for people to stick with GRAM4 than default > >>>> back to > >>>> GRAM2. I guess we need to keep trying every now and then and hope > >>>> that one > >>>> time it sticks ;-) > >>>> > >>>> -- > >>> > >>> Well this works to a point, but if falling back to a technology that > >>> is known to not be scalable for your sizes results in killing a > >>> machine, I, as a site admin, will eventually either a) deny you > >>> service b) shut down the poorly performing service or c) all of the > >>> above. So it's in your best interest to find and use those > >>> technologies that are best suited to the task at hand so the users > >>> of your software don't get nailed by (a). > >>> > >>> In this case it seems to me that using WS-GRAM, extending WS-GRAM > >>> and/or MDS to report site statistics, and/or modifying WS-GRAM to > >>> throttle itself (think of how apache reports "Server busy. Try again > >>> later") is the best path forward. For the short term, it seems that > >>> the Swift developers should manually find those limits for sites > >>> that the users use regularly for them to use, *and* educate their > >>> users on how to identify that they could be adversely affecting a > >>> resource and throttle themselves till the ideal, automated method is > >>> a usable reality. > >>> > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > > > From wilde at mcs.anl.gov Wed Jan 30 21:23:56 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 30 Jan 2008 21:23:56 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <1201742109.9441.3.camel@blabla.mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08A36.1000502@mcs.anl.gov> <0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov> <1201718259.5465.1.camel@blabla.mcs.anl.gov> <69921E77-384E-4D36-922C-BED4D52F2177@mcs.anl.gov> <1201742109.9441.3.camel@blabla.mcs.anl.gov> Message-ID: <47A13F4C.60507@mcs.anl.gov> I suggested we start the tests at a moderate intensity, and record the impact on CPU, mem, qlength, etc. Then ramp up untl those indicators start to suggest that the gk is under strain. Its not 100% foolproof, but better than blind stress testing. - mike On 1/30/08 7:15 PM, Mihael Hategan wrote: > Me doing such tests will probably mess the gatekeeper node again. How do > we proceed? > > On Wed, 2008-01-30 at 13:19 -0600, Stuart Martin wrote: >> I'm saying run swift tests using GRAM4 and see what you get. Run a >> similar job scenario like 2000 jobs to the same GRAM4 service. I will >> be interested to see how swift does for performance, scalability, >> errors... >> It's possible that condor-g is not optimal, so seeing how another >> GRAM4 client dong similar job submission scenarios fares would make >> for an interesting comparison. >> >> -Stu >> >> On Jan 30, 2008, at Jan 30, 12:37 PM, Mihael Hategan wrote: >> >>> I'm confused. Why would you want to test GRAM scalability while >>> introducing additional biasing elements, such as Condor-G? >>> >>> On Wed, 2008-01-30 at 11:21 -0600, Stuart Martin wrote: >>>> All, >>>> >>>> I wanted to chime in with a number of things being discussed here. >>>> >>>> There is a GRAM RFT Core reliability group focused on ensuring the >>>> GRAM service stays up and functional in spit of an onslaught from a >>>> client. http://confluence.globus.org/display/CDIGS/GRAM-RFT-Core+Reliability+Tiger+Team >>>> >>>> The ultimate goal here is that a client may get a timeout and that >>>> would be the signal to backoff some. >>>> >>>> ----- >>>> >>>> OSG - VO testing: We worked with Terrence (CMS) recently and here are >>>> his test results. >>>> http://hepuser.ucsd.edu/twiki/bin/view/UCSDTier2/WSGramTests >>>> >>>> GRAM2 handled this 2000 jobs x 2 condor-g clients to the same GRAM >>>> service better than GRAM4. But again, this is with the condor-g >>>> tricks. Without the tricks, GRAM2 will handle the load better. >>>> >>>> OSG VTB testing: These were using globusrun-ws and also condor-g. >>>> https://twiki.grid.iu.edu/twiki/bin/view/Integration/ >>>> WSGramValidation >>>> >>>> clients in these tests got a variety of errors depending on the jobs >>>> run: timeouts, GridFTP authentication errors, client-side OOM, ... >>>> GRAM4 functions pretty well, but it was not able to handle Terrence's >>>> scenario. But it handled 1000 jobs x 1 condor-g client just fine. >>>> >>>> ----- >>>> >>>> It would be very interesting to see how swift does with GRAM4. This >>>> would make for a nice comparison to condor-g. >>>> >>>> As far as having functioning GRAM4 services on TG, things have >>>> improved. LEAD is using GRAM4 exclusively and we've been working >>>> with >>>> them to make sure the GRAM4 services are up and functioning. INCA >>>> has >>>> been updated to more effectively test and monitor GRAM4 and GridFTP >>>> services that LEAD is targeting. This could be extended for any >>>> hosts >>>> that swift would like to test against. Here are some interesting >>>> charts from INCA - http://cuzco.sdsc.edu:8085/cgi-bin/lead.cgi >>>> >>>> -Stu >>>> >>>> On Jan 30, 2008, at Jan 30, 10:00 AM, Ti Leggett wrote: >>>> >>>>> On Jan 30, 2008, at 01/30/08 09:48 AM, Ben Clifford wrote: >>>>> >>>>> [snip] >>>>> >>>>>> No. The default behaviour when working with a user who is "just >>>>>> trying to >>>>>> get their stuff to run" is "screw this, use GRAM2 because it >>>>>> works". >>>>>> >>>>>> Its a self-reinforcing feedback loop, that will be broken at the >>>>>> point >>>>>> that it becomes easier for people to stick with GRAM4 than default >>>>>> back to >>>>>> GRAM2. I guess we need to keep trying every now and then and hope >>>>>> that one >>>>>> time it sticks ;-) >>>>>> >>>>>> -- >>>>> Well this works to a point, but if falling back to a technology that >>>>> is known to not be scalable for your sizes results in killing a >>>>> machine, I, as a site admin, will eventually either a) deny you >>>>> service b) shut down the poorly performing service or c) all of the >>>>> above. So it's in your best interest to find and use those >>>>> technologies that are best suited to the task at hand so the users >>>>> of your software don't get nailed by (a). >>>>> >>>>> In this case it seems to me that using WS-GRAM, extending WS-GRAM >>>>> and/or MDS to report site statistics, and/or modifying WS-GRAM to >>>>> throttle itself (think of how apache reports "Server busy. Try again >>>>> later") is the best path forward. For the short term, it seems that >>>>> the Swift developers should manually find those limits for sites >>>>> that the users use regularly for them to use, *and* educate their >>>>> users on how to identify that they could be adversely affecting a >>>>> resource and throttle themselves till the ideal, automated method is >>>>> a usable reality. >>>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From hategan at mcs.anl.gov Wed Jan 30 21:35:53 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 30 Jan 2008 21:35:53 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <47A13F4C.60507@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08A36.1000502@mcs.anl.gov> <0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov> <1201718259.5465.1.camel@blabla.mcs.anl.gov> <69921E77-384E-4D36-922C-BED4D52F2177@mcs.anl.gov> <1201742109.9441.3.camel@blabla.mcs.anl.gov> <47A13F4C.60507@mcs.anl.gov> Message-ID: <1201750553.11697.8.camel@blabla.mcs.anl.gov> Sure, I'd do that anyway to test the testing script(s)/process. I mean if I do mess it, I want to make sure I only need to do it once. But I'm thinking it's better to agree on some time than for Joe or Ti or JP to randomly wonder what's going on. On the other hand, seeing many processes in my name will probably eliminate the confusion :) On Wed, 2008-01-30 at 21:23 -0600, Michael Wilde wrote: > I suggested we start the tests at a moderate intensity, and record the > impact on CPU, mem, qlength, etc. > > Then ramp up untl those indicators start to suggest that the gk is under > strain. > > Its not 100% foolproof, but better than blind stress testing. > > - mike > > > On 1/30/08 7:15 PM, Mihael Hategan wrote: > > Me doing such tests will probably mess the gatekeeper node again. How do > > we proceed? > > > > On Wed, 2008-01-30 at 13:19 -0600, Stuart Martin wrote: > >> I'm saying run swift tests using GRAM4 and see what you get. Run a > >> similar job scenario like 2000 jobs to the same GRAM4 service. I will > >> be interested to see how swift does for performance, scalability, > >> errors... > >> It's possible that condor-g is not optimal, so seeing how another > >> GRAM4 client dong similar job submission scenarios fares would make > >> for an interesting comparison. > >> > >> -Stu > >> > >> On Jan 30, 2008, at Jan 30, 12:37 PM, Mihael Hategan wrote: > >> > >>> I'm confused. Why would you want to test GRAM scalability while > >>> introducing additional biasing elements, such as Condor-G? > >>> > >>> On Wed, 2008-01-30 at 11:21 -0600, Stuart Martin wrote: > >>>> All, > >>>> > >>>> I wanted to chime in with a number of things being discussed here. > >>>> > >>>> There is a GRAM RFT Core reliability group focused on ensuring the > >>>> GRAM service stays up and functional in spit of an onslaught from a > >>>> client. http://confluence.globus.org/display/CDIGS/GRAM-RFT-Core+Reliability+Tiger+Team > >>>> > >>>> The ultimate goal here is that a client may get a timeout and that > >>>> would be the signal to backoff some. > >>>> > >>>> ----- > >>>> > >>>> OSG - VO testing: We worked with Terrence (CMS) recently and here are > >>>> his test results. > >>>> http://hepuser.ucsd.edu/twiki/bin/view/UCSDTier2/WSGramTests > >>>> > >>>> GRAM2 handled this 2000 jobs x 2 condor-g clients to the same GRAM > >>>> service better than GRAM4. But again, this is with the condor-g > >>>> tricks. Without the tricks, GRAM2 will handle the load better. > >>>> > >>>> OSG VTB testing: These were using globusrun-ws and also condor-g. > >>>> https://twiki.grid.iu.edu/twiki/bin/view/Integration/ > >>>> WSGramValidation > >>>> > >>>> clients in these tests got a variety of errors depending on the jobs > >>>> run: timeouts, GridFTP authentication errors, client-side OOM, ... > >>>> GRAM4 functions pretty well, but it was not able to handle Terrence's > >>>> scenario. But it handled 1000 jobs x 1 condor-g client just fine. > >>>> > >>>> ----- > >>>> > >>>> It would be very interesting to see how swift does with GRAM4. This > >>>> would make for a nice comparison to condor-g. > >>>> > >>>> As far as having functioning GRAM4 services on TG, things have > >>>> improved. LEAD is using GRAM4 exclusively and we've been working > >>>> with > >>>> them to make sure the GRAM4 services are up and functioning. INCA > >>>> has > >>>> been updated to more effectively test and monitor GRAM4 and GridFTP > >>>> services that LEAD is targeting. This could be extended for any > >>>> hosts > >>>> that swift would like to test against. Here are some interesting > >>>> charts from INCA - http://cuzco.sdsc.edu:8085/cgi-bin/lead.cgi > >>>> > >>>> -Stu > >>>> > >>>> On Jan 30, 2008, at Jan 30, 10:00 AM, Ti Leggett wrote: > >>>> > >>>>> On Jan 30, 2008, at 01/30/08 09:48 AM, Ben Clifford wrote: > >>>>> > >>>>> [snip] > >>>>> > >>>>>> No. The default behaviour when working with a user who is "just > >>>>>> trying to > >>>>>> get their stuff to run" is "screw this, use GRAM2 because it > >>>>>> works". > >>>>>> > >>>>>> Its a self-reinforcing feedback loop, that will be broken at the > >>>>>> point > >>>>>> that it becomes easier for people to stick with GRAM4 than default > >>>>>> back to > >>>>>> GRAM2. I guess we need to keep trying every now and then and hope > >>>>>> that one > >>>>>> time it sticks ;-) > >>>>>> > >>>>>> -- > >>>>> Well this works to a point, but if falling back to a technology that > >>>>> is known to not be scalable for your sizes results in killing a > >>>>> machine, I, as a site admin, will eventually either a) deny you > >>>>> service b) shut down the poorly performing service or c) all of the > >>>>> above. So it's in your best interest to find and use those > >>>>> technologies that are best suited to the task at hand so the users > >>>>> of your software don't get nailed by (a). > >>>>> > >>>>> In this case it seems to me that using WS-GRAM, extending WS-GRAM > >>>>> and/or MDS to report site statistics, and/or modifying WS-GRAM to > >>>>> throttle itself (think of how apache reports "Server busy. Try again > >>>>> later") is the best path forward. For the short term, it seems that > >>>>> the Swift developers should manually find those limits for sites > >>>>> that the users use regularly for them to use, *and* educate their > >>>>> users on how to identify that they could be adversely affecting a > >>>>> resource and throttle themselves till the ideal, automated method is > >>>>> a usable reality. > >>>>> > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>> > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From wilde at mcs.anl.gov Thu Jan 31 10:37:02 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 31 Jan 2008 10:37:02 -0600 Subject: [Swift-devel] 5TB soon in CI SAN Message-ID: <47A1F92E.3030900@mcs.anl.gov> Hi Mike, Next week we should have 5 TB at the CI. The san is on the CI network under /disks We'll need to get you the gridftp URI for it. (and test it, as last time I used it there was some confusion as to which server was working correctly) - Mike -------- Original Message -------- Subject: Re: [Fwd: [Swift-devel] Need help using Data Capacitor in Swift Workflow] Date: Thu, 31 Jan 2008 09:25:17 -0600 (CST) From: Shawn Needham Reply-To: shawn at flash.uchicago.edu To: Michael Wilde CC: Ti Leggett , Ian Foster , Mike, If it helps the LigandWorks project, I could free about 5 TB by the middle of next week. I will drop you an email next week once I free that space. Cheers, Shawn On Wed, 30 Jan 2008, Michael Wilde wrote: > That certainly works for the LigandWorks project. > > If you could let me know when you've freed it, Shawn, I'll grab some space to > hold it for them. > > Thanks, > > - Mike > > > On 1/30/08 4:04 PM, Ti Leggett wrote: >> I talked to Shawn (who's cc'd) and this is what he said: >> >> He's archiving the 14TB data set to two sets of tapes. >> He can archive about 2.1T per day. >> By next Friday he estimates (at the current rate) that he will have around >> 4T left to archive >> He can remove datasets that have been archived to both tape sets after the >> second set is finished >> So by next Friday there should be ~10T freed with the remainder freed by >> Monday >> >> Shawn, did I get that all right? >> >> Is this satisfactory to everyone? >> >> On Jan 30, 2008, at 01/30/08 01:35 PM, Ian Foster wrote: >> >>> indeed, that is great. >>> >>> We do need policies, eventually. >>> >>> In the meantime, we will have 15 TB for Mike and Mike by the end of next >>> week. >>> >> >> > From benc at hawaga.org.uk Thu Jan 31 12:00:55 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 31 Jan 2008 18:00:55 +0000 (GMT) Subject: [Swift-devel] status race in batch mode local execution Message-ID: in batch mode local execution, there's a status race which makes batch job sometimes go Submitted->Completed->Active I have seen this manifest in cleanup jobs at the end of a workflow. the task gets set to ACTIVE in the run method in the new execution thread, but in the case of a batch job gets set to COMPLETED in the submitting thread; these two status-sets are unordered. -- From hategan at mcs.anl.gov Thu Jan 31 13:53:40 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 31 Jan 2008 13:53:40 -0600 Subject: [Swift-devel] Re: status race in batch mode local execution In-Reply-To: References: Message-ID: <1201809220.14185.0.camel@blabla.mcs.anl.gov> Yep. It's a known issue that I planned and still plan to address. However, the karajan scheduler ignores the active after completed event. On Thu, 2008-01-31 at 18:00 +0000, Ben Clifford wrote: > in batch mode local execution, there's a status race which makes batch > job sometimes go Submitted->Completed->Active > > I have seen this manifest in cleanup jobs at the end of a workflow. > > the task gets set to ACTIVE in the run method in the new execution thread, > but in the case of a batch job gets set to COMPLETED in the submitting > thread; these two status-sets are unordered. > From benc at hawaga.org.uk Thu Jan 31 14:19:12 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 31 Jan 2008 20:19:12 +0000 (GMT) Subject: [Swift-devel] Re: status race in batch mode local execution In-Reply-To: <1201809220.14185.0.camel@blabla.mcs.anl.gov> References: <1201809220.14185.0.camel@blabla.mcs.anl.gov> Message-ID: On Thu, 31 Jan 2008, Mihael Hategan wrote: > However, the karajan scheduler ignores the active after completed event. my stuff doesn't though... -- From hategan at mcs.anl.gov Thu Jan 31 14:29:52 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 31 Jan 2008 14:29:52 -0600 Subject: [Swift-devel] Re: status race in batch mode local execution In-Reply-To: References: <1201809220.14185.0.camel@blabla.mcs.anl.gov> Message-ID: <1201811392.14866.5.camel@blabla.mcs.anl.gov> On Thu, 2008-01-31 at 20:19 +0000, Ben Clifford wrote: > > On Thu, 31 Jan 2008, Mihael Hategan wrote: > > > However, the karajan scheduler ignores the active after completed event. > > my stuff doesn't though... There's complete ordering on: unsubmitted, submitted, active, where = [completed|failed] Which means that you can safely assume all x < y have happened if you receive y, and safely discard all x < y received after having received y. This is what I'm planning to integrate into the code. If you need it urgently, it's not hard to implement at the user level. > From benc at hawaga.org.uk Thu Jan 31 14:33:56 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 31 Jan 2008 20:33:56 +0000 (GMT) Subject: [Swift-devel] Re: status race in batch mode local execution In-Reply-To: <1201811392.14866.5.camel@blabla.mcs.anl.gov> References: <1201809220.14185.0.camel@blabla.mcs.anl.gov> <1201811392.14866.5.camel@blabla.mcs.anl.gov> Message-ID: On Thu, 31 Jan 2008, Mihael Hategan wrote: > There's complete ordering on: > > unsubmitted, submitted, active, > If you need it > urgently, it's not hard to implement at the user level. ok. that'll do. -- From wilde at mcs.anl.gov Thu Jan 31 16:08:18 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 31 Jan 2008 16:08:18 -0600 Subject: [Swift-devel] Re: Meet friday to discuss CNARI status? In-Reply-To: <20080131140506.AZL94782@m4500-02.uchicago.edu> References: <20080131140506.AZL94782@m4500-02.uchicago.edu> Message-ID: <47A246D2.5060209@mcs.anl.gov> Sarah, all - I had a chance to dig deeper into Sarah's email describing problems in expressing her workflow in swift. I've commented on her observations and am sending to the list for discussion. On 1/31/08 2:05 PM, skenny at uchicago.edu wrote: ... > regarding freesurfer implementation in swift...so, my previous > implementation in vds (which has been pretty stable) worked in > single-site execution. the reason for this was that freesurfer > produces LOTS of files (including logs and sigfiles that are > sometimes created implicitly but will cause a failure later in > the workflow if they're not there) which vary somewhat > depending on your run and are extremely difficult to map > individually. Can you send us a sample of their naming pattern and a note on hw many and how large? > so what i've done is build the directory tree on > the remote site on the shared filesystem and then let jobs go > out, crunch away on that tree and then tar the whole thing up > and send it back at the end. Sounds reasonable for the moment. Are you already doing this as a wrapper script, that can build a directory anywhere, and not require pre-setup of each site you want to use? > the freesurfer workflow (aka reconall) comes in 3 > stages--recon1, recon2 and recon3. i finished recon1 in swift > using single-site execution in a similar fashion as i'd done > in the past. but with many sites failing me (including uc/anl > a lot now) Can you describe these failures? We need to improve swift's ability to deal with ste problems and in some cases report site errors and or suggest site improvements. > and due to the prompting of ben and mihael i've > decided to try and rework it for multi-site execution...this > means rewriting recon1 instead of moving on to recon2 which > was my intent. so that's where i'm at...drudging my way thru a > multi-site version of recon1. It would be ideal if there were no difference between a 1-site and multiu-site version (ie what you call the multiste version is really a general version with no site dependencies and fully-explicit file declarations. One way to do this is with a wraper script that tars up the intermediate files and declares it as an output of the swift function. (As you suggest below). > incidentally, this relates back > to our discussion of being able to pass a dir tree as > input/output in swift. if that were possible we could have > multi-site execution in swift more easily; but because it's > not i need to rework every stage of the (hefty) workflow to > make sure it's mapping all potential input and output files. This is the part that I'd like to work through with you, because I thought tht in general passing a dir tree was straightforward. So I'd like to work with you to understand why in this case its not > so, admittedly, it's a little frustrating...it's not > impossible it's just giving me flashbacks of starting > freesurfer in vds from the very beginning and how trying it > was, heh. Thats what I was hoping to avoid. I think we'll need to work on this harder with the group. > i guess the other 2 options, which might speed it up a little > (but are maybe not using the expressiveness of swift so much) > would be to either 1) continue working on a single-site > execution changing sites when necessary or 2) passing a tar of > the entire tree for each job. Right - that option 2 is not so bad, but lets see if its needed. Swift lets you declare an argument to be a dataset made up of mutiple files. I suspect the problem is simply that the names of the output files are not known before the application starts. The remedy is a wrapper, and if thats the case, we might as well use the tarball soution instead of single files. - Mike