From bugzilla-daemon at mcs.anl.gov Wed Apr 1 07:32:24 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 1 Apr 2009 07:32:24 -0500 (CDT) Subject: [Swift-devel] [Bug 191] New: procedures invoked inside iterate{} don't get unique execution IDs Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=191 Summary: procedures invoked inside iterate{} don't get unique execution IDs Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Log processing and plotting AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk CC: swift-devel at ci.uchicago.edu iterate {} is more serialised than I intended. It executes each body inside the same single thread. Consequently, each iteration of the loop body does not end up with a unique thread prefix, and then execute IDs, which are based on thread ID, end up duplicated between invocations. I made the following hack for specific purpose of provenance challenge 3, as it provides enough uniqueness, albeit inelegantly, for that project. More properly, fixed bug 154 (iterate construct causes overserialisation of execution) could make this problem go away. Author: Ben Clifford Date: Tue Mar 31 16:20:41 2009 +0100 make iterate give each iteration a unique thread ID. this is possibly unsafe. in addition, it does not give a uique ID to the termination condition distinct from the body of the loop, which can probably give non-unique IDs when procedure calls are made both in the loop body and in the termination condition diff --git a/src/org/griphyn/vdl/karajan/lib/InfiniteCountingWhile.java b/src/org/griphyn/vdl/karajan/lib/Infin iteCountingWhile.java index 0d173c3..c6d4e89 100644 --- a/src/org/griphyn/vdl/karajan/lib/InfiniteCountingWhile.java +++ b/src/org/griphyn/vdl/karajan/lib/InfiniteCountingWhile.java @@ -9,6 +9,7 @@ package org.griphyn.vdl.karajan.lib; import java.util.Arrays; import java.util.List; +import org.globus.cog.karajan.util.ThreadingContext; import org.globus.cog.karajan.workflow.nodes.*; import org.globus.cog.karajan.stack.VariableStack; import org.globus.cog.karajan.util.TypeUtil; @@ -26,6 +27,8 @@ public class InfiniteCountingWhile extends Sequential { public void pre(VariableStack stack) throws ExecutionException { stack.setVar("#condition", new Condition()); + ThreadingContext tc = (ThreadingContext)stack.getVar("#thread"); + stack.setVar("#thread", tc.split(666)); stack.setVar(VAR, "$"); String counterName = (String)stack.getVar(VAR); stack.setVar(counterName, Arrays.asList(new Integer[] {new Integer(0)})); @@ -54,6 +57,8 @@ public class InfiniteCountingWhile extends Sequential { } if (index >= elementCount()) { // starting new iteration + ThreadingContext tc = (ThreadingContext)stack.getVar("#thread"); + stack.setVar("#thread", tc.split(666)); setIndex(stack, 1); fn = (FlowElement) getElement(0); -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Wed Apr 1 21:00:52 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 1 Apr 2009 21:00:52 -0500 (CDT) Subject: [Swift-devel] [Bug 116] simple_mapper handling of numbered files in arrays broken In-Reply-To: References: Message-ID: <20090402020052.537612CC70@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=116 Mihael Hategan changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |hategan at mcs.anl.gov --- Comment #3 from Mihael Hategan 2009-04-01 21:00:52 --- Additionally, if non-numerically named files exist (say "test.in"), simple mapper tries to use that name as index, which may or may not be the right thing, but it causes a consistency check failure on the array: Execution failed: java.lang.RuntimeException: Array element has index 'test' that does not parse as an integer. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. You are watching someone on the CC list of the bug. You are watching the assignee of the bug. You are watching the reporter. From aespinosa at cs.uchicago.edu Wed Apr 1 21:13:55 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 1 Apr 2009 21:13:55 -0500 Subject: [Swift-devel] array args in function apps Message-ID: <50b07b4b0904011913u4fafafc8qe2624780137f1dad@mail.gmail.com> type file; app (file out) cat(file infile[]) { cat @infile stdout=@out; } file infile[] ; file out <"test.out">; out= cat(infile); wift svn swift-r2748 cog-r2341 RunID: 20090401-2105-aevaa3o9 Progress: Execution failed: Exception in cat: Arguments: [1.in 2.in 3.in 4.in 5.in 6.in 7.in 8.in 9.in 10.in] Host: localhost Directory: manyargs-20090401-2105-aevaa3o9/jobs/x/cat-xszn9s8j stderr.txt: /bin/cat: 1.in 2.in 3.in 4.in 5.in 6.in 7.in 8.in 9.in 10.in: No such file or directory stdout.txt: ---- Caused by: Exit code 1 --- I remember when i ran using regular arguments, there are commas separating them in swift's logs. maybe 1.in ... 10.in is seen as one string? log is in http://www.ci.uchicago.edu/~aespinosa/swift/manyargs-20090401-2105-aevaa3o9.log From aespinosa at cs.uchicago.edu Wed Apr 1 21:44:20 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 1 Apr 2009 21:44:20 -0500 Subject: [Swift-devel] Re: array args in function apps In-Reply-To: <50b07b4b0904011913u4fafafc8qe2624780137f1dad@mail.gmail.com> References: <50b07b4b0904011913u4fafafc8qe2624780137f1dad@mail.gmail.com> Message-ID: <50b07b4b0904011944q236ede8dg1c6c2175c336855@mail.gmail.com> i got it working using @filenames type file; app (file out) cat(file infile[]) { cat @filenames(infile) stdout=@out; } file infile[] ; file out <"test.out">; out= cat(infile); On Wed, Apr 1, 2009 at 9:13 PM, Allan Espinosa wrote: > type file; > > app (file out) cat(file infile[]) { > ?cat @infile stdout=@out; > } > > file infile[] ; > file out <"test.out">; > out= cat(infile); > > wift svn swift-r2748 cog-r2341 > > RunID: 20090401-2105-aevaa3o9 > Progress: > Execution failed: > ? ? ? ?Exception in cat: > Arguments: [1.in 2.in 3.in 4.in 5.in 6.in 7.in 8.in 9.in 10.in] > Host: localhost > Directory: manyargs-20090401-2105-aevaa3o9/jobs/x/cat-xszn9s8j > stderr.txt: /bin/cat: 1.in 2.in 3.in 4.in 5.in 6.in 7.in 8.in 9.in > 10.in: No such file or directory > > stdout.txt: > ---- > > Caused by: > ? ? ? ?Exit code 1 > > --- > I remember when i ran using regular arguments, there are commas > separating them in swift's logs. maybe 1.in ... 10.in is seen as one > string? > > > log is in http://www.ci.uchicago.edu/~aespinosa/swift/manyargs-20090401-2105-aevaa3o9.log > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From benc at hawaga.org.uk Thu Apr 2 03:10:47 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 2 Apr 2009 08:10:47 +0000 (GMT) Subject: [Swift-devel] array args in function apps In-Reply-To: <50b07b4b0904011913u4fafafc8qe2624780137f1dad@mail.gmail.com> References: <50b07b4b0904011913u4fafafc8qe2624780137f1dad@mail.gmail.com> Message-ID: On Wed, 1 Apr 2009, Allan Espinosa wrote: > maybe 1.in ... 10.in is seen as one > string? yes. @filename(a) is even documented that way (which is what @a is an abbreviation for). -- From bugzilla-daemon at mcs.anl.gov Thu Apr 2 03:21:10 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 2 Apr 2009 03:21:10 -0500 (CDT) Subject: [Swift-devel] [Bug 192] New: displeasing stack trace when pwd is not writable Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=192 Summary: displeasing stack trace when pwd is not writable Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk The logging system outputs the following trace in swift 0.8 when pwd is unwritable: train02 at vm-125-58:/sw/swift$ swift log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: swift.log (Permission denied) at java.io.FileOutputStream.openAppend(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:177) at java.io.FileOutputStream.(FileOutputStream.java:102) at org.apache.log4j.FileAppender.setFile(FileAppender.java:272) at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:151) at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:247) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:123) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:87) at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:645) at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:603) at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:500) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:406) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:432) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:460) at org.apache.log4j.LogManager.(LogManager.java:113) at org.apache.log4j.Logger.getLogger(Logger.java:94) at org.globus.cog.karajan.Loader.(Loader.java:43) No SwiftScript program specified -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From wilde at mcs.anl.gov Thu Apr 2 21:31:39 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 02 Apr 2009 21:31:39 -0500 Subject: [Swift-devel] Discussion on next steps for Coasters Message-ID: <49D5750B.10603@mcs.anl.gov> I had a brief off-list discussion with Mihael on next steps for coasters. Im posting it here for group discussion and to get us started on the same page. This follows up on discussion a few weeks ago on the same topic. Rather than try to reorg the email below, Im posting it largely as-is in the interest of effort and time. Bottom line: Mihael will work on Coasters next, as he suggested in a prior email, taking the next steps to harden them for users, establish a better test mechanism and procedure, and work on some usability & enhancement issues. - Mike -------- Original Message -------- Subject: Re: Hi / status / next ? Date: Thu, 02 Apr 2009 21:01:14 -0500 From: Michael Wilde To: Mihael Hategan References: <49D551B8.5010105 at mcs.anl.gov> <1238721084.19231.18.camel at localhost> OK, all sounds good. Many more details to work out, but a short followup below. On 4/2/09 8:11 PM, Mihael Hategan wrote: > On Thu, 2009-04-02 at 19:00 -0500, Michael Wilde wrote: >> Hi Mihael, >> ... >> So next on Swift: I think you should do a fairly intensive burst of >> effort on Coaster stabilization and portability, like you suggested on >> the list a little while ago. > > Right. > >> At a very high level, what I want to see is: >> >> - solid test suite, so we know its working on a agreed on and growing >> set of platforms, mainly the TG, OSG and a few miscellaneous sites the >> users need >> >> - solve the "GT2 / OSG thing", which I *think* involves starting coaster >> workers from the submit host with GT2 using Condor-G. > > The complexity of adding condor-g into the loop will likely be nasty. > But I'll try. Before you start, then, especially if its not an obvious answer, lets sanity check with discussion on list. As a proposed update to your design doc. > >> - check that coaster shutdown is working. > > Is there any reason to believe it's not? Yeah, some suspicious behavior that we havent been able to pin down (me, Glen) but suspect may be happening. > >> Then lower priority: >> >> - make it possible to allocate a persistent pool of Coaster workers all >> at once (say, gimme 1000 nodes on Ranger for 1 hour". > > That I think isn't a good idea. Here's why, and correct me if I'm > missing something: > - regardless of whether you use it or not, you need to wait for nodes to > be available. Whether that waiting happens while swift is running or > not, it still happens. true > - once you have a pre-set number of nodes, you need to quickly start > swift and use them, otherwise you lose allocations. By contrast, in > automatic mode, swift will use them as soon as they are available true > - allocation of a pre-set number of nodes may be delayed if that number > of nodes is not available. In the automatic mode, swift will use fewer > nodes when they are available and ramp up to whatever it can get. A > limiting case, when your 1k nodes will not be available at all, shows > that the automatic case will yield better performance (you workflow will > finish). true > - better balancing can be done if there are multiple sites with > automatic allocation. all true ;) Only case where its handy is benchmarking a workflow on a known quantity of nodes. Driven in part by fact that on BGP, this is how they are allocated. (But even there we could do multi-block allocation in varying chunks if the allocator was aware of the scheduling policy of the cluster) So what I was thinking was "ask for N nodes all at once". In all cases, it would be assumed "...and then start your workflow". So it would not need to be a separate allocation. Tied to an option to say "leave my nodes running when wf done" this would I think meet all needs. But your points above are complelling, hence this feature needs deliberation and is nowhere need the top of the list. Higher on the list would be demand-based grow-shrink of pool, but in varying sized blocks. And on all systems, I think, you need to free in the same sized blocks (of of CPUs) that you allocated in. It raises another Q: for some sites like TeraPort, which I think places jobs on all cores, independently, in todays coasters implementation, I am assuming the user should not specify coastersPerNode > 1. True? (even though it has 2-core nodes.) We should clarify this in the users guide. I will ask this on the list right now so all can get answer. > One advantage to allocating blocks of coasters may be the possibility > that a single multi-node job is started (so it solves the gt2 > scalability problem, but so does you provisioning point below). I would be interested in this, both for its intrinsic performance benefits, but also as a short-term solution to the OSG GT2 "overheating" problem. Especially if the Condor-G solution gets complex and takes long to implement and perfect. Ie, as a short term fix with long term benefits, migt make sense to do it first, assuming that *it* is not harder than Condor-G provider and coaster integ. > >> - other ways to leave coasters running for the next jobs > > Right. That may be possible with persistent services instead of the > current transient scheme. > >> - ensure that coaster time allocations are being done sensibly >> >> - revisit the coaster "provisioning" mechanism in terms of in what >> increments workers are allocated and released in >> >> - some kind of coaster status display >> >> - some way to probe a job thats running on a coaster? > > Define "probe". - ps -f on the running process. - probe its resource usage (/proc, also ps, etc) - ls -lR of its jobdir (as these will more often be on /tmp) We have these needs today; on the BGP under falkon we manually login to the node, but thats cumbersome: hard to find the node; 2-stage login process. Low prio, a pipe dream. But theoretically do-able. So, very cool, we are converging on a plan. I'll cc most of the above to the list now. > >> Issue a shell >> command on the worker of the job? >> >> - other things I missed. >> >> I'll send this to the list for discussion; what I mainly want to >> understand from you first is your time availability, what you feel you >> owe swift in terms of compensating from i2u2 hours, and anything you >> know of on swift that is higher priority that the coaster things above? >> (I dont, but maybe missing something) >> >> Lastly, how is Phong doing, and to what extent can he be self-sufficient >> if you were to go 100% swift for a while? > > I think he'll be able to take over most things. However, with the > current big push, he's probably not confident enough, so it may have to > happen after the new version is put into production. ... From wilde at mcs.anl.gov Fri Apr 3 08:38:08 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 03 Apr 2009 08:38:08 -0500 Subject: [Swift-devel] Probing running jobs In-Reply-To: <1238732253.22128.12.camel@localhost> References: <49D551B8.5010105@mcs.anl.gov> <1238721084.19231.18.camel@localhost> <49D56DEA.104@mcs.anl.gov> <1238732253.22128.12.camel@localhost> Message-ID: <49D61140.1090109@mcs.anl.gov> Following up on Mihael's question about a feature I listed in the to-do list I proposed for coasters: On 4/2/09 11:17 PM, Mihael Hategan wrote: > On Thu, 2009-04-02 at 21:01 -0500, Michael Wilde wrote: >>>> - some way to probe a job thats running on a coaster? >>> Define "probe". >> - ps -f on the running process. >> - probe its resource usage (/proc, also ps, etc) >> - ls -lR of its jobdir (as these will more often be on /tmp) >> >> We have these needs today; on the BGP under falkon we manually login to >> the node, but thats cumbersome: hard to find the node; 2-stage login >> process. >> >> Low prio, a pipe dream. But theoretically do-able. > > It should be possible (and somewhat interesting) to have a simple shell > that can execute stuff on the workers while the job is running, so that > you can issue your own commands. > > The question of how to find the right worker remains. Can you go a bit > deeper into the details? How do you find the node currently (be as > specific as you can be)? In the oops workflow, I recall these cases at the moment: 1) Have my (large set of similar) jobs started? 2) Most jobs have finished. Are the remaining ones hung, or proceeding normally but slower for some application- or data-specific reason? -- For (1), on the BGP, if most or all cores in the partition have apps running on them, we pick any core and login to it. Then to see what that particular app is doing, we tail its log file for progress compared to its CPU tie consumption (from ps). Note that its log file is on local disk, because we set the "jobdir on local" option of swiftwrapper. Logging in to a node means finding its IO node IP addr from a Falkon dynamic config file, ssh-ing to the ION, then telnetting to an arbitrary worker node (these are on 192.168.1.[0-63] private addrs), then running ps and tail. If not all the worker nodes in a processor set are busy, its a nuisance to find one that is. If few are busy, its not practical. Overall, this technique is just a spot-check to see "are *any* of my jobs running right", ie to see if we've (finally) got their arguments correct, etc. (1) is better solved with the same technique needed for (2) - given a job, find its ION and worker node IPs, and ssh/telnet directly there, which does not exist but is straightforward. On BGP the WNs are not running ssh, hence the additional nuisance of telnet. (2) Is theoretically possible, but impractical, until we add a few scripts to trace from a swift job to the falkon service thats running it to the falkon agent thats running it (again, in the bgp case). The data for this exists. So we occasionally need (2) but cant do it. Regarding "question of how to find the right worker" - this starts with having some sort of ID for each job that the use can use to go from "source code based identity" through job status and then to job location. (by job here I mean execution of an app() proc). I have not yet looked at your status monitor, but am eager to try it. So I dont know if you took any steps in there to correlate a job's proc name and args to its status. But thats what I think the user ultimately needs and wants. For example, in oops, the majority of tasks are either of app "runrama" or "runoops". They have a mixture of scalar and file args. I'd like to see in status something sort of like strace, where syscalls have potentially long arg lists (when formatted) but there's a canonical way to present them in an acceptably compact format with ... ellipsis s needed. So as app invocations become known to swift, they get IDs starting from 0, (PID-like but not wrapping around), and are listed in the progress log as: Job 123 is Proc runrama Args 456 input/prot/.../...00.019.1ubq.pdb etc Job 123 input transfer OK Job 123 submitted - teraport/coaster92 Job 124 is Job 125 is Job 123 output transfer OK job 123 ended OK And then, I can say: probe 123 "ps -ef | grep runrama; tail -3 /tmp/work/*/*/runrama.log" (for starters). So the capability depends on having usable IDs for jobs and coasters, maybe more objects, so that the user can specify a job of interest and the system can send the users probe to that job. Something simple, flexible and shell-like is good to start with so we can explore whats needed and ideally create scripts to wrap more powerful capabilities. From benc at hawaga.org.uk Fri Apr 3 09:33:13 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 3 Apr 2009 14:33:13 +0000 (GMT) Subject: [Swift-devel] Probing running jobs In-Reply-To: <49D61140.1090109@mcs.anl.gov> References: <49D551B8.5010105@mcs.anl.gov> <1238721084.19231.18.camel@localhost> <49D56DEA.104@mcs.anl.gov> <1238732253.22128.12.camel@localhost> <49D61140.1090109@mcs.anl.gov> Message-ID: Not address the bulk of your email, just the bit about IDs. almost everything in swift that can be identified has some identifier on it from log-processing and provenance work - at least datsets, procedure invocations, job executions, file transfers. -- From benc at hawaga.org.uk Fri Apr 3 14:10:10 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 3 Apr 2009 19:10:10 +0000 (GMT) Subject: [Swift-devel] sync Message-ID: Just now I got this on a pbs+nfs cluster (the one at University of Johannesburg that I am involved with). It seems a little degenerate that in failing to record restart information for reliability, the run dies. Caused by: java.io.SyncFailedException: sync failed at java.io.FileDescriptor.sync(Native Method) at org.globus.cog.karajan.workflow.nodes.restartLog.FlushableLockedFileW riter.flush(FlushableLockedFileWriter.java:40) at org.globus.cog.karajan.workflow.nodes.restartLog.LogVargOperator.upda te(LogVargOperator.java:37) ... 37 more -- From hategan at mcs.anl.gov Sat Apr 4 16:34:32 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 04 Apr 2009 16:34:32 -0500 Subject: [Swift-devel] Re: Probing running jobs In-Reply-To: <49D61140.1090109@mcs.anl.gov> References: <49D551B8.5010105@mcs.anl.gov> <1238721084.19231.18.camel@localhost> <49D56DEA.104@mcs.anl.gov> <1238732253.22128.12.camel@localhost> <49D61140.1090109@mcs.anl.gov> Message-ID: <1238880872.8212.13.camel@localhost> On Fri, 2009-04-03 at 08:38 -0500, Michael Wilde wrote: > Following up on Mihael's question about a feature I listed in the to-do > list I proposed for coasters: > > On 4/2/09 11:17 PM, Mihael Hategan wrote: > > On Thu, 2009-04-02 at 21:01 -0500, Michael Wilde wrote: > >>>> - some way to probe a job thats running on a coaster? > >>> Define "probe". > >> - ps -f on the running process. > >> - probe its resource usage (/proc, also ps, etc) > >> - ls -lR of its jobdir (as these will more often be on /tmp) > >> > >> We have these needs today; on the BGP under falkon we manually login to > >> the node, but thats cumbersome: hard to find the node; 2-stage login > >> process. > >> > >> Low prio, a pipe dream. But theoretically do-able. > > > > It should be possible (and somewhat interesting) to have a simple shell > > that can execute stuff on the workers while the job is running, so that > > you can issue your own commands. > > > > The question of how to find the right worker remains. Can you go a bit > > deeper into the details? How do you find the node currently (be as > > specific as you can be)? > > In the oops workflow, I recall these cases at the moment: > > 1) Have my (large set of similar) jobs started? > > 2) Most jobs have finished. Are the remaining ones hung, or proceeding > normally but slower for some application- or data-specific reason? [...] In swift r2821 cog r2365 (I think), there is such a feature. If you start with the console monitor, you can go to the list of jobs. Then select desired job, and push enter to display a detail pane. If the job is in the active state and if it's running on a coaster worker, that detail pane will have an extra button named "Worker Terminal". Pressing that will pop up a simple terminal that can be used to run relatively arbitrary commands on the worker that the job is running on. It won't run commands that require console input (e.g., vi), so don't try. It won't start you in the job directory, but the swift workflow directory. That's because at some point we stopped using the GRAM directory attribute for setting the initial job dir because some silly site on OSG doesn't honor it. I think we should revisit the issue (I suspect there is a solution that works in both cases). From wilde at mcs.anl.gov Sat Apr 4 16:59:44 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 04 Apr 2009 16:59:44 -0500 Subject: [Swift-devel] coaster status report Message-ID: <49D7D850.9060701@mcs.anl.gov> With OOPS Glen was able to get some promising runs queued on Ranger, using the default properties and the sites setting from the SEM runs. Looking great so far, and above all was very easy to get it going. Thats very exciting! One run shows a few (3 out of 100 or so) failures that were retried successfully. We need to trak these down, and see if it was a transient app failure or something in swift etc. Then we turned to Abe and Queenbee. That was amazingly easy to configure and get running. Glen is scaling it up as we speak, trying for 2 sites x 40 jobs x 8 cores = 640 cores tween the two. In initial small tests, though - 50 parallel app() calls - its sending all jobs to abe, none to queenbee. We checked the usual sites, tc things, *seems* ok there. Possibly either a bg or a scheduler anomaly? We'll try with more jobs, and see; will send logs and sites etc files if that anomaly persists at larger scales. Seems like both these sites have WS-GRAM enabled; we'd like to try that as well, to expand beyond the 40-job per site suggested limit. Would like to get 1000 cores active on this problem. 2 x 60 x 8 or so. Then will add in a few more fruitful TG sites. Towards this end, Mihael, if you have the urge to probe at a setting/config that lets us start coasters in 4-8 node batches, this would be a great time to try that. I suspect you dont know yet if that will be easy, hard, or in between? Another note on coaster boot: - old problems on Abe with funky limitations on non-login shells seems to have gone away, either from the latest coaster strategy (-l issues?) or from Abe changes. - on queenbee, initial run got this error: Could not start coaster service Caused by: Task ended before registration was received. STDOUT: Warning: -jar not understood. Ignoring. Exception in thread "main" java.lang.NoClassDefFoundError: .tmp.bootstrap.y10420 at gnu.gcj.runtime.FirstThread.run() (/usr/lib64/libgcj.so.5.0.0) Turns out default java was 1.4.2 something. We added @default to .soft to get Java 1.6. Then coasters bootstrapped fine. This was nice to see, that a simple workaround was easy! At any rate, very productive, very promising, very pleasing to use. Nice work! - Mike From wilde at mcs.anl.gov Sat Apr 4 17:01:53 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 04 Apr 2009 17:01:53 -0500 Subject: [Swift-devel] Re: Probing running jobs In-Reply-To: <1238880872.8212.13.camel@localhost> References: <49D551B8.5010105@mcs.anl.gov> <1238721084.19231.18.camel@localhost> <49D56DEA.104@mcs.anl.gov> <1238732253.22128.12.camel@localhost> <49D61140.1090109@mcs.anl.gov> <1238880872.8212.13.camel@localhost> Message-ID: <49D7D8D1.3030304@mcs.anl.gov> Wow! Way cool - I cant wait to try this and the monitor. But need to clone myself. Maybe Glen, you can try this on oops tests... - Mike On 4/4/09 4:34 PM, Mihael Hategan wrote: > On Fri, 2009-04-03 at 08:38 -0500, Michael Wilde wrote: >> Following up on Mihael's question about a feature I listed in the to-do >> list I proposed for coasters: >> >> On 4/2/09 11:17 PM, Mihael Hategan wrote: >>> On Thu, 2009-04-02 at 21:01 -0500, Michael Wilde wrote: >>>>>> - some way to probe a job thats running on a coaster? >>>>> Define "probe". >>>> - ps -f on the running process. >>>> - probe its resource usage (/proc, also ps, etc) >>>> - ls -lR of its jobdir (as these will more often be on /tmp) >>>> >>>> We have these needs today; on the BGP under falkon we manually login to >>>> the node, but thats cumbersome: hard to find the node; 2-stage login >>>> process. >>>> >>>> Low prio, a pipe dream. But theoretically do-able. >>> It should be possible (and somewhat interesting) to have a simple shell >>> that can execute stuff on the workers while the job is running, so that >>> you can issue your own commands. >>> >>> The question of how to find the right worker remains. Can you go a bit >>> deeper into the details? How do you find the node currently (be as >>> specific as you can be)? >> In the oops workflow, I recall these cases at the moment: >> >> 1) Have my (large set of similar) jobs started? >> >> 2) Most jobs have finished. Are the remaining ones hung, or proceeding >> normally but slower for some application- or data-specific reason? > [...] > > In swift r2821 cog r2365 (I think), there is such a feature. > > If you start with the console monitor, you can go to the list of jobs. > Then select desired job, and push enter to display a detail pane. If the > job is in the active state and if it's running on a coaster worker, that > detail pane will have an extra button named "Worker Terminal". Pressing > that will pop up a simple terminal that can be used to run relatively > arbitrary commands on the worker that the job is running on. > > It won't run commands that require console input (e.g., vi), so don't > try. > > It won't start you in the job directory, but the swift workflow > directory. That's because at some point we stopped using the GRAM > directory attribute for setting the initial job dir because some silly > site on OSG doesn't honor it. I think we should revisit the issue (I > suspect there is a solution that works in both cases). > From wilde at mcs.anl.gov Sat Apr 4 17:03:55 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 04 Apr 2009 17:03:55 -0500 Subject: [Swift-devel] coaster status report In-Reply-To: <49D7D850.9060701@mcs.anl.gov> References: <49D7D850.9060701@mcs.anl.gov> Message-ID: <49D7D94B.1090504@mcs.anl.gov> small clarification here - we had to turn away from range because the queue was gruesome. the 3-failure issue was on abe. not much to say till we find and examine the log on this one. - Mike On 4/4/09 4:59 PM, Michael Wilde wrote: > With OOPS Glen was able to get some promising runs queued on Ranger, > using the default properties and the sites setting from the SEM runs. > > Looking great so far, and above all was very easy to get it going. > > Thats very exciting! > > One run shows a few (3 out of 100 or so) failures that were retried > successfully. We need to trak these down, and see if it was a transient > app failure or something in swift etc. > > Then we turned to Abe and Queenbee. That was amazingly easy to configure > and get running. Glen is scaling it up as we speak, trying for 2 sites x > 40 jobs x 8 cores = 640 cores tween the two. > > In initial small tests, though - 50 parallel app() calls - its sending > all jobs to abe, none to queenbee. We checked the usual sites, tc > things, *seems* ok there. Possibly either a bg or a scheduler anomaly? > We'll try with more jobs, and see; will send logs and sites etc files if > that anomaly persists at larger scales. > > Seems like both these sites have WS-GRAM enabled; we'd like to try that > as well, to expand beyond the 40-job per site suggested limit. Would > like to get 1000 cores active on this problem. 2 x 60 x 8 or so. > > Then will add in a few more fruitful TG sites. > > Towards this end, Mihael, if you have the urge to probe at a > setting/config that lets us start coasters in 4-8 node batches, this > would be a great time to try that. I suspect you dont know yet if that > will be easy, hard, or in between? > > Another note on coaster boot: > > - old problems on Abe with funky limitations on non-login shells seems > to have gone away, either from the latest coaster strategy (-l issues?) > or from Abe changes. > > - on queenbee, initial run got this error: > > Could not start coaster service > Caused by: > Task ended before registration was received. > STDOUT: Warning: -jar not understood. Ignoring. > Exception in thread "main" java.lang.NoClassDefFoundError: > .tmp.bootstrap.y10420 > at gnu.gcj.runtime.FirstThread.run() (/usr/lib64/libgcj.so.5.0.0) > > Turns out default java was 1.4.2 something. > > We added @default to .soft to get Java 1.6. > Then coasters bootstrapped fine. This was nice to see, that a simple > workaround was easy! > > At any rate, very productive, very promising, very pleasing to use. > > Nice work! > > - Mike > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Sat Apr 4 17:06:43 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 04 Apr 2009 17:06:43 -0500 Subject: [Swift-devel] coaster status report In-Reply-To: <49D7D850.9060701@mcs.anl.gov> References: <49D7D850.9060701@mcs.anl.gov> Message-ID: <1238882803.9038.1.camel@localhost> On Sat, 2009-04-04 at 16:59 -0500, Michael Wilde wrote: > - on queenbee, initial run got this error: > > Could not start coaster service > Caused by: > Task ended before registration was received. > STDOUT: Warning: -jar not understood. Ignoring. > Exception in thread "main" java.lang.NoClassDefFoundError: > .tmp.bootstrap.y10420 > at gnu.gcj.runtime.FirstThread.run() (/usr/lib64/libgcj.so.5.0.0) > > Turns out default java was 1.4.2 something. It looks like default java is GCJ which I wouldn't dare call "Java" because it probably fails too many compliance tests. > > We added @default to .soft to get Java 1.6. > Then coasters bootstrapped fine. This was nice to see, that a simple > workaround was easy! Right. Good call. From hategan at mcs.anl.gov Sat Apr 4 17:08:59 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 04 Apr 2009 17:08:59 -0500 Subject: [Swift-devel] coaster status report In-Reply-To: <49D7D94B.1090504@mcs.anl.gov> References: <49D7D850.9060701@mcs.anl.gov> <49D7D94B.1090504@mcs.anl.gov> Message-ID: <1238882939.9038.4.camel@localhost> On Sat, 2009-04-04 at 17:03 -0500, Michael Wilde wrote: > small clarification here - > > we had to turn away from range because the queue was gruesome. Yeah, but when it starts, it goooees. The beauty of multi-site runs (with replication enabled, which may or may not work properly) is that swift will make the best use of what's there. From wilde at mcs.anl.gov Sat Apr 4 17:15:00 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 04 Apr 2009 17:15:00 -0500 Subject: [Swift-devel] coaster status report In-Reply-To: <1238882939.9038.4.camel@localhost> References: <49D7D850.9060701@mcs.anl.gov> <49D7D94B.1090504@mcs.anl.gov> <1238882939.9038.4.camel@localhost> Message-ID: <49D7DBE4.3070604@mcs.anl.gov> On 4/4/09 5:08 PM, Mihael Hategan wrote: > On Sat, 2009-04-04 at 17:03 -0500, Michael Wilde wrote: >> small clarification here - >> >> we had to turn away from range because the queue was gruesome. > > Yeah, but when it starts, it goooees. > > The beauty of multi-site runs (with replication enabled, which may or > may not work properly) is that swift will make the best use of what's > there. Exactly. And I think Glen's group is eager to use it in exactly that way - send to TeraGrid and walk away, and not even bother manually checking traffic and load etc. Very promising. OOPS seems to compile clean everywhere we have tried, including BGP and Sicortex, and Glen has tested Zhengxiong's ADEM installer on OSG, where he got it installed on 8 sites in a blink. Glen is also working on a tgsites command that gens a correct user-specific sites.xml for TG, so ADEM and general use for both grids is within reach. Its all coming together very nice. From benc at hawaga.org.uk Sat Apr 4 17:21:44 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 4 Apr 2009 22:21:44 +0000 (GMT) Subject: [Swift-devel] Re: Probing running jobs In-Reply-To: <1238880872.8212.13.camel@localhost> References: <49D551B8.5010105@mcs.anl.gov> <1238721084.19231.18.camel@localhost> <49D56DEA.104@mcs.anl.gov> <1238732253.22128.12.camel@localhost> <49D61140.1090109@mcs.anl.gov> <1238880872.8212.13.camel@localhost> Message-ID: On Sat, 4 Apr 2009, Mihael Hategan wrote: > It won't start you in the job directory, but the swift workflow > directory. That's because at some point we stopped using the GRAM > directory attribute for setting the initial job dir because some silly > site on OSG doesn't honor it. I think we should revisit the issue (I > suspect there is a solution that works in both cases). I think that has not been the case since at least before the CI SVN started to be used. The first mention of specifying a directory attribute for task:execute in execute2 was r127, which specified wfdir. Before that, no directory was specified at all. The job directory has seemingly always been passed as a parameter of one kind or another to the wrapper script. -- From benc at hawaga.org.uk Sun Apr 5 06:11:28 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 5 Apr 2009 11:11:28 +0000 (GMT) Subject: [Swift-devel] Swift 0.9 release for ~2nd April In-Reply-To: References: Message-ID: On Mon, 23 Mar 2009, Ben Clifford wrote: > > I'd like to put out the Swift 0.9 release on the 2nd of April, with the > > release candidate being made from SVN on the 23rd of March. > > the present trunk seems way too unstable for a release candidate. so not > today. for now I'm planning on looking at making 0.9 again in the 2nd half of April. -- From hategan at mcs.anl.gov Sun Apr 5 09:59:29 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 05 Apr 2009 09:59:29 -0500 Subject: [Swift-devel] Re: Probing running jobs In-Reply-To: References: <49D551B8.5010105@mcs.anl.gov> <1238721084.19231.18.camel@localhost> <49D56DEA.104@mcs.anl.gov> <1238732253.22128.12.camel@localhost> <49D61140.1090109@mcs.anl.gov> <1238880872.8212.13.camel@localhost> Message-ID: <1238943569.14220.0.camel@localhost> On Sat, 2009-04-04 at 22:21 +0000, Ben Clifford wrote: > On Sat, 4 Apr 2009, Mihael Hategan wrote: > > > It won't start you in the job directory, but the swift workflow > > directory. That's because at some point we stopped using the GRAM > > directory attribute for setting the initial job dir because some silly > > site on OSG doesn't honor it. I think we should revisit the issue (I > > suspect there is a solution that works in both cases). > > I think that has not been the case since at least before the CI SVN > started to be used. > > The first mention of specifying a directory attribute for task:execute in > execute2 was r127, which specified wfdir. Before that, no directory was > specified at all. > > The job directory has seemingly always been passed as a parameter of one > kind or another to the wrapper script. > You are right. I was confused. From benc at hawaga.org.uk Sun Apr 5 16:51:26 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 5 Apr 2009 21:51:26 +0000 (GMT) Subject: [Swift-devel] too many site initializations Message-ID: vdl-int.k has this code, which I think is meant to make site initialization happen only once per site (and have only one job in initializing site shared directory progress ticker state). element(initSharedDir, [rhost] once(list(rhost, "shared") vdl:setprogress("Initializing site shared directory") However I see things like this: Progress: Selecting site:2932 Initializing site shared directory:102 Submitted:64 Active:69 Finished successfully:204 when I run with around 20 .. 30 OSG sites which suggests the onceness isn't happening there. Don't have time to investigate properly now but it seemed interesting to comment on. -- From bugzilla-daemon at mcs.anl.gov Sun Apr 5 18:09:42 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sun, 5 Apr 2009 18:09:42 -0500 (CDT) Subject: [Swift-devel] [Bug 193] New: replication job cancellation using pbs provider causes spurious console output Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=193 Summary: replication job cancellation using pbs provider causes spurious console output Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk CC: swift-devel at ci.uchicago.edu I get lines like this when job replicas are cancelled, where grid.uj.ac.za is a site I submitted to using provider=pbs Canceling job 33353.gridvm.grid.uj.ac.za I guess this is either an spurious print in CoG or is an incorrect log setting in Swift, but not looked any deeper. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. You are watching the assignee of the bug. You are watching the reporter. From wilde at mcs.anl.gov Sun Apr 5 18:14:34 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 05 Apr 2009 18:14:34 -0500 Subject: [Swift-devel] using WS GRAM Message-ID: <49D93B5A.8040103@mcs.anl.gov> Glen, try this: to try a few jobs "plain" on abe and qb then try coasters using gt4:gt4:pbs - Mike ps. beware, both might be blazing new territory From benc at hawaga.org.uk Sun Apr 5 18:18:46 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 5 Apr 2009 23:18:46 +0000 (GMT) Subject: [Swift-devel] using WS GRAM In-Reply-To: <49D93B5A.8040103@mcs.anl.gov> References: <49D93B5A.8040103@mcs.anl.gov> Message-ID: On Sun, 5 Apr 2009, Michael Wilde wrote: > ps. beware, both might be blazing new territory Is that code for "every time anyone tries GRAM4 on teragrid, it doesn't work?" ;) -- From bugzilla-daemon at mcs.anl.gov Mon Apr 6 06:29:53 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 6 Apr 2009 06:29:53 -0500 (CDT) Subject: [Swift-devel] [Bug 194] New: more analysis for replication Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=194 Summary: more analysis for replication Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: NEW Severity: enhancement Priority: P2 Component: Log processing and plotting AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk One tab/page with information about replication. Stuff that exists already: Comparison for each site of how many jobs were submitted, executed successfully, cancelled for replication (so similar to the execute2 sites table) Queue length distribution - something like the chart 'karajan queued JOB_SUBMISSION cumulative duration'. Stuff that could be collected/generated: over duration of run, how the replication threshold (which is based on mean queue time at the moment) varies. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From benc at hawaga.org.uk Mon Apr 6 06:34:47 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 6 Apr 2009 11:34:47 +0000 (GMT) Subject: [Swift-devel] goodness metrics for replication Message-ID: wondering what 'goodness' metrics are for replication. one is "how many jobs were replicated but the first submission executed (so the replication was in some sense wasted)" I'd be interested in ideas for other metrics -- From benc at hawaga.org.uk Mon Apr 6 08:26:42 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 6 Apr 2009 13:26:42 +0000 (GMT) Subject: [Swift-devel] replication vs site score Message-ID: More ongoing ramblings as I'm making slides about this... I'm not sure at the moment whether a job being cancelled due to replication causes the site's score to change. Maybe cancellation-due-to-other-replica-starting should be regarded as badness and reduce that site's score - "we asked you to run this job but were so slow we essentially regarded you as failing". Maybe they shouldn't be. Either way, it should be documented. -- From bugzilla-daemon at mcs.anl.gov Mon Apr 6 08:47:40 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 6 Apr 2009 08:47:40 -0500 (CDT) Subject: [Swift-devel] [Bug 195] New: info vs karajan states graph doesn't work well when an info file is missing Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=195 Summary: info vs karajan states graph doesn't work well when an info file is missing Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Log processing and plotting AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk For graphs such as this: "Offsets between job submission Active events and start times reported by info." when an info file is missing, it appears to be regarded as having start time of the unix epoch, which causes the automatic axes scaling to hide the actual useful information in this graph (which is for jobs where both a karajan and info start/end time are known). Such jobs should probably be omitted from these charts entirely. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Apr 6 09:08:41 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 6 Apr 2009 09:08:41 -0500 (CDT) Subject: [Swift-devel] [Bug 196] New: site score page should show site scores colour coded by site Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=196 Summary: site score page should show site scores colour coded by site Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: NEW Severity: enhancement Priority: P2 Component: Log processing and plotting AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk site score page should show site scores colour coded by site -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From wilde at mcs.anl.gov Mon Apr 6 09:13:10 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 09:13:10 -0500 Subject: [Swift-devel] Swift issues and next steps on OOPS app In-Reply-To: <49D800B5.4060109@uchicago.edu> References: <49D800B5.4060109@uchicago.edu> Message-ID: <49DA0DF6.5030300@mcs.anl.gov> was: Re: status update On 4/4/09 7:52 PM, Glen Hocky wrote: > Things seem to be kind of working on all machines (including ranger, > which picked up some speed) but not totally. So for ranger at the moment we can run default params and hope for 640 cores at a time. We should queue up several science runs of full-scale rounds, and assess the results and run times. > Problems to investigate this week: > swift choking after running lots of jobs successfully (shoudl probably > just ignore these things) I'm not sure which errors you mean here - lets examine them first. Do you mean the "successfully retried" errors? > swift not balancing load accross different sites (dumps all ones for my > teragrid sites file onto one site, grr!) Can you send a log of this to the Swift developers? They need that in order to look at this problem. I will do a sanity test of WS-GRAM with coasters on abe and queenbee. If it works we should expand our science runs there. These are good things to do today while BG/P is down. - Mike > > Glen From benc at hawaga.org.uk Mon Apr 6 09:19:10 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 6 Apr 2009 14:19:10 +0000 (GMT) Subject: [Swift-devel] Swift issues and next steps on OOPS app In-Reply-To: <49DA0DF6.5030300@mcs.anl.gov> References: <49D800B5.4060109@uchicago.edu> <49DA0DF6.5030300@mcs.anl.gov> Message-ID: On Mon, 6 Apr 2009, Michael Wilde wrote: > > swift not balancing load accross different sites (dumps all ones for my > > teragrid sites file onto one site, grr!) > > Can you send a log of this to the Swift developers? They need that in order to > look at this problem. For this also please sent the commandline that invoke Swift with, your sites file and your tc.data. -- From wilde at mcs.anl.gov Mon Apr 6 09:35:16 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 09:35:16 -0500 Subject: [Swift-devel] Swift issues and next steps on OOPS app In-Reply-To: References: <49D800B5.4060109@uchicago.edu> <49DA0DF6.5030300@mcs.anl.gov> Message-ID: <49DA1324.2010204@mcs.anl.gov> and swift.properties? (Aside to devel team: can we snapshot *all* this info into the start of the log? Its trivially short compared to the length of most logs) - command line - sites file, tc file - swift.properties I'll file as enh bug if there is agreement. On 4/6/09 9:19 AM, Ben Clifford wrote: > On Mon, 6 Apr 2009, Michael Wilde wrote: > >>> swift not balancing load accross different sites (dumps all ones for my >>> teragrid sites file onto one site, grr!) >> Can you send a log of this to the Swift developers? They need that in order to >> look at this problem. > > For this also please sent the commandline that invoke Swift with, your > sites file and your tc.data. > From benc at hawaga.org.uk Mon Apr 6 09:39:11 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 6 Apr 2009 14:39:11 +0000 (GMT) Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: Message-ID: even more rambling... in the context of a scheduler that is doing things like prioritising jobs based on more than the order that Swift happened to submit them (hopefully I will have a student for this in the summer), I think a replicant job should be pushed toward later execution rather than earlier execution to reduce the number of replicant jobs in the system at any one time. This is because I suspect (though I have gathered no numerical evidence) that given the choice between submitting a fresh job and a replicant job (making up terminology here too... mmm), it is almost always better to submit the fresh job. Either we end up submitting the replicant job eventually (in which case we are no worse off than if we submitted the replicant first and then a fresh job); or by delaying the replicant job we give that replicant's original a chance to start running and thus do not discard our precious time-and-load-dollars that we have already spent on queueing that replicant's original. -- From benc at hawaga.org.uk Mon Apr 6 09:43:56 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 6 Apr 2009 14:43:56 +0000 (GMT) Subject: [Swift-devel] Swift issues and next steps on OOPS app In-Reply-To: <49DA1324.2010204@mcs.anl.gov> References: <49D800B5.4060109@uchicago.edu> <49DA0DF6.5030300@mcs.anl.gov> <49DA1324.2010204@mcs.anl.gov> Message-ID: On Mon, 6 Apr 2009, Michael Wilde wrote: > I'll file as enh bug if there is agreement. yep -- From foster at anl.gov Mon Apr 6 09:46:29 2009 From: foster at anl.gov (Ian Foster) Date: Mon, 6 Apr 2009 09:46:29 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: Message-ID: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> Ben: You may recall the work that was done by Greg Maleciwz (sp?) on prioritizing jobs that enable new jobs to run. Those ideas seem relevant here. I met last week with a smart fellow in Singapore, Qin Zheng (CCed here), who has been working on the scheduling of replicant jobs. His interest is in doing this for jobs that have failed, while I think your interest is in scheduling for jobs that may have failed--a somewhat different thing. But there may be a connection. Ian. On Apr 6, 2009, at 9:39 AM, Ben Clifford wrote: > even more rambling... in the context of a scheduler that is doing > things > like prioritising jobs based on more than the order that Swift > happened to > submit them (hopefully I will have a student for this in the > summer), I > think a replicant job should be pushed toward later execution rather > than > earlier execution to reduce the number of replicant jobs in the > system at > any one time. > > This is because I suspect (though I have gathered no numerical > evidence) > that given the choice between submitting a fresh job and a replicant > job > (making up terminology here too... mmm), it is almost always better to > submit the fresh job. Either we end up submitting the replicant job > eventually (in which case we are no worse off than if we submitted the > replicant first and then a fresh job); or by delaying the replicant > job we > give that replicant's original a chance to start running and thus do > not > discard our precious time-and-load-dollars that we have already > spent on > queueing that replicant's original. > > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Mon Apr 6 10:00:08 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 6 Apr 2009 15:00:08 +0000 (GMT) Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> Message-ID: > You may recall the work that was done by Greg Maleciwz (sp?) on prioritizing > jobs that enable new jobs to run. Those ideas seem relevant here. yes, its ongoing thoughts based on that that lead me to thinking about this - more generally, what are the useful things to prioritise work on (both at the Swift level - a SwiftScript procedure call - and at the lower level of file transfers and remote job submissions) > I met last week with a smart fellow in Singapore, Qin Zheng (CCed here), who > has been working on the scheduling of replicant jobs. His interest is in doing > this for jobs that have failed, while I think your interest is in scheduling > for jobs that may have failed--a somewhat different thing. But there may be a > connection. Replicated jobs are jobs that the remote job submission system (eg GRAM) says are in a queue but that we think that we can probably run better (i.e. quicker or even run at all) by resubmitting; when doing that, we don't cancel the original job and potentially it will be that original job that runs, not the replica. Sometimes that is because the remote queue is "infintely long" (the site is taking jobs and losing them); sometimes its because it is "very long" (eg teraport's 14 day queue when my laptop has a local CPU free and no queue) In your above paragraph, that sounds more like Swift's retry mechanism - when a Swift-level job (SwiftScript procedure call) fails, we submit it again, basically using the same mechanism as with replicated jobs. However, in that case, the original job does not exist any more. -- From bugzilla-daemon at mcs.anl.gov Mon Apr 6 10:00:21 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 6 Apr 2009 10:00:21 -0500 (CDT) Subject: [Swift-devel] [Bug 197] New: Include more runtime environment info in Swift log for debugging Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=197 Summary: Include more runtime environment info in Swift log for debugging Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Keywords: debug Severity: enhancement Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: wilde at mcs.anl.gov More info on the user's runtime environment should be included automatically at the start of the swift .log file, so that developers can do most debugging with just the single .log file. This should include: - command line - sites file - tc file - swift.properties file and could include the swift source code itself (at least for now, when most scripts are very short. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Apr 6 10:09:21 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 6 Apr 2009 10:09:21 -0500 (CDT) Subject: [Swift-devel] [Bug 198] New: Add ability to specify execution sites on swift command line Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=198 Summary: Add ability to specify execution sites on swift command line Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Keywords: running Severity: enhancement Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: wilde at mcs.anl.gov Allow a set of sites to be specified on the command line for each run, rather than needed to edit the sites file to make such choices. This feature is (or should be) tied to improvements in how the site data is generated and maintained. Discussion of a design for this feature on the devel list should precede any development. The related issues are: - where to keep the site data that the command line options select from - how to parameterize that data and add options to the selected sites - how site data is generated and customized - how a variety of choices can be specified for the selected sites (eg, use coasters or not; which data movement strategy to use). -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From wilde at mcs.anl.gov Mon Apr 6 10:20:16 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 10:20:16 -0500 Subject: [Swift-devel] bugzilla keywords Message-ID: <49DA1DB0.8060107@mcs.anl.gov> Ive noticed these more in the new bugzilla interface, and so started using them, although I realize the keywords Ive created may need rethinking. Are bug keywords of any use to us, or should I stop doing this? If of use, we should define a small set that we like and works for all. From benc at hawaga.org.uk Mon Apr 6 10:27:42 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 6 Apr 2009 15:27:42 +0000 (GMT) Subject: [Swift-devel] bugzilla keywords In-Reply-To: <49DA1DB0.8060107@mcs.anl.gov> References: <49DA1DB0.8060107@mcs.anl.gov> Message-ID: On Mon, 6 Apr 2009, Michael Wilde wrote: > Ive noticed these more in the new bugzilla interface, and so started using > them, although I realize the keywords Ive created may need rethinking. > > Are bug keywords of any use to us, or should I stop doing this? > > If of use, we should define a small set that we like and works for all. I've never come up with a particularly useful use for them within Swift, and I don't think we should use them just because they are there. For the most part, I think even the component classification list is barely used. If you find some use in using them, though, I see no reason why you shouldn't do so, though. -- From hategan at mcs.anl.gov Mon Apr 6 10:40:38 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 10:40:38 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: Message-ID: <1239032438.30386.8.camel@localhost> On Mon, 2009-04-06 at 14:39 +0000, Ben Clifford wrote: > even more rambling... in the context of a scheduler that is doing things > like prioritising jobs based on more than the order that Swift happened to > submit them (hopefully I will have a student for this in the summer), I > think a replicant job should be pushed toward later execution rather than > earlier execution to reduce the number of replicant jobs in the system at > any one time. You have two extremes: 1. Send each job to all sites instantly. 2. Replicate after +inf time (see _too_much_ below) You're suggesting moving from somewhere in the middle, to somewhere in the middle, but a little to the right. > > This is because I suspect (though I have gathered no numerical evidence) > that given the choice between submitting a fresh job and a replicant job > (making up terminology here too... mmm), it is almost always better to > submit the fresh job. Either we end up submitting the replicant job > eventually (in which case we are no worse off than if we submitted the > replicant first and then a fresh job); or by delaying the replicant job we > give that replicant's original a chance to start running and thus do not > discard our precious time-and-load-dollars that we have already spent on > queueing that replicant's original. You are saying this with the awareness of the fact that replicas are only sent after the prototype job sat in the queue (and didn't start running) for what is deemed _too_much_? From benc at hawaga.org.uk Mon Apr 6 10:50:04 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 6 Apr 2009 15:50:04 +0000 (GMT) Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239032438.30386.8.camel@localhost> References: <1239032438.30386.8.camel@localhost> Message-ID: On Mon, 6 Apr 2009, Mihael Hategan wrote: > You are saying this with the awareness of the fact that replicas are > only sent after the prototype job sat in the queue (and didn't start > running) for what is deemed _too_much_? I'm not suggesting that we reduce any submission load to remote sites. I'm suggesting a different order for those submissions. The queue delay is not so _too_much_ that we cancel the original on replication; and it appears (though I don't have stats on real runs) that the originals do run sometimes (though it would be interesting to know in real situations how often) Given that, I'm suggesting that a better use of our load capacity is to do it with the ordering I suggested. As far as I can tell, it will not result in slower runs. In the case where originals do run eventually, this should results in faster runs. Thinking about it more, I can see a situation where a site is pretty fully loaded queuewise by swift yet never actually runs a job, because by the time a job gets near the front of the queue it has been replicated and run elsewhere. That's an extreme, but I think its the extreme of the same situation I talk about in my original message. -- From hategan at mcs.anl.gov Mon Apr 6 11:06:31 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 11:06:31 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: <1239032438.30386.8.camel@localhost> Message-ID: <1239033991.31063.3.camel@localhost> On Mon, 2009-04-06 at 15:50 +0000, Ben Clifford wrote: > On Mon, 6 Apr 2009, Mihael Hategan wrote: > > > You are saying this with the awareness of the fact that replicas are > > only sent after the prototype job sat in the queue (and didn't start > > running) for what is deemed _too_much_? > > I'm not suggesting that we reduce any submission load to remote sites. I'm > suggesting a different order for those submissions. > > The queue delay is not so _too_much_ that we cancel the original on > replication; and it appears (though I don't have stats on real runs) that > the originals do run sometimes (though it would be interesting to know in > real situations how often) > > Given that, I'm suggesting that a better use of our load capacity is to do > it with the ordering I suggested. I'm still not following. From what I understand, you are suggesting what's already there. So either that is true and you think the current scheme is not what it is, or I don't understand how your suggestion is different than the current scheme. From benc at hawaga.org.uk Mon Apr 6 11:11:31 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 6 Apr 2009 16:11:31 +0000 (GMT) Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239033991.31063.3.camel@localhost> References: <1239032438.30386.8.camel@localhost> <1239033991.31063.3.camel@localhost> Message-ID: On Mon, 6 Apr 2009, Mihael Hategan wrote: > I'm still not following. From what I understand, you are suggesting > what's already there. So either that is true and you think the current > scheme is not what it is, or I don't understand how your suggestion is > different than the current scheme. Its not the case, as I understand it, that replica jobs will always be run after primary jobs - they will be run in the order they arrive in the job queue. Jobs that Swift puts in the queue after that replication decision has been made (for example, jobs that were waiting for dependent data) will run after the replicas submitted before that dependent data become available. a=p(x) b=p(y) c=q(a); a, b run. eventually swift gets bored and resubmits b to the local job queue. then a completes, and so c gets queued in the local job queue. replica_of_b gets submitted to a site before c does. or not? -- From hategan at mcs.anl.gov Mon Apr 6 11:34:16 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 11:34:16 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: <1239032438.30386.8.camel@localhost> <1239033991.31063.3.camel@localhost> Message-ID: <1239035656.31668.10.camel@localhost> On Mon, 2009-04-06 at 16:11 +0000, Ben Clifford wrote: > On Mon, 6 Apr 2009, Mihael Hategan wrote: > > > I'm still not following. From what I understand, you are suggesting > > what's already there. So either that is true and you think the current > > scheme is not what it is, or I don't understand how your suggestion is > > different than the current scheme. > > Its not the case, as I understand it, that replica jobs will always be run > after primary jobs - they will be run in the order they arrive in the job > queue. Jobs that Swift puts in the queue after that replication decision > has been made (for example, jobs that were waiting for dependent data) > will run after the replicas submitted before that dependent data become > available. > > a=p(x) > b=p(y) > c=q(a); > > a, b run. eventually swift gets bored and resubmits b to the local job > queue. then a completes, and so c gets queued in the local job queue. > > replica_of_b gets submitted to a site before c does. I see what you're saying now. I think scheduler priorities are not a bad idea. From benc at hawaga.org.uk Mon Apr 6 11:34:57 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 6 Apr 2009 16:34:57 +0000 (GMT) Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239035656.31668.10.camel@localhost> References: <1239032438.30386.8.camel@localhost> <1239033991.31063.3.camel@localhost> <1239035656.31668.10.camel@localhost> Message-ID: On Mon, 6 Apr 2009, Mihael Hategan wrote: > I think scheduler priorities are not a bad idea. right - its likely that I get a summer student to play with that sort of stuff, which is what is making me think of what sort of things to prioritise on. -- From wilde at mcs.anl.gov Mon Apr 6 12:10:51 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 12:10:51 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs Message-ID: <49DA379B.7080403@mcs.anl.gov> With this sites entry: TG-CDA070002T /home/ux454325/swiftwork I get the error below. Files are on CI net at /home/wilde/swift/lab. I will try to copy coaster boot logs and gram logs to same place when I find them, in subdirs named by $RunID.logs. -- com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift Swift svn swift-r2809 cog-r2350 RunID: 20090406-1155-pgc5nj00 Progress: Progress: Stage in:1 Progress: Submitted:1 Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb Progress: Failed:1 Execution failed: Exception in cat: Arguments: [data.txt] Host: qb Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j stderr.txt: stdout.txt: ---- Caused by: Cannot submit job: Cannot run program "qsub": java.io.IOException: error=2, No such file or directory org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot submit job: Cannot run program "qsub": java.io.IOException: error=2, No such file or directory at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) at org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) at org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) Caused by: java.io.IOException: Cannot run program "qsub": java.io.IOException: error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at java.lang.Runtime.exec(Runtime.java:593) at org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73) at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) ... 4 more Caused by: java.io.IOException: java.io.IOException: error=2, No such file or directory at java.lang.UNIXProcess.(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) ... 7 more Cleaning up... Shutting down service at https://208.100.92.21:44166 Got channel MetaChannel: 24235184 -> GSSSChannel-null(1) - Done com$ pwd From wilde at mcs.anl.gov Mon Apr 6 12:29:21 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 12:29:21 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt4:gt4:pbs Message-ID: <49DA3BF1.1080206@mcs.anl.gov> With this sites entry: TG-CDA070002T /home/ux454325/swiftwork I get the error below. Files are on CI net at /home/wilde/swift/lab. Coaster boot log is in 20090406-1216-f5k8chdg.logs/ There was no GRAM log on the queenbee site. -- com$ swift -tc.file tc.data -sites.file qb.coasters-gt4-gt4-pbs.xml cat.swift Swift svn swift-r2809 cog-r2350 RunID: 20090406-1216-f5k8chdg Progress: Progress: Stage in:1 The GT4 provider does not support redirection. Redirection requests will be ignored without further warnings. Progress: Submitted:1 Failed to transfer wrapper log from cat-20090406-1216-f5k8chdg/info/0 on qb Progress: Failed:1 Execution failed: Exception in cat: Arguments: [data.txt] Host: qb Directory: cat-20090406-1216-f5k8chdg/jobs/0/cat-0cjfv09j stderr.txt: stdout.txt: ---- Caused by: Could not submit job Caused by: Could not start coaster service Caused by: Task ended before registration was received: Job failed with an exit code of 1 Cleaning up... Done com$ From hategan at mcs.anl.gov Mon Apr 6 13:02:17 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 13:02:17 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs In-Reply-To: <49DA379B.7080403@mcs.anl.gov> References: <49DA379B.7080403@mcs.anl.gov> Message-ID: <1239040937.2410.3.camel@localhost> Yes. This is one of those "can't find executable unless run through 'bash -l' or maybe not" which we saw using wget and md5sum. I'm thinking how to deal with the situation. On Mon, 2009-04-06 at 12:10 -0500, Michael Wilde wrote: > With this sites entry: > > > TG-CDA070002T > jobManager="gt2:pbs" /> > > /home/ux454325/swiftwork > > > I get the error below. Files are on CI net at /home/wilde/swift/lab. > > I will try to copy coaster boot logs and gram logs to same place when I > find them, in subdirs named by $RunID.logs. > > -- > > com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift > Swift svn swift-r2809 cog-r2350 > > RunID: 20090406-1155-pgc5nj00 > Progress: > Progress: Stage in:1 > Progress: Submitted:1 > Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb > Progress: Failed:1 > Execution failed: > Exception in cat: > Arguments: [data.txt] > Host: qb > Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Cannot submit job: Cannot run program "qsub": > java.io.IOException: error=2, No such file or directory > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Cannot submit job: Cannot run program "qsub": java.io.IOException: > error=2, No such file or directory > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) > at > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > at > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) > at > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) > at > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) > Caused by: java.io.IOException: Cannot run program "qsub": > java.io.IOException: error=2, No such file or directory > at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) > at java.lang.Runtime.exec(Runtime.java:593) > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73) > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) > ... 4 more > Caused by: java.io.IOException: java.io.IOException: error=2, No such > file or directory > at java.lang.UNIXProcess.(UNIXProcess.java:148) > at java.lang.ProcessImpl.start(ProcessImpl.java:65) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) > ... 7 more > > Cleaning up... > Shutting down service at https://208.100.92.21:44166 > Got channel MetaChannel: 24235184 -> GSSSChannel-null(1) > - Done > com$ pwd > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Mon Apr 6 13:28:03 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 13:28:03 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt4:gt4:pbs In-Reply-To: <49DA3BF1.1080206@mcs.anl.gov> References: <49DA3BF1.1080206@mcs.anl.gov> Message-ID: <1239042483.3445.2.camel@localhost> On Mon, 2009-04-06 at 12:29 -0500, Michael Wilde wrote: > With this sites entry: > > > > TG-CDA070002T > jobManager="gt4:gt4:pbs" /> > > /home/ux454325/swiftwork > > > > I get the error below. Files are on CI net at /home/wilde/swift/lab. > > Coaster boot log is in 20090406-1216-f5k8chdg.logs/ I'll also need ~/.globus/coasters/coasters.log. Sorry for not mentioning it earlier. Normally with gt2, there would be a stdout explanation of what happened, but with gt4 there is no stdout streaming back. From wilde at mcs.anl.gov Mon Apr 6 13:33:32 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 13:33:32 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt4:gt4:pbs In-Reply-To: <1239042483.3445.2.camel@localhost> References: <49DA3BF1.1080206@mcs.anl.gov> <1239042483.3445.2.camel@localhost> Message-ID: <49DA4AFC.104@mcs.anl.gov> I just copied coaster.log to that same dir. On 4/6/09 1:28 PM, Mihael Hategan wrote: > On Mon, 2009-04-06 at 12:29 -0500, Michael Wilde wrote: >> With this sites entry: >> >> >> >> TG-CDA070002T >> > jobManager="gt4:gt4:pbs" /> >> >> /home/ux454325/swiftwork >> >> >> >> I get the error below. Files are on CI net at /home/wilde/swift/lab. >> >> Coaster boot log is in 20090406-1216-f5k8chdg.logs/ > > I'll also need ~/.globus/coasters/coasters.log. Sorry for not mentioning > it earlier. > > Normally with gt2, there would be a stdout explanation of what happened, > but with gt4 there is no stdout streaming back. > > From hategan at mcs.anl.gov Mon Apr 6 14:24:59 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 14:24:59 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt4:gt4:pbs In-Reply-To: <49DA4AFC.104@mcs.anl.gov> References: <49DA3BF1.1080206@mcs.anl.gov> <1239042483.3445.2.camel@localhost> <49DA4AFC.104@mcs.anl.gov> Message-ID: <1239045899.4203.3.camel@localhost> On Mon, 2009-04-06 at 13:33 -0500, Michael Wilde wrote: > I just copied coaster.log to that same dir. Unfortunately it does not contain any information on the unfortunate run. I committed a patch to also log to the bootstrap log any errors that may occur during bootstrap.jar startup that may otherwise not be logged to the coasters log (nor reported back to the client due to the middleware). From wilde at mcs.anl.gov Mon Apr 6 15:17:29 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 15:17:29 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs In-Reply-To: <1239040937.2410.3.camel@localhost> References: <49DA379B.7080403@mcs.anl.gov> <1239040937.2410.3.camel@localhost> Message-ID: <49DA6359.8010207@mcs.anl.gov> Mihael, I just updated our test swift+cog source and rebuilt. Glen is now getting: Caused by: Invalid GSSCredentials org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid GSSCredentials at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:149) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) at org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:222) at org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) Caused by: GSSException: Invalid name provided [Caused by: [JGLOBUS-112] Malformed name, "=" missing in "38356/jobmanager-pbs"] at org.globus.gsi.gssapi.GlobusGSSName.(GlobusGSSName.java:137) at org.globus.gsi.gssapi.GlobusGSSManagerImpl.createName(GlobusGSSManagerImpl.java:304) at org.globus.gsi.gssapi.auth.IdentityAuthorization.getExpectedName(IdentityAuthorization.java:82) at org.globus.gram.Gram.gatekeeperConnect(Gram.java:85) at org.globus.gram.Gram.request(Gram.java:310) at org.globus.gram.GramJob.request(GramJob.java:262) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133) ... 5 more what's up here Any chance I picked up code in transition, or a new problem in recent commits? - Mike On 4/6/09 1:02 PM, Mihael Hategan wrote: > Yes. This is one of those "can't find executable unless run through > 'bash -l' or maybe not" which we saw using wget and md5sum. I'm thinking > how to deal with the situation. > > On Mon, 2009-04-06 at 12:10 -0500, Michael Wilde wrote: >> With this sites entry: >> >> >> TG-CDA070002T >> > jobManager="gt2:pbs" /> >> >> /home/ux454325/swiftwork >> >> >> I get the error below. Files are on CI net at /home/wilde/swift/lab. >> >> I will try to copy coaster boot logs and gram logs to same place when I >> find them, in subdirs named by $RunID.logs. >> >> -- >> >> com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift >> Swift svn swift-r2809 cog-r2350 >> >> RunID: 20090406-1155-pgc5nj00 >> Progress: >> Progress: Stage in:1 >> Progress: Submitted:1 >> Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb >> Progress: Failed:1 >> Execution failed: >> Exception in cat: >> Arguments: [data.txt] >> Host: qb >> Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> Cannot submit job: Cannot run program "qsub": >> java.io.IOException: error=2, No such file or directory >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >> Cannot submit job: Cannot run program "qsub": java.io.IOException: >> error=2, No such file or directory >> at >> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) >> at >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) >> at >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) >> at >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) >> at >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) >> Caused by: java.io.IOException: Cannot run program "qsub": >> java.io.IOException: error=2, No such file or directory >> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) >> at java.lang.Runtime.exec(Runtime.java:593) >> at >> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73) >> at >> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) >> ... 4 more >> Caused by: java.io.IOException: java.io.IOException: error=2, No such >> file or directory >> at java.lang.UNIXProcess.(UNIXProcess.java:148) >> at java.lang.ProcessImpl.start(ProcessImpl.java:65) >> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) >> ... 7 more >> >> Cleaning up... >> Shutting down service at https://208.100.92.21:44166 >> Got channel MetaChannel: 24235184 -> GSSSChannel-null(1) >> - Done >> com$ pwd >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Mon Apr 6 15:20:37 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 15:20:37 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs In-Reply-To: <1239040937.2410.3.camel@localhost> References: <49DA379B.7080403@mcs.anl.gov> <1239040937.2410.3.camel@localhost> Message-ID: <49DA6415.1010402@mcs.anl.gov> Mihael, when I svn updated our test swift+cog source and rebuilt, Glen Glen gets the errors below. When I reverted back to last Tuesday Mar 31, this new error does not occur. Does "Caused by: GSSException: Invalid name provided [Caused by: [JGLOBUS-112] Malformed name, "=" missing in "38356/jobmanager-pbs"]" suggest a new error introduced in the commits since Tuesday? This is with coasters and gt2:gt2:pbs. - Mike Caused by: Invalid GSSCredentials org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid GSSCredentials at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:149) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) at org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:222) at org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) Caused by: GSSException: Invalid name provided [Caused by: [JGLOBUS-112] Malformed name, "=" missing in "38356/jobmanager-pbs"] at org.globus.gsi.gssapi.GlobusGSSName.(GlobusGSSName.java:137) at org.globus.gsi.gssapi.GlobusGSSManagerImpl.createName(GlobusGSSManagerImpl.java:304) at org.globus.gsi.gssapi.auth.IdentityAuthorization.getExpectedName(IdentityAuthorization.java:82) at org.globus.gram.Gram.gatekeeperConnect(Gram.java:85) at org.globus.gram.Gram.request(Gram.java:310) at org.globus.gram.GramJob.request(GramJob.java:262) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133) ... 5 more what's up here Any chance I picked up code in transition, or a new problem in recent commits? - Mike On 4/6/09 1:02 PM, Mihael Hategan wrote: > Yes. This is one of those "can't find executable unless run through > 'bash -l' or maybe not" which we saw using wget and md5sum. I'm thinking > how to deal with the situation. > > On Mon, 2009-04-06 at 12:10 -0500, Michael Wilde wrote: >> With this sites entry: >> >> >> TG-CDA070002T >> > jobManager="gt2:pbs" /> >> >> /home/ux454325/swiftwork >> >> >> I get the error below. Files are on CI net at /home/wilde/swift/lab. >> >> I will try to copy coaster boot logs and gram logs to same place when I >> find them, in subdirs named by $RunID.logs. >> >> -- >> >> com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift >> Swift svn swift-r2809 cog-r2350 >> >> RunID: 20090406-1155-pgc5nj00 >> Progress: >> Progress: Stage in:1 >> Progress: Submitted:1 >> Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb >> Progress: Failed:1 >> Execution failed: >> Exception in cat: >> Arguments: [data.txt] >> Host: qb >> Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> Cannot submit job: Cannot run program "qsub": >> java.io.IOException: error=2, No such file or directory >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >> Cannot submit job: Cannot run program "qsub": java.io.IOException: >> error=2, No such file or directory >> at >> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) >> at >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) >> at >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) >> at >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) >> at >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) >> Caused by: java.io.IOException: Cannot run program "qsub": >> java.io.IOException: error=2, No such file or directory >> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) >> at java.lang.Runtime.exec(Runtime.java:593) >> at >> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73) >> at >> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) >> ... 4 more >> Caused by: java.io.IOException: java.io.IOException: error=2, No such >> file or directory >> at java.lang.UNIXProcess.(UNIXProcess.java:148) >> at java.lang.ProcessImpl.start(ProcessImpl.java:65) >> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) >> ... 7 more >> >> Cleaning up... >> Shutting down service at https://208.100.92.21:44166 >> Got channel MetaChannel: 24235184 -> GSSSChannel-null(1) >> - Done >> com$ pwd >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Mon Apr 6 15:25:09 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 15:25:09 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs In-Reply-To: <49DA6359.8010207@mcs.anl.gov> References: <49DA379B.7080403@mcs.anl.gov> <1239040937.2410.3.camel@localhost> <49DA6359.8010207@mcs.anl.gov> Message-ID: <1239049509.5350.0.camel@localhost> Oops. cog r2367 should fix that. On Mon, 2009-04-06 at 15:17 -0500, Michael Wilde wrote: > Mihael, I just updated our test swift+cog source and rebuilt. > > Glen is now getting: > > Caused by: > Invalid GSSCredentials > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: > Invalid GSSCredentials > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:149) > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99) > at > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > at > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) > at > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:222) > at > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) > Caused by: GSSException: Invalid name provided [Caused by: [JGLOBUS-112] > Malformed name, "=" missing in "38356/jobmanager-pbs"] > at > org.globus.gsi.gssapi.GlobusGSSName.(GlobusGSSName.java:137) > at > org.globus.gsi.gssapi.GlobusGSSManagerImpl.createName(GlobusGSSManagerImpl.java:304) > at > org.globus.gsi.gssapi.auth.IdentityAuthorization.getExpectedName(IdentityAuthorization.java:82) > at org.globus.gram.Gram.gatekeeperConnect(Gram.java:85) > at org.globus.gram.Gram.request(Gram.java:310) > at org.globus.gram.GramJob.request(GramJob.java:262) > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133) > ... 5 more > what's up here > > > Any chance I picked up code in transition, or a new problem in recent > commits? > > - Mike > > > > On 4/6/09 1:02 PM, Mihael Hategan wrote: > > Yes. This is one of those "can't find executable unless run through > > 'bash -l' or maybe not" which we saw using wget and md5sum. I'm thinking > > how to deal with the situation. > > > > On Mon, 2009-04-06 at 12:10 -0500, Michael Wilde wrote: > >> With this sites entry: > >> > >> > >> TG-CDA070002T > >> >> jobManager="gt2:pbs" /> > >> > >> /home/ux454325/swiftwork > >> > >> > >> I get the error below. Files are on CI net at /home/wilde/swift/lab. > >> > >> I will try to copy coaster boot logs and gram logs to same place when I > >> find them, in subdirs named by $RunID.logs. > >> > >> -- > >> > >> com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift > >> Swift svn swift-r2809 cog-r2350 > >> > >> RunID: 20090406-1155-pgc5nj00 > >> Progress: > >> Progress: Stage in:1 > >> Progress: Submitted:1 > >> Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb > >> Progress: Failed:1 > >> Execution failed: > >> Exception in cat: > >> Arguments: [data.txt] > >> Host: qb > >> Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j > >> stderr.txt: > >> > >> stdout.txt: > >> > >> ---- > >> > >> Caused by: > >> Cannot submit job: Cannot run program "qsub": > >> java.io.IOException: error=2, No such file or directory > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > >> Cannot submit job: Cannot run program "qsub": java.io.IOException: > >> error=2, No such file or directory > >> at > >> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) > >> at > >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > >> at > >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) > >> at > >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) > >> at > >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) > >> Caused by: java.io.IOException: Cannot run program "qsub": > >> java.io.IOException: error=2, No such file or directory > >> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) > >> at java.lang.Runtime.exec(Runtime.java:593) > >> at > >> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73) > >> at > >> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) > >> ... 4 more > >> Caused by: java.io.IOException: java.io.IOException: error=2, No such > >> file or directory > >> at java.lang.UNIXProcess.(UNIXProcess.java:148) > >> at java.lang.ProcessImpl.start(ProcessImpl.java:65) > >> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) > >> ... 7 more > >> > >> Cleaning up... > >> Shutting down service at https://208.100.92.21:44166 > >> Got channel MetaChannel: 24235184 -> GSSSChannel-null(1) > >> - Done > >> com$ pwd > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From wilde at mcs.anl.gov Mon Apr 6 16:25:51 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 16:25:51 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs In-Reply-To: <1239049509.5350.0.camel@localhost> References: <49DA379B.7080403@mcs.anl.gov> <1239040937.2410.3.camel@localhost> <49DA6359.8010207@mcs.anl.gov> <1239049509.5350.0.camel@localhost> Message-ID: <49DA735F.1020300@mcs.anl.gov> We just tested that rev, and now it seems as if the jobs are getting submitted to the fork JM instead of to PBS. Need a log for that, or is the cause obvious? On 4/6/09 3:25 PM, Mihael Hategan wrote: > Oops. cog r2367 should fix that. > > On Mon, 2009-04-06 at 15:17 -0500, Michael Wilde wrote: >> Mihael, I just updated our test swift+cog source and rebuilt. >> >> Glen is now getting: >> >> Caused by: >> Invalid GSSCredentials >> org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: >> Invalid GSSCredentials >> at >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:149) >> at >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99) >> at >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) >> at >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) >> at >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:222) >> at >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) >> Caused by: GSSException: Invalid name provided [Caused by: [JGLOBUS-112] >> Malformed name, "=" missing in "38356/jobmanager-pbs"] >> at >> org.globus.gsi.gssapi.GlobusGSSName.(GlobusGSSName.java:137) >> at >> org.globus.gsi.gssapi.GlobusGSSManagerImpl.createName(GlobusGSSManagerImpl.java:304) >> at >> org.globus.gsi.gssapi.auth.IdentityAuthorization.getExpectedName(IdentityAuthorization.java:82) >> at org.globus.gram.Gram.gatekeeperConnect(Gram.java:85) >> at org.globus.gram.Gram.request(Gram.java:310) >> at org.globus.gram.GramJob.request(GramJob.java:262) >> at >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133) >> ... 5 more >> what's up here >> >> >> Any chance I picked up code in transition, or a new problem in recent >> commits? >> >> - Mike >> >> >> >> On 4/6/09 1:02 PM, Mihael Hategan wrote: >>> Yes. This is one of those "can't find executable unless run through >>> 'bash -l' or maybe not" which we saw using wget and md5sum. I'm thinking >>> how to deal with the situation. >>> >>> On Mon, 2009-04-06 at 12:10 -0500, Michael Wilde wrote: >>>> With this sites entry: >>>> >>>> >>>> TG-CDA070002T >>>> >>> jobManager="gt2:pbs" /> >>>> >>>> /home/ux454325/swiftwork >>>> >>>> >>>> I get the error below. Files are on CI net at /home/wilde/swift/lab. >>>> >>>> I will try to copy coaster boot logs and gram logs to same place when I >>>> find them, in subdirs named by $RunID.logs. >>>> >>>> -- >>>> >>>> com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift >>>> Swift svn swift-r2809 cog-r2350 >>>> >>>> RunID: 20090406-1155-pgc5nj00 >>>> Progress: >>>> Progress: Stage in:1 >>>> Progress: Submitted:1 >>>> Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb >>>> Progress: Failed:1 >>>> Execution failed: >>>> Exception in cat: >>>> Arguments: [data.txt] >>>> Host: qb >>>> Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j >>>> stderr.txt: >>>> >>>> stdout.txt: >>>> >>>> ---- >>>> >>>> Caused by: >>>> Cannot submit job: Cannot run program "qsub": >>>> java.io.IOException: error=2, No such file or directory >>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >>>> Cannot submit job: Cannot run program "qsub": java.io.IOException: >>>> error=2, No such file or directory >>>> at >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) >>>> at >>>> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) >>>> at >>>> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) >>>> at >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) >>>> at >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) >>>> Caused by: java.io.IOException: Cannot run program "qsub": >>>> java.io.IOException: error=2, No such file or directory >>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) >>>> at java.lang.Runtime.exec(Runtime.java:593) >>>> at >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73) >>>> at >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) >>>> ... 4 more >>>> Caused by: java.io.IOException: java.io.IOException: error=2, No such >>>> file or directory >>>> at java.lang.UNIXProcess.(UNIXProcess.java:148) >>>> at java.lang.ProcessImpl.start(ProcessImpl.java:65) >>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) >>>> ... 7 more >>>> >>>> Cleaning up... >>>> Shutting down service at https://208.100.92.21:44166 >>>> Got channel MetaChannel: 24235184 -> GSSSChannel-null(1) >>>> - Done >>>> com$ pwd >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Mon Apr 6 16:44:43 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 16:44:43 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs In-Reply-To: <49DA735F.1020300@mcs.anl.gov> References: <49DA379B.7080403@mcs.anl.gov> <1239040937.2410.3.camel@localhost> <49DA6359.8010207@mcs.anl.gov> <1239049509.5350.0.camel@localhost> <49DA735F.1020300@mcs.anl.gov> Message-ID: <1239054283.6821.0.camel@localhost> On Mon, 2009-04-06 at 16:25 -0500, Michael Wilde wrote: > We just tested that rev, and now it seems as if the jobs are getting > submitted to the fork JM instead of to PBS. > > Need a log for that, or is the cause obvious? No. I'll debug and see. > > > On 4/6/09 3:25 PM, Mihael Hategan wrote: > > Oops. cog r2367 should fix that. > > > > On Mon, 2009-04-06 at 15:17 -0500, Michael Wilde wrote: > >> Mihael, I just updated our test swift+cog source and rebuilt. > >> > >> Glen is now getting: > >> > >> Caused by: > >> Invalid GSSCredentials > >> org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: > >> Invalid GSSCredentials > >> at > >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:149) > >> at > >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99) > >> at > >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > >> at > >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) > >> at > >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:222) > >> at > >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) > >> Caused by: GSSException: Invalid name provided [Caused by: [JGLOBUS-112] > >> Malformed name, "=" missing in "38356/jobmanager-pbs"] > >> at > >> org.globus.gsi.gssapi.GlobusGSSName.(GlobusGSSName.java:137) > >> at > >> org.globus.gsi.gssapi.GlobusGSSManagerImpl.createName(GlobusGSSManagerImpl.java:304) > >> at > >> org.globus.gsi.gssapi.auth.IdentityAuthorization.getExpectedName(IdentityAuthorization.java:82) > >> at org.globus.gram.Gram.gatekeeperConnect(Gram.java:85) > >> at org.globus.gram.Gram.request(Gram.java:310) > >> at org.globus.gram.GramJob.request(GramJob.java:262) > >> at > >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133) > >> ... 5 more > >> what's up here > >> > >> > >> Any chance I picked up code in transition, or a new problem in recent > >> commits? > >> > >> - Mike > >> > >> > >> > >> On 4/6/09 1:02 PM, Mihael Hategan wrote: > >>> Yes. This is one of those "can't find executable unless run through > >>> 'bash -l' or maybe not" which we saw using wget and md5sum. I'm thinking > >>> how to deal with the situation. > >>> > >>> On Mon, 2009-04-06 at 12:10 -0500, Michael Wilde wrote: > >>>> With this sites entry: > >>>> > >>>> > >>>> TG-CDA070002T > >>>> >>>> jobManager="gt2:pbs" /> > >>>> > >>>> /home/ux454325/swiftwork > >>>> > >>>> > >>>> I get the error below. Files are on CI net at /home/wilde/swift/lab. > >>>> > >>>> I will try to copy coaster boot logs and gram logs to same place when I > >>>> find them, in subdirs named by $RunID.logs. > >>>> > >>>> -- > >>>> > >>>> com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift > >>>> Swift svn swift-r2809 cog-r2350 > >>>> > >>>> RunID: 20090406-1155-pgc5nj00 > >>>> Progress: > >>>> Progress: Stage in:1 > >>>> Progress: Submitted:1 > >>>> Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb > >>>> Progress: Failed:1 > >>>> Execution failed: > >>>> Exception in cat: > >>>> Arguments: [data.txt] > >>>> Host: qb > >>>> Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j > >>>> stderr.txt: > >>>> > >>>> stdout.txt: > >>>> > >>>> ---- > >>>> > >>>> Caused by: > >>>> Cannot submit job: Cannot run program "qsub": > >>>> java.io.IOException: error=2, No such file or directory > >>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > >>>> Cannot submit job: Cannot run program "qsub": java.io.IOException: > >>>> error=2, No such file or directory > >>>> at > >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) > >>>> at > >>>> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > >>>> at > >>>> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) > >>>> at > >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) > >>>> at > >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) > >>>> Caused by: java.io.IOException: Cannot run program "qsub": > >>>> java.io.IOException: error=2, No such file or directory > >>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) > >>>> at java.lang.Runtime.exec(Runtime.java:593) > >>>> at > >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73) > >>>> at > >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) > >>>> ... 4 more > >>>> Caused by: java.io.IOException: java.io.IOException: error=2, No such > >>>> file or directory > >>>> at java.lang.UNIXProcess.(UNIXProcess.java:148) > >>>> at java.lang.ProcessImpl.start(ProcessImpl.java:65) > >>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) > >>>> ... 7 more > >>>> > >>>> Cleaning up... > >>>> Shutting down service at https://208.100.92.21:44166 > >>>> Got channel MetaChannel: 24235184 -> GSSSChannel-null(1) > >>>> - Done > >>>> com$ pwd > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From hategan at mcs.anl.gov Mon Apr 6 16:46:50 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 16:46:50 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs In-Reply-To: <49DA735F.1020300@mcs.anl.gov> References: <49DA379B.7080403@mcs.anl.gov> <1239040937.2410.3.camel@localhost> <49DA6359.8010207@mcs.anl.gov> <1239049509.5350.0.camel@localhost> <49DA735F.1020300@mcs.anl.gov> Message-ID: <1239054410.6821.2.camel@localhost> On Mon, 2009-04-06 at 16:25 -0500, Michael Wilde wrote: > We just tested that rev, and now it seems as if the jobs are getting > submitted to the fork JM instead of to PBS. > > Need a log for that, or is the cause obvious? Actually yes, it just became obvious. > > > On 4/6/09 3:25 PM, Mihael Hategan wrote: > > Oops. cog r2367 should fix that. > > > > On Mon, 2009-04-06 at 15:17 -0500, Michael Wilde wrote: > >> Mihael, I just updated our test swift+cog source and rebuilt. > >> > >> Glen is now getting: > >> > >> Caused by: > >> Invalid GSSCredentials > >> org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: > >> Invalid GSSCredentials > >> at > >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:149) > >> at > >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99) > >> at > >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > >> at > >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) > >> at > >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:222) > >> at > >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) > >> Caused by: GSSException: Invalid name provided [Caused by: [JGLOBUS-112] > >> Malformed name, "=" missing in "38356/jobmanager-pbs"] > >> at > >> org.globus.gsi.gssapi.GlobusGSSName.(GlobusGSSName.java:137) > >> at > >> org.globus.gsi.gssapi.GlobusGSSManagerImpl.createName(GlobusGSSManagerImpl.java:304) > >> at > >> org.globus.gsi.gssapi.auth.IdentityAuthorization.getExpectedName(IdentityAuthorization.java:82) > >> at org.globus.gram.Gram.gatekeeperConnect(Gram.java:85) > >> at org.globus.gram.Gram.request(Gram.java:310) > >> at org.globus.gram.GramJob.request(GramJob.java:262) > >> at > >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133) > >> ... 5 more > >> what's up here > >> > >> > >> Any chance I picked up code in transition, or a new problem in recent > >> commits? > >> > >> - Mike > >> > >> > >> > >> On 4/6/09 1:02 PM, Mihael Hategan wrote: > >>> Yes. This is one of those "can't find executable unless run through > >>> 'bash -l' or maybe not" which we saw using wget and md5sum. I'm thinking > >>> how to deal with the situation. > >>> > >>> On Mon, 2009-04-06 at 12:10 -0500, Michael Wilde wrote: > >>>> With this sites entry: > >>>> > >>>> > >>>> TG-CDA070002T > >>>> >>>> jobManager="gt2:pbs" /> > >>>> > >>>> /home/ux454325/swiftwork > >>>> > >>>> > >>>> I get the error below. Files are on CI net at /home/wilde/swift/lab. > >>>> > >>>> I will try to copy coaster boot logs and gram logs to same place when I > >>>> find them, in subdirs named by $RunID.logs. > >>>> > >>>> -- > >>>> > >>>> com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift > >>>> Swift svn swift-r2809 cog-r2350 > >>>> > >>>> RunID: 20090406-1155-pgc5nj00 > >>>> Progress: > >>>> Progress: Stage in:1 > >>>> Progress: Submitted:1 > >>>> Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb > >>>> Progress: Failed:1 > >>>> Execution failed: > >>>> Exception in cat: > >>>> Arguments: [data.txt] > >>>> Host: qb > >>>> Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j > >>>> stderr.txt: > >>>> > >>>> stdout.txt: > >>>> > >>>> ---- > >>>> > >>>> Caused by: > >>>> Cannot submit job: Cannot run program "qsub": > >>>> java.io.IOException: error=2, No such file or directory > >>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > >>>> Cannot submit job: Cannot run program "qsub": java.io.IOException: > >>>> error=2, No such file or directory > >>>> at > >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) > >>>> at > >>>> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > >>>> at > >>>> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) > >>>> at > >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) > >>>> at > >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) > >>>> Caused by: java.io.IOException: Cannot run program "qsub": > >>>> java.io.IOException: error=2, No such file or directory > >>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) > >>>> at java.lang.Runtime.exec(Runtime.java:593) > >>>> at > >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73) > >>>> at > >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) > >>>> ... 4 more > >>>> Caused by: java.io.IOException: java.io.IOException: error=2, No such > >>>> file or directory > >>>> at java.lang.UNIXProcess.(UNIXProcess.java:148) > >>>> at java.lang.ProcessImpl.start(ProcessImpl.java:65) > >>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) > >>>> ... 7 more > >>>> > >>>> Cleaning up... > >>>> Shutting down service at https://208.100.92.21:44166 > >>>> Got channel MetaChannel: 24235184 -> GSSSChannel-null(1) > >>>> - Done > >>>> com$ pwd > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From wilde at mcs.anl.gov Mon Apr 6 17:00:26 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 17:00:26 -0500 Subject: [Swift-devel] Expected behavior for scheduler slow-start with coasters? Message-ID: <49DA7B7A.6070802@mcs.anl.gov> We are seeing the following on Ranger: Swift has 2 coaster jobs running in SGE, coastersPerNode is set to 16, yet it seems to be doing "slow start" as if it doesnt know that it ca quickly fill the available coaster slots. For example, Glen sees the trace below, and is surprised that its not running at least 32 app() procs by this point, instead of 2. Is this expected behavior, or would you have expected the scheduler to fill all available coaster slots? -- Every 2.0s: showq | grep hockyg Mon Apr 6 16:51:38 2009 641061 data hockyg Running 16 01:31:50 Mon Apr 6 16:42:30 641062 data hockyg Running 16 01:31:50 Mon Apr 6 16:42:30 that's ranger Progress: Selecting site:98 Stage in:1 Submitting:1 Finished successfully:4 Progress: Selecting site:98 Submitting:2 Finished successfully:4 Progress: Selecting site:98 Submitting:1 Submitted:1 Finished successfully:4 Progress: Selecting site:98 Submitted:2 Finished successfully:4 Progress: Selecting site:98 Submitted:1 Active:1 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:1 Stage out:1 Finished successfully:4 Progress: Selecting site:98 Active:1 Finished successfully:5 Progress: Selecting site:97 Stage in:1 Active:1 Finished successfully:5 Progress: Selecting site:96 Active:3 Finished successfully:5 Progress: Selecting site:96 Active:3 Finished successfully:5 Progress: Selecting site:96 Active:2 Stage out:1 Finished successfully:5 Progress: Selecting site:96 Active:2 Finished successfully:6 Progress: Selecting site:95 Stage in:1 Active:2 Finished successfully:6 From hategan at mcs.anl.gov Mon Apr 6 17:04:36 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 17:04:36 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs In-Reply-To: <1239054410.6821.2.camel@localhost> References: <49DA379B.7080403@mcs.anl.gov> <1239040937.2410.3.camel@localhost> <49DA6359.8010207@mcs.anl.gov> <1239049509.5350.0.camel@localhost> <49DA735F.1020300@mcs.anl.gov> <1239054410.6821.2.camel@localhost> Message-ID: <1239055476.6821.9.camel@localhost> On Mon, 2009-04-06 at 16:46 -0500, Mihael Hategan wrote: > On Mon, 2009-04-06 at 16:25 -0500, Michael Wilde wrote: > > We just tested that rev, and now it seems as if the jobs are getting > > submitted to the fork JM instead of to PBS. > > > > Need a log for that, or is the cause obvious? > > Actually yes, it just became obvious. I've corrected the initial fix. Hopefully it works properly this time. The issue was related to a badly thought change needed for the worker terminal to function. From hategan at mcs.anl.gov Mon Apr 6 17:08:23 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 17:08:23 -0500 Subject: [Swift-devel] Expected behavior for scheduler slow-start with coasters? In-Reply-To: <49DA7B7A.6070802@mcs.anl.gov> References: <49DA7B7A.6070802@mcs.anl.gov> Message-ID: <1239055703.6821.11.camel@localhost> On Mon, 2009-04-06 at 17:00 -0500, Michael Wilde wrote: > We are seeing the following on Ranger: > > Swift has 2 coaster jobs running in SGE, coastersPerNode is set to 16, > yet it seems to be doing "slow start" as if it doesnt know that it ca > quickly fill the available coaster slots. Right. Swift doing a slow start is a given. Coasters allocating more workers than needed is the issue. From wilde at mcs.anl.gov Mon Apr 6 17:32:22 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 17:32:22 -0500 Subject: [Swift-devel] Expected behavior for scheduler slow-start with coasters? In-Reply-To: <1239055703.6821.11.camel@localhost> References: <49DA7B7A.6070802@mcs.anl.gov> <1239055703.6821.11.camel@localhost> Message-ID: <49DA82F6.9090502@mcs.anl.gov> OK, this one seems to be more of a nuisance/anomaly that we can set aside for now I think. Opening up the throttle a bit should make this a minor issue. Eventually, you'd hope it would fill available coasters when there is demand, or at least base the rampup on the fast that jobs started, and not wait for them to finish. Then it would sense faster that there were more ready workers. On 4/6/09 5:08 PM, Mihael Hategan wrote: > On Mon, 2009-04-06 at 17:00 -0500, Michael Wilde wrote: >> We are seeing the following on Ranger: >> >> Swift has 2 coaster jobs running in SGE, coastersPerNode is set to 16, >> yet it seems to be doing "slow start" as if it doesnt know that it ca >> quickly fill the available coaster slots. > > Right. Swift doing a slow start is a given. > > Coasters allocating more workers than needed is the issue. > > From hategan at mcs.anl.gov Mon Apr 6 17:43:23 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 17:43:23 -0500 Subject: [Swift-devel] Expected behavior for scheduler slow-start with coasters? In-Reply-To: <49DA82F6.9090502@mcs.anl.gov> References: <49DA7B7A.6070802@mcs.anl.gov> <1239055703.6821.11.camel@localhost> <49DA82F6.9090502@mcs.anl.gov> Message-ID: <1239057803.8721.4.camel@localhost> On Mon, 2009-04-06 at 17:32 -0500, Michael Wilde wrote: > OK, this one seems to be more of a nuisance/anomaly that we can set > aside for now I think. > > Opening up the throttle a bit should make this a minor issue. > Eventually, you'd hope it would fill available coasters when there is > demand, or at least base the rampup on the fast that jobs started, and > not wait for them to finish. Then it would sense faster that there were > more ready workers. Yes. I mentioned this a while ago, that with coasters, throttling guesses become unnecessary. You simply throttle to the number of available workers. This, however, falls out of the model we started with, so there are some possibly non-trivial changes to swift needed in order to support this with coasters, while still keeping the old behaviour without coasters. From wilde at mcs.anl.gov Mon Apr 6 17:53:17 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 17:53:17 -0500 Subject: [Swift-devel] Expected behavior for scheduler slow-start with coasters? In-Reply-To: <1239057803.8721.4.camel@localhost> References: <49DA7B7A.6070802@mcs.anl.gov> <1239055703.6821.11.camel@localhost> <49DA82F6.9090502@mcs.anl.gov> <1239057803.8721.4.camel@localhost> Message-ID: <49DA87DD.1010704@mcs.anl.gov> OK, sounds reasonable. For what its worth, Glen provided another example of coasters going idle while there are jobs ready to run. Nothing more to say on this, except to point out that it affects more than just startup. Is there a simpler, alternate scheduler algorithm that you could plug in as a global, settable alternative to the current one when all sites are using coasters? (No need to answer that now; we'll see how far we can get with things as they are, in various combinations of sites and settings). We're digging into the imbalance problem at the moment, that one may be more worthwhile your time, as is the larger-node-per-job allocation enhancement.) --- from Glen: again, not using there coasters affectively 5:42 Michael Wilde ? 5:42 Glen Hocky e.g. qb now has qb2: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------- ------ ----- --- ------ ----- - ----- 94741.qb2 hockyg workq scheduler_ 30786 1 1 -- 01:41 R 00:53 94742.qb2 hockyg workq scheduler_ 31391 1 1 -- 01:41 R 00:53 94808.qb2 hockyg workq scheduler_ 2274 1 1 -- 01:41 R 00:22 94809.qb2 hockyg workq scheduler_ 27186 1 1 -- 01:41 R 00:21 94811.qb2 hockyg workq scheduler_ 31647 1 1 -- 01:41 R 00:21 94812.qb2 hockyg workq scheduler_ 4773 1 1 -- 01:41 R 00:18 but only 4 active jobs 4 submitted *7 submitted all the rest done so what is it doing with all those extra cpus 5:43 ... Glen Hocky for my run on only qb Progress: Submitted:7 Active:4 Finished successfully:93 5:43 Glen Hocky again, the problem may be that these jobs are taking 15 minutes or more so they don't end very often On 4/6/09 5:43 PM, Mihael Hategan wrote: > On Mon, 2009-04-06 at 17:32 -0500, Michael Wilde wrote: >> OK, this one seems to be more of a nuisance/anomaly that we can set >> aside for now I think. >> >> Opening up the throttle a bit should make this a minor issue. >> Eventually, you'd hope it would fill available coasters when there is >> demand, or at least base the rampup on the fast that jobs started, and >> not wait for them to finish. Then it would sense faster that there were >> more ready workers. > > Yes. I mentioned this a while ago, that with coasters, throttling > guesses become unnecessary. You simply throttle to the number of > available workers. > > This, however, falls out of the model we started with, so there are some > possibly non-trivial changes to swift needed in order to support this > with coasters, while still keeping the old behaviour without coasters. > > > From hategan at mcs.anl.gov Mon Apr 6 18:18:59 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 18:18:59 -0500 Subject: [Swift-devel] Expected behavior for scheduler slow-start with coasters? In-Reply-To: <49DA87DD.1010704@mcs.anl.gov> References: <49DA7B7A.6070802@mcs.anl.gov> <1239055703.6821.11.camel@localhost> <49DA82F6.9090502@mcs.anl.gov> <1239057803.8721.4.camel@localhost> <49DA87DD.1010704@mcs.anl.gov> Message-ID: <1239059939.8843.21.camel@localhost> On Mon, 2009-04-06 at 17:53 -0500, Michael Wilde wrote: > OK, sounds reasonable. > > For what its worth, Glen provided another example of coasters going idle > while there are jobs ready to run. Or maybe the jobs don't fit in the time some of the workers have left. In other words don't be surprised that workers are not the same as the jobs they are meant to run, because that's obvious. There are only two promises related to how workers are allocated: no more workers than jobs will be started (modulo the broken coastersPerNode issue - and this promise may have to be dropped if we do block allocations) and that no worker will stay idle for more than a certain amount of time, which is currently 10 minutes (probably too large). > > Nothing more to say on this, except to point out that it affects more > than just startup. Where "it" may be a very different it. From wilde at mcs.anl.gov Mon Apr 6 18:28:04 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 18:28:04 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites Message-ID: <49DA9004.8010409@mcs.anl.gov> Glen seems to have a good example of this in: /home/hockyg/oops/swift/output/teragridoutdir.1 com$ grep 'execute2 THREAD_ASSOCIATION' *ji3.log | awk '{print $8}' | sort | uniq -c 159 host=abe 8 host=localhost 13 host=qb 11 host=ranger com$ --- But then I looked in the log and I see that for qb and ranger, it tries to start jobs there and gets an exception on each of them, while jobs for abe keep on zipping through. As far as I can tell, there is, eg on queenbee, no coaster boot log at the time of the exception, and I cant glean any clues from the GRAM log at the time of the exception (no obvious errors in it). I am trying now to reproduce this with simple echo-like jobs under my own id & cert where I can see all the server-side logs. I *think* that for the run above, Glen first tested ach of the 3 sites.xml pool elements separately, for the 3 sites, before trying the 3-site test. I *think* he verified that all three sites worked separately. But when put together, it *seems* that only the first one works, as if the ability to start coasters on 3 sites at once is broken. I am not at all sure, and will try to isolate with a simpler test that you can run as well, but at the moment thats a plausible theory. Btw, this is still with the Mar 31 code rev. I need to catch up on mail to see if I can no go back to testing on trunk. From wilde at mcs.anl.gov Mon Apr 6 23:25:45 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 23:25:45 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DA9004.8010409@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> Message-ID: <49DAD5C9.7080607@mcs.anl.gov> I tried this test and discovered some more things about coaster time management that I dont understand. It seems that on Queenbee coasters were timing out, while on abe the workers were getting queued, but abe's coasters.log showed lots of java exceptions. If you're interested, all logs for this run including coasters.logs from the two sites .globus dirs is on ci net at /home/wilde/swift/lab/20090406-2120-04ythaie I will re-run with the latest cog/swift revs to see if the behavior persists. - Mike On 4/6/09 6:28 PM, Michael Wilde wrote: > Glen seems to have a good example of this in: > /home/hockyg/oops/swift/output/teragridoutdir.1 > > com$ grep 'execute2 THREAD_ASSOCIATION' *ji3.log | awk '{print $8}' | > sort | uniq -c > 159 host=abe > 8 host=localhost > 13 host=qb > 11 host=ranger > com$ > > --- > > But then I looked in the log and I see that for qb and ranger, it tries > to start jobs there and gets an exception on each of them, while jobs > for abe keep on zipping through. > > As far as I can tell, there is, eg on queenbee, no coaster boot log at > the time of the exception, and I cant glean any clues from the GRAM log > at the time of the exception (no obvious errors in it). > > I am trying now to reproduce this with simple echo-like jobs under my > own id & cert where I can see all the server-side logs. > > I *think* that for the run above, Glen first tested ach of the 3 > sites.xml pool elements separately, for the 3 sites, before trying the > 3-site test. I *think* he verified that all three sites worked separately. > > But when put together, it *seems* that only the first one works, as if > the ability to start coasters on 3 sites at once is broken. > > I am not at all sure, and will try to isolate with a simpler test that > you can run as well, but at the moment thats a plausible theory. > > Btw, this is still with the Mar 31 code rev. I need to catch up on mail > to see if I can no go back to testing on trunk. > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Mon Apr 6 23:45:45 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 23:45:45 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DAD5C9.7080607@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> Message-ID: <1239079545.15719.3.camel@localhost> On Mon, 2009-04-06 at 23:25 -0500, Michael Wilde wrote: > I tried this test and discovered some more things about coaster time > management that I dont understand. > > It seems that on Queenbee coasters were timing out, while on abe the > workers were getting queued, but abe's coasters.log showed lots of java > exceptions. Yes. It still seems to have been run with the unfortunate version. I can't tell which exceptions are legit and which ones are the result of coasters code in the particular bad state. > > If you're interested, all logs for this run including coasters.logs from > the two sites .globus dirs is on ci net at > /home/wilde/swift/lab/20090406-2120-04ythaie > > I will re-run with the latest cog/swift revs to see if the behavior > persists. > > - Mike > > > On 4/6/09 6:28 PM, Michael Wilde wrote: > > Glen seems to have a good example of this in: > > /home/hockyg/oops/swift/output/teragridoutdir.1 > > > > com$ grep 'execute2 THREAD_ASSOCIATION' *ji3.log | awk '{print $8}' | > > sort | uniq -c > > 159 host=abe > > 8 host=localhost > > 13 host=qb > > 11 host=ranger > > com$ > > > > --- > > > > But then I looked in the log and I see that for qb and ranger, it tries > > to start jobs there and gets an exception on each of them, while jobs > > for abe keep on zipping through. > > > > As far as I can tell, there is, eg on queenbee, no coaster boot log at > > the time of the exception, and I cant glean any clues from the GRAM log > > at the time of the exception (no obvious errors in it). > > > > I am trying now to reproduce this with simple echo-like jobs under my > > own id & cert where I can see all the server-side logs. > > > > I *think* that for the run above, Glen first tested ach of the 3 > > sites.xml pool elements separately, for the 3 sites, before trying the > > 3-site test. I *think* he verified that all three sites worked separately. > > > > But when put together, it *seems* that only the first one works, as if > > the ability to start coasters on 3 sites at once is broken. > > > > I am not at all sure, and will try to isolate with a simpler test that > > you can run as well, but at the moment thats a plausible theory. > > > > Btw, this is still with the Mar 31 code rev. I need to catch up on mail > > to see if I can no go back to testing on trunk. > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Mon Apr 6 23:56:54 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 23:56:54 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DAD5C9.7080607@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> Message-ID: <49DADD16.2010507@mcs.anl.gov> The latest rev shows a similar failure on the surface, but I think different patterns in the coaster logs. The workflow is 40 simple "cat" jobs, data.txt to a default-mapped outfile. This time 39 of 40 jobs ran on abe, and then the workflow lingered and finally failed, with 39 ok, 1 failure. All the logs for this run are in /home/wilde/swift/lab/20090406-2330-72p9ale0 below that are dirs for the abe and qb logs coaster and gram logs. Abe had no gram log for this run. I suspect this one is worth looking at. On 4/6/09 11:25 PM, Michael Wilde wrote: > I tried this test and discovered some more things about coaster time > management that I dont understand. > > It seems that on Queenbee coasters were timing out, while on abe the > workers were getting queued, but abe's coasters.log showed lots of java > exceptions. > > If you're interested, all logs for this run including coasters.logs from > the two sites .globus dirs is on ci net at > /home/wilde/swift/lab/20090406-2120-04ythaie > > I will re-run with the latest cog/swift revs to see if the behavior > persists. > > - Mike > > > On 4/6/09 6:28 PM, Michael Wilde wrote: >> Glen seems to have a good example of this in: >> /home/hockyg/oops/swift/output/teragridoutdir.1 >> >> com$ grep 'execute2 THREAD_ASSOCIATION' *ji3.log | awk '{print $8}' | >> sort | uniq -c >> 159 host=abe >> 8 host=localhost >> 13 host=qb >> 11 host=ranger >> com$ >> >> --- >> >> But then I looked in the log and I see that for qb and ranger, it >> tries to start jobs there and gets an exception on each of them, while >> jobs for abe keep on zipping through. >> >> As far as I can tell, there is, eg on queenbee, no coaster boot log at >> the time of the exception, and I cant glean any clues from the GRAM >> log at the time of the exception (no obvious errors in it). >> >> I am trying now to reproduce this with simple echo-like jobs under my >> own id & cert where I can see all the server-side logs. >> >> I *think* that for the run above, Glen first tested ach of the 3 >> sites.xml pool elements separately, for the 3 sites, before trying the >> 3-site test. I *think* he verified that all three sites worked >> separately. >> >> But when put together, it *seems* that only the first one works, as if >> the ability to start coasters on 3 sites at once is broken. >> >> I am not at all sure, and will try to isolate with a simpler test that >> you can run as well, but at the moment thats a plausible theory. >> >> Btw, this is still with the Mar 31 code rev. I need to catch up on >> mail to see if I can no go back to testing on trunk. >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Apr 7 00:09:44 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 07 Apr 2009 00:09:44 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DADD16.2010507@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> Message-ID: <1239080984.16125.1.camel@localhost> On Mon, 2009-04-06 at 23:56 -0500, Michael Wilde wrote: > The latest rev shows a similar failure on the surface, but I think > different patterns in the coaster logs. > > The workflow is 40 simple "cat" jobs, data.txt to a default-mapped outfile. > > This time 39 of 40 jobs ran on abe, and then the workflow lingered and > finally failed, with 39 ok, 1 failure. > > All the logs for this run are in > /home/wilde/swift/lab/20090406-2330-72p9ale0 > > below that are dirs for the abe and qb logs coaster and gram logs. > Abe had no gram log for this run. > > I suspect this one is worth looking at. Indeed. Can you paste your sites file? There's some oddity there. From wilde at mcs.anl.gov Tue Apr 7 00:09:58 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 07 Apr 2009 00:09:58 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <1239080984.16125.1.camel@localhost> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> Message-ID: <49DAE026.3040909@mcs.anl.gov> com$ cat abe+qb.xml TG-CDA070002T 8 02:30:00 /u/ac/wilde/swiftwork TG-CDA070002T 8 02:30:00 /home/ux454325/swiftwork com$ On 4/7/09 12:09 AM, Mihael Hategan wrote: > On Mon, 2009-04-06 at 23:56 -0500, Michael Wilde wrote: >> The latest rev shows a similar failure on the surface, but I think >> different patterns in the coaster logs. >> >> The workflow is 40 simple "cat" jobs, data.txt to a default-mapped outfile. >> >> This time 39 of 40 jobs ran on abe, and then the workflow lingered and >> finally failed, with 39 ok, 1 failure. >> >> All the logs for this run are in >> /home/wilde/swift/lab/20090406-2330-72p9ale0 >> >> below that are dirs for the abe and qb logs coaster and gram logs. >> Abe had no gram log for this run. >> >> I suspect this one is worth looking at. > > Indeed. Can you paste your sites file? > > There's some oddity there. > > From wilde at mcs.anl.gov Tue Apr 7 00:15:23 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 07 Apr 2009 00:15:23 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DAE026.3040909@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> Message-ID: <49DAE16B.6000508@mcs.anl.gov> Note on below: I used 2hr30min as the time to match Glen's time, for the runs in which he first saw the "imbalance". In my first tests,I had used 5 min for coasterWorkerMaxwalltime and specified no site or tc maxwalltime. I thought that would work, based on our earlier lengthy exchanges on this topic. But apparantly coasters was calculating some default max walltime for "cat" and it gave me an error about insufficient time. I was trying to gather that alolng with several other anomalies in another report. On 4/7/09 12:09 AM, Michael Wilde wrote: > com$ cat abe+qb.xml > > > > > TG-CDA070002T > 8 > key="coasterWorkerMaxwalltime">02:30:00 > > jobManager="gt2:gt2:pbs" /> > > /u/ac/wilde/swiftwork > > > > > > TG-CDA070002T > 8 > key="coasterWorkerMaxwalltime">02:30:00 > > jobManager="gt2:gt2:pbs" /> > > /home/ux454325/swiftwork > > > > > com$ > > > On 4/7/09 12:09 AM, Mihael Hategan wrote: >> On Mon, 2009-04-06 at 23:56 -0500, Michael Wilde wrote: >>> The latest rev shows a similar failure on the surface, but I think >>> different patterns in the coaster logs. >>> >>> The workflow is 40 simple "cat" jobs, data.txt to a default-mapped >>> outfile. >>> >>> This time 39 of 40 jobs ran on abe, and then the workflow lingered >>> and finally failed, with 39 ok, 1 failure. >>> >>> All the logs for this run are in >>> /home/wilde/swift/lab/20090406-2330-72p9ale0 >>> >>> below that are dirs for the abe and qb logs coaster and gram logs. >>> Abe had no gram log for this run. >>> >>> I suspect this one is worth looking at. >> >> Indeed. Can you paste your sites file? >> >> There's some oddity there. >> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Apr 7 00:26:35 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 07 Apr 2009 00:26:35 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DAE16B.6000508@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> Message-ID: <1239081995.16125.8.camel@localhost> On Tue, 2009-04-07 at 00:15 -0500, Michael Wilde wrote: > Note on below: I used 2hr30min as the time to match Glen's time, for the > runs in which he first saw the "imbalance". > > In my first tests,I had used 5 min for coasterWorkerMaxwalltime and > specified no site or tc maxwalltime. I thought that would work, based on > our earlier lengthy exchanges on this topic. But apparantly coasters was > calculating some default max walltime for "cat" and it gave me an error > about insufficient time. Right. Previously it would just loop starting workers and then not using them because they didn't have enough time. The default walltime is 10 minutes. > I was trying to gather that alolng with several > other anomalies in another report. Now, the oddity below is that both coaster services are started with the same service id. Not only that, the same service id was used for subsequent runs (the bootstrap logs contain multiple "runs"). This, roughly, makes no sense, but I can't imagine it being cause for goodness. > > > On 4/7/09 12:09 AM, Michael Wilde wrote: > > com$ cat abe+qb.xml > > > > > > > > > > TG-CDA070002T > > 8 > > > key="coasterWorkerMaxwalltime">02:30:00 > > > > > jobManager="gt2:gt2:pbs" /> > > > > /u/ac/wilde/swiftwork > > > > > > > > > > > > TG-CDA070002T > > 8 > > > key="coasterWorkerMaxwalltime">02:30:00 > > > > > jobManager="gt2:gt2:pbs" /> > > > > /home/ux454325/swiftwork > > > > > > > > > > com$ > > > > > > On 4/7/09 12:09 AM, Mihael Hategan wrote: > >> On Mon, 2009-04-06 at 23:56 -0500, Michael Wilde wrote: > >>> The latest rev shows a similar failure on the surface, but I think > >>> different patterns in the coaster logs. > >>> > >>> The workflow is 40 simple "cat" jobs, data.txt to a default-mapped > >>> outfile. > >>> > >>> This time 39 of 40 jobs ran on abe, and then the workflow lingered > >>> and finally failed, with 39 ok, 1 failure. > >>> > >>> All the logs for this run are in > >>> /home/wilde/swift/lab/20090406-2330-72p9ale0 > >>> > >>> below that are dirs for the abe and qb logs coaster and gram logs. > >>> Abe had no gram log for this run. > >>> > >>> I suspect this one is worth looking at. > >> > >> Indeed. Can you paste your sites file? > >> > >> There's some oddity there. > >> > >> > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Apr 7 00:33:54 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 07 Apr 2009 00:33:54 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <1239081995.16125.8.camel@localhost> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> Message-ID: <49DAE5C2.6070806@mcs.anl.gov> On 4/7/09 12:26 AM, Mihael Hategan wrote: > On Tue, 2009-04-07 at 00:15 -0500, Michael Wilde wrote: >> Note on below: I used 2hr30min as the time to match Glen's time, for the >> runs in which he first saw the "imbalance". >> >> In my first tests,I had used 5 min for coasterWorkerMaxwalltime and >> specified no site or tc maxwalltime. I thought that would work, based on >> our earlier lengthy exchanges on this topic. But apparantly coasters was >> calculating some default max walltime for "cat" and it gave me an error >> about insufficient time. > > Right. Previously it would just loop starting workers and then not using > them because they didn't have enough time. The default walltime is 10 > minutes. That makes sense then. The error I got was: 2009-04-06 20:52:35,397-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=cat-e3agg19j - Application exception: Job cannot be run with the given max walltime worker constraint The other few anomalies I saw I will ignore unless they happen again, as I was using the bad 3/31 revision. This was things like starting a new service with some strange default max time ("01:41:00" or 101 minutes) after the initial services were started with the correct time, and some strange error retry behavior. Bear with me - these things are very difficult and tedious to report. >> I was trying to gather that alolng with several >> other anomalies in another report. > > Now, the oddity below is that both coaster services are started with the > same service id. Not only that, the same service id was used for > subsequent runs (the bootstrap logs contain multiple "runs"). This, > roughly, makes no sense, but I can't imagine it being cause for > goodness. OK. Any chance I messed up copying log files (and duplicated one) or are you seeing the duplicate service id in truly distinct logs? (No need for reply - Im assuming if there was a chance I duplicated a log it would be obvious...) > >> >> On 4/7/09 12:09 AM, Michael Wilde wrote: >>> com$ cat abe+qb.xml >>> >>> >>> >>> >>> TG-CDA070002T >>> 8 >>> >> key="coasterWorkerMaxwalltime">02:30:00 >>> >>> >> jobManager="gt2:gt2:pbs" /> >>> >>> /u/ac/wilde/swiftwork >>> >>> >>> >>> >>> >>> TG-CDA070002T >>> 8 >>> >> key="coasterWorkerMaxwalltime">02:30:00 >>> >>> >> jobManager="gt2:gt2:pbs" /> >>> >>> /home/ux454325/swiftwork >>> >>> >>> >>> >>> com$ >>> >>> >>> On 4/7/09 12:09 AM, Mihael Hategan wrote: >>>> On Mon, 2009-04-06 at 23:56 -0500, Michael Wilde wrote: >>>>> The latest rev shows a similar failure on the surface, but I think >>>>> different patterns in the coaster logs. >>>>> >>>>> The workflow is 40 simple "cat" jobs, data.txt to a default-mapped >>>>> outfile. >>>>> >>>>> This time 39 of 40 jobs ran on abe, and then the workflow lingered >>>>> and finally failed, with 39 ok, 1 failure. >>>>> >>>>> All the logs for this run are in >>>>> /home/wilde/swift/lab/20090406-2330-72p9ale0 >>>>> >>>>> below that are dirs for the abe and qb logs coaster and gram logs. >>>>> Abe had no gram log for this run. >>>>> >>>>> I suspect this one is worth looking at. >>>> Indeed. Can you paste your sites file? >>>> >>>> There's some oddity there. >>>> >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Tue Apr 7 00:39:14 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 07 Apr 2009 00:39:14 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DAE5C2.6070806@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <49DAE5C2.6070806@mcs.anl.gov> Message-ID: <1239082754.16125.12.camel@localhost> On Tue, 2009-04-07 at 00:33 -0500, Michael Wilde wrote: > > On 4/7/09 12:26 AM, Mihael Hategan wrote: > > On Tue, 2009-04-07 at 00:15 -0500, Michael Wilde wrote: > >> Note on below: I used 2hr30min as the time to match Glen's time, for the > >> runs in which he first saw the "imbalance". > >> > >> In my first tests,I had used 5 min for coasterWorkerMaxwalltime and > >> specified no site or tc maxwalltime. I thought that would work, based on > >> our earlier lengthy exchanges on this topic. But apparantly coasters was > >> calculating some default max walltime for "cat" and it gave me an error > >> about insufficient time. > > > > Right. Previously it would just loop starting workers and then not using > > them because they didn't have enough time. The default walltime is 10 > > minutes. > > That makes sense then. The error I got was: > > 2009-04-06 20:52:35,397-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=cat-e3agg19j - Application exception: Job cannot be run with the > given max walltime worker constraint > > The other few anomalies I saw I will ignore unless they happen again, as > I was using the bad 3/31 revision. This was things like starting a new > service with some strange default max time ("01:41:00" or 101 minutes) Not strange. 101 = 10 * 10 + 1 or DEFAULT_MAXWALLTIME * OVERALLOCATION_FACTOR + RESERVE. > after the initial services were started with the correct time, and some > strange error retry behavior. > > Bear with me - these things are very difficult and tedious to report. No problem. I'm glad you're exercising the code. From hategan at mcs.anl.gov Tue Apr 7 01:04:22 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 07 Apr 2009 01:04:22 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <1239081995.16125.8.camel@localhost> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> Message-ID: <1239084262.16125.18.camel@localhost> On Tue, 2009-04-07 at 00:26 -0500, Mihael Hategan wrote: > > I was trying to gather that alolng with several > > other anomalies in another report. > > Now, the oddity below is that both coaster services are started with the > same service id. Not only that, the same service id was used for > subsequent runs (the bootstrap logs contain multiple "runs"). This, > roughly, makes no sense, but I can't imagine it being cause for > goodness. That was just another one of my brilliant ideas. It was dimmed a bit in cog r2369. Previous to that, and after the big fiddle with the bootstrap script a while ago, multi-site coaster runs are broken. From benc at hawaga.org.uk Tue Apr 7 03:37:31 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 7 Apr 2009 08:37:31 +0000 (GMT) Subject: [Swift-devel] Expected behavior for scheduler slow-start with coasters? In-Reply-To: <49DA87DD.1010704@mcs.anl.gov> References: <49DA7B7A.6070802@mcs.anl.gov> <1239055703.6821.11.camel@localhost> <49DA82F6.9090502@mcs.anl.gov> <1239057803.8721.4.camel@localhost> <49DA87DD.1010704@mcs.anl.gov> Message-ID: > Is there a simpler, alternate scheduler algorithm that you could plug in as a > global, settable alternative to the current one when all sites are using > coasters? You can set the initialScore profile key very high[1] so that Swift will starts at full load rather than low load. This is basically "the simpler, alterative scheduler algorithm" that you are looking for. You will however runinto a different manifestation of the same problem that coastersPerNode does not work properly and will likely attempt to massively overallocate workers. Its not a bug in the scheduler - its a bug in the implementation of coastersPerNode that causes it to attempt to allocate one node per excess job. In the longer term, as Mihael said, the interface between the scheduler and execution systems needs to change because coasters don't fit in the present abstraction very well. [1] (to about 100 - the actual formula is rather opaque and I have to rederive it every time because I never write it down) -- From qinz at ihpc.a-star.edu.sg Tue Apr 7 04:07:20 2009 From: qinz at ihpc.a-star.edu.sg (Qin Zheng) Date: Tue, 7 Apr 2009 17:07:20 +0800 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> Message-ID: Prof Foster, thanks for introducing me to the team. My research interest is on scheduling workflows (DAGs). Ben, we decided not to use resubmission in the consideration that a DAG cannot be completed when any of its tasks fails, which each time would trigger the resubmission\retry of the DAG. Instead, we use fault tolerance by pre-scheduling replica (backup) for each task (see enclosure for details). The objective is to guarantee that this DAG can be completed (in a preplanned manner with fast failover to the backup upon failure) before its deadline. Currently I am also working on workflow scheduling under uncertainties of task running times. This work includes priorities tasks based on the impact of the variation of its running time on the overall response time and offline planning for high-priority tasks as well as runtime adaptation for all tasks once up-to-date information is available. I am looking forward to talking to you guys and knowing your research! Regards, Qin Zheng ________________________________ From: Ian Foster [mailto:foster at anl.gov] Sent: Monday, April 06, 2009 10:46 PM To: Ben Clifford Cc: swift-devel; Qin Zheng Subject: Re: [Swift-devel] Re: replication vs site score Ben: You may recall the work that was done by Greg Maleciwz (sp?) on prioritizing jobs that enable new jobs to run. Those ideas seem relevant here. I met last week with a smart fellow in Singapore, Qin Zheng (CCed here), who has been working on the scheduling of replicant jobs. His interest is in doing this for jobs that have failed, while I think your interest is in scheduling for jobs that may have failed--a somewhat different thing. But there may be a connection. Ian. On Apr 6, 2009, at 9:39 AM, Ben Clifford wrote: even more rambling... in the context of a scheduler that is doing things like prioritising jobs based on more than the order that Swift happened to submit them (hopefully I will have a student for this in the summer), I think a replicant job should be pushed toward later execution rather than earlier execution to reduce the number of replicant jobs in the system at any one time. This is because I suspect (though I have gathered no numerical evidence) that given the choice between submitting a fresh job and a replicant job (making up terminology here too... mmm), it is almost always better to submit the fresh job. Either we end up submitting the replicant job eventually (in which case we are no worse off than if we submitted the replicant first and then a fresh job); or by delaying the replicant job we give that replicant's original a chance to start running and thus do not discard our precious time-and-load-dollars that we have already spent on queueing that replicant's original. -- _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel ________________________________ This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Fault Tolerance_TC_Mar09.pdf Type: application/pdf Size: 2142133 bytes Desc: Fault Tolerance_TC_Mar09.pdf URL: From wilde at mcs.anl.gov Tue Apr 7 06:09:15 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 07 Apr 2009 06:09:15 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <1239082754.16125.12.camel@localhost> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <49DAE5C2.6070806@mcs.anl.gov> <1239082754.16125.12.camel@localhost> Message-ID: <49DB345B.3030406@mcs.anl.gov> >> The other few anomalies I saw I will ignore unless they happen again, as >> I was using the bad 3/31 revision. This was things like starting a new >> service with some strange default max time ("01:41:00" or 101 minutes) > > Not strange. 101 = 10 * 10 + 1 or DEFAULT_MAXWALLTIME * > OVERALLOCATION_FACTOR + RESERVE. I assumed 1:41 was derived from some formula. The unexpected behavior here was that it looked like a job was submitted by coasters that ignored the specified coasterWorkerMaxwalltime, after the initial jobs honored it. But again, the code base was suspect. I'll keep an eye out for it happening again. From wilde at mcs.anl.gov Tue Apr 7 06:13:47 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 07 Apr 2009 06:13:47 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <1239084262.16125.18.camel@localhost> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <1239084262.16125.18.camel@localhost> Message-ID: <49DB356B.4050808@mcs.anl.gov> putting Glen back on cc: Multi-site coaster runs will not work until Mihael posts a fix. On 4/7/09 1:04 AM, Mihael Hategan wrote: > On Tue, 2009-04-07 at 00:26 -0500, Mihael Hategan wrote: > >>> I was trying to gather that alolng with several >>> other anomalies in another report. >> Now, the oddity below is that both coaster services are started with the >> same service id. Not only that, the same service id was used for >> subsequent runs (the bootstrap logs contain multiple "runs"). This, >> roughly, makes no sense, but I can't imagine it being cause for >> goodness. > > That was just another one of my brilliant ideas. It was dimmed a bit in > cog r2369. Previous to that, and after the big fiddle with the bootstrap > script a while ago, multi-site coaster runs are broken. > From benc at hawaga.org.uk Tue Apr 7 06:30:59 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 7 Apr 2009 11:30:59 +0000 (GMT) Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> Message-ID: Hi. Most/all of the work that we've done with Swift works with fairly opportunistic use of resources - we submit work into job queues on one or more sites, where those job queues are shared with many other users, and where the runtimes for both our jobs and other users jobs are not well defined ahead of time. So whilst we use the word 'scheduling' sometimes in Swift, its more a case of "what do we think is the best site to queue a job on right now?" rather than making an execution plan that we think will be valid for a long period of time. Our replication mechanism sounds fairly similar to your pre-scheduled backups, but I think there are these important differences: * we don't launch a replica until we think there is a reasonable chance that the replica will run instead of the original (based on queue time) * as soon as one of the jobs *starts* running, we cancel all the others. from what I understand, you do that when one of the jobs *ends* successfully. We do have one situation where we have some pre-allocation of resources, and that is when coasters are being used. These use the above opportunistic queuing methods to acquire a worker node for a long period of time, and then runs Swift level jobs in there, at present on a first-come first-serve basis. Its likely that we'll change that to have some other job prioritisation, but still pre-scheduling the jobs. Where Swift would have trouble working with an ahead-of-time planner/scheduler is that the module that generates file transfer and execution tasks from high level SwiftScripts does not submit a dependent task for scheduling and execution until its predecessors have been successfully executed. What the scheduler sees is a stream, over time, of file transfer and execution tasks that are safe to run immediately. It might be easy, or it might be hard, to make the Swift code submit more eagerly, with description of task dependencies, which would allow you to plug in a pre-planner underneath. On Tue, 7 Apr 2009, Qin Zheng wrote: > Prof Foster, thanks for introducing me to the team. > > My research interest is on scheduling workflows (DAGs). Ben, we decided > not to use resubmission in the consideration that a DAG cannot be > completed when any of its tasks fails, which each time would trigger the > resubmission\retry of the DAG. Instead, we use fault tolerance by > pre-scheduling replica (backup) for each task (see enclosure for > details). The objective is to guarantee that this DAG can be completed > (in a preplanned manner with fast failover to the backup upon failure) > before its deadline. > > Currently I am also working on workflow scheduling under uncertainties > of task running times. This work includes priorities tasks based on the > impact of the variation of its running time on the overall response time > and offline planning for high-priority tasks as well as runtime > adaptation for all tasks once up-to-date information is available. > > I am looking forward to talking to you guys and knowing your research! > > Regards, > Qin Zheng > ________________________________ > From: Ian Foster [mailto:foster at anl.gov] > Sent: Monday, April 06, 2009 10:46 PM > To: Ben Clifford > Cc: swift-devel; Qin Zheng > Subject: Re: [Swift-devel] Re: replication vs site score > > Ben: > > You may recall the work that was done by Greg Maleciwz (sp?) on prioritizing jobs that enable new jobs to run. Those ideas seem relevant here. > > I met last week with a smart fellow in Singapore, Qin Zheng (CCed here), who has been working on the scheduling of replicant jobs. His interest is in doing this for jobs that have failed, while I think your interest is in scheduling for jobs that may have failed--a somewhat different thing. But there may be a connection. > > Ian. > > > On Apr 6, 2009, at 9:39 AM, Ben Clifford wrote: > > > even more rambling... in the context of a scheduler that is doing things > like prioritising jobs based on more than the order that Swift happened to > submit them (hopefully I will have a student for this in the summer), I > think a replicant job should be pushed toward later execution rather than > earlier execution to reduce the number of replicant jobs in the system at > any one time. > > This is because I suspect (though I have gathered no numerical evidence) > that given the choice between submitting a fresh job and a replicant job > (making up terminology here too... mmm), it is almost always better to > submit the fresh job. Either we end up submitting the replicant job > eventually (in which case we are no worse off than if we submitted the > replicant first and then a fresh job); or by delaying the replicant job we > give that replicant's original a chance to start running and thus do not > discard our precious time-and-load-dollars that we have already spent on > queueing that replicant's original. > > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > ________________________________ > This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you. > From foster at anl.gov Tue Apr 7 07:33:02 2009 From: foster at anl.gov (Ian Foster) Date: Tue, 7 Apr 2009 07:33:02 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DB356B.4050808@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <1239084262.16125.18.camel@localhost> <49DB356B.4050808@mcs.anl.gov> Message-ID: <26D2255E-D8B2-4816-A3D5-FF26743E519C@anl.gov> Is there a description somewhere of the algorithms used for starting coasters and submitting jobs to them? Ian. From benc at hawaga.org.uk Tue Apr 7 07:36:10 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 7 Apr 2009 12:36:10 +0000 (GMT) Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <26D2255E-D8B2-4816-A3D5-FF26743E519C@anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <1239084262.16125.18.camel@localhost> <49DB356B.4050808@mcs.anl.gov> <26D2255E-D8B2-4816-A3D5-FF26743E519C@anl.gov> Message-ID: On Tue, 7 Apr 2009, Ian Foster wrote: > Is there a description somewhere of the algorithms used for starting coasters > and submitting jobs to them? Plenty in the archives of this list, I expect. Basically: if a job arrives and there is a free coaster slot, launch a new coaster worker. If there is no free coaster slot existing for it, launch a new coaster worker. -- From wilde at mcs.anl.gov Tue Apr 7 07:42:25 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 07 Apr 2009 07:42:25 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <1239084262.16125.18.camel@localhost> <49DB356B.4050808@mcs.anl.gov> <26D2255E-D8B2-4816-A3D5-FF26743E519C@anl.gov> Message-ID: <49DB4A31.80108@mcs.anl.gov> This contains a lot of the startup details: http://wiki.cogkit.org/wiki/Coasters On 4/7/09 7:36 AM, Ben Clifford wrote: > On Tue, 7 Apr 2009, Ian Foster wrote: > >> Is there a description somewhere of the algorithms used for starting coasters >> and submitting jobs to them? > > Plenty in the archives of this list, I expect. > > Basically: if a job arrives and there is a free coaster slot, launch a new > coaster worker. If there is no free coaster slot existing for it, launch a > new coaster worker. > From benc at hawaga.org.uk Tue Apr 7 07:47:39 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 7 Apr 2009 12:47:39 +0000 (GMT) Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DB4A31.80108@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <1239084262.16125.18.camel@localhost> <49DB356B.4050808@mcs.anl.gov> <26D2255E-D8B2-4816-A3D5-FF26743E519C@anl.gov> <49DB4A31.80108@mcs.anl.gov> Message-ID: On Tue, 7 Apr 2009, Michael Wilde wrote: > This contains a lot of the startup details: > > http://wiki.cogkit.org/wiki/Coasters Would be good to link to that from the Swift user guide. -- From wilde at mcs.anl.gov Tue Apr 7 08:03:47 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 07 Apr 2009 08:03:47 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <1239084262.16125.18.camel@localhost> <49DB356B.4050808@mcs.anl.gov> <26D2255E-D8B2-4816-A3D5-FF26743E519C@anl.gov> <49DB4A31.80108@mcs.anl.gov> Message-ID: <49DB4F33.5040502@mcs.anl.gov> done. (but not tested) On 4/7/09 7:47 AM, Ben Clifford wrote: > > On Tue, 7 Apr 2009, Michael Wilde wrote: > >> This contains a lot of the startup details: >> >> http://wiki.cogkit.org/wiki/Coasters > > Would be good to link to that from the Swift user guide. > From wilde at mcs.anl.gov Tue Apr 7 08:14:54 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 07 Apr 2009 08:14:54 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DB4F33.5040502@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <1239084262.16125.18.camel@localhost> <49DB356B.4050808@mcs.anl.gov> <26D2255E-D8B2-4816-A3D5-FF26743E519C@anl.gov> <49DB4A31.80108@mcs.anl.gov> <49DB4F33.5040502@mcs.anl.gov> Message-ID: <49DB51CE.3090502@mcs.anl.gov> On 4/7/09 8:03 AM, Michael Wilde wrote: > done. (but not tested) but i should have. fixed, *and* tested. > > On 4/7/09 7:47 AM, Ben Clifford wrote: >> >> On Tue, 7 Apr 2009, Michael Wilde wrote: >> >>> This contains a lot of the startup details: >>> >>> http://wiki.cogkit.org/wiki/Coasters >> >> Would be good to link to that from the Swift user guide. >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Apr 7 10:08:10 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 07 Apr 2009 10:08:10 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DB356B.4050808@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <1239084262.16125.18.camel@localhost> <49DB356B.4050808@mcs.anl.gov> Message-ID: <1239116890.18531.0.camel@localhost> On Tue, 2009-04-07 at 06:13 -0500, Michael Wilde wrote: > putting Glen back on cc: Multi-site coaster runs will not work until > Mihael posts a fix. What I'm saying below is that the fix is in cog r2369. > > On 4/7/09 1:04 AM, Mihael Hategan wrote: > > On Tue, 2009-04-07 at 00:26 -0500, Mihael Hategan wrote: > > > >>> I was trying to gather that alolng with several > >>> other anomalies in another report. > >> Now, the oddity below is that both coaster services are started with the > >> same service id. Not only that, the same service id was used for > >> subsequent runs (the bootstrap logs contain multiple "runs"). This, > >> roughly, makes no sense, but I can't imagine it being cause for > >> goodness. > > > > That was just another one of my brilliant ideas. It was dimmed a bit in > > cog r2369. Previous to that, and after the big fiddle with the bootstrap > > script a while ago, multi-site coaster runs are broken. > > From wilde at mcs.anl.gov Tue Apr 7 10:13:23 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 07 Apr 2009 10:13:23 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <1239116890.18531.0.camel@localhost> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <1239084262.16125.18.camel@localhost> <49DB356B.4050808@mcs.anl.gov> <1239116890.18531.0.camel@localhost> Message-ID: <49DB6D93.7010900@mcs.anl.gov> Cool. I interpreted your note below as meaning its still broken, didnt realize that r2369 was latest. Got it, and am building now for Glen and I to test. I'll re-run the "cats" test. On 4/7/09 10:08 AM, Mihael Hategan wrote: > On Tue, 2009-04-07 at 06:13 -0500, Michael Wilde wrote: >> putting Glen back on cc: Multi-site coaster runs will not work until >> Mihael posts a fix. > > What I'm saying below is that the fix is in cog r2369. > >> On 4/7/09 1:04 AM, Mihael Hategan wrote: >>> On Tue, 2009-04-07 at 00:26 -0500, Mihael Hategan wrote: >>> >>>>> I was trying to gather that alolng with several >>>>> other anomalies in another report. >>>> Now, the oddity below is that both coaster services are started with the >>>> same service id. Not only that, the same service id was used for >>>> subsequent runs (the bootstrap logs contain multiple "runs"). This, >>>> roughly, makes no sense, but I can't imagine it being cause for >>>> goodness. >>> That was just another one of my brilliant ideas. It was dimmed a bit in >>> cog r2369. Previous to that, and after the big fiddle with the bootstrap >>> script a while ago, multi-site coaster runs are broken. >>> > From wilde at mcs.anl.gov Tue Apr 7 17:36:01 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 07 Apr 2009 17:36:01 -0500 Subject: [Swift-devel] Possible problem in coaster data transfer Message-ID: <49DBD551.3000201@mcs.anl.gov> It looks as if something in swift is garbling data files. We see this when trying coaster data transfer to circumvent a problem that the abe gridftp server was reporting (when using gridftp data transfer). The oops "pdt" file is the main output of the simulation (the coordinates of each atom in the folded protein). These files should have very regular multi-column lines, but in a few we see garbled lines. This is in run: ci:/home/hockyg/oops/swift/output/abeoutdir.20 These file range from 1.5MB to 3MB in this test. There's one per job, 50 files in this run. The lines on top are normal; the lines on the bottom are long due to file corruption. We've used coaster transfer off an on; we usually do gridftp transfer and were using coaster transfer in this case while Mihael debugs a problem thats manifesting as a gridftp error. Glen suspected he saw such corruption earlier; this run seems to confirm it. I'm not inclined to go deep into this at the moment, but rather to say that we'll stick to gridftp transfer for the duration of this paper writing effort. - Mike TOM 337 N GLN 0 57 135 CA ALA 0 23 -17.316 8.147 0.047 1.00 0.00 ATOM 337 N GLN 0 57 135 CA ALA 0 23 -17.316 8.147 0.047 1.00 0.00 ATOM 337 N GLN 0 57 135 CA ALA 0 23 -17.316 8.147 0.047 1.00 0.00 ATOM 337 N GLN 0 57 135 CA ALA 0 23 -17.316 8.147 0.047 1.00 0.00 ATOM 337 N GLN 0 57 135 CA ALA 0 23 -17.316 8.147 0.047 1.00 0.00 ATOM 337 N GLN 0 57 135 CA ALA 0 23 -17.316 8.147 0.047 1.00 0.00 com$ com$ awk 'length($0) > 150 {print $0}' `find | grep pdt` com$ awk 'length($0) > 120 {print $0}' `find | grep pdt` ATOM 335 C ASN 0 56 -34.964 1.477 15.043 1.00 0.00 0 13 -2.528 18.017 -1.927 1.00 0.00 ATOM 335 C ASN 0 56 -21.516 -6.860 -31.404 1.00 0.00 91 C ASN 0 32 -10.865 31.809 -15.581 1.00 0.00 ATOM 404 HN ALA 0 68 -10.285 -33.690 -26.233 1.00 0.00 135 CA ALA 0 23 12.808 -6.713 -11.148 1.00 0.00 ATOM 335 C ASN 0 56 0.510 -30.608 0.783 1.00 0.00 LEU 0 2 0.505 3.186 -1.484 1.00 0.00 ATOM 335 C ASN 0 56 -3.155 25.367 -4.095 1.00 83 C ALA 0 64 5.541 11.559 -1.063 1.00 0.00 ATOM 404 HN ALA 0 68 66.525 32.704 -21.958 GLN 0 57 135 CA ALA 0 23 19.234 14.087 -7.779 1.00 0.00 ATOM 335 C ASN 0 56 16.926 -3.414 -5.774 1.00 0.00 EU 0 43 13.554 22.230 19.827 1.00 0.00 ATOM 335 C ASN 0 56 14.805 34.413 23.907 1.00 0.00 59 -18.300 2.743 -27.536 1.00 0.00 ATOM 335 C ASN 0 56 19.787 15.477 24.896 1.00 0.00 0 13 9.613 11.599 -1.295 1.00 0.00 ATOM 404 HN ALA 0 68 11.882 -14.3 337 N GLN 0 57 135 CA ALA 0 23 21.798 -14.600 -6.379 1.00 0.00 ATOM 112 HA2 GLY 0 19 3.632 -11.142 -24.657 1.00 0. 315 CA LEU 0 53 0.180 -22.479 -33.671 1.00 0.00 ATOM 335 C ASN 0 56 -3.145 -30.419 -39.260 1.00 E 0 66 -8.925 -40.775 -24.402 1.00 0.00 com$ From aespinosa at cs.uchicago.edu Wed Apr 8 02:12:16 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 8 Apr 2009 02:12:16 -0500 Subject: [Swift-devel] jobs finishes but swift reports "execution failed". Message-ID: <50b07b4b0904080012u668c6921y9012ac066c8156f7@mail.gmail.com> Waiting for notification for 0 ms Received notification with 1 messages Progress: Submitted:1 Active:1 Progress: uninitialized:1 Finished successfully:2 Execution failed: Could not find any valid host for task "Task(type=UNKNOWN, identity=urn:cog-1239170783751)" with constraints {filecache=org.griphyn.vdl.karajan.lib.cache.CacheMapAdapter at 53f253f2, filenames=[Ljava.lang.String;@53945394, trfqn=cat, tr=cat} probably in one of the staging components cog2365 swift 2824 on surveyor BGP the modifications made iis just the convertion of "|" to "^". Right Zhao? log: http://www.ci.uchicago.edu/~aespinosa/swift/blast-20090408-0144-evyvbf93.log -- Allan M. Espinosa PhD student, Computer Science University of Chicago From qinz at ihpc.a-star.edu.sg Wed Apr 8 03:03:34 2009 From: qinz at ihpc.a-star.edu.sg (Qin Zheng) Date: Wed, 8 Apr 2009 16:03:34 +0800 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> Message-ID: Dear Ben, Thanks for your detailed reply and it helps me understand scheduling in Swift better. I wrote from a researcher perspective and I understand that for development, there are much more practical issues and are more challenging. I agree with you that scheduling a task after its parents completes is cost effective. It is the best "time" given all the updated info on the completion times of its parents. Also, it makes DAG submission easy (without dependency description) and minimizes the number of job instances in queues. The concern is that at this time, the task still needs to be submitted in queue and wait. This may not be sufficient for workflows with deadlines, where certain delivery guarantee in response time is necessary. The same applies for other remaining tasks in the workflow. I felt besides offline planning, runtime adaptation is necessary considering task duration variation (overrun) and faults. But the number of updates should be kept minimum and only for the very near future as the workflow proceeds. I am writing a paper on this and hopefully I could share it with you guys in a few weeks. This implies that the Swift code could be submitted a little bit more eagerly with a short-sighted look ahead. Yes, your points on the differences are valid and the replica in my case is used for FT while in Swift it could enable a task to run earlier (by submitting a replica at a short queue). You mentioned about queue time and can you share more on it, for example its accuracy and also the change to have some other job prioritization for coasters? I will be on star cruise to Malaysia in a few hours :). If I can not access email there, I will reply to you guys on Friday when I return to Singapore. Qin Zheng -----Original Message----- From: Ben Clifford [mailto:benc at hawaga.org.uk] Sent: Tuesday, April 07, 2009 7:31 PM To: Qin Zheng Cc: Ian Foster; swift-devel Subject: RE: [Swift-devel] Re: replication vs site score Hi. Most/all of the work that we've done with Swift works with fairly opportunistic use of resources - we submit work into job queues on one or more sites, where those job queues are shared with many other users, and where the runtimes for both our jobs and other users jobs are not well defined ahead of time. So whilst we use the word 'scheduling' sometimes in Swift, its more a case of "what do we think is the best site to queue a job on right now?" rather than making an execution plan that we think will be valid for a long period of time. Our replication mechanism sounds fairly similar to your pre-scheduled backups, but I think there are these important differences: * we don't launch a replica until we think there is a reasonable chance that the replica will run instead of the original (based on queue time) * as soon as one of the jobs *starts* running, we cancel all the others. from what I understand, you do that when one of the jobs *ends* successfully. We do have one situation where we have some pre-allocation of resources, and that is when coasters are being used. These use the above opportunistic queuing methods to acquire a worker node for a long period of time, and then runs Swift level jobs in there, at present on a first-come first-serve basis. Its likely that we'll change that to have some other job prioritisation, but still pre-scheduling the jobs. Where Swift would have trouble working with an ahead-of-time planner/scheduler is that the module that generates file transfer and execution tasks from high level SwiftScripts does not submit a dependent task for scheduling and execution until its predecessors have been successfully executed. What the scheduler sees is a stream, over time, of file transfer and execution tasks that are safe to run immediately. It might be easy, or it might be hard, to make the Swift code submit more eagerly, with description of task dependencies, which would allow you to plug in a pre-planner underneath. On Tue, 7 Apr 2009, Qin Zheng wrote: > Prof Foster, thanks for introducing me to the team. > > My research interest is on scheduling workflows (DAGs). Ben, we decided > not to use resubmission in the consideration that a DAG cannot be > completed when any of its tasks fails, which each time would trigger the > resubmission\retry of the DAG. Instead, we use fault tolerance by > pre-scheduling replica (backup) for each task (see enclosure for > details). The objective is to guarantee that this DAG can be completed > (in a preplanned manner with fast failover to the backup upon failure) > before its deadline. > > Currently I am also working on workflow scheduling under uncertainties > of task running times. This work includes priorities tasks based on the > impact of the variation of its running time on the overall response time > and offline planning for high-priority tasks as well as runtime > adaptation for all tasks once up-to-date information is available. > > I am looking forward to talking to you guys and knowing your research! > > Regards, > Qin Zheng > ________________________________ > From: Ian Foster [mailto:foster at anl.gov] > Sent: Monday, April 06, 2009 10:46 PM > To: Ben Clifford > Cc: swift-devel; Qin Zheng > Subject: Re: [Swift-devel] Re: replication vs site score > > Ben: > > You may recall the work that was done by Greg Maleciwz (sp?) on prioritizing jobs that enable new jobs to run. Those ideas seem relevant here. > > I met last week with a smart fellow in Singapore, Qin Zheng (CCed here), who has been working on the scheduling of replicant jobs. His interest is in doing this for jobs that have failed, while I think your interest is in scheduling for jobs that may have failed--a somewhat different thing. But there may be a connection. > > Ian. > > > On Apr 6, 2009, at 9:39 AM, Ben Clifford wrote: > > > even more rambling... in the context of a scheduler that is doing things > like prioritising jobs based on more than the order that Swift happened to > submit them (hopefully I will have a student for this in the summer), I > think a replicant job should be pushed toward later execution rather than > earlier execution to reduce the number of replicant jobs in the system at > any one time. > > This is because I suspect (though I have gathered no numerical evidence) > that given the choice between submitting a fresh job and a replicant job > (making up terminology here too... mmm), it is almost always better to > submit the fresh job. Either we end up submitting the replicant job > eventually (in which case we are no worse off than if we submitted the > replicant first and then a fresh job); or by delaying the replicant job we > give that replicant's original a chance to start running and thus do not > discard our precious time-and-load-dollars that we have already spent on > queueing that replicant's original. > > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > ________________________________ > This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you. > This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you. From benc at hawaga.org.uk Wed Apr 8 04:48:31 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 8 Apr 2009 09:48:31 +0000 (GMT) Subject: [Swift-devel] jobs finishes but swift reports "execution failed". In-Reply-To: <50b07b4b0904080012u668c6921y9012ac066c8156f7@mail.gmail.com> References: <50b07b4b0904080012u668c6921y9012ac066c8156f7@mail.gmail.com> Message-ID: that looks to me like you have tc.data entries for mockblast but not for cat. -- From benc at hawaga.org.uk Wed Apr 8 07:20:41 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 8 Apr 2009 12:20:41 +0000 (GMT) Subject: [Swift-devel] Possible problem in coaster data transfer In-Reply-To: <49DBD551.3000201@mcs.anl.gov> References: <49DBD551.3000201@mcs.anl.gov> Message-ID: if you do decide to dig deeper, you can turn on sitedir.keep in swift.properties and check that the file in the remote shared directory is uncorrupted for the same run that the staged out copy appears corrupted. -- From hockyg at uchicago.edu Wed Apr 8 09:03:50 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Wed, 08 Apr 2009 09:03:50 -0500 Subject: [Swift-devel] Possible problem in coaster data transfer In-Reply-To: References: <49DBD551.3000201@mcs.anl.gov> Message-ID: <49DCAEC6.1090905@uchicago.edu> Here you go. Same file from the remote site and from communicado after transfer by coasterIO Ben Clifford wrote: > if you do decide to dig deeper, you can turn on sitedir.keep in > swift.properties and check that the file in the remote shared directory is > uncorrupted for the same run that the staged out copy appears corrupted. > > -------------- next part -------------- A non-text attachment was scrubbed... Name: pdt_ci.gz Type: application/gzip Size: 46773 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: abe_pdt.gz Type: application/gzip Size: 576854 bytes Desc: not available URL: From hategan at mcs.anl.gov Wed Apr 8 10:03:07 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 08 Apr 2009 10:03:07 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

Message-ID: <1239202987.12586.17.camel@localhost> On Wed, 2009-04-08 at 16:03 +0800, Qin Zheng wrote: > Dear Ben, > > Thanks for your detailed reply and it helps me understand scheduling in Swift better. > > I wrote from a researcher perspective and I understand that for > development, there are much more practical issues and are more > challenging. I agree with you that scheduling a task after its parents > completes is cost effective. It is the best "time" given all the > updated info on the completion times of its parents. Also, it makes > DAG submission easy (without dependency description) and minimizes the > number of job instances in queues. The main reasoning was that it can be dealt with efficiently and that planning the whole workflow buys us little in a (very) dynamic environment in which submitting a job one minute later may mean the difference between 1 minute of queue time and one hour of queue time (though that's statistically a rare occurrence). > The concern is that at this time, the task still needs to be > submitted in queue and wait. This may not be sufficient for workflows > with deadlines, where certain delivery guarantee in response time is > necessary. You need some SLA/QOS to address that. Guessing the average queue time does not reduce its variation hence the risk of not finishing it by the time promised. You can use replication (i.e. race competing jobs) to reduce that variation (assuming that it follows some reasonable distribution), but I don't see how there could be a guarantee. > The same applies for other remaining tasks in the workflow. > > I felt besides offline planning, runtime adaptation is necessary > considering task duration variation (overrun) and faults. But the > number of updates should be kept minimum and only for the very near > future as the workflow proceeds. I am writing a paper on this and > hopefully I could share it with you guys in a few weeks. This implies > that the Swift code could be submitted a little bit more eagerly with > a short-sighted look ahead. I remember somebody mentioning (or having implemented) a similar scheme. If we have dependent jobs a and b, in swift that would go something like: Qa + Ra + Qb + Rb (where Qx - queuing time and Rx run time) But there's also the possibility of submitting B earlier by the average queue time or less and than having it wait until A produces its results. But then glide-ins/coasters, that's pretty much what they do. Mihael From benc at hawaga.org.uk Wed Apr 8 10:08:04 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 8 Apr 2009 15:08:04 +0000 (GMT) Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239202987.12586.17.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> Message-ID: On Wed, 8 Apr 2009, Mihael Hategan wrote: This: > planning the whole workflow buys us little in a (very) dynamic > environment in which submitting a job one minute later may mean the > difference between 1 minute of queue time and one hour of queue time and this: > You need some SLA/QOS to address that. seem to be significant characteristics that make the environments we run on not amenable to scheduling in the traditional sense. The lack of any meaningful guarantees about almost anything time-related makes everything basically opportunistic rather than scheduled. -- From hategan at mcs.anl.gov Wed Apr 8 14:53:28 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 08 Apr 2009 14:53:28 -0500 Subject: [Swift-devel] updates Message-ID: <1239220408.15551.4.camel@localhost> There are some fixes in cog r2381, most notably: - gridftp sessions were sometimes left in a messed state leading to subsequent transfers throwing obscure errors - coaster workers were left in an inconsistent state when jobs submitted to them exceeded their walltimes and the remaining runtime of the workers - an alleged fix for "qsub not found". This tied in to our earlier problems with finding executables. Even though, for example, java was found using bash -l, the process wasn't subsequently started using bash -l, leading to qsub not being in the path. The current scheme assumes that either everything needed can be found using bash -l or everything needed can be found without bash -l. I suppose some corner cases may still exist, but they seem unlikely. From hategan at mcs.anl.gov Wed Apr 8 15:44:44 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 08 Apr 2009 15:44:44 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DD0B3F.7050903@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> Message-ID: <1239223485.26815.2.camel@localhost> On Wed, 2009-04-08 at 13:38 -0700, Ioan Raicu wrote: > Does a batch-queue prediction service help things in any way? > https://portal.teragrid.org/gridsphere/gridsphere?cid=queue-prediction > > I've always wondered how the Swift scheduler would behave differently > if it had statistical information about queue times. It would help. Statistically. > Qin, have you compared your job replication strategy with one that > was cognizant of the expected wait queue time, in order to meet > deadlines? On the surface, assuming that the batch queue prediction is > accurate, it would seem that scheduling with known queue times might > solve the same deadline cognizant scheduling problem, but without > wasting resources by unnecessary replication. The replication isn't unnecessary. If it starts it starts because the queue time is larger than the expected queue time. > The place where the queue prediction doesn't help, is when there is a > bad node which causes an application to be slow or fail. No. The prediction doesn't help when it fails to predict accurately. > In this case, replication is probably the better recourse to > guarantee meeting deadlines. From iraicu at cs.uchicago.edu Wed Apr 8 15:38:23 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 08 Apr 2009 13:38:23 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> Message-ID: <49DD0B3F.7050903@cs.uchicago.edu> Does a batch-queue prediction service help things in any way? https://portal.teragrid.org/gridsphere/gridsphere?cid=queue-prediction I've always wondered how the Swift scheduler would behave differently if it had statistical information about queue times. Qin, have you compared your job replication strategy with one that was cognizant of the expected wait queue time, in order to meet deadlines? On the surface, assuming that the batch queue prediction is accurate, it would seem that scheduling with known queue times might solve the same deadline cognizant scheduling problem, but without wasting resources by unnecessary replication. The place where the queue prediction doesn't help, is when there is a bad node which causes an application to be slow or fail. In this case, replication is probably the better recourse to guarantee meeting deadlines. Here is their latest paper on this: http://www.springerlink.com/content/7552901360631246/fulltext.pdf. The system is deployed on the TeraGrid, and has been for a few years now. As far as I have heard, it is quite robust and accurate. Cheers, Ioan Ben Clifford wrote: > On Wed, 8 Apr 2009, Mihael Hategan wrote: > > This: > > >> planning the whole workflow buys us little in a (very) dynamic >> environment in which submitting a job one minute later may mean the >> difference between 1 minute of queue time and one hour of queue time >> > > and this: > > >> You need some SLA/QOS to address that. >> > > seem to be significant characteristics that make the environments we run > on not amenable to scheduling in the traditional sense. The lack of any > meaningful guarantees about almost anything time-related makes everything > basically opportunistic rather than scheduled. > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Wed Apr 8 15:46:28 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 08 Apr 2009 13:46:28 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239223485.26815.2.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <1239223485.26815.2.camel@localhost> Message-ID: <49DD0D24.2010909@cs.uchicago.edu> Mihael Hategan wrote: >> The place where the queue prediction doesn't help, is when there is a >> bad node which causes an application to be slow or fail. >> > > No. The prediction doesn't help when it fails to predict accurately. > > The prediction that I was referring to was only for the queue time, not the execution time. A failed node, causing an application run time to be longer than expected, has no impact on the prediction of the wait queue time. Ioan -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Apr 8 15:54:30 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 08 Apr 2009 15:54:30 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DD0D24.2010909@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <1239223485.26815.2.camel@localhost> <49DD0D24.2010909@cs.uchicago.edu> Message-ID: <1239224070.27089.4.camel@localhost> On Wed, 2009-04-08 at 13:46 -0700, Ioan Raicu wrote: > > > Mihael Hategan wrote: > > > The place where the queue prediction doesn't help, is when there is a > > > bad node which causes an application to be slow or fail. > > > > > > > No. The prediction doesn't help when it fails to predict accurately. > > > > > The prediction that I was referring to was only for the queue time, > not the execution time. A failed node, causing an application run time > to be longer than expected, has no impact on the prediction of the > wait queue time. You're right. I was trying to say that fundamentally the problem of uncertainty in queue times will remain by virtue of the fact that the times when people submit jobs (as well as the amount of jobs) is unpredictable and it can affect other people's job queue times. The predictor in the paper answers the question "if you were to submit your job before the state of the queue changes in any way, what would be the expected queue time for the job" and not "what will be the queue time for the job". From iraicu at cs.uchicago.edu Wed Apr 8 15:58:10 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 08 Apr 2009 13:58:10 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239224070.27089.4.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <1239223485.26815.2.camel@localhost> <49DD0D24.2010909@cs.uchicago.edu> <1239224070.27089.4.camel@localhost> Message-ID: <49DD0FE2.3000505@cs.uchicago.edu> Mihael Hategan wrote: > > You're right. I was trying to say that fundamentally the problem of > uncertainty in queue times will remain by virtue of the fact that the > times when people submit jobs (as well as the amount of jobs) is > unpredictable and it can affect other people's job queue times. > > The predictor in the paper answers the question "if you were to submit > your job before the state of the queue changes in any way, what would be > the expected queue time for the job" and not "what will be the queue > time for the job". > > Yes, its possible that between a query of prediction, and actual submission, the state of the queues change, and therefore the actual result change. But, every prediction comes with some error bounds, so its possible that the change in queue state, might be reflected in the error bars. Nevertheless, I think it might be an interesting improvement to the current Swift scheduler. Ben, was this on the list of Google summer of code projects? If not, perhaps you might want to add it. Ioan -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Apr 8 16:32:00 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 08 Apr 2009 16:32:00 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DD0FE2.3000505@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <1239223485.26815.2.camel@localhost> <49DD0D24.2010909@cs.uchicago.edu> <1239224070.27089.4.camel@localhost> <49DD0FE2.3000505@cs.uchicago.edu> <1239226320.3974.5.camel@localhost> Message-ID: <49DD2C95.3030706@cs.uchicago.edu> Aha, but I think the predictions are upper bounds, not upper and lower bounds. In essence, when they predict that your job will wait for 11.2 hours, with 95% confidence, and your job runs in 15 minutes, then in no way have they made a prediction in error. Now, if they would have predicted 1 minute, and it took 15 minutes, then it would have been an error. It is possible that they do not use knowledge of back-filling, which would make small jobs run immediately, although they would predict a long queue wait time, as if no back-filling is enabled. Its not clear how customized the predictor is, to the scheduler and features of the LRM, so there is certainly room for being pessimistic on their predictions. Ioan Mihael Hategan wrote: > On Wed, 2009-04-08 at 13:58 -0700, Ioan Raicu wrote: > >> Mihael Hategan wrote: >> >>> You're right. I was trying to say that fundamentally the problem of >>> uncertainty in queue times will remain by virtue of the fact that the >>> times when people submit jobs (as well as the amount of jobs) is >>> unpredictable and it can affect other people's job queue times. >>> >>> The predictor in the paper answers the question "if you were to submit >>> your job before the state of the queue changes in any way, what would be >>> the expected queue time for the job" and not "what will be the queue >>> time for the job". >>> >>> >>> >> Yes, its possible that between a query of prediction, and actual >> submission, the state of the queues change, and therefore the actual >> result change. But, every prediction comes with some error bounds, so >> its possible that the change in queue state, might be reflected in the >> error bars. >> > > I don't know... The system predicted that a 2 minute job on Abe would > sit 11.2 hours in the queue and 2.4 hours on QueenBee, but I've ran 20 > such jobs on both in the past 15 minutes. > > > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Apr 8 21:16:58 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 08 Apr 2009 21:16:58 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DD2C95.3030706@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> Message-ID: <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> Hi, I wanted to point out that when we use Falkon/coasters, we have full control over scheduling, so in that case we could in principle pre- compute schedules. However, in practice we still don't tend to have enough information about execution times for this to be that useful. At least that's my belief. I assume that estimates of queue time bounds would surely be helpful for determining where to send things, and whether a job was stuck. Ian. From hategan at mcs.anl.gov Thu Apr 9 10:30:54 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 09 Apr 2009 10:30:54 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> Message-ID: <1239291054.32146.14.camel@localhost> On Thu, 2009-04-09 at 06:27 -0500, Ian Foster wrote: > Hi, > > I wanted to point out that when we use Falkon/coasters, we have full > control over scheduling, Once we get the nodes, yes. > so in that case we could in principle pre- > compute schedules. However, in practice we still don't tend to have > enough information about execution times for this to be that useful. > At least that's my belief. > > I assume that estimates of queue time bounds would surely be helpful > for determining where to send things, and whether a job was stuck. > > Ian. From benc at hawaga.org.uk Thu Apr 9 10:30:37 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 9 Apr 2009 15:30:37 +0000 (GMT) Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> Message-ID: On Thu, 9 Apr 2009, Ian Foster wrote: > I wanted to point out that when we use Falkon/coasters, we have full control > over scheduling, so in that case we could in principle pre-compute schedules. Coasters as they are now are still allocated on an opportunistic basis, so once we have a coaster stuff could be scheduled to it, but when coaster workers actually exist is as unknown as when jobs will run in the non-coaster case, I think. Where Falkon has been used for pre-allocated resources on machines, with no dynamic allocation/unallocation, though, the available resources probably are known well enough for this. > However, in practice we still don't tend to have enough information about > execution times for this to be that useful. At least that's my belief. yes. -- From hategan at mcs.anl.gov Thu Apr 9 10:49:25 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 09 Apr 2009 10:49:25 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> Message-ID: <1239292165.339.1.camel@localhost> On Thu, 2009-04-09 at 15:30 +0000, Ben Clifford wrote: > On Thu, 9 Apr 2009, Ian Foster wrote: > > > I wanted to point out that when we use Falkon/coasters, we have full control > > over scheduling, so in that case we could in principle pre-compute schedules. > > Coasters as they are now are still allocated on an opportunistic basis, so > once we have a coaster stuff could be scheduled to it, but when coaster > workers actually exist is as unknown as when jobs will run in the > non-coaster case, I think. > > Where Falkon has been used for pre-allocated resources on machines, with > no dynamic allocation/unallocation, though, the available resources > probably are known well enough for this. Except when using pre-allocated resources, you are still waiting for them, but the waiting is not automated. > > > However, in practice we still don't tend to have enough information about > > execution times for this to be that useful. At least that's my belief. > > yes. > From benc at hawaga.org.uk Thu Apr 9 10:49:50 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 9 Apr 2009 15:49:50 +0000 (GMT) Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239292165.339.1.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> Message-ID: On Thu, 9 Apr 2009, Mihael Hategan wrote: > Except when using pre-allocated resources, you are still waiting for > them, but the waiting is not automated. Also you have chosen to not attempt to opportunistically get any more once you have decided you have waited enough. -- From hategan at mcs.anl.gov Thu Apr 9 10:57:14 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 09 Apr 2009 10:57:14 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> Message-ID: <1239292634.406.6.camel@localhost> On Thu, 2009-04-09 at 15:49 +0000, Ben Clifford wrote: > On Thu, 9 Apr 2009, Mihael Hategan wrote: > > > Except when using pre-allocated resources, you are still waiting for > > them, but the waiting is not automated. > > Also you have chosen to not attempt to opportunistically get any more once > you have decided you have waited enough. > Right. Overall it leads to inefficiencies and wasted cpu-hours, but it gives you a known set of resources, which is valuable. I think the known set of resources part can be achieved anyway if there was that back-channel mentioned in random chatter that informed swift about the nodes available. From iraicu at cs.uchicago.edu Thu Apr 9 11:49:33 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 09 Apr 2009 09:49:33 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> Message-ID: <49DE271D.8030003@cs.uchicago.edu> Right, Falkon supports both static and dynamic allocation of resources. I believe coaster only supports dynamic allocation of resources. We have lots of information under static allocation, that could help scheduling, but under dynamic allocation, there is a mixture of known information (the already allocated resources) and the unknown (the jobs in the wait queue). In a sense, a smarter scheduler could make use of at least known information, although this information might frequently change, and the scheduler would have to adapt frequently. Ioan Ben Clifford wrote: > On Thu, 9 Apr 2009, Ian Foster wrote: > > >> I wanted to point out that when we use Falkon/coasters, we have full control >> over scheduling, so in that case we could in principle pre-compute schedules. >> > > Coasters as they are now are still allocated on an opportunistic basis, so > once we have a coaster stuff could be scheduled to it, but when coaster > workers actually exist is as unknown as when jobs will run in the > non-coaster case, I think. > > Where Falkon has been used for pre-allocated resources on machines, with > no dynamic allocation/unallocation, though, the available resources > probably are known well enough for this. > > >> However, in practice we still don't tend to have enough information about >> execution times for this to be that useful. At least that's my belief. >> > > yes. > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From foster at anl.gov Thu Apr 9 11:51:27 2009 From: foster at anl.gov (Ian Foster) Date: Thu, 9 Apr 2009 11:51:27 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DE271D.8030003@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <49DE271D.8030003@cs.uchicago.edu> Message-ID: <7A6429F1-CDE4-4B09-95BB-C7DB24E78905@anl.gov> I didn't appreciate that about Coaster. It should (IMHO) support static allocation, as a special case. People will clearly want that. On Apr 9, 2009, at 11:49 AM, Ioan Raicu wrote: > Right, Falkon supports both static and dynamic allocation of > resources. I believe coaster only supports dynamic allocation of > resources. We have lots of information under static allocation, that > could help scheduling, but under dynamic allocation, there is a > mixture of known information (the already allocated resources) and > the unknown (the jobs in the wait queue). In a sense, a smarter > scheduler could make use of at least known information, although > this information might frequently change, and the scheduler would > have to adapt frequently. > > Ioan > > Ben Clifford wrote: >> >> On Thu, 9 Apr 2009, Ian Foster wrote: >> >> >>> I wanted to point out that when we use Falkon/coasters, we have >>> full control >>> over scheduling, so in that case we could in principle pre-compute >>> schedules. >>> >> Coasters as they are now are still allocated on an opportunistic >> basis, so >> once we have a coaster stuff could be scheduled to it, but when >> coaster >> workers actually exist is as unknown as when jobs will run in the >> non-coaster case, I think. >> >> Where Falkon has been used for pre-allocated resources on machines, >> with >> no dynamic allocation/unallocation, though, the available resources >> probably are known well enough for this. >> >> >>> However, in practice we still don't tend to have enough >>> information about >>> execution times for this to be that useful. At least that's my >>> belief. >>> >> yes. >> >> > > -- > =================================================== > Ioan Raicu, Ph.D. > =================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > =================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > =================================================== > =================================================== > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Apr 9 11:56:30 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 09 Apr 2009 11:56:30 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <7A6429F1-CDE4-4B09-95BB-C7DB24E78905@anl.gov> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> Message-ID: <49DE27E0.2020100@cs.uchicago.edu> What we usually do on the BG/P when using static provisioning, is that Swift does not start until the resources have been allocated, and that is because Falkon does not start until the resources are allocated. The whole process is automated, in terms of Swift waiting for Falkon to start, and Falkon waiting to start after resources get allocated. So, at Swift startup, in a static provisioned case, Swift could have all the information it might need, such as number of processors, number of sites, load (i.e. idle, as the resources are dedicated to Swift), etc. Ioan Mihael Hategan wrote: > On Thu, 2009-04-09 at 15:30 +0000, Ben Clifford wrote: > >> On Thu, 9 Apr 2009, Ian Foster wrote: >> >> >>> I wanted to point out that when we use Falkon/coasters, we have full control >>> over scheduling, so in that case we could in principle pre-compute schedules. >>> >> Coasters as they are now are still allocated on an opportunistic basis, so >> once we have a coaster stuff could be scheduled to it, but when coaster >> workers actually exist is as unknown as when jobs will run in the >> non-coaster case, I think. >> >> Where Falkon has been used for pre-allocated resources on machines, with >> no dynamic allocation/unallocation, though, the available resources >> probably are known well enough for this. >> > > Except when using pre-allocated resources, you are still waiting for > them, but the waiting is not automated. > > >>> However, in practice we still don't tend to have enough information about >>> execution times for this to be that useful. At least that's my belief. >>> >> yes. >> >> > > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Apr 9 12:01:53 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 09 Apr 2009 12:01:53 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DE27E0.2020100@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> <1239292634.406.6.camel@localhost> Message-ID: <49DE28C7.5080002@cs.uchicago.edu> Assuming you have more work to do, than the static resources allocated, you are not wasting any resources. The workflow will run until the resources are de-allocated, and whatever was not completed, will get rescheduled on the next round of static resources allocated. As far as I know, this is the usage pattern of the static resource allocation on the BG/P for the few regular users, that are running several jobs per day, where each job is a static resource allocation of 1K~10K processors for several hours each. Their parameter space is large enough that they keep doing this over and over again, and they still more work to do! Ioan Mihael Hategan wrote: > On Thu, 2009-04-09 at 15:49 +0000, Ben Clifford wrote: > >> On Thu, 9 Apr 2009, Mihael Hategan wrote: >> >> >>> Except when using pre-allocated resources, you are still waiting for >>> them, but the waiting is not automated. >>> >> Also you have chosen to not attempt to opportunistically get any more once >> you have decided you have waited enough. >> >> > > Right. Overall it leads to inefficiencies and wasted cpu-hours, but it > gives you a known set of resources, which is valuable. I think the known > set of resources part can be achieved anyway if there was that > back-channel mentioned in random chatter that informed swift about the > nodes available. > > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Thu Apr 9 12:04:10 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 09 Apr 2009 10:04:10 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239296190.1717.1.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <49DE271D.8030003@cs.uchicago.edu> <7A6429F1-CDE4-4B09-95BB-C7DB24E78905@anl.gov> <1239296190.1717.1.camel@localhost> Message-ID: <49DE2A8A.4060705@cs.uchicago.edu> There are use cases where static resource allocation are better than dynamic ones. Again, we come back to the BG/P system. There is a policy that only allows you to submit X number of jobs to Cobalt, and X is < 10 jobs. Now, if you want to allocate resources dynamically, in smaller chunks, you are limited to only a few jobs. Static provisioning all of a sudden seems attractive. Another thing that you have to remember, that for some systems, like the BG/P, getting 2 allocations of 64 nodes each, is not the same as getting 1 allocation of 128 nodes. The 1 single allocation of 128 nodes has networking configured in such a way to allow node-to-node communication efficiently. The 2 separate allocations, could be allocated in completely opposite ends of the system, and hence having poor networking properties to do node-to-node communication, between the separate allocations (if its even possible, I am not sure, the networks might be completely separate). This might not be important for vanilla Swift, but some of the MTDM work (previously known as collective I/O) relies on good network connectivity between any node in the allocation to pass data around and avoiding GPFS. Ioan Mihael Hategan wrote: > On Thu, 2009-04-09 at 11:51 -0500, Ian Foster wrote: > >> I didn't appreciate that about Coaster. It should (IMHO) support >> static allocation, as a special case. People will clearly want that. >> > > Yes. People clearly make irrational choices. > > >> >> On Apr 9, 2009, at 11:49 AM, Ioan Raicu wrote: >> >> >>> Right, Falkon supports both static and dynamic allocation of >>> resources. I believe coaster only supports dynamic allocation of >>> resources. We have lots of information under static allocation, that >>> could help scheduling, but under dynamic allocation, there is a >>> mixture of known information (the already allocated resources) and >>> the unknown (the jobs in the wait queue). In a sense, a smarter >>> scheduler could make use of at least known information, although >>> this information might frequently change, and the scheduler would >>> have to adapt frequently. >>> >>> Ioan >>> >>> Ben Clifford wrote: >>> >>>> On Thu, 9 Apr 2009, Ian Foster wrote: >>>> >>>> >>>> >>>>> I wanted to point out that when we use Falkon/coasters, we have full control >>>>> over scheduling, so in that case we could in principle pre-compute schedules. >>>>> >>>>> >>>> Coasters as they are now are still allocated on an opportunistic basis, so >>>> once we have a coaster stuff could be scheduled to it, but when coaster >>>> workers actually exist is as unknown as when jobs will run in the >>>> non-coaster case, I think. >>>> >>>> Where Falkon has been used for pre-allocated resources on machines, with >>>> no dynamic allocation/unallocation, though, the available resources >>>> probably are known well enough for this. >>>> >>>> >>>> >>>>> However, in practice we still don't tend to have enough information about >>>>> execution times for this to be that useful. At least that's my belief. >>>>> >>>>> >>>> yes. >>>> >>>> >>>> >>> -- >>> =================================================== >>> Ioan Raicu, Ph.D. >>> =================================================== >>> Distributed Systems Laboratory >>> Computer Science Department >>> University of Chicago >>> 1100 E. 58th Street, Ryerson Hall >>> Chicago, IL 60637 >>> =================================================== >>> Email: iraicu at cs.uchicago.edu >>> Web: http://www.cs.uchicago.edu/~iraicu >>> http://dev.globus.org/wiki/Incubator/Falkon >>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >>> =================================================== >>> =================================================== >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Apr 9 12:08:45 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 09 Apr 2009 12:08:45 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DE28C7.5080002@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> <1239292634.406.6.camel@localhost> <49DE28C7.5080002@cs.uchicago.edu> Message-ID: <1239296925.1717.14.camel@localhost> On Thu, 2009-04-09 at 09:56 -0700, Ioan Raicu wrote: > Assuming you have more work to do, than the static resources > allocated, you are not wasting any resources. The workflow will run > until the resources are de-allocated, and whatever was not completed, > will get rescheduled on the next round of static resources allocated. Right. As opposed to the system figuring out that there is more work and having workers ready appropriately. > As far as I know, this is the usage pattern of the static resource > allocation on the BG/P for the few regular users, that are running > several jobs per day, where each job is a static resource allocation > of 1K~10K processors for several hours each. Their parameter space is > large enough that they keep doing this over and over again, and they > still more work to do! Which, seems to show that, for static provisioning to work efficiently, you need to fit work exactly into the pre-allocated resources, or have pre-allocated resources to exactly fit your work load. I'm still not sure why you would choose this instead of "allocate resources on demand and de-allocate resources when you stop needing them". From iraicu at cs.uchicago.edu Thu Apr 9 12:12:00 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 09 Apr 2009 10:12:00 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239296513.1717.7.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> <49DE27E0.2020100@cs.uchicago.edu> <1239296513.1717.7.camel@localhost> Message-ID: <49DE2C60.7050905@cs.uchicago.edu> Mihael Hategan wrote: > Why I don't get is (and this is what I understand by "static > provisioning") where is the benefit in having a barrier that waits for > all requested workers to start, given that some workers will start > before others and will invariably have to sit idle until all workers are > started. > > No workers sit idle, waiting for other workers to start. The resource allocation takes some amount of time to boot up the OS on each node, mount GPFS, start Falkon service, start Falkon workers, etc... see http://dev.globus.org/wiki/Image:Falkon-BGP-startup-time.jpg. Its true that there is some difference between the 1st worker starting, and the last worker starting, probably on the order of seconds to maybe minutes at the largest scale of 160K processors. If this is a concern, the idle time as the system starts up, you can start Swift before 100% of the system is operational. The system is partitioned in 64 node chunks, so, in theory, Swift could start as soon as 64 nodes are online. Although, this could also have its own problems. Its not clear to me how dynamic the sites.xml file is. The location of the Falkon services is placed in the sites.xml file. Lets take an example of 4096 processor run, which would have 16 Falkon services when its 100% allocated. That means 16 entries in the sites.xml. If we wait for all 16 entries, we might waste a few idle cycles. If we start when the 1st entry is in the sites.xml (the first 64 nodes), and later the sites.xml file is updated with the rest of the 15 entries, will Swift re-read the sites.xml and figure out that there are additional sites to consider? How often does Swift re-read the sites.xml? If it does not re-read it, then in the current setup, we can do this, and have to wait for all resources to be 100% allocated before we start. Ioan > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From iraicu at cs.uchicago.edu Thu Apr 9 12:14:18 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 09 Apr 2009 10:14:18 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239296925.1717.14.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> <1239292634.406.6.camel@localhost> <49DE28C7.5080002@cs.uchicago.edu> <1239296925.1717.14.camel@localhost> Message-ID: <49DE2CEA.7050909@cs.uchicago.edu> Add another constraint, that you only have 6 jobs that you can submit to the LRM queue, are you still as optimistic about using dynamic resource provisioning? Mihael Hategan wrote: > On Thu, 2009-04-09 at 09:56 -0700, Ioan Raicu wrote: > >> Assuming you have more work to do, than the static resources >> allocated, you are not wasting any resources. The workflow will run >> until the resources are de-allocated, and whatever was not completed, >> will get rescheduled on the next round of static resources allocated. >> > > Right. As opposed to the system figuring out that there is more work and > having workers ready appropriately. > > >> As far as I know, this is the usage pattern of the static resource >> allocation on the BG/P for the few regular users, that are running >> several jobs per day, where each job is a static resource allocation >> of 1K~10K processors for several hours each. Their parameter space is >> large enough that they keep doing this over and over again, and they >> still more work to do! >> > > Which, seems to show that, for static provisioning to work efficiently, > you need to fit work exactly into the pre-allocated resources, or have > pre-allocated resources to exactly fit your work load. > > I'm still not sure why you would choose this instead of "allocate > resources on demand and de-allocate resources when you stop needing > them". > > > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Apr 9 12:16:22 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 09 Apr 2009 12:16:22 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DE2A8A.4060705@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> <1239292634.406.6.camel@localhost> <49DE28C7.5080002@cs.uchicago.edu> <1239296925.1717.14.camel@localhost> <49DE2CEA.7050909@cs.uchicago.edu> Message-ID: <1239297839.1717.30.camel@localhost> On Thu, 2009-04-09 at 10:14 -0700, Ioan Raicu wrote: > Add another constraint, that you only have 6 jobs that you can submit > to the LRM queue, are you still as optimistic about using dynamic > resource provisioning? 6 jobs is better than one job. From foster at anl.gov Thu Apr 9 12:24:29 2009 From: foster at anl.gov (Ian Foster) Date: Thu, 9 Apr 2009 12:24:29 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239297769.1717.27.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <49DE271D.8030003@cs.uchicago.edu> <7A6429F1-CDE4-4B09-95BB-C7DB24E78905@anl.gov> <1239296190.1717.1.camel@localhost> <49DE2A8A.4060705@cs.uchicago.edu> <1239297382.1717.23.camel@localhost> Message-ID: <49DE311B.4070309@cs.uchicago.edu> Mihael Hategan wrote: > On Thu, 2009-04-09 at 10:04 -0700, Ioan Raicu wrote: > >> There are use cases where static resource allocation are better than >> dynamic ones. Again, we come back to the BG/P system. There is a >> policy that only allows you to submit X number of jobs to Cobalt, and >> X is < 10 jobs. Now, if you want to allocate resources dynamically, in >> smaller chunks, you are limited to only a few jobs. Static >> provisioning all of a sudden seems attractive. >> > > It's a valid scenario and a valid solution, but asserting that it's the > only solution or that it's the best solution seems inappropriate. > > A better solution is to dynamically allocate workers in larger blocks if > you don't have arbitrary granularity on the allocation sizes. It > provides the middle ground that meets the scheduling system constraints > and minimizes inefficiencies. > When you allocate 1 node at a time, in dynamic provisioning, its trivial to de-allocate nodes/workers, with a timeout. When you allocate N nodes in a single job, where N>1, de-allocation is not trivial anymore. If a worker simply de-allocates (exit the process), that node remains allocated from the LRM's perspective, but de-allocated from the Coaster/Falkon perspective. When all N nodes are de-allocated, then the N nodes are released to the LRM. That is potentially a great deal of wastage. The better solution would be for there to be a centralized manager, that keeps track of an entire job (N nodes) and their utilization, and decide to de-allocate the entire N nodes at the same time, from both Coaster/Falkon and the LRM. Falkon doesn't support this unfortunately. Does Coaster support this? If not, then I'd say that dynamic resource provisioning has to be kept at jobs of a single node level, and not bunch together multiple node requests per job. This will obviously limit the use of dynamic provisioning for large scale runs, to LRMs that support large number of job submissions, proportional to the scale of the runs. Don't get me wrong, I think dynamic resource provisioning is the best approach in general, especially when workloads vary in loads, and you have an infrastructure that supports it (i.e. TeraGrid), but its not suitable for other systems, with the current implementation that I am aware of from Falkon (maybe Coaster as well) on systems like the BG/P. > >> Another thing that you have to remember, that for some systems, like >> the BG/P, getting 2 allocations of 64 nodes each, is not the same as >> getting 1 allocation of 128 nodes. The 1 single allocation of 128 >> nodes has networking configured in such a way to allow node-to-node >> communication efficiently. The 2 separate allocations, could be >> allocated in completely opposite ends of the system, and hence having >> poor networking properties to do node-to-node communication, between >> the separate allocations (if its even possible, I am not sure, the >> networks might be completely separate). This might not be important >> for vanilla Swift, but some of the MTDM work (previously known as >> collective I/O) relies on good network connectivity between any node >> in the allocation to pass data around and avoiding GPFS. >> > > I'm not sure what dynamic vs. static allocation of workers, in > principle, has to do with the implementation hurdles of CIO on the BG/P. > It has to do with the fact that if the network interconnect is important (such is the case for MTDM), then submitting multiple independent jobs to the LRM is detrimental to the overall performance of the application, as opposed to submitting a single job to the LRM. If the jobs are submitted to the LRM as independent jobs, there is no guarantee on their placement and proximity to each other (node wise). > Different systems have different constraints. Dynamic allocation can be > made to adapt to those constraints. > But after adapting it enough, its going to look like static provisioning. See this paper http://pegasus.isi.edu/publications/2008/JuveG-ResourceProvisioningOptions.pdf which discusses the various approaches to resource provisioning. You will find some systems support static provisioning, others suport dynamic provisioing, and others support both. It shows that there are clear use cases for one, the other, or both. Ioan > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Thu Apr 9 12:34:26 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 09 Apr 2009 10:34:26 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239297769.1717.27.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> <49DE27E0.2020100@cs.uchicago.edu> <1239296513.1717.7.camel@localhost> <49DE2C60.7050905@cs.uchicago.edu> <1239297769.1717.27.camel@localhost> Message-ID: <49DE31A2.5020806@cs.uchicago.edu> Mihael Hategan wrote: > On Thu, 2009-04-09 at 10:12 -0700, Ioan Raicu wrote: > >>> >>> >> No workers sit idle, waiting for other workers to start. The resource >> allocation takes some amount of time to boot up the OS on each node, >> mount GPFS, start Falkon service, start Falkon workers, etc... see >> http://dev.globus.org/wiki/Image:Falkon-BGP-startup-time.jpg. Its true >> that there is some difference between the 1st worker starting, and the >> last worker starting, probably on the order of seconds to maybe minutes >> at the largest scale of 160K processors. If this is a concern, the idle >> time as the system starts up, you can start Swift before 100% of the >> system is operational. The system is partitioned in 64 node chunks, so, >> in theory, Swift could start as soon as 64 nodes are online. Although, >> this could also have its own problems. >> > > This assumes a single site and exact knowledge of how to fit the > workload. > Nope, its a single site if you want to start at the earliest possible time, but once all nodes are started, it becomes a multi-site allocation, where each site is a 64 node chunk of the allocation. > I also assume this works when you have a reservation, otherwise you may > have better chances with smaller chunks. > Up to 8K cores, we usually run without reservations. Beyond that, we do get reservations. > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Thu Apr 9 12:36:15 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 09 Apr 2009 10:36:15 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <9B273A14-2414-4D5E-834A-C5098BB731A5@anl.gov> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov>

<1239202987.12586.17.camel@localhost>