From bugzilla-daemon at mcs.anl.gov Wed Apr 1 07:32:24 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 1 Apr 2009 07:32:24 -0500 (CDT) Subject: [Swift-devel] [Bug 191] New: procedures invoked inside iterate{} don't get unique execution IDs Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=191 Summary: procedures invoked inside iterate{} don't get unique execution IDs Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Log processing and plotting AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk CC: swift-devel at ci.uchicago.edu iterate {} is more serialised than I intended. It executes each body inside the same single thread. Consequently, each iteration of the loop body does not end up with a unique thread prefix, and then execute IDs, which are based on thread ID, end up duplicated between invocations. I made the following hack for specific purpose of provenance challenge 3, as it provides enough uniqueness, albeit inelegantly, for that project. More properly, fixed bug 154 (iterate construct causes overserialisation of execution) could make this problem go away. Author: Ben Clifford Date: Tue Mar 31 16:20:41 2009 +0100 make iterate give each iteration a unique thread ID. this is possibly unsafe. in addition, it does not give a uique ID to the termination condition distinct from the body of the loop, which can probably give non-unique IDs when procedure calls are made both in the loop body and in the termination condition diff --git a/src/org/griphyn/vdl/karajan/lib/InfiniteCountingWhile.java b/src/org/griphyn/vdl/karajan/lib/Infin iteCountingWhile.java index 0d173c3..c6d4e89 100644 --- a/src/org/griphyn/vdl/karajan/lib/InfiniteCountingWhile.java +++ b/src/org/griphyn/vdl/karajan/lib/InfiniteCountingWhile.java @@ -9,6 +9,7 @@ package org.griphyn.vdl.karajan.lib; import java.util.Arrays; import java.util.List; +import org.globus.cog.karajan.util.ThreadingContext; import org.globus.cog.karajan.workflow.nodes.*; import org.globus.cog.karajan.stack.VariableStack; import org.globus.cog.karajan.util.TypeUtil; @@ -26,6 +27,8 @@ public class InfiniteCountingWhile extends Sequential { public void pre(VariableStack stack) throws ExecutionException { stack.setVar("#condition", new Condition()); + ThreadingContext tc = (ThreadingContext)stack.getVar("#thread"); + stack.setVar("#thread", tc.split(666)); stack.setVar(VAR, "$"); String counterName = (String)stack.getVar(VAR); stack.setVar(counterName, Arrays.asList(new Integer[] {new Integer(0)})); @@ -54,6 +57,8 @@ public class InfiniteCountingWhile extends Sequential { } if (index >= elementCount()) { // starting new iteration + ThreadingContext tc = (ThreadingContext)stack.getVar("#thread"); + stack.setVar("#thread", tc.split(666)); setIndex(stack, 1); fn = (FlowElement) getElement(0); -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Wed Apr 1 21:00:52 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 1 Apr 2009 21:00:52 -0500 (CDT) Subject: [Swift-devel] [Bug 116] simple_mapper handling of numbered files in arrays broken In-Reply-To: References: Message-ID: <20090402020052.537612CC70@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=116 Mihael Hategan changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |hategan at mcs.anl.gov --- Comment #3 from Mihael Hategan 2009-04-01 21:00:52 --- Additionally, if non-numerically named files exist (say "test.in"), simple mapper tries to use that name as index, which may or may not be the right thing, but it causes a consistency check failure on the array: Execution failed: java.lang.RuntimeException: Array element has index 'test' that does not parse as an integer. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. You are watching someone on the CC list of the bug. You are watching the assignee of the bug. You are watching the reporter. From aespinosa at cs.uchicago.edu Wed Apr 1 21:13:55 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 1 Apr 2009 21:13:55 -0500 Subject: [Swift-devel] array args in function apps Message-ID: <50b07b4b0904011913u4fafafc8qe2624780137f1dad@mail.gmail.com> type file; app (file out) cat(file infile[]) { cat @infile stdout=@out; } file infile[] ; file out <"test.out">; out= cat(infile); wift svn swift-r2748 cog-r2341 RunID: 20090401-2105-aevaa3o9 Progress: Execution failed: Exception in cat: Arguments: [1.in 2.in 3.in 4.in 5.in 6.in 7.in 8.in 9.in 10.in] Host: localhost Directory: manyargs-20090401-2105-aevaa3o9/jobs/x/cat-xszn9s8j stderr.txt: /bin/cat: 1.in 2.in 3.in 4.in 5.in 6.in 7.in 8.in 9.in 10.in: No such file or directory stdout.txt: ---- Caused by: Exit code 1 --- I remember when i ran using regular arguments, there are commas separating them in swift's logs. maybe 1.in ... 10.in is seen as one string? log is in http://www.ci.uchicago.edu/~aespinosa/swift/manyargs-20090401-2105-aevaa3o9.log From aespinosa at cs.uchicago.edu Wed Apr 1 21:44:20 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 1 Apr 2009 21:44:20 -0500 Subject: [Swift-devel] Re: array args in function apps In-Reply-To: <50b07b4b0904011913u4fafafc8qe2624780137f1dad@mail.gmail.com> References: <50b07b4b0904011913u4fafafc8qe2624780137f1dad@mail.gmail.com> Message-ID: <50b07b4b0904011944q236ede8dg1c6c2175c336855@mail.gmail.com> i got it working using @filenames type file; app (file out) cat(file infile[]) { cat @filenames(infile) stdout=@out; } file infile[] ; file out <"test.out">; out= cat(infile); On Wed, Apr 1, 2009 at 9:13 PM, Allan Espinosa wrote: > type file; > > app (file out) cat(file infile[]) { > ?cat @infile stdout=@out; > } > > file infile[] ; > file out <"test.out">; > out= cat(infile); > > wift svn swift-r2748 cog-r2341 > > RunID: 20090401-2105-aevaa3o9 > Progress: > Execution failed: > ? ? ? ?Exception in cat: > Arguments: [1.in 2.in 3.in 4.in 5.in 6.in 7.in 8.in 9.in 10.in] > Host: localhost > Directory: manyargs-20090401-2105-aevaa3o9/jobs/x/cat-xszn9s8j > stderr.txt: /bin/cat: 1.in 2.in 3.in 4.in 5.in 6.in 7.in 8.in 9.in > 10.in: No such file or directory > > stdout.txt: > ---- > > Caused by: > ? ? ? ?Exit code 1 > > --- > I remember when i ran using regular arguments, there are commas > separating them in swift's logs. maybe 1.in ... 10.in is seen as one > string? > > > log is in http://www.ci.uchicago.edu/~aespinosa/swift/manyargs-20090401-2105-aevaa3o9.log > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From benc at hawaga.org.uk Thu Apr 2 03:10:47 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 2 Apr 2009 08:10:47 +0000 (GMT) Subject: [Swift-devel] array args in function apps In-Reply-To: <50b07b4b0904011913u4fafafc8qe2624780137f1dad@mail.gmail.com> References: <50b07b4b0904011913u4fafafc8qe2624780137f1dad@mail.gmail.com> Message-ID: On Wed, 1 Apr 2009, Allan Espinosa wrote: > maybe 1.in ... 10.in is seen as one > string? yes. @filename(a) is even documented that way (which is what @a is an abbreviation for). -- From bugzilla-daemon at mcs.anl.gov Thu Apr 2 03:21:10 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 2 Apr 2009 03:21:10 -0500 (CDT) Subject: [Swift-devel] [Bug 192] New: displeasing stack trace when pwd is not writable Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=192 Summary: displeasing stack trace when pwd is not writable Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk The logging system outputs the following trace in swift 0.8 when pwd is unwritable: train02 at vm-125-58:/sw/swift$ swift log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: swift.log (Permission denied) at java.io.FileOutputStream.openAppend(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:177) at java.io.FileOutputStream.(FileOutputStream.java:102) at org.apache.log4j.FileAppender.setFile(FileAppender.java:272) at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:151) at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:247) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:123) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:87) at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:645) at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:603) at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:500) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:406) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:432) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:460) at org.apache.log4j.LogManager.(LogManager.java:113) at org.apache.log4j.Logger.getLogger(Logger.java:94) at org.globus.cog.karajan.Loader.(Loader.java:43) No SwiftScript program specified -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From wilde at mcs.anl.gov Thu Apr 2 21:31:39 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 02 Apr 2009 21:31:39 -0500 Subject: [Swift-devel] Discussion on next steps for Coasters Message-ID: <49D5750B.10603@mcs.anl.gov> I had a brief off-list discussion with Mihael on next steps for coasters. Im posting it here for group discussion and to get us started on the same page. This follows up on discussion a few weeks ago on the same topic. Rather than try to reorg the email below, Im posting it largely as-is in the interest of effort and time. Bottom line: Mihael will work on Coasters next, as he suggested in a prior email, taking the next steps to harden them for users, establish a better test mechanism and procedure, and work on some usability & enhancement issues. - Mike -------- Original Message -------- Subject: Re: Hi / status / next ? Date: Thu, 02 Apr 2009 21:01:14 -0500 From: Michael Wilde To: Mihael Hategan References: <49D551B8.5010105 at mcs.anl.gov> <1238721084.19231.18.camel at localhost> OK, all sounds good. Many more details to work out, but a short followup below. On 4/2/09 8:11 PM, Mihael Hategan wrote: > On Thu, 2009-04-02 at 19:00 -0500, Michael Wilde wrote: >> Hi Mihael, >> ... >> So next on Swift: I think you should do a fairly intensive burst of >> effort on Coaster stabilization and portability, like you suggested on >> the list a little while ago. > > Right. > >> At a very high level, what I want to see is: >> >> - solid test suite, so we know its working on a agreed on and growing >> set of platforms, mainly the TG, OSG and a few miscellaneous sites the >> users need >> >> - solve the "GT2 / OSG thing", which I *think* involves starting coaster >> workers from the submit host with GT2 using Condor-G. > > The complexity of adding condor-g into the loop will likely be nasty. > But I'll try. Before you start, then, especially if its not an obvious answer, lets sanity check with discussion on list. As a proposed update to your design doc. > >> - check that coaster shutdown is working. > > Is there any reason to believe it's not? Yeah, some suspicious behavior that we havent been able to pin down (me, Glen) but suspect may be happening. > >> Then lower priority: >> >> - make it possible to allocate a persistent pool of Coaster workers all >> at once (say, gimme 1000 nodes on Ranger for 1 hour". > > That I think isn't a good idea. Here's why, and correct me if I'm > missing something: > - regardless of whether you use it or not, you need to wait for nodes to > be available. Whether that waiting happens while swift is running or > not, it still happens. true > - once you have a pre-set number of nodes, you need to quickly start > swift and use them, otherwise you lose allocations. By contrast, in > automatic mode, swift will use them as soon as they are available true > - allocation of a pre-set number of nodes may be delayed if that number > of nodes is not available. In the automatic mode, swift will use fewer > nodes when they are available and ramp up to whatever it can get. A > limiting case, when your 1k nodes will not be available at all, shows > that the automatic case will yield better performance (you workflow will > finish). true > - better balancing can be done if there are multiple sites with > automatic allocation. all true ;) Only case where its handy is benchmarking a workflow on a known quantity of nodes. Driven in part by fact that on BGP, this is how they are allocated. (But even there we could do multi-block allocation in varying chunks if the allocator was aware of the scheduling policy of the cluster) So what I was thinking was "ask for N nodes all at once". In all cases, it would be assumed "...and then start your workflow". So it would not need to be a separate allocation. Tied to an option to say "leave my nodes running when wf done" this would I think meet all needs. But your points above are complelling, hence this feature needs deliberation and is nowhere need the top of the list. Higher on the list would be demand-based grow-shrink of pool, but in varying sized blocks. And on all systems, I think, you need to free in the same sized blocks (of of CPUs) that you allocated in. It raises another Q: for some sites like TeraPort, which I think places jobs on all cores, independently, in todays coasters implementation, I am assuming the user should not specify coastersPerNode > 1. True? (even though it has 2-core nodes.) We should clarify this in the users guide. I will ask this on the list right now so all can get answer. > One advantage to allocating blocks of coasters may be the possibility > that a single multi-node job is started (so it solves the gt2 > scalability problem, but so does you provisioning point below). I would be interested in this, both for its intrinsic performance benefits, but also as a short-term solution to the OSG GT2 "overheating" problem. Especially if the Condor-G solution gets complex and takes long to implement and perfect. Ie, as a short term fix with long term benefits, migt make sense to do it first, assuming that *it* is not harder than Condor-G provider and coaster integ. > >> - other ways to leave coasters running for the next jobs > > Right. That may be possible with persistent services instead of the > current transient scheme. > >> - ensure that coaster time allocations are being done sensibly >> >> - revisit the coaster "provisioning" mechanism in terms of in what >> increments workers are allocated and released in >> >> - some kind of coaster status display >> >> - some way to probe a job thats running on a coaster? > > Define "probe". - ps -f on the running process. - probe its resource usage (/proc, also ps, etc) - ls -lR of its jobdir (as these will more often be on /tmp) We have these needs today; on the BGP under falkon we manually login to the node, but thats cumbersome: hard to find the node; 2-stage login process. Low prio, a pipe dream. But theoretically do-able. So, very cool, we are converging on a plan. I'll cc most of the above to the list now. > >> Issue a shell >> command on the worker of the job? >> >> - other things I missed. >> >> I'll send this to the list for discussion; what I mainly want to >> understand from you first is your time availability, what you feel you >> owe swift in terms of compensating from i2u2 hours, and anything you >> know of on swift that is higher priority that the coaster things above? >> (I dont, but maybe missing something) >> >> Lastly, how is Phong doing, and to what extent can he be self-sufficient >> if you were to go 100% swift for a while? > > I think he'll be able to take over most things. However, with the > current big push, he's probably not confident enough, so it may have to > happen after the new version is put into production. ... From wilde at mcs.anl.gov Fri Apr 3 08:38:08 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 03 Apr 2009 08:38:08 -0500 Subject: [Swift-devel] Probing running jobs In-Reply-To: <1238732253.22128.12.camel@localhost> References: <49D551B8.5010105@mcs.anl.gov> <1238721084.19231.18.camel@localhost> <49D56DEA.104@mcs.anl.gov> <1238732253.22128.12.camel@localhost> Message-ID: <49D61140.1090109@mcs.anl.gov> Following up on Mihael's question about a feature I listed in the to-do list I proposed for coasters: On 4/2/09 11:17 PM, Mihael Hategan wrote: > On Thu, 2009-04-02 at 21:01 -0500, Michael Wilde wrote: >>>> - some way to probe a job thats running on a coaster? >>> Define "probe". >> - ps -f on the running process. >> - probe its resource usage (/proc, also ps, etc) >> - ls -lR of its jobdir (as these will more often be on /tmp) >> >> We have these needs today; on the BGP under falkon we manually login to >> the node, but thats cumbersome: hard to find the node; 2-stage login >> process. >> >> Low prio, a pipe dream. But theoretically do-able. > > It should be possible (and somewhat interesting) to have a simple shell > that can execute stuff on the workers while the job is running, so that > you can issue your own commands. > > The question of how to find the right worker remains. Can you go a bit > deeper into the details? How do you find the node currently (be as > specific as you can be)? In the oops workflow, I recall these cases at the moment: 1) Have my (large set of similar) jobs started? 2) Most jobs have finished. Are the remaining ones hung, or proceeding normally but slower for some application- or data-specific reason? -- For (1), on the BGP, if most or all cores in the partition have apps running on them, we pick any core and login to it. Then to see what that particular app is doing, we tail its log file for progress compared to its CPU tie consumption (from ps). Note that its log file is on local disk, because we set the "jobdir on local" option of swiftwrapper. Logging in to a node means finding its IO node IP addr from a Falkon dynamic config file, ssh-ing to the ION, then telnetting to an arbitrary worker node (these are on 192.168.1.[0-63] private addrs), then running ps and tail. If not all the worker nodes in a processor set are busy, its a nuisance to find one that is. If few are busy, its not practical. Overall, this technique is just a spot-check to see "are *any* of my jobs running right", ie to see if we've (finally) got their arguments correct, etc. (1) is better solved with the same technique needed for (2) - given a job, find its ION and worker node IPs, and ssh/telnet directly there, which does not exist but is straightforward. On BGP the WNs are not running ssh, hence the additional nuisance of telnet. (2) Is theoretically possible, but impractical, until we add a few scripts to trace from a swift job to the falkon service thats running it to the falkon agent thats running it (again, in the bgp case). The data for this exists. So we occasionally need (2) but cant do it. Regarding "question of how to find the right worker" - this starts with having some sort of ID for each job that the use can use to go from "source code based identity" through job status and then to job location. (by job here I mean execution of an app() proc). I have not yet looked at your status monitor, but am eager to try it. So I dont know if you took any steps in there to correlate a job's proc name and args to its status. But thats what I think the user ultimately needs and wants. For example, in oops, the majority of tasks are either of app "runrama" or "runoops". They have a mixture of scalar and file args. I'd like to see in status something sort of like strace, where syscalls have potentially long arg lists (when formatted) but there's a canonical way to present them in an acceptably compact format with ... ellipsis s needed. So as app invocations become known to swift, they get IDs starting from 0, (PID-like but not wrapping around), and are listed in the progress log as: Job 123 is Proc runrama Args 456 input/prot/.../...00.019.1ubq.pdb etc Job 123 input transfer OK Job 123 submitted - teraport/coaster92 Job 124 is Job 125 is Job 123 output transfer OK job 123 ended OK And then, I can say: probe 123 "ps -ef | grep runrama; tail -3 /tmp/work/*/*/runrama.log" (for starters). So the capability depends on having usable IDs for jobs and coasters, maybe more objects, so that the user can specify a job of interest and the system can send the users probe to that job. Something simple, flexible and shell-like is good to start with so we can explore whats needed and ideally create scripts to wrap more powerful capabilities. From benc at hawaga.org.uk Fri Apr 3 09:33:13 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 3 Apr 2009 14:33:13 +0000 (GMT) Subject: [Swift-devel] Probing running jobs In-Reply-To: <49D61140.1090109@mcs.anl.gov> References: <49D551B8.5010105@mcs.anl.gov> <1238721084.19231.18.camel@localhost> <49D56DEA.104@mcs.anl.gov> <1238732253.22128.12.camel@localhost> <49D61140.1090109@mcs.anl.gov> Message-ID: Not address the bulk of your email, just the bit about IDs. almost everything in swift that can be identified has some identifier on it from log-processing and provenance work - at least datsets, procedure invocations, job executions, file transfers. -- From benc at hawaga.org.uk Fri Apr 3 14:10:10 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 3 Apr 2009 19:10:10 +0000 (GMT) Subject: [Swift-devel] sync Message-ID: Just now I got this on a pbs+nfs cluster (the one at University of Johannesburg that I am involved with). It seems a little degenerate that in failing to record restart information for reliability, the run dies. Caused by: java.io.SyncFailedException: sync failed at java.io.FileDescriptor.sync(Native Method) at org.globus.cog.karajan.workflow.nodes.restartLog.FlushableLockedFileW riter.flush(FlushableLockedFileWriter.java:40) at org.globus.cog.karajan.workflow.nodes.restartLog.LogVargOperator.upda te(LogVargOperator.java:37) ... 37 more -- From hategan at mcs.anl.gov Sat Apr 4 16:34:32 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 04 Apr 2009 16:34:32 -0500 Subject: [Swift-devel] Re: Probing running jobs In-Reply-To: <49D61140.1090109@mcs.anl.gov> References: <49D551B8.5010105@mcs.anl.gov> <1238721084.19231.18.camel@localhost> <49D56DEA.104@mcs.anl.gov> <1238732253.22128.12.camel@localhost> <49D61140.1090109@mcs.anl.gov> Message-ID: <1238880872.8212.13.camel@localhost> On Fri, 2009-04-03 at 08:38 -0500, Michael Wilde wrote: > Following up on Mihael's question about a feature I listed in the to-do > list I proposed for coasters: > > On 4/2/09 11:17 PM, Mihael Hategan wrote: > > On Thu, 2009-04-02 at 21:01 -0500, Michael Wilde wrote: > >>>> - some way to probe a job thats running on a coaster? > >>> Define "probe". > >> - ps -f on the running process. > >> - probe its resource usage (/proc, also ps, etc) > >> - ls -lR of its jobdir (as these will more often be on /tmp) > >> > >> We have these needs today; on the BGP under falkon we manually login to > >> the node, but thats cumbersome: hard to find the node; 2-stage login > >> process. > >> > >> Low prio, a pipe dream. But theoretically do-able. > > > > It should be possible (and somewhat interesting) to have a simple shell > > that can execute stuff on the workers while the job is running, so that > > you can issue your own commands. > > > > The question of how to find the right worker remains. Can you go a bit > > deeper into the details? How do you find the node currently (be as > > specific as you can be)? > > In the oops workflow, I recall these cases at the moment: > > 1) Have my (large set of similar) jobs started? > > 2) Most jobs have finished. Are the remaining ones hung, or proceeding > normally but slower for some application- or data-specific reason? [...] In swift r2821 cog r2365 (I think), there is such a feature. If you start with the console monitor, you can go to the list of jobs. Then select desired job, and push enter to display a detail pane. If the job is in the active state and if it's running on a coaster worker, that detail pane will have an extra button named "Worker Terminal". Pressing that will pop up a simple terminal that can be used to run relatively arbitrary commands on the worker that the job is running on. It won't run commands that require console input (e.g., vi), so don't try. It won't start you in the job directory, but the swift workflow directory. That's because at some point we stopped using the GRAM directory attribute for setting the initial job dir because some silly site on OSG doesn't honor it. I think we should revisit the issue (I suspect there is a solution that works in both cases). From wilde at mcs.anl.gov Sat Apr 4 16:59:44 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 04 Apr 2009 16:59:44 -0500 Subject: [Swift-devel] coaster status report Message-ID: <49D7D850.9060701@mcs.anl.gov> With OOPS Glen was able to get some promising runs queued on Ranger, using the default properties and the sites setting from the SEM runs. Looking great so far, and above all was very easy to get it going. Thats very exciting! One run shows a few (3 out of 100 or so) failures that were retried successfully. We need to trak these down, and see if it was a transient app failure or something in swift etc. Then we turned to Abe and Queenbee. That was amazingly easy to configure and get running. Glen is scaling it up as we speak, trying for 2 sites x 40 jobs x 8 cores = 640 cores tween the two. In initial small tests, though - 50 parallel app() calls - its sending all jobs to abe, none to queenbee. We checked the usual sites, tc things, *seems* ok there. Possibly either a bg or a scheduler anomaly? We'll try with more jobs, and see; will send logs and sites etc files if that anomaly persists at larger scales. Seems like both these sites have WS-GRAM enabled; we'd like to try that as well, to expand beyond the 40-job per site suggested limit. Would like to get 1000 cores active on this problem. 2 x 60 x 8 or so. Then will add in a few more fruitful TG sites. Towards this end, Mihael, if you have the urge to probe at a setting/config that lets us start coasters in 4-8 node batches, this would be a great time to try that. I suspect you dont know yet if that will be easy, hard, or in between? Another note on coaster boot: - old problems on Abe with funky limitations on non-login shells seems to have gone away, either from the latest coaster strategy (-l issues?) or from Abe changes. - on queenbee, initial run got this error: Could not start coaster service Caused by: Task ended before registration was received. STDOUT: Warning: -jar not understood. Ignoring. Exception in thread "main" java.lang.NoClassDefFoundError: .tmp.bootstrap.y10420 at gnu.gcj.runtime.FirstThread.run() (/usr/lib64/libgcj.so.5.0.0) Turns out default java was 1.4.2 something. We added @default to .soft to get Java 1.6. Then coasters bootstrapped fine. This was nice to see, that a simple workaround was easy! At any rate, very productive, very promising, very pleasing to use. Nice work! - Mike From wilde at mcs.anl.gov Sat Apr 4 17:01:53 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 04 Apr 2009 17:01:53 -0500 Subject: [Swift-devel] Re: Probing running jobs In-Reply-To: <1238880872.8212.13.camel@localhost> References: <49D551B8.5010105@mcs.anl.gov> <1238721084.19231.18.camel@localhost> <49D56DEA.104@mcs.anl.gov> <1238732253.22128.12.camel@localhost> <49D61140.1090109@mcs.anl.gov> <1238880872.8212.13.camel@localhost> Message-ID: <49D7D8D1.3030304@mcs.anl.gov> Wow! Way cool - I cant wait to try this and the monitor. But need to clone myself. Maybe Glen, you can try this on oops tests... - Mike On 4/4/09 4:34 PM, Mihael Hategan wrote: > On Fri, 2009-04-03 at 08:38 -0500, Michael Wilde wrote: >> Following up on Mihael's question about a feature I listed in the to-do >> list I proposed for coasters: >> >> On 4/2/09 11:17 PM, Mihael Hategan wrote: >>> On Thu, 2009-04-02 at 21:01 -0500, Michael Wilde wrote: >>>>>> - some way to probe a job thats running on a coaster? >>>>> Define "probe". >>>> - ps -f on the running process. >>>> - probe its resource usage (/proc, also ps, etc) >>>> - ls -lR of its jobdir (as these will more often be on /tmp) >>>> >>>> We have these needs today; on the BGP under falkon we manually login to >>>> the node, but thats cumbersome: hard to find the node; 2-stage login >>>> process. >>>> >>>> Low prio, a pipe dream. But theoretically do-able. >>> It should be possible (and somewhat interesting) to have a simple shell >>> that can execute stuff on the workers while the job is running, so that >>> you can issue your own commands. >>> >>> The question of how to find the right worker remains. Can you go a bit >>> deeper into the details? How do you find the node currently (be as >>> specific as you can be)? >> In the oops workflow, I recall these cases at the moment: >> >> 1) Have my (large set of similar) jobs started? >> >> 2) Most jobs have finished. Are the remaining ones hung, or proceeding >> normally but slower for some application- or data-specific reason? > [...] > > In swift r2821 cog r2365 (I think), there is such a feature. > > If you start with the console monitor, you can go to the list of jobs. > Then select desired job, and push enter to display a detail pane. If the > job is in the active state and if it's running on a coaster worker, that > detail pane will have an extra button named "Worker Terminal". Pressing > that will pop up a simple terminal that can be used to run relatively > arbitrary commands on the worker that the job is running on. > > It won't run commands that require console input (e.g., vi), so don't > try. > > It won't start you in the job directory, but the swift workflow > directory. That's because at some point we stopped using the GRAM > directory attribute for setting the initial job dir because some silly > site on OSG doesn't honor it. I think we should revisit the issue (I > suspect there is a solution that works in both cases). > From wilde at mcs.anl.gov Sat Apr 4 17:03:55 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 04 Apr 2009 17:03:55 -0500 Subject: [Swift-devel] coaster status report In-Reply-To: <49D7D850.9060701@mcs.anl.gov> References: <49D7D850.9060701@mcs.anl.gov> Message-ID: <49D7D94B.1090504@mcs.anl.gov> small clarification here - we had to turn away from range because the queue was gruesome. the 3-failure issue was on abe. not much to say till we find and examine the log on this one. - Mike On 4/4/09 4:59 PM, Michael Wilde wrote: > With OOPS Glen was able to get some promising runs queued on Ranger, > using the default properties and the sites setting from the SEM runs. > > Looking great so far, and above all was very easy to get it going. > > Thats very exciting! > > One run shows a few (3 out of 100 or so) failures that were retried > successfully. We need to trak these down, and see if it was a transient > app failure or something in swift etc. > > Then we turned to Abe and Queenbee. That was amazingly easy to configure > and get running. Glen is scaling it up as we speak, trying for 2 sites x > 40 jobs x 8 cores = 640 cores tween the two. > > In initial small tests, though - 50 parallel app() calls - its sending > all jobs to abe, none to queenbee. We checked the usual sites, tc > things, *seems* ok there. Possibly either a bg or a scheduler anomaly? > We'll try with more jobs, and see; will send logs and sites etc files if > that anomaly persists at larger scales. > > Seems like both these sites have WS-GRAM enabled; we'd like to try that > as well, to expand beyond the 40-job per site suggested limit. Would > like to get 1000 cores active on this problem. 2 x 60 x 8 or so. > > Then will add in a few more fruitful TG sites. > > Towards this end, Mihael, if you have the urge to probe at a > setting/config that lets us start coasters in 4-8 node batches, this > would be a great time to try that. I suspect you dont know yet if that > will be easy, hard, or in between? > > Another note on coaster boot: > > - old problems on Abe with funky limitations on non-login shells seems > to have gone away, either from the latest coaster strategy (-l issues?) > or from Abe changes. > > - on queenbee, initial run got this error: > > Could not start coaster service > Caused by: > Task ended before registration was received. > STDOUT: Warning: -jar not understood. Ignoring. > Exception in thread "main" java.lang.NoClassDefFoundError: > .tmp.bootstrap.y10420 > at gnu.gcj.runtime.FirstThread.run() (/usr/lib64/libgcj.so.5.0.0) > > Turns out default java was 1.4.2 something. > > We added @default to .soft to get Java 1.6. > Then coasters bootstrapped fine. This was nice to see, that a simple > workaround was easy! > > At any rate, very productive, very promising, very pleasing to use. > > Nice work! > > - Mike > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Sat Apr 4 17:06:43 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 04 Apr 2009 17:06:43 -0500 Subject: [Swift-devel] coaster status report In-Reply-To: <49D7D850.9060701@mcs.anl.gov> References: <49D7D850.9060701@mcs.anl.gov> Message-ID: <1238882803.9038.1.camel@localhost> On Sat, 2009-04-04 at 16:59 -0500, Michael Wilde wrote: > - on queenbee, initial run got this error: > > Could not start coaster service > Caused by: > Task ended before registration was received. > STDOUT: Warning: -jar not understood. Ignoring. > Exception in thread "main" java.lang.NoClassDefFoundError: > .tmp.bootstrap.y10420 > at gnu.gcj.runtime.FirstThread.run() (/usr/lib64/libgcj.so.5.0.0) > > Turns out default java was 1.4.2 something. It looks like default java is GCJ which I wouldn't dare call "Java" because it probably fails too many compliance tests. > > We added @default to .soft to get Java 1.6. > Then coasters bootstrapped fine. This was nice to see, that a simple > workaround was easy! Right. Good call. From hategan at mcs.anl.gov Sat Apr 4 17:08:59 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 04 Apr 2009 17:08:59 -0500 Subject: [Swift-devel] coaster status report In-Reply-To: <49D7D94B.1090504@mcs.anl.gov> References: <49D7D850.9060701@mcs.anl.gov> <49D7D94B.1090504@mcs.anl.gov> Message-ID: <1238882939.9038.4.camel@localhost> On Sat, 2009-04-04 at 17:03 -0500, Michael Wilde wrote: > small clarification here - > > we had to turn away from range because the queue was gruesome. Yeah, but when it starts, it goooees. The beauty of multi-site runs (with replication enabled, which may or may not work properly) is that swift will make the best use of what's there. From wilde at mcs.anl.gov Sat Apr 4 17:15:00 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 04 Apr 2009 17:15:00 -0500 Subject: [Swift-devel] coaster status report In-Reply-To: <1238882939.9038.4.camel@localhost> References: <49D7D850.9060701@mcs.anl.gov> <49D7D94B.1090504@mcs.anl.gov> <1238882939.9038.4.camel@localhost> Message-ID: <49D7DBE4.3070604@mcs.anl.gov> On 4/4/09 5:08 PM, Mihael Hategan wrote: > On Sat, 2009-04-04 at 17:03 -0500, Michael Wilde wrote: >> small clarification here - >> >> we had to turn away from range because the queue was gruesome. > > Yeah, but when it starts, it goooees. > > The beauty of multi-site runs (with replication enabled, which may or > may not work properly) is that swift will make the best use of what's > there. Exactly. And I think Glen's group is eager to use it in exactly that way - send to TeraGrid and walk away, and not even bother manually checking traffic and load etc. Very promising. OOPS seems to compile clean everywhere we have tried, including BGP and Sicortex, and Glen has tested Zhengxiong's ADEM installer on OSG, where he got it installed on 8 sites in a blink. Glen is also working on a tgsites command that gens a correct user-specific sites.xml for TG, so ADEM and general use for both grids is within reach. Its all coming together very nice. From benc at hawaga.org.uk Sat Apr 4 17:21:44 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 4 Apr 2009 22:21:44 +0000 (GMT) Subject: [Swift-devel] Re: Probing running jobs In-Reply-To: <1238880872.8212.13.camel@localhost> References: <49D551B8.5010105@mcs.anl.gov> <1238721084.19231.18.camel@localhost> <49D56DEA.104@mcs.anl.gov> <1238732253.22128.12.camel@localhost> <49D61140.1090109@mcs.anl.gov> <1238880872.8212.13.camel@localhost> Message-ID: On Sat, 4 Apr 2009, Mihael Hategan wrote: > It won't start you in the job directory, but the swift workflow > directory. That's because at some point we stopped using the GRAM > directory attribute for setting the initial job dir because some silly > site on OSG doesn't honor it. I think we should revisit the issue (I > suspect there is a solution that works in both cases). I think that has not been the case since at least before the CI SVN started to be used. The first mention of specifying a directory attribute for task:execute in execute2 was r127, which specified wfdir. Before that, no directory was specified at all. The job directory has seemingly always been passed as a parameter of one kind or another to the wrapper script. -- From benc at hawaga.org.uk Sun Apr 5 06:11:28 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 5 Apr 2009 11:11:28 +0000 (GMT) Subject: [Swift-devel] Swift 0.9 release for ~2nd April In-Reply-To: References: Message-ID: On Mon, 23 Mar 2009, Ben Clifford wrote: > > I'd like to put out the Swift 0.9 release on the 2nd of April, with the > > release candidate being made from SVN on the 23rd of March. > > the present trunk seems way too unstable for a release candidate. so not > today. for now I'm planning on looking at making 0.9 again in the 2nd half of April. -- From hategan at mcs.anl.gov Sun Apr 5 09:59:29 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 05 Apr 2009 09:59:29 -0500 Subject: [Swift-devel] Re: Probing running jobs In-Reply-To: References: <49D551B8.5010105@mcs.anl.gov> <1238721084.19231.18.camel@localhost> <49D56DEA.104@mcs.anl.gov> <1238732253.22128.12.camel@localhost> <49D61140.1090109@mcs.anl.gov> <1238880872.8212.13.camel@localhost> Message-ID: <1238943569.14220.0.camel@localhost> On Sat, 2009-04-04 at 22:21 +0000, Ben Clifford wrote: > On Sat, 4 Apr 2009, Mihael Hategan wrote: > > > It won't start you in the job directory, but the swift workflow > > directory. That's because at some point we stopped using the GRAM > > directory attribute for setting the initial job dir because some silly > > site on OSG doesn't honor it. I think we should revisit the issue (I > > suspect there is a solution that works in both cases). > > I think that has not been the case since at least before the CI SVN > started to be used. > > The first mention of specifying a directory attribute for task:execute in > execute2 was r127, which specified wfdir. Before that, no directory was > specified at all. > > The job directory has seemingly always been passed as a parameter of one > kind or another to the wrapper script. > You are right. I was confused. From benc at hawaga.org.uk Sun Apr 5 16:51:26 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 5 Apr 2009 21:51:26 +0000 (GMT) Subject: [Swift-devel] too many site initializations Message-ID: vdl-int.k has this code, which I think is meant to make site initialization happen only once per site (and have only one job in initializing site shared directory progress ticker state). element(initSharedDir, [rhost] once(list(rhost, "shared") vdl:setprogress("Initializing site shared directory") However I see things like this: Progress: Selecting site:2932 Initializing site shared directory:102 Submitted:64 Active:69 Finished successfully:204 when I run with around 20 .. 30 OSG sites which suggests the onceness isn't happening there. Don't have time to investigate properly now but it seemed interesting to comment on. -- From bugzilla-daemon at mcs.anl.gov Sun Apr 5 18:09:42 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sun, 5 Apr 2009 18:09:42 -0500 (CDT) Subject: [Swift-devel] [Bug 193] New: replication job cancellation using pbs provider causes spurious console output Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=193 Summary: replication job cancellation using pbs provider causes spurious console output Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk CC: swift-devel at ci.uchicago.edu I get lines like this when job replicas are cancelled, where grid.uj.ac.za is a site I submitted to using provider=pbs Canceling job 33353.gridvm.grid.uj.ac.za I guess this is either an spurious print in CoG or is an incorrect log setting in Swift, but not looked any deeper. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. You are watching the assignee of the bug. You are watching the reporter. From wilde at mcs.anl.gov Sun Apr 5 18:14:34 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 05 Apr 2009 18:14:34 -0500 Subject: [Swift-devel] using WS GRAM Message-ID: <49D93B5A.8040103@mcs.anl.gov> Glen, try this: to try a few jobs "plain" on abe and qb then try coasters using gt4:gt4:pbs - Mike ps. beware, both might be blazing new territory From benc at hawaga.org.uk Sun Apr 5 18:18:46 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 5 Apr 2009 23:18:46 +0000 (GMT) Subject: [Swift-devel] using WS GRAM In-Reply-To: <49D93B5A.8040103@mcs.anl.gov> References: <49D93B5A.8040103@mcs.anl.gov> Message-ID: On Sun, 5 Apr 2009, Michael Wilde wrote: > ps. beware, both might be blazing new territory Is that code for "every time anyone tries GRAM4 on teragrid, it doesn't work?" ;) -- From bugzilla-daemon at mcs.anl.gov Mon Apr 6 06:29:53 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 6 Apr 2009 06:29:53 -0500 (CDT) Subject: [Swift-devel] [Bug 194] New: more analysis for replication Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=194 Summary: more analysis for replication Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: NEW Severity: enhancement Priority: P2 Component: Log processing and plotting AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk One tab/page with information about replication. Stuff that exists already: Comparison for each site of how many jobs were submitted, executed successfully, cancelled for replication (so similar to the execute2 sites table) Queue length distribution - something like the chart 'karajan queued JOB_SUBMISSION cumulative duration'. Stuff that could be collected/generated: over duration of run, how the replication threshold (which is based on mean queue time at the moment) varies. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From benc at hawaga.org.uk Mon Apr 6 06:34:47 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 6 Apr 2009 11:34:47 +0000 (GMT) Subject: [Swift-devel] goodness metrics for replication Message-ID: wondering what 'goodness' metrics are for replication. one is "how many jobs were replicated but the first submission executed (so the replication was in some sense wasted)" I'd be interested in ideas for other metrics -- From benc at hawaga.org.uk Mon Apr 6 08:26:42 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 6 Apr 2009 13:26:42 +0000 (GMT) Subject: [Swift-devel] replication vs site score Message-ID: More ongoing ramblings as I'm making slides about this... I'm not sure at the moment whether a job being cancelled due to replication causes the site's score to change. Maybe cancellation-due-to-other-replica-starting should be regarded as badness and reduce that site's score - "we asked you to run this job but were so slow we essentially regarded you as failing". Maybe they shouldn't be. Either way, it should be documented. -- From bugzilla-daemon at mcs.anl.gov Mon Apr 6 08:47:40 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 6 Apr 2009 08:47:40 -0500 (CDT) Subject: [Swift-devel] [Bug 195] New: info vs karajan states graph doesn't work well when an info file is missing Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=195 Summary: info vs karajan states graph doesn't work well when an info file is missing Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Log processing and plotting AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk For graphs such as this: "Offsets between job submission Active events and start times reported by info." when an info file is missing, it appears to be regarded as having start time of the unix epoch, which causes the automatic axes scaling to hide the actual useful information in this graph (which is for jobs where both a karajan and info start/end time are known). Such jobs should probably be omitted from these charts entirely. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Apr 6 09:08:41 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 6 Apr 2009 09:08:41 -0500 (CDT) Subject: [Swift-devel] [Bug 196] New: site score page should show site scores colour coded by site Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=196 Summary: site score page should show site scores colour coded by site Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: NEW Severity: enhancement Priority: P2 Component: Log processing and plotting AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk site score page should show site scores colour coded by site -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From wilde at mcs.anl.gov Mon Apr 6 09:13:10 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 09:13:10 -0500 Subject: [Swift-devel] Swift issues and next steps on OOPS app In-Reply-To: <49D800B5.4060109@uchicago.edu> References: <49D800B5.4060109@uchicago.edu> Message-ID: <49DA0DF6.5030300@mcs.anl.gov> was: Re: status update On 4/4/09 7:52 PM, Glen Hocky wrote: > Things seem to be kind of working on all machines (including ranger, > which picked up some speed) but not totally. So for ranger at the moment we can run default params and hope for 640 cores at a time. We should queue up several science runs of full-scale rounds, and assess the results and run times. > Problems to investigate this week: > swift choking after running lots of jobs successfully (shoudl probably > just ignore these things) I'm not sure which errors you mean here - lets examine them first. Do you mean the "successfully retried" errors? > swift not balancing load accross different sites (dumps all ones for my > teragrid sites file onto one site, grr!) Can you send a log of this to the Swift developers? They need that in order to look at this problem. I will do a sanity test of WS-GRAM with coasters on abe and queenbee. If it works we should expand our science runs there. These are good things to do today while BG/P is down. - Mike > > Glen From benc at hawaga.org.uk Mon Apr 6 09:19:10 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 6 Apr 2009 14:19:10 +0000 (GMT) Subject: [Swift-devel] Swift issues and next steps on OOPS app In-Reply-To: <49DA0DF6.5030300@mcs.anl.gov> References: <49D800B5.4060109@uchicago.edu> <49DA0DF6.5030300@mcs.anl.gov> Message-ID: On Mon, 6 Apr 2009, Michael Wilde wrote: > > swift not balancing load accross different sites (dumps all ones for my > > teragrid sites file onto one site, grr!) > > Can you send a log of this to the Swift developers? They need that in order to > look at this problem. For this also please sent the commandline that invoke Swift with, your sites file and your tc.data. -- From wilde at mcs.anl.gov Mon Apr 6 09:35:16 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 09:35:16 -0500 Subject: [Swift-devel] Swift issues and next steps on OOPS app In-Reply-To: References: <49D800B5.4060109@uchicago.edu> <49DA0DF6.5030300@mcs.anl.gov> Message-ID: <49DA1324.2010204@mcs.anl.gov> and swift.properties? (Aside to devel team: can we snapshot *all* this info into the start of the log? Its trivially short compared to the length of most logs) - command line - sites file, tc file - swift.properties I'll file as enh bug if there is agreement. On 4/6/09 9:19 AM, Ben Clifford wrote: > On Mon, 6 Apr 2009, Michael Wilde wrote: > >>> swift not balancing load accross different sites (dumps all ones for my >>> teragrid sites file onto one site, grr!) >> Can you send a log of this to the Swift developers? They need that in order to >> look at this problem. > > For this also please sent the commandline that invoke Swift with, your > sites file and your tc.data. > From benc at hawaga.org.uk Mon Apr 6 09:39:11 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 6 Apr 2009 14:39:11 +0000 (GMT) Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: Message-ID: even more rambling... in the context of a scheduler that is doing things like prioritising jobs based on more than the order that Swift happened to submit them (hopefully I will have a student for this in the summer), I think a replicant job should be pushed toward later execution rather than earlier execution to reduce the number of replicant jobs in the system at any one time. This is because I suspect (though I have gathered no numerical evidence) that given the choice between submitting a fresh job and a replicant job (making up terminology here too... mmm), it is almost always better to submit the fresh job. Either we end up submitting the replicant job eventually (in which case we are no worse off than if we submitted the replicant first and then a fresh job); or by delaying the replicant job we give that replicant's original a chance to start running and thus do not discard our precious time-and-load-dollars that we have already spent on queueing that replicant's original. -- From benc at hawaga.org.uk Mon Apr 6 09:43:56 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 6 Apr 2009 14:43:56 +0000 (GMT) Subject: [Swift-devel] Swift issues and next steps on OOPS app In-Reply-To: <49DA1324.2010204@mcs.anl.gov> References: <49D800B5.4060109@uchicago.edu> <49DA0DF6.5030300@mcs.anl.gov> <49DA1324.2010204@mcs.anl.gov> Message-ID: On Mon, 6 Apr 2009, Michael Wilde wrote: > I'll file as enh bug if there is agreement. yep -- From foster at anl.gov Mon Apr 6 09:46:29 2009 From: foster at anl.gov (Ian Foster) Date: Mon, 6 Apr 2009 09:46:29 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: Message-ID: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> Ben: You may recall the work that was done by Greg Maleciwz (sp?) on prioritizing jobs that enable new jobs to run. Those ideas seem relevant here. I met last week with a smart fellow in Singapore, Qin Zheng (CCed here), who has been working on the scheduling of replicant jobs. His interest is in doing this for jobs that have failed, while I think your interest is in scheduling for jobs that may have failed--a somewhat different thing. But there may be a connection. Ian. On Apr 6, 2009, at 9:39 AM, Ben Clifford wrote: > even more rambling... in the context of a scheduler that is doing > things > like prioritising jobs based on more than the order that Swift > happened to > submit them (hopefully I will have a student for this in the > summer), I > think a replicant job should be pushed toward later execution rather > than > earlier execution to reduce the number of replicant jobs in the > system at > any one time. > > This is because I suspect (though I have gathered no numerical > evidence) > that given the choice between submitting a fresh job and a replicant > job > (making up terminology here too... mmm), it is almost always better to > submit the fresh job. Either we end up submitting the replicant job > eventually (in which case we are no worse off than if we submitted the > replicant first and then a fresh job); or by delaying the replicant > job we > give that replicant's original a chance to start running and thus do > not > discard our precious time-and-load-dollars that we have already > spent on > queueing that replicant's original. > > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Mon Apr 6 10:00:08 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 6 Apr 2009 15:00:08 +0000 (GMT) Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> Message-ID: > You may recall the work that was done by Greg Maleciwz (sp?) on prioritizing > jobs that enable new jobs to run. Those ideas seem relevant here. yes, its ongoing thoughts based on that that lead me to thinking about this - more generally, what are the useful things to prioritise work on (both at the Swift level - a SwiftScript procedure call - and at the lower level of file transfers and remote job submissions) > I met last week with a smart fellow in Singapore, Qin Zheng (CCed here), who > has been working on the scheduling of replicant jobs. His interest is in doing > this for jobs that have failed, while I think your interest is in scheduling > for jobs that may have failed--a somewhat different thing. But there may be a > connection. Replicated jobs are jobs that the remote job submission system (eg GRAM) says are in a queue but that we think that we can probably run better (i.e. quicker or even run at all) by resubmitting; when doing that, we don't cancel the original job and potentially it will be that original job that runs, not the replica. Sometimes that is because the remote queue is "infintely long" (the site is taking jobs and losing them); sometimes its because it is "very long" (eg teraport's 14 day queue when my laptop has a local CPU free and no queue) In your above paragraph, that sounds more like Swift's retry mechanism - when a Swift-level job (SwiftScript procedure call) fails, we submit it again, basically using the same mechanism as with replicated jobs. However, in that case, the original job does not exist any more. -- From bugzilla-daemon at mcs.anl.gov Mon Apr 6 10:00:21 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 6 Apr 2009 10:00:21 -0500 (CDT) Subject: [Swift-devel] [Bug 197] New: Include more runtime environment info in Swift log for debugging Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=197 Summary: Include more runtime environment info in Swift log for debugging Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Keywords: debug Severity: enhancement Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: wilde at mcs.anl.gov More info on the user's runtime environment should be included automatically at the start of the swift .log file, so that developers can do most debugging with just the single .log file. This should include: - command line - sites file - tc file - swift.properties file and could include the swift source code itself (at least for now, when most scripts are very short. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Apr 6 10:09:21 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 6 Apr 2009 10:09:21 -0500 (CDT) Subject: [Swift-devel] [Bug 198] New: Add ability to specify execution sites on swift command line Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=198 Summary: Add ability to specify execution sites on swift command line Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Keywords: running Severity: enhancement Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: wilde at mcs.anl.gov Allow a set of sites to be specified on the command line for each run, rather than needed to edit the sites file to make such choices. This feature is (or should be) tied to improvements in how the site data is generated and maintained. Discussion of a design for this feature on the devel list should precede any development. The related issues are: - where to keep the site data that the command line options select from - how to parameterize that data and add options to the selected sites - how site data is generated and customized - how a variety of choices can be specified for the selected sites (eg, use coasters or not; which data movement strategy to use). -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From wilde at mcs.anl.gov Mon Apr 6 10:20:16 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 10:20:16 -0500 Subject: [Swift-devel] bugzilla keywords Message-ID: <49DA1DB0.8060107@mcs.anl.gov> Ive noticed these more in the new bugzilla interface, and so started using them, although I realize the keywords Ive created may need rethinking. Are bug keywords of any use to us, or should I stop doing this? If of use, we should define a small set that we like and works for all. From benc at hawaga.org.uk Mon Apr 6 10:27:42 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 6 Apr 2009 15:27:42 +0000 (GMT) Subject: [Swift-devel] bugzilla keywords In-Reply-To: <49DA1DB0.8060107@mcs.anl.gov> References: <49DA1DB0.8060107@mcs.anl.gov> Message-ID: On Mon, 6 Apr 2009, Michael Wilde wrote: > Ive noticed these more in the new bugzilla interface, and so started using > them, although I realize the keywords Ive created may need rethinking. > > Are bug keywords of any use to us, or should I stop doing this? > > If of use, we should define a small set that we like and works for all. I've never come up with a particularly useful use for them within Swift, and I don't think we should use them just because they are there. For the most part, I think even the component classification list is barely used. If you find some use in using them, though, I see no reason why you shouldn't do so, though. -- From hategan at mcs.anl.gov Mon Apr 6 10:40:38 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 10:40:38 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: Message-ID: <1239032438.30386.8.camel@localhost> On Mon, 2009-04-06 at 14:39 +0000, Ben Clifford wrote: > even more rambling... in the context of a scheduler that is doing things > like prioritising jobs based on more than the order that Swift happened to > submit them (hopefully I will have a student for this in the summer), I > think a replicant job should be pushed toward later execution rather than > earlier execution to reduce the number of replicant jobs in the system at > any one time. You have two extremes: 1. Send each job to all sites instantly. 2. Replicate after +inf time (see _too_much_ below) You're suggesting moving from somewhere in the middle, to somewhere in the middle, but a little to the right. > > This is because I suspect (though I have gathered no numerical evidence) > that given the choice between submitting a fresh job and a replicant job > (making up terminology here too... mmm), it is almost always better to > submit the fresh job. Either we end up submitting the replicant job > eventually (in which case we are no worse off than if we submitted the > replicant first and then a fresh job); or by delaying the replicant job we > give that replicant's original a chance to start running and thus do not > discard our precious time-and-load-dollars that we have already spent on > queueing that replicant's original. You are saying this with the awareness of the fact that replicas are only sent after the prototype job sat in the queue (and didn't start running) for what is deemed _too_much_? From benc at hawaga.org.uk Mon Apr 6 10:50:04 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 6 Apr 2009 15:50:04 +0000 (GMT) Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239032438.30386.8.camel@localhost> References: <1239032438.30386.8.camel@localhost> Message-ID: On Mon, 6 Apr 2009, Mihael Hategan wrote: > You are saying this with the awareness of the fact that replicas are > only sent after the prototype job sat in the queue (and didn't start > running) for what is deemed _too_much_? I'm not suggesting that we reduce any submission load to remote sites. I'm suggesting a different order for those submissions. The queue delay is not so _too_much_ that we cancel the original on replication; and it appears (though I don't have stats on real runs) that the originals do run sometimes (though it would be interesting to know in real situations how often) Given that, I'm suggesting that a better use of our load capacity is to do it with the ordering I suggested. As far as I can tell, it will not result in slower runs. In the case where originals do run eventually, this should results in faster runs. Thinking about it more, I can see a situation where a site is pretty fully loaded queuewise by swift yet never actually runs a job, because by the time a job gets near the front of the queue it has been replicated and run elsewhere. That's an extreme, but I think its the extreme of the same situation I talk about in my original message. -- From hategan at mcs.anl.gov Mon Apr 6 11:06:31 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 11:06:31 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: <1239032438.30386.8.camel@localhost> Message-ID: <1239033991.31063.3.camel@localhost> On Mon, 2009-04-06 at 15:50 +0000, Ben Clifford wrote: > On Mon, 6 Apr 2009, Mihael Hategan wrote: > > > You are saying this with the awareness of the fact that replicas are > > only sent after the prototype job sat in the queue (and didn't start > > running) for what is deemed _too_much_? > > I'm not suggesting that we reduce any submission load to remote sites. I'm > suggesting a different order for those submissions. > > The queue delay is not so _too_much_ that we cancel the original on > replication; and it appears (though I don't have stats on real runs) that > the originals do run sometimes (though it would be interesting to know in > real situations how often) > > Given that, I'm suggesting that a better use of our load capacity is to do > it with the ordering I suggested. I'm still not following. From what I understand, you are suggesting what's already there. So either that is true and you think the current scheme is not what it is, or I don't understand how your suggestion is different than the current scheme. From benc at hawaga.org.uk Mon Apr 6 11:11:31 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 6 Apr 2009 16:11:31 +0000 (GMT) Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239033991.31063.3.camel@localhost> References: <1239032438.30386.8.camel@localhost> <1239033991.31063.3.camel@localhost> Message-ID: On Mon, 6 Apr 2009, Mihael Hategan wrote: > I'm still not following. From what I understand, you are suggesting > what's already there. So either that is true and you think the current > scheme is not what it is, or I don't understand how your suggestion is > different than the current scheme. Its not the case, as I understand it, that replica jobs will always be run after primary jobs - they will be run in the order they arrive in the job queue. Jobs that Swift puts in the queue after that replication decision has been made (for example, jobs that were waiting for dependent data) will run after the replicas submitted before that dependent data become available. a=p(x) b=p(y) c=q(a); a, b run. eventually swift gets bored and resubmits b to the local job queue. then a completes, and so c gets queued in the local job queue. replica_of_b gets submitted to a site before c does. or not? -- From hategan at mcs.anl.gov Mon Apr 6 11:34:16 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 11:34:16 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: <1239032438.30386.8.camel@localhost> <1239033991.31063.3.camel@localhost> Message-ID: <1239035656.31668.10.camel@localhost> On Mon, 2009-04-06 at 16:11 +0000, Ben Clifford wrote: > On Mon, 6 Apr 2009, Mihael Hategan wrote: > > > I'm still not following. From what I understand, you are suggesting > > what's already there. So either that is true and you think the current > > scheme is not what it is, or I don't understand how your suggestion is > > different than the current scheme. > > Its not the case, as I understand it, that replica jobs will always be run > after primary jobs - they will be run in the order they arrive in the job > queue. Jobs that Swift puts in the queue after that replication decision > has been made (for example, jobs that were waiting for dependent data) > will run after the replicas submitted before that dependent data become > available. > > a=p(x) > b=p(y) > c=q(a); > > a, b run. eventually swift gets bored and resubmits b to the local job > queue. then a completes, and so c gets queued in the local job queue. > > replica_of_b gets submitted to a site before c does. I see what you're saying now. I think scheduler priorities are not a bad idea. From benc at hawaga.org.uk Mon Apr 6 11:34:57 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 6 Apr 2009 16:34:57 +0000 (GMT) Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239035656.31668.10.camel@localhost> References: <1239032438.30386.8.camel@localhost> <1239033991.31063.3.camel@localhost> <1239035656.31668.10.camel@localhost> Message-ID: On Mon, 6 Apr 2009, Mihael Hategan wrote: > I think scheduler priorities are not a bad idea. right - its likely that I get a summer student to play with that sort of stuff, which is what is making me think of what sort of things to prioritise on. -- From wilde at mcs.anl.gov Mon Apr 6 12:10:51 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 12:10:51 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs Message-ID: <49DA379B.7080403@mcs.anl.gov> With this sites entry: TG-CDA070002T /home/ux454325/swiftwork I get the error below. Files are on CI net at /home/wilde/swift/lab. I will try to copy coaster boot logs and gram logs to same place when I find them, in subdirs named by $RunID.logs. -- com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift Swift svn swift-r2809 cog-r2350 RunID: 20090406-1155-pgc5nj00 Progress: Progress: Stage in:1 Progress: Submitted:1 Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb Progress: Failed:1 Execution failed: Exception in cat: Arguments: [data.txt] Host: qb Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j stderr.txt: stdout.txt: ---- Caused by: Cannot submit job: Cannot run program "qsub": java.io.IOException: error=2, No such file or directory org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot submit job: Cannot run program "qsub": java.io.IOException: error=2, No such file or directory at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) at org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) at org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) Caused by: java.io.IOException: Cannot run program "qsub": java.io.IOException: error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at java.lang.Runtime.exec(Runtime.java:593) at org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73) at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) ... 4 more Caused by: java.io.IOException: java.io.IOException: error=2, No such file or directory at java.lang.UNIXProcess.(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) ... 7 more Cleaning up... Shutting down service at https://208.100.92.21:44166 Got channel MetaChannel: 24235184 -> GSSSChannel-null(1) - Done com$ pwd From wilde at mcs.anl.gov Mon Apr 6 12:29:21 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 12:29:21 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt4:gt4:pbs Message-ID: <49DA3BF1.1080206@mcs.anl.gov> With this sites entry: TG-CDA070002T /home/ux454325/swiftwork I get the error below. Files are on CI net at /home/wilde/swift/lab. Coaster boot log is in 20090406-1216-f5k8chdg.logs/ There was no GRAM log on the queenbee site. -- com$ swift -tc.file tc.data -sites.file qb.coasters-gt4-gt4-pbs.xml cat.swift Swift svn swift-r2809 cog-r2350 RunID: 20090406-1216-f5k8chdg Progress: Progress: Stage in:1 The GT4 provider does not support redirection. Redirection requests will be ignored without further warnings. Progress: Submitted:1 Failed to transfer wrapper log from cat-20090406-1216-f5k8chdg/info/0 on qb Progress: Failed:1 Execution failed: Exception in cat: Arguments: [data.txt] Host: qb Directory: cat-20090406-1216-f5k8chdg/jobs/0/cat-0cjfv09j stderr.txt: stdout.txt: ---- Caused by: Could not submit job Caused by: Could not start coaster service Caused by: Task ended before registration was received: Job failed with an exit code of 1 Cleaning up... Done com$ From hategan at mcs.anl.gov Mon Apr 6 13:02:17 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 13:02:17 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs In-Reply-To: <49DA379B.7080403@mcs.anl.gov> References: <49DA379B.7080403@mcs.anl.gov> Message-ID: <1239040937.2410.3.camel@localhost> Yes. This is one of those "can't find executable unless run through 'bash -l' or maybe not" which we saw using wget and md5sum. I'm thinking how to deal with the situation. On Mon, 2009-04-06 at 12:10 -0500, Michael Wilde wrote: > With this sites entry: > > > TG-CDA070002T > jobManager="gt2:pbs" /> > > /home/ux454325/swiftwork > > > I get the error below. Files are on CI net at /home/wilde/swift/lab. > > I will try to copy coaster boot logs and gram logs to same place when I > find them, in subdirs named by $RunID.logs. > > -- > > com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift > Swift svn swift-r2809 cog-r2350 > > RunID: 20090406-1155-pgc5nj00 > Progress: > Progress: Stage in:1 > Progress: Submitted:1 > Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb > Progress: Failed:1 > Execution failed: > Exception in cat: > Arguments: [data.txt] > Host: qb > Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Cannot submit job: Cannot run program "qsub": > java.io.IOException: error=2, No such file or directory > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Cannot submit job: Cannot run program "qsub": java.io.IOException: > error=2, No such file or directory > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) > at > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > at > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) > at > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) > at > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) > Caused by: java.io.IOException: Cannot run program "qsub": > java.io.IOException: error=2, No such file or directory > at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) > at java.lang.Runtime.exec(Runtime.java:593) > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73) > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) > ... 4 more > Caused by: java.io.IOException: java.io.IOException: error=2, No such > file or directory > at java.lang.UNIXProcess.(UNIXProcess.java:148) > at java.lang.ProcessImpl.start(ProcessImpl.java:65) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) > ... 7 more > > Cleaning up... > Shutting down service at https://208.100.92.21:44166 > Got channel MetaChannel: 24235184 -> GSSSChannel-null(1) > - Done > com$ pwd > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Mon Apr 6 13:28:03 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 13:28:03 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt4:gt4:pbs In-Reply-To: <49DA3BF1.1080206@mcs.anl.gov> References: <49DA3BF1.1080206@mcs.anl.gov> Message-ID: <1239042483.3445.2.camel@localhost> On Mon, 2009-04-06 at 12:29 -0500, Michael Wilde wrote: > With this sites entry: > > > > TG-CDA070002T > jobManager="gt4:gt4:pbs" /> > > /home/ux454325/swiftwork > > > > I get the error below. Files are on CI net at /home/wilde/swift/lab. > > Coaster boot log is in 20090406-1216-f5k8chdg.logs/ I'll also need ~/.globus/coasters/coasters.log. Sorry for not mentioning it earlier. Normally with gt2, there would be a stdout explanation of what happened, but with gt4 there is no stdout streaming back. From wilde at mcs.anl.gov Mon Apr 6 13:33:32 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 13:33:32 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt4:gt4:pbs In-Reply-To: <1239042483.3445.2.camel@localhost> References: <49DA3BF1.1080206@mcs.anl.gov> <1239042483.3445.2.camel@localhost> Message-ID: <49DA4AFC.104@mcs.anl.gov> I just copied coaster.log to that same dir. On 4/6/09 1:28 PM, Mihael Hategan wrote: > On Mon, 2009-04-06 at 12:29 -0500, Michael Wilde wrote: >> With this sites entry: >> >> >> >> TG-CDA070002T >> > jobManager="gt4:gt4:pbs" /> >> >> /home/ux454325/swiftwork >> >> >> >> I get the error below. Files are on CI net at /home/wilde/swift/lab. >> >> Coaster boot log is in 20090406-1216-f5k8chdg.logs/ > > I'll also need ~/.globus/coasters/coasters.log. Sorry for not mentioning > it earlier. > > Normally with gt2, there would be a stdout explanation of what happened, > but with gt4 there is no stdout streaming back. > > From hategan at mcs.anl.gov Mon Apr 6 14:24:59 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 14:24:59 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt4:gt4:pbs In-Reply-To: <49DA4AFC.104@mcs.anl.gov> References: <49DA3BF1.1080206@mcs.anl.gov> <1239042483.3445.2.camel@localhost> <49DA4AFC.104@mcs.anl.gov> Message-ID: <1239045899.4203.3.camel@localhost> On Mon, 2009-04-06 at 13:33 -0500, Michael Wilde wrote: > I just copied coaster.log to that same dir. Unfortunately it does not contain any information on the unfortunate run. I committed a patch to also log to the bootstrap log any errors that may occur during bootstrap.jar startup that may otherwise not be logged to the coasters log (nor reported back to the client due to the middleware). From wilde at mcs.anl.gov Mon Apr 6 15:17:29 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 15:17:29 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs In-Reply-To: <1239040937.2410.3.camel@localhost> References: <49DA379B.7080403@mcs.anl.gov> <1239040937.2410.3.camel@localhost> Message-ID: <49DA6359.8010207@mcs.anl.gov> Mihael, I just updated our test swift+cog source and rebuilt. Glen is now getting: Caused by: Invalid GSSCredentials org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid GSSCredentials at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:149) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) at org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:222) at org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) Caused by: GSSException: Invalid name provided [Caused by: [JGLOBUS-112] Malformed name, "=" missing in "38356/jobmanager-pbs"] at org.globus.gsi.gssapi.GlobusGSSName.(GlobusGSSName.java:137) at org.globus.gsi.gssapi.GlobusGSSManagerImpl.createName(GlobusGSSManagerImpl.java:304) at org.globus.gsi.gssapi.auth.IdentityAuthorization.getExpectedName(IdentityAuthorization.java:82) at org.globus.gram.Gram.gatekeeperConnect(Gram.java:85) at org.globus.gram.Gram.request(Gram.java:310) at org.globus.gram.GramJob.request(GramJob.java:262) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133) ... 5 more what's up here Any chance I picked up code in transition, or a new problem in recent commits? - Mike On 4/6/09 1:02 PM, Mihael Hategan wrote: > Yes. This is one of those "can't find executable unless run through > 'bash -l' or maybe not" which we saw using wget and md5sum. I'm thinking > how to deal with the situation. > > On Mon, 2009-04-06 at 12:10 -0500, Michael Wilde wrote: >> With this sites entry: >> >> >> TG-CDA070002T >> > jobManager="gt2:pbs" /> >> >> /home/ux454325/swiftwork >> >> >> I get the error below. Files are on CI net at /home/wilde/swift/lab. >> >> I will try to copy coaster boot logs and gram logs to same place when I >> find them, in subdirs named by $RunID.logs. >> >> -- >> >> com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift >> Swift svn swift-r2809 cog-r2350 >> >> RunID: 20090406-1155-pgc5nj00 >> Progress: >> Progress: Stage in:1 >> Progress: Submitted:1 >> Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb >> Progress: Failed:1 >> Execution failed: >> Exception in cat: >> Arguments: [data.txt] >> Host: qb >> Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> Cannot submit job: Cannot run program "qsub": >> java.io.IOException: error=2, No such file or directory >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >> Cannot submit job: Cannot run program "qsub": java.io.IOException: >> error=2, No such file or directory >> at >> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) >> at >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) >> at >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) >> at >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) >> at >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) >> Caused by: java.io.IOException: Cannot run program "qsub": >> java.io.IOException: error=2, No such file or directory >> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) >> at java.lang.Runtime.exec(Runtime.java:593) >> at >> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73) >> at >> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) >> ... 4 more >> Caused by: java.io.IOException: java.io.IOException: error=2, No such >> file or directory >> at java.lang.UNIXProcess.(UNIXProcess.java:148) >> at java.lang.ProcessImpl.start(ProcessImpl.java:65) >> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) >> ... 7 more >> >> Cleaning up... >> Shutting down service at https://208.100.92.21:44166 >> Got channel MetaChannel: 24235184 -> GSSSChannel-null(1) >> - Done >> com$ pwd >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Mon Apr 6 15:20:37 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 15:20:37 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs In-Reply-To: <1239040937.2410.3.camel@localhost> References: <49DA379B.7080403@mcs.anl.gov> <1239040937.2410.3.camel@localhost> Message-ID: <49DA6415.1010402@mcs.anl.gov> Mihael, when I svn updated our test swift+cog source and rebuilt, Glen Glen gets the errors below. When I reverted back to last Tuesday Mar 31, this new error does not occur. Does "Caused by: GSSException: Invalid name provided [Caused by: [JGLOBUS-112] Malformed name, "=" missing in "38356/jobmanager-pbs"]" suggest a new error introduced in the commits since Tuesday? This is with coasters and gt2:gt2:pbs. - Mike Caused by: Invalid GSSCredentials org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: Invalid GSSCredentials at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:149) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) at org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:222) at org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) Caused by: GSSException: Invalid name provided [Caused by: [JGLOBUS-112] Malformed name, "=" missing in "38356/jobmanager-pbs"] at org.globus.gsi.gssapi.GlobusGSSName.(GlobusGSSName.java:137) at org.globus.gsi.gssapi.GlobusGSSManagerImpl.createName(GlobusGSSManagerImpl.java:304) at org.globus.gsi.gssapi.auth.IdentityAuthorization.getExpectedName(IdentityAuthorization.java:82) at org.globus.gram.Gram.gatekeeperConnect(Gram.java:85) at org.globus.gram.Gram.request(Gram.java:310) at org.globus.gram.GramJob.request(GramJob.java:262) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133) ... 5 more what's up here Any chance I picked up code in transition, or a new problem in recent commits? - Mike On 4/6/09 1:02 PM, Mihael Hategan wrote: > Yes. This is one of those "can't find executable unless run through > 'bash -l' or maybe not" which we saw using wget and md5sum. I'm thinking > how to deal with the situation. > > On Mon, 2009-04-06 at 12:10 -0500, Michael Wilde wrote: >> With this sites entry: >> >> >> TG-CDA070002T >> > jobManager="gt2:pbs" /> >> >> /home/ux454325/swiftwork >> >> >> I get the error below. Files are on CI net at /home/wilde/swift/lab. >> >> I will try to copy coaster boot logs and gram logs to same place when I >> find them, in subdirs named by $RunID.logs. >> >> -- >> >> com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift >> Swift svn swift-r2809 cog-r2350 >> >> RunID: 20090406-1155-pgc5nj00 >> Progress: >> Progress: Stage in:1 >> Progress: Submitted:1 >> Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb >> Progress: Failed:1 >> Execution failed: >> Exception in cat: >> Arguments: [data.txt] >> Host: qb >> Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> Cannot submit job: Cannot run program "qsub": >> java.io.IOException: error=2, No such file or directory >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >> Cannot submit job: Cannot run program "qsub": java.io.IOException: >> error=2, No such file or directory >> at >> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) >> at >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) >> at >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) >> at >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) >> at >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) >> Caused by: java.io.IOException: Cannot run program "qsub": >> java.io.IOException: error=2, No such file or directory >> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) >> at java.lang.Runtime.exec(Runtime.java:593) >> at >> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73) >> at >> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) >> ... 4 more >> Caused by: java.io.IOException: java.io.IOException: error=2, No such >> file or directory >> at java.lang.UNIXProcess.(UNIXProcess.java:148) >> at java.lang.ProcessImpl.start(ProcessImpl.java:65) >> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) >> ... 7 more >> >> Cleaning up... >> Shutting down service at https://208.100.92.21:44166 >> Got channel MetaChannel: 24235184 -> GSSSChannel-null(1) >> - Done >> com$ pwd >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Mon Apr 6 15:25:09 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 15:25:09 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs In-Reply-To: <49DA6359.8010207@mcs.anl.gov> References: <49DA379B.7080403@mcs.anl.gov> <1239040937.2410.3.camel@localhost> <49DA6359.8010207@mcs.anl.gov> Message-ID: <1239049509.5350.0.camel@localhost> Oops. cog r2367 should fix that. On Mon, 2009-04-06 at 15:17 -0500, Michael Wilde wrote: > Mihael, I just updated our test swift+cog source and rebuilt. > > Glen is now getting: > > Caused by: > Invalid GSSCredentials > org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: > Invalid GSSCredentials > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:149) > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99) > at > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > at > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) > at > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:222) > at > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) > Caused by: GSSException: Invalid name provided [Caused by: [JGLOBUS-112] > Malformed name, "=" missing in "38356/jobmanager-pbs"] > at > org.globus.gsi.gssapi.GlobusGSSName.(GlobusGSSName.java:137) > at > org.globus.gsi.gssapi.GlobusGSSManagerImpl.createName(GlobusGSSManagerImpl.java:304) > at > org.globus.gsi.gssapi.auth.IdentityAuthorization.getExpectedName(IdentityAuthorization.java:82) > at org.globus.gram.Gram.gatekeeperConnect(Gram.java:85) > at org.globus.gram.Gram.request(Gram.java:310) > at org.globus.gram.GramJob.request(GramJob.java:262) > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133) > ... 5 more > what's up here > > > Any chance I picked up code in transition, or a new problem in recent > commits? > > - Mike > > > > On 4/6/09 1:02 PM, Mihael Hategan wrote: > > Yes. This is one of those "can't find executable unless run through > > 'bash -l' or maybe not" which we saw using wget and md5sum. I'm thinking > > how to deal with the situation. > > > > On Mon, 2009-04-06 at 12:10 -0500, Michael Wilde wrote: > >> With this sites entry: > >> > >> > >> TG-CDA070002T > >> >> jobManager="gt2:pbs" /> > >> > >> /home/ux454325/swiftwork > >> > >> > >> I get the error below. Files are on CI net at /home/wilde/swift/lab. > >> > >> I will try to copy coaster boot logs and gram logs to same place when I > >> find them, in subdirs named by $RunID.logs. > >> > >> -- > >> > >> com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift > >> Swift svn swift-r2809 cog-r2350 > >> > >> RunID: 20090406-1155-pgc5nj00 > >> Progress: > >> Progress: Stage in:1 > >> Progress: Submitted:1 > >> Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb > >> Progress: Failed:1 > >> Execution failed: > >> Exception in cat: > >> Arguments: [data.txt] > >> Host: qb > >> Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j > >> stderr.txt: > >> > >> stdout.txt: > >> > >> ---- > >> > >> Caused by: > >> Cannot submit job: Cannot run program "qsub": > >> java.io.IOException: error=2, No such file or directory > >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > >> Cannot submit job: Cannot run program "qsub": java.io.IOException: > >> error=2, No such file or directory > >> at > >> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) > >> at > >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > >> at > >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) > >> at > >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) > >> at > >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) > >> Caused by: java.io.IOException: Cannot run program "qsub": > >> java.io.IOException: error=2, No such file or directory > >> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) > >> at java.lang.Runtime.exec(Runtime.java:593) > >> at > >> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73) > >> at > >> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) > >> ... 4 more > >> Caused by: java.io.IOException: java.io.IOException: error=2, No such > >> file or directory > >> at java.lang.UNIXProcess.(UNIXProcess.java:148) > >> at java.lang.ProcessImpl.start(ProcessImpl.java:65) > >> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) > >> ... 7 more > >> > >> Cleaning up... > >> Shutting down service at https://208.100.92.21:44166 > >> Got channel MetaChannel: 24235184 -> GSSSChannel-null(1) > >> - Done > >> com$ pwd > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From wilde at mcs.anl.gov Mon Apr 6 16:25:51 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 16:25:51 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs In-Reply-To: <1239049509.5350.0.camel@localhost> References: <49DA379B.7080403@mcs.anl.gov> <1239040937.2410.3.camel@localhost> <49DA6359.8010207@mcs.anl.gov> <1239049509.5350.0.camel@localhost> Message-ID: <49DA735F.1020300@mcs.anl.gov> We just tested that rev, and now it seems as if the jobs are getting submitted to the fork JM instead of to PBS. Need a log for that, or is the cause obvious? On 4/6/09 3:25 PM, Mihael Hategan wrote: > Oops. cog r2367 should fix that. > > On Mon, 2009-04-06 at 15:17 -0500, Michael Wilde wrote: >> Mihael, I just updated our test swift+cog source and rebuilt. >> >> Glen is now getting: >> >> Caused by: >> Invalid GSSCredentials >> org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: >> Invalid GSSCredentials >> at >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:149) >> at >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99) >> at >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) >> at >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) >> at >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:222) >> at >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) >> Caused by: GSSException: Invalid name provided [Caused by: [JGLOBUS-112] >> Malformed name, "=" missing in "38356/jobmanager-pbs"] >> at >> org.globus.gsi.gssapi.GlobusGSSName.(GlobusGSSName.java:137) >> at >> org.globus.gsi.gssapi.GlobusGSSManagerImpl.createName(GlobusGSSManagerImpl.java:304) >> at >> org.globus.gsi.gssapi.auth.IdentityAuthorization.getExpectedName(IdentityAuthorization.java:82) >> at org.globus.gram.Gram.gatekeeperConnect(Gram.java:85) >> at org.globus.gram.Gram.request(Gram.java:310) >> at org.globus.gram.GramJob.request(GramJob.java:262) >> at >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133) >> ... 5 more >> what's up here >> >> >> Any chance I picked up code in transition, or a new problem in recent >> commits? >> >> - Mike >> >> >> >> On 4/6/09 1:02 PM, Mihael Hategan wrote: >>> Yes. This is one of those "can't find executable unless run through >>> 'bash -l' or maybe not" which we saw using wget and md5sum. I'm thinking >>> how to deal with the situation. >>> >>> On Mon, 2009-04-06 at 12:10 -0500, Michael Wilde wrote: >>>> With this sites entry: >>>> >>>> >>>> TG-CDA070002T >>>> >>> jobManager="gt2:pbs" /> >>>> >>>> /home/ux454325/swiftwork >>>> >>>> >>>> I get the error below. Files are on CI net at /home/wilde/swift/lab. >>>> >>>> I will try to copy coaster boot logs and gram logs to same place when I >>>> find them, in subdirs named by $RunID.logs. >>>> >>>> -- >>>> >>>> com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift >>>> Swift svn swift-r2809 cog-r2350 >>>> >>>> RunID: 20090406-1155-pgc5nj00 >>>> Progress: >>>> Progress: Stage in:1 >>>> Progress: Submitted:1 >>>> Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb >>>> Progress: Failed:1 >>>> Execution failed: >>>> Exception in cat: >>>> Arguments: [data.txt] >>>> Host: qb >>>> Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j >>>> stderr.txt: >>>> >>>> stdout.txt: >>>> >>>> ---- >>>> >>>> Caused by: >>>> Cannot submit job: Cannot run program "qsub": >>>> java.io.IOException: error=2, No such file or directory >>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >>>> Cannot submit job: Cannot run program "qsub": java.io.IOException: >>>> error=2, No such file or directory >>>> at >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) >>>> at >>>> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) >>>> at >>>> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) >>>> at >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) >>>> at >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) >>>> Caused by: java.io.IOException: Cannot run program "qsub": >>>> java.io.IOException: error=2, No such file or directory >>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) >>>> at java.lang.Runtime.exec(Runtime.java:593) >>>> at >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73) >>>> at >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) >>>> ... 4 more >>>> Caused by: java.io.IOException: java.io.IOException: error=2, No such >>>> file or directory >>>> at java.lang.UNIXProcess.(UNIXProcess.java:148) >>>> at java.lang.ProcessImpl.start(ProcessImpl.java:65) >>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) >>>> ... 7 more >>>> >>>> Cleaning up... >>>> Shutting down service at https://208.100.92.21:44166 >>>> Got channel MetaChannel: 24235184 -> GSSSChannel-null(1) >>>> - Done >>>> com$ pwd >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Mon Apr 6 16:44:43 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 16:44:43 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs In-Reply-To: <49DA735F.1020300@mcs.anl.gov> References: <49DA379B.7080403@mcs.anl.gov> <1239040937.2410.3.camel@localhost> <49DA6359.8010207@mcs.anl.gov> <1239049509.5350.0.camel@localhost> <49DA735F.1020300@mcs.anl.gov> Message-ID: <1239054283.6821.0.camel@localhost> On Mon, 2009-04-06 at 16:25 -0500, Michael Wilde wrote: > We just tested that rev, and now it seems as if the jobs are getting > submitted to the fork JM instead of to PBS. > > Need a log for that, or is the cause obvious? No. I'll debug and see. > > > On 4/6/09 3:25 PM, Mihael Hategan wrote: > > Oops. cog r2367 should fix that. > > > > On Mon, 2009-04-06 at 15:17 -0500, Michael Wilde wrote: > >> Mihael, I just updated our test swift+cog source and rebuilt. > >> > >> Glen is now getting: > >> > >> Caused by: > >> Invalid GSSCredentials > >> org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: > >> Invalid GSSCredentials > >> at > >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:149) > >> at > >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99) > >> at > >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > >> at > >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) > >> at > >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:222) > >> at > >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) > >> Caused by: GSSException: Invalid name provided [Caused by: [JGLOBUS-112] > >> Malformed name, "=" missing in "38356/jobmanager-pbs"] > >> at > >> org.globus.gsi.gssapi.GlobusGSSName.(GlobusGSSName.java:137) > >> at > >> org.globus.gsi.gssapi.GlobusGSSManagerImpl.createName(GlobusGSSManagerImpl.java:304) > >> at > >> org.globus.gsi.gssapi.auth.IdentityAuthorization.getExpectedName(IdentityAuthorization.java:82) > >> at org.globus.gram.Gram.gatekeeperConnect(Gram.java:85) > >> at org.globus.gram.Gram.request(Gram.java:310) > >> at org.globus.gram.GramJob.request(GramJob.java:262) > >> at > >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133) > >> ... 5 more > >> what's up here > >> > >> > >> Any chance I picked up code in transition, or a new problem in recent > >> commits? > >> > >> - Mike > >> > >> > >> > >> On 4/6/09 1:02 PM, Mihael Hategan wrote: > >>> Yes. This is one of those "can't find executable unless run through > >>> 'bash -l' or maybe not" which we saw using wget and md5sum. I'm thinking > >>> how to deal with the situation. > >>> > >>> On Mon, 2009-04-06 at 12:10 -0500, Michael Wilde wrote: > >>>> With this sites entry: > >>>> > >>>> > >>>> TG-CDA070002T > >>>> >>>> jobManager="gt2:pbs" /> > >>>> > >>>> /home/ux454325/swiftwork > >>>> > >>>> > >>>> I get the error below. Files are on CI net at /home/wilde/swift/lab. > >>>> > >>>> I will try to copy coaster boot logs and gram logs to same place when I > >>>> find them, in subdirs named by $RunID.logs. > >>>> > >>>> -- > >>>> > >>>> com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift > >>>> Swift svn swift-r2809 cog-r2350 > >>>> > >>>> RunID: 20090406-1155-pgc5nj00 > >>>> Progress: > >>>> Progress: Stage in:1 > >>>> Progress: Submitted:1 > >>>> Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb > >>>> Progress: Failed:1 > >>>> Execution failed: > >>>> Exception in cat: > >>>> Arguments: [data.txt] > >>>> Host: qb > >>>> Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j > >>>> stderr.txt: > >>>> > >>>> stdout.txt: > >>>> > >>>> ---- > >>>> > >>>> Caused by: > >>>> Cannot submit job: Cannot run program "qsub": > >>>> java.io.IOException: error=2, No such file or directory > >>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > >>>> Cannot submit job: Cannot run program "qsub": java.io.IOException: > >>>> error=2, No such file or directory > >>>> at > >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) > >>>> at > >>>> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > >>>> at > >>>> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) > >>>> at > >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) > >>>> at > >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) > >>>> Caused by: java.io.IOException: Cannot run program "qsub": > >>>> java.io.IOException: error=2, No such file or directory > >>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) > >>>> at java.lang.Runtime.exec(Runtime.java:593) > >>>> at > >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73) > >>>> at > >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) > >>>> ... 4 more > >>>> Caused by: java.io.IOException: java.io.IOException: error=2, No such > >>>> file or directory > >>>> at java.lang.UNIXProcess.(UNIXProcess.java:148) > >>>> at java.lang.ProcessImpl.start(ProcessImpl.java:65) > >>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) > >>>> ... 7 more > >>>> > >>>> Cleaning up... > >>>> Shutting down service at https://208.100.92.21:44166 > >>>> Got channel MetaChannel: 24235184 -> GSSSChannel-null(1) > >>>> - Done > >>>> com$ pwd > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From hategan at mcs.anl.gov Mon Apr 6 16:46:50 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 16:46:50 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs In-Reply-To: <49DA735F.1020300@mcs.anl.gov> References: <49DA379B.7080403@mcs.anl.gov> <1239040937.2410.3.camel@localhost> <49DA6359.8010207@mcs.anl.gov> <1239049509.5350.0.camel@localhost> <49DA735F.1020300@mcs.anl.gov> Message-ID: <1239054410.6821.2.camel@localhost> On Mon, 2009-04-06 at 16:25 -0500, Michael Wilde wrote: > We just tested that rev, and now it seems as if the jobs are getting > submitted to the fork JM instead of to PBS. > > Need a log for that, or is the cause obvious? Actually yes, it just became obvious. > > > On 4/6/09 3:25 PM, Mihael Hategan wrote: > > Oops. cog r2367 should fix that. > > > > On Mon, 2009-04-06 at 15:17 -0500, Michael Wilde wrote: > >> Mihael, I just updated our test swift+cog source and rebuilt. > >> > >> Glen is now getting: > >> > >> Caused by: > >> Invalid GSSCredentials > >> org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException: > >> Invalid GSSCredentials > >> at > >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:149) > >> at > >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99) > >> at > >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > >> at > >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) > >> at > >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:222) > >> at > >> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) > >> Caused by: GSSException: Invalid name provided [Caused by: [JGLOBUS-112] > >> Malformed name, "=" missing in "38356/jobmanager-pbs"] > >> at > >> org.globus.gsi.gssapi.GlobusGSSName.(GlobusGSSName.java:137) > >> at > >> org.globus.gsi.gssapi.GlobusGSSManagerImpl.createName(GlobusGSSManagerImpl.java:304) > >> at > >> org.globus.gsi.gssapi.auth.IdentityAuthorization.getExpectedName(IdentityAuthorization.java:82) > >> at org.globus.gram.Gram.gatekeeperConnect(Gram.java:85) > >> at org.globus.gram.Gram.request(Gram.java:310) > >> at org.globus.gram.GramJob.request(GramJob.java:262) > >> at > >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133) > >> ... 5 more > >> what's up here > >> > >> > >> Any chance I picked up code in transition, or a new problem in recent > >> commits? > >> > >> - Mike > >> > >> > >> > >> On 4/6/09 1:02 PM, Mihael Hategan wrote: > >>> Yes. This is one of those "can't find executable unless run through > >>> 'bash -l' or maybe not" which we saw using wget and md5sum. I'm thinking > >>> how to deal with the situation. > >>> > >>> On Mon, 2009-04-06 at 12:10 -0500, Michael Wilde wrote: > >>>> With this sites entry: > >>>> > >>>> > >>>> TG-CDA070002T > >>>> >>>> jobManager="gt2:pbs" /> > >>>> > >>>> /home/ux454325/swiftwork > >>>> > >>>> > >>>> I get the error below. Files are on CI net at /home/wilde/swift/lab. > >>>> > >>>> I will try to copy coaster boot logs and gram logs to same place when I > >>>> find them, in subdirs named by $RunID.logs. > >>>> > >>>> -- > >>>> > >>>> com$ swift -tc.file tc.data -sites.file qb.coasters.xml cat.swift > >>>> Swift svn swift-r2809 cog-r2350 > >>>> > >>>> RunID: 20090406-1155-pgc5nj00 > >>>> Progress: > >>>> Progress: Stage in:1 > >>>> Progress: Submitted:1 > >>>> Failed to transfer wrapper log from cat-20090406-1155-pgc5nj00/info/m on qb > >>>> Progress: Failed:1 > >>>> Execution failed: > >>>> Exception in cat: > >>>> Arguments: [data.txt] > >>>> Host: qb > >>>> Directory: cat-20090406-1155-pgc5nj00/jobs/m/cat-m91ku09j > >>>> stderr.txt: > >>>> > >>>> stdout.txt: > >>>> > >>>> ---- > >>>> > >>>> Caused by: > >>>> Cannot submit job: Cannot run program "qsub": > >>>> java.io.IOException: error=2, No such file or directory > >>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > >>>> Cannot submit job: Cannot run program "qsub": java.io.IOException: > >>>> error=2, No such file or directory > >>>> at > >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) > >>>> at > >>>> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > >>>> at > >>>> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) > >>>> at > >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) > >>>> at > >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) > >>>> Caused by: java.io.IOException: Cannot run program "qsub": > >>>> java.io.IOException: error=2, No such file or directory > >>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) > >>>> at java.lang.Runtime.exec(Runtime.java:593) > >>>> at > >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:73) > >>>> at > >>>> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) > >>>> ... 4 more > >>>> Caused by: java.io.IOException: java.io.IOException: error=2, No such > >>>> file or directory > >>>> at java.lang.UNIXProcess.(UNIXProcess.java:148) > >>>> at java.lang.ProcessImpl.start(ProcessImpl.java:65) > >>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) > >>>> ... 7 more > >>>> > >>>> Cleaning up... > >>>> Shutting down service at https://208.100.92.21:44166 > >>>> Got channel MetaChannel: 24235184 -> GSSSChannel-null(1) > >>>> - Done > >>>> com$ pwd > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From wilde at mcs.anl.gov Mon Apr 6 17:00:26 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 17:00:26 -0500 Subject: [Swift-devel] Expected behavior for scheduler slow-start with coasters? Message-ID: <49DA7B7A.6070802@mcs.anl.gov> We are seeing the following on Ranger: Swift has 2 coaster jobs running in SGE, coastersPerNode is set to 16, yet it seems to be doing "slow start" as if it doesnt know that it ca quickly fill the available coaster slots. For example, Glen sees the trace below, and is surprised that its not running at least 32 app() procs by this point, instead of 2. Is this expected behavior, or would you have expected the scheduler to fill all available coaster slots? -- Every 2.0s: showq | grep hockyg Mon Apr 6 16:51:38 2009 641061 data hockyg Running 16 01:31:50 Mon Apr 6 16:42:30 641062 data hockyg Running 16 01:31:50 Mon Apr 6 16:42:30 that's ranger Progress: Selecting site:98 Stage in:1 Submitting:1 Finished successfully:4 Progress: Selecting site:98 Submitting:2 Finished successfully:4 Progress: Selecting site:98 Submitting:1 Submitted:1 Finished successfully:4 Progress: Selecting site:98 Submitted:2 Finished successfully:4 Progress: Selecting site:98 Submitted:1 Active:1 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:2 Finished successfully:4 Progress: Selecting site:98 Active:1 Stage out:1 Finished successfully:4 Progress: Selecting site:98 Active:1 Finished successfully:5 Progress: Selecting site:97 Stage in:1 Active:1 Finished successfully:5 Progress: Selecting site:96 Active:3 Finished successfully:5 Progress: Selecting site:96 Active:3 Finished successfully:5 Progress: Selecting site:96 Active:2 Stage out:1 Finished successfully:5 Progress: Selecting site:96 Active:2 Finished successfully:6 Progress: Selecting site:95 Stage in:1 Active:2 Finished successfully:6 From hategan at mcs.anl.gov Mon Apr 6 17:04:36 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 17:04:36 -0500 Subject: [Swift-devel] coaster problem with jobmanager=gt2:pbs In-Reply-To: <1239054410.6821.2.camel@localhost> References: <49DA379B.7080403@mcs.anl.gov> <1239040937.2410.3.camel@localhost> <49DA6359.8010207@mcs.anl.gov> <1239049509.5350.0.camel@localhost> <49DA735F.1020300@mcs.anl.gov> <1239054410.6821.2.camel@localhost> Message-ID: <1239055476.6821.9.camel@localhost> On Mon, 2009-04-06 at 16:46 -0500, Mihael Hategan wrote: > On Mon, 2009-04-06 at 16:25 -0500, Michael Wilde wrote: > > We just tested that rev, and now it seems as if the jobs are getting > > submitted to the fork JM instead of to PBS. > > > > Need a log for that, or is the cause obvious? > > Actually yes, it just became obvious. I've corrected the initial fix. Hopefully it works properly this time. The issue was related to a badly thought change needed for the worker terminal to function. From hategan at mcs.anl.gov Mon Apr 6 17:08:23 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 17:08:23 -0500 Subject: [Swift-devel] Expected behavior for scheduler slow-start with coasters? In-Reply-To: <49DA7B7A.6070802@mcs.anl.gov> References: <49DA7B7A.6070802@mcs.anl.gov> Message-ID: <1239055703.6821.11.camel@localhost> On Mon, 2009-04-06 at 17:00 -0500, Michael Wilde wrote: > We are seeing the following on Ranger: > > Swift has 2 coaster jobs running in SGE, coastersPerNode is set to 16, > yet it seems to be doing "slow start" as if it doesnt know that it ca > quickly fill the available coaster slots. Right. Swift doing a slow start is a given. Coasters allocating more workers than needed is the issue. From wilde at mcs.anl.gov Mon Apr 6 17:32:22 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 17:32:22 -0500 Subject: [Swift-devel] Expected behavior for scheduler slow-start with coasters? In-Reply-To: <1239055703.6821.11.camel@localhost> References: <49DA7B7A.6070802@mcs.anl.gov> <1239055703.6821.11.camel@localhost> Message-ID: <49DA82F6.9090502@mcs.anl.gov> OK, this one seems to be more of a nuisance/anomaly that we can set aside for now I think. Opening up the throttle a bit should make this a minor issue. Eventually, you'd hope it would fill available coasters when there is demand, or at least base the rampup on the fast that jobs started, and not wait for them to finish. Then it would sense faster that there were more ready workers. On 4/6/09 5:08 PM, Mihael Hategan wrote: > On Mon, 2009-04-06 at 17:00 -0500, Michael Wilde wrote: >> We are seeing the following on Ranger: >> >> Swift has 2 coaster jobs running in SGE, coastersPerNode is set to 16, >> yet it seems to be doing "slow start" as if it doesnt know that it ca >> quickly fill the available coaster slots. > > Right. Swift doing a slow start is a given. > > Coasters allocating more workers than needed is the issue. > > From hategan at mcs.anl.gov Mon Apr 6 17:43:23 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 17:43:23 -0500 Subject: [Swift-devel] Expected behavior for scheduler slow-start with coasters? In-Reply-To: <49DA82F6.9090502@mcs.anl.gov> References: <49DA7B7A.6070802@mcs.anl.gov> <1239055703.6821.11.camel@localhost> <49DA82F6.9090502@mcs.anl.gov> Message-ID: <1239057803.8721.4.camel@localhost> On Mon, 2009-04-06 at 17:32 -0500, Michael Wilde wrote: > OK, this one seems to be more of a nuisance/anomaly that we can set > aside for now I think. > > Opening up the throttle a bit should make this a minor issue. > Eventually, you'd hope it would fill available coasters when there is > demand, or at least base the rampup on the fast that jobs started, and > not wait for them to finish. Then it would sense faster that there were > more ready workers. Yes. I mentioned this a while ago, that with coasters, throttling guesses become unnecessary. You simply throttle to the number of available workers. This, however, falls out of the model we started with, so there are some possibly non-trivial changes to swift needed in order to support this with coasters, while still keeping the old behaviour without coasters. From wilde at mcs.anl.gov Mon Apr 6 17:53:17 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 17:53:17 -0500 Subject: [Swift-devel] Expected behavior for scheduler slow-start with coasters? In-Reply-To: <1239057803.8721.4.camel@localhost> References: <49DA7B7A.6070802@mcs.anl.gov> <1239055703.6821.11.camel@localhost> <49DA82F6.9090502@mcs.anl.gov> <1239057803.8721.4.camel@localhost> Message-ID: <49DA87DD.1010704@mcs.anl.gov> OK, sounds reasonable. For what its worth, Glen provided another example of coasters going idle while there are jobs ready to run. Nothing more to say on this, except to point out that it affects more than just startup. Is there a simpler, alternate scheduler algorithm that you could plug in as a global, settable alternative to the current one when all sites are using coasters? (No need to answer that now; we'll see how far we can get with things as they are, in various combinations of sites and settings). We're digging into the imbalance problem at the moment, that one may be more worthwhile your time, as is the larger-node-per-job allocation enhancement.) --- from Glen: again, not using there coasters affectively 5:42 Michael Wilde ? 5:42 Glen Hocky e.g. qb now has qb2: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------- ------ ----- --- ------ ----- - ----- 94741.qb2 hockyg workq scheduler_ 30786 1 1 -- 01:41 R 00:53 94742.qb2 hockyg workq scheduler_ 31391 1 1 -- 01:41 R 00:53 94808.qb2 hockyg workq scheduler_ 2274 1 1 -- 01:41 R 00:22 94809.qb2 hockyg workq scheduler_ 27186 1 1 -- 01:41 R 00:21 94811.qb2 hockyg workq scheduler_ 31647 1 1 -- 01:41 R 00:21 94812.qb2 hockyg workq scheduler_ 4773 1 1 -- 01:41 R 00:18 but only 4 active jobs 4 submitted *7 submitted all the rest done so what is it doing with all those extra cpus 5:43 ... Glen Hocky for my run on only qb Progress: Submitted:7 Active:4 Finished successfully:93 5:43 Glen Hocky again, the problem may be that these jobs are taking 15 minutes or more so they don't end very often On 4/6/09 5:43 PM, Mihael Hategan wrote: > On Mon, 2009-04-06 at 17:32 -0500, Michael Wilde wrote: >> OK, this one seems to be more of a nuisance/anomaly that we can set >> aside for now I think. >> >> Opening up the throttle a bit should make this a minor issue. >> Eventually, you'd hope it would fill available coasters when there is >> demand, or at least base the rampup on the fast that jobs started, and >> not wait for them to finish. Then it would sense faster that there were >> more ready workers. > > Yes. I mentioned this a while ago, that with coasters, throttling > guesses become unnecessary. You simply throttle to the number of > available workers. > > This, however, falls out of the model we started with, so there are some > possibly non-trivial changes to swift needed in order to support this > with coasters, while still keeping the old behaviour without coasters. > > > From hategan at mcs.anl.gov Mon Apr 6 18:18:59 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 18:18:59 -0500 Subject: [Swift-devel] Expected behavior for scheduler slow-start with coasters? In-Reply-To: <49DA87DD.1010704@mcs.anl.gov> References: <49DA7B7A.6070802@mcs.anl.gov> <1239055703.6821.11.camel@localhost> <49DA82F6.9090502@mcs.anl.gov> <1239057803.8721.4.camel@localhost> <49DA87DD.1010704@mcs.anl.gov> Message-ID: <1239059939.8843.21.camel@localhost> On Mon, 2009-04-06 at 17:53 -0500, Michael Wilde wrote: > OK, sounds reasonable. > > For what its worth, Glen provided another example of coasters going idle > while there are jobs ready to run. Or maybe the jobs don't fit in the time some of the workers have left. In other words don't be surprised that workers are not the same as the jobs they are meant to run, because that's obvious. There are only two promises related to how workers are allocated: no more workers than jobs will be started (modulo the broken coastersPerNode issue - and this promise may have to be dropped if we do block allocations) and that no worker will stay idle for more than a certain amount of time, which is currently 10 minutes (probably too large). > > Nothing more to say on this, except to point out that it affects more > than just startup. Where "it" may be a very different it. From wilde at mcs.anl.gov Mon Apr 6 18:28:04 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 18:28:04 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites Message-ID: <49DA9004.8010409@mcs.anl.gov> Glen seems to have a good example of this in: /home/hockyg/oops/swift/output/teragridoutdir.1 com$ grep 'execute2 THREAD_ASSOCIATION' *ji3.log | awk '{print $8}' | sort | uniq -c 159 host=abe 8 host=localhost 13 host=qb 11 host=ranger com$ --- But then I looked in the log and I see that for qb and ranger, it tries to start jobs there and gets an exception on each of them, while jobs for abe keep on zipping through. As far as I can tell, there is, eg on queenbee, no coaster boot log at the time of the exception, and I cant glean any clues from the GRAM log at the time of the exception (no obvious errors in it). I am trying now to reproduce this with simple echo-like jobs under my own id & cert where I can see all the server-side logs. I *think* that for the run above, Glen first tested ach of the 3 sites.xml pool elements separately, for the 3 sites, before trying the 3-site test. I *think* he verified that all three sites worked separately. But when put together, it *seems* that only the first one works, as if the ability to start coasters on 3 sites at once is broken. I am not at all sure, and will try to isolate with a simpler test that you can run as well, but at the moment thats a plausible theory. Btw, this is still with the Mar 31 code rev. I need to catch up on mail to see if I can no go back to testing on trunk. From wilde at mcs.anl.gov Mon Apr 6 23:25:45 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 23:25:45 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DA9004.8010409@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> Message-ID: <49DAD5C9.7080607@mcs.anl.gov> I tried this test and discovered some more things about coaster time management that I dont understand. It seems that on Queenbee coasters were timing out, while on abe the workers were getting queued, but abe's coasters.log showed lots of java exceptions. If you're interested, all logs for this run including coasters.logs from the two sites .globus dirs is on ci net at /home/wilde/swift/lab/20090406-2120-04ythaie I will re-run with the latest cog/swift revs to see if the behavior persists. - Mike On 4/6/09 6:28 PM, Michael Wilde wrote: > Glen seems to have a good example of this in: > /home/hockyg/oops/swift/output/teragridoutdir.1 > > com$ grep 'execute2 THREAD_ASSOCIATION' *ji3.log | awk '{print $8}' | > sort | uniq -c > 159 host=abe > 8 host=localhost > 13 host=qb > 11 host=ranger > com$ > > --- > > But then I looked in the log and I see that for qb and ranger, it tries > to start jobs there and gets an exception on each of them, while jobs > for abe keep on zipping through. > > As far as I can tell, there is, eg on queenbee, no coaster boot log at > the time of the exception, and I cant glean any clues from the GRAM log > at the time of the exception (no obvious errors in it). > > I am trying now to reproduce this with simple echo-like jobs under my > own id & cert where I can see all the server-side logs. > > I *think* that for the run above, Glen first tested ach of the 3 > sites.xml pool elements separately, for the 3 sites, before trying the > 3-site test. I *think* he verified that all three sites worked separately. > > But when put together, it *seems* that only the first one works, as if > the ability to start coasters on 3 sites at once is broken. > > I am not at all sure, and will try to isolate with a simpler test that > you can run as well, but at the moment thats a plausible theory. > > Btw, this is still with the Mar 31 code rev. I need to catch up on mail > to see if I can no go back to testing on trunk. > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Mon Apr 6 23:45:45 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 23:45:45 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DAD5C9.7080607@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> Message-ID: <1239079545.15719.3.camel@localhost> On Mon, 2009-04-06 at 23:25 -0500, Michael Wilde wrote: > I tried this test and discovered some more things about coaster time > management that I dont understand. > > It seems that on Queenbee coasters were timing out, while on abe the > workers were getting queued, but abe's coasters.log showed lots of java > exceptions. Yes. It still seems to have been run with the unfortunate version. I can't tell which exceptions are legit and which ones are the result of coasters code in the particular bad state. > > If you're interested, all logs for this run including coasters.logs from > the two sites .globus dirs is on ci net at > /home/wilde/swift/lab/20090406-2120-04ythaie > > I will re-run with the latest cog/swift revs to see if the behavior > persists. > > - Mike > > > On 4/6/09 6:28 PM, Michael Wilde wrote: > > Glen seems to have a good example of this in: > > /home/hockyg/oops/swift/output/teragridoutdir.1 > > > > com$ grep 'execute2 THREAD_ASSOCIATION' *ji3.log | awk '{print $8}' | > > sort | uniq -c > > 159 host=abe > > 8 host=localhost > > 13 host=qb > > 11 host=ranger > > com$ > > > > --- > > > > But then I looked in the log and I see that for qb and ranger, it tries > > to start jobs there and gets an exception on each of them, while jobs > > for abe keep on zipping through. > > > > As far as I can tell, there is, eg on queenbee, no coaster boot log at > > the time of the exception, and I cant glean any clues from the GRAM log > > at the time of the exception (no obvious errors in it). > > > > I am trying now to reproduce this with simple echo-like jobs under my > > own id & cert where I can see all the server-side logs. > > > > I *think* that for the run above, Glen first tested ach of the 3 > > sites.xml pool elements separately, for the 3 sites, before trying the > > 3-site test. I *think* he verified that all three sites worked separately. > > > > But when put together, it *seems* that only the first one works, as if > > the ability to start coasters on 3 sites at once is broken. > > > > I am not at all sure, and will try to isolate with a simpler test that > > you can run as well, but at the moment thats a plausible theory. > > > > Btw, this is still with the Mar 31 code rev. I need to catch up on mail > > to see if I can no go back to testing on trunk. > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Mon Apr 6 23:56:54 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 23:56:54 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DAD5C9.7080607@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> Message-ID: <49DADD16.2010507@mcs.anl.gov> The latest rev shows a similar failure on the surface, but I think different patterns in the coaster logs. The workflow is 40 simple "cat" jobs, data.txt to a default-mapped outfile. This time 39 of 40 jobs ran on abe, and then the workflow lingered and finally failed, with 39 ok, 1 failure. All the logs for this run are in /home/wilde/swift/lab/20090406-2330-72p9ale0 below that are dirs for the abe and qb logs coaster and gram logs. Abe had no gram log for this run. I suspect this one is worth looking at. On 4/6/09 11:25 PM, Michael Wilde wrote: > I tried this test and discovered some more things about coaster time > management that I dont understand. > > It seems that on Queenbee coasters were timing out, while on abe the > workers were getting queued, but abe's coasters.log showed lots of java > exceptions. > > If you're interested, all logs for this run including coasters.logs from > the two sites .globus dirs is on ci net at > /home/wilde/swift/lab/20090406-2120-04ythaie > > I will re-run with the latest cog/swift revs to see if the behavior > persists. > > - Mike > > > On 4/6/09 6:28 PM, Michael Wilde wrote: >> Glen seems to have a good example of this in: >> /home/hockyg/oops/swift/output/teragridoutdir.1 >> >> com$ grep 'execute2 THREAD_ASSOCIATION' *ji3.log | awk '{print $8}' | >> sort | uniq -c >> 159 host=abe >> 8 host=localhost >> 13 host=qb >> 11 host=ranger >> com$ >> >> --- >> >> But then I looked in the log and I see that for qb and ranger, it >> tries to start jobs there and gets an exception on each of them, while >> jobs for abe keep on zipping through. >> >> As far as I can tell, there is, eg on queenbee, no coaster boot log at >> the time of the exception, and I cant glean any clues from the GRAM >> log at the time of the exception (no obvious errors in it). >> >> I am trying now to reproduce this with simple echo-like jobs under my >> own id & cert where I can see all the server-side logs. >> >> I *think* that for the run above, Glen first tested ach of the 3 >> sites.xml pool elements separately, for the 3 sites, before trying the >> 3-site test. I *think* he verified that all three sites worked >> separately. >> >> But when put together, it *seems* that only the first one works, as if >> the ability to start coasters on 3 sites at once is broken. >> >> I am not at all sure, and will try to isolate with a simpler test that >> you can run as well, but at the moment thats a plausible theory. >> >> Btw, this is still with the Mar 31 code rev. I need to catch up on >> mail to see if I can no go back to testing on trunk. >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Apr 7 00:09:44 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 07 Apr 2009 00:09:44 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DADD16.2010507@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> Message-ID: <1239080984.16125.1.camel@localhost> On Mon, 2009-04-06 at 23:56 -0500, Michael Wilde wrote: > The latest rev shows a similar failure on the surface, but I think > different patterns in the coaster logs. > > The workflow is 40 simple "cat" jobs, data.txt to a default-mapped outfile. > > This time 39 of 40 jobs ran on abe, and then the workflow lingered and > finally failed, with 39 ok, 1 failure. > > All the logs for this run are in > /home/wilde/swift/lab/20090406-2330-72p9ale0 > > below that are dirs for the abe and qb logs coaster and gram logs. > Abe had no gram log for this run. > > I suspect this one is worth looking at. Indeed. Can you paste your sites file? There's some oddity there. From wilde at mcs.anl.gov Tue Apr 7 00:09:58 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 07 Apr 2009 00:09:58 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <1239080984.16125.1.camel@localhost> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> Message-ID: <49DAE026.3040909@mcs.anl.gov> com$ cat abe+qb.xml TG-CDA070002T 8 02:30:00 /u/ac/wilde/swiftwork TG-CDA070002T 8 02:30:00 /home/ux454325/swiftwork com$ On 4/7/09 12:09 AM, Mihael Hategan wrote: > On Mon, 2009-04-06 at 23:56 -0500, Michael Wilde wrote: >> The latest rev shows a similar failure on the surface, but I think >> different patterns in the coaster logs. >> >> The workflow is 40 simple "cat" jobs, data.txt to a default-mapped outfile. >> >> This time 39 of 40 jobs ran on abe, and then the workflow lingered and >> finally failed, with 39 ok, 1 failure. >> >> All the logs for this run are in >> /home/wilde/swift/lab/20090406-2330-72p9ale0 >> >> below that are dirs for the abe and qb logs coaster and gram logs. >> Abe had no gram log for this run. >> >> I suspect this one is worth looking at. > > Indeed. Can you paste your sites file? > > There's some oddity there. > > From wilde at mcs.anl.gov Tue Apr 7 00:15:23 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 07 Apr 2009 00:15:23 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DAE026.3040909@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> Message-ID: <49DAE16B.6000508@mcs.anl.gov> Note on below: I used 2hr30min as the time to match Glen's time, for the runs in which he first saw the "imbalance". In my first tests,I had used 5 min for coasterWorkerMaxwalltime and specified no site or tc maxwalltime. I thought that would work, based on our earlier lengthy exchanges on this topic. But apparantly coasters was calculating some default max walltime for "cat" and it gave me an error about insufficient time. I was trying to gather that alolng with several other anomalies in another report. On 4/7/09 12:09 AM, Michael Wilde wrote: > com$ cat abe+qb.xml > > > > > TG-CDA070002T > 8 > key="coasterWorkerMaxwalltime">02:30:00 > > jobManager="gt2:gt2:pbs" /> > > /u/ac/wilde/swiftwork > > > > > > TG-CDA070002T > 8 > key="coasterWorkerMaxwalltime">02:30:00 > > jobManager="gt2:gt2:pbs" /> > > /home/ux454325/swiftwork > > > > > com$ > > > On 4/7/09 12:09 AM, Mihael Hategan wrote: >> On Mon, 2009-04-06 at 23:56 -0500, Michael Wilde wrote: >>> The latest rev shows a similar failure on the surface, but I think >>> different patterns in the coaster logs. >>> >>> The workflow is 40 simple "cat" jobs, data.txt to a default-mapped >>> outfile. >>> >>> This time 39 of 40 jobs ran on abe, and then the workflow lingered >>> and finally failed, with 39 ok, 1 failure. >>> >>> All the logs for this run are in >>> /home/wilde/swift/lab/20090406-2330-72p9ale0 >>> >>> below that are dirs for the abe and qb logs coaster and gram logs. >>> Abe had no gram log for this run. >>> >>> I suspect this one is worth looking at. >> >> Indeed. Can you paste your sites file? >> >> There's some oddity there. >> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Apr 7 00:26:35 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 07 Apr 2009 00:26:35 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DAE16B.6000508@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> Message-ID: <1239081995.16125.8.camel@localhost> On Tue, 2009-04-07 at 00:15 -0500, Michael Wilde wrote: > Note on below: I used 2hr30min as the time to match Glen's time, for the > runs in which he first saw the "imbalance". > > In my first tests,I had used 5 min for coasterWorkerMaxwalltime and > specified no site or tc maxwalltime. I thought that would work, based on > our earlier lengthy exchanges on this topic. But apparantly coasters was > calculating some default max walltime for "cat" and it gave me an error > about insufficient time. Right. Previously it would just loop starting workers and then not using them because they didn't have enough time. The default walltime is 10 minutes. > I was trying to gather that alolng with several > other anomalies in another report. Now, the oddity below is that both coaster services are started with the same service id. Not only that, the same service id was used for subsequent runs (the bootstrap logs contain multiple "runs"). This, roughly, makes no sense, but I can't imagine it being cause for goodness. > > > On 4/7/09 12:09 AM, Michael Wilde wrote: > > com$ cat abe+qb.xml > > > > > > > > > > TG-CDA070002T > > 8 > > > key="coasterWorkerMaxwalltime">02:30:00 > > > > > jobManager="gt2:gt2:pbs" /> > > > > /u/ac/wilde/swiftwork > > > > > > > > > > > > TG-CDA070002T > > 8 > > > key="coasterWorkerMaxwalltime">02:30:00 > > > > > jobManager="gt2:gt2:pbs" /> > > > > /home/ux454325/swiftwork > > > > > > > > > > com$ > > > > > > On 4/7/09 12:09 AM, Mihael Hategan wrote: > >> On Mon, 2009-04-06 at 23:56 -0500, Michael Wilde wrote: > >>> The latest rev shows a similar failure on the surface, but I think > >>> different patterns in the coaster logs. > >>> > >>> The workflow is 40 simple "cat" jobs, data.txt to a default-mapped > >>> outfile. > >>> > >>> This time 39 of 40 jobs ran on abe, and then the workflow lingered > >>> and finally failed, with 39 ok, 1 failure. > >>> > >>> All the logs for this run are in > >>> /home/wilde/swift/lab/20090406-2330-72p9ale0 > >>> > >>> below that are dirs for the abe and qb logs coaster and gram logs. > >>> Abe had no gram log for this run. > >>> > >>> I suspect this one is worth looking at. > >> > >> Indeed. Can you paste your sites file? > >> > >> There's some oddity there. > >> > >> > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Apr 7 00:33:54 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 07 Apr 2009 00:33:54 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <1239081995.16125.8.camel@localhost> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> Message-ID: <49DAE5C2.6070806@mcs.anl.gov> On 4/7/09 12:26 AM, Mihael Hategan wrote: > On Tue, 2009-04-07 at 00:15 -0500, Michael Wilde wrote: >> Note on below: I used 2hr30min as the time to match Glen's time, for the >> runs in which he first saw the "imbalance". >> >> In my first tests,I had used 5 min for coasterWorkerMaxwalltime and >> specified no site or tc maxwalltime. I thought that would work, based on >> our earlier lengthy exchanges on this topic. But apparantly coasters was >> calculating some default max walltime for "cat" and it gave me an error >> about insufficient time. > > Right. Previously it would just loop starting workers and then not using > them because they didn't have enough time. The default walltime is 10 > minutes. That makes sense then. The error I got was: 2009-04-06 20:52:35,397-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=cat-e3agg19j - Application exception: Job cannot be run with the given max walltime worker constraint The other few anomalies I saw I will ignore unless they happen again, as I was using the bad 3/31 revision. This was things like starting a new service with some strange default max time ("01:41:00" or 101 minutes) after the initial services were started with the correct time, and some strange error retry behavior. Bear with me - these things are very difficult and tedious to report. >> I was trying to gather that alolng with several >> other anomalies in another report. > > Now, the oddity below is that both coaster services are started with the > same service id. Not only that, the same service id was used for > subsequent runs (the bootstrap logs contain multiple "runs"). This, > roughly, makes no sense, but I can't imagine it being cause for > goodness. OK. Any chance I messed up copying log files (and duplicated one) or are you seeing the duplicate service id in truly distinct logs? (No need for reply - Im assuming if there was a chance I duplicated a log it would be obvious...) > >> >> On 4/7/09 12:09 AM, Michael Wilde wrote: >>> com$ cat abe+qb.xml >>> >>> >>> >>> >>> TG-CDA070002T >>> 8 >>> >> key="coasterWorkerMaxwalltime">02:30:00 >>> >>> >> jobManager="gt2:gt2:pbs" /> >>> >>> /u/ac/wilde/swiftwork >>> >>> >>> >>> >>> >>> TG-CDA070002T >>> 8 >>> >> key="coasterWorkerMaxwalltime">02:30:00 >>> >>> >> jobManager="gt2:gt2:pbs" /> >>> >>> /home/ux454325/swiftwork >>> >>> >>> >>> >>> com$ >>> >>> >>> On 4/7/09 12:09 AM, Mihael Hategan wrote: >>>> On Mon, 2009-04-06 at 23:56 -0500, Michael Wilde wrote: >>>>> The latest rev shows a similar failure on the surface, but I think >>>>> different patterns in the coaster logs. >>>>> >>>>> The workflow is 40 simple "cat" jobs, data.txt to a default-mapped >>>>> outfile. >>>>> >>>>> This time 39 of 40 jobs ran on abe, and then the workflow lingered >>>>> and finally failed, with 39 ok, 1 failure. >>>>> >>>>> All the logs for this run are in >>>>> /home/wilde/swift/lab/20090406-2330-72p9ale0 >>>>> >>>>> below that are dirs for the abe and qb logs coaster and gram logs. >>>>> Abe had no gram log for this run. >>>>> >>>>> I suspect this one is worth looking at. >>>> Indeed. Can you paste your sites file? >>>> >>>> There's some oddity there. >>>> >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Tue Apr 7 00:39:14 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 07 Apr 2009 00:39:14 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DAE5C2.6070806@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <49DAE5C2.6070806@mcs.anl.gov> Message-ID: <1239082754.16125.12.camel@localhost> On Tue, 2009-04-07 at 00:33 -0500, Michael Wilde wrote: > > On 4/7/09 12:26 AM, Mihael Hategan wrote: > > On Tue, 2009-04-07 at 00:15 -0500, Michael Wilde wrote: > >> Note on below: I used 2hr30min as the time to match Glen's time, for the > >> runs in which he first saw the "imbalance". > >> > >> In my first tests,I had used 5 min for coasterWorkerMaxwalltime and > >> specified no site or tc maxwalltime. I thought that would work, based on > >> our earlier lengthy exchanges on this topic. But apparantly coasters was > >> calculating some default max walltime for "cat" and it gave me an error > >> about insufficient time. > > > > Right. Previously it would just loop starting workers and then not using > > them because they didn't have enough time. The default walltime is 10 > > minutes. > > That makes sense then. The error I got was: > > 2009-04-06 20:52:35,397-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=cat-e3agg19j - Application exception: Job cannot be run with the > given max walltime worker constraint > > The other few anomalies I saw I will ignore unless they happen again, as > I was using the bad 3/31 revision. This was things like starting a new > service with some strange default max time ("01:41:00" or 101 minutes) Not strange. 101 = 10 * 10 + 1 or DEFAULT_MAXWALLTIME * OVERALLOCATION_FACTOR + RESERVE. > after the initial services were started with the correct time, and some > strange error retry behavior. > > Bear with me - these things are very difficult and tedious to report. No problem. I'm glad you're exercising the code. From hategan at mcs.anl.gov Tue Apr 7 01:04:22 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 07 Apr 2009 01:04:22 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <1239081995.16125.8.camel@localhost> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> Message-ID: <1239084262.16125.18.camel@localhost> On Tue, 2009-04-07 at 00:26 -0500, Mihael Hategan wrote: > > I was trying to gather that alolng with several > > other anomalies in another report. > > Now, the oddity below is that both coaster services are started with the > same service id. Not only that, the same service id was used for > subsequent runs (the bootstrap logs contain multiple "runs"). This, > roughly, makes no sense, but I can't imagine it being cause for > goodness. That was just another one of my brilliant ideas. It was dimmed a bit in cog r2369. Previous to that, and after the big fiddle with the bootstrap script a while ago, multi-site coaster runs are broken. From benc at hawaga.org.uk Tue Apr 7 03:37:31 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 7 Apr 2009 08:37:31 +0000 (GMT) Subject: [Swift-devel] Expected behavior for scheduler slow-start with coasters? In-Reply-To: <49DA87DD.1010704@mcs.anl.gov> References: <49DA7B7A.6070802@mcs.anl.gov> <1239055703.6821.11.camel@localhost> <49DA82F6.9090502@mcs.anl.gov> <1239057803.8721.4.camel@localhost> <49DA87DD.1010704@mcs.anl.gov> Message-ID: > Is there a simpler, alternate scheduler algorithm that you could plug in as a > global, settable alternative to the current one when all sites are using > coasters? You can set the initialScore profile key very high[1] so that Swift will starts at full load rather than low load. This is basically "the simpler, alterative scheduler algorithm" that you are looking for. You will however runinto a different manifestation of the same problem that coastersPerNode does not work properly and will likely attempt to massively overallocate workers. Its not a bug in the scheduler - its a bug in the implementation of coastersPerNode that causes it to attempt to allocate one node per excess job. In the longer term, as Mihael said, the interface between the scheduler and execution systems needs to change because coasters don't fit in the present abstraction very well. [1] (to about 100 - the actual formula is rather opaque and I have to rederive it every time because I never write it down) -- From qinz at ihpc.a-star.edu.sg Tue Apr 7 04:07:20 2009 From: qinz at ihpc.a-star.edu.sg (Qin Zheng) Date: Tue, 7 Apr 2009 17:07:20 +0800 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> Message-ID: Prof Foster, thanks for introducing me to the team. My research interest is on scheduling workflows (DAGs). Ben, we decided not to use resubmission in the consideration that a DAG cannot be completed when any of its tasks fails, which each time would trigger the resubmission\retry of the DAG. Instead, we use fault tolerance by pre-scheduling replica (backup) for each task (see enclosure for details). The objective is to guarantee that this DAG can be completed (in a preplanned manner with fast failover to the backup upon failure) before its deadline. Currently I am also working on workflow scheduling under uncertainties of task running times. This work includes priorities tasks based on the impact of the variation of its running time on the overall response time and offline planning for high-priority tasks as well as runtime adaptation for all tasks once up-to-date information is available. I am looking forward to talking to you guys and knowing your research! Regards, Qin Zheng ________________________________ From: Ian Foster [mailto:foster at anl.gov] Sent: Monday, April 06, 2009 10:46 PM To: Ben Clifford Cc: swift-devel; Qin Zheng Subject: Re: [Swift-devel] Re: replication vs site score Ben: You may recall the work that was done by Greg Maleciwz (sp?) on prioritizing jobs that enable new jobs to run. Those ideas seem relevant here. I met last week with a smart fellow in Singapore, Qin Zheng (CCed here), who has been working on the scheduling of replicant jobs. His interest is in doing this for jobs that have failed, while I think your interest is in scheduling for jobs that may have failed--a somewhat different thing. But there may be a connection. Ian. On Apr 6, 2009, at 9:39 AM, Ben Clifford wrote: even more rambling... in the context of a scheduler that is doing things like prioritising jobs based on more than the order that Swift happened to submit them (hopefully I will have a student for this in the summer), I think a replicant job should be pushed toward later execution rather than earlier execution to reduce the number of replicant jobs in the system at any one time. This is because I suspect (though I have gathered no numerical evidence) that given the choice between submitting a fresh job and a replicant job (making up terminology here too... mmm), it is almost always better to submit the fresh job. Either we end up submitting the replicant job eventually (in which case we are no worse off than if we submitted the replicant first and then a fresh job); or by delaying the replicant job we give that replicant's original a chance to start running and thus do not discard our precious time-and-load-dollars that we have already spent on queueing that replicant's original. -- _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel ________________________________ This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Fault Tolerance_TC_Mar09.pdf Type: application/pdf Size: 2142133 bytes Desc: Fault Tolerance_TC_Mar09.pdf URL: From wilde at mcs.anl.gov Tue Apr 7 06:09:15 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 07 Apr 2009 06:09:15 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <1239082754.16125.12.camel@localhost> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <49DAE5C2.6070806@mcs.anl.gov> <1239082754.16125.12.camel@localhost> Message-ID: <49DB345B.3030406@mcs.anl.gov> >> The other few anomalies I saw I will ignore unless they happen again, as >> I was using the bad 3/31 revision. This was things like starting a new >> service with some strange default max time ("01:41:00" or 101 minutes) > > Not strange. 101 = 10 * 10 + 1 or DEFAULT_MAXWALLTIME * > OVERALLOCATION_FACTOR + RESERVE. I assumed 1:41 was derived from some formula. The unexpected behavior here was that it looked like a job was submitted by coasters that ignored the specified coasterWorkerMaxwalltime, after the initial jobs honored it. But again, the code base was suspect. I'll keep an eye out for it happening again. From wilde at mcs.anl.gov Tue Apr 7 06:13:47 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 07 Apr 2009 06:13:47 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <1239084262.16125.18.camel@localhost> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <1239084262.16125.18.camel@localhost> Message-ID: <49DB356B.4050808@mcs.anl.gov> putting Glen back on cc: Multi-site coaster runs will not work until Mihael posts a fix. On 4/7/09 1:04 AM, Mihael Hategan wrote: > On Tue, 2009-04-07 at 00:26 -0500, Mihael Hategan wrote: > >>> I was trying to gather that alolng with several >>> other anomalies in another report. >> Now, the oddity below is that both coaster services are started with the >> same service id. Not only that, the same service id was used for >> subsequent runs (the bootstrap logs contain multiple "runs"). This, >> roughly, makes no sense, but I can't imagine it being cause for >> goodness. > > That was just another one of my brilliant ideas. It was dimmed a bit in > cog r2369. Previous to that, and after the big fiddle with the bootstrap > script a while ago, multi-site coaster runs are broken. > From benc at hawaga.org.uk Tue Apr 7 06:30:59 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 7 Apr 2009 11:30:59 +0000 (GMT) Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> Message-ID: Hi. Most/all of the work that we've done with Swift works with fairly opportunistic use of resources - we submit work into job queues on one or more sites, where those job queues are shared with many other users, and where the runtimes for both our jobs and other users jobs are not well defined ahead of time. So whilst we use the word 'scheduling' sometimes in Swift, its more a case of "what do we think is the best site to queue a job on right now?" rather than making an execution plan that we think will be valid for a long period of time. Our replication mechanism sounds fairly similar to your pre-scheduled backups, but I think there are these important differences: * we don't launch a replica until we think there is a reasonable chance that the replica will run instead of the original (based on queue time) * as soon as one of the jobs *starts* running, we cancel all the others. from what I understand, you do that when one of the jobs *ends* successfully. We do have one situation where we have some pre-allocation of resources, and that is when coasters are being used. These use the above opportunistic queuing methods to acquire a worker node for a long period of time, and then runs Swift level jobs in there, at present on a first-come first-serve basis. Its likely that we'll change that to have some other job prioritisation, but still pre-scheduling the jobs. Where Swift would have trouble working with an ahead-of-time planner/scheduler is that the module that generates file transfer and execution tasks from high level SwiftScripts does not submit a dependent task for scheduling and execution until its predecessors have been successfully executed. What the scheduler sees is a stream, over time, of file transfer and execution tasks that are safe to run immediately. It might be easy, or it might be hard, to make the Swift code submit more eagerly, with description of task dependencies, which would allow you to plug in a pre-planner underneath. On Tue, 7 Apr 2009, Qin Zheng wrote: > Prof Foster, thanks for introducing me to the team. > > My research interest is on scheduling workflows (DAGs). Ben, we decided > not to use resubmission in the consideration that a DAG cannot be > completed when any of its tasks fails, which each time would trigger the > resubmission\retry of the DAG. Instead, we use fault tolerance by > pre-scheduling replica (backup) for each task (see enclosure for > details). The objective is to guarantee that this DAG can be completed > (in a preplanned manner with fast failover to the backup upon failure) > before its deadline. > > Currently I am also working on workflow scheduling under uncertainties > of task running times. This work includes priorities tasks based on the > impact of the variation of its running time on the overall response time > and offline planning for high-priority tasks as well as runtime > adaptation for all tasks once up-to-date information is available. > > I am looking forward to talking to you guys and knowing your research! > > Regards, > Qin Zheng > ________________________________ > From: Ian Foster [mailto:foster at anl.gov] > Sent: Monday, April 06, 2009 10:46 PM > To: Ben Clifford > Cc: swift-devel; Qin Zheng > Subject: Re: [Swift-devel] Re: replication vs site score > > Ben: > > You may recall the work that was done by Greg Maleciwz (sp?) on prioritizing jobs that enable new jobs to run. Those ideas seem relevant here. > > I met last week with a smart fellow in Singapore, Qin Zheng (CCed here), who has been working on the scheduling of replicant jobs. His interest is in doing this for jobs that have failed, while I think your interest is in scheduling for jobs that may have failed--a somewhat different thing. But there may be a connection. > > Ian. > > > On Apr 6, 2009, at 9:39 AM, Ben Clifford wrote: > > > even more rambling... in the context of a scheduler that is doing things > like prioritising jobs based on more than the order that Swift happened to > submit them (hopefully I will have a student for this in the summer), I > think a replicant job should be pushed toward later execution rather than > earlier execution to reduce the number of replicant jobs in the system at > any one time. > > This is because I suspect (though I have gathered no numerical evidence) > that given the choice between submitting a fresh job and a replicant job > (making up terminology here too... mmm), it is almost always better to > submit the fresh job. Either we end up submitting the replicant job > eventually (in which case we are no worse off than if we submitted the > replicant first and then a fresh job); or by delaying the replicant job we > give that replicant's original a chance to start running and thus do not > discard our precious time-and-load-dollars that we have already spent on > queueing that replicant's original. > > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > ________________________________ > This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you. > From foster at anl.gov Tue Apr 7 07:33:02 2009 From: foster at anl.gov (Ian Foster) Date: Tue, 7 Apr 2009 07:33:02 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DB356B.4050808@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <1239084262.16125.18.camel@localhost> <49DB356B.4050808@mcs.anl.gov> Message-ID: <26D2255E-D8B2-4816-A3D5-FF26743E519C@anl.gov> Is there a description somewhere of the algorithms used for starting coasters and submitting jobs to them? Ian. From benc at hawaga.org.uk Tue Apr 7 07:36:10 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 7 Apr 2009 12:36:10 +0000 (GMT) Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <26D2255E-D8B2-4816-A3D5-FF26743E519C@anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <1239084262.16125.18.camel@localhost> <49DB356B.4050808@mcs.anl.gov> <26D2255E-D8B2-4816-A3D5-FF26743E519C@anl.gov> Message-ID: On Tue, 7 Apr 2009, Ian Foster wrote: > Is there a description somewhere of the algorithms used for starting coasters > and submitting jobs to them? Plenty in the archives of this list, I expect. Basically: if a job arrives and there is a free coaster slot, launch a new coaster worker. If there is no free coaster slot existing for it, launch a new coaster worker. -- From wilde at mcs.anl.gov Tue Apr 7 07:42:25 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 07 Apr 2009 07:42:25 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <1239084262.16125.18.camel@localhost> <49DB356B.4050808@mcs.anl.gov> <26D2255E-D8B2-4816-A3D5-FF26743E519C@anl.gov> Message-ID: <49DB4A31.80108@mcs.anl.gov> This contains a lot of the startup details: http://wiki.cogkit.org/wiki/Coasters On 4/7/09 7:36 AM, Ben Clifford wrote: > On Tue, 7 Apr 2009, Ian Foster wrote: > >> Is there a description somewhere of the algorithms used for starting coasters >> and submitting jobs to them? > > Plenty in the archives of this list, I expect. > > Basically: if a job arrives and there is a free coaster slot, launch a new > coaster worker. If there is no free coaster slot existing for it, launch a > new coaster worker. > From benc at hawaga.org.uk Tue Apr 7 07:47:39 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 7 Apr 2009 12:47:39 +0000 (GMT) Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DB4A31.80108@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <1239084262.16125.18.camel@localhost> <49DB356B.4050808@mcs.anl.gov> <26D2255E-D8B2-4816-A3D5-FF26743E519C@anl.gov> <49DB4A31.80108@mcs.anl.gov> Message-ID: On Tue, 7 Apr 2009, Michael Wilde wrote: > This contains a lot of the startup details: > > http://wiki.cogkit.org/wiki/Coasters Would be good to link to that from the Swift user guide. -- From wilde at mcs.anl.gov Tue Apr 7 08:03:47 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 07 Apr 2009 08:03:47 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <1239084262.16125.18.camel@localhost> <49DB356B.4050808@mcs.anl.gov> <26D2255E-D8B2-4816-A3D5-FF26743E519C@anl.gov> <49DB4A31.80108@mcs.anl.gov> Message-ID: <49DB4F33.5040502@mcs.anl.gov> done. (but not tested) On 4/7/09 7:47 AM, Ben Clifford wrote: > > On Tue, 7 Apr 2009, Michael Wilde wrote: > >> This contains a lot of the startup details: >> >> http://wiki.cogkit.org/wiki/Coasters > > Would be good to link to that from the Swift user guide. > From wilde at mcs.anl.gov Tue Apr 7 08:14:54 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 07 Apr 2009 08:14:54 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DB4F33.5040502@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <1239084262.16125.18.camel@localhost> <49DB356B.4050808@mcs.anl.gov> <26D2255E-D8B2-4816-A3D5-FF26743E519C@anl.gov> <49DB4A31.80108@mcs.anl.gov> <49DB4F33.5040502@mcs.anl.gov> Message-ID: <49DB51CE.3090502@mcs.anl.gov> On 4/7/09 8:03 AM, Michael Wilde wrote: > done. (but not tested) but i should have. fixed, *and* tested. > > On 4/7/09 7:47 AM, Ben Clifford wrote: >> >> On Tue, 7 Apr 2009, Michael Wilde wrote: >> >>> This contains a lot of the startup details: >>> >>> http://wiki.cogkit.org/wiki/Coasters >> >> Would be good to link to that from the Swift user guide. >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Apr 7 10:08:10 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 07 Apr 2009 10:08:10 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <49DB356B.4050808@mcs.anl.gov> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <1239084262.16125.18.camel@localhost> <49DB356B.4050808@mcs.anl.gov> Message-ID: <1239116890.18531.0.camel@localhost> On Tue, 2009-04-07 at 06:13 -0500, Michael Wilde wrote: > putting Glen back on cc: Multi-site coaster runs will not work until > Mihael posts a fix. What I'm saying below is that the fix is in cog r2369. > > On 4/7/09 1:04 AM, Mihael Hategan wrote: > > On Tue, 2009-04-07 at 00:26 -0500, Mihael Hategan wrote: > > > >>> I was trying to gather that alolng with several > >>> other anomalies in another report. > >> Now, the oddity below is that both coaster services are started with the > >> same service id. Not only that, the same service id was used for > >> subsequent runs (the bootstrap logs contain multiple "runs"). This, > >> roughly, makes no sense, but I can't imagine it being cause for > >> goodness. > > > > That was just another one of my brilliant ideas. It was dimmed a bit in > > cog r2369. Previous to that, and after the big fiddle with the bootstrap > > script a while ago, multi-site coaster runs are broken. > > From wilde at mcs.anl.gov Tue Apr 7 10:13:23 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 07 Apr 2009 10:13:23 -0500 Subject: [Swift-devel] Imbalanced scheduling with coasters and multiple sites In-Reply-To: <1239116890.18531.0.camel@localhost> References: <49DA9004.8010409@mcs.anl.gov> <49DAD5C9.7080607@mcs.anl.gov> <49DADD16.2010507@mcs.anl.gov> <1239080984.16125.1.camel@localhost> <49DAE026.3040909@mcs.anl.gov> <49DAE16B.6000508@mcs.anl.gov> <1239081995.16125.8.camel@localhost> <1239084262.16125.18.camel@localhost> <49DB356B.4050808@mcs.anl.gov> <1239116890.18531.0.camel@localhost> Message-ID: <49DB6D93.7010900@mcs.anl.gov> Cool. I interpreted your note below as meaning its still broken, didnt realize that r2369 was latest. Got it, and am building now for Glen and I to test. I'll re-run the "cats" test. On 4/7/09 10:08 AM, Mihael Hategan wrote: > On Tue, 2009-04-07 at 06:13 -0500, Michael Wilde wrote: >> putting Glen back on cc: Multi-site coaster runs will not work until >> Mihael posts a fix. > > What I'm saying below is that the fix is in cog r2369. > >> On 4/7/09 1:04 AM, Mihael Hategan wrote: >>> On Tue, 2009-04-07 at 00:26 -0500, Mihael Hategan wrote: >>> >>>>> I was trying to gather that alolng with several >>>>> other anomalies in another report. >>>> Now, the oddity below is that both coaster services are started with the >>>> same service id. Not only that, the same service id was used for >>>> subsequent runs (the bootstrap logs contain multiple "runs"). This, >>>> roughly, makes no sense, but I can't imagine it being cause for >>>> goodness. >>> That was just another one of my brilliant ideas. It was dimmed a bit in >>> cog r2369. Previous to that, and after the big fiddle with the bootstrap >>> script a while ago, multi-site coaster runs are broken. >>> > From wilde at mcs.anl.gov Tue Apr 7 17:36:01 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 07 Apr 2009 17:36:01 -0500 Subject: [Swift-devel] Possible problem in coaster data transfer Message-ID: <49DBD551.3000201@mcs.anl.gov> It looks as if something in swift is garbling data files. We see this when trying coaster data transfer to circumvent a problem that the abe gridftp server was reporting (when using gridftp data transfer). The oops "pdt" file is the main output of the simulation (the coordinates of each atom in the folded protein). These files should have very regular multi-column lines, but in a few we see garbled lines. This is in run: ci:/home/hockyg/oops/swift/output/abeoutdir.20 These file range from 1.5MB to 3MB in this test. There's one per job, 50 files in this run. The lines on top are normal; the lines on the bottom are long due to file corruption. We've used coaster transfer off an on; we usually do gridftp transfer and were using coaster transfer in this case while Mihael debugs a problem thats manifesting as a gridftp error. Glen suspected he saw such corruption earlier; this run seems to confirm it. I'm not inclined to go deep into this at the moment, but rather to say that we'll stick to gridftp transfer for the duration of this paper writing effort. - Mike TOM 337 N GLN 0 57 135 CA ALA 0 23 -17.316 8.147 0.047 1.00 0.00 ATOM 337 N GLN 0 57 135 CA ALA 0 23 -17.316 8.147 0.047 1.00 0.00 ATOM 337 N GLN 0 57 135 CA ALA 0 23 -17.316 8.147 0.047 1.00 0.00 ATOM 337 N GLN 0 57 135 CA ALA 0 23 -17.316 8.147 0.047 1.00 0.00 ATOM 337 N GLN 0 57 135 CA ALA 0 23 -17.316 8.147 0.047 1.00 0.00 ATOM 337 N GLN 0 57 135 CA ALA 0 23 -17.316 8.147 0.047 1.00 0.00 com$ com$ awk 'length($0) > 150 {print $0}' `find | grep pdt` com$ awk 'length($0) > 120 {print $0}' `find | grep pdt` ATOM 335 C ASN 0 56 -34.964 1.477 15.043 1.00 0.00 0 13 -2.528 18.017 -1.927 1.00 0.00 ATOM 335 C ASN 0 56 -21.516 -6.860 -31.404 1.00 0.00 91 C ASN 0 32 -10.865 31.809 -15.581 1.00 0.00 ATOM 404 HN ALA 0 68 -10.285 -33.690 -26.233 1.00 0.00 135 CA ALA 0 23 12.808 -6.713 -11.148 1.00 0.00 ATOM 335 C ASN 0 56 0.510 -30.608 0.783 1.00 0.00 LEU 0 2 0.505 3.186 -1.484 1.00 0.00 ATOM 335 C ASN 0 56 -3.155 25.367 -4.095 1.00 83 C ALA 0 64 5.541 11.559 -1.063 1.00 0.00 ATOM 404 HN ALA 0 68 66.525 32.704 -21.958 GLN 0 57 135 CA ALA 0 23 19.234 14.087 -7.779 1.00 0.00 ATOM 335 C ASN 0 56 16.926 -3.414 -5.774 1.00 0.00 EU 0 43 13.554 22.230 19.827 1.00 0.00 ATOM 335 C ASN 0 56 14.805 34.413 23.907 1.00 0.00 59 -18.300 2.743 -27.536 1.00 0.00 ATOM 335 C ASN 0 56 19.787 15.477 24.896 1.00 0.00 0 13 9.613 11.599 -1.295 1.00 0.00 ATOM 404 HN ALA 0 68 11.882 -14.3 337 N GLN 0 57 135 CA ALA 0 23 21.798 -14.600 -6.379 1.00 0.00 ATOM 112 HA2 GLY 0 19 3.632 -11.142 -24.657 1.00 0. 315 CA LEU 0 53 0.180 -22.479 -33.671 1.00 0.00 ATOM 335 C ASN 0 56 -3.145 -30.419 -39.260 1.00 E 0 66 -8.925 -40.775 -24.402 1.00 0.00 com$ From aespinosa at cs.uchicago.edu Wed Apr 8 02:12:16 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 8 Apr 2009 02:12:16 -0500 Subject: [Swift-devel] jobs finishes but swift reports "execution failed". Message-ID: <50b07b4b0904080012u668c6921y9012ac066c8156f7@mail.gmail.com> Waiting for notification for 0 ms Received notification with 1 messages Progress: Submitted:1 Active:1 Progress: uninitialized:1 Finished successfully:2 Execution failed: Could not find any valid host for task "Task(type=UNKNOWN, identity=urn:cog-1239170783751)" with constraints {filecache=org.griphyn.vdl.karajan.lib.cache.CacheMapAdapter at 53f253f2, filenames=[Ljava.lang.String;@53945394, trfqn=cat, tr=cat} probably in one of the staging components cog2365 swift 2824 on surveyor BGP the modifications made iis just the convertion of "|" to "^". Right Zhao? log: http://www.ci.uchicago.edu/~aespinosa/swift/blast-20090408-0144-evyvbf93.log -- Allan M. Espinosa PhD student, Computer Science University of Chicago From qinz at ihpc.a-star.edu.sg Wed Apr 8 03:03:34 2009 From: qinz at ihpc.a-star.edu.sg (Qin Zheng) Date: Wed, 8 Apr 2009 16:03:34 +0800 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> Message-ID: Dear Ben, Thanks for your detailed reply and it helps me understand scheduling in Swift better. I wrote from a researcher perspective and I understand that for development, there are much more practical issues and are more challenging. I agree with you that scheduling a task after its parents completes is cost effective. It is the best "time" given all the updated info on the completion times of its parents. Also, it makes DAG submission easy (without dependency description) and minimizes the number of job instances in queues. The concern is that at this time, the task still needs to be submitted in queue and wait. This may not be sufficient for workflows with deadlines, where certain delivery guarantee in response time is necessary. The same applies for other remaining tasks in the workflow. I felt besides offline planning, runtime adaptation is necessary considering task duration variation (overrun) and faults. But the number of updates should be kept minimum and only for the very near future as the workflow proceeds. I am writing a paper on this and hopefully I could share it with you guys in a few weeks. This implies that the Swift code could be submitted a little bit more eagerly with a short-sighted look ahead. Yes, your points on the differences are valid and the replica in my case is used for FT while in Swift it could enable a task to run earlier (by submitting a replica at a short queue). You mentioned about queue time and can you share more on it, for example its accuracy and also the change to have some other job prioritization for coasters? I will be on star cruise to Malaysia in a few hours :). If I can not access email there, I will reply to you guys on Friday when I return to Singapore. Qin Zheng -----Original Message----- From: Ben Clifford [mailto:benc at hawaga.org.uk] Sent: Tuesday, April 07, 2009 7:31 PM To: Qin Zheng Cc: Ian Foster; swift-devel Subject: RE: [Swift-devel] Re: replication vs site score Hi. Most/all of the work that we've done with Swift works with fairly opportunistic use of resources - we submit work into job queues on one or more sites, where those job queues are shared with many other users, and where the runtimes for both our jobs and other users jobs are not well defined ahead of time. So whilst we use the word 'scheduling' sometimes in Swift, its more a case of "what do we think is the best site to queue a job on right now?" rather than making an execution plan that we think will be valid for a long period of time. Our replication mechanism sounds fairly similar to your pre-scheduled backups, but I think there are these important differences: * we don't launch a replica until we think there is a reasonable chance that the replica will run instead of the original (based on queue time) * as soon as one of the jobs *starts* running, we cancel all the others. from what I understand, you do that when one of the jobs *ends* successfully. We do have one situation where we have some pre-allocation of resources, and that is when coasters are being used. These use the above opportunistic queuing methods to acquire a worker node for a long period of time, and then runs Swift level jobs in there, at present on a first-come first-serve basis. Its likely that we'll change that to have some other job prioritisation, but still pre-scheduling the jobs. Where Swift would have trouble working with an ahead-of-time planner/scheduler is that the module that generates file transfer and execution tasks from high level SwiftScripts does not submit a dependent task for scheduling and execution until its predecessors have been successfully executed. What the scheduler sees is a stream, over time, of file transfer and execution tasks that are safe to run immediately. It might be easy, or it might be hard, to make the Swift code submit more eagerly, with description of task dependencies, which would allow you to plug in a pre-planner underneath. On Tue, 7 Apr 2009, Qin Zheng wrote: > Prof Foster, thanks for introducing me to the team. > > My research interest is on scheduling workflows (DAGs). Ben, we decided > not to use resubmission in the consideration that a DAG cannot be > completed when any of its tasks fails, which each time would trigger the > resubmission\retry of the DAG. Instead, we use fault tolerance by > pre-scheduling replica (backup) for each task (see enclosure for > details). The objective is to guarantee that this DAG can be completed > (in a preplanned manner with fast failover to the backup upon failure) > before its deadline. > > Currently I am also working on workflow scheduling under uncertainties > of task running times. This work includes priorities tasks based on the > impact of the variation of its running time on the overall response time > and offline planning for high-priority tasks as well as runtime > adaptation for all tasks once up-to-date information is available. > > I am looking forward to talking to you guys and knowing your research! > > Regards, > Qin Zheng > ________________________________ > From: Ian Foster [mailto:foster at anl.gov] > Sent: Monday, April 06, 2009 10:46 PM > To: Ben Clifford > Cc: swift-devel; Qin Zheng > Subject: Re: [Swift-devel] Re: replication vs site score > > Ben: > > You may recall the work that was done by Greg Maleciwz (sp?) on prioritizing jobs that enable new jobs to run. Those ideas seem relevant here. > > I met last week with a smart fellow in Singapore, Qin Zheng (CCed here), who has been working on the scheduling of replicant jobs. His interest is in doing this for jobs that have failed, while I think your interest is in scheduling for jobs that may have failed--a somewhat different thing. But there may be a connection. > > Ian. > > > On Apr 6, 2009, at 9:39 AM, Ben Clifford wrote: > > > even more rambling... in the context of a scheduler that is doing things > like prioritising jobs based on more than the order that Swift happened to > submit them (hopefully I will have a student for this in the summer), I > think a replicant job should be pushed toward later execution rather than > earlier execution to reduce the number of replicant jobs in the system at > any one time. > > This is because I suspect (though I have gathered no numerical evidence) > that given the choice between submitting a fresh job and a replicant job > (making up terminology here too... mmm), it is almost always better to > submit the fresh job. Either we end up submitting the replicant job > eventually (in which case we are no worse off than if we submitted the > replicant first and then a fresh job); or by delaying the replicant job we > give that replicant's original a chance to start running and thus do not > discard our precious time-and-load-dollars that we have already spent on > queueing that replicant's original. > > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > ________________________________ > This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you. > This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you. From benc at hawaga.org.uk Wed Apr 8 04:48:31 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 8 Apr 2009 09:48:31 +0000 (GMT) Subject: [Swift-devel] jobs finishes but swift reports "execution failed". In-Reply-To: <50b07b4b0904080012u668c6921y9012ac066c8156f7@mail.gmail.com> References: <50b07b4b0904080012u668c6921y9012ac066c8156f7@mail.gmail.com> Message-ID: that looks to me like you have tc.data entries for mockblast but not for cat. -- From benc at hawaga.org.uk Wed Apr 8 07:20:41 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 8 Apr 2009 12:20:41 +0000 (GMT) Subject: [Swift-devel] Possible problem in coaster data transfer In-Reply-To: <49DBD551.3000201@mcs.anl.gov> References: <49DBD551.3000201@mcs.anl.gov> Message-ID: if you do decide to dig deeper, you can turn on sitedir.keep in swift.properties and check that the file in the remote shared directory is uncorrupted for the same run that the staged out copy appears corrupted. -- From hockyg at uchicago.edu Wed Apr 8 09:03:50 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Wed, 08 Apr 2009 09:03:50 -0500 Subject: [Swift-devel] Possible problem in coaster data transfer In-Reply-To: References: <49DBD551.3000201@mcs.anl.gov> Message-ID: <49DCAEC6.1090905@uchicago.edu> Here you go. Same file from the remote site and from communicado after transfer by coasterIO Ben Clifford wrote: > if you do decide to dig deeper, you can turn on sitedir.keep in > swift.properties and check that the file in the remote shared directory is > uncorrupted for the same run that the staged out copy appears corrupted. > > -------------- next part -------------- A non-text attachment was scrubbed... Name: pdt_ci.gz Type: application/gzip Size: 46773 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: abe_pdt.gz Type: application/gzip Size: 576854 bytes Desc: not available URL: From hategan at mcs.anl.gov Wed Apr 8 10:03:07 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 08 Apr 2009 10:03:07 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> Message-ID: <1239202987.12586.17.camel@localhost> On Wed, 2009-04-08 at 16:03 +0800, Qin Zheng wrote: > Dear Ben, > > Thanks for your detailed reply and it helps me understand scheduling in Swift better. > > I wrote from a researcher perspective and I understand that for > development, there are much more practical issues and are more > challenging. I agree with you that scheduling a task after its parents > completes is cost effective. It is the best "time" given all the > updated info on the completion times of its parents. Also, it makes > DAG submission easy (without dependency description) and minimizes the > number of job instances in queues. The main reasoning was that it can be dealt with efficiently and that planning the whole workflow buys us little in a (very) dynamic environment in which submitting a job one minute later may mean the difference between 1 minute of queue time and one hour of queue time (though that's statistically a rare occurrence). > The concern is that at this time, the task still needs to be > submitted in queue and wait. This may not be sufficient for workflows > with deadlines, where certain delivery guarantee in response time is > necessary. You need some SLA/QOS to address that. Guessing the average queue time does not reduce its variation hence the risk of not finishing it by the time promised. You can use replication (i.e. race competing jobs) to reduce that variation (assuming that it follows some reasonable distribution), but I don't see how there could be a guarantee. > The same applies for other remaining tasks in the workflow. > > I felt besides offline planning, runtime adaptation is necessary > considering task duration variation (overrun) and faults. But the > number of updates should be kept minimum and only for the very near > future as the workflow proceeds. I am writing a paper on this and > hopefully I could share it with you guys in a few weeks. This implies > that the Swift code could be submitted a little bit more eagerly with > a short-sighted look ahead. I remember somebody mentioning (or having implemented) a similar scheme. If we have dependent jobs a and b, in swift that would go something like: Qa + Ra + Qb + Rb (where Qx - queuing time and Rx run time) But there's also the possibility of submitting B earlier by the average queue time or less and than having it wait until A produces its results. But then glide-ins/coasters, that's pretty much what they do. Mihael From benc at hawaga.org.uk Wed Apr 8 10:08:04 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 8 Apr 2009 15:08:04 +0000 (GMT) Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239202987.12586.17.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> Message-ID: On Wed, 8 Apr 2009, Mihael Hategan wrote: This: > planning the whole workflow buys us little in a (very) dynamic > environment in which submitting a job one minute later may mean the > difference between 1 minute of queue time and one hour of queue time and this: > You need some SLA/QOS to address that. seem to be significant characteristics that make the environments we run on not amenable to scheduling in the traditional sense. The lack of any meaningful guarantees about almost anything time-related makes everything basically opportunistic rather than scheduled. -- From hategan at mcs.anl.gov Wed Apr 8 14:53:28 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 08 Apr 2009 14:53:28 -0500 Subject: [Swift-devel] updates Message-ID: <1239220408.15551.4.camel@localhost> There are some fixes in cog r2381, most notably: - gridftp sessions were sometimes left in a messed state leading to subsequent transfers throwing obscure errors - coaster workers were left in an inconsistent state when jobs submitted to them exceeded their walltimes and the remaining runtime of the workers - an alleged fix for "qsub not found". This tied in to our earlier problems with finding executables. Even though, for example, java was found using bash -l, the process wasn't subsequently started using bash -l, leading to qsub not being in the path. The current scheme assumes that either everything needed can be found using bash -l or everything needed can be found without bash -l. I suppose some corner cases may still exist, but they seem unlikely. From hategan at mcs.anl.gov Wed Apr 8 15:44:44 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 08 Apr 2009 15:44:44 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DD0B3F.7050903@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> Message-ID: <1239223485.26815.2.camel@localhost> On Wed, 2009-04-08 at 13:38 -0700, Ioan Raicu wrote: > Does a batch-queue prediction service help things in any way? > https://portal.teragrid.org/gridsphere/gridsphere?cid=queue-prediction > > I've always wondered how the Swift scheduler would behave differently > if it had statistical information about queue times. It would help. Statistically. > Qin, have you compared your job replication strategy with one that > was cognizant of the expected wait queue time, in order to meet > deadlines? On the surface, assuming that the batch queue prediction is > accurate, it would seem that scheduling with known queue times might > solve the same deadline cognizant scheduling problem, but without > wasting resources by unnecessary replication. The replication isn't unnecessary. If it starts it starts because the queue time is larger than the expected queue time. > The place where the queue prediction doesn't help, is when there is a > bad node which causes an application to be slow or fail. No. The prediction doesn't help when it fails to predict accurately. > In this case, replication is probably the better recourse to > guarantee meeting deadlines. From iraicu at cs.uchicago.edu Wed Apr 8 15:38:23 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 08 Apr 2009 13:38:23 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> Message-ID: <49DD0B3F.7050903@cs.uchicago.edu> Does a batch-queue prediction service help things in any way? https://portal.teragrid.org/gridsphere/gridsphere?cid=queue-prediction I've always wondered how the Swift scheduler would behave differently if it had statistical information about queue times. Qin, have you compared your job replication strategy with one that was cognizant of the expected wait queue time, in order to meet deadlines? On the surface, assuming that the batch queue prediction is accurate, it would seem that scheduling with known queue times might solve the same deadline cognizant scheduling problem, but without wasting resources by unnecessary replication. The place where the queue prediction doesn't help, is when there is a bad node which causes an application to be slow or fail. In this case, replication is probably the better recourse to guarantee meeting deadlines. Here is their latest paper on this: http://www.springerlink.com/content/7552901360631246/fulltext.pdf. The system is deployed on the TeraGrid, and has been for a few years now. As far as I have heard, it is quite robust and accurate. Cheers, Ioan Ben Clifford wrote: > On Wed, 8 Apr 2009, Mihael Hategan wrote: > > This: > > >> planning the whole workflow buys us little in a (very) dynamic >> environment in which submitting a job one minute later may mean the >> difference between 1 minute of queue time and one hour of queue time >> > > and this: > > >> You need some SLA/QOS to address that. >> > > seem to be significant characteristics that make the environments we run > on not amenable to scheduling in the traditional sense. The lack of any > meaningful guarantees about almost anything time-related makes everything > basically opportunistic rather than scheduled. > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Wed Apr 8 15:46:28 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 08 Apr 2009 13:46:28 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239223485.26815.2.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <1239223485.26815.2.camel@localhost> Message-ID: <49DD0D24.2010909@cs.uchicago.edu> Mihael Hategan wrote: >> The place where the queue prediction doesn't help, is when there is a >> bad node which causes an application to be slow or fail. >> > > No. The prediction doesn't help when it fails to predict accurately. > > The prediction that I was referring to was only for the queue time, not the execution time. A failed node, causing an application run time to be longer than expected, has no impact on the prediction of the wait queue time. Ioan -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Apr 8 15:54:30 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 08 Apr 2009 15:54:30 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DD0D24.2010909@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <1239223485.26815.2.camel@localhost> <49DD0D24.2010909@cs.uchicago.edu> Message-ID: <1239224070.27089.4.camel@localhost> On Wed, 2009-04-08 at 13:46 -0700, Ioan Raicu wrote: > > > Mihael Hategan wrote: > > > The place where the queue prediction doesn't help, is when there is a > > > bad node which causes an application to be slow or fail. > > > > > > > No. The prediction doesn't help when it fails to predict accurately. > > > > > The prediction that I was referring to was only for the queue time, > not the execution time. A failed node, causing an application run time > to be longer than expected, has no impact on the prediction of the > wait queue time. You're right. I was trying to say that fundamentally the problem of uncertainty in queue times will remain by virtue of the fact that the times when people submit jobs (as well as the amount of jobs) is unpredictable and it can affect other people's job queue times. The predictor in the paper answers the question "if you were to submit your job before the state of the queue changes in any way, what would be the expected queue time for the job" and not "what will be the queue time for the job". From iraicu at cs.uchicago.edu Wed Apr 8 15:58:10 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 08 Apr 2009 13:58:10 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239224070.27089.4.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <1239223485.26815.2.camel@localhost> <49DD0D24.2010909@cs.uchicago.edu> <1239224070.27089.4.camel@localhost> Message-ID: <49DD0FE2.3000505@cs.uchicago.edu> Mihael Hategan wrote: > > You're right. I was trying to say that fundamentally the problem of > uncertainty in queue times will remain by virtue of the fact that the > times when people submit jobs (as well as the amount of jobs) is > unpredictable and it can affect other people's job queue times. > > The predictor in the paper answers the question "if you were to submit > your job before the state of the queue changes in any way, what would be > the expected queue time for the job" and not "what will be the queue > time for the job". > > Yes, its possible that between a query of prediction, and actual submission, the state of the queues change, and therefore the actual result change. But, every prediction comes with some error bounds, so its possible that the change in queue state, might be reflected in the error bars. Nevertheless, I think it might be an interesting improvement to the current Swift scheduler. Ben, was this on the list of Google summer of code projects? If not, perhaps you might want to add it. Ioan -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Apr 8 16:32:00 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 08 Apr 2009 16:32:00 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DD0FE2.3000505@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <1239223485.26815.2.camel@localhost> <49DD0D24.2010909@cs.uchicago.edu> <1239224070.27089.4.camel@localhost> <49DD0FE2.3000505@cs.uchicago.edu> Message-ID: <1239226320.3974.5.camel@localhost> On Wed, 2009-04-08 at 13:58 -0700, Ioan Raicu wrote: > > > Mihael Hategan wrote: > > > > You're right. I was trying to say that fundamentally the problem of > > uncertainty in queue times will remain by virtue of the fact that the > > times when people submit jobs (as well as the amount of jobs) is > > unpredictable and it can affect other people's job queue times. > > > > The predictor in the paper answers the question "if you were to submit > > your job before the state of the queue changes in any way, what would be > > the expected queue time for the job" and not "what will be the queue > > time for the job". > > > > > Yes, its possible that between a query of prediction, and actual > submission, the state of the queues change, and therefore the actual > result change. But, every prediction comes with some error bounds, so > its possible that the change in queue state, might be reflected in the > error bars. I don't know... The system predicted that a 2 minute job on Abe would sit 11.2 hours in the queue and 2.4 hours on QueenBee, but I've ran 20 such jobs on both in the past 15 minutes. From iraicu at cs.uchicago.edu Wed Apr 8 18:00:37 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 08 Apr 2009 16:00:37 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239226320.3974.5.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <1239223485.26815.2.camel@localhost> <49DD0D24.2010909@cs.uchicago.edu> <1239224070.27089.4.camel@localhost> <49DD0FE2.3000505@cs.uchicago.edu> <1239226320.3974.5.camel@localhost> Message-ID: <49DD2C95.3030706@cs.uchicago.edu> Aha, but I think the predictions are upper bounds, not upper and lower bounds. In essence, when they predict that your job will wait for 11.2 hours, with 95% confidence, and your job runs in 15 minutes, then in no way have they made a prediction in error. Now, if they would have predicted 1 minute, and it took 15 minutes, then it would have been an error. It is possible that they do not use knowledge of back-filling, which would make small jobs run immediately, although they would predict a long queue wait time, as if no back-filling is enabled. Its not clear how customized the predictor is, to the scheduler and features of the LRM, so there is certainly room for being pessimistic on their predictions. Ioan Mihael Hategan wrote: > On Wed, 2009-04-08 at 13:58 -0700, Ioan Raicu wrote: > >> Mihael Hategan wrote: >> >>> You're right. I was trying to say that fundamentally the problem of >>> uncertainty in queue times will remain by virtue of the fact that the >>> times when people submit jobs (as well as the amount of jobs) is >>> unpredictable and it can affect other people's job queue times. >>> >>> The predictor in the paper answers the question "if you were to submit >>> your job before the state of the queue changes in any way, what would be >>> the expected queue time for the job" and not "what will be the queue >>> time for the job". >>> >>> >>> >> Yes, its possible that between a query of prediction, and actual >> submission, the state of the queues change, and therefore the actual >> result change. But, every prediction comes with some error bounds, so >> its possible that the change in queue state, might be reflected in the >> error bars. >> > > I don't know... The system predicted that a 2 minute job on Abe would > sit 11.2 hours in the queue and 2.4 hours on QueenBee, but I've ran 20 > such jobs on both in the past 15 minutes. > > > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Apr 8 21:16:58 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 08 Apr 2009 21:16:58 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DD2C95.3030706@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <1239223485.26815.2.camel@localhost> <49DD0D24.2010909@cs.uchicago.edu> <1239224070.27089.4.camel@localhost> <49DD0FE2.3000505@cs.uchicago.edu> <1239226320.3974.5.camel@localhost> <49DD2C95.3030706@cs.uchicago.edu> Message-ID: <1239243418.17988.39.camel@localhost> On Wed, 2009-04-08 at 16:00 -0700, Ioan Raicu wrote: > Aha, but I think the predictions are upper bounds, not upper and lower > bounds. In essence, when they predict that your job will wait for 11.2 > hours, with 95% confidence, and your job runs in 15 minutes, then in > no way have they made a prediction in error. Heh. "It's not even wrong". From foster at anl.gov Thu Apr 9 06:27:31 2009 From: foster at anl.gov (Ian Foster) Date: Thu, 9 Apr 2009 06:27:31 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DD0B3F.7050903@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> Message-ID: <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> Hi, I wanted to point out that when we use Falkon/coasters, we have full control over scheduling, so in that case we could in principle pre- compute schedules. However, in practice we still don't tend to have enough information about execution times for this to be that useful. At least that's my belief. I assume that estimates of queue time bounds would surely be helpful for determining where to send things, and whether a job was stuck. Ian. From hategan at mcs.anl.gov Thu Apr 9 10:30:54 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 09 Apr 2009 10:30:54 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> Message-ID: <1239291054.32146.14.camel@localhost> On Thu, 2009-04-09 at 06:27 -0500, Ian Foster wrote: > Hi, > > I wanted to point out that when we use Falkon/coasters, we have full > control over scheduling, Once we get the nodes, yes. > so in that case we could in principle pre- > compute schedules. However, in practice we still don't tend to have > enough information about execution times for this to be that useful. > At least that's my belief. > > I assume that estimates of queue time bounds would surely be helpful > for determining where to send things, and whether a job was stuck. > > Ian. From benc at hawaga.org.uk Thu Apr 9 10:30:37 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 9 Apr 2009 15:30:37 +0000 (GMT) Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> Message-ID: On Thu, 9 Apr 2009, Ian Foster wrote: > I wanted to point out that when we use Falkon/coasters, we have full control > over scheduling, so in that case we could in principle pre-compute schedules. Coasters as they are now are still allocated on an opportunistic basis, so once we have a coaster stuff could be scheduled to it, but when coaster workers actually exist is as unknown as when jobs will run in the non-coaster case, I think. Where Falkon has been used for pre-allocated resources on machines, with no dynamic allocation/unallocation, though, the available resources probably are known well enough for this. > However, in practice we still don't tend to have enough information about > execution times for this to be that useful. At least that's my belief. yes. -- From hategan at mcs.anl.gov Thu Apr 9 10:49:25 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 09 Apr 2009 10:49:25 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> Message-ID: <1239292165.339.1.camel@localhost> On Thu, 2009-04-09 at 15:30 +0000, Ben Clifford wrote: > On Thu, 9 Apr 2009, Ian Foster wrote: > > > I wanted to point out that when we use Falkon/coasters, we have full control > > over scheduling, so in that case we could in principle pre-compute schedules. > > Coasters as they are now are still allocated on an opportunistic basis, so > once we have a coaster stuff could be scheduled to it, but when coaster > workers actually exist is as unknown as when jobs will run in the > non-coaster case, I think. > > Where Falkon has been used for pre-allocated resources on machines, with > no dynamic allocation/unallocation, though, the available resources > probably are known well enough for this. Except when using pre-allocated resources, you are still waiting for them, but the waiting is not automated. > > > However, in practice we still don't tend to have enough information about > > execution times for this to be that useful. At least that's my belief. > > yes. > From benc at hawaga.org.uk Thu Apr 9 10:49:50 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 9 Apr 2009 15:49:50 +0000 (GMT) Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239292165.339.1.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> Message-ID: On Thu, 9 Apr 2009, Mihael Hategan wrote: > Except when using pre-allocated resources, you are still waiting for > them, but the waiting is not automated. Also you have chosen to not attempt to opportunistically get any more once you have decided you have waited enough. -- From hategan at mcs.anl.gov Thu Apr 9 10:57:14 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 09 Apr 2009 10:57:14 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> Message-ID: <1239292634.406.6.camel@localhost> On Thu, 2009-04-09 at 15:49 +0000, Ben Clifford wrote: > On Thu, 9 Apr 2009, Mihael Hategan wrote: > > > Except when using pre-allocated resources, you are still waiting for > > them, but the waiting is not automated. > > Also you have chosen to not attempt to opportunistically get any more once > you have decided you have waited enough. > Right. Overall it leads to inefficiencies and wasted cpu-hours, but it gives you a known set of resources, which is valuable. I think the known set of resources part can be achieved anyway if there was that back-channel mentioned in random chatter that informed swift about the nodes available. From iraicu at cs.uchicago.edu Thu Apr 9 11:49:33 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 09 Apr 2009 09:49:33 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> Message-ID: <49DE271D.8030003@cs.uchicago.edu> Right, Falkon supports both static and dynamic allocation of resources. I believe coaster only supports dynamic allocation of resources. We have lots of information under static allocation, that could help scheduling, but under dynamic allocation, there is a mixture of known information (the already allocated resources) and the unknown (the jobs in the wait queue). In a sense, a smarter scheduler could make use of at least known information, although this information might frequently change, and the scheduler would have to adapt frequently. Ioan Ben Clifford wrote: > On Thu, 9 Apr 2009, Ian Foster wrote: > > >> I wanted to point out that when we use Falkon/coasters, we have full control >> over scheduling, so in that case we could in principle pre-compute schedules. >> > > Coasters as they are now are still allocated on an opportunistic basis, so > once we have a coaster stuff could be scheduled to it, but when coaster > workers actually exist is as unknown as when jobs will run in the > non-coaster case, I think. > > Where Falkon has been used for pre-allocated resources on machines, with > no dynamic allocation/unallocation, though, the available resources > probably are known well enough for this. > > >> However, in practice we still don't tend to have enough information about >> execution times for this to be that useful. At least that's my belief. >> > > yes. > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From foster at anl.gov Thu Apr 9 11:51:27 2009 From: foster at anl.gov (Ian Foster) Date: Thu, 9 Apr 2009 11:51:27 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DE271D.8030003@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <49DE271D.8030003@cs.uchicago.edu> Message-ID: <7A6429F1-CDE4-4B09-95BB-C7DB24E78905@anl.gov> I didn't appreciate that about Coaster. It should (IMHO) support static allocation, as a special case. People will clearly want that. On Apr 9, 2009, at 11:49 AM, Ioan Raicu wrote: > Right, Falkon supports both static and dynamic allocation of > resources. I believe coaster only supports dynamic allocation of > resources. We have lots of information under static allocation, that > could help scheduling, but under dynamic allocation, there is a > mixture of known information (the already allocated resources) and > the unknown (the jobs in the wait queue). In a sense, a smarter > scheduler could make use of at least known information, although > this information might frequently change, and the scheduler would > have to adapt frequently. > > Ioan > > Ben Clifford wrote: >> >> On Thu, 9 Apr 2009, Ian Foster wrote: >> >> >>> I wanted to point out that when we use Falkon/coasters, we have >>> full control >>> over scheduling, so in that case we could in principle pre-compute >>> schedules. >>> >> Coasters as they are now are still allocated on an opportunistic >> basis, so >> once we have a coaster stuff could be scheduled to it, but when >> coaster >> workers actually exist is as unknown as when jobs will run in the >> non-coaster case, I think. >> >> Where Falkon has been used for pre-allocated resources on machines, >> with >> no dynamic allocation/unallocation, though, the available resources >> probably are known well enough for this. >> >> >>> However, in practice we still don't tend to have enough >>> information about >>> execution times for this to be that useful. At least that's my >>> belief. >>> >> yes. >> >> > > -- > =================================================== > Ioan Raicu, Ph.D. > =================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > =================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > =================================================== > =================================================== > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Apr 9 11:56:30 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 09 Apr 2009 11:56:30 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <7A6429F1-CDE4-4B09-95BB-C7DB24E78905@anl.gov> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <49DE271D.8030003@cs.uchicago.edu> <7A6429F1-CDE4-4B09-95BB-C7DB24E78905@anl.gov> Message-ID: <1239296190.1717.1.camel@localhost> On Thu, 2009-04-09 at 11:51 -0500, Ian Foster wrote: > I didn't appreciate that about Coaster. It should (IMHO) support > static allocation, as a special case. People will clearly want that. Yes. People clearly make irrational choices. > > > > On Apr 9, 2009, at 11:49 AM, Ioan Raicu wrote: > > > Right, Falkon supports both static and dynamic allocation of > > resources. I believe coaster only supports dynamic allocation of > > resources. We have lots of information under static allocation, that > > could help scheduling, but under dynamic allocation, there is a > > mixture of known information (the already allocated resources) and > > the unknown (the jobs in the wait queue). In a sense, a smarter > > scheduler could make use of at least known information, although > > this information might frequently change, and the scheduler would > > have to adapt frequently. > > > > Ioan > > > > Ben Clifford wrote: > > > On Thu, 9 Apr 2009, Ian Foster wrote: > > > > > > > > > > I wanted to point out that when we use Falkon/coasters, we have full control > > > > over scheduling, so in that case we could in principle pre-compute schedules. > > > > > > > Coasters as they are now are still allocated on an opportunistic basis, so > > > once we have a coaster stuff could be scheduled to it, but when coaster > > > workers actually exist is as unknown as when jobs will run in the > > > non-coaster case, I think. > > > > > > Where Falkon has been used for pre-allocated resources on machines, with > > > no dynamic allocation/unallocation, though, the available resources > > > probably are known well enough for this. > > > > > > > > > > However, in practice we still don't tend to have enough information about > > > > execution times for this to be that useful. At least that's my belief. > > > > > > > yes. > > > > > > > > > > -- > > =================================================== > > Ioan Raicu, Ph.D. > > =================================================== > > Distributed Systems Laboratory > > Computer Science Department > > University of Chicago > > 1100 E. 58th Street, Ryerson Hall > > Chicago, IL 60637 > > =================================================== > > Email: iraicu at cs.uchicago.edu > > Web: http://www.cs.uchicago.edu/~iraicu > > http://dev.globus.org/wiki/Incubator/Falkon > > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > > =================================================== > > =================================================== > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From iraicu at cs.uchicago.edu Thu Apr 9 11:52:48 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 09 Apr 2009 09:52:48 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239292165.339.1.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> Message-ID: <49DE27E0.2020100@cs.uchicago.edu> What we usually do on the BG/P when using static provisioning, is that Swift does not start until the resources have been allocated, and that is because Falkon does not start until the resources are allocated. The whole process is automated, in terms of Swift waiting for Falkon to start, and Falkon waiting to start after resources get allocated. So, at Swift startup, in a static provisioned case, Swift could have all the information it might need, such as number of processors, number of sites, load (i.e. idle, as the resources are dedicated to Swift), etc. Ioan Mihael Hategan wrote: > On Thu, 2009-04-09 at 15:30 +0000, Ben Clifford wrote: > >> On Thu, 9 Apr 2009, Ian Foster wrote: >> >> >>> I wanted to point out that when we use Falkon/coasters, we have full control >>> over scheduling, so in that case we could in principle pre-compute schedules. >>> >> Coasters as they are now are still allocated on an opportunistic basis, so >> once we have a coaster stuff could be scheduled to it, but when coaster >> workers actually exist is as unknown as when jobs will run in the >> non-coaster case, I think. >> >> Where Falkon has been used for pre-allocated resources on machines, with >> no dynamic allocation/unallocation, though, the available resources >> probably are known well enough for this. >> > > Except when using pre-allocated resources, you are still waiting for > them, but the waiting is not automated. > > >>> However, in practice we still don't tend to have enough information about >>> execution times for this to be that useful. At least that's my belief. >>> >> yes. >> >> > > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Apr 9 12:01:53 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 09 Apr 2009 12:01:53 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DE27E0.2020100@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> <49DE27E0.2020100@cs.uchicago.edu> Message-ID: <1239296513.1717.7.camel@localhost> On Thu, 2009-04-09 at 09:52 -0700, Ioan Raicu wrote: > What we usually do on the BG/P when using static provisioning, is that > Swift does not start until the resources have been allocated, and that > is because Falkon does not start until the resources are allocated. > The whole process is automated, in terms of Swift waiting for Falkon > to start, and Falkon waiting to start after resources get allocated. > So, at Swift startup, in a static provisioned case, Swift could have > all the information it might need, such as number of processors, > number of sites, load (i.e. idle, as the resources are dedicated to > Swift), etc. You seem to be describing the scenario of not submitting a job unless you know it will be executed immediately because there is an active worker for it. Which I agree with. Why I don't get is (and this is what I understand by "static provisioning") where is the benefit in having a barrier that waits for all requested workers to start, given that some workers will start before others and will invariably have to sit idle until all workers are started. From iraicu at cs.uchicago.edu Thu Apr 9 11:56:39 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 09 Apr 2009 09:56:39 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239292634.406.6.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> <1239292634.406.6.camel@localhost> Message-ID: <49DE28C7.5080002@cs.uchicago.edu> Assuming you have more work to do, than the static resources allocated, you are not wasting any resources. The workflow will run until the resources are de-allocated, and whatever was not completed, will get rescheduled on the next round of static resources allocated. As far as I know, this is the usage pattern of the static resource allocation on the BG/P for the few regular users, that are running several jobs per day, where each job is a static resource allocation of 1K~10K processors for several hours each. Their parameter space is large enough that they keep doing this over and over again, and they still more work to do! Ioan Mihael Hategan wrote: > On Thu, 2009-04-09 at 15:49 +0000, Ben Clifford wrote: > >> On Thu, 9 Apr 2009, Mihael Hategan wrote: >> >> >>> Except when using pre-allocated resources, you are still waiting for >>> them, but the waiting is not automated. >>> >> Also you have chosen to not attempt to opportunistically get any more once >> you have decided you have waited enough. >> >> > > Right. Overall it leads to inefficiencies and wasted cpu-hours, but it > gives you a known set of resources, which is valuable. I think the known > set of resources part can be achieved anyway if there was that > back-channel mentioned in random chatter that informed swift about the > nodes available. > > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Thu Apr 9 12:04:10 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 09 Apr 2009 10:04:10 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239296190.1717.1.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <49DE271D.8030003@cs.uchicago.edu> <7A6429F1-CDE4-4B09-95BB-C7DB24E78905@anl.gov> <1239296190.1717.1.camel@localhost> Message-ID: <49DE2A8A.4060705@cs.uchicago.edu> There are use cases where static resource allocation are better than dynamic ones. Again, we come back to the BG/P system. There is a policy that only allows you to submit X number of jobs to Cobalt, and X is < 10 jobs. Now, if you want to allocate resources dynamically, in smaller chunks, you are limited to only a few jobs. Static provisioning all of a sudden seems attractive. Another thing that you have to remember, that for some systems, like the BG/P, getting 2 allocations of 64 nodes each, is not the same as getting 1 allocation of 128 nodes. The 1 single allocation of 128 nodes has networking configured in such a way to allow node-to-node communication efficiently. The 2 separate allocations, could be allocated in completely opposite ends of the system, and hence having poor networking properties to do node-to-node communication, between the separate allocations (if its even possible, I am not sure, the networks might be completely separate). This might not be important for vanilla Swift, but some of the MTDM work (previously known as collective I/O) relies on good network connectivity between any node in the allocation to pass data around and avoiding GPFS. Ioan Mihael Hategan wrote: > On Thu, 2009-04-09 at 11:51 -0500, Ian Foster wrote: > >> I didn't appreciate that about Coaster. It should (IMHO) support >> static allocation, as a special case. People will clearly want that. >> > > Yes. People clearly make irrational choices. > > >> >> On Apr 9, 2009, at 11:49 AM, Ioan Raicu wrote: >> >> >>> Right, Falkon supports both static and dynamic allocation of >>> resources. I believe coaster only supports dynamic allocation of >>> resources. We have lots of information under static allocation, that >>> could help scheduling, but under dynamic allocation, there is a >>> mixture of known information (the already allocated resources) and >>> the unknown (the jobs in the wait queue). In a sense, a smarter >>> scheduler could make use of at least known information, although >>> this information might frequently change, and the scheduler would >>> have to adapt frequently. >>> >>> Ioan >>> >>> Ben Clifford wrote: >>> >>>> On Thu, 9 Apr 2009, Ian Foster wrote: >>>> >>>> >>>> >>>>> I wanted to point out that when we use Falkon/coasters, we have full control >>>>> over scheduling, so in that case we could in principle pre-compute schedules. >>>>> >>>>> >>>> Coasters as they are now are still allocated on an opportunistic basis, so >>>> once we have a coaster stuff could be scheduled to it, but when coaster >>>> workers actually exist is as unknown as when jobs will run in the >>>> non-coaster case, I think. >>>> >>>> Where Falkon has been used for pre-allocated resources on machines, with >>>> no dynamic allocation/unallocation, though, the available resources >>>> probably are known well enough for this. >>>> >>>> >>>> >>>>> However, in practice we still don't tend to have enough information about >>>>> execution times for this to be that useful. At least that's my belief. >>>>> >>>>> >>>> yes. >>>> >>>> >>>> >>> -- >>> =================================================== >>> Ioan Raicu, Ph.D. >>> =================================================== >>> Distributed Systems Laboratory >>> Computer Science Department >>> University of Chicago >>> 1100 E. 58th Street, Ryerson Hall >>> Chicago, IL 60637 >>> =================================================== >>> Email: iraicu at cs.uchicago.edu >>> Web: http://www.cs.uchicago.edu/~iraicu >>> http://dev.globus.org/wiki/Incubator/Falkon >>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >>> =================================================== >>> =================================================== >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Apr 9 12:08:45 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 09 Apr 2009 12:08:45 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DE28C7.5080002@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> <1239292634.406.6.camel@localhost> <49DE28C7.5080002@cs.uchicago.edu> Message-ID: <1239296925.1717.14.camel@localhost> On Thu, 2009-04-09 at 09:56 -0700, Ioan Raicu wrote: > Assuming you have more work to do, than the static resources > allocated, you are not wasting any resources. The workflow will run > until the resources are de-allocated, and whatever was not completed, > will get rescheduled on the next round of static resources allocated. Right. As opposed to the system figuring out that there is more work and having workers ready appropriately. > As far as I know, this is the usage pattern of the static resource > allocation on the BG/P for the few regular users, that are running > several jobs per day, where each job is a static resource allocation > of 1K~10K processors for several hours each. Their parameter space is > large enough that they keep doing this over and over again, and they > still more work to do! Which, seems to show that, for static provisioning to work efficiently, you need to fit work exactly into the pre-allocated resources, or have pre-allocated resources to exactly fit your work load. I'm still not sure why you would choose this instead of "allocate resources on demand and de-allocate resources when you stop needing them". From iraicu at cs.uchicago.edu Thu Apr 9 12:12:00 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 09 Apr 2009 10:12:00 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239296513.1717.7.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> <49DE27E0.2020100@cs.uchicago.edu> <1239296513.1717.7.camel@localhost> Message-ID: <49DE2C60.7050905@cs.uchicago.edu> Mihael Hategan wrote: > Why I don't get is (and this is what I understand by "static > provisioning") where is the benefit in having a barrier that waits for > all requested workers to start, given that some workers will start > before others and will invariably have to sit idle until all workers are > started. > > No workers sit idle, waiting for other workers to start. The resource allocation takes some amount of time to boot up the OS on each node, mount GPFS, start Falkon service, start Falkon workers, etc... see http://dev.globus.org/wiki/Image:Falkon-BGP-startup-time.jpg. Its true that there is some difference between the 1st worker starting, and the last worker starting, probably on the order of seconds to maybe minutes at the largest scale of 160K processors. If this is a concern, the idle time as the system starts up, you can start Swift before 100% of the system is operational. The system is partitioned in 64 node chunks, so, in theory, Swift could start as soon as 64 nodes are online. Although, this could also have its own problems. Its not clear to me how dynamic the sites.xml file is. The location of the Falkon services is placed in the sites.xml file. Lets take an example of 4096 processor run, which would have 16 Falkon services when its 100% allocated. That means 16 entries in the sites.xml. If we wait for all 16 entries, we might waste a few idle cycles. If we start when the 1st entry is in the sites.xml (the first 64 nodes), and later the sites.xml file is updated with the rest of the 15 entries, will Swift re-read the sites.xml and figure out that there are additional sites to consider? How often does Swift re-read the sites.xml? If it does not re-read it, then in the current setup, we can do this, and have to wait for all resources to be 100% allocated before we start. Ioan > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From iraicu at cs.uchicago.edu Thu Apr 9 12:14:18 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 09 Apr 2009 10:14:18 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239296925.1717.14.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> <1239292634.406.6.camel@localhost> <49DE28C7.5080002@cs.uchicago.edu> <1239296925.1717.14.camel@localhost> Message-ID: <49DE2CEA.7050909@cs.uchicago.edu> Add another constraint, that you only have 6 jobs that you can submit to the LRM queue, are you still as optimistic about using dynamic resource provisioning? Mihael Hategan wrote: > On Thu, 2009-04-09 at 09:56 -0700, Ioan Raicu wrote: > >> Assuming you have more work to do, than the static resources >> allocated, you are not wasting any resources. The workflow will run >> until the resources are de-allocated, and whatever was not completed, >> will get rescheduled on the next round of static resources allocated. >> > > Right. As opposed to the system figuring out that there is more work and > having workers ready appropriately. > > >> As far as I know, this is the usage pattern of the static resource >> allocation on the BG/P for the few regular users, that are running >> several jobs per day, where each job is a static resource allocation >> of 1K~10K processors for several hours each. Their parameter space is >> large enough that they keep doing this over and over again, and they >> still more work to do! >> > > Which, seems to show that, for static provisioning to work efficiently, > you need to fit work exactly into the pre-allocated resources, or have > pre-allocated resources to exactly fit your work load. > > I'm still not sure why you would choose this instead of "allocate > resources on demand and de-allocate resources when you stop needing > them". > > > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Apr 9 12:16:22 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 09 Apr 2009 12:16:22 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DE2A8A.4060705@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <49DE271D.8030003@cs.uchicago.edu> <7A6429F1-CDE4-4B09-95BB-C7DB24E78905@anl.gov> <1239296190.1717.1.camel@localhost> <49DE2A8A.4060705@cs.uchicago.edu> Message-ID: <1239297382.1717.23.camel@localhost> On Thu, 2009-04-09 at 10:04 -0700, Ioan Raicu wrote: > There are use cases where static resource allocation are better than > dynamic ones. Again, we come back to the BG/P system. There is a > policy that only allows you to submit X number of jobs to Cobalt, and > X is < 10 jobs. Now, if you want to allocate resources dynamically, in > smaller chunks, you are limited to only a few jobs. Static > provisioning all of a sudden seems attractive. It's a valid scenario and a valid solution, but asserting that it's the only solution or that it's the best solution seems inappropriate. A better solution is to dynamically allocate workers in larger blocks if you don't have arbitrary granularity on the allocation sizes. It provides the middle ground that meets the scheduling system constraints and minimizes inefficiencies. > > Another thing that you have to remember, that for some systems, like > the BG/P, getting 2 allocations of 64 nodes each, is not the same as > getting 1 allocation of 128 nodes. The 1 single allocation of 128 > nodes has networking configured in such a way to allow node-to-node > communication efficiently. The 2 separate allocations, could be > allocated in completely opposite ends of the system, and hence having > poor networking properties to do node-to-node communication, between > the separate allocations (if its even possible, I am not sure, the > networks might be completely separate). This might not be important > for vanilla Swift, but some of the MTDM work (previously known as > collective I/O) relies on good network connectivity between any node > in the allocation to pass data around and avoiding GPFS. I'm not sure what dynamic vs. static allocation of workers, in principle, has to do with the implementation hurdles of CIO on the BG/P. Different systems have different constraints. Dynamic allocation can be made to adapt to those constraints. From hategan at mcs.anl.gov Thu Apr 9 12:22:49 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 09 Apr 2009 12:22:49 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DE2C60.7050905@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> <49DE27E0.2020100@cs.uchicago.edu> <1239296513.1717.7.camel@localhost> <49DE2C60.7050905@cs.uchicago.edu> Message-ID: <1239297769.1717.27.camel@localhost> On Thu, 2009-04-09 at 10:12 -0700, Ioan Raicu wrote: > > Mihael Hategan wrote: > > Why I don't get is (and this is what I understand by "static > > provisioning") where is the benefit in having a barrier that waits for > > all requested workers to start, given that some workers will start > > before others and will invariably have to sit idle until all workers are > > started. > > > > > No workers sit idle, waiting for other workers to start. The resource > allocation takes some amount of time to boot up the OS on each node, > mount GPFS, start Falkon service, start Falkon workers, etc... see > http://dev.globus.org/wiki/Image:Falkon-BGP-startup-time.jpg. Its true > that there is some difference between the 1st worker starting, and the > last worker starting, probably on the order of seconds to maybe minutes > at the largest scale of 160K processors. If this is a concern, the idle > time as the system starts up, you can start Swift before 100% of the > system is operational. The system is partitioned in 64 node chunks, so, > in theory, Swift could start as soon as 64 nodes are online. Although, > this could also have its own problems. This assumes a single site and exact knowledge of how to fit the workload. I also assume this works when you have a reservation, otherwise you may have better chances with smaller chunks. From hategan at mcs.anl.gov Thu Apr 9 12:23:59 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 09 Apr 2009 12:23:59 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DE2CEA.7050909@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> <1239292634.406.6.camel@localhost> <49DE28C7.5080002@cs.uchicago.edu> <1239296925.1717.14.camel@localhost> <49DE2CEA.7050909@cs.uchicago.edu> Message-ID: <1239297839.1717.30.camel@localhost> On Thu, 2009-04-09 at 10:14 -0700, Ioan Raicu wrote: > Add another constraint, that you only have 6 jobs that you can submit > to the LRM queue, are you still as optimistic about using dynamic > resource provisioning? 6 jobs is better than one job. From foster at anl.gov Thu Apr 9 12:24:29 2009 From: foster at anl.gov (Ian Foster) Date: Thu, 9 Apr 2009 12:24:29 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239297769.1717.27.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> <49DE27E0.2020100@cs.uchicago.edu> <1239296513.1717.7.camel@localhost> <49DE2C60.7050905@cs.uchicago.edu> <1239297769.1717.27.camel@localhost> Message-ID: <9B273A14-2414-4D5E-834A-C5098BB731A5@anl.gov> Mihael: Ioan (and I) are just saying, I think, that there are situations in which people want to work with (or in some cases, need to work with) a static allocation. No-one is arguing that this will always be the case, or even should mostly be the case. Just sometimes. Ian, From iraicu at cs.uchicago.edu Thu Apr 9 12:32:11 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 09 Apr 2009 10:32:11 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239297382.1717.23.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <49DE271D.8030003@cs.uchicago.edu> <7A6429F1-CDE4-4B09-95BB-C7DB24E78905@anl.gov> <1239296190.1717.1.camel@localhost> <49DE2A8A.4060705@cs.uchicago.edu> <1239297382.1717.23.camel@localhost> Message-ID: <49DE311B.4070309@cs.uchicago.edu> Mihael Hategan wrote: > On Thu, 2009-04-09 at 10:04 -0700, Ioan Raicu wrote: > >> There are use cases where static resource allocation are better than >> dynamic ones. Again, we come back to the BG/P system. There is a >> policy that only allows you to submit X number of jobs to Cobalt, and >> X is < 10 jobs. Now, if you want to allocate resources dynamically, in >> smaller chunks, you are limited to only a few jobs. Static >> provisioning all of a sudden seems attractive. >> > > It's a valid scenario and a valid solution, but asserting that it's the > only solution or that it's the best solution seems inappropriate. > > A better solution is to dynamically allocate workers in larger blocks if > you don't have arbitrary granularity on the allocation sizes. It > provides the middle ground that meets the scheduling system constraints > and minimizes inefficiencies. > When you allocate 1 node at a time, in dynamic provisioning, its trivial to de-allocate nodes/workers, with a timeout. When you allocate N nodes in a single job, where N>1, de-allocation is not trivial anymore. If a worker simply de-allocates (exit the process), that node remains allocated from the LRM's perspective, but de-allocated from the Coaster/Falkon perspective. When all N nodes are de-allocated, then the N nodes are released to the LRM. That is potentially a great deal of wastage. The better solution would be for there to be a centralized manager, that keeps track of an entire job (N nodes) and their utilization, and decide to de-allocate the entire N nodes at the same time, from both Coaster/Falkon and the LRM. Falkon doesn't support this unfortunately. Does Coaster support this? If not, then I'd say that dynamic resource provisioning has to be kept at jobs of a single node level, and not bunch together multiple node requests per job. This will obviously limit the use of dynamic provisioning for large scale runs, to LRMs that support large number of job submissions, proportional to the scale of the runs. Don't get me wrong, I think dynamic resource provisioning is the best approach in general, especially when workloads vary in loads, and you have an infrastructure that supports it (i.e. TeraGrid), but its not suitable for other systems, with the current implementation that I am aware of from Falkon (maybe Coaster as well) on systems like the BG/P. > >> Another thing that you have to remember, that for some systems, like >> the BG/P, getting 2 allocations of 64 nodes each, is not the same as >> getting 1 allocation of 128 nodes. The 1 single allocation of 128 >> nodes has networking configured in such a way to allow node-to-node >> communication efficiently. The 2 separate allocations, could be >> allocated in completely opposite ends of the system, and hence having >> poor networking properties to do node-to-node communication, between >> the separate allocations (if its even possible, I am not sure, the >> networks might be completely separate). This might not be important >> for vanilla Swift, but some of the MTDM work (previously known as >> collective I/O) relies on good network connectivity between any node >> in the allocation to pass data around and avoiding GPFS. >> > > I'm not sure what dynamic vs. static allocation of workers, in > principle, has to do with the implementation hurdles of CIO on the BG/P. > It has to do with the fact that if the network interconnect is important (such is the case for MTDM), then submitting multiple independent jobs to the LRM is detrimental to the overall performance of the application, as opposed to submitting a single job to the LRM. If the jobs are submitted to the LRM as independent jobs, there is no guarantee on their placement and proximity to each other (node wise). > Different systems have different constraints. Dynamic allocation can be > made to adapt to those constraints. > But after adapting it enough, its going to look like static provisioning. See this paper http://pegasus.isi.edu/publications/2008/JuveG-ResourceProvisioningOptions.pdf which discusses the various approaches to resource provisioning. You will find some systems support static provisioning, others suport dynamic provisioing, and others support both. It shows that there are clear use cases for one, the other, or both. Ioan > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Thu Apr 9 12:34:26 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 09 Apr 2009 10:34:26 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239297769.1717.27.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> <49DE27E0.2020100@cs.uchicago.edu> <1239296513.1717.7.camel@localhost> <49DE2C60.7050905@cs.uchicago.edu> <1239297769.1717.27.camel@localhost> Message-ID: <49DE31A2.5020806@cs.uchicago.edu> Mihael Hategan wrote: > On Thu, 2009-04-09 at 10:12 -0700, Ioan Raicu wrote: > >>> >>> >> No workers sit idle, waiting for other workers to start. The resource >> allocation takes some amount of time to boot up the OS on each node, >> mount GPFS, start Falkon service, start Falkon workers, etc... see >> http://dev.globus.org/wiki/Image:Falkon-BGP-startup-time.jpg. Its true >> that there is some difference between the 1st worker starting, and the >> last worker starting, probably on the order of seconds to maybe minutes >> at the largest scale of 160K processors. If this is a concern, the idle >> time as the system starts up, you can start Swift before 100% of the >> system is operational. The system is partitioned in 64 node chunks, so, >> in theory, Swift could start as soon as 64 nodes are online. Although, >> this could also have its own problems. >> > > This assumes a single site and exact knowledge of how to fit the > workload. > Nope, its a single site if you want to start at the earliest possible time, but once all nodes are started, it becomes a multi-site allocation, where each site is a 64 node chunk of the allocation. > I also assume this works when you have a reservation, otherwise you may > have better chances with smaller chunks. > Up to 8K cores, we usually run without reservations. Beyond that, we do get reservations. > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.uchicago.edu Thu Apr 9 12:36:15 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 09 Apr 2009 10:36:15 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <9B273A14-2414-4D5E-834A-C5098BB731A5@anl.gov> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> <49DE27E0.2020100@cs.uchicago.edu> <1239296513.1717.7.camel@localhost> <49DE2C60.7050905@cs.uchicago.edu> <1239297769.1717.27.camel@localhost> <9B273A14-2414-4D5E-834A-C5098BB731A5@anl.gov> Message-ID: <49DE320F.8060306@cs.uchicago.edu> Yes, that is all I was arguing for as well... that dynamic provisioning cannot substitute static provisioning in some cases. Ioan Ian Foster wrote: > Mihael: > > Ioan (and I) are just saying, I think, that there are situations in > which people want to work with (or in some cases, need to work with) a > static allocation. > > No-one is arguing that this will always be the case, or even should > mostly be the case. Just sometimes. > > Ian, > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From hategan at mcs.anl.gov Thu Apr 9 12:47:53 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 09 Apr 2009 12:47:53 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <9B273A14-2414-4D5E-834A-C5098BB731A5@anl.gov> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> <49DE27E0.2020100@cs.uchicago.edu> <1239296513.1717.7.camel@localhost> <49DE2C60.7050905@cs.uchicago.edu> <1239297769.1717.27.camel@localhost> <9B273A14-2414-4D5E-834A-C5098BB731A5@anl.gov> Message-ID: <1239299273.4415.10.camel@localhost> On Thu, 2009-04-09 at 12:24 -0500, Ian Foster wrote: > Mihael: > > Ioan (and I) are just saying, I think, that there are situations in > which people want to work with (or in some cases, need to work with) a > static allocation. That was the piece I was missing: The allocation scheme will make better decisions if information about node reservations is available to it. Users should be allowed to specify such information and it should be considered when allocating workers. > > No-one is arguing that this will always be the case, or even should > mostly be the case. Just sometimes. Being left with work done after the reservation ends seemed like a no-no, when one could allocate nodes opportunistically in addition to the reservation to meet the demand. From hategan at mcs.anl.gov Thu Apr 9 12:56:31 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 09 Apr 2009 12:56:31 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DE311B.4070309@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <49DE271D.8030003@cs.uchicago.edu> <7A6429F1-CDE4-4B09-95BB-C7DB24E78905@anl.gov> <1239296190.1717.1.camel@localhost> <49DE2A8A.4060705@cs.uchicago.edu> <1239297382.1717.23.camel@localhost> <49DE311B.4070309@cs.uchicago.edu> Message-ID: <1239299791.4415.19.camel@localhost> On Thu, 2009-04-09 at 10:32 -0700, Ioan Raicu wrote: > > A better solution is to dynamically allocate workers in larger blocks if > > you don't have arbitrary granularity on the allocation sizes. It > > provides the middle ground that meets the scheduling system constraints > > and minimizes inefficiencies. > > > When you allocate 1 node at a time, in dynamic provisioning, its > trivial to de-allocate nodes/workers, with a timeout. When you > allocate N nodes in a single job, where N>1, de-allocation is not > trivial anymore. Exactly. > If a worker simply de-allocates (exit the process), that node remains > allocated from the LRM's perspective, but de-allocated from the > Coaster/Falkon perspective. When all N nodes are de-allocated, then > the N nodes are released to the LRM. That is potentially a great deal > of wastage. The better solution would be for there to be a centralized > manager, that keeps track of an entire job (N nodes) and their > utilization, and decide to de-allocate the entire N nodes at the same > time, from both Coaster/Falkon and the LRM. Falkon doesn't support > this unfortunately. Does Coaster support this? That was the thing Mike was pushing for and which I started thinking of. It's good that we're having this discussion, because it uncovers some of the details involved. > If not, then I'd say that dynamic resource provisioning has to be > kept at jobs of a single node level, That does not follow. The mutually-exclusive choices you are presenting are 1) no current support for block allocations, 2) single node allocations and you're missing 3) future support for block allocations. If (1) is false (2) is not the only alternative. > and not bunch together multiple node requests per job. This will > obviously limit the use of dynamic provisioning for large scale runs, > to LRMs that support large number of job submissions, proportional to > the scale of the runs. From iraicu at cs.uchicago.edu Thu Apr 9 12:59:55 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 09 Apr 2009 10:59:55 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239299791.4415.19.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <49DE271D.8030003@cs.uchicago.edu> <7A6429F1-CDE4-4B09-95BB-C7DB24E78905@anl.gov> <1239296190.1717.1.camel@localhost> <49DE2A8A.4060705@cs.uchicago.edu> <1239297382.1717.23.camel@localhost> <49DE311B.4070309@cs.uchicago.edu> <1239299791.4415.19.camel@localhost> Message-ID: <49DE379B.4060006@cs.uchicago.edu> Mihael Hategan wrote: > On Thu, 2009-04-09 at 10:32 -0700, Ioan Raicu wrote: > > > > >> If not, then I'd say that dynamic resource provisioning has to be >> kept at jobs of a single node level, >> > > That does not follow. The mutually-exclusive choices you are presenting > are 1) no current support for block allocations, 2) single node > allocations and you're missing 3) future support for block allocations. > If (1) is false (2) is not the only alternative. > > Remember, I said "if not", in other words, if Coaster does not support this today, then (2) is your only alternative, unless you want to waste many resources over time. Future support, obviously means that (2) will not be the only alternative :) -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Apr 9 13:03:22 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 09 Apr 2009 13:03:22 -0500 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DE320F.8060306@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> <49DE27E0.2020100@cs.uchicago.edu> <1239296513.1717.7.camel@localhost> <49DE2C60.7050905@cs.uchicago.edu> <1239297769.1717.27.camel@localhost> <9B273A14-2414-4D5E-834A-C5098BB731A5@anl.gov> <49DE320F.8060306@cs.uchicago.edu> Message-ID: <1239300202.4415.26.camel@localhost> On Thu, 2009-04-09 at 10:36 -0700, Ioan Raicu wrote: > Yes, that is all I was arguing for as well... that dynamic provisioning > cannot substitute static provisioning in some cases. There's a big difference between Ian's statement and yours. Being able to do have dynamic workers that can make use of static allocations is not the same as not being able to have dynamic workers in some cases. I think you do actually mean what Ian says. It wasn't apparent though. > > Ioan > > Ian Foster wrote: > > Mihael: > > > > Ioan (and I) are just saying, I think, that there are situations in > > which people want to work with (or in some cases, need to work with) a > > static allocation. > > > > No-one is arguing that this will always be the case, or even should > > mostly be the case. Just sometimes. > > > > Ian, > > > > > From iraicu at cs.uchicago.edu Thu Apr 9 13:19:15 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 09 Apr 2009 11:19:15 -0700 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <1239300202.4415.26.camel@localhost> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <4DED210A-3AB9-4FE6-B82D-60B0966280C5@anl.gov> <1239292165.339.1.camel@localhost> <49DE27E0.2020100@cs.uchicago.edu> <1239296513.1717.7.camel@localhost> <49DE2C60.7050905@cs.uchicago.edu> <1239297769.1717.27.camel@localhost> <9B273A14-2414-4D5E-834A-C5098BB731A5@anl.gov> <49DE320F.8060306@cs.uchicago.edu> <1239300202.4415.26.camel@localhost> Message-ID: <49DE3C23.6050606@cs.uchicago.edu> I meant what Ian said. Mihael Hategan wrote: > On Thu, 2009-04-09 at 10:36 -0700, Ioan Raicu wrote: > >> Yes, that is all I was arguing for as well... that dynamic provisioning >> cannot substitute static provisioning in some cases. >> > > There's a big difference between Ian's statement and yours. Being able > to do have dynamic workers that can make use of static allocations is > not the same as not being able to have dynamic workers in some cases. > > I think you do actually mean what Ian says. It wasn't apparent though. > > >> Ioan >> >> Ian Foster wrote: >> >>> Mihael: >>> >>> Ioan (and I) are just saying, I think, that there are situations in >>> which people want to work with (or in some cases, need to work with) a >>> static allocation. >>> >>> No-one is arguing that this will always be the case, or even should >>> mostly be the case. Just sometimes. >>> >>> Ian, >>> >>> >>> > > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Fri Apr 10 11:38:05 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 10 Apr 2009 11:38:05 -0500 Subject: [Swift-devel] Is there a site count limit? Message-ID: <49DF75ED.1060704@mcs.anl.gov> Hi, We're trying to run an oops run on 8 racks of the BGP. Its possible this is larger than has been done to date with swift. Our sites.xml file has localhost plus 128 Falkon sites, one for each pset in the 8-rack partition. From what I can tell, Swift sees all 128 sites, but only sends jobs to exactly the first 32, bgp000-bgp031. While I debug this further, does anyone know of some hardwired limit that would cause swift to send to only the first 32 bgp sites? From hategan at mcs.anl.gov Fri Apr 10 11:42:06 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 10 Apr 2009 11:42:06 -0500 Subject: [Swift-devel] Is there a site count limit? In-Reply-To: <49DF75ED.1060704@mcs.anl.gov> References: <49DF75ED.1060704@mcs.anl.gov> Message-ID: <1239381726.8860.0.camel@localhost> On Fri, 2009-04-10 at 11:38 -0500, Michael Wilde wrote: > Hi, > > We're trying to run an oops run on 8 racks of the BGP. Its possible this > is larger than has been done to date with swift. > > Our sites.xml file has localhost plus 128 Falkon sites, one for each > pset in the 8-rack partition. > > From what I can tell, Swift sees all 128 sites, but only sends jobs to > exactly the first 32, bgp000-bgp031. > > While I debug this further, does anyone know of some hardwired limit > that would cause swift to send to only the first 32 bgp sites? I can't think of anything that would make that the case. The sites file and a log would be useful. From qinz at ihpc.a-star.edu.sg Fri Apr 10 11:48:23 2009 From: qinz at ihpc.a-star.edu.sg (Qin Zheng) Date: Sat, 11 Apr 2009 00:48:23 +0800 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DD0B3F.7050903@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> , <49DD0B3F.7050903@cs.uchicago.edu> Message-ID: Dear all, I came from the angle of application (such as Enterprise application or disaster recovery for an extreme case) requirement in SLA. Reservation can give some idea of response time (let's talk separately about failure and inaccurate execution time estimation) while queuing time prediction can give some probability of (mean and upbound of) expected start time. Knowing queuing time is important according to feedbacks from users of our in-hours supercomputers while they bare errors in their execution time estimations. However, even for a single task, only dynamically queuing it (or a number of its replicas) does not provide time-related information (as have been mentioned). Ioan, I was also thinking along the line of queue time estimation, which may be sufficient for what I am doing now. I considered reservation (so no queuing time) in my previous fault tolerance work due to the strict timing sequence requirement. I will read the paper soon to clarify a few points, especially the two points made by Mihael. Because to me it is only useful if it can tell (a) a queuing time, not only for the current state and immediately changes when new jobs are queued; (b) a mean and an upbound on queuing time, or if only the upbound is given, it should be tight in some sense (at most 20 minutes for the 2-minute job example). Finally, when a node can fail, it can also affect jobs queuing for it and this paper briefly mentions something about detecting the failure using queuing time data. I will share my findings regarding queuing time with you guys soon. Cheers, Qin Zheng ________________________________ From: Ioan Raicu [iraicu at cs.uchicago.edu] Sent: Thursday, April 09, 2009 4:38 AM To: Ben Clifford Cc: Mihael Hategan; Qin Zheng; swift-devel; Ian Foster Subject: Re: [Swift-devel] Re: replication vs site score Does a batch-queue prediction service help things in any way? https://portal.teragrid.org/gridsphere/gridsphere?cid=queue-prediction I've always wondered how the Swift scheduler would behave differently if it had statistical information about queue times. Qin, have you compared your job replication strategy with one that was cognizant of the expected wait queue time, in order to meet deadlines? On the surface, assuming that the batch queue prediction is accurate, it would seem that scheduling with known queue times might solve the same deadline cognizant scheduling problem, but without wasting resources by unnecessary replication. The place where the queue prediction doesn't help, is when there is a bad node which causes an application to be slow or fail. In this case, replication is probably the better recourse to guarantee meeting deadlines. Here is their latest paper on this: http://www.springerlink.com/content/7552901360631246/fulltext.pdf. The system is deployed on the TeraGrid, and has been for a few years now. As far as I have heard, it is quite robust and accurate. Cheers, Ioan Ben Clifford wrote: On Wed, 8 Apr 2009, Mihael Hategan wrote: This: planning the whole workflow buys us little in a (very) dynamic environment in which submitting a job one minute later may mean the difference between 1 minute of queue time and one hour of queue time and this: You need some SLA/QOS to address that. seem to be significant characteristics that make the environments we run on not amenable to scheduling in the traditional sense. The lack of any meaningful guarantees about almost anything time-related makes everything basically opportunistic rather than scheduled. -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== ________________________________ This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Fri Apr 10 12:00:26 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 10 Apr 2009 12:00:26 -0500 Subject: [Swift-devel] Is there a site count limit? In-Reply-To: <1239381726.8860.0.camel@localhost> References: <49DF75ED.1060704@mcs.anl.gov> <1239381726.8860.0.camel@localhost> Message-ID: <49DF7B2A.4000201@mcs.anl.gov> They are in ci:/home/wilde/oops.1063.2 I spotted the anomaly (if thats what it is) as below. Also: we discussed on the list way way back how to get the swift scheduler to send no more jobs to each "site" than there are cores in that site (for this bgp/falkon case) so that jobs dont get committed to busy sites while other sites have free cores. In this run, we are trying to send 32K jobs to 32K cores. Each of the 128 "sites" have 256 cores. The #s below show about 19K of those jobs as having been dispatched to 32*256 = 8192 cores. int$ grep JOB_START *nr3.log | awk '{print $19}' | sort | uniq -c 24 365 host=bgp000 790 host=bgp001 371 host=bgp002 383 host=bgp003 365 host=bgp004 791 host=bgp005 415 host=bgp006 775 host=bgp007 790 host=bgp008 791 host=bgp009 369 host=bgp010 790 host=bgp011 359 host=bgp012 791 host=bgp013 394 host=bgp014 402 host=bgp015 358 host=bgp016 595 host=bgp017 790 host=bgp018 790 host=bgp019 791 host=bgp020 790 host=bgp021 370 host=bgp022 790 host=bgp023 790 host=bgp024 674 host=bgp025 567 host=bgp026 389 host=bgp027 778 host=bgp028 366 host=bgp029 787 host=bgp030 695 host=bgp031 int$ pwd On 4/10/09 11:42 AM, Mihael Hategan wrote: > On Fri, 2009-04-10 at 11:38 -0500, Michael Wilde wrote: >> Hi, >> >> We're trying to run an oops run on 8 racks of the BGP. Its possible this >> is larger than has been done to date with swift. >> >> Our sites.xml file has localhost plus 128 Falkon sites, one for each >> pset in the 8-rack partition. >> >> From what I can tell, Swift sees all 128 sites, but only sends jobs to >> exactly the first 32, bgp000-bgp031. >> >> While I debug this further, does anyone know of some hardwired limit >> that would cause swift to send to only the first 32 bgp sites? > > I can't think of anything that would make that the case. The sites file > and a log would be useful. > From hategan at mcs.anl.gov Fri Apr 10 12:05:57 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 10 Apr 2009 12:05:57 -0500 Subject: [Swift-devel] Is there a site count limit? In-Reply-To: <49DF7B2A.4000201@mcs.anl.gov> References: <49DF75ED.1060704@mcs.anl.gov> <1239381726.8860.0.camel@localhost> <49DF7B2A.4000201@mcs.anl.gov> Message-ID: <1239383157.10739.1.camel@localhost> On Fri, 2009-04-10 at 12:00 -0500, Michael Wilde wrote: > They are in ci:/home/wilde/oops.1063.2 > > I spotted the anomaly (if thats what it is) as below. > > Also: we discussed on the list way way back how to get the swift > scheduler to send no more jobs to each "site" than there are cores in > that site (for this bgp/falkon case) so that jobs dont get committed to > busy sites while other sites have free cores. > > In this run, we are trying to send 32K jobs to 32K cores. > Each of the 128 "sites" have 256 cores. > > The #s below show about 19K of those jobs as having been dispatched to > 32*256 = 8192 cores. That is if all the cores are the same. In this case it seems that only 8192 cores are the same. I'll investigate why. > > int$ grep JOB_START *nr3.log | awk '{print $19}' | sort | uniq -c > > 24 > 365 host=bgp000 > 790 host=bgp001 > 371 host=bgp002 > 383 host=bgp003 > 365 host=bgp004 > 791 host=bgp005 > 415 host=bgp006 > 775 host=bgp007 > 790 host=bgp008 > 791 host=bgp009 > 369 host=bgp010 > 790 host=bgp011 > 359 host=bgp012 > 791 host=bgp013 > 394 host=bgp014 > 402 host=bgp015 > 358 host=bgp016 > 595 host=bgp017 > 790 host=bgp018 > 790 host=bgp019 > 791 host=bgp020 > 790 host=bgp021 > 370 host=bgp022 > 790 host=bgp023 > 790 host=bgp024 > 674 host=bgp025 > 567 host=bgp026 > 389 host=bgp027 > 778 host=bgp028 > 366 host=bgp029 > 787 host=bgp030 > 695 host=bgp031 > int$ pwd > > > On 4/10/09 11:42 AM, Mihael Hategan wrote: > > On Fri, 2009-04-10 at 11:38 -0500, Michael Wilde wrote: > >> Hi, > >> > >> We're trying to run an oops run on 8 racks of the BGP. Its possible this > >> is larger than has been done to date with swift. > >> > >> Our sites.xml file has localhost plus 128 Falkon sites, one for each > >> pset in the 8-rack partition. > >> > >> From what I can tell, Swift sees all 128 sites, but only sends jobs to > >> exactly the first 32, bgp000-bgp031. > >> > >> While I debug this further, does anyone know of some hardwired limit > >> that would cause swift to send to only the first 32 bgp sites? > > > > I can't think of anything that would make that the case. The sites file > > and a log would be useful. > > From hategan at mcs.anl.gov Fri Apr 10 12:18:05 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 10 Apr 2009 12:18:05 -0500 Subject: [Swift-devel] Is there a site count limit? In-Reply-To: <49DF7B2A.4000201@mcs.anl.gov> References: <49DF75ED.1060704@mcs.anl.gov> <1239381726.8860.0.camel@localhost> <49DF7B2A.4000201@mcs.anl.gov> Message-ID: <1239383885.10739.3.camel@localhost> Increase foreach.max.threads to at least 4096. On Fri, 2009-04-10 at 12:00 -0500, Michael Wilde wrote: > They are in ci:/home/wilde/oops.1063.2 > > I spotted the anomaly (if thats what it is) as below. > > Also: we discussed on the list way way back how to get the swift > scheduler to send no more jobs to each "site" than there are cores in > that site (for this bgp/falkon case) so that jobs dont get committed to > busy sites while other sites have free cores. > > In this run, we are trying to send 32K jobs to 32K cores. > Each of the 128 "sites" have 256 cores. > > The #s below show about 19K of those jobs as having been dispatched to > 32*256 = 8192 cores. > > int$ grep JOB_START *nr3.log | awk '{print $19}' | sort | uniq -c > > 24 > 365 host=bgp000 > 790 host=bgp001 > 371 host=bgp002 > 383 host=bgp003 > 365 host=bgp004 > 791 host=bgp005 > 415 host=bgp006 > 775 host=bgp007 > 790 host=bgp008 > 791 host=bgp009 > 369 host=bgp010 > 790 host=bgp011 > 359 host=bgp012 > 791 host=bgp013 > 394 host=bgp014 > 402 host=bgp015 > 358 host=bgp016 > 595 host=bgp017 > 790 host=bgp018 > 790 host=bgp019 > 791 host=bgp020 > 790 host=bgp021 > 370 host=bgp022 > 790 host=bgp023 > 790 host=bgp024 > 674 host=bgp025 > 567 host=bgp026 > 389 host=bgp027 > 778 host=bgp028 > 366 host=bgp029 > 787 host=bgp030 > 695 host=bgp031 > int$ pwd > > > On 4/10/09 11:42 AM, Mihael Hategan wrote: > > On Fri, 2009-04-10 at 11:38 -0500, Michael Wilde wrote: > >> Hi, > >> > >> We're trying to run an oops run on 8 racks of the BGP. Its possible this > >> is larger than has been done to date with swift. > >> > >> Our sites.xml file has localhost plus 128 Falkon sites, one for each > >> pset in the 8-rack partition. > >> > >> From what I can tell, Swift sees all 128 sites, but only sends jobs to > >> exactly the first 32, bgp000-bgp031. > >> > >> While I debug this further, does anyone know of some hardwired limit > >> that would cause swift to send to only the first 32 bgp sites? > > > > I can't think of anything that would make that the case. The sites file > > and a log would be useful. > > From hategan at mcs.anl.gov Fri Apr 10 12:22:10 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 10 Apr 2009 12:22:10 -0500 Subject: [Swift-devel] Is there a site count limit? In-Reply-To: <1239383885.10739.3.camel@localhost> References: <49DF75ED.1060704@mcs.anl.gov> <1239381726.8860.0.camel@localhost> <49DF7B2A.4000201@mcs.anl.gov> <1239383885.10739.3.camel@localhost> Message-ID: <1239384130.10739.5.camel@localhost> On Fri, 2009-04-10 at 12:18 -0500, Mihael Hategan wrote: > Increase foreach.max.threads to at least 4096. That doesn't seem to be the cause though. Do you have all the sites/executables properly in tc.data? > > On Fri, 2009-04-10 at 12:00 -0500, Michael Wilde wrote: > > They are in ci:/home/wilde/oops.1063.2 > > > > I spotted the anomaly (if thats what it is) as below. > > > > Also: we discussed on the list way way back how to get the swift > > scheduler to send no more jobs to each "site" than there are cores in > > that site (for this bgp/falkon case) so that jobs dont get committed to > > busy sites while other sites have free cores. > > > > In this run, we are trying to send 32K jobs to 32K cores. > > Each of the 128 "sites" have 256 cores. > > > > The #s below show about 19K of those jobs as having been dispatched to > > 32*256 = 8192 cores. > > > > int$ grep JOB_START *nr3.log | awk '{print $19}' | sort | uniq -c > > > > 24 > > 365 host=bgp000 > > 790 host=bgp001 > > 371 host=bgp002 > > 383 host=bgp003 > > 365 host=bgp004 > > 791 host=bgp005 > > 415 host=bgp006 > > 775 host=bgp007 > > 790 host=bgp008 > > 791 host=bgp009 > > 369 host=bgp010 > > 790 host=bgp011 > > 359 host=bgp012 > > 791 host=bgp013 > > 394 host=bgp014 > > 402 host=bgp015 > > 358 host=bgp016 > > 595 host=bgp017 > > 790 host=bgp018 > > 790 host=bgp019 > > 791 host=bgp020 > > 790 host=bgp021 > > 370 host=bgp022 > > 790 host=bgp023 > > 790 host=bgp024 > > 674 host=bgp025 > > 567 host=bgp026 > > 389 host=bgp027 > > 778 host=bgp028 > > 366 host=bgp029 > > 787 host=bgp030 > > 695 host=bgp031 > > int$ pwd > > > > > > On 4/10/09 11:42 AM, Mihael Hategan wrote: > > > On Fri, 2009-04-10 at 11:38 -0500, Michael Wilde wrote: > > >> Hi, > > >> > > >> We're trying to run an oops run on 8 racks of the BGP. Its possible this > > >> is larger than has been done to date with swift. > > >> > > >> Our sites.xml file has localhost plus 128 Falkon sites, one for each > > >> pset in the 8-rack partition. > > >> > > >> From what I can tell, Swift sees all 128 sites, but only sends jobs to > > >> exactly the first 32, bgp000-bgp031. > > >> > > >> While I debug this further, does anyone know of some hardwired limit > > >> that would cause swift to send to only the first 32 bgp sites? > > > > > > I can't think of anything that would make that the case. The sites file > > > and a log would be useful. > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri Apr 10 12:39:42 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 10 Apr 2009 12:39:42 -0500 Subject: [Swift-devel] Is there a site count limit? In-Reply-To: <1239384130.10739.5.camel@localhost> References: <49DF75ED.1060704@mcs.anl.gov> <1239381726.8860.0.camel@localhost> <49DF7B2A.4000201@mcs.anl.gov> <1239383885.10739.3.camel@localhost> <1239384130.10739.5.camel@localhost> Message-ID: <49DF845E.6000908@mcs.anl.gov> On 4/10/09 12:22 PM, Mihael Hategan wrote: > On Fri, 2009-04-10 at 12:18 -0500, Mihael Hategan wrote: >> Increase foreach.max.threads to at least 4096. it was set to 100000 (100K) > That doesn't seem to be the cause though. Do you have all the > sites/executables properly in tc.data? duh. of course not :) thats the problem, thanks. > >> On Fri, 2009-04-10 at 12:00 -0500, Michael Wilde wrote: >>> They are in ci:/home/wilde/oops.1063.2 >>> >>> I spotted the anomaly (if thats what it is) as below. >>> >>> Also: we discussed on the list way way back how to get the swift >>> scheduler to send no more jobs to each "site" than there are cores in >>> that site (for this bgp/falkon case) so that jobs dont get committed to >>> busy sites while other sites have free cores. >>> >>> In this run, we are trying to send 32K jobs to 32K cores. >>> Each of the 128 "sites" have 256 cores. >>> >>> The #s below show about 19K of those jobs as having been dispatched to >>> 32*256 = 8192 cores. >>> >>> int$ grep JOB_START *nr3.log | awk '{print $19}' | sort | uniq -c >>> >>> 24 >>> 365 host=bgp000 >>> 790 host=bgp001 >>> 371 host=bgp002 >>> 383 host=bgp003 >>> 365 host=bgp004 >>> 791 host=bgp005 >>> 415 host=bgp006 >>> 775 host=bgp007 >>> 790 host=bgp008 >>> 791 host=bgp009 >>> 369 host=bgp010 >>> 790 host=bgp011 >>> 359 host=bgp012 >>> 791 host=bgp013 >>> 394 host=bgp014 >>> 402 host=bgp015 >>> 358 host=bgp016 >>> 595 host=bgp017 >>> 790 host=bgp018 >>> 790 host=bgp019 >>> 791 host=bgp020 >>> 790 host=bgp021 >>> 370 host=bgp022 >>> 790 host=bgp023 >>> 790 host=bgp024 >>> 674 host=bgp025 >>> 567 host=bgp026 >>> 389 host=bgp027 >>> 778 host=bgp028 >>> 366 host=bgp029 >>> 787 host=bgp030 >>> 695 host=bgp031 >>> int$ pwd >>> >>> >>> On 4/10/09 11:42 AM, Mihael Hategan wrote: >>>> On Fri, 2009-04-10 at 11:38 -0500, Michael Wilde wrote: >>>>> Hi, >>>>> >>>>> We're trying to run an oops run on 8 racks of the BGP. Its possible this >>>>> is larger than has been done to date with swift. >>>>> >>>>> Our sites.xml file has localhost plus 128 Falkon sites, one for each >>>>> pset in the 8-rack partition. >>>>> >>>>> From what I can tell, Swift sees all 128 sites, but only sends jobs to >>>>> exactly the first 32, bgp000-bgp031. >>>>> >>>>> While I debug this further, does anyone know of some hardwired limit >>>>> that would cause swift to send to only the first 32 bgp sites? >>>> I can't think of anything that would make that the case. The sites file >>>> and a log would be useful. >>>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Fri Apr 10 14:44:45 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 10 Apr 2009 14:44:45 -0500 Subject: [Swift-devel] Is there a site count limit? In-Reply-To: <49DF845E.6000908@mcs.anl.gov> References: <49DF75ED.1060704@mcs.anl.gov> <1239381726.8860.0.camel@localhost> <49DF7B2A.4000201@mcs.anl.gov> <1239383885.10739.3.camel@localhost> <1239384130.10739.5.camel@localhost> <49DF845E.6000908@mcs.anl.gov> Message-ID: <49DFA1AD.8050000@mcs.anl.gov> Mihael, your suggestion of: 2.56 1000 Is *almost* right on: int$ grep JOB_START *45.log | awk '{print $19}' | sort | uniq -c | awk '{ sum += $1} END {print sum}' 8131 int$ grep JOB_START *45.log | awk '{print $19}' | sort | uniq -c 3 254 host=bgp000 254 host=bgp001 254 host=bgp002 ... 254 host=bgp030 254 host=bgp031 int$ Can you suggest how to tweak it up to 256? Use jobThrottle=2.58 maybe? I will experiment, but if there's a precise way to hit it "just right" that would be great. If not, we will adjust as needed and reduce the total # of jobs. Is this a roundoff issue, or does the formula subtract 2 somewhere from the throttle * score product? - Mike On 4/10/09 12:39 PM, Michael Wilde wrote: > > > On 4/10/09 12:22 PM, Mihael Hategan wrote: >> On Fri, 2009-04-10 at 12:18 -0500, Mihael Hategan wrote: >>> Increase foreach.max.threads to at least 4096. > > it was set to 100000 (100K) > >> That doesn't seem to be the cause though. Do you have all the >> sites/executables properly in tc.data? > > duh. of course not :) > > thats the problem, thanks. > >> >>> On Fri, 2009-04-10 at 12:00 -0500, Michael Wilde wrote: >>>> They are in ci:/home/wilde/oops.1063.2 >>>> >>>> I spotted the anomaly (if thats what it is) as below. >>>> >>>> Also: we discussed on the list way way back how to get the swift >>>> scheduler to send no more jobs to each "site" than there are cores >>>> in that site (for this bgp/falkon case) so that jobs dont get >>>> committed to busy sites while other sites have free cores. >>>> >>>> In this run, we are trying to send 32K jobs to 32K cores. >>>> Each of the 128 "sites" have 256 cores. >>>> >>>> The #s below show about 19K of those jobs as having been dispatched >>>> to 32*256 = 8192 cores. >>>> >>>> int$ grep JOB_START *nr3.log | awk '{print $19}' | sort | uniq -c >>>> 24 >>>> 365 host=bgp000 >>>> 790 host=bgp001 >>>> 371 host=bgp002 >>>> 383 host=bgp003 >>>> 365 host=bgp004 >>>> 791 host=bgp005 >>>> 415 host=bgp006 >>>> 775 host=bgp007 >>>> 790 host=bgp008 >>>> 791 host=bgp009 >>>> 369 host=bgp010 >>>> 790 host=bgp011 >>>> 359 host=bgp012 >>>> 791 host=bgp013 >>>> 394 host=bgp014 >>>> 402 host=bgp015 >>>> 358 host=bgp016 >>>> 595 host=bgp017 >>>> 790 host=bgp018 >>>> 790 host=bgp019 >>>> 791 host=bgp020 >>>> 790 host=bgp021 >>>> 370 host=bgp022 >>>> 790 host=bgp023 >>>> 790 host=bgp024 >>>> 674 host=bgp025 >>>> 567 host=bgp026 >>>> 389 host=bgp027 >>>> 778 host=bgp028 >>>> 366 host=bgp029 >>>> 787 host=bgp030 >>>> 695 host=bgp031 >>>> int$ pwd >>>> >>>> >>>> On 4/10/09 11:42 AM, Mihael Hategan wrote: >>>>> On Fri, 2009-04-10 at 11:38 -0500, Michael Wilde wrote: >>>>>> Hi, >>>>>> >>>>>> We're trying to run an oops run on 8 racks of the BGP. Its >>>>>> possible this is larger than has been done to date with swift. >>>>>> >>>>>> Our sites.xml file has localhost plus 128 Falkon sites, one for >>>>>> each pset in the 8-rack partition. >>>>>> >>>>>> From what I can tell, Swift sees all 128 sites, but only sends >>>>>> jobs to exactly the first 32, bgp000-bgp031. >>>>>> >>>>>> While I debug this further, does anyone know of some hardwired >>>>>> limit that would cause swift to send to only the first 32 bgp sites? >>>>> I can't think of anything that would make that the case. The sites >>>>> file >>>>> and a log would be useful. >>>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Fri Apr 10 15:15:09 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 10 Apr 2009 15:15:09 -0500 Subject: [Swift-devel] Is there a site count limit? In-Reply-To: <49DFA1AD.8050000@mcs.anl.gov> References: <49DF75ED.1060704@mcs.anl.gov> <1239381726.8860.0.camel@localhost> <49DF7B2A.4000201@mcs.anl.gov> <1239383885.10739.3.camel@localhost> <1239384130.10739.5.camel@localhost> <49DF845E.6000908@mcs.anl.gov> <49DFA1AD.8050000@mcs.anl.gov> Message-ID: <1239394510.27021.1.camel@localhost> On Fri, 2009-04-10 at 14:44 -0500, Michael Wilde wrote: > Mihael, your suggestion of: > > 2.56 > 1000 > > Is *almost* right on: > > int$ grep JOB_START *45.log | awk '{print $19}' | sort | uniq -c | awk > '{ sum += $1} END {print sum}' > 8131 > int$ grep JOB_START *45.log | awk '{print $19}' | sort | uniq -c > > 3 > 254 host=bgp000 > 254 host=bgp001 > 254 host=bgp002 > ... > 254 host=bgp030 > 254 host=bgp031 > int$ > > Can you suggest how to tweak it up to 256? Use jobThrottle=2.58 maybe? Make the initial score larger. 10000 should be enough. As it goes to +inf, you should have a max of 100*jobThrottle + 1 jobs. > I > will experiment, but if there's a precise way to hit it "just right" > that would be great. If not, we will adjust as needed and reduce the > total # of jobs. > > Is this a roundoff issue, or does the formula subtract 2 somewhere from > the throttle * score product? > > - Mike > > > On 4/10/09 12:39 PM, Michael Wilde wrote: > > > > > > On 4/10/09 12:22 PM, Mihael Hategan wrote: > >> On Fri, 2009-04-10 at 12:18 -0500, Mihael Hategan wrote: > >>> Increase foreach.max.threads to at least 4096. > > > > it was set to 100000 (100K) > > > >> That doesn't seem to be the cause though. Do you have all the > >> sites/executables properly in tc.data? > > > > duh. of course not :) > > > > thats the problem, thanks. > > > >> > >>> On Fri, 2009-04-10 at 12:00 -0500, Michael Wilde wrote: > >>>> They are in ci:/home/wilde/oops.1063.2 > >>>> > >>>> I spotted the anomaly (if thats what it is) as below. > >>>> > >>>> Also: we discussed on the list way way back how to get the swift > >>>> scheduler to send no more jobs to each "site" than there are cores > >>>> in that site (for this bgp/falkon case) so that jobs dont get > >>>> committed to busy sites while other sites have free cores. > >>>> > >>>> In this run, we are trying to send 32K jobs to 32K cores. > >>>> Each of the 128 "sites" have 256 cores. > >>>> > >>>> The #s below show about 19K of those jobs as having been dispatched > >>>> to 32*256 = 8192 cores. > >>>> > >>>> int$ grep JOB_START *nr3.log | awk '{print $19}' | sort | uniq -c > >>>> 24 > >>>> 365 host=bgp000 > >>>> 790 host=bgp001 > >>>> 371 host=bgp002 > >>>> 383 host=bgp003 > >>>> 365 host=bgp004 > >>>> 791 host=bgp005 > >>>> 415 host=bgp006 > >>>> 775 host=bgp007 > >>>> 790 host=bgp008 > >>>> 791 host=bgp009 > >>>> 369 host=bgp010 > >>>> 790 host=bgp011 > >>>> 359 host=bgp012 > >>>> 791 host=bgp013 > >>>> 394 host=bgp014 > >>>> 402 host=bgp015 > >>>> 358 host=bgp016 > >>>> 595 host=bgp017 > >>>> 790 host=bgp018 > >>>> 790 host=bgp019 > >>>> 791 host=bgp020 > >>>> 790 host=bgp021 > >>>> 370 host=bgp022 > >>>> 790 host=bgp023 > >>>> 790 host=bgp024 > >>>> 674 host=bgp025 > >>>> 567 host=bgp026 > >>>> 389 host=bgp027 > >>>> 778 host=bgp028 > >>>> 366 host=bgp029 > >>>> 787 host=bgp030 > >>>> 695 host=bgp031 > >>>> int$ pwd > >>>> > >>>> > >>>> On 4/10/09 11:42 AM, Mihael Hategan wrote: > >>>>> On Fri, 2009-04-10 at 11:38 -0500, Michael Wilde wrote: > >>>>>> Hi, > >>>>>> > >>>>>> We're trying to run an oops run on 8 racks of the BGP. Its > >>>>>> possible this is larger than has been done to date with swift. > >>>>>> > >>>>>> Our sites.xml file has localhost plus 128 Falkon sites, one for > >>>>>> each pset in the 8-rack partition. > >>>>>> > >>>>>> From what I can tell, Swift sees all 128 sites, but only sends > >>>>>> jobs to exactly the first 32, bgp000-bgp031. > >>>>>> > >>>>>> While I debug this further, does anyone know of some hardwired > >>>>>> limit that would cause swift to send to only the first 32 bgp sites? > >>>>> I can't think of anything that would make that the case. The sites > >>>>> file > >>>>> and a log would be useful. > >>>>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From bugzilla-daemon at mcs.anl.gov Fri Apr 10 17:11:25 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Fri, 10 Apr 2009 17:11:25 -0500 (CDT) Subject: [Swift-devel] [Bug 199] New: error in simple mapper when underscores are used in type declaration Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=199 Summary: error in simple mapper when underscores are used in type declaration Product: Swift Version: unspecified Platform: PC OS/Version: Windows Status: NEW Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: skenny at uchicago.edu java.lang.IllegalStateException: mapper.existing() returned a path that it cannot subsequently map -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From qinz at ihpc.a-star.edu.sg Sat Apr 11 10:10:49 2009 From: qinz at ihpc.a-star.edu.sg (Qin Zheng) Date: Sat, 11 Apr 2009 23:10:49 +0800 Subject: [Swift-devel] Re: replication vs site score In-Reply-To: References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> , <49DD0B3F.7050903@cs.uchicago.edu>, Message-ID: Hi, I read the paper and found that (a) ok, it's bound on future queuing time; (b) only an upper bound for 0.95 quantile (with 95% confidence). Note that to me it should be relatively tight to the actual wait time (so as to be useful when deciding where to queue) while on the other hand it's much safer to put higher bound to guarantee a certain confidence level. Mihael, the case you mentioned is very likely as seen from their Fig 1, where for majority actual wait times (in black) near to 0 unit, their bounds (in red) could be 50 thousands units or more. It's still a success (not a failure). It is not what I want on expected queuing time for my work. Let me know if you guys have thoughts on the above. Qin Zheng ________________________________ From: Qin Zheng Sent: Saturday, April 11, 2009 12:48 AM To: iraicu at cs.uchicago.edu; Ben Clifford Cc: Mihael Hategan; swift-devel; Ian Foster Subject: RE: [Swift-devel] Re: replication vs site score Dear all, I came from the angle of application (such as Enterprise application or disaster recovery for an extreme case) requirement in SLA. Reservation can give some idea of response time (let's talk separately about failure and inaccurate execution time estimation) while queuing time prediction can give some probability of (mean and upbound of) expected start time. Knowing queuing time is important according to feedbacks from users of our in-hours supercomputers while they bare errors in their execution time estimations. However, even for a single task, only dynamically queuing it (or a number of its replicas) does not provide time-related information (as have been mentioned). Ioan, I was also thinking along the line of queue time estimation, which may be sufficient for what I am doing now. I considered reservation (so no queuing time) in my previous fault tolerance work due to the strict timing sequence requirement. I will read the paper soon to clarify a few points, especially the two points made by Mihael. Because to me it is only useful if it can tell (a) a queuing time, not only for the current state and immediately changes when new jobs are queued; (b) a mean and an upbound on queuing time, or if only the upbound is given, it should be tight in some sense (at most 20 minutes for the 2-minute job example). Finally, when a node can fail, it can also affect jobs queuing for it and this paper briefly mentions something about detecting the failure using queuing time data. I will share my findings regarding queuing time with you guys soon. Cheers, Qin Zheng ________________________________ From: Ioan Raicu [iraicu at cs.uchicago.edu] Sent: Thursday, April 09, 2009 4:38 AM To: Ben Clifford Cc: Mihael Hategan; Qin Zheng; swift-devel; Ian Foster Subject: Re: [Swift-devel] Re: replication vs site score Does a batch-queue prediction service help things in any way? https://portal.teragrid.org/gridsphere/gridsphere?cid=queue-prediction I've always wondered how the Swift scheduler would behave differently if it had statistical information about queue times. Qin, have you compared your job replication strategy with one that was cognizant of the expected wait queue time, in order to meet deadlines? On the surface, assuming that the batch queue prediction is accurate, it would seem that scheduling with known queue times might solve the same deadline cognizant scheduling problem, but without wasting resources by unnecessary replication. The place where the queue prediction doesn't help, is when there is a bad node which causes an application to be slow or fail. In this case, replication is probably the better recourse to guarantee meeting deadlines. Here is their latest paper on this: http://www.springerlink.com/content/7552901360631246/fulltext.pdf. The system is deployed on the TeraGrid, and has been for a few years now. As far as I have heard, it is quite robust and accurate. Cheers, Ioan Ben Clifford wrote: On Wed, 8 Apr 2009, Mihael Hategan wrote: This: planning the whole workflow buys us little in a (very) dynamic environment in which submitting a job one minute later may mean the difference between 1 minute of queue time and one hour of queue time and this: You need some SLA/QOS to address that. seem to be significant characteristics that make the environments we run on not amenable to scheduling in the traditional sense. The lack of any meaningful guarantees about almost anything time-related makes everything basically opportunistic rather than scheduled. -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== ________________________________ This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Mon Apr 13 01:00:28 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 13 Apr 2009 06:00:28 +0000 (GMT) Subject: [Swift-devel] simple executable staging Message-ID: I've seen enough simple (conceptually, not in run size) uses of Swift (2) and heard enough from OSG Engage people about it being useful that I think Swift should have an option to stage an executable instead of finding it remotely. This seems more useful in some fields and less so in others (for example, where people are writing numeric code in C and so have few library dependencies). It can be done in Swift in a round-about way at the moment, but I think it should be better supported. -- From hockyg at uchicago.edu Mon Apr 13 01:04:01 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Mon, 13 Apr 2009 01:04:01 -0500 Subject: [Swift-devel] simple executable staging In-Reply-To: References: Message-ID: <49E2D5D1.6050607@uchicago.edu> This could be potentially very useful for data analysis esp. if the analysis were performed with standard tools (e.g. perl, python, awk and gnuplot). If you could make simple changes to your analysis script and then push the data and analysis tools out for number crunching that would be great. Ben Clifford wrote: > I've seen enough simple (conceptually, not in run size) uses of Swift (2) > and heard enough from OSG Engage people about it being useful that I think > Swift should have an option to stage an executable instead of finding it > remotely. > > This seems more useful in some fields and less so in others (for example, > where people are writing numeric code in C and so have few library > dependencies). > > It can be done in Swift in a round-about way at the moment, but I think it > should be better supported. > > From benc at hawaga.org.uk Mon Apr 13 08:00:36 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 13 Apr 2009 13:00:36 +0000 (GMT) Subject: [Swift-devel] simple executable staging In-Reply-To: <49E2D5D1.6050607@uchicago.edu> References: <49E2D5D1.6050607@uchicago.edu> Message-ID: You can do it now without too much hassle but its not as beautiful as it could be. I thought I had added an example to the userguide but turns out I haven't. To do it now, make the analysis script an input data file and launch it using /bin/sh (so /bin/sh is what does into the tc.data file, not your actual application). Some (?skenny) does similar stuff with a script Rinvoke (that is mapped in tc.data) that launches R on an input script. But I'd like that to look more elegant. On Mon, 13 Apr 2009, Glen Hocky wrote: > This could be potentially very useful for data analysis esp. if the analysis > were performed with standard tools (e.g. perl, python, awk and gnuplot). If > you could make simple changes to your analysis script and then push the data > and analysis tools out for number crunching that would be great. > > Ben Clifford wrote: > > I've seen enough simple (conceptually, not in run size) uses of Swift (2) > > and heard enough from OSG Engage people about it being useful that I think > > Swift should have an option to stage an executable instead of finding it > > remotely. > > > > This seems more useful in some fields and less so in others (for example, > > where people are writing numeric code in C and so have few library > > dependencies). > > > > It can be done in Swift in a round-about way at the moment, but I think it > > should be better supported. > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From wilde at mcs.anl.gov Mon Apr 13 08:50:37 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 13 Apr 2009 08:50:37 -0500 Subject: [Swift-devel] simple executable staging In-Reply-To: References: <49E2D5D1.6050607@uchicago.edu> Message-ID: <49E3432D.7090906@mcs.anl.gov> I like the idea of executable staging. I feel that while it can start small and simple, it should eventually encompass these things in a clean elegant model: - versioning of apps and connecting that back to provenance - some way to specify app version from app() to tc.data - some notion of "package" for namespace management (we should re-examine the old VDL concept of namespace::name:version as a general name specification for app and proc names. - some way to manage the set of app() declarations that specify the "entry points" into an application - management of app PATHs and install dirs - retaining of installed apps (vs copy every time) - both should be possible - some connection to / generalization of ADEM, including both binary and src/build installs On 4/13/09 8:00 AM, Ben Clifford wrote: > You can do it now without too much hassle but its not as beautiful as it > could be. I thought I had added an example to the userguide but turns out > I haven't. > > To do it now, make the analysis script an input data file and launch it > using /bin/sh (so /bin/sh is what does into the tc.data file, not your > actual application). > > Some (?skenny) does similar stuff with a script Rinvoke (that is mapped in > tc.data) that launches R on an input script. > > But I'd like that to look more elegant. > > On Mon, 13 Apr 2009, Glen Hocky wrote: > >> This could be potentially very useful for data analysis esp. if the analysis >> were performed with standard tools (e.g. perl, python, awk and gnuplot). If >> you could make simple changes to your analysis script and then push the data >> and analysis tools out for number crunching that would be great. >> >> Ben Clifford wrote: >>> I've seen enough simple (conceptually, not in run size) uses of Swift (2) >>> and heard enough from OSG Engage people about it being useful that I think >>> Swift should have an option to stage an executable instead of finding it >>> remotely. >>> >>> This seems more useful in some fields and less so in others (for example, >>> where people are writing numeric code in C and so have few library >>> dependencies). >>> >>> It can be done in Swift in a round-about way at the moment, but I think it >>> should be better supported. >>> >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Mon Apr 13 09:02:07 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 13 Apr 2009 09:02:07 -0500 Subject: [Swift-devel] Re: swift plot In-Reply-To: <49E2D96D.8080401@uchicago.edu> References: <49E2D96D.8080401@uchicago.edu> Message-ID: <49E345DF.5090500@mcs.anl.gov> Cool - thanks. Can you tell me (a) the incantation you used for this and (b) if there are options to get "cooler" plots? Maybe Ben can tell us (b). For now, the plot that tells the main story is this one I think: http://kff4.uchicago.edu/~ghocks/report-oops-20090412-0804-jbeux7s9/execstages.png This plot is boring but "great" in the sense that it shows Swift quickly loading up the BG/P and then running steady till jobs start finishing, at which time a "triangle" of idle time on the partition becomes available. Would plots of the Ranger and of Abe/Qb/Ranger runs show something more "interesting" (ie more color and jitter in the workload?) - Mike On 4/13/09 1:19 AM, Glen Hocky wrote: > Hi Mike, > here is the swift plot for the 16 hour run i ran today on bgp. it took > >1/2 hour to generate > http://kff4.uchicago.edu/~ghocks/report-oops-20090412-0804-jbeux7s9/ > > Glen From hockyg at uchicago.edu Mon Apr 13 09:05:29 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Mon, 13 Apr 2009 09:05:29 -0500 Subject: [Swift-devel] Re: swift plot In-Reply-To: <49E345DF.5090500@mcs.anl.gov> References: <49E2D96D.8080401@uchicago.edu> <49E345DF.5090500@mcs.anl.gov> Message-ID: <49E346A9.9040708@uchicago.edu> I just did swift-plot-log logfile As per: http://www.ci.uchicago.edu/swift/downloads/release-notes-0.8.txt I can run on TG logfiles today but wanted to do the bluegene one because of maintenance Michael Wilde wrote: > Cool - thanks. Can you tell me (a) the incantation you used for this > and (b) if there are options to get "cooler" plots? > > Maybe Ben can tell us (b). > > For now, the plot that tells the main story is this one I think: > http://kff4.uchicago.edu/~ghocks/report-oops-20090412-0804-jbeux7s9/execstages.png > > > This plot is boring but "great" in the sense that it shows Swift > quickly loading up the BG/P and then running steady till jobs start > finishing, at which time a "triangle" of idle time on the partition > becomes available. > > Would plots of the Ranger and of Abe/Qb/Ranger runs show something > more "interesting" (ie more color and jitter in the workload?) > > - Mike > > On 4/13/09 1:19 AM, Glen Hocky wrote: >> Hi Mike, >> here is the swift plot for the 16 hour run i ran today on bgp. it >> took >1/2 hour to generate >> http://kff4.uchicago.edu/~ghocks/report-oops-20090412-0804-jbeux7s9/ >> >> Glen From benc at hawaga.org.uk Mon Apr 13 09:11:35 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 13 Apr 2009 14:11:35 +0000 (GMT) Subject: [Swift-devel] Re: swift plot In-Reply-To: <49E345DF.5090500@mcs.anl.gov> References: <49E2D96D.8080401@uchicago.edu> <49E345DF.5090500@mcs.anl.gov> Message-ID: The graph on the front page 'Number of karajan level job submissions that are 'Active'' looks like for the most part only 1 or 2 CPUs were in use at any one time. That seems not so good - either 1 or 2 CPUs really were in use at any one time, which is not good for the app perspective, or maybe the execution provider that you are using is not reporting job statuses correctly. For comparison, run with: wrapperlog.always.transfer=true in swift.properties, and make sure that if you move the .log file you move the correspondingly named .d directory too, and then plot logs and make sure that graphs on the info tab appear (they are broken images at the moment). That will give a worker-nodes view of how many CPU cores are being used as once, for comparison. -- From hockyg at uchicago.edu Mon Apr 13 09:37:26 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Mon, 13 Apr 2009 09:37:26 -0500 Subject: [Swift-devel] Re: swift plot In-Reply-To: References: <49E2D96D.8080401@uchicago.edu> <49E345DF.5090500@mcs.anl.gov> Message-ID: <49E34E26.8060704@uchicago.edu> Ah. The falkon provider only shows "submitted" when a job is running and only returns "active" when a job is staging out (or about to). Glen Ben Clifford wrote: > The graph on the front page 'Number of karajan level job submissions that > are 'Active'' looks like for the most part only 1 or 2 CPUs were in use at > any one time. > > That seems not so good - either 1 or 2 CPUs really were in use at any one > time, which is not good for the app perspective, or maybe the execution > provider that you are using is not reporting job statuses correctly. > > For comparison, run with: > wrapperlog.always.transfer=true > in swift.properties, and make sure that if you move the .log file you move > the correspondingly named .d directory too, and then plot logs and make > sure that graphs on the info tab appear (they are broken images at the > moment). > > That will give a worker-nodes view of how many CPU cores are being used as > once, for comparison. > > From wilde at mcs.anl.gov Mon Apr 13 09:41:34 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 13 Apr 2009 09:41:34 -0500 Subject: [Swift-devel] Re: swift plot In-Reply-To: References: <49E2D96D.8080401@uchicago.edu> <49E345DF.5090500@mcs.anl.gov> Message-ID: <49E34F1E.8040906@mcs.anl.gov> I dont understand - I interpreted the plot as 8K cores active for the majority of the workflow. This workflow starts with a few tiny jobs on localhost, and then expands to 8K cores. Glen, do you see what Ben is referring to? On 4/13/09 9:11 AM, Ben Clifford wrote: > The graph on the front page 'Number of karajan level job submissions that > are 'Active'' looks like for the most part only 1 or 2 CPUs were in use at > any one time. > > That seems not so good - either 1 or 2 CPUs really were in use at any one > time, which is not good for the app perspective, or maybe the execution > provider that you are using is not reporting job statuses correctly. > > For comparison, run with: > wrapperlog.always.transfer=true > in swift.properties, and make sure that if you move the .log file you move > the correspondingly named .d directory too, and then plot logs and make > sure that graphs on the info tab appear (they are broken images at the > moment). > > That will give a worker-nodes view of how many CPU cores are being used as > once, for comparison. > From benc at hawaga.org.uk Mon Apr 13 09:45:49 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 13 Apr 2009 14:45:49 +0000 (GMT) Subject: [Swift-devel] Re: swift plot In-Reply-To: <49E34E26.8060704@uchicago.edu> References: <49E2D96D.8080401@uchicago.edu> <49E345DF.5090500@mcs.anl.gov> <49E34E26.8060704@uchicago.edu> Message-ID: On Mon, 13 Apr 2009, Glen Hocky wrote: > Ah. The falkon provider only shows "submitted" when a job is running and only > returns "active" when a job is staging out (or about to). OK, I suspected that. That restricts the amount of analysis that Swift can provide based on the karajan.JOB_SUBMISSION plots. The worker side info logs will provide similar information, though it may be that you have disabled those. -- From benc at hawaga.org.uk Mon Apr 13 09:48:31 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 13 Apr 2009 14:48:31 +0000 (GMT) Subject: [Swift-devel] Re: swift plot In-Reply-To: <49E34F1E.8040906@mcs.anl.gov> References: <49E2D96D.8080401@uchicago.edu> <49E345DF.5090500@mcs.anl.gov> <49E34F1E.8040906@mcs.anl.gov> Message-ID: On Mon, 13 Apr 2009, Michael Wilde wrote: > I dont understand - I interpreted the plot as 8K cores active for the majority > of the workflow. For the most part, that graph you pasted is showing that Swift has 8000 jobs sent to the queueing system (local execution provider and falkon). It does not indicate whether they are actually running on a CPU or not - neither of the two mechanisms swift has to collect that information are implemented/working in this run. -- From wilde at mcs.anl.gov Mon Apr 13 09:56:17 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 13 Apr 2009 09:56:17 -0500 Subject: [Swift-devel] Re: swift plot In-Reply-To: References: <49E2D96D.8080401@uchicago.edu> <49E345DF.5090500@mcs.anl.gov> <49E34E26.8060704@uchicago.edu> Message-ID: <49E35291.9020002@mcs.anl.gov> Thats probably the same reason that Falkon runs dont show the same useful progress in the swift status output. We've suspect similar. On 4/13/09 9:45 AM, Ben Clifford wrote: > On Mon, 13 Apr 2009, Glen Hocky wrote: > >> Ah. The falkon provider only shows "submitted" when a job is running and only >> returns "active" when a job is staging out (or about to). > > OK, I suspected that. That restricts the amount of analysis that Swift can > provide based on the karajan.JOB_SUBMISSION plots. The worker side info > logs will provide similar information, though it may be that you have > disabled those. > From benc at hawaga.org.uk Mon Apr 13 10:04:34 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 13 Apr 2009 15:04:34 +0000 (GMT) Subject: [Swift-devel] Re: swift plot In-Reply-To: <49E35291.9020002@mcs.anl.gov> References: <49E2D96D.8080401@uchicago.edu> <49E345DF.5090500@mcs.anl.gov> <49E34E26.8060704@uchicago.edu> <49E35291.9020002@mcs.anl.gov> Message-ID: On Mon, 13 Apr 2009, Michael Wilde wrote: > Thats probably the same reason that Falkon runs dont show the same useful > progress in the swift status output. We've suspect similar. yes almost definitely - the Submitting/submitted/active transitions in the status ticker are driven by those notifications. -- From benc at hawaga.org.uk Mon Apr 13 10:19:00 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 13 Apr 2009 15:19:00 +0000 (GMT) Subject: [Swift-devel] swift + condor-g Message-ID: from a superficial reading of the code it looks like the present condor provider can be made to submit to condor-g with a relatively small set of modifications - a small set of changes to the job specification, submitting into an existing condor installation. I'd be interested on working on this in the next week as I am presently working with people for the next few weeks who have an existing condor-g setup, and who would like to submit to OSG. -- From benc at hawaga.org.uk Mon Apr 13 10:23:46 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 13 Apr 2009 15:23:46 +0000 (GMT) Subject: [Swift-devel] swift-plot-log in tutorial Message-ID: I just did a swift hands on session with about 15 users who now have a few days experience in building simple cluster/grid scale applications. One of the things I did was have them run swift-plot-log on Swift runs, and look at some of the graphics there. I think this gave a very useful visual impression of what was going on during runs, and so I think some incorporation of that into future Swift tutorials would be useful. -- From wilde at mcs.anl.gov Mon Apr 13 10:33:17 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 13 Apr 2009 10:33:17 -0500 Subject: [Swift-devel] swift-plot-log in tutorial In-Reply-To: References: Message-ID: <49E35B3D.4030500@mcs.anl.gov> Nice, sounds good. I need to start using this, and the status monitor as well. I realize the plots are slow to produce with the current scripts (understandably, np) but conceptually these two things could be integrated? How is my run progressing, both as a numeric status display and as an evolving plot, is something almost all users care about, especially for initial runs where the user is asking "is my running running correctly" and "is it running as fast as it should/could?" On 4/13/09 10:23 AM, Ben Clifford wrote: > I just did a swift hands on session with about 15 users who now have a few > days experience in building simple cluster/grid scale applications. > > One of the things I did was have them run swift-plot-log on Swift runs, > and look at some of the graphics there. > > I think this gave a very useful visual impression of what was going on > during runs, and so I think some incorporation of that into future Swift > tutorials would be useful. > From foster at anl.gov Mon Apr 13 10:34:33 2009 From: foster at anl.gov (Ian Foster) Date: Mon, 13 Apr 2009 10:34:33 -0500 Subject: [Swift-devel] swift-plot-log in tutorial In-Reply-To: References: Message-ID: <2B1AE431-E734-44E8-95F5-DCCAC6C96219@anl.gov> yes that seems very important to me ... On Apr 13, 2009, at 10:23 AM, Ben Clifford wrote: > > I just did a swift hands on session with about 15 users who now have > a few > days experience in building simple cluster/grid scale applications. > > One of the things I did was have them run swift-plot-log on Swift > runs, > and look at some of the graphics there. > > I think this gave a very useful visual impression of what was going on > during runs, and so I think some incorporation of that into future > Swift > tutorials would be useful. > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Mon Apr 13 10:41:28 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 13 Apr 2009 15:41:28 +0000 (GMT) Subject: [Swift-devel] swift-plot-log in tutorial In-Reply-To: <49E35B3D.4030500@mcs.anl.gov> References: <49E35B3D.4030500@mcs.anl.gov> Message-ID: It could be integrated into the runtime. But I think that is not very high priority - you can run the existing log processing code on a run in progress to get that kind of information, and there is already a fair amount of information available during a run. On Mon, 13 Apr 2009, Michael Wilde wrote: > Nice, sounds good. I need to start using this, and the status monitor as well. > > I realize the plots are slow to produce with the current scripts > (understandably, np) but conceptually these two things could be integrated? > > How is my run progressing, both as a numeric status display and as an evolving > plot, is something almost all users care about, especially for initial runs > where the user is asking "is my running running correctly" and "is it running > as fast as it should/could?" > > On 4/13/09 10:23 AM, Ben Clifford wrote: > > I just did a swift hands on session with about 15 users who now have a few > > days experience in building simple cluster/grid scale applications. > > > > One of the things I did was have them run swift-plot-log on Swift runs, and > > look at some of the graphics there. > > > > I think this gave a very useful visual impression of what was going on > > during runs, and so I think some incorporation of that into future Swift > > tutorials would be useful. > > > > From wilde at mcs.anl.gov Mon Apr 13 10:44:53 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 13 Apr 2009 10:44:53 -0500 Subject: [Swift-devel] swift + condor-g In-Reply-To: References: Message-ID: <49E35DF5.5090600@mcs.anl.gov> That sounds good. I assume that this enhancement would also provide one of the pre-requisites for making coasters use condor-g. Having condor-g working, even without coasters, OOPS can start using OSG and TG for real production (as most real oops jobs are hour-long or so). This assumes that along with condor-g support, that the condor-g grid monitor feature also works. On 4/13/09 10:19 AM, Ben Clifford wrote: > from a superficial reading of the code it looks like the present condor > provider can be made to submit to condor-g with a relatively small set of > modifications - a small set of changes to the job specification, > submitting into an existing condor installation. > > I'd be interested on working on this in the next week as I am presently > working with people for the next few weeks who have an existing condor-g > setup, and who would like to submit to OSG. > From benc at hawaga.org.uk Mon Apr 13 10:49:05 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 13 Apr 2009 15:49:05 +0000 (GMT) Subject: [Swift-devel] swift + condor-g In-Reply-To: <49E35DF5.5090600@mcs.anl.gov> References: <49E35DF5.5090600@mcs.anl.gov> Message-ID: On Mon, 13 Apr 2009, Michael Wilde wrote: > This assumes that along with condor-g support, that the condor-g grid monitor > feature also works. That's really my only motivation for doing this. -- From wilde at mcs.anl.gov Mon Apr 13 10:57:08 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 13 Apr 2009 10:57:08 -0500 Subject: [Swift-devel] swift-plot-log in tutorial In-Reply-To: References: <49E35B3D.4030500@mcs.anl.gov> Message-ID: <49E360D4.8010201@mcs.anl.gov> On 4/13/09 10:41 AM, Ben Clifford wrote: > It could be integrated into the runtime. But I think that is not very high > priority - you can run the existing log processing code on a run in > progress to get that kind of information, and there is already a fair > amount of information available during a run. Right, I agree. I was thinking a low priority distant future nicety. > On Mon, 13 Apr 2009, Michael Wilde wrote: > >> Nice, sounds good. I need to start using this, and the status monitor as well. >> >> I realize the plots are slow to produce with the current scripts >> (understandably, np) but conceptually these two things could be integrated? >> >> How is my run progressing, both as a numeric status display and as an evolving >> plot, is something almost all users care about, especially for initial runs >> where the user is asking "is my running running correctly" and "is it running >> as fast as it should/could?" >> >> On 4/13/09 10:23 AM, Ben Clifford wrote: >>> I just did a swift hands on session with about 15 users who now have a few >>> days experience in building simple cluster/grid scale applications. >>> >>> One of the things I did was have them run swift-plot-log on Swift runs, and >>> look at some of the graphics there. >>> >>> I think this gave a very useful visual impression of what was going on >>> during runs, and so I think some incorporation of that into future Swift >>> tutorials would be useful. >>> >> From benc at hawaga.org.uk Mon Apr 13 14:36:53 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 13 Apr 2009 19:36:53 +0000 (GMT) Subject: [Swift-devel] Re: swift + condor-g In-Reply-To: References: Message-ID: after more fiddling than I was expecting with the cog condor provider and the condor installation that I'm using, I managed to get first.swift running through condor-g onto the University of Johannesburg site that I am working at. so this looks hopeful. -- From hategan at mcs.anl.gov Mon Apr 13 14:47:14 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 13 Apr 2009 14:47:14 -0500 Subject: [Swift-devel] Re: swift + condor-g In-Reply-To: References: Message-ID: <1239652034.14873.0.camel@localhost> That provider I think needs some updating to use job logs instead of the current hold-in-queue-and-then-remove scheme. On Mon, 2009-04-13 at 19:36 +0000, Ben Clifford wrote: > after more fiddling than I was expecting with the cog condor provider and > the condor installation that I'm using, I managed to get first.swift > running through condor-g onto the University of Johannesburg site that I > am working at. > > so this looks hopeful. > From benc at hawaga.org.uk Mon Apr 13 15:01:23 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 13 Apr 2009 20:01:23 +0000 (GMT) Subject: [Swift-devel] Re: swift + condor-g In-Reply-To: <1239652034.14873.0.camel@localhost> References: <1239652034.14873.0.camel@localhost> Message-ID: It has more serious problems (as in problems that stop it working for me entirely) that are perhaps more important to address. The changes I made to get it going with condor-g (but some are to make it work with condor-anything) are in here: http://www.ci.uchicago.edu/~benc/tmp/condor-g.patch In summary what I changed to make it run at UJ are: * exit code handling - the in-SVN version expects an exit code file to appear but never gets one. the patch removes that expectation - I haven't looked enough to see if it returns exit codes properly or not because Swift doesn't care. * LeaveJobInQueue needs to have no + at the start of it (I can't find much clear documentation about this on the condor web site) * quoting is kinda screwy (as seems traditional with condor) - what breaks running first.swift is for an empty argument, he preent quoting behaviour makes the swift wrapper look forinput file called "" rather than taking no input files. The patch above contains some other mess, as well as the changes I made to get it to submit to the OSG CE at UJ. On Mon, 13 Apr 2009, Mihael Hategan wrote: > That provider I think needs some updating to use job logs instead of the > current hold-in-queue-and-then-remove scheme. > > On Mon, 2009-04-13 at 19:36 +0000, Ben Clifford wrote: > > after more fiddling than I was expecting with the cog condor provider and > > the condor installation that I'm using, I managed to get first.swift > > running through condor-g onto the University of Johannesburg site that I > > am working at. > > > > so this looks hopeful. > > > > From hategan at mcs.anl.gov Mon Apr 13 15:08:32 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 13 Apr 2009 15:08:32 -0500 Subject: [Swift-devel] Re: swift + condor-g In-Reply-To: References: <1239652034.14873.0.camel@localhost> Message-ID: <1239653312.15311.1.camel@localhost> On Mon, 2009-04-13 at 20:01 +0000, Ben Clifford wrote: > It has more serious problems (as in problems that stop it working for me > entirely) that are perhaps more important to address. The changes I made > to get it going with condor-g (but some are to make it work with > condor-anything) are in here: > > http://www.ci.uchicago.edu/~benc/tmp/condor-g.patch > > In summary what I changed to make it run at UJ are: > > * exit code handling - the in-SVN version expects an exit code file to > appear but never gets one. Odd. It was supposed to use the exit code ad. > the patch removes that expectation - I haven't > looked enough to see if it returns exit codes properly or not because > Swift doesn't care. > > * LeaveJobInQueue needs to have no + at the start of it (I can't find > much clear documentation about this on the condor web site) > > * quoting is kinda screwy (as seems traditional with condor) - what > breaks running first.swift is for an empty argument, he preent quoting > behaviour makes the swift wrapper look forinput file called "" rather than > taking no input files. What version of condor is that? > > The patch above contains some other mess, as well as the changes I made to > get it to submit to the OSG CE at UJ. > > > > On Mon, 13 Apr 2009, Mihael Hategan wrote: > > > That provider I think needs some updating to use job logs instead of the > > current hold-in-queue-and-then-remove scheme. > > > > On Mon, 2009-04-13 at 19:36 +0000, Ben Clifford wrote: > > > after more fiddling than I was expecting with the cog condor provider and > > > the condor installation that I'm using, I managed to get first.swift > > > running through condor-g onto the University of Johannesburg site that I > > > am working at. > > > > > > so this looks hopeful. > > > > > > > From benc at hawaga.org.uk Mon Apr 13 15:11:08 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 13 Apr 2009 20:11:08 +0000 (GMT) Subject: [Swift-devel] Re: swift + condor-g In-Reply-To: <1239653312.15311.1.camel@localhost> References: <1239652034.14873.0.camel@localhost> <1239653312.15311.1.camel@localhost> Message-ID: On Mon, 13 Apr 2009, Mihael Hategan wrote: > Odd. It was supposed to use the exit code ad. It looks like it pulls exit codes from condor_q status information, but I haven't actually checked if that flows all the way through. > What version of condor is that? [benc at osg-ui ~]$ condor_version $CondorVersion: 7.0.5 Sep 20 2008 BuildID: 105846 $ -- From hategan at mcs.anl.gov Mon Apr 13 15:25:53 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 13 Apr 2009 15:25:53 -0500 Subject: [Swift-devel] Re: swift + condor-g In-Reply-To: References: <1239652034.14873.0.camel@localhost> Message-ID: <1239654353.15550.4.camel@localhost> On Mon, 2009-04-13 at 20:01 +0000, Ben Clifford wrote: > It has more serious problems (as in problems that stop it working for me > entirely) that are perhaps more important to address. The changes I made > to get it going with condor-g (but some are to make it work with > condor-anything) are in here: > > http://www.ci.uchicago.edu/~benc/tmp/condor-g.patch > > In summary what I changed to make it run at UJ are: > > * exit code handling - the in-SVN version expects an exit code file to > appear but never gets one. the patch removes that expectation - I haven't > looked enough to see if it returns exit codes properly or not because > Swift doesn't care. > > * LeaveJobInQueue needs to have no + at the start of it (I can't find > much clear documentation about this on the condor web site) So http://www.cs.wisc.edu/condor/manual/v7.2/condor_submit.html, mentions the following (and so does http://www.cs.wisc.edu/condor/manual/v7.0/condor_submit.html): + = A line which begins with a '+' (plus) character instructs condor_submit to insert the following attribute into the job ClassAd with the given value. Luckily using log files would get rid of this issue. From benc at hawaga.org.uk Mon Apr 13 15:51:19 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 13 Apr 2009 20:51:19 +0000 (GMT) Subject: [Swift-devel] Re: swift + condor-g In-Reply-To: <1239654353.15550.4.camel@localhost> References: <1239652034.14873.0.camel@localhost> <1239654353.15550.4.camel@localhost> Message-ID: I get the + but I don't see anything describing the specific classads. Different from both what I did and what is in the SVN at the moment, both the 7.0 and 7.2 docs list a leave_in_queue = submit file command. -- From hategan at mcs.anl.gov Mon Apr 13 16:20:34 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 13 Apr 2009 16:20:34 -0500 Subject: [Swift-devel] Re: swift + condor-g In-Reply-To: References: <1239652034.14873.0.camel@localhost> <1239654353.15550.4.camel@localhost> Message-ID: <1239657634.16478.1.camel@localhost> On Mon, 2009-04-13 at 20:51 +0000, Ben Clifford wrote: > I get the + but I don't see anything describing the specific classads. If you do a condor_q -f (or whatever stands for full), it will display all the classads for the job. There you'll see it. It is also mentioned in several mailing list threads. > > Different from both what I did and what is in the SVN at the moment, both > the 7.0 and 7.2 docs list a leave_in_queue = submit file command. > From benc at hawaga.org.uk Mon Apr 13 16:23:33 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 13 Apr 2009 21:23:33 +0000 (GMT) Subject: [Swift-devel] Re: swift + condor-g In-Reply-To: <1239657634.16478.1.camel@localhost> References: <1239652034.14873.0.camel@localhost> <1239654353.15550.4.camel@localhost> <1239657634.16478.1.camel@localhost> Message-ID: On Mon, 13 Apr 2009, Mihael Hategan wrote: > On Mon, 2009-04-13 at 20:51 +0000, Ben Clifford wrote: > > I get the + but I don't see anything describing the specific classads. > > If you do a condor_q -f (or whatever stands for full), it will display > all the classads for the job. There you'll see it. It is also mentioned > in several mailing list threads. -long Yes, I can use google and see it in plenty of threads. But the only in-manual documented option seems to be leave_in_queue -- From iraicu at cs.uchicago.edu Mon Apr 13 17:15:50 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 13 Apr 2009 15:15:50 -0700 Subject: [Swift-devel] Re: swift plot In-Reply-To: <49E34E26.8060704@uchicago.edu> References: <49E2D96D.8080401@uchicago.edu> <49E345DF.5090500@mcs.anl.gov> <49E34E26.8060704@uchicago.edu> Message-ID: <49E3B996.7040209@cs.uchicago.edu> There are no notifications sent from Falkon to Swift when going from queued to running state. However, for the purpose of running on the BG/P, if Swift is configured properly, then any job submitted to Falkon will go into a running state almost instantly, as Swift should not be submitting jobs to Falkon unless Falkon has idle CPUs. So, you can treat the "submitted" jobs as "active" jobs, for the purpose of getting a good idea of how the workflow is progressing. Perhaps, we could even modify the Falkon provider to change the job state from submitted to active, upon successful submission. Again, if Swift is not configured right, to throttle job submission (i.e. 256 jobs per Falkon on BG/P), then this change will be misleading, such as would be the case when the throttles are open wide, and there is a single Falkon service running. Ioan Glen Hocky wrote: > Ah. The falkon provider only shows "submitted" when a job is running > and only returns "active" when a job is staging out (or about to). > > Glen > > Ben Clifford wrote: >> The graph on the front page 'Number of karajan level job submissions >> that are 'Active'' looks like for the most part only 1 or 2 CPUs were >> in use at any one time. >> >> That seems not so good - either 1 or 2 CPUs really were in use at any >> one time, which is not good for the app perspective, or maybe the >> execution provider that you are using is not reporting job statuses >> correctly. >> >> For comparison, run with: >> wrapperlog.always.transfer=true >> in swift.properties, and make sure that if you move the .log file you >> move the correspondingly named .d directory too, and then plot logs >> and make sure that graphs on the info tab appear (they are broken >> images at the moment). >> >> That will give a worker-nodes view of how many CPU cores are being >> used as once, for comparison. >> >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From iraicu at cs.uchicago.edu Mon Apr 13 17:17:30 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 13 Apr 2009 15:17:30 -0700 Subject: [Swift-devel] Re: swift plot In-Reply-To: <49E34E26.8060704@uchicago.edu> References: <49E2D96D.8080401@uchicago.edu> <49E345DF.5090500@mcs.anl.gov> <49E34E26.8060704@uchicago.edu> Message-ID: <49E3B9FA.1060709@cs.uchicago.edu> In the long run, if its important, we could have the Falkon service send back notifications about jobs going from "submitted" to "active". It would require modifications to both the Falkon service and the Falkon provider, but could be done. Ioan Glen Hocky wrote: > Ah. The falkon provider only shows "submitted" when a job is running > and only returns "active" when a job is staging out (or about to). > > Glen > > Ben Clifford wrote: >> The graph on the front page 'Number of karajan level job submissions >> that are 'Active'' looks like for the most part only 1 or 2 CPUs were >> in use at any one time. >> >> That seems not so good - either 1 or 2 CPUs really were in use at any >> one time, which is not good for the app perspective, or maybe the >> execution provider that you are using is not reporting job statuses >> correctly. >> >> For comparison, run with: >> wrapperlog.always.transfer=true >> in swift.properties, and make sure that if you move the .log file you >> move the correspondingly named .d directory too, and then plot logs >> and make sure that graphs on the info tab appear (they are broken >> images at the moment). >> >> That will give a worker-nodes view of how many CPU cores are being >> used as once, for comparison. >> >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From gabri.turcu at gmail.com Mon Apr 13 18:42:46 2009 From: gabri.turcu at gmail.com (Gabri Turcu) Date: Mon, 13 Apr 2009 18:42:46 -0500 Subject: [Swift-devel] java.lang.OutOfMemoryError when running grep on 10k files Message-ID: <9f808f850904131642t3318f63ax424a697278acea8f@mail.gmail.com> Hi, I am trying to run grep on newslab data on teraport. While everything works fine for a small number of patterns and files(e.g. ~20 patterns, ~1000 files), I get errors for larger numbers of files (~10k). I would be very grateful for any help. The main files I'm using are (at CI) : /home/gabri/swift-0.8/examples/swift/newslabex/count/tc.data /home/gabri/swift-0.8/examples/swift/newslabex/count/sites.xml (-using the fast queue) /home/gabri/swift-0.8/examples/swift/newslabex/count/count.swift /home/gabri/swift-0.8/examples/swift/newslabex/count/grp For number of files=10k and number of patterns=2 - I'm getting an "java.lang.OutOfMemoryError". I have tried increasing the heap size by runnig Swift with (-Xms1536m -Xmx4096m) in the command line, but that seemed to just push the failure point a little further. Is this at all the way to go? - The corresponding logs are at: /home/gabri/swift-0.8/examples/swift/newslabex/count/errmanyfiles/ Thank you very much for any suggestions. Best, Gabri -------------- next part -------------- An HTML attachment was scrubbed... URL: From hockyg at uchicago.edu Mon Apr 13 18:45:45 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Mon, 13 Apr 2009 18:45:45 -0500 Subject: [Swift-devel] java.lang.OutOfMemoryError when running grep on 10k files In-Reply-To: <9f808f850904131642t3318f63ax424a697278acea8f@mail.gmail.com> References: <9f808f850904131642t3318f63ax424a697278acea8f@mail.gmail.com> Message-ID: <49E3CEA9.8020001@uchicago.edu> I did the same but in COG_OPTS, i.e. export COG_OPTS="-Xmx1024m" or 2048 for really big jobs Gabri Turcu wrote: > Hi, > > I am trying to run grep on newslab data on teraport. While everything > works fine for a small number of patterns and files(e.g. ~20 patterns, > ~1000 files), I get errors for larger numbers of files (~10k). I would > be very grateful for any help. > > The main files I'm using are (at CI) : > /home/gabri/swift-0.8/examples/swift/newslabex/count/tc.data > /home/gabri/swift-0.8/examples/swift/newslabex/count/sites.xml > (-using the fast queue) > /home/gabri/swift-0.8/examples/swift/newslabex/count/count.swift > /home/gabri/swift-0.8/examples/swift/newslabex/count/grp > > For number of files=10k and number of patterns=2 > - I'm getting an "java.lang.OutOfMemoryError". I have tried increasing > the heap size by runnig Swift with (-Xms1536m -Xmx4096m) in the > command line, but that seemed to just push the failure point a little > further. Is this at all the way to go? > - The corresponding logs are at: > /home/gabri/swift-0.8/examples/swift/newslabex/count/errmanyfiles/ > > Thank you very much for any suggestions. > Best, > Gabri > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From gabri.turcu at gmail.com Mon Apr 13 19:23:28 2009 From: gabri.turcu at gmail.com (Gabri Turcu) Date: Mon, 13 Apr 2009 19:23:28 -0500 Subject: [Swift-devel] java.lang.OutOfMemoryError when running grep on 10k files In-Reply-To: <49E3CEA9.8020001@uchicago.edu> References: <9f808f850904131642t3318f63ax424a697278acea8f@mail.gmail.com> <49E3CEA9.8020001@uchicago.edu> Message-ID: <9f808f850904131723o1f1453fcg3764c6d37abd56b3@mail.gmail.com> Hi Glen, Thanks a lot! I can now see in the log file that what I had was not working. It seems to be on its way now. Best, Gabri On Mon, Apr 13, 2009 at 6:45 PM, Glen Hocky wrote: > I did the same but in COG_OPTS, i.e. > export COG_OPTS="-Xmx1024m" > or 2048 for really big jobs > > Gabri Turcu wrote: > >> Hi, >> >> I am trying to run grep on newslab data on teraport. While everything >> works fine for a small number of patterns and files(e.g. ~20 patterns, ~1000 >> files), I get errors for larger numbers of files (~10k). I would be very >> grateful for any help. >> >> The main files I'm using are (at CI) : >> /home/gabri/swift-0.8/examples/swift/newslabex/count/tc.data >> /home/gabri/swift-0.8/examples/swift/newslabex/count/sites.xml (-using >> the fast queue) >> /home/gabri/swift-0.8/examples/swift/newslabex/count/count.swift >> /home/gabri/swift-0.8/examples/swift/newslabex/count/grp >> >> For number of files=10k and number of patterns=2 >> - I'm getting an "java.lang.OutOfMemoryError". I have tried increasing the >> heap size by runnig Swift with (-Xms1536m -Xmx4096m) in the command line, >> but that seemed to just push the failure point a little further. Is this at >> all the way to go? >> - The corresponding logs are at: >> /home/gabri/swift-0.8/examples/swift/newslabex/count/errmanyfiles/ >> >> Thank you very much for any suggestions. >> Best, >> Gabri >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Tue Apr 14 00:50:21 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 14 Apr 2009 05:50:21 +0000 (GMT) Subject: [Swift-devel] Re: replication vs site score In-Reply-To: <49DD0FE2.3000505@cs.uchicago.edu> References: <1C37ED4E-D752-4870-A52C-076FC3EBBF49@anl.gov> <1239202987.12586.17.camel@localhost> <49DD0B3F.7050903@cs.uchicago.edu> <1239223485.26815.2.camel@localhost> <49DD0D24.2010909@cs.uchicago.edu> <1239224070.27089.4.camel@localhost> <49DD0FE2.3000505@cs.uchicago.edu> Message-ID: On Wed, 8 Apr 2009, Ioan Raicu wrote: > in queue state, might be reflected in the error bars. Nevertheless, I think it > might be an interesting improvement to the current Swift scheduler. Ben, was > this on the list of Google summer of code projects? If not, perhaps you might > want to add it. There is a project that I hope will be accepted to do more scheduler work. It can go as the "might be interesting to try", although the examples hategan gives in his reply to the above suggest the answers are so vague they may be useless. -- From benc at hawaga.org.uk Tue Apr 14 01:06:09 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 14 Apr 2009 06:06:09 +0000 (GMT) Subject: [Swift-devel] Re: swift plot In-Reply-To: <49E3B9FA.1060709@cs.uchicago.edu> References: <49E2D96D.8080401@uchicago.edu> <49E345DF.5090500@mcs.anl.gov> <49E34E26.8060704@uchicago.edu> <49E3B9FA.1060709@cs.uchicago.edu> Message-ID: On Mon, 13 Apr 2009, Ioan Raicu wrote: > In the long run, if its important, we could have the Falkon service send back > notifications about jobs going from "submitted" to "active". It would require > modifications to both the Falkon service and the Falkon provider, but could be > done. > I think it is useful for the sake of the Principle of Least Surprise. -- From benc at hawaga.org.uk Tue Apr 14 10:19:06 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 14 Apr 2009 15:19:06 +0000 (GMT) Subject: [Swift-devel] localscheduler condor fix Message-ID: The localscheduler condor code as of cog r2382 does not behave in such a way that I can use it to run Swift jobs to a local condor pool on gwynn.bsd.uchicago.edu. The attached patch is sufficient (but not all necessary) to make it submit basic Swift jobs (Swift quoting tests fail, but quoting tests fail for GRAM2+jobmanager-condor so I'm not overly concerned) I would like to commit this, but here's a chance to object. -- -------------- next part -------------- diff --git a/modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/condor/CondorExecutor.java b/modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/condor/CondorExecutor.java index eeb1380..1921615 100644 --- a/modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/condor/CondorExecutor.java +++ b/modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/condor/CondorExecutor.java @@ -91,7 +91,7 @@ public class CondorExecutor extends AbstractExecutor { } wr.write('\n'); wr.write("notification = Never\n"); - wr.write("+LeaveJobInQueue = TRUE\n"); + wr.write("leave_in_queue = TRUE\n"); wr.write("queue\n"); wr.close(); } @@ -103,15 +103,15 @@ public class CondorExecutor extends AbstractExecutor { TRIGGERS[' '] = true; TRIGGERS['\n'] = true; TRIGGERS['\t'] = true; - TRIGGERS['|'] = true; TRIGGERS['\\'] = true; TRIGGERS['>'] = true; TRIGGERS['<'] = true; + TRIGGERS['"'] = true; } protected String quote(String s) { if ("".equals(s)) { - return "\"\""; + return ""; } boolean quotes = false; for (int i = 0; i < s.length(); i++) { @@ -126,6 +126,7 @@ public class CondorExecutor extends AbstractExecutor { } StringBuffer sb = new StringBuffer(); if (quotes) { + sb.append('\\'); sb.append('"'); } for (int i = 0; i < s.length(); i++) { @@ -136,6 +137,7 @@ public class CondorExecutor extends AbstractExecutor { sb.append(c); } if (quotes) { + sb.append('\\'); sb.append('"'); } return sb.toString(); @@ -198,7 +200,7 @@ public class CondorExecutor extends AbstractExecutor { FileLocation stdErrorLocation, String exitcode, AbstractExecutor executor) { return new Job(jobid, stdout, stdOutputLocation, stderr, - stdErrorLocation, exitcode, executor); + stdErrorLocation, null, executor); } private static QueuePoller poller; diff --git a/modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/condor/QueuePoller.java b/modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/condor/QueuePoller.java index a94dfa7..2d653c3 100644 --- a/modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/condor/QueuePoller.java +++ b/modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/condor/QueuePoller.java @@ -39,7 +39,7 @@ public class QueuePoller extends AbstractQueuePoller { protected void processStdout(InputStream is) throws IOException { if (logger.isDebugEnabled()) { - logger.debug("Processing qstat stdout"); + logger.debug("Processing condor_q stdout"); } BufferedReader br = new BufferedReader(new InputStreamReader(is)); String line; From hategan at mcs.anl.gov Tue Apr 14 10:40:09 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 14 Apr 2009 10:40:09 -0500 Subject: [Swift-devel] localscheduler condor fix In-Reply-To: References: Message-ID: <1239723609.29943.1.camel@localhost> On Tue, 2009-04-14 at 15:19 +0000, Ben Clifford wrote: > The localscheduler condor code as of cog r2382 does not behave in such a > way that I can use it to run Swift jobs to a local condor pool on > gwynn.bsd.uchicago.edu. > > The attached patch is sufficient (but not all necessary) to make it submit > basic Swift jobs (Swift quoting tests fail, but quoting tests fail for > GRAM2+jobmanager-condor so I'm not overly concerned) > > I would like to commit this, but here's a chance to object. No objection from me. From benc at hawaga.org.uk Wed Apr 15 02:08:19 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 15 Apr 2009 07:08:19 +0000 (GMT) Subject: [Swift-devel] cleanup job attributes Message-ID: cleanup jobs don't seem to get the globus profile keys that normal jobs get from sites.xml. this is causing me trouble with my condor-g provider code, because those profile keys are used to configure appropriate submission settings. its not immediately apparent why that submission is not picking up the keys but the execute2 submissions are. -- From benc at hawaga.org.uk Wed Apr 15 03:17:19 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 15 Apr 2009 08:17:19 +0000 (GMT) Subject: [Swift-devel] Swift + Condor-G Message-ID: CoG svn r2388 and Swift r2846 contain the ability to submit Swift jobs to gt2 sites via Condor-G. Below is a site definition that I have used to submit to an OSG site at RENCI. I would appreciate feedback from anyone who tests this successfully or unsuccessfully. As previously mentioned, this is not intended to replace the existing gram2 submisssion mechanisms; it provides a way to submit to OSG sites where plain gram2 is (to a greater or lesser extent) discouraged or disallowed or dysfunctional. It requires a local condor installation (which is a strong argument against using this if you do not already have such in place - this was one of the traumatic parts of getting VDS running) /nfs/home/osgedu/benc grid gt2 belhaven-1.renci.org/jobmanager-fork -- From benc at hawaga.org.uk Wed Apr 15 04:25:30 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 15 Apr 2009 09:25:30 +0000 (GMT) Subject: [Swift-devel] Is there a site count limit? In-Reply-To: <1239394510.27021.1.camel@localhost> References: <49DF75ED.1060704@mcs.anl.gov> <1239381726.8860.0.camel@localhost> <49DF7B2A.4000201@mcs.anl.gov> <1239383885.10739.3.camel@localhost> <1239384130.10739.5.camel@localhost> <49DF845E.6000908@mcs.anl.gov> <49DFA1AD.8050000@mcs.anl.gov> <1239394510.27021.1.camel@localhost> Message-ID: On Fri, 10 Apr 2009, Mihael Hategan wrote: > Make the initial score larger. 10000 should be enough. As it goes to > +inf, you should have a max of 100*jobThrottle + 1 jobs. I would like to change the user interface for jobThrottle and initialScore to be much more human friendly - the present input numbers, whilst sufficiently expressive, are ridiculous from a normal user perspective. I would like to make a parameter to replace jobThrottle (either in CoG or in Swift) that you specify the actual maximum number of jobs (so that maxJobs = 100*jobThrottle +1) For initial score I would like to find some value range that is easier to understand for users - perhaps a fraction that indicates how many of the maxJobs the initialScore will start at. -- From benc at hawaga.org.uk Wed Apr 15 04:21:57 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 15 Apr 2009 09:21:57 +0000 (GMT) Subject: [Swift-devel] Is there a site count limit? In-Reply-To: <49DFA1AD.8050000@mcs.anl.gov> References: <49DF75ED.1060704@mcs.anl.gov> <1239381726.8860.0.camel@localhost> <49DF7B2A.4000201@mcs.anl.gov> <1239383885.10739.3.camel@localhost> <1239384130.10739.5.camel@localhost> <49DF845E.6000908@mcs.anl.gov> <49DFA1AD.8050000@mcs.anl.gov> Message-ID: On Fri, 10 Apr 2009, Michael Wilde wrote: > int$ grep JOB_START *45.log | awk '{print $19}' | sort | uniq -c | awk '{ sum > += $1} END {print sum}' The log processing tools generate an HTML table with jobs per site (including information how many succeeded and failed per site). For swift <=r2857 you could say swift-plot-log foo.log jobs-sites.table r2858 and onwards, its now called jobs-sites.html to make it easier to open in a web browser. -- From wilde at mcs.anl.gov Wed Apr 15 07:47:02 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 15 Apr 2009 07:47:02 -0500 Subject: [Swift-devel] Is there a site count limit? In-Reply-To: References: <49DF75ED.1060704@mcs.anl.gov> <1239381726.8860.0.camel@localhost> <49DF7B2A.4000201@mcs.anl.gov> <1239383885.10739.3.camel@localhost> <1239384130.10739.5.camel@localhost> <49DF845E.6000908@mcs.anl.gov> <49DFA1AD.8050000@mcs.anl.gov> <1239394510.27021.1.camel@localhost> Message-ID: <49E5D746.1090304@mcs.anl.gov> Yes, this would be very good. And at the same time review the rest of the throttle parameters. I always have to scratch my head and re-read the descriptions to get the behavior I want. If the other throttles asre OK as-is, then it would still be good to clarify their explanations in the properties file test and provide examples in the user guide. On 4/15/09 4:25 AM, Ben Clifford wrote: > On Fri, 10 Apr 2009, Mihael Hategan wrote: > >> Make the initial score larger. 10000 should be enough. As it goes to >> +inf, you should have a max of 100*jobThrottle + 1 jobs. > > I would like to change the user interface for jobThrottle and initialScore > to be much more human friendly - the present input numbers, whilst > sufficiently expressive, are ridiculous from a normal user perspective. > > I would like to make a parameter to replace jobThrottle (either in CoG or > in Swift) that you specify the actual maximum number of jobs > (so that maxJobs = 100*jobThrottle +1) > > For initial score I would like to find some value range that is easier to > understand for users - perhaps a fraction that indicates how many of > the maxJobs the initialScore will start at. > From wilde at mcs.anl.gov Wed Apr 15 08:23:47 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 15 Apr 2009 08:23:47 -0500 Subject: [Swift-devel] oops provenance In-Reply-To: <49E2706B.70205@uchicago.edu> References: <49E2652D.9010901@uchicago.edu> <49E26BC9.5090103@mcs.anl.gov> <49E26CD6.6060700@uchicago.edu> <49E26F94.4020907@mcs.anl.gov> <49E2706B.70205@uchicago.edu> Message-ID: <49E5DFE3.7040804@mcs.anl.gov> was: Re: first stab Glen, I'd like to pick up on the comment on provenance you made last Sunday while gathering data for the oops paper. Ben is focusing on provenance at the moment, and so I'd like to include him in the discussion (although I know you are focusing on the provenence challenge at the moment). I think a starting point for oops provenance is this: For every run, you want to know: - an ID for the run - time and date started / ended - how many jobs ran where - location of the output files & logs - the run parameters (proteins, config params, scale) - analyzed scores of the run output - what version of oops was used - what version of swift/cog was used - what version of the oops.swift script was used Given this in a database, you could also compare structure scores for one version of code or one algorithm vs another We're also always looking to see what level of parallelism was achieved by swift, so some way of getting that out of the logs, up to an including full log plots, would be handy. - Mike On 4/12/09 5:51 PM, Glen Hocky wrote: > well, if i'd done a summary before hand, i may have tried to do a few > extra proteins or something. anyway, i think everything is fine, but i > am definitely going to have to think about some way of summarizing. i > think this makes a good case for provenance tracking though :) > > Michael Wilde wrote: >> Hi Glen, >> >> Not sure what you mean by "... annoyed at the bredth of the runs that >> i've done. i may do a few more on abe qb and ranger if they are >> working because that would just take a few hours" >> >> As in the runs are not yielding something interesting to write about? >> >> One thing we can talk about is just the run time, etc. We havent >> looked at that closely, but hopefully its good enough to be worth citing. >> >> Anything I can do to help organize or discuss this section with you? >> >> - Mike From benc at hawaga.org.uk Wed Apr 15 08:57:37 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 15 Apr 2009 13:57:37 +0000 (GMT) Subject: [Swift-devel] Re: oops provenance In-Reply-To: <49E5DFE3.7040804@mcs.anl.gov> References: <49E2652D.9010901@uchicago.edu> <49E26BC9.5090103@mcs.anl.gov> <49E26CD6.6060700@uchicago.edu> <49E26F94.4020907@mcs.anl.gov> <49E2706B.70205@uchicago.edu> <49E5DFE3.7040804@mcs.anl.gov> Message-ID: On Wed, 15 Apr 2009, Michael Wilde wrote: > We're also always looking to see what level of parallelism was achieved by > swift, so some way of getting that out of the logs, up to an including full > log plots, would be handy. You can get that already based on the -info log plots - get wrapper logs like I said and you'll get some text stats and also a graph of CPUs in use over time. I'll send notes on provenancedb in a bit. -- From benc at hawaga.org.uk Wed Apr 15 10:58:29 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 15 Apr 2009 15:58:29 +0000 (GMT) Subject: [Swift-devel] Re: oops provenance In-Reply-To: <49E5DFE3.7040804@mcs.anl.gov> References: <49E2652D.9010901@uchicago.edu> <49E26BC9.5090103@mcs.anl.gov> <49E26CD6.6060700@uchicago.edu> <49E26F94.4020907@mcs.anl.gov> <49E2706B.70205@uchicago.edu> <49E5DFE3.7040804@mcs.anl.gov> Message-ID: If you want to look at the provenance db as it is now, read sections 2 and 3 of this page: http://www.ci.uchicago.edu/~benc/provenance.html#owndb I recommend if you try this at home to use sqlite3, not postgres. > I think a starting point for oops provenance is this: For every run, you want > to know: many of these are straightforward to add, and i will look at doing so after pc3 stuff > - an ID for the run this exists now > - analyzed scores of the run output not sure what that is - is this application specific output? > - what version of oops was used the extrainfo stuff I implemented previously for the oops app may be used here. I've heard no feedback about it actually being used, though. > - what version of the oops.swift script was used For all the version stuff, you need to figure out what version semantics you want (eg md5sum of swift script, which gives fine grained version distinction but no order; user specified version numbering which is pretty much guaranteed to be wrong but you might think you want that, and also gives ordering; ... there are lots of schemes ...) > Given this in a database, you could also compare structure scores for one > version of code or one algorithm vs another This is more application specific data? Are you looking here to have data output from a run end up in a database? -- From benc at hawaga.org.uk Wed Apr 15 11:19:34 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 15 Apr 2009 16:19:34 +0000 (GMT) Subject: [Swift-devel] Is there a site count limit? In-Reply-To: <49E5D746.1090304@mcs.anl.gov> References: <49DF75ED.1060704@mcs.anl.gov> <1239381726.8860.0.camel@localhost> <49DF7B2A.4000201@mcs.anl.gov> <1239383885.10739.3.camel@localhost> <1239384130.10739.5.camel@localhost> <49DF845E.6000908@mcs.anl.gov> <49DFA1AD.8050000@mcs.anl.gov> <1239394510.27021.1.camel@localhost> <49E5D746.1090304@mcs.anl.gov> Message-ID: On Wed, 15 Apr 2009, Michael Wilde wrote: > Yes, this would be very good. And at the same time review the rest of the > throttle parameters. I always have to scratch my head and re-read the > descriptions to get the behavior I want. If the other throttles asre OK as-is, > then it would still be good to clarify their explanations in the properties > file test and provide examples in the user guide. I drew a diagram and sent it to the swift-devel list that was a draft of a graphical explanation of where the various throttle settings fit in. No one commented so I have taken it no further. http://mail.ci.uchicago.edu/pipermail/swift-devel/2009-February/004318.html -- From wilde at mcs.anl.gov Wed Apr 15 11:44:55 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 15 Apr 2009 11:44:55 -0500 Subject: [Swift-devel] Re: oops provenance In-Reply-To: References: <49E2652D.9010901@uchicago.edu> <49E26BC9.5090103@mcs.anl.gov> <49E26CD6.6060700@uchicago.edu> <49E26F94.4020907@mcs.anl.gov> <49E2706B.70205@uchicago.edu> <49E5DFE3.7040804@mcs.anl.gov> Message-ID: <49E60F07.8070508@mcs.anl.gov> On 4/15/09 10:58 AM, Ben Clifford wrote: > If you want to look at the provenance db as it is now, read sections 2 and > 3 of this page: > > http://www.ci.uchicago.edu/~benc/provenance.html#owndb > > I recommend if you try this at home to use sqlite3, not postgres. > >> I think a starting point for oops provenance is this: For every run, you want >> to know: > > many of these are straightforward to add, and i will look at doing so > after pc3 stuff > >> - an ID for the run > > this exists now yes, "but". Its long and hard to manage. We have experience now with both Falkon and Swift in giving runs simple short IDs, and that has worked well. Its so much easier to talk about oops run 0042 than run *imqvgr8. The long ID is also useful but should be more hidden and internal. How we do this should tie in with where we go with swift run management conventions. > >> - analyzed scores of the run output > > not sure what that is - is this application specific output? yes > >> - what version of oops was used > > the extrainfo stuff I implemented previously for the oops app may be used > here. I've heard no feedback about it actually being used, though. right, that is the solution. we need to test it. > >> - what version of the oops.swift script was used > > For all the version stuff, you need to figure out what version semantics > you want (eg md5sum of swift script, which gives fine grained version > distinction but no order; user specified version numbering which is pretty > much guaranteed to be wrong but you might think you want that, and also > gives ordering; ... there are lots of schemes ...) hmmm - all those sound good - can we have them all? ;) seriously, though - a few thoughts on this: - i lean to close integration with svn on versions, ie use svn to version code, including swift scripts, and use svn revision IDs as well as software release numbers to define versions of code. Ie, oops rev 0428 or oops release 1.2.4, depending on what you were running. - i can now see the merits and use of the old vdl constructs namespace::name:version, and would like to explore how to use and integrate that into Swift. - I think the mdsum etc stuff is useful, and also good for research into "airtight" provenance, but less immediately needed by users. And when added, seems like that kind of thing thats nice to have always running in the background, to resolve thorny provenance questions, but should seldom be visible to the end user. >> Given this in a database, you could also compare structure scores for one >> version of code or one algorithm vs another > > This is more application specific data? I think its a join of app-specific and swift-maintained. Eg, in the current oops.swift script, the user can specify via cmd line arg which of 2 oops algorithms to use ("classic" or "rama"). So I could easily see a parameter sweep that says: for each protein in plist, do the full sweep for both algorithms, and give me tables, plots etc that lets me compare them. So far, that is more "application" than provenance. But now, do the same thing but compare rama 1.2.4 with rama 1.2.6. Depending on how thats expressed, it could utilize provenance info. Especially if the question was asked "retrospectively" on the provenance data, as opposed to set up in advance as a comparative workflow. Ie, look at the runtime-per-simulation of each of the last 3 rama versions. > Are you looking here to have data output from a run end up in a database? Yes, thats being considered, as an application thing, in addition to and separate from the provenance data. I will send the OOPS paper to swft and try to get it posted on the swift web soon. Its got some nice stats in sec 5 on #runs, that would be great to derive on a running basis from collected provenance data. - Mike From hategan at mcs.anl.gov Wed Apr 15 11:50:35 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 15 Apr 2009 11:50:35 -0500 Subject: [Swift-devel] Is there a site count limit? In-Reply-To: References: <49DF75ED.1060704@mcs.anl.gov> <1239381726.8860.0.camel@localhost> <49DF7B2A.4000201@mcs.anl.gov> <1239383885.10739.3.camel@localhost> <1239384130.10739.5.camel@localhost> <49DF845E.6000908@mcs.anl.gov> <49DFA1AD.8050000@mcs.anl.gov> <1239394510.27021.1.camel@localhost> <49E5D746.1090304@mcs.anl.gov> Message-ID: <1239814235.17883.1.camel@localhost> On Wed, 2009-04-15 at 16:19 +0000, Ben Clifford wrote: > On Wed, 15 Apr 2009, Michael Wilde wrote: > > > Yes, this would be very good. And at the same time review the rest of the > > throttle parameters. I always have to scratch my head and re-read the > > descriptions to get the behavior I want. If the other throttles asre OK as-is, > > then it would still be good to clarify their explanations in the properties > > file test and provide examples in the user guide. > > I drew a diagram and sent it to the swift-devel list that was a draft of a > graphical explanation of where the various throttle settings fit in. No > one commented so I have taken it no further. > > http://mail.ci.uchicago.edu/pipermail/swift-devel/2009-February/004318.html > Yeah, we should put that in the user guide, but could you enhance the contrast a bit before doing that? From wilde at mcs.anl.gov Wed Apr 15 11:58:47 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 15 Apr 2009 11:58:47 -0500 Subject: [Swift-devel] Is there a site count limit? In-Reply-To: References: <49DF75ED.1060704@mcs.anl.gov> <1239381726.8860.0.camel@localhost> <49DF7B2A.4000201@mcs.anl.gov> <1239383885.10739.3.camel@localhost> <1239384130.10739.5.camel@localhost> <49DF845E.6000908@mcs.anl.gov> <49DFA1AD.8050000@mcs.anl.gov> <1239394510.27021.1.camel@localhost> <49E5D746.1090304@mcs.anl.gov> Message-ID: <49E61247.8090007@mcs.anl.gov> Ah, cool. Now that I see it again I remember why I didnt comment: its because it was complex and I didnt understand it. I just spent a few minutes digesting the diagram, and yes, that gives a nice explanation of the top-level behavior. I feel that the throttling mechanism is both important enough, timely (in that real users need it yet are stymied by it) complex enough, and interesting enough (ie it could merit a research paper) that we should dig back into it. Add in coasters to the mix, and scheduling/throttling gets yet another dimension. (and both more complexity and an opportunity to reduce complexity based on knowledge of available workers). On 4/15/09 11:19 AM, Ben Clifford wrote: > On Wed, 15 Apr 2009, Michael Wilde wrote: > >> Yes, this would be very good. And at the same time review the rest of the >> throttle parameters. I always have to scratch my head and re-read the >> descriptions to get the behavior I want. If the other throttles asre OK as-is, >> then it would still be good to clarify their explanations in the properties >> file test and provide examples in the user guide. > > I drew a diagram and sent it to the swift-devel list that was a draft of a > graphical explanation of where the various throttle settings fit in. No > one commented so I have taken it no further. > > http://mail.ci.uchicago.edu/pipermail/swift-devel/2009-February/004318.html > From wilde at mcs.anl.gov Wed Apr 15 12:01:22 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 15 Apr 2009 12:01:22 -0500 Subject: [Swift-devel] Is there a site count limit? In-Reply-To: <1239814235.17883.1.camel@localhost> References: <49DF75ED.1060704@mcs.anl.gov> <1239381726.8860.0.camel@localhost> <49DF7B2A.4000201@mcs.anl.gov> <1239383885.10739.3.camel@localhost> <1239384130.10739.5.camel@localhost> <49DF845E.6000908@mcs.anl.gov> <49DFA1AD.8050000@mcs.anl.gov> <1239394510.27021.1.camel@localhost> <49E5D746.1090304@mcs.anl.gov> <1239814235.17883.1.camel@localhost> Message-ID: <49E612E2.9090802@mcs.anl.gov> On 4/15/09 11:50 AM, Mihael Hategan wrote: > On Wed, 2009-04-15 at 16:19 +0000, Ben Clifford wrote: >> On Wed, 15 Apr 2009, Michael Wilde wrote: >> >>> Yes, this would be very good. And at the same time review the rest of the >>> throttle parameters. I always have to scratch my head and re-read the >>> descriptions to get the behavior I want. If the other throttles asre OK as-is, >>> then it would still be good to clarify their explanations in the properties >>> file test and provide examples in the user guide. >> I drew a diagram and sent it to the swift-devel list that was a draft of a >> graphical explanation of where the various throttle settings fit in. No >> one commented so I have taken it no further. >> >> http://mail.ci.uchicago.edu/pipermail/swift-devel/2009-February/004318.html >> > > Yeah, we should put that in the user guide, but could you enhance the > contrast a bit before doing that? Do we have a standard tool for graphics for the ug (so we can maintain the source file in a maintainable way? Although the original does have a nice home-cooked flair. :) From benc at hawaga.org.uk Wed Apr 15 12:08:17 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 15 Apr 2009 17:08:17 +0000 (GMT) Subject: [Swift-devel] Re: oops provenance In-Reply-To: <49E60F07.8070508@mcs.anl.gov> References: <49E2652D.9010901@uchicago.edu> <49E26BC9.5090103@mcs.anl.gov> <49E26CD6.6060700@uchicago.edu> <49E26F94.4020907@mcs.anl.gov> <49E2706B.70205@uchicago.edu> <49E5DFE3.7040804@mcs.anl.gov> <49E60F07.8070508@mcs.anl.gov> Message-ID: On Wed, 15 Apr 2009, Michael Wilde wrote: > yes, "but". Its long and hard to manage. We have experience now with both > Falkon and Swift in giving runs simple short IDs, and that has worked well. > Its so much easier to talk about oops run 0042 than run *imqvgr8. The long ID > is also useful but should be more hidden and internal. Swift has a -runid parameter that allows you to override the date/time/randomid with a string of your choosing. This was implemented specifically to address this case, where the end user believes they can succesfully give runs IDs themselves and so swift does not need to. None of the code requires the run id to be in the default generated format - thats a conservative verbose default for when the user does not want to generate identifiers themselves. swift -runid foo first.xml for example, will generate a logfile first-foo.log -- From benc at hawaga.org.uk Wed Apr 15 12:09:54 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 15 Apr 2009 17:09:54 +0000 (GMT) Subject: [Swift-devel] Is there a site count limit? In-Reply-To: <49E61247.8090007@mcs.anl.gov> References: <49DF75ED.1060704@mcs.anl.gov> <1239381726.8860.0.camel@localhost> <49DF7B2A.4000201@mcs.anl.gov> <1239383885.10739.3.camel@localhost> <1239384130.10739.5.camel@localhost> <49DF845E.6000908@mcs.anl.gov> <49DFA1AD.8050000@mcs.anl.gov> <1239394510.27021.1.camel@localhost> <49E5D746.1090304@mcs.anl.gov> <49E61247.8090007@mcs.anl.gov> Message-ID: On Wed, 15 Apr 2009, Michael Wilde wrote: > Ah, cool. Now that I see it again I remember why I didnt comment: its because > it was complex and I didnt understand it. > I just spent a few minutes digesting the diagram, and yes, that gives a nice > explanation of the top-level behavior. yes, no amount of documentation in the world will made it possible for you to absorb a complex concept without actually engaging your brain ;) > I feel that the throttling mechanism is both important enough, timely (in that > real users need it yet are stymied by it) complex enough, and interesting > enough (ie it could merit a research paper) that we should dig back into it. Messing with it is very different from documenting what exists now. -- From skenny at uchicago.edu Wed Apr 15 12:21:27 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Wed, 15 Apr 2009 12:21:27 -0500 (CDT) Subject: [Swift-devel] vacation 4/16 - 4/24 Message-ID: <20090415122127.BVQ07987@m4500-02.uchicago.edu> i will be on vacation/off-line starting tomorrow and will return monday 4/27. my brain needs a break :) (of course, will be checking email intermittently) ~sk From hockyg at uchicago.edu Wed Apr 15 13:49:57 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Wed, 15 Apr 2009 13:49:57 -0500 Subject: [Swift-devel] feature request Message-ID: <49E62C55.5080107@uchicago.edu> Hi Everyone, While I'm thinking of it, one problem I had using coasters on multiple TG sites is that jobs would commit themselves to a site at the beginning of the run. This was a problem because all of the jobs would finish on one machine while jobs for the other machines were sitting in a queue. I know you may have considered this before, but an option to select a site/coaster for a single job only when one is available would be very useful for us (note: we aren't too worried about overhead in not pre-setting up files and directories because our jobs run > 10 minutes, usually closer to an hour). Glen From hategan at mcs.anl.gov Wed Apr 15 13:58:15 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 15 Apr 2009 13:58:15 -0500 Subject: [Swift-devel] feature request In-Reply-To: <49E62C55.5080107@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> Message-ID: <1239821895.23411.10.camel@localhost> On Wed, 2009-04-15 at 13:49 -0500, Glen Hocky wrote: > Hi Everyone, > While I'm thinking of it, one problem I had using coasters on multiple > TG sites is that jobs would commit themselves to a site at the beginning > of the run. If the site scoring parameters are left at the default, only a couple of jobs should commit to sites at start, and the number would progressively increase as sites complete jobs. Combined with replication, which is probably disabled by default, even jobs committed to sites that don't do much, should be re-submitted to different sites eventually. > This was a problem because all of the jobs would finish on > one machine while jobs for the other machines were sitting in a queue. I > know you may have considered this before, but an option to select a > site/coaster for a single job only when one is available would be very > useful for us (note: we aren't too worried about overhead in not > pre-setting up files and directories because our jobs run > 10 minutes, > usually closer to an hour). That is one thing that is planned with coasters, but hasn't materialized yet, in part due to the fact that the above solution should provide a similar experience. From hockyg at uchicago.edu Wed Apr 15 14:05:36 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Wed, 15 Apr 2009 14:05:36 -0500 Subject: [Swift-devel] feature request In-Reply-To: <1239821895.23411.10.camel@localhost> References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> Message-ID: <49E63000.40009@uchicago.edu> The problem with the first method was that the number of jobs, i.e. score increased too slowly. In that configuration, i believe that the behavior was 1 or 2 coasters were submitted to a single site and none to the others and then it just stayed that way for a long time. Another problem w/ the default configuration was on sites w/ Coasters per node > 1. My experience on ranger with the default parameters is that 2 coasters would start in the queue and only ~6 jobs would run on them, rather than 32. Since our jobs take ~1 hour, this means that for 32 hours of CPU time, I was getting about 6 CPU hours of work. and even after jobs started finishing in that config, the ramp up was too slow Mihael Hategan wrote: > On Wed, 2009-04-15 at 13:49 -0500, Glen Hocky wrote: > >> Hi Everyone, >> While I'm thinking of it, one problem I had using coasters on multiple >> TG sites is that jobs would commit themselves to a site at the beginning >> of the run. >> > > If the site scoring parameters are left at the default, only a couple of > jobs should commit to sites at start, and the number would progressively > increase as sites complete jobs. > > Combined with replication, which is probably disabled by default, even > jobs committed to sites that don't do much, should be re-submitted to > different sites eventually. > > >> This was a problem because all of the jobs would finish on >> one machine while jobs for the other machines were sitting in a queue. I >> know you may have considered this before, but an option to select a >> site/coaster for a single job only when one is available would be very >> useful for us (note: we aren't too worried about overhead in not >> pre-setting up files and directories because our jobs run > 10 minutes, >> usually closer to an hour). >> > > That is one thing that is planned with coasters, but hasn't materialized > yet, in part due to the fact that the above solution should provide a > similar experience. > > > From benc at hawaga.org.uk Wed Apr 15 14:09:47 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 15 Apr 2009 19:09:47 +0000 (GMT) Subject: [Swift-devel] feature request In-Reply-To: <1239821895.23411.10.camel@localhost> References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> Message-ID: On Wed, 15 Apr 2009, Mihael Hategan wrote: > Combined with replication, which is probably disabled by default, even > jobs committed to sites that don't do much, should be re-submitted to > different sites eventually. coasters seem mostly irrelevant to the original problem report - jobs go into a queue and don't dequeue into active state fast enough. so this is pretty much exactly what replication is for. -- From hategan at mcs.anl.gov Wed Apr 15 14:16:45 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 15 Apr 2009 14:16:45 -0500 Subject: [Swift-devel] feature request In-Reply-To: References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> Message-ID: <1239823005.23411.31.camel@localhost> On Wed, 2009-04-15 at 19:09 +0000, Ben Clifford wrote: > On Wed, 15 Apr 2009, Mihael Hategan wrote: > > > Combined with replication, which is probably disabled by default, even > > jobs committed to sites that don't do much, should be re-submitted to > > different sites eventually. > > coasters seem mostly irrelevant to the original problem report - jobs go > into a queue and don't dequeue into active state fast enough. so this is > pretty much exactly what replication is for. > They are relevant to the extent that re-submitting to a site that has active workers, will make the jobs go to the active state quickly. It is therefore close in behavior with having swift commit jobs only to sites that have active workers. From hategan at mcs.anl.gov Wed Apr 15 14:22:05 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 15 Apr 2009 14:22:05 -0500 Subject: [Swift-devel] feature request In-Reply-To: <49E63000.40009@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> Message-ID: <1239823325.23411.38.camel@localhost> On Wed, 2009-04-15 at 14:05 -0500, Glen Hocky wrote: > The problem with the first method was that the number of jobs, i.e. > score increased too slowly. In that configuration, i believe that the > behavior was 1 or 2 coasters were submitted to a single site and none to > the others and then it just stayed that way for a long time. The behavior you mention is contrary to what should be happening, in that all sites should have had 2 swift jobs submitted to. It is possible that you've made the observation while coasters were not working properly on certain sites. The solution to that is not to fundamentally re-engineer the way swift submission works, but to make coasters run properly on those sites. > > Another problem w/ the default configuration was on sites w/ Coasters > per node > 1. My experience on ranger with the default parameters is > that 2 coasters would start in the queue and only ~6 jobs would run on > them, rather than 32. Since our jobs take ~1 hour, this means that for > 32 hours of CPU time, I was getting about 6 CPU hours of work. and even > after jobs started finishing in that config, the ramp up was too slow That is indeed a scenario which cannot be addressed by the scheme I mentioned. Telling swift that a site has a certain granularity, and 2 jobs eat the same resources as 16 jobs, is not something we have. Though I think the scoring could easily be adapted for that. From wilde at mcs.anl.gov Wed Apr 15 14:33:07 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 15 Apr 2009 14:33:07 -0500 Subject: [Swift-devel] feature request In-Reply-To: <1239823325.23411.38.camel@localhost> References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> Message-ID: <49E63673.7070102@mcs.anl.gov> On 4/15/09 2:22 PM, Mihael Hategan wrote: > On Wed, 2009-04-15 at 14:05 -0500, Glen Hocky wrote: >> The problem with the first method was that the number of jobs, i.e. >> score increased too slowly. In that configuration, i believe that the >> behavior was 1 or 2 coasters were submitted to a single site and none to >> the others and then it just stayed that way for a long time. > > The behavior you mention is contrary to what should be happening, in > that all sites should have had 2 swift jobs submitted to. > > It is possible that you've made the observation while coasters were not > working properly on certain sites. The solution to that is not to > fundamentally re-engineer the way swift submission works, but to make > coasters run properly on those sites. And also to find some way to explain to users what to expect, adhering to the principle of least-astonishment. >> Another problem w/ the default configuration was on sites w/ Coasters >> per node > 1. My experience on ranger with the default parameters is >> that 2 coasters would start in the queue and only ~6 jobs would run on >> them, rather than 32. Since our jobs take ~1 hour, this means that for >> 32 hours of CPU time, I was getting about 6 CPU hours of work. and even >> after jobs started finishing in that config, the ramp up was too slow > > That is indeed a scenario which cannot be addressed by the scheme I > mentioned. Telling swift that a site has a certain granularity, and 2 > jobs eat the same resources as 16 jobs, is not something we have. Though > I think the scoring could easily be adapted for that. Could it not glean that from coastersPerNode? Or is that what you mean by "scoring could be adapted"? > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Wed Apr 15 14:37:48 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 15 Apr 2009 14:37:48 -0500 Subject: [Swift-devel] feature request In-Reply-To: <49E63673.7070102@mcs.anl.gov> References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> <49E63673.7070102@mcs.anl.gov> Message-ID: <1239824268.24331.1.camel@localhost> On Wed, 2009-04-15 at 14:33 -0500, Michael Wilde wrote: > > Could it not glean that from coastersPerNode? Or is that what you mean > by "scoring could be adapted"? That's approximately what I mean by "scoring could be adapted". From wilde at mcs.anl.gov Wed Apr 15 14:36:00 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 15 Apr 2009 14:36:00 -0500 Subject: [Swift-devel] feature request In-Reply-To: <1239823325.23411.38.camel@localhost> References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> Message-ID: <49E63720.9020706@mcs.anl.gov> On 4/15/09 2:22 PM, Mihael Hategan wrote: > On Wed, 2009-04-15 at 14:05 -0500, Glen Hocky wrote: >> The problem with the first method was that the number of jobs, i.e. >> score increased too slowly. In that configuration, i believe that the >> behavior was 1 or 2 coasters were submitted to a single site and none to >> the others and then it just stayed that way for a long time. > > The behavior you mention is contrary to what should be happening, in > that all sites should have had 2 swift jobs submitted to. > > It is possible that you've made the observation while coasters were not > working properly on certain sites. The solution to that is not to > fundamentally re-engineer the way swift submission works, but to make > coasters run properly on those sites. I'd like to ask Zhao to do some of this testing, with OOPS on Ranger and other TG and OSG sites, as prep for getting the tools in broader use in the OOPS group. To what extent can we create and maintain a test suite for coasters that puts it through its paces on a realistic environment. Seems hard, yet highly desirable, imo. > >> Another problem w/ the default configuration was on sites w/ Coasters >> per node > 1. My experience on ranger with the default parameters is >> that 2 coasters would start in the queue and only ~6 jobs would run on >> them, rather than 32. Since our jobs take ~1 hour, this means that for >> 32 hours of CPU time, I was getting about 6 CPU hours of work. and even >> after jobs started finishing in that config, the ramp up was too slow > > That is indeed a scenario which cannot be addressed by the scheme I > mentioned. Telling swift that a site has a certain granularity, and 2 > jobs eat the same resources as 16 jobs, is not something we have. Though > I think the scoring could easily be adapted for that. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed Apr 15 14:39:06 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 15 Apr 2009 14:39:06 -0500 Subject: [Swift-devel] More OOPS feature requests Message-ID: <49E637DA.6000109@mcs.anl.gov> I'd like to raise 2 initial requests for language or runtime change as a result of the oops work. - handling long in/out file lists via a file to wrapper when they get to long (already offered on the list by Ben, so this seems doable. Should I file an bugzilla req for this?) - global variables: are these feasible and/or desirable in oops? The motivation for this particular case was to fetch the @arg values (4 to 6 now, but growing) all in the "main" proc, but have their values available to several levels of deeper proc calls, without passing all the values down all the way. What the code does now is fetch the @args at multiple levels, which I thought was not elegant. I suspect that is further use for globals in normal coding style. So Im wondering what the view on this is, both from a language design view and from difficulty of implementing them. Has anyone else wished for globals? Bugzilla req or more discussion? The oops paper murmers about a 3rd feature, but I need to test that further, as the language may support it just fine. We have two variants of the oops , and provide an option to call one or the other, all within the same structure of nested rounds of simulation. One returns, eg, 5 files for each simulation. The other only returns 2 of those five. We wanted to use the same structure to describe both outputs, and just let non-returned fields stay null. These fields are final outputs, so no swift code will inspect their value (at the moment, athough that will change). So I need to verify that we can return a struct where some fields are left null, without tripping into any undesired data-flow dependency semantics. I hope that this will work without problem. From hategan at mcs.anl.gov Wed Apr 15 14:43:41 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 15 Apr 2009 14:43:41 -0500 Subject: [Swift-devel] feature request In-Reply-To: <49E63720.9020706@mcs.anl.gov> References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> <49E63720.9020706@mcs.anl.gov> Message-ID: <1239824621.24450.2.camel@localhost> On Wed, 2009-04-15 at 14:36 -0500, Michael Wilde wrote: > > On 4/15/09 2:22 PM, Mihael Hategan wrote: > > On Wed, 2009-04-15 at 14:05 -0500, Glen Hocky wrote: > >> The problem with the first method was that the number of jobs, i.e. > >> score increased too slowly. In that configuration, i believe that the > >> behavior was 1 or 2 coasters were submitted to a single site and none to > >> the others and then it just stayed that way for a long time. > > > > The behavior you mention is contrary to what should be happening, in > > that all sites should have had 2 swift jobs submitted to. > > > > It is possible that you've made the observation while coasters were not > > working properly on certain sites. The solution to that is not to > > fundamentally re-engineer the way swift submission works, but to make > > coasters run properly on those sites. > > I'd like to ask Zhao to do some of this testing, with OOPS on Ranger and > other TG and OSG sites, as prep for getting the tools in broader use in > the OOPS group. If it results in clear and specific problem reports, I'd be happy with it. If, however, it will mostly produce "swift doesn't work" type of reports, then we'd better divert the effort towards other things. From benc at hawaga.org.uk Wed Apr 15 14:42:26 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 15 Apr 2009 19:42:26 +0000 (GMT) Subject: [Swift-devel] feature request In-Reply-To: <49E63720.9020706@mcs.anl.gov> References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> <49E63720.9020706@mcs.anl.gov> Message-ID: On Wed, 15 Apr 2009, Michael Wilde wrote: > I'd like to ask Zhao to do some of this testing, with OOPS on Ranger and other > TG and OSG sites, as prep for getting the tools in broader use in the OOPS > group. There is a per-site test suite already in place, though not run particularly often. I think any per-site testing should be based around that (for example, new site definitions contributed to the tests/sites/ svn directory and tests driven by the scripts there and any new tests also contributed there). I don't see coaster per-site testing to be hugely different than any other per-site testing (other than less of it has been done, I think) -- From hategan at mcs.anl.gov Wed Apr 15 14:47:59 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 15 Apr 2009 14:47:59 -0500 Subject: [Swift-devel] More OOPS feature requests In-Reply-To: <49E637DA.6000109@mcs.anl.gov> References: <49E637DA.6000109@mcs.anl.gov> Message-ID: <1239824879.24450.5.camel@localhost> On Wed, 2009-04-15 at 14:39 -0500, Michael Wilde wrote: > I'd like to raise 2 initial requests for language or runtime change as a > result of the oops work. > > - handling long in/out file lists via a file to wrapper when they get to > long (already offered on the list by Ben, so this seems doable. Should I > file an bugzilla req for this?) > > - global variables: are these feasible and/or desirable in oops? > > The motivation for this particular case was to fetch the @arg values (4 > to 6 now, but growing) all in the "main" proc, but have their values > available to several levels of deeper proc calls, without passing all > the values down all the way. > > What the code does now is fetch the @args at multiple levels, which I > thought was not elegant. I suspect that is further use for globals in > normal coding style. > > So Im wondering what the view on this is, both from a language design > view and from difficulty of implementing them. Has anyone else wished > for globals? Some form of lexically scoped constants are probably a good idea. From benc at hawaga.org.uk Wed Apr 15 14:47:24 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 15 Apr 2009 19:47:24 +0000 (GMT) Subject: [Swift-devel] More OOPS feature requests In-Reply-To: <49E637DA.6000109@mcs.anl.gov> References: <49E637DA.6000109@mcs.anl.gov> Message-ID: On Wed, 15 Apr 2009, Michael Wilde wrote: > - handling long in/out file lists via a file to wrapper when they get to long > (already offered on the list by Ben, so this seems doable. Should I file an > bugzilla req for this?) file an enahncement request in bugzilla for this. i think its straightforward. > - global variables: are these feasible and/or desirable in oops? > > The motivation for this particular case was to fetch the @arg values (4 to 6 > now, but growing) all in the "main" proc, but have their values available to > several levels of deeper proc calls, without passing all the values down all > the way. That has bothered me in the past too. I think its fine to do - it doesn't introduce the ickiness that global variable do because they're single assignment (so really, global constants) file an enahnement request for that too -- From benc at hawaga.org.uk Wed Apr 15 14:51:43 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 15 Apr 2009 19:51:43 +0000 (GMT) Subject: [Swift-devel] feature request In-Reply-To: <1239824621.24450.2.camel@localhost> References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> <49E63720.9020706@mcs.anl.gov> <1239824621.24450.2.camel@localhost> Message-ID: On Wed, 15 Apr 2009, Mihael Hategan wrote: > If it results in clear and specific problem reports, I'd be happy with > it. If, however, it will mostly produce "swift doesn't work" type of > reports, then we'd better divert the effort towards other things. I think developing site definitions that go into the SVN and that anyone can run to reproduce (or not) is a reasonable approach to try to avoid the "I ran this poorly described test and it generically didn't work" problem. -- From wilde at mcs.anl.gov Wed Apr 15 15:04:49 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 15 Apr 2009 15:04:49 -0500 Subject: [Swift-devel] feature request In-Reply-To: References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> <49E63720.9020706@mcs.anl.gov> <1239824621.24450.2.camel@localhost> Message-ID: <49E63DE1.9040707@mcs.anl.gov> Towards that end - Glen wrote a nice script to query the info.teragrid.org services and generate a sites.xml tailored for the user. The idea was this: - user gets a TG login and is enabled on N sites - user gets a proxy on their TG cert via myproxy.teragrid.org - script probes the info service to get the site handles - script gsissh'es to each site to find users home/scratch dir and anything else that is user specific Needs testing, cleanup, improvement, I think, but I'd like to urge Glen to contribute it and us to work with him to package and maintain it. I think it would enable many people in the OOPS group to start using TG resources. Same can be done with OSG. And ADEM in this context looks very promising. - Mike On 4/15/09 2:51 PM, Ben Clifford wrote: > On Wed, 15 Apr 2009, Mihael Hategan wrote: > >> If it results in clear and specific problem reports, I'd be happy with >> it. If, however, it will mostly produce "swift doesn't work" type of >> reports, then we'd better divert the effort towards other things. > > I think developing site definitions that go into the SVN and that anyone > can run to reproduce (or not) is a reasonable approach to try to avoid the > "I ran this poorly described test and it generically didn't work" problem. > From bugzilla-daemon at mcs.anl.gov Wed Apr 15 15:07:56 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 15 Apr 2009 15:07:56 -0500 (CDT) Subject: [Swift-devel] [Bug 200] New: Add global variables to swift Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=200 Summary: Add global variables to swift Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: wilde at mcs.anl.gov This started with a request based on coding oops.swift: The motivation for this particular case was to fetch the @arg values (4 to 6 now, but growing) all in the "main" proc, but have their values available to several levels of deeper proc calls, without passing all the values down all the way. What the code does now is fetch the @args at multiple levels, which I thought was not elegant. I suspect that is further use for globals in normal coding style. -- email comment from Ben: That has bothered me in the past too. I think its fine to do - it doesn't introduce the ickiness that global variable do because they're single assignment (so really, global constants) file an enahnement request for that too -- email comment from Mihael: Some form of lexically scoped constants are probably a good idea. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Wed Apr 15 15:13:11 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 15 Apr 2009 15:13:11 -0500 (CDT) Subject: [Swift-devel] [Bug 201] New: Pass log lists of in/out files to _swiftwrapper in a file Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=201 Summary: Pass log lists of in/out files to _swiftwrapper in a file Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: wilde at mcs.anl.gov We should handle long in/out file lists via a file to wrapper when they get too long. -- email comment from Ben: file an enahncement request in bugzilla for this. i think its straightforward. -- said earlier in email (approximately) by Mike: we should do this in a way that the normal short cases stay fast, and only incur the overhead of a file when needed, ideally. Thats assuming that there are no other ways to do this. Don't want to create an extra file for each app invocation when not necessary. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Wed Apr 15 15:13:30 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 15 Apr 2009 15:13:30 -0500 (CDT) Subject: [Swift-devel] [Bug 200] Add global variables to swift In-Reply-To: References: Message-ID: <20090415201330.4148A2CD86@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=200 Michael Wilde changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|normal |enhancement -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From foster at anl.gov Wed Apr 15 15:17:44 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 15 Apr 2009 15:17:44 -0500 Subject: [Swift-devel] [Bug 200] Add global variables to swift In-Reply-To: <20090415201330.4148A2CD86@wind.mcs.anl.gov> References: <20090415201330.4148A2CD86@wind.mcs.anl.gov> Message-ID: <6C2E5AC7-3FA1-490E-B122-04932BABA6DA@anl.gov> I think this is a good idea. Of course one could grab all the constants and stick them in a structure of type "global stuff we want to pass around", and pass around a pointer to that, but that is a bit cumbersome. This way we basically do that implicitly. Ia. On Apr 15, 2009, at 3:13 PM, bugzilla-daemon at mcs.anl.gov wrote: > https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=200 > > > Michael Wilde changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Severity|normal |enhancement > > > > > -- > Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are watching the assignee of the bug. > You are watching the reporter. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From bugzilla-daemon at mcs.anl.gov Wed Apr 15 15:26:36 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 15 Apr 2009 15:26:36 -0500 (CDT) Subject: [Swift-devel] [Bug 201] Pass long lists of in/out files to _swiftwrapper in a file In-Reply-To: References: Message-ID: <20090415202636.5D5DB2CD86@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=201 Michael Wilde changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Pass log lists of in/out |Pass long lists of in/out |files to _swiftwrapper in a |files to _swiftwrapper in a |file |file -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From wilde at mcs.anl.gov Wed Apr 15 17:46:17 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 15 Apr 2009 17:46:17 -0500 Subject: [Swift-devel] feature request In-Reply-To: References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> <49E63720.9020706@mcs.anl.gov> Message-ID: <49E663B9.4000007@mcs.anl.gov> Zhao, based on Ben's suggestion and our earlier discussion, can you locate and try the Swift test suite, and then look in detail at, and try, the per-site suite. Im looking for an assessment of what it takes to create a branch of the suite to test coasters in th same manner that the sites are tested. And what it takes to run that on a regular basis to find problems in sites and Swift *before* our users find them. In parallel, as we discussed, please check out the latest OOPS, and follow the README under oops/swift to do an initial sanity test. Then test the 3-site Ranger-Abe-Queenbee config, then expand to Mercury-SDSC-ANL. All under coasters. We'll need to map out more clearly where this is heading, but I want you to get familiar with both OOPS and coasters, and report all problems clearly back to the swift-devel list. As Ben says, in a way that someone else can reproduce the problem. Thanks. - Mike On 4/15/09 2:42 PM, Ben Clifford wrote: > On Wed, 15 Apr 2009, Michael Wilde wrote: > >> I'd like to ask Zhao to do some of this testing, with OOPS on Ranger and other >> TG and OSG sites, as prep for getting the tools in broader use in the OOPS >> group. > > There is a per-site test suite already in place, though not run > particularly often. > > I think any per-site testing should be based around that (for example, new > site definitions contributed to the tests/sites/ svn directory and tests > driven by the scripts there and any new tests also contributed there). > > I don't see coaster per-site testing to be hugely different than any other > per-site testing (other than less of it has been done, I think) > From wilde at mcs.anl.gov Wed Apr 15 20:02:49 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 15 Apr 2009 20:02:49 -0500 Subject: [Swift-devel] How to do coasters on bgp? Message-ID: <49E683B9.8020707@mcs.anl.gov> What work is needed if we want to run coasters on the bgp? I can think of the following, but there might be more, or differen, issues. The scheduler is cobalt - it seems pretty PBS-like to the user; I dont know if its pbs-like enough for swift and coasters. We need to see if we can first run a simple job with the pbs provider. Then we can test if we start a single coaster job. That should ensure that the perl we need is all there. Once all that works, we need to work out how to ask coasters to allocate all the cores it wants in one job. Im hoping this would work the same on bgp as the feature you're developing for that purpose on other hosts. (does it have property name or profile variable name?) Does this seem close, anything missing, any other problems expected on BGP? p.s. Then there is sicortex, which is not a priority at all, but its interesting to think about: - no Java on head node - slurm scheduler - we have tested falkon on it by ssh tunneling from communicado From wilde at mcs.anl.gov Wed Apr 15 20:22:44 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 15 Apr 2009 20:22:44 -0500 Subject: [Swift-devel] How to do coasters on bgp? In-Reply-To: <49E68517.7040705@cs.uchicago.edu> References: <49E683B9.8020707@mcs.anl.gov> <49E68517.7040705@cs.uchicago.edu> Message-ID: <49E68864.4070600@mcs.anl.gov> On 4/15/09 8:08 PM, Ioan Raicu wrote: ...Before moving to > the BG/P, I'd argue for implementing static provisioning, where one > could specify N nodes/cores for H hours, all contained in a single LRM > job submission. Once that works in general, on your favorite LRM, then > I'd move to the BG/P, and tackle BG/P specific issues. Right. Mihael is currently developing that (see below). > Ioan > > Michael Wilde wrote: >> What work is needed if we want to run coasters on the bgp? >> >> I can think of the following, but there might be more, or differen, >> issues. >> >> The scheduler is cobalt - it seems pretty PBS-like to the user; I dont >> know if its pbs-like enough for swift and coasters. >> >> We need to see if we can first run a simple job with the pbs provider. >> >> Then we can test if we start a single coaster job. That should ensure >> that the perl we need is all there. >> >> Once all that works, we need to work out how to ask coasters to >> allocate all the cores it wants in one job. Im hoping this would work >> the same on bgp as the feature you're developing for that purpose on >> other hosts. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> (does it have property name or profile variable name?) >> >> Does this seem close, anything missing, any other problems expected on >> BGP? >> >> >> p.s. >> Then there is sicortex, which is not a priority at all, but its >> interesting to think about: >> - no Java on head node >> - slurm scheduler >> - we have tested falkon on it by ssh tunneling from communicado >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > From iraicu at cs.uchicago.edu Wed Apr 15 20:08:39 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 15 Apr 2009 20:08:39 -0500 Subject: [Swift-devel] How to do coasters on bgp? In-Reply-To: <49E683B9.8020707@mcs.anl.gov> References: <49E683B9.8020707@mcs.anl.gov> Message-ID: <49E68517.7040705@cs.uchicago.edu> I believe the BG/P is one of those machines, where static resource provisioning is necessary. From the discussion we had on static vs. dynamic provisioning a week or so back, I think that might be a big hurdle (besides the other typical hurdles, such as network connectivity, language support, and nuances specific to the BG/P). Before moving to the BG/P, I'd argue for implementing static provisioning, where one could specify N nodes/cores for H hours, all contained in a single LRM job submission. Once that works in general, on your favorite LRM, then I'd move to the BG/P, and tackle BG/P specific issues. Ioan Michael Wilde wrote: > What work is needed if we want to run coasters on the bgp? > > I can think of the following, but there might be more, or differen, > issues. > > The scheduler is cobalt - it seems pretty PBS-like to the user; I dont > know if its pbs-like enough for swift and coasters. > > We need to see if we can first run a simple job with the pbs provider. > > Then we can test if we start a single coaster job. That should ensure > that the perl we need is all there. > > Once all that works, we need to work out how to ask coasters to > allocate all the cores it wants in one job. Im hoping this would work > the same on bgp as the feature you're developing for that purpose on > other hosts. > (does it have property name or profile variable name?) > > Does this seem close, anything missing, any other problems expected on > BGP? > > > p.s. > Then there is sicortex, which is not a priority at all, but its > interesting to think about: > - no Java on head node > - slurm scheduler > - we have tested falkon on it by ssh tunneling from communicado > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From iraicu at cs.uchicago.edu Wed Apr 15 20:28:49 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 15 Apr 2009 18:28:49 -0700 Subject: [Swift-devel] How to do coasters on bgp? In-Reply-To: <49E68864.4070600@mcs.anl.gov> References: <49E683B9.8020707@mcs.anl.gov> <49E68517.7040705@cs.uchicago.edu> <49E68864.4070600@mcs.anl.gov> Message-ID: <49E689D1.70102@cs.uchicago.edu> Aha, right, I didn't catch that ;) Michael Wilde wrote: > On 4/15/09 8:08 PM, Ioan Raicu wrote: > ...Before moving to >> the BG/P, I'd argue for implementing static provisioning, where one >> could specify N nodes/cores for H hours, all contained in a single >> LRM job submission. Once that works in general, on your favorite LRM, >> then I'd move to the BG/P, and tackle BG/P specific issues. > > Right. Mihael is currently developing that (see below). > >> Ioan >> >> Michael Wilde wrote: >>> What work is needed if we want to run coasters on the bgp? >>> >>> I can think of the following, but there might be more, or differen, >>> issues. >>> >>> The scheduler is cobalt - it seems pretty PBS-like to the user; I >>> dont know if its pbs-like enough for swift and coasters. >>> >>> We need to see if we can first run a simple job with the pbs provider. >>> >>> Then we can test if we start a single coaster job. That should >>> ensure that the perl we need is all there. >>> >>> Once all that works, we need to work out how to ask coasters to >>> allocate all the cores it wants in one job. Im hoping this would >>> work the same on bgp as the feature you're developing for that >>> purpose on other hosts. > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >>> (does it have property name or profile variable name?) >>> >>> Does this seem close, anything missing, any other problems expected >>> on BGP? >>> >>> >>> p.s. >>> Then there is sicortex, which is not a priority at all, but its >>> interesting to think about: >>> - no Java on head node >>> - slurm scheduler >>> - we have tested falkon on it by ssh tunneling from communicado >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From hategan at mcs.anl.gov Wed Apr 15 23:11:02 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 15 Apr 2009 23:11:02 -0500 Subject: [Swift-devel] Re: How to do coasters on bgp? In-Reply-To: <49E683B9.8020707@mcs.anl.gov> References: <49E683B9.8020707@mcs.anl.gov> Message-ID: <1239855062.3693.1.camel@localhost> On Wed, 2009-04-15 at 20:02 -0500, Michael Wilde wrote: > What work is needed if we want to run coasters on the bgp? > > I can think of the following, but there might be more, or differen, issues. > > The scheduler is cobalt - it seems pretty PBS-like to the user; I dont > know if its pbs-like enough for swift and coasters. It's very much unlike PBS as far as I know. However, there is a cobalt provider. From hategan at mcs.anl.gov Wed Apr 15 23:18:13 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 15 Apr 2009 23:18:13 -0500 Subject: [Swift-devel] How to do coasters on bgp? In-Reply-To: <49E68517.7040705@cs.uchicago.edu> References: <49E683B9.8020707@mcs.anl.gov> <49E68517.7040705@cs.uchicago.edu> Message-ID: <1239855493.3693.9.camel@localhost> On Wed, 2009-04-15 at 20:08 -0500, Ioan Raicu wrote: > I believe the BG/P is one of those machines, where static resource > provisioning is necessary. From the discussion we had on static vs. > dynamic provisioning a week or so back, I think that might be a big > hurdle (besides the other typical hurdles, such as network connectivity, > language support, and nuances specific to the BG/P). Before moving to > the BG/P, I'd argue for implementing static provisioning, I see there's still some confusion about "static provisioning". Perhaps we should abandon the term. Instead we could use: 1. pre-defined window, in which we say when the system should try to allocate nodes, in what amount, and for how long. 2. moving window, in which the system decides when to allocate nodes and for how long, subject to certain constraints (such as 64 nodes per job, 8 cores per node). 3. mixed mode, in which pre-defined windows could be used if available, and supplemented by moving windows if necessary. From hategan at mcs.anl.gov Wed Apr 15 23:22:36 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 15 Apr 2009 23:22:36 -0500 Subject: [Swift-devel] Re: How to do coasters on bgp? In-Reply-To: <49E683B9.8020707@mcs.anl.gov> References: <49E683B9.8020707@mcs.anl.gov> Message-ID: <1239855756.3693.13.camel@localhost> On Wed, 2009-04-15 at 20:02 -0500, Michael Wilde wrote: > Once all that works, we need to work out how to ask coasters to allocate > all the cores it wants in one job. Im hoping this would work the same on > bgp as the feature you're developing for that purpose on other hosts. > (does it have property name or profile variable name?) So far it has a simulator that allows me to get immediate feedback on how it works. > > Does this seem close, anything missing, any other problems expected on BGP? Yes. I expect a number of obscure problems on the BGP. From iraicu at cs.uchicago.edu Wed Apr 15 23:21:33 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 15 Apr 2009 21:21:33 -0700 Subject: [Swift-devel] How to do coasters on bgp? In-Reply-To: <1239855493.3693.9.camel@localhost> References: <49E683B9.8020707@mcs.anl.gov> <49E68517.7040705@cs.uchicago.edu> <1239855493.3693.9.camel@localhost> Message-ID: <49E6B24D.5060909@cs.uchicago.edu> Yes, the combination of the things you mentioned should allow you to run on the BG/P with policies that restrict you to only a few jobs in the queue, where each job will generally contain a large portion of the allocation in terms of number of processors. Ioan Mihael Hategan wrote: > On Wed, 2009-04-15 at 20:08 -0500, Ioan Raicu wrote: > >> I believe the BG/P is one of those machines, where static resource >> provisioning is necessary. From the discussion we had on static vs. >> dynamic provisioning a week or so back, I think that might be a big >> hurdle (besides the other typical hurdles, such as network connectivity, >> language support, and nuances specific to the BG/P). Before moving to >> the BG/P, I'd argue for implementing static provisioning, >> > > I see there's still some confusion about "static provisioning". Perhaps > we should abandon the term. Instead we could use: > > 1. pre-defined window, in which we say when the system should try to > allocate nodes, in what amount, and for how long. > 2. moving window, in which the system decides when to allocate nodes and > for how long, subject to certain constraints (such as 64 nodes per job, > 8 cores per node). > 3. mixed mode, in which pre-defined windows could be used if available, > and supplemented by moving windows if necessary. > > > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Wed Apr 15 23:40:23 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 15 Apr 2009 23:40:23 -0500 Subject: [Swift-devel] How to do coasters on bgp? In-Reply-To: <1239855493.3693.9.camel@localhost> References: <49E683B9.8020707@mcs.anl.gov> <49E68517.7040705@cs.uchicago.edu> <1239855493.3693.9.camel@localhost> Message-ID: <49E6B6B7.2020003@mcs.anl.gov> Modulo subtleties which I dont understand or cant foresee, this sounds very good to me. I realized I may have caused confusion in my previous post on this: when I said that "this is what you are doing", I meant approach (1) below. Which sounded to me as good or better that whats been called static provisioning. Im interested in what others think, but I like the scheme below. On 4/15/09 11:18 PM, Mihael Hategan wrote: > On Wed, 2009-04-15 at 20:08 -0500, Ioan Raicu wrote: >> I believe the BG/P is one of those machines, where static resource >> provisioning is necessary. From the discussion we had on static vs. >> dynamic provisioning a week or so back, I think that might be a big >> hurdle (besides the other typical hurdles, such as network connectivity, >> language support, and nuances specific to the BG/P). Before moving to >> the BG/P, I'd argue for implementing static provisioning, > > I see there's still some confusion about "static provisioning". Perhaps > we should abandon the term. Instead we could use: > > 1. pre-defined window, in which we say when the system should try to > allocate nodes, in what amount, and for how long. > 2. moving window, in which the system decides when to allocate nodes and > for how long, subject to certain constraints (such as 64 nodes per job, > 8 cores per node). > 3. mixed mode, in which pre-defined windows could be used if available, > and supplemented by moving windows if necessary. > > From benc at hawaga.org.uk Thu Apr 16 00:22:45 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 16 Apr 2009 05:22:45 +0000 (GMT) Subject: [Swift-devel] feature request In-Reply-To: <49E663B9.4000007@mcs.anl.gov> References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> <49E63720.9020706@mcs.anl.gov> <49E663B9.4000007@mcs.anl.gov> Message-ID: On Wed, 15 Apr 2009, Michael Wilde wrote: > Zhao, based on Ben's suggestion and our earlier discussion, can you locate and > try the Swift test suite, and then look in detail at, and try, the per-site > suite. > > Im looking for an assessment of what it takes to create a branch of the suite > to test coasters in th same manner that the sites are tested. Put swift on your path and get a proxy. Then: cd tests/sites ./run-all coaster/ This will start running the tests in the coaster/ subdirectory. Each of the files in there is a site definition. One site test is run for each of those. When/if all of them have exited, the list of sites that worked and the list of sites that did not work is output. If you want to run a single site, say ./run-site coaster/coaster-local.xml (for example) To add new sites, put an file in the coaster/ subdirectory containing an appropriate site definition. In order to make site tests that many people can run, I usually make a remote work directory and chmod a+rwxt on that remote work directory so no matter who runs, Swift will not encounter permission problems. > And what it takes to run that on a regular basis to find problems in sites and > Swift *before* our users find them. The main problem with the site tests is that they generally need credentials, and its unclear what the right way to handle long term testing credentials is. -- From hategan at mcs.anl.gov Fri Apr 17 16:33:45 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 17 Apr 2009 16:33:45 -0500 Subject: [Swift-devel] coaster block allocations Message-ID: <1240004025.5783.6.camel@localhost> I mentioned a simulator: http://www.mcs.anl.gov/~hategan/CS.java It will given you a visual idea of how things (will) work (run it and press the up key to advance time). You can play with parameters, work loads and pre-allocations. If you find some behavior that you don't like, send me the parameters (which are printed when you close the window) and work load (which you'll have to manually paste from the code), so that I can reproduce it. From wilde at mcs.anl.gov Fri Apr 17 18:15:18 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 17 Apr 2009 18:15:18 -0500 Subject: [Swift-devel] coaster block allocations In-Reply-To: <1240004025.5783.6.camel@localhost> References: <1240004025.5783.6.camel@localhost> Message-ID: <49E90D86.2080902@mcs.anl.gov> Am I doing this wrong? com$ javac CS.java CS.java:616: method does not override a method from its superclass @Override ^ CS.java:634: method does not override a method from its superclass @Override ^ CS.java:714: method does not override a method from its superclass @Override ^ CS.java:719: method does not override a method from its superclass @Override ^ CS.java:733: method does not override a method from its superclass @Override ^ 5 errors com$ which javac /soft/java-1.5.0_06-sun-r1/bin/javac com$ On 4/17/09 4:33 PM, Mihael Hategan wrote: > I mentioned a simulator: > http://www.mcs.anl.gov/~hategan/CS.java > > It will given you a visual idea of how things (will) work (run it and > press the up key to advance time). You can play with parameters, work > loads and pre-allocations. > > If you find some behavior that you don't like, send me the parameters > (which are printed when you close the window) and work load (which > you'll have to manually paste from the code), so that I can reproduce > it. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Fri Apr 17 19:09:33 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 17 Apr 2009 19:09:33 -0500 Subject: [Swift-devel] coaster block allocations In-Reply-To: <49E90D86.2080902@mcs.anl.gov> References: <1240004025.5783.6.camel@localhost> <49E90D86.2080902@mcs.anl.gov> Message-ID: <1240013373.8374.1.camel@localhost> Looks like 1.5 doesn't like @Override when implementing an interface method. Either try 1.6 or fetch it again (I removed the respective annotations). On Fri, 2009-04-17 at 18:15 -0500, Michael Wilde wrote: > Am I doing this wrong? > > com$ javac CS.java > CS.java:616: method does not override a method from its superclass > @Override > ^ > CS.java:634: method does not override a method from its superclass > @Override > ^ > CS.java:714: method does not override a method from its superclass > @Override > ^ > CS.java:719: method does not override a method from its superclass > @Override > ^ > CS.java:733: method does not override a method from its superclass > @Override > ^ > 5 errors > com$ which javac > /soft/java-1.5.0_06-sun-r1/bin/javac > com$ > > > > On 4/17/09 4:33 PM, Mihael Hategan wrote: > > I mentioned a simulator: > > http://www.mcs.anl.gov/~hategan/CS.java > > > > It will given you a visual idea of how things (will) work (run it and > > press the up key to advance time). You can play with parameters, work > > loads and pre-allocations. > > > > If you find some behavior that you don't like, send me the parameters > > (which are printed when you close the window) and work load (which > > you'll have to manually paste from the code), so that I can reproduce > > it. > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Sun Apr 19 09:24:50 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 19 Apr 2009 09:24:50 -0500 Subject: [Swift-devel] Avoid copy of local files to work directory? Message-ID: <49EB3432.4090901@mcs.anl.gov> Is there a Swift setting to use links rather than copies to access input files when you specify local data transfer with griftp url=local://localhost. Or better yet, to access the files directly? I thought we had such a setting, but cant find it. Maybe I got that confused with the links-vs-copies for workdir-to-jobdir. Or, Ben, was that one of the 4 patches you did for some earlier Falkon testing about a year ago? If not, would such behavior be useful and feasible? Its also being discussed in the CIO project, as some BG/P workflows have similar issues. I ask in this case because the NewsLab project has a script that is processing a huge number of small files, with what seems to be high IO and low CPU overhead. This is running on TeraPort with data on the CI SAN GPFS. Gabri is working on this for Svetlozar; I've suggested she send measurements and log plots to the swift-user list for discussion. From benc at hawaga.org.uk Sun Apr 19 10:10:51 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 19 Apr 2009 15:10:51 +0000 (GMT) Subject: [Swift-devel] Avoid copy of local files to work directory? In-Reply-To: <49EB3432.4090901@mcs.anl.gov> References: <49EB3432.4090901@mcs.anl.gov> Message-ID: I once implemented a patch to give that behaviour, and sent to this list. I think it probably either still works or almost still works, if you can find it in the swift-devel archives. It was never benchmarked. Mostly I would be concerned about the overhead of creating the links being roughly the same as actually copying the data, so I think it would be important to get decent run logs from both approaches that can be compared. Whilst you are busy googling for that patch, it might be interesting to see a log file (including -info wrapper logs) for a large run with Swift as it is now, to see what the actual breakdown of times is in an individual job. -- From benc at hawaga.org.uk Sun Apr 19 10:56:06 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 19 Apr 2009 15:56:06 +0000 (GMT) Subject: [Swift-devel] swift 0.9 rc2 is available Message-ID: rc2 for swift is available from http://www.ci.uchicago.edu/~benc/swift-0.9rc2.tar.gz please test and report back. If there are no major bugs, then this will become 0.9 final release in 7 days from now. As before, I will ignore the voting stuff that dev.globus mandates for making releases. -- From wilde at mcs.anl.gov Sun Apr 19 11:39:53 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 19 Apr 2009 11:39:53 -0500 Subject: [Swift-devel] Avoid copy of local files to work directory? In-Reply-To: References: <49EB3432.4090901@mcs.anl.gov> Message-ID: <49EB53D9.8050600@mcs.anl.gov> On 4/19/09 10:10 AM, Ben Clifford wrote: > I once implemented a patch to give that behaviour, and sent to this list. Its here: http://mail.ci.uchicago.edu/pipermail/swift-devel/2008-April/002931.html > I think it probably either still works or almost still works, if you can > find it in the swift-devel archives. > > It was never benchmarked. Mostly I would be concerned about the overhead > of creating the links being roughly the same as actually copying the data, > so I think it would be important to get decent run logs from both > approaches that can be compared. > Whilst you are busy googling for that patch, it might be interesting to > see a log file (including -info wrapper logs) for a large run with Swift > as it is now, to see what the actual breakdown of times is in an > individual job. > From benc at hawaga.org.uk Tue Apr 21 03:06:25 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 21 Apr 2009 08:06:25 +0000 (GMT) Subject: [Swift-devel] switchable copious provenance logging Message-ID: As I work on provenance, the amount of log output for that becomes larger. Two things are probably interesting wrt that: i) make logging of provenance stuff switchable to on/off ii) move the provenance related log information to a separate file Comments. -- From wilde at mcs.anl.gov Tue Apr 21 07:28:27 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 21 Apr 2009 07:28:27 -0500 Subject: [Swift-devel] switchable copious provenance logging In-Reply-To: References: Message-ID: <49EDBBEB.6010404@mcs.anl.gov> My perhaps uninformed and random observation on logging: the debug log seems very noisy, and it seems whats needed for provenance (ie, the major events, not all the minor ones) could be made much more compact. I think ultimately we want 2 control bits on logging: - debug level - provenance (on/off) but maybe a level as well. Its possible these should go to separate channels, or be separable by a clear descriptor on the log line (for space management) It seems that while Swift is stll a young system, having debug on all the time is useful, so that error info gets captured without having to do re-runs. It seems that for most work we can afford the time overhead of these logs; the space overhead may become burdensome. Is the provenance logging making the size of the log unmanageable? On 4/21/09 3:06 AM, Ben Clifford wrote: > As I work on provenance, the amount of log output for that becomes larger. > > Two things are probably interesting wrt that: > > i) make logging of provenance stuff switchable to on/off > ii) move the provenance related log information to a separate file > > Comments. > From benc at hawaga.org.uk Tue Apr 21 07:33:05 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 21 Apr 2009 12:33:05 +0000 (GMT) Subject: [Swift-devel] switchable copious provenance logging In-Reply-To: <49EDBBEB.6010404@mcs.anl.gov> References: <49EDBBEB.6010404@mcs.anl.gov> Message-ID: On Tue, 21 Apr 2009, Michael Wilde wrote: > the debug log seems very noisy, and it seems whats needed for provenance (ie, > the major events, not all the minor ones) could be made much more compact. yes. > It seems that while Swift is stll a young system, having debug on all the time > is useful, so that error info gets captured without having to do re-runs. It > seems that for most work we can afford the time overhead of these logs; the > space overhead may become burdensome. I think for anything debuggable it is useful to capture large runs. > Is the provenance logging making the size of the log unmanageable? They're already annoyingly large when I want to move them onto my laptop. Additional information is likely to make that more so. -- From foster at anl.gov Tue Apr 21 07:44:19 2009 From: foster at anl.gov (Ian Foster) Date: Tue, 21 Apr 2009 07:44:19 -0500 Subject: [Swift-devel] switchable copious provenance logging In-Reply-To: <49EDBBEB.6010404@mcs.anl.gov> References: <49EDBBEB.6010404@mcs.anl.gov> Message-ID: <5DADBCDF-CC59-40A8-9E7A-4CE75A610650@anl.gov> Given the fact that Swift programs describe the structure of computations, we should be able to compress logs considerably, by reference to the program. E.g., the program: f() { g(); h(); } with no arguments will always (ignoring errors) execute g and h if f is called. So we could just record that f() has been called. I can imagine taking ideas of that sort quite a long way. Now if f() has arguments, things get more complex. But one could record subsets of argument information maybe? Ian. On Apr 21, 2009, at 7:28 AM, Michael Wilde wrote: > My perhaps uninformed and random observation on logging: > > the debug log seems very noisy, and it seems whats needed for > provenance (ie, the major events, not all the minor ones) could be > made much more compact. > > I think ultimately we want 2 control bits on logging: > > - debug level > - provenance (on/off) but maybe a level as well. > > Its possible these should go to separate channels, or be separable > by a clear descriptor on the log line (for space management) > > It seems that while Swift is stll a young system, having debug on > all the time is useful, so that error info gets captured without > having to do re-runs. It seems that for most work we can afford the > time overhead of these logs; the space overhead may become burdensome. > > Is the provenance logging making the size of the log unmanageable? > > On 4/21/09 3:06 AM, Ben Clifford wrote: >> As I work on provenance, the amount of log output for that becomes >> larger. >> Two things are probably interesting wrt that: >> i) make logging of provenance stuff switchable to on/off >> ii) move the provenance related log information to a separate file >> Comments. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From foster at anl.gov Tue Apr 21 07:46:35 2009 From: foster at anl.gov (Ian Foster) Date: Tue, 21 Apr 2009 07:46:35 -0500 Subject: [Swift-devel] switchable copious provenance logging In-Reply-To: <5DADBCDF-CC59-40A8-9E7A-4CE75A610650@anl.gov> References: <49EDBBEB.6010404@mcs.anl.gov> <5DADBCDF-CC59-40A8-9E7A-4CE75A610650@anl.gov> Message-ID: PS: This could be an interesting research project, perhaps? On Apr 21, 2009, at 7:44 AM, Ian Foster wrote: > Given the fact that Swift programs describe the structure of > computations, we should be able to compress logs considerably, by > reference to the program. E.g., the program: > > f() { > g(); > h(); > } > > with no arguments will always (ignoring errors) execute g and h if f > is called. So we could just record that f() has been called. > > I can imagine taking ideas of that sort quite a long way. > > Now if f() has arguments, things get more complex. But one could > record subsets of argument information maybe? > > Ian. > > > On Apr 21, 2009, at 7:28 AM, Michael Wilde wrote: > >> My perhaps uninformed and random observation on logging: >> >> the debug log seems very noisy, and it seems whats needed for >> provenance (ie, the major events, not all the minor ones) could be >> made much more compact. >> >> I think ultimately we want 2 control bits on logging: >> >> - debug level >> - provenance (on/off) but maybe a level as well. >> >> Its possible these should go to separate channels, or be separable >> by a clear descriptor on the log line (for space management) >> >> It seems that while Swift is stll a young system, having debug on >> all the time is useful, so that error info gets captured without >> having to do re-runs. It seems that for most work we can afford the >> time overhead of these logs; the space overhead may become >> burdensome. >> >> Is the provenance logging making the size of the log unmanageable? >> >> On 4/21/09 3:06 AM, Ben Clifford wrote: >>> As I work on provenance, the amount of log output for that becomes >>> larger. >>> Two things are probably interesting wrt that: >>> i) make logging of provenance stuff switchable to on/off >>> ii) move the provenance related log information to a separate file >>> Comments. >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Tue Apr 21 08:17:20 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 21 Apr 2009 13:17:20 +0000 (GMT) Subject: [Swift-devel] switchable copious provenance logging In-Reply-To: References: <49EDBBEB.6010404@mcs.anl.gov> <5DADBCDF-CC59-40A8-9E7A-4CE75A610650@anl.gov> Message-ID: On Tue, 21 Apr 2009, Ian Foster wrote: > > Given the fact that Swift programs describe the structure of computations, > > we should be able to compress logs considerably, by reference to the > > program. E.g., the program: Taking that to its extreme, there is only a single log entry: > go which is the top level execution. I think what is interesting in the logs, and what they almost entirely consist of, is the stuff that can only be seen in retrospect - things like the errors that are ignored in the below, or times of executions, or what ran where. > > > > f() { > > g(); > > h(); > > } > > > > with no arguments will always (ignoring errors) execute g and h if f is > > called. So we could just record that f() has been called. -- From foster at anl.gov Tue Apr 21 08:23:12 2009 From: foster at anl.gov (Ian Foster) Date: Tue, 21 Apr 2009 08:23:12 -0500 Subject: [Swift-devel] switchable copious provenance logging In-Reply-To: References: <49EDBBEB.6010404@mcs.anl.gov> <5DADBCDF-CC59-40A8-9E7A-4CE75A610650@anl.gov> Message-ID: <89079822-B8EF-4408-BB7E-9ADB92FE7566@anl.gov> Ben: That will be true if we also capture all data that was accessed, and if we use any random number generators, the numbers generated. In other cases, there will be tradeoffs between the amount of input data recorded and the amount of provenance information recorded, I think? (Not sure if I am right there.) It does seem to me that we should want to distinguish between logging (for errors) and provenance (for recording the computation performed to generate a particular result). Ian. On Apr 21, 2009, at 8:17 AM, Ben Clifford wrote: > > Taking that to its extreme, there is only a single log entry: > >> go > > which is the top level execution. From benc at hawaga.org.uk Tue Apr 21 08:54:01 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 21 Apr 2009 13:54:01 +0000 (GMT) Subject: [Swift-devel] switchable copious provenance logging In-Reply-To: <89079822-B8EF-4408-BB7E-9ADB92FE7566@anl.gov> References: <49EDBBEB.6010404@mcs.anl.gov> <5DADBCDF-CC59-40A8-9E7A-4CE75A610650@anl.gov> <89079822-B8EF-4408-BB7E-9ADB92FE7566@anl.gov> Message-ID: On Tue, 21 Apr 2009, Ian Foster wrote: > In other cases, there will be tradeoffs between the amount of input data > recorded and the amount of provenance information recorded, I think? (Not sure > if I am right there.) yes, plenty of tradeoffs. The stuff I did initially didn't capture enough to satisfy people. The stuff I'm working on now captures lots more. > It does seem to me that we should want to distinguish between logging > (for errors) and provenance (for recording the computation performed to > generate a particular result). yes. -- From andric at uchicago.edu Tue Apr 21 09:56:47 2009 From: andric at uchicago.edu (Michael Andric) Date: Tue, 21 Apr 2009 09:56:47 -0500 Subject: [Swift-devel] swift not working Message-ID: Normally, I would hit up Sarah for a fix on this, but since she's on vacation I'm hoping someone else out there could help with this. I'm unable to get swift jobs submitted. I've tried submitting to both the ucanl64 and bsd clusters. The run dir (with log files) is here: /disks/ci-gpfs/fmri/cnari/swift/projects/andric/SNR/RFL2 Here's what I get from ucanl: [...]Progress: Submitting:1 Submitted:1 Failed but can retry:1 Failed to transfer wrapper log from AFNIsnr-20090421-0930-q403bn99/info/e on ANLUCTERAGRID64 Progress: Submitting:1 Failed but can retry:2 Progress: Submitting:1 Failed but can retry:2 Progress: Submitting:1 Failed but can retry:2 Progress: Stage in:1 Submitting:1 Failed but can retry:1 Progress: Submitting:2 Failed but can retry:1 Progress: Submitting:1 Submitted:1 Failed but can retry:1 Failed to transfer wrapper log from AFNIsnr-20090421-0930-q403bn99/info/g on ANLUCTERAGRID64 Progress: Submitting:1 Failed:1 Failed but can retry:1 Execution failed: Exception in AFNI_3dvolreg: Arguments: [-twopass, -twodup, -dfile, mot_RFL2.run6_trim, -base, ts.6_trim+orig[92], -prefix, volreg.RFL2.run6_trim, ts.6_trim+orig.BRIK] Host: ANLUCTERAGRID64 Directory: AFNIsnr-20090421-0930-q403bn99/jobs/g/AFNI_3dvolreg-gxefdp9j stderr.txt: stdout.txt: ---- Caused by: Cannot submit job: ; nested exception is: java.net.SocketTimeoutException: Read timed out gwynn 5% and this is what I get on bsd: RunID: 20090421-0943-o1bb0081 Progress: Progress: Selecting site:1 Initializing site shared directory:1 Stage in:1 Progress: Stage in:2 Submitting:1 2009.04.21 09:44:04.848 CDT: [ERROR] Parsing profiles on line 1800 Illegal character ':'at position 60 :Illegal character ':' Progress: Submitted:1 Failed but can retry:2 Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/f on BSD Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/g on BSD Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/e on BSD Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/i on BSD Progress: Failed but can retry:3 Progress: Stage in:1 Failed but can retry:2 Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/m on BSD Progress: Failed but can retry:3 Progress: Failed but can retry:3 Progress: Failed but can retry:3 Progress: Stage in:1 Failed but can retry:2 Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/o on BSD Progress: Failed but can retry:3 Progress: Failed but can retry:3 Progress: Failed but can retry:3 Progress: Failed but can retry:3 Progress: Stage in:1 Failed but can retry:2 Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/q on BSD Progress: Failed:1 Failed but can retry:2 Execution failed: Exception in AFNI_3dvolreg: Arguments: [-twopass, -twodup, -dfile, mot_RFL2.run5_trim, -base, ts.5_trim+orig[92], -prefix, volreg.RFL2.run5_trim, ts.5_trim+orig.BRIK] Host: BSD Directory: AFNIsnr-20090421-0943-o1bb0081/jobs/q/AFNI_3dvolreg-q0rydp9j stderr.txt: stdout.txt: ---- Caused by: Cannot submit job Caused by: Data transfer to the server failed [Caused by: Token length 1248813600 > 33554432] -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Tue Apr 21 10:10:40 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 21 Apr 2009 15:10:40 +0000 (GMT) Subject: [Swift-devel] swift not working In-Reply-To: References: Message-ID: Please can you send the sites.xml files for the below two sites. They both look like network errors of some kind. On Tue, 21 Apr 2009, Michael Andric wrote: > Normally, I would hit up Sarah for a fix on this, but since she's on > vacation I'm hoping someone else out there could help with this. I'm unable > to get swift jobs submitted. I've tried submitting to both the ucanl64 and > bsd clusters. The run dir (with log files) is here: > /disks/ci-gpfs/fmri/cnari/swift/projects/andric/SNR/RFL2 > > > Here's what I get from ucanl: > > [...]Progress: Submitting:1 Submitted:1 Failed but can retry:1 > Failed to transfer wrapper log from AFNIsnr-20090421-0930-q403bn99/info/e on > ANLUCTERAGRID64 > Progress: Submitting:1 Failed but can retry:2 > Progress: Submitting:1 Failed but can retry:2 > Progress: Submitting:1 Failed but can retry:2 > Progress: Stage in:1 Submitting:1 Failed but can retry:1 > Progress: Submitting:2 Failed but can retry:1 > Progress: Submitting:1 Submitted:1 Failed but can retry:1 > Failed to transfer wrapper log from AFNIsnr-20090421-0930-q403bn99/info/g on > ANLUCTERAGRID64 > Progress: Submitting:1 Failed:1 Failed but can retry:1 > Execution failed: > Exception in AFNI_3dvolreg: > Arguments: [-twopass, -twodup, -dfile, mot_RFL2.run6_trim, -base, > ts.6_trim+orig[92], -prefix, volreg.RFL2.run6_trim, ts.6_trim+orig.BRIK] > Host: ANLUCTERAGRID64 > Directory: AFNIsnr-20090421-0930-q403bn99/jobs/g/AFNI_3dvolreg-gxefdp9j > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Cannot submit job: ; nested exception is: > java.net.SocketTimeoutException: Read timed out > gwynn 5% > > > > and this is what I get on bsd: > > RunID: 20090421-0943-o1bb0081 > Progress: > Progress: Selecting site:1 Initializing site shared directory:1 Stage > in:1 > Progress: Stage in:2 Submitting:1 > 2009.04.21 09:44:04.848 CDT: [ERROR] Parsing profiles on line 1800 Illegal > character ':'at position 60 :Illegal character ':' > Progress: Submitted:1 Failed but can retry:2 > Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/f on > BSD > Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/g on > BSD > Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/e on > BSD > Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/i on > BSD > Progress: Failed but can retry:3 > Progress: Stage in:1 Failed but can retry:2 > Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/m on > BSD > Progress: Failed but can retry:3 > Progress: Failed but can retry:3 > Progress: Failed but can retry:3 > Progress: Stage in:1 Failed but can retry:2 > Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/o on > BSD > Progress: Failed but can retry:3 > Progress: Failed but can retry:3 > Progress: Failed but can retry:3 > Progress: Failed but can retry:3 > Progress: Stage in:1 Failed but can retry:2 > Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/q on > BSD > Progress: Failed:1 Failed but can retry:2 > Execution failed: > Exception in AFNI_3dvolreg: > Arguments: [-twopass, -twodup, -dfile, mot_RFL2.run5_trim, -base, > ts.5_trim+orig[92], -prefix, volreg.RFL2.run5_trim, ts.5_trim+orig.BRIK] > Host: BSD > Directory: AFNIsnr-20090421-0943-o1bb0081/jobs/q/AFNI_3dvolreg-q0rydp9j > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Cannot submit job > Caused by: > Data transfer to the server failed [Caused by: Token length > 1248813600 > 33554432] > From benc at hawaga.org.uk Tue Apr 21 10:23:50 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 21 Apr 2009 15:23:50 +0000 (GMT) Subject: [Swift-devel] karajan thread IDs as surrogates for swift execution IDs Message-ID: In the provenance work that I am doing for Swift, I have been using the karajan thread ID to identify a lot of things, on the assumption that every distinctly identifiable swift call gets its own karajan thread ID. This fails in two places: compound procedures which run only a single statement - in that case, a new karajan thread is not started, and the same thread is used for both the containing and the contained procedure; and iterate(){} which runs each iteration in sequence in the same thread. For the purposes of my immediate development, I have hacks that look like this: // starting new iteration + ThreadingContext tc = (ThreadingContext)stack.getVar("#thread"); + stack.setVar("#thread", tc.split(666)); so that the thread ID is different for each iteration (though in a rather longwinded format using church numerals to number the iterations). Can I expect things to break or work if I make changes to the thread ID? Thus far everything seems to be working for me. -- From andric at uchicago.edu Tue Apr 21 10:31:53 2009 From: andric at uchicago.edu (Michael Andric) Date: Tue, 21 Apr 2009 10:31:53 -0500 Subject: [Swift-devel] swift not working In-Reply-To: References: Message-ID: /disks/ci-gpfs/fmri/cnari/config/sites_ucanl64.xml and /disks/ci-gpfs/fmri/cnari/config/sites_bsd.xml On Tue, Apr 21, 2009 at 10:10 AM, Ben Clifford wrote: > > Please can you send the sites.xml files for the below two sites. They both > look like network errors of some kind. > > On Tue, 21 Apr 2009, Michael Andric wrote: > > > Normally, I would hit up Sarah for a fix on this, but since she's on > > vacation I'm hoping someone else out there could help with this. I'm > unable > > to get swift jobs submitted. I've tried submitting to both the ucanl64 > and > > bsd clusters. The run dir (with log files) is here: > > /disks/ci-gpfs/fmri/cnari/swift/projects/andric/SNR/RFL2 > > > > > > Here's what I get from ucanl: > > > > [...]Progress: Submitting:1 Submitted:1 Failed but can retry:1 > > Failed to transfer wrapper log from AFNIsnr-20090421-0930-q403bn99/info/e > on > > ANLUCTERAGRID64 > > Progress: Submitting:1 Failed but can retry:2 > > Progress: Submitting:1 Failed but can retry:2 > > Progress: Submitting:1 Failed but can retry:2 > > Progress: Stage in:1 Submitting:1 Failed but can retry:1 > > Progress: Submitting:2 Failed but can retry:1 > > Progress: Submitting:1 Submitted:1 Failed but can retry:1 > > Failed to transfer wrapper log from AFNIsnr-20090421-0930-q403bn99/info/g > on > > ANLUCTERAGRID64 > > Progress: Submitting:1 Failed:1 Failed but can retry:1 > > Execution failed: > > Exception in AFNI_3dvolreg: > > Arguments: [-twopass, -twodup, -dfile, mot_RFL2.run6_trim, -base, > > ts.6_trim+orig[92], -prefix, volreg.RFL2.run6_trim, ts.6_trim+orig.BRIK] > > Host: ANLUCTERAGRID64 > > Directory: AFNIsnr-20090421-0930-q403bn99/jobs/g/AFNI_3dvolreg-gxefdp9j > > stderr.txt: > > > > stdout.txt: > > > > ---- > > > > Caused by: > > Cannot submit job: ; nested exception is: > > java.net.SocketTimeoutException: Read timed out > > gwynn 5% > > > > > > > > and this is what I get on bsd: > > > > RunID: 20090421-0943-o1bb0081 > > Progress: > > Progress: Selecting site:1 Initializing site shared directory:1 Stage > > in:1 > > Progress: Stage in:2 Submitting:1 > > 2009.04.21 09:44:04.848 CDT: [ERROR] Parsing profiles on line 1800 > Illegal > > character ':'at position 60 :Illegal character ':' > > Progress: Submitted:1 Failed but can retry:2 > > Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/f > on > > BSD > > Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/g > on > > BSD > > Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/e > on > > BSD > > Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/i > on > > BSD > > Progress: Failed but can retry:3 > > Progress: Stage in:1 Failed but can retry:2 > > Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/m > on > > BSD > > Progress: Failed but can retry:3 > > Progress: Failed but can retry:3 > > Progress: Failed but can retry:3 > > Progress: Stage in:1 Failed but can retry:2 > > Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/o > on > > BSD > > Progress: Failed but can retry:3 > > Progress: Failed but can retry:3 > > Progress: Failed but can retry:3 > > Progress: Failed but can retry:3 > > Progress: Stage in:1 Failed but can retry:2 > > Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/q > on > > BSD > > Progress: Failed:1 Failed but can retry:2 > > Execution failed: > > Exception in AFNI_3dvolreg: > > Arguments: [-twopass, -twodup, -dfile, mot_RFL2.run5_trim, -base, > > ts.5_trim+orig[92], -prefix, volreg.RFL2.run5_trim, ts.5_trim+orig.BRIK] > > Host: BSD > > Directory: AFNIsnr-20090421-0943-o1bb0081/jobs/q/AFNI_3dvolreg-q0rydp9j > > stderr.txt: > > > > stdout.txt: > > > > ---- > > > > Caused by: > > Cannot submit job > > Caused by: > > Data transfer to the server failed [Caused by: Token length > > 1248813600 > 33554432] > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Apr 21 10:46:19 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 21 Apr 2009 10:46:19 -0500 Subject: [Swift-devel] karajan thread IDs as surrogates for swift execution IDs In-Reply-To: References: Message-ID: <1240328779.17861.10.camel@localhost> On Tue, 2009-04-21 at 15:23 +0000, Ben Clifford wrote: > In the provenance work that I am doing for Swift, I have been using the > karajan thread ID to identify a lot of things, on the assumption that > every distinctly identifiable swift call gets its own karajan thread ID. > > This fails in two places: compound procedures which run only a single > statement - in that case, a new karajan thread is not started, and the > same thread is used for both the containing and the contained procedure; > and iterate(){} which runs each iteration in sequence in the same thread. > > For the purposes of my immediate development, I have hacks that look like > this: > > // starting new iteration > + ThreadingContext tc = > (ThreadingContext)stack.getVar("#thread"); > + stack.setVar("#thread", tc.split(666)); > > so that the thread ID is different for each iteration (though in a rather > longwinded format using church numerals to number the iterations). > > Can I expect things to break or work if I make changes to the thread ID? Not really. As long as no two threads with the same ID execute at the same time. Though I'd prefer something like an iteration counter rather than [.666]*. > > Thus far everything seems to be working for me. > From wilde at mcs.anl.gov Tue Apr 21 10:57:34 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 21 Apr 2009 10:57:34 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <1240004025.5783.6.camel@localhost> References: <1240004025.5783.6.camel@localhost> Message-ID: <49EDECEE.3040601@mcs.anl.gov> was: Re: [Swift-devel] coaster block allocations I have not been able to try this yet - been working on grant deadlines. But regarding an "idea of how things (will) work" - please send an update on the coaster development - are you proceeding with the 3-level allocation strategy you outlined in a prior thread, and how is that progressing? What is on the list of priority fixes or enhancements to coasters? Which of these can make it into 0.9? (I would like see 0.9 have a highly usable coaster release with a converging set of capabilities. I would argue to delay 0.9 by up to 2 weeks to get it in, if needed enhancements are close) I agree with Ben that coaster testing (and implicitly coaster stability) is important for this release; I would like to see us push to get the feature into a state where its broadly usable and heavily used. On 4/17/09 4:33 PM, Mihael Hategan wrote: > I mentioned a simulator: > http://www.mcs.anl.gov/~hategan/CS.java > > It will given you a visual idea of how things (will) work (run it and > press the up key to advance time). You can play with parameters, work > loads and pre-allocations. > > If you find some behavior that you don't like, send me the parameters > (which are printed when you close the window) and work load (which > you'll have to manually paste from the code), so that I can reproduce > it. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Tue Apr 21 10:57:47 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 21 Apr 2009 15:57:47 +0000 (GMT) Subject: [Swift-devel] karajan thread IDs as surrogates for swift execution IDs In-Reply-To: <1240328779.17861.10.camel@localhost> References: <1240328779.17861.10.camel@localhost> Message-ID: On Tue, 21 Apr 2009, Mihael Hategan wrote: > Though I'd prefer something like an iteration counter rather than > [.666]*. So would I. But I thought I could try summoning satan this way. -- From hategan at mcs.anl.gov Tue Apr 21 11:05:58 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 21 Apr 2009 11:05:58 -0500 Subject: [Swift-devel] Re: Coaster capabilities for release 0.9 In-Reply-To: <49EDECEE.3040601@mcs.anl.gov> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> Message-ID: <1240329958.18487.2.camel@localhost> On Tue, 2009-04-21 at 10:57 -0500, Michael Wilde wrote: > was: Re: [Swift-devel] coaster block allocations > > I have not been able to try this yet - been working on grant deadlines. > > But regarding an "idea of how things (will) work" - please send an > update on the coaster development - are you proceeding with the 3-level > allocation strategy you outlined in a prior thread, and how is that > progressing? There's a working algorithm which I'm now working to get into the coasters. > > What is on the list of priority fixes or enhancements to coasters? I'm focusing on the above. > > Which of these can make it into 0.9? (I would like see 0.9 have a highly > usable coaster release with a converging set of capabilities. I would > argue to delay 0.9 by up to 2 weeks to get it in, if needed enhancements > are close) I'd argue against it, because it's a major change and it would need more testing than that. > > I agree with Ben that coaster testing (and implicitly coaster stability) > is important for this release; I would like to see us push to get the > feature into a state where its broadly usable and heavily used. Yes. All the things we do we want to get into a state where it's broadly usable and heavily used. From hategan at mcs.anl.gov Tue Apr 21 11:07:23 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 21 Apr 2009 11:07:23 -0500 Subject: [Swift-devel] karajan thread IDs as surrogates for swift execution IDs In-Reply-To: References: <1240328779.17861.10.camel@localhost> Message-ID: <1240330043.18487.5.camel@localhost> On Tue, 2009-04-21 at 15:57 +0000, Ben Clifford wrote: > On Tue, 21 Apr 2009, Mihael Hategan wrote: > > > Though I'd prefer something like an iteration counter rather than > > [.666]*. > > So would I. But I thought I could try summoning satan this way. > I think Church is not impressed. From hategan at mcs.anl.gov Tue Apr 21 11:08:29 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 21 Apr 2009 11:08:29 -0500 Subject: [Swift-devel] swift not working In-Reply-To: References: Message-ID: <1240330109.18487.6.camel@localhost> I don't see anything wrong with your setup. Is this a persistent problem (i.e. can you try to run these again)? On Tue, 2009-04-21 at 10:31 -0500, Michael Andric wrote: > /disks/ci-gpfs/fmri/cnari/config/sites_ucanl64.xml > > > and > > > /disks/ci-gpfs/fmri/cnari/config/sites_bsd.xml > > On Tue, Apr 21, 2009 at 10:10 AM, Ben Clifford > wrote: > > Please can you send the sites.xml files for the below two > sites. They both > look like network errors of some kind. > > > On Tue, 21 Apr 2009, Michael Andric wrote: > > > Normally, I would hit up Sarah for a fix on this, but since > she's on > > vacation I'm hoping someone else out there could help with > this. I'm unable > > to get swift jobs submitted. I've tried submitting to both > the ucanl64 and > > bsd clusters. The run dir (with log files) is here: > > /disks/ci-gpfs/fmri/cnari/swift/projects/andric/SNR/RFL2 > > > > > > Here's what I get from ucanl: > > > > [...]Progress: Submitting:1 Submitted:1 Failed but can > retry:1 > > Failed to transfer wrapper log from > AFNIsnr-20090421-0930-q403bn99/info/e on > > ANLUCTERAGRID64 > > Progress: Submitting:1 Failed but can retry:2 > > Progress: Submitting:1 Failed but can retry:2 > > Progress: Submitting:1 Failed but can retry:2 > > Progress: Stage in:1 Submitting:1 Failed but can retry:1 > > Progress: Submitting:2 Failed but can retry:1 > > Progress: Submitting:1 Submitted:1 Failed but can retry:1 > > Failed to transfer wrapper log from > AFNIsnr-20090421-0930-q403bn99/info/g on > > ANLUCTERAGRID64 > > Progress: Submitting:1 Failed:1 Failed but can retry:1 > > Execution failed: > > Exception in AFNI_3dvolreg: > > Arguments: [-twopass, -twodup, -dfile, mot_RFL2.run6_trim, > -base, > > ts.6_trim+orig[92], -prefix, volreg.RFL2.run6_trim, > ts.6_trim+orig.BRIK] > > Host: ANLUCTERAGRID64 > > Directory: > AFNIsnr-20090421-0930-q403bn99/jobs/g/AFNI_3dvolreg-gxefdp9j > > stderr.txt: > > > > stdout.txt: > > > > ---- > > > > Caused by: > > Cannot submit job: ; nested exception is: > > java.net.SocketTimeoutException: Read timed out > > gwynn 5% > > > > > > > > and this is what I get on bsd: > > > > RunID: 20090421-0943-o1bb0081 > > Progress: > > Progress: Selecting site:1 Initializing site shared > directory:1 Stage > > in:1 > > Progress: Stage in:2 Submitting:1 > > 2009.04.21 09:44:04.848 CDT: [ERROR] Parsing profiles on > line 1800 Illegal > > character ':'at position 60 :Illegal character ':' > > Progress: Submitted:1 Failed but can retry:2 > > Failed to transfer wrapper log from > AFNIsnr-20090421-0943-o1bb0081/info/f on > > BSD > > Failed to transfer wrapper log from > AFNIsnr-20090421-0943-o1bb0081/info/g on > > BSD > > Failed to transfer wrapper log from > AFNIsnr-20090421-0943-o1bb0081/info/e on > > BSD > > Failed to transfer wrapper log from > AFNIsnr-20090421-0943-o1bb0081/info/i on > > BSD > > Progress: Failed but can retry:3 > > Progress: Stage in:1 Failed but can retry:2 > > Failed to transfer wrapper log from > AFNIsnr-20090421-0943-o1bb0081/info/m on > > BSD > > Progress: Failed but can retry:3 > > Progress: Failed but can retry:3 > > Progress: Failed but can retry:3 > > Progress: Stage in:1 Failed but can retry:2 > > Failed to transfer wrapper log from > AFNIsnr-20090421-0943-o1bb0081/info/o on > > BSD > > Progress: Failed but can retry:3 > > Progress: Failed but can retry:3 > > Progress: Failed but can retry:3 > > Progress: Failed but can retry:3 > > Progress: Stage in:1 Failed but can retry:2 > > Failed to transfer wrapper log from > AFNIsnr-20090421-0943-o1bb0081/info/q on > > BSD > > Progress: Failed:1 Failed but can retry:2 > > Execution failed: > > Exception in AFNI_3dvolreg: > > Arguments: [-twopass, -twodup, -dfile, mot_RFL2.run5_trim, > -base, > > ts.5_trim+orig[92], -prefix, volreg.RFL2.run5_trim, > ts.5_trim+orig.BRIK] > > Host: BSD > > Directory: > AFNIsnr-20090421-0943-o1bb0081/jobs/q/AFNI_3dvolreg-q0rydp9j > > stderr.txt: > > > > stdout.txt: > > > > ---- > > > > Caused by: > > Cannot submit job > > Caused by: > > Data transfer to the server failed [Caused by: Token > length > > 1248813600 > 33554432] > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Apr 21 11:12:52 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 21 Apr 2009 11:12:52 -0500 Subject: [Swift-devel] swift not working In-Reply-To: <1240330109.18487.6.camel@localhost> References: <1240330109.18487.6.camel@localhost> Message-ID: <49EDF084.5000908@mcs.anl.gov> What do these log messages mean: 2009-04-21 09:44:05,634-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=AFNI_3dvolreg-f0rydp9j - Application exception: Cannot submit job Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot submit job Caused by: org.globus.gram.GramException: Data transfer to the server failed [Caused by: Token length 1248813600 > 33554432] 2009-04-21 09:44:05,635-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=AFNI_3dvolreg-g0rydp9j - Application exception: Cannot submit job Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot submit job Caused by: org.globus.gram.GramException: Data transfer to the server failed [Caused by: Token length 1248813600 > 33554432] Googling for Token length > 33554432 gets a lot of hits including: http://bugzilla.globus.org/globus/show_bug.cgi?id=2210 On 4/21/09 11:08 AM, Mihael Hategan wrote: > I don't see anything wrong with your setup. > > Is this a persistent problem (i.e. can you try to run these again)? > > On Tue, 2009-04-21 at 10:31 -0500, Michael Andric wrote: >> /disks/ci-gpfs/fmri/cnari/config/sites_ucanl64.xml >> >> >> and >> >> >> /disks/ci-gpfs/fmri/cnari/config/sites_bsd.xml >> >> On Tue, Apr 21, 2009 at 10:10 AM, Ben Clifford >> wrote: >> >> Please can you send the sites.xml files for the below two >> sites. They both >> look like network errors of some kind. >> >> >> On Tue, 21 Apr 2009, Michael Andric wrote: >> >> > Normally, I would hit up Sarah for a fix on this, but since >> she's on >> > vacation I'm hoping someone else out there could help with >> this. I'm unable >> > to get swift jobs submitted. I've tried submitting to both >> the ucanl64 and >> > bsd clusters. The run dir (with log files) is here: >> > /disks/ci-gpfs/fmri/cnari/swift/projects/andric/SNR/RFL2 >> > >> > >> > Here's what I get from ucanl: >> > >> > [...]Progress: Submitting:1 Submitted:1 Failed but can >> retry:1 >> > Failed to transfer wrapper log from >> AFNIsnr-20090421-0930-q403bn99/info/e on >> > ANLUCTERAGRID64 >> > Progress: Submitting:1 Failed but can retry:2 >> > Progress: Submitting:1 Failed but can retry:2 >> > Progress: Submitting:1 Failed but can retry:2 >> > Progress: Stage in:1 Submitting:1 Failed but can retry:1 >> > Progress: Submitting:2 Failed but can retry:1 >> > Progress: Submitting:1 Submitted:1 Failed but can retry:1 >> > Failed to transfer wrapper log from >> AFNIsnr-20090421-0930-q403bn99/info/g on >> > ANLUCTERAGRID64 >> > Progress: Submitting:1 Failed:1 Failed but can retry:1 >> > Execution failed: >> > Exception in AFNI_3dvolreg: >> > Arguments: [-twopass, -twodup, -dfile, mot_RFL2.run6_trim, >> -base, >> > ts.6_trim+orig[92], -prefix, volreg.RFL2.run6_trim, >> ts.6_trim+orig.BRIK] >> > Host: ANLUCTERAGRID64 >> > Directory: >> AFNIsnr-20090421-0930-q403bn99/jobs/g/AFNI_3dvolreg-gxefdp9j >> > stderr.txt: >> > >> > stdout.txt: >> > >> > ---- >> > >> > Caused by: >> > Cannot submit job: ; nested exception is: >> > java.net.SocketTimeoutException: Read timed out >> > gwynn 5% >> > >> > >> > >> > and this is what I get on bsd: >> > >> > RunID: 20090421-0943-o1bb0081 >> > Progress: >> > Progress: Selecting site:1 Initializing site shared >> directory:1 Stage >> > in:1 >> > Progress: Stage in:2 Submitting:1 >> > 2009.04.21 09:44:04.848 CDT: [ERROR] Parsing profiles on >> line 1800 Illegal >> > character ':'at position 60 :Illegal character ':' >> > Progress: Submitted:1 Failed but can retry:2 >> > Failed to transfer wrapper log from >> AFNIsnr-20090421-0943-o1bb0081/info/f on >> > BSD >> > Failed to transfer wrapper log from >> AFNIsnr-20090421-0943-o1bb0081/info/g on >> > BSD >> > Failed to transfer wrapper log from >> AFNIsnr-20090421-0943-o1bb0081/info/e on >> > BSD >> > Failed to transfer wrapper log from >> AFNIsnr-20090421-0943-o1bb0081/info/i on >> > BSD >> > Progress: Failed but can retry:3 >> > Progress: Stage in:1 Failed but can retry:2 >> > Failed to transfer wrapper log from >> AFNIsnr-20090421-0943-o1bb0081/info/m on >> > BSD >> > Progress: Failed but can retry:3 >> > Progress: Failed but can retry:3 >> > Progress: Failed but can retry:3 >> > Progress: Stage in:1 Failed but can retry:2 >> > Failed to transfer wrapper log from >> AFNIsnr-20090421-0943-o1bb0081/info/o on >> > BSD >> > Progress: Failed but can retry:3 >> > Progress: Failed but can retry:3 >> > Progress: Failed but can retry:3 >> > Progress: Failed but can retry:3 >> > Progress: Stage in:1 Failed but can retry:2 >> > Failed to transfer wrapper log from >> AFNIsnr-20090421-0943-o1bb0081/info/q on >> > BSD >> > Progress: Failed:1 Failed but can retry:2 >> > Execution failed: >> > Exception in AFNI_3dvolreg: >> > Arguments: [-twopass, -twodup, -dfile, mot_RFL2.run5_trim, >> -base, >> > ts.5_trim+orig[92], -prefix, volreg.RFL2.run5_trim, >> ts.5_trim+orig.BRIK] >> > Host: BSD >> > Directory: >> AFNIsnr-20090421-0943-o1bb0081/jobs/q/AFNI_3dvolreg-q0rydp9j >> > stderr.txt: >> > >> > stdout.txt: >> > >> > ---- >> > >> > Caused by: >> > Cannot submit job >> > Caused by: >> > Data transfer to the server failed [Caused by: Token >> length >> > 1248813600 > 33554432] >> > >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Apr 21 11:15:07 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 21 Apr 2009 11:15:07 -0500 Subject: [Swift-devel] feature request In-Reply-To: <49EDEC49.2020204@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> <49E63720.9020706@mcs.anl.gov> <49E663B9.4000007@mcs.anl.gov> <49EDCC59.6080507@mcs.anl.gov> <49EDEC49.2020204@uchicago.edu> Message-ID: <49EDF10B.5050204@mcs.anl.gov> [meant to cc this to swift-devel] Two separate questions here: 1) what to do: Test coasters on a set of agreed upon (and growing list of) sites: Start with these: localhost: TG: teraport, ucanl, mercury, sdsc, abe, queenbee, ranger OSG: red.unl.edu, some wisc.edu site (pick 3 for now) Local: HNL cluster (gwynn) Thats a good list to start for 0.9. We can easily grow this once you get that far. If I missed a high prio target please let me know. Others in the wings: Jazz, MCS kBT cluster, TG bigred and purdue condor; many more OSG sites. 2) what failed I didnt look in your log yet, but be aware that you need a proxy to run coasters even on localhost (for secure messaging with GSI) Best to start with localhost testing on communicado, not surveyor On 4/21/09 10:54 AM, Zhao Zhang wrote: > Dear All > > I am trying to run swift on local site. I checked out the latest swift > code, and built it. Started test as below, it failed. > I am not clear with my test goals here, am I making sure coaster is > working on all sites we have, or am I testing the existing > coaster could be use by any users? > > zhao > > zzhang at login6.surveyor:/home/falkon/swift_coaster/cog/modules/swift/tests/sites> > ./run-site coaster/coaster-local.xml > testing site configuration: coaster/coaster-local.xml > Removing files from previous runs > Running test 061-cattwo at Tue Apr 21 10:51:21 CDT 2009 > Swift svn swift-r2865 cog-r2388 > > RunID: 20090421-1051-er55uyr8 > Progress: > Multiple entries found for cat on localhost. Using the first one > Progress: Submitted:1 > Failed to transfer wrapper log from > 061-cattwo-20090421-1051-er55uyr8/info/i on localhost > Progress: Submitted:1 > Failed to transfer wrapper log from > 061-cattwo-20090421-1051-er55uyr8/info/k on localhost > Progress: Submitted:1 > Failed to transfer wrapper log from > 061-cattwo-20090421-1051-er55uyr8/info/m on localhost > Execution failed: > Exception in cat: > Arguments: [061-cattwo.1.in, 061-cattwo.2.in] > Host: localhost > Directory: 061-cattwo-20090421-1051-er55uyr8/jobs/m/cat-mt8ngp9j > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Could not submit job > Caused by: > Could not start coaster service > Caused by: > Task ended before registration was received. > STDOUT: > STDERR: which: no gmd5sum in > (/home/falkon/swift_coaster/cog/modules/swift/dist/swift-svn/bin:/home/zzhang/ruby-1.8.7-p72/bin/bin:/home/zzhang/chirp/bin:/home/zzhang/gridftp/bin:/home/zzhang/gridftp/sbin:/home/zzhang/xar/bin:/home/falkon/swift_scratch/cog/modules/swift/dist/swift-svn/bin:/home/falkon/falkon/bin:/home/falkon/falkon/service:/home/falkon/falkon/worker:/home/falkon/falkon/client:/home/falkon/falkon/monitor:/home/falkon/falkon/webserver:/home/falkon/falkon/ploticus/src:/home/falkon/falkon/apache-ant-1.7.0:/home/falkon/falkon/apache-ant-1.7.0/bin:/usr/lib/jvm/java:/usr/lib/jvm/java/bin:/home/falkon/falkon/container:/home/falkon/falkon/container/bin:/bin:/usr/sbin:/etc:/usr/X11R6/bin:/usr/bin:/sbin:/usr/local/bin:/bgsys/drivers/ppcfloor/bin:/bgsys/drivers/ppcfloor/comm/bin:/dbhome/bgpdb2c/sqllib/lib:/opt/ibmcmp/vac/bg/9.0/bin:/opt/ibmcmp/vacpp/bg/9.0/bin:/opt/ibmcmp/xlf/bg/11.1/bin:/software/common/apps/mpiscripts:/software/common/apps/projects-list/bin:/software/common/adm/softenv/bin:/home/ zzhang/bin/linux-sles10-ppc64:/home/zzhang/bin:.:/software/common/apps/misc-scripts:/bgsys/drivers/ppcfloor/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin) > > > > Caused by: > Job failed with an exit code of 1 > Cleaning up... > Done > SWIFT RETURN CODE NON-ZERO - test 061-cattwo > > > Michael Wilde wrote: >> Zhao, based on prior discussion of coaster testing on this list, we >> all agree its high priority. >> >> Can you set aside everything else and focus on it, and let us know at >> the end of the day if you've been able to run the existing site tests >> as a starting point? >> >> Then you need to test running coasters manually, then (I assume, Ben >> and Mihael) create additional site tests that exercise coasters? >> >> Mihael, you should provide a list of coaster-specific aspects to test. >> The job-time management aspects come to mind, as do coaster cleanup >> and termination. >> >> >> On 4/21/09 3:15 AM, Ben Clifford wrote: >>> At the weekend, I put out a release candidate for Swift 0.9 with a >>> 7-day day test period. >>> >>> Performing coaster testing on that release candidate using the >>> existing Swift site tests, and with additional sites, is something >>> that would be useful to the release process and has a clearly defined >>> short timescale - it is something that should happen before the weekend. >>> >>> On Thu, 16 Apr 2009, Ben Clifford wrote: >>> >>>> On Wed, 15 Apr 2009, Michael Wilde wrote: >>>> >>>>> Zhao, based on Ben's suggestion and our earlier discussion, can you >>>>> locate and >>>>> try the Swift test suite, and then look in detail at, and try, the >>>>> per-site >>>>> suite. >>>>> >>>>> Im looking for an assessment of what it takes to create a branch of >>>>> the suite >>>>> to test coasters in th same manner that the sites are tested. >>>> Put swift on your path and get a proxy. >>>> >>>> Then: >>>> cd tests/sites >>>> ./run-all coaster/ >>>> >>>> This will start running the tests in the coaster/ subdirectory. >>>> >>>> Each of the files in there is a site definition. One site test is >>>> run for each of those. When/if all of them have exited, the list of >>>> sites that worked and the list of sites that did not work is output. >>>> >>>> If you want to run a single site, say ./run-site >>>> coaster/coaster-local.xml >>>> (for example) >>>> >>>> To add new sites, put an file in the coaster/ subdirectory >>>> containing an appropriate site definition. >>>> >>>> In order to make site tests that many people can run, I usually make >>>> a remote work directory and chmod a+rwxt on that remote work >>>> directory so no matter who runs, Swift will not encounter permission >>>> problems. >>>> >>>>> And what it takes to run that on a regular basis to find problems >>>>> in sites and >>>>> Swift *before* our users find them. >>>> The main problem with the site tests is that they generally need >>>> credentials, and its unclear what the right way to handle long term >>>> testing credentials is. >>>> >>>> >> > From hategan at mcs.anl.gov Tue Apr 21 11:19:02 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 21 Apr 2009 11:19:02 -0500 Subject: [Swift-devel] swift not working In-Reply-To: <49EDF084.5000908@mcs.anl.gov> References: <1240330109.18487.6.camel@localhost> <49EDF084.5000908@mcs.anl.gov> Message-ID: <1240330742.18908.0.camel@localhost> On Tue, 2009-04-21 at 11:12 -0500, Michael Wilde wrote: > What do these log messages mean: > > 2009-04-21 09:44:05,634-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=AFNI_3dvolreg-f0rydp9j - Application exception: Cannot submit job > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Cannot submit job > Caused by: org.globus.gram.GramException: Data transfer to the server > failed [Caused by: Token length 1248813600 > 33554432] > 2009-04-21 09:44:05,635-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=AFNI_3dvolreg-g0rydp9j - Application exception: Cannot submit job > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Cannot submit job > Caused by: org.globus.gram.GramException: Data transfer to the server > failed [Caused by: Token length 1248813600 > 33554432] > > Googling for Token length > 33554432 gets a lot of hits including: > > http://bugzilla.globus.org/globus/show_bug.cgi?id=2210 > Right. Except that error he gets with gram2. I saw it happening before when high loads are involved. From zhaozhang at uchicago.edu Tue Apr 21 11:22:04 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 21 Apr 2009 11:22:04 -0500 Subject: [Swift-devel] feature request In-Reply-To: <49EDF10B.5050204@mcs.anl.gov> References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> <49E63720.9020706@mcs.anl.gov> <49E663B9.4000007@mcs.anl.gov> <49EDCC59.6080507@mcs.anl.gov> <49EDEC49.2020204@uchicago.edu> <49EDF10B.5050204@mcs.anl.gov> Message-ID: <49EDF2AC.60504@uchicago.edu> Also, I found my grid proxy expired. [zzhang at communicado ~]$ grid-proxy-init Your identity: /DC=org/DC=doegrids/OU=People/CN=Zhao Zhang 385894 Enter GRID pass phrase for this identity: Creating proxy ........................................... Done ERROR: Your certificate has expired: Thu Feb 26 12:47:51 2009 Michael Wilde wrote: > [meant to cc this to swift-devel] > > Two separate questions here: > > 1) what to do: > > Test coasters on a set of agreed upon (and growing list of) sites: > > Start with these: > > localhost: > TG: teraport, ucanl, mercury, sdsc, abe, queenbee, ranger > OSG: red.unl.edu, some wisc.edu site (pick 3 for now) > Local: HNL cluster (gwynn) > > Thats a good list to start for 0.9. We can easily grow this once you > get that far. > > If I missed a high prio target please let me know. > > Others in the wings: Jazz, MCS kBT cluster, TG bigred and purdue condor; > many more OSG sites. > > > 2) what failed > > I didnt look in your log yet, but be aware that you need a proxy to run > coasters even on localhost (for secure messaging with GSI) > > Best to start with localhost testing on communicado, not surveyor > > > On 4/21/09 10:54 AM, Zhao Zhang wrote: >> Dear All >> >> I am trying to run swift on local site. I checked out the latest >> swift code, and built it. Started test as below, it failed. >> I am not clear with my test goals here, am I making sure coaster is >> working on all sites we have, or am I testing the existing >> coaster could be use by any users? >> >> zhao >> >> zzhang at login6.surveyor:/home/falkon/swift_coaster/cog/modules/swift/tests/sites> >> ./run-site coaster/coaster-local.xml >> testing site configuration: coaster/coaster-local.xml >> Removing files from previous runs >> Running test 061-cattwo at Tue Apr 21 10:51:21 CDT 2009 >> Swift svn swift-r2865 cog-r2388 >> >> RunID: 20090421-1051-er55uyr8 >> Progress: >> Multiple entries found for cat on localhost. Using the first one >> Progress: Submitted:1 >> Failed to transfer wrapper log from >> 061-cattwo-20090421-1051-er55uyr8/info/i on localhost >> Progress: Submitted:1 >> Failed to transfer wrapper log from >> 061-cattwo-20090421-1051-er55uyr8/info/k on localhost >> Progress: Submitted:1 >> Failed to transfer wrapper log from >> 061-cattwo-20090421-1051-er55uyr8/info/m on localhost >> Execution failed: >> Exception in cat: >> Arguments: [061-cattwo.1.in, 061-cattwo.2.in] >> Host: localhost >> Directory: 061-cattwo-20090421-1051-er55uyr8/jobs/m/cat-mt8ngp9j >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> Could not submit job >> Caused by: >> Could not start coaster service >> Caused by: >> Task ended before registration was received. >> STDOUT: >> STDERR: which: no gmd5sum in >> (/home/falkon/swift_coaster/cog/modules/swift/dist/swift-svn/bin:/home/zzhang/ruby-1.8.7-p72/bin/bin:/home/zzhang/chirp/bin:/home/zzhang/gridftp/bin:/home/zzhang/gridftp/sbin:/home/zzhang/xar/bin:/home/falkon/swift_scratch/cog/modules/swift/dist/swift-svn/bin:/home/falkon/falkon/bin:/home/falkon/falkon/service:/home/falkon/falkon/worker:/home/falkon/falkon/client:/home/falkon/falkon/monitor:/home/falkon/falkon/webserver:/home/falkon/falkon/ploticus/src:/home/falkon/falkon/apache-ant-1.7.0:/home/falkon/falkon/apache-ant-1.7.0/bin:/usr/lib/jvm/java:/usr/lib/jvm/java/bin:/home/falkon/falkon/container:/home/falkon/falkon/container/bin:/bin:/usr/sbin:/etc:/usr/X11R6/bin:/usr/bin:/sbin:/usr/local/bin:/bgsys/drivers/ppcfloor/bin:/bgsys/drivers/ppcfloor/comm/bin:/dbhome/bgpdb2c/sqllib/lib:/opt/ibmcmp/vac/bg/9.0/bin:/opt/ibmcmp/vacpp/bg/9.0/bin:/opt/ibmcmp/xlf/bg/11.1/bin:/software/common/apps/mpiscripts:/software/common/apps/projects-list/bin:/software/common/adm/softenv/bin:/home/ >> > > zzhang/bin/linux-sles10-ppc64:/home/zzhang/bin:.:/software/common/apps/misc-scripts:/bgsys/drivers/ppcfloor/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin) > >> >> >> >> Caused by: >> Job failed with an exit code of 1 >> Cleaning up... >> Done >> SWIFT RETURN CODE NON-ZERO - test 061-cattwo >> >> >> Michael Wilde wrote: >>> Zhao, based on prior discussion of coaster testing on this list, we >>> all agree its high priority. >>> >>> Can you set aside everything else and focus on it, and let us know >>> at the end of the day if you've been able to run the existing site >>> tests as a starting point? >>> >>> Then you need to test running coasters manually, then (I assume, Ben >>> and Mihael) create additional site tests that exercise coasters? >>> >>> Mihael, you should provide a list of coaster-specific aspects to test. >>> The job-time management aspects come to mind, as do coaster cleanup >>> and termination. >>> >>> >>> On 4/21/09 3:15 AM, Ben Clifford wrote: >>>> At the weekend, I put out a release candidate for Swift 0.9 with a >>>> 7-day day test period. >>>> >>>> Performing coaster testing on that release candidate using the >>>> existing Swift site tests, and with additional sites, is something >>>> that would be useful to the release process and has a clearly >>>> defined short timescale - it is something that should happen before >>>> the weekend. >>>> >>>> On Thu, 16 Apr 2009, Ben Clifford wrote: >>>> >>>>> On Wed, 15 Apr 2009, Michael Wilde wrote: >>>>> >>>>>> Zhao, based on Ben's suggestion and our earlier discussion, can >>>>>> you locate and >>>>>> try the Swift test suite, and then look in detail at, and try, >>>>>> the per-site >>>>>> suite. >>>>>> >>>>>> Im looking for an assessment of what it takes to create a branch >>>>>> of the suite >>>>>> to test coasters in th same manner that the sites are tested. >>>>> Put swift on your path and get a proxy. >>>>> >>>>> Then: >>>>> cd tests/sites >>>>> ./run-all coaster/ >>>>> >>>>> This will start running the tests in the coaster/ subdirectory. >>>>> >>>>> Each of the files in there is a site definition. One site test is >>>>> run for each of those. When/if all of them have exited, the list >>>>> of sites that worked and the list of sites that did not work is >>>>> output. >>>>> >>>>> If you want to run a single site, say ./run-site >>>>> coaster/coaster-local.xml >>>>> (for example) >>>>> >>>>> To add new sites, put an file in the coaster/ subdirectory >>>>> containing an appropriate site definition. >>>>> >>>>> In order to make site tests that many people can run, I usually >>>>> make a remote work directory and chmod a+rwxt on that remote work >>>>> directory so no matter who runs, Swift will not encounter >>>>> permission problems. >>>>> >>>>>> And what it takes to run that on a regular basis to find problems >>>>>> in sites and >>>>>> Swift *before* our users find them. >>>>> The main problem with the site tests is that they generally need >>>>> credentials, and its unclear what the right way to handle long >>>>> term testing credentials is. >>>>> >>>>> >>> >> > > From benc at hawaga.org.uk Tue Apr 21 11:28:18 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 21 Apr 2009 16:28:18 +0000 (GMT) Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <49EDECEE.3040601@mcs.anl.gov> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> Message-ID: On Tue, 21 Apr 2009, Michael Wilde wrote: > Which of these can make it into 0.9? (I would like see 0.9 have a highly > usable coaster release with a converging set of capabilities. I would > argue to delay 0.9 by up to 2 weeks to get it in, if needed enhancements > are close) A 0.9 release candidate has already been made. Bugs would have to be fairly serious in order to prevent me releasing it. Coasters, as an in-development feature, not a production-quality feature, pretty much shouldn't get to stop or delay releases unless they are substantially worse than swift 0.8 coasters. Post 0.9rc2 changes will end up in the next swift release, which should be roughly 8 weeks after the 0.9 release. That is not so long to wait. If there is likelihood that the advice from now onwards until that next release will be "download the latest svn" to coaster users, then there is little point in putting coaster changes into a release. -- From zhaozhang at uchicago.edu Tue Apr 21 11:48:01 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 21 Apr 2009 11:48:01 -0500 Subject: [Swift-devel] feature request In-Reply-To: <49EDF456.5050202@mcs.anl.gov> References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> <49E63720.9020706@mcs.anl.gov> <49E663B9.4000007@mcs.anl.gov> <49EDCC59.6080507@mcs.anl.gov> <49EDEC49.2020204@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> Message-ID: <49EDF8C1.7050201@uchicago.edu> Hi, All I checked out the swift-0.9rc2 from http://www.ci.uchicago.edu/~benc/swift-0.9rc2.tar.gz And I manually copy it to ~/swift_coaster/cog/modules/swift/dist/swift-0.9rc2/ [zzhang at communicado swift]$ which swift ~/swift_coaster/cog/modules/swift/dist/swift-0.9rc2/bin/swift Then I started with the coaster-local test: with the following error message: [zzhang at communicado sites]$ ./run-site coaster/coaster-local.xml testing site configuration: coaster/coaster-local.xml Removing files from previous runs Running test 061-cattwo at Tue Apr 21 11:45:17 CDT 2009 Warning: -Xmx256M not understood. Ignoring. log4j:ERROR Error occured while converting date. java.lang.IllegalArgumentException: Illegal pattern character at java.text.SimpleDateFormat.format(java.util.Date, java.lang.StringBuffer, java.text.FieldPosition) (/usr/lib64/libgcj.so.5.0.0) at java.text.DateFormat.format(java.util.Date) (/usr/lib64/libgcj.so.5.0.0) at org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.helpers.PatternConverter.format(java.lang.StringBuffer, org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.PatternLayout.format(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.WriterAppender.subAppend(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.WriterAppender.append(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.AppenderSkeleton.doAppend(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.Category.callAppenders(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.Category.forcedLog(java.lang.String, org.apache.log4j.Priority, java.lang.Object, java.lang.Throwable) (Unknown Source) at org.apache.log4j.Category.debug(java.lang.Object) (Unknown Source) at org.griphyn.vdl.karajan.Loader.main(java.lang.String[]) (Unknown Source) log4j:ERROR Error occured while converting date. java.lang.IllegalArgumentException: Illegal pattern character at java.text.SimpleDateFormat.format(java.util.Date, java.lang.StringBuffer, java.text.FieldPosition) (/usr/lib64/libgcj.so.5.0.0) at java.text.DateFormat.format(java.util.Date) (/usr/lib64/libgcj.so.5.0.0) at org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.helpers.PatternConverter.format(java.lang.StringBuffer, org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.PatternLayout.format(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.WriterAppender.subAppend(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.WriterAppender.append(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.AppenderSkeleton.doAppend(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.Category.callAppenders(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.Category.forcedLog(java.lang.String, org.apache.log4j.Priority, java.lang.Object, java.lang.Throwable) (Unknown Source) at org.apache.log4j.Category.debug(java.lang.Object) (Unknown Source) at org.griphyn.vdl.karajan.Loader.main(java.lang.String[]) (Unknown Source) log4j:ERROR Error occured while converting date. java.lang.IllegalArgumentException: Illegal pattern character at java.text.SimpleDateFormat.format(java.util.Date, java.lang.StringBuffer, java.text.FieldPosition) (/usr/lib64/libgcj.so.5.0.0) at java.text.DateFormat.format(java.util.Date) (/usr/lib64/libgcj.so.5.0.0) at org.apache.log4j.helpers.PatternParser$DatePatternConverter.convert(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.helpers.PatternConverter.format(java.lang.StringBuffer, org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.PatternLayout.format(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.WriterAppender.subAppend(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.WriterAppender.append(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.AppenderSkeleton.doAppend(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.Category.callAppenders(org.apache.log4j.spi.LoggingEvent) (Unknown Source) at org.apache.log4j.Category.forcedLog(java.lang.String, org.apache.log4j.Priority, java.lang.Object, java.lang.Throwable) (Unknown Source) at org.apache.log4j.Category.info(java.lang.Object) (Unknown Source) at org.griphyn.vdl.karajan.Loader.compile(java.lang.String) (Unknown Source) at org.griphyn.vdl.karajan.Loader.main(java.lang.String[]) (Unknown Source) Exception in thread "main" java.lang.NoSuchMethodError: method java.io.File.toURI was not found. at _Jv_ResolvePoolEntry(java.lang.Class, int) (/usr/lib64/libgcj.so.5.0.0) at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(java.io.File, org.apache.xmlbeans.SchemaType, org.apache.xmlbeans.XmlOptions) (Unknown Source) at org.globus.swift.language.ProgramDocument$Factory.parse(java.io.File, org.apache.xmlbeans.XmlOptions) (Unknown Source) at org.griphyn.vdl.engine.Karajan.parseProgramXML(java.lang.String) (Unknown Source) at org.griphyn.vdl.engine.Karajan.compile(java.lang.String, java.io.PrintStream) (Unknown Source) at org.griphyn.vdl.karajan.Loader.compile(java.lang.String) (Unknown Source) at org.griphyn.vdl.karajan.Loader.main(java.lang.String[]) (Unknown Source) SWIFT RETURN CODE NON-ZERO - test 061-cattwo Michael Wilde wrote: > http://mail.ci.uchicago.edu/pipermail/swift-devel/2009-April/005167.html > > If you ever need it, swift-devel is archived at: > > http://mail.ci.uchicago.edu/pipermail/swift-devel/ > > (swift-user similarly) > > > > On 4/21/09 11:17 AM, Zhao Zhang wrote: >> Hi, Ben >> >> It seems that I missed that email. I could not find it in my >> thunderbird. Could you remind me the address of >> 0.9rc2 swift? >> >> I had a problem receiving emails from Swift-devel list some time ago. >> >> zhao >> >> Ben Clifford wrote: >>> You should be testing Swift 0.9rc2, not the latest Swift code from SVN. >>> >>> So you should find the message that I sent on the mailing list >>> announcing Swift 0.9rc2, and download that tarball. >>> >>> Then set your PATH to include the bin/ directory in there. >>> >>> Don't build Swift from SVN. >>> >>> The only thing coming from SVN should be the tests/ subdirectory. >>> >>> You can tell if you are running the 0.9rc2 version by looking at the >>> version output. It should say Swift 0.92c when you start a Swift run. >>> >>> You should be running the site tests from a regular linux machine >>> like communicado. I think surveyor is a more specialised machine >>> which probably doesn't have what is expected for local coaster >>> execution. >>> >>> >>>> I am not clear with my test goals here, am I making sure coaster is >>>> working on all sites we have, or am I testing the existing coaster >>>> could be use by any users? >>>> >>> >>> You should be testing how well coasters in 0.9rc2 will work for any >>> user running on any site. >>> >>> Imagine a new user arrives, and downloads Swift 0.9. He then picks a >>> site at random, and tries to use coasters on that site. Will it >>> work? That is the answer you are trying to find. >>> >>> Which sites do coasters work on now? Which sites do they not work >>> on? What are the error messages from the sites they do not work on? >>> >>> > From benc at hawaga.org.uk Tue Apr 21 11:53:14 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 21 Apr 2009 16:53:14 +0000 (GMT) Subject: [Swift-devel] feature request In-Reply-To: <49EDF8C1.7050201@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> <49E63720.9020706@mcs.anl.gov> <49E663B9.4000007@mcs.anl.gov> <49EDCC59.6080507@mcs.anl.gov> <49EDEC49.2020204@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> Message-ID: On Tue, 21 Apr 2009, Zhao Zhang wrote: > Warning: -Xmx256M not understood. Ignoring. This comes from a change that happened to communicado sometimes recently. Modify your ~/.soft file so that you have the line: +java-sun before @default Otherwise, you end up using gcj for java instead of a real java. -- From andric at uchicago.edu Tue Apr 21 11:58:53 2009 From: andric at uchicago.edu (Michael Andric) Date: Tue, 21 Apr 2009 11:58:53 -0500 Subject: [Swift-devel] swift not working In-Reply-To: <1240330742.18908.0.camel@localhost> References: <1240330109.18487.6.camel@localhost> <49EDF084.5000908@mcs.anl.gov> <1240330742.18908.0.camel@localhost> Message-ID: i tried again, still failing 2009.04.21 11:54:02.700 CDT: [ERROR] Parsing profiles on line 1800 Illegal character ':'at position 60 :Illegal character ':' Swift svn swift-r2854 cog-r2382 RunID: 20090421-1154-v89vd1q8 Progress: Progress: Selecting site:1 Initializing site shared directory:1 Stage in:1 Progress: Stage in:2 Submitting:1 2009.04.21 11:54:11.689 CDT: [ERROR] Parsing profiles on line 1800 Illegal character ':'at position 60 :Illegal character ':' Failed to transfer wrapper log from AFNIsnr-20090421-1154-v89vd1q8/info/s on BSD Failed to transfer wrapper log from AFNIsnr-20090421-1154-v89vd1q8/info/r on BSD Progress: Submitted:1 Failed but can retry:2 Failed to transfer wrapper log from AFNIsnr-20090421-1154-v89vd1q8/info/q on BSD Failed to transfer wrapper log from AFNIsnr-20090421-1154-v89vd1q8/info/u on BSD Progress: Failed but can retry:3 Progress: Stage in:1 Failed but can retry:2 Failed to transfer wrapper log from AFNIsnr-20090421-1154-v89vd1q8/info/z on BSD Progress: Failed but can retry:3 Progress: Failed but can retry:3 Progress: Failed but can retry:3 Progress: Stage in:1 Failed but can retry:2 Failed to transfer wrapper log from AFNIsnr-20090421-1154-v89vd1q8/info/1 on BSD Progress: Failed but can retry:3 Progress: Failed but can retry:3 Progress: Failed but can retry:3 Progress: Failed but can retry:3 Failed to transfer wrapper log from AFNIsnr-20090421-1154-v89vd1q8/info/3 on BSD Progress: Failed:1 Failed but can retry:2 Execution failed: Exception in AFNI_3dvolreg: Arguments: [-twopass, -twodup, -dfile, mot_RFL2.run6_trim, -base, ts.6_trim+orig[92], -prefix, volreg.RFL2.run6_trim, ts.6_trim+orig.BRIK] Host: BSD Directory: AFNIsnr-20090421-1154-v89vd1q8/jobs/3/AFNI_3dvolreg-34u5jp9j stderr.txt: stdout.txt: ---- Caused by: Cannot submit job Caused by: Data transfer to the server failed [Caused by: Token length 1248813600 > 33554432] gwynn 4% On Tue, Apr 21, 2009 at 11:19 AM, Mihael Hategan wrote: > On Tue, 2009-04-21 at 11:12 -0500, Michael Wilde wrote: > > What do these log messages mean: > > > > 2009-04-21 09:44:05,634-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > jobid=AFNI_3dvolreg-f0rydp9j - Application exception: Cannot submit job > > Caused by: > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Cannot submit job > > Caused by: org.globus.gram.GramException: Data transfer to the server > > failed [Caused by: Token length 1248813600 > 33554432] > > 2009-04-21 09:44:05,635-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > jobid=AFNI_3dvolreg-g0rydp9j - Application exception: Cannot submit job > > Caused by: > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Cannot submit job > > Caused by: org.globus.gram.GramException: Data transfer to the server > > failed [Caused by: Token length 1248813600 > 33554432] > > > > Googling for Token length > 33554432 gets a lot of hits including: > > > > http://bugzilla.globus.org/globus/show_bug.cgi?id=2210 > > > > Right. Except that error he gets with gram2. I saw it happening before > when high loads are involved. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhaozhang at uchicago.edu Tue Apr 21 12:00:03 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 21 Apr 2009 12:00:03 -0500 Subject: [Swift-devel] feature request In-Reply-To: References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> <49E63720.9020706@mcs.anl.gov> <49E663B9.4000007@mcs.anl.gov> <49EDCC59.6080507@mcs.anl.gov> <49EDEC49.2020204@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> Message-ID: <49EDFB93.10902@uchicago.edu> Hi, Ben I modified .soft, then ran the test, it failed with the same error as I had on surveyor: [zzhang at communicado sites]$ ./run-site coaster/coaster-local.xml testing site configuration: coaster/coaster-local.xml Removing files from previous runs Running test 061-cattwo at Tue Apr 21 11:57:16 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090421-1157-hdc8m6c4 Progress: Multiple entries found for cat on localhost. Using the first one Progress: Submitted:1 Failed to transfer wrapper log from 061-cattwo-20090421-1157-hdc8m6c4/info/l on localhost Progress: Submitted:1 Failed to transfer wrapper log from 061-cattwo-20090421-1157-hdc8m6c4/info/n on localhost Progress: Stage in:1 Progress: Submitted:1 Failed to transfer wrapper log from 061-cattwo-20090421-1157-hdc8m6c4/info/p on localhost Execution failed: Exception in cat: Arguments: [061-cattwo.1.in, 061-cattwo.2.in] Host: localhost Directory: 061-cattwo-20090421-1157-hdc8m6c4/jobs/p/cat-pqfajp9j stderr.txt: stdout.txt: ---- Caused by: Could not submit job Caused by: Could not start coaster service Caused by: Task ended before registration was received. STDOUT: STDERR: which: no gmd5sum in (/home/zzhang/swift_coaster/cog/modules/swift/dist/swift-0.9rc2/bin:/home/zzhang/chirp/bin:/home/zzhang/xar/bin:/soft/java-1.5.0_06-sun-r1/bin:/soft/java-1.5.0_06-sun-r1/jre/bin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/zzhang/bin/linux-rhel4-x86_64:/home/zzhang/bin:/soft/apache-ant-1.6.5-r1/bin:/software/common/cert-scripts-2-5.rev44-r1/bin:/soft/condor-6.8.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/globus-4.0.3-r1/bin:/soft/globus-4.0.3-r1/sbin) Caused by: Job failed with an exit code of 1 Cleaning up... Done SWIFT RETURN CODE NON-ZERO - test 061-cattwo Ben Clifford wrote: > On Tue, 21 Apr 2009, Zhao Zhang wrote: > > >> Warning: -Xmx256M not understood. Ignoring. >> > > This comes from a change that happened to communicado sometimes recently. > > Modify your ~/.soft file so that you have the line: > > +java-sun > > before @default > > Otherwise, you end up using gcj for java instead of a real java. > > From benc at hawaga.org.uk Tue Apr 21 12:08:36 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 21 Apr 2009 17:08:36 +0000 (GMT) Subject: [Swift-devel] feature request In-Reply-To: <49EDFB93.10902@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> <49E63720.9020706@mcs.anl.gov> <49E663B9.4000007@mcs.anl.gov> <49EDCC59.6080507@mcs.anl.gov> <49EDEC49.2020204@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> Message-ID: On Tue, 21 Apr 2009, Zhao Zhang wrote: > I modified .soft, then ran the test, it failed with the same error as I > had on surveyor: You will need a valid proxy to test coasters. They will not work for you until you have one. -- From hategan at mcs.anl.gov Tue Apr 21 12:12:46 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 21 Apr 2009 12:12:46 -0500 Subject: [Swift-devel] feature request In-Reply-To: <49EDFB93.10902@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> <49E63720.9020706@mcs.anl.gov> <49E663B9.4000007@mcs.anl.gov> <49EDCC59.6080507@mcs.anl.gov> <49EDEC49.2020204@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> Message-ID: <1240333966.20167.8.camel@localhost> There's a bunch of "coaster-bootstrap-.log" files in your home directory, which give more details: I see this in them: Caused by: org.globus.gsi.GlobusCredentialException: [JGLOBUS-10] Expired credentials (DC=org,DC=doegrids,OU=People,CN=Zhao Zhang 385894,C N=1487316538). On Tue, 2009-04-21 at 12:00 -0500, Zhao Zhang wrote: > Hi, Ben > > I modified .soft, then ran the test, it failed with the same error as I > had on surveyor: > > [zzhang at communicado sites]$ ./run-site coaster/coaster-local.xml > testing site configuration: coaster/coaster-local.xml > Removing files from previous runs > Running test 061-cattwo at Tue Apr 21 11:57:16 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090421-1157-hdc8m6c4 > Progress: > Multiple entries found for cat on localhost. Using the first one > Progress: Submitted:1 > Failed to transfer wrapper log from > 061-cattwo-20090421-1157-hdc8m6c4/info/l on localhost > Progress: Submitted:1 > Failed to transfer wrapper log from > 061-cattwo-20090421-1157-hdc8m6c4/info/n on localhost > Progress: Stage in:1 > Progress: Submitted:1 > Failed to transfer wrapper log from > 061-cattwo-20090421-1157-hdc8m6c4/info/p on localhost > Execution failed: > Exception in cat: > Arguments: [061-cattwo.1.in, 061-cattwo.2.in] > Host: localhost > Directory: 061-cattwo-20090421-1157-hdc8m6c4/jobs/p/cat-pqfajp9j > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Could not submit job > Caused by: > Could not start coaster service > Caused by: > Task ended before registration was received. > STDOUT: > STDERR: which: no gmd5sum in > (/home/zzhang/swift_coaster/cog/modules/swift/dist/swift-0.9rc2/bin:/home/zzhang/chirp/bin:/home/zzhang/xar/bin:/soft/java-1.5.0_06-sun-r1/bin:/soft/java-1.5.0_06-sun-r1/jre/bin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/zzhang/bin/linux-rhel4-x86_64:/home/zzhang/bin:/soft/apache-ant-1.6.5-r1/bin:/software/common/cert-scripts-2-5.rev44-r1/bin:/soft/condor-6.8.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/globus-4.0.3-r1/bin:/soft/globus-4.0.3-r1/sbin) > > > Caused by: > Job failed with an exit code of 1 > Cleaning up... > Done > SWIFT RETURN CODE NON-ZERO - test 061-cattwo > > > Ben Clifford wrote: > > On Tue, 21 Apr 2009, Zhao Zhang wrote: > > > > > >> Warning: -Xmx256M not understood. Ignoring. > >> > > > > This comes from a change that happened to communicado sometimes recently. > > > > Modify your ~/.soft file so that you have the line: > > > > +java-sun > > > > before @default > > > > Otherwise, you end up using gcj for java instead of a real java. > > > > From benc at hawaga.org.uk Tue Apr 21 12:11:38 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 21 Apr 2009 17:11:38 +0000 (GMT) Subject: [Swift-devel] swift not working In-Reply-To: References: Message-ID: On Tue, 21 Apr 2009, Michael Andric wrote: > /disks/ci-gpfs/fmri/cnari/config/sites_bsd.xml I cannot make a manual submission to gwynn.bsd.uchicago.edu using globus-job-run, so this is not a Swift problem. I think this is something that support at ci should deal with. -- From andric at uchicago.edu Tue Apr 21 12:14:18 2009 From: andric at uchicago.edu (Michael Andric) Date: Tue, 21 Apr 2009 12:14:18 -0500 Subject: [Swift-devel] swift not working In-Reply-To: References: Message-ID: what about ucanl? On Tue, Apr 21, 2009 at 12:11 PM, Ben Clifford wrote: > > On Tue, 21 Apr 2009, Michael Andric wrote: > > > /disks/ci-gpfs/fmri/cnari/config/sites_bsd.xml > > I cannot make a manual submission to gwynn.bsd.uchicago.edu using > globus-job-run, so this is not a Swift problem. I think this is something > that support at ci should deal with. > > -- > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Tue Apr 21 12:17:44 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 21 Apr 2009 17:17:44 +0000 (GMT) Subject: [Swift-devel] swift not working In-Reply-To: References: Message-ID: On Tue, 21 Apr 2009, Michael Andric wrote: > what about ucanl? patience patience! It looks like the gram4 installation on TG-UC is broken (specifically, hanging) which is an issue for TG helpdesk. I've already submitted a CI support req for gwynn (ci ticket 569), and I'll put in a teragrid ticket for TGUC. -- From hategan at mcs.anl.gov Tue Apr 21 12:20:53 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 21 Apr 2009 12:20:53 -0500 Subject: [Swift-devel] swift not working In-Reply-To: References: Message-ID: <1240334453.21927.0.camel@localhost> Did you also try that one again? On Tue, 2009-04-21 at 12:14 -0500, Michael Andric wrote: > what about ucanl? > > On Tue, Apr 21, 2009 at 12:11 PM, Ben Clifford > wrote: > > On Tue, 21 Apr 2009, Michael Andric wrote: > > > > /disks/ci-gpfs/fmri/cnari/config/sites_bsd.xml > > > I cannot make a manual submission to gwynn.bsd.uchicago.edu > using > globus-job-run, so this is not a Swift problem. I think this > is something > that support at ci should deal with. > > -- > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Tue Apr 21 12:19:55 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 21 Apr 2009 17:19:55 +0000 (GMT) Subject: [Swift-devel] swift not working In-Reply-To: <1240334453.21927.0.camel@localhost> References: <1240334453.21927.0.camel@localhost> Message-ID: I can recreate non-swift brokenness on both of those sites. On Tue, 21 Apr 2009, Mihael Hategan wrote: > Did you also try that one again? > > On Tue, 2009-04-21 at 12:14 -0500, Michael Andric wrote: > > what about ucanl? > > > > On Tue, Apr 21, 2009 at 12:11 PM, Ben Clifford > > wrote: > > > > On Tue, 21 Apr 2009, Michael Andric wrote: > > > > > > > /disks/ci-gpfs/fmri/cnari/config/sites_bsd.xml > > > > > > I cannot make a manual submission to gwynn.bsd.uchicago.edu > > using > > globus-job-run, so this is not a Swift problem. I think this > > is something > > that support at ci should deal with. > > > > -- > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From benc at hawaga.org.uk Tue Apr 21 12:21:57 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 21 Apr 2009 17:21:57 +0000 (GMT) Subject: [Swift-devel] feature request In-Reply-To: <49EDFB93.10902@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> <49E63720.9020706@mcs.anl.gov> <49E663B9.4000007@mcs.anl.gov> <49EDCC59.6080507@mcs.anl.gov> <49EDEC49.2020204@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> Message-ID: zhao, on communicado please type: grid-cert-info and paste the results here. Then type: grid-proxy-init and enter your password and paste the results. -- From hategan at mcs.anl.gov Tue Apr 21 12:25:06 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 21 Apr 2009 12:25:06 -0500 Subject: [Swift-devel] swift not working In-Reply-To: References: <1240334453.21927.0.camel@localhost> Message-ID: <1240334706.21927.3.camel@localhost> Ok. My point was, in general, that if there are such catastrophic failures, spacing out in time a couple of attempts generally raises the confidence that the problem is not some spooky transient issue. On Tue, 2009-04-21 at 17:19 +0000, Ben Clifford wrote: > I can recreate non-swift brokenness on both of those sites. > > On Tue, 21 Apr 2009, Mihael Hategan wrote: > > > Did you also try that one again? > > > > On Tue, 2009-04-21 at 12:14 -0500, Michael Andric wrote: > > > what about ucanl? > > > > > > On Tue, Apr 21, 2009 at 12:11 PM, Ben Clifford > > > wrote: > > > > > > On Tue, 21 Apr 2009, Michael Andric wrote: > > > > > > > > > > /disks/ci-gpfs/fmri/cnari/config/sites_bsd.xml > > > > > > > > > I cannot make a manual submission to gwynn.bsd.uchicago.edu > > > using > > > globus-job-run, so this is not a Swift problem. I think this > > > is something > > > that support at ci should deal with. > > > > > > -- > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > From wilde at mcs.anl.gov Tue Apr 21 12:24:55 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 21 Apr 2009 12:24:55 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> Message-ID: <49EE0167.5090909@mcs.anl.gov> Raising the urgency of testing coasters for 0.9 is inconsistent with saying there is "little point in putting coaster changes into a release". If we are willing to insert more point releases between 0.9 and 1.0, or we dont attach special significance to 1.0, I am happy to let 0.9 go. Otherwise, I want to delay 0.9. Thats a separate but related question, and good to decide before we call the next release 0.9. That aside, I want to set a firm goal that a coaster feature we can call production-quality be in the next point release after the one currently in candidate state. Its OK to have known limitations, but it should do a useful core set of things very well, with tests to validate it, and documented so users know what they can and can not expect it to do. (Ie when to use it and when not). I can see that this may take 8 weeks or more. I still want a list of the known coaster improvements needed, and a time estimate for doing them, and design discussion on any features that need it (such as coasters that work well on OSG). Otherwise we cant tell if coasters needs 8 weeks or 80 weeks. I started such a list in prior emails. Mihael, please merge that with a list of issues you see from emailed reports, other issues you know of or have concerns about, and send out to swift-devel. Now that we have some experience with coasters - much of it very positive - we should bear down and focus on making it production quality. I hear the dismissal of my previous statement on that as a mom-and-apple-pie rant, but a list of know deficiencies and how you plan to address them is the concrete part of what Im asking for when I say this. - Mike On 4/21/09 11:28 AM, Ben Clifford wrote: > On Tue, 21 Apr 2009, Michael Wilde wrote: > >> Which of these can make it into 0.9? (I would like see 0.9 have a highly >> usable coaster release with a converging set of capabilities. I would >> argue to delay 0.9 by up to 2 weeks to get it in, if needed enhancements >> are close) > > A 0.9 release candidate has already been made. Bugs would have to be > fairly serious in order to prevent me releasing it. > > Coasters, as an in-development feature, not a production-quality feature, > pretty much shouldn't get to stop or delay releases unless they are > substantially worse than swift 0.8 coasters. > > Post 0.9rc2 changes will end up in the next swift release, which should be > roughly 8 weeks after the 0.9 release. That is not so long to wait. > > If there is likelihood that the advice from now onwards until that next > release will be "download the latest svn" to coaster users, then there is > little point in putting coaster changes into a release. > From andric at uchicago.edu Tue Apr 21 12:29:15 2009 From: andric at uchicago.edu (Michael Andric) Date: Tue, 21 Apr 2009 12:29:15 -0500 Subject: [Swift-devel] swift not working In-Reply-To: <1240334706.21927.3.camel@localhost> References: <1240334453.21927.0.camel@localhost> <1240334706.21927.3.camel@localhost> Message-ID: just tried on bigred - also failed [...] Progress: Failed but can retry:3 Progress: Stage in:1 Failed but can retry:2 Exception occured in the exception handling code, so it cannot be properly propagated to the user java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) at java.net.SocketOutputStream.write(SocketOutputStream.java:115) at java.io.DataOutputStream.writeByte(DataOutputStream.java:136) at org.globus.ftp.dc.EBlockImageDCWriter.endOfData(EBlockImageDCWriter.java:63) at org.globus.ftp.dc.GridFTPTransferSourceThread.shutdown(GridFTPTransferSourceThread.java:62) at org.globus.ftp.dc.TransferSourceThread.run(TransferSourceThread.java:87) Progress: Failed but can retry:3 Failed to transfer wrapper log from AFNIsnr-20090421-1220-q9598ll1/info/o on BIGRED Progress: Failed:1 Failed but can retry:2 Execution failed: Exception in AFNI_3dvolreg: Arguments: [-twopass, -twodup, -dfile, mot_RFL2.run4_trim, -base, ts.4_trim+orig[92], -prefix, volreg.RFL2.run4_trim, ts.4_trim+orig.BRIK] Host: BIGRED Directory: AFNIsnr-20090421-1220-q9598ll1/jobs/o/AFNI_3dvolreg-od58kp9j stderr.txt: stdout.txt: ---- Caused by: Server refused performing the request. Custom message: (error code 1) [Nested exception message: Custom message: Unexpected reply: 451 ocurred during retrieve() org.globus.ftp.exception.ServerException: Refusing to start transfer before previous transfer completes (error code 5) org.globus.ftp.exception.ServerException: Refusing to start transfer before previous transfer completes (error code 5) at org.globus.ftp.dc.TransferThreadManager.startTransfer(TransferThreadManager.java:129) at org.globus.ftp.extended.GridFTPServerFacade.retrieve(GridFTPServerFacade.java:431) at org.globus.ftp.FTPClient.put(FTPClient.java:1289) at org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.putFile(FileResourceImpl.java:427) at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doDestination(DelegatedFileTransferHandler.java:355) at org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doDestination(CachingDelegatedFileTransferHandler.java:47) at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:492) at java.lang.Thread.run(Thread.java:595) ] gwynn 4% On Tue, Apr 21, 2009 at 12:25 PM, Mihael Hategan wrote: > Ok. My point was, in general, that if there are such catastrophic > failures, spacing out in time a couple of attempts generally raises the > confidence that the problem is not some spooky transient issue. > > On Tue, 2009-04-21 at 17:19 +0000, Ben Clifford wrote: > > I can recreate non-swift brokenness on both of those sites. > > > > On Tue, 21 Apr 2009, Mihael Hategan wrote: > > > > > Did you also try that one again? > > > > > > On Tue, 2009-04-21 at 12:14 -0500, Michael Andric wrote: > > > > what about ucanl? > > > > > > > > On Tue, Apr 21, 2009 at 12:11 PM, Ben Clifford > > > > wrote: > > > > > > > > On Tue, 21 Apr 2009, Michael Andric wrote: > > > > > > > > > > > > > /disks/ci-gpfs/fmri/cnari/config/sites_bsd.xml > > > > > > > > > > > > I cannot make a manual submission to gwynn.bsd.uchicago.edu > > > > using > > > > globus-job-run, so this is not a Swift problem. I think this > > > > is something > > > > that support at ci should deal with. > > > > > > > > -- > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhaozhang at uchicago.edu Tue Apr 21 12:31:08 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 21 Apr 2009 12:31:08 -0500 Subject: [Swift-devel] feature request In-Reply-To: References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> <49E63720.9020706@mcs.anl.gov> <49E663B9.4000007@mcs.anl.gov> <49EDCC59.6080507@mcs.anl.gov> <49EDEC49.2020204@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> Message-ID: <49EE02DC.5060405@uchicago.edu> Yep, I found that my certificate has expired. I check the webpage about grid proxy, it says I need to request a new certificate. zhao [zzhang at communicado ~]$ grid-proxy-init Your identity: /DC=org/DC=doegrids/OU=People/CN=Zhao Zhang 385894 Enter GRID pass phrase for this identity: Creating proxy ........................................... Done ERROR: Your certificate has expired: Thu Feb 26 12:47:51 2009 Ben Clifford wrote: > zhao, on communicado please type: > > grid-cert-info > > and paste the results here. > > Then type: grid-proxy-init and enter your password and paste the results. > > From zhaozhang at uchicago.edu Tue Apr 21 12:37:43 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 21 Apr 2009 12:37:43 -0500 Subject: [Swift-devel] feature request In-Reply-To: References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> <49E63720.9020706@mcs.anl.gov> <49E663B9.4000007@mcs.anl.gov> <49EDCC59.6080507@mcs.anl.gov> <49EDEC49.2020204@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> Message-ID: <49EE0467.8060408@uchicago.edu> Ok, I found there is a pair of keys, with name "newcert.pem" and "newkey.pem", I could not remember when I renewed them. But yes, it is working [zzhang at communicado .globus]$ grid-proxy-init Your identity: DC=org,DC=doegrids,OU=People,CN=Zhao Zhang 385894 Enter GRID pass phrase for this identity: Creating proxy, please wait... Proxy verify OK Your proxy is valid until Wed Apr 22 00:33:38 CDT 2009 [zzhang at communicado sites]$ ./run-site coaster/coaster-local.xml testing site configuration: coaster/coaster-local.xml Removing files from previous runs Running test 061-cattwo at Tue Apr 21 12:34:45 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090421-1234-beqr2e0d Progress: Multiple entries found for cat on localhost. Using the first one Progress: Submitted:1 Progress: Active:1 Final status: Finished successfully:1 Cleaning up... Shutting down service at https://127.0.0.1:50003 Got channel MetaChannel: 810427 -> GSSSChannel-null(1) - Done expecting 061-cattwo.out.expected checking 061-cattwo.out.expected Skipping exception test due to test configuration Test passed at Tue Apr 21 12:34:52 CDT 2009 ----------===========================---------- Running test 130-fmri at Tue Apr 21 12:34:52 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090421-1234-0ig7ezt6 Progress: Multiple entries found for touch on localhost. Using the first one Progress: Selecting site:2 Submitting:1 Submitted:1 Progress: Selecting site:2 Submitted:1 Active:1 Progress: Selecting site:1 Stage in:1 Finished successfully:2 Progress: Submitted:1 Finished successfully:4 Progress: Submitted:3 Finished successfully:6 Progress: Submitted:1 Finished successfully:10 Final status: Finished successfully:11 Cleaning up... Shutting down service at https://127.0.0.1:50003 Got channel MetaChannel: 27652339 -> GSSSChannel-null(1) - Done expecting 130-fmri.0000.jpeg.expected 130-fmri.0001.jpeg.expected 130-fmri.0002.jpeg.expected checking 130-fmri.0000.jpeg.expected Skipping exception test due to test configuration checking 130-fmri.0001.jpeg.expected Skipping exception test due to test configuration checking 130-fmri.0002.jpeg.expected Skipping exception test due to test configuration Test passed at Tue Apr 21 12:35:05 CDT 2009 ----------===========================---------- Running test 103-quote at Tue Apr 21 12:35:05 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090421-1235-fh6odt20 Progress: Multiple entries found for echo on localhost. Using the first one Progress: Submitted:1 Progress: Active:1 Final status: Finished successfully:1 Cleaning up... Shutting down service at https://127.0.0.1:50003 Got channel MetaChannel: 32165316 -> GSSSChannel-null(1) - Done expecting 103-quote.out.expected checking 103-quote.out.expected Skipping exception test due to test configuration Test passed at Tue Apr 21 12:35:12 CDT 2009 ----------===========================---------- Running test 1032-singlequote at Tue Apr 21 12:35:12 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090421-1235-355iqmu2 Progress: Multiple entries found for echo on localhost. Using the first one Progress: Submitted:1 Progress: Active:1 Final status: Finished successfully:1 Cleaning up... Shutting down service at https://127.0.0.1:50003 Got channel MetaChannel: 27362451 -> GSSSChannel-null(1) - Done expecting 1032-singlequote.out.expected checking 1032-singlequote.out.expected Skipping exception test due to test configuration Test passed at Tue Apr 21 12:35:19 CDT 2009 ----------===========================---------- Running test 1031-quote at Tue Apr 21 12:35:19 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090421-1235-kz20ns09 Progress: Multiple entries found for echo on localhost. Using the first one Progress: Submitted:1 Progress: Active:1 Final status: Finished successfully:1 Cleaning up... Shutting down service at https://127.0.0.1:50003 Got channel MetaChannel: 32165316 -> GSSSChannel-null(1) - Done expecting 1031-quote.*.expected No expected output files specified for this test case - not checking output. Skipping exception test due to test configuration Test passed at Tue Apr 21 12:35:26 CDT 2009 ----------===========================---------- Running test 1033-singlequote at Tue Apr 21 12:35:26 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090421-1235-d82mk769 Progress: Multiple entries found for echo on localhost. Using the first one Progress: Submitted:1 Progress: Active:1 Final status: Finished successfully:1 Cleaning up... Shutting down service at https://127.0.0.1:50003 Got channel MetaChannel: 3934565 -> GSSSChannel-null(1) - Done expecting 1033-singlequote.out.expected checking 1033-singlequote.out.expected Skipping exception test due to test configuration Test passed at Tue Apr 21 12:35:33 CDT 2009 ----------===========================---------- Running test 141-space-in-filename at Tue Apr 21 12:35:33 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090421-1235-53ylesn6 Progress: Multiple entries found for echo on localhost. Using the first one Progress: Submitted:1 Progress: Active:1 Final status: Finished successfully:1 Cleaning up... Shutting down service at https://127.0.0.1:50003 Got channel MetaChannel: 3934565 -> GSSSChannel-null(1) - Done expecting 141-space-in-filename.space here.out.expected checking 141-space-in-filename.space here.out.expected Skipping exception test due to test configuration Test passed at Tue Apr 21 12:35:40 CDT 2009 ----------===========================---------- Running test 142-space-and-quotes at Tue Apr 21 12:35:40 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090421-1235-ko20j409 Progress: Multiple entries found for touch on localhost. Using the first one Progress: Selecting site:2 Submitting:1 Submitted:1 Progress: Selecting site:2 Submitted:1 Active:1 Progress: Selecting site:1 Stage in:1 Finished successfully:2 Final status: Finished successfully:4 Cleaning up... Shutting down service at https://127.0.0.1:50003 Got channel MetaChannel: 15140795 -> GSSSChannel-null(1) - Done expecting 142-space-and-quotes.2" space ".out.expected 142-space-and-quotes.3' space '.out.expected 142-space-and-quotes.out.expected 142-space-and-quotes. space .out.expected checking 142-space-and-quotes.2" space ".out.expected Skipping exception test due to test configuration checking 142-space-and-quotes.3' space '.out.expected Skipping exception test due to test configuration checking 142-space-and-quotes.out.expected Skipping exception test due to test configuration checking 142-space-and-quotes. space .out.expected Skipping exception test due to test configuration Test passed at Tue Apr 21 12:35:50 CDT 2009 ----------===========================---------- All language behaviour tests passed Ben Clifford wrote: > zhao, on communicado please type: > > grid-cert-info > > and paste the results here. > > Then type: grid-proxy-init and enter your password and paste the results. > > From benc at hawaga.org.uk Tue Apr 21 12:37:56 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 21 Apr 2009 17:37:56 +0000 (GMT) Subject: [Swift-devel] feature request In-Reply-To: <49EE02DC.5060405@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> <49E63720.9020706@mcs.anl.gov> <49E663B9.4000007@mcs.anl.gov> <49EDCC59.6080507@mcs.anl.gov> <49EDEC49.2020204@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> Message-ID: Please do all of what I asked. On Tue, 21 Apr 2009, Zhao Zhang wrote: > Yep, I found that my certificate has expired. I check the webpage about grid > proxy, it says I need to request a new certificate. > > zhao > > [zzhang at communicado ~]$ grid-proxy-init > Your identity: /DC=org/DC=doegrids/OU=People/CN=Zhao Zhang 385894 > Enter GRID pass phrase for this identity: > Creating proxy ........................................... Done > > > ERROR: Your certificate has expired: Thu Feb 26 12:47:51 2009 > > > > Ben Clifford wrote: > > zhao, on communicado please type: > > > > grid-cert-info > > > > and paste the results here. > > > > Then type: grid-proxy-init and enter your password and paste the results. > > > > > > From benc at hawaga.org.uk Tue Apr 21 12:41:45 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 21 Apr 2009 17:41:45 +0000 (GMT) Subject: [Swift-devel] feature request In-Reply-To: <49EE0467.8060408@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> <49E63720.9020706@mcs.anl.gov> <49E663B9.4000007@mcs.anl.gov> <49EDCC59.6080507@mcs.anl.gov> <49EDEC49.2020204@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE0467.8060408@uchicago.edu> Message-ID: ok. now try: ./run-all coaster/ You will need to watch this occasionally because sometimes you might find some sites hang rather than failing. You can kill the test for that site by pressing ctrl-C; I think the test script will then move onto the next site (but I am not sure). On Tue, 21 Apr 2009, Zhao Zhang wrote: > Ok, I found there is a pair of keys, with name "newcert.pem" and "newkey.pem", > I could not remember when I renewed them. But yes, it is working > > [zzhang at communicado .globus]$ grid-proxy-init > Your identity: DC=org,DC=doegrids,OU=People,CN=Zhao Zhang 385894 > Enter GRID pass phrase for this identity: Creating proxy, please > wait... > Proxy verify OK > Your proxy is valid until Wed Apr 22 00:33:38 CDT 2009 > > > > [zzhang at communicado sites]$ ./run-site coaster/coaster-local.xml > testing site configuration: coaster/coaster-local.xml > Removing files from previous runs > Running test 061-cattwo at Tue Apr 21 12:34:45 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090421-1234-beqr2e0d > Progress: > Multiple entries found for cat on localhost. Using the first one > Progress: Submitted:1 > Progress: Active:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://127.0.0.1:50003 > Got channel MetaChannel: 810427 -> GSSSChannel-null(1) > - Done > expecting 061-cattwo.out.expected > checking 061-cattwo.out.expected > Skipping exception test due to test configuration > Test passed at Tue Apr 21 12:34:52 CDT 2009 > ----------===========================---------- > Running test 130-fmri at Tue Apr 21 12:34:52 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090421-1234-0ig7ezt6 > Progress: > Multiple entries found for touch on localhost. Using the first one > Progress: Selecting site:2 Submitting:1 Submitted:1 > Progress: Selecting site:2 Submitted:1 Active:1 > Progress: Selecting site:1 Stage in:1 Finished successfully:2 > Progress: Submitted:1 Finished successfully:4 > Progress: Submitted:3 Finished successfully:6 > Progress: Submitted:1 Finished successfully:10 > Final status: Finished successfully:11 > Cleaning up... > Shutting down service at https://127.0.0.1:50003 > Got channel MetaChannel: 27652339 -> GSSSChannel-null(1) > - Done > expecting 130-fmri.0000.jpeg.expected 130-fmri.0001.jpeg.expected > 130-fmri.0002.jpeg.expected > checking 130-fmri.0000.jpeg.expected > Skipping exception test due to test configuration > checking 130-fmri.0001.jpeg.expected > Skipping exception test due to test configuration > checking 130-fmri.0002.jpeg.expected > Skipping exception test due to test configuration > Test passed at Tue Apr 21 12:35:05 CDT 2009 > ----------===========================---------- > Running test 103-quote at Tue Apr 21 12:35:05 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090421-1235-fh6odt20 > Progress: > Multiple entries found for echo on localhost. Using the first one > Progress: Submitted:1 > Progress: Active:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://127.0.0.1:50003 > Got channel MetaChannel: 32165316 -> GSSSChannel-null(1) > - Done > expecting 103-quote.out.expected > checking 103-quote.out.expected > Skipping exception test due to test configuration > Test passed at Tue Apr 21 12:35:12 CDT 2009 > ----------===========================---------- > Running test 1032-singlequote at Tue Apr 21 12:35:12 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090421-1235-355iqmu2 > Progress: > Multiple entries found for echo on localhost. Using the first one > Progress: Submitted:1 > Progress: Active:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://127.0.0.1:50003 > Got channel MetaChannel: 27362451 -> GSSSChannel-null(1) > - Done > expecting 1032-singlequote.out.expected > checking 1032-singlequote.out.expected > Skipping exception test due to test configuration > Test passed at Tue Apr 21 12:35:19 CDT 2009 > ----------===========================---------- > Running test 1031-quote at Tue Apr 21 12:35:19 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090421-1235-kz20ns09 > Progress: > Multiple entries found for echo on localhost. Using the first one > Progress: Submitted:1 > Progress: Active:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://127.0.0.1:50003 > Got channel MetaChannel: 32165316 -> GSSSChannel-null(1) > - Done > expecting 1031-quote.*.expected > No expected output files specified for this test case - not checking output. > Skipping exception test due to test configuration > Test passed at Tue Apr 21 12:35:26 CDT 2009 > ----------===========================---------- > Running test 1033-singlequote at Tue Apr 21 12:35:26 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090421-1235-d82mk769 > Progress: > Multiple entries found for echo on localhost. Using the first one > Progress: Submitted:1 > Progress: Active:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://127.0.0.1:50003 > Got channel MetaChannel: 3934565 -> GSSSChannel-null(1) > - Done > expecting 1033-singlequote.out.expected > checking 1033-singlequote.out.expected > Skipping exception test due to test configuration > Test passed at Tue Apr 21 12:35:33 CDT 2009 > ----------===========================---------- > Running test 141-space-in-filename at Tue Apr 21 12:35:33 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090421-1235-53ylesn6 > Progress: > Multiple entries found for echo on localhost. Using the first one > Progress: Submitted:1 > Progress: Active:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://127.0.0.1:50003 > Got channel MetaChannel: 3934565 -> GSSSChannel-null(1) > - Done > expecting 141-space-in-filename.space here.out.expected > checking 141-space-in-filename.space here.out.expected > Skipping exception test due to test configuration > Test passed at Tue Apr 21 12:35:40 CDT 2009 > ----------===========================---------- > Running test 142-space-and-quotes at Tue Apr 21 12:35:40 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090421-1235-ko20j409 > Progress: > Multiple entries found for touch on localhost. Using the first one > Progress: Selecting site:2 Submitting:1 Submitted:1 > Progress: Selecting site:2 Submitted:1 Active:1 > Progress: Selecting site:1 Stage in:1 Finished successfully:2 > Final status: Finished successfully:4 > Cleaning up... > Shutting down service at https://127.0.0.1:50003 > Got channel MetaChannel: 15140795 -> GSSSChannel-null(1) > - Done > expecting 142-space-and-quotes.2" space ".out.expected 142-space-and-quotes.3' > space '.out.expected 142-space-and-quotes.out.expected 142-space-and-quotes. > space .out.expected > checking 142-space-and-quotes.2" space ".out.expected > Skipping exception test due to test configuration > checking 142-space-and-quotes.3' space '.out.expected > Skipping exception test due to test configuration > checking 142-space-and-quotes.out.expected > Skipping exception test due to test configuration > checking 142-space-and-quotes. space .out.expected > Skipping exception test due to test configuration > Test passed at Tue Apr 21 12:35:50 CDT 2009 > ----------===========================---------- > All language behaviour tests passed > > > Ben Clifford wrote: > > zhao, on communicado please type: > > > > grid-cert-info > > > > and paste the results here. > > > > Then type: grid-proxy-init and enter your password and paste the results. > > > > > > From andric at uchicago.edu Tue Apr 21 12:41:58 2009 From: andric at uchicago.edu (Michael Andric) Date: Tue, 21 Apr 2009 12:41:58 -0500 Subject: [Swift-devel] swift not working In-Reply-To: References: <1240334453.21927.0.camel@localhost> <1240334706.21927.3.camel@localhost> Message-ID: also failed again on ucanl64 [...] Progress: Submitting:1 Submitted:1 Failed but can retry:1 Failed to transfer wrapper log from AFNIsnr-20090421-1231-4h6oa8k5/info/e on ANLUCTERAGRID64 Progress: Submitting:1 Failed but can retry:2 Progress: Submitting:1 Failed but can retry:2 Progress: Submitting:1 Failed but can retry:2 Progress: Stage in:1 Submitting:1 Failed but can retry:1 Progress: Submitting:1 Submitted:1 Failed but can retry:1 Progress: Submitting:1 Failed but can retry:2 Failed to transfer wrapper log from AFNIsnr-20090421-1231-4h6oa8k5/info/g on ANLUCTERAGRID64 Execution failed: Exception in AFNI_3dvolreg: Arguments: [-twopass, -twodup, -dfile, mot_RFL2.run5_trim, -base, ts.5_trim+orig[92], -prefix, volreg.RFL2.run5_trim, ts.5_trim+orig.BRIK] Host: ANLUCTERAGRID64 Directory: AFNIsnr-20090421-1231-4h6oa8k5/jobs/g/AFNI_3dvolreg-gbrmkp9j stderr.txt: stdout.txt: ---- Caused by: Cannot submit job: ; nested exception is: java.net.SocketTimeoutException: Read timed out gwynn 7% On Tue, Apr 21, 2009 at 12:29 PM, Michael Andric wrote: > just tried on bigred - also failed > [...] > Progress: Failed but can retry:3 > Progress: Stage in:1 Failed but can retry:2 > Exception occured in the exception handling code, so it cannot be properly > propagated to the user > java.net.SocketException: Broken pipe > at java.net.SocketOutputStream.socketWrite0(Native Method) > at > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) > at java.net.SocketOutputStream.write(SocketOutputStream.java:115) > at java.io.DataOutputStream.writeByte(DataOutputStream.java:136) > at > org.globus.ftp.dc.EBlockImageDCWriter.endOfData(EBlockImageDCWriter.java:63) > at > org.globus.ftp.dc.GridFTPTransferSourceThread.shutdown(GridFTPTransferSourceThread.java:62) > at > org.globus.ftp.dc.TransferSourceThread.run(TransferSourceThread.java:87) > Progress: Failed but can retry:3 > Failed to transfer wrapper log from AFNIsnr-20090421-1220-q9598ll1/info/o > on BIGRED > Progress: Failed:1 Failed but can retry:2 > Execution failed: > Exception in AFNI_3dvolreg: > Arguments: [-twopass, -twodup, -dfile, mot_RFL2.run4_trim, -base, > ts.4_trim+orig[92], -prefix, volreg.RFL2.run4_trim, ts.4_trim+orig.BRIK] > Host: BIGRED > Directory: AFNIsnr-20090421-1220-q9598ll1/jobs/o/AFNI_3dvolreg-od58kp9j > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Server refused performing the request. Custom message: (error code > 1) [Nested exception message: Custom message: Unexpected reply: 451 ocurred > during retrieve() > org.globus.ftp.exception.ServerException: Refusing to start transfer before > previous transfer completes (error code 5) > org.globus.ftp.exception.ServerException: Refusing to start transfer before > previous transfer completes (error code 5) > at > org.globus.ftp.dc.TransferThreadManager.startTransfer(TransferThreadManager.java:129) > at > org.globus.ftp.extended.GridFTPServerFacade.retrieve(GridFTPServerFacade.java:431) > at org.globus.ftp.FTPClient.put(FTPClient.java:1289) > at > org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.putFile(FileResourceImpl.java:427) > at > org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doDestination(DelegatedFileTransferHandler.java:355) > at > org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doDestination(CachingDelegatedFileTransferHandler.java:47) > at > org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:492) > at java.lang.Thread.run(Thread.java:595) > ] > gwynn 4% > > On Tue, Apr 21, 2009 at 12:25 PM, Mihael Hategan wrote: > >> Ok. My point was, in general, that if there are such catastrophic >> failures, spacing out in time a couple of attempts generally raises the >> confidence that the problem is not some spooky transient issue. >> >> On Tue, 2009-04-21 at 17:19 +0000, Ben Clifford wrote: >> > I can recreate non-swift brokenness on both of those sites. >> > >> > On Tue, 21 Apr 2009, Mihael Hategan wrote: >> > >> > > Did you also try that one again? >> > > >> > > On Tue, 2009-04-21 at 12:14 -0500, Michael Andric wrote: >> > > > what about ucanl? >> > > > >> > > > On Tue, Apr 21, 2009 at 12:11 PM, Ben Clifford >> > > > wrote: >> > > > >> > > > On Tue, 21 Apr 2009, Michael Andric wrote: >> > > > >> > > > >> > > > > /disks/ci-gpfs/fmri/cnari/config/sites_bsd.xml >> > > > >> > > > >> > > > I cannot make a manual submission to gwynn.bsd.uchicago.edu >> > > > using >> > > > globus-job-run, so this is not a Swift problem. I think this >> > > > is something >> > > > that support at ci should deal with. >> > > > >> > > > -- >> > > > >> > > > >> > > > _______________________________________________ >> > > > Swift-devel mailing list >> > > > Swift-devel at ci.uchicago.edu >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > >> > > >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Tue Apr 21 12:44:40 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 21 Apr 2009 17:44:40 +0000 (GMT) Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <49EE0167.5090909@mcs.anl.gov> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> Message-ID: On Tue, 21 Apr 2009, Michael Wilde wrote: > If we are willing to insert more point releases between 0.9 and 1.0, or we > dont attach special significance to 1.0, I am happy to let 0.9 go. Otherwise, > I want to delay 0.9. Thats a separate but related question, and good to decide > before we call the next release 0.9. My assumption was that 0.9 + 0.1 = 0.10. I don't see anything special about the release after 0.9 compared to any other release. > That aside, I want to set a firm goal that a coaster feature we can call > production-quality be in the next point release after the one currently in > candidate state. The release policy so far has been every 2 months or so, there should be a release; it is time based and not feature based. I don't think that should change. >From a release managers perspective that means in 2 months, I release what is in the SVN heads, rather than delay based on some feature not arriving yet; in the case of the feature not being ready, then it appears in the next point release after it eventually is ready. Thats the same policy as (for example) the provenance code that I am working on for pc3 - if its in trunk, it gets released; if its not in trunk, it doesn't get released. Lots happens in Swift in 2 months and its unfortunate to withhold that from users. That does not mean that it is not worthwhile setting coasters as a goal for the next release, but it means it should be in the sense of "coasters in 2 months time" not "0.10 release will be delayed until coasters are ready". -- From zhaozhang at uchicago.edu Tue Apr 21 12:52:29 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 21 Apr 2009 12:52:29 -0500 Subject: [Swift-devel] feature request In-Reply-To: <49EE0467.8060408@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> <49E63720.9020706@mcs.anl.gov> <49E663B9.4000007@mcs.anl.gov> <49EDCC59.6080507@mcs.anl.gov> <49EDEC49.2020204@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE0467.8060408@uchicago.edu> Message-ID: <49EE07DD.9000408@uchicago.edu> Hi, Ben and Mihael As long as I ran it fine on local site, I tried "./run-all coaster/" out. I am attaching the log. zhao Zhao Zhang wrote: > Ok, I found there is a pair of keys, with name "newcert.pem" and > "newkey.pem", I could not remember when I renewed them. But yes, it is > working > > [zzhang at communicado .globus]$ grid-proxy-init > Your identity: DC=org,DC=doegrids,OU=People,CN=Zhao Zhang 385894 > Enter GRID pass phrase for this identity: Creating proxy, > please wait... > Proxy verify OK > Your proxy is valid until Wed Apr 22 00:33:38 CDT 2009 > > > > [zzhang at communicado sites]$ ./run-site coaster/coaster-local.xml > testing site configuration: coaster/coaster-local.xml > Removing files from previous runs > Running test 061-cattwo at Tue Apr 21 12:34:45 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090421-1234-beqr2e0d > Progress: > Multiple entries found for cat on localhost. Using the first one > Progress: Submitted:1 > Progress: Active:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://127.0.0.1:50003 > Got channel MetaChannel: 810427 -> GSSSChannel-null(1) > - Done > expecting 061-cattwo.out.expected > checking 061-cattwo.out.expected > Skipping exception test due to test configuration > Test passed at Tue Apr 21 12:34:52 CDT 2009 > ----------===========================---------- > Running test 130-fmri at Tue Apr 21 12:34:52 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090421-1234-0ig7ezt6 > Progress: > Multiple entries found for touch on localhost. Using the first one > Progress: Selecting site:2 Submitting:1 Submitted:1 > Progress: Selecting site:2 Submitted:1 Active:1 > Progress: Selecting site:1 Stage in:1 Finished successfully:2 > Progress: Submitted:1 Finished successfully:4 > Progress: Submitted:3 Finished successfully:6 > Progress: Submitted:1 Finished successfully:10 > Final status: Finished successfully:11 > Cleaning up... > Shutting down service at https://127.0.0.1:50003 > Got channel MetaChannel: 27652339 -> GSSSChannel-null(1) > - Done > expecting 130-fmri.0000.jpeg.expected 130-fmri.0001.jpeg.expected > 130-fmri.0002.jpeg.expected > checking 130-fmri.0000.jpeg.expected > Skipping exception test due to test configuration > checking 130-fmri.0001.jpeg.expected > Skipping exception test due to test configuration > checking 130-fmri.0002.jpeg.expected > Skipping exception test due to test configuration > Test passed at Tue Apr 21 12:35:05 CDT 2009 > ----------===========================---------- > Running test 103-quote at Tue Apr 21 12:35:05 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090421-1235-fh6odt20 > Progress: > Multiple entries found for echo on localhost. Using the first one > Progress: Submitted:1 > Progress: Active:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://127.0.0.1:50003 > Got channel MetaChannel: 32165316 -> GSSSChannel-null(1) > - Done > expecting 103-quote.out.expected > checking 103-quote.out.expected > Skipping exception test due to test configuration > Test passed at Tue Apr 21 12:35:12 CDT 2009 > ----------===========================---------- > Running test 1032-singlequote at Tue Apr 21 12:35:12 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090421-1235-355iqmu2 > Progress: > Multiple entries found for echo on localhost. Using the first one > Progress: Submitted:1 > Progress: Active:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://127.0.0.1:50003 > Got channel MetaChannel: 27362451 -> GSSSChannel-null(1) > - Done > expecting 1032-singlequote.out.expected > checking 1032-singlequote.out.expected > Skipping exception test due to test configuration > Test passed at Tue Apr 21 12:35:19 CDT 2009 > ----------===========================---------- > Running test 1031-quote at Tue Apr 21 12:35:19 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090421-1235-kz20ns09 > Progress: > Multiple entries found for echo on localhost. Using the first one > Progress: Submitted:1 > Progress: Active:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://127.0.0.1:50003 > Got channel MetaChannel: 32165316 -> GSSSChannel-null(1) > - Done > expecting 1031-quote.*.expected > No expected output files specified for this test case - not checking > output. > Skipping exception test due to test configuration > Test passed at Tue Apr 21 12:35:26 CDT 2009 > ----------===========================---------- > Running test 1033-singlequote at Tue Apr 21 12:35:26 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090421-1235-d82mk769 > Progress: > Multiple entries found for echo on localhost. Using the first one > Progress: Submitted:1 > Progress: Active:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://127.0.0.1:50003 > Got channel MetaChannel: 3934565 -> GSSSChannel-null(1) > - Done > expecting 1033-singlequote.out.expected > checking 1033-singlequote.out.expected > Skipping exception test due to test configuration > Test passed at Tue Apr 21 12:35:33 CDT 2009 > ----------===========================---------- > Running test 141-space-in-filename at Tue Apr 21 12:35:33 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090421-1235-53ylesn6 > Progress: > Multiple entries found for echo on localhost. Using the first one > Progress: Submitted:1 > Progress: Active:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://127.0.0.1:50003 > Got channel MetaChannel: 3934565 -> GSSSChannel-null(1) > - Done > expecting 141-space-in-filename.space here.out.expected > checking 141-space-in-filename.space here.out.expected > Skipping exception test due to test configuration > Test passed at Tue Apr 21 12:35:40 CDT 2009 > ----------===========================---------- > Running test 142-space-and-quotes at Tue Apr 21 12:35:40 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090421-1235-ko20j409 > Progress: > Multiple entries found for touch on localhost. Using the first one > Progress: Selecting site:2 Submitting:1 Submitted:1 > Progress: Selecting site:2 Submitted:1 Active:1 > Progress: Selecting site:1 Stage in:1 Finished successfully:2 > Final status: Finished successfully:4 > Cleaning up... > Shutting down service at https://127.0.0.1:50003 > Got channel MetaChannel: 15140795 -> GSSSChannel-null(1) > - Done > expecting 142-space-and-quotes.2" space ".out.expected > 142-space-and-quotes.3' space '.out.expected > 142-space-and-quotes.out.expected 142-space-and-quotes. space > .out.expected > checking 142-space-and-quotes.2" space ".out.expected > Skipping exception test due to test configuration > checking 142-space-and-quotes.3' space '.out.expected > Skipping exception test due to test configuration > checking 142-space-and-quotes.out.expected > Skipping exception test due to test configuration > checking 142-space-and-quotes. space .out.expected > Skipping exception test due to test configuration > Test passed at Tue Apr 21 12:35:50 CDT 2009 > ----------===========================---------- > All language behaviour tests passed > > > Ben Clifford wrote: >> zhao, on communicado please type: >> >> grid-cert-info >> >> and paste the results here. >> >> Then type: grid-proxy-init and enter your password and paste the >> results. >> >> > -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: log URL: From zhaozhang at uchicago.edu Tue Apr 21 12:55:15 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 21 Apr 2009 12:55:15 -0500 Subject: [Swift-devel] feature request In-Reply-To: References: <49E62C55.5080107@uchicago.edu> <1239821895.23411.10.camel@localhost> <49E63000.40009@uchicago.edu> <1239823325.23411.38.camel@localhost> <49E63720.9020706@mcs.anl.gov> <49E663B9.4000007@mcs.anl.gov> <49EDCC59.6080507@mcs.anl.gov> <49EDEC49.2020204@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> Message-ID: <49EE0883.4070908@uchicago.edu> [zzhang at communicado coaster]$ grid-cert-info subject : DC=org,DC=doegrids,OU=People,CN=Zhao Zhang 385894 issuer : DC=org,DC=DOEGrids,OU=Certificate Authorities,CN=DOEGrids CA 1 start date : Wed Feb 25 14:32:08 CST 2009 end date : Thu Feb 25 14:32:08 CST 2010 [zzhang at communicado coaster]$ grid-proxy-init Your identity: DC=org,DC=doegrids,OU=People,CN=Zhao Zhang 385894 Enter GRID pass phrase for this identity: Creating proxy, please wait... Proxy verify OK Your proxy is valid until Wed Apr 22 00:54:47 CDT 2009 zhao Ben Clifford wrote: > Please do all of what I asked. > > On Tue, 21 Apr 2009, Zhao Zhang wrote: > > >> Yep, I found that my certificate has expired. I check the webpage about grid >> proxy, it says I need to request a new certificate. >> >> zhao >> >> [zzhang at communicado ~]$ grid-proxy-init >> Your identity: /DC=org/DC=doegrids/OU=People/CN=Zhao Zhang 385894 >> Enter GRID pass phrase for this identity: >> Creating proxy ........................................... Done >> >> >> ERROR: Your certificate has expired: Thu Feb 26 12:47:51 2009 >> >> >> >> Ben Clifford wrote: >> >>> zhao, on communicado please type: >>> >>> grid-cert-info >>> >>> and paste the results here. >>> >>> Then type: grid-proxy-init and enter your password and paste the results. >>> >>> >>> >> > > From benc at hawaga.org.uk Tue Apr 21 13:00:51 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 21 Apr 2009 18:00:51 +0000 (GMT) Subject: [Swift-devel] swift not working In-Reply-To: References: <1240334453.21927.0.camel@localhost> <1240334706.21927.3.camel@localhost> Message-ID: Unlike the other two, I can talk to bigred manually ok. I can also run examples/swift/first.swift OK on bigred. Please try running examples/swift/first.swift against bigred from your setup. Also send the swift version string from the start of swift output. On Tue, 21 Apr 2009, Michael Andric wrote: > just tried on bigred - also failed > [...] > Progress: Failed but can retry:3 > Progress: Stage in:1 Failed but can retry:2 > Exception occured in the exception handling code, so it cannot be properly > propagated to the user > java.net.SocketException: Broken pipe > at java.net.SocketOutputStream.socketWrite0(Native Method) > at > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) > at java.net.SocketOutputStream.write(SocketOutputStream.java:115) > at java.io.DataOutputStream.writeByte(DataOutputStream.java:136) > at > org.globus.ftp.dc.EBlockImageDCWriter.endOfData(EBlockImageDCWriter.java:63) > at > org.globus.ftp.dc.GridFTPTransferSourceThread.shutdown(GridFTPTransferSourceThread.java:62) > at > org.globus.ftp.dc.TransferSourceThread.run(TransferSourceThread.java:87) > Progress: Failed but can retry:3 > Failed to transfer wrapper log from AFNIsnr-20090421-1220-q9598ll1/info/o on > BIGRED > Progress: Failed:1 Failed but can retry:2 > Execution failed: > Exception in AFNI_3dvolreg: > Arguments: [-twopass, -twodup, -dfile, mot_RFL2.run4_trim, -base, > ts.4_trim+orig[92], -prefix, volreg.RFL2.run4_trim, ts.4_trim+orig.BRIK] > Host: BIGRED > Directory: AFNIsnr-20090421-1220-q9598ll1/jobs/o/AFNI_3dvolreg-od58kp9j > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Server refused performing the request. Custom message: (error code > 1) [Nested exception message: Custom message: Unexpected reply: 451 ocurred > during retrieve() > org.globus.ftp.exception.ServerException: Refusing to start transfer before > previous transfer completes (error code 5) > org.globus.ftp.exception.ServerException: Refusing to start transfer before > previous transfer completes (error code 5) > at > org.globus.ftp.dc.TransferThreadManager.startTransfer(TransferThreadManager.java:129) > at > org.globus.ftp.extended.GridFTPServerFacade.retrieve(GridFTPServerFacade.java:431) > at org.globus.ftp.FTPClient.put(FTPClient.java:1289) > at > org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.putFile(FileResourceImpl.java:427) > at > org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doDestination(DelegatedFileTransferHandler.java:355) > at > org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doDestination(CachingDelegatedFileTransferHandler.java:47) > at > org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:492) > at java.lang.Thread.run(Thread.java:595) > ] > gwynn 4% > > On Tue, Apr 21, 2009 at 12:25 PM, Mihael Hategan wrote: > > > Ok. My point was, in general, that if there are such catastrophic > > failures, spacing out in time a couple of attempts generally raises the > > confidence that the problem is not some spooky transient issue. > > > > On Tue, 2009-04-21 at 17:19 +0000, Ben Clifford wrote: > > > I can recreate non-swift brokenness on both of those sites. > > > > > > On Tue, 21 Apr 2009, Mihael Hategan wrote: > > > > > > > Did you also try that one again? > > > > > > > > On Tue, 2009-04-21 at 12:14 -0500, Michael Andric wrote: > > > > > what about ucanl? > > > > > > > > > > On Tue, Apr 21, 2009 at 12:11 PM, Ben Clifford > > > > > wrote: > > > > > > > > > > On Tue, 21 Apr 2009, Michael Andric wrote: > > > > > > > > > > > > > > > > /disks/ci-gpfs/fmri/cnari/config/sites_bsd.xml > > > > > > > > > > > > > > > I cannot make a manual submission to gwynn.bsd.uchicago.edu > > > > > using > > > > > globus-job-run, so this is not a Swift problem. I think this > > > > > is something > > > > > that support at ci should deal with. > > > > > > > > > > -- > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > From hategan at mcs.anl.gov Tue Apr 21 13:05:02 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 21 Apr 2009 13:05:02 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <49EE0167.5090909@mcs.anl.gov> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> Message-ID: <1240337102.22528.10.camel@localhost> On Tue, 2009-04-21 at 12:24 -0500, Michael Wilde wrote: > Raising the urgency of testing coasters for 0.9 is inconsistent with > saying there is "little point in putting coaster changes into a release". > > If we are willing to insert more point releases between 0.9 and 1.0, or > we dont attach special significance to 1.0, I am happy to let 0.9 go. > Otherwise, I want to delay 0.9. We have been attaching no special significance to release numbers. We increment them by one and we have no roadmap (i.e. something that says 1.0 will have these and these features and they will have been tested extensively). > Thats a separate but related question, > and good to decide before we call the next release 0.9. > > That aside, I want to set a firm goal that a coaster feature we can call > production-quality be in the next point release after the one currently > in candidate state. Its OK to have known limitations, but it should do a > useful core set of things very well, with tests to validate it, and > documented so users know what they can and can not expect it to do. (Ie > when to use it and when not). > > I can see that this may take 8 weeks or more. I still want a list of the > known coaster improvements needed, and a time estimate for doing them, > and design discussion on any features that need it (such as coasters > that work well on OSG). Otherwise we cant tell if coasters needs 8 weeks > or 80 weeks. I started such a list in prior emails. > > Mihael, please merge that with a list of issues you see from emailed > reports, other issues you know of or have concerns about, and send out > to swift-devel. Which should I rather do now, continue implementing block allocations, or sort out discussions about coasters from the mailing list? (i.e. would you please let me do my job?) From wilde at mcs.anl.gov Tue Apr 21 13:32:54 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 21 Apr 2009 13:32:54 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases Message-ID: <49EE1156.3090107@mcs.anl.gov> Before 0.9 is released, I want to discuss an agreed upon sequence for the next set of enhancements and improvements, and a target set of dates and release numbers for them. It is a minor issue, but I see Swift 1.0 as having some greater significance as a release number. You may or may not agree - but it should be discussed and we should make a conscious decision on whats next. Also minor but going from 0.9 to 0.10 is odd. 0.09 to 0.10 would have been OK. 0.91 .92 makes more sense, or 0.9.1. I dont much care at the moment. The significance of calling something 1.0 thought will extend to funding issues, as our next NSF report will be due soon - June 1 I think - and concurrently I need to go to NSF and other agencies for continued funding, probably with an unsolicited proposal. My preference is to do 2 or more point releases in the next 16 weeks and shoot for 1.0 by the end of August. The significance of that release is that it will be the last under the current round of NSF funding. So whether we call it 1.0, 1.1, whatever, it will be a milestone of sorts. Below is a *very* rough draft of my features priorities. I want to ask others to make a list of features important to you, so we can sort and prioritize, and turn into a set of planned point releases. Ian challenged me to narrow it to 10 features; I came close but have not yet given it the thought it needs, looked through bugzilla, etc. Then I backslid, but to pick 10, we need to consider more. Serious effort on the feature and release list can start next week after 0.9 (and after my current grant pressure eases up a bit). - Mike *** rough features on Mike's mind: (These are not sorted into categories yet, not of similar size; A few more distant features crept in; others are missing) Some of these require language deliberation, which makes them harder. Some would make huge differences in usability (like better runtime errors, with line numbers. Also hard, but very valuable). Clean up mapper model. - Names and interfaces (and maybe model): - pos params to make externs more compact; - single_file_mapper to take expressions in its short form. - re-consider mixed type mappings (file and scalar) Structure assignment and related issues - remove limitations on this - maybe some similar language limits to address Good run time error messages with source code line numbers - summary of errors at end of output is ineffective - review of messages in log that should go to output - review of message content in log Coaster polishing and testing/robustness - address scheduler anomalies - 3 modes of provisioning Condor-G support tc.data and PATH extensions - rename tc.data (e.g.: applist.txt)? - clean up its format, make simpler to add apps. app() syntax review - trailing ";" confusing - quoting review and improvements if needed - multi-line Auto app install Make sites & tc config easy/instant for osg, tg, local clusters - productive ADEM namespace::name:version for function/app identifiers Global variables Swift command - multi-site selection - run management (maybe a separate command) Scaling - Efficient handling of 1M and more jobs per wf - what else needed here? Library / #include feature Provenance - make it end-user useful and ready - easy to get perf numbers - easy to associate swift parameters with files and runs More logging / status reporting enhancement - enhance format and info content of log tools report Review of iteration and related flow of control issues - why iterate seems hard to use - for, do, while? - functional equivalents Built-in and/or intrinsic functions - remove @ - additional string functions, perhaps other related builtins - (time, date, sizeof, ...) - Built-in function extensibility - general external function mechanism - shell out? lib path? -- Longer term: Coasters on bgp, kraken and sico IO and Data management improvements - mtdm (cio) data management fully integrated, with extensibility - broadcast - pull - batched output - local data transfer bypass From benc at hawaga.org.uk Tue Apr 21 13:49:12 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 21 Apr 2009 18:49:12 +0000 (GMT) Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <49EE1156.3090107@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> Message-ID: On Tue, 21 Apr 2009, Michael Wilde wrote: > Before 0.9 is released, I want to discuss an agreed upon sequence for the next > set of enhancements and improvements, and a target set of dates and release > numbers for them. I think the below is good to discuss. I don't see how it affects the transition (or not) of 0.9rc2 to 0.9. > It is a minor issue, but I see Swift 1.0 as having some greater > significance as a release number. Yes, it does have significance. Though until you mentioned it today, I hadn't regard Swift as being > Also minor but going from 0.9 to 0.10 is odd. Not really. Lots of software uses . as a field separate for integers, rather than as a decimal-fraction indicator. > The significance of calling something 1.0 thought will extend to funding > issues, as our next NSF report will be due soon - June 1 I think - and > concurrently I need to go to NSF and other agencies for continued > funding, probably with an unsolicited proposal. OK. I don't really care what the versions are called as long as they are clearly ordered so if you have a preference for 0.9 + 2 to be 1.0 thats fine. Or 0.9 + 1 be 1.0. > My preference is to do 2 or more point releases in the next 16 weeks and > shoot for 1.0 by the end of August. 2 releases matches the existing release schedule for that time period, assuming that the first release candidate is released for each. Rough dates on that schedule are late June and late August (so 1.0 would be the august one) However, rushing to put less-well tested code into trunk is likely to break the first-rc-is-release goal, as well as lowering the quality of experience for people who build from SVN. ==== > Condor-G support That exists in 0.9rc2, at least for regular job submission. > - why iterate seems hard to use Post some stuff to swift-devel about htat. -- From andric at uchicago.edu Tue Apr 21 13:58:16 2009 From: andric at uchicago.edu (Michael Andric) Date: Tue, 21 Apr 2009 13:58:16 -0500 Subject: [Swift-devel] swift not working In-Reply-To: References: <1240334453.21927.0.camel@localhost> <1240334706.21927.3.camel@localhost> Message-ID: do you have a full path for this? "examples/swift/first.swift " On Tue, Apr 21, 2009 at 1:00 PM, Ben Clifford wrote: > > Unlike the other two, I can talk to bigred manually ok. I can also run > examples/swift/first.swift OK on bigred. > > Please try running examples/swift/first.swift against bigred from your > setup. > > Also send the swift version string from the start of swift output. > > On Tue, 21 Apr 2009, Michael Andric wrote: > > > just tried on bigred - also failed > > [...] > > Progress: Failed but can retry:3 > > Progress: Stage in:1 Failed but can retry:2 > > Exception occured in the exception handling code, so it cannot be > properly > > propagated to the user > > java.net.SocketException: Broken pipe > > at java.net.SocketOutputStream.socketWrite0(Native Method) > > at > > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) > > at java.net.SocketOutputStream.write(SocketOutputStream.java:115) > > at java.io.DataOutputStream.writeByte(DataOutputStream.java:136) > > at > > > org.globus.ftp.dc.EBlockImageDCWriter.endOfData(EBlockImageDCWriter.java:63) > > at > > > org.globus.ftp.dc.GridFTPTransferSourceThread.shutdown(GridFTPTransferSourceThread.java:62) > > at > > org.globus.ftp.dc.TransferSourceThread.run(TransferSourceThread.java:87) > > Progress: Failed but can retry:3 > > Failed to transfer wrapper log from AFNIsnr-20090421-1220-q9598ll1/info/o > on > > BIGRED > > Progress: Failed:1 Failed but can retry:2 > > Execution failed: > > Exception in AFNI_3dvolreg: > > Arguments: [-twopass, -twodup, -dfile, mot_RFL2.run4_trim, -base, > > ts.4_trim+orig[92], -prefix, volreg.RFL2.run4_trim, ts.4_trim+orig.BRIK] > > Host: BIGRED > > Directory: AFNIsnr-20090421-1220-q9598ll1/jobs/o/AFNI_3dvolreg-od58kp9j > > stderr.txt: > > > > stdout.txt: > > > > ---- > > > > Caused by: > > Server refused performing the request. Custom message: (error > code > > 1) [Nested exception message: Custom message: Unexpected reply: 451 > ocurred > > during retrieve() > > org.globus.ftp.exception.ServerException: Refusing to start transfer > before > > previous transfer completes (error code 5) > > org.globus.ftp.exception.ServerException: Refusing to start transfer > before > > previous transfer completes (error code 5) > > at > > > org.globus.ftp.dc.TransferThreadManager.startTransfer(TransferThreadManager.java:129) > > at > > > org.globus.ftp.extended.GridFTPServerFacade.retrieve(GridFTPServerFacade.java:431) > > at org.globus.ftp.FTPClient.put(FTPClient.java:1289) > > at > > > org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.putFile(FileResourceImpl.java:427) > > at > > > org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doDestination(DelegatedFileTransferHandler.java:355) > > at > > > org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doDestination(CachingDelegatedFileTransferHandler.java:47) > > at > > > org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:492) > > at java.lang.Thread.run(Thread.java:595) > > ] > > gwynn 4% > > > > On Tue, Apr 21, 2009 at 12:25 PM, Mihael Hategan >wrote: > > > > > Ok. My point was, in general, that if there are such catastrophic > > > failures, spacing out in time a couple of attempts generally raises the > > > confidence that the problem is not some spooky transient issue. > > > > > > On Tue, 2009-04-21 at 17:19 +0000, Ben Clifford wrote: > > > > I can recreate non-swift brokenness on both of those sites. > > > > > > > > On Tue, 21 Apr 2009, Mihael Hategan wrote: > > > > > > > > > Did you also try that one again? > > > > > > > > > > On Tue, 2009-04-21 at 12:14 -0500, Michael Andric wrote: > > > > > > what about ucanl? > > > > > > > > > > > > On Tue, Apr 21, 2009 at 12:11 PM, Ben Clifford < > benc at hawaga.org.uk> > > > > > > wrote: > > > > > > > > > > > > On Tue, 21 Apr 2009, Michael Andric wrote: > > > > > > > > > > > > > > > > > > > /disks/ci-gpfs/fmri/cnari/config/sites_bsd.xml > > > > > > > > > > > > > > > > > > I cannot make a manual submission to > gwynn.bsd.uchicago.edu > > > > > > using > > > > > > globus-job-run, so this is not a Swift problem. I think > this > > > > > > is something > > > > > > that support at ci should deal with. > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > Swift-devel mailing list > > > > > > Swift-devel at ci.uchicago.edu > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Tue Apr 21 13:58:56 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 21 Apr 2009 18:58:56 +0000 (GMT) Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> Message-ID: On Tue, 21 Apr 2009, Ben Clifford wrote: > Yes, it does have significance. Though until you mentioned it today, I > hadn't regard Swift as being oops, missed half that paragraph. I really should write my emails in order. "hadn't regarded Swift as being specifically at a 1.0 stage (whatever that means...)" > > Also minor but going from 0.9 to 0.10 is odd. > > Not really. Lots of software uses . as a field separate for integers, > rather than as a decimal-fraction indicator. > > > The significance of calling something 1.0 thought will extend to funding > > issues, as our next NSF report will be due soon - June 1 I think - and > > concurrently I need to go to NSF and other agencies for continued > > funding, probably with an unsolicited proposal. > > OK. I don't really care what the versions are called as long as they are > clearly ordered so if you have a preference for 0.9 + 2 to be 1.0 thats > fine. Or 0.9 + 1 be 1.0. > > > My preference is to do 2 or more point releases in the next 16 weeks and > > shoot for 1.0 by the end of August. > > 2 releases matches the existing release schedule for that time period, > assuming that the first release candidate is released for each. Rough > dates on that schedule are late June and late August (so 1.0 would be the > august one) > > However, rushing to put less-well tested code into trunk is likely to > break the first-rc-is-release goal, as well as lowering the quality of > experience for people who build from SVN. > > ==== > > > Condor-G support > > That exists in 0.9rc2, at least for regular job submission. > > > - why iterate seems hard to use > > Post some stuff to swift-devel about htat. > > From aespinosa at cs.uchicago.edu Tue Apr 21 13:54:18 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Tue, 21 Apr 2009 13:54:18 -0500 Subject: [Swift-devel] feature request In-Reply-To: <49EE0883.4070908@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> Message-ID: <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> I ran the same testt that was was conducted w/ Zhao (his certificates does not yet have any specific vo membership) and have the following result: icating with the GridFTP server Caused by: Server refused performing the request. Custom message: Bad password. (error code 1) [Nested exception message: Custom message: Unexpected reply: 530-Login incorrect. : globus_gss_assist: Gridmap lookup failure: Could not map /DC=org/DC=doegrids/OU=People/CN=Allan M. Espinosa 374652 530- 530 End.] SWIFT RETURN CODE NON-ZERO - test 061-cattwo SITE FAIL! Exit code 1 for site definition coaster/tgncsa-hg-coaster-pbs-gram4.xml These sites failed: coaster/fletch-coaster-gram2-gram2-condor.xml coaster/fletch-coaster-gram2-gram2-fork.xml coaster/teraport-gt2-gt2-pbs.xml coaster/tgncsa-hg-coaster-pbs-gram2.xml coaster/tgncsa-hg-coaster-pbs-gram4.xml These sites worked: coaster/coaster-local.xml coaster/renci-engage-coaster.xml cause of errors: 1. fletch has an expired host certificate 2. i don't have teragrid allocation on tgncsa (my roaming account expired). also i have a different certificate for using teragrid resources. attached in this email is the full log file. -Allan On Tue, Apr 21, 2009 at 12:55 PM, Zhao Zhang wrote: > [zzhang at communicado coaster]$ grid-cert-info > subject ? ? : DC=org,DC=doegrids,OU=People,CN=Zhao Zhang 385894 > issuer ? ? ?: DC=org,DC=DOEGrids,OU=Certificate Authorities,CN=DOEGrids CA 1 > start date ?: Wed Feb 25 14:32:08 CST 2009 > end date ? ?: Thu Feb 25 14:32:08 CST 2010 > > > [zzhang at communicado coaster]$ grid-proxy-init > Your identity: DC=org,DC=doegrids,OU=People,CN=Zhao Zhang 385894 > Enter GRID pass phrase for this identity: ? ? ? ?Creating proxy, please > wait... > Proxy verify OK > Your proxy is valid until Wed Apr 22 00:54:47 CDT 2009 > > zhao > > Ben Clifford wrote: >> >> Please do all of what I asked. >> >> On Tue, 21 Apr 2009, Zhao Zhang wrote: >> >> >>> >>> Yep, I found that my certificate has expired. I check the webpage about >>> grid >>> proxy, it says I need to request a new certificate. >>> >>> zhao >>> >>> [zzhang at communicado ~]$ grid-proxy-init >>> Your identity: /DC=org/DC=doegrids/OU=People/CN=Zhao Zhang 385894 >>> Enter GRID pass phrase for this identity: >>> Creating proxy ........................................... Done >>> >>> >>> ERROR: Your certificate has expired: Thu Feb 26 12:47:51 2009 >>> >>> >>> >>> Ben Clifford wrote: >>> >>>> >>>> zhao, on communicado please type: >>>> >>>> grid-cert-info >>>> >>>> and paste the results here. >>>> >>>> Then type: grid-proxy-init and enter your password and paste the >>>> results. >>>> >>>> >>> >>> >> >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Allan M. Espinosa PhD student, Computer Science University of Chicago -------------- next part -------------- A non-text attachment was scrubbed... Name: log Type: application/octet-stream Size: 19572 bytes Desc: not available URL: From benc at hawaga.org.uk Tue Apr 21 13:59:52 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 21 Apr 2009 18:59:52 +0000 (GMT) Subject: [Swift-devel] swift not working In-Reply-To: References: <1240334453.21927.0.camel@localhost> <1240334706.21927.3.camel@localhost> Message-ID: On Tue, 21 Apr 2009, Michael Andric wrote: > do you have a full path for this? "examples/swift/first.swift " Whereever your swift install is, next to the bin/ directory that the swift command lives in, you should find the examples directory: $ ls CHANGES.txt bin etc examples lib libexec -- From andric at uchicago.edu Tue Apr 21 14:05:45 2009 From: andric at uchicago.edu (Michael Andric) Date: Tue, 21 Apr 2009 14:05:45 -0500 Subject: [Swift-devel] swift not working In-Reply-To: References: <1240334453.21927.0.camel@localhost> <1240334706.21927.3.camel@localhost> Message-ID: (on bigred) #!/bin/tcsh swift first.swift -sites.file /disks/gpfs/fmri/cnari/swift/config/sites_bigred.xml -tc.file /disks/gpfs/fmri/cnari/swift/config/tc.data -user="andric" gwynn 3% swift_execute.sh 2009.04.21 14:04:47.291 CDT: [ERROR] Parsing profiles on line 1800 Illegal character ':'at position 60 :Illegal character ':' Swift svn swift-r2854 cog-r2382 RunID: 20090421-1404-zm2j2t4f Progress: Progress: Stage in:1 2009.04.21 14:04:53.436 CDT: [ERROR] Parsing profiles on line 1800 Illegal character ':'at position 60 :Illegal character ':' Progress: Submitted:1 Failed to transfer wrapper log from first-20090421-1404-zm2j2t4f/info/t on BIGRED Progress: Stage in:1 Progress: Submitted:1 Failed to transfer wrapper log from first-20090421-1404-zm2j2t4f/info/v on BIGRED Progress: Stage in:1 Progress: Submitted:1 Failed to transfer wrapper log from first-20090421-1404-zm2j2t4f/info/y on BIGRED Progress: Failed:1 Execution failed: Exception in echo: Arguments: [Hello, world!] Host: BIGRED Directory: first-20090421-1404-zm2j2t4f/jobs/y/echo-yysdop9j stderr.txt: stdout.txt: ---- Caused by: Cannot submit job Caused by: The job manager failed to open the user proxy gwynn 4% On Tue, Apr 21, 2009 at 1:59 PM, Ben Clifford wrote: > > > On Tue, 21 Apr 2009, Michael Andric wrote: > > > do you have a full path for this? "examples/swift/first.swift " > > Whereever your swift install is, next to the bin/ directory that the swift > command lives in, you should find the examples directory: > > $ ls > CHANGES.txt bin etc examples lib libexec > > -- > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Apr 21 14:10:50 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 21 Apr 2009 14:10:50 -0500 Subject: [Swift-devel] feature request In-Reply-To: <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> Message-ID: <1240341050.23910.3.camel@localhost> Thanks. I think the most valuable tests are those for which there is a working environment (i.e. valid proxy, valid allocation, etc.) Mihael On Tue, 2009-04-21 at 13:54 -0500, Allan Espinosa wrote: > I ran the same testt that was was conducted w/ Zhao (his certificates > does not yet have any specific vo membership) and have the following > result: > > icating with the GridFTP server > Caused by: > Server refused performing the request. Custom message: Bad password. > (error code 1) [Nested exception message: Custom message: Unexpected > reply: 530-Login incorrect. : globus_gss_assist: Gridmap lookup > failure: Could not map /DC=org/DC=doegrids/OU=People/CN=Allan M. > Espinosa 374652 > 530- > 530 End.] > SWIFT RETURN CODE NON-ZERO - test 061-cattwo > SITE FAIL! Exit code 1 for site definition > coaster/tgncsa-hg-coaster-pbs-gram4.xml > These sites failed: coaster/fletch-coaster-gram2-gram2-condor.xml > coaster/fletch-coaster-gram2-gram2-fork.xml > coaster/teraport-gt2-gt2-pbs.xml > coaster/tgncsa-hg-coaster-pbs-gram2.xml > coaster/tgncsa-hg-coaster-pbs-gram4.xml > These sites worked: coaster/coaster-local.xml coaster/renci-engage-coaster.xml > > > cause of errors: 1. fletch has an expired host certificate > 2. i don't have teragrid allocation on tgncsa (my roaming account > expired). also i have a different certificate for using teragrid > resources. > > attached in this email is the full log file. > > -Allan > > > On Tue, Apr 21, 2009 at 12:55 PM, Zhao Zhang wrote: > > [zzhang at communicado coaster]$ grid-cert-info > > subject : DC=org,DC=doegrids,OU=People,CN=Zhao Zhang 385894 > > issuer : DC=org,DC=DOEGrids,OU=Certificate Authorities,CN=DOEGrids CA 1 > > start date : Wed Feb 25 14:32:08 CST 2009 > > end date : Thu Feb 25 14:32:08 CST 2010 > > > > > > [zzhang at communicado coaster]$ grid-proxy-init > > Your identity: DC=org,DC=doegrids,OU=People,CN=Zhao Zhang 385894 > > Enter GRID pass phrase for this identity: Creating proxy, please > > wait... > > Proxy verify OK > > Your proxy is valid until Wed Apr 22 00:54:47 CDT 2009 > > > > zhao > > > > Ben Clifford wrote: > >> > >> Please do all of what I asked. > >> > >> On Tue, 21 Apr 2009, Zhao Zhang wrote: > >> > >> > >>> > >>> Yep, I found that my certificate has expired. I check the webpage about > >>> grid > >>> proxy, it says I need to request a new certificate. > >>> > >>> zhao > >>> > >>> [zzhang at communicado ~]$ grid-proxy-init > >>> Your identity: /DC=org/DC=doegrids/OU=People/CN=Zhao Zhang 385894 > >>> Enter GRID pass phrase for this identity: > >>> Creating proxy ........................................... Done > >>> > >>> > >>> ERROR: Your certificate has expired: Thu Feb 26 12:47:51 2009 > >>> > >>> > >>> > >>> Ben Clifford wrote: > >>> > >>>> > >>>> zhao, on communicado please type: > >>>> > >>>> grid-cert-info > >>>> > >>>> and paste the results here. > >>>> > >>>> Then type: grid-proxy-init and enter your password and paste the > >>>> results. > >>>> > >>>> > >>> > >>> > >> > >> > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Apr 21 14:12:11 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 21 Apr 2009 14:12:11 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <1240337102.22528.10.camel@localhost> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> <1240337102.22528.10.camel@localhost> Message-ID: <49EE1A8B.1@mcs.anl.gov> > Which should I rather do now, continue implementing block allocations, > or sort out discussions about coasters from the mailing list? You should do this: First send a sentence or two on how block allocations are going: whats involved, and how long you think they will take till checked in. What kind of testing will be required to ensure functional and stable. Then continue developing them. Between now and Friday, do the list I asked for. > (i.e. would you please let me do my job?) Yes. I know development takes concentration. To achieve that, you manage your time, and the rate at which you read and answer email or take any other interrupts. If something needs immediate attention, I or others say so, or your best judgment does. If not, it doesn't, but needs eventual attention. A list of coaster issues was started on swift-devel Feb 13: http://mail.ci.uchicago.edu/pipermail/swift-devel/2009-February/004511.html A report on progress and summary of open design issues and work remaining would benefit everyone. Find a good stopping point in the next few days, then make list of work items and design and testing issues on coasters. Then discuss and get feedback. Then develop what you listed. Then repeat. That *is* your job. From hategan at mcs.anl.gov Tue Apr 21 15:02:10 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 21 Apr 2009 15:02:10 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <49EE1A8B.1@mcs.anl.gov> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> <1240337102.22528.10.camel@localhost> <49EE1A8B.1@mcs.anl.gov> Message-ID: <1240344130.24159.40.camel@localhost> On Tue, 2009-04-21 at 14:12 -0500, Michael Wilde wrote: > > Which should I rather do now, continue implementing block allocations, > > or sort out discussions about coasters from the mailing list? > > You should do this: > > First send a sentence or two on how block allocations are going: whats > involved, and how long you think they will take till checked in. What > kind of testing will be required to ensure functional and stable. I haven't explicitly done so because I thought I mentioned this before. I have an algorithm that I tested with various loads and configurations. It can work with limited numbers of jobs, and pre-set allocations (though there are some tweaks there that are still needed - a major one being when to choose pre-existing allocations instead of automated allocations; for example, if the pre-existing allocation is tomorrow, and I submit now, I'd presumably want some jobs to be started before tomorrow). I need to now plug that into the rest of the coaster code (this is the part I'm doing right now), find ways to specify parameters in the sites file, efficiently ship those parameters remotely, and do some basic testing. I estimate that to take somewhere in the range of one to two weeks of development time, but that's not with very high confidence. The kind of testing needed to ensure stability is not different from what we've discussed and what we already know: we'll need to run synthetic tests on various sites, and we'll need to run existing applications on various sites. This is not new. > > Then continue developing them. > > Between now and Friday, do the list I asked for. > > > (i.e. would you please let me do my job?) > > Yes. > > I know development takes concentration. To achieve that, you manage your > time, and the rate at which you read and answer email or take any other > interrupts. > > If something needs immediate attention, I or others say so, or your best > judgment does. If not, it doesn't, but needs eventual attention. > > A list of coaster issues was started on swift-devel Feb 13: > http://mail.ci.uchicago.edu/pipermail/swift-devel/2009-February/004511.html There are 4 issues there: - bootstrap issues (these are solved - this was discussed on the mailing list) - sites.xml attribute for java - the priority of this has become lower because the detection is done better now, but it should probably still be implemented (this was also on the mailing list as far as I can remember; also, I think one can specify JAVA_HOME or make sure java is in the path using the environment or otherwise) - service on the worker node - after block allocations, unless Ben does otherwise - the scalability problem - we're doing block allocations mainly to address this > > A report on progress and summary of open design issues and work > remaining would benefit everyone. > > Find a good stopping point in the next few days, then make list of work > items and design and testing issues on coasters. > > Then discuss and get feedback. Then develop what you listed. Then > repeat. That *is* your job. > I wish :) Yep. We discussed, got feedback, and I was now developing. Discussing the same issues again was interfering with my developing. So let me re-state this: "I am now developing coaster block allocations. I will send feedback from time to time, when I have interesting things to send feedback about. I will probably have testable code in 1 to 2 weeks." From foster at anl.gov Tue Apr 21 16:16:36 2009 From: foster at anl.gov (Ian Foster) Date: Tue, 21 Apr 2009 16:16:36 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <1240344130.24159.40.camel@localhost> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> <1240337102.22528.10.camel@localhost> <49EE1A8B.1@mcs.anl.gov> <1240344130.24159.40.camel@localhost> Message-ID: <16013AC9-9604-4BAC-B3DA-38FE8EA2E3F4@anl.gov> Is it possible to argue for simplicity in these algorithms? E.g., if I want to submit against an allocation, I should be able to do that, and not have the algorithm second-guess me and do something different. Having more complex things is ok, too, as long as they can easily be turned off--or (my recommendation) require that they be turned on explicitly. Ian. On Apr 21, 2009, at 3:02 PM, Mihael Hategan wrote: > On Tue, 2009-04-21 at 14:12 -0500, Michael Wilde wrote: >>> Which should I rather do now, continue implementing block >>> allocations, >>> or sort out discussions about coasters from the mailing list? >> >> You should do this: >> >> First send a sentence or two on how block allocations are going: >> whats >> involved, and how long you think they will take till checked in. What >> kind of testing will be required to ensure functional and stable. > > I haven't explicitly done so because I thought I mentioned this > before. > > I have an algorithm that I tested with various loads and > configurations. > It can work with limited numbers of jobs, and pre-set allocations > (though there are some tweaks there that are still needed - a major > one > being when to choose pre-existing allocations instead of automated > allocations; for example, if the pre-existing allocation is tomorrow, > and I submit now, I'd presumably want some jobs to be started before > tomorrow). > > I need to now plug that into the rest of the coaster code (this is the > part I'm doing right now), find ways to specify parameters in the > sites > file, efficiently ship those parameters remotely, and do some basic > testing. I estimate that to take somewhere in the range of one to two > weeks of development time, but that's not with very high confidence. > > The kind of testing needed to ensure stability is not different from > what we've discussed and what we already know: we'll need to run > synthetic tests on various sites, and we'll need to run existing > applications on various sites. This is not new. > >> >> Then continue developing them. >> >> Between now and Friday, do the list I asked for. >> >>> (i.e. would you please let me do my job?) >> >> Yes. >> >> I know development takes concentration. To achieve that, you manage >> your >> time, and the rate at which you read and answer email or take any >> other >> interrupts. >> >> If something needs immediate attention, I or others say so, or your >> best >> judgment does. If not, it doesn't, but needs eventual attention. >> >> A list of coaster issues was started on swift-devel Feb 13: >> http://mail.ci.uchicago.edu/pipermail/swift-devel/2009-February/004511.html > > There are 4 issues there: > - bootstrap issues (these are solved - this was discussed on the > mailing > list) > - sites.xml attribute for java - the priority of this has become lower > because the detection is done better now, but it should probably still > be implemented (this was also on the mailing list as far as I can > remember; also, I think one can specify JAVA_HOME or make sure java is > in the path using the environment or otherwise) > - service on the worker node - after block allocations, unless Ben > does > otherwise > - the scalability problem - we're doing block allocations mainly to > address this > >> >> A report on progress and summary of open design issues and work >> remaining would benefit everyone. >> >> Find a good stopping point in the next few days, then make list of >> work >> items and design and testing issues on coasters. >> >> Then discuss and get feedback. Then develop what you listed. Then >> repeat. That *is* your job. >> > > I wish :) > > Yep. We discussed, got feedback, and I was now developing. Discussing > the same issues again was interfering with my developing. So let me > re-state this: "I am now developing coaster block allocations. I will > send feedback from time to time, when I have interesting things to > send > feedback about. I will probably have testable code in 1 to 2 weeks." > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Apr 21 16:44:24 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 21 Apr 2009 16:44:24 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <16013AC9-9604-4BAC-B3DA-38FE8EA2E3F4@anl.gov> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> <1240337102.22528.10.camel@localhost> <49EE1A8B.1@mcs.anl.gov> <1240344130.24159.40.camel@localhost> <16013AC9-9604-4BAC-B3DA-38FE8EA2E3F4@anl.gov> Message-ID: <1240350264.26210.20.camel@localhost> On Tue, 2009-04-21 at 16:16 -0500, Ian Foster wrote: > Is it possible to argue for simplicity in these algorithms? > Yes. > E.g., if I want to submit against an allocation, I should be able to > do that, and not have the algorithm second-guess me and do something > different. Yes. That would be one particular case. > > Having more complex things is ok, too, as long as they can easily be > turned off--or (my recommendation) require that they be turned on > explicitly. What you say does beg for a couple of questions: - if all work is done in a run but the allocation has more time left, should the workers be shut down or not? - if more work remains to be done in a run after an explicit allocation was used, should the system attempt to allocate more nodes? If not, should it hang? Fail? - if the allocation is far in the distance from now, and a run is started now, is allocating nodes now a matter of second-guessing or a matter of trying to finish the work faster? What, besides alleged complexity of the algorithm, would be the downside of doing so? From hategan at mcs.anl.gov Tue Apr 21 20:53:50 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 21 Apr 2009 20:53:50 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <1240344130.24159.40.camel@localhost> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> <1240337102.22528.10.camel@localhost> <49EE1A8B.1@mcs.anl.gov> <1240344130.24159.40.camel@localhost> Message-ID: <1240365230.810.7.camel@localhost> On Tue, 2009-04-21 at 15:02 -0500, Mihael Hategan wrote: > The kind of testing needed to ensure stability is not different from > what we've discussed and what we already know: we'll need to run > synthetic tests on various sites, and we'll need to run existing > applications on various sites. This is not new. There would be an addition that I would like to make there. In a sense, the simulator I sent last week was part of an effort to test coasters without using any remote resources, so that specifically problems with the allocation algorithm can be isolated. In addition to that I'm contemplating the idea of having a fake provider that can quickly simulate one or more large clusters. I'm not sure how doable that is, and how many useful results that can provide for real life applications, but if done I think it would provide timely feedback on a number of components in the coaster code. From iraicu at cs.uchicago.edu Tue Apr 21 21:03:26 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 21 Apr 2009 21:03:26 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <1240365230.810.7.camel@localhost> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> <1240337102.22528.10.camel@localhost> <49EE1A8B.1@mcs.anl.gov> <1240344130.24159.40.camel@localhost> <1240365230.810.7.camel@localhost> Message-ID: <49EE7AEE.1000309@cs.uchicago.edu> I don't want to through Falkon back in the mix, but Falkon does have an emulator as well, that can probably do the simulation of multiple clusters as you said you wanted. This is assuming Coasters will run over Falkon, and you need significant effort to implement it in Coaster or a fake provider. Ioan Mihael Hategan wrote: > On Tue, 2009-04-21 at 15:02 -0500, Mihael Hategan wrote: > > >> The kind of testing needed to ensure stability is not different from >> what we've discussed and what we already know: we'll need to run >> synthetic tests on various sites, and we'll need to run existing >> applications on various sites. This is not new. >> > > There would be an addition that I would like to make there. In a sense, > the simulator I sent last week was part of an effort to test coasters > without using any remote resources, so that specifically problems with > the allocation algorithm can be isolated. In addition to that I'm > contemplating the idea of having a fake provider that can quickly > simulate one or more large clusters. I'm not sure how doable that is, > and how many useful results that can provide for real life applications, > but if done I think it would provide timely feedback on a number of > components in the coaster code. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Apr 21 21:11:26 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 21 Apr 2009 21:11:26 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <49EE7AEE.1000309@cs.uchicago.edu> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> <1240337102.22528.10.camel@localhost> <49EE1A8B.1@mcs.anl.gov> <1240344130.24159.40.camel@localhost> <1240365230.810.7.camel@localhost> <49EE7AEE.1000309@cs.uchicago.edu> Message-ID: <1240366286.1648.4.camel@localhost> On Tue, 2009-04-21 at 21:03 -0500, Ioan Raicu wrote: > I don't want to through Falkon back in the mix, but Falkon does have > an emulator as well, that can probably do the simulation of multiple > clusters as you said you wanted. This is assuming Coasters will run > over Falkon, I think it's a matter of saying "deef" instead of something else in the provider. > and you need significant effort to implement it in Coaster or a fake > provider. I think a provider to simulate running a task is probably not the difficult part. More difficult may be a mechanism to compress time, such that a 1 hour job will run in a snap, while a two hour job will run in two snaps. I'll let you know as I get a better idea formed in my head. From iraicu at cs.uchicago.edu Tue Apr 21 21:31:15 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 21 Apr 2009 21:31:15 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <1240366286.1648.4.camel@localhost> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> <1240337102.22528.10.camel@localhost> <49EE1A8B.1@mcs.anl.gov> <1240344130.24159.40.camel@localhost> <1240365230.810.7.camel@localhost> <49EE7AEE.1000309@cs.uchicago.edu> <1240366286.1648.4.camel@localhost> Message-ID: <49EE8173.30101@cs.uchicago.edu> Mihael Hategan wrote: > On Tue, 2009-04-21 at 21:03 -0500, Ioan Raicu wrote: > >> and you need significant effort to implement it in Coaster or a fake >> provider. >> > > I think a provider to simulate running a task is probably not the > difficult part. More difficult may be a mechanism to compress time, such > that a 1 hour job will run in a snap, while a two hour job will run in > two snaps. > > I'll let you know as I get a better idea formed in my head. > > I did this with timers. Each task has a max wall time, that dictates when the timer will fire for a particular task to mark its completion. I simulated up to 1M processors, and billions of tasks, so the timer based approach in Java seems to scale quite well to extremes of millions of concurrent timers. If you are interested, I can dig through the Falkon code where I implemented this logic of emulating workers. Ioan > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Apr 21 21:47:46 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 21 Apr 2009 21:47:46 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <49EE8173.30101@cs.uchicago.edu> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> <1240337102.22528.10.camel@localhost> <49EE1A8B.1@mcs.anl.gov> <1240344130.24159.40.camel@localhost> <1240365230.810.7.camel@localhost> <49EE7AEE.1000309@cs.uchicago.edu> <1240366286.1648.4.camel@localhost> <49EE8173.30101@cs.uchicago.edu> Message-ID: <1240368466.2927.7.camel@localhost> On Tue, 2009-04-21 at 21:31 -0500, Ioan Raicu wrote: > > > Mihael Hategan wrote: > > On Tue, 2009-04-21 at 21:03 -0500, Ioan Raicu wrote: > > > > > and you need significant effort to implement it in Coaster or a fake > > > provider. > > > > > > > I think a provider to simulate running a task is probably not the > > difficult part. More difficult may be a mechanism to compress time, such > > that a 1 hour job will run in a snap, while a two hour job will run in > > two snaps. > > > > I'll let you know as I get a better idea formed in my head. > > > > > I did this with timers. Each task has a max wall time, that dictates > when the timer will fire for a particular task to mark its completion. > I simulated up to 1M processors, and billions of tasks, so the timer > based approach in Java seems to scale quite well to extremes of > millions of concurrent timers. One problem is that if you use the same timer object, the events they generate aren't really concurrent (which is why it's so scalable). Each timer is a thread that goes through a loop of sorted tasks. I think a pool of a few timers would do the trick though. > If you are interested, I can dig through the Falkon code where I > implemented this logic of emulating workers. Right. That's almost all that's needed, but there's also the current time which needs to be faked given that planning depends on how much time is left in a block, and that depends on what time it is now. Seems fairly straightforward. From skenny at uchicago.edu Tue Apr 21 23:05:12 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Tue, 21 Apr 2009 23:05:12 -0500 (CDT) Subject: [Swift-devel] swift not working Message-ID: <20090421230512.BVZ33715@m4500-02.uchicago.edu> i've added a sites file for gram2 on ucanl64 here: /disks/ci-gpfs/fmri/cnari/swift/config/sites_ucanl64_gt2.xml i was able to get a swift job thru w/that just now. andric, you should be able to use that until gram4 is fixed on the site--unless you tried this already and i missed it, but this thread is long and what am i doing reading work email on vacation anyway? heh ;) ~sk ---- Original message ---- >Date: Tue, 21 Apr 2009 14:05:45 -0500 >From: Michael Andric >Subject: Re: [Swift-devel] swift not working >To: Ben Clifford >Cc: swift-devel at ci.uchicago.edu > > (on bigred) > #!/bin/tcsh > swift first.swift -sites.file > /disks/gpfs/fmri/cnari/swift/config/sites_bigred.xml > -tc.file /disks/gpfs/fmri/cnari/swift/config/tc.data > -user="andric" > gwynn 3% swift_execute.sh > 2009.04.21 14:04:47.291 CDT: [ERROR] Parsing > profiles on line 1800 Illegal character ':'at > position 60 :Illegal character ':' > Swift svn swift-r2854 cog-r2382 > RunID: 20090421-1404-zm2j2t4f > Progress: > Progress: ?Stage in:1 > 2009.04.21 14:04:53.436 CDT: [ERROR] Parsing > profiles on line 1800 Illegal character ':'at > position 60 :Illegal character ':' > Progress: ?Submitted:1 > Failed to transfer wrapper log from > first-20090421-1404-zm2j2t4f/info/t on BIGRED > Progress: ?Stage in:1 > Progress: ?Submitted:1 > Failed to transfer wrapper log from > first-20090421-1404-zm2j2t4f/info/v on BIGRED > Progress: ?Stage in:1 > Progress: ?Submitted:1 > Failed to transfer wrapper log from > first-20090421-1404-zm2j2t4f/info/y on BIGRED > Progress: ?Failed:1 > Execution failed: > ?? ? ? ?Exception in echo: > Arguments: [Hello, world!] > Host: BIGRED > Directory: > first-20090421-1404-zm2j2t4f/jobs/y/echo-yysdop9j > stderr.txt:? > stdout.txt:? > ---- > Caused by: > ?? ? ? ?Cannot submit job > Caused by: > ?? ? ? ?The job manager failed to open the user > proxy > gwynn 4% > On Tue, Apr 21, 2009 at 1:59 PM, Ben Clifford > wrote: > > On Tue, 21 Apr 2009, Michael Andric wrote: > > > do you have a full path for this? > ?"examples/swift/first.swift " > > Whereever your swift install is, next to the bin/ > directory that the swift > command lives in, you should find the examples > directory: > > $ ls > CHANGES.txt bin ? ? ? ? etc ? ? ? ? > examples ? ?lib ? ? ? ? libexec > -- >________________ >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From andric at uchicago.edu Wed Apr 22 01:16:21 2009 From: andric at uchicago.edu (Michael Andric) Date: Wed, 22 Apr 2009 01:16:21 -0500 Subject: [Swift-devel] swift not working In-Reply-To: <20090421230512.BVZ33715@m4500-02.uchicago.edu> References: <20090421230512.BVZ33715@m4500-02.uchicago.edu> Message-ID: that worked.(HUGE thanks, Sarah) On Tue, Apr 21, 2009 at 11:05 PM, wrote: > i've added a sites file for gram2 on ucanl64 here: > > /disks/ci-gpfs/fmri/cnari/swift/config/sites_ucanl64_gt2.xml > > i was able to get a swift job thru w/that just now. andric, > you should be able to use that until gram4 is fixed on the > site--unless you tried this already and i missed it, but this > thread is long and what am i doing reading work email on > vacation anyway? heh ;) > > ~sk > > ---- Original message ---- > >Date: Tue, 21 Apr 2009 14:05:45 -0500 > >From: Michael Andric > >Subject: Re: [Swift-devel] swift not working > >To: Ben Clifford > >Cc: swift-devel at ci.uchicago.edu > > > > (on bigred) > > #!/bin/tcsh > > swift first.swift -sites.file > > /disks/gpfs/fmri/cnari/swift/config/sites_bigred.xml > > -tc.file /disks/gpfs/fmri/cnari/swift/config/tc.data > > -user="andric" > > gwynn 3% swift_execute.sh > > 2009.04.21 14:04:47.291 CDT: [ERROR] Parsing > > profiles on line 1800 Illegal character ':'at > > position 60 :Illegal character ':' > > Swift svn swift-r2854 cog-r2382 > > RunID: 20090421-1404-zm2j2t4f > > Progress: > > Progress: Stage in:1 > > 2009.04.21 14:04:53.436 CDT: [ERROR] Parsing > > profiles on line 1800 Illegal character ':'at > > position 60 :Illegal character ':' > > Progress: Submitted:1 > > Failed to transfer wrapper log from > > first-20090421-1404-zm2j2t4f/info/t on BIGRED > > Progress: Stage in:1 > > Progress: Submitted:1 > > Failed to transfer wrapper log from > > first-20090421-1404-zm2j2t4f/info/v on BIGRED > > Progress: Stage in:1 > > Progress: Submitted:1 > > Failed to transfer wrapper log from > > first-20090421-1404-zm2j2t4f/info/y on BIGRED > > Progress: Failed:1 > > Execution failed: > > Exception in echo: > > Arguments: [Hello, world!] > > Host: BIGRED > > Directory: > > first-20090421-1404-zm2j2t4f/jobs/y/echo-yysdop9j > > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > > Cannot submit job > > Caused by: > > The job manager failed to open the user > > proxy > > gwynn 4% > > On Tue, Apr 21, 2009 at 1:59 PM, Ben Clifford > > wrote: > > > > On Tue, 21 Apr 2009, Michael Andric wrote: > > > > > do you have a full path for this? > > "examples/swift/first.swift " > > > > Whereever your swift install is, next to the bin/ > > directory that the swift > > command lives in, you should find the examples > > directory: > > > > $ ls > > CHANGES.txt bin etc > > examples lib libexec > > -- > >________________ > >_______________________________________________ > >Swift-devel mailing list > >Swift-devel at ci.uchicago.edu > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Wed Apr 22 04:55:19 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 22 Apr 2009 09:55:19 +0000 (GMT) Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <49EE1156.3090107@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> Message-ID: I would argue fairly strongly against doing anything particularly revolutionary before a 1.0 release - the time for developing significant 1.0 new features, if 1.0 is to be in August, is pretty much over (if you regard the development period for that being the past 3 years); experience has shown that it takes many many months for there to be enough developer/user iterations to get stuff working well. Pay attention especially to the special significance of calling a release 1.0 - nothing in there should be a "development feature" that has only recently been made, and perhaps little development time should be spent after the son-of-0.9 release on working on new stuff vs fixing existing stuff. > Ian challenged me to narrow it to 10 features; I came close but have not yet > given it the thought it needs, looked through bugzilla, etc. Then I backslid, > but to pick 10, we need to consider more. I think picking 10 things when those things can vary wildly in scope, is not a good thing to do. The below commentary is wrt what can be done for an august 1.0 release, not whether they are worth doing in general in the longer term. > Clean up mapper model. definitely not. far too much scope for breakage, and definitely breaking of everyones understanding of how mappers work now. > Structure assignment and related issues > > - remove limitations on this > - maybe some similar language limits to address needs more description. but there are a relatively small things that can be implemented that lok more like bugs in implementation / not-implemented-yet that are ok to do. anything that is significantly changing existing semantics should not happen. > Good run time error messages with source code line numbers > > - summary of errors at end of output is ineffective > - review of messages in log that should go to output > - review of message content in log yes. > Coaster polishing and testing/robustness > > - address scheduler anomalies > - 3 modes of provisioning > > Condor-G support its there in 0.9rc2 but needs more testing and debugging. > tc.data and PATH extensions neither of the two subbullets talk about PATH > - rename tc.data (e.g.: applist.txt)? no. changing one name to another doesn't help people, and breaks both existing users and documentation. > - clean up its format, make simpler to add apps. yes. this is a usability-killer. > > app() syntax review > > - trailing ";" confusing > - quoting review and improvements if needed > - multi-line If someone can come up with a compelling new format for app syntax then it should be straightforward to implement it. But coming up with that format is the hard thing. I don't think that lopping off a semicolon makes it particularly less or more intuitive compared to the rest of the app syntax. > Auto app install singel-executable staging is< i think, easily achieveable. anything more complex than that is not achievable for a 1.0 release. That does not preclude ongoing ADEM (or related to ADEM) development as a separate component (much as Falkon is or the log-processing stuff was) > Make sites & tc config easy/instant for osg, tg, local clusters that is too wishywashy a goal. for osg, I am pleased with the present site generation stuff. my main concern is that it does not generate tc.data entries for /bin/sh, which is how self-deployed apps are made at the moment. But this might be addressed either in the single-app-deployment or tc.data points above. > - productive ADEM this should be orthogonal to Swift releases. > namespace::name:version for function/app identifiers no. the implementation of this runs too deep throughout the code, and there is no real consensus at the moment on how these should behave. > Global variables yes. I think pretty easy and non-disruptive > Swift command > > - multi-site selection By that I take it to mean a commandline parameter to specifically restrict the set of sites to run on. in which case, yes, should be easy. > - run management (maybe a separate command) This needs breaking up into specific features, many of which the answer will be yes to. > Scaling > > - Efficient handling of 1M and more jobs per wf > - what else needed here? At least some of this is karajan threading work, which pretty much means hategan has to do it. I don't pretend to understand how easy or hard it is, but I would be very wary about introducing major changes to the threading system there. > Library / #include feature By library here I take it youmean including swift code, not linking to other non-swift libraries. in which case, yes, I think a simple include mechanism can go in - you already have prototypes that show the semantics. > Provenance > > - make it end-user useful and ready > - easy to get perf numbers > - easy to associate swift parameters with files and runs yes. (the rest of this paragraph is a rant) I am fairly happy with the ongoing development of this at the moment for PC3. However, there has been a consistent lack of feedback for the provenance work. I attribute no blame on any person for this, but it is the case that no on aside from Luiz has spent a significant amount of time working with the provenance code over the 16 months that it has been in the SVN. People throw in plenty of feature requests, some of which already exist, and then make no use of them. If other people in the group do not test out features in the provenance code properly (i.e. more than spending one hour every month) then people cannot complain when the features do not work how they want. The same applies to the log processing graphs - people need to come up with concrete technical criticisms rather than vague "please make it better" > More logging / status reporting enhancement > - enhance format and info content of log tools report see the above rant. > > Review of iteration and related flow of control issues > - why iterate seems hard to use > - for, do, while? > - functional equivalents Major changes in the language should not happen before 1.0. So no. Ongoing review of the semantics and syntax is a good thing to do, distinct from 1.0. However, I think you need to have a better understanding of what the semantics are now before starting to suggest changes. > Built-in and/or intrinsic functions > - remove @ yes. pretty easy, I think. > - additional string functions, perhaps other related builtins > - (time, date, sizeof, ...) I don't think we should go about adding large numbers of builtins without clear use for them. So i think each proposed builtin shoudl be proposed on its individual merits. so there is no yes/no answer for this poitn. > - Built-in function extensibility what does that mean? > - general external function mechanism - shell out? lib path? no for 1.0 - mechanisms already exist for this, although they are syntactically ugly. As this list is primarily about 1.0 features, I do not comment on your Long-term section which appeared here, other than to say they are all definitely post 1.0 -- From benc at hawaga.org.uk Wed Apr 22 05:26:15 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 22 Apr 2009 10:26:15 +0000 (GMT) Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> Message-ID: You also asked for other stuff that could go into 1.0. There are a bunch of bugs/enhancements tat are relatively minor in the bugzilla that I would like to happen, but its not tragic if they don't. The most disrupting change I have on a local development branch is a very rough prototype of some different site semantics that I wrote last week to work with i) the gLite workload management system and ii) condor in "local cluster without a shared filesystem" mode The implementation I have is in no way suitable for inclusion in Swift, as careful study of my next email to this should make apparent. However, both running on gLite and running on condor pools without a shared fs have been long term goals. I think the appropriate course of action is that I continue development on these features, and that they should go into trunk when they are ready, which may turn out to be before son-of-0.9 or may turn out to be after 1.0. -- From benc at hawaga.org.uk Wed Apr 22 07:12:28 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 22 Apr 2009 12:12:28 +0000 (GMT) Subject: [Swift-devel] swift with input and output sandboxes instead of a shared site filesystem Message-ID: I implemented an extremely poor quality prototype to try to get my head around some of the execution semantics when running through: i) the gLite workload management system (hereafter, WMS) as used in the South African Natioanl Grid (my short-term interest) and in EGEE (my longer term interest) ii) condor used as an LRM to manage nodes which do not have a shared filesystem, without any "grid stuff" involved In the case of the WMS, it is a goal to have the WMS perform site selection, rather than submitting clients (such as Swift). I don't particularly agree with this, but there it is. In the case of condor-with-no-shared-fs, one of the basic requirements of a Swift site is violated - that of an easily accessible shared file system. Both the WMS and condor provide an alternative to Swift's file management model; and both of their approaches look similar. In a job submission, one specifies the files to be staged into an arbitrary working directory before execution, and the files to be staged out after execution. My prototype is intended to get practical experience interfacing Swift to a job submission with those semantics. What I have done in my implementation is rip out almost the entirety of the execute2/site file cache/wrapper.sh layers, and replace it with a callout to a user-specified shell script. The shell script is passed the submit side paths of input files and of output files, and the commandline. The shell script is then entirely responsible for causing the job to run somewhere and for doing appropriate input and output staging. Into this shell interface, I then have two scripts, one for sagrid/glite and one for condor-with-no-shared-fs. They are similar to each other, differing only in the syntax of the submission commands and files. These scripts create a single input tarball, create a job submission file, submit to the appropriate submit command, hang round polling for status until the job is finished, and unpack an output tarball. Tarballs are used rather than explicitly listing each input and output file for two reasons: i) if an output file is missing (perhaps due to application failure) I would like the job submission to still return what it has (most especially remote log files). As long as a tarball is made with *something*, this works. ii) condor (and perhaps WMS) apparently cannot handle directory hierarchies in their stagein/stageout parameters. I have tested on the SAgrid testing environment (for WMS) and this works (although quite slowly, as the WMS reports job status changes quite slowly); and on a condor installation on gwynn.bsd.uchicago.edu (this has a shared filesystem, so is not a totally satisfactory test). I also sent this to Mats to test in his environment (as a project he has was my immediate motivation for the condor side of this). This prototype approach loses a huge chunk of Swift execution-side functionality such as replication, clustering, coasters (deliberately - I was targetting getting SwiftScript programs running, rather than getting a decent integration with the interesting execution stuff we have made). As such, it is entirely inappropriate for production (or even most experimental) use. However, it has given me another perspective on submitting jobs to the above two environments. For condor: The zipped input/output sandbox approach seems to work nicely. To mould this into something more in tune with what is in Swift now, I think is not crazy hard - the input and output staging parts of execute2 would need to change into something that creates/unpacks a tarball and appropriately modifies the job description so that when it is run by the existing execution mechanism, the tarballs get carried along. (to test if you bothered reading this, if you paste me the random string H14n$=N:t)Z you get a free beer) As specified above, that approach does not work with clustering or with coasters, though both could be modified so as to support such (for example, clustering could be made to merge all stagein and stageout listings for jobs; and coasters could be given a different interface to the existing coaster file transfer mechanism). It might be that coasters and clusters are not particularly desired in this environment, though. For glite execution - the big loss here I think is coasters, because its a very spread out grid environment. So with this approach, applications which work well without coasters will probably work well; but applications which are reliant on coasters for their performance will work as dismally as when run without coasters in any other grid environment. I can think of various modifications, similar to those mentioned in the condor section above, to try to make them work through this submission system, but it might be that a totally different approach to my above implementation is warranted for coaster based execution on glite, with more explicit specification of which sites to run on, rather than allowing the WMS any choice, and only running on sites which do have a shared filesystem available. I think in the short term, my interest is in getting this stuff more closely integrated without focusing too much on coasters and clusters. Comments. -- From iraicu at cs.uchicago.edu Wed Apr 22 07:14:12 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 22 Apr 2009 07:14:12 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <1240368466.2927.7.camel@localhost> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> <1240337102.22528.10.camel@localhost> <49EE1A8B.1@mcs.anl.gov> <1240344130.24159.40.camel@localhost> <1240365230.810.7.camel@localhost> <49EE7AEE.1000309@cs.uchicago.edu> <1240366286.1648.4.camel@localhost> <49EE8173.30101@cs.uchicago.edu> <1240368466.2927.7.camel@localhost> Message-ID: <49EF0A14.5030106@cs.uchicago.edu> Mihael Hategan wrote: > On Tue, 2009-04-21 at 21:31 -0500, Ioan Raicu wrote: > >> >> I did this with timers. Each task has a max wall time, that dictates >> when the timer will fire for a particular task to mark its completion. >> I simulated up to 1M processors, and billions of tasks, so the timer >> based approach in Java seems to scale quite well to extremes of >> millions of concurrent timers. >> > > One problem is that if you use the same timer object, the events they > generate aren't really concurrent (which is why it's so scalable). Each > timer is a thread that goes through a loop of sorted tasks. I think a > pool of a few timers would do the trick though. > > Right, I know. I implemented it with a pool of threads. >> If you are interested, I can dig through the Falkon code where I >> implemented this logic of emulating workers. >> > > Right. That's almost all that's needed, but there's also the current > time which needs to be faked given that planning depends on how much > time is left in a block, and that depends on what time it is now. > In my case, there were two ways to remove timer events. One way was to let them fire themselves when the time asked elapsed. This is one timer per task. The other reason to remove a timer event would be that the worker's allocation ended. This is done also with timers, one per worker. Overall, I used real time as a reference time, and the implementation could keep up with 1000s, maybe even 10000s of timer events per second. Ioan > Seems fairly straightforward. > > > > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From foster at anl.gov Wed Apr 22 07:26:13 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 22 Apr 2009 07:26:13 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <1240350264.26210.20.camel@localhost> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> <1240337102.22528.10.camel@localhost> <49EE1A8B.1@mcs.anl.gov> <1240344130.24159.40.camel@localhost> <16013AC9-9604-4BAC-B3DA-38FE8EA2E3F4@anl.gov> <1240350264.26210.20.camel@localhost> Message-ID: <55D159F8-17D4-4CD6-AD69-1F4709AEA2A8@anl.gov> With respect to the questions below, I think it is important that people be able to say "do X" and have the system do X. Of course if it is also possible to say "do X, but if you think that can do better than X, give it a try", that will be good too. But that would be something to ask for explicitly. On Apr 21, 2009, at 4:44 PM, Mihael Hategan wrote: > On Tue, 2009-04-21 at 16:16 -0500, Ian Foster wrote: >> Is it possible to argue for simplicity in these algorithms? >> > > Yes. > >> E.g., if I want to submit against an allocation, I should be able to >> do that, and not have the algorithm second-guess me and do something >> different. > > Yes. That would be one particular case. > >> >> Having more complex things is ok, too, as long as they can easily be >> turned off--or (my recommendation) require that they be turned on >> explicitly. > > What you say does beg for a couple of questions: > - if all work is done in a run but the allocation has more time left, > should the workers be shut down or not? > - if more work remains to be done in a run after an explicit > allocation > was used, should the system attempt to allocate more nodes? If not, > should it hang? Fail? > - if the allocation is far in the distance from now, and a run is > started now, is allocating nodes now a matter of second-guessing or a > matter of trying to finish the work faster? What, besides alleged > complexity of the algorithm, would be the downside of doing so? > > > From wilde at mcs.anl.gov Wed Apr 22 07:42:58 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 22 Apr 2009 07:42:58 -0500 Subject: [Swift-devel] swift with input and output sandboxes instead of a shared site filesystem In-Reply-To: References: Message-ID: <49EF10D2.4080502@mcs.anl.gov> This sounds good, Ben. I only skimmed, need to read, and will comment later. We should compare to the experiments on collective I/O (currently named "many-task data management) that Allan, Zhao, Ian, and I are doing. The approach involves some things that would apply in the two environments you mention as well: - pull input files to the worker node FS rather than push them multihop - batch output file up into tarballs and expand them back on their target filesystem (on the submit host for now) It also involves some things that apply more to large clusters, but may be generalizable to generic grid environments: - broadcast common files used by many jobs from the submit host to the worker nodes - use "intermediate" filesystems striped across cluster nodes, rather than local filesystems on cluster nodes, where this is more efficient or is needed - have the worker nodes selectively access files from local, intermediate, or global storage, depending on where the submit host workflow decided to place them - keep a catalog in the cluster of what files are on what local host, and a protocol to transfer them from where they were produced to where they need to be consumed (this feature is like data diffusion, but needs more thought and experience to determine how useful it is; not may workflows need it). This is being done so far in discussion with me, Ioan, and Ian, but we'd like to get you and Mihael and anyone else interested to join in; Kamil Iskra and Justin Wozniak from the MCS ZeptoOS and Radix groups are involved as well). We should use this thread to discuss your IO strategy below first, before we involve the MTDM experiments, but one common thread seems to be mastersing the changes in Swift data management that allow for us to explore these new data management modes. If you recall, we've had discussions in the past on having something like "pluggable data management strategies" that allowed a given script to be executed in different environments with different strategies - either globallu set or set by site. I'm offline a lot untill Monday with a proposal deadline, and hope to comment and rejoin the discussion by then or shorty after. - Mike On 4/22/09 7:12 AM, Ben Clifford wrote: > I implemented an extremely poor quality prototype to try to get my head > around some of the execution semantics when running through: > > i) the gLite workload management system (hereafter, WMS) as used in the > South African Natioanl Grid (my short-term interest) and in EGEE (my > longer term interest) > > ii) condor used as an LRM to manage nodes which do not have a shared > filesystem, without any "grid stuff" involved > > In the case of the WMS, it is a goal to have the WMS perform site > selection, rather than submitting clients (such as Swift). I don't > particularly agree with this, but there it is. > > In the case of condor-with-no-shared-fs, one of the basic requirements of > a Swift site is violated - that of an easily accessible shared file > system. > > Both the WMS and condor provide an alternative to Swift's file management > model; and both of their approaches look similar. > > In a job submission, one specifies the files to be staged into an > arbitrary working directory before execution, and the files to be staged > out after execution. > > My prototype is intended to get practical experience interfacing Swift to > a job submission with those semantics. > > What I have done in my implementation is rip out almost the entirety of > the execute2/site file cache/wrapper.sh layers, and replace it with a > callout to a user-specified shell script. The shell script is passed the > submit side paths of input files and of output files, and the commandline. > > The shell script is then entirely responsible for causing the job to run > somewhere and for doing appropriate input and output staging. > > Into this shell interface, I then have two scripts, one for sagrid/glite > and one for condor-with-no-shared-fs. > > They are similar to each other, differing only in the syntax of the > submission commands and files. > > These scripts create a single input tarball, create a job submission file, > submit to the appropriate submit command, hang round polling for status > until the job is finished, and unpack an output tarball. Tarballs are used > rather than explicitly listing each input and output file for two reasons: > i) if an output file is missing (perhaps due to application failure) I > would like the job submission to still return what it has (most especially > remote log files). As long as a tarball is made with *something*, this > works. ii) condor (and perhaps WMS) apparently cannot handle directory > hierarchies in their stagein/stageout parameters. > > I have tested on the SAgrid testing environment (for WMS) and this works > (although quite slowly, as the WMS reports job status changes quite > slowly); and on a condor installation on gwynn.bsd.uchicago.edu (this has > a shared filesystem, so is not a totally satisfactory test). I also sent > this to Mats to test in his environment (as a project he has was my > immediate motivation for the condor side of this). > > This prototype approach loses a huge chunk of Swift execution-side > functionality such as replication, clustering, coasters (deliberately - I > was targetting getting SwiftScript programs running, rather than getting a > decent integration with the interesting execution stuff we have made). > > As such, it is entirely inappropriate for production (or even most > experimental) use. > > However, it has given me another perspective on submitting jobs to the > above two environments. > > For condor: > > The zipped input/output sandbox approach seems to work nicely. > > To mould this into something more in tune with what is in Swift now, I > think is not crazy hard - the input and output staging parts of execute2 > would need to change into something that creates/unpacks a tarball and > appropriately modifies the job description so that when it is run by the > existing execution mechanism, the tarballs get carried along. (to test if > you bothered reading this, if you paste me the random string H14n$=N:t)Z > you get a free beer) > > As specified above, that approach does not work with clustering or with > coasters, though both could be modified so as to support such (for > example, clustering could be made to merge all stagein and stageout > listings for jobs; and coasters could be given a different interface to > the existing coaster file transfer mechanism). It might be that coasters > and clusters are not particularly desired in this environment, though. > > For glite execution - the big loss here I think is coasters, because its a > very spread out grid environment. So with this approach, applications > which work well without coasters will probably work well; but applications > which are reliant on coasters for their performance will work as dismally > as when run without coasters in any other grid environment. I can think of > various modifications, similar to those mentioned in the condor section > above, to try to make them work through this submission system, but it > might be that a totally different approach to my above implementation is > warranted for coaster based execution on glite, with more explicit > specification of which sites to run on, rather than allowing the WMS any > choice, and only running on sites which do have a shared filesystem > available. > > I think in the short term, my interest is in getting this stuff more > closely integrated without focusing too much on coasters and clusters. > > Comments. > From wilde at mcs.anl.gov Wed Apr 22 07:52:23 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 22 Apr 2009 07:52:23 -0500 Subject: [Swift-devel] swift with input and output sandboxes instead of a shared site filesystem In-Reply-To: <49EF10D2.4080502@mcs.anl.gov> References: <49EF10D2.4080502@mcs.anl.gov> Message-ID: <49EF1307.4050403@mcs.anl.gov> one clarification: when I said "pull input files to the worker node FS rather than push them multihop" the multi-hop was refering to the current system of source dir to work dir to job dir, not your new system below. - mike On 4/22/09 7:42 AM, Michael Wilde wrote: > This sounds good, Ben. I only skimmed, need to read, and will comment > later. > > We should compare to the experiments on collective I/O (currently named > "many-task data management) that Allan, Zhao, Ian, and I are doing. > > The approach involves some things that would apply in the two > environments you mention as well: > > - pull input files to the worker node FS rather than push them multihop > - batch output file up into tarballs and expand them back on their > target filesystem (on the submit host for now) > > It also involves some things that apply more to large clusters, but may > be generalizable to generic grid environments: > > - broadcast common files used by many jobs from the submit host to the > worker nodes > - use "intermediate" filesystems striped across cluster nodes, rather > than local filesystems on cluster nodes, where this is more efficient or > is needed > - have the worker nodes selectively access files from local, > intermediate, or global storage, depending on where the submit host > workflow decided to place them > - keep a catalog in the cluster of what files are on what local host, > and a protocol to transfer them from where they were produced to where > they need to be consumed (this feature is like data diffusion, but needs > more thought and experience to determine how useful it is; not may > workflows need it). > > This is being done so far in discussion with me, Ioan, and Ian, but we'd > like to get you and Mihael and anyone else interested to join in; Kamil > Iskra and Justin Wozniak from the MCS ZeptoOS and Radix groups are > involved as well). > > We should use this thread to discuss your IO strategy below first, > before we involve the MTDM experiments, but one common thread seems to > be mastersing the changes in Swift data management that allow for us to > explore these new data management modes. > > If you recall, we've had discussions in the past on having something > like "pluggable data management strategies" that allowed a given script > to be executed in different environments with different strategies - > either globallu set or set by site. > > I'm offline a lot untill Monday with a proposal deadline, and hope to > comment and rejoin the discussion by then or shorty after. > > - Mike > > > On 4/22/09 7:12 AM, Ben Clifford wrote: >> I implemented an extremely poor quality prototype to try to get my >> head around some of the execution semantics when running through: >> >> i) the gLite workload management system (hereafter, WMS) as used in >> the South African Natioanl Grid (my short-term interest) and in >> EGEE (my longer term interest) >> >> ii) condor used as an LRM to manage nodes which do not have a shared >> filesystem, without any "grid stuff" involved >> >> In the case of the WMS, it is a goal to have the WMS perform site >> selection, rather than submitting clients (such as Swift). I don't >> particularly agree with this, but there it is. >> >> In the case of condor-with-no-shared-fs, one of the basic requirements >> of a Swift site is violated - that of an easily accessible shared file >> system. >> >> Both the WMS and condor provide an alternative to Swift's file >> management model; and both of their approaches look similar. >> >> In a job submission, one specifies the files to be staged into an >> arbitrary working directory before execution, and the files to be >> staged out after execution. >> >> My prototype is intended to get practical experience interfacing Swift >> to a job submission with those semantics. >> >> What I have done in my implementation is rip out almost the entirety >> of the execute2/site file cache/wrapper.sh layers, and replace it with >> a callout to a user-specified shell script. The shell script is passed >> the submit side paths of input files and of output files, and the >> commandline. >> >> The shell script is then entirely responsible for causing the job to >> run somewhere and for doing appropriate input and output staging. >> >> Into this shell interface, I then have two scripts, one for >> sagrid/glite and one for condor-with-no-shared-fs. >> >> They are similar to each other, differing only in the syntax of the >> submission commands and files. >> >> These scripts create a single input tarball, create a job submission >> file, submit to the appropriate submit command, hang round polling for >> status until the job is finished, and unpack an output tarball. >> Tarballs are used rather than explicitly listing each input and output >> file for two reasons: i) if an output file is missing (perhaps due to >> application failure) I would like the job submission to still return >> what it has (most especially remote log files). As long as a tarball >> is made with *something*, this works. ii) condor (and perhaps WMS) >> apparently cannot handle directory hierarchies in their >> stagein/stageout parameters. >> >> I have tested on the SAgrid testing environment (for WMS) and this >> works (although quite slowly, as the WMS reports job status changes >> quite slowly); and on a condor installation on gwynn.bsd.uchicago.edu >> (this has a shared filesystem, so is not a totally satisfactory test). >> I also sent this to Mats to test in his environment (as a project he >> has was my immediate motivation for the condor side of this). >> >> This prototype approach loses a huge chunk of Swift execution-side >> functionality such as replication, clustering, coasters (deliberately >> - I was targetting getting SwiftScript programs running, rather than >> getting a decent integration with the interesting execution stuff we >> have made). >> >> As such, it is entirely inappropriate for production (or even most >> experimental) use. >> >> However, it has given me another perspective on submitting jobs to the >> above two environments. >> >> For condor: >> >> The zipped input/output sandbox approach seems to work nicely. >> >> To mould this into something more in tune with what is in Swift now, I >> think is not crazy hard - the input and output staging parts of >> execute2 would need to change into something that creates/unpacks a >> tarball and appropriately modifies the job description so that when it >> is run by the existing execution mechanism, the tarballs get carried >> along. (to test if you bothered reading this, if you paste me the >> random string H14n$=N:t)Z you get a free beer) >> >> As specified above, that approach does not work with clustering or >> with coasters, though both could be modified so as to support such >> (for example, clustering could be made to merge all stagein and >> stageout listings for jobs; and coasters could be given a different >> interface to the existing coaster file transfer mechanism). It might >> be that coasters and clusters are not particularly desired in this >> environment, though. >> >> For glite execution - the big loss here I think is coasters, because >> its a very spread out grid environment. So with this approach, >> applications which work well without coasters will probably work well; >> but applications which are reliant on coasters for their performance >> will work as dismally as when run without coasters in any other grid >> environment. I can think of various modifications, similar to those >> mentioned in the condor section above, to try to make them work >> through this submission system, but it might be that a totally >> different approach to my above implementation is warranted for coaster >> based execution on glite, with more explicit specification of which >> sites to run on, rather than allowing the WMS any choice, and only >> running on sites which do have a shared filesystem available. >> >> I think in the short term, my interest is in getting this stuff more >> closely integrated without focusing too much on coasters and clusters. >> >> Comments. >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From foster at anl.gov Wed Apr 22 08:49:26 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 22 Apr 2009 08:49:26 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> Message-ID: <4951E091-E0D8-4255-BCCB-96BB15EBA5C6@anl.gov> Could we get feedback from users on the time it takes to do this, maybe even ask them to do it and report the time required? On Apr 22, 2009, at 4:55 AM, Ben Clifford wrote: >> Make sites & tc config easy/instant for osg, tg, local clusters > > that is too wishywashy a goal. > > for osg, I am pleased with the present site generation stuff. my main > concern is that it does not generate tc.data entries for /bin/sh, > which is > how self-deployed apps are made at the moment. But this might be > addressed > either in the single-app-deployment or tc.data points above. -------------- next part -------------- An HTML attachment was scrubbed... URL: From foster at anl.gov Wed Apr 22 08:51:15 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 22 Apr 2009 08:51:15 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> Message-ID: If no-one is using these things, then perhaps we should not be doing them. I myself think they are really important, and so hope that we would instead get people to use them. But if Mike can't get people to use them, then maybe we decide they are unimportant. On Apr 22, 2009, at 4:55 AM, Ben Clifford wrote: >> Provenance >> >> - make it end-user useful and ready >> - easy to get perf numbers >> - easy to associate swift parameters with files and runs > > yes. > > (the rest of this paragraph is a rant) I am fairly happy with the > ongoing > development of this at the moment for PC3. However, there has been a > consistent lack of feedback for the provenance work. I attribute no > blame > on any person for this, but it is the case that no on aside from > Luiz has > spent a significant amount of time working with the provenance code > over > the 16 months that it has been in the SVN. People throw in plenty of > feature requests, some of which already exist, and then make no use of > them. If other people in the group do not test out features in the > provenance code properly (i.e. more than spending one hour every > month) > then people cannot complain when the features do not work how they > want. > The same applies to the log processing graphs - people need to come up > with concrete technical criticisms rather than vague "please make it > better" From benc at hawaga.org.uk Wed Apr 22 08:54:48 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 22 Apr 2009 13:54:48 +0000 (GMT) Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> Message-ID: On Wed, 22 Apr 2009, Ian Foster wrote: > If no-one is using these things, then perhaps we should not be doing > them. Thats generally why I abandon such projects for several months to work on other things that people seem responsive to. > I myself think they are really important, and so hope that we would > instead get people to use them. But if Mike can't get people to use > them, then maybe we decide they are unimportant. Yes, I repeatedly get told "this is important", but I repeatedly run into the frustrations vented below. -- From wilde at mcs.anl.gov Wed Apr 22 09:05:04 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 22 Apr 2009 09:05:04 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <4951E091-E0D8-4255-BCCB-96BB15EBA5C6@anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <4951E091-E0D8-4255-BCCB-96BB15EBA5C6@anl.gov> Message-ID: <49EF2410.9020403@mcs.anl.gov> On 4/22/09 8:49 AM, Ian Foster wrote: > Could we get feedback from users on the time it takes to do this, maybe > even ask them to do it and report the time required? Glen, comments? > On Apr 22, 2009, at 4:55 AM, Ben Clifford wrote: > >>> Make sites & tc config easy/instant for osg, tg, local clusters >> >> that is too wishywashy a goal. No, sorry - it is a real goal. It will get refined, but having worked closely with many users, this is a very real impediment. Perhaps Glen and Sarah can comment, as a user and user-liaison. >> >> for osg, I am pleased with the present site generation stuff. Its pretty close; I think it needs to be smarter about where the work directory goes. my main >> concern is that it does not generate tc.data entries for /bin/sh, >> which is >> how self-deployed apps are made at the moment. But this might be >> addressed >> either in the single-app-deployment or tc.data points above. Yes. > From benc at hawaga.org.uk Wed Apr 22 09:06:52 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 22 Apr 2009 14:06:52 +0000 (GMT) Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <49EF2410.9020403@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <4951E091-E0D8-4255-BCCB-96BB15EBA5C6@anl.gov> <49EF2410.9020403@mcs.anl.gov> Message-ID: On Wed, 22 Apr 2009, Michael Wilde wrote: > > > that is too wishywashy a goal. > > No, sorry - it is a real goal. It will get refined, but having worked closely > with many users, this is a very real impediment. Perhaps Glen and Sarah can > comment, as a user and user-liaison. I accept that its a real goal. Its too wishy washy to make plans around. Otherwise we might as well plan around the single goal "get money and retire rich". -- From wilde at mcs.anl.gov Wed Apr 22 09:07:25 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 22 Apr 2009 09:07:25 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> Message-ID: <49EF249D.9000605@mcs.anl.gov> We have, minimally, a funded and still unmet charter from CNARI to do real provenance and get it in use in HNL. And we have a live and growing user group in OOPS who can benefit from provenance as I stated on the list in the past week or two. (and who in fact has built a simple web interface to view swift results) On 4/22/09 8:51 AM, Ian Foster wrote: > If no-one is using these things, then perhaps we should not be doing them. No one can use them till they exist. > I myself think they are really important, and so hope that we would > instead get people to use them. But if Mike can't get people to use > them, then maybe we decide they are unimportant. See above. > On Apr 22, 2009, at 4:55 AM, Ben Clifford wrote: > >>> Provenance >>> >>> - make it end-user useful and ready >>> - easy to get perf numbers >>> - easy to associate swift parameters with files and runs >> >> yes. >> >> (the rest of this paragraph is a rant) I am fairly happy with the ongoing >> development of this at the moment for PC3. However, there has been a >> consistent lack of feedback for the provenance work. I attribute no blame >> on any person for this, but it is the case that no on aside from Luiz has >> spent a significant amount of time working with the provenance code over >> the 16 months that it has been in the SVN. People throw in plenty of >> feature requests, some of which already exist, and then make no use of >> them. If other people in the group do not test out features in the >> provenance code properly (i.e. more than spending one hour every month) >> then people cannot complain when the features do not work how they want. >> The same applies to the log processing graphs - people need to come up >> with concrete technical criticisms rather than vague "please make it >> better" > From benc at hawaga.org.uk Wed Apr 22 09:07:26 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 22 Apr 2009 14:07:26 +0000 (GMT) Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <49EF2410.9020403@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <4951E091-E0D8-4255-BCCB-96BB15EBA5C6@anl.gov> <49EF2410.9020403@mcs.anl.gov> Message-ID: On Wed, 22 Apr 2009, Michael Wilde wrote: > Its pretty close; I think it needs to be smarter about where the work > directory goes. What is insufficiently smart about it now? Its based around what the site publishes. -- From benc at hawaga.org.uk Wed Apr 22 09:09:56 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 22 Apr 2009 14:09:56 +0000 (GMT) Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <49EF249D.9000605@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <49EF249D.9000605@mcs.anl.gov> Message-ID: On Wed, 22 Apr 2009, Michael Wilde wrote: > And we have a live and growing user group in OOPS who can benefit from > provenance as I stated on the list in the past week or two. (and who in fact > has built a simple web interface to view swift results) You repeatedly state that people want it. Ian repeatedly states that people want it. No one gives me feed back. Stop stating that people want it and give me feed back. Who has written this web interface? Has it been discussed here? Does it use the existing provenance data base? What is the project management failure that has caused this work to happen in secret? -- From foster at anl.gov Wed Apr 22 09:13:01 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 22 Apr 2009 09:13:01 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> <49EF249D.9000605@mcs.anl.gov> Message-ID: <96C3FDD9-7168-4E07-B904-A2C0F85E3E19@anl.gov> Mike: As I understand things, we've had a mechanism for storing and analyzing/viewing logs for some time. Also a mechanism for recording some basic provenance information. If the CNARI people view these features as very important, why aren't they working with what we have, and providing feedback? Ian. On Apr 22, 2009, at 9:09 AM, Ben Clifford wrote: > > On Wed, 22 Apr 2009, Michael Wilde wrote: > >> And we have a live and growing user group in OOPS who can benefit >> from >> provenance as I stated on the list in the past week or two. (and >> who in fact >> has built a simple web interface to view swift results) > > You repeatedly state that people want it. Ian repeatedly states that > people want it. No one gives me feed back. Stop stating that people > want > it and give me feed back. > > Who has written this web interface? Has it been discussed here? Does > it > use the existing provenance data base? What is the project management > failure that has caused this work to happen in secret? > > -- > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Wed Apr 22 09:24:39 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 22 Apr 2009 14:24:39 +0000 (GMT) Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <49EF249D.9000605@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <49EF249D.9000605@mcs.anl.gov> Message-ID: On Wed, 22 Apr 2009, Michael Wilde wrote: > We have, minimally, a funded and still unmet charter from CNARI to do real > provenance and get it in use in HNL. Is there a CNARI application person funded to work on provenance? If so? and what proportion of their time is specifically assigned to working specificially on provenance and not on any other work? -- From wilde at mcs.anl.gov Wed Apr 22 09:32:00 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 22 Apr 2009 09:32:00 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> <4951E091-E0D8-4255-BCCB-96BB15EBA5C6@anl.gov> <49EF2410.9020403@mcs.anl.gov> Message-ID: <49EF2A60.4020900@mcs.anl.gov> It seemed to me to not find the data dir for the VO I was in. (or maybe was hardcoded to use tmp?) But maybe that was a bug or insufficient site data, pilot error, etc. But that was the issue I recall. On 4/22/09 9:07 AM, Ben Clifford wrote: > On Wed, 22 Apr 2009, Michael Wilde wrote: > >> Its pretty close; I think it needs to be smarter about where the work >> directory goes. > > What is insufficiently smart about it now? Its based around what the site > publishes. > From benc at hawaga.org.uk Wed Apr 22 09:35:50 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 22 Apr 2009 14:35:50 +0000 (GMT) Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <49EF2A60.4020900@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <4951E091-E0D8-4255-BCCB-96BB15EBA5C6@anl.gov> <49EF2410.9020403@mcs.anl.gov> <49EF2A60.4020900@mcs.anl.gov> Message-ID: On Wed, 22 Apr 2009, Michael Wilde wrote: > It seemed to me to not find the data dir for the VO I was in. > (or maybe was hardcoded to use tmp?) The directory gets constructed like this: $workdir/$lc_vo/tmp/$sitename Where $workdir is the ReSS published value of OSG_DATA, lc_vo is the name of the VO, and sitename is the name of the site. Is this what you were seeing? -- From wilde at mcs.anl.gov Wed Apr 22 09:38:03 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 22 Apr 2009 09:38:03 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <96C3FDD9-7168-4E07-B904-A2C0F85E3E19@anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <49EF249D.9000605@mcs.anl.gov> <96C3FDD9-7168-4E07-B904-A2C0F85E3E19@anl.gov> Message-ID: <49EF2BCB.7010702@mcs.anl.gov> When I last reviewed the provenance work it was in the form of notes written by Ben. The queries were esoteric and based on hard to use IDs. It was a 15 page document and I never got the feedback out on it. It was still very experimental - which was certainly what I'd expect at that state - but seemed still very disconnected from the user. Feedback was hard because I kept asking myself, do I need to read this more carefully, or am I missing the way in which this is useful? So it lingered and got buried below more immediate things on my plate. Yes, my fault. On 4/22/09 9:13 AM, Ian Foster wrote: > Mike: > > As I understand things, we've had a mechanism for storing and > analyzing/viewing logs for some time. Also a mechanism for recording > some basic provenance information. If the CNARI people view these > features as very important, why aren't they working with what we have, > and providing feedback? Because they dont have anything they can use. Sarah started looking at how to apply Ben's tools, but did not get far enough to be able to give the end users anything of value. - Mike > > Ian. > > > On Apr 22, 2009, at 9:09 AM, Ben Clifford wrote: > >> >> On Wed, 22 Apr 2009, Michael Wilde wrote: >> >>> And we have a live and growing user group in OOPS who can benefit from >>> provenance as I stated on the list in the past week or two. (and who >>> in fact >>> has built a simple web interface to view swift results) >> >> You repeatedly state that people want it. Ian repeatedly states that >> people want it. No one gives me feed back. Stop stating that people want >> it and give me feed back. >> >> Who has written this web interface? Has it been discussed here? Does it >> use the existing provenance data base? What is the project management >> failure that has caused this work to happen in secret? >> >> -- >> > From wilde at mcs.anl.gov Wed Apr 22 09:41:06 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 22 Apr 2009 09:41:06 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> <49EF249D.9000605@mcs.anl.gov> Message-ID: <49EF2C82.3060207@mcs.anl.gov> On 4/22/09 9:24 AM, Ben Clifford wrote: > On Wed, 22 Apr 2009, Michael Wilde wrote: > >> We have, minimally, a funded and still unmet charter from CNARI to do real >> provenance and get it in use in HNL. > > Is there a CNARI application person funded to work on provenance? If so? > and what proportion of their time is specifically assigned to working > specificially on provenance and not on any other work? Yes. You are partly funded by CNARI to work on provenance, as is Sarah. The % time to work on provenance is not well defined. Why is a discussion on a list of features devolving into recriminations and frustration? From benc at hawaga.org.uk Wed Apr 22 09:41:44 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 22 Apr 2009 14:41:44 +0000 (GMT) Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <49EF2BCB.7010702@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <49EF249D.9000605@mcs.anl.gov> <96C3FDD9-7168-4E07-B904-A2C0F85E3E19@anl.gov> <49EF2BCB.7010702@mcs.anl.gov> Message-ID: On Wed, 22 Apr 2009, Michael Wilde wrote: > So it lingered and got buried below more immediate things on my plate. Thats my main concern - its a project that everyone says they want, but no one (not just you) appears to have time to interact with, which to me suggests people don't want it very hard. -- From foster at anl.gov Wed Apr 22 09:42:28 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 22 Apr 2009 09:42:28 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <49EF2BCB.7010702@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <49EF249D.9000605@mcs.anl.gov> <96C3FDD9-7168-4E07-B904-A2C0F85E3E19@anl.gov> <49EF2BCB.7010702@mcs.anl.gov> Message-ID: <4110DB56-718A-4290-AAFE-7EBA56A52E6D@anl.gov> Hey, It seems that we should schedule a get-together on these topics when Ben visits, and work out a plan that will meet CNARI needs. I'd enjoy joining that discussion. Ian. On Apr 22, 2009, at 9:38 AM, Michael Wilde wrote: > When I last reviewed the provenance work it was in the form of notes > written by Ben. > > The queries were esoteric and based on hard to use IDs. > > It was a 15 page document and I never got the feedback out on it. > > It was still very experimental - which was certainly what I'd expect > at that state - but seemed still very disconnected from the user. > > Feedback was hard because I kept asking myself, do I need to read > this more carefully, or am I missing the way in which this is useful? > > So it lingered and got buried below more immediate things on my plate. > > Yes, my fault. > > On 4/22/09 9:13 AM, Ian Foster wrote: >> Mike: >> As I understand things, we've had a mechanism for storing and >> analyzing/viewing logs for some time. Also a mechanism for >> recording some basic provenance information. If the CNARI people >> view these features as very important, why aren't they working with >> what we have, and providing feedback? > > Because they dont have anything they can use. > Sarah started looking at how to apply Ben's tools, but did not get > far enough to be able to give the end users anything of value. > > - Mike > >> Ian. On Apr 22, 2009, at 9:09 AM, Ben Clifford wrote: >>> >>> On Wed, 22 Apr 2009, Michael Wilde wrote: >>> >>>> And we have a live and growing user group in OOPS who can benefit >>>> from >>>> provenance as I stated on the list in the past week or two. (and >>>> who in fact >>>> has built a simple web interface to view swift results) >>> >>> You repeatedly state that people want it. Ian repeatedly states that >>> people want it. No one gives me feed back. Stop stating that >>> people want >>> it and give me feed back. >>> >>> Who has written this web interface? Has it been discussed here? >>> Does it >>> use the existing provenance data base? What is the project >>> management >>> failure that has caused this work to happen in secret? >>> >>> -- >>> From benc at hawaga.org.uk Wed Apr 22 09:43:34 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 22 Apr 2009 14:43:34 +0000 (GMT) Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <49EF2C82.3060207@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <49EF249D.9000605@mcs.anl.gov> <49EF2C82.3060207@mcs.anl.gov> Message-ID: On Wed, 22 Apr 2009, Michael Wilde wrote: > Yes. You are partly funded by CNARI to work on provenance, as is Sarah. > > The % time to work on provenance is not well defined. OK. For provenance work to progress in CNARI, it would probably help if you assigned a specific amount of her time to work on this to the exclusion and suffering of other goals. -- From wilde at mcs.anl.gov Wed Apr 22 09:49:58 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 22 Apr 2009 09:49:58 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> <4951E091-E0D8-4255-BCCB-96BB15EBA5C6@anl.gov> <49EF2410.9020403@mcs.anl.gov> <49EF2A60.4020900@mcs.anl.gov> Message-ID: <49EF2E96.3050802@mcs.anl.gov> On 4/22/09 9:35 AM, Ben Clifford wrote: > On Wed, 22 Apr 2009, Michael Wilde wrote: > >> It seemed to me to not find the data dir for the VO I was in. >> (or maybe was hardcoded to use tmp?) > > The directory gets constructed like this: > > $workdir/$lc_vo/tmp/$sitename > > Where $workdir is the ReSS published value of OSG_DATA, lc_vo is the name > of the VO, and sitename is the name of the site. > > Is this what you were seeing? Dont know. I seem to recall something that was more like a scratch dir. Likely I tried an older version of the code from Mats. No matter, if its a non issue thats great. Need to test more. From wilde at mcs.anl.gov Wed Apr 22 09:55:55 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 22 Apr 2009 09:55:55 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> <49EF249D.9000605@mcs.anl.gov> <96C3FDD9-7168-4E07-B904-A2C0F85E3E19@anl.gov> <49EF2BCB.7010702@mcs.anl.gov> Message-ID: <49EF2FFB.9040802@mcs.anl.gov> I would simply say that provenance bubbling up in the priority list. Provenance is of little interest until you start generating enough results to have something to show the provenance *of*. Now that we're getting some results on several projects, provenance is moving up in priority. That seems as it should be given our development and management resources. Lets keep working through the list analytically. On 4/22/09 9:41 AM, Ben Clifford wrote: > On Wed, 22 Apr 2009, Michael Wilde wrote: > >> So it lingered and got buried below more immediate things on my plate. > > Thats my main concern - its a project that everyone says they want, but no > one (not just you) appears to have time to interact with, which to me > suggests people don't want it very hard. > From wilde at mcs.anl.gov Wed Apr 22 09:58:47 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 22 Apr 2009 09:58:47 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <4110DB56-718A-4290-AAFE-7EBA56A52E6D@anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <49EF249D.9000605@mcs.anl.gov> <96C3FDD9-7168-4E07-B904-A2C0F85E3E19@anl.gov> <49EF2BCB.7010702@mcs.anl.gov> <4110DB56-718A-4290-AAFE-7EBA56A52E6D@anl.gov> Message-ID: <49EF30A7.5080503@mcs.anl.gov> Yes. But when we meet the consensus seems to often be that we should also work these issues incrementall on this list, and at the meetings do things that we need whiteboards and brainstorming for. So lets keep pushing through these topics. Should this list move soon to bugzilla? Should discussion be via bugzilla at the point? Maybe stay on this list for now??? On 4/22/09 9:42 AM, Ian Foster wrote: > Hey, > > It seems that we should schedule a get-together on these topics when Ben > visits, and work out a plan that will meet CNARI needs. I'd enjoy > joining that discussion. > > Ian. > > > On Apr 22, 2009, at 9:38 AM, Michael Wilde wrote: > >> When I last reviewed the provenance work it was in the form of notes >> written by Ben. >> >> The queries were esoteric and based on hard to use IDs. >> >> It was a 15 page document and I never got the feedback out on it. >> >> It was still very experimental - which was certainly what I'd expect >> at that state - but seemed still very disconnected from the user. >> >> Feedback was hard because I kept asking myself, do I need to read this >> more carefully, or am I missing the way in which this is useful? >> >> So it lingered and got buried below more immediate things on my plate. >> >> Yes, my fault. >> >> On 4/22/09 9:13 AM, Ian Foster wrote: >>> Mike: >>> As I understand things, we've had a mechanism for storing and >>> analyzing/viewing logs for some time. Also a mechanism for recording >>> some basic provenance information. If the CNARI people view these >>> features as very important, why aren't they working with what we >>> have, and providing feedback? >> >> Because they dont have anything they can use. >> Sarah started looking at how to apply Ben's tools, but did not get far >> enough to be able to give the end users anything of value. >> >> - Mike >> >>> Ian. On Apr 22, 2009, at 9:09 AM, Ben Clifford wrote: >>>> >>>> On Wed, 22 Apr 2009, Michael Wilde wrote: >>>> >>>>> And we have a live and growing user group in OOPS who can benefit from >>>>> provenance as I stated on the list in the past week or two. (and >>>>> who in fact >>>>> has built a simple web interface to view swift results) >>>> >>>> You repeatedly state that people want it. Ian repeatedly states that >>>> people want it. No one gives me feed back. Stop stating that people >>>> want >>>> it and give me feed back. >>>> >>>> Who has written this web interface? Has it been discussed here? Does it >>>> use the existing provenance data base? What is the project management >>>> failure that has caused this work to happen in secret? >>>> >>>> -- >>>> > From wilde at mcs.anl.gov Wed Apr 22 10:16:05 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 22 Apr 2009 10:16:05 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> <49EF249D.9000605@mcs.anl.gov> Message-ID: <49EF34B5.4080506@mcs.anl.gov> On 4/22/09 9:09 AM, Ben Clifford wrote: > On Wed, 22 Apr 2009, Michael Wilde wrote: > >> And we have a live and growing user group in OOPS who can benefit from >> provenance as I stated on the list in the past week or two. (and who in fact >> has built a simple web interface to view swift results) > > You repeatedly state that people want it. Ian repeatedly states that > people want it. No one gives me feed back. Stop stating that people want > it and give me feed back. > > Who has written this web interface? Glen Hocky: http://freedgroup.uchicago.edu/oops.html Has it been discussed here? I thought so, but maybe not. > Does it > use the existing provenance data base? No. > What is the project management > failure that has caused this work to happen in secret? None that I can see. I have to go back through email - or you can - to look for the point where I discussed the possibilities for OOPS provenance. It was simply this: - first, I worked with Glen to make a swift script to run their science - then we ran these scripts (that was clearly on the list and in the open) - then we wanted a simple web interface where OOPS users could locate the (science) results of runs - then we wrote a paper. That was discussed with you and Mihael. Mihael was co-author on the submission for helping with coaster side (at the the time I thought he would get deeper involved in scaling up runs) - I wanted to post the paper to Swift, but before putting it on the list I asked the science PIs for permission. They never answered, and posting it fell off my plate. ^^^ OK, that was a project management failure. Sorry. Jeez. - The paper is here and should be posted on the web when I get permission: http://www.ci.uchicago.edu/~wilde/OOPS.SC09.submitted.pdf - The source code in the paper got "stylized" in the last minute to shrink it down; A snapshot of the live running code is posted on the web site listed first above. The code needs to get refined so that we can post enough fragments of running code ot be readable. From benc at hawaga.org.uk Wed Apr 22 10:25:30 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 22 Apr 2009 15:25:30 +0000 (GMT) Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <49EF34B5.4080506@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <49EF249D.9000605@mcs.anl.gov> <49EF34B5.4080506@mcs.anl.gov> Message-ID: > - then we wanted a simple web interface where OOPS users could locate the > (science) results of runs Thats the sort of thing which would have been useful for the provenance project to have had interaction on. I appreciate that in the short term, it doesn't help your immediate results, and almost definitely actively hinders due to the increased cost of development. But that is the way in which the provenance stuff will move forwards - by having people who are prepared to bear to pain of being early users, just like happened for mainline Swift 2..3y ago, just like coaster users in the past year. -- From benc at hawaga.org.uk Wed Apr 22 10:28:56 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 22 Apr 2009 15:28:56 +0000 (GMT) Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <49EF30A7.5080503@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <49EF249D.9000605@mcs.anl.gov> <96C3FDD9-7168-4E07-B904-A2C0F85E3E19@anl.gov> <49EF2BCB.7010702@mcs.anl.gov> <4110DB56-718A-4290-AAFE-7EBA56A52E6D@anl.gov> <49EF30A7.5080503@mcs.anl.gov> Message-ID: On Wed, 22 Apr 2009, Michael Wilde wrote: > Should this list move soon to bugzilla? > Should discussion be via bugzilla at the point? I think having each clearly defined piece as a separate bugzilla item is useful for giving everything a unique number that can be used rather than having to informally describe things each time they are mentioned, whether they're going to be targetting for 1.0 or not. -- From foster at anl.gov Wed Apr 22 10:40:34 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 22 Apr 2009 10:40:34 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> <49EF249D.9000605@mcs.anl.gov> <96C3FDD9-7168-4E07-B904-A2C0F85E3E19@anl.gov> <49EF2BCB.7010702@mcs.anl.gov> <4110DB56-718A-4290-AAFE-7EBA56A52E6D@anl.gov> <49EF30A7.5080503@mcs.anl.gov> Message-ID: <30029B2D-D23B-4BE1-8BBA-E87C4AD0ECC6@anl.gov> It would be good to have some "user stories" on how provenance is to work. On Apr 22, 2009, at 10:28 AM, Ben Clifford wrote: > On Wed, 22 Apr 2009, Michael Wilde wrote: > >> Should this list move soon to bugzilla? >> Should discussion be via bugzilla at the point? > > I think having each clearly defined piece as a separate bugzilla > item is > useful for giving everything a unique number that can be used rather > than > having to informally describe things each time they are mentioned, > whether > they're going to be targetting for 1.0 or not. > > -- From hategan at mcs.anl.gov Wed Apr 22 10:44:01 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 22 Apr 2009 10:44:01 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <55D159F8-17D4-4CD6-AD69-1F4709AEA2A8@anl.gov> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> <1240337102.22528.10.camel@localhost> <49EE1A8B.1@mcs.anl.gov> <1240344130.24159.40.camel@localhost> <16013AC9-9604-4BAC-B3DA-38FE8EA2E3F4@anl.gov> <1240350264.26210.20.camel@localhost> <55D159F8-17D4-4CD6-AD69-1F4709AEA2A8@anl.gov> Message-ID: <1240415041.2927.9.camel@localhost> On Wed, 2009-04-22 at 07:26 -0500, Ian Foster wrote: > With respect to the questions below, There are 3 very specific questions below, not one. You answered none of them. > I think it is important that > people be able to say "do X" and have the system do X. Of course if > it is also possible to say "do X, but if you think that can do better > than X, give it a try", that will be good too. But that would be > something to ask for explicitly. > > > On Apr 21, 2009, at 4:44 PM, Mihael Hategan wrote: > > > On Tue, 2009-04-21 at 16:16 -0500, Ian Foster wrote: > >> Is it possible to argue for simplicity in these algorithms? > >> > > > > Yes. > > > >> E.g., if I want to submit against an allocation, I should be able to > >> do that, and not have the algorithm second-guess me and do something > >> different. > > > > Yes. That would be one particular case. > > > >> > >> Having more complex things is ok, too, as long as they can easily be > >> turned off--or (my recommendation) require that they be turned on > >> explicitly. > > > > What you say does beg for a couple of questions: > > - if all work is done in a run but the allocation has more time left, > > should the workers be shut down or not? > > - if more work remains to be done in a run after an explicit > > allocation > > was used, should the system attempt to allocate more nodes? If not, > > should it hang? Fail? > > - if the allocation is far in the distance from now, and a run is > > started now, is allocating nodes now a matter of second-guessing or a > > matter of trying to finish the work faster? What, besides alleged > > complexity of the algorithm, would be the downside of doing so? > > > > > > > From wilde at mcs.anl.gov Wed Apr 22 10:44:16 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 22 Apr 2009 10:44:16 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> <49EF249D.9000605@mcs.anl.gov> <49EF34B5.4080506@mcs.anl.gov> Message-ID: <49EF3B50.9010500@mcs.anl.gov> On 4/22/09 10:25 AM, Ben Clifford wrote: >> - then we wanted a simple web interface where OOPS users could locate the >> (science) results of runs > > Thats the sort of thing which would have been useful for the provenance > project to have had interaction on. Yes. We started that discussion *immediately* after the paper was submitted. On 4/15 we started a thread on this list, subject "oops provenance": http://mail.ci.uchicago.edu/pipermail/swift-devel/2009-April/005119.html I appreciate that in the short term, > it doesn't help your immediate results, and almost definitely actively > hinders due to the increased cost of development. But that is the way in > which the provenance stuff will move forwards - by having people who are > prepared to bear to pain of being early users, just like happened for > mainline Swift 2..3y ago, just like coaster users in the past year. Yes. Recall, what the user wanted was a web page that showed science plots and protein folds visualized, and tables of science numbers. That was done, for live use, and was spurred by a paper deadline. Then right after the paper I send the message above saying "lets get this connected to provenance". Seems like the right approach was taken, and is heading in the direction you suggest. From wilde at mcs.anl.gov Wed Apr 22 10:44:51 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 22 Apr 2009 10:44:51 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <30029B2D-D23B-4BE1-8BBA-E87C4AD0ECC6@anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <49EF249D.9000605@mcs.anl.gov> <96C3FDD9-7168-4E07-B904-A2C0F85E3E19@anl.gov> <49EF2BCB.7010702@mcs.anl.gov> <4110DB56-718A-4290-AAFE-7EBA56A52E6D@anl.gov> <49EF30A7.5080503@mcs.anl.gov> <30029B2D-D23B-4BE1-8BBA-E87C4AD0ECC6@anl.gov> Message-ID: <49EF3B73.8070108@mcs.anl.gov> On 4/22/09 10:40 AM, Ian Foster wrote: > It would be good to have some "user stories" on how provenance is to work. > Here is one starting point: http://mail.ci.uchicago.edu/pipermail/swift-devel/2009-April/005119.html > > > On Apr 22, 2009, at 10:28 AM, Ben Clifford wrote: > >> On Wed, 22 Apr 2009, Michael Wilde wrote: >> >>> Should this list move soon to bugzilla? >>> Should discussion be via bugzilla at the point? >> >> I think having each clearly defined piece as a separate bugzilla item is >> useful for giving everything a unique number that can be used rather than >> having to informally describe things each time they are mentioned, >> whether >> they're going to be targetting for 1.0 or not. >> >> -- > From foster at anl.gov Wed Apr 22 10:49:07 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 22 Apr 2009 10:49:07 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <1240415041.2927.9.camel@localhost> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> <1240337102.22528.10.camel@localhost> <49EE1A8B.1@mcs.anl.gov> <1240344130.24159.40.camel@localhost> <16013AC9-9604-4BAC-B3DA-38FE8EA2E3F4@anl.gov> <1240350264.26210.20.camel@localhost> <55D159F8-17D4-4CD6-AD69-1F4709AEA2A8@anl.gov> <1240415041.2927.9.camel@localhost> Message-ID: <8BD13021-DCEC-4F42-BE2A-A4D14F7B35FC@anl.gov> >>> >>> >>> What you say does beg for a couple of questions: >>> - if all work is done in a run but the allocation has more time >>> left, >>> should the workers be shut down or not? Shut down. >>> >>> - if more work remains to be done in a run after an explicit >>> allocation >>> was used, should the system attempt to allocate more nodes? If not, >>> should it hang? Fail? Fail. >>> >>> - if the allocation is far in the distance from now, and a run is >>> started now, is allocating nodes now a matter of second-guessing >>> or a >>> matter of trying to finish the work faster? What, besides alleged >>> complexity of the algorithm, would be the downside of doing so? Maybe someone has requested an allocation at 10am tomorrow because that is when they want to run the application. Maybe they are benchmarking, and want things to run with a specified number of nodes). Maybe someone doesn't trust the clever algorithm, or finds that it fails for odd reason. Having a more complex algorithm as well is great. I'm not saying this would not be wonderful. But it shouldn't be obligatory. Ian. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Apr 22 11:16:14 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 22 Apr 2009 11:16:14 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <8BD13021-DCEC-4F42-BE2A-A4D14F7B35FC@anl.gov> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> <1240337102.22528.10.camel@localhost> <49EE1A8B.1@mcs.anl.gov> <1240344130.24159.40.camel@localhost> <16013AC9-9604-4BAC-B3DA-38FE8EA2E3F4@anl.gov> <1240350264.26210.20.camel@localhost> <55D159F8-17D4-4CD6-AD69-1F4709AEA2A8@anl.gov> <1240415041.2927.9.camel@localhost> <8BD13021-DCEC-4F42-BE2A-A4D14F7B35FC@anl.gov> Message-ID: <1240416974.2927.30.camel@localhost> On Wed, 2009-04-22 at 10:49 -0500, Ian Foster wrote: > > > > > > > > > > > > What you say does beg for a couple of questions: > > > > - if all work is done in a run but the allocation has more time > > > > left, > > > > should the workers be shut down or not? > > > Shut down. Ok. > > > > > > > > > - if more work remains to be done in a run after an explicit > > > > allocation > > > > was used, should the system attempt to allocate more nodes? If > > > > not, > > > > should it hang? Fail? > > > Fail. I disagree. If the user didn't want the work to complete, they wouldn't run it. It should be possible to force this mode, but I don't think it should be the default. > > > > > > > > > - if the allocation is far in the distance from now, and a run > > > > is > > > > started now, is allocating nodes now a matter of second-guessing > > > > or a > > > > matter of trying to finish the work faster? What, besides > > > > alleged > > > > complexity of the algorithm, would be the downside of doing so? > > > Maybe someone has requested an allocation at 10am tomorrow because > that is when they want to run the application. I'd assume then that they would start swift somewhere around 10am tomorrow, not one or two days in advance. > > > Maybe they are benchmarking, and want things to run with a specified > number of nodes). > Being able to force a "use exactly these nodes for this amount of time at this time" is a given. Making it the default I have issue with. > > Maybe someone doesn't trust the clever algorithm, or finds that it > fails for odd reason. > Right. Many people don't trust garbage collection either. I find it funny that people insist that non-trivial things such as distributed computing, GC, special relativity be entirely intuitive. > > Having a more complex algorithm as well is great. I'm not saying this > would not be wonderful. But it shouldn't be obligatory. That's far from the statement of making the other one the default. Let's ask our users though! From foster at anl.gov Wed Apr 22 11:19:05 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 22 Apr 2009 11:19:05 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <1240416974.2927.30.camel@localhost> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> <1240337102.22528.10.camel@localhost> <49EE1A8B.1@mcs.anl.gov> <1240344130.24159.40.camel@localhost> <16013AC9-9604-4BAC-B3DA-38FE8EA2E3F4@anl.gov> <1240350264.26210.20.camel@localhost> <55D159F8-17D4-4CD6-AD69-1F4709AEA2A8@anl.gov> <1240415041.2927.9.camel@localhost> <8BD13021-DCEC-4F42-BE2A-A4D14F7B35FC@anl.gov> <1240416974.2927.30.camel@localhost> Message-ID: <957E083E-DDD2-4EB6-8DB5-D850AC317FB7@anl.gov> Yes, perhaps the automated system should be default. I don't feel strongly about that. On Apr 22, 2009, at 11:16 AM, Mihael Hategan wrote: > On Wed, 2009-04-22 at 10:49 -0500, Ian Foster wrote: >>>>> >>>>> >>>>> What you say does beg for a couple of questions: >>>>> - if all work is done in a run but the allocation has more time >>>>> left, >>>>> should the workers be shut down or not? >> >> >> Shut down. > > Ok. > >> >>>>> >>>>> - if more work remains to be done in a run after an explicit >>>>> allocation >>>>> was used, should the system attempt to allocate more nodes? If >>>>> not, >>>>> should it hang? Fail? >> >> >> Fail. > > I disagree. If the user didn't want the work to complete, they > wouldn't > run it. It should be possible to force this mode, but I don't think it > should be the default. > >> >>>>> >>>>> - if the allocation is far in the distance from now, and a run >>>>> is >>>>> started now, is allocating nodes now a matter of second-guessing >>>>> or a >>>>> matter of trying to finish the work faster? What, besides >>>>> alleged >>>>> complexity of the algorithm, would be the downside of doing so? >> >> >> Maybe someone has requested an allocation at 10am tomorrow because >> that is when they want to run the application. > > I'd assume then that they would start swift somewhere around 10am > tomorrow, not one or two days in advance. > >> >> >> Maybe they are benchmarking, and want things to run with a specified >> number of nodes). >> > > Being able to force a "use exactly these nodes for this amount of time > at this time" is a given. Making it the default I have issue with. > >> >> Maybe someone doesn't trust the clever algorithm, or finds that it >> fails for odd reason. >> > > Right. Many people don't trust garbage collection either. I find it > funny that people insist that non-trivial things such as distributed > computing, GC, special relativity be entirely intuitive. > >> >> Having a more complex algorithm as well is great. I'm not saying this >> would not be wonderful. But it shouldn't be obligatory. > > That's far from the statement of making the other one the default. > > Let's ask our users though! > From hategan at mcs.anl.gov Wed Apr 22 11:40:43 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 22 Apr 2009 11:40:43 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> Message-ID: <1240418443.7115.8.camel@localhost> On Wed, 2009-04-22 at 09:55 +0000, Ben Clifford wrote: > > Auto app install > > singel-executable staging is< i think, easily achieveable. anything more > complex than that is not achievable for a 1.0 release. That does not > preclude ongoing ADEM (or related to ADEM) development as a separate > component (much as Falkon is or the log-processing stuff was) > I think the auto app install issue is a difficult one. I'm thinking of our alleged typical science user, who patches together applications from various sources, some stable, some not, some picky about their environment some less so. I believe that the only reasonable solution to the problem is virtualization. The scenario is that in which the user builds a VM that they can test locally, and which works the same on any resource used, without the need to re-compile, tweak, troubleshoot, debug, etc. I'm not sure what the state of VM support is on the TG and/or OSG (maybe Kate can comment on this), but I think without it, our users will invariably need to trade between effort of installing applications on resources and ability to load-balance or deal with transient site issues. From wilde at mcs.anl.gov Wed Apr 22 11:42:08 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 22 Apr 2009 11:42:08 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <957E083E-DDD2-4EB6-8DB5-D850AC317FB7@anl.gov> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> <1240337102.22528.10.camel@localhost> <49EE1A8B.1@mcs.anl.gov> <1240344130.24159.40.camel@localhost> <16013AC9-9604-4BAC-B3DA-38FE8EA2E3F4@anl.gov> <1240350264.26210.20.camel@localhost> <55D159F8-17D4-4CD6-AD69-1F4709AEA2A8@anl.gov> <1240415041.2927.9.camel@localhost> <8BD13021-DCEC-4F42-BE2A-A4D14F7B35FC@anl.gov> <1240416974.2927.30.camel@localhost> <957E083E-DDD2-4EB6-8DB5-D850AC317FB7@anl.gov> Message-ID: <49EF48E0.8010407@mcs.anl.gov> I agree - automated should be the default. I prefer, Mihael, that you get the work now underway completed as you have it envisioned - with the caveat that if the automation process looks like it will exceed the ~10 day estimate you gave yesterday, that you raise a flag and discuss with the group what the difficulties and alternatives are. My view is: - manual system that works OK: good - automated system that works poorly: bad - automated system that works well: best Automation of the core scheduler has proven to be hard, but has made good progress. One fear I have is of similar difficulties automating the coaster scheduler/provisioner. If that proves similarly problematic, I dont want to hold back users from getting work done with a manual system when the fully automated system takes a long development effort. But either way, once its working well, I feel automated should be the default. - Mike On 4/22/09 11:19 AM, Ian Foster wrote: > Yes, perhaps the automated system should be default. I don't feel > strongly about that. > > > On Apr 22, 2009, at 11:16 AM, Mihael Hategan wrote: > >> On Wed, 2009-04-22 at 10:49 -0500, Ian Foster wrote: >>>>>> >>>>>> >>>>>> What you say does beg for a couple of questions: >>>>>> - if all work is done in a run but the allocation has more time >>>>>> left, >>>>>> should the workers be shut down or not? >>> >>> >>> Shut down. >> >> Ok. >> >>> >>>>>> >>>>>> - if more work remains to be done in a run after an explicit >>>>>> allocation >>>>>> was used, should the system attempt to allocate more nodes? If >>>>>> not, >>>>>> should it hang? Fail? >>> >>> >>> Fail. >> >> I disagree. If the user didn't want the work to complete, they wouldn't >> run it. It should be possible to force this mode, but I don't think it >> should be the default. >> >>> >>>>>> >>>>>> - if the allocation is far in the distance from now, and a run >>>>>> is >>>>>> started now, is allocating nodes now a matter of second-guessing >>>>>> or a >>>>>> matter of trying to finish the work faster? What, besides >>>>>> alleged >>>>>> complexity of the algorithm, would be the downside of doing so? >>> >>> >>> Maybe someone has requested an allocation at 10am tomorrow because >>> that is when they want to run the application. >> >> I'd assume then that they would start swift somewhere around 10am >> tomorrow, not one or two days in advance. >> >>> >>> >>> Maybe they are benchmarking, and want things to run with a specified >>> number of nodes). >>> >> >> Being able to force a "use exactly these nodes for this amount of time >> at this time" is a given. Making it the default I have issue with. >> >>> >>> Maybe someone doesn't trust the clever algorithm, or finds that it >>> fails for odd reason. >>> >> >> Right. Many people don't trust garbage collection either. I find it >> funny that people insist that non-trivial things such as distributed >> computing, GC, special relativity be entirely intuitive. >> >>> >>> Having a more complex algorithm as well is great. I'm not saying this >>> would not be wonderful. But it shouldn't be obligatory. >> >> That's far from the statement of making the other one the default. >> >> Let's ask our users though! >> > From foster at anl.gov Wed Apr 22 11:42:43 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 22 Apr 2009 11:42:43 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <1240418443.7115.8.camel@localhost> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> Message-ID: <4B1995A1-3405-443E-9AF4-C00CA79450CB@anl.gov> I have asked Yi Zhu, new student at U.Chicago, to experiment with doing this on Amazon EC2 (and Nimbus) for Swift. He should join this list, tell us of his plans, and keep us updated on his progress. On Apr 22, 2009, at 11:40 AM, Mihael Hategan wrote: > On Wed, 2009-04-22 at 09:55 +0000, Ben Clifford wrote: > >>> Auto app install >> >> singel-executable staging is< i think, easily achieveable. anything >> more >> complex than that is not achievable for a 1.0 release. That does not >> preclude ongoing ADEM (or related to ADEM) development as a separate >> component (much as Falkon is or the log-processing stuff was) >> > > I think the auto app install issue is a difficult one. I'm thinking of > our alleged typical science user, who patches together applications > from > various sources, some stable, some not, some picky about their > environment some less so. > > I believe that the only reasonable solution to the problem is > virtualization. The scenario is that in which the user builds a VM > that > they can test locally, and which works the same on any resource used, > without the need to re-compile, tweak, troubleshoot, debug, etc. I'm > not > sure what the state of VM support is on the TG and/or OSG (maybe Kate > can comment on this), but I think without it, our users will > invariably > need to trade between effort of installing applications on resources > and > ability to load-balance or deal with transient site issues. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Wed Apr 22 11:51:41 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 22 Apr 2009 11:51:41 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <49EF48E0.8010407@mcs.anl.gov> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> <1240337102.22528.10.camel@localhost> <49EE1A8B.1@mcs.anl.gov> <1240344130.24159.40.camel@localhost> <16013AC9-9604-4BAC-B3DA-38FE8EA2E3F4@anl.gov> <1240350264.26210.20.camel@localhost> <55D159F8-17D4-4CD6-AD69-1F4709AEA2A8@anl.gov> <1240415041.2927.9.camel@localhost> <8BD13021-DCEC-4F42-BE2A-A4D14F7B35FC@anl.gov> <1240416974.2927.30.camel@localhost> <957E083E-DDD2-4EB6-8DB5-D850AC317FB7@anl.gov> <49EF48E0.8010407@mcs.anl.gov> Message-ID: <1240419101.7475.5.camel@localhost> On Wed, 2009-04-22 at 11:42 -0500, Michael Wilde wrote: > I agree - automated should be the default. > > I prefer, Mihael, that you get the work now underway completed as you > have it envisioned - with the caveat that if the automation process > looks like it will exceed the ~10 day estimate you gave yesterday, I think I said 1-2 weeks of effort, which includes more uncertainty than "10 days". > that > you raise a flag and discuss with the group what the difficulties and > alternatives are. > > My view is: > > - manual system that works OK: good > - automated system that works poorly: bad :) > - automated system that works well: best > > Automation of the core scheduler has proven to be hard, but has made > good progress. I don't think the scehduler has changed much, and I don't think it was hard. The problem I've seen was a misunderstanding of the problem rather than a problem with the solution. From benc at hawaga.org.uk Wed Apr 22 11:43:17 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 22 Apr 2009 16:43:17 +0000 (GMT) Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <1240418443.7115.8.camel@localhost> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> Message-ID: On Wed, 22 Apr 2009, Mihael Hategan wrote: > I believe that the only reasonable solution to the problem is > virtualization. I don't think virtualisation is particularly more reasonable than the various other mechanisms. Each has its own benefits and downsides, and none of them to me appear to be particularly domininant. -- From hategan at mcs.anl.gov Wed Apr 22 12:05:46 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 22 Apr 2009 12:05:46 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> Message-ID: <1240419946.7967.4.camel@localhost> On Wed, 2009-04-22 at 16:43 +0000, Ben Clifford wrote: > On Wed, 22 Apr 2009, Mihael Hategan wrote: > > > I believe that the only reasonable solution to the problem is > > virtualization. > > I don't think virtualisation is particularly more reasonable than the > various other mechanisms. Each has its own benefits and downsides, and > none of them to me appear to be particularly domininant. > Well, the differential cost of going from one site to an arbitrary number of sites is 0 from the perspective of setting applications up. I think this is essential to making swift a system that commoditizes running applications on grids. From wilde at mcs.anl.gov Wed Apr 22 12:13:08 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 22 Apr 2009 12:13:08 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <1240419101.7475.5.camel@localhost> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> <1240337102.22528.10.camel@localhost> <49EE1A8B.1@mcs.anl.gov> <1240344130.24159.40.camel@localhost> <16013AC9-9604-4BAC-B3DA-38FE8EA2E3F4@anl.gov> <1240350264.26210.20.camel@localhost> <55D159F8-17D4-4CD6-AD69-1F4709AEA2A8@anl.gov> <1240415041.2927.9.camel@localhost> <8BD13021-DCEC-4F42-BE2A-A4D14F7B35FC@anl.gov> <1240416974.2927.30.camel@localhost> <957E083E-DDD2-4EB6-8DB5-D850AC317FB7@anl.gov> <49EF48E0.8010407@mcs.anl.gov> <1240419101.7475.5.camel@localhost> Message-ID: <49EF5024.8040802@mcs.anl.gov> On 4/22/09 11:51 AM, Mihael Hategan wrote: >> Automation of the core scheduler has proven to be hard, but has made >> good progress. > > I don't think the scehduler has changed much, and I don't think it was > hard. The problem I've seen was a misunderstanding of the problem rather > than a problem with the solution. I say this based on experience using the system and helping users use the system. The problems I observe, which persist, are: -- 1) slow start starts too slow. Maybe the user needs a simple setting of how aggressive to schedule, where the automated default is somewhat more aggressive than the current default. And it should be based on job starting, not job completion - I dont know if it is, but it *seemed* to me it was not. I might be mistaken. 2) the throttle settings by which a user can seek to adjust things are too complex for any of the users I have worked with to deal with, including me. Some of that can be fixed by documentation. 3) the settings need to get tuned by experts on a per-site basis to be considered "automated". Thats not a defect in the scheduler per se (in fact, its a valuable feature that such tuning can be done). BUT end uses should not have to dicker with these settings for every site. WE need to provide site definitions for TG, OSG, etc (ie "supported" sites) and we need documentation that tells a user (eg a "swift admin") how to do this for new sites. -- I think if we do #3, then #1 and #2 are solved issues. But part of the "misunderstanding of the problem" is understanding that the feature is not done till its working well for ordinary end users. The current scheduler is almost there. This should go on the work list under discussion now as a work item. We should look at user logs after it gets into heavier use, tune as needed, and ideally write a research/practice paper on it. From wilde at mcs.anl.gov Wed Apr 22 12:24:25 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 22 Apr 2009 12:24:25 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <1240419946.7967.4.camel@localhost> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <1240419946.7967.4.camel@localhost> Message-ID: <49EF52C9.9040707@mcs.anl.gov> The problem with VMs at the moment is that most of the resources our users need to run on at the moment are not VM-enabled, and wont be in say the next 12 months or more. Automated deployers like ADEM show signs of working well *iff* the user app installs nicely. Many do, some dont. And the ADEM model offers a nice way to find the sites on which an app *can* install and run cleanly. So a user can start with those sites, and gradually work to improve their build/install process to handle the sites that pose problems. Anecdotal story: Glen tried ADEM on OOPS, guided only by Zhengxiong's document. With little effort he was able to run on 8 sites. He sent me this: "Zhengxiong and I used ADEM and got oops working on up to 8 osg sites. Swift script is identical to the one above. (by up to, I mean those are the ones it installed successfully on for Engage VO, and we're not finished testing). " Getting 8 OSG sites in a first test is a very promising sign - if a new user could do that, that would start out pretty happy. VMs I think are definitely part of the long-term picture but not the near term. I dont want us to wait for VMs rather than going after the immediate low-hanging fruit. On 4/22/09 12:05 PM, Mihael Hategan wrote: > On Wed, 2009-04-22 at 16:43 +0000, Ben Clifford wrote: >> On Wed, 22 Apr 2009, Mihael Hategan wrote: >> >>> I believe that the only reasonable solution to the problem is >>> virtualization. >> I don't think virtualisation is particularly more reasonable than the >> various other mechanisms. Each has its own benefits and downsides, and >> none of them to me appear to be particularly domininant. >> > > Well, the differential cost of going from one site to an arbitrary > number of sites is 0 from the perspective of setting applications up. I > think this is essential to making swift a system that commoditizes > running applications on grids. > From hategan at mcs.anl.gov Wed Apr 22 12:47:08 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 22 Apr 2009 12:47:08 -0500 Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <49EF5024.8040802@mcs.anl.gov> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> <1240337102.22528.10.camel@localhost> <49EE1A8B.1@mcs.anl.gov> <1240344130.24159.40.camel@localhost> <16013AC9-9604-4BAC-B3DA-38FE8EA2E3F4@anl.gov> <1240350264.26210.20.camel@localhost> <55D159F8-17D4-4CD6-AD69-1F4709AEA2A8@anl.gov> <1240415041.2927.9.camel@localhost> <8BD13021-DCEC-4F42-BE2A-A4D14F7B35FC@anl.gov> <1240416974.2927.30.camel@localhost> <957E083E-DDD2-4EB6-8DB5-D850AC317FB7@anl.gov> <49EF48E0.8010407@mcs.anl.gov> <1240419101.7475.5.camel@localhost> <49EF5024.8040802@mcs.anl.gov> Message-ID: <1240422428.10150.26.camel@localhost> On Wed, 2009-04-22 at 12:13 -0500, Michael Wilde wrote: > On 4/22/09 11:51 AM, Mihael Hategan wrote: > > >> Automation of the core scheduler has proven to be hard, but has made > >> good progress. > > > > I don't think the scehduler has changed much, and I don't think it was > > hard. The problem I've seen was a misunderstanding of the problem rather > > than a problem with the solution. > > I say this based on experience using the system and helping users use > the system. > > The problems I observe, which persist, are: > > -- > > 1) slow start starts too slow. What objective measurement/determination of how it should be makes it too slow? I'm all for re-adjusting the defaults. So far, nobody came up with a set of values with a reasonable motivation behind them. > Maybe the user needs a simple setting of > how aggressive to schedule, where the automated default is somewhat more > aggressive than the current default. > > And it should be based on job starting, not job completion - I dont know > if it is, but it *seemed* to me it was not. I might be mistaken. Interesting point. I think both should count in different measures. But I think increasing the score when a job goes from queued to active is a good thing. > > 2) the throttle settings by which a user can seek to adjust things are > too complex for any of the users I have worked with to deal with, > including me. Some of that can be fixed by documentation. Which particular throttle settings? All? > > 3) the settings need to get tuned by experts on a per-site basis to be > considered "automated". Which settings, specifically? > Thats not a defect in the scheduler per se (in > fact, its a valuable feature that such tuning can be done). > > BUT end uses should not have to dicker with these settings for every > site. WE need to provide site definitions for TG, OSG, etc (ie > "supported" sites) and we need documentation that tells a user (eg a > "swift admin") how to do this for new sites. > > -- > > I think if we do #3, then #1 and #2 are solved issues. > > But part of the "misunderstanding of the problem" is understanding that > the feature is not done till its working well for ordinary end users. I think one problem is that we keep using unquantifiable measures like "well", "ordinary", "too slow", "simple", etc, which are seen differently by different people. We need concrete suggestions and specific comments on what isn't as it should be, and proposed solutions should be also detailed and mindful of the problem(s) at hand. From benc at hawaga.org.uk Wed Apr 22 13:23:03 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 22 Apr 2009 18:23:03 +0000 (GMT) Subject: [Swift-devel] Coaster capabilities for release 0.9 In-Reply-To: <49EF5024.8040802@mcs.anl.gov> References: <1240004025.5783.6.camel@localhost> <49EDECEE.3040601@mcs.anl.gov> <49EE0167.5090909@mcs.anl.gov> <1240337102.22528.10.camel@localhost> <49EE1A8B.1@mcs.anl.gov> <1240344130.24159.40.camel@localhost> <16013AC9-9604-4BAC-B3DA-38FE8EA2E3F4@anl.gov> <1240350264.26210.20.camel@localhost> <55D159F8-17D4-4CD6-AD69-1F4709AEA2A8@anl.gov> <1240415041.2927.9.camel@localhost> <8BD13021-DCEC-4F42-BE2A-A4D14F7B35FC@anl.gov> <1240416974.2927.30.camel@localhost> <957E083E-DDD2-4EB6-8DB5-D850AC317FB7@anl.gov> <49EF48E0.8010407@mcs.anl.gov> <1240419101.7475.5.camel@localhost> <49EF5024.8040802@mcs.anl.gov> Message-ID: On Wed, 22 Apr 2009, Michael Wilde wrote: > 1) slow start starts too slow. Maybe the user needs a simple setting of how > aggressive to schedule, where the automated default is somewhat more > aggressive than the current default. > > And it should be based on job starting, not job completion - I dont know if it > is, but it *seemed* to me it was not. I might be mistaken. Its based on final completion. The low defaults are deliberate because people were getting upset about Swift overloading gram2 installations. I have a TODO to make the two main load parameters more user friendly: most likely being the settings, as integers: * maximum number of jobs to run on this site * how many jobs to run at once should the scheduler start with The defaults for those are 20 and 1 at present (though specified in the existing different parameter space) Thsoe seem to be the main two parameters people want to change, in my experience, and I think changing them to the above two bullet points is a helpful thing to do. > BUT end uses should not have to dicker with these settings for every site. WE > need to provide site definitions for TG, OSG, etc (ie "supported" sites) and > we need documentation that tells a user (eg a "swift admin") how to do this > for new sites. For gram2 sites, the defaults are the settings I recommend. Thats why they are set there. I have no confidence in any OSG site being really able to deal with a higher load without people coming back and blaming swift for overloading. Probably the docs could have a nice table of what is recommended for the different providers and situations. When something autogenerates a site, it can probably be made to set it to the recommended defaults for that site if it has no other information. Almost all other execution mechanisms can handle substantially higher loads, but given the poor publicity generated when someone blows up a site using Swift, I am against changing the defaults for gram2. -- From gabri.turcu at gmail.com Wed Apr 22 23:51:44 2009 From: gabri.turcu at gmail.com (Gabri Turcu) Date: Wed, 22 Apr 2009 23:51:44 -0500 Subject: [Swift-devel] Avoid copy of local files to work directory? In-Reply-To: <49EB53D9.8050600@mcs.anl.gov> References: <49EB3432.4090901@mcs.anl.gov> <49EB53D9.8050600@mcs.anl.gov> Message-ID: <9f808f850904222151s6155e6a6icf1330ce3d3a938b@mail.gmail.com> Hi, I followed the steps at http://mail.ci.uchicago.edu/pipermail/swift-devel/2008-April/002931.html, but could not recompile swift after adding the provider. I think I may not using the right version of swift (Swift 0.8 swift-r2448 cog-r2261)? [javac] /autonfs/home/gabri/cog/modules/provider-ln/src/org/globus/cog/abstraction/impl/execution/local/JobSubmissionTaskHandler.java:39: org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler is not abstract and does not override abstract method cancel(java.lang.String) in org.globus.cog.abstraction.interfaces.DelegatedTaskHandler [javac] public class JobSubmissionTaskHandler implements DelegatedTaskHandler, [javac] ^ [javac] Note: /autonfs/home/gabri/cog/modules/provider-ln/src/org/globus/cog/abstraction/impl/file/ln/FileResourceImpl.java uses or overrides a deprecated API. [javac] Note: Recompile with -deprecation for details. [javac] 1 error Thank you for any suggestions. Best, Gabri On Sun, Apr 19, 2009 at 11:39 AM, Michael Wilde wrote: > > > On 4/19/09 10:10 AM, Ben Clifford wrote: > >> I once implemented a patch to give that behaviour, and sent to this list. >> > > Its here: > > http://mail.ci.uchicago.edu/pipermail/swift-devel/2008-April/002931.html > > > I think it probably either still works or almost still works, if you can >> find it in the swift-devel archives. >> >> It was never benchmarked. Mostly I would be concerned about the overhead >> of creating the links being roughly the same as actually copying the data, >> so I think it would be important to get decent run logs from both approaches >> that can be compared. >> > > Whilst you are busy googling for that patch, it might be interesting to >> see a log file (including -info wrapper logs) for a large run with Swift as >> it is now, to see what the actual breakdown of times is in an individual >> job. >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hockyg at uchicago.edu Thu Apr 23 00:54:11 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Thu, 23 Apr 2009 00:54:11 -0500 Subject: [Swift-devel] problem using queenbee Message-ID: <49F00283.1090402@uchicago.edu> Hi everyone, I just made a bunch of code changes in OOPS and pushed them out on ranger, queenbee, ranger, and ucanl and i'm testing out swift rc2 / coasters I've been totally successful on ucanl and abe, and the word is still out on ranger since it's busy However, i'm having odd errors on queenbee. All files are in > /home/hockyg/oops/swift/output/qboutdir.1 on ci home. Any idea what's wrong? Glen From benc at hawaga.org.uk Thu Apr 23 01:13:24 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 23 Apr 2009 06:13:24 +0000 (GMT) Subject: [Swift-devel] Avoid copy of local files to work directory? In-Reply-To: <9f808f850904222151s6155e6a6icf1330ce3d3a938b@mail.gmail.com> References: <49EB3432.4090901@mcs.anl.gov> <49EB53D9.8050600@mcs.anl.gov> <9f808f850904222151s6155e6a6icf1330ce3d3a938b@mail.gmail.com> Message-ID: On Wed, 22 Apr 2009, Gabri Turcu wrote: > I followed the steps at > http://mail.ci.uchicago.edu/pipermail/swift-devel/2008-April/002931.html, > but could not recompile swift after adding the provider. I think I may not > using the right version of swift (Swift 0.8 swift-r2448 cog-r2261)? Yes, that is a version mismatch - the provider API has changed a little. I will fix it for you. In the meantime, do you have log files (and wrapper log files (-info files)) for one of the runs that you are experimenting on - I would like to see the timings of individual pieces. -- From benc at hawaga.org.uk Thu Apr 23 01:18:09 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 23 Apr 2009 06:18:09 +0000 (GMT) Subject: [Swift-devel] problem using queenbee In-Reply-To: <49F00283.1090402@uchicago.edu> References: <49F00283.1090402@uchicago.edu> Message-ID: On Thu, 23 Apr 2009, Glen Hocky wrote: > However, i'm having odd errors on queenbee. All files are in > > /home/hockyg/oops/swift/output/qboutdir.1 > on ci home. > > Any idea what's wrong? Lots of problems with gridftp, it looks like. There are lots of messages like this: org.globus.ftp.exception.DataChannelException: setPassive() must match store() and setActive() - retrieve() (error code 2) in your most recent log file oops-20090422-2354-v1g3l76g.log in that directory. Is thsi repeatable with the same swiftscript? Does examples/swift/first.swift run ok with that sites file? -- From gabri.turcu at gmail.com Thu Apr 23 01:32:58 2009 From: gabri.turcu at gmail.com (Gabri Turcu) Date: Thu, 23 Apr 2009 01:32:58 -0500 Subject: [Swift-devel] Avoid copy of local files to work directory? In-Reply-To: References: <49EB3432.4090901@mcs.anl.gov> <49EB53D9.8050600@mcs.anl.gov> <9f808f850904222151s6155e6a6icf1330ce3d3a938b@mail.gmail.com> Message-ID: <9f808f850904222332n22188c0ewab27a29442a0d242@mail.gmail.com> Hi Ben, Thank you for your quick response and your help. I have plots at http://people.cs.uchicago.edu/~gabri/ The larger run of ~16hours (300 patterns searched against 100k files) is at http://people.cs.uchicago.edu/~gabri/report-count-20090416-2119-s03yn299/ and the log file is at CI at /home/gabri/swift-0.8/examples/swift/newslabex/count/logs/count-20090416-2119-s03yn299.log Smaller runs (6 patterns against 20k files) are also in http://people.cs.uchicago.edu/~gabri/concurrent_mapper/ and http://people.cs.uchicago.edu/~gabri/simple_mapper/ Thanks a lot. Best, Gabri On Thu, Apr 23, 2009 at 1:13 AM, Ben Clifford wrote: > > On Wed, 22 Apr 2009, Gabri Turcu wrote: > > > I followed the steps at > > http://mail.ci.uchicago.edu/pipermail/swift-devel/2008-April/002931.html > , > > but could not recompile swift after adding the provider. I think I may > not > > using the right version of swift (Swift 0.8 swift-r2448 cog-r2261)? > > Yes, that is a version mismatch - the provider API has changed a little. > > I will fix it for you. > > In the meantime, do you have log files (and wrapper log files (-info > files)) for one of the runs that you are experimenting on - I would like > to see the timings of individual pieces. > > -- > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hockyg at uchicago.edu Thu Apr 23 01:37:59 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Thu, 23 Apr 2009 01:37:59 -0500 Subject: [Swift-devel] problem using queenbee In-Reply-To: References: <49F00283.1090402@uchicago.edu> Message-ID: <49F00CC7.7030006@uchicago.edu> results seem reproducible. output in qboutdir.2 here is the result for first.swift logfile: /home/hockyg/oops/swift/first-20090423-0135-gigv99m6.log > swift first.swift -tc.file qb.data -sites.file qb.xml > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090423-0135-gigv99m6 > Progress: > Progress: Stage in:1 > Progress: Submitted:1 > Progress: Active:1 > Failed to transfer wrapper log from > first-20090423-0135-gigv99m6/info/n on qb > Progress: Stage in:1 > Progress: Submitted:1 > Progress: Active:1 > Failed to transfer wrapper log from > first-20090423-0135-gigv99m6/info/p on qb > Progress: Stage in:1 > Progress: Active:1 > Failed to transfer wrapper log from > first-20090423-0135-gigv99m6/info/r on qb > Progress: Failed:1 > Execution failed: > Exception in echo: > Arguments: [Hello, world!] > Host: qb > Directory: first-20090423-0135-gigv99m6/jobs/r/echo-r4sk4s9j > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Failed to start worker: Worker ended prematurely > Cleaning up... > Shutting down service at https://208.100.92.21:46927 > Got channel MetaChannel: 18653132 -> GSSSChannel-null(1) > - Done Ben Clifford wrote: > On Thu, 23 Apr 2009, Glen Hocky wrote: > > >> However, i'm having odd errors on queenbee. All files are in >> >>> /home/hockyg/oops/swift/output/qboutdir.1 >>> >> on ci home. >> >> Any idea what's wrong? >> > > Lots of problems with gridftp, it looks like. There are lots of messages > like this: > > org.globus.ftp.exception.DataChannelException: setPassive() > must match store() and setActive() - retrieve() (error code 2) > > in your most recent log file oops-20090422-2354-v1g3l76g.log in that > directory. > > Is thsi repeatable with the same swiftscript? > > Does examples/swift/first.swift run ok with that sites file? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Thu Apr 23 01:44:26 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 23 Apr 2009 06:44:26 +0000 (GMT) Subject: [Swift-devel] problem using queenbee In-Reply-To: <49F00CC7.7030006@uchicago.edu> References: <49F00283.1090402@uchicago.edu> <49F00CC7.7030006@uchicago.edu> Message-ID: That error in qboutdir.2 looks like the same ftp error; but the below paste from first.swift look like a different problem, with coasters. On Thu, 23 Apr 2009, Glen Hocky wrote: > results seem reproducible. output in qboutdir.2 > > here is the result for first.swift > logfile: /home/hockyg/oops/swift/first-20090423-0135-gigv99m6.log > > swift first.swift -tc.file qb.data -sites.file qb.xml > > Swift 0.9rc2 swift-r2860 cog-r2388 > > > > RunID: 20090423-0135-gigv99m6 > > Progress: > > Progress: Stage in:1 > > Progress: Submitted:1 > > Progress: Active:1 > > Failed to transfer wrapper log from first-20090423-0135-gigv99m6/info/n on > > qb > > Progress: Stage in:1 > > Progress: Submitted:1 > > Progress: Active:1 > > Failed to transfer wrapper log from first-20090423-0135-gigv99m6/info/p on > > qb > > Progress: Stage in:1 > > Progress: Active:1 > > Failed to transfer wrapper log from first-20090423-0135-gigv99m6/info/r on > > qb > > Progress: Failed:1 > > Execution failed: > > Exception in echo: > > Arguments: [Hello, world!] > > Host: qb > > Directory: first-20090423-0135-gigv99m6/jobs/r/echo-r4sk4s9j > > stderr.txt: > > > > stdout.txt: > > > > ---- > > > > Caused by: > > Failed to start worker: Worker ended prematurely > > Cleaning up... > > Shutting down service at https://208.100.92.21:46927 > > Got channel MetaChannel: 18653132 -> GSSSChannel-null(1) > > - Done > > > Ben Clifford wrote: > > On Thu, 23 Apr 2009, Glen Hocky wrote: > > > > > > > However, i'm having odd errors on queenbee. All files are in > > > > > > > /home/hockyg/oops/swift/output/qboutdir.1 > > > > > > > on ci home. > > > > > > Any idea what's wrong? > > > > > > > Lots of problems with gridftp, it looks like. There are lots of messages > > like this: > > > > org.globus.ftp.exception.DataChannelException: setPassive() must match > > store() and setActive() - retrieve() (error code 2) > > > > in your most recent log file oops-20090422-2354-v1g3l76g.log in that > > directory. > > > > Is thsi repeatable with the same swiftscript? > > > > Does examples/swift/first.swift run ok with that sites file? > > > > > > From hockyg at uchicago.edu Thu Apr 23 01:47:41 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Thu, 23 Apr 2009 01:47:41 -0500 Subject: [Swift-devel] problem using queenbee In-Reply-To: References: <49F00283.1090402@uchicago.edu> <49F00CC7.7030006@uchicago.edu> Message-ID: <49F00F0D.3010308@uchicago.edu> On a related note, my abe run of 2000 5-10 minute jobs is very close to finishing and ran very stably @ 256 cpus. cool! Ben Clifford wrote: > That error in qboutdir.2 looks like the same ftp error; but the below > paste from first.swift look like a different problem, with coasters. > > On Thu, 23 Apr 2009, Glen Hocky wrote: > > >> results seem reproducible. output in qboutdir.2 >> >> here is the result for first.swift >> logfile: /home/hockyg/oops/swift/first-20090423-0135-gigv99m6.log >> >>> swift first.swift -tc.file qb.data -sites.file qb.xml >>> Swift 0.9rc2 swift-r2860 cog-r2388 >>> >>> RunID: 20090423-0135-gigv99m6 >>> Progress: >>> Progress: Stage in:1 >>> Progress: Submitted:1 >>> Progress: Active:1 >>> Failed to transfer wrapper log from first-20090423-0135-gigv99m6/info/n on >>> qb >>> Progress: Stage in:1 >>> Progress: Submitted:1 >>> Progress: Active:1 >>> Failed to transfer wrapper log from first-20090423-0135-gigv99m6/info/p on >>> qb >>> Progress: Stage in:1 >>> Progress: Active:1 >>> Failed to transfer wrapper log from first-20090423-0135-gigv99m6/info/r on >>> qb >>> Progress: Failed:1 >>> Execution failed: >>> Exception in echo: >>> Arguments: [Hello, world!] >>> Host: qb >>> Directory: first-20090423-0135-gigv99m6/jobs/r/echo-r4sk4s9j >>> stderr.txt: >>> >>> stdout.txt: >>> >>> ---- >>> >>> Caused by: >>> Failed to start worker: Worker ended prematurely >>> Cleaning up... >>> Shutting down service at https://208.100.92.21:46927 >>> Got channel MetaChannel: 18653132 -> GSSSChannel-null(1) >>> - Done >>> >> Ben Clifford wrote: >> >>> On Thu, 23 Apr 2009, Glen Hocky wrote: >>> >>> >>> >>>> However, i'm having odd errors on queenbee. All files are in >>>> >>>> >>>>> /home/hockyg/oops/swift/output/qboutdir.1 >>>>> >>>>> >>>> on ci home. >>>> >>>> Any idea what's wrong? >>>> >>>> >>> Lots of problems with gridftp, it looks like. There are lots of messages >>> like this: >>> >>> org.globus.ftp.exception.DataChannelException: setPassive() must match >>> store() and setActive() - retrieve() (error code 2) >>> >>> in your most recent log file oops-20090422-2354-v1g3l76g.log in that >>> directory. >>> >>> Is thsi repeatable with the same swiftscript? >>> >>> Does examples/swift/first.swift run ok with that sites file? >>> >>> >>> >> From benc at hawaga.org.uk Thu Apr 23 01:47:35 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 23 Apr 2009 06:47:35 +0000 (GMT) Subject: [Swift-devel] Avoid copy of local files to work directory? In-Reply-To: <9f808f850904222332n22188c0ewab27a29442a0d242@mail.gmail.com> References: <49EB3432.4090901@mcs.anl.gov> <49EB53D9.8050600@mcs.anl.gov> <9f808f850904222151s6155e6a6icf1330ce3d3a938b@mail.gmail.com> <9f808f850904222332n22188c0ewab27a29442a0d242@mail.gmail.com> Message-ID: Can you do a run with swift.properties option wrapperlog.always.transfer=true (it is set to false by default). There is interesting timing data in there. -- From gabri.turcu at gmail.com Thu Apr 23 02:06:16 2009 From: gabri.turcu at gmail.com (Gabri Turcu) Date: Thu, 23 Apr 2009 02:06:16 -0500 Subject: [Swift-devel] Avoid copy of local files to work directory? In-Reply-To: References: <49EB3432.4090901@mcs.anl.gov> <49EB53D9.8050600@mcs.anl.gov> <9f808f850904222151s6155e6a6icf1330ce3d3a938b@mail.gmail.com> <9f808f850904222332n22188c0ewab27a29442a0d242@mail.gmail.com> Message-ID: <9f808f850904230006y7778017bt27d155ae62a8fc85@mail.gmail.com> Hi Ben, The more recent logs in http://people.cs.uchicago.edu/~gabri/concurrent_mapper/ and http://people.cs.uchicago.edu/~gabri/simple_mapper/ were done with the wrapperlog.always.transfer = true. I have put the properties file I was using for these at http://people.cs.uchicago.edu/~gabri/count.prop I will restart the large run too with this setting and get back. Thanks a lot. Best, Gabri On Thu, Apr 23, 2009 at 1:47 AM, Ben Clifford wrote: > > Can you do a run with swift.properties option > wrapperlog.always.transfer=true (it is set to false by default). > > There is interesting timing data in there. > > -- > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Thu Apr 23 03:31:43 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 23 Apr 2009 08:31:43 +0000 (GMT) Subject: [Swift-devel] Avoid copy of local files to work directory? In-Reply-To: <9f808f850904230006y7778017bt27d155ae62a8fc85@mail.gmail.com> References: <49EB3432.4090901@mcs.anl.gov> <49EB53D9.8050600@mcs.anl.gov> <9f808f850904222151s6155e6a6icf1330ce3d3a938b@mail.gmail.com> <9f808f850904222332n22188c0ewab27a29442a0d242@mail.gmail.com> <9f808f850904230006y7778017bt27d155ae62a8fc85@mail.gmail.com> Message-ID: On Thu, 23 Apr 2009, Gabri Turcu wrote: > http://people.cs.uchicago.edu/~gabri/concurrent_mapper/ and I looked at http://people.cs.uchicago.edu/~gabri/concurrent_mapper/report-count-20090422-1853-c819lhq7/karajan.html and it doesn't look like the file transfer stuff is using up a large portion of time. The graphs in the karjan tab "karajan FILE_TRANSFER tasks" show most of the time no file transfer are happening. If that stagein/stageout step was taking a lot of time, then I would expect that to be a solid red block showing that the file transfer throttle was being reached often. Similarly on the info tab, the graph 'how wrapper.sh is spending its time' and manual observation of a few randomly chosen info files shows that hardly any time is being used on input file management compared to execution time. It has been suggested that there is a problem with input file staging taking too long, but the above log file seems to suggest that this is not at all a problem. Please can you (plural - whoever it is that thinks file staging is a problem here) present some specific evidence to convince me? -- From benc at hawaga.org.uk Thu Apr 23 04:05:26 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 23 Apr 2009 09:05:26 +0000 (GMT) Subject: [Swift-devel] Avoid copy of local files to work directory? In-Reply-To: <9f808f850904222151s6155e6a6icf1330ce3d3a938b@mail.gmail.com> References: <49EB3432.4090901@mcs.anl.gov> <49EB53D9.8050600@mcs.anl.gov> <9f808f850904222151s6155e6a6icf1330ce3d3a938b@mail.gmail.com> Message-ID: On Wed, 22 Apr 2009, Gabri Turcu wrote: > I followed the steps at > http://mail.ci.uchicago.edu/pipermail/swift-devel/2008-April/002931.html, > but could not recompile swift after adding the provider. I think I may not > using the right version of swift (Swift 0.8 swift-r2448 cog-r2261)? Try http://www.ci.uchicago.edu/~benc/provider-ln-2009-0423.tar.gz As I have said in previous mails, I am fairly unconvinced that this will be useful in your present situation, and I invite you to provide more details (in terms of high level description, and concrete log files) of what you think your problem is. -- From zhaozhang at uchicago.edu Thu Apr 23 09:21:59 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 23 Apr 2009 09:21:59 -0500 Subject: [Swift-devel] feature request In-Reply-To: <1240341050.23910.3.camel@localhost> References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> Message-ID: <49F07987.8050908@uchicago.edu> Several Error Message from coaster test: Here is my guess, correct me if I am wrong. Error 1: This is related to CI network setting, /etc/grid-security/hostcert.pem. Could anyone help on this? Who should I contact? Error 2: My certificate is not enabled on teraport, As Mike and I talked last night, "certificate revocation list" on CI network is misconfigured. Error 3 & Error 4: I am not active on tgncsa site. Mike said he needed to add me to another group. zhao Error 1: testing site configuration: coaster/fletch-coaster-gram2-gram2-condor.xml Removing files from previous runs Running test 061-cattwo at Thu Apr 23 07:55:35 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090423-0755-2kxbne86 Progress: Progress: Initializing site shared directory:1 Execution failed: Could not initialize shared directory on teraport Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Error communicating with the GridFTP server Caused by: Server refused performing the request. Custom message: Server refused GSSAPI authentication. (error code 1) [Nested exception message: Custom message: Unexpected reply: 530-globus_xio: Server side credential failure 530-globus_gsi_gssapi: Error with GSI credential 530-globus_gsi_gssapi: Error with gss credential handle *530-globus_credential: Error with credential: The host credential: /etc/grid-security/hostcert.pem* 530- with subject: /DC=org/DC=doegrids/OU=Services/CN=fletch.bsd.uchicago.edu 530- has expired 77452 minutes ago. 530- 530 End.] Error 2: testing site configuration: coaster/teraport-gt2-gt2-pbs.xml Removing files from previous runs Running test 061-cattwo at Thu Apr 23 08:02:35 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090423-0802-2ra8nd2a Progress: Progress: Initializing site shared directory:1 Progress: Initializing site shared directory:1 Execution failed: Could not initialize shared directory on teraport Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Cannot create directory /gpfs1 *Caused by: Server refused performing the request. Custom message: Server refused creating directory (error code 1) [Nested exception message: Custom message: Unexpected reply: 500-Command failed : globus_gridftp_server_file.c:globus_l_gf\ s_file_mkdir:554:* 500-System error in mkdir: Permission denied 500-A system call failed: Permission denied 500 End.] SWIFT RETURN CODE NON-ZERO - test 061-cattwo SITE FAIL! Exit code 1 for site definition coaster/teraport-gt2-gt2-pbs.xml testing site configuration: coaster/tgncsa-hg-coaster-pbs-gram2.xml Removing files from previous runs Running test 061-cattwo at Thu Apr 23 08:02:41 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090423-0802-h404ec20 Progress: Progress: Initializing site shared directory:1 Progress: Failed:1 Execution failed: Could not initialize shared directory on tgncsa-hg Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Error communicating with the GridFTP server Caused by: S*erver refused performing the request. Custom message: Bad password. (error code 1) [Nested exception message: Custom message: Unexpected reply: 530-Login incorrect. : globus_gss_assist: Gridmap lookup failure: Could not map /\ DC=org/DC=doegrids/OU=People/CN=Zhao Zhang 385894* 530- 530 End.] SWIFT RETURN CODE NON-ZERO - test 061-cattwo SITE FAIL! Exit code 1 for site definition coaster/tgncsa-hg-coaster-pbs-gram2.xml testing site configuration: coaster/tgncsa-hg-coaster-pbs-gram4.xml Removing files from previous runs Running test 061-cattwo at Thu Apr 23 08:02:47 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090423-0802-6quf1lnb Progress: Progress: Initializing site shared directory:1 Progress: Failed:1 Execution failed: Could not initialize shared directory on tgncsa-hg Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Error communicating with the GridFTP server Caused by: *Server refused performing the request. Custom message: Bad password. (error code 1) [Nested exception message: Custom message: Unexpected reply: 530-Login incorrect. : globus_gss_assist: Gridmap lookup failure: Could not map /\ DC=org/DC=doegrids/OU=People/CN=Zhao Zhang 385894* 530- 530 End.] SWIFT RETURN CODE NON-ZERO - test 061-cattwo SITE FAIL! Exit code 1 for site definition coaster/tgncsa-hg-coaster-pbs-gram4.xm Mihael Hategan wrote: > Thanks. > > I think the most valuable tests are those for which there is a working > environment (i.e. valid proxy, valid allocation, etc.) > > Mihael > > On Tue, 2009-04-21 at 13:54 -0500, Allan Espinosa wrote: > >> I ran the same testt that was was conducted w/ Zhao (his certificates >> does not yet have any specific vo membership) and have the following >> result: >> >> icating with the GridFTP server >> Caused by: >> Server refused performing the request. Custom message: Bad password. >> (error code 1) [Nested exception message: Custom message: Unexpected >> reply: 530-Login incorrect. : globus_gss_assist: Gridmap lookup >> failure: Could not map /DC=org/DC=doegrids/OU=People/CN=Allan M. >> Espinosa 374652 >> 530- >> 530 End.] >> SWIFT RETURN CODE NON-ZERO - test 061-cattwo >> SITE FAIL! Exit code 1 for site definition >> coaster/tgncsa-hg-coaster-pbs-gram4.xml >> These sites failed: coaster/fletch-coaster-gram2-gram2-condor.xml >> coaster/fletch-coaster-gram2-gram2-fork.xml >> coaster/teraport-gt2-gt2-pbs.xml >> coaster/tgncsa-hg-coaster-pbs-gram2.xml >> coaster/tgncsa-hg-coaster-pbs-gram4.xml >> These sites worked: coaster/coaster-local.xml coaster/renci-engage-coaster.xml >> >> >> cause of errors: 1. fletch has an expired host certificate >> 2. i don't have teragrid allocation on tgncsa (my roaming account >> expired). also i have a different certificate for using teragrid >> resources. >> >> attached in this email is the full log file. >> >> -Allan >> >> >> On Tue, Apr 21, 2009 at 12:55 PM, Zhao Zhang wrote: >> >>> [zzhang at communicado coaster]$ grid-cert-info >>> subject : DC=org,DC=doegrids,OU=People,CN=Zhao Zhang 385894 >>> issuer : DC=org,DC=DOEGrids,OU=Certificate Authorities,CN=DOEGrids CA 1 >>> start date : Wed Feb 25 14:32:08 CST 2009 >>> end date : Thu Feb 25 14:32:08 CST 2010 >>> >>> >>> [zzhang at communicado coaster]$ grid-proxy-init >>> Your identity: DC=org,DC=doegrids,OU=People,CN=Zhao Zhang 385894 >>> Enter GRID pass phrase for this identity: Creating proxy, please >>> wait... >>> Proxy verify OK >>> Your proxy is valid until Wed Apr 22 00:54:47 CDT 2009 >>> >>> zhao >>> >>> Ben Clifford wrote: >>> >>>> Please do all of what I asked. >>>> >>>> On Tue, 21 Apr 2009, Zhao Zhang wrote: >>>> >>>> >>>> >>>>> Yep, I found that my certificate has expired. I check the webpage about >>>>> grid >>>>> proxy, it says I need to request a new certificate. >>>>> >>>>> zhao >>>>> >>>>> [zzhang at communicado ~]$ grid-proxy-init >>>>> Your identity: /DC=org/DC=doegrids/OU=People/CN=Zhao Zhang 385894 >>>>> Enter GRID pass phrase for this identity: >>>>> Creating proxy ........................................... Done >>>>> >>>>> >>>>> ERROR: Your certificate has expired: Thu Feb 26 12:47:51 2009 >>>>> >>>>> >>>>> >>>>> Ben Clifford wrote: >>>>> >>>>> >>>>>> zhao, on communicado please type: >>>>>> >>>>>> grid-cert-info >>>>>> >>>>>> and paste the results here. >>>>>> >>>>>> Then type: grid-proxy-init and enter your password and paste the >>>>>> results. >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> >>> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > > From benc at hawaga.org.uk Thu Apr 23 09:35:38 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 23 Apr 2009 14:35:38 +0000 (GMT) Subject: [Swift-devel] feature request In-Reply-To: <49F07987.8050908@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> Message-ID: On Thu, 23 Apr 2009, Zhao Zhang wrote: > Error 1: This is related to CI network setting, > /etc/grid-security/hostcert.pem. Could anyone help on this? Who should I > contact? fletch is broken. But try changing those sites files to use gwynn.bsd.uchicago.edu instead. > Error 2: My certificate is not enabled on teraport, As Mike and I talked last > night, "certificate revocation list" on CI network is misconfigured. This looks more like a permissions problem - the directory being used in the sites.xml file for that test does not exist and you do not have permission to create it. In r2874 I have changes tests/sites/coaster/teraport-gt2-gt2-pbs.xml to use a different path that should work for you now. > Error 3 & Error 4: I am not active on tgncsa site. Mike said he needed to add > me to another group. yes. Do you have the list from the end of your test run about which sites worked and which did not? -- From zhaozhang at uchicago.edu Thu Apr 23 10:20:41 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 23 Apr 2009 10:20:41 -0500 Subject: [Swift-devel] feature request In-Reply-To: References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> Message-ID: <49F08749.5000909@uchicago.edu> Yes, here it is: These sites failed: coaster/fletch-coaster-gram2-gram2-condor.xml coaster/fletch-coaster-gram2-gram2-fork.xml coaster/teraport-gt2-gt2-pbs.xml coaster/tgncsa-hg-coaster-pbs-gram2.xml coaster/tgncsa-hg-coaster-pbs-gram4.xml These sites worked: coaster/coaster-local.xml coaster/renci-engage-coaster.xml zhao Ben Clifford wrote: > On Thu, 23 Apr 2009, Zhao Zhang wrote: > > >> Error 1: This is related to CI network setting, >> /etc/grid-security/hostcert.pem. Could anyone help on this? Who should I >> contact? >> > > fletch is broken. But try changing those sites files to use > gwynn.bsd.uchicago.edu instead. > > >> Error 2: My certificate is not enabled on teraport, As Mike and I talked last >> night, "certificate revocation list" on CI network is misconfigured. >> > > This looks more like a permissions problem - the directory being used in > the sites.xml file for that test does not exist and you do not have > permission to create it. > > In r2874 I have changes tests/sites/coaster/teraport-gt2-gt2-pbs.xml to > use a different path that should work for you now. > > >> Error 3 & Error 4: I am not active on tgncsa site. Mike said he needed to add >> me to another group. >> > > yes. > > Do you have the list from the end of your test run about which sites > worked and which did not? > > From wilde at mcs.anl.gov Thu Apr 23 10:33:51 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 23 Apr 2009 10:33:51 -0500 Subject: [Swift-devel] feature request In-Reply-To: <49F08749.5000909@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F08749.5000909@uchicago.edu> Message-ID: <49F08A5F.5090701@mcs.anl.gov> My inderstanding is that Gwynn is the new gatekeeper replacing fletch (as Ben says) so Fletch should be removed from the tests. The tests should be annotated to call this the "HNL Condor Pool". see also: http://www.ci.uchicago.edu/wiki/bin/view/CNARI/ResourceStructure - Mike On 4/23/09 10:20 AM, Zhao Zhang wrote: > Yes, here it is: > > These sites failed: coaster/fletch-coaster-gram2-gram2-condor.xml > coaster/fletch-coaster-gram2-gram2-fork.xml > coaster/teraport-gt2-gt2-pbs.xml coaster/tgncsa-hg-coaster-pbs-gram2.xml > coaster/tgncsa-hg-coaster-pbs-gram4.xml > These sites worked: coaster/coaster-local.xml > coaster/renci-engage-coaster.xml > > zhao > > Ben Clifford wrote: >> On Thu, 23 Apr 2009, Zhao Zhang wrote: >> >> >>> Error 1: This is related to CI network setting, >>> /etc/grid-security/hostcert.pem. Could anyone help on this? Who should I >>> contact? >>> >> >> fletch is broken. But try changing those sites files to use >> gwynn.bsd.uchicago.edu instead. >> >> >>> Error 2: My certificate is not enabled on teraport, As Mike and I >>> talked last >>> night, "certificate revocation list" on CI network is misconfigured. >>> >> >> This looks more like a permissions problem - the directory being used >> in the sites.xml file for that test does not exist and you do not have >> permission to create it. >> >> In r2874 I have changes tests/sites/coaster/teraport-gt2-gt2-pbs.xml >> to use a different path that should work for you now. >> >> >>> Error 3 & Error 4: I am not active on tgncsa site. Mike said he >>> needed to add >>> me to another group. >>> >> >> yes. >> >> Do you have the list from the end of your test run about which sites >> worked and which did not? >> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Thu Apr 23 10:56:58 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 23 Apr 2009 10:56:58 -0500 Subject: [Swift-devel] problem using queenbee In-Reply-To: <49F00283.1090402@uchicago.edu> References: <49F00283.1090402@uchicago.edu> Message-ID: <1240502218.23721.1.camel@localhost> Although I'm not quite sure how the gridftp session gets in the bad state that it does there, I've put in a patch in cog 2390 that should treat that case differently and create a new session instead of re-using the one that's messed up. On Thu, 2009-04-23 at 00:54 -0500, Glen Hocky wrote: > Hi everyone, > I just made a bunch of code changes in OOPS and pushed them out on > ranger, queenbee, ranger, and ucanl and i'm testing out swift rc2 / > coasters > I've been totally successful on ucanl and abe, and the word is still out > on ranger since it's busy > > However, i'm having odd errors on queenbee. All files are in > > /home/hockyg/oops/swift/output/qboutdir.1 > on ci home. > > Any idea what's wrong? > > Glen > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From zhaozhang at uchicago.edu Thu Apr 23 11:18:18 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 23 Apr 2009 11:18:18 -0500 Subject: [Swift-devel] feature request In-Reply-To: References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> Message-ID: <49F094CA.6070704@uchicago.edu> Hi, Ben Ben Clifford wrote: > On Thu, 23 Apr 2009, Zhao Zhang wrote: > > >> Error 1: This is related to CI network setting, >> /etc/grid-security/hostcert.pem. Could anyone help on this? Who should I >> contact? >> > > fletch is broken. But try changing those sites files to use > gwynn.bsd.uchicago.edu instead. > > >> Error 2: My certificate is not enabled on teraport, As Mike and I talked last >> night, "certificate revocation list" on CI network is misconfigured. >> > > This looks more like a permissions problem - the directory being used in > the sites.xml file for that test does not exist and you do not have > permission to create it. > > In r2874 I have changes tests/sites/coaster/teraport-gt2-gt2-pbs.xml to > use a different path that should work for you now. > I tried this out, it failed, then I increased the wall-time to 15 minutes in the coaster/teraport-gt2-gt2-pbs.xml file. And I am waiting now. zhao [zzhang at communicado sites]$ ./run-site coaster/teraport-gt2-gt2-pbs.xml testing site configuration: coaster/teraport-gt2-gt2-pbs.xml Removing files from previous runs Running test 061-cattwo at Thu Apr 23 11:12:09 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090423-1112-6jqlxfcf Progress: Progress: Stage in:1 Progress: Submitted:1 Failed to transfer wrapper log from 061-cattwo-20090423-1112-6jqlxfcf/info/q on teraport Failed to transfer wrapper log from 061-cattwo-20090423-1112-6jqlxfcf/info/s on teraport Progress: Stage in:1 Failed to transfer wrapper log from 061-cattwo-20090423-1112-6jqlxfcf/info/u on teraport Progress: Failed:1 Execution failed: Exception in cat: Arguments: [061-cattwo.1.in, 061-cattwo.2.in] Host: teraport Directory: 061-cattwo-20090423-1112-6jqlxfcf/jobs/u/cat-umlmrs9j stderr.txt: stdout.txt: ---- Caused by: Job cannot be run with the given max walltime worker constraint (task: 600, maxwalltime: 300s) Cleaning up... Shutting down service at https://128.135.125.118:58204 Got channel MetaChannel: 1297642 -> GSSSChannel-null(1) - Done SWIFT RETURN CODE NON-ZERO - test 061-cattwo > >> Error 3 & Error 4: I am not active on tgncsa site. Mike said he needed to add >> me to another group. >> > > yes. > > Do you have the list from the end of your test run about which sites > worked and which did not? > > From zhaozhang at uchicago.edu Thu Apr 23 12:20:12 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 23 Apr 2009 12:20:12 -0500 Subject: [Swift-devel] feature request In-Reply-To: <49F094CA.6070704@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> Message-ID: <49F0A34C.6040306@uchicago.edu> Hi, again the test on teraport is successful, here is the log zhao testing site configuration: coaster/teraport-gt2-gt2-pbs.xml Removing files from previous runs Running test 061-cattwo at Thu Apr 23 11:27:19 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090423-1127-aluxx4m9 Progress: Progress: Stage in:1 Progress: Submitted:1 Progress: Submitted:1 ... Progress: Active:1 Progress: Finished successfully:1 Final status: Finished successfully:1 Cleaning up... Shutting down service at https://128.135.125.118:57080 Got channel MetaChannel: 2129305 -> GSSSChannel-null(1) - Done expecting 061-cattwo.out.expected checking 061-cattwo.out.expected Skipping exception test due to test configuration Test passed at Thu Apr 23 11:57:39 CDT 2009 ----------===========================---------- Running test 130-fmri at Thu Apr 23 11:57:39 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090423-1157-r8sarc77 Progress: Progress: Selecting site:2 Initializing site shared directory:1 Stage in:1 Progress: Selecting site:2 Stage in:1 Submitting:1 Progress: Selecting site:2 Submitting:1 Submitted:1 ... Progress: Selecting site:2 Submitted:2 Progress: Selecting site:2 Submitted:2 Progress: Selecting site:2 Submitted:2 Progress: Selecting site:2 Submitted:1 Active:1 Progress: Selecting site:2 Active:1 Stage out:1 Progress: Selecting site:1 Stage in:1 Stage out:1 Finished successfully:1 Progress: Submitted:1 Stage out:1 Finished successfully:2 Progress: Active:1 Finished successfully:4 Progress: Submitting:2 Submitted:1 Finished successfully:5 Progress: Active:2 Stage out:1 Finished successfully:5 Progress: Submitted:1 Stage out:2 Finished successfully:8 Final status: Finished successfully:11 Cleaning up... Shutting down service at https://128.135.125.118:52773 Got channel MetaChannel: 28761475 -> GSSSChannel-null(1) - Done expecting 130-fmri.0000.jpeg.expected 130-fmri.0001.jpeg.expected 130-fmri.0002.jpeg.expected checking 130-fmri.0000.jpeg.expected Skipping exception test due to test configuration checking 130-fmri.0001.jpeg.expected Skipping exception test due to test configuration checking 130-fmri.0002.jpeg.expected Skipping exception test due to test configuration Test passed at Thu Apr 23 12:04:47 CDT 2009 ----------===========================---------- Running test 103-quote at Thu Apr 23 12:04:47 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090423-1204-sjzpkfd3 Progress: Progress: Stage in:1 Progress: Submitted:1 Progress: Active:1 Progress: Finished successfully:1 Final status: Finished successfully:1 Cleaning up... Shutting down service at https://128.135.125.118:40813 Got channel MetaChannel: 28500325 -> GSSSChannel-null(1) - Done expecting 103-quote.out.expected checking 103-quote.out.expected Skipping exception test due to test configuration Test passed at Thu Apr 23 12:05:05 CDT 2009 ----------===========================---------- Running test 1032-singlequote at Thu Apr 23 12:05:05 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090423-1205-x2d55af3 Progress: Progress: Stage in:1 Progress: Submitted:1 Progress: Active:1 Progress: Finished successfully:1 Final status: Finished successfully:1 Cleaning up... Shutting down service at https://128.135.125.118:44126 Got channel MetaChannel: 18100302 -> GSSSChannel-null(1) - Done expecting 1032-singlequote.out.expected checking 1032-singlequote.out.expected Skipping exception test due to test configuration Test passed at Thu Apr 23 12:05:22 CDT 2009 ----------===========================---------- Running test 1031-quote at Thu Apr 23 12:05:22 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090423-1205-5aa1ko4e Progress: Progress: Stage in:1 Progress: Submitted:1 Progress: Active:1 Final status: Finished successfully:1 Cleaning up... Shutting down service at https://128.135.125.118:43759 Got channel MetaChannel: 19002607 -> GSSSChannel-null(1) - Done expecting 1031-quote.*.expected No expected output files specified for this test case - not checking output. Skipping exception test due to test configuration Test passed at Thu Apr 23 12:05:38 CDT 2009 ----------===========================---------- Running test 1033-singlequote at Thu Apr 23 12:05:38 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090423-1205-8nopyujc Progress: Progress: Stage in:1 Progress: Submitted:1 Progress: Active:1 Progress: Finished successfully:1 Final status: Finished successfully:1 Cleaning up... Shutting down service at https://128.135.125.118:39924 Got channel MetaChannel: 31196317 -> GSSSChannel-null(1) - Done expecting 1033-singlequote.out.expected checking 1033-singlequote.out.expected Skipping exception test due to test configuration Test passed at Thu Apr 23 12:05:56 CDT 2009 ----------===========================---------- Running test 141-space-in-filename at Thu Apr 23 12:05:56 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090423-1205-aalqz1c4 Progress: Progress: Stage in:1 Progress: Submitted:1 Progress: Active:1 Progress: Finished successfully:1 Final status: Finished successfully:1 Cleaning up... Shutting down service at https://128.135.125.118:60177 Got channel MetaChannel: 4728458 -> GSSSChannel-null(1) - Done expecting 141-space-in-filename.space here.out.expected checking 141-space-in-filename.space here.out.expected Skipping exception test due to test configuration Test passed at Thu Apr 23 12:06:15 CDT 2009 ----------===========================---------- Running test 142-space-and-quotes at Thu Apr 23 12:06:15 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090423-1206-8617gag1 Progress: Progress: Selecting site:2 Initializing site shared directory:1 Stage in:1 Progress: Selecting site:2 Submitting:1 Submitted:1 Progress: Selecting site:2 Submitted:1 Active:1 Progress: Selecting site:2 Active:1 Finished successfully:1 Progress: Stage out:1 Finished successfully:3 Final status: Finished successfully:4 Cleaning up... Shutting down service at https://128.135.125.118:57945 Got channel MetaChannel: 16387060 -> GSSSChannel-null(1) - Done expecting 142-space-and-quotes.2" space ".out.expected 142-space-and-quotes.3' space '.out.expected 142-space-and-quotes.out.expected 142-space-and-quotes. space .out.expected checking 142-space-and-quotes.2" space ".out.expected Skipping exception test due to test configuration checking 142-space-and-quotes.3' space '.out.expected Skipping exception test due to test configuration checking 142-space-and-quotes.out.expected Skipping exception test due to test configuration checking 142-space-and-quotes. space .out.expected Skipping exception test due to test configuration Test passed at Thu Apr 23 12:06:35 CDT 2009 ----------===========================---------- All language behaviour tests passed Zhao Zhang wrote: > Hi, Ben > > Ben Clifford wrote: >> On Thu, 23 Apr 2009, Zhao Zhang wrote: >> >> >>> Error 1: This is related to CI network setting, >>> /etc/grid-security/hostcert.pem. Could anyone help on this? Who >>> should I >>> contact? >>> >> >> fletch is broken. But try changing those sites files to use >> gwynn.bsd.uchicago.edu instead. >> >> >>> Error 2: My certificate is not enabled on teraport, As Mike and I >>> talked last >>> night, "certificate revocation list" on CI network is misconfigured. >>> >> >> This looks more like a permissions problem - the directory being used >> in the sites.xml file for that test does not exist and you do not >> have permission to create it. >> >> In r2874 I have changes tests/sites/coaster/teraport-gt2-gt2-pbs.xml >> to use a different path that should work for you now. >> > I tried this out, it failed, then I increased the wall-time to 15 > minutes in the coaster/teraport-gt2-gt2-pbs.xml file. > And I am waiting now. > > zhao > > [zzhang at communicado sites]$ ./run-site coaster/teraport-gt2-gt2-pbs.xml > testing site configuration: coaster/teraport-gt2-gt2-pbs.xml > Removing files from previous runs > Running test 061-cattwo at Thu Apr 23 11:12:09 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090423-1112-6jqlxfcf > Progress: > Progress: Stage in:1 > Progress: Submitted:1 > Failed to transfer wrapper log from > 061-cattwo-20090423-1112-6jqlxfcf/info/q on teraport > Failed to transfer wrapper log from > 061-cattwo-20090423-1112-6jqlxfcf/info/s on teraport > Progress: Stage in:1 > Failed to transfer wrapper log from > 061-cattwo-20090423-1112-6jqlxfcf/info/u on teraport > Progress: Failed:1 > Execution failed: > Exception in cat: > Arguments: [061-cattwo.1.in, 061-cattwo.2.in] > Host: teraport > Directory: 061-cattwo-20090423-1112-6jqlxfcf/jobs/u/cat-umlmrs9j > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Job cannot be run with the given max walltime worker constraint > (task: 600, maxwalltime: 300s) > Cleaning up... > Shutting down service at https://128.135.125.118:58204 > Got channel MetaChannel: 1297642 -> GSSSChannel-null(1) > - Done > SWIFT RETURN CODE NON-ZERO - test 061-cattwo > >> >>> Error 3 & Error 4: I am not active on tgncsa site. Mike said he >>> needed to add >>> me to another group. >>> >> >> yes. >> >> Do you have the list from the end of your test run about which sites >> worked and which did not? >> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Thu Apr 23 12:29:09 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 23 Apr 2009 17:29:09 +0000 (GMT) Subject: [Swift-devel] condor-g version sensitivity in swift 0.9rc2 Message-ID: Looks like the condor-g stuff I committed the other day needs the local condor to be at least 7.0.x, as the two 6.8.x installations I've tried it on don't work. I think thats fine for the 0.9 release, though - the current osg software stack has condor 7.0.x and that was my main target for this functionality. -- From wilde at mcs.anl.gov Thu Apr 23 12:39:57 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 23 Apr 2009 12:39:57 -0500 Subject: [Swift-devel] condor-g version sensitivity in swift 0.9rc2 In-Reply-To: References: Message-ID: <49F0A7ED.3030805@mcs.anl.gov> Im behind on my swift-devel reading so its likely this was already stated: - does this rev let us now use condor-g to run more jobs on a GRAM-2 site that we could before? - should we try doing so, first on test sites like uc.teragrid, then OSG sites? - does it work with coasters, or are further mods needed for that? On 4/23/09 12:29 PM, Ben Clifford wrote: > Looks like the condor-g stuff I committed the other day needs the local > condor to be at least 7.0.x, as the two 6.8.x installations I've tried it > on don't work. > > I think thats fine for the 0.9 release, though - the current osg software > stack has condor 7.0.x and that was my main target for this functionality. > From wilde at mcs.anl.gov Thu Apr 23 13:05:25 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 23 Apr 2009 13:05:25 -0500 Subject: [Swift-devel] Swift + Condor-G In-Reply-To: References: Message-ID: <49F0ADE5.6020704@mcs.anl.gov> This sounds great - I missed this note. Zhao, this is worth testing as well. Glen, if this works well it opens the door to broader OOPS runs on OSG and TG. Ben: in this profile item: > gt2 > belhaven-1.renci.org/jobmanager-fork is the format gt2[space][contact-string]? (it wrapped in my email) On 4/15/09 3:17 AM, Ben Clifford wrote: > CoG svn r2388 and Swift r2846 contain the ability to submit Swift jobs to > gt2 sites via Condor-G. > > Below is a site definition that I have used to submit to an OSG site at > RENCI. > > I would appreciate feedback from anyone who tests this successfully or > unsuccessfully. > > As previously mentioned, this is not intended to replace the existing > gram2 submisssion mechanisms; it provides a way to submit to OSG sites > where plain gram2 is (to a greater or lesser extent) discouraged or > disallowed or dysfunctional. It requires a local condor installation > (which is a strong argument against using this if you do not already have > such in place - this was one of the traumatic parts of getting VDS > running) > > > > > /nfs/home/osgedu/benc > grid > gt2 > belhaven-1.renci.org/jobmanager-fork > > From wilde at mcs.anl.gov Thu Apr 23 14:04:11 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 23 Apr 2009 14:04:11 -0500 Subject: [Swift-devel] Swift + Condor-G In-Reply-To: <49F0ADE5.6020704@mcs.anl.gov> References: <49F0ADE5.6020704@mcs.anl.gov> Message-ID: <49F0BBAB.8010909@mcs.anl.gov> I was pondering whether this would work with coasters, but then realized some of the complexities. 0.9 user guide would be a good place to document which coaster job manager configurations work, what scaling/throttling limits to be aware of, how/when to use the new condor-g capability, and what coasters entries work best where (eg, gt2:gt2:pbs, gt2:pbs, etc) Users today need to navigate through a tricky sites.xml config process constrained by limits that are hard to understand and deal with: - throttle to limit jobs sent under GT2 GRAM to < 40 - how to avoid exceeding this with coasters - what LRM is on each site - what providers are available (eg pbs and condor but not sge) - what coaster combinations work and dont, from where Some way of making this as clear as possible in the users guide or through some set of preconfigured sites files would be very helpful. Right now, users "trade" sites.xml entries like recipes for secret sauces, and it would be good to provide them a cookbook or some better process. On 4/23/09 1:05 PM, Michael Wilde wrote: > This sounds great - I missed this note. > > Zhao, this is worth testing as well. > > Glen, if this works well it opens the door to broader OOPS runs on OSG > and TG. > > Ben: in this profile item: > > > gt2 > > belhaven-1.renci.org/jobmanager-fork > > is the format gt2[space][contact-string]? (it wrapped in my email) > > On 4/15/09 3:17 AM, Ben Clifford wrote: >> CoG svn r2388 and Swift r2846 contain the ability to submit Swift jobs >> to gt2 sites via Condor-G. >> >> Below is a site definition that I have used to submit to an OSG site >> at RENCI. >> >> I would appreciate feedback from anyone who tests this successfully or >> unsuccessfully. >> >> As previously mentioned, this is not intended to replace the existing >> gram2 submisssion mechanisms; it provides a way to submit to OSG sites >> where plain gram2 is (to a greater or lesser extent) discouraged or >> disallowed or dysfunctional. It requires a local condor installation >> (which is a strong argument against using this if you do not already >> have such in place - this was one of the traumatic parts of getting >> VDS running) >> >> >> >> >> /nfs/home/osgedu/benc >> grid >> gt2 >> belhaven-1.renci.org/jobmanager-fork >> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From jroelofs at gmail.com Thu Apr 23 15:24:43 2009 From: jroelofs at gmail.com (Jon Roelofs) Date: Thu, 23 Apr 2009 14:24:43 -0600 Subject: [Swift-devel] Google Summer of Code Student Message-ID: Hello, This is Jon Roelofs. My proposal: "Scheduling Algorithms for Swift" was accepted for Google Summer of Code 200, so this summer I'll be working on developing a better scheduler for the Swift toolkit. You can see more about what I plan to do here: http://socghop.appspot.com/student_project/show/google/gsoc2009/globus/t124022382550and here: http://www.cs.colostate.edu/~roelofs/gsoc_09.php I look forward to working with you all. Regards, Jon -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Thu Apr 23 16:16:49 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 23 Apr 2009 21:16:49 +0000 (GMT) Subject: [Swift-devel] Swift + Condor-G In-Reply-To: <49F0BBAB.8010909@mcs.anl.gov> References: <49F0ADE5.6020704@mcs.anl.gov> <49F0BBAB.8010909@mcs.anl.gov> Message-ID: > Right now, users "trade" sites.xml entries like recipes for secret sauces, and > it would be good to provide them a cookbook or some better process. The OSG site file generator exists and generates nice site files for OSG VOs, for as long as OSG continues to use ReSS; perhaps another year or so based on previous OSG information system practice. Its not clear to me if TeraGrid publishes enough useful information in a mechanically recoverable form for a similar thing to happen for TG sites - what I've seen is the web pages which seem very human readable and mostly sufficient; and the MDS stuff, which seemed rather sparse though usable. We could have a manually curated list of sites; but that requires ongoing curation which is a non-trivial ongoing task. The distribution already contains such a manually curated list. Theres also a set of sites in the tests/sites/ directory. In principle, that could become a general purpose site repository which the Community would maintain, wiki-style; but the existing sites list is also open to Community maintenance and has received few if any updates. -- From benc at hawaga.org.uk Thu Apr 23 16:30:12 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 23 Apr 2009 21:30:12 +0000 (GMT) Subject: [Swift-devel] Swift + Condor-G In-Reply-To: <49F0ADE5.6020704@mcs.anl.gov> References: <49F0ADE5.6020704@mcs.anl.gov> Message-ID: On Thu, 23 Apr 2009, Michael Wilde wrote: > Zhao, this is worth testing as well. Tomorrow I should commit a condor-g site tests directory. > Glen, if this works well it opens the door to broader OOPS runs on OSG and TG. Not sure about TG, but it is OSGs recommended job submission mechanism. > is the format gt2[space][contact-string]? (it wrapped in my email) yes. Its the format as used in condor-g grid_resource lines (in that it is directly copied there and not parsed at all). -- From benc at hawaga.org.uk Thu Apr 23 16:36:39 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 23 Apr 2009 21:36:39 +0000 (GMT) Subject: [Swift-devel] condor-g version sensitivity in swift 0.9rc2 In-Reply-To: <49F0A7ED.3030805@mcs.anl.gov> References: <49F0A7ED.3030805@mcs.anl.gov> Message-ID: On Thu, 23 Apr 2009, Michael Wilde wrote: > - does this rev let us now use condor-g to run more jobs on a GRAM-2 site that > we could before? It lets you send a bunch of jobs to condor-g and let condor-g handle load management, as OSG prefers. If Condor-G submits more jobs then yes; but in my relatively short testing, I haven't seen it be particularly enthusiastic about sending in large numbers of jobs. However, it gives me more confidence for the future about testing larger loads on OSG because Swift can use the OSG recommended mechanism, rather than a mechanism that is specifically disrecommended. > - does it work with coasters, or are further mods needed for that? Probably not though I have not tried it. What I think is likely to break is, I think, straightforward plumbing work, not some great conceptual design problem, so I think if someone actually pays attention to making it work, it should not be too hard. In the end, it looks mostly like regular gram2 execution. -- From wilde at mcs.anl.gov Thu Apr 23 17:01:58 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 23 Apr 2009 17:01:58 -0500 Subject: [Swift-devel] Swift + Condor-G In-Reply-To: References: <49F0ADE5.6020704@mcs.anl.gov> <49F0BBAB.8010909@mcs.anl.gov> Message-ID: <49F0E556.2080305@mcs.anl.gov> On 4/23/09 4:16 PM, Ben Clifford wrote: >> Right now, users "trade" sites.xml entries like recipes for secret sauces, and >> it would be good to provide them a cookbook or some better process. > > The OSG site file generator exists and generates nice site files for OSG > VOs, for as long as OSG continues to use ReSS; perhaps another year or so > based on previous OSG information system practice. Thats certainly a good start, but settings of Globus or Karajan namespace variables like throttle etc remain to be done, as do coaster settings. > Its not clear to me if TeraGrid publishes enough useful information in a > mechanically recoverable form for a similar thing to happen for TG sites - > what I've seen is the web pages which seem very human readable and mostly > sufficient; and the MDS stuff, which seemed rather sparse though usable. It seems to - Glen did what seems a good start on a TG rough equivalent to the OSG mechanism. I urge him to post and share it (or repost ;) > We could have a manually curated list of sites; but that requires ongoing > curation which is a non-trivial ongoing task. The distribution already > contains such a manually curated list. > > Theres also a set of sites in the tests/sites/ directory. In principle, > that could become a general purpose site repository which the Community > would maintain, wiki-style; but the existing sites list is also open to > Community maintenance and has received few if any updates. The automated version seems best once its complete and usable out of the box. Good (complete) manual working defs for a modest set of example sites might be a good stopgap if automated is not achievable in some short time (say 4 weeks and certainly by next point release). From hockyg at uchicago.edu Thu Apr 23 18:16:47 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Thu, 23 Apr 2009 18:16:47 -0500 Subject: [Swift-devel] problem at end of workflow Message-ID: <49F0F6DF.3040902@uchicago.edu> Problem getting last few jobs to run All logs in /home/hockyg/oops/swift/output/rangeroutdir.1002 on ci home Here are my jobs in the queue on ranger, so i still have 256 coasters running, (I would assume): > showq | grep hockyg > 675953 data hockyg Running 16 00:52:43 Thu Apr 23 > 14:06:47 > 675954 data hockyg Running 16 00:53:45 Thu Apr 23 > 14:07:49 > 675955 data hockyg Running 16 01:00:50 Thu Apr 23 > 14:14:54 > 675956 data hockyg Running 16 01:04:48 Thu Apr 23 > 14:18:52 > 675957 data hockyg Running 16 01:04:48 Thu Apr 23 > 14:18:52 > 675958 data hockyg Running 16 01:07:45 Thu Apr 23 > 14:21:49 > 675959 data hockyg Running 16 01:07:45 Thu Apr 23 > 14:21:49 > 675960 data hockyg Running 16 01:10:43 Thu Apr 23 > 14:24:47 > 675961 data hockyg Running 16 01:11:46 Thu Apr 23 > 14:25:50 > 675962 data hockyg Running 16 01:16:46 Thu Apr 23 > 14:30:50 > 675963 data hockyg Running 16 01:18:46 Thu Apr 23 > 14:32:50 Here is my progress screen: > Progress: Submitted:1 Active:3 Finished successfully:10002 > Progress: Submitted:1 Active:2 Stage out:1 Finished successfully:10002 > Progress: Submitted:1 Active:2 Stage out:1 Finished successfully:10002 > Progress: Submitted:1 Active:1 Stage out:1 Finished successfully:10003 > Progress: Submitted:1 Active:1 Finished successfully:10004 > Progress: Submitted:1 Active:1 Finished successfully:10004 > Progress: Submitted:1 Stage out:1 Finished successfully:10004 > Progress: Submitted:1 Finished successfully:10005 > Progress: uninitialized:1 Submitted:1 Finished successfully:10005 > Progress: Submitted:2 Finished successfully:10005 > Progress: Submitted:2 Finished successfully:10005 > Progress: Submitted:2 Finished successfully:10005 From wilde at mcs.anl.gov Thu Apr 23 19:26:43 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 23 Apr 2009 19:26:43 -0500 Subject: [Swift-devel] problem at end of workflow In-Reply-To: <49F0F6DF.3040902@uchicago.edu> References: <49F0F6DF.3040902@uchicago.edu> Message-ID: <49F10743.6020905@mcs.anl.gov> By "last few" do you mean the 2 that seem to be stuck in "submitted" state? On 4/23/09 6:16 PM, Glen Hocky wrote: > Problem getting last few jobs to run > All logs in > > /home/hockyg/oops/swift/output/rangeroutdir.1002 > > on ci home > > Here are my jobs in the queue on ranger, so i still have 256 coasters > running, (I would assume): >> showq | grep hockyg >> 675953 data hockyg Running 16 00:52:43 Thu Apr 23 >> 14:06:47 >> 675954 data hockyg Running 16 00:53:45 Thu Apr 23 >> 14:07:49 >> 675955 data hockyg Running 16 01:00:50 Thu Apr 23 >> 14:14:54 >> 675956 data hockyg Running 16 01:04:48 Thu Apr 23 >> 14:18:52 >> 675957 data hockyg Running 16 01:04:48 Thu Apr 23 >> 14:18:52 >> 675958 data hockyg Running 16 01:07:45 Thu Apr 23 >> 14:21:49 >> 675959 data hockyg Running 16 01:07:45 Thu Apr 23 >> 14:21:49 >> 675960 data hockyg Running 16 01:10:43 Thu Apr 23 >> 14:24:47 >> 675961 data hockyg Running 16 01:11:46 Thu Apr 23 >> 14:25:50 >> 675962 data hockyg Running 16 01:16:46 Thu Apr 23 >> 14:30:50 >> 675963 data hockyg Running 16 01:18:46 Thu Apr 23 >> 14:32:50 > > Here is my progress screen: >> Progress: Submitted:1 Active:3 Finished successfully:10002 >> Progress: Submitted:1 Active:2 Stage out:1 Finished >> successfully:10002 >> Progress: Submitted:1 Active:2 Stage out:1 Finished >> successfully:10002 >> Progress: Submitted:1 Active:1 Stage out:1 Finished >> successfully:10003 >> Progress: Submitted:1 Active:1 Finished successfully:10004 >> Progress: Submitted:1 Active:1 Finished successfully:10004 >> Progress: Submitted:1 Stage out:1 Finished successfully:10004 >> Progress: Submitted:1 Finished successfully:10005 >> Progress: uninitialized:1 Submitted:1 Finished successfully:10005 >> Progress: Submitted:2 Finished successfully:10005 >> Progress: Submitted:2 Finished successfully:10005 >> Progress: Submitted:2 Finished successfully:10005 > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hockyg at uchicago.edu Thu Apr 23 20:06:30 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Thu, 23 Apr 2009 20:06:30 -0500 Subject: [Swift-devel] problem at end of workflow In-Reply-To: <49F10743.6020905@mcs.anl.gov> References: <49F0F6DF.3040902@uchicago.edu> <49F10743.6020905@mcs.anl.gov> Message-ID: <49F11096.6090209@uchicago.edu> yes, though on a different run (on abe) I had more, ~16. Those logs are also in my output dir Michael Wilde wrote: > By "last few" do you mean the 2 that seem to be stuck in "submitted" > state? > > > On 4/23/09 6:16 PM, Glen Hocky wrote: >> Problem getting last few jobs to run >> All logs in >> >> /home/hockyg/oops/swift/output/rangeroutdir.1002 >> >> on ci home >> >> Here are my jobs in the queue on ranger, so i still have 256 coasters >> running, (I would assume): >>> showq | grep hockyg >>> 675953 data hockyg Running 16 00:52:43 Thu Apr >>> 23 14:06:47 >>> 675954 data hockyg Running 16 00:53:45 Thu Apr >>> 23 14:07:49 >>> 675955 data hockyg Running 16 01:00:50 Thu Apr >>> 23 14:14:54 >>> 675956 data hockyg Running 16 01:04:48 Thu Apr >>> 23 14:18:52 >>> 675957 data hockyg Running 16 01:04:48 Thu Apr >>> 23 14:18:52 >>> 675958 data hockyg Running 16 01:07:45 Thu Apr >>> 23 14:21:49 >>> 675959 data hockyg Running 16 01:07:45 Thu Apr >>> 23 14:21:49 >>> 675960 data hockyg Running 16 01:10:43 Thu Apr >>> 23 14:24:47 >>> 675961 data hockyg Running 16 01:11:46 Thu Apr >>> 23 14:25:50 >>> 675962 data hockyg Running 16 01:16:46 Thu Apr >>> 23 14:30:50 >>> 675963 data hockyg Running 16 01:18:46 Thu Apr >>> 23 14:32:50 >> >> Here is my progress screen: >>> Progress: Submitted:1 Active:3 Finished successfully:10002 >>> Progress: Submitted:1 Active:2 Stage out:1 Finished >>> successfully:10002 >>> Progress: Submitted:1 Active:2 Stage out:1 Finished >>> successfully:10002 >>> Progress: Submitted:1 Active:1 Stage out:1 Finished >>> successfully:10003 >>> Progress: Submitted:1 Active:1 Finished successfully:10004 >>> Progress: Submitted:1 Active:1 Finished successfully:10004 >>> Progress: Submitted:1 Stage out:1 Finished successfully:10004 >>> Progress: Submitted:1 Finished successfully:10005 >>> Progress: uninitialized:1 Submitted:1 Finished successfully:10005 >>> Progress: Submitted:2 Finished successfully:10005 >>> Progress: Submitted:2 Finished successfully:10005 >>> Progress: Submitted:2 Finished successfully:10005 >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Thu Apr 23 20:35:05 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 23 Apr 2009 20:35:05 -0500 Subject: [Swift-devel] problem at end of workflow In-Reply-To: <49F0F6DF.3040902@uchicago.edu> References: <49F0F6DF.3040902@uchicago.edu> Message-ID: <1240536905.703.0.camel@localhost> On Thu, 2009-04-23 at 18:16 -0500, Glen Hocky wrote: > Problem getting last few jobs to run > All logs in > > /home/hockyg/oops/swift/output/rangeroutdir.1002 Does that also contain the coaster logs (remote:~/.globus/coasters/*.log)? > > on ci home > > Here are my jobs in the queue on ranger, so i still have 256 coasters > running, (I would assume): > > showq | grep hockyg > > 675953 data hockyg Running 16 00:52:43 Thu Apr 23 > > 14:06:47 > > 675954 data hockyg Running 16 00:53:45 Thu Apr 23 > > 14:07:49 > > 675955 data hockyg Running 16 01:00:50 Thu Apr 23 > > 14:14:54 > > 675956 data hockyg Running 16 01:04:48 Thu Apr 23 > > 14:18:52 > > 675957 data hockyg Running 16 01:04:48 Thu Apr 23 > > 14:18:52 > > 675958 data hockyg Running 16 01:07:45 Thu Apr 23 > > 14:21:49 > > 675959 data hockyg Running 16 01:07:45 Thu Apr 23 > > 14:21:49 > > 675960 data hockyg Running 16 01:10:43 Thu Apr 23 > > 14:24:47 > > 675961 data hockyg Running 16 01:11:46 Thu Apr 23 > > 14:25:50 > > 675962 data hockyg Running 16 01:16:46 Thu Apr 23 > > 14:30:50 > > 675963 data hockyg Running 16 01:18:46 Thu Apr 23 > > 14:32:50 > > Here is my progress screen: > > Progress: Submitted:1 Active:3 Finished successfully:10002 > > Progress: Submitted:1 Active:2 Stage out:1 Finished successfully:10002 > > Progress: Submitted:1 Active:2 Stage out:1 Finished successfully:10002 > > Progress: Submitted:1 Active:1 Stage out:1 Finished successfully:10003 > > Progress: Submitted:1 Active:1 Finished successfully:10004 > > Progress: Submitted:1 Active:1 Finished successfully:10004 > > Progress: Submitted:1 Stage out:1 Finished successfully:10004 > > Progress: Submitted:1 Finished successfully:10005 > > Progress: uninitialized:1 Submitted:1 Finished successfully:10005 > > Progress: Submitted:2 Finished successfully:10005 > > Progress: Submitted:2 Finished successfully:10005 > > Progress: Submitted:2 Finished successfully:10005 > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hockyg at uchicago.edu Thu Apr 23 20:42:06 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Thu, 23 Apr 2009 20:42:06 -0500 Subject: [Swift-devel] problem at end of workflow In-Reply-To: <1240536905.703.0.camel@localhost> References: <49F0F6DF.3040902@uchicago.edu> <1240536905.703.0.camel@localhost> Message-ID: <49F118EE.5000609@uchicago.edu> from now on, if i have a question about a site, i'll sync my remote:~/.globus/coasters folder to /home/hockyg/oops/swift/coaster_logs/SITE i just did that for ranger /home/hockyg/oops/swift/coaster_logs/ranger/coasters Mihael Hategan wrote: > On Thu, 2009-04-23 at 18:16 -0500, Glen Hocky wrote: > >> Problem getting last few jobs to run >> All logs in >> >> /home/hockyg/oops/swift/output/rangeroutdir.1002 >> > > Does that also contain the coaster logs > (remote:~/.globus/coasters/*.log)? > > >> on ci home >> >> Here are my jobs in the queue on ranger, so i still have 256 coasters >> running, (I would assume): >> >>> showq | grep hockyg >>> 675953 data hockyg Running 16 00:52:43 Thu Apr 23 >>> 14:06:47 >>> 675954 data hockyg Running 16 00:53:45 Thu Apr 23 >>> 14:07:49 >>> 675955 data hockyg Running 16 01:00:50 Thu Apr 23 >>> 14:14:54 >>> 675956 data hockyg Running 16 01:04:48 Thu Apr 23 >>> 14:18:52 >>> 675957 data hockyg Running 16 01:04:48 Thu Apr 23 >>> 14:18:52 >>> 675958 data hockyg Running 16 01:07:45 Thu Apr 23 >>> 14:21:49 >>> 675959 data hockyg Running 16 01:07:45 Thu Apr 23 >>> 14:21:49 >>> 675960 data hockyg Running 16 01:10:43 Thu Apr 23 >>> 14:24:47 >>> 675961 data hockyg Running 16 01:11:46 Thu Apr 23 >>> 14:25:50 >>> 675962 data hockyg Running 16 01:16:46 Thu Apr 23 >>> 14:30:50 >>> 675963 data hockyg Running 16 01:18:46 Thu Apr 23 >>> 14:32:50 >>> >> Here is my progress screen: >> >>> Progress: Submitted:1 Active:3 Finished successfully:10002 >>> Progress: Submitted:1 Active:2 Stage out:1 Finished successfully:10002 >>> Progress: Submitted:1 Active:2 Stage out:1 Finished successfully:10002 >>> Progress: Submitted:1 Active:1 Stage out:1 Finished successfully:10003 >>> Progress: Submitted:1 Active:1 Finished successfully:10004 >>> Progress: Submitted:1 Active:1 Finished successfully:10004 >>> Progress: Submitted:1 Stage out:1 Finished successfully:10004 >>> Progress: Submitted:1 Finished successfully:10005 >>> Progress: uninitialized:1 Submitted:1 Finished successfully:10005 >>> Progress: Submitted:2 Finished successfully:10005 >>> Progress: Submitted:2 Finished successfully:10005 >>> Progress: Submitted:2 Finished successfully:10005 >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > From hategan at mcs.anl.gov Thu Apr 23 20:48:29 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 23 Apr 2009 20:48:29 -0500 Subject: [Swift-devel] problem at end of workflow In-Reply-To: <49F118EE.5000609@uchicago.edu> References: <49F0F6DF.3040902@uchicago.edu> <1240536905.703.0.camel@localhost> <49F118EE.5000609@uchicago.edu> Message-ID: <1240537709.1093.0.camel@localhost> It may also be a good idea to clean them up from time to time, until such time as we come up with a scheme to make them more organized. On Thu, 2009-04-23 at 20:42 -0500, Glen Hocky wrote: > from now on, if i have a question about a site, i'll sync my > remote:~/.globus/coasters folder to > /home/hockyg/oops/swift/coaster_logs/SITE > > i just did that for ranger > /home/hockyg/oops/swift/coaster_logs/ranger/coasters > > Mihael Hategan wrote: > > On Thu, 2009-04-23 at 18:16 -0500, Glen Hocky wrote: > > > >> Problem getting last few jobs to run > >> All logs in > >> > >> /home/hockyg/oops/swift/output/rangeroutdir.1002 > >> > > > > Does that also contain the coaster logs > > (remote:~/.globus/coasters/*.log)? > > > > > >> on ci home > >> > >> Here are my jobs in the queue on ranger, so i still have 256 coasters > >> running, (I would assume): > >> > >>> showq | grep hockyg > >>> 675953 data hockyg Running 16 00:52:43 Thu Apr 23 > >>> 14:06:47 > >>> 675954 data hockyg Running 16 00:53:45 Thu Apr 23 > >>> 14:07:49 > >>> 675955 data hockyg Running 16 01:00:50 Thu Apr 23 > >>> 14:14:54 > >>> 675956 data hockyg Running 16 01:04:48 Thu Apr 23 > >>> 14:18:52 > >>> 675957 data hockyg Running 16 01:04:48 Thu Apr 23 > >>> 14:18:52 > >>> 675958 data hockyg Running 16 01:07:45 Thu Apr 23 > >>> 14:21:49 > >>> 675959 data hockyg Running 16 01:07:45 Thu Apr 23 > >>> 14:21:49 > >>> 675960 data hockyg Running 16 01:10:43 Thu Apr 23 > >>> 14:24:47 > >>> 675961 data hockyg Running 16 01:11:46 Thu Apr 23 > >>> 14:25:50 > >>> 675962 data hockyg Running 16 01:16:46 Thu Apr 23 > >>> 14:30:50 > >>> 675963 data hockyg Running 16 01:18:46 Thu Apr 23 > >>> 14:32:50 > >>> > >> Here is my progress screen: > >> > >>> Progress: Submitted:1 Active:3 Finished successfully:10002 > >>> Progress: Submitted:1 Active:2 Stage out:1 Finished successfully:10002 > >>> Progress: Submitted:1 Active:2 Stage out:1 Finished successfully:10002 > >>> Progress: Submitted:1 Active:1 Stage out:1 Finished successfully:10003 > >>> Progress: Submitted:1 Active:1 Finished successfully:10004 > >>> Progress: Submitted:1 Active:1 Finished successfully:10004 > >>> Progress: Submitted:1 Stage out:1 Finished successfully:10004 > >>> Progress: Submitted:1 Finished successfully:10005 > >>> Progress: uninitialized:1 Submitted:1 Finished successfully:10005 > >>> Progress: Submitted:2 Finished successfully:10005 > >>> Progress: Submitted:2 Finished successfully:10005 > >>> Progress: Submitted:2 Finished successfully:10005 > >>> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > > > > > From hategan at mcs.anl.gov Thu Apr 23 20:55:33 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 23 Apr 2009 20:55:33 -0500 Subject: [Swift-devel] problem at end of workflow In-Reply-To: <49F118EE.5000609@uchicago.edu> References: <49F0F6DF.3040902@uchicago.edu> <1240536905.703.0.camel@localhost> <49F118EE.5000609@uchicago.edu> Message-ID: <1240538133.1093.4.camel@localhost> On Thu, 2009-04-23 at 20:42 -0500, Glen Hocky wrote: > from now on, if i have a question about a site, i'll sync my > remote:~/.globus/coasters folder to > /home/hockyg/oops/swift/coaster_logs/SITE > > i just did that for ranger > /home/hockyg/oops/swift/coaster_logs/ranger/coasters What I see in the logs is a bunch of starting workers, none running and a bunch of queued jobs. I'm not sure what went wrong and where. I'll continue looking. > > Mihael Hategan wrote: > > On Thu, 2009-04-23 at 18:16 -0500, Glen Hocky wrote: > > > >> Problem getting last few jobs to run > >> All logs in > >> > >> /home/hockyg/oops/swift/output/rangeroutdir.1002 > >> > > > > Does that also contain the coaster logs > > (remote:~/.globus/coasters/*.log)? > > > > > >> on ci home > >> > >> Here are my jobs in the queue on ranger, so i still have 256 coasters > >> running, (I would assume): > >> > >>> showq | grep hockyg > >>> 675953 data hockyg Running 16 00:52:43 Thu Apr 23 > >>> 14:06:47 > >>> 675954 data hockyg Running 16 00:53:45 Thu Apr 23 > >>> 14:07:49 > >>> 675955 data hockyg Running 16 01:00:50 Thu Apr 23 > >>> 14:14:54 > >>> 675956 data hockyg Running 16 01:04:48 Thu Apr 23 > >>> 14:18:52 > >>> 675957 data hockyg Running 16 01:04:48 Thu Apr 23 > >>> 14:18:52 > >>> 675958 data hockyg Running 16 01:07:45 Thu Apr 23 > >>> 14:21:49 > >>> 675959 data hockyg Running 16 01:07:45 Thu Apr 23 > >>> 14:21:49 > >>> 675960 data hockyg Running 16 01:10:43 Thu Apr 23 > >>> 14:24:47 > >>> 675961 data hockyg Running 16 01:11:46 Thu Apr 23 > >>> 14:25:50 > >>> 675962 data hockyg Running 16 01:16:46 Thu Apr 23 > >>> 14:30:50 > >>> 675963 data hockyg Running 16 01:18:46 Thu Apr 23 > >>> 14:32:50 > >>> > >> Here is my progress screen: > >> > >>> Progress: Submitted:1 Active:3 Finished successfully:10002 > >>> Progress: Submitted:1 Active:2 Stage out:1 Finished successfully:10002 > >>> Progress: Submitted:1 Active:2 Stage out:1 Finished successfully:10002 > >>> Progress: Submitted:1 Active:1 Stage out:1 Finished successfully:10003 > >>> Progress: Submitted:1 Active:1 Finished successfully:10004 > >>> Progress: Submitted:1 Active:1 Finished successfully:10004 > >>> Progress: Submitted:1 Stage out:1 Finished successfully:10004 > >>> Progress: Submitted:1 Finished successfully:10005 > >>> Progress: uninitialized:1 Submitted:1 Finished successfully:10005 > >>> Progress: Submitted:2 Finished successfully:10005 > >>> Progress: Submitted:2 Finished successfully:10005 > >>> Progress: Submitted:2 Finished successfully:10005 > >>> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > > > > > From hategan at mcs.anl.gov Thu Apr 23 20:56:52 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 23 Apr 2009 20:56:52 -0500 Subject: [Swift-devel] problem at end of workflow In-Reply-To: <1240538133.1093.4.camel@localhost> References: <49F0F6DF.3040902@uchicago.edu> <1240536905.703.0.camel@localhost> <49F118EE.5000609@uchicago.edu> <1240538133.1093.4.camel@localhost> Message-ID: <1240538212.1093.6.camel@localhost> On Thu, 2009-04-23 at 20:55 -0500, Mihael Hategan wrote: > On Thu, 2009-04-23 at 20:42 -0500, Glen Hocky wrote: > > from now on, if i have a question about a site, i'll sync my > > remote:~/.globus/coasters folder to > > /home/hockyg/oops/swift/coaster_logs/SITE > > > > i just did that for ranger > > /home/hockyg/oops/swift/coaster_logs/ranger/coasters > > What I see in the logs is a bunch of starting workers, none running and > a bunch of queued jobs. > > I'm not sure what went wrong and where. I'll continue looking. In the mean time, I recommend killing the queued jobs, in the hope that it will trigger a failure of the queued jobs and a subsequent re-submission from swift. From hategan at mcs.anl.gov Thu Apr 23 21:44:51 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 23 Apr 2009 21:44:51 -0500 Subject: [Swift-devel] problem at end of workflow In-Reply-To: <1240538133.1093.4.camel@localhost> References: <49F0F6DF.3040902@uchicago.edu> <1240536905.703.0.camel@localhost> <49F118EE.5000609@uchicago.edu> <1240538133.1093.4.camel@localhost> Message-ID: <1240541091.1616.5.camel@localhost> On Thu, 2009-04-23 at 20:55 -0500, Mihael Hategan wrote: > On Thu, 2009-04-23 at 20:42 -0500, Glen Hocky wrote: > > from now on, if i have a question about a site, i'll sync my > > remote:~/.globus/coasters folder to > > /home/hockyg/oops/swift/coaster_logs/SITE > > > > i just did that for ranger > > /home/hockyg/oops/swift/coaster_logs/ranger/coasters > > What I see in the logs is a bunch of starting workers, none running and > a bunch of queued jobs. > > I'm not sure what went wrong and where. I'll continue looking. I committed a patch (cog r2391) that tries to avoid the problem seen there (that of improper count of active workers which keeps growing to the maximum allowed and prevents other workers from being created). From hockyg at uchicago.edu Fri Apr 24 01:17:54 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Fri, 24 Apr 2009 01:17:54 -0500 Subject: [Swift-devel] problem at end of workflow In-Reply-To: <1240541091.1616.5.camel@localhost> References: <49F0F6DF.3040902@uchicago.edu> <1240536905.703.0.camel@localhost> <49F118EE.5000609@uchicago.edu> <1240538133.1093.4.camel@localhost> <1240541091.1616.5.camel@localhost> Message-ID: <49F15992.4040003@uchicago.edu> I upgraded to newest cog and swift (Swift svn swift-r2880 cog-r2391) I got new behavior this time. instead of just 16 jobs to the queue, i got up to 50 before I killed. So that's probably a good thing. The bad thing is that a job became active before any of the jobs in the queue started and i started getting error messages (error transfer log messages) Everything is in /home/hockyg/oops/swift/output/rangeroutdir.3000 and as promised, new ranger coaster logs in > /home/hockyg/oops/swift/coaster_logs/ranger/coasters sorry i don't have more time to stress test this tonight Glen Mihael Hategan wrote: > On Thu, 2009-04-23 at 20:55 -0500, Mihael Hategan wrote: > >> On Thu, 2009-04-23 at 20:42 -0500, Glen Hocky wrote: >> >>> from now on, if i have a question about a site, i'll sync my >>> remote:~/.globus/coasters folder to >>> /home/hockyg/oops/swift/coaster_logs/SITE >>> >>> i just did that for ranger >>> /home/hockyg/oops/swift/coaster_logs/ranger/coasters >>> >> What I see in the logs is a bunch of starting workers, none running and >> a bunch of queued jobs. >> >> I'm not sure what went wrong and where. I'll continue looking. >> > > I committed a patch (cog r2391) that tries to avoid the problem seen > there (that of improper count of active workers which keeps growing to > the maximum allowed and prevents other workers from being created). > > From benc at hawaga.org.uk Fri Apr 24 05:53:18 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 24 Apr 2009 10:53:18 +0000 (GMT) Subject: [Swift-devel] swift 0.9rc2 -> 0.9 final on Sunday Message-ID: I've not heard compelling reasons for 0.9rc2 to not become 0.9 final, so I still intend to make that release on Sunday. -- From benc at hawaga.org.uk Fri Apr 24 07:50:31 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 24 Apr 2009 12:50:31 +0000 (GMT) Subject: [Swift-devel] 0.9 release notes draft Message-ID: Below are the draft release notes for 0.9. They need more formatting, but all the text is there. I have said little about coasters other than document two new configuration parameters and 'substantial ongoing development work'. If there are any specific additional points to be made there, comment here. ====== User interface ============== *** Added console text interface to provide live information about swift runs, which can be enabled with the -tui commandline parameter *** when replication is enabled, swift will locally kill jobs that have run for twice their specified walltime Execution modes =============== *** Support for Condor-G submissions, by setting a job specification attribute of "grid" and specifying a gridResource attribute containing the string to be placed into the Condor-G grid_resource classad. *** Support for submissions to a local condor pool *** Coasters: substantial ongoing development work Configuration parameters ======================== *** Environment variable SWIFT_EXTRA_INFO, when set in an environment profile, is executed with the result being recorded in wrapper logs. This allows arbitrary information about the remote environment to be gathered and returned to the submit side. *** New configuration option wrapper.invocation.mode, specifiable either globally in the configuration file or per-site as a profile key, that configures whether wrapper script invocations are made with an absolute path (as was the behaviour in Swift 0.8) or with a relative path (as was behaviour in previous versions of Swift). *** coasterWorkerMaxwalltime - a coaster parameter to explicitly set worker maxwalltime, overriding the default computation based on job walltimes. this should be useful when it is desirable to specify worker parameters based on the known properties of the target queue rather than on properties of the jobs to be executed inside coasters. *** coasterInternalIP parameter that allows the IP address used by coaster workers to communicate to the coaster head job to be set explicitly. this can be used when running on a cluster which an internal network which cannot access the IP address that the head node picks *** TODO maxheapsize setting on commandline... *** property ticker.disable to disable runtime ticker display *** strsplit function which will split the input string based on separators that match the given pattern and return a string array. New commands ============ *** The log-processing code, primarily exposed as the swift-plot-log command, has been merged into the main Swift release, rather than being a separate download. *** There is a new command swift-osg-ress-site-catalog which will generate a site catalog based on data gathered from OSG's ReSS information system. This can be used by OSG users to easily generate a large number of site catalog entries for sites that are likely to work. Language changes ================ *** Procedure invocations can be made in any expression context now, rather than only directly in an assignment. *** Mappings can now be made in any declaration, whether it has an assignment or not. Previous a procedure call assignment and a mapping could not be made in the same statement. *** Handling of [*] and . has changed. [*] has been broken for several Swift releases, but now has some approximation of its former behaviour, in a cleaner implementation. [*] is now an identity operation, that is array[*] == array. The structure access operator . can now take an array on the left hand side. In situations where a[*].img was used, it is permissible to continue to write a[*].img, or to write a.img - both of these will return an array of img elements. *** Tighter type checking on app blocks. Primitive types and arrays of primitive types are permitted. Other types are prohibited. *** Arrays of primitive types can be passed to app blocks, and will be expanded into multiple command-line parameters, one parameter per array element. *** != operator had been broken forever. It works now. *** output of trace() is changed for non-primitive datasets: * Array handling: Previous to this, trace would show the Java object representation of a (Future)PairIterator for arrays, which is fairly useless for a user. This commit makes trace show more about the array, and only emit the trace when the array is closed. * Other datasets: Trace will wait for those datasets to be closed, and emit their default string representation (including the variable name and path used in SwiftScript) *** loop condition in iterate statements can now refer to variables declared within the iteration body (this was bug 177) *** Mappings can now be made in any declaration, whether it has an assignment or not. Deprecations and removal of old functionality ============================================= *** Removed support for .dtm file extension which was deprecated in Swift 0.4 Internal changes ================ *** The wrapper.sh and seq.sh scripts that are deployed to remote sites to help with execution have been renamed to more Swift specific names, to avoid collision with user-supplied files of the same name. The new names are _swiftwrap and _swiftseq *** Recompilation will happen if a .kml file was compiled with a different version of Swift to the version being invoked. This is in addition to the existing behaviour where a .swift file will be recompiled if it is newer than the corresponding .kml file. *** Added a throttling parallelFor and changed the swift foreach underlying implementation to it. The throttling parallelFor limits the number of concurrent iterations allowed, which should allow swift to scale better in certain cases. This is controlled by the configuration property foreach.max.threads -- From bugzilla-daemon at mcs.anl.gov Fri Apr 24 07:53:07 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Fri, 24 Apr 2009 07:53:07 -0500 (CDT) Subject: [Swift-devel] [Bug 202] New: ticker.disable and -tui parameters should merge into single display mode choice Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=202 Summary: ticker.disable and -tui parameters should merge into single display mode choice Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: NEW Severity: enhancement Priority: P2 Component: General AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk The output form of swift is controlled by both the ticker.disable config propery and -tui command line option. These should merge into a single parameter allowing the output display format to be chosen (none, ticker, tui). -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From wilde at mcs.anl.gov Fri Apr 24 08:43:41 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 24 Apr 2009 08:43:41 -0500 Subject: [Swift-devel] 0.9 release notes draft In-Reply-To: References: Message-ID: <49F1C20D.60308@mcs.anl.gov> Thats a nice package of work. Kudos to all involved! I think documenting some thing sabout coasters is important, primarily how jobs are sent to workers based on time (and that there is a default time of 10 minutes if time is not specified). I found this very confusing - not because its complicated, but because the current text on this topic was incomplete and not clear enough). Mihael, can you do a read through the user guid and parameter descriptions on coasters, and either write whats needed or list what needs to be written later? Other minor things from a quick read: > *** when replication is enabled, swift will locally kill jobs that have > run for twice their specified walltime what does "locally kill" mean? > *** Recompilation will happen if a .kml file was compiled with a different This was an occasional problem: does this close the last know glitch in avoiding recompilation, or are there any other cases where a user could get tripped up on this? I.e., ideally should the user not need to know that Swift applies this heuristic? - Mike On 4/24/09 7:50 AM, Ben Clifford wrote: > Below are the draft release notes for 0.9. They need more formatting, but > all the text is there. > > I have said little about coasters other than document two new > configuration parameters and 'substantial ongoing development work'. If > there are any specific additional points to be made there, comment here. > > ====== > > > User interface > ============== > > *** Added console text interface to provide live information about swift > runs, which can be enabled with the -tui commandline parameter > > *** when replication is enabled, swift will locally kill jobs that have > run for twice their specified walltime > > Execution modes > =============== > > *** Support for Condor-G submissions, by setting a job specification attribute > of "grid" and specifying a gridResource attribute containing the string > to be placed into the Condor-G grid_resource classad. > > *** Support for submissions to a local condor pool > > *** Coasters: substantial ongoing development work > > Configuration parameters > ======================== > > *** Environment variable SWIFT_EXTRA_INFO, when set in an environment profile, > is executed with the result being recorded in wrapper logs. This allows > arbitrary information about the remote environment to be gathered and > returned to the submit side. > > *** New configuration option wrapper.invocation.mode, specifiable either > globally in the configuration file or per-site as a profile key, > that configures whether wrapper script invocations are made with an > absolute path (as was the behaviour in Swift 0.8) or with a relative > path (as was behaviour in previous versions of Swift). > > *** coasterWorkerMaxwalltime - a coaster parameter to explicitly set worker maxwalltime, overriding the default computation based on job walltimes. this should be useful when it is desirable to specify worker parameters based on the known properties of the target queue rather than on properties of the jobs to be executed inside coasters. > > *** coasterInternalIP parameter that allows the IP address used by coaster workers to communicate to the coaster head job to be set explicitly. this can be used when running on a cluster which an internal network which cannot access the IP address that the head node picks > > *** TODO maxheapsize setting on commandline... > > *** property ticker.disable to disable runtime ticker display > > *** strsplit function which will split the input string based on separators that match the given pattern and return a string array. > > New commands > ============ > > *** The log-processing code, primarily exposed as the swift-plot-log command, > has been merged into the main Swift release, rather than being a separate > download. > > *** There is a new command swift-osg-ress-site-catalog which will generate > a site catalog based on data gathered from OSG's ReSS information > system. This can be used by OSG users to easily generate a large number > of site catalog entries for sites that are likely to work. > > Language changes > ================ > > *** Procedure invocations can be made in any expression context now, rather > than only directly in an assignment. > > *** Mappings can now be made in any declaration, whether it has an assignment > or not. Previous a procedure call assignment and a mapping could not be > made in the same statement. > > *** Handling of [*] and . has changed. [*] has been broken for several > Swift releases, but now has some approximation of its former behaviour, > in a cleaner implementation. > [*] is now an identity operation, that is array[*] == array. > The structure access operator . can now take an array on the left > hand side. In situations where a[*].img was used, it is permissible > to continue to write a[*].img, or to write a.img - both of these will > return an array of img elements. > > *** Tighter type checking on app blocks. Primitive types and arrays of > primitive types are permitted. Other types are prohibited. > > *** Arrays of primitive types can be passed to app blocks, and will be > expanded into multiple command-line parameters, one parameter per > array element. > > *** != operator had been broken forever. It works now. > > *** output of trace() is changed for non-primitive datasets: > > * Array handling: > Previous to this, trace would show the Java object representation of a (Future)PairIterator for arrays, which is fairly useless for a user. This commit makes trace show more about the array, and only emit the trace when the array is closed. > > * Other datasets: > Trace will wait for those datasets to be closed, and emit their default > string representation (including the variable name and path used in > SwiftScript) > > *** loop condition in iterate statements can now refer to variables declared within the iteration body (this was bug 177) > > *** Mappings can now be made in any declaration, whether it has an assignment or not. > > > > Deprecations and removal of old functionality > ============================================= > > *** Removed support for .dtm file extension which was deprecated in Swift 0.4 > > Internal changes > ================ > > *** The wrapper.sh and seq.sh scripts that are deployed to remote sites to > help with execution have been renamed to more Swift specific names, to > avoid collision with user-supplied files of the same name. The new names > are _swiftwrap and _swiftseq > > *** Recompilation will happen if a .kml file was compiled with a different > version of Swift to the version being invoked. This is in addition to the > existing behaviour where a .swift file will be recompiled if it is newer > than the corresponding .kml file. > > *** Added a throttling parallelFor and changed the swift foreach underlying > implementation to it. The throttling parallelFor limits the number of > concurrent iterations allowed, which should allow swift to scale better > in certain cases. This is controlled by the configuration property > foreach.max.threads > From benc at hawaga.org.uk Fri Apr 24 08:59:53 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 24 Apr 2009 13:59:53 +0000 (GMT) Subject: [Swift-devel] 0.9 release notes draft In-Reply-To: <49F1C20D.60308@mcs.anl.gov> References: <49F1C20D.60308@mcs.anl.gov> Message-ID: On Fri, 24 Apr 2009, Michael Wilde wrote: > > *** when replication is enabled, swift will locally kill jobs that have > > run for twice their specified walltime > > what does "locally kill" mean? traditionally the maxwalltime setting was passed to the execution layer and would end up somewhere like PBS for enforcement deep in the stack. If something was broken or not implemented somewhere between Swift and that enforcer, then maxwalltimes would have no effect. Local enforcement means that the Swift client will attempt to kill a job that 2* a specified maxwalltime, without relying on communication with anythign outside of the swit client. This is tied in with replication in that its behaviour that happens when the swift client believes a job has been in a particular state for too long. For queued state, the time is based on a simple analysis of existing queue times, and enforcement behaviour is to launch a replica. For active (running) state, the time is based on maxwall tim, and enforcement behaviour is to kill the job, which will then cause a retry (or too-many-retries failure) > > *** Recompilation will happen if a .kml file was compiled with a different > > This was an occasional problem: does this close the last know glitch in > avoiding recompilation, or are there any other cases where a user could get > tripped up on this? I.e., ideally should the user not need to know that Swift > applies this heuristic? I think this catches everything in this area that has been catching people in the past. -- From wilde at mcs.anl.gov Fri Apr 24 09:13:37 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 24 Apr 2009 09:13:37 -0500 Subject: [Swift-devel] Swift User Guide as MCS preprint Message-ID: <49F1C911.5010208@mcs.anl.gov> I would like to submit the 0.9 Swift Users Guide as an MCS report, so that I can cite it in a grant proposal which doesnt permit URL-only citations. I will use as authors: Ben, Mihael, Ian, Sarah, Mike. I will do this today (as a pdf doc) unless there are any objections. - Mike From foster at anl.gov Fri Apr 24 09:16:25 2009 From: foster at anl.gov (Ian Foster) Date: Fri, 24 Apr 2009 09:16:25 -0500 Subject: [Swift-devel] Swift User Guide as MCS preprint In-Reply-To: <49F1C911.5010208@mcs.anl.gov> References: <49F1C911.5010208@mcs.anl.gov> Message-ID: <6E9834FF-AE84-4248-9E58-CA5041C2518E@anl.gov> great. If you want me to review before that is done, let me know. On Apr 24, 2009, at 9:13 AM, Michael Wilde wrote: > I would like to submit the 0.9 Swift Users Guide as an MCS report, > so that I can cite it in a grant proposal which doesnt permit URL- > only citations. > > I will use as authors: Ben, Mihael, Ian, Sarah, Mike. > > I will do this today (as a pdf doc) unless there are any objections. > > - Mike > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Fri Apr 24 09:18:04 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 24 Apr 2009 14:18:04 +0000 (GMT) Subject: [Swift-devel] Swift User Guide as MCS preprint In-Reply-To: <49F1C911.5010208@mcs.anl.gov> References: <49F1C911.5010208@mcs.anl.gov> Message-ID: On Fri, 24 Apr 2009, Michael Wilde wrote: > I will use as authors: Ben, Mihael, Ian, Sarah, Mike. SVN history additionally gives tibi and yong as having committed stuff to the user guide XML file. -- From benc at hawaga.org.uk Fri Apr 24 09:44:19 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 24 Apr 2009 14:44:19 +0000 (GMT) Subject: [Swift-devel] Swift User Guide as MCS preprint In-Reply-To: References: <49F1C911.5010208@mcs.anl.gov> Message-ID: On Fri, 24 Apr 2009, Ben Clifford wrote: > SVN history additionally gives tibi and yong as having committed stuff to > the user guide XML file. Mats also has a small amount of content in that document, though not committed himself. -- From wilde at mcs.anl.gov Fri Apr 24 09:53:44 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 24 Apr 2009 09:53:44 -0500 Subject: [Swift-devel] Swift User Guide as MCS preprint In-Reply-To: References: <49F1C911.5010208@mcs.anl.gov> Message-ID: <49F1D278.4040607@mcs.anl.gov> Sounds great - yes, thanks, I will include them. Certainly Yong was instrumental in getting it all started. I want to list all who contributed. On 4/24/09 9:18 AM, Ben Clifford wrote: > On Fri, 24 Apr 2009, Michael Wilde wrote: > >> I will use as authors: Ben, Mihael, Ian, Sarah, Mike. > > SVN history additionally gives tibi and yong as having committed stuff to > the user guide XML file. > From wilde at mcs.anl.gov Fri Apr 24 10:24:12 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 24 Apr 2009 10:24:12 -0500 Subject: [Swift-devel] Swift User Guide as MCS preprint In-Reply-To: References: <49F1C911.5010208@mcs.anl.gov> Message-ID: <49F1D99C.7080106@mcs.anl.gov> Ben, if you could gen the PDF for me that would be great; else if there is no rush I will do it on Tuesday after the grant is in. (Im in no rush, I just need to be able to insert the citation for now) I can add the front and back in Word then bind all as pdf, or we can put the info below in docbook in svn (which would be better) All: please see people named below, let me know who else to add. The Doc # is: ANL/MCS-P1614-0409 It needs a cover page, and a trailer page that has the ack listed on: http://dsl-wiki.cs.uchicago.edu/index.php/Wiki:PublicationInstructions which is "The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory ("Argonne"). Argonne, a U.S. Department of Energy Office of Science laboratory, is operated under Contract No. DE-AC02-06CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government." as well as an NSF DOE and NIH ack: This work was supported in part by: the National Science Foundation under Grant OCI-0721939 NASA Ames Research Center GSRP Grant Number NNA06CB89H the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Dept. of Energy, under Contract DE-AC02-06CH11357, the National Institute of Deafness and Other Communication Disorders of the National Institutes of Health under Grant R21/R33 DC008638, and by the Computation Institute of The University of Chicago. We thankfully acknowledge the contributions and usage feedback of: Mike Kubal Andrew Binkowski Nika Nefedova Andriy Fedorov Glen Hocky Zhao Zhang Allan Espinosa Mats Rynge Michael Andric Uri Hasson Yue Chen ...more On 4/24/09 9:44 AM, Ben Clifford wrote: > On Fri, 24 Apr 2009, Ben Clifford wrote: > >> SVN history additionally gives tibi and yong as having committed stuff to >> the user guide XML file. > > Mats also has a small amount of content in that document, though not > committed himself. > From foster at anl.gov Fri Apr 24 10:25:16 2009 From: foster at anl.gov (Ian Foster) Date: Fri, 24 Apr 2009 10:25:16 -0500 Subject: [Swift-devel] Swift User Guide as MCS preprint In-Reply-To: <49F1D99C.7080106@mcs.anl.gov> References: <49F1C911.5010208@mcs.anl.gov> <49F1D99C.7080106@mcs.anl.gov> Message-ID: We should really make this a CI preprint, not a MCS preprint. Ian. On Apr 24, 2009, at 10:24 AM, Michael Wilde wrote: > Ben, if you could gen the PDF for me that would be great; else if > there is no rush I will do it on Tuesday after the grant is in. (Im > in no rush, I just need to be able to insert the citation for now) > > I can add the front and back in Word then bind all as pdf, or we can > put the info below in docbook in svn (which would be better) > > All: please see people named below, let me know who else to add. > > The Doc # is: ANL/MCS-P1614-0409 > > It needs a cover page, and a trailer page that has the ack listed on: > > http://dsl-wiki.cs.uchicago.edu/index.php/Wiki:PublicationInstructions > > which is > > "The submitted manuscript has been created by UChicago Argonne, LLC, > Operator of Argonne National Laboratory ("Argonne"). Argonne, a U.S. > Department of Energy Office of Science laboratory, is operated under > Contract No. DE-AC02-06CH11357. The U.S. Government retains for > itself, and others acting on its behalf, a paid-up nonexclusive, > irrevocable worldwide license in said article to reproduce, prepare > derivative works, distribute copies to the public, and perform > publicly and display publicly, by or on behalf of the Government." > > as well as an NSF DOE and NIH ack: > > This work was supported in part by: > > the National Science Foundation under Grant OCI-0721939 > > NASA Ames Research Center GSRP Grant Number NNA06CB89H > > the Mathematical, Information, and Computational Sciences Division > subprogram of the Office of Advanced Scientific Computing Research, > Office of Science, U.S. Dept. of Energy, under Contract DE- > AC02-06CH11357, > > the National Institute of Deafness and Other Communication Disorders > of the National Institutes of Health under Grant R21/R33 DC008638, > > and by the Computation Institute of The University of Chicago. > > We thankfully acknowledge the contributions and usage feedback of: > Mike Kubal > Andrew Binkowski > Nika Nefedova > Andriy Fedorov > Glen Hocky > Zhao Zhang > Allan Espinosa > Mats Rynge > Michael Andric > Uri Hasson > Yue Chen > ...more > > > On 4/24/09 9:44 AM, Ben Clifford wrote: >> On Fri, 24 Apr 2009, Ben Clifford wrote: >>> SVN history additionally gives tibi and yong as having committed >>> stuff to the user guide XML file. >> Mats also has a small amount of content in that document, though >> not committed himself. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Fri Apr 24 10:34:22 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 24 Apr 2009 15:34:22 +0000 (GMT) Subject: [Swift-devel] Swift User Guide as MCS preprint In-Reply-To: <49F1D99C.7080106@mcs.anl.gov> References: <49F1C911.5010208@mcs.anl.gov> <49F1D99C.7080106@mcs.anl.gov> Message-ID: > I can add the front and back in Word then bind all as pdf, or we can put > the info below in docbook in svn (which would be better) I don't think the live ongoing userguide should be made to look like an MCS publication, when as soon as someone makes a commit, it won't be. The acknowledgements list you provide is somethign that should go on its own web page though. The PDF at http://www.ci.uchicago.edu/swift/guides/userguide.pdf is (as of writing this email) up to do date with the latest SVN docbook source, so thats probably what you should add boilerplate to. > ...more Milena Nikolic -- From zhaozhang at uchicago.edu Fri Apr 24 12:52:02 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 24 Apr 2009 12:52:02 -0500 Subject: [Swift-devel] feature request In-Reply-To: <49F0A34C.6040306@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> Message-ID: <49F1FC42.1090008@uchicago.edu> Hi, All As I got mapped on gwynn, I redid the tests. The results are All language behaviour tests passed These sites failed: coaster/tgncsa-hg-coaster-pbs-gram2.xml coaster/tgncsa-hg-coaster-pbs-gram4.xml These sites worked: coaster/coaster-local.xml coaster/gwynn-coaster-gram2-gram2-condor.xml coaster/gwynn-coaster-gram2-gram2-fork.xml coaster/renci-engage-coaster.xml coaster/teraport-gt2-gt2-pbs.xml coaster/uj-pbs-gram2.xml Logs could be found at /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/log_all on CI network. zhao Zhao Zhang wrote: > Hi, again > > the test on teraport is successful, here is the log > > zhao > > testing site configuration: coaster/teraport-gt2-gt2-pbs.xml > Removing files from previous runs > Running test 061-cattwo at Thu Apr 23 11:27:19 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090423-1127-aluxx4m9 > Progress: > Progress: Stage in:1 > Progress: Submitted:1 > Progress: Submitted:1 > ... > Progress: Active:1 > Progress: Finished successfully:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://128.135.125.118:57080 > Got channel MetaChannel: 2129305 -> GSSSChannel-null(1) > - Done > expecting 061-cattwo.out.expected > checking 061-cattwo.out.expected > Skipping exception test due to test configuration > Test passed at Thu Apr 23 11:57:39 CDT 2009 > ----------===========================---------- > Running test 130-fmri at Thu Apr 23 11:57:39 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090423-1157-r8sarc77 > Progress: > Progress: Selecting site:2 Initializing site shared directory:1 > Stage in:1 > Progress: Selecting site:2 Stage in:1 Submitting:1 > Progress: Selecting site:2 Submitting:1 Submitted:1 > ... > Progress: Selecting site:2 Submitted:2 > Progress: Selecting site:2 Submitted:2 > Progress: Selecting site:2 Submitted:2 > Progress: Selecting site:2 Submitted:1 Active:1 > Progress: Selecting site:2 Active:1 Stage out:1 > Progress: Selecting site:1 Stage in:1 Stage out:1 Finished > successfully:1 > Progress: Submitted:1 Stage out:1 Finished successfully:2 > Progress: Active:1 Finished successfully:4 > Progress: Submitting:2 Submitted:1 Finished successfully:5 > Progress: Active:2 Stage out:1 Finished successfully:5 > Progress: Submitted:1 Stage out:2 Finished successfully:8 > Final status: Finished successfully:11 > Cleaning up... > Shutting down service at https://128.135.125.118:52773 > Got channel MetaChannel: 28761475 -> GSSSChannel-null(1) > - Done > expecting 130-fmri.0000.jpeg.expected 130-fmri.0001.jpeg.expected > 130-fmri.0002.jpeg.expected > checking 130-fmri.0000.jpeg.expected > Skipping exception test due to test configuration > checking 130-fmri.0001.jpeg.expected > Skipping exception test due to test configuration > checking 130-fmri.0002.jpeg.expected > Skipping exception test due to test configuration > Test passed at Thu Apr 23 12:04:47 CDT 2009 > ----------===========================---------- > Running test 103-quote at Thu Apr 23 12:04:47 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090423-1204-sjzpkfd3 > Progress: > Progress: Stage in:1 > Progress: Submitted:1 > Progress: Active:1 > Progress: Finished successfully:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://128.135.125.118:40813 > Got channel MetaChannel: 28500325 -> GSSSChannel-null(1) > - Done > expecting 103-quote.out.expected > checking 103-quote.out.expected > Skipping exception test due to test configuration > Test passed at Thu Apr 23 12:05:05 CDT 2009 > ----------===========================---------- > Running test 1032-singlequote at Thu Apr 23 12:05:05 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090423-1205-x2d55af3 > Progress: > Progress: Stage in:1 > Progress: Submitted:1 > Progress: Active:1 > Progress: Finished successfully:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://128.135.125.118:44126 > Got channel MetaChannel: 18100302 -> GSSSChannel-null(1) > - Done > expecting 1032-singlequote.out.expected > checking 1032-singlequote.out.expected > Skipping exception test due to test configuration > Test passed at Thu Apr 23 12:05:22 CDT 2009 > ----------===========================---------- > Running test 1031-quote at Thu Apr 23 12:05:22 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090423-1205-5aa1ko4e > Progress: > Progress: Stage in:1 > Progress: Submitted:1 > Progress: Active:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://128.135.125.118:43759 > Got channel MetaChannel: 19002607 -> GSSSChannel-null(1) > - Done > expecting 1031-quote.*.expected > No expected output files specified for this test case - not checking > output. > Skipping exception test due to test configuration > Test passed at Thu Apr 23 12:05:38 CDT 2009 > ----------===========================---------- > Running test 1033-singlequote at Thu Apr 23 12:05:38 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090423-1205-8nopyujc > Progress: > Progress: Stage in:1 > Progress: Submitted:1 > Progress: Active:1 > Progress: Finished successfully:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://128.135.125.118:39924 > Got channel MetaChannel: 31196317 -> GSSSChannel-null(1) > - Done > expecting 1033-singlequote.out.expected > checking 1033-singlequote.out.expected > Skipping exception test due to test configuration > Test passed at Thu Apr 23 12:05:56 CDT 2009 > ----------===========================---------- > Running test 141-space-in-filename at Thu Apr 23 12:05:56 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090423-1205-aalqz1c4 > Progress: > Progress: Stage in:1 > Progress: Submitted:1 > Progress: Active:1 > Progress: Finished successfully:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://128.135.125.118:60177 > Got channel MetaChannel: 4728458 -> GSSSChannel-null(1) > - Done > expecting 141-space-in-filename.space here.out.expected > checking 141-space-in-filename.space here.out.expected > Skipping exception test due to test configuration > Test passed at Thu Apr 23 12:06:15 CDT 2009 > ----------===========================---------- > Running test 142-space-and-quotes at Thu Apr 23 12:06:15 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090423-1206-8617gag1 > Progress: > Progress: Selecting site:2 Initializing site shared directory:1 > Stage in:1 > Progress: Selecting site:2 Submitting:1 Submitted:1 > Progress: Selecting site:2 Submitted:1 Active:1 > Progress: Selecting site:2 Active:1 Finished successfully:1 > Progress: Stage out:1 Finished successfully:3 > Final status: Finished successfully:4 > Cleaning up... > Shutting down service at https://128.135.125.118:57945 > Got channel MetaChannel: 16387060 -> GSSSChannel-null(1) > - Done > expecting 142-space-and-quotes.2" space ".out.expected > 142-space-and-quotes.3' space '.out.expected > 142-space-and-quotes.out.expected 142-space-and-quotes. space > .out.expected > checking 142-space-and-quotes.2" space ".out.expected > Skipping exception test due to test configuration > checking 142-space-and-quotes.3' space '.out.expected > Skipping exception test due to test configuration > checking 142-space-and-quotes.out.expected > Skipping exception test due to test configuration > checking 142-space-and-quotes. space .out.expected > Skipping exception test due to test configuration > Test passed at Thu Apr 23 12:06:35 CDT 2009 > ----------===========================---------- > All language behaviour tests passed > > > > Zhao Zhang wrote: >> Hi, Ben >> >> Ben Clifford wrote: >>> On Thu, 23 Apr 2009, Zhao Zhang wrote: >>> >>> >>>> Error 1: This is related to CI network setting, >>>> /etc/grid-security/hostcert.pem. Could anyone help on this? Who >>>> should I >>>> contact? >>>> >>> >>> fletch is broken. But try changing those sites files to use >>> gwynn.bsd.uchicago.edu instead. >>> >>> >>>> Error 2: My certificate is not enabled on teraport, As Mike and I >>>> talked last >>>> night, "certificate revocation list" on CI network is misconfigured. >>>> >>> >>> This looks more like a permissions problem - the directory being >>> used in the sites.xml file for that test does not exist and you do >>> not have permission to create it. >>> >>> In r2874 I have changes tests/sites/coaster/teraport-gt2-gt2-pbs.xml >>> to use a different path that should work for you now. >>> >> I tried this out, it failed, then I increased the wall-time to 15 >> minutes in the coaster/teraport-gt2-gt2-pbs.xml file. >> And I am waiting now. >> >> zhao >> >> [zzhang at communicado sites]$ ./run-site coaster/teraport-gt2-gt2-pbs.xml >> testing site configuration: coaster/teraport-gt2-gt2-pbs.xml >> Removing files from previous runs >> Running test 061-cattwo at Thu Apr 23 11:12:09 CDT 2009 >> Swift 0.9rc2 swift-r2860 cog-r2388 >> >> RunID: 20090423-1112-6jqlxfcf >> Progress: >> Progress: Stage in:1 >> Progress: Submitted:1 >> Failed to transfer wrapper log from >> 061-cattwo-20090423-1112-6jqlxfcf/info/q on teraport >> Failed to transfer wrapper log from >> 061-cattwo-20090423-1112-6jqlxfcf/info/s on teraport >> Progress: Stage in:1 >> Failed to transfer wrapper log from >> 061-cattwo-20090423-1112-6jqlxfcf/info/u on teraport >> Progress: Failed:1 >> Execution failed: >> Exception in cat: >> Arguments: [061-cattwo.1.in, 061-cattwo.2.in] >> Host: teraport >> Directory: 061-cattwo-20090423-1112-6jqlxfcf/jobs/u/cat-umlmrs9j >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> Job cannot be run with the given max walltime worker >> constraint (task: 600, maxwalltime: 300s) >> Cleaning up... >> Shutting down service at https://128.135.125.118:58204 >> Got channel MetaChannel: 1297642 -> GSSSChannel-null(1) >> - Done >> SWIFT RETURN CODE NON-ZERO - test 061-cattwo >> >>> >>>> Error 3 & Error 4: I am not active on tgncsa site. Mike said he >>>> needed to add >>>> me to another group. >>>> >>> >>> yes. >>> >>> Do you have the list from the end of your test run about which sites >>> worked and which did not? >>> >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Fri Apr 24 13:09:09 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 24 Apr 2009 13:09:09 -0500 Subject: [Swift-devel] problem at end of workflow In-Reply-To: <49F15992.4040003@uchicago.edu> References: <49F0F6DF.3040902@uchicago.edu> <1240536905.703.0.camel@localhost> <49F118EE.5000609@uchicago.edu> <1240538133.1093.4.camel@localhost> <1240541091.1616.5.camel@localhost> <49F15992.4040003@uchicago.edu> Message-ID: <1240596549.9287.2.camel@localhost> On Fri, 2009-04-24 at 01:17 -0500, Glen Hocky wrote: > I upgraded to newest cog and swift (Swift svn swift-r2880 cog-r2391) > I got new behavior this time. instead of just 16 jobs to the queue, i > got up to 50 before I killed. So that's probably a good thing. > > The bad thing is that a job became active before any of the jobs in the > queue started and i started getting error messages (error transfer log > messages) How much time before? > > Everything is in > /home/hockyg/oops/swift/output/rangeroutdir.3000 > > and as promised, new ranger coaster logs in > > /home/hockyg/oops/swift/coaster_logs/ranger/coasters > > sorry i don't have more time to stress test this tonight You seem to be hitting an old problem that is messy to reproduce because (I think) it involves certain proxies or credentials. I've put some code in (cog r2392) to alleviate some of the symptoms. From wilde at mcs.anl.gov Fri Apr 24 13:29:37 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 24 Apr 2009 13:29:37 -0500 Subject: [Swift-devel] problem at end of workflow In-Reply-To: <1240596549.9287.2.camel@localhost> References: <49F0F6DF.3040902@uchicago.edu> <1240536905.703.0.camel@localhost> <49F118EE.5000609@uchicago.edu> <1240538133.1093.4.camel@localhost> <1240541091.1616.5.camel@localhost> <49F15992.4040003@uchicago.edu> <1240596549.9287.2.camel@localhost> Message-ID: <49F20511.1080005@mcs.anl.gov> Any pointers that might enable Zhao to reproduce these things in testing? On 4/24/09 1:09 PM, Mihael Hategan wrote: > On Fri, 2009-04-24 at 01:17 -0500, Glen Hocky wrote: >> I upgraded to newest cog and swift (Swift svn swift-r2880 cog-r2391) >> I got new behavior this time. instead of just 16 jobs to the queue, i >> got up to 50 before I killed. So that's probably a good thing. >> >> The bad thing is that a job became active before any of the jobs in the >> queue started and i started getting error messages (error transfer log >> messages) > > How much time before? > >> Everything is in >> /home/hockyg/oops/swift/output/rangeroutdir.3000 >> >> and as promised, new ranger coaster logs in >>> /home/hockyg/oops/swift/coaster_logs/ranger/coasters >> sorry i don't have more time to stress test this tonight > > You seem to be hitting an old problem that is messy to reproduce because > (I think) it involves certain proxies or credentials. I've put some code > in (cog r2392) to alleviate some of the symptoms. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Fri Apr 24 13:46:05 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 24 Apr 2009 13:46:05 -0500 Subject: [Swift-devel] problem at end of workflow In-Reply-To: <49F20511.1080005@mcs.anl.gov> References: <49F0F6DF.3040902@uchicago.edu> <1240536905.703.0.camel@localhost> <49F118EE.5000609@uchicago.edu> <1240538133.1093.4.camel@localhost> <1240541091.1616.5.camel@localhost> <49F15992.4040003@uchicago.edu> <1240596549.9287.2.camel@localhost> <49F20511.1080005@mcs.anl.gov> Message-ID: <1240598765.10279.0.camel@localhost> On Fri, 2009-04-24 at 13:29 -0500, Michael Wilde wrote: > Any pointers that might enable Zhao to reproduce these things in testing? Yeah. Use Glen's proxy. > > On 4/24/09 1:09 PM, Mihael Hategan wrote: > > On Fri, 2009-04-24 at 01:17 -0500, Glen Hocky wrote: > >> I upgraded to newest cog and swift (Swift svn swift-r2880 cog-r2391) > >> I got new behavior this time. instead of just 16 jobs to the queue, i > >> got up to 50 before I killed. So that's probably a good thing. > >> > >> The bad thing is that a job became active before any of the jobs in the > >> queue started and i started getting error messages (error transfer log > >> messages) > > > > How much time before? > > > >> Everything is in > >> /home/hockyg/oops/swift/output/rangeroutdir.3000 > >> > >> and as promised, new ranger coaster logs in > >>> /home/hockyg/oops/swift/coaster_logs/ranger/coasters > >> sorry i don't have more time to stress test this tonight > > > > You seem to be hitting an old problem that is messy to reproduce because > > (I think) it involves certain proxies or credentials. I've put some code > > in (cog r2392) to alleviate some of the symptoms. > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri Apr 24 14:14:16 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 24 Apr 2009 14:14:16 -0500 Subject: [Swift-devel] problem at end of workflow In-Reply-To: <1240598765.10279.0.camel@localhost> References: <49F0F6DF.3040902@uchicago.edu> <1240536905.703.0.camel@localhost> <49F118EE.5000609@uchicago.edu> <1240538133.1093.4.camel@localhost> <1240541091.1616.5.camel@localhost> <49F15992.4040003@uchicago.edu> <1240596549.9287.2.camel@localhost> <49F20511.1080005@mcs.anl.gov> <1240598765.10279.0.camel@localhost> Message-ID: <49F20F88.5050002@mcs.anl.gov> On 4/24/09 1:46 PM, Mihael Hategan wrote: > On Fri, 2009-04-24 at 13:29 -0500, Michael Wilde wrote: >> Any pointers that might enable Zhao to reproduce these things in testing? > > Yeah. Use Glen's proxy. OK. Glen, when you get a chance, can you post how Zhao might test with a proxy similar to yours? Or even run the tests with you, under yours? Im not following this thread close enough to know how the proxy figures into the problem, but sounds like it does. My goal is that testing beat up the system and find bugs before users, so if there indeed strange aspects of various proxies (or user environments) we should try to understand that. Thanks, - Mike >> On 4/24/09 1:09 PM, Mihael Hategan wrote: >>> On Fri, 2009-04-24 at 01:17 -0500, Glen Hocky wrote: >>>> I upgraded to newest cog and swift (Swift svn swift-r2880 cog-r2391) >>>> I got new behavior this time. instead of just 16 jobs to the queue, i >>>> got up to 50 before I killed. So that's probably a good thing. >>>> >>>> The bad thing is that a job became active before any of the jobs in the >>>> queue started and i started getting error messages (error transfer log >>>> messages) >>> How much time before? >>> >>>> Everything is in >>>> /home/hockyg/oops/swift/output/rangeroutdir.3000 >>>> >>>> and as promised, new ranger coaster logs in >>>>> /home/hockyg/oops/swift/coaster_logs/ranger/coasters >>>> sorry i don't have more time to stress test this tonight >>> You seem to be hitting an old problem that is messy to reproduce because >>> (I think) it involves certain proxies or credentials. I've put some code >>> in (cog r2392) to alleviate some of the symptoms. >>> >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From zhaozhang at uchicago.edu Fri Apr 24 14:16:10 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 24 Apr 2009 14:16:10 -0500 Subject: [Swift-devel] problem at end of workflow In-Reply-To: <49F20F88.5050002@mcs.anl.gov> References: <49F0F6DF.3040902@uchicago.edu> <1240536905.703.0.camel@localhost> <49F118EE.5000609@uchicago.edu> <1240538133.1093.4.camel@localhost> <1240541091.1616.5.camel@localhost> <49F15992.4040003@uchicago.edu> <1240596549.9287.2.camel@localhost> <49F20511.1080005@mcs.anl.gov> <1240598765.10279.0.camel@localhost> <49F20F88.5050002@mcs.anl.gov> Message-ID: <49F20FFA.4090309@uchicago.edu> Yes, that will be helpful, I am tracing back this email thread gathering information. zhao Michael Wilde wrote: > > > On 4/24/09 1:46 PM, Mihael Hategan wrote: >> On Fri, 2009-04-24 at 13:29 -0500, Michael Wilde wrote: >>> Any pointers that might enable Zhao to reproduce these things in >>> testing? >> >> Yeah. Use Glen's proxy. > > OK. Glen, when you get a chance, can you post how Zhao might test with > a proxy similar to yours? Or even run the tests with you, under yours? > > Im not following this thread close enough to know how the proxy > figures into the problem, but sounds like it does. > > My goal is that testing beat up the system and find bugs before users, > so if there indeed strange aspects of various proxies (or user > environments) we should try to understand that. > > Thanks, > > - Mike > >>> On 4/24/09 1:09 PM, Mihael Hategan wrote: >>>> On Fri, 2009-04-24 at 01:17 -0500, Glen Hocky wrote: >>>>> I upgraded to newest cog and swift (Swift svn swift-r2880 cog-r2391) >>>>> I got new behavior this time. instead of just 16 jobs to the >>>>> queue, i got up to 50 before I killed. So that's probably a good >>>>> thing. >>>>> >>>>> The bad thing is that a job became active before any of the jobs >>>>> in the queue started and i started getting error messages (error >>>>> transfer log messages) >>>> How much time before? >>>> >>>>> Everything is in >>>>> /home/hockyg/oops/swift/output/rangeroutdir.3000 >>>>> >>>>> and as promised, new ranger coaster logs in >>>>>> /home/hockyg/oops/swift/coaster_logs/ranger/coasters >>>>> sorry i don't have more time to stress test this tonight >>>> You seem to be hitting an old problem that is messy to reproduce >>>> because >>>> (I think) it involves certain proxies or credentials. I've put some >>>> code >>>> in (cog r2392) to alleviate some of the symptoms. >>>> >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Fri Apr 24 14:30:49 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 24 Apr 2009 14:30:49 -0500 Subject: [Swift-devel] feature request In-Reply-To: <49F1FC42.1090008@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> Message-ID: <49F21369.1040307@mcs.anl.gov> Great, Zhao. What's next? Testing of the new condor provider features is important. For failures, it would be good to see if you can go further in at least lifting out the errors so that developers can either tell you if there was an error in the testing (which you should fix) or an error in the code (which they should fix, and you can help get them the info they need, and identify faster which errors might need more immediate attention). Can you suggest a methodical approach to testing, in terms of: - what tests you need to and plan to run on what systems? - how the reports are organized - how errors are listed and diagnosed I want this to be a more interactive process between you and the developers, not just "it broke, see dir X" Thanks, Mike On 4/24/09 12:52 PM, Zhao Zhang wrote: > Hi, All > > As I got mapped on gwynn, I redid the tests. The results are > > All language behaviour tests passed > These sites failed: coaster/tgncsa-hg-coaster-pbs-gram2.xml > coaster/tgncsa-hg-coaster-pbs-gram4.xml > These sites worked: coaster/coaster-local.xml > coaster/gwynn-coaster-gram2-gram2-condor.xml > coaster/gwynn-coaster-gram2-gram2-fork.xml > coaster/renci-engage-coaster.xml coaster/teraport-gt2-gt2-pbs.xml > coaster/uj-pbs-gram2.xml > > Logs could be found at > /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/log_all on CI > network. > > zhao > > Zhao Zhang wrote: >> Hi, again >> >> the test on teraport is successful, here is the log >> >> zhao >> >> testing site configuration: coaster/teraport-gt2-gt2-pbs.xml >> Removing files from previous runs >> Running test 061-cattwo at Thu Apr 23 11:27:19 CDT 2009 >> Swift 0.9rc2 swift-r2860 cog-r2388 >> >> RunID: 20090423-1127-aluxx4m9 >> Progress: >> Progress: Stage in:1 >> Progress: Submitted:1 >> Progress: Submitted:1 >> ... >> Progress: Active:1 >> Progress: Finished successfully:1 >> Final status: Finished successfully:1 >> Cleaning up... >> Shutting down service at https://128.135.125.118:57080 >> Got channel MetaChannel: 2129305 -> GSSSChannel-null(1) >> - Done >> expecting 061-cattwo.out.expected >> checking 061-cattwo.out.expected >> Skipping exception test due to test configuration >> Test passed at Thu Apr 23 11:57:39 CDT 2009 >> ----------===========================---------- >> Running test 130-fmri at Thu Apr 23 11:57:39 CDT 2009 >> Swift 0.9rc2 swift-r2860 cog-r2388 >> >> RunID: 20090423-1157-r8sarc77 >> Progress: >> Progress: Selecting site:2 Initializing site shared directory:1 >> Stage in:1 >> Progress: Selecting site:2 Stage in:1 Submitting:1 >> Progress: Selecting site:2 Submitting:1 Submitted:1 >> ... >> Progress: Selecting site:2 Submitted:2 >> Progress: Selecting site:2 Submitted:2 >> Progress: Selecting site:2 Submitted:2 >> Progress: Selecting site:2 Submitted:1 Active:1 >> Progress: Selecting site:2 Active:1 Stage out:1 >> Progress: Selecting site:1 Stage in:1 Stage out:1 Finished >> successfully:1 >> Progress: Submitted:1 Stage out:1 Finished successfully:2 >> Progress: Active:1 Finished successfully:4 >> Progress: Submitting:2 Submitted:1 Finished successfully:5 >> Progress: Active:2 Stage out:1 Finished successfully:5 >> Progress: Submitted:1 Stage out:2 Finished successfully:8 >> Final status: Finished successfully:11 >> Cleaning up... >> Shutting down service at https://128.135.125.118:52773 >> Got channel MetaChannel: 28761475 -> GSSSChannel-null(1) >> - Done >> expecting 130-fmri.0000.jpeg.expected 130-fmri.0001.jpeg.expected >> 130-fmri.0002.jpeg.expected >> checking 130-fmri.0000.jpeg.expected >> Skipping exception test due to test configuration >> checking 130-fmri.0001.jpeg.expected >> Skipping exception test due to test configuration >> checking 130-fmri.0002.jpeg.expected >> Skipping exception test due to test configuration >> Test passed at Thu Apr 23 12:04:47 CDT 2009 >> ----------===========================---------- >> Running test 103-quote at Thu Apr 23 12:04:47 CDT 2009 >> Swift 0.9rc2 swift-r2860 cog-r2388 >> >> RunID: 20090423-1204-sjzpkfd3 >> Progress: >> Progress: Stage in:1 >> Progress: Submitted:1 >> Progress: Active:1 >> Progress: Finished successfully:1 >> Final status: Finished successfully:1 >> Cleaning up... >> Shutting down service at https://128.135.125.118:40813 >> Got channel MetaChannel: 28500325 -> GSSSChannel-null(1) >> - Done >> expecting 103-quote.out.expected >> checking 103-quote.out.expected >> Skipping exception test due to test configuration >> Test passed at Thu Apr 23 12:05:05 CDT 2009 >> ----------===========================---------- >> Running test 1032-singlequote at Thu Apr 23 12:05:05 CDT 2009 >> Swift 0.9rc2 swift-r2860 cog-r2388 >> >> RunID: 20090423-1205-x2d55af3 >> Progress: >> Progress: Stage in:1 >> Progress: Submitted:1 >> Progress: Active:1 >> Progress: Finished successfully:1 >> Final status: Finished successfully:1 >> Cleaning up... >> Shutting down service at https://128.135.125.118:44126 >> Got channel MetaChannel: 18100302 -> GSSSChannel-null(1) >> - Done >> expecting 1032-singlequote.out.expected >> checking 1032-singlequote.out.expected >> Skipping exception test due to test configuration >> Test passed at Thu Apr 23 12:05:22 CDT 2009 >> ----------===========================---------- >> Running test 1031-quote at Thu Apr 23 12:05:22 CDT 2009 >> Swift 0.9rc2 swift-r2860 cog-r2388 >> >> RunID: 20090423-1205-5aa1ko4e >> Progress: >> Progress: Stage in:1 >> Progress: Submitted:1 >> Progress: Active:1 >> Final status: Finished successfully:1 >> Cleaning up... >> Shutting down service at https://128.135.125.118:43759 >> Got channel MetaChannel: 19002607 -> GSSSChannel-null(1) >> - Done >> expecting 1031-quote.*.expected >> No expected output files specified for this test case - not checking >> output. >> Skipping exception test due to test configuration >> Test passed at Thu Apr 23 12:05:38 CDT 2009 >> ----------===========================---------- >> Running test 1033-singlequote at Thu Apr 23 12:05:38 CDT 2009 >> Swift 0.9rc2 swift-r2860 cog-r2388 >> >> RunID: 20090423-1205-8nopyujc >> Progress: >> Progress: Stage in:1 >> Progress: Submitted:1 >> Progress: Active:1 >> Progress: Finished successfully:1 >> Final status: Finished successfully:1 >> Cleaning up... >> Shutting down service at https://128.135.125.118:39924 >> Got channel MetaChannel: 31196317 -> GSSSChannel-null(1) >> - Done >> expecting 1033-singlequote.out.expected >> checking 1033-singlequote.out.expected >> Skipping exception test due to test configuration >> Test passed at Thu Apr 23 12:05:56 CDT 2009 >> ----------===========================---------- >> Running test 141-space-in-filename at Thu Apr 23 12:05:56 CDT 2009 >> Swift 0.9rc2 swift-r2860 cog-r2388 >> >> RunID: 20090423-1205-aalqz1c4 >> Progress: >> Progress: Stage in:1 >> Progress: Submitted:1 >> Progress: Active:1 >> Progress: Finished successfully:1 >> Final status: Finished successfully:1 >> Cleaning up... >> Shutting down service at https://128.135.125.118:60177 >> Got channel MetaChannel: 4728458 -> GSSSChannel-null(1) >> - Done >> expecting 141-space-in-filename.space here.out.expected >> checking 141-space-in-filename.space here.out.expected >> Skipping exception test due to test configuration >> Test passed at Thu Apr 23 12:06:15 CDT 2009 >> ----------===========================---------- >> Running test 142-space-and-quotes at Thu Apr 23 12:06:15 CDT 2009 >> Swift 0.9rc2 swift-r2860 cog-r2388 >> >> RunID: 20090423-1206-8617gag1 >> Progress: >> Progress: Selecting site:2 Initializing site shared directory:1 >> Stage in:1 >> Progress: Selecting site:2 Submitting:1 Submitted:1 >> Progress: Selecting site:2 Submitted:1 Active:1 >> Progress: Selecting site:2 Active:1 Finished successfully:1 >> Progress: Stage out:1 Finished successfully:3 >> Final status: Finished successfully:4 >> Cleaning up... >> Shutting down service at https://128.135.125.118:57945 >> Got channel MetaChannel: 16387060 -> GSSSChannel-null(1) >> - Done >> expecting 142-space-and-quotes.2" space ".out.expected >> 142-space-and-quotes.3' space '.out.expected >> 142-space-and-quotes.out.expected 142-space-and-quotes. space >> .out.expected >> checking 142-space-and-quotes.2" space ".out.expected >> Skipping exception test due to test configuration >> checking 142-space-and-quotes.3' space '.out.expected >> Skipping exception test due to test configuration >> checking 142-space-and-quotes.out.expected >> Skipping exception test due to test configuration >> checking 142-space-and-quotes. space .out.expected >> Skipping exception test due to test configuration >> Test passed at Thu Apr 23 12:06:35 CDT 2009 >> ----------===========================---------- >> All language behaviour tests passed >> >> >> >> Zhao Zhang wrote: >>> Hi, Ben >>> >>> Ben Clifford wrote: >>>> On Thu, 23 Apr 2009, Zhao Zhang wrote: >>>> >>>> >>>>> Error 1: This is related to CI network setting, >>>>> /etc/grid-security/hostcert.pem. Could anyone help on this? Who >>>>> should I >>>>> contact? >>>>> >>>> >>>> fletch is broken. But try changing those sites files to use >>>> gwynn.bsd.uchicago.edu instead. >>>> >>>> >>>>> Error 2: My certificate is not enabled on teraport, As Mike and I >>>>> talked last >>>>> night, "certificate revocation list" on CI network is misconfigured. >>>>> >>>> >>>> This looks more like a permissions problem - the directory being >>>> used in the sites.xml file for that test does not exist and you do >>>> not have permission to create it. >>>> >>>> In r2874 I have changes tests/sites/coaster/teraport-gt2-gt2-pbs.xml >>>> to use a different path that should work for you now. >>>> >>> I tried this out, it failed, then I increased the wall-time to 15 >>> minutes in the coaster/teraport-gt2-gt2-pbs.xml file. >>> And I am waiting now. >>> >>> zhao >>> >>> [zzhang at communicado sites]$ ./run-site coaster/teraport-gt2-gt2-pbs.xml >>> testing site configuration: coaster/teraport-gt2-gt2-pbs.xml >>> Removing files from previous runs >>> Running test 061-cattwo at Thu Apr 23 11:12:09 CDT 2009 >>> Swift 0.9rc2 swift-r2860 cog-r2388 >>> >>> RunID: 20090423-1112-6jqlxfcf >>> Progress: >>> Progress: Stage in:1 >>> Progress: Submitted:1 >>> Failed to transfer wrapper log from >>> 061-cattwo-20090423-1112-6jqlxfcf/info/q on teraport >>> Failed to transfer wrapper log from >>> 061-cattwo-20090423-1112-6jqlxfcf/info/s on teraport >>> Progress: Stage in:1 >>> Failed to transfer wrapper log from >>> 061-cattwo-20090423-1112-6jqlxfcf/info/u on teraport >>> Progress: Failed:1 >>> Execution failed: >>> Exception in cat: >>> Arguments: [061-cattwo.1.in, 061-cattwo.2.in] >>> Host: teraport >>> Directory: 061-cattwo-20090423-1112-6jqlxfcf/jobs/u/cat-umlmrs9j >>> stderr.txt: >>> >>> stdout.txt: >>> >>> ---- >>> >>> Caused by: >>> Job cannot be run with the given max walltime worker >>> constraint (task: 600, maxwalltime: 300s) >>> Cleaning up... >>> Shutting down service at https://128.135.125.118:58204 >>> Got channel MetaChannel: 1297642 -> GSSSChannel-null(1) >>> - Done >>> SWIFT RETURN CODE NON-ZERO - test 061-cattwo >>> >>>> >>>>> Error 3 & Error 4: I am not active on tgncsa site. Mike said he >>>>> needed to add >>>>> me to another group. >>>>> >>>> >>>> yes. >>>> >>>> Do you have the list from the end of your test run about which sites >>>> worked and which did not? >>>> >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Fri Apr 24 14:38:11 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 24 Apr 2009 14:38:11 -0500 Subject: [Swift-devel] problem at end of workflow In-Reply-To: <49F20F88.5050002@mcs.anl.gov> References: <49F0F6DF.3040902@uchicago.edu> <1240536905.703.0.camel@localhost> <49F118EE.5000609@uchicago.edu> <1240538133.1093.4.camel@localhost> <1240541091.1616.5.camel@localhost> <49F15992.4040003@uchicago.edu> <1240596549.9287.2.camel@localhost> <49F20511.1080005@mcs.anl.gov> <1240598765.10279.0.camel@localhost> <49F20F88.5050002@mcs.anl.gov> Message-ID: <1240601891.10463.3.camel@localhost> On Fri, 2009-04-24 at 14:14 -0500, Michael Wilde wrote: > > On 4/24/09 1:46 PM, Mihael Hategan wrote: > > On Fri, 2009-04-24 at 13:29 -0500, Michael Wilde wrote: > >> Any pointers that might enable Zhao to reproduce these things in testing? > > > > Yeah. Use Glen's proxy. > > OK. Glen, when you get a chance, can you post how Zhao might test with a > proxy similar to yours? If I knew the similarity that triggers the problem, there would probably be a solution already. I'm not 100% sure it's the proxy, but given what I know so far, I suspect it is. > Or even run the tests with you, under yours? > > Im not following this thread close enough to know how the proxy figures > into the problem, but sounds like it does. > > My goal is that testing beat up the system and find bugs before users, > so if there indeed strange aspects of various proxies (or user > environments) we should try to understand that. > > Thanks, > > - Mike > > >> On 4/24/09 1:09 PM, Mihael Hategan wrote: > >>> On Fri, 2009-04-24 at 01:17 -0500, Glen Hocky wrote: > >>>> I upgraded to newest cog and swift (Swift svn swift-r2880 cog-r2391) > >>>> I got new behavior this time. instead of just 16 jobs to the queue, i > >>>> got up to 50 before I killed. So that's probably a good thing. > >>>> > >>>> The bad thing is that a job became active before any of the jobs in the > >>>> queue started and i started getting error messages (error transfer log > >>>> messages) > >>> How much time before? > >>> > >>>> Everything is in > >>>> /home/hockyg/oops/swift/output/rangeroutdir.3000 > >>>> > >>>> and as promised, new ranger coaster logs in > >>>>> /home/hockyg/oops/swift/coaster_logs/ranger/coasters > >>>> sorry i don't have more time to stress test this tonight > >>> You seem to be hitting an old problem that is messy to reproduce because > >>> (I think) it involves certain proxies or credentials. I've put some code > >>> in (cog r2392) to alleviate some of the symptoms. > >>> > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From hategan at mcs.anl.gov Fri Apr 24 15:08:27 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 24 Apr 2009 15:08:27 -0500 Subject: [Swift-devel] feature request In-Reply-To: <49F21369.1040307@mcs.anl.gov> References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> Message-ID: <1240603707.10463.16.camel@localhost> On Fri, 2009-04-24 at 14:30 -0500, Michael Wilde wrote: > Great, Zhao. > > What's next? > > Testing of the new condor provider features is important. The block allocator and coasters+condor-g are mutually useless. So if we were otherwise planning to stabilize both at the same time, then it may probably be better to focus on just one. > > For failures, it would be good to see if you can go further in at least > lifting out the errors so that developers can either tell you if there > was an error in the testing (which you should fix) or an error in the > code (which they should fix, and you can help get them the info they > need, and identify faster which errors might need more immediate attention). > > Can you suggest a methodical approach to testing, in terms of: > > - what tests you need to and plan to run on what systems? > - how the reports are organized > - how errors are listed and diagnosed > > I want this to be a more interactive process between you and the > developers, not just "it broke, see dir X" > > Thanks, > > Mike > > > On 4/24/09 12:52 PM, Zhao Zhang wrote: > > Hi, All > > > > As I got mapped on gwynn, I redid the tests. The results are > > > > All language behaviour tests passed > > These sites failed: coaster/tgncsa-hg-coaster-pbs-gram2.xml > > coaster/tgncsa-hg-coaster-pbs-gram4.xml > > These sites worked: coaster/coaster-local.xml > > coaster/gwynn-coaster-gram2-gram2-condor.xml > > coaster/gwynn-coaster-gram2-gram2-fork.xml > > coaster/renci-engage-coaster.xml coaster/teraport-gt2-gt2-pbs.xml > > coaster/uj-pbs-gram2.xml > > > > Logs could be found at > > /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/log_all on CI > > network. > > > > zhao > > > > Zhao Zhang wrote: > >> Hi, again > >> > >> the test on teraport is successful, here is the log > >> > >> zhao > >> > >> testing site configuration: coaster/teraport-gt2-gt2-pbs.xml > >> Removing files from previous runs > >> Running test 061-cattwo at Thu Apr 23 11:27:19 CDT 2009 > >> Swift 0.9rc2 swift-r2860 cog-r2388 > >> > >> RunID: 20090423-1127-aluxx4m9 > >> Progress: > >> Progress: Stage in:1 > >> Progress: Submitted:1 > >> Progress: Submitted:1 > >> ... > >> Progress: Active:1 > >> Progress: Finished successfully:1 > >> Final status: Finished successfully:1 > >> Cleaning up... > >> Shutting down service at https://128.135.125.118:57080 > >> Got channel MetaChannel: 2129305 -> GSSSChannel-null(1) > >> - Done > >> expecting 061-cattwo.out.expected > >> checking 061-cattwo.out.expected > >> Skipping exception test due to test configuration > >> Test passed at Thu Apr 23 11:57:39 CDT 2009 > >> ----------===========================---------- > >> Running test 130-fmri at Thu Apr 23 11:57:39 CDT 2009 > >> Swift 0.9rc2 swift-r2860 cog-r2388 > >> > >> RunID: 20090423-1157-r8sarc77 > >> Progress: > >> Progress: Selecting site:2 Initializing site shared directory:1 > >> Stage in:1 > >> Progress: Selecting site:2 Stage in:1 Submitting:1 > >> Progress: Selecting site:2 Submitting:1 Submitted:1 > >> ... > >> Progress: Selecting site:2 Submitted:2 > >> Progress: Selecting site:2 Submitted:2 > >> Progress: Selecting site:2 Submitted:2 > >> Progress: Selecting site:2 Submitted:1 Active:1 > >> Progress: Selecting site:2 Active:1 Stage out:1 > >> Progress: Selecting site:1 Stage in:1 Stage out:1 Finished > >> successfully:1 > >> Progress: Submitted:1 Stage out:1 Finished successfully:2 > >> Progress: Active:1 Finished successfully:4 > >> Progress: Submitting:2 Submitted:1 Finished successfully:5 > >> Progress: Active:2 Stage out:1 Finished successfully:5 > >> Progress: Submitted:1 Stage out:2 Finished successfully:8 > >> Final status: Finished successfully:11 > >> Cleaning up... > >> Shutting down service at https://128.135.125.118:52773 > >> Got channel MetaChannel: 28761475 -> GSSSChannel-null(1) > >> - Done > >> expecting 130-fmri.0000.jpeg.expected 130-fmri.0001.jpeg.expected > >> 130-fmri.0002.jpeg.expected > >> checking 130-fmri.0000.jpeg.expected > >> Skipping exception test due to test configuration > >> checking 130-fmri.0001.jpeg.expected > >> Skipping exception test due to test configuration > >> checking 130-fmri.0002.jpeg.expected > >> Skipping exception test due to test configuration > >> Test passed at Thu Apr 23 12:04:47 CDT 2009 > >> ----------===========================---------- > >> Running test 103-quote at Thu Apr 23 12:04:47 CDT 2009 > >> Swift 0.9rc2 swift-r2860 cog-r2388 > >> > >> RunID: 20090423-1204-sjzpkfd3 > >> Progress: > >> Progress: Stage in:1 > >> Progress: Submitted:1 > >> Progress: Active:1 > >> Progress: Finished successfully:1 > >> Final status: Finished successfully:1 > >> Cleaning up... > >> Shutting down service at https://128.135.125.118:40813 > >> Got channel MetaChannel: 28500325 -> GSSSChannel-null(1) > >> - Done > >> expecting 103-quote.out.expected > >> checking 103-quote.out.expected > >> Skipping exception test due to test configuration > >> Test passed at Thu Apr 23 12:05:05 CDT 2009 > >> ----------===========================---------- > >> Running test 1032-singlequote at Thu Apr 23 12:05:05 CDT 2009 > >> Swift 0.9rc2 swift-r2860 cog-r2388 > >> > >> RunID: 20090423-1205-x2d55af3 > >> Progress: > >> Progress: Stage in:1 > >> Progress: Submitted:1 > >> Progress: Active:1 > >> Progress: Finished successfully:1 > >> Final status: Finished successfully:1 > >> Cleaning up... > >> Shutting down service at https://128.135.125.118:44126 > >> Got channel MetaChannel: 18100302 -> GSSSChannel-null(1) > >> - Done > >> expecting 1032-singlequote.out.expected > >> checking 1032-singlequote.out.expected > >> Skipping exception test due to test configuration > >> Test passed at Thu Apr 23 12:05:22 CDT 2009 > >> ----------===========================---------- > >> Running test 1031-quote at Thu Apr 23 12:05:22 CDT 2009 > >> Swift 0.9rc2 swift-r2860 cog-r2388 > >> > >> RunID: 20090423-1205-5aa1ko4e > >> Progress: > >> Progress: Stage in:1 > >> Progress: Submitted:1 > >> Progress: Active:1 > >> Final status: Finished successfully:1 > >> Cleaning up... > >> Shutting down service at https://128.135.125.118:43759 > >> Got channel MetaChannel: 19002607 -> GSSSChannel-null(1) > >> - Done > >> expecting 1031-quote.*.expected > >> No expected output files specified for this test case - not checking > >> output. > >> Skipping exception test due to test configuration > >> Test passed at Thu Apr 23 12:05:38 CDT 2009 > >> ----------===========================---------- > >> Running test 1033-singlequote at Thu Apr 23 12:05:38 CDT 2009 > >> Swift 0.9rc2 swift-r2860 cog-r2388 > >> > >> RunID: 20090423-1205-8nopyujc > >> Progress: > >> Progress: Stage in:1 > >> Progress: Submitted:1 > >> Progress: Active:1 > >> Progress: Finished successfully:1 > >> Final status: Finished successfully:1 > >> Cleaning up... > >> Shutting down service at https://128.135.125.118:39924 > >> Got channel MetaChannel: 31196317 -> GSSSChannel-null(1) > >> - Done > >> expecting 1033-singlequote.out.expected > >> checking 1033-singlequote.out.expected > >> Skipping exception test due to test configuration > >> Test passed at Thu Apr 23 12:05:56 CDT 2009 > >> ----------===========================---------- > >> Running test 141-space-in-filename at Thu Apr 23 12:05:56 CDT 2009 > >> Swift 0.9rc2 swift-r2860 cog-r2388 > >> > >> RunID: 20090423-1205-aalqz1c4 > >> Progress: > >> Progress: Stage in:1 > >> Progress: Submitted:1 > >> Progress: Active:1 > >> Progress: Finished successfully:1 > >> Final status: Finished successfully:1 > >> Cleaning up... > >> Shutting down service at https://128.135.125.118:60177 > >> Got channel MetaChannel: 4728458 -> GSSSChannel-null(1) > >> - Done > >> expecting 141-space-in-filename.space here.out.expected > >> checking 141-space-in-filename.space here.out.expected > >> Skipping exception test due to test configuration > >> Test passed at Thu Apr 23 12:06:15 CDT 2009 > >> ----------===========================---------- > >> Running test 142-space-and-quotes at Thu Apr 23 12:06:15 CDT 2009 > >> Swift 0.9rc2 swift-r2860 cog-r2388 > >> > >> RunID: 20090423-1206-8617gag1 > >> Progress: > >> Progress: Selecting site:2 Initializing site shared directory:1 > >> Stage in:1 > >> Progress: Selecting site:2 Submitting:1 Submitted:1 > >> Progress: Selecting site:2 Submitted:1 Active:1 > >> Progress: Selecting site:2 Active:1 Finished successfully:1 > >> Progress: Stage out:1 Finished successfully:3 > >> Final status: Finished successfully:4 > >> Cleaning up... > >> Shutting down service at https://128.135.125.118:57945 > >> Got channel MetaChannel: 16387060 -> GSSSChannel-null(1) > >> - Done > >> expecting 142-space-and-quotes.2" space ".out.expected > >> 142-space-and-quotes.3' space '.out.expected > >> 142-space-and-quotes.out.expected 142-space-and-quotes. space > >> .out.expected > >> checking 142-space-and-quotes.2" space ".out.expected > >> Skipping exception test due to test configuration > >> checking 142-space-and-quotes.3' space '.out.expected > >> Skipping exception test due to test configuration > >> checking 142-space-and-quotes.out.expected > >> Skipping exception test due to test configuration > >> checking 142-space-and-quotes. space .out.expected > >> Skipping exception test due to test configuration > >> Test passed at Thu Apr 23 12:06:35 CDT 2009 > >> ----------===========================---------- > >> All language behaviour tests passed > >> > >> > >> > >> Zhao Zhang wrote: > >>> Hi, Ben > >>> > >>> Ben Clifford wrote: > >>>> On Thu, 23 Apr 2009, Zhao Zhang wrote: > >>>> > >>>> > >>>>> Error 1: This is related to CI network setting, > >>>>> /etc/grid-security/hostcert.pem. Could anyone help on this? Who > >>>>> should I > >>>>> contact? > >>>>> > >>>> > >>>> fletch is broken. But try changing those sites files to use > >>>> gwynn.bsd.uchicago.edu instead. > >>>> > >>>> > >>>>> Error 2: My certificate is not enabled on teraport, As Mike and I > >>>>> talked last > >>>>> night, "certificate revocation list" on CI network is misconfigured. > >>>>> > >>>> > >>>> This looks more like a permissions problem - the directory being > >>>> used in the sites.xml file for that test does not exist and you do > >>>> not have permission to create it. > >>>> > >>>> In r2874 I have changes tests/sites/coaster/teraport-gt2-gt2-pbs.xml > >>>> to use a different path that should work for you now. > >>>> > >>> I tried this out, it failed, then I increased the wall-time to 15 > >>> minutes in the coaster/teraport-gt2-gt2-pbs.xml file. > >>> And I am waiting now. > >>> > >>> zhao > >>> > >>> [zzhang at communicado sites]$ ./run-site coaster/teraport-gt2-gt2-pbs.xml > >>> testing site configuration: coaster/teraport-gt2-gt2-pbs.xml > >>> Removing files from previous runs > >>> Running test 061-cattwo at Thu Apr 23 11:12:09 CDT 2009 > >>> Swift 0.9rc2 swift-r2860 cog-r2388 > >>> > >>> RunID: 20090423-1112-6jqlxfcf > >>> Progress: > >>> Progress: Stage in:1 > >>> Progress: Submitted:1 > >>> Failed to transfer wrapper log from > >>> 061-cattwo-20090423-1112-6jqlxfcf/info/q on teraport > >>> Failed to transfer wrapper log from > >>> 061-cattwo-20090423-1112-6jqlxfcf/info/s on teraport > >>> Progress: Stage in:1 > >>> Failed to transfer wrapper log from > >>> 061-cattwo-20090423-1112-6jqlxfcf/info/u on teraport > >>> Progress: Failed:1 > >>> Execution failed: > >>> Exception in cat: > >>> Arguments: [061-cattwo.1.in, 061-cattwo.2.in] > >>> Host: teraport > >>> Directory: 061-cattwo-20090423-1112-6jqlxfcf/jobs/u/cat-umlmrs9j > >>> stderr.txt: > >>> > >>> stdout.txt: > >>> > >>> ---- > >>> > >>> Caused by: > >>> Job cannot be run with the given max walltime worker > >>> constraint (task: 600, maxwalltime: 300s) > >>> Cleaning up... > >>> Shutting down service at https://128.135.125.118:58204 > >>> Got channel MetaChannel: 1297642 -> GSSSChannel-null(1) > >>> - Done > >>> SWIFT RETURN CODE NON-ZERO - test 061-cattwo > >>> > >>>> > >>>>> Error 3 & Error 4: I am not active on tgncsa site. Mike said he > >>>>> needed to add > >>>>> me to another group. > >>>>> > >>>> > >>>> yes. > >>>> > >>>> Do you have the list from the end of your test run about which sites > >>>> worked and which did not? > >>>> > >>>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri Apr 24 15:50:48 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 24 Apr 2009 15:50:48 -0500 Subject: [Swift-devel] feature request In-Reply-To: <1240603707.10463.16.camel@localhost> References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <1240603707.10463.16.camel@localhost> Message-ID: <49F22628.5080005@mcs.anl.gov> On 4/24/09 3:08 PM, Mihael Hategan wrote: > On Fri, 2009-04-24 at 14:30 -0500, Michael Wilde wrote: >> Great, Zhao. >> >> What's next? >> >> Testing of the new condor provider features is important. > > The block allocator and coasters+condor-g are mutually useless. I need to think that through. I see this set of values: - The plain condor-g provider is high-value to OSG and TG immediately. It overcomes the GRAM2 overhead, and seems like it should open up many doors. Thats what I was asking Zhao to test. - coasters+condor-g (unless you state otherwise) looks like it needs a sanity test by you first; maybe some adjustments to work? That combination has same high value - the block allocator has value in addition to the above for systems and cases where: a) there are limits on #jobs per user that can be queued or run at the same time b) its desirable for scheduling reasons to do the allocation as one big job c) its just plain more efficient to allocate in chunks For this block allocator, I never envisioned (or wanted) something fancy: just a parameter hostsPerJob akin to coastersPerCore that would set the allocation unit, permitting users to set this to 1 (default perhaps), big (all on one job) or some modest number to get CPUs in chunks. It seems to me that there is value for all the combinations, but it certainly merits discussion to pick the smallest number of features and interactions we can, to keep development, testing, and above all, usage, as simple as possible. But no simpler. - Mike > > So if we were otherwise planning to stabilize both at the same time, > then it may probably be better to focus on just one. > >> For failures, it would be good to see if you can go further in at least >> lifting out the errors so that developers can either tell you if there >> was an error in the testing (which you should fix) or an error in the >> code (which they should fix, and you can help get them the info they >> need, and identify faster which errors might need more immediate attention). >> >> Can you suggest a methodical approach to testing, in terms of: >> >> - what tests you need to and plan to run on what systems? >> - how the reports are organized >> - how errors are listed and diagnosed >> >> I want this to be a more interactive process between you and the >> developers, not just "it broke, see dir X" >> >> Thanks, >> >> Mike >> >> >> On 4/24/09 12:52 PM, Zhao Zhang wrote: >>> Hi, All >>> >>> As I got mapped on gwynn, I redid the tests. The results are >>> >>> All language behaviour tests passed >>> These sites failed: coaster/tgncsa-hg-coaster-pbs-gram2.xml >>> coaster/tgncsa-hg-coaster-pbs-gram4.xml >>> These sites worked: coaster/coaster-local.xml >>> coaster/gwynn-coaster-gram2-gram2-condor.xml >>> coaster/gwynn-coaster-gram2-gram2-fork.xml >>> coaster/renci-engage-coaster.xml coaster/teraport-gt2-gt2-pbs.xml >>> coaster/uj-pbs-gram2.xml >>> >>> Logs could be found at >>> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/log_all on CI >>> network. >>> >>> zhao >>> >>> Zhao Zhang wrote: >>>> Hi, again >>>> >>>> the test on teraport is successful, here is the log >>>> >>>> zhao >>>> >>>> testing site configuration: coaster/teraport-gt2-gt2-pbs.xml >>>> Removing files from previous runs >>>> Running test 061-cattwo at Thu Apr 23 11:27:19 CDT 2009 >>>> Swift 0.9rc2 swift-r2860 cog-r2388 >>>> >>>> RunID: 20090423-1127-aluxx4m9 >>>> Progress: >>>> Progress: Stage in:1 >>>> Progress: Submitted:1 >>>> Progress: Submitted:1 >>>> ... >>>> Progress: Active:1 >>>> Progress: Finished successfully:1 >>>> Final status: Finished successfully:1 >>>> Cleaning up... >>>> Shutting down service at https://128.135.125.118:57080 >>>> Got channel MetaChannel: 2129305 -> GSSSChannel-null(1) >>>> - Done >>>> expecting 061-cattwo.out.expected >>>> checking 061-cattwo.out.expected >>>> Skipping exception test due to test configuration >>>> Test passed at Thu Apr 23 11:57:39 CDT 2009 >>>> ----------===========================---------- >>>> Running test 130-fmri at Thu Apr 23 11:57:39 CDT 2009 >>>> Swift 0.9rc2 swift-r2860 cog-r2388 >>>> >>>> RunID: 20090423-1157-r8sarc77 >>>> Progress: >>>> Progress: Selecting site:2 Initializing site shared directory:1 >>>> Stage in:1 >>>> Progress: Selecting site:2 Stage in:1 Submitting:1 >>>> Progress: Selecting site:2 Submitting:1 Submitted:1 >>>> ... >>>> Progress: Selecting site:2 Submitted:2 >>>> Progress: Selecting site:2 Submitted:2 >>>> Progress: Selecting site:2 Submitted:2 >>>> Progress: Selecting site:2 Submitted:1 Active:1 >>>> Progress: Selecting site:2 Active:1 Stage out:1 >>>> Progress: Selecting site:1 Stage in:1 Stage out:1 Finished >>>> successfully:1 >>>> Progress: Submitted:1 Stage out:1 Finished successfully:2 >>>> Progress: Active:1 Finished successfully:4 >>>> Progress: Submitting:2 Submitted:1 Finished successfully:5 >>>> Progress: Active:2 Stage out:1 Finished successfully:5 >>>> Progress: Submitted:1 Stage out:2 Finished successfully:8 >>>> Final status: Finished successfully:11 >>>> Cleaning up... >>>> Shutting down service at https://128.135.125.118:52773 >>>> Got channel MetaChannel: 28761475 -> GSSSChannel-null(1) >>>> - Done >>>> expecting 130-fmri.0000.jpeg.expected 130-fmri.0001.jpeg.expected >>>> 130-fmri.0002.jpeg.expected >>>> checking 130-fmri.0000.jpeg.expected >>>> Skipping exception test due to test configuration >>>> checking 130-fmri.0001.jpeg.expected >>>> Skipping exception test due to test configuration >>>> checking 130-fmri.0002.jpeg.expected >>>> Skipping exception test due to test configuration >>>> Test passed at Thu Apr 23 12:04:47 CDT 2009 >>>> ----------===========================---------- >>>> Running test 103-quote at Thu Apr 23 12:04:47 CDT 2009 >>>> Swift 0.9rc2 swift-r2860 cog-r2388 >>>> >>>> RunID: 20090423-1204-sjzpkfd3 >>>> Progress: >>>> Progress: Stage in:1 >>>> Progress: Submitted:1 >>>> Progress: Active:1 >>>> Progress: Finished successfully:1 >>>> Final status: Finished successfully:1 >>>> Cleaning up... >>>> Shutting down service at https://128.135.125.118:40813 >>>> Got channel MetaChannel: 28500325 -> GSSSChannel-null(1) >>>> - Done >>>> expecting 103-quote.out.expected >>>> checking 103-quote.out.expected >>>> Skipping exception test due to test configuration >>>> Test passed at Thu Apr 23 12:05:05 CDT 2009 >>>> ----------===========================---------- >>>> Running test 1032-singlequote at Thu Apr 23 12:05:05 CDT 2009 >>>> Swift 0.9rc2 swift-r2860 cog-r2388 >>>> >>>> RunID: 20090423-1205-x2d55af3 >>>> Progress: >>>> Progress: Stage in:1 >>>> Progress: Submitted:1 >>>> Progress: Active:1 >>>> Progress: Finished successfully:1 >>>> Final status: Finished successfully:1 >>>> Cleaning up... >>>> Shutting down service at https://128.135.125.118:44126 >>>> Got channel MetaChannel: 18100302 -> GSSSChannel-null(1) >>>> - Done >>>> expecting 1032-singlequote.out.expected >>>> checking 1032-singlequote.out.expected >>>> Skipping exception test due to test configuration >>>> Test passed at Thu Apr 23 12:05:22 CDT 2009 >>>> ----------===========================---------- >>>> Running test 1031-quote at Thu Apr 23 12:05:22 CDT 2009 >>>> Swift 0.9rc2 swift-r2860 cog-r2388 >>>> >>>> RunID: 20090423-1205-5aa1ko4e >>>> Progress: >>>> Progress: Stage in:1 >>>> Progress: Submitted:1 >>>> Progress: Active:1 >>>> Final status: Finished successfully:1 >>>> Cleaning up... >>>> Shutting down service at https://128.135.125.118:43759 >>>> Got channel MetaChannel: 19002607 -> GSSSChannel-null(1) >>>> - Done >>>> expecting 1031-quote.*.expected >>>> No expected output files specified for this test case - not checking >>>> output. >>>> Skipping exception test due to test configuration >>>> Test passed at Thu Apr 23 12:05:38 CDT 2009 >>>> ----------===========================---------- >>>> Running test 1033-singlequote at Thu Apr 23 12:05:38 CDT 2009 >>>> Swift 0.9rc2 swift-r2860 cog-r2388 >>>> >>>> RunID: 20090423-1205-8nopyujc >>>> Progress: >>>> Progress: Stage in:1 >>>> Progress: Submitted:1 >>>> Progress: Active:1 >>>> Progress: Finished successfully:1 >>>> Final status: Finished successfully:1 >>>> Cleaning up... >>>> Shutting down service at https://128.135.125.118:39924 >>>> Got channel MetaChannel: 31196317 -> GSSSChannel-null(1) >>>> - Done >>>> expecting 1033-singlequote.out.expected >>>> checking 1033-singlequote.out.expected >>>> Skipping exception test due to test configuration >>>> Test passed at Thu Apr 23 12:05:56 CDT 2009 >>>> ----------===========================---------- >>>> Running test 141-space-in-filename at Thu Apr 23 12:05:56 CDT 2009 >>>> Swift 0.9rc2 swift-r2860 cog-r2388 >>>> >>>> RunID: 20090423-1205-aalqz1c4 >>>> Progress: >>>> Progress: Stage in:1 >>>> Progress: Submitted:1 >>>> Progress: Active:1 >>>> Progress: Finished successfully:1 >>>> Final status: Finished successfully:1 >>>> Cleaning up... >>>> Shutting down service at https://128.135.125.118:60177 >>>> Got channel MetaChannel: 4728458 -> GSSSChannel-null(1) >>>> - Done >>>> expecting 141-space-in-filename.space here.out.expected >>>> checking 141-space-in-filename.space here.out.expected >>>> Skipping exception test due to test configuration >>>> Test passed at Thu Apr 23 12:06:15 CDT 2009 >>>> ----------===========================---------- >>>> Running test 142-space-and-quotes at Thu Apr 23 12:06:15 CDT 2009 >>>> Swift 0.9rc2 swift-r2860 cog-r2388 >>>> >>>> RunID: 20090423-1206-8617gag1 >>>> Progress: >>>> Progress: Selecting site:2 Initializing site shared directory:1 >>>> Stage in:1 >>>> Progress: Selecting site:2 Submitting:1 Submitted:1 >>>> Progress: Selecting site:2 Submitted:1 Active:1 >>>> Progress: Selecting site:2 Active:1 Finished successfully:1 >>>> Progress: Stage out:1 Finished successfully:3 >>>> Final status: Finished successfully:4 >>>> Cleaning up... >>>> Shutting down service at https://128.135.125.118:57945 >>>> Got channel MetaChannel: 16387060 -> GSSSChannel-null(1) >>>> - Done >>>> expecting 142-space-and-quotes.2" space ".out.expected >>>> 142-space-and-quotes.3' space '.out.expected >>>> 142-space-and-quotes.out.expected 142-space-and-quotes. space >>>> .out.expected >>>> checking 142-space-and-quotes.2" space ".out.expected >>>> Skipping exception test due to test configuration >>>> checking 142-space-and-quotes.3' space '.out.expected >>>> Skipping exception test due to test configuration >>>> checking 142-space-and-quotes.out.expected >>>> Skipping exception test due to test configuration >>>> checking 142-space-and-quotes. space .out.expected >>>> Skipping exception test due to test configuration >>>> Test passed at Thu Apr 23 12:06:35 CDT 2009 >>>> ----------===========================---------- >>>> All language behaviour tests passed >>>> >>>> >>>> >>>> Zhao Zhang wrote: >>>>> Hi, Ben >>>>> >>>>> Ben Clifford wrote: >>>>>> On Thu, 23 Apr 2009, Zhao Zhang wrote: >>>>>> >>>>>> >>>>>>> Error 1: This is related to CI network setting, >>>>>>> /etc/grid-security/hostcert.pem. Could anyone help on this? Who >>>>>>> should I >>>>>>> contact? >>>>>>> >>>>>> fletch is broken. But try changing those sites files to use >>>>>> gwynn.bsd.uchicago.edu instead. >>>>>> >>>>>> >>>>>>> Error 2: My certificate is not enabled on teraport, As Mike and I >>>>>>> talked last >>>>>>> night, "certificate revocation list" on CI network is misconfigured. >>>>>>> >>>>>> This looks more like a permissions problem - the directory being >>>>>> used in the sites.xml file for that test does not exist and you do >>>>>> not have permission to create it. >>>>>> >>>>>> In r2874 I have changes tests/sites/coaster/teraport-gt2-gt2-pbs.xml >>>>>> to use a different path that should work for you now. >>>>>> >>>>> I tried this out, it failed, then I increased the wall-time to 15 >>>>> minutes in the coaster/teraport-gt2-gt2-pbs.xml file. >>>>> And I am waiting now. >>>>> >>>>> zhao >>>>> >>>>> [zzhang at communicado sites]$ ./run-site coaster/teraport-gt2-gt2-pbs.xml >>>>> testing site configuration: coaster/teraport-gt2-gt2-pbs.xml >>>>> Removing files from previous runs >>>>> Running test 061-cattwo at Thu Apr 23 11:12:09 CDT 2009 >>>>> Swift 0.9rc2 swift-r2860 cog-r2388 >>>>> >>>>> RunID: 20090423-1112-6jqlxfcf >>>>> Progress: >>>>> Progress: Stage in:1 >>>>> Progress: Submitted:1 >>>>> Failed to transfer wrapper log from >>>>> 061-cattwo-20090423-1112-6jqlxfcf/info/q on teraport >>>>> Failed to transfer wrapper log from >>>>> 061-cattwo-20090423-1112-6jqlxfcf/info/s on teraport >>>>> Progress: Stage in:1 >>>>> Failed to transfer wrapper log from >>>>> 061-cattwo-20090423-1112-6jqlxfcf/info/u on teraport >>>>> Progress: Failed:1 >>>>> Execution failed: >>>>> Exception in cat: >>>>> Arguments: [061-cattwo.1.in, 061-cattwo.2.in] >>>>> Host: teraport >>>>> Directory: 061-cattwo-20090423-1112-6jqlxfcf/jobs/u/cat-umlmrs9j >>>>> stderr.txt: >>>>> >>>>> stdout.txt: >>>>> >>>>> ---- >>>>> >>>>> Caused by: >>>>> Job cannot be run with the given max walltime worker >>>>> constraint (task: 600, maxwalltime: 300s) >>>>> Cleaning up... >>>>> Shutting down service at https://128.135.125.118:58204 >>>>> Got channel MetaChannel: 1297642 -> GSSSChannel-null(1) >>>>> - Done >>>>> SWIFT RETURN CODE NON-ZERO - test 061-cattwo >>>>> >>>>>> >>>>>>> Error 3 & Error 4: I am not active on tgncsa site. Mike said he >>>>>>> needed to add >>>>>>> me to another group. >>>>>>> >>>>>> yes. >>>>>> >>>>>> Do you have the list from the end of your test run about which sites >>>>>> worked and which did not? >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Fri Apr 24 16:25:00 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 24 Apr 2009 16:25:00 -0500 Subject: [Swift-devel] feature request In-Reply-To: <49F22628.5080005@mcs.anl.gov> References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <1240603707.10463.16.camel@localhost> <49F22628.5080005@mcs.anl.gov> Message-ID: <1240608300.13592.23.camel@localhost> On Fri, 2009-04-24 at 15:50 -0500, Michael Wilde wrote: > > On 4/24/09 3:08 PM, Mihael Hategan wrote: > > On Fri, 2009-04-24 at 14:30 -0500, Michael Wilde wrote: > >> Great, Zhao. > >> > >> What's next? > >> > >> Testing of the new condor provider features is important. > > > > The block allocator and coasters+condor-g are mutually useless. > > I need to think that through. I see this set of values: > > - The plain condor-g provider is high-value to OSG and TG immediately. Immediately pending testing and stabilization. > It overcomes the GRAM2 overhead, and seems like it should open up many > doors. It does some things to overcome GRAM2 job manager overhead. > Thats what I was asking Zhao to test. I think it's worth testing by itself, but not in combination with something that will slowly go away. It's up to you guys. I didn't want Zhao to waste his time. From hategan at mcs.anl.gov Fri Apr 24 16:33:28 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 24 Apr 2009 16:33:28 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <49EF52C9.9040707@mcs.anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <1240419946.7967.4.camel@localhost> <49EF52C9.9040707@mcs.anl.gov> Message-ID: <1240608808.13592.32.camel@localhost> On Wed, 2009-04-22 at 12:24 -0500, Michael Wilde wrote: > The problem with VMs at the moment is that most of the resources our > users need to run on at the moment are not VM-enabled, and wont be in > say the next 12 months or more. Well, I think that if everybody thinks like that, TG will see no need to do anything about it. If we think this is the right way to go (and I do, and you seem to), I believe we should at least ask for it. Ian definitely has the skills to put this nicely in words, and the reputation that would make TG consider them. From wilde at mcs.anl.gov Fri Apr 24 16:32:40 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 24 Apr 2009 16:32:40 -0500 Subject: [Swift-devel] feature request In-Reply-To: <1240608300.13592.23.camel@localhost> References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <1240603707.10463.16.camel@localhost> <49F22628.5080005@mcs.anl.gov> <1240608300.13592.23.camel@localhost> Message-ID: <49F22FF8.7000900@mcs.anl.gov> On 4/24/09 4:25 PM, Mihael Hategan wrote: ... > It does some things to overcome GRAM2 job manager overhead. > >> Thats what I was asking Zhao to test. > > I think it's worth testing by itself, but not in combination with > something that will slowly go away. It's up to you guys. I didn't want > Zhao to waste his time. Cool. Thats what I was asking him to test: the condor-g provider by itself. So, Zhao, since we agree on this, go for it. Since its a new feature, just added, there is no urgency to do it before the Sunday release; you should start this as first thing next week, though. Mihael, regarding the "slowly go away" part: thats where we need a roadmap on what the stages are. I need to digest the information you supplied earlier in the week at my urging, and you should set aside time, once your current coding push is done, to do the same, and present a coaster work roadmap. Thanks, Mike From hategan at mcs.anl.gov Fri Apr 24 16:38:10 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 24 Apr 2009 16:38:10 -0500 Subject: [Swift-devel] feature request In-Reply-To: <49F22FF8.7000900@mcs.anl.gov> References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <1240603707.10463.16.camel@localhost> <49F22628.5080005@mcs.anl.gov> <1240608300.13592.23.camel@localhost> <49F22FF8.7000900@mcs.anl.gov> Message-ID: <1240609090.14275.0.camel@localhost> On Fri, 2009-04-24 at 16:32 -0500, Michael Wilde wrote: > > On 4/24/09 4:25 PM, Mihael Hategan wrote: > ... > > It does some things to overcome GRAM2 job manager overhead. > > > >> Thats what I was asking Zhao to test. > > > > I think it's worth testing by itself, but not in combination with > > something that will slowly go away. It's up to you guys. I didn't want > > Zhao to waste his time. > > Cool. Thats what I was asking him to test: the condor-g provider by itself. > > So, Zhao, since we agree on this, go for it. > > Since its a new feature, just added, there is no urgency to do it before > the Sunday release; you should start this as first thing next week, though. > > Mihael, regarding the "slowly go away" part: thats where we need a > roadmap on what the stages are. I need to digest the information you > supplied earlier in the week at my urging, and you should set aside > time, once your current coding push is done, to do the same, and present > a coaster work roadmap. I've got none. Besides block allocations that is. From foster at anl.gov Fri Apr 24 17:08:13 2009 From: foster at anl.gov (Ian Foster) Date: Fri, 24 Apr 2009 17:08:13 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <1240608808.13592.32.camel@localhost> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <1240419946.7967.4.camel@localhost> <49EF52C9.9040707@mcs.anl.gov> <1240608808.13592.32.camel@localhost> Message-ID: It has been a persistent problem getting TG and OSG to consider VMs. The reasons are mixed: in my view, some good, some bad. The good reasons include the difficulties inherent in getting Xen/VMware support for the sometimes odd software found on high-end systems. E.g., on TG-UC, both 64-bit hardware and GPFS have been sources of problems. The bad reason is a persistent conservatism. Good news is that there are a growing number of so-called "cloud" testbeds appearing (albeit small ones), based on Nimbus and Eucalyptus. Also, it seems that Argonne will have a 6,000-core "cloud" system in the near future--a perfect place for such applications. Perhaps that will be first of many. Ian. On Apr 24, 2009, at 4:33 PM, Mihael Hategan wrote: > On Wed, 2009-04-22 at 12:24 -0500, Michael Wilde wrote: >> The problem with VMs at the moment is that most of the resources our >> users need to run on at the moment are not VM-enabled, and wont be in >> say the next 12 months or more. > > Well, I think that if everybody thinks like that, TG will see no > need to > do anything about it. > > If we think this is the right way to go (and I do, and you seem to), I > believe we should at least ask for it. Ian definitely has the skills > to > put this nicely in words, and the reputation that would make TG > consider > them. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Fri Apr 24 17:56:19 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 24 Apr 2009 17:56:19 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <1240419946.7967.4.camel@localhost> <49EF52C9.9040707@mcs.anl.gov> <1240608808.13592.32.camel@localhost> Message-ID: <1240613779.15379.5.camel@localhost> On Fri, 2009-04-24 at 17:08 -0500, Ian Foster wrote: > It has been a persistent problem getting TG and OSG to consider VMs. > The reasons are mixed: in my view, some good, some bad. The good > reasons include the difficulties inherent in getting Xen/VMware > support for the sometimes odd software found on high-end systems. > E.g., on TG-UC, both 64-bit hardware and GPFS have been sources of > problems. We (swift) can probably plan around GPFS, and possibly other such "difficulties". > The bad reason is a persistent conservatism. Is there any way that the problem can be formulated in such a way as to satisfy both worlds? In my naive view, a node could be used both in a traditional way, and as a VM host when needed. From foster at anl.gov Fri Apr 24 18:21:05 2009 From: foster at anl.gov (Ian Foster) Date: Fri, 24 Apr 2009 18:21:05 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <1240613779.15379.5.camel@localhost> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <1240419946.7967.4.camel@localhost> <49EF52C9.9040707@mcs.anl.gov> <1240608808.13592.32.camel@localhost> <1240613779.15379.5.camel@localhost> Message-ID: <4299524E-FE70-43E9-A2EB-F0E8432B4C76@anl.gov> The GPFS issue was quite fundamental at the time--IBM would not guarantee that a VM node running GPFS would not corrupt GPFS. They got past the issue eventually, I don't know how. Anyway, we should keep pushing people on this. In the meantime, EC2 and Nimbus are good targets. On Apr 24, 2009, at 5:56 PM, Mihael Hategan wrote: > On Fri, 2009-04-24 at 17:08 -0500, Ian Foster wrote: >> It has been a persistent problem getting TG and OSG to consider VMs. >> The reasons are mixed: in my view, some good, some bad. The good >> reasons include the difficulties inherent in getting Xen/VMware >> support for the sometimes odd software found on high-end systems. >> E.g., on TG-UC, both 64-bit hardware and GPFS have been sources of >> problems. > > We (swift) can probably plan around GPFS, and possibly other such > "difficulties". > >> The bad reason is a persistent conservatism. > > Is there any way that the problem can be formulated in such a way as > to > satisfy both worlds? In my naive view, a node could be used both in a > traditional way, and as a VM host when needed. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From support at ci.uchicago.edu Sun Apr 26 11:01:32 2009 From: support at ci.uchicago.edu (Ti Leggett) Date: Sun, 26 Apr 2009 11:01:32 -0500 Subject: [Swift-devel] [CI Ticketing System #565] swift not working In-Reply-To: References: Message-ID: If you were submitting this to gwynn for the HNL condor pool, globus had been broken on gwynn. Globus has been fixed, so can you try this again and let us know if this problem is fixed as well. On Tue Apr 21 10:05:00 2009, andric at uchicago.edu wrote: > Normally, I would hit up Sarah for a fix on this, but since she's on > vacation I'm hoping someone else out there could help with this. I'm unable > to get swift jobs submitted. I've tried submitting to both the ucanl64 and > bsd clusters. The run dir (with log files) is here: > /disks/ci-gpfs/fmri/cnari/swift/projects/andric/SNR/RFL2 > > > Here's what I get from ucanl: > > [...]Progress: Submitting:1 Submitted:1 Failed but can retry:1 > Failed to transfer wrapper log from AFNIsnr-20090421-0930-q403bn99/info/e on > ANLUCTERAGRID64 > Progress: Submitting:1 Failed but can retry:2 > Progress: Submitting:1 Failed but can retry:2 > Progress: Submitting:1 Failed but can retry:2 > Progress: Stage in:1 Submitting:1 Failed but can retry:1 > Progress: Submitting:2 Failed but can retry:1 > Progress: Submitting:1 Submitted:1 Failed but can retry:1 > Failed to transfer wrapper log from AFNIsnr-20090421-0930-q403bn99/info/g on > ANLUCTERAGRID64 > Progress: Submitting:1 Failed:1 Failed but can retry:1 > Execution failed: > Exception in AFNI_3dvolreg: > Arguments: [-twopass, -twodup, -dfile, mot_RFL2.run6_trim, -base, > ts.6_trim+orig[92], -prefix, volreg.RFL2.run6_trim, ts.6_trim+orig.BRIK] > Host: ANLUCTERAGRID64 > Directory: AFNIsnr-20090421-0930-q403bn99/jobs/g/AFNI_3dvolreg-gxefdp9j > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Cannot submit job: ; nested exception is: > java.net.SocketTimeoutException: Read timed out > gwynn 5% > > > > and this is what I get on bsd: > > RunID: 20090421-0943-o1bb0081 > Progress: > Progress: Selecting site:1 Initializing site shared directory:1 Stage > in:1 > Progress: Stage in:2 Submitting:1 > 2009.04.21 09:44:04.848 CDT: [ERROR] Parsing profiles on line 1800 Illegal > character ':'at position 60 :Illegal character ':' > Progress: Submitted:1 Failed but can retry:2 > Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/f on > BSD > Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/g on > BSD > Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/e on > BSD > Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/i on > BSD > Progress: Failed but can retry:3 > Progress: Stage in:1 Failed but can retry:2 > Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/m on > BSD > Progress: Failed but can retry:3 > Progress: Failed but can retry:3 > Progress: Failed but can retry:3 > Progress: Stage in:1 Failed but can retry:2 > Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/o on > BSD > Progress: Failed but can retry:3 > Progress: Failed but can retry:3 > Progress: Failed but can retry:3 > Progress: Failed but can retry:3 > Progress: Stage in:1 Failed but can retry:2 > Failed to transfer wrapper log from AFNIsnr-20090421-0943-o1bb0081/info/q on > BSD > Progress: Failed:1 Failed but can retry:2 > Execution failed: > Exception in AFNI_3dvolreg: > Arguments: [-twopass, -twodup, -dfile, mot_RFL2.run5_trim, -base, > ts.5_trim+orig[92], -prefix, volreg.RFL2.run5_trim, ts.5_trim+orig.BRIK] > Host: BSD > Directory: AFNIsnr-20090421-0943-o1bb0081/jobs/q/AFNI_3dvolreg-q0rydp9j > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Cannot submit job > Caused by: > Data transfer to the server failed [Caused by: Token length > 1248813600 > 33554432] From keahey at mcs.anl.gov Mon Apr 27 07:48:21 2009 From: keahey at mcs.anl.gov (Kate Keahey) Date: Mon, 27 Apr 2009 07:48:21 -0500 Subject: [Swift-devel] Planning next set of Swift features and releases In-Reply-To: <4299524E-FE70-43E9-A2EB-F0E8432B4C76@anl.gov> References: <49EE1156.3090107@mcs.anl.gov> <1240418443.7115.8.camel@localhost> <1240419946.7967.4.camel@localhost> <49EF52C9.9040707@mcs.anl.gov> <1240608808.13592.32.camel@localhost> <1240613779.15379.5.camel@localhost> <4299524E-FE70-43E9-A2EB-F0E8432B4C76@anl.gov> Message-ID: <49F5A995.4000401@mcs.anl.gov> Ian Foster wrote: > The GPFS issue was quite fundamental at the time--IBM would not > guarantee that a VM node running GPFS would not corrupt GPFS. They got > past the issue eventually, I don't know how. IBM released GPFS including Xen support a while ago. > > Anyway, we should keep pushing people on this. In the meantime, EC2 and > Nimbus are good targets. > > > On Apr 24, 2009, at 5:56 PM, Mihael Hategan wrote: > >> On Fri, 2009-04-24 at 17:08 -0500, Ian Foster wrote: >>> It has been a persistent problem getting TG and OSG to consider VMs. >>> The reasons are mixed: in my view, some good, some bad. The good >>> reasons include the difficulties inherent in getting Xen/VMware >>> support for the sometimes odd software found on high-end systems. >>> E.g., on TG-UC, both 64-bit hardware and GPFS have been sources of >>> problems. >> >> We (swift) can probably plan around GPFS, and possibly other such >> "difficulties". >> >>> The bad reason is a persistent conservatism. >> >> Is there any way that the problem can be formulated in such a way as to >> satisfy both worlds? In my naive view, a node could be used both in a >> traditional way, and as a VM host when needed. >> >> > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Kate Keahey, Mathematics & CS Division, Argonne National Laboratory Computation Institute, University of Chicago From benc at hawaga.org.uk Mon Apr 27 08:24:05 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 27 Apr 2009 13:24:05 +0000 (GMT) Subject: [Swift-devel] Swift 0.9 released. Message-ID: Swift 0.9 is released. Download it at http://www.ci.uchicago.edu/swift/downloads/ The release notes, with more information on new features and bugfixes, are available at: http://www.ci.uchicago.edu/swift/downloads/release-notes-0.9.txt -- From hockyg at uchicago.edu Mon Apr 27 14:29:04 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Mon, 27 Apr 2009 14:29:04 -0500 Subject: [Swift-devel] odd scheduling behavior in swift-0.9 Message-ID: <49F60780.902@uchicago.edu> Hi Everyone, I had been having problems from communicado to ranger so this morning I tried running on ranger from ranger as the submit host with the following sites file. This time I had >600 jobs go active while only 3-4 jobs in the queue were active i.e. only 48-64 coasters should have been active. > > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> > > > > key="SWIFT_JOBDIR_PATH">/tmp/hockyg/jobdir > key="SWIFT_JOBDIR_PATH">/tmp/hockyg/jobdir > TG-MCB080099N > 16 > key="coasterWorkerMaxwalltime">05:00:00 > 60 > 50 > 10 > /share/home/01021/hockyg/swiftwork > Log file is transfered to here: ci home: /home/hockyg/public_html/swift_logs/ranger_from_ranger/oops-20090427-0845-pc8celi7.log web: /http://www.ci.uchicago.edu/~hockyg/swift_logs/ranger_from_ranger/oops-20090427-0845-pc8celi7.log coaster logs are in: ci home: /home/hockyg/oops/swift/coaster_logs/ranger From hategan at mcs.anl.gov Mon Apr 27 15:08:20 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 27 Apr 2009 15:08:20 -0500 Subject: [Swift-devel] odd scheduling behavior in swift-0.9 In-Reply-To: <49F60780.902@uchicago.edu> References: <49F60780.902@uchicago.edu> Message-ID: <1240862900.8258.5.camel@localhost> [hategan at communicado ~]$ less /home/hockyg/public_html/swift_logs/ranger_from_ranger/oops-20090427-0845-pc8celi7.log /home/hockyg/public_html/swift_logs/ranger_from_ranger/oops-20090427-0845-pc8celi7.log: Permission denied On Mon, 2009-04-27 at 14:29 -0500, Glen Hocky wrote: > Hi Everyone, > I had been having problems from communicado to ranger so this morning I > tried running on ranger from ranger as the submit host with the > following sites file. This time I had >600 jobs go active while only 3-4 > jobs in the queue were active i.e. only 48-64 coasters should have been > active. > > > > > > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> > > > > > > > > > key="SWIFT_JOBDIR_PATH">/tmp/hockyg/jobdir > > > key="SWIFT_JOBDIR_PATH">/tmp/hockyg/jobdir > > TG-MCB080099N > > 16 > > > key="coasterWorkerMaxwalltime">05:00:00 > > 60 > > 50 > > 10 > > /share/home/01021/hockyg/swiftwork > > > Log file is transfered to here: > ci home: > /home/hockyg/public_html/swift_logs/ranger_from_ranger/oops-20090427-0845-pc8celi7.log > web: > /http://www.ci.uchicago.edu/~hockyg/swift_logs/ranger_from_ranger/oops-20090427-0845-pc8celi7.log > > coaster logs are in: > ci home: /home/hockyg/oops/swift/coaster_logs/ranger > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Mon Apr 27 17:09:28 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 27 Apr 2009 17:09:28 -0500 Subject: [Swift-devel] odd scheduling behavior in swift-0.9 In-Reply-To: <49F60780.902@uchicago.edu> References: <49F60780.902@uchicago.edu> Message-ID: <1240870168.30339.2.camel@localhost> On Mon, 2009-04-27 at 14:29 -0500, Glen Hocky wrote: > Hi Everyone, > I had been having problems from communicado to ranger so this morning I > tried running on ranger from ranger as the submit host with the > following sites file. This time I had >600 jobs go active I did a plot of the run (http://www.mcs.anl.gov/~hategan/oops-04-27-1/), and I don't see 600 jobs active anywhere. Most I see is between 60 and 70 (http://www.mcs.anl.gov/~hategan/oops-04-27-1/karatasks.JOB_SUBMISSION.Active-total.png). > while only 3-4 > jobs in the queue were active i.e. only 48-64 coasters should have been > active. > > > > > > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> > > > > > > > > > key="SWIFT_JOBDIR_PATH">/tmp/hockyg/jobdir > > > key="SWIFT_JOBDIR_PATH">/tmp/hockyg/jobdir > > TG-MCB080099N > > 16 > > > key="coasterWorkerMaxwalltime">05:00:00 > > 60 > > 50 > > 10 > > /share/home/01021/hockyg/swiftwork > > > Log file is transfered to here: > ci home: > /home/hockyg/public_html/swift_logs/ranger_from_ranger/oops-20090427-0845-pc8celi7.log > web: > /http://www.ci.uchicago.edu/~hockyg/swift_logs/ranger_from_ranger/oops-20090427-0845-pc8celi7.log > > coaster logs are in: > ci home: /home/hockyg/oops/swift/coaster_logs/ranger > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Mon Apr 27 17:49:46 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 27 Apr 2009 17:49:46 -0500 Subject: [Swift-devel] progress stats lying Message-ID: <1240872586.30696.8.camel@localhost> The "Active" events are processed immediately, while for stageout events to register to the ticker, it takes some karajan execution to happen in possibly many threads and possibly involving scheduled tasks. This causes the ticker to claim that more jobs are active than there actually are. A possible solution is to add another state ("Checking status" or something), that is triggered in Execute by a task completed event. From zhaozhang at uchicago.edu Tue Apr 28 00:46:38 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 28 Apr 2009 00:46:38 -0500 Subject: [Swift-devel] several questions about coaster Message-ID: <49F6983E.7030301@uchicago.edu> Hi, Mihael As I am going to test coaster deeply on various systems. I got several questions regarding to the coaster infrastructure. 1. Scalability From source code, I could tell that coaster is using TCP for task transmission.(correct me if I am wrong) What is the largest scale test we have done with coaster? I mean the ratio between coaster dispatcher and the number of workers. 2. Usability All tests I have tried with coaster were running together with swift. That is a black box test for me. Is there any interface in coaster that I could specify the number of workers and the wall time? Also, is there a way for me to start coaster service and workers separately and independently? Besides, I am guessing coaster is using a dynamic provisioning approach to request resources, is this correct? Which means coaster will decide how many compute nodes to request according to the number of jobs, and the length of jobs. If I run coaster in a super computer context, can I ask coaster to hold a certain number of compute nodes for a certain amount of time? (This somehow overlaps the first Q in section 2) 3. Performance Does coaster provide alternative interface other than the coaster provider? Say if I want to test the dispatch rate of coaster, but don't want to introduce swift overhead, which is a good way to start coaster? Is there a coaster log that could show the number of active workers currently registered with the coaster service, how many jobs are running, how many jobs returned successful, and etc.? 4. Dispatch Algorithm Does coaster use a scoring algorithm for dispatching jobs? Which means coaster service keeps scores for every workers, and dispatch jobs based on those scores? Is there an alternative way, say FIFO algorithm? 5. Reliability I know that if a job failed, swift could resend the same job. But does coaster have any error recovery mechanism built in? So far so many questions, thanks for you patience. I will have more as going on with the testing. :-) best wishes zhao From benc at hawaga.org.uk Tue Apr 28 01:05:59 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 28 Apr 2009 06:05:59 +0000 (GMT) Subject: [Swift-devel] progress stats lying In-Reply-To: <1240872586.30696.8.camel@localhost> References: <1240872586.30696.8.camel@localhost> Message-ID: On Mon, 27 Apr 2009, Mihael Hategan wrote: > The "Active" events are processed immediately, while for stageout events > to register to the ticker, it takes some karajan execution to happen in > possibly many threads and possibly involving scheduled tasks. This > causes the ticker to claim that more jobs are active than there actually > are. > > A possible solution is to add another state ("Checking status" or > something), that is triggered in Execute by a task completed event. Yes. Probably all the status updates could do with a brief review to see how well they really tie in with what they claim. -- From benc at hawaga.org.uk Tue Apr 28 02:15:03 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 28 Apr 2009 07:15:03 +0000 (GMT) Subject: [Swift-devel] progress stats lying In-Reply-To: References: <1240872586.30696.8.camel@localhost> Message-ID: On Tue, 28 Apr 2009, Ben Clifford wrote: > > A possible solution is to add another state ("Checking status" or > > something), that is triggered in Execute by a task completed event. r2887 adds a checking status event in execute2, right after the vdl:execute call. I think that is sufficient? or are you claiming that something deeper in the execute layer is delaying things? -- From benc at hawaga.org.uk Tue Apr 28 07:31:22 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 28 Apr 2009 12:31:22 +0000 (GMT) Subject: [Swift-devel] Re: metrics In-Reply-To: References: Message-ID: (I added swift-devel to this because others might read and comment On Fri, 24 Apr 2009, Jon Roelofs wrote: > What kind of hardware is this typically run on? About how many machines > would there be? Are there multicore machines? If so, would swift treat a > quad core as if it were 4 sites? I have the idea that swift is targeted > somewhere in-between a dedicated supercomputer and something like what > Berkeley is doing with BOINC, but more on the dedicated side of things. Is > that a fair assumption? There is quite a lot of variation in what Swift gets run on. The traditional target platform is something like the Open Science Grid which has multiple sites, all of which vary in characteristics from each other significantly (in terms of node count and network connectivity) In there, each site consists of multiple compute cores (either in the same machine or in a single cluster with a shared file system). The sites are generally dedicated to this use, but not to a particular user; so when Swift wants to run jobs on those sites, those jobs usually must sit in a queue for some period before execution, sometimes for a very long time (for example, I've seen one cluster pretty much fully loaded for 2 weeks solid, whilst other sites will run jobs within seconds). >From a site selection and scheduling perspective, the difficult thing is mostly in choosing the right site for a job (what does "the right site" mean? how to determine that? what happens when we chose the wrong site?) On TeraGrid, where individual sites are quite large, users tend to run only on one site at once, so site selection issues are less important. However, part of the motivation to run on only one site is because of deficiencies in site selection (for example, the data affinity stuff that we have talked about is intended to solve a problem which causes a lot of data transfer). On TeraGrid people are commonly running with the coaster execution provider, which allocates nodes separately from submitting jobs to them. So there is the possibily (not used at the moment) to do some scheduling based on the knowledge that those nodes will likely be dedicated to a particular Swift run for some duration. A third mode of Swift usage is on something like the Blue Gene/P. There, the machine has enough cores that when used with Swift it is regarded as several sites, each site being some chunk of the machine. So then site selection issues start coming back into play again. On BG/P people run through Falkon, which, similar to coasters, separates out allocation of compute nodes from assigning tasks to those nodes. In all of the above, it is often the case that the expected duration of a job is not known; so it is hard to do very tight planning ahead of time. The various execution systems and sites have very different performance characteristics, in terms of how many jobs can be sent to a site at once, and how much time overhead each job has. > Is there a way to specify that certain jobs need sites with specific > hardware/software requirements? For example, maybe one of the apps > needs hardware that can run CUDA code, so it doesn't make sense to send > this job to a site without a GPGPU. The tc.data file lets you list which sites support which applications, as a list of sites and applications. Mostly at the moment that is due to applications being installed or not installed rather than hardware requirements. -- From benc at hawaga.org.uk Tue Apr 28 07:49:30 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 28 Apr 2009 12:49:30 +0000 (GMT) Subject: [Swift-devel] feature request In-Reply-To: <49F21369.1040307@mcs.anl.gov> References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> Message-ID: On Fri, 24 Apr 2009, Michael Wilde wrote: > Testing of the new condor provider features is important. I would be interested in seeing the condor-g changes tested against a large set of OSG sites. However, the swift/ress/osg site file generator does not generate sites files to use that. If you (Zhao) want to modify the sites file generator to give sites.xml files that use the condor provider in condor-g mode, instead of using the gt2 provider, then you would be able to get some testing done on OSG; or you could generate the sites file by hand based on the output of swift-osg-ress-site-catalog. -- From benc at hawaga.org.uk Tue Apr 28 09:32:26 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 28 Apr 2009 14:32:26 +0000 (GMT) Subject: [Swift-devel] Avoid copy of local files to work directory? In-Reply-To: References: <49EB3432.4090901@mcs.anl.gov> <49EB53D9.8050600@mcs.anl.gov> <9f808f850904222151s6155e6a6icf1330ce3d3a938b@mail.gmail.com> Message-ID: > Try http://www.ci.uchicago.edu/~benc/provider-ln-2009-0423.tar.gz > > As I have said in previous mails, I am fairly unconvinced that this will > be useful in your present situation, and I invite you to provide more > details (in terms of high level description, and concrete log files) of > what you think your problem is. Hi. Do you have any results from this? I am interested in both negative and positive results. -- From hategan at mcs.anl.gov Tue Apr 28 12:59:38 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 28 Apr 2009 12:59:38 -0500 Subject: [Swift-devel] Re: several questions about coaster In-Reply-To: <49F6983E.7030301@uchicago.edu> References: <49F6983E.7030301@uchicago.edu> Message-ID: <1240941578.4742.10.camel@localhost> On Tue, 2009-04-28 at 00:46 -0500, Zhao Zhang wrote: > Hi, Mihael > > As I am going to test coaster deeply on various systems. I got several > questions regarding to the coaster infrastructure. > 1. Scalability > From source code, I could tell that coaster is using TCP for task > transmission.(correct me if I am wrong) What is the largest > scale test we have done with coaster? I mean the ratio between > coaster dispatcher and the number of workers. Typical stuff on ranger. Around 1000 nodes. > > 2. Usability > All tests I have tried with coaster were running together with > swift. That is a black box test for me. Is there any interface in coaster > that I could specify the number of workers and the wall time? Like all profile entries, they are task attributes. > > Also, is there a way for me to start coaster service and workers > separately and independently? The coaster provider is a provider like any other. Use the "job submission example" in cog and change the provider to "coaster". > > Besides, I am guessing coaster is using a dynamic provisioning > approach to request resources, is this correct? Which means coaster > will decide how many compute nodes to request according to the > number of jobs, and the length of jobs. No. For each job, it will try to find the worker with the least time left that can still run the job. If no such worker can be found, it will try to start one with 10 times larger walltime up to a maximum of 250 workers I think. > If I run coaster in a super computer > context, can I ask coaster to hold a certain number of compute nodes > for a certain amount of time? (This somehow overlaps the first Q in > section 2) You don't seem to have tuned in much into recent discussion on this mailing list. No, you can't. > > 3. Performance > Does coaster provide alternative interface other than the coaster > provider? Say if I want to test the dispatch rate of coaster, but don't > want to > introduce swift overhead, which is a good way to start coaster? The abstraction api in cog has nothing to do with swift, so use that. > > Is there a coaster log that could show the number of active workers > currently registered with the coaster service, how many jobs are running, > how many jobs returned successful, and etc.? Yes. On the remote site, in ~/.globus/coasters > > 4. Dispatch Algorithm > Does coaster use a scoring algorithm for dispatching jobs? No scoring algorithm. Read appropriate answer in (2). > Which > means coaster service keeps scores for every workers, and dispatch jobs > based on those scores? Is there an alternative way, say FIFO algorithm? > > 5. Reliability > I know that if a job failed, swift could resend the same job. But > does coaster have any error recovery mechanism built in? No. It deliberately has none, in order to avoid obscuring errors. From hategan at mcs.anl.gov Tue Apr 28 13:04:01 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 28 Apr 2009 13:04:01 -0500 Subject: [Swift-devel] progress stats lying In-Reply-To: References: <1240872586.30696.8.camel@localhost> Message-ID: <1240941841.4742.16.camel@localhost> On Tue, 2009-04-28 at 07:15 +0000, Ben Clifford wrote: > On Tue, 28 Apr 2009, Ben Clifford wrote: > > > > A possible solution is to add another state ("Checking status" or > > > something), that is triggered in Execute by a task completed event. > > r2887 adds a checking status event in execute2, right after the > vdl:execute call. I think that is sufficient? or are you claiming that > something deeper in the execute layer is delaying things? It's probably better but not sufficient. With many threads there may be delays between the vdl:execute call completing and the next thing running may be visible. The problem is that swift may say that more jobs are active than known possible, which will raise eyebrows. From hockyg at uchicago.edu Tue Apr 28 13:05:37 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Tue, 28 Apr 2009 13:05:37 -0500 Subject: [Swift-devel] progress stats lying In-Reply-To: <1240941841.4742.16.camel@localhost> References: <1240872586.30696.8.camel@localhost> <1240941841.4742.16.camel@localhost> Message-ID: <49F74571.4080703@uchicago.edu> as a user, it seems very important to have accurate summary stats so you can tell if anything is going wrong. Mihael Hategan wrote: > On Tue, 2009-04-28 at 07:15 +0000, Ben Clifford wrote: > >> On Tue, 28 Apr 2009, Ben Clifford wrote: >> >> >>>> A possible solution is to add another state ("Checking status" or >>>> something), that is triggered in Execute by a task completed event. >>>> >> r2887 adds a checking status event in execute2, right after the >> vdl:execute call. I think that is sufficient? or are you claiming that >> something deeper in the execute layer is delaying things? >> > > It's probably better but not sufficient. With many threads there may be > delays between the vdl:execute call completing and the next thing > running may be visible. > > The problem is that swift may say that more jobs are active than known > possible, which will raise eyebrows. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Tue Apr 28 13:14:03 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 28 Apr 2009 13:14:03 -0500 Subject: [Swift-devel] progress stats lying In-Reply-To: <49F74571.4080703@uchicago.edu> References: <1240872586.30696.8.camel@localhost> <1240941841.4742.16.camel@localhost> <49F74571.4080703@uchicago.edu> Message-ID: <1240942443.5245.4.camel@localhost> On Tue, 2009-04-28 at 13:05 -0500, Glen Hocky wrote: > as a user, it seems very important to have accurate summary stats so you > can tell if anything is going wrong. You're overgeneralizing. You can infer some things from the summary stats, but many things can go wrong even though the summary stats are right. > > Mihael Hategan wrote: > > On Tue, 2009-04-28 at 07:15 +0000, Ben Clifford wrote: > > > >> On Tue, 28 Apr 2009, Ben Clifford wrote: > >> > >> > >>>> A possible solution is to add another state ("Checking status" or > >>>> something), that is triggered in Execute by a task completed event. > >>>> > >> r2887 adds a checking status event in execute2, right after the > >> vdl:execute call. I think that is sufficient? or are you claiming that > >> something deeper in the execute layer is delaying things? > >> > > > > It's probably better but not sufficient. With many threads there may be > > delays between the vdl:execute call completing and the next thing > > running may be visible. > > > > The problem is that swift may say that more jobs are active than known > > possible, which will raise eyebrows. > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > From hockyg at uchicago.edu Tue Apr 28 13:25:21 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Tue, 28 Apr 2009 13:25:21 -0500 Subject: [Swift-devel] progress stats lying In-Reply-To: <1240942443.5245.4.camel@localhost> References: <1240872586.30696.8.camel@localhost> <1240941841.4742.16.camel@localhost> <49F74571.4080703@uchicago.edu> <1240942443.5245.4.camel@localhost> Message-ID: <49F74A11.4070702@uchicago.edu> Agreed. but if they are inaccurate then it is as bad as not having them at all. Mihael Hategan wrote: > On Tue, 2009-04-28 at 13:05 -0500, Glen Hocky wrote: > >> as a user, it seems very important to have accurate summary stats so you >> can tell if anything is going wrong. >> > > You're overgeneralizing. You can infer some things from the summary > stats, but many things can go wrong even though the summary stats are > right. > > >> Mihael Hategan wrote: >> >>> On Tue, 2009-04-28 at 07:15 +0000, Ben Clifford wrote: >>> >>> >>>> On Tue, 28 Apr 2009, Ben Clifford wrote: >>>> >>>> >>>> >>>>>> A possible solution is to add another state ("Checking status" or >>>>>> something), that is triggered in Execute by a task completed event. >>>>>> >>>>>> >>>> r2887 adds a checking status event in execute2, right after the >>>> vdl:execute call. I think that is sufficient? or are you claiming that >>>> something deeper in the execute layer is delaying things? >>>> >>>> >>> It's probably better but not sufficient. With many threads there may be >>> delays between the vdl:execute call completing and the next thing >>> running may be visible. >>> >>> The problem is that swift may say that more jobs are active than known >>> possible, which will raise eyebrows. >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> > > From hategan at mcs.anl.gov Tue Apr 28 13:30:41 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 28 Apr 2009 13:30:41 -0500 Subject: [Swift-devel] progress stats lying In-Reply-To: <49F74A11.4070702@uchicago.edu> References: <1240872586.30696.8.camel@localhost> <1240941841.4742.16.camel@localhost> <49F74571.4080703@uchicago.edu> <1240942443.5245.4.camel@localhost> <49F74A11.4070702@uchicago.edu> Message-ID: <1240943441.5605.0.camel@localhost> On Tue, 2009-04-28 at 13:25 -0500, Glen Hocky wrote: > Agreed. but if they are inaccurate then it is as bad as not having them > at all. Maybe worse even. > > Mihael Hategan wrote: > > On Tue, 2009-04-28 at 13:05 -0500, Glen Hocky wrote: > > > >> as a user, it seems very important to have accurate summary stats so you > >> can tell if anything is going wrong. > >> > > > > You're overgeneralizing. You can infer some things from the summary > > stats, but many things can go wrong even though the summary stats are > > right. > > > > > >> Mihael Hategan wrote: > >> > >>> On Tue, 2009-04-28 at 07:15 +0000, Ben Clifford wrote: > >>> > >>> > >>>> On Tue, 28 Apr 2009, Ben Clifford wrote: > >>>> > >>>> > >>>> > >>>>>> A possible solution is to add another state ("Checking status" or > >>>>>> something), that is triggered in Execute by a task completed event. > >>>>>> > >>>>>> > >>>> r2887 adds a checking status event in execute2, right after the > >>>> vdl:execute call. I think that is sufficient? or are you claiming that > >>>> something deeper in the execute layer is delaying things? > >>>> > >>>> > >>> It's probably better but not sufficient. With many threads there may be > >>> delays between the vdl:execute call completing and the next thing > >>> running may be visible. > >>> > >>> The problem is that swift may say that more jobs are active than known > >>> possible, which will raise eyebrows. > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> > >>> > > > > > From zhaozhang at uchicago.edu Tue Apr 28 15:00:40 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 28 Apr 2009 15:00:40 -0500 Subject: [Swift-devel] feature request In-Reply-To: References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> Message-ID: <49F76068.4020600@uchicago.edu> Hi, Ben Is this the following a correct condor-G sites.xml ? zhao /nfs/osg-data/osgedu/benc/swift Ben Clifford wrote: > On Fri, 24 Apr 2009, Michael Wilde wrote: > > >> Testing of the new condor provider features is important. >> > > I would be interested in seeing the condor-g changes tested against a > large set of OSG sites. However, the swift/ress/osg site file generator > does not generate sites files to use that. If you (Zhao) want to modify > the sites file generator to give sites.xml files that use the condor > provider in condor-g mode, instead of using the gt2 provider, then you > would be able to get some testing done on OSG; or you could generate the > sites file by hand based on the output of swift-osg-ress-site-catalog. > > From benc at hawaga.org.uk Tue Apr 28 15:23:57 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 28 Apr 2009 20:23:57 +0000 (GMT) Subject: [Swift-devel] progress stats lying In-Reply-To: <1240943441.5605.0.camel@localhost> References: <1240872586.30696.8.camel@localhost> <1240941841.4742.16.camel@localhost> <49F74571.4080703@uchicago.edu> <1240942443.5245.4.camel@localhost> <49F74A11.4070702@uchicago.edu> <1240943441.5605.0.camel@localhost> Message-ID: On Tue, 28 Apr 2009, Mihael Hategan wrote: > On Tue, 2009-04-28 at 13:25 -0500, Glen Hocky wrote: > > Agreed. but if they are inaccurate then it is as bad as not having them > > at all. > > Maybe worse even. Very few of Swift's excecution systems give highly temporally-correct status change events - no amount of fixing up Swift client side stuff will cure that. -- From benc at hawaga.org.uk Tue Apr 28 15:40:28 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 28 Apr 2009 20:40:28 +0000 (GMT) Subject: [Swift-devel] feature request In-Reply-To: <49F76068.4020600@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <49F76068.4020600@uchicago.edu> Message-ID: On Tue, 28 Apr 2009, Zhao Zhang wrote: > Is this the following a correct condor-G sites.xml ? No, that will try to use gram4. Have a look at this message for an example that I have successfully used: http://mail.ci.uchicago.edu/pipermail/swift-devel/2009-April/005115.html -- From zhaozhang at uchicago.edu Tue Apr 28 15:42:08 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 28 Apr 2009 15:42:08 -0500 Subject: [Swift-devel] Re: several questions about coaster In-Reply-To: <1240941578.4742.10.camel@localhost> References: <49F6983E.7030301@uchicago.edu> <1240941578.4742.10.camel@localhost> Message-ID: <49F76A20.3000201@uchicago.edu> Hi, Mihael Mihael Hategan wrote: > On Tue, 2009-04-28 at 00:46 -0500, Zhao Zhang wrote: > >> Hi, Mihael >> >> As I am going to test coaster deeply on various systems. I got several >> questions regarding to the coaster infrastructure. >> 1. Scalability >> From source code, I could tell that coaster is using TCP for task >> transmission.(correct me if I am wrong) What is the largest >> scale test we have done with coaster? I mean the ratio between >> coaster dispatcher and the number of workers. >> > > Typical stuff on ranger. Around 1000 nodes. > Btw, could coaster use all cores on a single compute node? I mean in a multi-core context. > >> 2. Usability >> All tests I have tried with coaster were running together with >> swift. That is a black box test for me. Is there any interface in coaster >> that I could specify the number of workers and the wall time? >> > > Like all profile entries, they are task attributes. > > >> Also, is there a way for me to start coaster service and workers >> separately and independently? >> > > The coaster provider is a provider like any other. Use the "job > submission example" in cog and change the provider to "coaster". > I found this link, but it is empty. http://wiki.cogkit.org/wiki/Java_Cog_Kit_Examples_Guide#Job_Submission > >> Besides, I am guessing coaster is using a dynamic provisioning >> approach to request resources, is this correct? Which means coaster >> will decide how many compute nodes to request according to the >> number of jobs, and the length of jobs. >> > > No. > > For each job, it will try to find the worker with the least time left > that can still run the job. If no such worker can be found, it will try > to start one with 10 times larger walltime up to a maximum of 250 > workers I think. > > >> If I run coaster in a super computer >> context, can I ask coaster to hold a certain number of compute nodes >> for a certain amount of time? (This somehow overlaps the first Q in >> section 2) >> > > You don't seem to have tuned in much into recent discussion on this > mailing list. > > No, you can't. > I knew this from the discussion between you and Mike. So is this a feature that we are going to implement soon, or we haven't decided? > >> 3. Performance >> Does coaster provide alternative interface other than the coaster >> provider? Say if I want to test the dispatch rate of coaster, but don't >> want to >> introduce swift overhead, which is a good way to start coaster? >> > > The abstraction api in cog has nothing to do with swift, so use that. > err, I tried to find it on the cog kit wiki, but could not find it. Could you point me to somewhere handy that I could try it out? > >> Is there a coaster log that could show the number of active workers >> currently registered with the coaster service, how many jobs are running, >> how many jobs returned successful, and etc.? >> > > Yes. On the remote site, in ~/.globus/coasters > I found those. zhao > >> 4. Dispatch Algorithm >> Does coaster use a scoring algorithm for dispatching jobs? >> > > No scoring algorithm. Read appropriate answer in (2). > > >> Which >> means coaster service keeps scores for every workers, and dispatch jobs >> based on those scores? Is there an alternative way, say FIFO algorithm? >> >> 5. Reliability >> I know that if a job failed, swift could resend the same job. But >> does coaster have any error recovery mechanism built in? >> > > No. It deliberately has none, in order to avoid obscuring errors. > > > > From zhaozhang at uchicago.edu Tue Apr 28 15:51:03 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 28 Apr 2009 15:51:03 -0500 Subject: [Swift-devel] feature request In-Reply-To: References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <49F76068.4020600@uchicago.edu> Message-ID: <49F76C37.90808@uchicago.edu> Hi, Ben Here is my sites.xml [zzhang at communicado sites]$ cat condor-g/renci-condor.xml /nfs/home/osgedu/benc grid gt2 belhaven-1.renci.org/jobmanager-fork And, I started run-site to test this, [zzhang at communicado sites]$ ./run-site condor-g/renci-condor.xml testing site configuration: condor-g/renci-condor.xml Removing files from previous runs Running test 061-cattwo at Tue Apr 28 15:49:03 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090428-1549-rken3vo4 Progress: Progress: Stage in:1 Progress: Submitting:1 Failed to transfer wrapper log from 061-cattwo-20090428-1549-rken3vo4/info/y on renci-engage Progress: Stage in:1 Failed to transfer wrapper log from 061-cattwo-20090428-1549-rken3vo4/info/0 on renci-engage Progress: Stage in:1 Failed to transfer wrapper log from 061-cattwo-20090428-1549-rken3vo4/info/2 on renci-engage Progress: Failed:1 Execution failed: Exception in cat: Arguments: [061-cattwo.1.in, 061-cattwo.2.in] Host: renci-engage Directory: 061-cattwo-20090428-1549-rken3vo4/jobs/2/cat-26xkb2aj stderr.txt: stdout.txt: ---- Caused by: Cannot submit job: Could not submit job (condor_submit reported an exit code of 1). no error output SWIFT RETURN CODE NON-ZERO - test 061-cattwo Any ideas on the failures? Do you need more information? zhao Ben Clifford wrote: > On Tue, 28 Apr 2009, Zhao Zhang wrote: > > >> Is this the following a correct condor-G sites.xml ? >> > > No, that will try to use gram4. > > Have a look at this message for an example that I have successfully used: > > http://mail.ci.uchicago.edu/pipermail/swift-devel/2009-April/005115.html > > > > From benc at hawaga.org.uk Tue Apr 28 15:54:00 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 28 Apr 2009 20:54:00 +0000 (GMT) Subject: [Swift-devel] feature request In-Reply-To: <49F76C37.90808@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <49F76068.4020600@uchicago.edu> <49F76C37.90808@uchicago.edu> Message-ID: The version of condor that is on communicado probably won't work. i have only seen this work with condor 7.0.x. I think the version on teraport's tp-osg node should work: ssh tp-osg.ci.uchicago.edu $ source /opt/osg/setup.sh $ condor_version $CondorVersion: 7.0.5 Sep 20 2008 BuildID: 105846 $ $CondorPlatform: X86_64-LINUX_RHEL5 $ On Tue, 28 Apr 2009, Zhao Zhang wrote: > Hi, Ben > > Here is my sites.xml > > [zzhang at communicado sites]$ cat condor-g/renci-condor.xml > > > > > /nfs/home/osgedu/benc > grid > gt2 > belhaven-1.renci.org/jobmanager-fork > > > > And, I started run-site to test this, > [zzhang at communicado sites]$ ./run-site condor-g/renci-condor.xml > testing site configuration: condor-g/renci-condor.xml > Removing files from previous runs > Running test 061-cattwo at Tue Apr 28 15:49:03 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090428-1549-rken3vo4 > Progress: > Progress: Stage in:1 > Progress: Submitting:1 > Failed to transfer wrapper log from 061-cattwo-20090428-1549-rken3vo4/info/y > on renci-engage > Progress: Stage in:1 > Failed to transfer wrapper log from 061-cattwo-20090428-1549-rken3vo4/info/0 > on renci-engage > Progress: Stage in:1 > Failed to transfer wrapper log from 061-cattwo-20090428-1549-rken3vo4/info/2 > on renci-engage > Progress: Failed:1 > Execution failed: > Exception in cat: > Arguments: [061-cattwo.1.in, 061-cattwo.2.in] > Host: renci-engage > Directory: 061-cattwo-20090428-1549-rken3vo4/jobs/2/cat-26xkb2aj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Cannot submit job: Could not submit job (condor_submit reported an exit > code of 1). no error output > SWIFT RETURN CODE NON-ZERO - test 061-cattwo > > Any ideas on the failures? Do you need more information? > > zhao > > > Ben Clifford wrote: > > On Tue, 28 Apr 2009, Zhao Zhang wrote: > > > > > > > Is this the following a correct condor-G sites.xml ? > > > > > > > No, that will try to use gram4. > > > > Have a look at this message for an example that I have successfully used: > > > > http://mail.ci.uchicago.edu/pipermail/swift-devel/2009-April/005115.html > > > > > > > > > > From benc at hawaga.org.uk Tue Apr 28 15:55:56 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 28 Apr 2009 20:55:56 +0000 (GMT) Subject: [Swift-devel] Re: several questions about coaster In-Reply-To: <49F76A20.3000201@uchicago.edu> References: <49F6983E.7030301@uchicago.edu> <1240941578.4742.10.camel@localhost> <49F76A20.3000201@uchicago.edu> Message-ID: On Tue, 28 Apr 2009, Zhao Zhang wrote: > Btw, could coaster use all cores on a single compute node? I mean in a > multi-core context. That was the intention of the coastersPerNode configuration option (although as I wrote it, it is buggy and causes too many coaster workers to be allocated - I'm not sure how it behaves at the moment) -- From zhaozhang at uchicago.edu Tue Apr 28 16:13:21 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 28 Apr 2009 16:13:21 -0500 Subject: [Swift-devel] feature request In-Reply-To: References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <49F76068.4020600@uchicago.edu> <49F76C37.90808@uchicago.edu> Message-ID: <49F77171.9060307@uchicago.edu> Thanks, Ben It worked, but there is an exception. zhao [zzhang at tp-grid1 sites]$ ./run-site condor-g/renci-condor.xml testing site configuration: condor-g/renci-condor.xml Removing files from previous runs Running test 061-cattwo at Tue Apr 28 16:04:38 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090428-1604-cge2klv7 Progress: uninitialized:1 Progress: Stage in:1 Progress: Submitting:1 Progress: Active:1 Progress: Active:1 Progress: Stage out:1 Final status: Finished successfully:1 The following warnings have occurred: 1. Cleanup on renci-engage failed Caused by: Cannot submit job: Could not submit job (condor_submit reported an exit code of 1). no error output expecting 061-cattwo.out.expected checking 061-cattwo.out.expected Skipping exception test due to test configuration Test passed at Tue Apr 28 16:05:25 CDT 2009 Ben Clifford wrote: > The version of condor that is on communicado probably won't work. i have > only seen this work with condor 7.0.x. > > I think the version on teraport's tp-osg node should work: > > ssh tp-osg.ci.uchicago.edu > $ source /opt/osg/setup.sh > $ condor_version > $CondorVersion: 7.0.5 Sep 20 2008 BuildID: 105846 $ > $CondorPlatform: X86_64-LINUX_RHEL5 $ > > > On Tue, 28 Apr 2009, Zhao Zhang wrote: > > >> Hi, Ben >> >> Here is my sites.xml >> >> [zzhang at communicado sites]$ cat condor-g/renci-condor.xml >> >> >> >> >> /nfs/home/osgedu/benc >> grid >> gt2 >> belhaven-1.renci.org/jobmanager-fork >> >> >> >> And, I started run-site to test this, >> [zzhang at communicado sites]$ ./run-site condor-g/renci-condor.xml >> testing site configuration: condor-g/renci-condor.xml >> Removing files from previous runs >> Running test 061-cattwo at Tue Apr 28 15:49:03 CDT 2009 >> Swift 0.9rc2 swift-r2860 cog-r2388 >> >> RunID: 20090428-1549-rken3vo4 >> Progress: >> Progress: Stage in:1 >> Progress: Submitting:1 >> Failed to transfer wrapper log from 061-cattwo-20090428-1549-rken3vo4/info/y >> on renci-engage >> Progress: Stage in:1 >> Failed to transfer wrapper log from 061-cattwo-20090428-1549-rken3vo4/info/0 >> on renci-engage >> Progress: Stage in:1 >> Failed to transfer wrapper log from 061-cattwo-20090428-1549-rken3vo4/info/2 >> on renci-engage >> Progress: Failed:1 >> Execution failed: >> Exception in cat: >> Arguments: [061-cattwo.1.in, 061-cattwo.2.in] >> Host: renci-engage >> Directory: 061-cattwo-20090428-1549-rken3vo4/jobs/2/cat-26xkb2aj >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> Cannot submit job: Could not submit job (condor_submit reported an exit >> code of 1). no error output >> SWIFT RETURN CODE NON-ZERO - test 061-cattwo >> >> Any ideas on the failures? Do you need more information? >> >> zhao >> >> >> Ben Clifford wrote: >> >>> On Tue, 28 Apr 2009, Zhao Zhang wrote: >>> >>> >>> >>>> Is this the following a correct condor-G sites.xml ? >>>> >>>> >>> No, that will try to use gram4. >>> >>> Have a look at this message for an example that I have successfully used: >>> >>> http://mail.ci.uchicago.edu/pipermail/swift-devel/2009-April/005115.html >>> >>> >>> >>> >>> >> > > From benc at hawaga.org.uk Tue Apr 28 16:20:47 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 28 Apr 2009 21:20:47 +0000 (GMT) Subject: [Swift-devel] feature request In-Reply-To: <49F77171.9060307@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <49F76068.4020600@uchicago.edu> <49F76C37.90808@uchicago.edu> <49F77171.9060307@uchicago.edu> Message-ID: On Tue, 28 Apr 2009, Zhao Zhang wrote: > It worked, but there is an exception. OK. That's a known issue (or known to me and Mats at least) with the way that cleanup jobs are handled. I need to fix it, but it should not stop you testing lots of sites. -- From hategan at mcs.anl.gov Tue Apr 28 16:25:26 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 28 Apr 2009 16:25:26 -0500 Subject: [Swift-devel] Re: several questions about coaster In-Reply-To: <49F76A20.3000201@uchicago.edu> References: <49F6983E.7030301@uchicago.edu> <1240941578.4742.10.camel@localhost> <49F76A20.3000201@uchicago.edu> Message-ID: <1240953926.7799.7.camel@localhost> On Tue, 2009-04-28 at 15:42 -0500, Zhao Zhang wrote: > Hi, Mihael > > Mihael Hategan wrote: > What is the largest > >> scale test we have done with coaster? I mean the ratio between > >> coaster dispatcher and the number of workers. > >> > > > > Typical stuff on ranger. Around 1000 nodes. > > > Btw, could coaster use all cores on a single compute node? I mean in a > multi-core context. Using "coastersPerNode" attribute, yes. > > > >> 2. Usability > >> All tests I have tried with coaster were running together with > >> swift. That is a black box test for me. Is there any interface in coaster > >> that I could specify the number of workers and the wall time? > >> > > > > Like all profile entries, they are task attributes. > > > > > >> Also, is there a way for me to start coaster service and workers > >> separately and independently? > >> > > > > The coaster provider is a provider like any other. Use the "job > > submission example" in cog and change the provider to "coaster". > > > I found this link, but it is empty. > http://wiki.cogkit.org/wiki/Java_Cog_Kit_Examples_Guide#Job_Submission Use http://wiki.cogkit.org/wiki/Java_CoG_Kit_Abstraction_Guide#How_to_execute_a_remote_job_execution_task > > > >> If I run coaster in a super computer > >> context, can I ask coaster to hold a certain number of compute > nodes > >> for a certain amount of time? (This somehow overlaps the first Q > in > >> section 2) > >> > > > > You don't seem to have tuned in much into recent discussion on this > > mailing list. > > > > No, you can't. > > > I knew this from the discussion between you and Mike. So is this a > feature that we are going to implement soon, or we haven't decided? It was the highest priority up until last night. Right now it isn't, so I can't say when I think it will be done. I'll try to squeeze it in. From zhaozhang at uchicago.edu Tue Apr 28 16:52:39 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 28 Apr 2009 16:52:39 -0500 Subject: [Swift-devel] feature request In-Reply-To: <49F77171.9060307@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <49EDF19D.7030305@uchicago.edu> <49EDF456.5050202@mcs.anl.gov> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <49F76068.4020600@uchicago.edu> <49F76C37.90808@uchicago.edu> <49F77171.9060307@uchicago.edu> Message-ID: <49F77AA7.5070703@uchicago.edu> Hi, Again So here is the OSG site list of the VO "OSG", I am double checking that I need to test all of those sites, right? http://vors.grid.iu.edu/cgi-bin/index.cgi?VO=24&grid=1®ion=0&res=0&dtype=0 zhao Zhao Zhang wrote: > Thanks, Ben > > It worked, but there is an exception. > > zhao > > [zzhang at tp-grid1 sites]$ ./run-site condor-g/renci-condor.xml > testing site configuration: condor-g/renci-condor.xml > Removing files from previous runs > Running test 061-cattwo at Tue Apr 28 16:04:38 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090428-1604-cge2klv7 > Progress: uninitialized:1 > Progress: Stage in:1 > Progress: Submitting:1 > Progress: Active:1 > Progress: Active:1 > Progress: Stage out:1 > Final status: Finished successfully:1 > The following warnings have occurred: > 1. Cleanup on renci-engage failed > Caused by: > Cannot submit job: Could not submit job (condor_submit reported > an exit code of 1). no error output > expecting 061-cattwo.out.expected > checking 061-cattwo.out.expected > Skipping exception test due to test configuration > Test passed at Tue Apr 28 16:05:25 CDT 2009 > > > Ben Clifford wrote: >> The version of condor that is on communicado probably won't work. i >> have only seen this work with condor 7.0.x. >> >> I think the version on teraport's tp-osg node should work: >> >> ssh tp-osg.ci.uchicago.edu >> $ source /opt/osg/setup.sh $ condor_version >> $CondorVersion: 7.0.5 Sep 20 2008 BuildID: 105846 $ >> $CondorPlatform: X86_64-LINUX_RHEL5 $ >> >> >> On Tue, 28 Apr 2009, Zhao Zhang wrote: >> >> >>> Hi, Ben >>> >>> Here is my sites.xml >>> >>> [zzhang at communicado sites]$ cat condor-g/renci-condor.xml >>> >>> >>> >>> >>> /nfs/home/osgedu/benc >>> grid >>> gt2 >>> belhaven-1.renci.org/jobmanager-fork >>> >>> >>> >>> And, I started run-site to test this, >>> [zzhang at communicado sites]$ ./run-site condor-g/renci-condor.xml >>> testing site configuration: condor-g/renci-condor.xml >>> Removing files from previous runs >>> Running test 061-cattwo at Tue Apr 28 15:49:03 CDT 2009 >>> Swift 0.9rc2 swift-r2860 cog-r2388 >>> >>> RunID: 20090428-1549-rken3vo4 >>> Progress: >>> Progress: Stage in:1 >>> Progress: Submitting:1 >>> Failed to transfer wrapper log from >>> 061-cattwo-20090428-1549-rken3vo4/info/y >>> on renci-engage >>> Progress: Stage in:1 >>> Failed to transfer wrapper log from >>> 061-cattwo-20090428-1549-rken3vo4/info/0 >>> on renci-engage >>> Progress: Stage in:1 >>> Failed to transfer wrapper log from >>> 061-cattwo-20090428-1549-rken3vo4/info/2 >>> on renci-engage >>> Progress: Failed:1 >>> Execution failed: >>> Exception in cat: >>> Arguments: [061-cattwo.1.in, 061-cattwo.2.in] >>> Host: renci-engage >>> Directory: 061-cattwo-20090428-1549-rken3vo4/jobs/2/cat-26xkb2aj >>> stderr.txt: >>> >>> stdout.txt: >>> >>> ---- >>> >>> Caused by: >>> Cannot submit job: Could not submit job (condor_submit >>> reported an exit >>> code of 1). no error output >>> SWIFT RETURN CODE NON-ZERO - test 061-cattwo >>> >>> Any ideas on the failures? Do you need more information? >>> >>> zhao >>> >>> >>> Ben Clifford wrote: >>> >>>> On Tue, 28 Apr 2009, Zhao Zhang wrote: >>>> >>>> >>>>> Is this the following a correct condor-G sites.xml ? >>>>> >>>> No, that will try to use gram4. >>>> >>>> Have a look at this message for an example that I have successfully >>>> used: >>>> >>>> http://mail.ci.uchicago.edu/pipermail/swift-devel/2009-April/005115.html >>>> >>>> >>>> >>>> >>>> >>> >> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From zhaozhang at uchicago.edu Tue Apr 28 16:59:35 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 28 Apr 2009 16:59:35 -0500 Subject: [Swift-devel] feature request In-Reply-To: References: <49E62C55.5080107@uchicago.edu> <49EDF8C1.7050201@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <49F76068.4020600@uchicago.edu> <49F76C37.90808@uchicago.edu> <49F77171.9060307@uchicago.edu> Message-ID: <49F77C47.7030106@uchicago.edu> Hi, Ben Unfortunately, the last swift workflow failed, I am attaching the log. zhao Ben Clifford wrote: > On Tue, 28 Apr 2009, Zhao Zhang wrote: > > >> It worked, but there is an exception. >> > > OK. That's a known issue (or known to me and Mats at least) with the way > that cleanup jobs are handled. I need to fix it, but it should not stop > you testing lots of sites. > > -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: condor_log_renci_0428 URL: From hockyg at uchicago.edu Tue Apr 28 17:19:31 2009 From: hockyg at uchicago.edu (Glen Max Hocky) Date: Tue, 28 Apr 2009 17:19:31 -0500 (CDT) Subject: [Swift-devel] reproducible problem running under coasters on ranger from communicado Message-ID: <20090428171931.AEM63965@m4500-00.uchicago.edu> I had the following problem this morning and just recreated under mike's login. (showing him how to run the latest stuff and i wanted to see if this problem could be recreated) This is all with the latest svn version Hundreds of jobs were in the active state and running on an equiv number of cpus on ranger. All of the sudden, all but 100 switched to a failed state. Then the run proceeded fairly normally until it crashed with a "coaster failed to start" error. clips of errors below all logs in /home/wilde/oops/swift/output/rangeroutdir.20 coaster logs in /home/wilde/oops/swift/output/rangeroutdir.20/coaster_logs ------------------------------------ Progress: Selecting site:3 Submitted:784 Active:113 Finished successfully:9 Progress: Selecting site:3 Submitted:512 Active:385 Finished successfully:9 Progress: Selecting site:3 Submitted:379 Active:518 Finished successfully:9 Progress: Selecting site:3 Submitted:337 Active:560 Finished successfully:9 Progress: Selecting site:3 Submitted:337 Active:560 Finished successfully:9 Progress: Selecting site:3 Submitted:337 Active:560 Finished successfully:9 Progress: Selecting site:3 Submitted:337 Active:559 Finished successfully:9 Failed but can retry:1 Progress: Selecting site:3 Submitted:335 Active:559 Finished successfully:9 Failed but can retry:3 Progress: Selecting site:3 Submitted:335 Active:543 Finished successfully:9 Failed but can retry:19 Progress: Selecting site:3 Submitted:333 Active:543 Finished successfully:9 Failed but can retry:21 Progress: Selecting site:3 Submitted:333 Active:527 Finished successfully:9 Failed but can retry:37 Progress: Selecting site:3 Submitted:333 Active:495 Finished successfully:9 Failed but can retry:69 Progress: Selecting site:3 Submitted:332 Active:481 Finished successfully:9 Failed but can retry:84 Progress: Selecting site:3 Submitted:332 Active:479 Finished successfully:9 Failed but can retry:86 Progress: Selecting site:3 Submitted:331 Active:465 Finished successfully:9 Failed but can retry:101 Progress: Selecting site:3 Submitted:331 Active:463 Finished successfully:9 Failed but can retry:103 Progress: Selecting site:3 Submitted:330 Active:447 Finished successfully:9 Failed but can retry:120 Progress: Selecting site:3 Submitted:329 Active:433 Finished successfully:9 Failed but can retry:135 Progress: Selecting site:3 Submitted:329 Active:415 Finished successfully:9 Failed but can retry:153 Progress: Selecting site:3 Submitted:329 Active:399 Finished successfully:9 Failed but can retry:169 Progress: Selecting site:3 Submitted:329 Active:383 Finished successfully:9 Failed but can retry:185 Progress: Selecting site:3 Submitted:328 Active:367 Finished successfully:9 Failed but can retry:202 Progress: Selecting site:3 Submitted:327 Active:351 Finished successfully:9 Failed but can retry:219 Progress: Selecting site:3 Submitted:326 Active:336 Finished successfully:9 Failed but can retry:235 Progress: Selecting site:3 Submitted:326 Active:319 Finished successfully:9 Failed but can retry:252 Progress: Selecting site:3 Submitted:220 Active:408 Finished successfully:9 Failed but can retry:269 Progress: Selecting site:3 Submitted:219 Active:363 Finished successfully:9 Failed but can retry:315 Progress: Selecting site:3 Submitted:216 Active:334 Finished successfully:9 Failed but can retry:347 Progress: Selecting site:3 Submitted:214 Active:303 Finished successfully:9 Failed but can retry:380 Progress: Selecting site:3 Submitted:214 Active:287 Finished successfully:9 Failed but can retry:396 Progress: Selecting site:3 Submitted:214 Active:271 Finished successfully:9 Failed but can retry:412 Progress: Selecting site:3 Submitted:213 Active:255 Finished successfully:9 Failed but can retry:429 Progress: Selecting site:3 Submitted:213 Active:239 Finished successfully:9 Failed but can retry:445 Progress: Selecting site:3 Submitted:213 Active:223 Finished successfully:9 Failed but can retry:461 Progress: Selecting site:3 Submitted:213 Active:207 Finished successfully:9 Failed but can retry:477 Progress: Selecting site:3 Submitted:212 Active:207 Finished successfully:9 Failed but can retry:478 Progress: Selecting site:3 Submitted:212 Active:175 Finished successfully:9 Failed but can retry:510 Progress: Selecting site:3 Submitted:211 Active:143 Finished successfully:9 Failed but can retry:543 Progress: Selecting site:3 Submitted:211 Active:112 Finished successfully:9 Failed but can retry:574 Progress: Selecting site:3 Submitted:211 Active:111 Finished successfully:9 Failed but can retry:575 Progress: Selecting site:3 Submitted:211 Active:96 Finished successfully:9 Failed but can retry:590 Progress: Selecting site:3 Submitted:211 Active:96 Finished successfully:9 Failed but can retry:590 ----------------------------------- Failed to transfer wrapper log from oops-20090428-1642-ils1yrj8/info/a on ranger Progress: Submitted:801 Active:44 Finished successfully:61 Failed but can retry:3 Failed to transfer wrapper log from oops-20090428-1642-ils1yrj8/info/x on ranger Failed to transfer wrapper log from oops-20090428-1642-ils1yrj8/info/l on ranger Failed to transfer wrapper log from oops-20090428-1642-ils1yrj8/info/v on ranger Progress: Stage in:1 Submitted:802 Active:42 Checking status:1 Finished successfully:61 Failed but can retry:2 Failed to transfer wrapper log from oops-20090428-1642-ils1yrj8/info/3 on ranger Execution failed: Exception in runramaSpeed: Arguments: [input/fasta/T1af7.fasta, home/wilde/oops/swift/output/rangeroutdir.20/T1af7/T1af7.ST25.TU200.000 0.secseq, input/native/T1af7.pdb, input/rama/T1af7.rama_map, home/wi lde/oops/swift/output/rangeroutdir.20/T1af7//ST25.TU200/0000/01/64/T1af 7.ST25.TU200.0000.0164.pdt, home/wilde/oops/swift/output/rangeroutdir.20/T1af7//ST25.TU200/0000/01/ 64/T1a f7.ST25.TU200.0000.0164.rmsd, 164, DEFAULT_INIT_TEMP_=_25, TEMP_UPDATE_INTERVAL_=_200, MAX_NUMBER_OF_ANNEALING_STEPS_=_0, KILL_TIME_=_30] Host: ranger Directory: oops-20090428-1642-ils1yrj8/jobs/3/runramaSpeed-383qd2aj stderr.txt: stdout.txt: ---- Caused by: Failed to start worker: Worker ended prematurely Cleaning up... Shutting down service at https://129.114.50.163:49375 Got channel MetaChannel: 3994917 -> GSSSChannel-null(1) - Done ------------------------------------------ From hategan at mcs.anl.gov Tue Apr 28 17:31:07 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 28 Apr 2009 17:31:07 -0500 Subject: [Swift-devel] Re: reproducible problem running under coasters on ranger from communicado In-Reply-To: <20090428171931.AEM63965@m4500-00.uchicago.edu> References: <20090428171931.AEM63965@m4500-00.uchicago.edu> Message-ID: <1240957867.10076.0.camel@localhost> [hategan at communicado coaster_logs]$ pwd /home/wilde/oops/swift/output/rangeroutdir.20/coaster_logs [hategan at communicado coaster_logs]$ cd coasters/ -bash: cd: coasters/: Permission denied On Tue, 2009-04-28 at 17:19 -0500, Glen Max Hocky wrote: > I had the following problem this morning and just recreated under mike's login. > (showing him how to run the latest stuff and i wanted to see if this problem > could be recreated) > > This is all with the latest svn version > Hundreds of jobs were in the active state and running on an equiv number of > cpus on ranger. All of the sudden, all but 100 switched to a failed state. Then > the run proceeded fairly normally until it crashed with a "coaster failed to start" > error. > > clips of errors below > > all logs in > /home/wilde/oops/swift/output/rangeroutdir.20 > coaster logs in > /home/wilde/oops/swift/output/rangeroutdir.20/coaster_logs > > ------------------------------------ > > Progress: Selecting site:3 Submitted:784 Active:113 Finished successfully:9 > Progress: Selecting site:3 Submitted:512 Active:385 Finished successfully:9 > Progress: Selecting site:3 Submitted:379 Active:518 Finished successfully:9 > Progress: Selecting site:3 Submitted:337 Active:560 Finished successfully:9 > Progress: Selecting site:3 Submitted:337 Active:560 Finished successfully:9 > Progress: Selecting site:3 Submitted:337 Active:560 Finished successfully:9 > Progress: Selecting site:3 Submitted:337 Active:559 Finished successfully:9 > Failed but can retry:1 > Progress: Selecting site:3 Submitted:335 Active:559 Finished successfully:9 > Failed but can retry:3 > Progress: Selecting site:3 Submitted:335 Active:543 Finished successfully:9 > Failed but can retry:19 > Progress: Selecting site:3 Submitted:333 Active:543 Finished successfully:9 > Failed but can retry:21 > Progress: Selecting site:3 Submitted:333 Active:527 Finished successfully:9 > Failed but can retry:37 > Progress: Selecting site:3 Submitted:333 Active:495 Finished successfully:9 > Failed but can retry:69 > Progress: Selecting site:3 Submitted:332 Active:481 Finished successfully:9 > Failed but can retry:84 > Progress: Selecting site:3 Submitted:332 Active:479 Finished successfully:9 > Failed but can retry:86 > Progress: Selecting site:3 Submitted:331 Active:465 Finished successfully:9 > Failed but can retry:101 > Progress: Selecting site:3 Submitted:331 Active:463 Finished successfully:9 > Failed but can retry:103 > Progress: Selecting site:3 Submitted:330 Active:447 Finished successfully:9 > Failed but can retry:120 > Progress: Selecting site:3 Submitted:329 Active:433 Finished successfully:9 > Failed but can retry:135 > Progress: Selecting site:3 Submitted:329 Active:415 Finished successfully:9 > Failed but can retry:153 > Progress: Selecting site:3 Submitted:329 Active:399 Finished successfully:9 > Failed but can retry:169 > Progress: Selecting site:3 Submitted:329 Active:383 Finished successfully:9 > Failed but can retry:185 > Progress: Selecting site:3 Submitted:328 Active:367 Finished successfully:9 > Failed but can retry:202 > Progress: Selecting site:3 Submitted:327 Active:351 Finished successfully:9 > Failed but can retry:219 > Progress: Selecting site:3 Submitted:326 Active:336 Finished successfully:9 > Failed but can retry:235 > Progress: Selecting site:3 Submitted:326 Active:319 Finished successfully:9 > Failed but can retry:252 > Progress: Selecting site:3 Submitted:220 Active:408 Finished successfully:9 > Failed but can retry:269 > Progress: Selecting site:3 Submitted:219 Active:363 Finished successfully:9 > Failed but can retry:315 > Progress: Selecting site:3 Submitted:216 Active:334 Finished successfully:9 > Failed but can retry:347 > Progress: Selecting site:3 Submitted:214 Active:303 Finished successfully:9 > Failed but can retry:380 > Progress: Selecting site:3 Submitted:214 Active:287 Finished successfully:9 > Failed but can retry:396 > Progress: Selecting site:3 Submitted:214 Active:271 Finished successfully:9 > Failed but can retry:412 > Progress: Selecting site:3 Submitted:213 Active:255 Finished successfully:9 > Failed but can retry:429 > Progress: Selecting site:3 Submitted:213 Active:239 Finished successfully:9 > Failed but can retry:445 > Progress: Selecting site:3 Submitted:213 Active:223 Finished successfully:9 > Failed but can retry:461 > Progress: Selecting site:3 Submitted:213 Active:207 Finished successfully:9 > Failed but can retry:477 > Progress: Selecting site:3 Submitted:212 Active:207 Finished successfully:9 > Failed but can retry:478 > Progress: Selecting site:3 Submitted:212 Active:175 Finished successfully:9 > Failed but can retry:510 > Progress: Selecting site:3 Submitted:211 Active:143 Finished successfully:9 > Failed but can retry:543 > Progress: Selecting site:3 Submitted:211 Active:112 Finished successfully:9 > Failed but can retry:574 > Progress: Selecting site:3 Submitted:211 Active:111 Finished successfully:9 > Failed but can retry:575 > Progress: Selecting site:3 Submitted:211 Active:96 Finished successfully:9 > Failed but can retry:590 > Progress: Selecting site:3 Submitted:211 Active:96 Finished successfully:9 > Failed but can retry:590 > > > > ----------------------------------- > Failed to transfer wrapper log from oops-20090428-1642-ils1yrj8/info/a on > ranger > Progress: Submitted:801 Active:44 Finished successfully:61 Failed but can > retry:3 > Failed to transfer wrapper log from oops-20090428-1642-ils1yrj8/info/x on > ranger > Failed to transfer wrapper log from oops-20090428-1642-ils1yrj8/info/l on > ranger > Failed to transfer wrapper log from oops-20090428-1642-ils1yrj8/info/v on > ranger > Progress: Stage in:1 Submitted:802 Active:42 Checking status:1 Finished > successfully:61 Failed but can retry:2 > Failed to transfer wrapper log from oops-20090428-1642-ils1yrj8/info/3 on > ranger > Execution failed: > Exception in runramaSpeed: > Arguments: [input/fasta/T1af7.fasta, > home/wilde/oops/swift/output/rangeroutdir.20/T1af7/T1af7.ST25.TU200.000 > 0.secseq, input/native/T1af7.pdb, input/rama/T1af7.rama_map, home/wi > lde/oops/swift/output/rangeroutdir.20/T1af7//ST25.TU200/0000/01/64/T1af > 7.ST25.TU200.0000.0164.pdt, > home/wilde/oops/swift/output/rangeroutdir.20/T1af7//ST25.TU200/0000/01/ > 64/T1a > f7.ST25.TU200.0000.0164.rmsd, 164, DEFAULT_INIT_TEMP_=_25, > TEMP_UPDATE_INTERVAL_=_200, MAX_NUMBER_OF_ANNEALING_STEPS_=_0, > KILL_TIME_=_30] > Host: ranger > Directory: oops-20090428-1642-ils1yrj8/jobs/3/runramaSpeed-383qd2aj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Failed to start worker: Worker ended prematurely > Cleaning up... > Shutting down service at https://129.114.50.163:49375 > Got channel MetaChannel: 3994917 -> GSSSChannel-null(1) > - Done > ------------------------------------------ From wilde at mcs.anl.gov Tue Apr 28 18:08:47 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 28 Apr 2009 18:08:47 -0500 Subject: [Swift-devel] Re: reproducible problem running under coasters on ranger from communicado In-Reply-To: <1240957867.10076.0.camel@localhost> References: <20090428171931.AEM63965@m4500-00.uchicago.edu> <1240957867.10076.0.camel@localhost> Message-ID: <49F78C7F.1050302@mcs.anl.gov> Should be readable now - sorry. I think when Glen copied the files they retained their perms from the originals on Ranger. Glen - need to make the files readable when you copy 'em. On 4/28/09 5:31 PM, Mihael Hategan wrote: > [hategan at communicado coaster_logs]$ pwd > /home/wilde/oops/swift/output/rangeroutdir.20/coaster_logs > [hategan at communicado coaster_logs]$ cd coasters/ > -bash: cd: coasters/: Permission denied > > > On Tue, 2009-04-28 at 17:19 -0500, Glen Max Hocky wrote: >> I had the following problem this morning and just recreated under mike's login. >> (showing him how to run the latest stuff and i wanted to see if this problem >> could be recreated) >> >> This is all with the latest svn version >> Hundreds of jobs were in the active state and running on an equiv number of >> cpus on ranger. All of the sudden, all but 100 switched to a failed state. Then >> the run proceeded fairly normally until it crashed with a "coaster failed to start" >> error. >> >> clips of errors below >> >> all logs in >> /home/wilde/oops/swift/output/rangeroutdir.20 >> coaster logs in >> /home/wilde/oops/swift/output/rangeroutdir.20/coaster_logs >> >> ------------------------------------ >> >> Progress: Selecting site:3 Submitted:784 Active:113 Finished successfully:9 >> Progress: Selecting site:3 Submitted:512 Active:385 Finished successfully:9 >> Progress: Selecting site:3 Submitted:379 Active:518 Finished successfully:9 >> Progress: Selecting site:3 Submitted:337 Active:560 Finished successfully:9 >> Progress: Selecting site:3 Submitted:337 Active:560 Finished successfully:9 >> Progress: Selecting site:3 Submitted:337 Active:560 Finished successfully:9 >> Progress: Selecting site:3 Submitted:337 Active:559 Finished successfully:9 >> Failed but can retry:1 >> Progress: Selecting site:3 Submitted:335 Active:559 Finished successfully:9 >> Failed but can retry:3 >> Progress: Selecting site:3 Submitted:335 Active:543 Finished successfully:9 >> Failed but can retry:19 >> Progress: Selecting site:3 Submitted:333 Active:543 Finished successfully:9 >> Failed but can retry:21 >> Progress: Selecting site:3 Submitted:333 Active:527 Finished successfully:9 >> Failed but can retry:37 >> Progress: Selecting site:3 Submitted:333 Active:495 Finished successfully:9 >> Failed but can retry:69 >> Progress: Selecting site:3 Submitted:332 Active:481 Finished successfully:9 >> Failed but can retry:84 >> Progress: Selecting site:3 Submitted:332 Active:479 Finished successfully:9 >> Failed but can retry:86 >> Progress: Selecting site:3 Submitted:331 Active:465 Finished successfully:9 >> Failed but can retry:101 >> Progress: Selecting site:3 Submitted:331 Active:463 Finished successfully:9 >> Failed but can retry:103 >> Progress: Selecting site:3 Submitted:330 Active:447 Finished successfully:9 >> Failed but can retry:120 >> Progress: Selecting site:3 Submitted:329 Active:433 Finished successfully:9 >> Failed but can retry:135 >> Progress: Selecting site:3 Submitted:329 Active:415 Finished successfully:9 >> Failed but can retry:153 >> Progress: Selecting site:3 Submitted:329 Active:399 Finished successfully:9 >> Failed but can retry:169 >> Progress: Selecting site:3 Submitted:329 Active:383 Finished successfully:9 >> Failed but can retry:185 >> Progress: Selecting site:3 Submitted:328 Active:367 Finished successfully:9 >> Failed but can retry:202 >> Progress: Selecting site:3 Submitted:327 Active:351 Finished successfully:9 >> Failed but can retry:219 >> Progress: Selecting site:3 Submitted:326 Active:336 Finished successfully:9 >> Failed but can retry:235 >> Progress: Selecting site:3 Submitted:326 Active:319 Finished successfully:9 >> Failed but can retry:252 >> Progress: Selecting site:3 Submitted:220 Active:408 Finished successfully:9 >> Failed but can retry:269 >> Progress: Selecting site:3 Submitted:219 Active:363 Finished successfully:9 >> Failed but can retry:315 >> Progress: Selecting site:3 Submitted:216 Active:334 Finished successfully:9 >> Failed but can retry:347 >> Progress: Selecting site:3 Submitted:214 Active:303 Finished successfully:9 >> Failed but can retry:380 >> Progress: Selecting site:3 Submitted:214 Active:287 Finished successfully:9 >> Failed but can retry:396 >> Progress: Selecting site:3 Submitted:214 Active:271 Finished successfully:9 >> Failed but can retry:412 >> Progress: Selecting site:3 Submitted:213 Active:255 Finished successfully:9 >> Failed but can retry:429 >> Progress: Selecting site:3 Submitted:213 Active:239 Finished successfully:9 >> Failed but can retry:445 >> Progress: Selecting site:3 Submitted:213 Active:223 Finished successfully:9 >> Failed but can retry:461 >> Progress: Selecting site:3 Submitted:213 Active:207 Finished successfully:9 >> Failed but can retry:477 >> Progress: Selecting site:3 Submitted:212 Active:207 Finished successfully:9 >> Failed but can retry:478 >> Progress: Selecting site:3 Submitted:212 Active:175 Finished successfully:9 >> Failed but can retry:510 >> Progress: Selecting site:3 Submitted:211 Active:143 Finished successfully:9 >> Failed but can retry:543 >> Progress: Selecting site:3 Submitted:211 Active:112 Finished successfully:9 >> Failed but can retry:574 >> Progress: Selecting site:3 Submitted:211 Active:111 Finished successfully:9 >> Failed but can retry:575 >> Progress: Selecting site:3 Submitted:211 Active:96 Finished successfully:9 >> Failed but can retry:590 >> Progress: Selecting site:3 Submitted:211 Active:96 Finished successfully:9 >> Failed but can retry:590 >> >> >> >> ----------------------------------- >> Failed to transfer wrapper log from oops-20090428-1642-ils1yrj8/info/a on >> ranger >> Progress: Submitted:801 Active:44 Finished successfully:61 Failed but can >> retry:3 >> Failed to transfer wrapper log from oops-20090428-1642-ils1yrj8/info/x on >> ranger >> Failed to transfer wrapper log from oops-20090428-1642-ils1yrj8/info/l on >> ranger >> Failed to transfer wrapper log from oops-20090428-1642-ils1yrj8/info/v on >> ranger >> Progress: Stage in:1 Submitted:802 Active:42 Checking status:1 Finished >> successfully:61 Failed but can retry:2 >> Failed to transfer wrapper log from oops-20090428-1642-ils1yrj8/info/3 on >> ranger >> Execution failed: >> Exception in runramaSpeed: >> Arguments: [input/fasta/T1af7.fasta, >> home/wilde/oops/swift/output/rangeroutdir.20/T1af7/T1af7.ST25.TU200.000 >> 0.secseq, input/native/T1af7.pdb, input/rama/T1af7.rama_map, home/wi >> lde/oops/swift/output/rangeroutdir.20/T1af7//ST25.TU200/0000/01/64/T1af >> 7.ST25.TU200.0000.0164.pdt, >> home/wilde/oops/swift/output/rangeroutdir.20/T1af7//ST25.TU200/0000/01/ >> 64/T1a >> f7.ST25.TU200.0000.0164.rmsd, 164, DEFAULT_INIT_TEMP_=_25, >> TEMP_UPDATE_INTERVAL_=_200, MAX_NUMBER_OF_ANNEALING_STEPS_=_0, >> KILL_TIME_=_30] >> Host: ranger >> Directory: oops-20090428-1642-ils1yrj8/jobs/3/runramaSpeed-383qd2aj >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> Failed to start worker: Worker ended prematurely >> Cleaning up... >> Shutting down service at https://129.114.50.163:49375 >> Got channel MetaChannel: 3994917 -> GSSSChannel-null(1) >> - Done >> ------------------------------------------ > From benc at hawaga.org.uk Wed Apr 29 01:27:31 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 29 Apr 2009 06:27:31 +0000 (GMT) Subject: [Swift-devel] feature request In-Reply-To: <49F77C47.7030106@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <49F76068.4020600@uchicago.edu> <49F76C37.90808@uchicago.edu> <49F77171.9060307@uchicago.edu> <49F77C47.7030106@uchicago.edu> Message-ID: OK. The tests with spaces/quotes/strange characters will fail whenever condor is used. As long as 061-cattwo and 130-fmri are successful, that is ok for now. Bug 200 in the Swift bugzilla may allow all the tests to work, when it is implemented (I am part way through). On Tue, 28 Apr 2009, Zhao Zhang wrote: > Hi, Ben > > Unfortunately, the last swift workflow failed, I am attaching the log. > > zhao > > Ben Clifford wrote: > > On Tue, 28 Apr 2009, Zhao Zhang wrote: > > > > > > > It worked, but there is an exception. > > > > > > > OK. That's a known issue (or known to me and Mats at least) with the way > > that cleanup jobs are handled. I need to fix it, but it should not stop you > > testing lots of sites. > > > > From zhaozhang at uchicago.edu Wed Apr 29 09:44:56 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 29 Apr 2009 09:44:56 -0500 Subject: [Swift-devel] feature request In-Reply-To: References: <49E62C55.5080107@uchicago.edu> <49EDFB93.10902@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <49F76068.4020600@uchicago.edu> <49F76C37.90808@uchicago.edu> <49F77171.9060307@uchicago.edu> <49F77C47.7030106@uchicago.edu> Message-ID: <49F867E8.9050808@uchicago.edu> Hi, Ben I am figuring out how to convert from VO Resouce Selector web page to valid sites.xml. Take renci-engagement as an example, the page is here http://vors.grid.iu.edu/cgi-bin/index.cgi?region=0&VO=24&grid=1&dtype=0&res=387#SITELISTING How could I infer the value of the following fields? /nfs/home/osgedu/benc grid gt2 belhaven-1.renci.org/jobmanager-fork Also, as I remembered, condor should use GT4, right? But we specify gt2 in the above sites.xml, is this correct? Thanks. zhao Ben Clifford wrote: > OK. The tests with spaces/quotes/strange characters will fail whenever > condor is used. As long as 061-cattwo and 130-fmri are successful, that is > ok for now. > > Bug 200 in the Swift bugzilla may allow all the tests to work, when it is > implemented (I am part way through). > > On Tue, 28 Apr 2009, Zhao Zhang wrote: > > >> Hi, Ben >> >> Unfortunately, the last swift workflow failed, I am attaching the log. >> >> zhao >> >> Ben Clifford wrote: >> >>> On Tue, 28 Apr 2009, Zhao Zhang wrote: >>> >>> >>> >>>> It worked, but there is an exception. >>>> >>>> >>> OK. That's a known issue (or known to me and Mats at least) with the way >>> that cleanup jobs are handled. I need to fix it, but it should not stop you >>> testing lots of sites. >>> >>> >>> > > From benc at hawaga.org.uk Wed Apr 29 09:49:00 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 29 Apr 2009 14:49:00 +0000 (GMT) Subject: [Swift-devel] feature request In-Reply-To: <49F867E8.9050808@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <49F76068.4020600@uchicago.edu> <49F76C37.90808@uchicago.edu> <49F77171.9060307@uchicago.edu> <49F77C47.7030106@uchicago.edu> <49F867E8.9050808@uchicago.edu> Message-ID: On Wed, 29 Apr 2009, Zhao Zhang wrote: > I am figuring out how to convert from VO Resouce Selector web page to valid > sites.xml. There is already a tool, swift-osg-ress-site-catalog to take OSG site information and make it into a sites.xml. That is a much better place for you to start at, I think. Look in the bin/ directory of a swift distribution. That already collects of the information you need. > Also, as I remembered, condor should use GT4, right? But we specify gt2 in the > above sites.xml, is this correct? No, you should use gt2 (that means gram2). -- From zhaozhang at uchicago.edu Wed Apr 29 10:06:20 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 29 Apr 2009 10:06:20 -0500 Subject: [Swift-devel] feature request In-Reply-To: References: <49E62C55.5080107@uchicago.edu> <49EE02DC.5060405@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <49F76068.4020600@uchicago.edu> <49F76C37.90808@uchicago.edu> <49F77171.9060307@uchicago.edu> <49F77C47.7030106@uchicago.edu> <49F867E8.9050808@uchicago.edu> Message-ID: <49F86CEC.50507@uchicago.edu> Hi, Ben I modified the existing swift-osg-ress-site-catalog, and generate a sample sites.xml at CI network /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/condor-g/sample-sites.xml Could you help me check if there is any known error in there? thanks. zhao Ben Clifford wrote: > On Wed, 29 Apr 2009, Zhao Zhang wrote: > > >> I am figuring out how to convert from VO Resouce Selector web page to valid >> sites.xml. >> > > There is already a tool, swift-osg-ress-site-catalog to take OSG site > information and make it into a sites.xml. That is a much better place for > you to start at, I think. Look in the bin/ directory of a swift > distribution. That already collects of the information you need. > > >> Also, as I remembered, condor should use GT4, right? But we specify gt2 in the >> above sites.xml, is this correct? >> > > No, you should use gt2 (that means gram2). > > From benc at hawaga.org.uk Wed Apr 29 10:08:26 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 29 Apr 2009 15:08:26 +0000 (GMT) Subject: [Swift-devel] feature request In-Reply-To: <49F86CEC.50507@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <49F76068.4020600@uchicago.edu> <49F76C37.90808@uchicago.edu> <49F77171.9060307@uchicago.edu> <49F77C47.7030106@uchicago.edu> <49F867E8.9050808@uchicago.edu> <49F86CEC.50507@uchicago.edu> Message-ID: On Wed, 29 Apr 2009, Zhao Zhang wrote: > I modified the existing swift-osg-ress-site-catalog, and generate a sample > sites.xml at CI network > /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/condor-g/sample-sites.xml > > Could you help me check if there is any known error in there? thanks. I looked briefly and it seems ok -- From zhaozhang at uchicago.edu Wed Apr 29 10:35:07 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 29 Apr 2009 10:35:07 -0500 Subject: [Swift-devel] feature request In-Reply-To: References: <49E62C55.5080107@uchicago.edu> <49EE0883.4070908@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <49F76068.4020600@uchicago.edu> <49F76C37.90808@uchicago.edu> <49F77171.9060307@uchicago.edu> <49F77C47.7030106@uchicago.edu> <49F867E8.9050808@uchicago.edu> <49F86CEC.50507@uchicago.edu> Message-ID: <49F873AB.4080409@uchicago.edu> Hi, again. Now I have two sites.xml for renci-engagement. The first one is working, while the second is not. The second sites.xml is generated by the script. Since the only difference between the first and the second is the value, I am assuming the error is caused by the . If this is true, how could I find all valid for all sites on OSG? zhao First: /nfs/home/osgedu/benc grid gt2 belhaven-1.renci.org/jobmanager-fork Second: /nfs/osg-data/engage/tmp/RENCI-Engagement grid gt2 belhaven-1.renci.org/jobmanager-fork The error message is: [zzhang at tp-grid1 sites]$ ./run-site condor-g/RENCI-Engagement.xml testing site configuration: condor-g/RENCI-Engagement.xml Removing files from previous runs Running test 061-cattwo at Wed Apr 29 10:21:31 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090429-1021-8ayf7v75 Progress: Execution failed: Could not find any valid host for task "Task(type=UNKNOWN, identity=urn:cog-1241018496704)" with constraints {filenames=[Ljava.lang.String;@7f38f3d1, trfqn=cat, filecache=org.griphyn.vdl.karajan.lib.cache.CacheMapAdapter at 740f5f97, tr=cat} SWIFT RETURN CODE NON-ZERO - test 061-cattwo Ben Clifford wrote: > On Wed, 29 Apr 2009, Zhao Zhang wrote: > > >> I modified the existing swift-osg-ress-site-catalog, and generate a sample >> sites.xml at CI network >> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/condor-g/sample-sites.xml >> >> Could you help me check if there is any known error in there? thanks. >> > > I looked briefly and it seems ok > > From wilde at mcs.anl.gov Wed Apr 29 10:52:49 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 29 Apr 2009 10:52:49 -0500 Subject: [Swift-devel] feature request In-Reply-To: <49F873AB.4080409@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <49F76068.4020600@uchicago.edu> <49F76C37.90808@uchicago.edu> <49F77171.9060307@uchicago.edu> <49F77C47.7030106@uchicago.edu> <49F867E8.9050808@uchicago.edu> <49F86CEC.50507@uchicago.edu> <49F873AB.4080409@uchicago.edu> Message-ID: <49F877D1.2050609@mcs.anl.gov> The message: Could not find any valid host for task "Task(type=UNKNOWN, identity=urn:cog-1241018496704)" with constraints {filenames=[Ljava.lang.String;@7f38f3d1, trfqn=cat, filecache=org.griphyn.vdl.karajan.lib.cache.CacheMapAdapter at 740f5f97, tr=cat} means there was no tc.data entry for cat on any site. So I suspect this is due to your pool handles not natching your tc.data site names. On 4/29/09 10:35 AM, Zhao Zhang wrote: > Hi, again. > > Now I have two sites.xml for renci-engagement. The first one is working, > while the second is not. The second sites.xml is generated by the script. > Since the only difference between the first and the second is the > value, I am assuming the error is caused by the > . > If this is true, how could I find all valid for all > sites on OSG? > > zhao > > First: > > > > > /nfs/home/osgedu/benc > grid > gt2 > belhaven-1.renci.org/jobmanager-fork > > > > Second: > > > > > > >/nfs/osg-data/engage/tmp/RENCI-Engagement > grid > gt2 > belhaven-1.renci.org/jobmanager-fork > > > > The error message is: > [zzhang at tp-grid1 sites]$ ./run-site condor-g/RENCI-Engagement.xml > testing site configuration: condor-g/RENCI-Engagement.xml > Removing files from previous runs > Running test 061-cattwo at Wed Apr 29 10:21:31 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090429-1021-8ayf7v75 > Progress: > Execution failed: > Could not find any valid host for task "Task(type=UNKNOWN, > identity=urn:cog-1241018496704)" with constraints > {filenames=[Ljava.lang.String;@7f38f3d1, trfqn=cat, > filecache=org.griphyn.vdl.karajan.lib.cache.CacheMapAdapter at 740f5f97, > tr=cat} > SWIFT RETURN CODE NON-ZERO - test 061-cattwo > > > Ben Clifford wrote: >> On Wed, 29 Apr 2009, Zhao Zhang wrote: >> >> >>> I modified the existing swift-osg-ress-site-catalog, and generate a >>> sample >>> sites.xml at CI network >>> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/condor-g/sample-sites.xml >>> >>> >>> Could you help me check if there is any known error in there? thanks. >>> >> >> I looked briefly and it seems ok >> >> From zhaozhang at uchicago.edu Wed Apr 29 11:05:20 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 29 Apr 2009 11:05:20 -0500 Subject: [Swift-devel] feature request In-Reply-To: <49F877D1.2050609@mcs.anl.gov> References: <49E62C55.5080107@uchicago.edu> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <49F76068.4020600@uchicago.edu> <49F76C37.90808@uchicago.edu> <49F77171.9060307@uchicago.edu> <49F77C47.7030106@uchicago.edu> <49F867E8.9050808@uchicago.edu> <49F86CEC.50507@uchicago.edu> <49F873AB.4080409@uchicago.edu> <49F877D1.2050609@mcs.anl.gov> Message-ID: <49F87AC0.4030207@uchicago.edu> Ha, I see. Thanks. Here comes another issue: [zzhang at tp-grid1 sites]$ ./run-site condor-g/test-RENCI-Engagement.xml testing site configuration: condor-g/test-RENCI-Engagement.xml Removing files from previous runs Running test 061-cattwo at Wed Apr 29 11:02:05 CDT 2009 Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090429-1102-jsx6fqzc Progress: Progress: Initializing site shared directory:1 Execution failed: Could not initialize shared directory on localhost Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Cannot create directory /nfs/osg-data/engage/tmp/RENCI-Engagement/061-cattwo-20090429-1102-jsx6fqzc Caused by: Server refused performing the request. Custom message: Server refused creating directory (error code 1) [Nested exception message: Custom message: Unexpected reply: 500-Command failed : globus_gridftp_server_file.c:globus_l_gfs_file_mkdir:554: 500-System error in mkdir: Permission denied 500-A system call failed: Permission denied 500 End.] SWIFT RETURN CODE NON-ZERO - test 061-cattwo So, the default workdirecory doesn't have wirte access, is there a pattern on OSG that I could know where I could write to? zhao Michael Wilde wrote: > The message: > > Could not find any valid host for task "Task(type=UNKNOWN, > identity=urn:cog-1241018496704)" with constraints > {filenames=[Ljava.lang.String;@7f38f3d1, trfqn=cat, > filecache=org.griphyn.vdl.karajan.lib.cache.CacheMapAdapter at 740f5f97, > tr=cat} > > means there was no tc.data entry for cat on any site. > > So I suspect this is due to your pool handles not natching your > tc.data site names. > > On 4/29/09 10:35 AM, Zhao Zhang wrote: >> Hi, again. >> >> Now I have two sites.xml for renci-engagement. The first one is >> working, while the second is not. The second sites.xml is generated >> by the script. >> Since the only difference between the first and the second is the >> value, I am assuming the error is caused by the >> . >> If this is true, how could I find all valid for all >> sites on OSG? >> >> zhao >> >> First: >> >> >> >> >> /nfs/home/osgedu/benc >> grid >> gt2 >> belhaven-1.renci.org/jobmanager-fork >> >> >> >> Second: >> >> >> >> >> >> > >/nfs/osg-data/engage/tmp/RENCI-Engagement >> grid >> gt2 >> belhaven-1.renci.org/jobmanager-fork >> >> >> >> The error message is: >> [zzhang at tp-grid1 sites]$ ./run-site condor-g/RENCI-Engagement.xml >> testing site configuration: condor-g/RENCI-Engagement.xml >> Removing files from previous runs >> Running test 061-cattwo at Wed Apr 29 10:21:31 CDT 2009 >> Swift 0.9rc2 swift-r2860 cog-r2388 >> >> RunID: 20090429-1021-8ayf7v75 >> Progress: >> Execution failed: >> Could not find any valid host for task "Task(type=UNKNOWN, >> identity=urn:cog-1241018496704)" with constraints >> {filenames=[Ljava.lang.String;@7f38f3d1, trfqn=cat, >> filecache=org.griphyn.vdl.karajan.lib.cache.CacheMapAdapter at 740f5f97, >> tr=cat} >> SWIFT RETURN CODE NON-ZERO - test 061-cattwo >> >> >> Ben Clifford wrote: >>> On Wed, 29 Apr 2009, Zhao Zhang wrote: >>> >>> >>>> I modified the existing swift-osg-ress-site-catalog, and generate a >>>> sample >>>> sites.xml at CI network >>>> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/condor-g/sample-sites.xml >>>> >>>> >>>> Could you help me check if there is any known error in there? thanks. >>>> >>> >>> I looked briefly and it seems ok >>> >>> > From wilde at mcs.anl.gov Wed Apr 29 11:05:21 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 29 Apr 2009 11:05:21 -0500 Subject: [Swift-devel] feature request In-Reply-To: <49F877D1.2050609@mcs.anl.gov> References: <49E62C55.5080107@uchicago.edu> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <49F76068.4020600@uchicago.edu> <49F76C37.90808@uchicago.edu> <49F77171.9060307@uchicago.edu> <49F77C47.7030106@uchicago.edu> <49F867E8.9050808@uchicago.edu> <49F86CEC.50507@uchicago.edu> <49F873AB.4080409@uchicago.edu> <49F877D1.2050609@mcs.anl.gov> Message-ID: <49F87AC1.1080105@mcs.anl.gov> Separately, to test on both OSG and TG, you need to have a good understanding of where data and apps live, and how various directories are intended to be used. OSG has a set of generic storage locations (tmp, app, data, etc), and TG has basically home and scratch for each user, with some provision for group directories. We should document the relationships between what the grids specify, how swift users should use the various dirs, and how the tests use those dirs (which should test what we tell the users to do). The site tools and the tests need to be (and I think are) are already cognizant of this. But theres a few variations possible, like putting apps under $APP vs $DATA, and how writability is based on your VO, and how and when subdirectories should be used. We should locate the relevant Grid doc pages on these topics, and link to them from the "local details" section of the Swift user guide. This isnt very complicated once the following are documented: - each grids conventions for directories - permissions and multi-user issues - transience and space management issues - group sharing of various directories In the current simple Swift model where permanent data lives on the submit host or a specified server and the workdirectory is always transient, this is not much of an issue. In the future, Swift might support the caching of files on sites between workflows, and then we'll need to have a bit more complex model. What I'm saying here is perhaps not precise enough, but I believe its important enough, and confusing enough to most users, that we should document it clealy and provide pointers and examples. - Mike On 4/29/09 10:52 AM, Michael Wilde wrote: > The message: > > Could not find any valid host for task "Task(type=UNKNOWN, > identity=urn:cog-1241018496704)" with constraints > {filenames=[Ljava.lang.String;@7f38f3d1, trfqn=cat, > filecache=org.griphyn.vdl.karajan.lib.cache.CacheMapAdapter at 740f5f97, > tr=cat} > > means there was no tc.data entry for cat on any site. > > So I suspect this is due to your pool handles not natching your tc.data > site names. > > On 4/29/09 10:35 AM, Zhao Zhang wrote: >> Hi, again. >> >> Now I have two sites.xml for renci-engagement. The first one is >> working, while the second is not. The second sites.xml is generated by >> the script. >> Since the only difference between the first and the second is the >> value, I am assuming the error is caused by the >> . >> If this is true, how could I find all valid for all >> sites on OSG? >> >> zhao >> >> First: >> >> >> >> >> /nfs/home/osgedu/benc >> grid >> gt2 >> belhaven-1.renci.org/jobmanager-fork >> >> >> >> Second: >> >> >> >> >> >> > >/nfs/osg-data/engage/tmp/RENCI-Engagement >> grid >> gt2 >> belhaven-1.renci.org/jobmanager-fork >> >> >> >> The error message is: >> [zzhang at tp-grid1 sites]$ ./run-site condor-g/RENCI-Engagement.xml >> testing site configuration: condor-g/RENCI-Engagement.xml >> Removing files from previous runs >> Running test 061-cattwo at Wed Apr 29 10:21:31 CDT 2009 >> Swift 0.9rc2 swift-r2860 cog-r2388 >> >> RunID: 20090429-1021-8ayf7v75 >> Progress: >> Execution failed: >> Could not find any valid host for task "Task(type=UNKNOWN, >> identity=urn:cog-1241018496704)" with constraints >> {filenames=[Ljava.lang.String;@7f38f3d1, trfqn=cat, >> filecache=org.griphyn.vdl.karajan.lib.cache.CacheMapAdapter at 740f5f97, >> tr=cat} >> SWIFT RETURN CODE NON-ZERO - test 061-cattwo >> >> >> Ben Clifford wrote: >>> On Wed, 29 Apr 2009, Zhao Zhang wrote: >>> >>> >>>> I modified the existing swift-osg-ress-site-catalog, and generate a >>>> sample >>>> sites.xml at CI network >>>> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/condor-g/sample-sites.xml >>>> >>>> >>>> Could you help me check if there is any known error in there? thanks. >>>> >>> >>> I looked briefly and it seems ok >>> >>> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From rynge at renci.org Wed Apr 29 11:12:14 2009 From: rynge at renci.org (Mats Rynge) Date: Wed, 29 Apr 2009 12:12:14 -0400 Subject: [Swift-devel] feature request In-Reply-To: <49F873AB.4080409@uchicago.edu> References: <49E62C55.5080107@uchicago.edu> <50b07b4b0904211154q7a4f3b67t40027f7f8cfe477b@mail.gmail.com> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <49F76068.4020600@uchicago.edu> <49F76C37.90808@uchicago.edu> <49F77171.9060307@uchicago.edu> <49F77C47.7030106@uchicago.edu> <49F867E8.9050808@uchicago.edu> <49F86CEC.50507@uchicago.edu> <49F873AB.4080409@uchicago.edu> Message-ID: <49F87C5E.9030703@renci.org> Zhao Zhang wrote: > Hi, again. > > Now I have two sites.xml for renci-engagement. The first one is working, > while the second is not. The second sites.xml is generated by the script. > Since the only difference between the first and the second is the > value, I am assuming the error is caused by the > . > If this is true, how could I find all valid for all > sites on OSG? The work directory will depend on what VO you are a member of. If you are using your Engagement certificate, the second config should work. Also note that using $HOME (like in the first example) is not a good idea. Most sites on OSG, RENCI-Engagement included, have quotas for $HOME. So, a workflow using more than 5GB of disk with the first config will fail. > First: > > > > > /nfs/home/osgedu/benc > grid > gt2 > belhaven-1.renci.org/jobmanager-fork > > > > Second: > > > > > > >/nfs/osg-data/engage/tmp/RENCI-Engagement > grid > gt2 > belhaven-1.renci.org/jobmanager-fork > > > > The error message is: > [zzhang at tp-grid1 sites]$ ./run-site condor-g/RENCI-Engagement.xml > testing site configuration: condor-g/RENCI-Engagement.xml > Removing files from previous runs > Running test 061-cattwo at Wed Apr 29 10:21:31 CDT 2009 > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090429-1021-8ayf7v75 > Progress: > Execution failed: > Could not find any valid host for task "Task(type=UNKNOWN, > identity=urn:cog-1241018496704)" with constraints > {filenames=[Ljava.lang.String;@7f38f3d1, trfqn=cat, > filecache=org.griphyn.vdl.karajan.lib.cache.CacheMapAdapter at 740f5f97, > tr=cat} > SWIFT RETURN CODE NON-ZERO - test 061-cattwo > > > Ben Clifford wrote: >> On Wed, 29 Apr 2009, Zhao Zhang wrote: >> >> >>> I modified the existing swift-osg-ress-site-catalog, and generate a >>> sample >>> sites.xml at CI network >>> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/condor-g/sample-sites.xml >>> >>> >>> Could you help me check if there is any known error in there? thanks. >>> >> >> I looked briefly and it seems ok >> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Mats Rynge Renaissance Computing Institute From wilde at mcs.anl.gov Wed Apr 29 11:10:14 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 29 Apr 2009 11:10:14 -0500 Subject: [Swift-devel] Tool for TeraGrid site info Message-ID: <49F87BE6.8020603@mcs.anl.gov> Glen, where is the latest version of your TG sites info tool? Ben, what should we do to start testing that, making it uniform with the osg tool, and make it available for comments and heading towards inclusion in Swift? (I think this was commented on earlier - I will track down when I get back to the planning list in a few days from now) From benc at hawaga.org.uk Wed Apr 29 11:46:06 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 29 Apr 2009 16:46:06 +0000 (GMT) Subject: [Swift-devel] Tool for TeraGrid site info In-Reply-To: <49F87BE6.8020603@mcs.anl.gov> References: <49F87BE6.8020603@mcs.anl.gov> Message-ID: On Wed, 29 Apr 2009, Michael Wilde wrote: > Ben, what should we do to start testing that, making it uniform with the osg > tool, and make it available for comments and heading towards inclusion in > Swift? Put it online somewhere and post the link here. For making it uniform, study the user interface of the osg one and remove unnecessary differences. -- From benc at hawaga.org.uk Wed Apr 29 11:50:53 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 29 Apr 2009 16:50:53 +0000 (GMT) Subject: [Swift-devel] feature request In-Reply-To: <49F87C5E.9030703@renci.org> References: <49E62C55.5080107@uchicago.edu> <1240341050.23910.3.camel@localhost> <49F07987.8050908@uchicago.edu> <49F094CA.6070704@uchicago.edu> <49F0A34C.6040306@uchicago.edu> <49F1FC42.1090008@uchicago.edu> <49F21369.1040307@mcs.anl.gov> <49F76068.4020600@uchicago.edu> <49F76C37.90808@uchicago.edu> <49F77171.9060307@uchicago.edu> <49F77C47.7030106@uchicago.edu> <49F867E8.9050808@uchicago.edu> <49F86CEC.50507@uchicago.edu> <49F873AB.4080409@uchicago.edu> <49F87C5E.9030703@renci.org> Message-ID: On Wed, 29 Apr 2009, Mats Rynge wrote: > The work directory will depend on what VO you are a member of. If you are > using your Engagement certificate, the second config should work. zhao's credential is in osgedu, I think, in which case he should use use the --vo=osgedu parameter to swift-osg-ress-site-catalog -- From aespinosa at cs.uchicago.edu Wed Apr 29 13:03:18 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 29 Apr 2009 13:03:18 -0500 Subject: [Swift-devel] Fwd: voms-proxy-init not generating proxy files Message-ID: <20090429180318.GA5422@origin> Hi, Anyone has tips using voms-proxy-init to associate your proxy with certain VOs? I have been filing support requests to support at CI and support at osg for a while but i get minimal response. -Allan ----- Forwarded message from Allan Espinosa ----- Date: Tue, 21 Apr 2009 17:15:23 -0500 From: Allan Espinosa To: CI Support Subject: voms-proxy-init not generating proxy files hi, it seems that there is something wrong with the voms tools in the @osg stack: [aespinosa at tp-login1 ~]$ voms-proxy-init -voms Engage [aespinosa at tp-login1 ~]$ grid-proxy-info ERROR: Couldn't find a valid proxy. Use -debug for further information. [aespinosa at tp-login1 ~]$ voms-proxy-init -debug -voms Engage Detected Globus version: 22 Unspecified proxy version, settling on Globus version: 2 Number of bits in key :512 Using configuration file /soft/osg-client-1.0.0-r1/glite/etc/vomses Using configuration file /soft/osg-client-1.0.0-r1/glite/etc/vomses [aespinosa at tp-login1 ~]$ voms-proxy-info Couldn't find a valid proxy. below is my ~/.soft +swig +apache-ant +maui +torque @osg +mpich-p4 +fftw @default Thanks -Allan -- Allan M. Espinosa PhD student, Computer Science University of Chicago ----- End forwarded message ----- From benc at hawaga.org.uk Wed Apr 29 16:43:31 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 29 Apr 2009 21:43:31 +0000 (GMT) Subject: [Swift-devel] Fwd: voms-proxy-init not generating proxy files In-Reply-To: <20090429180318.GA5422@origin> References: <20090429180318.GA5422@origin> Message-ID: On Wed, 29 Apr 2009, Allan Espinosa wrote: > Anyone has tips using voms-proxy-init to associate your proxy with > certain VOs? I have been filing support requests to support at CI and > support at osg for a while but i get minimal response. > > [aespinosa at tp-login1 ~]$ voms-proxy-init -voms Engage You should be seeing more there - you should be asked for your proxy password at this stage, like with grid-proxy-init. No idea why it isn't working on teraport though. Many sites won't pay attention to the extra VOMS information in a credential though; (although maybe some require it too.. hurrah for diversity) (here's sample output I get from a gLite machine: $ voms-proxy-init -key ./doecred.pem -cert ./doecred.pem -voms gilda Cannot find file or dir: /home/benc/.glite/vomses Enter GRID pass phrase: Your identity: /DC=org/DC=doegrids/OU=People/CN=Benjamin Clifford 418168 Creating temporary proxy .............................................................................................. Done Contacting voms.ct.infn.it:15001 [/C=IT/O=INFN/OU=Host/L=Catania/CN=voms.ct.infn.it] "gilda" Done Creating proxy ................................................ Done Your proxy is valid until Thu Apr 30 11:43:00 2009 ) -- From hockyg at uchicago.edu Wed Apr 29 16:44:54 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Wed, 29 Apr 2009 16:44:54 -0500 Subject: [Swift-devel] nearly, but not quite optimal performance on ranger from ranger under coasters Message-ID: <49F8CA56.2040904@uchicago.edu> Hi Everyone, Today I tried to run jobs of length ~45 minutes. I ran on Ranger from Ranger with 50 5 This submitted 51 jobs to the queue, of which 38 went to the run state, meaning there should be 608 available coasters. I had > Progress: Selecting site:2569 Active:431 Finished successfully:9 For ~30 minutes until some jobs started finishing. Then I had a stead decline of activity, see below. I assume this is happening because the extra 177 coasters that were available timed out durning this 45 minutes and new ones did not become available, but that could be wrong. How should I tune my settings to fix this? p.s. logs in CI-HOME: /home/hockyg/public_html/swift_logs/ranger_from_ranger/oops-20090429-1519-03iyqi74.log > Progress: Selecting site:2569 Active:431 Finished successfully:9 > Progress: Selecting site:2569 Active:431 Finished successfully:9 > Progress: Selecting site:2569 Active:431 Finished successfully:9 > Progress: Selecting site:2569 Active:431 Finished successfully:9 > Progress: Selecting site:2569 Active:431 Finished successfully:9 > Progress: Selecting site:2569 Active:431 Finished successfully:9 > Progress: Selecting site:2569 Active:430 Checking status:1 > Finished successfully:9 > Progress: Selecting site:2569 Active:430 Finished successfully:10 > Progress: Selecting site:2569 Active:430 Finished successfully:10 > Progress: Selecting site:2569 Active:430 Finished successfully:10 > Progress: Selecting site:2569 Active:429 Checking status:1 > Finished successfully:10 > Progress: Selecting site:2569 Active:429 Finished successfully:11 > Progress: Selecting site:2569 Active:429 Finished successfully:11 ... > Progress: Selecting site:2516 Stage in:53 Active:196 Checking > status:3 Stage out:194 Finished successfully:47 > Progress: Selecting site:2516 Stage in:53 Active:196 Checking > status:3 Stage out:193 Finished successfully:48 > Progress: Selecting site:2516 Stage in:53 Active:196 Checking > status:3 Stage out:192 Finished successfully:49 > Progress: Selecting site:2516 Stage in:53 Active:196 Checking > status:3 Stage out:192 Finished successfully:49 > Progress: Selecting site:2516 Stage in:53 Active:195 Checking > status:4 Stage out:192 Finished successfully:49 > Progress: Selecting site:2516 Stage in:53 Active:166 Checking > status:33 Stage out:190 Finished successfully:51 > Progress: Selecting site:2516 Stage in:53 Active:166 Checking > status:33 Stage out:189 Finished successfully:52 > Progress: Selecting site:2516 Stage in:53 Active:165 Checking > status:34 Stage out:189 Finished successfully:52 > Progress: Selecting site:2516 Stage in:53 Active:155 Checking > status:44 Stage out:188 Finished successfully:53 > Progress: Selecting site:2516 Stage in:53 Active:152 Checking > status:47 Stage out:187 Finished successfully:54 > Progress: Selecting site:2516 Stage in:53 Active:151 Checking > status:48 Stage out:186 Finished successfully:55 > Progress: Selecting site:2516 Stage in:53 Active:99 Checking > status:100 Stage out:184 Finished successfully:57 > Progress: Selecting site:2516 Stage in:53 Active:99 Checking > status:100 Stage out:184 Finished successfully:57 > Progress: Selecting site:2516 Stage in:53 Active:99 Checking > status:100 Stage out:183 Finished successfully:58 > Progress: Selecting site:2510 Stage in:59 Active:90 Checking > status:109 Stage out:182 Finished successfully:59 > Progress: Selecting site:2509 Stage in:60 Active:90 Checking > status:109 Stage out:181 Finished successfully:60 > Progress: Selecting site:2509 Stage in:60 Active:90 Checking > status:109 Stage out:179 Finished successfully:62 > Progress: Selecting site:2509 Stage in:60 Active:90 Checking > status:109 Stage out:178 Finished successfully:63 > Progress: Selecting site:2509 Stage in:60 Active:90 Checking > status:109 Stage out:177 Finished successfully:64 > Progress: Selecting site:2509 Stage in:60 Active:90 Checking > status:109 Stage out:176 Finished successfully:65 > Progress: Selecting site:2508 Stage in:61 Active:90 Checking > status:109 Stage out:176 Finished successfully:65 > Progress: Selecting site:2508 Stage in:61 Active:90 Checking > status:109 Stage out:174 Finished successfully:67 > Progress: Selecting site:2508 Stage in:61 Active:90 Checking > status:109 Stage out:174 Finished successfully:67 > Progress: Selecting site:2508 Stage in:61 Active:90 Checking > status:109 Stage out:173 Finished successfully:68 > Progress: Selecting site:2508 Stage in:61 Active:90 Checking > status:109 Stage out:172 Finished successfully:69 > Progress: Selecting site:2508 Stage in:61 Active:90 Checking > status:109 Stage out:171 Finished successfully:70 > Progress: Selecting site:2508 Stage in:61 Active:90 Checking > status:109 Stage out:170 Finished successfully:71 > Progress: Selecting site:2503 Stage in:66 Active:40 Checking > status:159 Stage out:169 Finished successfully:72 > Progress: Selecting site:2503 Stage in:66 Active:40 Checking > status:159 Stage out:168 Finished successfully:73 > Progress: Selecting site:2503 Stage in:66 Active:40 Checking > status:159 Stage out:167 Finished successfully:74 From hategan at mcs.anl.gov Wed Apr 29 17:00:00 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 29 Apr 2009 17:00:00 -0500 Subject: [Swift-devel] nearly, but not quite optimal performance on ranger from ranger under coasters In-Reply-To: <49F8CA56.2040904@uchicago.edu> References: <49F8CA56.2040904@uchicago.edu> Message-ID: <1241042400.19073.3.camel@localhost> On Wed, 2009-04-29 at 16:44 -0500, Glen Hocky wrote: > Hi Everyone, > Today I tried to run jobs of length ~45 minutes. I ran on Ranger from > Ranger with > 50 > 5 > > This submitted 51 jobs to the queue, of which 38 went to the run state, > meaning there should be 608 available coasters. I had > > Progress: Selecting site:2569 Active:431 Finished successfully:9 > For ~30 minutes until some jobs started finishing. Then I had a stead > decline of activity, see below. > I assume this is happening because the extra 177 coasters that were > available timed out durning this 45 minutes and new ones did not become > available, but that could be wrong. > > > How should I tune my settings to fix this? There is no tuning to fix this. In the current scheme, if coastersPerNode > 1, there will always be more workers requested than jobs submitted. If sufficient jobs are not subsequently submitted by swift (and that requires the first round to finish), some workers will die having done nothing. From hategan at mcs.anl.gov Wed Apr 29 17:10:55 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 29 Apr 2009 17:10:55 -0500 Subject: [Swift-devel] nearly, but not quite optimal performance on ranger from ranger under coasters In-Reply-To: <49F8CA56.2040904@uchicago.edu> References: <49F8CA56.2040904@uchicago.edu> Message-ID: <1241043055.20891.3.camel@localhost> On Wed, 2009-04-29 at 16:44 -0500, Glen Hocky wrote: > Hi Everyone, > Today I tried to run jobs of length ~45 minutes. I ran on Ranger from > Ranger with > 50 > 5 > > This submitted 51 jobs to the queue, of which 38 went to the run state, > meaning there should be 608 available coasters. I had > > Progress: Selecting site:2569 Active:431 Finished successfully:9 > For ~30 minutes until some jobs started finishing. Then I had a stead > decline of activity, Your decline of activity is due to stage-ins and stage-outs, and it's not really a decline of activity. I suggest you let it run instead of interrupting it when you feel it's not working right. You should probably interrupt it when you see no active jobs in the queue and if no information is appended to the swift log in more than, say, 10 minutes. From hockyg at uchicago.edu Wed Apr 29 19:15:16 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Wed, 29 Apr 2009 19:15:16 -0500 Subject: [Swift-devel] nearly, but not quite optimal performance on ranger from ranger under coasters In-Reply-To: <1241042400.19073.3.camel@localhost> References: <49F8CA56.2040904@uchicago.edu> <1241042400.19073.3.camel@localhost> Message-ID: <49F8ED94.6010102@uchicago.edu> In the rare case where more coasters become active than you expected, shouldn't you fill them up from the submitted queue? or at least have that as an option? Mihael Hategan wrote: > On Wed, 2009-04-29 at 16:44 -0500, Glen Hocky wrote: > >> Hi Everyone, >> Today I tried to run jobs of length ~45 minutes. I ran on Ranger from >> Ranger with >> 50 >> 5 >> >> This submitted 51 jobs to the queue, of which 38 went to the run state, >> meaning there should be 608 available coasters. I had >> >>> Progress: Selecting site:2569 Active:431 Finished successfully:9 >>> >> For ~30 minutes until some jobs started finishing. Then I had a stead >> decline of activity, see below. >> I assume this is happening because the extra 177 coasters that were >> available timed out durning this 45 minutes and new ones did not become >> available, but that could be wrong. >> >> >> How should I tune my settings to fix this? >> > > There is no tuning to fix this. In the current scheme, if > coastersPerNode > 1, there will always be more workers requested than > jobs submitted. If sufficient jobs are not subsequently submitted by > swift (and that requires the first round to finish), some workers will > die having done nothing. > > From hategan at mcs.anl.gov Wed Apr 29 19:28:38 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 29 Apr 2009 19:28:38 -0500 Subject: [Swift-devel] nearly, but not quite optimal performance on ranger from ranger under coasters In-Reply-To: <49F8ED94.6010102@uchicago.edu> References: <49F8CA56.2040904@uchicago.edu> <1241042400.19073.3.camel@localhost> <49F8ED94.6010102@uchicago.edu> Message-ID: <1241051318.23017.4.camel@localhost> On Wed, 2009-04-29 at 19:15 -0500, Glen Hocky wrote: > In the rare case where more coasters become active than you expected, > shouldn't you fill them up from the submitted queue? Not if you've already hit the limit of jobs for that site. > or at least have > that as an option? It doesn't matter. If, when you request N CPUs, you always get k*N CPUs, there are going to be problems. Obviously, this smells, and it should be fixed. From aespinosa at cs.uchicago.edu Wed Apr 29 19:36:27 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 29 Apr 2009 19:36:27 -0500 Subject: [Swift-devel] Fwd: voms-proxy-init not generating proxy files In-Reply-To: <49F8B70C.7080504@renci.org> References: <20090429180318.GA5422@origin> <49F8A024.3060908@renci.org> <20090429193841.GB8646@origin> <49F8B70C.7080504@renci.org> Message-ID: <20090430003627.GA4568@origin> Hi Mats, I got my proxy working. v[aespinosa at communicado ~]$ voms-proxy-init -voms osg Wrong permissions on file: /soft/osg-client-1.0.0-r1/glite/etc/vomses Writing permissions are allowed only for the owner Enter GRID pass phrase: Your identity: /DC=org/DC=doegrids/OU=People/CN=Allan M. Espinosa 374652 Creating temporary proxy ..................................... Done Contacting voms.opensciencegrid.org:15027 [/DC=org/DC=doegrids/OU=Services/CN=host/voms.opensciencegrid.org] "osg" Done Creating proxy ..................................... Done Your proxy is valid until Thu Apr 30 07:34:43 2009 Thanks a lot! -Allan On Wed, Apr 29, 2009 at 04:22:36PM -0400, Mats Rynge wrote: > Allan Espinosa wrote: > >> Here's my X509 environment >> [aespinosa at communicado ~]$ env | grep X509 >> X509_CERT_DIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA >> X509_CADIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA >> X509_VOMS_DIR=/soft/osg-client-1.0.0-r1/glite/vomsdir > > > I have found the problem, but it basically comes down to some poor > assumptions made by the voms developers. For some reason voms-proxy-init > thinks that the permissions on > /soft/osg-client-1.0.0-r1/glite/etc/vomses are too open. The fix below > should work for you: > > > $ cp /soft/osg-client-1.0.0-r1/glite/etc/vomses ~/.globus/ > $ export VOMS_USERCONF=~/.globus/vomses > $ voms-proxy-init -voms Engage > > > Please notify the admins of the system so we can have this fixed for > other users. > > -- > Mats Rynge > Renaissance Computing Institute > From hategan at mcs.anl.gov Wed Apr 29 19:55:44 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 29 Apr 2009 19:55:44 -0500 Subject: [Swift-devel] nearly, but not quite optimal performance on ranger from ranger under coasters In-Reply-To: <49F8F364.30007@uchicago.edu> References: <49F8CA56.2040904@uchicago.edu> <1241043055.20891.3.camel@localhost> <171BD830-6A23-4E92-A37D-3760571C800E@uchicago.edu> <1241045994.21707.5.camel@localhost> <49F8F364.30007@uchicago.edu> Message-ID: <1241052944.23434.4.camel@localhost> On Wed, 2009-04-29 at 19:40 -0500, Glen Hocky wrote: > Tried to rerun (identical everything) but am now getting the errors in > this logfile > > > /home/hockyg/public_html/swift_logs/ranger_from_ranger/oops-20090429-1920-q6r3oph1.log > > Errors are: org.globus.gram.GramException: The job manager detected an invalid script response I'm getting the same when I do a simple job submission. Apparently gram on ranger is broken again. I've said this before, and I'll say it again: Unless TG gets its act straightened out about the reliability of gram installations (and it hasn't for about 5 years now), running on a single site is going to be a problem. The only reasonable solution I can think of is running on multiple sites, and letting the swift fault-management do its job. From wilde at mcs.anl.gov Wed Apr 29 23:10:23 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 29 Apr 2009 23:10:23 -0500 Subject: [Swift-devel] nearly, but not quite optimal performance on ranger from ranger under coasters In-Reply-To: <1241052944.23434.4.camel@localhost> References: <49F8CA56.2040904@uchicago.edu> <1241043055.20891.3.camel@localhost> <171BD830-6A23-4E92-A37D-3760571C800E@uchicago.edu> <1241045994.21707.5.camel@localhost> <49F8F364.30007@uchicago.edu> <1241052944.23434.4.camel@localhost> Message-ID: <49F924AF.20801@mcs.anl.gov> On 4/29/09 7:55 PM, Mihael Hategan wrote: > On Wed, 2009-04-29 at 19:40 -0500, Glen Hocky wrote: >> Tried to rerun (identical everything) but am now getting the errors in >> this logfile >> >>> /home/hockyg/public_html/swift_logs/ranger_from_ranger/oops-20090429-1920-q6r3oph1.log >> Errors are: > > org.globus.gram.GramException: The job manager detected an invalid > script response I vaguely recall this error being causes when the job time exceeded the limits of the queue time, or some similar error. Look to see if you got PBS errors in your home dir, or clues in the gram logs on the target site. One case was that the calculated time needed for the coaster jobs exceeded the queue limit. Of course, it could be something entirely different. > > I'm getting the same when I do a simple job submission. Apparently gram > on ranger is broken again. Did you do this from communicado? I was getting gram err 74 there, but gram 12's everywhere else. When I added a osg client into my .soft the errors went away. If this was communicado, there is something else going on there I think. It seems that the globus 4 package may have been damaged in some way. > I've said this before, and I'll say it again: > > Unless TG gets its act straightened out about the reliability of gram > installations (and it hasn't for about 5 years now), running on a single > site is going to be a problem. The only reasonable solution I can think > of is running on multiple sites, and letting the swift fault-management > do its job. I agree that this is what we want to have working well as the default way of using Swift. That's what we were trying a bit ago with ranger, abe, and queenbee, and the intent to expand the list. We've got to get some expiring & ending accounts renewed, which is going to slow us down a bit, but we'll get back to testing in that mode. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From foster at anl.gov Wed Apr 29 23:30:16 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 29 Apr 2009 23:30:16 -0500 Subject: [Swift-devel] Google Summer of Code Student In-Reply-To: References: Message-ID: Jon: I think it would be very helpful to have a more formal specification of the current scheduler. I don't know if that is in scope. Ian. On Apr 23, 2009, at 3:24 PM, Jon Roelofs wrote: > Hello, > > This is Jon Roelofs. My proposal: "Scheduling Algorithms for Swift" > was accepted for Google Summer of Code 200, so this summer I'll be > working on developing a better scheduler for the Swift toolkit. > You can see more about what I plan to do here: http://socghop.appspot.com/student_project/show/google/gsoc2009/globus/t124022382550 > and here: http://www.cs.colostate.edu/~roelofs/gsoc_09.php > > I look forward to working with you all. > > > Regards, > > Jon > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Apr 29 23:39:23 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 29 Apr 2009 23:39:23 -0500 Subject: [Swift-devel] nearly, but not quite optimal performance on ranger from ranger under coasters In-Reply-To: <49F924AF.20801@mcs.anl.gov> References: <49F8CA56.2040904@uchicago.edu> <1241043055.20891.3.camel@localhost> <171BD830-6A23-4E92-A37D-3760571C800E@uchicago.edu> <1241045994.21707.5.camel@localhost> <49F8F364.30007@uchicago.edu> <1241052944.23434.4.camel@localhost> <49F924AF.20801@mcs.anl.gov> Message-ID: <1241066363.26872.4.camel@localhost> On Wed, 2009-04-29 at 23:10 -0500, Michael Wilde wrote: > > On 4/29/09 7:55 PM, Mihael Hategan wrote: > > On Wed, 2009-04-29 at 19:40 -0500, Glen Hocky wrote: > >> Tried to rerun (identical everything) but am now getting the errors in > >> this logfile > >> > >>> /home/hockyg/public_html/swift_logs/ranger_from_ranger/oops-20090429-1920-q6r3oph1.log > >> Errors are: > > > > org.globus.gram.GramException: The job manager detected an invalid > > script response > > I vaguely recall this error being causes when the job time exceeded the > limits of the queue time, or some similar error. > > Look to see if you got PBS errors in your home dir, or clues in the gram > logs on the target site. > > One case was that the calculated time needed for the coaster jobs > exceeded the queue limit. Could be. > > Of course, it could be something entirely different. > > > > I'm getting the same when I do a simple job submission. Apparently gram > > on ranger is broken again. > > Did you do this from communicado? No. I did this from my own machine. Seems like I didn't specify a project, and my default got screwed up. So Glen, check if you are using an expired project, or whether you specify a project at all in sites.xml. From hategan at mcs.anl.gov Thu Apr 30 00:07:31 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 30 Apr 2009 00:07:31 -0500 Subject: [Swift-devel] nearly, but not quite optimal performance on ranger from ranger under coasters In-Reply-To: <49F9301C.5000101@uchicago.edu> References: <49F8CA56.2040904@uchicago.edu> <1241043055.20891.3.camel@localhost> <171BD830-6A23-4E92-A37D-3760571C800E@uchicago.edu> <1241045994.21707.5.camel@localhost> <49F8F364.30007@uchicago.edu> <1241052944.23434.4.camel@localhost> <49F924AF.20801@mcs.anl.gov> <1241066363.26872.4.camel@localhost> <49F9301C.5000101@uchicago.edu> Message-ID: <1241068051.27593.0.camel@localhost> On Wed, 2009-04-29 at 23:59 -0500, Glen Hocky wrote: > I definately have a nonexpired project in my sites file How about what Mike says? > > Mihael Hategan wrote: > > On Wed, 2009-04-29 at 23:10 -0500, Michael Wilde wrote: > > > >> On 4/29/09 7:55 PM, Mihael Hategan wrote: > >> > >>> On Wed, 2009-04-29 at 19:40 -0500, Glen Hocky wrote: > >>> > >>>> Tried to rerun (identical everything) but am now getting the errors in > >>>> this logfile > >>>> > >>>> > >>>>> /home/hockyg/public_html/swift_logs/ranger_from_ranger/oops-20090429-1920-q6r3oph1.log > >>>>> > >>>> Errors are: > >>>> > >>> org.globus.gram.GramException: The job manager detected an invalid > >>> script response > >>> > >> I vaguely recall this error being causes when the job time exceeded the > >> limits of the queue time, or some similar error. > >> > >> Look to see if you got PBS errors in your home dir, or clues in the gram > >> logs on the target site. > >> > >> One case was that the calculated time needed for the coaster jobs > >> exceeded the queue limit. > >> > > > > Could be. > > > > > >> Of course, it could be something entirely different. > >> > >>> I'm getting the same when I do a simple job submission. Apparently gram > >>> on ranger is broken again. > >>> > >> Did you do this from communicado? > >> > > > > No. I did this from my own machine. Seems like I didn't specify a > > project, and my default got screwed up. So Glen, check if you are using > > an expired project, or whether you specify a project at all in > > sites.xml. > > > > > > > From benc at hawaga.org.uk Thu Apr 30 00:38:22 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 30 Apr 2009 05:38:22 +0000 (GMT) Subject: [Swift-devel] Google Summer of Code Student In-Reply-To: References: Message-ID: On Wed, 29 Apr 2009, Ian Foster wrote: > I think it would be very helpful to have a more formal specification of > the current scheduler. I don't know if that is in scope. I think the project is more likely to end up with a better description of the end result than what exists now. -- From wilde at mcs.anl.gov Thu Apr 30 08:03:35 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 30 Apr 2009 08:03:35 -0500 Subject: [Swift-devel] Google Summer of Code Student In-Reply-To: References: Message-ID: <49F9A1A7.5030501@mcs.anl.gov> Hi Jon, Welcome to the Swift team! We're all looking forward to working with you this summer. I hope we get a chance to meet you, but regardless we hope to chat a lot with you on the swift- lists. I hope its a Summer of *Enjoyable* Code! :) - Mike On 4/23/09 3:24 PM, Jon Roelofs wrote: > Hello, > > This is Jon Roelofs. My proposal: "Scheduling Algorithms for Swift" was > accepted for Google Summer of Code 200, so this summer I'll be working > on developing a better scheduler for the Swift toolkit. > You can see more about what I plan to do here: > http://socghop.appspot.com/student_project/show/google/gsoc2009/globus/t124022382550 > and here: http://www.cs.colostate.edu/~roelofs/gsoc_09.php > > I look forward to working with you all. > > > Regards, > > Jon > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Thu Apr 30 08:15:06 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 30 Apr 2009 08:15:06 -0500 Subject: [Swift-devel] Google Summer of Code Student In-Reply-To: References: Message-ID: <49F9A45A.1090604@mcs.anl.gov> Xi Li wrote many drafts of notes and memos that captured some of this. The material could use review and comment, and I think we should put this and similar logic/design notes into a directory in the doc subtree. Im sure that Xi would appreciate being kept in the discussion. I have not checked if she is still on this list or not; she sent in a draft paper when she went back to China, but I personally didnt have time to review and comment. The algorithms and issues involved are complex, so the attached needs review and should be taken as a draft design extraction by a reader of the code, not the developer, although Mihael did help Xi quite a bit in gathering this understanding. - Mike On 4/29/09 11:30 PM, Ian Foster wrote: > Jon: > > I think it would be very helpful to have a more formal specification of > the current scheduler. I don't know if that is in scope. > > Ian. > > > On Apr 23, 2009, at 3:24 PM, Jon Roelofs wrote: > >> Hello, >> >> This is Jon Roelofs. My proposal: "Scheduling Algorithms for Swift" >> was accepted for Google Summer of Code 200, so this summer I'll be >> working on developing a better scheduler for the Swift toolkit. >> You can see more about what I plan to do here: >> http://socghop.appspot.com/student_project/show/google/gsoc2009/globus/t124022382550 >> and here: http://www.cs.colostate.edu/~roelofs/gsoc_09.php >> >> I look forward to working with you all. >> >> >> Regards, >> >> Jon >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -------------- next part -------------- A non-text attachment was scrubbed... Name: Site Selection in Swift.pdf Type: application/pdf Size: 53569 bytes Desc: not available URL: From wilde at mcs.anl.gov Thu Apr 30 15:02:50 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 30 Apr 2009 15:02:50 -0500 Subject: [Swift-devel] Re: [Swift-user] Execution error In-Reply-To: References: <49F9DE8F.1070404@mcs.anl.gov> <49F9E298.8030801@mcs.anl.gov> <49F9E8FB.9020500@mcs.anl.gov> <49F9F680.6040503@mcs.anl.gov> Message-ID: <49FA03EA.7080807@mcs.anl.gov> Back on list here (I only went off-list to discuss accounts, etc) The problem in the run below is this: 2009-04-30 14:29:41,265-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=PTMap2-abeii5aj - Application exception: Job cannot be run with the given max walltime worker constraint (task: 3000, \ maxwalltime: 2400s) You have this on the ptmap app in your tc.data: globus::maxwalltime=50 But you only gave coasters 40 mins per coaster worker. So its complaining that it cant run a 50 minute job in a 40 minute (max) coaster worker. ;) I mentioned in a prior mail that you need to set the two time vals in your sites.xml entry; thats what you need to do next, now. change the coaster time in your sites.xml to: key="coasterWorkerMaxwalltime">00:51:00 If you have more info on the variability of your ptmap run times, send that to the list, and we can discuss how to handle. (NOTE: doing grp -i of the log for "except" or scanning for "except" with an editor will often locate the first "exception" that your job encountered. Thats how I found the error above). Also, Yue, for testing new sites, or for validating that old sites still work, you should create the smallest possible ptmap workflow - 1 job if that is possible - and verify that this works. Then say 10 jobs to make sure scheduling etc is sane. Then, send in your huge jobs. With only 1 job, its easier to spot the errors in the log file. - Mike On 4/30/09 2:34 PM, Yue, Chen - BMD wrote: > Hi Michael, > > I run into the same messages again when I use Ranger: > > Progress: Selecting site:146 Stage in:25 Submitting:15 Submitted:821 > Failed but can retry:16 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/l on ranger > Progress: Selecting site:146 Stage in:3 Submitting:1 Submitted:857 > Failed but can retry:16 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/v on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/b on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/0 on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/x on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/3 on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger > The log for the search is at : > /home/yuechen/PTMap2/PTMap2-unmod-20090430-1428-v0c5di5c.log > > The sites.xml I have is: > > > url="gatekeeper.ranger.tacc.teragrid.org" > jobManager="gt2:gt2:SGE"/> > > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir > TG-CCR080022N > 16 > development > key="coasterWorkerMaxwalltime">00:40:00 > 31 > 50 > 10 > /work/01164/yuechen/swiftwork > > The tc.data I have is: > > ranger PTMap2 > /share/home/01164/yuechen/PTMap2/PTMap2 INSTALLED > INTEL32::LINUX globus::maxwalltime=50 > > I'm using swift 0.9 rc2 > > Thank you very much for help! > > Chen, Yue > > > > ------------------------------------------------------------------------ > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > *Sent:* Thu 4/30/2009 2:05 PM > *To:* Yue, Chen - BMD > *Subject:* Re: [Swift-user] Execution error > > > > On 4/30/09 1:51 PM, Yue, Chen - BMD wrote: > > Hi Michael, > > > > When I tried to activate my account, I encountered the following error: > > > > "Sorry, this account is in an invalid state. You may not activate your > > at this time." > > > > I used the username and password from TG-CDA070002T. Should I use a > > different password? > > If you can already login to Ranger, then you are all set - you must have > done this previously. > > I thought you had *not*, because when I looked up your login on ranger > ("finger yuechen") it said "never logged in". But seems like that info > is incorrect. > > If you have ptmap compiled, seems like you are almost all set. > > Let me know if it works. > > - Mike > > > Thanks! > > > > Chen, Yue > > > > > > ------------------------------------------------------------------------ > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > > *Sent:* Thu 4/30/2009 1:07 PM > > *To:* Yue, Chen - BMD > > *Cc:* swift user > > *Subject:* Re: [Swift-user] Execution error > > > > Yue, use this XML pool element to access ranger: > > > > > > > url="gatekeeper.ranger.tacc.teragrid.org" > > jobManager="gt2:gt2:SGE"/> > > > > > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir > > TG-CCR080022N > > 16 > > development > > > key="coasterWorkerMaxwalltime">00:40:00 > > 31 > > 50 > > 10 > > /work/00306/tg455797/swiftwork > > > > > > > > You will need to also do these steps: > > > > Go to this web page to enable your Ranger account: > > > > https://tas.tacc.utexas.edu/TASMigration/AccountActivation.aspx > > > > Then login to Ranger via the TeraGrid portal and put your ssh keys in > > place (assuming you use ssh keys, which you should) > > > > While on Ranger, do this: > > > > echo $WORK > > mkdir $work/swiftwork > > > > and put the full path of your $WORK/swiftwork directory in the > > element above. (My login is tg455etc, yours is yuechen) > > > > Then scp your code to Ranger and compile it. > > > > Then create a tc.data entry for your ptmap app > > > > Next, set your time values in the sites.xml entry above to suitable > > values for Ranger. You'll need to measure times, but I think you will > > find Ranger about twice as fast as Mercury for CPU-bound jobs. > > > > The values above were set for one app job per coaster. I think you can > > probably do more. > > > > If you estimate a run time of 5 minutes, use: > > > > > key="coasterWorkerMaxwalltime">00:30:00 > > 5 > > > > Other people on the list - please sanity check what I suggest here. > > > > - Mike > > > > > > On 4/30/09 12:40 PM, Michael Wilde wrote: > > > I just checked - TG-CDA070002T has indeed expired. > > > > > > The best for now is to move to use (only) Ranger, under this account: > > > TG-CCR080022N > > > > > > I will locate and send you a sites.xml entry in a moment. > > > > > > You need to go to a web page to activate your Ranger login. > > > > > > Best to contact me in IM and we can work this out. > > > > > > - Mike > > > > > > > > > > > > On 4/30/09 12:23 PM, Michael Wilde wrote: > > >> Also, what account are you running under? We may need to change > you to > > >> a new account - as the OSG Training account expires today. > > >> If that happend at Noon, it *might* be the problem. > > >> > > >> - Mike > > >> > > >> > > >> On 4/30/09 12:08 PM, Yue, Chen - BMD wrote: > > >>> Hi, > > >>> > > >>> I came back to re-run my application on NCSA Mercury which was > tested > > >>> successfully last week after I just set up coasters with swift 0.9, > > >>> but I got many messages like the following: > > >>> > > >>> Progress: Stage in:219 Submitting:803 Submitted:1 > > >>> Progress: Stage in:129 Submitting:703 Submitted:190 Failed > but can > > >>> retry:1 > > >>> Progress: Stage in:38 Submitting:425 Submitted:556 Failed but can > > >>> retry:4 > > >>> Failed to transfer wrapper log from > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY > > >>> Failed to transfer wrapper log from > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY > > >>> Failed to transfer wrapper log from > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY > > >>> Failed to transfer wrapper log from > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY > > >>> Failed to transfer wrapper log from > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY > > >>> Failed to transfer wrapper log from > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY > > >>> Progress: Stage in:1 Submitted:1013 Active:1 Failed but can > retry:8 > > >>> Progress: Submitted:1011 Active:1 Failed but can retry:11 > > >>> The log file for the successful run last week is ; > > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log > > >>> > > >>> The log file for the failed run is : > > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log > > >>> > > >>> I don't think I did anything different, so I don't know why this > time > > >>> they failed. The sites.xml for Mercury is: > > >>> > > >>> > > >>> > > >>> > >>> jobManager="gt2:PBS"/> > > >>> /gpfs_scratch1/yuechen/swiftwork > > >>> debug > > >>> > > >>> > > >>> Thank you for help! > > >>> > > >>> Chen, Yue > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> This email is intended only for the use of the individual or entity > > >>> to which it is addressed and may contain information that is > > >>> privileged and confidential. If the reader of this email message is > > >>> not the intended recipient, you are hereby notified that any > > >>> dissemination, distribution, or copying of this communication is > > >>> prohibited. If you have received this email in error, please notify > > >>> the sender and destroy/delete all copies of the transmittal. > Thank you. > > >>> > > >>> > > >>> > > ------------------------------------------------------------------------ > > >>> > > >>> _______________________________________________ > > >>> Swift-user mailing list > > >>> Swift-user at ci.uchicago.edu > > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > >> _______________________________________________ > > >> Swift-user mailing list > > >> Swift-user at ci.uchicago.edu > > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > This email is intended only for the use of the individual or entity to > > which it is addressed and may contain information that is privileged and > > confidential. If the reader of this email message is not the intended > > recipient, you are hereby notified that any dissemination, distribution, > > or copying of this communication is prohibited. If you have received > > this email in error, please notify the sender and destroy/delete all > > copies of the transmittal. Thank you. > > > > > This email is intended only for the use of the individual or entity to > which it is addressed and may contain information that is privileged and > confidential. If the reader of this email message is not the intended > recipient, you are hereby notified that any dissemination, distribution, > or copying of this communication is prohibited. If you have received > this email in error, please notify the sender and destroy/delete all > copies of the transmittal. Thank you. From zhaozhang at uchicago.edu Thu Apr 30 15:10:07 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 30 Apr 2009 15:10:07 -0500 Subject: [Swift-devel] Rough Swift Test Plan Message-ID: <49FA059F.1070309@uchicago.edu> Hi, All I am starting to make a Swift test plan for its various features, so that I could put this test into production: run it everyday or after new feature is introduced. 1. Platforms: All OSG sites, All TG sites For OSG sites, Jing, Xi Li, Zhengxiong and Suchandra* *have conducted lots of tests over OSG, I will ask them about which sites are most testable. By testable, I mean we could run sample test without waiting too long in the queue. For TG sites, I am waiting for the account approval to test on queen bee, abe, and other machines. For now, I could test on ranger. Mike Papka and Ti are helping us setting accounts on TG/ANL 2. Features We care about two features right now. swift-over-condorG and swift-over-coaster-over-condorG. 3. Automation Mike mentioned a automatic test tool for many platforms called NMI Test&Build. Ben, can we use the same system to do the tests on real grid sites? 4. Sanity Check Before we start swift workflow test on, is there a way that we could test if GT2 or GT4 are working there? 5. Test Case I would just use the cases we have now for coaster, might need some other test case that could run for a longer time. So far, I could only get the above thoughts, any comments and corrections will be appreciated. Thanks. zhao From yuechen at bsd.uchicago.edu Thu Apr 30 16:08:58 2009 From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD) Date: Thu, 30 Apr 2009 16:08:58 -0500 Subject: [Swift-devel] RE: [Swift-user] Execution error References: <49F9DE8F.1070404@mcs.anl.gov> <49F9E298.8030801@mcs.anl.gov> <49F9E8FB.9020500@mcs.anl.gov> <49F9F680.6040503@mcs.anl.gov> <49FA03EA.7080807@mcs.anl.gov> Message-ID: Hi Michael, Thank you for the advices. I tested ranger with 1 job and new specifications of maxwalltime. It shows the following error message. I don't know if there is other problem with my setup. Thank you! ///////////////////////////////////////////////// [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file sites.xml -tc.file tc.data Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090430-1559-2vi6x811 Progress: Progress: Stage in:1 Progress: Submitting:1 Progress: Submitting:1 Progress: Submitted:1 Progress: Active:1 Failed to transfer wrapper log from PTMap2-unmod-20090430-1559-2vi6x811/info/i on ranger Progress: Active:1 Failed to transfer wrapper log from PTMap2-unmod-20090430-1559-2vi6x811/info/k on ranger Progress: Stage in:1 Progress: Active:1 Failed to transfer wrapper log from PTMap2-unmod-20090430-1559-2vi6x811/info/m on ranger Progress: Failed:1 Execution failed: Exception in PTMap2: Arguments: [e04.mzXML, ./seqs-ecolik12/fasta01, inputs-unmod.txt, parameters.txt] Host: ranger Directory: PTMap2-unmod-20090430-1559-2vi6x811/jobs/m/PTMap2-mbe6m5aj stderr.txt: stdout.txt: ---- Caused by: Failed to start worker: null null org.globus.gram.GramException: The job manager detected an invalid script response at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) at org.globus.gram.GramJob.setStatus(GramJob.java:184) at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) at java.lang.Thread.run(Thread.java:619) Cleaning up... Shutting down service at https://129.114.50.163:45562 Got channel MetaChannel: 20903429 -> GSSSChannel-null(1) - Done [yuechen at communicado PTMap2]$ /////////////////////////////////////////////////////////// Chen, Yue ________________________________ From: Michael Wilde [mailto:wilde at mcs.anl.gov] Sent: Thu 4/30/2009 3:02 PM To: Yue, Chen - BMD; swift-devel Subject: Re: [Swift-user] Execution error Back on list here (I only went off-list to discuss accounts, etc) The problem in the run below is this: 2009-04-30 14:29:41,265-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=PTMap2-abeii5aj - Application exception: Job cannot be run with the given max walltime worker constraint (task: 3000, \ maxwalltime: 2400s) You have this on the ptmap app in your tc.data: globus::maxwalltime=50 But you only gave coasters 40 mins per coaster worker. So its complaining that it cant run a 50 minute job in a 40 minute (max) coaster worker. ;) I mentioned in a prior mail that you need to set the two time vals in your sites.xml entry; thats what you need to do next, now. change the coaster time in your sites.xml to: key="coasterWorkerMaxwalltime">00:51:00 If you have more info on the variability of your ptmap run times, send that to the list, and we can discuss how to handle. (NOTE: doing grp -i of the log for "except" or scanning for "except" with an editor will often locate the first "exception" that your job encountered. Thats how I found the error above). Also, Yue, for testing new sites, or for validating that old sites still work, you should create the smallest possible ptmap workflow - 1 job if that is possible - and verify that this works. Then say 10 jobs to make sure scheduling etc is sane. Then, send in your huge jobs. With only 1 job, its easier to spot the errors in the log file. - Mike On 4/30/09 2:34 PM, Yue, Chen - BMD wrote: > Hi Michael, > > I run into the same messages again when I use Ranger: > > Progress: Selecting site:146 Stage in:25 Submitting:15 Submitted:821 > Failed but can retry:16 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/l on ranger > Progress: Selecting site:146 Stage in:3 Submitting:1 Submitted:857 > Failed but can retry:16 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/v on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/b on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/0 on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/x on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/3 on ranger > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger > The log for the search is at : > /home/yuechen/PTMap2/PTMap2-unmod-20090430-1428-v0c5di5c.log > > The sites.xml I have is: > > > url="gatekeeper.ranger.tacc.teragrid.org" > jobManager="gt2:gt2:SGE"/> > > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir > TG-CCR080022N > 16 > development > key="coasterWorkerMaxwalltime">00:40:00 > 31 > 50 > 10 > /work/01164/yuechen/swiftwork > > The tc.data I have is: > > ranger PTMap2 > /share/home/01164/yuechen/PTMap2/PTMap2 INSTALLED > INTEL32::LINUX globus::maxwalltime=50 > > I'm using swift 0.9 rc2 > > Thank you very much for help! > > Chen, Yue > > > > ------------------------------------------------------------------------ > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > *Sent:* Thu 4/30/2009 2:05 PM > *To:* Yue, Chen - BMD > *Subject:* Re: [Swift-user] Execution error > > > > On 4/30/09 1:51 PM, Yue, Chen - BMD wrote: > > Hi Michael, > > > > When I tried to activate my account, I encountered the following error: > > > > "Sorry, this account is in an invalid state. You may not activate your > > at this time." > > > > I used the username and password from TG-CDA070002T. Should I use a > > different password? > > If you can already login to Ranger, then you are all set - you must have > done this previously. > > I thought you had *not*, because when I looked up your login on ranger > ("finger yuechen") it said "never logged in". But seems like that info > is incorrect. > > If you have ptmap compiled, seems like you are almost all set. > > Let me know if it works. > > - Mike > > > Thanks! > > > > Chen, Yue > > > > > > ------------------------------------------------------------------------ > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > > *Sent:* Thu 4/30/2009 1:07 PM > > *To:* Yue, Chen - BMD > > *Cc:* swift user > > *Subject:* Re: [Swift-user] Execution error > > > > Yue, use this XML pool element to access ranger: > > > > > > > url="gatekeeper.ranger.tacc.teragrid.org" > > jobManager="gt2:gt2:SGE"/> > > > > > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir > > TG-CCR080022N > > 16 > > development > > > key="coasterWorkerMaxwalltime">00:40:00 > > 31 > > 50 > > 10 > > /work/00306/tg455797/swiftwork > > > > > > > > You will need to also do these steps: > > > > Go to this web page to enable your Ranger account: > > > > https://tas.tacc.utexas.edu/TASMigration/AccountActivation.aspx > > > > Then login to Ranger via the TeraGrid portal and put your ssh keys in > > place (assuming you use ssh keys, which you should) > > > > While on Ranger, do this: > > > > echo $WORK > > mkdir $work/swiftwork > > > > and put the full path of your $WORK/swiftwork directory in the > > element above. (My login is tg455etc, yours is yuechen) > > > > Then scp your code to Ranger and compile it. > > > > Then create a tc.data entry for your ptmap app > > > > Next, set your time values in the sites.xml entry above to suitable > > values for Ranger. You'll need to measure times, but I think you will > > find Ranger about twice as fast as Mercury for CPU-bound jobs. > > > > The values above were set for one app job per coaster. I think you can > > probably do more. > > > > If you estimate a run time of 5 minutes, use: > > > > > key="coasterWorkerMaxwalltime">00:30:00 > > 5 > > > > Other people on the list - please sanity check what I suggest here. > > > > - Mike > > > > > > On 4/30/09 12:40 PM, Michael Wilde wrote: > > > I just checked - TG-CDA070002T has indeed expired. > > > > > > The best for now is to move to use (only) Ranger, under this account: > > > TG-CCR080022N > > > > > > I will locate and send you a sites.xml entry in a moment. > > > > > > You need to go to a web page to activate your Ranger login. > > > > > > Best to contact me in IM and we can work this out. > > > > > > - Mike > > > > > > > > > > > > On 4/30/09 12:23 PM, Michael Wilde wrote: > > >> Also, what account are you running under? We may need to change > you to > > >> a new account - as the OSG Training account expires today. > > >> If that happend at Noon, it *might* be the problem. > > >> > > >> - Mike > > >> > > >> > > >> On 4/30/09 12:08 PM, Yue, Chen - BMD wrote: > > >>> Hi, > > >>> > > >>> I came back to re-run my application on NCSA Mercury which was > tested > > >>> successfully last week after I just set up coasters with swift 0.9, > > >>> but I got many messages like the following: > > >>> > > >>> Progress: Stage in:219 Submitting:803 Submitted:1 > > >>> Progress: Stage in:129 Submitting:703 Submitted:190 Failed > but can > > >>> retry:1 > > >>> Progress: Stage in:38 Submitting:425 Submitted:556 Failed but can > > >>> retry:4 > > >>> Failed to transfer wrapper log from > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY > > >>> Failed to transfer wrapper log from > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY > > >>> Failed to transfer wrapper log from > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY > > >>> Failed to transfer wrapper log from > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY > > >>> Failed to transfer wrapper log from > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY > > >>> Failed to transfer wrapper log from > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY > > >>> Progress: Stage in:1 Submitted:1013 Active:1 Failed but can > retry:8 > > >>> Progress: Submitted:1011 Active:1 Failed but can retry:11 > > >>> The log file for the successful run last week is ; > > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log > > >>> > > >>> The log file for the failed run is : > > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log > > >>> > > >>> I don't think I did anything different, so I don't know why this > time > > >>> they failed. The sites.xml for Mercury is: > > >>> > > >>> > > >>> > > >>> > >>> jobManager="gt2:PBS"/> > > >>> /gpfs_scratch1/yuechen/swiftwork > > >>> debug > > >>> > > >>> > > >>> Thank you for help! > > >>> > > >>> Chen, Yue > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> This email is intended only for the use of the individual or entity > > >>> to which it is addressed and may contain information that is > > >>> privileged and confidential. If the reader of this email message is > > >>> not the intended recipient, you are hereby notified that any > > >>> dissemination, distribution, or copying of this communication is > > >>> prohibited. If you have received this email in error, please notify > > >>> the sender and destroy/delete all copies of the transmittal. > Thank you. > > >>> > > >>> > > >>> > > ------------------------------------------------------------------------ > > >>> > > >>> _______________________________________________ > > >>> Swift-user mailing list > > >>> Swift-user at ci.uchicago.edu > > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > >> _______________________________________________ > > >> Swift-user mailing list > > >> Swift-user at ci.uchicago.edu > > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > This email is intended only for the use of the individual or entity to > > which it is addressed and may contain information that is privileged and > > confidential. If the reader of this email message is not the intended > > recipient, you are hereby notified that any dissemination, distribution, > > or copying of this communication is prohibited. If you have received > > this email in error, please notify the sender and destroy/delete all > > copies of the transmittal. Thank you. > > > > > This email is intended only for the use of the individual or entity to > which it is addressed and may contain information that is privileged and > confidential. If the reader of this email message is not the intended > recipient, you are hereby notified that any dissemination, distribution, > or copying of this communication is prohibited. If you have received > this email in error, please notify the sender and destroy/delete all > copies of the transmittal. Thank you. This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hockyg at uchicago.edu Thu Apr 30 16:13:34 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Thu, 30 Apr 2009 16:13:34 -0500 Subject: [Swift-devel] RE: [Swift-user] Execution error In-Reply-To: References: <49F9DE8F.1070404@mcs.anl.gov> <49F9E298.8030801@mcs.anl.gov> <49F9E8FB.9020500@mcs.anl.gov> <49F9F680.6040503@mcs.anl.gov> <49FA03EA.7080807@mcs.anl.gov> Message-ID: <49FA147E.6070205@uchicago.edu> I have the identical response on ranger. It started yesterday evening. Possibly a problem that the TACC folks need to fix? Glen Yue, Chen - BMD wrote: > Hi Michael, > > Thank you for the advices. I tested ranger with 1 job and new > specifications of maxwalltime. It shows the following error message. I > don't know if there is other problem with my setup. Thank you! > > ///////////////////////////////////////////////// > [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file > sites.xml -tc.file tc.data > Swift 0.9rc2 swift-r2860 cog-r2388 > RunID: 20090430-1559-2vi6x811 > Progress: > Progress: Stage in:1 > Progress: Submitting:1 > Progress: Submitting:1 > Progress: Submitted:1 > Progress: Active:1 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1559-2vi6x811/info/i on ranger > Progress: Active:1 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1559-2vi6x811/info/k on ranger > Progress: Stage in:1 > Progress: Active:1 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1559-2vi6x811/info/m on ranger > Progress: Failed:1 > Execution failed: > Exception in PTMap2: > Arguments: [e04.mzXML, ./seqs-ecolik12/fasta01, inputs-unmod.txt, > parameters.txt] > Host: ranger > Directory: PTMap2-unmod-20090430-1559-2vi6x811/jobs/m/PTMap2-mbe6m5aj > stderr.txt: > stdout.txt: > ---- > Caused by: > Failed to start worker: > null > null > org.globus.gram.GramException: The job manager detected an invalid > script response > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) > at org.globus.gram.GramJob.setStatus(GramJob.java:184) > at > org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) > at java.lang.Thread.run(Thread.java:619) > Cleaning up... > Shutting down service at https://129.114.50.163:45562 > > Got channel MetaChannel: 20903429 -> GSSSChannel-null(1) > - Done > [yuechen at communicado PTMap2]$ > /////////////////////////////////////////////////////////// > > Chen, Yue > > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > *Sent:* Thu 4/30/2009 3:02 PM > *To:* Yue, Chen - BMD; swift-devel > *Subject:* Re: [Swift-user] Execution error > > Back on list here (I only went off-list to discuss accounts, etc) > > The problem in the run below is this: > > 2009-04-30 14:29:41,265-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=PTMap2-abeii5aj - Application exception: Job cannot be run with > the given max walltime worker constraint (task: 3000, \ > maxwalltime: 2400s) > > You have this on the ptmap app in your tc.data: > > globus::maxwalltime=50 > > But you only gave coasters 40 mins per coaster worker. So its > complaining that it cant run a 50 minute job in a 40 minute (max) > coaster worker. ;) > > I mentioned in a prior mail that you need to set the two time vals in > your sites.xml entry; thats what you need to do next, now. > > change the coaster time in your sites.xml to: > key="coasterWorkerMaxwalltime">00:51:00 > > If you have more info on the variability of your ptmap run times, send > that to the list, and we can discuss how to handle. > > > (NOTE: doing grp -i of the log for "except" or scanning for "except" > with an editor will often locate the first "exception" that your job > encountered. Thats how I found the error above). > > Also, Yue, for testing new sites, or for validating that old sites still > work, you should create the smallest possible ptmap workflow - 1 job if > that is possible - and verify that this works. Then say 10 jobs to make > sure scheduling etc is sane. Then, send in your huge jobs. > > With only 1 job, its easier to spot the errors in the log file. > > - Mike > > > On 4/30/09 2:34 PM, Yue, Chen - BMD wrote: > > Hi Michael, > > > > I run into the same messages again when I use Ranger: > > > > Progress: Selecting site:146 Stage in:25 Submitting:15 Submitted:821 > > Failed but can retry:16 > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1428-v0c5di5c/info/l on ranger > > Progress: Selecting site:146 Stage in:3 Submitting:1 Submitted:857 > > Failed but can retry:16 > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1428-v0c5di5c/info/v on ranger > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1428-v0c5di5c/info/b on ranger > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1428-v0c5di5c/info/0 on ranger > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1428-v0c5di5c/info/x on ranger > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1428-v0c5di5c/info/3 on ranger > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger > > The log for the search is at : > > /home/yuechen/PTMap2/PTMap2-unmod-20090430-1428-v0c5di5c.log > > > > The sites.xml I have is: > > > > > > > url="gatekeeper.ranger.tacc.teragrid.org" > > jobManager="gt2:gt2:SGE"/> > > > > > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir > > TG-CCR080022N > > 16 > > development > > > key="coasterWorkerMaxwalltime">00:40:00 > > 31 > > 50 > > 10 > > /work/01164/yuechen/swiftwork > > > > The tc.data I have is: > > > > ranger PTMap2 > > /share/home/01164/yuechen/PTMap2/PTMap2 INSTALLED > > INTEL32::LINUX globus::maxwalltime=50 > > > > I'm using swift 0.9 rc2 > > > > Thank you very much for help! > > > > Chen, Yue > > > > > > > > ------------------------------------------------------------------------ > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > > *Sent:* Thu 4/30/2009 2:05 PM > > *To:* Yue, Chen - BMD > > *Subject:* Re: [Swift-user] Execution error > > > > > > > > On 4/30/09 1:51 PM, Yue, Chen - BMD wrote: > > > Hi Michael, > > > > > > When I tried to activate my account, I encountered the following > error: > > > > > > "Sorry, this account is in an invalid state. You may not activate > your > > > at this time." > > > > > > I used the username and password from TG-CDA070002T. Should I use a > > > different password? > > > > If you can already login to Ranger, then you are all set - you must have > > done this previously. > > > > I thought you had *not*, because when I looked up your login on ranger > > ("finger yuechen") it said "never logged in". But seems like that info > > is incorrect. > > > > If you have ptmap compiled, seems like you are almost all set. > > > > Let me know if it works. > > > > - Mike > > > > > Thanks! > > > > > > Chen, Yue > > > > > > > > > > ------------------------------------------------------------------------ > > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > > > *Sent:* Thu 4/30/2009 1:07 PM > > > *To:* Yue, Chen - BMD > > > *Cc:* swift user > > > *Subject:* Re: [Swift-user] Execution error > > > > > > Yue, use this XML pool element to access ranger: > > > > > > > > > > > url="gatekeeper.ranger.tacc.teragrid.org" > > > jobManager="gt2:gt2:SGE"/> > > > url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" /> > > > > > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir > > > key="project">TG-CCR080022N > > > 16 > > > development > > > > > key="coasterWorkerMaxwalltime">00:40:00 > > > 31 > > > 50 > > > 10 > > > /work/00306/tg455797/swiftwork > > > > > > > > > > > > You will need to also do these steps: > > > > > > Go to this web page to enable your Ranger account: > > > > > > https://tas.tacc.utexas.edu/TASMigration/AccountActivation.aspx > > > > > > Then login to Ranger via the TeraGrid portal and put your ssh keys in > > > place (assuming you use ssh keys, which you should) > > > > > > While on Ranger, do this: > > > > > > echo $WORK > > > mkdir $work/swiftwork > > > > > > and put the full path of your $WORK/swiftwork directory in the > > > element above. (My login is tg455etc, yours is > yuechen) > > > > > > Then scp your code to Ranger and compile it. > > > > > > Then create a tc.data entry for your ptmap app > > > > > > Next, set your time values in the sites.xml entry above to suitable > > > values for Ranger. You'll need to measure times, but I think you will > > > find Ranger about twice as fast as Mercury for CPU-bound jobs. > > > > > > The values above were set for one app job per coaster. I think > you can > > > probably do more. > > > > > > If you estimate a run time of 5 minutes, use: > > > > > > > > key="coasterWorkerMaxwalltime">00:30:00 > > > 5 > > > > > > Other people on the list - please sanity check what I suggest here. > > > > > > - Mike > > > > > > > > > On 4/30/09 12:40 PM, Michael Wilde wrote: > > > > I just checked - TG-CDA070002T has indeed expired. > > > > > > > > The best for now is to move to use (only) Ranger, under this > account: > > > > TG-CCR080022N > > > > > > > > I will locate and send you a sites.xml entry in a moment. > > > > > > > > You need to go to a web page to activate your Ranger login. > > > > > > > > Best to contact me in IM and we can work this out. > > > > > > > > - Mike > > > > > > > > > > > > > > > > On 4/30/09 12:23 PM, Michael Wilde wrote: > > > >> Also, what account are you running under? We may need to change > > you to > > > >> a new account - as the OSG Training account expires today. > > > >> If that happend at Noon, it *might* be the problem. > > > >> > > > >> - Mike > > > >> > > > >> > > > >> On 4/30/09 12:08 PM, Yue, Chen - BMD wrote: > > > >>> Hi, > > > >>> > > > >>> I came back to re-run my application on NCSA Mercury which was > > tested > > > >>> successfully last week after I just set up coasters with > swift 0.9, > > > >>> but I got many messages like the following: > > > >>> > > > >>> Progress: Stage in:219 Submitting:803 Submitted:1 > > > >>> Progress: Stage in:129 Submitting:703 Submitted:190 Failed > > but can > > > >>> retry:1 > > > >>> Progress: Stage in:38 Submitting:425 Submitted:556 Failed > but can > > > >>> retry:4 > > > >>> Failed to transfer wrapper log from > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY > > > >>> Failed to transfer wrapper log from > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY > > > >>> Failed to transfer wrapper log from > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY > > > >>> Failed to transfer wrapper log from > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY > > > >>> Failed to transfer wrapper log from > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY > > > >>> Failed to transfer wrapper log from > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY > > > >>> Progress: Stage in:1 Submitted:1013 Active:1 Failed but can > > retry:8 > > > >>> Progress: Submitted:1011 Active:1 Failed but can retry:11 > > > >>> The log file for the successful run last week is ; > > > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log > > > >>> > > > >>> The log file for the failed run is : > > > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log > > > >>> > > > >>> I don't think I did anything different, so I don't know why this > > time > > > >>> they failed. The sites.xml for Mercury is: > > > >>> > > > >>> > > > >>> > > > >>> url="grid-hg.ncsa.teragrid.org" > > > >>> jobManager="gt2:PBS"/> > > > >>> > /gpfs_scratch1/yuechen/swiftwork > > > >>> debug > > > >>> > > > >>> > > > >>> Thank you for help! > > > >>> > > > >>> Chen, Yue > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> This email is intended only for the use of the individual or > entity > > > >>> to which it is addressed and may contain information that is > > > >>> privileged and confidential. If the reader of this email > message is > > > >>> not the intended recipient, you are hereby notified that any > > > >>> dissemination, distribution, or copying of this communication is > > > >>> prohibited. If you have received this email in error, please > notify > > > >>> the sender and destroy/delete all copies of the transmittal. > > Thank you. > > > >>> > > > >>> > > > >>> > > > > ------------------------------------------------------------------------ > > > >>> > > > >>> _______________________________________________ > > > >>> Swift-user mailing list > > > >>> Swift-user at ci.uchicago.edu > > > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > >> _______________________________________________ > > > >> Swift-user mailing list > > > >> Swift-user at ci.uchicago.edu > > > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > _______________________________________________ > > > > Swift-user mailing list > > > > Swift-user at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > This email is intended only for the use of the individual or > entity to > > > which it is addressed and may contain information that is > privileged and > > > confidential. If the reader of this email message is not the intended > > > recipient, you are hereby notified that any dissemination, > distribution, > > > or copying of this communication is prohibited. If you have received > > > this email in error, please notify the sender and destroy/delete all > > > copies of the transmittal. Thank you. > > > > > > > > > > This email is intended only for the use of the individual or entity to > > which it is addressed and may contain information that is privileged and > > confidential. If the reader of this email message is not the intended > > recipient, you are hereby notified that any dissemination, distribution, > > or copying of this communication is prohibited. If you have received > > this email in error, please notify the sender and destroy/delete all > > copies of the transmittal. Thank you. > > > > > This email is intended only for the use of the individual or entity to > which it is addressed and may contain information that is privileged > and confidential. If the reader of this email message is not the > intended recipient, you are hereby notified that any dissemination, > distribution, or copying of this communication is prohibited. If you > have received this email in error, please notify the sender and > destroy/delete all copies of the transmittal. Thank you. > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Thu Apr 30 16:31:13 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 30 Apr 2009 16:31:13 -0500 Subject: [Swift-devel] RE: [Swift-user] Execution error In-Reply-To: <49FA147E.6070205@uchicago.edu> References: <49F9DE8F.1070404@mcs.anl.gov> <49F9E298.8030801@mcs.anl.gov> <49F9E8FB.9020500@mcs.anl.gov> <49F9F680.6040503@mcs.anl.gov> <49FA03EA.7080807@mcs.anl.gov> <49FA147E.6070205@uchicago.edu> Message-ID: <1241127073.922.1.camel@localhost> Can you guys try to run first.swift on ranger with the settings you have (you'll need to add "echo" to tc.data)? On Thu, 2009-04-30 at 16:13 -0500, Glen Hocky wrote: > I have the identical response on ranger. It started yesterday evening. > Possibly a problem that the TACC folks need to fix? > > Glen > > Yue, Chen - BMD wrote: > > Hi Michael, > > > > Thank you for the advices. I tested ranger with 1 job and new > > specifications of maxwalltime. It shows the following error message. I > > don't know if there is other problem with my setup. Thank you! > > > > ///////////////////////////////////////////////// > > [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file > > sites.xml -tc.file tc.data > > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090430-1559-2vi6x811 > > Progress: > > Progress: Stage in:1 > > Progress: Submitting:1 > > Progress: Submitting:1 > > Progress: Submitted:1 > > Progress: Active:1 > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1559-2vi6x811/info/i on ranger > > Progress: Active:1 > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1559-2vi6x811/info/k on ranger > > Progress: Stage in:1 > > Progress: Active:1 > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1559-2vi6x811/info/m on ranger > > Progress: Failed:1 > > Execution failed: > > Exception in PTMap2: > > Arguments: [e04.mzXML, ./seqs-ecolik12/fasta01, inputs-unmod.txt, > > parameters.txt] > > Host: ranger > > Directory: PTMap2-unmod-20090430-1559-2vi6x811/jobs/m/PTMap2-mbe6m5aj > > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > > Failed to start worker: > > null > > null > > org.globus.gram.GramException: The job manager detected an invalid > > script response > > at > > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) > > at org.globus.gram.GramJob.setStatus(GramJob.java:184) > > at > > org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) > > at java.lang.Thread.run(Thread.java:619) > > Cleaning up... > > Shutting down service at https://129.114.50.163:45562 > > > > Got channel MetaChannel: 20903429 -> GSSSChannel-null(1) > > - Done > > [yuechen at communicado PTMap2]$ > > /////////////////////////////////////////////////////////// > > > > Chen, Yue > > > > > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > > *Sent:* Thu 4/30/2009 3:02 PM > > *To:* Yue, Chen - BMD; swift-devel > > *Subject:* Re: [Swift-user] Execution error > > > > Back on list here (I only went off-list to discuss accounts, etc) > > > > The problem in the run below is this: > > > > 2009-04-30 14:29:41,265-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > jobid=PTMap2-abeii5aj - Application exception: Job cannot be run with > > the given max walltime worker constraint (task: 3000, \ > > maxwalltime: 2400s) > > > > You have this on the ptmap app in your tc.data: > > > > globus::maxwalltime=50 > > > > But you only gave coasters 40 mins per coaster worker. So its > > complaining that it cant run a 50 minute job in a 40 minute (max) > > coaster worker. ;) > > > > I mentioned in a prior mail that you need to set the two time vals in > > your sites.xml entry; thats what you need to do next, now. > > > > change the coaster time in your sites.xml to: > > key="coasterWorkerMaxwalltime">00:51:00 > > > > If you have more info on the variability of your ptmap run times, send > > that to the list, and we can discuss how to handle. > > > > > > (NOTE: doing grp -i of the log for "except" or scanning for "except" > > with an editor will often locate the first "exception" that your job > > encountered. Thats how I found the error above). > > > > Also, Yue, for testing new sites, or for validating that old sites still > > work, you should create the smallest possible ptmap workflow - 1 job if > > that is possible - and verify that this works. Then say 10 jobs to make > > sure scheduling etc is sane. Then, send in your huge jobs. > > > > With only 1 job, its easier to spot the errors in the log file. > > > > - Mike > > > > > > On 4/30/09 2:34 PM, Yue, Chen - BMD wrote: > > > Hi Michael, > > > > > > I run into the same messages again when I use Ranger: > > > > > > Progress: Selecting site:146 Stage in:25 Submitting:15 Submitted:821 > > > Failed but can retry:16 > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/l on ranger > > > Progress: Selecting site:146 Stage in:3 Submitting:1 Submitted:857 > > > Failed but can retry:16 > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/v on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/b on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/0 on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/x on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/3 on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger > > > The log for the search is at : > > > /home/yuechen/PTMap2/PTMap2-unmod-20090430-1428-v0c5di5c.log > > > > > > The sites.xml I have is: > > > > > > > > > > > url="gatekeeper.ranger.tacc.teragrid.org" > > > jobManager="gt2:gt2:SGE"/> > > > > > > > > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir > > > TG-CCR080022N > > > 16 > > > development > > > > > key="coasterWorkerMaxwalltime">00:40:00 > > > 31 > > > 50 > > > 10 > > > /work/01164/yuechen/swiftwork > > > > > > The tc.data I have is: > > > > > > ranger PTMap2 > > > /share/home/01164/yuechen/PTMap2/PTMap2 INSTALLED > > > INTEL32::LINUX globus::maxwalltime=50 > > > > > > I'm using swift 0.9 rc2 > > > > > > Thank you very much for help! > > > > > > Chen, Yue > > > > > > > > > > > > ------------------------------------------------------------------------ > > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > > > *Sent:* Thu 4/30/2009 2:05 PM > > > *To:* Yue, Chen - BMD > > > *Subject:* Re: [Swift-user] Execution error > > > > > > > > > > > > On 4/30/09 1:51 PM, Yue, Chen - BMD wrote: > > > > Hi Michael, > > > > > > > > When I tried to activate my account, I encountered the following > > error: > > > > > > > > "Sorry, this account is in an invalid state. You may not activate > > your > > > > at this time." > > > > > > > > I used the username and password from TG-CDA070002T. Should I use a > > > > different password? > > > > > > If you can already login to Ranger, then you are all set - you must have > > > done this previously. > > > > > > I thought you had *not*, because when I looked up your login on ranger > > > ("finger yuechen") it said "never logged in". But seems like that info > > > is incorrect. > > > > > > If you have ptmap compiled, seems like you are almost all set. > > > > > > Let me know if it works. > > > > > > - Mike > > > > > > > Thanks! > > > > > > > > Chen, Yue > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > > > > *Sent:* Thu 4/30/2009 1:07 PM > > > > *To:* Yue, Chen - BMD > > > > *Cc:* swift user > > > > *Subject:* Re: [Swift-user] Execution error > > > > > > > > Yue, use this XML pool element to access ranger: > > > > > > > > > > > > > > > url="gatekeeper.ranger.tacc.teragrid.org" > > > > jobManager="gt2:gt2:SGE"/> > > > > > url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" /> > > > > > > > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir > > > > > key="project">TG-CCR080022N > > > > 16 > > > > development > > > > > > > key="coasterWorkerMaxwalltime">00:40:00 > > > > 31 > > > > 50 > > > > 10 > > > > /work/00306/tg455797/swiftwork > > > > > > > > > > > > > > > > You will need to also do these steps: > > > > > > > > Go to this web page to enable your Ranger account: > > > > > > > > https://tas.tacc.utexas.edu/TASMigration/AccountActivation.aspx > > > > > > > > Then login to Ranger via the TeraGrid portal and put your ssh keys in > > > > place (assuming you use ssh keys, which you should) > > > > > > > > While on Ranger, do this: > > > > > > > > echo $WORK > > > > mkdir $work/swiftwork > > > > > > > > and put the full path of your $WORK/swiftwork directory in the > > > > element above. (My login is tg455etc, yours is > > yuechen) > > > > > > > > Then scp your code to Ranger and compile it. > > > > > > > > Then create a tc.data entry for your ptmap app > > > > > > > > Next, set your time values in the sites.xml entry above to suitable > > > > values for Ranger. You'll need to measure times, but I think you will > > > > find Ranger about twice as fast as Mercury for CPU-bound jobs. > > > > > > > > The values above were set for one app job per coaster. I think > > you can > > > > probably do more. > > > > > > > > If you estimate a run time of 5 minutes, use: > > > > > > > > > > > key="coasterWorkerMaxwalltime">00:30:00 > > > > 5 > > > > > > > > Other people on the list - please sanity check what I suggest here. > > > > > > > > - Mike > > > > > > > > > > > > On 4/30/09 12:40 PM, Michael Wilde wrote: > > > > > I just checked - TG-CDA070002T has indeed expired. > > > > > > > > > > The best for now is to move to use (only) Ranger, under this > > account: > > > > > TG-CCR080022N > > > > > > > > > > I will locate and send you a sites.xml entry in a moment. > > > > > > > > > > You need to go to a web page to activate your Ranger login. > > > > > > > > > > Best to contact me in IM and we can work this out. > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > On 4/30/09 12:23 PM, Michael Wilde wrote: > > > > >> Also, what account are you running under? We may need to change > > > you to > > > > >> a new account - as the OSG Training account expires today. > > > > >> If that happend at Noon, it *might* be the problem. > > > > >> > > > > >> - Mike > > > > >> > > > > >> > > > > >> On 4/30/09 12:08 PM, Yue, Chen - BMD wrote: > > > > >>> Hi, > > > > >>> > > > > >>> I came back to re-run my application on NCSA Mercury which was > > > tested > > > > >>> successfully last week after I just set up coasters with > > swift 0.9, > > > > >>> but I got many messages like the following: > > > > >>> > > > > >>> Progress: Stage in:219 Submitting:803 Submitted:1 > > > > >>> Progress: Stage in:129 Submitting:703 Submitted:190 Failed > > > but can > > > > >>> retry:1 > > > > >>> Progress: Stage in:38 Submitting:425 Submitted:556 Failed > > but can > > > > >>> retry:4 > > > > >>> Failed to transfer wrapper log from > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY > > > > >>> Failed to transfer wrapper log from > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY > > > > >>> Failed to transfer wrapper log from > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY > > > > >>> Failed to transfer wrapper log from > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY > > > > >>> Failed to transfer wrapper log from > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY > > > > >>> Failed to transfer wrapper log from > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY > > > > >>> Progress: Stage in:1 Submitted:1013 Active:1 Failed but can > > > retry:8 > > > > >>> Progress: Submitted:1011 Active:1 Failed but can retry:11 > > > > >>> The log file for the successful run last week is ; > > > > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log > > > > >>> > > > > >>> The log file for the failed run is : > > > > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log > > > > >>> > > > > >>> I don't think I did anything different, so I don't know why this > > > time > > > > >>> they failed. The sites.xml for Mercury is: > > > > >>> > > > > >>> > > > > >>> > > > > >>> > url="grid-hg.ncsa.teragrid.org" > > > > >>> jobManager="gt2:PBS"/> > > > > >>> > > /gpfs_scratch1/yuechen/swiftwork > > > > >>> debug > > > > >>> > > > > >>> > > > > >>> Thank you for help! > > > > >>> > > > > >>> Chen, Yue > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> This email is intended only for the use of the individual or > > entity > > > > >>> to which it is addressed and may contain information that is > > > > >>> privileged and confidential. If the reader of this email > > message is > > > > >>> not the intended recipient, you are hereby notified that any > > > > >>> dissemination, distribution, or copying of this communication is > > > > >>> prohibited. If you have received this email in error, please > > notify > > > > >>> the sender and destroy/delete all copies of the transmittal. > > > Thank you. > > > > >>> > > > > >>> > > > > >>> > > > > > > ------------------------------------------------------------------------ > > > > >>> > > > > >>> _______________________________________________ > > > > >>> Swift-user mailing list > > > > >>> Swift-user at ci.uchicago.edu > > > > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > >> _______________________________________________ > > > > >> Swift-user mailing list > > > > >> Swift-user at ci.uchicago.edu > > > > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > _______________________________________________ > > > > > Swift-user mailing list > > > > > Swift-user at ci.uchicago.edu > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > This email is intended only for the use of the individual or > > entity to > > > > which it is addressed and may contain information that is > > privileged and > > > > confidential. If the reader of this email message is not the intended > > > > recipient, you are hereby notified that any dissemination, > > distribution, > > > > or copying of this communication is prohibited. If you have received > > > > this email in error, please notify the sender and destroy/delete all > > > > copies of the transmittal. Thank you. > > > > > > > > > > > > > > > This email is intended only for the use of the individual or entity to > > > which it is addressed and may contain information that is privileged and > > > confidential. If the reader of this email message is not the intended > > > recipient, you are hereby notified that any dissemination, distribution, > > > or copying of this communication is prohibited. If you have received > > > this email in error, please notify the sender and destroy/delete all > > > copies of the transmittal. Thank you. > > > > > > > > > > This email is intended only for the use of the individual or entity to > > which it is addressed and may contain information that is privileged > > and confidential. If the reader of this email message is not the > > intended recipient, you are hereby notified that any dissemination, > > distribution, or copying of this communication is prohibited. If you > > have received this email in error, please notify the sender and > > destroy/delete all copies of the transmittal. Thank you. > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Thu Apr 30 16:39:02 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 30 Apr 2009 16:39:02 -0500 Subject: [Swift-devel] RE: [Swift-user] Execution error In-Reply-To: <1241127073.922.1.camel@localhost> References: <49F9DE8F.1070404@mcs.anl.gov> <49F9E298.8030801@mcs.anl.gov> <49F9E8FB.9020500@mcs.anl.gov> <49F9F680.6040503@mcs.anl.gov> <49FA03EA.7080807@mcs.anl.gov> <49FA147E.6070205@uchicago.edu> <1241127073.922.1.camel@localhost> Message-ID: <49FA1A76.2010209@mcs.anl.gov> And we should also drill back down to why (at least yesterday) the GT4 softev package failed, but the OSG client worked, for globus-job-run. I guess its possible there is a host or CA cert issue here. - Mike On 4/30/09 4:31 PM, Mihael Hategan wrote: > Can you guys try to run first.swift on ranger with the settings you have > (you'll need to add "echo" to tc.data)? > > > On Thu, 2009-04-30 at 16:13 -0500, Glen Hocky wrote: >> I have the identical response on ranger. It started yesterday evening. >> Possibly a problem that the TACC folks need to fix? >> >> Glen >> >> Yue, Chen - BMD wrote: >>> Hi Michael, >>> >>> Thank you for the advices. I tested ranger with 1 job and new >>> specifications of maxwalltime. It shows the following error message. I >>> don't know if there is other problem with my setup. Thank you! >>> >>> ///////////////////////////////////////////////// >>> [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file >>> sites.xml -tc.file tc.data >>> Swift 0.9rc2 swift-r2860 cog-r2388 >>> RunID: 20090430-1559-2vi6x811 >>> Progress: >>> Progress: Stage in:1 >>> Progress: Submitting:1 >>> Progress: Submitting:1 >>> Progress: Submitted:1 >>> Progress: Active:1 >>> Failed to transfer wrapper log from >>> PTMap2-unmod-20090430-1559-2vi6x811/info/i on ranger >>> Progress: Active:1 >>> Failed to transfer wrapper log from >>> PTMap2-unmod-20090430-1559-2vi6x811/info/k on ranger >>> Progress: Stage in:1 >>> Progress: Active:1 >>> Failed to transfer wrapper log from >>> PTMap2-unmod-20090430-1559-2vi6x811/info/m on ranger >>> Progress: Failed:1 >>> Execution failed: >>> Exception in PTMap2: >>> Arguments: [e04.mzXML, ./seqs-ecolik12/fasta01, inputs-unmod.txt, >>> parameters.txt] >>> Host: ranger >>> Directory: PTMap2-unmod-20090430-1559-2vi6x811/jobs/m/PTMap2-mbe6m5aj >>> stderr.txt: >>> stdout.txt: >>> ---- >>> Caused by: >>> Failed to start worker: >>> null >>> null >>> org.globus.gram.GramException: The job manager detected an invalid >>> script response >>> at >>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) >>> at org.globus.gram.GramJob.setStatus(GramJob.java:184) >>> at >>> org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) >>> at java.lang.Thread.run(Thread.java:619) >>> Cleaning up... >>> Shutting down service at https://129.114.50.163:45562 >>> >>> Got channel MetaChannel: 20903429 -> GSSSChannel-null(1) >>> - Done >>> [yuechen at communicado PTMap2]$ >>> /////////////////////////////////////////////////////////// >>> >>> Chen, Yue >>> >>> >>> *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >>> *Sent:* Thu 4/30/2009 3:02 PM >>> *To:* Yue, Chen - BMD; swift-devel >>> *Subject:* Re: [Swift-user] Execution error >>> >>> Back on list here (I only went off-list to discuss accounts, etc) >>> >>> The problem in the run below is this: >>> >>> 2009-04-30 14:29:41,265-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION >>> jobid=PTMap2-abeii5aj - Application exception: Job cannot be run with >>> the given max walltime worker constraint (task: 3000, \ >>> maxwalltime: 2400s) >>> >>> You have this on the ptmap app in your tc.data: >>> >>> globus::maxwalltime=50 >>> >>> But you only gave coasters 40 mins per coaster worker. So its >>> complaining that it cant run a 50 minute job in a 40 minute (max) >>> coaster worker. ;) >>> >>> I mentioned in a prior mail that you need to set the two time vals in >>> your sites.xml entry; thats what you need to do next, now. >>> >>> change the coaster time in your sites.xml to: >>> key="coasterWorkerMaxwalltime">00:51:00 >>> >>> If you have more info on the variability of your ptmap run times, send >>> that to the list, and we can discuss how to handle. >>> >>> >>> (NOTE: doing grp -i of the log for "except" or scanning for "except" >>> with an editor will often locate the first "exception" that your job >>> encountered. Thats how I found the error above). >>> >>> Also, Yue, for testing new sites, or for validating that old sites still >>> work, you should create the smallest possible ptmap workflow - 1 job if >>> that is possible - and verify that this works. Then say 10 jobs to make >>> sure scheduling etc is sane. Then, send in your huge jobs. >>> >>> With only 1 job, its easier to spot the errors in the log file. >>> >>> - Mike >>> >>> >>> On 4/30/09 2:34 PM, Yue, Chen - BMD wrote: >>>> Hi Michael, >>>> >>>> I run into the same messages again when I use Ranger: >>>> >>>> Progress: Selecting site:146 Stage in:25 Submitting:15 Submitted:821 >>>> Failed but can retry:16 >>>> Failed to transfer wrapper log from >>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger >>>> Failed to transfer wrapper log from >>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/l on ranger >>>> Progress: Selecting site:146 Stage in:3 Submitting:1 Submitted:857 >>>> Failed but can retry:16 >>>> Failed to transfer wrapper log from >>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/v on ranger >>>> Failed to transfer wrapper log from >>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/b on ranger >>>> Failed to transfer wrapper log from >>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/0 on ranger >>>> Failed to transfer wrapper log from >>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger >>>> Failed to transfer wrapper log from >>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger >>>> Failed to transfer wrapper log from >>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger >>>> Failed to transfer wrapper log from >>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger >>>> Failed to transfer wrapper log from >>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger >>>> Failed to transfer wrapper log from >>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger >>>> Failed to transfer wrapper log from >>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger >>>> Failed to transfer wrapper log from >>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger >>>> Failed to transfer wrapper log from >>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/x on ranger >>>> Failed to transfer wrapper log from >>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/3 on ranger >>>> Failed to transfer wrapper log from >>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger >>>> The log for the search is at : >>>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1428-v0c5di5c.log >>>> >>>> The sites.xml I have is: >>>> >>>> >>>> >>> url="gatekeeper.ranger.tacc.teragrid.org" >>>> jobManager="gt2:gt2:SGE"/> >>>> >>>> >>> key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir >>>> TG-CCR080022N >>>> 16 >>>> development >>>> >>> key="coasterWorkerMaxwalltime">00:40:00 >>>> 31 >>>> 50 >>>> 10 >>>> /work/01164/yuechen/swiftwork >>>> >>>> The tc.data I have is: >>>> >>>> ranger PTMap2 >>>> /share/home/01164/yuechen/PTMap2/PTMap2 INSTALLED >>>> INTEL32::LINUX globus::maxwalltime=50 >>>> >>>> I'm using swift 0.9 rc2 >>>> >>>> Thank you very much for help! >>>> >>>> Chen, Yue >>>> >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >>>> *Sent:* Thu 4/30/2009 2:05 PM >>>> *To:* Yue, Chen - BMD >>>> *Subject:* Re: [Swift-user] Execution error >>>> >>>> >>>> >>>> On 4/30/09 1:51 PM, Yue, Chen - BMD wrote: >>>> > Hi Michael, >>>> > >>>> > When I tried to activate my account, I encountered the following >>> error: >>>> > >>>> > "Sorry, this account is in an invalid state. You may not activate >>> your >>>> > at this time." >>>> > >>>> > I used the username and password from TG-CDA070002T. Should I use a >>>> > different password? >>>> >>>> If you can already login to Ranger, then you are all set - you must have >>>> done this previously. >>>> >>>> I thought you had *not*, because when I looked up your login on ranger >>>> ("finger yuechen") it said "never logged in". But seems like that info >>>> is incorrect. >>>> >>>> If you have ptmap compiled, seems like you are almost all set. >>>> >>>> Let me know if it works. >>>> >>>> - Mike >>>> >>>> > Thanks! >>>> > >>>> > Chen, Yue >>>> > >>>> > >>>> > >>> ------------------------------------------------------------------------ >>>> > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >>>> > *Sent:* Thu 4/30/2009 1:07 PM >>>> > *To:* Yue, Chen - BMD >>>> > *Cc:* swift user >>>> > *Subject:* Re: [Swift-user] Execution error >>>> > >>>> > Yue, use this XML pool element to access ranger: >>>> > >>>> > >>>> > >>> > url="gatekeeper.ranger.tacc.teragrid.org" >>>> > jobManager="gt2:gt2:SGE"/> >>>> > >> url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" /> >>>> > >>> > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir >>>> > >> key="project">TG-CCR080022N >>>> > 16 >>>> > development >>>> > >>> > key="coasterWorkerMaxwalltime">00:40:00 >>>> > 31 >>>> > 50 >>>> > 10 >>>> > /work/00306/tg455797/swiftwork >>>> > >>>> > >>>> > >>>> > You will need to also do these steps: >>>> > >>>> > Go to this web page to enable your Ranger account: >>>> > >>>> > https://tas.tacc.utexas.edu/TASMigration/AccountActivation.aspx >>>> > >>>> > Then login to Ranger via the TeraGrid portal and put your ssh keys in >>>> > place (assuming you use ssh keys, which you should) >>>> > >>>> > While on Ranger, do this: >>>> > >>>> > echo $WORK >>>> > mkdir $work/swiftwork >>>> > >>>> > and put the full path of your $WORK/swiftwork directory in the >>>> > element above. (My login is tg455etc, yours is >>> yuechen) >>>> > >>>> > Then scp your code to Ranger and compile it. >>>> > >>>> > Then create a tc.data entry for your ptmap app >>>> > >>>> > Next, set your time values in the sites.xml entry above to suitable >>>> > values for Ranger. You'll need to measure times, but I think you will >>>> > find Ranger about twice as fast as Mercury for CPU-bound jobs. >>>> > >>>> > The values above were set for one app job per coaster. I think >>> you can >>>> > probably do more. >>>> > >>>> > If you estimate a run time of 5 minutes, use: >>>> > >>>> > >>> > key="coasterWorkerMaxwalltime">00:30:00 >>>> > 5 >>>> > >>>> > Other people on the list - please sanity check what I suggest here. >>>> > >>>> > - Mike >>>> > >>>> > >>>> > On 4/30/09 12:40 PM, Michael Wilde wrote: >>>> > > I just checked - TG-CDA070002T has indeed expired. >>>> > > >>>> > > The best for now is to move to use (only) Ranger, under this >>> account: >>>> > > TG-CCR080022N >>>> > > >>>> > > I will locate and send you a sites.xml entry in a moment. >>>> > > >>>> > > You need to go to a web page to activate your Ranger login. >>>> > > >>>> > > Best to contact me in IM and we can work this out. >>>> > > >>>> > > - Mike >>>> > > >>>> > > >>>> > > >>>> > > On 4/30/09 12:23 PM, Michael Wilde wrote: >>>> > >> Also, what account are you running under? We may need to change >>>> you to >>>> > >> a new account - as the OSG Training account expires today. >>>> > >> If that happend at Noon, it *might* be the problem. >>>> > >> >>>> > >> - Mike >>>> > >> >>>> > >> >>>> > >> On 4/30/09 12:08 PM, Yue, Chen - BMD wrote: >>>> > >>> Hi, >>>> > >>> >>>> > >>> I came back to re-run my application on NCSA Mercury which was >>>> tested >>>> > >>> successfully last week after I just set up coasters with >>> swift 0.9, >>>> > >>> but I got many messages like the following: >>>> > >>> >>>> > >>> Progress: Stage in:219 Submitting:803 Submitted:1 >>>> > >>> Progress: Stage in:129 Submitting:703 Submitted:190 Failed >>>> but can >>>> > >>> retry:1 >>>> > >>> Progress: Stage in:38 Submitting:425 Submitted:556 Failed >>> but can >>>> > >>> retry:4 >>>> > >>> Failed to transfer wrapper log from >>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY >>>> > >>> Failed to transfer wrapper log from >>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY >>>> > >>> Failed to transfer wrapper log from >>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY >>>> > >>> Failed to transfer wrapper log from >>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY >>>> > >>> Failed to transfer wrapper log from >>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY >>>> > >>> Failed to transfer wrapper log from >>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY >>>> > >>> Progress: Stage in:1 Submitted:1013 Active:1 Failed but can >>>> retry:8 >>>> > >>> Progress: Submitted:1011 Active:1 Failed but can retry:11 >>>> > >>> The log file for the successful run last week is ; >>>> > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log >>>> > >>> >>>> > >>> The log file for the failed run is : >>>> > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log >>>> > >>> >>>> > >>> I don't think I did anything different, so I don't know why this >>>> time >>>> > >>> they failed. The sites.xml for Mercury is: >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> >> url="grid-hg.ncsa.teragrid.org" >>>> > >>> jobManager="gt2:PBS"/> >>>> > >>> >>> /gpfs_scratch1/yuechen/swiftwork >>>> > >>> debug >>>> > >>> >>>> > >>> >>>> > >>> Thank you for help! >>>> > >>> >>>> > >>> Chen, Yue >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> This email is intended only for the use of the individual or >>> entity >>>> > >>> to which it is addressed and may contain information that is >>>> > >>> privileged and confidential. If the reader of this email >>> message is >>>> > >>> not the intended recipient, you are hereby notified that any >>>> > >>> dissemination, distribution, or copying of this communication is >>>> > >>> prohibited. If you have received this email in error, please >>> notify >>>> > >>> the sender and destroy/delete all copies of the transmittal. >>>> Thank you. >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> ------------------------------------------------------------------------ >>>> > >>> >>>> > >>> _______________________________________________ >>>> > >>> Swift-user mailing list >>>> > >>> Swift-user at ci.uchicago.edu >>>> > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>>> > >> _______________________________________________ >>>> > >> Swift-user mailing list >>>> > >> Swift-user at ci.uchicago.edu >>>> > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>>> > > _______________________________________________ >>>> > > Swift-user mailing list >>>> > > Swift-user at ci.uchicago.edu >>>> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>>> > >>>> > >>>> > >>>> > >>>> > This email is intended only for the use of the individual or >>> entity to >>>> > which it is addressed and may contain information that is >>> privileged and >>>> > confidential. If the reader of this email message is not the intended >>>> > recipient, you are hereby notified that any dissemination, >>> distribution, >>>> > or copying of this communication is prohibited. If you have received >>>> > this email in error, please notify the sender and destroy/delete all >>>> > copies of the transmittal. Thank you. >>>> >>>> >>>> >>>> >>>> This email is intended only for the use of the individual or entity to >>>> which it is addressed and may contain information that is privileged and >>>> confidential. If the reader of this email message is not the intended >>>> recipient, you are hereby notified that any dissemination, distribution, >>>> or copying of this communication is prohibited. If you have received >>>> this email in error, please notify the sender and destroy/delete all >>>> copies of the transmittal. Thank you. >>> >>> >>> >>> This email is intended only for the use of the individual or entity to >>> which it is addressed and may contain information that is privileged >>> and confidential. If the reader of this email message is not the >>> intended recipient, you are hereby notified that any dissemination, >>> distribution, or copying of this communication is prohibited. If you >>> have received this email in error, please notify the sender and >>> destroy/delete all copies of the transmittal. Thank you. >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From zhaozhang at uchicago.edu Thu Apr 30 16:52:43 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 30 Apr 2009 16:52:43 -0500 Subject: [Swift-devel] RE: [Swift-user] Execution error In-Reply-To: <49FA147E.6070205@uchicago.edu> References: <49F9DE8F.1070404@mcs.anl.gov> <49F9E298.8030801@mcs.anl.gov> <49F9E8FB.9020500@mcs.anl.gov> <49F9F680.6040503@mcs.anl.gov> <49FA03EA.7080807@mcs.anl.gov> <49FA147E.6070205@uchicago.edu> Message-ID: <49FA1DAB.4050306@uchicago.edu> Hi, Glen Can you point me to the working swift on ranger? zhao Glen Hocky wrote: > I have the identical response on ranger. It started yesterday evening. > Possibly a problem that the TACC folks need to fix? > > Glen > > Yue, Chen - BMD wrote: >> Hi Michael, >> >> Thank you for the advices. I tested ranger with 1 job and new >> specifications of maxwalltime. It shows the following error message. >> I don't know if there is other problem with my setup. Thank you! >> >> ///////////////////////////////////////////////// >> [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file >> sites.xml -tc.file tc.data >> Swift 0.9rc2 swift-r2860 cog-r2388 >> RunID: 20090430-1559-2vi6x811 >> Progress: >> Progress: Stage in:1 >> Progress: Submitting:1 >> Progress: Submitting:1 >> Progress: Submitted:1 >> Progress: Active:1 >> Failed to transfer wrapper log from >> PTMap2-unmod-20090430-1559-2vi6x811/info/i on ranger >> Progress: Active:1 >> Failed to transfer wrapper log from >> PTMap2-unmod-20090430-1559-2vi6x811/info/k on ranger >> Progress: Stage in:1 >> Progress: Active:1 >> Failed to transfer wrapper log from >> PTMap2-unmod-20090430-1559-2vi6x811/info/m on ranger >> Progress: Failed:1 >> Execution failed: >> Exception in PTMap2: >> Arguments: [e04.mzXML, ./seqs-ecolik12/fasta01, inputs-unmod.txt, >> parameters.txt] >> Host: ranger >> Directory: PTMap2-unmod-20090430-1559-2vi6x811/jobs/m/PTMap2-mbe6m5aj >> stderr.txt: >> stdout.txt: >> ---- >> Caused by: >> Failed to start worker: >> null >> null >> org.globus.gram.GramException: The job manager detected an invalid >> script response >> at >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) >> >> at org.globus.gram.GramJob.setStatus(GramJob.java:184) >> at >> org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) >> at java.lang.Thread.run(Thread.java:619) >> Cleaning up... >> Shutting down service at https://129.114.50.163:45562 >> >> Got channel MetaChannel: 20903429 -> GSSSChannel-null(1) >> - Done >> [yuechen at communicado PTMap2]$ >> /////////////////////////////////////////////////////////// >> >> Chen, Yue >> >> >> *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >> *Sent:* Thu 4/30/2009 3:02 PM >> *To:* Yue, Chen - BMD; swift-devel >> *Subject:* Re: [Swift-user] Execution error >> >> Back on list here (I only went off-list to discuss accounts, etc) >> >> The problem in the run below is this: >> >> 2009-04-30 14:29:41,265-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION >> jobid=PTMap2-abeii5aj - Application exception: Job cannot be run with >> the given max walltime worker constraint (task: 3000, \ >> maxwalltime: 2400s) >> >> You have this on the ptmap app in your tc.data: >> >> globus::maxwalltime=50 >> >> But you only gave coasters 40 mins per coaster worker. So its >> complaining that it cant run a 50 minute job in a 40 minute (max) >> coaster worker. ;) >> >> I mentioned in a prior mail that you need to set the two time vals in >> your sites.xml entry; thats what you need to do next, now. >> >> change the coaster time in your sites.xml to: >> key="coasterWorkerMaxwalltime">00:51:00 >> >> If you have more info on the variability of your ptmap run times, send >> that to the list, and we can discuss how to handle. >> >> >> (NOTE: doing grp -i of the log for "except" or scanning for "except" >> with an editor will often locate the first "exception" that your job >> encountered. Thats how I found the error above). >> >> Also, Yue, for testing new sites, or for validating that old sites still >> work, you should create the smallest possible ptmap workflow - 1 job if >> that is possible - and verify that this works. Then say 10 jobs to make >> sure scheduling etc is sane. Then, send in your huge jobs. >> >> With only 1 job, its easier to spot the errors in the log file. >> >> - Mike >> >> >> On 4/30/09 2:34 PM, Yue, Chen - BMD wrote: >> > Hi Michael, >> > > I run into the same messages again when I use Ranger: >> > > Progress: Selecting site:146 Stage in:25 Submitting:15 >> Submitted:821 >> > Failed but can retry:16 >> > Failed to transfer wrapper log from >> > PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger >> > Failed to transfer wrapper log from >> > PTMap2-unmod-20090430-1428-v0c5di5c/info/l on ranger >> > Progress: Selecting site:146 Stage in:3 Submitting:1 Submitted:857 >> > Failed but can retry:16 >> > Failed to transfer wrapper log from >> > PTMap2-unmod-20090430-1428-v0c5di5c/info/v on ranger >> > Failed to transfer wrapper log from >> > PTMap2-unmod-20090430-1428-v0c5di5c/info/b on ranger >> > Failed to transfer wrapper log from >> > PTMap2-unmod-20090430-1428-v0c5di5c/info/0 on ranger >> > Failed to transfer wrapper log from >> > PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger >> > Failed to transfer wrapper log from >> > PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger >> > Failed to transfer wrapper log from >> > PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger >> > Failed to transfer wrapper log from >> > PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger >> > Failed to transfer wrapper log from >> > PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger >> > Failed to transfer wrapper log from >> > PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger >> > Failed to transfer wrapper log from >> > PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger >> > Failed to transfer wrapper log from >> > PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger >> > Failed to transfer wrapper log from >> > PTMap2-unmod-20090430-1428-v0c5di5c/info/x on ranger >> > Failed to transfer wrapper log from >> > PTMap2-unmod-20090430-1428-v0c5di5c/info/3 on ranger >> > Failed to transfer wrapper log from >> > PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger >> > The log for the search is at : > >> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1428-v0c5di5c.log >> > > The sites.xml I have is: >> > > >> > > > url="gatekeeper.ranger.tacc.teragrid.org" >> > jobManager="gt2:gt2:SGE"/> >> > >> > > > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir >> > TG-CCR080022N >> > 16 >> > development >> > > > key="coasterWorkerMaxwalltime">00:40:00 >> > 31 >> > 50 >> > 10 >> > /work/01164/yuechen/swiftwork >> > >> > The tc.data I have is: >> > > ranger PTMap2 > >> /share/home/01164/yuechen/PTMap2/PTMap2 INSTALLED > >> INTEL32::LINUX globus::maxwalltime=50 >> > >> > I'm using swift 0.9 rc2 >> > >> > Thank you very much for help! >> > >> > Chen, Yue >> > >> > > >> > >> ------------------------------------------------------------------------ >> > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >> > *Sent:* Thu 4/30/2009 2:05 PM >> > *To:* Yue, Chen - BMD >> > *Subject:* Re: [Swift-user] Execution error >> > >> > >> > >> > On 4/30/09 1:51 PM, Yue, Chen - BMD wrote: >> > > Hi Michael, >> > > >> > > When I tried to activate my account, I encountered the following >> error: >> > > >> > > "Sorry, this account is in an invalid state. You may not >> activate your >> > > at this time." >> > > >> > > I used the username and password from TG-CDA070002T. Should I use a >> > > different password? >> > >> > If you can already login to Ranger, then you are all set - you must >> have >> > done this previously. >> > >> > I thought you had *not*, because when I looked up your login on ranger >> > ("finger yuechen") it said "never logged in". But seems like that info >> > is incorrect. >> > >> > If you have ptmap compiled, seems like you are almost all set. >> > >> > Let me know if it works. >> > >> > - Mike >> > >> > > Thanks! >> > > >> > > Chen, Yue >> > > >> > > >> > > >> ------------------------------------------------------------------------ >> > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >> > > *Sent:* Thu 4/30/2009 1:07 PM >> > > *To:* Yue, Chen - BMD >> > > *Cc:* swift user >> > > *Subject:* Re: [Swift-user] Execution error >> > > >> > > Yue, use this XML pool element to access ranger: >> > > >> > > >> > > > > > url="gatekeeper.ranger.tacc.teragrid.org" >> > > jobManager="gt2:gt2:SGE"/> >> > > > url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" /> >> > > > > > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir >> > > > key="project">TG-CCR080022N >> > > 16 >> > > development >> > > > > > key="coasterWorkerMaxwalltime">00:40:00 >> > > 31 >> > > 50 >> > > 10 >> > > /work/00306/tg455797/swiftwork >> > > >> > > >> > > >> > > You will need to also do these steps: >> > > >> > > Go to this web page to enable your Ranger account: >> > > >> > > https://tas.tacc.utexas.edu/TASMigration/AccountActivation.aspx >> > > >> > > Then login to Ranger via the TeraGrid portal and put your ssh >> keys in >> > > place (assuming you use ssh keys, which you should) >> > > >> > > While on Ranger, do this: >> > > >> > > echo $WORK >> > > mkdir $work/swiftwork >> > > >> > > and put the full path of your $WORK/swiftwork directory in the >> > > element above. (My login is tg455etc, yours is >> yuechen) >> > > >> > > Then scp your code to Ranger and compile it. >> > > >> > > Then create a tc.data entry for your ptmap app >> > > >> > > Next, set your time values in the sites.xml entry above to suitable >> > > values for Ranger. You'll need to measure times, but I think you >> will >> > > find Ranger about twice as fast as Mercury for CPU-bound jobs. >> > > >> > > The values above were set for one app job per coaster. I think >> you can >> > > probably do more. >> > > >> > > If you estimate a run time of 5 minutes, use: >> > > >> > > > > > key="coasterWorkerMaxwalltime">00:30:00 >> > > 5 >> > > >> > > Other people on the list - please sanity check what I suggest here. >> > > >> > > - Mike >> > > >> > > >> > > On 4/30/09 12:40 PM, Michael Wilde wrote: >> > > > I just checked - TG-CDA070002T has indeed expired. >> > > > >> > > > The best for now is to move to use (only) Ranger, under this >> account: >> > > > TG-CCR080022N >> > > > >> > > > I will locate and send you a sites.xml entry in a moment. >> > > > >> > > > You need to go to a web page to activate your Ranger login. >> > > > >> > > > Best to contact me in IM and we can work this out. >> > > > >> > > > - Mike >> > > > >> > > > >> > > > >> > > > On 4/30/09 12:23 PM, Michael Wilde wrote: >> > > >> Also, what account are you running under? We may need to change >> > you to >> > > >> a new account - as the OSG Training account expires today. >> > > >> If that happend at Noon, it *might* be the problem. >> > > >> >> > > >> - Mike >> > > >> >> > > >> >> > > >> On 4/30/09 12:08 PM, Yue, Chen - BMD wrote: >> > > >>> Hi, >> > > >>> >> > > >>> I came back to re-run my application on NCSA Mercury which was >> > tested >> > > >>> successfully last week after I just set up coasters with >> swift 0.9, >> > > >>> but I got many messages like the following: >> > > >>> >> > > >>> Progress: Stage in:219 Submitting:803 Submitted:1 >> > > >>> Progress: Stage in:129 Submitting:703 Submitted:190 Failed >> > but can >> > > >>> retry:1 >> > > >>> Progress: Stage in:38 Submitting:425 Submitted:556 >> Failed but can >> > > >>> retry:4 >> > > >>> Failed to transfer wrapper log from >> > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY >> > > >>> Failed to transfer wrapper log from >> > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY >> > > >>> Failed to transfer wrapper log from >> > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY >> > > >>> Failed to transfer wrapper log from >> > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY >> > > >>> Failed to transfer wrapper log from >> > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY >> > > >>> Failed to transfer wrapper log from >> > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY >> > > >>> Progress: Stage in:1 Submitted:1013 Active:1 Failed but can >> > retry:8 >> > > >>> Progress: Submitted:1011 Active:1 Failed but can retry:11 >> > > >>> The log file for the successful run last week is ; >> > > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log >> > > >>> >> > > >>> The log file for the failed run is : >> > > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log >> > > >>> >> > > >>> I don't think I did anything different, so I don't know why >> this >> > time >> > > >>> they failed. The sites.xml for Mercury is: >> > > >>> >> > > >>> >> > > >>> >> > > >>> > url="grid-hg.ncsa.teragrid.org" >> > > >>> jobManager="gt2:PBS"/> >> > > >>> >> /gpfs_scratch1/yuechen/swiftwork >> > > >>> debug >> > > >>> >> > > >>> >> > > >>> Thank you for help! >> > > >>> >> > > >>> Chen, Yue >> > > >>> >> > > >>> >> > > >>> >> > > >>> >> > > >>> >> > > >>> >> > > >>> >> > > >>> >> > > >>> >> > > >>> >> > > >>> >> > > >>> This email is intended only for the use of the individual >> or entity >> > > >>> to which it is addressed and may contain information that is >> > > >>> privileged and confidential. If the reader of this email >> message is >> > > >>> not the intended recipient, you are hereby notified that any >> > > >>> dissemination, distribution, or copying of this >> communication is >> > > >>> prohibited. If you have received this email in error, >> please notify >> > > >>> the sender and destroy/delete all copies of the transmittal. >> > Thank you. >> > > >>> >> > > >>> >> > > >>> >> > > >> ------------------------------------------------------------------------ >> > > >>> >> > > >>> _______________________________________________ >> > > >>> Swift-user mailing list >> > > >>> Swift-user at ci.uchicago.edu >> > > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >> > > >> _______________________________________________ >> > > >> Swift-user mailing list >> > > >> Swift-user at ci.uchicago.edu >> > > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >> > > > _______________________________________________ >> > > > Swift-user mailing list >> > > > Swift-user at ci.uchicago.edu >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >> > > >> > > >> > > >> > > >> > > This email is intended only for the use of the individual or >> entity to >> > > which it is addressed and may contain information that is >> privileged and >> > > confidential. If the reader of this email message is not the >> intended >> > > recipient, you are hereby notified that any dissemination, >> distribution, >> > > or copying of this communication is prohibited. If you have >> received >> > > this email in error, please notify the sender and destroy/delete >> all >> > > copies of the transmittal. Thank you. >> > >> > > >> > >> > This email is intended only for the use of the individual or entity to >> > which it is addressed and may contain information that is >> privileged and >> > confidential. If the reader of this email message is not the intended >> > recipient, you are hereby notified that any dissemination, >> distribution, >> > or copying of this communication is prohibited. If you have received >> > this email in error, please notify the sender and destroy/delete all >> > copies of the transmittal. Thank you. >> >> >> >> >> This email is intended only for the use of the individual or entity >> to which it is addressed and may contain information that is >> privileged and confidential. If the reader of this email message is >> not the intended recipient, you are hereby notified that any >> dissemination, distribution, or copying of this communication is >> prohibited. If you have received this email in error, please notify >> the sender and destroy/delete all copies of the transmittal. Thank you. >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Thu Apr 30 17:01:45 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 30 Apr 2009 17:01:45 -0500 Subject: [Swift-devel] RE: [Swift-user] Execution error In-Reply-To: <49FA1A76.2010209@mcs.anl.gov> References: <49F9DE8F.1070404@mcs.anl.gov> <49F9E298.8030801@mcs.anl.gov> <49F9E8FB.9020500@mcs.anl.gov> <49F9F680.6040503@mcs.anl.gov> <49FA03EA.7080807@mcs.anl.gov> <49FA147E.6070205@uchicago.edu> <1241127073.922.1.camel@localhost> <49FA1A76.2010209@mcs.anl.gov> Message-ID: <49FA1FC9.7010108@mcs.anl.gov> A bit more info on this: it *seems* like a cert issue. I last accessed Ranger via globus-job-run perhaps 2 weeks ago, no problem. Yesterday, while debugging with Glen, globus-job-run was giving me GRAM err 74. (and GRM err 12 to all other sites) So I added +osg-client to my .soft file, and then globus-job-run worked. But I noticed that my globus-job-run was still coming from the GT4 dir, not from an OSG dir. Just now I traced this back to X509_CERT_DIR: then I did: com$ unset X509_CERT_DIR com$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /usr/bin/id GRAM Job submission failed because the job manager failed to open stderr (error code 74) com$ com$ com$ X509_CERT_DIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA com$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /usr/bin/id GRAM Job submission failed because the job manager failed to open stderr (error code 74) com$ export X509_CERT_DIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA com$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /usr/bin/id uid=455797(tg455797) gid=80243(G-80243) groups=80243(G-80243),81031(G-81031),81411(G-81411),81611(G-81611),81613(G-81613),81621(G-81621),81747(G-81747),81792(G-81792),800744(G-800744),800745(G-800745),800889(G-800889),800981(G-800981),800983(G-800983),801271(G-801271),801364(G-801364) com$ Mihael, does swift honor X509_CERT_DIR? If so, Glen, Yue, that is something to try. You may need to put +osg-client this in your .soft file and re-login: @python-2.5 +java-sun +apache-ant +gx-map +condor +gx-map @globus-4 @default +R +torque +maui +matlab-7.7 +osg-client - Mike On 4/30/09 4:39 PM, Michael Wilde wrote: > And we should also drill back down to why (at least yesterday) the GT4 > softev package failed, but the OSG client worked, for globus-job-run. > > I guess its possible there is a host or CA cert issue here. > > - Mike > > > On 4/30/09 4:31 PM, Mihael Hategan wrote: >> Can you guys try to run first.swift on ranger with the settings you have >> (you'll need to add "echo" to tc.data)? >> >> >> On Thu, 2009-04-30 at 16:13 -0500, Glen Hocky wrote: >>> I have the identical response on ranger. It started yesterday >>> evening. Possibly a problem that the TACC folks need to fix? >>> >>> Glen >>> >>> Yue, Chen - BMD wrote: >>>> Hi Michael, >>>> >>>> Thank you for the advices. I tested ranger with 1 job and new >>>> specifications of maxwalltime. It shows the following error message. >>>> I don't know if there is other problem with my setup. Thank you! >>>> >>>> ///////////////////////////////////////////////// >>>> [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file >>>> sites.xml -tc.file tc.data >>>> Swift 0.9rc2 swift-r2860 cog-r2388 >>>> RunID: 20090430-1559-2vi6x811 >>>> Progress: >>>> Progress: Stage in:1 >>>> Progress: Submitting:1 >>>> Progress: Submitting:1 >>>> Progress: Submitted:1 >>>> Progress: Active:1 >>>> Failed to transfer wrapper log from >>>> PTMap2-unmod-20090430-1559-2vi6x811/info/i on ranger >>>> Progress: Active:1 >>>> Failed to transfer wrapper log from >>>> PTMap2-unmod-20090430-1559-2vi6x811/info/k on ranger >>>> Progress: Stage in:1 >>>> Progress: Active:1 >>>> Failed to transfer wrapper log from >>>> PTMap2-unmod-20090430-1559-2vi6x811/info/m on ranger >>>> Progress: Failed:1 >>>> Execution failed: >>>> Exception in PTMap2: >>>> Arguments: [e04.mzXML, ./seqs-ecolik12/fasta01, inputs-unmod.txt, >>>> parameters.txt] >>>> Host: ranger >>>> Directory: PTMap2-unmod-20090430-1559-2vi6x811/jobs/m/PTMap2-mbe6m5aj >>>> stderr.txt: >>>> stdout.txt: >>>> ---- >>>> Caused by: >>>> Failed to start worker: >>>> null >>>> null >>>> org.globus.gram.GramException: The job manager detected an invalid >>>> script response >>>> at >>>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) >>>> >>>> at org.globus.gram.GramJob.setStatus(GramJob.java:184) >>>> at >>>> org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) >>>> at java.lang.Thread.run(Thread.java:619) >>>> Cleaning up... >>>> Shutting down service at https://129.114.50.163:45562 >>>> >>>> Got channel MetaChannel: 20903429 -> GSSSChannel-null(1) >>>> - Done >>>> [yuechen at communicado PTMap2]$ >>>> /////////////////////////////////////////////////////////// >>>> >>>> Chen, Yue >>>> >>>> >>>> *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >>>> *Sent:* Thu 4/30/2009 3:02 PM >>>> *To:* Yue, Chen - BMD; swift-devel >>>> *Subject:* Re: [Swift-user] Execution error >>>> >>>> Back on list here (I only went off-list to discuss accounts, etc) >>>> >>>> The problem in the run below is this: >>>> >>>> 2009-04-30 14:29:41,265-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION >>>> jobid=PTMap2-abeii5aj - Application exception: Job cannot be run with >>>> the given max walltime worker constraint (task: 3000, \ >>>> maxwalltime: 2400s) >>>> >>>> You have this on the ptmap app in your tc.data: >>>> >>>> globus::maxwalltime=50 >>>> >>>> But you only gave coasters 40 mins per coaster worker. So its >>>> complaining that it cant run a 50 minute job in a 40 minute (max) >>>> coaster worker. ;) >>>> >>>> I mentioned in a prior mail that you need to set the two time vals in >>>> your sites.xml entry; thats what you need to do next, now. >>>> >>>> change the coaster time in your sites.xml to: >>>> key="coasterWorkerMaxwalltime">00:51:00 >>>> >>>> If you have more info on the variability of your ptmap run times, send >>>> that to the list, and we can discuss how to handle. >>>> >>>> >>>> (NOTE: doing grp -i of the log for "except" or scanning for "except" >>>> with an editor will often locate the first "exception" that your job >>>> encountered. Thats how I found the error above). >>>> >>>> Also, Yue, for testing new sites, or for validating that old sites >>>> still >>>> work, you should create the smallest possible ptmap workflow - 1 job if >>>> that is possible - and verify that this works. Then say 10 jobs to >>>> make >>>> sure scheduling etc is sane. Then, send in your huge jobs. >>>> >>>> With only 1 job, its easier to spot the errors in the log file. >>>> >>>> - Mike >>>> >>>> >>>> On 4/30/09 2:34 PM, Yue, Chen - BMD wrote: >>>>> Hi Michael, >>>>> >>>>> I run into the same messages again when I use Ranger: >>>>> >>>>> Progress: Selecting site:146 Stage in:25 Submitting:15 >>>>> Submitted:821 >>>>> Failed but can retry:16 >>>>> Failed to transfer wrapper log from >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger >>>>> Failed to transfer wrapper log from >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/l on ranger >>>>> Progress: Selecting site:146 Stage in:3 Submitting:1 Submitted:857 >>>>> Failed but can retry:16 >>>>> Failed to transfer wrapper log from >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/v on ranger >>>>> Failed to transfer wrapper log from >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/b on ranger >>>>> Failed to transfer wrapper log from >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/0 on ranger >>>>> Failed to transfer wrapper log from >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger >>>>> Failed to transfer wrapper log from >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger >>>>> Failed to transfer wrapper log from >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger >>>>> Failed to transfer wrapper log from >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger >>>>> Failed to transfer wrapper log from >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger >>>>> Failed to transfer wrapper log from >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger >>>>> Failed to transfer wrapper log from >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger >>>>> Failed to transfer wrapper log from >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger >>>>> Failed to transfer wrapper log from >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/x on ranger >>>>> Failed to transfer wrapper log from >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/3 on ranger >>>>> Failed to transfer wrapper log from >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger >>>>> The log for the search is at : >>>>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1428-v0c5di5c.log >>>>> >>>>> The sites.xml I have is: >>>>> >>>>> >>>>> >>>> url="gatekeeper.ranger.tacc.teragrid.org" >>>>> jobManager="gt2:gt2:SGE"/> >>>>> >>>>> >>>> key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir >>>>> TG-CCR080022N >>>>> 16 >>>>> development >>>>> >>>> key="coasterWorkerMaxwalltime">00:40:00 >>>>> 31 >>>>> 50 >>>>> 10 >>>>> /work/01164/yuechen/swiftwork >>>>> >>>>> The tc.data I have is: >>>>> >>>>> ranger PTMap2 >>>>> /share/home/01164/yuechen/PTMap2/PTMap2 INSTALLED >>>>> INTEL32::LINUX globus::maxwalltime=50 >>>>> >>>>> I'm using swift 0.9 rc2 >>>>> >>>>> Thank you very much for help! >>>>> >>>>> Chen, Yue >>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------------ >>>>> >>>>> *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >>>>> *Sent:* Thu 4/30/2009 2:05 PM >>>>> *To:* Yue, Chen - BMD >>>>> *Subject:* Re: [Swift-user] Execution error >>>>> >>>>> >>>>> >>>>> On 4/30/09 1:51 PM, Yue, Chen - BMD wrote: >>>>> > Hi Michael, >>>>> > >>>>> > When I tried to activate my account, I encountered the following >>>> error: >>>>> > >>>>> > "Sorry, this account is in an invalid state. You may not activate >>>> your >>>>> > at this time." >>>>> > >>>>> > I used the username and password from TG-CDA070002T. Should I use a >>>>> > different password? >>>>> >>>>> If you can already login to Ranger, then you are all set - you must >>>>> have >>>>> done this previously. >>>>> >>>>> I thought you had *not*, because when I looked up your login on ranger >>>>> ("finger yuechen") it said "never logged in". But seems like that info >>>>> is incorrect. >>>>> >>>>> If you have ptmap compiled, seems like you are almost all set. >>>>> >>>>> Let me know if it works. >>>>> >>>>> - Mike >>>>> >>>>> > Thanks! >>>>> > >>>>> > Chen, Yue >>>>> > >>>>> > >>>>> > >>>> ------------------------------------------------------------------------ >>>> >>>>> > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >>>>> > *Sent:* Thu 4/30/2009 1:07 PM >>>>> > *To:* Yue, Chen - BMD >>>>> > *Cc:* swift user >>>>> > *Subject:* Re: [Swift-user] Execution error >>>>> > >>>>> > Yue, use this XML pool element to access ranger: >>>>> > >>>>> > >>>>> > >>>> > url="gatekeeper.ranger.tacc.teragrid.org" >>>>> > jobManager="gt2:gt2:SGE"/> >>>>> > >>> url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" /> >>>>> > >>>> > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir >>>>> > >>> key="project">TG-CCR080022N >>>>> > 16 >>>>> > development >>>>> > >>>> > key="coasterWorkerMaxwalltime">00:40:00 >>>>> > 31 >>>>> > 50 >>>>> > 10 >>>>> > /work/00306/tg455797/swiftwork >>>>> > >>>>> > >>>>> > >>>>> > You will need to also do these steps: >>>>> > >>>>> > Go to this web page to enable your Ranger account: >>>>> > >>>>> > https://tas.tacc.utexas.edu/TASMigration/AccountActivation.aspx >>>>> > >>>>> > Then login to Ranger via the TeraGrid portal and put your ssh >>>>> keys in >>>>> > place (assuming you use ssh keys, which you should) >>>>> > >>>>> > While on Ranger, do this: >>>>> > >>>>> > echo $WORK >>>>> > mkdir $work/swiftwork >>>>> > >>>>> > and put the full path of your $WORK/swiftwork directory in the >>>>> > element above. (My login is tg455etc, yours is >>>> yuechen) >>>>> > >>>>> > Then scp your code to Ranger and compile it. >>>>> > >>>>> > Then create a tc.data entry for your ptmap app >>>>> > >>>>> > Next, set your time values in the sites.xml entry above to suitable >>>>> > values for Ranger. You'll need to measure times, but I think you >>>>> will >>>>> > find Ranger about twice as fast as Mercury for CPU-bound jobs. >>>>> > >>>>> > The values above were set for one app job per coaster. I think >>>> you can >>>>> > probably do more. >>>>> > >>>>> > If you estimate a run time of 5 minutes, use: >>>>> > >>>>> > >>>> > key="coasterWorkerMaxwalltime">00:30:00 >>>>> > 5 >>>>> > >>>>> > Other people on the list - please sanity check what I suggest here. >>>>> > >>>>> > - Mike >>>>> > >>>>> > >>>>> > On 4/30/09 12:40 PM, Michael Wilde wrote: >>>>> > > I just checked - TG-CDA070002T has indeed expired. >>>>> > > >>>>> > > The best for now is to move to use (only) Ranger, under this >>>> account: >>>>> > > TG-CCR080022N >>>>> > > >>>>> > > I will locate and send you a sites.xml entry in a moment. >>>>> > > >>>>> > > You need to go to a web page to activate your Ranger login. >>>>> > > >>>>> > > Best to contact me in IM and we can work this out. >>>>> > > >>>>> > > - Mike >>>>> > > >>>>> > > >>>>> > > >>>>> > > On 4/30/09 12:23 PM, Michael Wilde wrote: >>>>> > >> Also, what account are you running under? We may need to change >>>>> you to >>>>> > >> a new account - as the OSG Training account expires today. >>>>> > >> If that happend at Noon, it *might* be the problem. >>>>> > >> >>>>> > >> - Mike >>>>> > >> >>>>> > >> >>>>> > >> On 4/30/09 12:08 PM, Yue, Chen - BMD wrote: >>>>> > >>> Hi, >>>>> > >>> >>>>> > >>> I came back to re-run my application on NCSA Mercury which was >>>>> tested >>>>> > >>> successfully last week after I just set up coasters with >>>> swift 0.9, >>>>> > >>> but I got many messages like the following: >>>>> > >>> >>>>> > >>> Progress: Stage in:219 Submitting:803 Submitted:1 >>>>> > >>> Progress: Stage in:129 Submitting:703 Submitted:190 Failed >>>>> but can >>>>> > >>> retry:1 >>>>> > >>> Progress: Stage in:38 Submitting:425 Submitted:556 Failed >>>> but can >>>>> > >>> retry:4 >>>>> > >>> Failed to transfer wrapper log from >>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY >>>>> > >>> Failed to transfer wrapper log from >>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY >>>>> > >>> Failed to transfer wrapper log from >>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY >>>>> > >>> Failed to transfer wrapper log from >>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY >>>>> > >>> Failed to transfer wrapper log from >>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY >>>>> > >>> Failed to transfer wrapper log from >>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY >>>>> > >>> Progress: Stage in:1 Submitted:1013 Active:1 Failed but can >>>>> retry:8 >>>>> > >>> Progress: Submitted:1011 Active:1 Failed but can retry:11 >>>>> > >>> The log file for the successful run last week is ; >>>>> > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log >>>>> > >>> >>>>> > >>> The log file for the failed run is : >>>>> > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log >>>>> > >>> >>>>> > >>> I don't think I did anything different, so I don't know why >>>>> this >>>>> time >>>>> > >>> they failed. The sites.xml for Mercury is: >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>> url="grid-hg.ncsa.teragrid.org" >>>>> > >>> jobManager="gt2:PBS"/> >>>>> > >>> >>>> /gpfs_scratch1/yuechen/swiftwork >>>>> > >>> debug >>>>> > >>> >>>>> > >>> >>>>> > >>> Thank you for help! >>>>> > >>> >>>>> > >>> Chen, Yue >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> This email is intended only for the use of the individual or >>>> entity >>>>> > >>> to which it is addressed and may contain information that is >>>>> > >>> privileged and confidential. If the reader of this email >>>> message is >>>>> > >>> not the intended recipient, you are hereby notified that any >>>>> > >>> dissemination, distribution, or copying of this >>>>> communication is >>>>> > >>> prohibited. If you have received this email in error, please >>>> notify >>>>> > >>> the sender and destroy/delete all copies of the transmittal. >>>>> Thank you. >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>>> ------------------------------------------------------------------------ >>>> >>>>> > >>> >>>>> > >>> _______________________________________________ >>>>> > >>> Swift-user mailing list >>>>> > >>> Swift-user at ci.uchicago.edu >>>>> > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>>>> > >> _______________________________________________ >>>>> > >> Swift-user mailing list >>>>> > >> Swift-user at ci.uchicago.edu >>>>> > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>>>> > > _______________________________________________ >>>>> > > Swift-user mailing list >>>>> > > Swift-user at ci.uchicago.edu >>>>> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > This email is intended only for the use of the individual or >>>> entity to >>>>> > which it is addressed and may contain information that is >>>> privileged and >>>>> > confidential. If the reader of this email message is not the >>>>> intended >>>>> > recipient, you are hereby notified that any dissemination, >>>> distribution, >>>>> > or copying of this communication is prohibited. If you have >>>>> received >>>>> > this email in error, please notify the sender and destroy/delete >>>>> all >>>>> > copies of the transmittal. Thank you. >>>>> >>>>> >>>>> >>>>> >>>>> This email is intended only for the use of the individual or entity to >>>>> which it is addressed and may contain information that is >>>>> privileged and >>>>> confidential. If the reader of this email message is not the intended >>>>> recipient, you are hereby notified that any dissemination, >>>>> distribution, >>>>> or copying of this communication is prohibited. If you have received >>>>> this email in error, please notify the sender and destroy/delete all >>>>> copies of the transmittal. Thank you. >>>> >>>> >>>> >>>> This email is intended only for the use of the individual or entity >>>> to which it is addressed and may contain information that is >>>> privileged and confidential. If the reader of this email message is >>>> not the intended recipient, you are hereby notified that any >>>> dissemination, distribution, or copying of this communication is >>>> prohibited. If you have received this email in error, please notify >>>> the sender and destroy/delete all copies of the transmittal. Thank you. >>>> ------------------------------------------------------------------------ >>>> >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Thu Apr 30 17:13:19 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 30 Apr 2009 17:13:19 -0500 Subject: [Swift-devel] RE: [Swift-user] Execution error In-Reply-To: <49FA1FC9.7010108@mcs.anl.gov> References: <49F9DE8F.1070404@mcs.anl.gov> <49F9E298.8030801@mcs.anl.gov> <49F9E8FB.9020500@mcs.anl.gov> <49F9F680.6040503@mcs.anl.gov> <49FA03EA.7080807@mcs.anl.gov> <49FA147E.6070205@uchicago.edu> <1241127073.922.1.camel@localhost> <49FA1A76.2010209@mcs.anl.gov> <49FA1FC9.7010108@mcs.anl.gov> Message-ID: <1241129599.1616.0.camel@localhost> On Thu, 2009-04-30 at 17:01 -0500, Michael Wilde wrote: > A bit more info on this: it *seems* like a cert issue. > > I last accessed Ranger via globus-job-run perhaps 2 weeks ago, no problem. > > Yesterday, while debugging with Glen, globus-job-run was giving me GRAM > err 74. (and GRM err 12 to all other sites) > > So I added +osg-client to my .soft file, and then globus-job-run worked. > > But I noticed that my globus-job-run was still coming from the GT4 dir, > not from an OSG dir. > > Just now I traced this back to X509_CERT_DIR: > > then I did: > > com$ unset X509_CERT_DIR > com$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /usr/bin/id > GRAM Job submission failed because the job manager failed to open stderr > (error code 74) That seems like an IP address problem. Make sure you set GLOBUS_HOSTNAME properly. > com$ > com$ > com$ X509_CERT_DIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA > com$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /usr/bin/id > GRAM Job submission failed because the job manager failed to open stderr > (error code 74) > com$ export X509_CERT_DIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA > com$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /usr/bin/id > uid=455797(tg455797) gid=80243(G-80243) > groups=80243(G-80243),81031(G-81031),81411(G-81411),81611(G-81611),81613(G-81613),81621(G-81621),81747(G-81747),81792(G-81792),800744(G-800744),800745(G-800745),800889(G-800889),800981(G-800981),800983(G-800983),801271(G-801271),801364(G-801364) > com$ > > Mihael, does swift honor X509_CERT_DIR? If so, Glen, Yue, that is > something to try. > > You may need to put +osg-client this in your .soft file and re-login: > > @python-2.5 > +java-sun > > +apache-ant > +gx-map > +condor > +gx-map > @globus-4 > @default > +R > +torque > +maui > +matlab-7.7 > +osg-client > > - Mike > > > > > > On 4/30/09 4:39 PM, Michael Wilde wrote: > > And we should also drill back down to why (at least yesterday) the GT4 > > softev package failed, but the OSG client worked, for globus-job-run. > > > > I guess its possible there is a host or CA cert issue here. > > > > - Mike > > > > > > On 4/30/09 4:31 PM, Mihael Hategan wrote: > >> Can you guys try to run first.swift on ranger with the settings you have > >> (you'll need to add "echo" to tc.data)? > >> > >> > >> On Thu, 2009-04-30 at 16:13 -0500, Glen Hocky wrote: > >>> I have the identical response on ranger. It started yesterday > >>> evening. Possibly a problem that the TACC folks need to fix? > >>> > >>> Glen > >>> > >>> Yue, Chen - BMD wrote: > >>>> Hi Michael, > >>>> > >>>> Thank you for the advices. I tested ranger with 1 job and new > >>>> specifications of maxwalltime. It shows the following error message. > >>>> I don't know if there is other problem with my setup. Thank you! > >>>> > >>>> ///////////////////////////////////////////////// > >>>> [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file > >>>> sites.xml -tc.file tc.data > >>>> Swift 0.9rc2 swift-r2860 cog-r2388 > >>>> RunID: 20090430-1559-2vi6x811 > >>>> Progress: > >>>> Progress: Stage in:1 > >>>> Progress: Submitting:1 > >>>> Progress: Submitting:1 > >>>> Progress: Submitted:1 > >>>> Progress: Active:1 > >>>> Failed to transfer wrapper log from > >>>> PTMap2-unmod-20090430-1559-2vi6x811/info/i on ranger > >>>> Progress: Active:1 > >>>> Failed to transfer wrapper log from > >>>> PTMap2-unmod-20090430-1559-2vi6x811/info/k on ranger > >>>> Progress: Stage in:1 > >>>> Progress: Active:1 > >>>> Failed to transfer wrapper log from > >>>> PTMap2-unmod-20090430-1559-2vi6x811/info/m on ranger > >>>> Progress: Failed:1 > >>>> Execution failed: > >>>> Exception in PTMap2: > >>>> Arguments: [e04.mzXML, ./seqs-ecolik12/fasta01, inputs-unmod.txt, > >>>> parameters.txt] > >>>> Host: ranger > >>>> Directory: PTMap2-unmod-20090430-1559-2vi6x811/jobs/m/PTMap2-mbe6m5aj > >>>> stderr.txt: > >>>> stdout.txt: > >>>> ---- > >>>> Caused by: > >>>> Failed to start worker: > >>>> null > >>>> null > >>>> org.globus.gram.GramException: The job manager detected an invalid > >>>> script response > >>>> at > >>>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) > >>>> > >>>> at org.globus.gram.GramJob.setStatus(GramJob.java:184) > >>>> at > >>>> org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) > >>>> at java.lang.Thread.run(Thread.java:619) > >>>> Cleaning up... > >>>> Shutting down service at https://129.114.50.163:45562 > >>>> > >>>> Got channel MetaChannel: 20903429 -> GSSSChannel-null(1) > >>>> - Done > >>>> [yuechen at communicado PTMap2]$ > >>>> /////////////////////////////////////////////////////////// > >>>> > >>>> Chen, Yue > >>>> > >>>> > >>>> *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > >>>> *Sent:* Thu 4/30/2009 3:02 PM > >>>> *To:* Yue, Chen - BMD; swift-devel > >>>> *Subject:* Re: [Swift-user] Execution error > >>>> > >>>> Back on list here (I only went off-list to discuss accounts, etc) > >>>> > >>>> The problem in the run below is this: > >>>> > >>>> 2009-04-30 14:29:41,265-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION > >>>> jobid=PTMap2-abeii5aj - Application exception: Job cannot be run with > >>>> the given max walltime worker constraint (task: 3000, \ > >>>> maxwalltime: 2400s) > >>>> > >>>> You have this on the ptmap app in your tc.data: > >>>> > >>>> globus::maxwalltime=50 > >>>> > >>>> But you only gave coasters 40 mins per coaster worker. So its > >>>> complaining that it cant run a 50 minute job in a 40 minute (max) > >>>> coaster worker. ;) > >>>> > >>>> I mentioned in a prior mail that you need to set the two time vals in > >>>> your sites.xml entry; thats what you need to do next, now. > >>>> > >>>> change the coaster time in your sites.xml to: > >>>> key="coasterWorkerMaxwalltime">00:51:00 > >>>> > >>>> If you have more info on the variability of your ptmap run times, send > >>>> that to the list, and we can discuss how to handle. > >>>> > >>>> > >>>> (NOTE: doing grp -i of the log for "except" or scanning for "except" > >>>> with an editor will often locate the first "exception" that your job > >>>> encountered. Thats how I found the error above). > >>>> > >>>> Also, Yue, for testing new sites, or for validating that old sites > >>>> still > >>>> work, you should create the smallest possible ptmap workflow - 1 job if > >>>> that is possible - and verify that this works. Then say 10 jobs to > >>>> make > >>>> sure scheduling etc is sane. Then, send in your huge jobs. > >>>> > >>>> With only 1 job, its easier to spot the errors in the log file. > >>>> > >>>> - Mike > >>>> > >>>> > >>>> On 4/30/09 2:34 PM, Yue, Chen - BMD wrote: > >>>>> Hi Michael, > >>>>> > >>>>> I run into the same messages again when I use Ranger: > >>>>> > >>>>> Progress: Selecting site:146 Stage in:25 Submitting:15 > >>>>> Submitted:821 > >>>>> Failed but can retry:16 > >>>>> Failed to transfer wrapper log from > >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger > >>>>> Failed to transfer wrapper log from > >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/l on ranger > >>>>> Progress: Selecting site:146 Stage in:3 Submitting:1 Submitted:857 > >>>>> Failed but can retry:16 > >>>>> Failed to transfer wrapper log from > >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/v on ranger > >>>>> Failed to transfer wrapper log from > >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/b on ranger > >>>>> Failed to transfer wrapper log from > >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/0 on ranger > >>>>> Failed to transfer wrapper log from > >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger > >>>>> Failed to transfer wrapper log from > >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger > >>>>> Failed to transfer wrapper log from > >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger > >>>>> Failed to transfer wrapper log from > >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger > >>>>> Failed to transfer wrapper log from > >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger > >>>>> Failed to transfer wrapper log from > >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger > >>>>> Failed to transfer wrapper log from > >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger > >>>>> Failed to transfer wrapper log from > >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger > >>>>> Failed to transfer wrapper log from > >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/x on ranger > >>>>> Failed to transfer wrapper log from > >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/3 on ranger > >>>>> Failed to transfer wrapper log from > >>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger > >>>>> The log for the search is at : > >>>>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1428-v0c5di5c.log > >>>>> > >>>>> The sites.xml I have is: > >>>>> > >>>>> > >>>>> >>>>> url="gatekeeper.ranger.tacc.teragrid.org" > >>>>> jobManager="gt2:gt2:SGE"/> > >>>>> > >>>>> >>>>> key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir > >>>>> TG-CCR080022N > >>>>> 16 > >>>>> development > >>>>> >>>>> key="coasterWorkerMaxwalltime">00:40:00 > >>>>> 31 > >>>>> 50 > >>>>> 10 > >>>>> /work/01164/yuechen/swiftwork > >>>>> > >>>>> The tc.data I have is: > >>>>> > >>>>> ranger PTMap2 > >>>>> /share/home/01164/yuechen/PTMap2/PTMap2 INSTALLED > >>>>> INTEL32::LINUX globus::maxwalltime=50 > >>>>> > >>>>> I'm using swift 0.9 rc2 > >>>>> > >>>>> Thank you very much for help! > >>>>> > >>>>> Chen, Yue > >>>>> > >>>>> > >>>>> > >>>>> ------------------------------------------------------------------------ > >>>>> > >>>>> *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > >>>>> *Sent:* Thu 4/30/2009 2:05 PM > >>>>> *To:* Yue, Chen - BMD > >>>>> *Subject:* Re: [Swift-user] Execution error > >>>>> > >>>>> > >>>>> > >>>>> On 4/30/09 1:51 PM, Yue, Chen - BMD wrote: > >>>>> > Hi Michael, > >>>>> > > >>>>> > When I tried to activate my account, I encountered the following > >>>> error: > >>>>> > > >>>>> > "Sorry, this account is in an invalid state. You may not activate > >>>> your > >>>>> > at this time." > >>>>> > > >>>>> > I used the username and password from TG-CDA070002T. Should I use a > >>>>> > different password? > >>>>> > >>>>> If you can already login to Ranger, then you are all set - you must > >>>>> have > >>>>> done this previously. > >>>>> > >>>>> I thought you had *not*, because when I looked up your login on ranger > >>>>> ("finger yuechen") it said "never logged in". But seems like that info > >>>>> is incorrect. > >>>>> > >>>>> If you have ptmap compiled, seems like you are almost all set. > >>>>> > >>>>> Let me know if it works. > >>>>> > >>>>> - Mike > >>>>> > >>>>> > Thanks! > >>>>> > > >>>>> > Chen, Yue > >>>>> > > >>>>> > > >>>>> > > >>>> ------------------------------------------------------------------------ > >>>> > >>>>> > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > >>>>> > *Sent:* Thu 4/30/2009 1:07 PM > >>>>> > *To:* Yue, Chen - BMD > >>>>> > *Cc:* swift user > >>>>> > *Subject:* Re: [Swift-user] Execution error > >>>>> > > >>>>> > Yue, use this XML pool element to access ranger: > >>>>> > > >>>>> > > >>>>> > >>>>> > url="gatekeeper.ranger.tacc.teragrid.org" > >>>>> > jobManager="gt2:gt2:SGE"/> > >>>>> > >>>> url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" /> > >>>>> > >>>>> > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir > >>>>> > >>>> key="project">TG-CCR080022N > >>>>> > 16 > >>>>> > development > >>>>> > >>>>> > key="coasterWorkerMaxwalltime">00:40:00 > >>>>> > 31 > >>>>> > 50 > >>>>> > 10 > >>>>> > /work/00306/tg455797/swiftwork > >>>>> > > >>>>> > > >>>>> > > >>>>> > You will need to also do these steps: > >>>>> > > >>>>> > Go to this web page to enable your Ranger account: > >>>>> > > >>>>> > https://tas.tacc.utexas.edu/TASMigration/AccountActivation.aspx > >>>>> > > >>>>> > Then login to Ranger via the TeraGrid portal and put your ssh > >>>>> keys in > >>>>> > place (assuming you use ssh keys, which you should) > >>>>> > > >>>>> > While on Ranger, do this: > >>>>> > > >>>>> > echo $WORK > >>>>> > mkdir $work/swiftwork > >>>>> > > >>>>> > and put the full path of your $WORK/swiftwork directory in the > >>>>> > element above. (My login is tg455etc, yours is > >>>> yuechen) > >>>>> > > >>>>> > Then scp your code to Ranger and compile it. > >>>>> > > >>>>> > Then create a tc.data entry for your ptmap app > >>>>> > > >>>>> > Next, set your time values in the sites.xml entry above to suitable > >>>>> > values for Ranger. You'll need to measure times, but I think you > >>>>> will > >>>>> > find Ranger about twice as fast as Mercury for CPU-bound jobs. > >>>>> > > >>>>> > The values above were set for one app job per coaster. I think > >>>> you can > >>>>> > probably do more. > >>>>> > > >>>>> > If you estimate a run time of 5 minutes, use: > >>>>> > > >>>>> > >>>>> > key="coasterWorkerMaxwalltime">00:30:00 > >>>>> > 5 > >>>>> > > >>>>> > Other people on the list - please sanity check what I suggest here. > >>>>> > > >>>>> > - Mike > >>>>> > > >>>>> > > >>>>> > On 4/30/09 12:40 PM, Michael Wilde wrote: > >>>>> > > I just checked - TG-CDA070002T has indeed expired. > >>>>> > > > >>>>> > > The best for now is to move to use (only) Ranger, under this > >>>> account: > >>>>> > > TG-CCR080022N > >>>>> > > > >>>>> > > I will locate and send you a sites.xml entry in a moment. > >>>>> > > > >>>>> > > You need to go to a web page to activate your Ranger login. > >>>>> > > > >>>>> > > Best to contact me in IM and we can work this out. > >>>>> > > > >>>>> > > - Mike > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > On 4/30/09 12:23 PM, Michael Wilde wrote: > >>>>> > >> Also, what account are you running under? We may need to change > >>>>> you to > >>>>> > >> a new account - as the OSG Training account expires today. > >>>>> > >> If that happend at Noon, it *might* be the problem. > >>>>> > >> > >>>>> > >> - Mike > >>>>> > >> > >>>>> > >> > >>>>> > >> On 4/30/09 12:08 PM, Yue, Chen - BMD wrote: > >>>>> > >>> Hi, > >>>>> > >>> > >>>>> > >>> I came back to re-run my application on NCSA Mercury which was > >>>>> tested > >>>>> > >>> successfully last week after I just set up coasters with > >>>> swift 0.9, > >>>>> > >>> but I got many messages like the following: > >>>>> > >>> > >>>>> > >>> Progress: Stage in:219 Submitting:803 Submitted:1 > >>>>> > >>> Progress: Stage in:129 Submitting:703 Submitted:190 Failed > >>>>> but can > >>>>> > >>> retry:1 > >>>>> > >>> Progress: Stage in:38 Submitting:425 Submitted:556 Failed > >>>> but can > >>>>> > >>> retry:4 > >>>>> > >>> Failed to transfer wrapper log from > >>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY > >>>>> > >>> Failed to transfer wrapper log from > >>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY > >>>>> > >>> Failed to transfer wrapper log from > >>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY > >>>>> > >>> Failed to transfer wrapper log from > >>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY > >>>>> > >>> Failed to transfer wrapper log from > >>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY > >>>>> > >>> Failed to transfer wrapper log from > >>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY > >>>>> > >>> Progress: Stage in:1 Submitted:1013 Active:1 Failed but can > >>>>> retry:8 > >>>>> > >>> Progress: Submitted:1011 Active:1 Failed but can retry:11 > >>>>> > >>> The log file for the successful run last week is ; > >>>>> > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log > >>>>> > >>> > >>>>> > >>> The log file for the failed run is : > >>>>> > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log > >>>>> > >>> > >>>>> > >>> I don't think I did anything different, so I don't know why > >>>>> this > >>>>> time > >>>>> > >>> they failed. The sites.xml for Mercury is: > >>>>> > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> > >>> >>>> url="grid-hg.ncsa.teragrid.org" > >>>>> > >>> jobManager="gt2:PBS"/> > >>>>> > >>> > >>>> /gpfs_scratch1/yuechen/swiftwork > >>>>> > >>> debug > >>>>> > >>> > >>>>> > >>> > >>>>> > >>> Thank you for help! > >>>>> > >>> > >>>>> > >>> Chen, Yue > >>>>> > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> > >>> This email is intended only for the use of the individual or > >>>> entity > >>>>> > >>> to which it is addressed and may contain information that is > >>>>> > >>> privileged and confidential. If the reader of this email > >>>> message is > >>>>> > >>> not the intended recipient, you are hereby notified that any > >>>>> > >>> dissemination, distribution, or copying of this > >>>>> communication is > >>>>> > >>> prohibited. If you have received this email in error, please > >>>> notify > >>>>> > >>> the sender and destroy/delete all copies of the transmittal. > >>>>> Thank you. > >>>>> > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> > > >>>> ------------------------------------------------------------------------ > >>>> > >>>>> > >>> > >>>>> > >>> _______________________________________________ > >>>>> > >>> Swift-user mailing list > >>>>> > >>> Swift-user at ci.uchicago.edu > >>>>> > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > >>>>> > >> _______________________________________________ > >>>>> > >> Swift-user mailing list > >>>>> > >> Swift-user at ci.uchicago.edu > >>>>> > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > >>>>> > > _______________________________________________ > >>>>> > > Swift-user mailing list > >>>>> > > Swift-user at ci.uchicago.edu > >>>>> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > This email is intended only for the use of the individual or > >>>> entity to > >>>>> > which it is addressed and may contain information that is > >>>> privileged and > >>>>> > confidential. If the reader of this email message is not the > >>>>> intended > >>>>> > recipient, you are hereby notified that any dissemination, > >>>> distribution, > >>>>> > or copying of this communication is prohibited. If you have > >>>>> received > >>>>> > this email in error, please notify the sender and destroy/delete > >>>>> all > >>>>> > copies of the transmittal. Thank you. > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> This email is intended only for the use of the individual or entity to > >>>>> which it is addressed and may contain information that is > >>>>> privileged and > >>>>> confidential. If the reader of this email message is not the intended > >>>>> recipient, you are hereby notified that any dissemination, > >>>>> distribution, > >>>>> or copying of this communication is prohibited. If you have received > >>>>> this email in error, please notify the sender and destroy/delete all > >>>>> copies of the transmittal. Thank you. > >>>> > >>>> > >>>> > >>>> This email is intended only for the use of the individual or entity > >>>> to which it is addressed and may contain information that is > >>>> privileged and confidential. If the reader of this email message is > >>>> not the intended recipient, you are hereby notified that any > >>>> dissemination, distribution, or copying of this communication is > >>>> prohibited. If you have received this email in error, please notify > >>>> the sender and destroy/delete all copies of the transmittal. Thank you. > >>>> ------------------------------------------------------------------------ > >>>> > >>>> > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Thu Apr 30 17:23:56 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 30 Apr 2009 17:23:56 -0500 Subject: [Swift-devel] RE: [Swift-user] Execution error In-Reply-To: <1241129599.1616.0.camel@localhost> References: <49F9DE8F.1070404@mcs.anl.gov> <49F9E298.8030801@mcs.anl.gov> <49F9E8FB.9020500@mcs.anl.gov> <49F9F680.6040503@mcs.anl.gov> <49FA03EA.7080807@mcs.anl.gov> <49FA147E.6070205@uchicago.edu> <1241127073.922.1.camel@localhost> <49FA1A76.2010209@mcs.anl.gov> <49FA1FC9.7010108@mcs.anl.gov> <1241129599.1616.0.camel@localhost> Message-ID: <49FA24FC.4040500@mcs.anl.gov> On 4/30/09 5:13 PM, Mihael Hategan wrote: >> GRAM Job submission failed because the job manager failed to open stderr >> (error code 74) > > That seems like an IP address problem. Make sure you set GLOBUS_HOSTNAME > properly. OK, I will try that. But in the test below, I caused the error by unsetting X509_CERT_DIR and fixed the error by resetting it - no other changes. I *think* that as recently as a few weeks ago globus-job-run to ranger worked with just @globus in my .soft file. Adding +osg-client seemed to make it work by setting X509_CERT_DIR. So as far as I can tell, at least at the level of globus-job-run, these seems to be related to certs. Given what Im seeing, do you still think GLOBUS_HOSTNAME is a factor? - Mike > On Thu, 2009-04-30 at 17:01 -0500, Michael Wilde wrote: >> A bit more info on this: it *seems* like a cert issue. >> >> I last accessed Ranger via globus-job-run perhaps 2 weeks ago, no problem. >> >> Yesterday, while debugging with Glen, globus-job-run was giving me GRAM >> err 74. (and GRM err 12 to all other sites) >> >> So I added +osg-client to my .soft file, and then globus-job-run worked. >> >> But I noticed that my globus-job-run was still coming from the GT4 dir, >> not from an OSG dir. >> >> Just now I traced this back to X509_CERT_DIR: >> >> then I did: >> >> com$ unset X509_CERT_DIR >> com$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /usr/bin/id >> GRAM Job submission failed because the job manager failed to open stderr >> (error code 74) > > That seems like an IP address problem. Make sure you set GLOBUS_HOSTNAME > properly. > >> com$ >> com$ >> com$ X509_CERT_DIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA >> com$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /usr/bin/id >> GRAM Job submission failed because the job manager failed to open stderr >> (error code 74) >> com$ export X509_CERT_DIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA >> com$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /usr/bin/id >> uid=455797(tg455797) gid=80243(G-80243) >> groups=80243(G-80243),81031(G-81031),81411(G-81411),81611(G-81611),81613(G-81613),81621(G-81621),81747(G-81747),81792(G-81792),800744(G-800744),800745(G-800745),800889(G-800889),800981(G-800981),800983(G-800983),801271(G-801271),801364(G-801364) >> com$ >> >> Mihael, does swift honor X509_CERT_DIR? If so, Glen, Yue, that is >> something to try. >> >> You may need to put +osg-client this in your .soft file and re-login: >> >> @python-2.5 >> +java-sun >> >> +apache-ant >> +gx-map >> +condor >> +gx-map >> @globus-4 >> @default >> +R >> +torque >> +maui >> +matlab-7.7 >> +osg-client >> >> - Mike >> >> >> >> >> >> On 4/30/09 4:39 PM, Michael Wilde wrote: >>> And we should also drill back down to why (at least yesterday) the GT4 >>> softev package failed, but the OSG client worked, for globus-job-run. >>> >>> I guess its possible there is a host or CA cert issue here. >>> >>> - Mike >>> >>> >>> On 4/30/09 4:31 PM, Mihael Hategan wrote: >>>> Can you guys try to run first.swift on ranger with the settings you have >>>> (you'll need to add "echo" to tc.data)? >>>> >>>> >>>> On Thu, 2009-04-30 at 16:13 -0500, Glen Hocky wrote: >>>>> I have the identical response on ranger. It started yesterday >>>>> evening. Possibly a problem that the TACC folks need to fix? >>>>> >>>>> Glen >>>>> >>>>> Yue, Chen - BMD wrote: >>>>>> Hi Michael, >>>>>> >>>>>> Thank you for the advices. I tested ranger with 1 job and new >>>>>> specifications of maxwalltime. It shows the following error message. >>>>>> I don't know if there is other problem with my setup. Thank you! >>>>>> >>>>>> ///////////////////////////////////////////////// >>>>>> [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file >>>>>> sites.xml -tc.file tc.data >>>>>> Swift 0.9rc2 swift-r2860 cog-r2388 >>>>>> RunID: 20090430-1559-2vi6x811 >>>>>> Progress: >>>>>> Progress: Stage in:1 >>>>>> Progress: Submitting:1 >>>>>> Progress: Submitting:1 >>>>>> Progress: Submitted:1 >>>>>> Progress: Active:1 >>>>>> Failed to transfer wrapper log from >>>>>> PTMap2-unmod-20090430-1559-2vi6x811/info/i on ranger >>>>>> Progress: Active:1 >>>>>> Failed to transfer wrapper log from >>>>>> PTMap2-unmod-20090430-1559-2vi6x811/info/k on ranger >>>>>> Progress: Stage in:1 >>>>>> Progress: Active:1 >>>>>> Failed to transfer wrapper log from >>>>>> PTMap2-unmod-20090430-1559-2vi6x811/info/m on ranger >>>>>> Progress: Failed:1 >>>>>> Execution failed: >>>>>> Exception in PTMap2: >>>>>> Arguments: [e04.mzXML, ./seqs-ecolik12/fasta01, inputs-unmod.txt, >>>>>> parameters.txt] >>>>>> Host: ranger >>>>>> Directory: PTMap2-unmod-20090430-1559-2vi6x811/jobs/m/PTMap2-mbe6m5aj >>>>>> stderr.txt: >>>>>> stdout.txt: >>>>>> ---- >>>>>> Caused by: >>>>>> Failed to start worker: >>>>>> null >>>>>> null >>>>>> org.globus.gram.GramException: The job manager detected an invalid >>>>>> script response >>>>>> at >>>>>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) >>>>>> >>>>>> at org.globus.gram.GramJob.setStatus(GramJob.java:184) >>>>>> at >>>>>> org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) >>>>>> at java.lang.Thread.run(Thread.java:619) >>>>>> Cleaning up... >>>>>> Shutting down service at https://129.114.50.163:45562 >>>>>> >>>>>> Got channel MetaChannel: 20903429 -> GSSSChannel-null(1) >>>>>> - Done >>>>>> [yuechen at communicado PTMap2]$ >>>>>> /////////////////////////////////////////////////////////// >>>>>> >>>>>> Chen, Yue >>>>>> >>>>>> >>>>>> *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >>>>>> *Sent:* Thu 4/30/2009 3:02 PM >>>>>> *To:* Yue, Chen - BMD; swift-devel >>>>>> *Subject:* Re: [Swift-user] Execution error >>>>>> >>>>>> Back on list here (I only went off-list to discuss accounts, etc) >>>>>> >>>>>> The problem in the run below is this: >>>>>> >>>>>> 2009-04-30 14:29:41,265-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION >>>>>> jobid=PTMap2-abeii5aj - Application exception: Job cannot be run with >>>>>> the given max walltime worker constraint (task: 3000, \ >>>>>> maxwalltime: 2400s) >>>>>> >>>>>> You have this on the ptmap app in your tc.data: >>>>>> >>>>>> globus::maxwalltime=50 >>>>>> >>>>>> But you only gave coasters 40 mins per coaster worker. So its >>>>>> complaining that it cant run a 50 minute job in a 40 minute (max) >>>>>> coaster worker. ;) >>>>>> >>>>>> I mentioned in a prior mail that you need to set the two time vals in >>>>>> your sites.xml entry; thats what you need to do next, now. >>>>>> >>>>>> change the coaster time in your sites.xml to: >>>>>> key="coasterWorkerMaxwalltime">00:51:00 >>>>>> >>>>>> If you have more info on the variability of your ptmap run times, send >>>>>> that to the list, and we can discuss how to handle. >>>>>> >>>>>> >>>>>> (NOTE: doing grp -i of the log for "except" or scanning for "except" >>>>>> with an editor will often locate the first "exception" that your job >>>>>> encountered. Thats how I found the error above). >>>>>> >>>>>> Also, Yue, for testing new sites, or for validating that old sites >>>>>> still >>>>>> work, you should create the smallest possible ptmap workflow - 1 job if >>>>>> that is possible - and verify that this works. Then say 10 jobs to >>>>>> make >>>>>> sure scheduling etc is sane. Then, send in your huge jobs. >>>>>> >>>>>> With only 1 job, its easier to spot the errors in the log file. >>>>>> >>>>>> - Mike >>>>>> >>>>>> >>>>>> On 4/30/09 2:34 PM, Yue, Chen - BMD wrote: >>>>>>> Hi Michael, >>>>>>> >>>>>>> I run into the same messages again when I use Ranger: >>>>>>> >>>>>>> Progress: Selecting site:146 Stage in:25 Submitting:15 >>>>>>> Submitted:821 >>>>>>> Failed but can retry:16 >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/l on ranger >>>>>>> Progress: Selecting site:146 Stage in:3 Submitting:1 Submitted:857 >>>>>>> Failed but can retry:16 >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/v on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/b on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/0 on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/x on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/3 on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger >>>>>>> The log for the search is at : >>>>>>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1428-v0c5di5c.log >>>>>>> >>>>>>> The sites.xml I have is: >>>>>>> >>>>>>> >>>>>>> >>>>>> url="gatekeeper.ranger.tacc.teragrid.org" >>>>>>> jobManager="gt2:gt2:SGE"/> >>>>>>> >>>>>>> >>>>>> key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir >>>>>>> TG-CCR080022N >>>>>>> 16 >>>>>>> development >>>>>>> >>>>>> key="coasterWorkerMaxwalltime">00:40:00 >>>>>>> 31 >>>>>>> 50 >>>>>>> 10 >>>>>>> /work/01164/yuechen/swiftwork >>>>>>> >>>>>>> The tc.data I have is: >>>>>>> >>>>>>> ranger PTMap2 >>>>>>> /share/home/01164/yuechen/PTMap2/PTMap2 INSTALLED >>>>>>> INTEL32::LINUX globus::maxwalltime=50 >>>>>>> >>>>>>> I'm using swift 0.9 rc2 >>>>>>> >>>>>>> Thank you very much for help! >>>>>>> >>>>>>> Chen, Yue >>>>>>> >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------------ >>>>>>> >>>>>>> *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >>>>>>> *Sent:* Thu 4/30/2009 2:05 PM >>>>>>> *To:* Yue, Chen - BMD >>>>>>> *Subject:* Re: [Swift-user] Execution error >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 4/30/09 1:51 PM, Yue, Chen - BMD wrote: >>>>>>> > Hi Michael, >>>>>>> > >>>>>>> > When I tried to activate my account, I encountered the following >>>>>> error: >>>>>>> > >>>>>>> > "Sorry, this account is in an invalid state. You may not activate >>>>>> your >>>>>>> > at this time." >>>>>>> > >>>>>>> > I used the username and password from TG-CDA070002T. Should I use a >>>>>>> > different password? >>>>>>> >>>>>>> If you can already login to Ranger, then you are all set - you must >>>>>>> have >>>>>>> done this previously. >>>>>>> >>>>>>> I thought you had *not*, because when I looked up your login on ranger >>>>>>> ("finger yuechen") it said "never logged in". But seems like that info >>>>>>> is incorrect. >>>>>>> >>>>>>> If you have ptmap compiled, seems like you are almost all set. >>>>>>> >>>>>>> Let me know if it works. >>>>>>> >>>>>>> - Mike >>>>>>> >>>>>>> > Thanks! >>>>>>> > >>>>>>> > Chen, Yue >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> ------------------------------------------------------------------------ >>>>>> >>>>>>> > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >>>>>>> > *Sent:* Thu 4/30/2009 1:07 PM >>>>>>> > *To:* Yue, Chen - BMD >>>>>>> > *Cc:* swift user >>>>>>> > *Subject:* Re: [Swift-user] Execution error >>>>>>> > >>>>>>> > Yue, use this XML pool element to access ranger: >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > url="gatekeeper.ranger.tacc.teragrid.org" >>>>>>> > jobManager="gt2:gt2:SGE"/> >>>>>>> > >>>>> url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" /> >>>>>>> > >>>>>> > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir >>>>>>> > >>>>> key="project">TG-CCR080022N >>>>>>> > 16 >>>>>>> > development >>>>>>> > >>>>>> > key="coasterWorkerMaxwalltime">00:40:00 >>>>>>> > 31 >>>>>>> > 50 >>>>>>> > 10 >>>>>>> > /work/00306/tg455797/swiftwork >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > You will need to also do these steps: >>>>>>> > >>>>>>> > Go to this web page to enable your Ranger account: >>>>>>> > >>>>>>> > https://tas.tacc.utexas.edu/TASMigration/AccountActivation.aspx >>>>>>> > >>>>>>> > Then login to Ranger via the TeraGrid portal and put your ssh >>>>>>> keys in >>>>>>> > place (assuming you use ssh keys, which you should) >>>>>>> > >>>>>>> > While on Ranger, do this: >>>>>>> > >>>>>>> > echo $WORK >>>>>>> > mkdir $work/swiftwork >>>>>>> > >>>>>>> > and put the full path of your $WORK/swiftwork directory in the >>>>>>> > element above. (My login is tg455etc, yours is >>>>>> yuechen) >>>>>>> > >>>>>>> > Then scp your code to Ranger and compile it. >>>>>>> > >>>>>>> > Then create a tc.data entry for your ptmap app >>>>>>> > >>>>>>> > Next, set your time values in the sites.xml entry above to suitable >>>>>>> > values for Ranger. You'll need to measure times, but I think you >>>>>>> will >>>>>>> > find Ranger about twice as fast as Mercury for CPU-bound jobs. >>>>>>> > >>>>>>> > The values above were set for one app job per coaster. I think >>>>>> you can >>>>>>> > probably do more. >>>>>>> > >>>>>>> > If you estimate a run time of 5 minutes, use: >>>>>>> > >>>>>>> > >>>>>> > key="coasterWorkerMaxwalltime">00:30:00 >>>>>>> > 5 >>>>>>> > >>>>>>> > Other people on the list - please sanity check what I suggest here. >>>>>>> > >>>>>>> > - Mike >>>>>>> > >>>>>>> > >>>>>>> > On 4/30/09 12:40 PM, Michael Wilde wrote: >>>>>>> > > I just checked - TG-CDA070002T has indeed expired. >>>>>>> > > >>>>>>> > > The best for now is to move to use (only) Ranger, under this >>>>>> account: >>>>>>> > > TG-CCR080022N >>>>>>> > > >>>>>>> > > I will locate and send you a sites.xml entry in a moment. >>>>>>> > > >>>>>>> > > You need to go to a web page to activate your Ranger login. >>>>>>> > > >>>>>>> > > Best to contact me in IM and we can work this out. >>>>>>> > > >>>>>>> > > - Mike >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > On 4/30/09 12:23 PM, Michael Wilde wrote: >>>>>>> > >> Also, what account are you running under? We may need to change >>>>>>> you to >>>>>>> > >> a new account - as the OSG Training account expires today. >>>>>>> > >> If that happend at Noon, it *might* be the problem. >>>>>>> > >> >>>>>>> > >> - Mike >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> On 4/30/09 12:08 PM, Yue, Chen - BMD wrote: >>>>>>> > >>> Hi, >>>>>>> > >>> >>>>>>> > >>> I came back to re-run my application on NCSA Mercury which was >>>>>>> tested >>>>>>> > >>> successfully last week after I just set up coasters with >>>>>> swift 0.9, >>>>>>> > >>> but I got many messages like the following: >>>>>>> > >>> >>>>>>> > >>> Progress: Stage in:219 Submitting:803 Submitted:1 >>>>>>> > >>> Progress: Stage in:129 Submitting:703 Submitted:190 Failed >>>>>>> but can >>>>>>> > >>> retry:1 >>>>>>> > >>> Progress: Stage in:38 Submitting:425 Submitted:556 Failed >>>>>> but can >>>>>>> > >>> retry:4 >>>>>>> > >>> Failed to transfer wrapper log from >>>>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY >>>>>>> > >>> Failed to transfer wrapper log from >>>>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY >>>>>>> > >>> Failed to transfer wrapper log from >>>>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY >>>>>>> > >>> Failed to transfer wrapper log from >>>>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY >>>>>>> > >>> Failed to transfer wrapper log from >>>>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY >>>>>>> > >>> Failed to transfer wrapper log from >>>>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY >>>>>>> > >>> Progress: Stage in:1 Submitted:1013 Active:1 Failed but can >>>>>>> retry:8 >>>>>>> > >>> Progress: Submitted:1011 Active:1 Failed but can retry:11 >>>>>>> > >>> The log file for the successful run last week is ; >>>>>>> > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log >>>>>>> > >>> >>>>>>> > >>> The log file for the failed run is : >>>>>>> > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log >>>>>>> > >>> >>>>>>> > >>> I don't think I did anything different, so I don't know why >>>>>>> this >>>>>>> time >>>>>>> > >>> they failed. The sites.xml for Mercury is: >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>> url="grid-hg.ncsa.teragrid.org" >>>>>>> > >>> jobManager="gt2:PBS"/> >>>>>>> > >>> >>>>>> /gpfs_scratch1/yuechen/swiftwork >>>>>>> > >>> debug >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> Thank you for help! >>>>>>> > >>> >>>>>>> > >>> Chen, Yue >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> This email is intended only for the use of the individual or >>>>>> entity >>>>>>> > >>> to which it is addressed and may contain information that is >>>>>>> > >>> privileged and confidential. If the reader of this email >>>>>> message is >>>>>>> > >>> not the intended recipient, you are hereby notified that any >>>>>>> > >>> dissemination, distribution, or copying of this >>>>>>> communication is >>>>>>> > >>> prohibited. If you have received this email in error, please >>>>>> notify >>>>>>> > >>> the sender and destroy/delete all copies of the transmittal. >>>>>>> Thank you. >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>>>>> ------------------------------------------------------------------------ >>>>>> >>>>>>> > >>> >>>>>>> > >>> _______________________________________________ >>>>>>> > >>> Swift-user mailing list >>>>>>> > >>> Swift-user at ci.uchicago.edu >>>>>>> > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>>>>>> > >> _______________________________________________ >>>>>>> > >> Swift-user mailing list >>>>>>> > >> Swift-user at ci.uchicago.edu >>>>>>> > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>>>>>> > > _______________________________________________ >>>>>>> > > Swift-user mailing list >>>>>>> > > Swift-user at ci.uchicago.edu >>>>>>> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > This email is intended only for the use of the individual or >>>>>> entity to >>>>>>> > which it is addressed and may contain information that is >>>>>> privileged and >>>>>>> > confidential. If the reader of this email message is not the >>>>>>> intended >>>>>>> > recipient, you are hereby notified that any dissemination, >>>>>> distribution, >>>>>>> > or copying of this communication is prohibited. If you have >>>>>>> received >>>>>>> > this email in error, please notify the sender and destroy/delete >>>>>>> all >>>>>>> > copies of the transmittal. Thank you. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> This email is intended only for the use of the individual or entity to >>>>>>> which it is addressed and may contain information that is >>>>>>> privileged and >>>>>>> confidential. If the reader of this email message is not the intended >>>>>>> recipient, you are hereby notified that any dissemination, >>>>>>> distribution, >>>>>>> or copying of this communication is prohibited. If you have received >>>>>>> this email in error, please notify the sender and destroy/delete all >>>>>>> copies of the transmittal. Thank you. >>>>>> >>>>>> >>>>>> >>>>>> This email is intended only for the use of the individual or entity >>>>>> to which it is addressed and may contain information that is >>>>>> privileged and confidential. If the reader of this email message is >>>>>> not the intended recipient, you are hereby notified that any >>>>>> dissemination, distribution, or copying of this communication is >>>>>> prohibited. If you have received this email in error, please notify >>>>>> the sender and destroy/delete all copies of the transmittal. Thank you. >>>>>> ------------------------------------------------------------------------ >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From yuechen at bsd.uchicago.edu Thu Apr 30 17:32:56 2009 From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD) Date: Thu, 30 Apr 2009 17:32:56 -0500 Subject: [Swift-devel] RE: [Swift-user] Execution error References: <49F9DE8F.1070404@mcs.anl.gov> <49F9E298.8030801@mcs.anl.gov> <49F9E8FB.9020500@mcs.anl.gov> <49F9F680.6040503@mcs.anl.gov> <49FA03EA.7080807@mcs.anl.gov> <49FA147E.6070205@uchicago.edu> <1241127073.922.1.camel@localhost> <49FA1A76.2010209@mcs.anl.gov> <49FA1FC9.7010108@mcs.anl.gov> <1241129599.1616.0.camel@localhost> <49FA24FC.4040500@mcs.anl.gov> Message-ID: Hi Michael, I already have +osg-client-1.0.0-r1 in my .soft file. But I change it to +osg-client and tried again. "ranger" gave me the same error message. In the meantime, I tested one job on both Abe and Lonestar and they both gave me qsub error. I attached as following: //////////////////////////////////// [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file sites.xml -tc.file tc.data Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090430-1722-oncfdolb Progress: Progress: Stage in:1 Progress: Stage in:1 Progress: Stage in:1 Progress: Submitting:1 Progress: Submitted:1 Failed to transfer wrapper log from PTMap2-unmod-20090430-1722-oncfdolb/info/3 on TACC_LoneStar Progress: Active:1 Failed to transfer wrapper log from PTMap2-unmod-20090430-1722-oncfdolb/info/5 on TACC_LoneStar Progress: Stage in:1 Progress: Active:1 Failed to transfer wrapper log from PTMap2-unmod-20090430-1722-oncfdolb/info/7 on TACC_LoneStar Progress: Failed:1 Execution failed: Exception in PTMap2: Arguments: [e04.mzXML, ./seqs-ecolik12/fasta02, inputs-unmod.txt, parameters.txt] Host: TACC_LoneStar Directory: PTMap2-unmod-20090430-1722-oncfdolb/jobs/7/PTMap2-7uagp5aj stderr.txt: stdout.txt: ---- Caused by: Cannot submit job: Could not submit job (qsub reported an exit code of -1). no error output org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot submit job: Could not submit job (qsub reported an exit code of -1). no error output at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:43) at org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) at org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) Caused by: org.globus.cog.abstraction.impl.scheduler.common.ProcessException: Could not submit job (qsub reported an exit code of -1). no error output at org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:94) at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) ... 4 more Cleaning up... Shutting down service at https://129.114.50.32:34704 Got channel MetaChannel: 2013263 -> GSSSChannel-null(1) - Done ///////////////////////////////////////// My sites.xml is at : /home/yuechen/PTMap2/sites.xml. I'm wondering if this still relates to my setup. Thanks! Chen, Yue ________________________________ From: Michael Wilde [mailto:wilde at mcs.anl.gov] Sent: Thu 4/30/2009 5:23 PM To: Mihael Hategan Cc: swift-devel; Yue, Chen - BMD Subject: Re: [Swift-devel] RE: [Swift-user] Execution error On 4/30/09 5:13 PM, Mihael Hategan wrote: >> GRAM Job submission failed because the job manager failed to open stderr >> (error code 74) > > That seems like an IP address problem. Make sure you set GLOBUS_HOSTNAME > properly. OK, I will try that. But in the test below, I caused the error by unsetting X509_CERT_DIR and fixed the error by resetting it - no other changes. I *think* that as recently as a few weeks ago globus-job-run to ranger worked with just @globus in my .soft file. Adding +osg-client seemed to make it work by setting X509_CERT_DIR. So as far as I can tell, at least at the level of globus-job-run, these seems to be related to certs. Given what Im seeing, do you still think GLOBUS_HOSTNAME is a factor? - Mike > On Thu, 2009-04-30 at 17:01 -0500, Michael Wilde wrote: >> A bit more info on this: it *seems* like a cert issue. >> >> I last accessed Ranger via globus-job-run perhaps 2 weeks ago, no problem. >> >> Yesterday, while debugging with Glen, globus-job-run was giving me GRAM >> err 74. (and GRM err 12 to all other sites) >> >> So I added +osg-client to my .soft file, and then globus-job-run worked. >> >> But I noticed that my globus-job-run was still coming from the GT4 dir, >> not from an OSG dir. >> >> Just now I traced this back to X509_CERT_DIR: >> >> then I did: >> >> com$ unset X509_CERT_DIR >> com$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /usr/bin/id >> GRAM Job submission failed because the job manager failed to open stderr >> (error code 74) > > That seems like an IP address problem. Make sure you set GLOBUS_HOSTNAME > properly. > >> com$ >> com$ >> com$ X509_CERT_DIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA >> com$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /usr/bin/id >> GRAM Job submission failed because the job manager failed to open stderr >> (error code 74) >> com$ export X509_CERT_DIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA >> com$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /usr/bin/id >> uid=455797(tg455797) gid=80243(G-80243) >> groups=80243(G-80243),81031(G-81031),81411(G-81411),81611(G-81611),81613(G-81613),81621(G-81621),81747(G-81747),81792(G-81792),800744(G-800744),800745(G-800745),800889(G-800889),800981(G-800981),800983(G-800983),801271(G-801271),801364(G-801364) >> com$ >> >> Mihael, does swift honor X509_CERT_DIR? If so, Glen, Yue, that is >> something to try. >> >> You may need to put +osg-client this in your .soft file and re-login: >> >> @python-2.5 >> +java-sun >> >> +apache-ant >> +gx-map >> +condor >> +gx-map >> @globus-4 >> @default >> +R >> +torque >> +maui >> +matlab-7.7 >> +osg-client >> >> - Mike >> >> >> >> >> >> On 4/30/09 4:39 PM, Michael Wilde wrote: >>> And we should also drill back down to why (at least yesterday) the GT4 >>> softev package failed, but the OSG client worked, for globus-job-run. >>> >>> I guess its possible there is a host or CA cert issue here. >>> >>> - Mike >>> >>> >>> On 4/30/09 4:31 PM, Mihael Hategan wrote: >>>> Can you guys try to run first.swift on ranger with the settings you have >>>> (you'll need to add "echo" to tc.data)? >>>> >>>> >>>> On Thu, 2009-04-30 at 16:13 -0500, Glen Hocky wrote: >>>>> I have the identical response on ranger. It started yesterday >>>>> evening. Possibly a problem that the TACC folks need to fix? >>>>> >>>>> Glen >>>>> >>>>> Yue, Chen - BMD wrote: >>>>>> Hi Michael, >>>>>> >>>>>> Thank you for the advices. I tested ranger with 1 job and new >>>>>> specifications of maxwalltime. It shows the following error message. >>>>>> I don't know if there is other problem with my setup. Thank you! >>>>>> >>>>>> ///////////////////////////////////////////////// >>>>>> [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file >>>>>> sites.xml -tc.file tc.data >>>>>> Swift 0.9rc2 swift-r2860 cog-r2388 >>>>>> RunID: 20090430-1559-2vi6x811 >>>>>> Progress: >>>>>> Progress: Stage in:1 >>>>>> Progress: Submitting:1 >>>>>> Progress: Submitting:1 >>>>>> Progress: Submitted:1 >>>>>> Progress: Active:1 >>>>>> Failed to transfer wrapper log from >>>>>> PTMap2-unmod-20090430-1559-2vi6x811/info/i on ranger >>>>>> Progress: Active:1 >>>>>> Failed to transfer wrapper log from >>>>>> PTMap2-unmod-20090430-1559-2vi6x811/info/k on ranger >>>>>> Progress: Stage in:1 >>>>>> Progress: Active:1 >>>>>> Failed to transfer wrapper log from >>>>>> PTMap2-unmod-20090430-1559-2vi6x811/info/m on ranger >>>>>> Progress: Failed:1 >>>>>> Execution failed: >>>>>> Exception in PTMap2: >>>>>> Arguments: [e04.mzXML, ./seqs-ecolik12/fasta01, inputs-unmod.txt, >>>>>> parameters.txt] >>>>>> Host: ranger >>>>>> Directory: PTMap2-unmod-20090430-1559-2vi6x811/jobs/m/PTMap2-mbe6m5aj >>>>>> stderr.txt: >>>>>> stdout.txt: >>>>>> ---- >>>>>> Caused by: >>>>>> Failed to start worker: >>>>>> null >>>>>> null >>>>>> org.globus.gram.GramException: The job manager detected an invalid >>>>>> script response >>>>>> at >>>>>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) >>>>>> >>>>>> at org.globus.gram.GramJob.setStatus(GramJob.java:184) >>>>>> at >>>>>> org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) >>>>>> at java.lang.Thread.run(Thread.java:619) >>>>>> Cleaning up... >>>>>> Shutting down service at https://129.114.50.163:45562 >>>>>> > >>>>>> Got channel MetaChannel: 20903429 -> GSSSChannel-null(1) >>>>>> - Done >>>>>> [yuechen at communicado PTMap2]$ >>>>>> /////////////////////////////////////////////////////////// >>>>>> >>>>>> Chen, Yue >>>>>> >>>>>> >>>>>> *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >>>>>> *Sent:* Thu 4/30/2009 3:02 PM >>>>>> *To:* Yue, Chen - BMD; swift-devel >>>>>> *Subject:* Re: [Swift-user] Execution error >>>>>> >>>>>> Back on list here (I only went off-list to discuss accounts, etc) >>>>>> >>>>>> The problem in the run below is this: >>>>>> >>>>>> 2009-04-30 14:29:41,265-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION >>>>>> jobid=PTMap2-abeii5aj - Application exception: Job cannot be run with >>>>>> the given max walltime worker constraint (task: 3000, \ >>>>>> maxwalltime: 2400s) >>>>>> >>>>>> You have this on the ptmap app in your tc.data: >>>>>> >>>>>> globus::maxwalltime=50 >>>>>> >>>>>> But you only gave coasters 40 mins per coaster worker. So its >>>>>> complaining that it cant run a 50 minute job in a 40 minute (max) >>>>>> coaster worker. ;) >>>>>> >>>>>> I mentioned in a prior mail that you need to set the two time vals in >>>>>> your sites.xml entry; thats what you need to do next, now. >>>>>> >>>>>> change the coaster time in your sites.xml to: >>>>>> key="coasterWorkerMaxwalltime">00:51:00 >>>>>> >>>>>> If you have more info on the variability of your ptmap run times, send >>>>>> that to the list, and we can discuss how to handle. >>>>>> >>>>>> >>>>>> (NOTE: doing grp -i of the log for "except" or scanning for "except" >>>>>> with an editor will often locate the first "exception" that your job >>>>>> encountered. Thats how I found the error above). >>>>>> >>>>>> Also, Yue, for testing new sites, or for validating that old sites >>>>>> still >>>>>> work, you should create the smallest possible ptmap workflow - 1 job if >>>>>> that is possible - and verify that this works. Then say 10 jobs to >>>>>> make >>>>>> sure scheduling etc is sane. Then, send in your huge jobs. >>>>>> >>>>>> With only 1 job, its easier to spot the errors in the log file. >>>>>> >>>>>> - Mike >>>>>> >>>>>> >>>>>> On 4/30/09 2:34 PM, Yue, Chen - BMD wrote: >>>>>>> Hi Michael, >>>>>>> >>>>>>> I run into the same messages again when I use Ranger: >>>>>>> >>>>>>> Progress: Selecting site:146 Stage in:25 Submitting:15 >>>>>>> Submitted:821 >>>>>>> Failed but can retry:16 >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/l on ranger >>>>>>> Progress: Selecting site:146 Stage in:3 Submitting:1 Submitted:857 >>>>>>> Failed but can retry:16 >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/v on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/b on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/0 on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/x on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/3 on ranger >>>>>>> Failed to transfer wrapper log from >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger >>>>>>> The log for the search is at : >>>>>>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1428-v0c5di5c.log >>>>>>> >>>>>>> The sites.xml I have is: >>>>>>> >>>>>>> >>>>>>> >>>>>> url="gatekeeper.ranger.tacc.teragrid.org" >>>>>>> jobManager="gt2:gt2:SGE"/> >>>>>>> >>>>>>> >>>>>> key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir >>>>>>> TG-CCR080022N >>>>>>> 16 >>>>>>> development >>>>>>> >>>>>> key="coasterWorkerMaxwalltime">00:40:00 >>>>>>> 31 >>>>>>> 50 >>>>>>> 10 >>>>>>> /work/01164/yuechen/swiftwork >>>>>>> >>>>>>> The tc.data I have is: >>>>>>> >>>>>>> ranger PTMap2 >>>>>>> /share/home/01164/yuechen/PTMap2/PTMap2 INSTALLED >>>>>>> INTEL32::LINUX globus::maxwalltime=50 >>>>>>> >>>>>>> I'm using swift 0.9 rc2 >>>>>>> >>>>>>> Thank you very much for help! >>>>>>> >>>>>>> Chen, Yue >>>>>>> >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------------ >>>>>>> >>>>>>> *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >>>>>>> *Sent:* Thu 4/30/2009 2:05 PM >>>>>>> *To:* Yue, Chen - BMD >>>>>>> *Subject:* Re: [Swift-user] Execution error >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 4/30/09 1:51 PM, Yue, Chen - BMD wrote: >>>>>>> > Hi Michael, >>>>>>> > >>>>>>> > When I tried to activate my account, I encountered the following >>>>>> error: >>>>>>> > >>>>>>> > "Sorry, this account is in an invalid state. You may not activate >>>>>> your >>>>>>> > at this time." >>>>>>> > >>>>>>> > I used the username and password from TG-CDA070002T. Should I use a >>>>>>> > different password? >>>>>>> >>>>>>> If you can already login to Ranger, then you are all set - you must >>>>>>> have >>>>>>> done this previously. >>>>>>> >>>>>>> I thought you had *not*, because when I looked up your login on ranger >>>>>>> ("finger yuechen") it said "never logged in". But seems like that info >>>>>>> is incorrect. >>>>>>> >>>>>>> If you have ptmap compiled, seems like you are almost all set. >>>>>>> >>>>>>> Let me know if it works. >>>>>>> >>>>>>> - Mike >>>>>>> >>>>>>> > Thanks! >>>>>>> > >>>>>>> > Chen, Yue >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> ------------------------------------------------------------------------ >>>>>> >>>>>>> > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] >>>>>>> > *Sent:* Thu 4/30/2009 1:07 PM >>>>>>> > *To:* Yue, Chen - BMD >>>>>>> > *Cc:* swift user >>>>>>> > *Subject:* Re: [Swift-user] Execution error >>>>>>> > >>>>>>> > Yue, use this XML pool element to access ranger: >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > url="gatekeeper.ranger.tacc.teragrid.org" >>>>>>> > jobManager="gt2:gt2:SGE"/> >>>>>>> > >>>>> url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" /> >>>>>>> > >>>>>> > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir >>>>>>> > >>>>> key="project">TG-CCR080022N >>>>>>> > 16 >>>>>>> > development >>>>>>> > >>>>>> > key="coasterWorkerMaxwalltime">00:40:00 >>>>>>> > 31 >>>>>>> > 50 >>>>>>> > 10 >>>>>>> > /work/00306/tg455797/swiftwork >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > You will need to also do these steps: >>>>>>> > >>>>>>> > Go to this web page to enable your Ranger account: >>>>>>> > >>>>>>> > https://tas.tacc.utexas.edu/TASMigration/AccountActivation.aspx >>>>>>> > >>>>>>> > Then login to Ranger via the TeraGrid portal and put your ssh >>>>>>> keys in >>>>>>> > place (assuming you use ssh keys, which you should) >>>>>>> > >>>>>>> > While on Ranger, do this: >>>>>>> > >>>>>>> > echo $WORK >>>>>>> > mkdir $work/swiftwork >>>>>>> > >>>>>>> > and put the full path of your $WORK/swiftwork directory in the >>>>>>> > element above. (My login is tg455etc, yours is >>>>>> yuechen) >>>>>>> > >>>>>>> > Then scp your code to Ranger and compile it. >>>>>>> > >>>>>>> > Then create a tc.data entry for your ptmap app >>>>>>> > >>>>>>> > Next, set your time values in the sites.xml entry above to suitable >>>>>>> > values for Ranger. You'll need to measure times, but I think you >>>>>>> will >>>>>>> > find Ranger about twice as fast as Mercury for CPU-bound jobs. >>>>>>> > >>>>>>> > The values above were set for one app job per coaster. I think >>>>>> you can >>>>>>> > probably do more. >>>>>>> > >>>>>>> > If you estimate a run time of 5 minutes, use: >>>>>>> > >>>>>>> > >>>>>> > key="coasterWorkerMaxwalltime">00:30:00 >>>>>>> > 5 >>>>>>> > >>>>>>> > Other people on the list - please sanity check what I suggest here. >>>>>>> > >>>>>>> > - Mike >>>>>>> > >>>>>>> > >>>>>>> > On 4/30/09 12:40 PM, Michael Wilde wrote: >>>>>>> > > I just checked - TG-CDA070002T has indeed expired. >>>>>>> > > >>>>>>> > > The best for now is to move to use (only) Ranger, under this >>>>>> account: >>>>>>> > > TG-CCR080022N >>>>>>> > > >>>>>>> > > I will locate and send you a sites.xml entry in a moment. >>>>>>> > > >>>>>>> > > You need to go to a web page to activate your Ranger login. >>>>>>> > > >>>>>>> > > Best to contact me in IM and we can work this out. >>>>>>> > > >>>>>>> > > - Mike >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > On 4/30/09 12:23 PM, Michael Wilde wrote: >>>>>>> > >> Also, what account are you running under? We may need to change >>>>>>> you to >>>>>>> > >> a new account - as the OSG Training account expires today. >>>>>>> > >> If that happend at Noon, it *might* be the problem. >>>>>>> > >> >>>>>>> > >> - Mike >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> On 4/30/09 12:08 PM, Yue, Chen - BMD wrote: >>>>>>> > >>> Hi, >>>>>>> > >>> >>>>>>> > >>> I came back to re-run my application on NCSA Mercury which was >>>>>>> tested >>>>>>> > >>> successfully last week after I just set up coasters with >>>>>> swift 0.9, >>>>>>> > >>> but I got many messages like the following: >>>>>>> > >>> >>>>>>> > >>> Progress: Stage in:219 Submitting:803 Submitted:1 >>>>>>> > >>> Progress: Stage in:129 Submitting:703 Submitted:190 Failed >>>>>>> but can >>>>>>> > >>> retry:1 >>>>>>> > >>> Progress: Stage in:38 Submitting:425 Submitted:556 Failed >>>>>> but can >>>>>>> > >>> retry:4 >>>>>>> > >>> Failed to transfer wrapper log from >>>>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY >>>>>>> > >>> Failed to transfer wrapper log from >>>>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY >>>>>>> > >>> Failed to transfer wrapper log from >>>>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY >>>>>>> > >>> Failed to transfer wrapper log from >>>>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY >>>>>>> > >>> Failed to transfer wrapper log from >>>>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY >>>>>>> > >>> Failed to transfer wrapper log from >>>>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY >>>>>>> > >>> Progress: Stage in:1 Submitted:1013 Active:1 Failed but can >>>>>>> retry:8 >>>>>>> > >>> Progress: Submitted:1011 Active:1 Failed but can retry:11 >>>>>>> > >>> The log file for the successful run last week is ; >>>>>>> > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log >>>>>>> > >>> >>>>>>> > >>> The log file for the failed run is : >>>>>>> > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log >>>>>>> > >>> >>>>>>> > >>> I don't think I did anything different, so I don't know why >>>>>>> this >>>>>>> time >>>>>>> > >>> they failed. The sites.xml for Mercury is: >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>> url="grid-hg.ncsa.teragrid.org" >>>>>>> > >>> jobManager="gt2:PBS"/> >>>>>>> > >>> >>>>>> /gpfs_scratch1/yuechen/swiftwork >>>>>>> > >>> debug >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> Thank you for help! >>>>>>> > >>> >>>>>>> > >>> Chen, Yue >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> This email is intended only for the use of the individual or >>>>>> entity >>>>>>> > >>> to which it is addressed and may contain information that is >>>>>>> > >>> privileged and confidential. If the reader of this email >>>>>> message is >>>>>>> > >>> not the intended recipient, you are hereby notified that any >>>>>>> > >>> dissemination, distribution, or copying of this >>>>>>> communication is >>>>>>> > >>> prohibited. If you have received this email in error, please >>>>>> notify >>>>>>> > >>> the sender and destroy/delete all copies of the transmittal. >>>>>>> Thank you. >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>>>>> ------------------------------------------------------------------------ >>>>>> >>>>>>> > >>> >>>>>>> > >>> _______________________________________________ >>>>>>> > >>> Swift-user mailing list >>>>>>> > >>> Swift-user at ci.uchicago.edu >>>>>>> > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>>>>>> > >> _______________________________________________ >>>>>>> > >> Swift-user mailing list >>>>>>> > >> Swift-user at ci.uchicago.edu >>>>>>> > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>>>>>> > > _______________________________________________ >>>>>>> > > Swift-user mailing list >>>>>>> > > Swift-user at ci.uchicago.edu >>>>>>> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > This email is intended only for the use of the individual or >>>>>> entity to >>>>>>> > which it is addressed and may contain information that is >>>>>> privileged and >>>>>>> > confidential. If the reader of this email message is not the >>>>>>> intended >>>>>>> > recipient, you are hereby notified that any dissemination, >>>>>> distribution, >>>>>>> > or copying of this communication is prohibited. If you have >>>>>>> received >>>>>>> > this email in error, please notify the sender and destroy/delete >>>>>>> all >>>>>>> > copies of the transmittal. Thank you. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> This email is intended only for the use of the individual or entity to >>>>>>> which it is addressed and may contain information that is >>>>>>> privileged and >>>>>>> confidential. If the reader of this email message is not the intended >>>>>>> recipient, you are hereby notified that any dissemination, >>>>>>> distribution, >>>>>>> or copying of this communication is prohibited. If you have received >>>>>>> this email in error, please notify the sender and destroy/delete all >>>>>>> copies of the transmittal. Thank you. >>>>>> >>>>>> >>>>>> >>>>>> This email is intended only for the use of the individual or entity >>>>>> to which it is addressed and may contain information that is >>>>>> privileged and confidential. If the reader of this email message is >>>>>> not the intended recipient, you are hereby notified that any >>>>>> dissemination, distribution, or copying of this communication is >>>>>> prohibited. If you have received this email in error, please notify >>>>>> the sender and destroy/delete all copies of the transmittal. Thank you. >>>>>> ------------------------------------------------------------------------ >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Apr 30 17:36:32 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 30 Apr 2009 17:36:32 -0500 Subject: [Swift-devel] RE: [Swift-user] Execution error In-Reply-To: <49FA24FC.4040500@mcs.anl.gov> References: <49F9DE8F.1070404@mcs.anl.gov> <49F9E298.8030801@mcs.anl.gov> <49F9E8FB.9020500@mcs.anl.gov> <49F9F680.6040503@mcs.anl.gov> <49FA03EA.7080807@mcs.anl.gov> <49FA147E.6070205@uchicago.edu> <1241127073.922.1.camel@localhost> <49FA1A76.2010209@mcs.anl.gov> <49FA1FC9.7010108@mcs.anl.gov> <1241129599.1616.0.camel@localhost> <49FA24FC.4040500@mcs.anl.gov> Message-ID: <1241130992.1616.4.camel@localhost> On Thu, 2009-04-30 at 17:23 -0500, Michael Wilde wrote: > > On 4/30/09 5:13 PM, Mihael Hategan wrote: > >> GRAM Job submission failed because the job manager failed to open > stderr > >> (error code 74) > > > > That seems like an IP address problem. Make sure you set GLOBUS_HOSTNAME > > properly. > > OK, I will try that. But in the test below, I caused the error by > unsetting X509_CERT_DIR and fixed the error by resetting it - no other > changes. > > I *think* that as recently as a few weeks ago globus-job-run to ranger > worked with just @globus in my .soft file. > > Adding +osg-client seemed to make it work by setting X509_CERT_DIR. > > So as far as I can tell, at least at the level of globus-job-run, these > seems to be related to certs. > > Given what Im seeing, do you still think GLOBUS_HOSTNAME is a factor? I find it hard to tell what the osg client is doing or not doing. The gold standard is the plain globus gram client. Besides that, the cog globusrun is the next step (because that's what swift uses). From hategan at mcs.anl.gov Thu Apr 30 17:37:53 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 30 Apr 2009 17:37:53 -0500 Subject: [Swift-devel] RE: [Swift-user] Execution error In-Reply-To: References: <49F9DE8F.1070404@mcs.anl.gov> <49F9E298.8030801@mcs.anl.gov> <49F9E8FB.9020500@mcs.anl.gov> <49F9F680.6040503@mcs.anl.gov> <49FA03EA.7080807@mcs.anl.gov> <49FA147E.6070205@uchicago.edu> <1241127073.922.1.camel@localhost> <49FA1A76.2010209@mcs.anl.gov> <49FA1FC9.7010108@mcs.anl.gov> <1241129599.1616.0.camel@localhost> <49FA24FC.4040500@mcs.anl.gov> Message-ID: <1241131073.2060.0.camel@localhost> On Thu, 2009-04-30 at 17:32 -0500, Yue, Chen - BMD wrote: > Hi Michael, > > I already have +osg-client-1.0.0-r1 in my .soft file. But I change it > to +osg-client and tried again. "ranger" gave me the same error > message. In the meantime, I tested one job on both Abe and Lonestar > and they both gave me qsub error. Can you try "gt2:gt2:PBS" instead of "gt2:PBS" on all sites? > I attached as following: > > //////////////////////////////////// > [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file > sites.xml -tc.file tc.data > Swift 0.9rc2 swift-r2860 cog-r2388 > RunID: 20090430-1722-oncfdolb > Progress: > Progress: Stage in:1 > Progress: Stage in:1 > Progress: Stage in:1 > Progress: Submitting:1 > Progress: Submitted:1 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1722-oncfdolb/info/3 on TACC_LoneStar > Progress: Active:1 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1722-oncfdolb/info/5 on TACC_LoneStar > Progress: Stage in:1 > Progress: Active:1 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1722-oncfdolb/info/7 on TACC_LoneStar > Progress: Failed:1 > Execution failed: > Exception in PTMap2: > Arguments: [e04.mzXML, ./seqs-ecolik12/fasta02, inputs-unmod.txt, > parameters.txt] > Host: TACC_LoneStar > Directory: PTMap2-unmod-20090430-1722-oncfdolb/jobs/7/PTMap2-7uagp5aj > stderr.txt: > stdout.txt: > ---- > Caused by: > Cannot submit job: Could not submit job (qsub reported an exit > code of -1). no error output > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Cannot submit job: Could not submit job (qsub reported an exit code of > -1). no error output > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) > at > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > at > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:43) > at > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) > at > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) > Caused by: > org.globus.cog.abstraction.impl.scheduler.common.ProcessException: > Could not submit job (qsub reported an exit code of -1). no error > output > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:94) > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) > ... 4 more > Cleaning up... > Shutting down service at https://129.114.50.32:34704 > Got channel MetaChannel: 2013263 -> GSSSChannel-null(1) > - Done > ///////////////////////////////////////// > > My sites.xml is at : /home/yuechen/PTMap2/sites.xml. I'm wondering if > this still relates to my setup. Thanks! > > Chen, Yue > > > > > ______________________________________________________________________ > From: Michael Wilde [mailto:wilde at mcs.anl.gov] > Sent: Thu 4/30/2009 5:23 PM > To: Mihael Hategan > Cc: swift-devel; Yue, Chen - BMD > Subject: Re: [Swift-devel] RE: [Swift-user] Execution error > > > > > On 4/30/09 5:13 PM, Mihael Hategan wrote: > >> GRAM Job submission failed because the job manager failed to open > stderr > >> (error code 74) > > > > That seems like an IP address problem. Make sure you set > GLOBUS_HOSTNAME > > properly. > > OK, I will try that. But in the test below, I caused the error by > unsetting X509_CERT_DIR and fixed the error by resetting it - no other > changes. > > I *think* that as recently as a few weeks ago globus-job-run to ranger > worked with just @globus in my .soft file. > > Adding +osg-client seemed to make it work by setting X509_CERT_DIR. > > So as far as I can tell, at least at the level of globus-job-run, > these > seems to be related to certs. > > Given what Im seeing, do you still think GLOBUS_HOSTNAME is a factor? > > - Mike > > > > On Thu, 2009-04-30 at 17:01 -0500, Michael Wilde wrote: > >> A bit more info on this: it *seems* like a cert issue. > >> > >> I last accessed Ranger via globus-job-run perhaps 2 weeks ago, no > problem. > >> > >> Yesterday, while debugging with Glen, globus-job-run was giving me > GRAM > >> err 74. (and GRM err 12 to all other sites) > >> > >> So I added +osg-client to my .soft file, and then globus-job-run > worked. > >> > >> But I noticed that my globus-job-run was still coming from the GT4 > dir, > >> not from an OSG dir. > >> > >> Just now I traced this back to X509_CERT_DIR: > >> > >> then I did: > >> > >> com$ unset X509_CERT_DIR > >> com$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /usr/bin/id > >> GRAM Job submission failed because the job manager failed to open > stderr > >> (error code 74) > > > > That seems like an IP address problem. Make sure you set > GLOBUS_HOSTNAME > > properly. > > > >> com$ > >> com$ > >> com$ X509_CERT_DIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA > >> com$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /usr/bin/id > >> GRAM Job submission failed because the job manager failed to open > stderr > >> (error code 74) > >> com$ export > X509_CERT_DIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA > >> com$ globus-job-run gatekeeper.ranger.tacc.teragrid.org /usr/bin/id > >> uid=455797(tg455797) gid=80243(G-80243) > >> > groups=80243(G-80243),81031(G-81031),81411(G-81411),81611(G-81611),81613(G-81613),81621(G-81621),81747(G-81747),81792(G-81792),800744(G-800744),800745(G-800745),800889(G-800889),800981(G-800981),800983(G-800983),801271(G-801271),801364(G-801364) > >> com$ > >> > >> Mihael, does swift honor X509_CERT_DIR? If so, Glen, Yue, that is > >> something to try. > >> > >> You may need to put +osg-client this in your .soft file and > re-login: > >> > >> @python-2.5 > >> +java-sun > >> > >> +apache-ant > >> +gx-map > >> +condor > >> +gx-map > >> @globus-4 > >> @default > >> +R > >> +torque > >> +maui > >> +matlab-7.7 > >> +osg-client > >> > >> - Mike > >> > >> > >> > >> > >> > >> On 4/30/09 4:39 PM, Michael Wilde wrote: > >>> And we should also drill back down to why (at least yesterday) the > GT4 > >>> softev package failed, but the OSG client worked, for > globus-job-run. > >>> > >>> I guess its possible there is a host or CA cert issue here. > >>> > >>> - Mike > >>> > >>> > >>> On 4/30/09 4:31 PM, Mihael Hategan wrote: > >>>> Can you guys try to run first.swift on ranger with the settings > you have > >>>> (you'll need to add "echo" to tc.data)? > >>>> > >>>> > >>>> On Thu, 2009-04-30 at 16:13 -0500, Glen Hocky wrote: > >>>>> I have the identical response on ranger. It started yesterday > >>>>> evening. Possibly a problem that the TACC folks need to fix? > >>>>> > >>>>> Glen > >>>>> > >>>>> Yue, Chen - BMD wrote: > >>>>>> Hi Michael, > >>>>>> > >>>>>> Thank you for the advices. I tested ranger with 1 job and new > >>>>>> specifications of maxwalltime. It shows the following error > message. > >>>>>> I don't know if there is other problem with my setup. Thank > you! > >>>>>> > >>>>>> ///////////////////////////////////////////////// > >>>>>> [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift > -sites.file > >>>>>> sites.xml -tc.file tc.data > >>>>>> Swift 0.9rc2 swift-r2860 cog-r2388 > >>>>>> RunID: 20090430-1559-2vi6x811 > >>>>>> Progress: > >>>>>> Progress: Stage in:1 > >>>>>> Progress: Submitting:1 > >>>>>> Progress: Submitting:1 > >>>>>> Progress: Submitted:1 > >>>>>> Progress: Active:1 > >>>>>> Failed to transfer wrapper log from > >>>>>> PTMap2-unmod-20090430-1559-2vi6x811/info/i on ranger > >>>>>> Progress: Active:1 > >>>>>> Failed to transfer wrapper log from > >>>>>> PTMap2-unmod-20090430-1559-2vi6x811/info/k on ranger > >>>>>> Progress: Stage in:1 > >>>>>> Progress: Active:1 > >>>>>> Failed to transfer wrapper log from > >>>>>> PTMap2-unmod-20090430-1559-2vi6x811/info/m on ranger > >>>>>> Progress: Failed:1 > >>>>>> Execution failed: > >>>>>> Exception in PTMap2: > >>>>>> Arguments: [e04.mzXML, ./seqs-ecolik12/fasta01, > inputs-unmod.txt, > >>>>>> parameters.txt] > >>>>>> Host: ranger > >>>>>> Directory: > PTMap2-unmod-20090430-1559-2vi6x811/jobs/m/PTMap2-mbe6m5aj > >>>>>> stderr.txt: > >>>>>> stdout.txt: > >>>>>> ---- > >>>>>> Caused by: > >>>>>> Failed to start worker: > >>>>>> null > >>>>>> null > >>>>>> org.globus.gram.GramException: The job manager detected an > invalid > >>>>>> script response > >>>>>> at > >>>>>> > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) > >>>>>> > >>>>>> at org.globus.gram.GramJob.setStatus(GramJob.java:184) > >>>>>> at > >>>>>> > org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) > >>>>>> at java.lang.Thread.run(Thread.java:619) > >>>>>> Cleaning up... > >>>>>> Shutting down service at https://129.114.50.163:45562 > >>>>>> > >>>>>> Got channel MetaChannel: 20903429 -> GSSSChannel-null(1) > >>>>>> - Done > >>>>>> [yuechen at communicado PTMap2]$ > >>>>>> /////////////////////////////////////////////////////////// > >>>>>> > >>>>>> Chen, Yue > >>>>>> > >>>>>> > >>>>>> *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > >>>>>> *Sent:* Thu 4/30/2009 3:02 PM > >>>>>> *To:* Yue, Chen - BMD; swift-devel > >>>>>> *Subject:* Re: [Swift-user] Execution error > >>>>>> > >>>>>> Back on list here (I only went off-list to discuss accounts, > etc) > >>>>>> > >>>>>> The problem in the run below is this: > >>>>>> > >>>>>> 2009-04-30 14:29:41,265-0500 DEBUG vdl:execute2 > APPLICATION_EXCEPTION > >>>>>> jobid=PTMap2-abeii5aj - Application exception: Job cannot be > run with > >>>>>> the given max walltime worker constraint (task: 3000, \ > >>>>>> maxwalltime: 2400s) > >>>>>> > >>>>>> You have this on the ptmap app in your tc.data: > >>>>>> > >>>>>> globus::maxwalltime=50 > >>>>>> > >>>>>> But you only gave coasters 40 mins per coaster worker. So its > >>>>>> complaining that it cant run a 50 minute job in a 40 minute > (max) > >>>>>> coaster worker. ;) > >>>>>> > >>>>>> I mentioned in a prior mail that you need to set the two time > vals in > >>>>>> your sites.xml entry; thats what you need to do next, now. > >>>>>> > >>>>>> change the coaster time in your sites.xml to: > >>>>>> key="coasterWorkerMaxwalltime">00:51:00 > >>>>>> > >>>>>> If you have more info on the variability of your ptmap run > times, send > >>>>>> that to the list, and we can discuss how to handle. > >>>>>> > >>>>>> > >>>>>> (NOTE: doing grp -i of the log for "except" or scanning for > "except" > >>>>>> with an editor will often locate the first "exception" that > your job > >>>>>> encountered. Thats how I found the error above). > >>>>>> > >>>>>> Also, Yue, for testing new sites, or for validating that old > sites > >>>>>> still > >>>>>> work, you should create the smallest possible ptmap workflow - > 1 job if > >>>>>> that is possible - and verify that this works. Then say 10 > jobs to > >>>>>> make > >>>>>> sure scheduling etc is sane. Then, send in your huge jobs. > >>>>>> > >>>>>> With only 1 job, its easier to spot the errors in the log file. > >>>>>> > >>>>>> - Mike > >>>>>> > >>>>>> > >>>>>> On 4/30/09 2:34 PM, Yue, Chen - BMD wrote: > >>>>>>> Hi Michael, > >>>>>>> > >>>>>>> I run into the same messages again when I use Ranger: > >>>>>>> > >>>>>>> Progress: Selecting site:146 Stage in:25 Submitting:15 > >>>>>>> Submitted:821 > >>>>>>> Failed but can retry:16 > >>>>>>> Failed to transfer wrapper log from > >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger > >>>>>>> Failed to transfer wrapper log from > >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/l on ranger > >>>>>>> Progress: Selecting site:146 Stage in:3 Submitting:1 > Submitted:857 > >>>>>>> Failed but can retry:16 > >>>>>>> Failed to transfer wrapper log from > >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/v on ranger > >>>>>>> Failed to transfer wrapper log from > >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/b on ranger > >>>>>>> Failed to transfer wrapper log from > >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/0 on ranger > >>>>>>> Failed to transfer wrapper log from > >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger > >>>>>>> Failed to transfer wrapper log from > >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger > >>>>>>> Failed to transfer wrapper log from > >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger > >>>>>>> Failed to transfer wrapper log from > >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger > >>>>>>> Failed to transfer wrapper log from > >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger > >>>>>>> Failed to transfer wrapper log from > >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger > >>>>>>> Failed to transfer wrapper log from > >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger > >>>>>>> Failed to transfer wrapper log from > >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger > >>>>>>> Failed to transfer wrapper log from > >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/x on ranger > >>>>>>> Failed to transfer wrapper log from > >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/3 on ranger > >>>>>>> Failed to transfer wrapper log from > >>>>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger > >>>>>>> The log for the search is at : > >>>>>>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1428-v0c5di5c.log > >>>>>>> > >>>>>>> The sites.xml I have is: > >>>>>>> > >>>>>>> > >>>>>>> >>>>>>> url="gatekeeper.ranger.tacc.teragrid.org" > >>>>>>> jobManager="gt2:gt2:SGE"/> > >>>>>>> url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" /> > >>>>>>> >>>>>>> > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir > >>>>>>> key="project">TG-CCR080022N > >>>>>>> key="coastersPerNode">16 > >>>>>>> key="queue">development > >>>>>>> >>>>>>> > key="coasterWorkerMaxwalltime">00:40:00 > >>>>>>> key="maxwalltime">31 > >>>>>>> key="initialScore">50 > >>>>>>> key="jobThrottle">10 > >>>>>>> > /work/01164/yuechen/swiftwork > >>>>>>> > >>>>>>> The tc.data I have is: > >>>>>>> > >>>>>>> ranger PTMap2 > >>>>>>> /share/home/01164/yuechen/PTMap2/PTMap2 INSTALLED > >>>>>>> INTEL32::LINUX globus::maxwalltime=50 > >>>>>>> > >>>>>>> I'm using swift 0.9 rc2 > >>>>>>> > >>>>>>> Thank you very much for help! > >>>>>>> > >>>>>>> Chen, Yue > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > ------------------------------------------------------------------------ > >>>>>>> > >>>>>>> *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > >>>>>>> *Sent:* Thu 4/30/2009 2:05 PM > >>>>>>> *To:* Yue, Chen - BMD > >>>>>>> *Subject:* Re: [Swift-user] Execution error > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On 4/30/09 1:51 PM, Yue, Chen - BMD wrote: > >>>>>>> > Hi Michael, > >>>>>>> > > >>>>>>> > When I tried to activate my account, I encountered the > following > >>>>>> error: > >>>>>>> > > >>>>>>> > "Sorry, this account is in an invalid state. You may not > activate > >>>>>> your > >>>>>>> > at this time." > >>>>>>> > > >>>>>>> > I used the username and password from TG-CDA070002T. Should > I use a > >>>>>>> > different password? > >>>>>>> > >>>>>>> If you can already login to Ranger, then you are all set - you > must > >>>>>>> have > >>>>>>> done this previously. > >>>>>>> > >>>>>>> I thought you had *not*, because when I looked up your login > on ranger > >>>>>>> ("finger yuechen") it said "never logged in". But seems like > that info > >>>>>>> is incorrect. > >>>>>>> > >>>>>>> If you have ptmap compiled, seems like you are almost all set. > >>>>>>> > >>>>>>> Let me know if it works. > >>>>>>> > >>>>>>> - Mike > >>>>>>> > >>>>>>> > Thanks! > >>>>>>> > > >>>>>>> > Chen, Yue > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>> > ------------------------------------------------------------------------ > >>>>>> > >>>>>>> > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > >>>>>>> > *Sent:* Thu 4/30/2009 1:07 PM > >>>>>>> > *To:* Yue, Chen - BMD > >>>>>>> > *Cc:* swift user > >>>>>>> > *Subject:* Re: [Swift-user] Execution error > >>>>>>> > > >>>>>>> > Yue, use this XML pool element to access ranger: > >>>>>>> > > >>>>>>> > > >>>>>>> > >>>>>>> > url="gatekeeper.ranger.tacc.teragrid.org" > >>>>>>> > jobManager="gt2:gt2:SGE"/> > >>>>>>> > >>>>>> url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" /> > >>>>>>> > >>>>>>> > > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir > >>>>>>> > >>>>>> key="project">TG-CCR080022N > >>>>>>> > key="coastersPerNode">16 > >>>>>>> > key="queue">development > >>>>>>> > >>>>>>> > > key="coasterWorkerMaxwalltime">00:40:00 > >>>>>>> > key="maxwalltime">31 > >>>>>>> > key="initialScore">50 > >>>>>>> > key="jobThrottle">10 > >>>>>>> > > /work/00306/tg455797/swiftwork > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > You will need to also do these steps: > >>>>>>> > > >>>>>>> > Go to this web page to enable your Ranger account: > >>>>>>> > > >>>>>>> > > https://tas.tacc.utexas.edu/TASMigration/AccountActivation.aspx > >>>>>>> > > >>>>>>> > Then login to Ranger via the TeraGrid portal and put your > ssh > >>>>>>> keys in > >>>>>>> > place (assuming you use ssh keys, which you should) > >>>>>>> > > >>>>>>> > While on Ranger, do this: > >>>>>>> > > >>>>>>> > echo $WORK > >>>>>>> > mkdir $work/swiftwork > >>>>>>> > > >>>>>>> > and put the full path of your $WORK/swiftwork directory in > the > >>>>>>> > element above. (My login is tg455etc, yours > is > >>>>>> yuechen) > >>>>>>> > > >>>>>>> > Then scp your code to Ranger and compile it. > >>>>>>> > > >>>>>>> > Then create a tc.data entry for your ptmap app > >>>>>>> > > >>>>>>> > Next, set your time values in the sites.xml entry above to > suitable > >>>>>>> > values for Ranger. You'll need to measure times, but I > think you > >>>>>>> will > >>>>>>> > find Ranger about twice as fast as Mercury for CPU-bound > jobs. > >>>>>>> > > >>>>>>> > The values above were set for one app job per coaster. I > think > >>>>>> you can > >>>>>>> > probably do more. > >>>>>>> > > >>>>>>> > If you estimate a run time of 5 minutes, use: > >>>>>>> > > >>>>>>> > >>>>>>> > > key="coasterWorkerMaxwalltime">00:30:00 > >>>>>>> > key="maxwalltime">5 > >>>>>>> > > >>>>>>> > Other people on the list - please sanity check what I > suggest here. > >>>>>>> > > >>>>>>> > - Mike > >>>>>>> > > >>>>>>> > > >>>>>>> > On 4/30/09 12:40 PM, Michael Wilde wrote: > >>>>>>> > > I just checked - TG-CDA070002T has indeed expired. > >>>>>>> > > > >>>>>>> > > The best for now is to move to use (only) Ranger, under > this > >>>>>> account: > >>>>>>> > > TG-CCR080022N > >>>>>>> > > > >>>>>>> > > I will locate and send you a sites.xml entry in a > moment. > >>>>>>> > > > >>>>>>> > > You need to go to a web page to activate your Ranger > login. > >>>>>>> > > > >>>>>>> > > Best to contact me in IM and we can work this out. > >>>>>>> > > > >>>>>>> > > - Mike > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > On 4/30/09 12:23 PM, Michael Wilde wrote: > >>>>>>> > >> Also, what account are you running under? We may need > to change > >>>>>>> you to > >>>>>>> > >> a new account - as the OSG Training account expires > today. > >>>>>>> > >> If that happend at Noon, it *might* be the problem. > >>>>>>> > >> > >>>>>>> > >> - Mike > >>>>>>> > >> > >>>>>>> > >> > >>>>>>> > >> On 4/30/09 12:08 PM, Yue, Chen - BMD wrote: > >>>>>>> > >>> Hi, > >>>>>>> > >>> > >>>>>>> > >>> I came back to re-run my application on NCSA Mercury > which was > >>>>>>> tested > >>>>>>> > >>> successfully last week after I just set up coasters > with > >>>>>> swift 0.9, > >>>>>>> > >>> but I got many messages like the following: > >>>>>>> > >>> > >>>>>>> > >>> Progress: Stage in:219 Submitting:803 Submitted:1 > >>>>>>> > >>> Progress: Stage in:129 Submitting:703 Submitted:190 > Failed > >>>>>>> but can > >>>>>>> > >>> retry:1 > >>>>>>> > >>> Progress: Stage in:38 Submitting:425 Submitted:556 > Failed > >>>>>> but can > >>>>>>> > >>> retry:4 > >>>>>>> > >>> Failed to transfer wrapper log from > >>>>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/h on > NCSA_MERCURY > >>>>>>> > >>> Failed to transfer wrapper log from > >>>>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/j on > NCSA_MERCURY > >>>>>>> > >>> Failed to transfer wrapper log from > >>>>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/p on > NCSA_MERCURY > >>>>>>> > >>> Failed to transfer wrapper log from > >>>>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on > NCSA_MERCURY > >>>>>>> > >>> Failed to transfer wrapper log from > >>>>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/b on > NCSA_MERCURY > >>>>>>> > >>> Failed to transfer wrapper log from > >>>>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/c on > NCSA_MERCURY > >>>>>>> > >>> Progress: Stage in:1 Submitted:1013 Active:1 Failed > but can > >>>>>>> retry:8 > >>>>>>> > >>> Progress: Submitted:1011 Active:1 Failed but can > retry:11 > >>>>>>> > >>> The log file for the successful run last week is ; > >>>>>>> > > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log > >>>>>>> > >>> > >>>>>>> > >>> The log file for the failed run is : > >>>>>>> > > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log > >>>>>>> > >>> > >>>>>>> > >>> I don't think I did anything different, so I don't > know why > >>>>>>> this > >>>>>>> time > >>>>>>> > >>> they failed. The sites.xml for Mercury is: > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> > >>> url="gsiftp://gridftp-hg.ncsa.teragrid.org"/> > >>>>>>> > >>> >>>>>> url="grid-hg.ncsa.teragrid.org" > >>>>>>> > >>> jobManager="gt2:PBS"/> > >>>>>>> > >>> > >>>>>> /gpfs_scratch1/yuechen/swiftwork > >>>>>>> > >>> key="queue">debug > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> > >>> Thank you for help! > >>>>>>> > >>> > >>>>>>> > >>> Chen, Yue > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> > >>> This email is intended only for the use of the > individual or > >>>>>> entity > >>>>>>> > >>> to which it is addressed and may contain information > that is > >>>>>>> > >>> privileged and confidential. If the reader of this > email > >>>>>> message is > >>>>>>> > >>> not the intended recipient, you are hereby notified > that any > >>>>>>> > >>> dissemination, distribution, or copying of this > >>>>>>> communication is > >>>>>>> > >>> prohibited. If you have received this email in error, > please > >>>>>> notify > >>>>>>> > >>> the sender and destroy/delete all copies of the > transmittal. > >>>>>>> Thank you. > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> > > >>>>>> > ------------------------------------------------------------------------ > >>>>>> > >>>>>>> > >>> > >>>>>>> > >>> _______________________________________________ > >>>>>>> > >>> Swift-user mailing list > >>>>>>> > >>> Swift-user at ci.uchicago.edu > >>>>>>> > >>> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > >>>>>>> > >> _______________________________________________ > >>>>>>> > >> Swift-user mailing list > >>>>>>> > >> Swift-user at ci.uchicago.edu > >>>>>>> > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > >>>>>>> > > _______________________________________________ > >>>>>>> > > Swift-user mailing list > >>>>>>> > > Swift-user at ci.uchicago.edu > >>>>>>> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > This email is intended only for the use of the individual > or > >>>>>> entity to > >>>>>>> > which it is addressed and may contain information that is > >>>>>> privileged and > >>>>>>> > confidential. If the reader of this email message is not > the > >>>>>>> intended > >>>>>>> > recipient, you are hereby notified that any dissemination, > >>>>>> distribution, > >>>>>>> > or copying of this communication is prohibited. If you have > >>>>>>> received > >>>>>>> > this email in error, please notify the sender and > destroy/delete > >>>>>>> all > >>>>>>> > copies of the transmittal. Thank you. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> This email is intended only for the use of the individual or > entity to > >>>>>>> which it is addressed and may contain information that is > >>>>>>> privileged and > >>>>>>> confidential. If the reader of this email message is not the > intended > >>>>>>> recipient, you are hereby notified that any dissemination, > >>>>>>> distribution, > >>>>>>> or copying of this communication is prohibited. If you have > received > >>>>>>> this email in error, please notify the sender and > destroy/delete all > >>>>>>> copies of the transmittal. Thank you. > >>>>>> > >>>>>> > >>>>>> > >>>>>> This email is intended only for the use of the individual or > entity > >>>>>> to which it is addressed and may contain information that is > >>>>>> privileged and confidential. If the reader of this email > message is > >>>>>> not the intended recipient, you are hereby notified that any > >>>>>> dissemination, distribution, or copying of this communication > is > >>>>>> prohibited. If you have received this email in error, please > notify > >>>>>> the sender and destroy/delete all copies of the transmittal. > Thank you. > >>>>>> > ------------------------------------------------------------------------ > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Swift-devel mailing list > >>>>>> Swift-devel at ci.uchicago.edu > >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>>>> > >>>>> _______________________________________________ > >>>>> Swift-devel mailing list > >>>>> Swift-devel at ci.uchicago.edu > >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > This email is intended only for the use of the individual or entity to > which it is addressed and may contain information that is privileged > and confidential. If the reader of this email message is not the > intended recipient, you are hereby notified that any dissemination, > distribution, or copying of this communication is prohibited. If you > have received this email in error, please notify the sender and > destroy/delete all copies of the transmittal. Thank you. From yuechen at bsd.uchicago.edu Thu Apr 30 17:57:40 2009 From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD) Date: Thu, 30 Apr 2009 17:57:40 -0500 Subject: [Swift-devel] RE: [Swift-user] Execution error References: <49F9DE8F.1070404@mcs.anl.gov> <49F9E298.8030801@mcs.anl.gov><49F9E8FB.9020500@mcs.anl.gov><49F9F680.6040503@mcs.anl.gov><49FA03EA.7080807@mcs.anl.gov><49FA147E.6070205@uchicago.edu> <1241127073.922.1.camel@localhost><49FA1A76.2010209@mcs.anl.gov> <49FA1FC9.7010108@mcs.anl.gov><1241129599.1616.0.camel@localhost> <49FA24FC.4040500@mcs.anl.gov> <1241131073.2060.0.camel@localhost> Message-ID: Hi Mihael, The error message is different after I change from gt2:PBS to gt2:gt2:PBS on TACC LoneStar. I attached the trace as following. I don't know if there is still setup problem. Thank you very much! ///////////////////////// [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file sites.xml -tc.file tc.data Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090430-1742-tfabq6q4 Progress: Progress: Stage in:1 Progress: Stage in:1 Progress: Stage in:1 Progress: Submitting:1 Progress: Submitted:1 Progress: Active:1 Failed to transfer wrapper log from PTMap2-unmod-20090430-1742-tfabq6q4/info/r on TACC_LoneStar Progress: Active:1 Failed to transfer wrapper log from PTMap2-unmod-20090430-1742-tfabq6q4/info/t on TACC_LoneStar Progress: Stage in:1 Progress: Active:1 Failed to transfer wrapper log from PTMap2-unmod-20090430-1742-tfabq6q4/info/v on TACC_LoneStar Progress: Failed:1 Execution failed: Exception in PTMap2: Arguments: [e04.mzXML, ./seqs-ecolik12/fasta02, inputs-unmod.txt, parameters.txt] Host: TACC_LoneStar Directory: PTMap2-unmod-20090430-1742-tfabq6q4/jobs/v/PTMap2-v7baq5aj stderr.txt: stdout.txt: ---- Caused by: Cannot submit job org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot submit job at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:145) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:43) at org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) at org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) Caused by: org.globus.gram.GramException: The gatekeeper failed to find the requested service at org.globus.gram.Gram.checkHttpReply(Gram.java:137) at org.globus.gram.Gram.request(Gram.java:342) at org.globus.gram.GramJob.request(GramJob.java:262) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133) ... 5 more Cleaning up... Shutting down service at https://129.114.50.32:35257 Got channel MetaChannel: 13391897 -> GSSSChannel-null(1) - Done //////////////////////// Chen, Yue ________________________________ From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Thu 4/30/2009 5:37 PM To: Yue, Chen - BMD Cc: Michael Wilde; swift-devel Subject: RE: [Swift-devel] RE: [Swift-user] Execution error On Thu, 2009-04-30 at 17:32 -0500, Yue, Chen - BMD wrote: > Hi Michael, > > I already have +osg-client-1.0.0-r1 in my .soft file. But I change it > to +osg-client and tried again. "ranger" gave me the same error > message. In the meantime, I tested one job on both Abe and Lonestar > and they both gave me qsub error. Can you try "gt2:gt2:PBS" instead of "gt2:PBS" on all sites? > I attached as following: > > //////////////////////////////////// > [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file > sites.xml -tc.file tc.data > Swift 0.9rc2 swift-r2860 cog-r2388 > RunID: 20090430-1722-oncfdolb > Progress: > Progress: Stage in:1 > Progress: Stage in:1 > Progress: Stage in:1 > Progress: Submitting:1 > Progress: Submitted:1 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1722-oncfdolb/info/3 on TACC_LoneStar > Progress: Active:1 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1722-oncfdolb/info/5 on TACC_LoneStar > Progress: Stage in:1 > Progress: Active:1 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1722-oncfdolb/info/7 on TACC_LoneStar > Progress: Failed:1 > Execution failed: > Exception in PTMap2: > Arguments: [e04.mzXML, ./seqs-ecolik12/fasta02, inputs-unmod.txt, > parameters.txt] > Host: TACC_LoneStar > Directory: PTMap2-unmod-20090430-1722-oncfdolb/jobs/7/PTMap2-7uagp5aj > stderr.txt: > stdout.txt: > ---- > Caused by: > Cannot submit job: Could not submit job (qsub reported an exit > code of -1). no error output > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Cannot submit job: Could not submit job (qsub reported an exit code of > -1). no error output > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) > at > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > at > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:43) > at > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) > at > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) > Caused by: > org.globus.cog.abstraction.impl.scheduler.common.ProcessException: > Could not submit job (qsub reported an exit code of -1). no error > output > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:94) > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) > ... 4 more > Cleaning up... > Shutting down service at https://129.114.50.32:34704 > Got channel MetaChannel: 2013263 -> GSSSChannel-null(1) > - Done > ///////////////////////////////////////// > > My sites.xml is at : /home/yuechen/PTMap2/sites.xml. I'm wondering if > this still relates to my setup. Thanks! > > Chen, Yue > > > > This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Apr 30 18:08:26 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 30 Apr 2009 18:08:26 -0500 Subject: [Swift-devel] RE: [Swift-user] Execution error In-Reply-To: References: <49F9DE8F.1070404@mcs.anl.gov> <49F9E298.8030801@mcs.anl.gov> <49F9E8FB.9020500@mcs.anl.gov> <49F9F680.6040503@mcs.anl.gov> <49FA03EA.7080807@mcs.anl.gov> <49FA147E.6070205@uchicago.edu> <1241127073.922.1.camel@localhost> <49FA1A76.2010209@mcs.anl.gov> <49FA1FC9.7010108@mcs.anl.gov> <1241129599.1616.0.camel@localhost> <49FA24FC.4040500@mcs.anl.gov> <1241131073.2060.0.camel@localhost> Message-ID: <1241132906.2816.0.camel@localhost> Odd. Try "gt2:gt2:pbs" instead of "gt2:gt2:PBS". On Thu, 2009-04-30 at 17:57 -0500, Yue, Chen - BMD wrote: > Hi Mihael, > > The error message is different after I change from gt2:PBS to > gt2:gt2:PBS on TACC LoneStar. I attached the trace as following. I > don't know if there is still setup problem. Thank you very much! > > ///////////////////////// > [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file > sites.xml -tc.file tc.data > Swift 0.9rc2 swift-r2860 cog-r2388 > RunID: 20090430-1742-tfabq6q4 > Progress: > Progress: Stage in:1 > Progress: Stage in:1 > Progress: Stage in:1 > Progress: Submitting:1 > Progress: Submitted:1 > Progress: Active:1 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1742-tfabq6q4/info/r on TACC_LoneStar > Progress: Active:1 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1742-tfabq6q4/info/t on TACC_LoneStar > Progress: Stage in:1 > Progress: Active:1 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1742-tfabq6q4/info/v on TACC_LoneStar > Progress: Failed:1 > Execution failed: > Exception in PTMap2: > Arguments: [e04.mzXML, ./seqs-ecolik12/fasta02, inputs-unmod.txt, > parameters.txt] > Host: TACC_LoneStar > Directory: PTMap2-unmod-20090430-1742-tfabq6q4/jobs/v/PTMap2-v7baq5aj > stderr.txt: > stdout.txt: > ---- > Caused by: > Cannot submit job > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Cannot submit job > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:145) > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99) > at > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > at > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:43) > at > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) > at > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) > Caused by: org.globus.gram.GramException: The gatekeeper failed to > find the requested service > at org.globus.gram.Gram.checkHttpReply(Gram.java:137) > at org.globus.gram.Gram.request(Gram.java:342) > at org.globus.gram.GramJob.request(GramJob.java:262) > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133) > ... 5 more > Cleaning up... > Shutting down service at https://129.114.50.32:35257 > Got channel MetaChannel: 13391897 -> GSSSChannel-null(1) > - Done > //////////////////////// > > Chen, Yue > > > ______________________________________________________________________ > From: Mihael Hategan [mailto:hategan at mcs.anl.gov] > Sent: Thu 4/30/2009 5:37 PM > To: Yue, Chen - BMD > Cc: Michael Wilde; swift-devel > Subject: RE: [Swift-devel] RE: [Swift-user] Execution error > > > On Thu, 2009-04-30 at 17:32 -0500, Yue, Chen - BMD wrote: > > Hi Michael, > > > > I already have +osg-client-1.0.0-r1 in my .soft file. But I change > it > > to +osg-client and tried again. "ranger" gave me the same error > > message. In the meantime, I tested one job on both Abe and Lonestar > > and they both gave me qsub error. > > Can you try "gt2:gt2:PBS" instead of "gt2:PBS" on all sites? > > > I attached as following: > > > > //////////////////////////////////// > > [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file > > sites.xml -tc.file tc.data > > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090430-1722-oncfdolb > > Progress: > > Progress: Stage in:1 > > Progress: Stage in:1 > > Progress: Stage in:1 > > Progress: Submitting:1 > > Progress: Submitted:1 > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1722-oncfdolb/info/3 on TACC_LoneStar > > Progress: Active:1 > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1722-oncfdolb/info/5 on TACC_LoneStar > > Progress: Stage in:1 > > Progress: Active:1 > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1722-oncfdolb/info/7 on TACC_LoneStar > > Progress: Failed:1 > > Execution failed: > > Exception in PTMap2: > > Arguments: [e04.mzXML, ./seqs-ecolik12/fasta02, inputs-unmod.txt, > > parameters.txt] > > Host: TACC_LoneStar > > Directory: > PTMap2-unmod-20090430-1722-oncfdolb/jobs/7/PTMap2-7uagp5aj > > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > > Cannot submit job: Could not submit job (qsub reported an > exit > > code of -1). no error output > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Cannot submit job: Could not submit job (qsub reported an exit code > of > > -1). no error output > > at > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) > > at > > > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > > at > > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:43) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) > > Caused by: > > org.globus.cog.abstraction.impl.scheduler.common.ProcessException: > > Could not submit job (qsub reported an exit code of -1). no error > > output > > at > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:94) > > at > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) > > ... 4 more > > Cleaning up... > > Shutting down service at https://129.114.50.32:34704 > > Got channel MetaChannel: 2013263 -> GSSSChannel-null(1) > > - Done > > ///////////////////////////////////////// > > > > My sites.xml is at : /home/yuechen/PTMap2/sites.xml. I'm wondering > if > > this still relates to my setup. Thanks! > > > > Chen, Yue > > > > > > > > > > > > > > > This email is intended only for the use of the individual or entity to > which it is addressed and may contain information that is privileged > and confidential. If the reader of this email message is not the > intended recipient, you are hereby notified that any dissemination, > distribution, or copying of this communication is prohibited. If you > have received this email in error, please notify the sender and > destroy/delete all copies of the transmittal. Thank you. From yuechen at bsd.uchicago.edu Thu Apr 30 18:12:19 2009 From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD) Date: Thu, 30 Apr 2009 18:12:19 -0500 Subject: [Swift-devel] RE: [Swift-user] Execution error References: <49F9DE8F.1070404@mcs.anl.gov> <49F9E298.8030801@mcs.anl.gov><49F9E8FB.9020500@mcs.anl.gov><49F9F680.6040503@mcs.anl.gov><49FA03EA.7080807@mcs.anl.gov><49FA147E.6070205@uchicago.edu> <1241127073.922.1.camel@localhost><49FA1A76.2010209@mcs.anl.gov> <49FA1FC9.7010108@mcs.anl.gov><1241129599.1616.0.camel@localhost> <49FA24FC.4040500@mcs.anl.gov><1241131073.2060.0.camel@localhost> <1241132906.2816.0.camel@localhost> Message-ID: Hi Mihael, I tried and the error message is the same. I attached the following trace. /////////////////////////////// [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file sites.xml -tc.file tc.data Swift 0.9rc2 swift-r2860 cog-r2388 RunID: 20090430-1809-ppue84jc Progress: Progress: Stage in:1 Progress: Stage in:1 Progress: Stage in:1 Progress: Submitting:1 Progress: Submitting:1 Progress: Submitted:1 Progress: Active:1 Failed to transfer wrapper log from PTMap2-unmod-20090430-1809-ppue84jc/info/m on TACC_LoneStar Progress: Active:1 Failed to transfer wrapper log from PTMap2-unmod-20090430-1809-ppue84jc/info/o on TACC_LoneStar Progress: Stage in:1 Progress: Active:1 Failed to transfer wrapper log from PTMap2-unmod-20090430-1809-ppue84jc/info/q on TACC_LoneStar Progress: Failed:1 Execution failed: Exception in PTMap2: Arguments: [e04.mzXML, ./seqs-ecolik12/fasta02, inputs-unmod.txt, parameters.txt] Host: TACC_LoneStar Directory: PTMap2-unmod-20090430-1809-ppue84jc/jobs/q/PTMap2-q3acr5aj stderr.txt: stdout.txt: ---- Caused by: Cannot submit job org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot submit job at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:145) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:43) at org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) at org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) Caused by: org.globus.gram.GramException: The gatekeeper failed to find the requested service at org.globus.gram.Gram.checkHttpReply(Gram.java:137) at org.globus.gram.Gram.request(Gram.java:342) at org.globus.gram.GramJob.request(GramJob.java:262) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133) ... 5 more Cleaning up... Shutting down service at https://129.114.50.32:35904 Got channel MetaChannel: 16618296 -> GSSSChannel-null(1) - Done ////////////////////////////////// My sites.xml for TACC Lonestar is : /tmp/scratch/yc/swiftwork Thanks! Chen, Yue ________________________________ From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Thu 4/30/2009 6:08 PM To: Yue, Chen - BMD Cc: Michael Wilde; swift-devel Subject: RE: [Swift-devel] RE: [Swift-user] Execution error Odd. Try "gt2:gt2:pbs" instead of "gt2:gt2:PBS". On Thu, 2009-04-30 at 17:57 -0500, Yue, Chen - BMD wrote: > Hi Mihael, > > The error message is different after I change from gt2:PBS to > gt2:gt2:PBS on TACC LoneStar. I attached the trace as following. I > don't know if there is still setup problem. Thank you very much! > > ///////////////////////// > [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file > sites.xml -tc.file tc.data > Swift 0.9rc2 swift-r2860 cog-r2388 > RunID: 20090430-1742-tfabq6q4 > Progress: > Progress: Stage in:1 > Progress: Stage in:1 > Progress: Stage in:1 > Progress: Submitting:1 > Progress: Submitted:1 > Progress: Active:1 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1742-tfabq6q4/info/r on TACC_LoneStar > Progress: Active:1 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1742-tfabq6q4/info/t on TACC_LoneStar > Progress: Stage in:1 > Progress: Active:1 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1742-tfabq6q4/info/v on TACC_LoneStar > Progress: Failed:1 > Execution failed: > Exception in PTMap2: > Arguments: [e04.mzXML, ./seqs-ecolik12/fasta02, inputs-unmod.txt, > parameters.txt] > Host: TACC_LoneStar > Directory: PTMap2-unmod-20090430-1742-tfabq6q4/jobs/v/PTMap2-v7baq5aj > stderr.txt: > stdout.txt: > ---- > Caused by: > Cannot submit job > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Cannot submit job > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:145) > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99) > at > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > at > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:43) > at > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) > at > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) > Caused by: org.globus.gram.GramException: The gatekeeper failed to > find the requested service > at org.globus.gram.Gram.checkHttpReply(Gram.java:137) > at org.globus.gram.Gram.request(Gram.java:342) > at org.globus.gram.GramJob.request(GramJob.java:262) > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133) > ... 5 more > Cleaning up... > Shutting down service at https://129.114.50.32:35257 > Got channel MetaChannel: 13391897 -> GSSSChannel-null(1) > - Done > //////////////////////// > > Chen, Yue > > This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Apr 30 18:36:21 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 30 Apr 2009 18:36:21 -0500 Subject: [Swift-devel] RE: [Swift-user] Execution error In-Reply-To: References: <49F9DE8F.1070404@mcs.anl.gov> <49F9E298.8030801@mcs.anl.gov> <49F9E8FB.9020500@mcs.anl.gov> <49F9F680.6040503@mcs.anl.gov> <49FA03EA.7080807@mcs.anl.gov> <49FA147E.6070205@uchicago.edu> <1241127073.922.1.camel@localhost> <49FA1A76.2010209@mcs.anl.gov> <49FA1FC9.7010108@mcs.anl.gov> <1241129599.1616.0.camel@localhost> <49FA24FC.4040500@mcs.anl.gov> <1241131073.2060.0.camel@localhost> <1241132906.2816.0.camel@localhost> Message-ID: <1241134581.3266.0.camel@localhost> On Thu, 2009-04-30 at 18:12 -0500, Yue, Chen - BMD wrote: > Hi Mihael, > > I tried and the error message is the same. I attached the following > trace. Well, lonestar doesn't use PBS, but LSF. So for lonestar, use gt2:gt2:lsf. > > /////////////////////////////// > [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file > sites.xml -tc.file tc.data > Swift 0.9rc2 swift-r2860 cog-r2388 > RunID: 20090430-1809-ppue84jc > Progress: > Progress: Stage in:1 > Progress: Stage in:1 > Progress: Stage in:1 > Progress: Submitting:1 > Progress: Submitting:1 > Progress: Submitted:1 > Progress: Active:1 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1809-ppue84jc/info/m on TACC_LoneStar > Progress: Active:1 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1809-ppue84jc/info/o on TACC_LoneStar > Progress: Stage in:1 > Progress: Active:1 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1809-ppue84jc/info/q on TACC_LoneStar > Progress: Failed:1 > Execution failed: > Exception in PTMap2: > Arguments: [e04.mzXML, ./seqs-ecolik12/fasta02, inputs-unmod.txt, > parameters.txt] > Host: TACC_LoneStar > Directory: PTMap2-unmod-20090430-1809-ppue84jc/jobs/q/PTMap2-q3acr5aj > stderr.txt: > stdout.txt: > ---- > Caused by: > Cannot submit job > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Cannot submit job > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:145) > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99) > at > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > at > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:43) > at > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) > at > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) > Caused by: org.globus.gram.GramException: The gatekeeper failed to > find the requested service > at org.globus.gram.Gram.checkHttpReply(Gram.java:137) > at org.globus.gram.Gram.request(Gram.java:342) > at org.globus.gram.GramJob.request(GramJob.java:262) > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133) > ... 5 more > Cleaning up... > Shutting down service at https://129.114.50.32:35904 > Got channel MetaChannel: 16618296 -> GSSSChannel-null(1) > - Done > ////////////////////////////////// > > My sites.xml for TACC Lonestar is : > > > > url="gatekeeper.lonestar.tacc.teragrid.org" jobManager="gt2:gt2:pbs"/> > /tmp/scratch/yc/swiftwork > > > Thanks! > > Chen, Yue > > > > ______________________________________________________________________ > From: Mihael Hategan [mailto:hategan at mcs.anl.gov] > Sent: Thu 4/30/2009 6:08 PM > To: Yue, Chen - BMD > Cc: Michael Wilde; swift-devel > Subject: RE: [Swift-devel] RE: [Swift-user] Execution error > > > Odd. Try "gt2:gt2:pbs" instead of "gt2:gt2:PBS". > > On Thu, 2009-04-30 at 17:57 -0500, Yue, Chen - BMD wrote: > > Hi Mihael, > > > > The error message is different after I change from gt2:PBS to > > gt2:gt2:PBS on TACC LoneStar. I attached the trace as following. I > > don't know if there is still setup problem. Thank you very much! > > > > ///////////////////////// > > [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file > > sites.xml -tc.file tc.data > > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090430-1742-tfabq6q4 > > Progress: > > Progress: Stage in:1 > > Progress: Stage in:1 > > Progress: Stage in:1 > > Progress: Submitting:1 > > Progress: Submitted:1 > > Progress: Active:1 > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1742-tfabq6q4/info/r on TACC_LoneStar > > Progress: Active:1 > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1742-tfabq6q4/info/t on TACC_LoneStar > > Progress: Stage in:1 > > Progress: Active:1 > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1742-tfabq6q4/info/v on TACC_LoneStar > > Progress: Failed:1 > > Execution failed: > > Exception in PTMap2: > > Arguments: [e04.mzXML, ./seqs-ecolik12/fasta02, inputs-unmod.txt, > > parameters.txt] > > Host: TACC_LoneStar > > Directory: > PTMap2-unmod-20090430-1742-tfabq6q4/jobs/v/PTMap2-v7baq5aj > > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > > Cannot submit job > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Cannot submit job > > at > > > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:145) > > at > > > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:99) > > at > > > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > > at > > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:43) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.startWorker(WorkerManager.java:221) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.run(WorkerManager.java:145) > > Caused by: org.globus.gram.GramException: The gatekeeper failed to > > find the requested service > > at org.globus.gram.Gram.checkHttpReply(Gram.java:137) > > at org.globus.gram.Gram.request(Gram.java:342) > > at org.globus.gram.GramJob.request(GramJob.java:262) > > at > > > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:133) > > ... 5 more > > Cleaning up... > > Shutting down service at https://129.114.50.32:35257 > > Got channel MetaChannel: 13391897 -> GSSSChannel-null(1) > > - Done > > //////////////////////// > > > > Chen, Yue > > > > > > > > > > > This email is intended only for the use of the individual or entity to > which it is addressed and may contain information that is privileged > and confidential. If the reader of this email message is not the > intended recipient, you are hereby notified that any dissemination, > distribution, or copying of this communication is prohibited. If you > have received this email in error, please notify the sender and > destroy/delete all copies of the transmittal. Thank you. From hategan at mcs.anl.gov Thu Apr 30 18:54:26 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 30 Apr 2009 18:54:26 -0500 Subject: [Swift-devel] RE: [Swift-user] Execution error In-Reply-To: <49FA147E.6070205@uchicago.edu> References: <49F9DE8F.1070404@mcs.anl.gov> <49F9E298.8030801@mcs.anl.gov> <49F9E8FB.9020500@mcs.anl.gov> <49F9F680.6040503@mcs.anl.gov> <49FA03EA.7080807@mcs.anl.gov> <49FA147E.6070205@uchicago.edu> Message-ID: <1241135666.3603.1.camel@localhost> Mystery solved: Thu Apr 30 18:19:13 2009 JM_SCRIPT: ERROR: job submission failed: Thu Apr 30 18:19:13 2009 JM_SCRIPT: ------------------------------------------------------------------------ Welcome to TACC's Ranger System, an NSF TeraGrid Resource ------------------------------------------------------------------------ --> Submitting 16 tasks... --> Submitting 16 tasks/host... --> Submitting exclusive job to 1 hosts... --> Verifying HOME file-system availability... --> Verifying WORK file-system availability... --> Verifying SCRATCH file-system availability... --> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data limits... --> Requesting valid memory configuration (mt=31.3G)... --> Checking ssh keys... --> Checking file existence and permissions for passwordless ssh... --> Verifying accounting... ---------------------------------------------------------------- ERROR: You have exceeded the max submitted job count. Maximum allowed is 50 jobs. Please contact TACC Consulting if you believe you have received this message in error. ---------------------------------------------------------------- Job aborted by esub. I'll add a limit for the number of jobs allowed to the current coaster code. On Thu, 2009-04-30 at 16:13 -0500, Glen Hocky wrote: > I have the identical response on ranger. It started yesterday evening. > Possibly a problem that the TACC folks need to fix? > > Glen > > Yue, Chen - BMD wrote: > > Hi Michael, > > > > Thank you for the advices. I tested ranger with 1 job and new > > specifications of maxwalltime. It shows the following error message. I > > don't know if there is other problem with my setup. Thank you! > > > > ///////////////////////////////////////////////// > > [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file > > sites.xml -tc.file tc.data > > Swift 0.9rc2 swift-r2860 cog-r2388 > > RunID: 20090430-1559-2vi6x811 > > Progress: > > Progress: Stage in:1 > > Progress: Submitting:1 > > Progress: Submitting:1 > > Progress: Submitted:1 > > Progress: Active:1 > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1559-2vi6x811/info/i on ranger > > Progress: Active:1 > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1559-2vi6x811/info/k on ranger > > Progress: Stage in:1 > > Progress: Active:1 > > Failed to transfer wrapper log from > > PTMap2-unmod-20090430-1559-2vi6x811/info/m on ranger > > Progress: Failed:1 > > Execution failed: > > Exception in PTMap2: > > Arguments: [e04.mzXML, ./seqs-ecolik12/fasta01, inputs-unmod.txt, > > parameters.txt] > > Host: ranger > > Directory: PTMap2-unmod-20090430-1559-2vi6x811/jobs/m/PTMap2-mbe6m5aj > > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > > Failed to start worker: > > null > > null > > org.globus.gram.GramException: The job manager detected an invalid > > script response > > at > > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530) > > at org.globus.gram.GramJob.setStatus(GramJob.java:184) > > at > > org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) > > at java.lang.Thread.run(Thread.java:619) > > Cleaning up... > > Shutting down service at https://129.114.50.163:45562 > > > > Got channel MetaChannel: 20903429 -> GSSSChannel-null(1) > > - Done > > [yuechen at communicado PTMap2]$ > > /////////////////////////////////////////////////////////// > > > > Chen, Yue > > > > > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > > *Sent:* Thu 4/30/2009 3:02 PM > > *To:* Yue, Chen - BMD; swift-devel > > *Subject:* Re: [Swift-user] Execution error > > > > Back on list here (I only went off-list to discuss accounts, etc) > > > > The problem in the run below is this: > > > > 2009-04-30 14:29:41,265-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > jobid=PTMap2-abeii5aj - Application exception: Job cannot be run with > > the given max walltime worker constraint (task: 3000, \ > > maxwalltime: 2400s) > > > > You have this on the ptmap app in your tc.data: > > > > globus::maxwalltime=50 > > > > But you only gave coasters 40 mins per coaster worker. So its > > complaining that it cant run a 50 minute job in a 40 minute (max) > > coaster worker. ;) > > > > I mentioned in a prior mail that you need to set the two time vals in > > your sites.xml entry; thats what you need to do next, now. > > > > change the coaster time in your sites.xml to: > > key="coasterWorkerMaxwalltime">00:51:00 > > > > If you have more info on the variability of your ptmap run times, send > > that to the list, and we can discuss how to handle. > > > > > > (NOTE: doing grp -i of the log for "except" or scanning for "except" > > with an editor will often locate the first "exception" that your job > > encountered. Thats how I found the error above). > > > > Also, Yue, for testing new sites, or for validating that old sites still > > work, you should create the smallest possible ptmap workflow - 1 job if > > that is possible - and verify that this works. Then say 10 jobs to make > > sure scheduling etc is sane. Then, send in your huge jobs. > > > > With only 1 job, its easier to spot the errors in the log file. > > > > - Mike > > > > > > On 4/30/09 2:34 PM, Yue, Chen - BMD wrote: > > > Hi Michael, > > > > > > I run into the same messages again when I use Ranger: > > > > > > Progress: Selecting site:146 Stage in:25 Submitting:15 Submitted:821 > > > Failed but can retry:16 > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/l on ranger > > > Progress: Selecting site:146 Stage in:3 Submitting:1 Submitted:857 > > > Failed but can retry:16 > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/v on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/b on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/0 on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/x on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/3 on ranger > > > Failed to transfer wrapper log from > > > PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger > > > The log for the search is at : > > > /home/yuechen/PTMap2/PTMap2-unmod-20090430-1428-v0c5di5c.log > > > > > > The sites.xml I have is: > > > > > > > > > > > url="gatekeeper.ranger.tacc.teragrid.org" > > > jobManager="gt2:gt2:SGE"/> > > > > > > > > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir > > > TG-CCR080022N > > > 16 > > > development > > > > > key="coasterWorkerMaxwalltime">00:40:00 > > > 31 > > > 50 > > > 10 > > > /work/01164/yuechen/swiftwork > > > > > > The tc.data I have is: > > > > > > ranger PTMap2 > > > /share/home/01164/yuechen/PTMap2/PTMap2 INSTALLED > > > INTEL32::LINUX globus::maxwalltime=50 > > > > > > I'm using swift 0.9 rc2 > > > > > > Thank you very much for help! > > > > > > Chen, Yue > > > > > > > > > > > > ------------------------------------------------------------------------ > > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > > > *Sent:* Thu 4/30/2009 2:05 PM > > > *To:* Yue, Chen - BMD > > > *Subject:* Re: [Swift-user] Execution error > > > > > > > > > > > > On 4/30/09 1:51 PM, Yue, Chen - BMD wrote: > > > > Hi Michael, > > > > > > > > When I tried to activate my account, I encountered the following > > error: > > > > > > > > "Sorry, this account is in an invalid state. You may not activate > > your > > > > at this time." > > > > > > > > I used the username and password from TG-CDA070002T. Should I use a > > > > different password? > > > > > > If you can already login to Ranger, then you are all set - you must have > > > done this previously. > > > > > > I thought you had *not*, because when I looked up your login on ranger > > > ("finger yuechen") it said "never logged in". But seems like that info > > > is incorrect. > > > > > > If you have ptmap compiled, seems like you are almost all set. > > > > > > Let me know if it works. > > > > > > - Mike > > > > > > > Thanks! > > > > > > > > Chen, Yue > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov] > > > > *Sent:* Thu 4/30/2009 1:07 PM > > > > *To:* Yue, Chen - BMD > > > > *Cc:* swift user > > > > *Subject:* Re: [Swift-user] Execution error > > > > > > > > Yue, use this XML pool element to access ranger: > > > > > > > > > > > > > > > url="gatekeeper.ranger.tacc.teragrid.org" > > > > jobManager="gt2:gt2:SGE"/> > > > > > url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" /> > > > > > > > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir > > > > > key="project">TG-CCR080022N > > > > 16 > > > > development > > > > > > > key="coasterWorkerMaxwalltime">00:40:00 > > > > 31 > > > > 50 > > > > 10 > > > > /work/00306/tg455797/swiftwork > > > > > > > > > > > > > > > > You will need to also do these steps: > > > > > > > > Go to this web page to enable your Ranger account: > > > > > > > > https://tas.tacc.utexas.edu/TASMigration/AccountActivation.aspx > > > > > > > > Then login to Ranger via the TeraGrid portal and put your ssh keys in > > > > place (assuming you use ssh keys, which you should) > > > > > > > > While on Ranger, do this: > > > > > > > > echo $WORK > > > > mkdir $work/swiftwork > > > > > > > > and put the full path of your $WORK/swiftwork directory in the > > > > element above. (My login is tg455etc, yours is > > yuechen) > > > > > > > > Then scp your code to Ranger and compile it. > > > > > > > > Then create a tc.data entry for your ptmap app > > > > > > > > Next, set your time values in the sites.xml entry above to suitable > > > > values for Ranger. You'll need to measure times, but I think you will > > > > find Ranger about twice as fast as Mercury for CPU-bound jobs. > > > > > > > > The values above were set for one app job per coaster. I think > > you can > > > > probably do more. > > > > > > > > If you estimate a run time of 5 minutes, use: > > > > > > > > > > > key="coasterWorkerMaxwalltime">00:30:00 > > > > 5 > > > > > > > > Other people on the list - please sanity check what I suggest here. > > > > > > > > - Mike > > > > > > > > > > > > On 4/30/09 12:40 PM, Michael Wilde wrote: > > > > > I just checked - TG-CDA070002T has indeed expired. > > > > > > > > > > The best for now is to move to use (only) Ranger, under this > > account: > > > > > TG-CCR080022N > > > > > > > > > > I will locate and send you a sites.xml entry in a moment. > > > > > > > > > > You need to go to a web page to activate your Ranger login. > > > > > > > > > > Best to contact me in IM and we can work this out. > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > On 4/30/09 12:23 PM, Michael Wilde wrote: > > > > >> Also, what account are you running under? We may need to change > > > you to > > > > >> a new account - as the OSG Training account expires today. > > > > >> If that happend at Noon, it *might* be the problem. > > > > >> > > > > >> - Mike > > > > >> > > > > >> > > > > >> On 4/30/09 12:08 PM, Yue, Chen - BMD wrote: > > > > >>> Hi, > > > > >>> > > > > >>> I came back to re-run my application on NCSA Mercury which was > > > tested > > > > >>> successfully last week after I just set up coasters with > > swift 0.9, > > > > >>> but I got many messages like the following: > > > > >>> > > > > >>> Progress: Stage in:219 Submitting:803 Submitted:1 > > > > >>> Progress: Stage in:129 Submitting:703 Submitted:190 Failed > > > but can > > > > >>> retry:1 > > > > >>> Progress: Stage in:38 Submitting:425 Submitted:556 Failed > > but can > > > > >>> retry:4 > > > > >>> Failed to transfer wrapper log from > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY > > > > >>> Failed to transfer wrapper log from > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY > > > > >>> Failed to transfer wrapper log from > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY > > > > >>> Failed to transfer wrapper log from > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY > > > > >>> Failed to transfer wrapper log from > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY > > > > >>> Failed to transfer wrapper log from > > > > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY > > > > >>> Progress: Stage in:1 Submitted:1013 Active:1 Failed but can > > > retry:8 > > > > >>> Progress: Submitted:1011 Active:1 Failed but can retry:11 > > > > >>> The log file for the successful run last week is ; > > > > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log > > > > >>> > > > > >>> The log file for the failed run is : > > > > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log > > > > >>> > > > > >>> I don't think I did anything different, so I don't know why this > > > time > > > > >>> they failed. The sites.xml for Mercury is: > > > > >>> > > > > >>> > > > > >>> > > > > >>> > url="grid-hg.ncsa.teragrid.org" > > > > >>> jobManager="gt2:PBS"/> > > > > >>> > > /gpfs_scratch1/yuechen/swiftwork > > > > >>> debug > > > > >>> > > > > >>> > > > > >>> Thank you for help! > > > > >>> > > > > >>> Chen, Yue > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> This email is intended only for the use of the individual or > > entity > > > > >>> to which it is addressed and may contain information that is > > > > >>> privileged and confidential. If the reader of this email > > message is > > > > >>> not the intended recipient, you are hereby notified that any > > > > >>> dissemination, distribution, or copying of this communication is > > > > >>> prohibited. If you have received this email in error, please > > notify > > > > >>> the sender and destroy/delete all copies of the transmittal. > > > Thank you. > > > > >>> > > > > >>> > > > > >>> > > > > > > ------------------------------------------------------------------------ > > > > >>> > > > > >>> _______________________________________________ > > > > >>> Swift-user mailing list > > > > >>> Swift-user at ci.uchicago.edu > > > > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > >> _______________________________________________ > > > > >> Swift-user mailing list > > > > >> Swift-user at ci.uchicago.edu > > > > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > _______________________________________________ > > > > > Swift-user mailing list > > > > > Swift-user at ci.uchicago.edu > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > > > > > > > > > > > > > > > > This email is intended only for the use of the individual or > > entity to > > > > which it is addressed and may contain information that is > > privileged and > > > > confidential. If the reader of this email message is not the intended > > > > recipient, you are hereby notified that any dissemination, > > distribution, > > > > or copying of this communication is prohibited. If you have received > > > > this email in error, please notify the sender and destroy/delete all > > > > copies of the transmittal. Thank you. > > > > > > > > > > > > > > > This email is intended only for the use of the individual or entity to > > > which it is addressed and may contain information that is privileged and > > > confidential. If the reader of this email message is not the intended > > > recipient, you are hereby notified that any dissemination, distribution, > > > or copying of this communication is prohibited. If you have received > > > this email in error, please notify the sender and destroy/delete all > > > copies of the transmittal. Thank you. > > > > > > > > > > This email is intended only for the use of the individual or entity to > > which it is addressed and may contain information that is privileged > > and confidential. If the reader of this email message is not the > > intended recipient, you are hereby notified that any dissemination, > > distribution, or copying of this communication is prohibited. If you > > have received this email in error, please notify the sender and > > destroy/delete all copies of the transmittal. Thank you. > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel