From bugzilla-daemon at mcs.anl.gov Sun Mar 1 15:47:05 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sun, 1 Mar 2009 15:47:05 -0600 (CST) Subject: [Swift-devel] [Bug 86] recompilation should not be suppressed if compiler version has changed In-Reply-To: Message-ID: <20090301214705.B633B164B3@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=86 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from benc at hawaga.org.uk 2009-03-01 15:47 ------- Implemented in r2583. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Sun Mar 1 15:47:53 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sun, 1 Mar 2009 15:47:53 -0600 (CST) Subject: [Swift-devel] [Bug 180] multi-node coasters? In-Reply-To: Message-ID: <20090301214753.04AE7164CF@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=180 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|benc at hawaga.org.uk |hategan at mcs.anl.gov -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. You are the assignee for the bug, or are watching the assignee. From wilde at mcs.anl.gov Sun Mar 1 22:53:39 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 01 Mar 2009 22:53:39 -0600 Subject: [Swift-devel] continued questions on iterate Message-ID: <49AB6653.6010306@mcs.anl.gov> This program: string s[]; s[0]="hi "; iterate i { s[i+1] = @strcat(s[i],"hi "); trace(s[i]); } until(i==5); Gives: com$ swift it4.swift Could not start execution. variable s has multiple writers. -- Its similar to the tutorial example: counterfile a[] ; a[0] = echo("793578934574893"); iterate v { a[v+1] = countstep(a[v]); print("extract int value ", at extractint(a[v+1])); } until (@extractint(a[v+1]) <= 1); -- ...which I reported earlier as having problems (I think in addition to the one above?) This is using the latest swift, rev 2631, and latest cog. I thought I had issues like this licked, but then updated the code to get closer to what the user needs. In this example, I dont see any violation of single-assignment, but apparently swift does. The full example that the test case above is for is at: www.ci.uchicago.edu/~wilde/oops8.swift, which encounters the same multiple-writer problem. I start with an initial "secondary structure" string of all A's, same length as the protein sequence. After each folding round, a new structure is derived for analysis and used as the starting point for the next round. This has the same data access pattern as array s[] above: foreach p, pn in protein { OOPSOut result[][] ; SecSeq secseq[] ; OOPSIn oopsin ; secseq[0] = sedfasta(oopsin.fasta, ["-e","s/./A/g"]); boolean converged[]; iterate i { SecSeq s; result[i] = doRound(p,oopsin,secseq[i],i); (converged[i],s) = analyzeResult(result[i], p, i, secseq[i]); secseq[i+1] = s; } until (converged[i] || (i==3)); } In this case, I get the same message for array secseq (varable has multiple writers). I From foster at uchicago.edu Sun Mar 1 23:25:11 2009 From: foster at uchicago.edu (Ian Foster) Date: Sun, 1 Mar 2009 23:25:11 -0600 Subject: [Swift-devel] Swift user guide Message-ID: <08A89F3C-222A-4B5F-B497-75435EE6CB0A@uchicago.edu> Reading a recent email about "iterate" got me looking at the Swift manual. This is looking very nice. I had a couple of comments: 1) "Conceptually, a parallel can be drawn between Swift mapped variables and Java reference types. In both cases there is no syntactic distinction between primitive types and mapped types or reference types respectively. Additionally, the semantic distinction is also kept to a minimum." --> I don't think we should assume that readers know what a Java reference type is. Most will not. 2) Arrays: I gather from below that the size of an array is defined by assignments to it. This seems confusing and dangerous to me: doesn't it require a global analysis, which must ultimately be undecidable, to determine whether an array is closed? Statements which deal with the array as a whole will often wait for the array to be closed before executing (thus, a closed array is the equivalent of a non-array type being assigned). However, a foreach statement will apply its body to elements of an array as they become known. It will not wait until the array is closed. Consider this script: file a[]; file b[]; foreach v,i in a { b[i] = p(v); } a[0] = r(); a[1] = s(); Initially, the foreach statement will have nothing to execute, as the array a has not been assigned any values. The procedures r and s will execute. As soon as either of them is finished, the corresponding invocation of procedure p will occur. After both r and s have completed, the array a will be closed since no other statements in the script make an assignment toa. 3) In the following text, the (,index) is presumably meant to indicate an optional element. But as you don't use a different font, or indeed have indicated what conventions you are using, readers may not realize that. foreach statements have the general form: foreach controlvariable (,index) in expression { statements } -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Sun Mar 1 23:39:42 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 01 Mar 2009 23:39:42 -0600 Subject: [Swift-devel] continued questions on iterate In-Reply-To: <49AB6653.6010306@mcs.anl.gov> References: <49AB6653.6010306@mcs.anl.gov> Message-ID: <49AB711E.4010208@mcs.anl.gov> Im able to work around this by moving the s[0] assignments inside the iterate block, in an if(i==0) {} else {} construct. Still, it seems the restriction is not intended. - Mike On 3/1/09 10:53 PM, Michael Wilde wrote: > This program: > > string s[]; > s[0]="hi "; > iterate i { > s[i+1] = @strcat(s[i],"hi "); > trace(s[i]); > } until(i==5); > > Gives: > > com$ swift it4.swift > Could not start execution. > variable s has multiple writers. > > -- > Its similar to the tutorial example: > > counterfile a[] ; > > a[0] = echo("793578934574893"); > > iterate v { > a[v+1] = countstep(a[v]); > print("extract int value ", at extractint(a[v+1])); > } until (@extractint(a[v+1]) <= 1); > > -- > > ...which I reported earlier as having problems (I think in addition to > the one above?) > > This is using the latest swift, rev 2631, and latest cog. > > I thought I had issues like this licked, but then updated the code to > get closer to what the user needs. > > In this example, I dont see any violation of single-assignment, but > apparently swift does. > > The full example that the test case above is for is at: > www.ci.uchicago.edu/~wilde/oops8.swift, which encounters the same > multiple-writer problem. > > I start with an initial "secondary structure" string of all A's, same > length as the protein sequence. After each folding round, a new > structure is derived for analysis and used as the starting point for the > next round. This has the same data access pattern as array s[] above: > > foreach p, pn in protein { > OOPSOut result[][] ; > SecSeq secseq[] prefix=@strcat("seqseq/",p,"/"),suffix=".secseq">; > OOPSIn oopsin ; > secseq[0] = sedfasta(oopsin.fasta, ["-e","s/./A/g"]); > boolean converged[]; > iterate i { > SecSeq s; > result[i] = doRound(p,oopsin,secseq[i],i); > (converged[i],s) = analyzeResult(result[i], p, i, secseq[i]); > secseq[i+1] = s; > } until (converged[i] || (i==3)); > } > > In this case, I get the same message for array secseq (varable has > multiple writers). > > I > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Mon Mar 2 00:27:38 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 02 Mar 2009 00:27:38 -0600 Subject: [Swift-devel] Swift user guide In-Reply-To: <08A89F3C-222A-4B5F-B497-75435EE6CB0A@uchicago.edu> References: <08A89F3C-222A-4B5F-B497-75435EE6CB0A@uchicago.edu> Message-ID: <1235975258.20059.39.camel@localhost> On Sun, 2009-03-01 at 23:25 -0600, Ian Foster wrote: > 2) Arrays: I gather from below that the size of an array is defined by > assignments to it. This seems confusing and dangerous to me: doesn't > it require a global analysis, which must ultimately be undecidable, to > determine whether an array is closed? > You don't need to add a level of indirection :) Iterate() makes swift termination and array closing undecidable (and so does recursion*). And I think we want it that way in order to support problems of the kind "repeat until sufficiently good results are obtained". (*) except with recursion you're guaranteed to run out of memory eventually From benc at hawaga.org.uk Mon Mar 2 01:40:56 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 2 Mar 2009 07:40:56 +0000 (GMT) Subject: [Swift-devel] Swift user guide In-Reply-To: <08A89F3C-222A-4B5F-B497-75435EE6CB0A@uchicago.edu> References: <08A89F3C-222A-4B5F-B497-75435EE6CB0A@uchicago.edu> Message-ID: On Sun, 1 Mar 2009, Ian Foster wrote: > 2) Arrays: I gather from below that the size of an array is defined by > assignments to it. This seems confusing and dangerous to me: doesn't it > require a global analysis, which must ultimately be undecidable, to determine > whether an array is closed? yes. They're a very unpleasant structure at the moment. As I've written in other mails in more depth, I'd prefer to see them behave as single-assignment structures constructued by something like looks similar to but slightly different to the present loop constructs, so that you'd say: array = foreach .... rather than foreach ... { array[i]=... } There would be no requirement for this to actually execute as some atomic operation, and could happen over time interleaved with other tasks, as happens for foreach at the moment; but from a code analysis perspective, its much clearer when the array is closed - after the single statement that assigns to it has fully completed, like non-array variables. -- From benc at hawaga.org.uk Mon Mar 2 01:45:01 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 2 Mar 2009 07:45:01 +0000 (GMT) Subject: [Swift-devel] continued questions on iterate In-Reply-To: <49AB6653.6010306@mcs.anl.gov> References: <49AB6653.6010306@mcs.anl.gov> Message-ID: On Sun, 1 Mar 2009, Michael Wilde wrote: > com$ swift it4.swift > Could not start execution. > variable s has multiple writers. > In this example, I dont see any violation of single-assignment, but apparently > swift does. yes, its a bug in the "lets try to guess who is allowed to write to array" code. -- From benc at hawaga.org.uk Mon Mar 2 05:08:45 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 2 Mar 2009 11:08:45 +0000 (GMT) Subject: [Swift-devel] provenance challenge 3 participation Message-ID: The 3rd Provenance Challenge starts today. Mike and I entered the VDS VDC in the first provenance challenge; we did not participate in the second challenge. Luiz M. R. Gadelha Jr. and I intend to work on an entry based around the provenance database prototype that lives in the provenancedb/ directory of the SVN, with the goals of making that code more useable and useful and of making some presentation at the provenance challenge workshop in the summer. Information about the challenge is here: http://twiki.ipaw.info/bin/view/Challenge/ThirdProvenanceChallenge We've had some discussion in private correspondence but will continue our discussions on this list. -- From foster at uchicago.edu Mon Mar 2 07:09:38 2009 From: foster at uchicago.edu (Ian Foster) Date: Mon, 2 Mar 2009 07:09:38 -0600 Subject: [Swift-devel] Swift user guide In-Reply-To: References: <08A89F3C-222A-4B5F-B497-75435EE6CB0A@uchicago.edu> Message-ID: <4358C77B-E378-45BE-8C2D-116C411E4CE7@uchicago.edu> Ben: The example I saw in the code had the input array (e.g., "a", in "foreach v in a") constructed elsewhere in the program, with a[0] = v1, a[1] = v2. That seemed particularly challenging. Ian. On Mar 2, 2009, at 1:40 AM, Ben Clifford wrote: > > On Sun, 1 Mar 2009, Ian Foster wrote: > >> 2) Arrays: I gather from below that the size of an array is defined >> by >> assignments to it. This seems confusing and dangerous to me: >> doesn't it >> require a global analysis, which must ultimately be undecidable, to >> determine >> whether an array is closed? > > yes. > > They're a very unpleasant structure at the moment. As I've written in > other mails in more depth, I'd prefer to see them behave as > single-assignment structures constructued by something like looks > similar > to but slightly different to the present loop constructs, so that > you'd > say: > > array = foreach .... > > rather than > > foreach ... { > array[i]=... > } > > There would be no requirement for this to actually execute as some > atomic > operation, and could happen over time interleaved with other tasks, as > happens for foreach at the moment; but from a code analysis > perspective, > its much clearer when the array is closed - after the single statement > that assigns to it has fully completed, like non-array variables. > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From foster at anl.gov Mon Mar 2 08:24:34 2009 From: foster at anl.gov (Ian Foster) Date: Mon, 2 Mar 2009 08:24:34 -0600 Subject: [Swift-devel] provenance challenge 3 participation In-Reply-To: References: Message-ID: Ben: That's great news. Are we in a position whereby most Swift runs are recorded in the database? Ian. On Mar 2, 2009, at 5:08 AM, Ben Clifford wrote: > > The 3rd Provenance Challenge starts today. Mike and I entered the > VDS VDC > in the first provenance challenge; we did not participate in the > second > challenge. Luiz M. R. Gadelha Jr. and I intend to > work > on an entry based around the provenance database prototype that > lives in > the provenancedb/ directory of the SVN, with the goals of making > that code > more useable and useful and of making some presentation at the > provenance > challenge workshop in the summer. > > Information about the challenge is here: > > http://twiki.ipaw.info/bin/view/Challenge/ThirdProvenanceChallenge > > We've had some discussion in private correspondence but will > continue our > discussions on this list. > > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Mon Mar 2 08:33:14 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 2 Mar 2009 14:33:14 +0000 (GMT) Subject: [Swift-devel] provenance challenge 3 participation In-Reply-To: References: Message-ID: On Mon, 2 Mar 2009, Ian Foster wrote: > Are we in a position whereby most Swift runs are recorded in the > database? Lots of Swift runs are going into my log repository, though I don't really know what proportion of every run in the whole universe. "The database" doesn't exist - there is a database implementation in SVN which you can deploy and import data to. However, its still at the stage where ongoing development changes the database schema often(*) which usually requires a reimport of all the logs; or modifies the data that Swift is producing so that old logs no longer provide all the needed data. Hopefully it will converge on something fairly stable. (*) not much in recent months but thats because the provenance project has been dead. -- From benc at hawaga.org.uk Mon Mar 2 09:43:50 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 2 Mar 2009 15:43:50 +0000 (GMT) Subject: [Swift-devel] Swift user guide In-Reply-To: <4358C77B-E378-45BE-8C2D-116C411E4CE7@uchicago.edu> References: <08A89F3C-222A-4B5F-B497-75435EE6CB0A@uchicago.edu> <4358C77B-E378-45BE-8C2D-116C411E4CE7@uchicago.edu> Message-ID: On Mon, 2 Mar 2009, Ian Foster wrote: > Ben: > > The example I saw in the code had the input array (e.g., "a", in "foreach v in > a") constructed elsewhere in the program, with a[0] = v1, a[1] = v2. That > seemed particularly challenging. You can have single assignment with an explicit list of array contents like this: a = [v1, v2]; That works in the present implementation in some, but not all, cases - it depends on whethere the expressions v1 and v2 are primitive types like int (in which case it does work, and has worked since before Swift was called Swift) or if they are mapped to files (in which case it doesn't work in the present implementation, but could be made to). -- From wilde at mcs.anl.gov Mon Mar 2 17:26:49 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 02 Mar 2009 17:26:49 -0600 Subject: [Swift-devel] Re: Fault tolerance in "many task computing"? In-Reply-To: <49AAC1DF.7090502@cs.uchicago.edu> References: <21662876.222571235010770270.JavaMail.root@zimbra> <499CC801.3090304@cs.uchicago.edu> <25796E93-7DB0-44FF-8884-0201B4EC620F@anl.gov> <499CEF6C.1020700@mcs.anl.gov> <49AAC1DF.7090502@cs.uchicago.edu> Message-ID: <49AC6B39.8010407@mcs.anl.gov> All, Pete suggested we take a look at CIFTS's message logging system and consider integrating it into our stack. Rinku gave me, Allan, and Zhao and excellent overview and demo of the system. (Thanks, Rinku!) Here's my notes from this meeting. My intent is just to start a discussion for longer-term consideration, not any near-term action. (Although Jing Tie may find some of these concepts fruitful for er troubleshooting research). CIFTS is the DOE SciDAC project "Coordinated and Improved Fault Tolerance for High Performance Computing Systems", PI'd by Pete: http://www.mcs.anl.gov/research/cifts/index.php It produces "FTB", a backplane for distributing logging information within a distributed system: http://www.mcs.anl.gov/research/cifts/docs/files/ftb_developers_guide.pdf I pointed Rinku to Swift and Falkon info, as well as Netlogger and activities related to it in the CEDPS project, and we have a joint action item to understand the possible overlap and integration issues and possibilities between these two systems. Netlogger and CEDPS info is at: http://acs.lbl.gov/NetLoggerWiki/index.php/Main_Page http://dev.globus.org/wiki/Incubator/NetLogger http://www.cedps.net/index.php/Troubleshooting#Work-in-progress I mentioned that we have invested a small bit of effort in integrating Netlogger log publishing capabilities into Swift. Potential overlap notwithstanding, CIFTS (and in particular the Fault Tolerant Backplane, FTB), could serve as a very nice consolidation service for log information originating in the many different components involved in executing a Swift program: - the application program wrapper script - the Falkon or Coaster worker agent - the Globus job manager and/or local scheduler - the worker node - the remote site fileserver/filesystem - a site system management facility like BG/P's RAS service - Falkon and Coaster servers and bootstrappers - the swift client-side engine - GrifFTP and other transport protocols and services - etc FTB would enable us to readily capture and consolidate all these information sources and funnel the data into streams related to specific Swift program executions. It has the infrastructure to route messages out of distributed systems, and to permit publication of and subscription to message streams. Its agents, it seems, can help messages traverse firewalls and deal with other transport and delivery issues. FTB is implemented as a C API, and comes with a set of example clients. From this a simple set of command line interfaces could be derived to permit low-cost experimentation with the system in, eg, Falkon on the BG/P, where Rinku and others are implementing collectors to gather log information from different parts of ZeptoOS and the BG/P hardware complex. Its not clear that any of us have the cycles within the next two months to explore this, but it would make an interesting student project, to compare CIFTS and NetLogger, and to test some initial integrations into Swift, Falkon, and Coasters. (I feel its a good Summer of Code project). My initial question is whether some CIFTS/FTB hooks could be planted in a lightweight Swift experiment, and we could try to get a feel for whether the infrastructure gives us something that we cant readily get today. My gut feel is that is does. I think it would be a great research/development topic to explore how close this could bring us to the point where all distributed errors are cleanly routed back to the centralized user to more quickly pinpoint the cause of remote and distributed failures. Swift does a *pretty* good job of this today, albeit in a somewhat ad-hoc fashion. FTB would make it easier to integrate information from additional sources like the remote scheduler and BGP RAS logs into the debugging process. And all that is before we even consider the goals of automating fault tolerance, which I think is the ultimate vision of CIFTS. Thoughts and discussion welcome. Once any of us get a day or so to play with FTB, we'll know more about the possibilities. Regards, Mike On 3/1/09 11:11 AM, Ioan Raicu wrote: > Hi Rinku, > It looks like I am not going to be able to make the meeting tomorrow. On > Friday, another interview opportunity came up, and the only open slot > for the next 2 weeks was this Monday. Sorry about the short notice. Go > ahead and meet without me, and I'll catch up with what was discussed at > the meeting from Mike. > > Thanks, > Ioan > > Michael Wilde wrote: >> Zhao, Allan, you're welcome to join this discussion, at Argonne Mar 2, >> or by phone. >> >> - Mike >> >> >> On 2/18/09 10:30 PM, Ian Foster wrote: >>> Hi, >>> >>> This sounds like a really fun project. Maybe we should involve Zhao >>> and Allen as well, given that Ioan has (sadly) graduated, and will >>> leave us? >>> I'd love to participate, I will need to do so by phone--could we do >>> that? I'll just listen in, and see what I can learn. >>> >>> Ian. >>> >>> >>> On Feb 18, 2009, at 8:46 PM, Ioan Raicu wrote: >>> >>>> Great! >>>> >>>> I added Ian as a cc, maybe he wants to come to this meeting as well. >>>> Ian, the original message from Pete was: >>>>> Ioan and Mike, >>>>> >>>>> The CIFTS project is a DOE project to provide a "fault tolerant >>>>> backplane". I'm the PI of the project which involved ORNL, LBL, >>>>> IU, Ohio State, and UTK. Below is a suggestion to hook CIFTS to >>>>> Falkon, so faults could be monitored. Rinku (on the cc: line) is >>>>> the lead developer for CIFTS. Maybe when one of you is on campus >>>>> (ANL) you can meet with Rinku, and brainstorm if there is any way >>>>> to link the two systems efficiently. Email below is from an ORNL >>>>> participant in the CIFTS framework. >>>>> >>>>> -Pete >>>> The meeting is scheduled with Rinku, Mike, Pete (?), and I for March >>>> 2nd, at 11AM in Rinku's office (ANL, D-231 in the MCS building). >>>> >>>> Ioan >>>> >>>> Rinku Gupta wrote: >>>>> We can meet at my office (D-231 in the MCS building) and then sneak >>>>> into Pete's room, if it is empty. >>>>> >>>>> Rinku >>>>> >>>>> >>>>> >>>>> ----- "Ioan Raicu" wrote: >>>>> >>>>> >>>>>> Works for me! I assume we are meeting at ANL. Whose office are we >>>>>> meeting in? >>>>>> >>>>>> Ioan >>>>>> >>>>>> Rinku Gupta wrote: >>>>>> >>>>>> Based on everyones availability, how does 11:00am on March 2nd sound? >>>>>> >>>>>> Thanks >>>>>> Rinku >>>>>> >>>>>> >>>>>> ----- "Michael Wilde" wrote: >>>>>> >>>>>> Rinku, Ioan, >>>>>> >>>>>> I can do Thu Feb 26 10:30 (I assume you meant AM not PM). >>>>>> >>>>>> But if Rinku is just arriving back in the US that morning, it seems >>>>>> better to postpone to the week after. >>>>>> >>>>>> I can be at Argonne any time week of March 2. Mornings are free, >>>>>> Mon-Thu >>>>>> are best. >>>>>> >>>>>> Can we tentatively then meet at 11AM Mon Mar 2? >>>>>> >>>>>> Regards, >>>>>> >>>>>> Mike >>>>>> >>>>>> >>>>>> On 2/18/09 9:37 AM, Ioan Raicu wrote: >>>>>> >>>>>> Hi Rinku, >>>>>> Next Thursday (February 26th) at 10:30AM would work for me. If we >>>>>> need >>>>>> >>>>>> to meet the following week, I could meet Monday (March 2nd) and >>>>>> Thursday >>>>>> >>>>>> (March 5th) any time. >>>>>> >>>>>> Cheers, >>>>>> Ioan >>>>>> >>>>>> Rinku Gupta wrote: >>>>>> >>>>>> Hi Michael, Ioan >>>>>> >>>>>> I am currently on travel and will arrive back to the USA only >>>>>> Thursday >>>>>> (Feb 26th) early morning. Will you be available anytime the >>>>>> week after next? If not, then we can try to schedule a meeting >>>>>> sometime around 10:30/11pm next Thursday at ANL. >>>>>> >>>>>> >>>>>> >>>>>> Thanks >>>>>> Rinku >>>>>> >>>>>> >>>>>> ----- "Ioan Raicu" wrote: >>>>>> >>>>>> Hi Rinku, >>>>>> I can meet next week on Wednesday any time, and Thursday morning >>>>>> before >>>>>> noon, as I have a flight to catch early afternoon from O'Hare. I can >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> meet either at UC or ANL. Let me know what works best for everyone. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Ioan >>>>>> >>>>>> Michael Wilde wrote: >>>>>> >>>>>> Hi All, >>>>>> >>>>>> Rinku, lets set up a meeting for next week to discuss. I can meet >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Wed >>>>>> >>>>>> of Thu, at Argonne or UChicago. >>>>>> >>>>>> Do either of those dates work for you, and which place is best? >>>>>> >>>>>> In the meantime I'll read up on CIFTS at >>>>>> http://www.mcs.anl.gov/research/cifts/docs/index.php and the wiki >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> that >>>>>> >>>>>> this refers to. >>>>>> >>>>>> If you have any other docs we should read, please send them. >>>>>> >>>>>> Thanks and regards, >>>>>> >>>>>> Mike >>>>>> >>>>>> >>>>>> On 2/17/09 12:42 PM, Pete Beckman wrote: >>>>>> >>>>>> Ioan and Mike, >>>>>> >>>>>> The CIFTS project is a DOE project to provide a "fault tolerant >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> backplane". I'm the PI of the project which involved ORNL, LBL, >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> IU, >>>>>> >>>>>> >>>>>> >>>>>> Ohio State, and UTK. Below is a suggestion to hook CIFTS to Falkon, >>>>>> >>>>>> >>>>>> >>>>>> so faults could be monitored. Rinku (on the cc: line) is the lead >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> developer for CIFTS. Maybe when one of you is on campus (ANL) you >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> can meet with Rinku, and brainstorm if there is any way to link the >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> two systems efficiently. Email below is from an ORNL participant >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> in >>>>>> >>>>>> >>>>>> >>>>>> the CIFTS framework. >>>>>> >>>>>> -Pete >>>>>> >>>>>> >>>>>> Begin forwarded message: >>>>>> >>>>>> From: bernholdtde at ornl.gov Date: February 12, 2009 10:29:47 AM CST >>>>>> To: cifts at googlegroups.com Cc: bernholdtde at ornl.gov Subject: Fault >>>>>> tolerance in "many task computing"? >>>>>> Reply-To: cifts at googlegroups.com Pete (and other ANL folks), >>>>>> >>>>>> I recently read the SC08 paper on many task computing on which you're >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> a co-author. ( >>>>>> http://portal.acm.org/citation.cfm?doid=1413370.1413393 >>>>>> ) >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> I wonder if it would be viable to build a CIFTS demonstration >>>>>> scenario >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> around the software system described in this paper? >>>>>> >>>>>> In the paper, there's a paragraph discussing reliability that >>>>>> discusses some of the issues at a high level. It strikes me as >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> both >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> interesting and challenging because you have both system components >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> (i.e. Cobalt) and multiple user-space components (Falken, Swift, >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> application tasks) interacting. >>>>>> >>>>>> It might also be worth looking at this environment to help understand >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> the use cases and requirements for the policy/control channels (as >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> opposed to the FTB's informational channel). >>>>>> >>>>>> Just some ideas, db >>>>>> -- >>>>>> David E. Bernholdt | Email: bernholdtde at ornl.gov >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Oak Ridge National Laboratory | Phone: +1 (865) 574 3147 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> http://www.csm.ornl.gov/~bernhold/ | Fax: +1 (865) 576 5491 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> --~--~---------~--~----~------------~-------~--~----~ >>>>>> You received this message because you are subscribed to the Google >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Groups "CIFTS" group. >>>>>> To post to this group, send email to cifts at googlegroups.com To >>>>>> unsubscribe from this group, send email to >>>>>> cifts+unsubscribe at googlegroups.com For more options, visit this group >>>>>> at http://groups.google.com/group/cifts?hl=en >>>>>> -~----------~----~----~----~------~----~------~--~--- -- >>>>>> =================================================== >>>>>> Ioan Raicu, Ph.D. >>>>>> =================================================== >>>>>> Distributed Systems Laboratory >>>>>> Computer Science Department >>>>>> University of Chicago >>>>>> 1100 E. 58th Street, Ryerson Hall >>>>>> Chicago, IL 60637 >>>>>> =================================================== >>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu >>>>>> http://dev.globus.org/wiki/Incubator/Falkon >>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >>>>>> =================================================== >>>>>> =================================================== -- >>>>>> =================================================== >>>>>> Ioan Raicu, Ph.D. >>>>>> =================================================== >>>>>> Distributed Systems Laboratory >>>>>> Computer Science Department >>>>>> University of Chicago >>>>>> 1100 E. 58th Street, Ryerson Hall >>>>>> Chicago, IL 60637 >>>>>> =================================================== >>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu >>>>>> http://dev.globus.org/wiki/Incubator/Falkon >>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >>>>>> =================================================== >>>>>> =================================================== >>>>>> -- >>>>>> =================================================== >>>>>> Ioan Raicu, Ph.D. >>>>>> =================================================== >>>>>> Distributed Systems Laboratory >>>>>> Computer Science Department >>>>>> University of Chicago >>>>>> 1100 E. 58th Street, Ryerson Hall >>>>>> Chicago, IL 60637 >>>>>> =================================================== >>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu >>>>>> http://dev.globus.org/wiki/Incubator/Falkon >>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >>>>>> =================================================== >>>>>> =================================================== >>>>>> >>>>> >>>> >>>> -- >>>> =================================================== >>>> Ioan Raicu, Ph.D. >>>> =================================================== >>>> Distributed Systems Laboratory >>>> Computer Science Department >>>> University of Chicago >>>> 1100 E. 58th Street, Ryerson Hall >>>> Chicago, IL 60637 >>>> =================================================== >>>> Email: iraicu at cs.uchicago.edu >>>> Web: http://www.cs.uchicago.edu/~iraicu >>>> http://dev.globus.org/wiki/Incubator/Falkon >>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >>>> =================================================== >>>> =================================================== >>> >> > From hategan at mcs.anl.gov Mon Mar 2 17:50:42 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 02 Mar 2009 17:50:42 -0600 Subject: [Swift-devel] Re: Fault tolerance in "many task computing"? In-Reply-To: <49AC6B39.8010407@mcs.anl.gov> References: <21662876.222571235010770270.JavaMail.root@zimbra> <499CC801.3090304@cs.uchicago.edu> <25796E93-7DB0-44FF-8884-0201B4EC620F@anl.gov> <499CEF6C.1020700@mcs.anl.gov> <49AAC1DF.7090502@cs.uchicago.edu> <49AC6B39.8010407@mcs.anl.gov> Message-ID: <1236037842.2575.7.camel@localhost> Is there a Java library for FTB? What does FTB bring new to the table compared to a distributed messaging system? Mihael On Mon, 2009-03-02 at 17:26 -0600, Michael Wilde wrote: > All, > > Pete suggested we take a look at CIFTS's message logging system and > consider integrating it into our stack. Rinku gave me, Allan, and Zhao > and excellent overview and demo of the system. (Thanks, Rinku!) > > Here's my notes from this meeting. My intent is just to start a > discussion for longer-term consideration, not any near-term action. > (Although Jing Tie may find some of these concepts fruitful for er > troubleshooting research). > > CIFTS is the DOE SciDAC project "Coordinated and Improved Fault > Tolerance for High Performance Computing Systems", PI'd by Pete: > http://www.mcs.anl.gov/research/cifts/index.php > > It produces "FTB", a backplane for distributing logging information > within a distributed system: > > http://www.mcs.anl.gov/research/cifts/docs/files/ftb_developers_guide.pdf > > I pointed Rinku to Swift and Falkon info, as well as Netlogger and > activities related to it in the CEDPS project, and we have a joint > action item to understand the possible overlap and integration issues > and possibilities between these two systems. > > Netlogger and CEDPS info is at: > > http://acs.lbl.gov/NetLoggerWiki/index.php/Main_Page > http://dev.globus.org/wiki/Incubator/NetLogger > http://www.cedps.net/index.php/Troubleshooting#Work-in-progress > > I mentioned that we have invested a small bit of effort in integrating > Netlogger log publishing capabilities into Swift. > > Potential overlap notwithstanding, CIFTS (and in particular the Fault > Tolerant Backplane, FTB), could serve as a very nice consolidation > service for log information originating in the many different components > involved in executing a Swift program: > > - the application program wrapper script > - the Falkon or Coaster worker agent > - the Globus job manager and/or local scheduler > - the worker node > - the remote site fileserver/filesystem > - a site system management facility like BG/P's RAS service > - Falkon and Coaster servers and bootstrappers > - the swift client-side engine > - GrifFTP and other transport protocols and services > - etc > > FTB would enable us to readily capture and consolidate all these > information sources and funnel the data into streams related to specific > Swift program executions. It has the infrastructure to route messages > out of distributed systems, and to permit publication of and > subscription to message streams. Its agents, it seems, can help messages > traverse firewalls and deal with other transport and delivery issues. > > FTB is implemented as a C API, and comes with a set of example clients. > From this a simple set of command line interfaces could be derived to > permit low-cost experimentation with the system in, eg, Falkon on the > BG/P, where Rinku and others are implementing collectors to gather log > information from different parts of ZeptoOS and the BG/P hardware complex. > > Its not clear that any of us have the cycles within the next two months > to explore this, but it would make an interesting student project, to > compare CIFTS and NetLogger, and to test some initial integrations into > Swift, Falkon, and Coasters. (I feel its a good Summer of Code project). > > My initial question is whether some CIFTS/FTB hooks could be planted in > a lightweight Swift experiment, and we could try to get a feel for > whether the infrastructure gives us something that we cant readily get > today. My gut feel is that is does. > > I think it would be a great research/development topic to explore how > close this could bring us to the point where all distributed errors are > cleanly routed back to the centralized user to more quickly pinpoint the > cause of remote and distributed failures. Swift does a *pretty* good > job of this today, albeit in a somewhat ad-hoc fashion. FTB would make > it easier to integrate information from additional sources like the > remote scheduler and BGP RAS logs into the debugging process. > > And all that is before we even consider the goals of automating fault > tolerance, which I think is the ultimate vision of CIFTS. > > Thoughts and discussion welcome. Once any of us get a day or so to play > with FTB, we'll know more about the possibilities. > > Regards, > > Mike > > > On 3/1/09 11:11 AM, Ioan Raicu wrote: > > Hi Rinku, > > It looks like I am not going to be able to make the meeting tomorrow. On > > Friday, another interview opportunity came up, and the only open slot > > for the next 2 weeks was this Monday. Sorry about the short notice. Go > > ahead and meet without me, and I'll catch up with what was discussed at > > the meeting from Mike. > > > > Thanks, > > Ioan > > > > Michael Wilde wrote: > >> Zhao, Allan, you're welcome to join this discussion, at Argonne Mar 2, > >> or by phone. > >> > >> - Mike > >> > >> > >> On 2/18/09 10:30 PM, Ian Foster wrote: > >>> Hi, > >>> > >>> This sounds like a really fun project. Maybe we should involve Zhao > >>> and Allen as well, given that Ioan has (sadly) graduated, and will > >>> leave us? > >>> I'd love to participate, I will need to do so by phone--could we do > >>> that? I'll just listen in, and see what I can learn. > >>> > >>> Ian. > >>> > >>> > >>> On Feb 18, 2009, at 8:46 PM, Ioan Raicu wrote: > >>> > >>>> Great! > >>>> > >>>> I added Ian as a cc, maybe he wants to come to this meeting as well. > >>>> Ian, the original message from Pete was: > >>>>> Ioan and Mike, > >>>>> > >>>>> The CIFTS project is a DOE project to provide a "fault tolerant > >>>>> backplane". I'm the PI of the project which involved ORNL, LBL, > >>>>> IU, Ohio State, and UTK. Below is a suggestion to hook CIFTS to > >>>>> Falkon, so faults could be monitored. Rinku (on the cc: line) is > >>>>> the lead developer for CIFTS. Maybe when one of you is on campus > >>>>> (ANL) you can meet with Rinku, and brainstorm if there is any way > >>>>> to link the two systems efficiently. Email below is from an ORNL > >>>>> participant in the CIFTS framework. > >>>>> > >>>>> -Pete > >>>> The meeting is scheduled with Rinku, Mike, Pete (?), and I for March > >>>> 2nd, at 11AM in Rinku's office (ANL, D-231 in the MCS building). > >>>> > >>>> Ioan > >>>> > >>>> Rinku Gupta wrote: > >>>>> We can meet at my office (D-231 in the MCS building) and then sneak > >>>>> into Pete's room, if it is empty. > >>>>> > >>>>> Rinku > >>>>> > >>>>> > >>>>> > >>>>> ----- "Ioan Raicu" wrote: > >>>>> > >>>>> > >>>>>> Works for me! I assume we are meeting at ANL. Whose office are we > >>>>>> meeting in? > >>>>>> > >>>>>> Ioan > >>>>>> > >>>>>> Rinku Gupta wrote: > >>>>>> > >>>>>> Based on everyones availability, how does 11:00am on March 2nd sound? > >>>>>> > >>>>>> Thanks > >>>>>> Rinku > >>>>>> > >>>>>> > >>>>>> ----- "Michael Wilde" wrote: > >>>>>> > >>>>>> Rinku, Ioan, > >>>>>> > >>>>>> I can do Thu Feb 26 10:30 (I assume you meant AM not PM). > >>>>>> > >>>>>> But if Rinku is just arriving back in the US that morning, it seems > >>>>>> better to postpone to the week after. > >>>>>> > >>>>>> I can be at Argonne any time week of March 2. Mornings are free, > >>>>>> Mon-Thu > >>>>>> are best. > >>>>>> > >>>>>> Can we tentatively then meet at 11AM Mon Mar 2? > >>>>>> > >>>>>> Regards, > >>>>>> > >>>>>> Mike > >>>>>> > >>>>>> > >>>>>> On 2/18/09 9:37 AM, Ioan Raicu wrote: > >>>>>> > >>>>>> Hi Rinku, > >>>>>> Next Thursday (February 26th) at 10:30AM would work for me. If we > >>>>>> need > >>>>>> > >>>>>> to meet the following week, I could meet Monday (March 2nd) and > >>>>>> Thursday > >>>>>> > >>>>>> (March 5th) any time. > >>>>>> > >>>>>> Cheers, > >>>>>> Ioan > >>>>>> > >>>>>> Rinku Gupta wrote: > >>>>>> > >>>>>> Hi Michael, Ioan > >>>>>> > >>>>>> I am currently on travel and will arrive back to the USA only > >>>>>> Thursday > >>>>>> (Feb 26th) early morning. Will you be available anytime the > >>>>>> week after next? If not, then we can try to schedule a meeting > >>>>>> sometime around 10:30/11pm next Thursday at ANL. > >>>>>> > >>>>>> > >>>>>> > >>>>>> Thanks > >>>>>> Rinku > >>>>>> > >>>>>> > >>>>>> ----- "Ioan Raicu" wrote: > >>>>>> > >>>>>> Hi Rinku, > >>>>>> I can meet next week on Wednesday any time, and Thursday morning > >>>>>> before > >>>>>> noon, as I have a flight to catch early afternoon from O'Hare. I can > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> meet either at UC or ANL. Let me know what works best for everyone. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> Thanks, > >>>>>> Ioan > >>>>>> > >>>>>> Michael Wilde wrote: > >>>>>> > >>>>>> Hi All, > >>>>>> > >>>>>> Rinku, lets set up a meeting for next week to discuss. I can meet > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> Wed > >>>>>> > >>>>>> of Thu, at Argonne or UChicago. > >>>>>> > >>>>>> Do either of those dates work for you, and which place is best? > >>>>>> > >>>>>> In the meantime I'll read up on CIFTS at > >>>>>> http://www.mcs.anl.gov/research/cifts/docs/index.php and the wiki > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> that > >>>>>> > >>>>>> this refers to. > >>>>>> > >>>>>> If you have any other docs we should read, please send them. > >>>>>> > >>>>>> Thanks and regards, > >>>>>> > >>>>>> Mike > >>>>>> > >>>>>> > >>>>>> On 2/17/09 12:42 PM, Pete Beckman wrote: > >>>>>> > >>>>>> Ioan and Mike, > >>>>>> > >>>>>> The CIFTS project is a DOE project to provide a "fault tolerant > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> backplane". I'm the PI of the project which involved ORNL, LBL, > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> IU, > >>>>>> > >>>>>> > >>>>>> > >>>>>> Ohio State, and UTK. Below is a suggestion to hook CIFTS to Falkon, > >>>>>> > >>>>>> > >>>>>> > >>>>>> so faults could be monitored. Rinku (on the cc: line) is the lead > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> developer for CIFTS. Maybe when one of you is on campus (ANL) you > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> can meet with Rinku, and brainstorm if there is any way to link the > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> two systems efficiently. Email below is from an ORNL participant > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> in > >>>>>> > >>>>>> > >>>>>> > >>>>>> the CIFTS framework. > >>>>>> > >>>>>> -Pete > >>>>>> > >>>>>> > >>>>>> Begin forwarded message: > >>>>>> > >>>>>> From: bernholdtde at ornl.gov Date: February 12, 2009 10:29:47 AM CST > >>>>>> To: cifts at googlegroups.com Cc: bernholdtde at ornl.gov Subject: Fault > >>>>>> tolerance in "many task computing"? > >>>>>> Reply-To: cifts at googlegroups.com Pete (and other ANL folks), > >>>>>> > >>>>>> I recently read the SC08 paper on many task computing on which you're > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> a co-author. ( > >>>>>> http://portal.acm.org/citation.cfm?doid=1413370.1413393 > >>>>>> ) > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> I wonder if it would be viable to build a CIFTS demonstration > >>>>>> scenario > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> around the software system described in this paper? > >>>>>> > >>>>>> In the paper, there's a paragraph discussing reliability that > >>>>>> discusses some of the issues at a high level. It strikes me as > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> both > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> interesting and challenging because you have both system components > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> (i.e. Cobalt) and multiple user-space components (Falken, Swift, > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> application tasks) interacting. > >>>>>> > >>>>>> It might also be worth looking at this environment to help understand > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> the use cases and requirements for the policy/control channels (as > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> opposed to the FTB's informational channel). > >>>>>> > >>>>>> Just some ideas, db > >>>>>> -- > >>>>>> David E. Bernholdt | Email: bernholdtde at ornl.gov > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> Oak Ridge National Laboratory | Phone: +1 (865) 574 3147 > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> http://www.csm.ornl.gov/~bernhold/ | Fax: +1 (865) 576 5491 > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> --~--~---------~--~----~------------~-------~--~----~ > >>>>>> You received this message because you are subscribed to the Google > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> Groups "CIFTS" group. > >>>>>> To post to this group, send email to cifts at googlegroups.com To > >>>>>> unsubscribe from this group, send email to > >>>>>> cifts+unsubscribe at googlegroups.com For more options, visit this group > >>>>>> at http://groups.google.com/group/cifts?hl=en > >>>>>> -~----------~----~----~----~------~----~------~--~--- -- > >>>>>> =================================================== > >>>>>> Ioan Raicu, Ph.D. > >>>>>> =================================================== > >>>>>> Distributed Systems Laboratory > >>>>>> Computer Science Department > >>>>>> University of Chicago > >>>>>> 1100 E. 58th Street, Ryerson Hall > >>>>>> Chicago, IL 60637 > >>>>>> =================================================== > >>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu > >>>>>> http://dev.globus.org/wiki/Incubator/Falkon > >>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > >>>>>> =================================================== > >>>>>> =================================================== -- > >>>>>> =================================================== > >>>>>> Ioan Raicu, Ph.D. > >>>>>> =================================================== > >>>>>> Distributed Systems Laboratory > >>>>>> Computer Science Department > >>>>>> University of Chicago > >>>>>> 1100 E. 58th Street, Ryerson Hall > >>>>>> Chicago, IL 60637 > >>>>>> =================================================== > >>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu > >>>>>> http://dev.globus.org/wiki/Incubator/Falkon > >>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > >>>>>> =================================================== > >>>>>> =================================================== > >>>>>> -- > >>>>>> =================================================== > >>>>>> Ioan Raicu, Ph.D. > >>>>>> =================================================== > >>>>>> Distributed Systems Laboratory > >>>>>> Computer Science Department > >>>>>> University of Chicago > >>>>>> 1100 E. 58th Street, Ryerson Hall > >>>>>> Chicago, IL 60637 > >>>>>> =================================================== > >>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu > >>>>>> http://dev.globus.org/wiki/Incubator/Falkon > >>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > >>>>>> =================================================== > >>>>>> =================================================== > >>>>>> > >>>>> > >>>> > >>>> -- > >>>> =================================================== > >>>> Ioan Raicu, Ph.D. > >>>> =================================================== > >>>> Distributed Systems Laboratory > >>>> Computer Science Department > >>>> University of Chicago > >>>> 1100 E. 58th Street, Ryerson Hall > >>>> Chicago, IL 60637 > >>>> =================================================== > >>>> Email: iraicu at cs.uchicago.edu > >>>> Web: http://www.cs.uchicago.edu/~iraicu > >>>> http://dev.globus.org/wiki/Incubator/Falkon > >>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > >>>> =================================================== > >>>> =================================================== > >>> > >> > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Mon Mar 2 18:10:11 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 02 Mar 2009 18:10:11 -0600 Subject: [Swift-devel] Re: Fault tolerance in "many task computing"? In-Reply-To: <1236037842.2575.7.camel@localhost> References: <21662876.222571235010770270.JavaMail.root@zimbra> <499CC801.3090304@cs.uchicago.edu> <25796E93-7DB0-44FF-8884-0201B4EC620F@anl.gov> <499CEF6C.1020700@mcs.anl.gov> <49AAC1DF.7090502@cs.uchicago.edu> <49AC6B39.8010407@mcs.anl.gov> <1236037842.2575.7.camel@localhost> Message-ID: <49AC7563.2070209@mcs.anl.gov> On 3/2/09 5:50 PM, Mihael Hategan wrote: > Is there a Java library for FTB? No, my understanding is that its only C at the moment. > > What does FTB bring new to the table compared to a distributed messaging > system? Pete and Rinku (and a bit of reading) can certainly make a better case, but this is my general impression: To me, it seems simple, lightweight, and well-structured for pub-sub of messages that pertain to system/application operation. I think it defines a nice model of endpoints, priorities, message codes, etc. while leaving a payload for the user to send message-specific details. Its agents implement s spanning tree to route messages from distributed components, so the user doesnt need to worry about this. I think it has some redundancy in this delivery model. It seems to be designed to be light weight to handle high traffic (eg from errant system components). Just seems well-tailored to the log message routing job. - Mike > > Mihael > > On Mon, 2009-03-02 at 17:26 -0600, Michael Wilde wrote: >> All, >> >> Pete suggested we take a look at CIFTS's message logging system and >> consider integrating it into our stack. Rinku gave me, Allan, and Zhao >> and excellent overview and demo of the system. (Thanks, Rinku!) >> >> Here's my notes from this meeting. My intent is just to start a >> discussion for longer-term consideration, not any near-term action. >> (Although Jing Tie may find some of these concepts fruitful for er >> troubleshooting research). >> >> CIFTS is the DOE SciDAC project "Coordinated and Improved Fault >> Tolerance for High Performance Computing Systems", PI'd by Pete: >> http://www.mcs.anl.gov/research/cifts/index.php >> >> It produces "FTB", a backplane for distributing logging information >> within a distributed system: >> >> http://www.mcs.anl.gov/research/cifts/docs/files/ftb_developers_guide.pdf >> >> I pointed Rinku to Swift and Falkon info, as well as Netlogger and >> activities related to it in the CEDPS project, and we have a joint >> action item to understand the possible overlap and integration issues >> and possibilities between these two systems. >> >> Netlogger and CEDPS info is at: >> >> http://acs.lbl.gov/NetLoggerWiki/index.php/Main_Page >> http://dev.globus.org/wiki/Incubator/NetLogger >> http://www.cedps.net/index.php/Troubleshooting#Work-in-progress >> >> I mentioned that we have invested a small bit of effort in integrating >> Netlogger log publishing capabilities into Swift. >> >> Potential overlap notwithstanding, CIFTS (and in particular the Fault >> Tolerant Backplane, FTB), could serve as a very nice consolidation >> service for log information originating in the many different components >> involved in executing a Swift program: >> >> - the application program wrapper script >> - the Falkon or Coaster worker agent >> - the Globus job manager and/or local scheduler >> - the worker node >> - the remote site fileserver/filesystem >> - a site system management facility like BG/P's RAS service >> - Falkon and Coaster servers and bootstrappers >> - the swift client-side engine >> - GrifFTP and other transport protocols and services >> - etc >> >> FTB would enable us to readily capture and consolidate all these >> information sources and funnel the data into streams related to specific >> Swift program executions. It has the infrastructure to route messages >> out of distributed systems, and to permit publication of and >> subscription to message streams. Its agents, it seems, can help messages >> traverse firewalls and deal with other transport and delivery issues. >> >> FTB is implemented as a C API, and comes with a set of example clients. >> From this a simple set of command line interfaces could be derived to >> permit low-cost experimentation with the system in, eg, Falkon on the >> BG/P, where Rinku and others are implementing collectors to gather log >> information from different parts of ZeptoOS and the BG/P hardware complex. >> >> Its not clear that any of us have the cycles within the next two months >> to explore this, but it would make an interesting student project, to >> compare CIFTS and NetLogger, and to test some initial integrations into >> Swift, Falkon, and Coasters. (I feel its a good Summer of Code project). >> >> My initial question is whether some CIFTS/FTB hooks could be planted in >> a lightweight Swift experiment, and we could try to get a feel for >> whether the infrastructure gives us something that we cant readily get >> today. My gut feel is that is does. >> >> I think it would be a great research/development topic to explore how >> close this could bring us to the point where all distributed errors are >> cleanly routed back to the centralized user to more quickly pinpoint the >> cause of remote and distributed failures. Swift does a *pretty* good >> job of this today, albeit in a somewhat ad-hoc fashion. FTB would make >> it easier to integrate information from additional sources like the >> remote scheduler and BGP RAS logs into the debugging process. >> >> And all that is before we even consider the goals of automating fault >> tolerance, which I think is the ultimate vision of CIFTS. >> >> Thoughts and discussion welcome. Once any of us get a day or so to play >> with FTB, we'll know more about the possibilities. >> >> Regards, >> >> Mike >> >> >> On 3/1/09 11:11 AM, Ioan Raicu wrote: >>> Hi Rinku, >>> It looks like I am not going to be able to make the meeting tomorrow. On >>> Friday, another interview opportunity came up, and the only open slot >>> for the next 2 weeks was this Monday. Sorry about the short notice. Go >>> ahead and meet without me, and I'll catch up with what was discussed at >>> the meeting from Mike. >>> >>> Thanks, >>> Ioan >>> >>> Michael Wilde wrote: >>>> Zhao, Allan, you're welcome to join this discussion, at Argonne Mar 2, >>>> or by phone. >>>> >>>> - Mike >>>> >>>> >>>> On 2/18/09 10:30 PM, Ian Foster wrote: >>>>> Hi, >>>>> >>>>> This sounds like a really fun project. Maybe we should involve Zhao >>>>> and Allen as well, given that Ioan has (sadly) graduated, and will >>>>> leave us? >>>>> I'd love to participate, I will need to do so by phone--could we do >>>>> that? I'll just listen in, and see what I can learn. >>>>> >>>>> Ian. >>>>> >>>>> >>>>> On Feb 18, 2009, at 8:46 PM, Ioan Raicu wrote: >>>>> >>>>>> Great! >>>>>> >>>>>> I added Ian as a cc, maybe he wants to come to this meeting as well. >>>>>> Ian, the original message from Pete was: >>>>>>> Ioan and Mike, >>>>>>> >>>>>>> The CIFTS project is a DOE project to provide a "fault tolerant >>>>>>> backplane". I'm the PI of the project which involved ORNL, LBL, >>>>>>> IU, Ohio State, and UTK. Below is a suggestion to hook CIFTS to >>>>>>> Falkon, so faults could be monitored. Rinku (on the cc: line) is >>>>>>> the lead developer for CIFTS. Maybe when one of you is on campus >>>>>>> (ANL) you can meet with Rinku, and brainstorm if there is any way >>>>>>> to link the two systems efficiently. Email below is from an ORNL >>>>>>> participant in the CIFTS framework. >>>>>>> >>>>>>> -Pete >>>>>> The meeting is scheduled with Rinku, Mike, Pete (?), and I for March >>>>>> 2nd, at 11AM in Rinku's office (ANL, D-231 in the MCS building). >>>>>> >>>>>> Ioan >>>>>> >>>>>> Rinku Gupta wrote: >>>>>>> We can meet at my office (D-231 in the MCS building) and then sneak >>>>>>> into Pete's room, if it is empty. >>>>>>> >>>>>>> Rinku >>>>>>> >>>>>>> >>>>>>> >>>>>>> ----- "Ioan Raicu" wrote: >>>>>>> >>>>>>> >>>>>>>> Works for me! I assume we are meeting at ANL. Whose office are we >>>>>>>> meeting in? >>>>>>>> >>>>>>>> Ioan >>>>>>>> >>>>>>>> Rinku Gupta wrote: >>>>>>>> >>>>>>>> Based on everyones availability, how does 11:00am on March 2nd sound? >>>>>>>> >>>>>>>> Thanks >>>>>>>> Rinku >>>>>>>> >>>>>>>> >>>>>>>> ----- "Michael Wilde" wrote: >>>>>>>> >>>>>>>> Rinku, Ioan, >>>>>>>> >>>>>>>> I can do Thu Feb 26 10:30 (I assume you meant AM not PM). >>>>>>>> >>>>>>>> But if Rinku is just arriving back in the US that morning, it seems >>>>>>>> better to postpone to the week after. >>>>>>>> >>>>>>>> I can be at Argonne any time week of March 2. Mornings are free, >>>>>>>> Mon-Thu >>>>>>>> are best. >>>>>>>> >>>>>>>> Can we tentatively then meet at 11AM Mon Mar 2? >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Mike >>>>>>>> >>>>>>>> >>>>>>>> On 2/18/09 9:37 AM, Ioan Raicu wrote: >>>>>>>> >>>>>>>> Hi Rinku, >>>>>>>> Next Thursday (February 26th) at 10:30AM would work for me. If we >>>>>>>> need >>>>>>>> >>>>>>>> to meet the following week, I could meet Monday (March 2nd) and >>>>>>>> Thursday >>>>>>>> >>>>>>>> (March 5th) any time. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Ioan >>>>>>>> >>>>>>>> Rinku Gupta wrote: >>>>>>>> >>>>>>>> Hi Michael, Ioan >>>>>>>> >>>>>>>> I am currently on travel and will arrive back to the USA only >>>>>>>> Thursday >>>>>>>> (Feb 26th) early morning. Will you be available anytime the >>>>>>>> week after next? If not, then we can try to schedule a meeting >>>>>>>> sometime around 10:30/11pm next Thursday at ANL. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Thanks >>>>>>>> Rinku >>>>>>>> >>>>>>>> >>>>>>>> ----- "Ioan Raicu" wrote: >>>>>>>> >>>>>>>> Hi Rinku, >>>>>>>> I can meet next week on Wednesday any time, and Thursday morning >>>>>>>> before >>>>>>>> noon, as I have a flight to catch early afternoon from O'Hare. I can >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> meet either at UC or ANL. Let me know what works best for everyone. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Ioan >>>>>>>> >>>>>>>> Michael Wilde wrote: >>>>>>>> >>>>>>>> Hi All, >>>>>>>> >>>>>>>> Rinku, lets set up a meeting for next week to discuss. I can meet >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Wed >>>>>>>> >>>>>>>> of Thu, at Argonne or UChicago. >>>>>>>> >>>>>>>> Do either of those dates work for you, and which place is best? >>>>>>>> >>>>>>>> In the meantime I'll read up on CIFTS at >>>>>>>> http://www.mcs.anl.gov/research/cifts/docs/index.php and the wiki >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> that >>>>>>>> >>>>>>>> this refers to. >>>>>>>> >>>>>>>> If you have any other docs we should read, please send them. >>>>>>>> >>>>>>>> Thanks and regards, >>>>>>>> >>>>>>>> Mike >>>>>>>> >>>>>>>> >>>>>>>> On 2/17/09 12:42 PM, Pete Beckman wrote: >>>>>>>> >>>>>>>> Ioan and Mike, >>>>>>>> >>>>>>>> The CIFTS project is a DOE project to provide a "fault tolerant >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> backplane". I'm the PI of the project which involved ORNL, LBL, >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> IU, >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Ohio State, and UTK. Below is a suggestion to hook CIFTS to Falkon, >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> so faults could be monitored. Rinku (on the cc: line) is the lead >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> developer for CIFTS. Maybe when one of you is on campus (ANL) you >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> can meet with Rinku, and brainstorm if there is any way to link the >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> two systems efficiently. Email below is from an ORNL participant >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> in >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> the CIFTS framework. >>>>>>>> >>>>>>>> -Pete >>>>>>>> >>>>>>>> >>>>>>>> Begin forwarded message: >>>>>>>> >>>>>>>> From: bernholdtde at ornl.gov Date: February 12, 2009 10:29:47 AM CST >>>>>>>> To: cifts at googlegroups.com Cc: bernholdtde at ornl.gov Subject: Fault >>>>>>>> tolerance in "many task computing"? >>>>>>>> Reply-To: cifts at googlegroups.com Pete (and other ANL folks), >>>>>>>> >>>>>>>> I recently read the SC08 paper on many task computing on which you're >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> a co-author. ( >>>>>>>> http://portal.acm.org/citation.cfm?doid=1413370.1413393 >>>>>>>> ) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I wonder if it would be viable to build a CIFTS demonstration >>>>>>>> scenario >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> around the software system described in this paper? >>>>>>>> >>>>>>>> In the paper, there's a paragraph discussing reliability that >>>>>>>> discusses some of the issues at a high level. It strikes me as >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> both >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> interesting and challenging because you have both system components >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> (i.e. Cobalt) and multiple user-space components (Falken, Swift, >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> application tasks) interacting. >>>>>>>> >>>>>>>> It might also be worth looking at this environment to help understand >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> the use cases and requirements for the policy/control channels (as >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> opposed to the FTB's informational channel). >>>>>>>> >>>>>>>> Just some ideas, db >>>>>>>> -- >>>>>>>> David E. Bernholdt | Email: bernholdtde at ornl.gov >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Oak Ridge National Laboratory | Phone: +1 (865) 574 3147 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> http://www.csm.ornl.gov/~bernhold/ | Fax: +1 (865) 576 5491 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> --~--~---------~--~----~------------~-------~--~----~ >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Groups "CIFTS" group. >>>>>>>> To post to this group, send email to cifts at googlegroups.com To >>>>>>>> unsubscribe from this group, send email to >>>>>>>> cifts+unsubscribe at googlegroups.com For more options, visit this group >>>>>>>> at http://groups.google.com/group/cifts?hl=en >>>>>>>> -~----------~----~----~----~------~----~------~--~--- -- >>>>>>>> =================================================== >>>>>>>> Ioan Raicu, Ph.D. >>>>>>>> =================================================== >>>>>>>> Distributed Systems Laboratory >>>>>>>> Computer Science Department >>>>>>>> University of Chicago >>>>>>>> 1100 E. 58th Street, Ryerson Hall >>>>>>>> Chicago, IL 60637 >>>>>>>> =================================================== >>>>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu >>>>>>>> http://dev.globus.org/wiki/Incubator/Falkon >>>>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >>>>>>>> =================================================== >>>>>>>> =================================================== -- >>>>>>>> =================================================== >>>>>>>> Ioan Raicu, Ph.D. >>>>>>>> =================================================== >>>>>>>> Distributed Systems Laboratory >>>>>>>> Computer Science Department >>>>>>>> University of Chicago >>>>>>>> 1100 E. 58th Street, Ryerson Hall >>>>>>>> Chicago, IL 60637 >>>>>>>> =================================================== >>>>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu >>>>>>>> http://dev.globus.org/wiki/Incubator/Falkon >>>>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >>>>>>>> =================================================== >>>>>>>> =================================================== >>>>>>>> -- >>>>>>>> =================================================== >>>>>>>> Ioan Raicu, Ph.D. >>>>>>>> =================================================== >>>>>>>> Distributed Systems Laboratory >>>>>>>> Computer Science Department >>>>>>>> University of Chicago >>>>>>>> 1100 E. 58th Street, Ryerson Hall >>>>>>>> Chicago, IL 60637 >>>>>>>> =================================================== >>>>>>>> Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu >>>>>>>> http://dev.globus.org/wiki/Incubator/Falkon >>>>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >>>>>>>> =================================================== >>>>>>>> =================================================== >>>>>>>> >>>>>>> >>>>>> -- >>>>>> =================================================== >>>>>> Ioan Raicu, Ph.D. >>>>>> =================================================== >>>>>> Distributed Systems Laboratory >>>>>> Computer Science Department >>>>>> University of Chicago >>>>>> 1100 E. 58th Street, Ryerson Hall >>>>>> Chicago, IL 60637 >>>>>> =================================================== >>>>>> Email: iraicu at cs.uchicago.edu >>>>>> Web: http://www.cs.uchicago.edu/~iraicu >>>>>> http://dev.globus.org/wiki/Incubator/Falkon >>>>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >>>>>> =================================================== >>>>>> =================================================== >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Tue Mar 3 07:31:33 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 3 Mar 2009 13:31:33 +0000 (GMT) Subject: [Swift-devel] Re: Fault tolerance in "many task computing"? In-Reply-To: <49AC6B39.8010407@mcs.anl.gov> References: <21662876.222571235010770270.JavaMail.root@zimbra> <499CC801.3090304@cs.uchicago.edu> <25796E93-7DB0-44FF-8884-0201B4EC620F@anl.gov> <499CEF6C.1020700@mcs.anl.gov> <49AAC1DF.7090502@cs.uchicago.edu> <49AC6B39.8010407@mcs.anl.gov> Message-ID: Sounds interesting as a research project. How to hook different logging systems in is fairly well defined in the submit side (through log4j) and in the worker code (through a single bash function). Integration into the globus toolkit stack is something that ties in with CEDPS, not Swift. I would not be adverse to more pluggable logging mechanisms in the swift core code, although I am as always resistant to adding in unnecessary dependencies. -- From hategan at mcs.anl.gov Wed Mar 4 14:04:41 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 04 Mar 2009 14:04:41 -0600 Subject: [Swift-devel] different host CN expectations in gram and gridftp server In-Reply-To: <1235510279.7676.0.camel@localhost> References: <50b07b4b0902221621s52239835xf920e665e8cfce5f@mail.gmail.com> <1235361343.1273.6.camel@localhost> <50b07b4b0902230848v15e1394dh829fcb2bbf94a578@mail.gmail.com> <1235408072.10242.0.camel@localhost> <50b07b4b0902230856g18e11118v5f5a27d2d5eb7afc@mail.gmail.com> <50b07b4b0902241314t7ea23b28g832c70e26877c5f6@mail.gmail.com> <1235510279.7676.0.camel@localhost> Message-ID: <1236197081.24081.2.camel@localhost> http://bugzilla.mcs.anl.gov/globus/show_bug.cgi?id=6678 Use login3.ranger.tacc.utexas.edu instead of gatekeeper.ranger.tacc.teragrid.org if your submit host is a node on ranger. Though I'd recommend against running swift on one of ranger's head nodes. On Tue, 2009-02-24 at 15:17 -0600, Mihael Hategan wrote: > Ok. I'll look into this. > > On Tue, 2009-02-24 at 15:14 -0600, Allan Espinosa wrote: > > I still get the same gram authentication error message: > > > > Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Cannot submit job > > Caused by: org.globus.gram.GramException: Data transfer to the server > > failed [Caused by: Authentication failed [Caused by: Operation > > unauthorized (Mechanism level: [JGLOBUS-56] Authorization failed. > > Expected "/CN=host/129.114.50.163" target but received > > "/C=US/O=UTAustin/OU=TACC/CN=login3.ranger.tacc.utexas.edu")]] > > 2009-02-24 15:12:07,215-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > jobid=hostname-8tx7p37j - Application exception: Cannot submit job > > > > This is using both the fork and sge job manager via gram2-only > > > > -aallan > > > > > > On Mon, Feb 23, 2009 at 10:58 AM, Ben Clifford wrote: > > > > > > If you use gram2 instead of coasters+gram2, what happens? > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Thu Mar 5 08:51:24 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 5 Mar 2009 14:51:24 +0000 (GMT) Subject: [Swift-devel] Open Provenance Model log exporter Message-ID: Part of Provenance Challenge 3 (PC3) is to export data into the open provenance model (OPM). I've committed a crude exporter for that into provenancedb/ in the SVN in r2633 I've also added details in the provenance.xml docbook page in that directory about how open provenance model vocabulary relates to Swift concepts. My experience so far has been that OPM is quite in sympathy with the thoughts that I've had so far about Swift provenance. I'll be interested to see (in PC3) if any meaningful interop between implementations can be achieved. At present, my thoughts tend towards us being able to export Swift data into some other provenance system, rather than importing other peoples data from not-Swift into our database. -- From foster at anl.gov Thu Mar 5 12:56:47 2009 From: foster at anl.gov (Ian Foster) Date: Thu, 5 Mar 2009 12:56:47 -0600 Subject: [Swift-devel] Open Provenance Model log exporter In-Reply-To: References: Message-ID: <73AE54CA-2F68-4AE4-BC7E-44F753C4A1B4@anl.gov> Ben: That sounds interesting. Are there any decent tools for analyzing/ processing OPM logs that we can make use of? One issue that I recall from past discussions was that Swift's functional model makes provenance in some ways "simpler" than in other systems. Do we lose that simplicity when we export to OPM? Maybe we can discuss these issues when you get some experience with OPM. Ian. On Mar 5, 2009, at 8:51 AM, Ben Clifford wrote: > Part of Provenance Challenge 3 (PC3) is to export data into the open > provenance model (OPM). > > I've committed a crude exporter for that into provenancedb/ in the > SVN in > r2633 > > I've also added details in the provenance.xml docbook page in that > directory about how open provenance model vocabulary relates to Swift > concepts. > > My experience so far has been that OPM is quite in sympathy with the > thoughts that I've had so far about Swift provenance. I'll be > interested > to see (in PC3) if any meaningful interop between implementations > can be > achieved. > > At present, my thoughts tend towards us being able to export Swift > data > into some other provenance system, rather than importing other peoples > data from not-Swift into our database. > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Thu Mar 5 16:45:21 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 5 Mar 2009 22:45:21 +0000 (GMT) Subject: [Swift-devel] Open Provenance Model log exporter In-Reply-To: <73AE54CA-2F68-4AE4-BC7E-44F753C4A1B4@anl.gov> References: <73AE54CA-2F68-4AE4-BC7E-44F753C4A1B4@anl.gov> Message-ID: On Thu, 5 Mar 2009, Ian Foster wrote: > That sounds interesting. Are there any decent tools for > analyzing/processing OPM logs that we can make use of? Not that I'm aware of, as its all very new - a (perhaps vain - cf CEDPS) hope in participating in PC3 is that we'll entangle ourselves with things that consume what we produce. > One issue that I recall from past discussions was that Swift's > functional model makes provenance in some ways "simpler" than in other > systems. Do we lose that simplicity when we export to OPM? I think that OPM doesn't make things any more complicated. What looks like a basic relation in the Swift provenance work so far pretty much maps into a basic relation in OPM. So I am (surprisingly?) not very cynical about the model. -- From bugzilla-daemon at mcs.anl.gov Thu Mar 5 20:55:17 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 5 Mar 2009 20:55:17 -0600 (CST) Subject: [Swift-devel] [Bug 61] semantics of [*] and multi-return-values need clarifying In-Reply-To: Message-ID: <20090306025517.CA7E3164CE@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=61 gabri.turcu at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|benc at hawaga.org.uk |gabri.turcu at gmail.com Status|ASSIGNED |NEW -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. You are the assignee for the bug, or are watching the assignee. From wilde at mcs.anl.gov Thu Mar 5 21:33:47 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 05 Mar 2009 21:33:47 -0600 Subject: [Swift-devel] Quick Start Guide examples are mangled Message-ID: <49B0999B.7060706@mcs.anl.gov> I dont know when this appeared, but some of the example text in the Swift Quick Start Guide at: http://www.ci.uchicago.edu/swift/guides/quickstartguide.php#install Is getting rendered wrong - its full of html tags and unreadable. Looks like: > tar -xzvf swift-.tar.gz And one of our new users was scratching his head saying, "wow, this is really rather cryptic!" ;) (seriously...) - Mike From wilde at mcs.anl.gov Thu Mar 5 21:36:41 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 05 Mar 2009 21:36:41 -0600 Subject: [Swift-devel] Quick Start Guide examples are mangled In-Reply-To: <49B0999B.7060706@mcs.anl.gov> References: <49B0999B.7060706@mcs.anl.gov> Message-ID: <49B09A49.1010800@mcs.anl.gov> same with the Really Quick version: http://www.ci.uchicago.edu/swift/guides/reallyquickstartguide.php and its all the example text, on both pages, that's gone bad. On 3/5/09 9:33 PM, Michael Wilde wrote: > I dont know when this appeared, but some of the example text in the > Swift Quick Start Guide at: > > http://www.ci.uchicago.edu/swift/guides/quickstartguide.php#install > > Is getting rendered wrong - its full of html tags and unreadable. Looks > like: > > > class="command">tar -xzvf > swift-.tar.gz > > And one of our new users was scratching his head saying, "wow, this is > really rather cryptic!" ;) (seriously...) > > - Mike > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Fri Mar 6 02:41:11 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 6 Mar 2009 08:41:11 +0000 (GMT) Subject: [Swift-devel] Quick Start Guide examples are mangled In-Reply-To: <49B0999B.7060706@mcs.anl.gov> References: <49B0999B.7060706@mcs.anl.gov> Message-ID: On Thu, 5 Mar 2009, Michael Wilde wrote: > I dont know when this appeared, but some of the example text in the Swift > Quick Start Guide at: > > http://www.ci.uchicago.edu/swift/guides/quickstartguide.php#install > > Is getting rendered wrong - its full of html tags and unreadable. Looks like: I changed a style sheet there the other day, which likely caused it. oops. -- From benc at hawaga.org.uk Fri Mar 6 07:42:32 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 6 Mar 2009 13:42:32 +0000 (GMT) Subject: [Swift-devel] Quick Start Guide examples are mangled In-Reply-To: References: <49B0999B.7060706@mcs.anl.gov> Message-ID: On Fri, 6 Mar 2009, Ben Clifford wrote: > > I dont know when this appeared, but some of the example text in the Swift > > Quick Start Guide at: > > > > http://www.ci.uchicago.edu/swift/guides/quickstartguide.php#install > > > > Is getting rendered wrong - its full of html tags and unreadable. Looks like: > > I changed a style sheet there the other day, which likely caused it. oops. fixed in r2646: Change use of elements for console interactions into elements. elements undergo magic syntax highlighting under the swiftsh_html.xsl style sheet that the quickstart guides were switched to in r2614 -- From bugzilla-daemon at mcs.anl.gov Fri Mar 6 18:34:07 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Fri, 6 Mar 2009 18:34:07 -0600 (CST) Subject: [Swift-devel] [Bug 181] New: Poor error message for sites.xml syntax error Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=181 Summary: Poor error message for sites.xml syntax error Product: Swift Version: unspecified Platform: All OS/Version: Linux Status: NEW Severity: minor Priority: P4 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: wilde at mcs.anl.gov A missing / on a /> closing bracket on the tag below yields a cryptic (and duplicated) error message: fast 05:00:00 /home/wilde/swiftwork Gives: Execution failed: Could not load file teraport.xml: com.thoughtworks.xstream.converters.ConversionException: : end tag name must match start tag name from line 5 (position: TEXT seen ...\n... @8:8) : : end tag name must match start tag name from line 5 (position: TEXT seen ...\n... @8:8) -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From wilde at mcs.anl.gov Fri Mar 6 18:45:37 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 06 Mar 2009 18:45:37 -0600 Subject: [Swift-devel] PBS time-exceeds-queue conflict yields cryptic error Message-ID: <49B1C3B1.9000309@mcs.anl.gov> A low prio issue: When I ask for more time than the selected PBS queue allows, I get a cryptic error. The fact that this condition yields a PBS error is known and has been discussed on the list. Is it tracked as bug 133, or does that refer to exceed allotted time at runtime? If so, I can add this note; else I can file a new bug. In my case, I gave: fast 05:00:00 /home/wilde/swiftwork my error was asking for 5 hours (05:00:00) instead of 5 minutes. I got the un-helpful error: tp$ swift -tc.file tc.data -sites.file teraport.xml floop.swift Swift svn swift-r2631 cog-r2306 RunID: 20090306-1822-cerzn3y8 Progress: Failed to transfer wrapper log from floop-20090306-1822-cerzn3y8/info/0 on teraport Execution failed: Failed to transfer wrapper log from floop-20090306-1822-cerzn3y8/info/r on teraport Failed to transfer wrapper log from floop-20090306-1822-cerzn3y8/info/y on teraport Progress: Submitting:7 Failed:3 Exception in echo: Arguments: [42] Host: teraport Directory: floop-20090306-1822-cerzn3y8/jobs/0/echo-0hikdk7j stderr.txt: stdout.txt: ---- Caused by: Cannot submit job: Could not submit job (qsub reported an exit code of 188). no error output Failed to transfer wrapper log from floop-20090306-1822-cerzn3y8/info/s on teraport Failed to transfer wrapper log from floop-20090306-1822-cerzn3y8/info/q on teraport tp$ fg From hategan at mcs.anl.gov Fri Mar 6 21:24:06 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 06 Mar 2009 21:24:06 -0600 Subject: [Swift-devel] PBS time-exceeds-queue conflict yields cryptic error In-Reply-To: <49B1C3B1.9000309@mcs.anl.gov> References: <49B1C3B1.9000309@mcs.anl.gov> Message-ID: <1236396246.14742.4.camel@localhost> On Fri, 2009-03-06 at 18:45 -0600, Michael Wilde wrote: > A low prio issue: > > When I ask for more time than the selected PBS queue allows, I get a > cryptic error. The fact that this condition yields a PBS error is known > and has been discussed on the list. On a quick glance, I couldn't find a list of qsub exit codes and their meanings. So I'm thinking whether it's reasonable to assume some portability for them (assuming I can find out from the Torque sources what they are). From hategan at mcs.anl.gov Fri Mar 6 21:33:47 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 06 Mar 2009 21:33:47 -0600 Subject: [Swift-devel] PBS time-exceeds-queue conflict yields cryptic error In-Reply-To: <1236396246.14742.4.camel@localhost> References: <49B1C3B1.9000309@mcs.anl.gov> <1236396246.14742.4.camel@localhost> Message-ID: <1236396827.14742.6.camel@localhost> On Fri, 2009-03-06 at 21:24 -0600, Mihael Hategan wrote: > On Fri, 2009-03-06 at 18:45 -0600, Michael Wilde wrote: > > A low prio issue: > > > > When I ask for more time than the selected PBS queue allows, I get a > > cryptic error. The fact that this condition yields a PBS error is known > > and has been discussed on the list. > > On a quick glance, I couldn't find a list of qsub exit codes and their > meanings. > > So I'm thinking whether it's reasonable to assume some portability for > them (assuming I can find out from the Torque sources what they are). There's also the possibility that stderr from qsub isn't displayed properly. Could you add log4j.logger.org.globus.cog.abstraction.impl=DEBUG to etc/log4j.properties, run it again, and send me the log? Mihael From benc at hawaga.org.uk Sat Mar 7 03:49:01 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 7 Mar 2009 09:49:01 +0000 (GMT) Subject: [Swift-devel] PBS time-exceeds-queue conflict yields cryptic error In-Reply-To: <1236396246.14742.4.camel@localhost> References: <49B1C3B1.9000309@mcs.anl.gov> <1236396246.14742.4.camel@localhost> Message-ID: On Fri, 6 Mar 2009, Mihael Hategan wrote: > On a quick glance, I couldn't find a list of qsub exit codes and their > meanings. When I've been googling round previously for such, I've not had much luck. > So I'm thinking whether it's reasonable to assume some portability for > them (assuming I can find out from the Torque sources what they are). leading to delightful error messages like "this *might* be a walltime violation", which I suppose is better than no error at all. -- From bugzilla-daemon at mcs.anl.gov Sun Mar 8 17:13:15 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sun, 8 Mar 2009 17:13:15 -0500 (CDT) Subject: [Swift-devel] [Bug 109] Change default max heap size to 256M In-Reply-To: Message-ID: <20090308221315.3A3CD164CE@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=109 hategan at mcs.anl.gov changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from hategan at mcs.anl.gov 2009-03-08 17:13 ------- Fixed in r2647. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From benc at hawaga.org.uk Mon Mar 9 04:56:05 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 9 Mar 2009 09:56:05 +0000 (GMT) Subject: [Swift-devel] #caller in karajan (fwd) Message-ID: yet again I demonstrate my ability to typo Chicago... ---------- Forwarded message ---------- Date: Mon, 9 Mar 2009 09:53:44 +0000 (GMT) From: Ben Clifford To: hategan at mcs.anl.gov Cc: swift-devel at ci.uchciago.edu Subject: #caller in karajan This failed in the daily tests http://nmi-s005.cs.wisc.edu:80/rundir/benc/2009/03/benc_nmi-s005.batlab.cs.wisc.edu_1236561792_24345/userdir/nmi:x86_64_rhas_3/remote_task.out with an error that I haven't seen before that I think is something karajan related. The nmi platform description is x86_64_rhas_3 The same version has worked on other platforms. I haven't really investigated it for reproducibility. The same test passed yesterday, so it has the air of some race condition. Running test 065-delay Swift svn which: no svn in (/prereq/java-1.4.2_05/bin:/prereq/apache-ant-1.7.0/bin:/bin:/usr/bin:/home/condor/execute/dir_27780/userdir) which: no svn in (/prereq/java-1.4.2_05/bin:/prereq/apache-ant-1.7.0/bin:/bin:/usr/bin:/home/condor/execute/dir_27780/userdir) swift-unknown cog-unknown RunID: 20090309-0236-rv96a6pg Uncaught exception: java.util.EmptyStackException in org.globus.cog.karajan.workflow.nodes.Sequential @ execute-default.k, line: 1 java.util.EmptyStackException at org.globus.cog.karajan.stack.LinkedStack.leave(LinkedStack.java:54) at org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:127) at org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:151) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:366) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) Event was No #caller found on stack for sys:element @ execute-default.k, line: 1 sys:element @ execute-default.k, line: 1 Exception is: java.util.EmptyStackException Cannot fail element java.util.EmptyStackException at org.globus.cog.karajan.stack.LinkedStack.leave(LinkedStack.java:54) at org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:127) at org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:147) at org.globus.cog.karajan.workflow.events.EventBus.failElement(EventBus.java:189) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:155) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) SWIFT RETURN CODE NON-ZERO - test 065-delay.swift From zhaozhang at uchicago.edu Mon Mar 9 14:03:46 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 09 Mar 2009 14:03:46 -0500 Subject: [Swift-devel] how to write a data provider Message-ID: <49B56812.2020308@uchicago.edu> Hi, Is there any documentation online, from which I could learn how to write a data provider? Thanks best regards zhao From aespinosa at cs.uchicago.edu Mon Mar 9 14:12:36 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 9 Mar 2009 14:12:36 -0500 Subject: [Swift-devel] how to write a data provider In-Reply-To: <49B56812.2020308@uchicago.edu> References: <49B56812.2020308@uchicago.edu> Message-ID: <50b07b4b0903091212w159c8fc9r92466348001037a5@mail.gmail.com> HI Zhao, I am also not familiar on how to write one but I'm sort of starting to have an idea. http://wiki.cogkit.org/wiki/Java_CoG_Kit_Abstraction_Guide Looking at the SSH provider, my guess is that these are the entry points from Swift (or CoG in general) to the provider: *TaskHandler.java *TaskImpl.java Let's set a reading/ study group to get to know the internals of this -Allan On Mon, Mar 9, 2009 at 2:03 PM, Zhao Zhang wrote: > Hi, > > Is there any documentation online, from which I could learn how to write a > data provider? Thanks > > best regards > zhao From hategan at mcs.anl.gov Mon Mar 9 15:00:25 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Mar 2009 15:00:25 -0500 Subject: [Swift-devel] how to write a data provider In-Reply-To: <50b07b4b0903091212w159c8fc9r92466348001037a5@mail.gmail.com> References: <49B56812.2020308@uchicago.edu> <50b07b4b0903091212w159c8fc9r92466348001037a5@mail.gmail.com> Message-ID: <1236628825.15244.4.camel@localhost> On Mon, 2009-03-09 at 14:12 -0500, Allan Espinosa wrote: > HI Zhao, > > I am also not familiar on how to write one but I'm sort of starting to > have an idea. > > http://wiki.cogkit.org/wiki/Java_CoG_Kit_Abstraction_Guide Yes, that's useful reading material. > > Looking at the SSH provider, my guess is that these are the entry > points from Swift (or CoG in general) to the provider: > > *TaskHandler.java > *TaskImpl.java > For code examples, take a look at org.globus.cog.abstraction.impl.file.local.FileResourceImpl.java in provider-local. I'd start by making a copy of provider-local, updating project.properties and resources/cog-provider.properties, and then hacking at FileResourceImpl.java. From hategan at mcs.anl.gov Mon Mar 9 15:25:00 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Mar 2009 15:25:00 -0500 Subject: [Swift-devel] scalability updates Message-ID: <1236630300.16421.17.camel@localhost> I've committed two main things today: 1. A foreach thread limiting patch, which limits the maximum number of concurrent threads that a foreach can have at any time. The default is at 1024, but it is configurable in swift.properties. For scripts whose main memory hog is large numbers of iterations in foreach loops, this should allow things to run with considerably less memory. 2. A lazy range function (the [x:y] operator). The previous one was silly. Simply writing [0:1000000] would cause swift to run out of memory because it was trying to create a swift array with 1000000 elements before running a single iteration on it. In principle, these two would roughly translate into the following: - a likely demise to several of our swift-runs-out-of-memory scenarios. Though there's still a bit to go here, because arrays in general in swift keep too much things in memory. - skenny type scripts (foreach i in [1:65535] { doStuff(); }) will not see that 5 minute delay before the first job is submitted. - Ben's provenance stuff may break if it relies on items returned by range() reporting a path-from-root containing the array itself (as array elements are roots themselves). From wilde at mcs.anl.gov Tue Mar 10 00:01:53 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 10 Mar 2009 00:01:53 -0500 Subject: [Swift-devel] scalability updates In-Reply-To: <1236630300.16421.17.camel@localhost> References: <1236630300.16421.17.camel@localhost> Message-ID: <49B5F441.7060502@mcs.anl.gov> Very nice! These look very promising. One interesting test would be doing a million localhost echos in a simple foreach loop on a range-initialized array, and looking at the memory needs. These 2 enhancements seem to pave the way to making streamed (or on-demand) mappers useful. For those, I think we need a mapper paradigm design adjustment discussion. But I think the next thing to work on in scalability would be the Condor-G provider, so we can run large coaster runs with more concurrency. The multi-cpu coaster allocator might be a workaround to re-consider if a condor-G provider is too far off. Assuming (or when) there's agreement that this is the best solution for coaster scalability, I'd like to propose that as the next big task on your to-do list. On 3/9/09 3:25 PM, Mihael Hategan wrote: > I've committed two main things today: > > 1. A foreach thread limiting patch, which limits the maximum number of > concurrent threads that a foreach can have at any time. The default is > at 1024, but it is configurable in swift.properties. For scripts whose > main memory hog is large numbers of iterations in foreach loops, this > should allow things to run with considerably less memory. > > 2. A lazy range function (the [x:y] operator). The previous one was > silly. Simply writing [0:1000000] would cause swift to run out of memory > because it was trying to create a swift array with 1000000 elements > before running a single iteration on it. > > In principle, these two would roughly translate into the following: > - a likely demise to several of our swift-runs-out-of-memory scenarios. > Though there's still a bit to go here, because arrays in general in > swift keep too much things in memory. > - skenny type scripts (foreach i in [1:65535] { doStuff(); }) will not > see that 5 minute delay before the first job is submitted. > - Ben's provenance stuff may break if it relies on items returned by > range() reporting a path-from-root containing the array itself (as array > elements are roots themselves). > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Mar 10 00:07:58 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 10 Mar 2009 00:07:58 -0500 Subject: [Swift-devel] scalability updates In-Reply-To: <49B5F441.7060502@mcs.anl.gov> References: <1236630300.16421.17.camel@localhost> <49B5F441.7060502@mcs.anl.gov> Message-ID: <1236661678.26700.3.camel@localhost> On Tue, 2009-03-10 at 00:01 -0500, Michael Wilde wrote: > Very nice! These look very promising. One interesting test would be > doing a million localhost echos in a simple foreach loop on a > range-initialized array, and looking at the memory needs. > > These 2 enhancements seem to pave the way to making streamed (or > on-demand) mappers useful. For those, I think we need a mapper paradigm > design adjustment discussion. I think we thought out the mappers to be procedural (i.e. not hold state) from the beginning, so the problem does not seem to be in the design of the mappers. Rather, it's the implementation of some of the mappers and the implementation of swift data structures (parts of an array cannot be garbage-collected independently). > > But I think the next thing to work on in scalability would be the > Condor-G provider, so we can run large coaster runs with more > concurrency. The multi-cpu coaster allocator might be a workaround to > re-consider if a condor-G provider is too far off. > > Assuming (or when) there's agreement that this is the best solution for > coaster scalability, I'd like to propose that as the next big task on > your to-do list. Yes. Sounds reasonable. Now, only if I could find a condor/condor-g installation that I have access to... From wilde at mcs.anl.gov Tue Mar 10 00:18:31 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 10 Mar 2009 00:18:31 -0500 Subject: [Swift-devel] scalability updates In-Reply-To: <1236661678.26700.3.camel@localhost> References: <1236630300.16421.17.camel@localhost> <49B5F441.7060502@mcs.anl.gov> <1236661678.26700.3.camel@localhost> Message-ID: <49B5F827.2050003@mcs.anl.gov> On 3/10/09 12:07 AM, Mihael Hategan wrote: > On Tue, 2009-03-10 at 00:01 -0500, Michael Wilde wrote: >> Very nice! These look very promising. One interesting test would be >> doing a million localhost echos in a simple foreach loop on a >> range-initialized array, and looking at the memory needs. >> >> These 2 enhancements seem to pave the way to making streamed (or >> on-demand) mappers useful. For those, I think we need a mapper paradigm >> design adjustment discussion. > > I think we thought out the mappers to be procedural (i.e. not hold > state) from the beginning, so the problem does not seem to be in the > design of the mappers. Rather, it's the implementation of some of the > mappers and the implementation of swift data structures (parts of an > array cannot be garbage-collected independently). By that I meant the functionality issues in mappers (ie, ability to map various user patterns easily, like the things I ran into in the OOPS app). Other than the streaming thing, I wasnt concerned with the performance of the mappers. > >> But I think the next thing to work on in scalability would be the >> Condor-G provider, so we can run large coaster runs with more >> concurrency. The multi-cpu coaster allocator might be a workaround to >> re-consider if a condor-G provider is too far off. >> >> Assuming (or when) there's agreement that this is the best solution for >> coaster scalability, I'd like to propose that as the next big task on >> your to-do list. > > Yes. Sounds reasonable. Now, only if I could find a condor/condor-g > installation that I have access to... Great, we'll find you one. The local ITB site that Suchandra maintains is a good choice, and I think we can get Greg to install the client package on TeraGrid if its not already there, or Ti on communicado. From benc at hawaga.org.uk Tue Mar 10 02:55:18 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Mar 2009 07:55:18 +0000 (GMT) Subject: [Swift-devel] scalability updates In-Reply-To: <1236630300.16421.17.camel@localhost> References: <1236630300.16421.17.camel@localhost> Message-ID: On Mon, 9 Mar 2009, Mihael Hategan wrote: > - Ben's provenance stuff may break if it relies on items returned by > range() reporting a path-from-root containing the array itself (as array > elements are roots themselves). Perhaps. But the path-from-root stuff leaves a slightly bad taste in my mouth anyway and is I think broken in other places due to aliasing (eg this code fragment: int i = 3; j = [i]; is a little ambiguous in what should be the root of j[1]) -- From benc at hawaga.org.uk Tue Mar 10 02:59:45 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Mar 2009 07:59:45 +0000 (GMT) Subject: [Swift-devel] scalability updates In-Reply-To: <1236661678.26700.3.camel@localhost> References: <1236630300.16421.17.camel@localhost> <49B5F441.7060502@mcs.anl.gov> <1236661678.26700.3.camel@localhost> Message-ID: On Tue, 10 Mar 2009, Mihael Hategan wrote: > Yes. Sounds reasonable. Now, only if I could find a condor/condor-g > installation that I have access to... communicado has Condor running on it. That should be enough. -- From wilde at mcs.anl.gov Tue Mar 10 08:28:53 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 10 Mar 2009 08:28:53 -0500 Subject: [Swift-devel] scalability updates In-Reply-To: References: <1236630300.16421.17.camel@localhost> <49B5F441.7060502@mcs.anl.gov> <1236661678.26700.3.camel@localhost> Message-ID: <49B66B15.3010908@mcs.anl.gov> Yes, provided its configured to send jobs to OSG and TG, and thats working. It *should* be, to support the OSG hands-on tutorial, but has occasional problems as I recall. On 3/10/09 2:59 AM, Ben Clifford wrote: > On Tue, 10 Mar 2009, Mihael Hategan wrote: > >> Yes. Sounds reasonable. Now, only if I could find a condor/condor-g >> installation that I have access to... > > communicado has Condor running on it. That should be enough. > From wilde at mcs.anl.gov Tue Mar 10 08:35:53 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 10 Mar 2009 08:35:53 -0500 Subject: [Swift-devel] scalability updates In-Reply-To: <49B66B15.3010908@mcs.anl.gov> References: <1236630300.16421.17.camel@localhost> <49B5F441.7060502@mcs.anl.gov> <1236661678.26700.3.camel@localhost> <49B66B15.3010908@mcs.anl.gov> Message-ID: <49B66CB9.4010103@mcs.anl.gov> But yes, I agree: communicado is the best place to test from. This is a good motivation to keep its Condor in working state. Can you test it and keep on maintaining it? Should this be your job, Ben, or CI Support's? On 3/10/09 8:28 AM, Michael Wilde wrote: > Yes, provided its configured to send jobs to OSG and TG, and thats > working. It *should* be, to support the OSG hands-on tutorial, but has > occasional problems as I recall. > > On 3/10/09 2:59 AM, Ben Clifford wrote: >> On Tue, 10 Mar 2009, Mihael Hategan wrote: >> >>> Yes. Sounds reasonable. Now, only if I could find a condor/condor-g >>> installation that I have access to... >> >> communicado has Condor running on it. That should be enough. >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Tue Mar 10 08:37:43 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Mar 2009 13:37:43 +0000 (GMT) Subject: [Swift-devel] scalability updates In-Reply-To: <49B66CB9.4010103@mcs.anl.gov> References: <1236630300.16421.17.camel@localhost> <49B5F441.7060502@mcs.anl.gov> <1236661678.26700.3.camel@localhost> <49B66B15.3010908@mcs.anl.gov> <49B66CB9.4010103@mcs.anl.gov> Message-ID: On Tue, 10 Mar 2009, Michael Wilde wrote: > But yes, I agree: communicado is the best place to test from. This is a good > motivation to keep its Condor in working state. Can you test it and keep on > maintaining it? Should this be your job, Ben, or CI Support's? Maintenance of that is CI support's job. -- From benc at hawaga.org.uk Tue Mar 10 09:22:04 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 10 Mar 2009 14:22:04 +0000 (GMT) Subject: [Swift-devel] scalability updates In-Reply-To: <1236630300.16421.17.camel@localhost> References: <1236630300.16421.17.camel@localhost> Message-ID: I ran 066-many with 30000 jobs (10 x more than is in the SVN at the moment). This eventually died, with it not being entirely clear to me why. Throughout the run I was null pointer exceptions being thrown. I also noticed a slowdown during the run - at the start more than 1000 jobs are waiting for site selection; towards the end this was down to about 980. This is on communicado. The log for that run is http://www.ci.uchicago.edu/~benc/tmp/066-many-20090310-0853-ezpsjt20.log -- From hategan at mcs.anl.gov Tue Mar 10 09:24:04 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 10 Mar 2009 09:24:04 -0500 Subject: [Swift-devel] scalability updates In-Reply-To: <49B5F827.2050003@mcs.anl.gov> References: <1236630300.16421.17.camel@localhost> <49B5F441.7060502@mcs.anl.gov> <1236661678.26700.3.camel@localhost> <49B5F827.2050003@mcs.anl.gov> Message-ID: <1236695045.3324.0.camel@localhost> On Tue, 2009-03-10 at 00:18 -0500, Michael Wilde wrote: > > On 3/10/09 12:07 AM, Mihael Hategan wrote: > By that I meant the functionality issues in mappers (ie, ability to map > various user patterns easily, like the things I ran into in the OOPS > app). Other than the streaming thing, I wasnt concerned with the > performance of the mappers. My misunderstanding. Sorry. From hategan at mcs.anl.gov Tue Mar 10 09:27:12 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 10 Mar 2009 09:27:12 -0500 Subject: [Swift-devel] scalability updates In-Reply-To: <49B5F441.7060502@mcs.anl.gov> References: <1236630300.16421.17.camel@localhost> <49B5F441.7060502@mcs.anl.gov> Message-ID: <1236695232.3324.4.camel@localhost> On Tue, 2009-03-10 at 00:01 -0500, Michael Wilde wrote: > Very nice! These look very promising. One interesting test would be > doing a million localhost echos in a simple foreach loop on a > range-initialized array, and looking at the memory needs. It depends. Should the echos return anything, and should the result be put in an array without being used, that won't work. A 1M swift integer array takes more than 300MB of memory. From wilde at mcs.anl.gov Tue Mar 10 11:13:20 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 10 Mar 2009 11:13:20 -0500 Subject: [Swift-devel] scalability updates In-Reply-To: <1236695232.3324.4.camel@localhost> References: <1236630300.16421.17.camel@localhost> <49B5F441.7060502@mcs.anl.gov> <1236695232.3324.4.camel@localhost> Message-ID: <49B691A0.50601@mcs.anl.gov> I think 1M echos would be a good milestone, even if it takes several GB of RAM. Communicado has 14GB total, so its a good place for such a test. I realize that it will take time to work up to that level. But even more important than 1M is just to know how the system scales, and document it in the user guide along with resource needs, whatever the level. On 3/10/09 9:27 AM, Mihael Hategan wrote: > On Tue, 2009-03-10 at 00:01 -0500, Michael Wilde wrote: >> Very nice! These look very promising. One interesting test would be >> doing a million localhost echos in a simple foreach loop on a >> range-initialized array, and looking at the memory needs. > > It depends. Should the echos return anything, and should the result be > put in an array without being used, that won't work. A 1M swift integer > array takes more than 300MB of memory. > From hategan at mcs.anl.gov Tue Mar 10 14:24:01 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 10 Mar 2009 14:24:01 -0500 Subject: [Swift-devel] scalability updates In-Reply-To: References: <1236630300.16421.17.camel@localhost> Message-ID: <1236713041.22637.0.camel@localhost> Should be fixed in cog r2325 & swift r2677. On Tue, 2009-03-10 at 14:22 +0000, Ben Clifford wrote: > I ran 066-many with 30000 jobs (10 x more than is in the SVN at the > moment). > > This eventually died, with it not being entirely clear to me why. > > Throughout the run I was null pointer exceptions being thrown. > > I also noticed a slowdown during the run - at the start more than 1000 > jobs are waiting for site selection; towards the end this was down to > about 980. > > This is on communicado. > > The log for that run is > http://www.ci.uchicago.edu/~benc/tmp/066-many-20090310-0853-ezpsjt20.log > From benc at hawaga.org.uk Wed Mar 11 03:47:53 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 11 Mar 2009 08:47:53 +0000 (GMT) Subject: [Swift-devel] scalability updates In-Reply-To: <1236713041.22637.0.camel@localhost> References: <1236630300.16421.17.camel@localhost> <1236713041.22637.0.camel@localhost> Message-ID: Here's a different looking one. Trying to run 066-many with 30000 iterations, in one run I got the below error within a couple of seconds; and in a subsequent run it happened after 20s or so. http://www.ci.uchicago.edu/~benc/tmp/066-many-20090311-0336-toznl2a0.log Full exception is in the log but in summary: > No named arguments on current frame > at org.globus.cog.karajan.arguments.Arg.getNamed(Arg.java:52) > at org.globus.cog.karajan.arguments.Arg.getValue0(Arg.java:66) > at org.globus.cog.karajan.arguments.Arg.getValue(Arg.java:62) > at > org.globus.cog.karajan.arguments.Arg$Positional.getValue(Arg.java:131) > at org.griphyn.vdl.karajan.lib.Log.post(Log.java:71) I got what looks like its a similar error when I tried running with 3000 iterations, in http://www.ci.uchicago.edu/~benc/tmp/066-many-20090311-0343-tmpq92td.log > Execution failed: > Missing argument level -- From hategan at mcs.anl.gov Wed Mar 11 08:38:11 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Mar 2009 08:38:11 -0500 Subject: [Swift-devel] scalability updates In-Reply-To: References: <1236630300.16421.17.camel@localhost> <1236713041.22637.0.camel@localhost> Message-ID: <1236778692.20470.0.camel@localhost> Yes, I saw those. Did you get this with pre or post r2677? On Wed, 2009-03-11 at 08:47 +0000, Ben Clifford wrote: > Here's a different looking one. > > Trying to run 066-many with 30000 iterations, in one run I got the below > error within a couple of seconds; and in a subsequent run it happened > after 20s or so. > > http://www.ci.uchicago.edu/~benc/tmp/066-many-20090311-0336-toznl2a0.log > > Full exception is in the log but in summary: > > > No named arguments on current frame > > at org.globus.cog.karajan.arguments.Arg.getNamed(Arg.java:52) > > at org.globus.cog.karajan.arguments.Arg.getValue0(Arg.java:66) > > at org.globus.cog.karajan.arguments.Arg.getValue(Arg.java:62) > > at > > org.globus.cog.karajan.arguments.Arg$Positional.getValue(Arg.java:131) > > at org.griphyn.vdl.karajan.lib.Log.post(Log.java:71) > > > I got what looks like its a similar error when I tried running with 3000 > iterations, in > http://www.ci.uchicago.edu/~benc/tmp/066-many-20090311-0343-tmpq92td.log > > > Execution failed: > > Missing argument level > > From benc at hawaga.org.uk Wed Mar 11 10:05:04 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 11 Mar 2009 15:05:04 +0000 (GMT) Subject: [Swift-devel] scalability updates In-Reply-To: <1236778692.20470.0.camel@localhost> References: <1236630300.16421.17.camel@localhost> <1236713041.22637.0.camel@localhost> <1236778692.20470.0.camel@localhost> Message-ID: On Wed, 11 Mar 2009, Mihael Hategan wrote: > Yes, I saw those. Did you get this with pre or post r2677? 2009-03-11 03:43:54,179-0500 INFO unknown Swift svn swift-r2677 (swift modified locally) cog-r2325 I think I have occasionally seen this error in the past, but never enough to recreate it. -- From hategan at mcs.anl.gov Wed Mar 11 10:07:49 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Mar 2009 10:07:49 -0500 Subject: [Swift-devel] scalability updates In-Reply-To: References: <1236630300.16421.17.camel@localhost> <1236713041.22637.0.camel@localhost> <1236778692.20470.0.camel@localhost> Message-ID: <1236784069.21789.0.camel@localhost> On Wed, 2009-03-11 at 15:05 +0000, Ben Clifford wrote: > On Wed, 11 Mar 2009, Mihael Hategan wrote: > > > Yes, I saw those. Did you get this with pre or post r2677? Hmm. I need to dig more. > > 2009-03-11 03:43:54,179-0500 INFO unknown Swift svn swift-r2677 (swift > modified locally) cog-r2325 > > I think I have occasionally seen this error in the past, but never enough > to recreate it. > From hategan at mcs.anl.gov Wed Mar 11 13:18:23 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Mar 2009 13:18:23 -0500 Subject: [Swift-devel] scalability updates In-Reply-To: References: <1236630300.16421.17.camel@localhost> <1236713041.22637.0.camel@localhost> <1236778692.20470.0.camel@localhost> Message-ID: <1236795503.15465.1.camel@localhost> On Wed, 2009-03-11 at 15:05 +0000, Ben Clifford wrote: > On Wed, 11 Mar 2009, Mihael Hategan wrote: > > > Yes, I saw those. Did you get this with pre or post r2677? > > 2009-03-11 03:43:54,179-0500 INFO unknown Swift svn swift-r2677 (swift > modified locally) cog-r2325 > > I think I have occasionally seen this error in the past, but never enough > to recreate it. Right. It doesn't seem specific to the foreach limiting changes. Are you getting this on any machine I have access to? From hategan at mcs.anl.gov Wed Mar 11 15:57:11 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Mar 2009 15:57:11 -0500 Subject: [Swift-devel] scalability updates In-Reply-To: <1236795503.15465.1.camel@localhost> References: <1236630300.16421.17.camel@localhost> <1236713041.22637.0.camel@localhost> <1236778692.20470.0.camel@localhost> <1236795503.15465.1.camel@localhost> Message-ID: <1236805031.7892.0.camel@localhost> On Wed, 2009-03-11 at 13:18 -0500, Mihael Hategan wrote: > On Wed, 2009-03-11 at 15:05 +0000, Ben Clifford wrote: > > On Wed, 11 Mar 2009, Mihael Hategan wrote: > > > > > Yes, I saw those. Did you get this with pre or post r2677? > > > > 2009-03-11 03:43:54,179-0500 INFO unknown Swift svn swift-r2677 (swift > > modified locally) cog-r2325 > > > > I think I have occasionally seen this error in the past, but never enough > > to recreate it. > > Right. It doesn't seem specific to the foreach limiting changes. > > Are you getting this on any machine I have access to? Nevermind. I can reproduce it (them). From hategan at mcs.anl.gov Wed Mar 11 21:20:04 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Mar 2009 21:20:04 -0500 Subject: [Swift-devel] scalability updates In-Reply-To: References: <1236630300.16421.17.camel@localhost> <1236713041.22637.0.camel@localhost> <1236778692.20470.0.camel@localhost> Message-ID: <1236824404.12524.0.camel@localhost> I got 4 successful runs with cog r2326. On Wed, 2009-03-11 at 15:05 +0000, Ben Clifford wrote: > On Wed, 11 Mar 2009, Mihael Hategan wrote: > > > Yes, I saw those. Did you get this with pre or post r2677? > > 2009-03-11 03:43:54,179-0500 INFO unknown Swift svn swift-r2677 (swift > modified locally) cog-r2325 > > I think I have occasionally seen this error in the past, but never enough > to recreate it. > From benc at hawaga.org.uk Thu Mar 12 06:09:29 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 12 Mar 2009 11:09:29 +0000 (GMT) Subject: [Swift-devel] provenance 'processes' Message-ID: The provenance work that I've done so far links datasets (DSHandles) by saying they are inputs or outputs to app procedure invocations; it does not represent the processing of data by @functions or operators. Separately there is a containment graph to link DSHandles that represent arrays or structs with their members; in that graph there's no causal information - if you construct and array, and then extract a member, the representation in this graph is the same as if you construct the members, then use [] syntax to make an array of those members. OPM has a different representation of containment, explicitly representing 'processes' that construct collections or extract collections. Having mulled that over, and tried to do some other things, I think that the provenance representation in Swift should record @functions and operators (and probably composite procedures) in the same way that it records app procedure executions. Mapper parameters should be also recorded in some more sensible way. This is likely to generate much more provenance logging information, but I think will give decent information about every single DSHandle. I have a niggling fear that this will generate enough information to cause undesirable slowdown; but I think making provenance recording turn-off-and-onable can reduce that problem for people who don't care about provenance. -- From bugzilla-daemon at mcs.anl.gov Thu Mar 12 23:08:08 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 12 Mar 2009 23:08:08 -0500 (CDT) Subject: [Swift-devel] [Bug 182] New: Error messages summarized at end of Swift output should also be printed when they occur Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=182 Summary: Error messages summarized at end of Swift output should also be printed when they occur Product: Swift Version: unspecified Platform: PC OS/Version: All Status: NEW Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: wilde at mcs.anl.gov Based on the premise that we want most users to get most of the debugging info they need from just looking at Swift output, rather than the .log file, then job execution error messages should be listed when they occur, rather than summarozed at the end. This is especially tru for long workflows. Currently, all the user sees in the output is the running success/failure count, e.g.: Progress: Submitted:1 Active:1 Failed:5 Finished successfully:29 But at the end, they see: Final status: Failed:6 Finished successfully:32 The following errors have occurred: 1. Application "render_round" failed (Job failed with an exit code of 254) Arguments: "output/T1af7/0001_0000.SecStr, viz/T1af7/T1af7.round.0001.result.png, T1af7, 1" Host: localhost Directory: oops8-20090312-2254-hrdvk4lg/jobs/r/render_round-rzcihu7j STDERR: STDOUT: -- These errors should also be listed when they occur. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Thu Mar 12 23:16:10 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 12 Mar 2009 23:16:10 -0500 (CDT) Subject: [Swift-devel] [Bug 183] New: Print better error message when app executable is not found Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=183 Summary: Print better error message when app executable is not found Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: wilde at mcs.anl.gov Currently, when an app listed in tc.data doesnt exist on the site, one gets this error: (Job failed with an exit code of 254) Swift should generate an error that says exactly what happened, eg: Application render_round not found on site locahost at path "/home/wilde/oops/oops-r026/bin/render_round.sh". -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From zhaozhang at uchicago.edu Fri Mar 13 17:07:34 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 13 Mar 2009 17:07:34 -0500 Subject: [Swift-devel] How does swift know if a task is successful Message-ID: <49BAD926.1030607@uchicago.edu> Hi, All I have a question on how swift knows if a task is successful. In my case, I am using a status notification instead of a status file. So my question is is this status notification the only thing swift is waiting for, or is swift also waiting for the output data to appear to say that a job is successful? Thanks. zhao From hategan at mcs.anl.gov Fri Mar 13 17:13:33 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Mar 2009 17:13:33 -0500 Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: <49BAD926.1030607@uchicago.edu> References: <49BAD926.1030607@uchicago.edu> Message-ID: <1236982413.13026.1.camel@localhost> On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote: > Hi, All > > I have a question on how swift knows if a task is successful. > In my case, I am using a status notification instead of a status file. > > So my question is is this status notification the only thing swift is > waiting for, or is swift also waiting for the output data to appear to > say that a job is successful? Once the job is done, swift will attempt to stage out all the files that it expects the job to have produced. Should one of those files not be there, there will be failures. From hategan at mcs.anl.gov Fri Mar 13 22:17:52 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Mar 2009 22:17:52 -0500 Subject: [Swift-devel] condor provider Message-ID: <1237000672.21050.7.camel@localhost> I committed an update to the local schedulers. This includes a bit of refactoring of the existing providers and the addition of a condor provider. The condor provider is not a globus-through-condor thing. It submits to the default condor queue in the vanilla (or mpi) universe. I've tested it on communicado with with loads like 256 parallel jobs. Seems to behave. It needs more work, but it's a start. In the refactoring process, I probably screwed the other two, and I'm not in the mood to test them now. There's one thing I haven't figured out though, and that is whether condor has any notion of a walltime limit. From benc at hawaga.org.uk Sun Mar 15 12:29:23 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 15 Mar 2009 17:29:23 +0000 (GMT) Subject: [Swift-devel] scalability updates In-Reply-To: <1236805031.7892.0.camel@localhost> References: <1236630300.16421.17.camel@localhost> <1236713041.22637.0.camel@localhost> <1236778692.20470.0.camel@localhost> <1236795503.15465.1.camel@localhost> <1236805031.7892.0.camel@localhost> Message-ID: I tried to run 066-many with a million (as a suitable large number) iterations to see where it got and then promptly forogt that I'd left it running It got to about 200000 (2*10^5) jobs, and then died. If you're interested, the log is in http://www.ci.uchciago.edu/~benc/066-many-20090312-0550-z7jxehb4.log The restart log for that is empty apart from the timestamp (s/.log/.0.rlog on the the above URL for that). I think it should contain much more than that - one line for each of the 2*10^5 jobs that are alleged by the log files to have been completed. From hategan at mcs.anl.gov Sun Mar 15 19:41:22 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 15 Mar 2009 19:41:22 -0500 Subject: [Swift-devel] scalability updates In-Reply-To: References: <1236630300.16421.17.camel@localhost> <1236713041.22637.0.camel@localhost> <1236778692.20470.0.camel@localhost> <1236795503.15465.1.camel@localhost> <1236805031.7892.0.camel@localhost> Message-ID: <1237164082.20046.0.camel@localhost> On Sun, 2009-03-15 at 17:29 +0000, Ben Clifford wrote: > I tried to run 066-many with a million (as a suitable large number) > iterations to see where it got and then promptly forogt that I'd left > it running > > It got to about 200000 (2*10^5) jobs, and then died. > > If you're interested, the log is in > http://www.ci.uchciago.edu/~benc/066-many-20090312-0550-z7jxehb4.log > > The restart log for that is empty apart from the timestamp > (s/.log/.0.rlog on the the above URL for that). I think it should contain > much more than that - one line for each of the 2*10^5 jobs that are > alleged by the log files to have been completed. The jobs produce no data. What exactly should be in the restart log you'd say? From hategan at mcs.anl.gov Sun Mar 15 19:44:11 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 15 Mar 2009 19:44:11 -0500 Subject: [Swift-devel] scalability updates In-Reply-To: References: <1236630300.16421.17.camel@localhost> <1236713041.22637.0.camel@localhost> <1236778692.20470.0.camel@localhost> <1236795503.15465.1.camel@localhost> <1236805031.7892.0.camel@localhost> Message-ID: <1237164251.20046.2.camel@localhost> On Sun, 2009-03-15 at 17:29 +0000, Ben Clifford wrote: > I tried to run 066-many with a million (as a suitable large number) > iterations to see where it got and then promptly forogt that I'd left > it running > > It got to about 200000 (2*10^5) jobs, and then died. > > If you're interested, the log is in > http://www.ci.uchciago.edu/~benc/066-many-20090312-0550-z7jxehb4.log mike at blabla tmp$ wget http://www.ci.uchicago.edu/~benc/066-many-20090312-0550-z7jxehb4.log --2009-03-15 19:43:28-- http://www.ci.uchicago.edu/~benc/066-many-20090312-0550-z7jxehb4.log Resolving www.ci.uchicago.edu... 128.135.125.142 Connecting to www.ci.uchicago.edu|128.135.125.142|:80... connected. HTTP request sent, awaiting response... 404 Not Found 2009-03-15 19:43:28 ERROR 404: Not Found. mike at blabla tmp$ From benc at hawaga.org.uk Mon Mar 16 03:54:36 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 16 Mar 2009 08:54:36 +0000 (GMT) Subject: [Swift-devel] scalability updates In-Reply-To: <1237164251.20046.2.camel@localhost> References: <1236630300.16421.17.camel@localhost> <1236713041.22637.0.camel@localhost> <1236778692.20470.0.camel@localhost> <1236795503.15465.1.camel@localhost> <1236805031.7892.0.camel@localhost> <1237164251.20046.2.camel@localhost> Message-ID: On Sun, 15 Mar 2009, Mihael Hategan wrote: > mike at blabla tmp$ wget > http://www.ci.uchicago.edu/~benc/066-many-20090312-0550-z7jxehb4.log > --2009-03-15 19:43:28-- oops. Try ~benc/tmp/ -- From benc at hawaga.org.uk Mon Mar 16 03:56:11 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 16 Mar 2009 08:56:11 +0000 (GMT) Subject: [Swift-devel] scalability updates In-Reply-To: <1237164082.20046.0.camel@localhost> References: <1236630300.16421.17.camel@localhost> <1236713041.22637.0.camel@localhost> <1236778692.20470.0.camel@localhost> <1236795503.15465.1.camel@localhost> <1236805031.7892.0.camel@localhost> <1237164082.20046.0.camel@localhost> Message-ID: On Sun, 15 Mar 2009, Mihael Hategan wrote: > The jobs produce no data. What exactly should be in the restart log > you'd say? oh yes, I forgot that they need an output variable for that. Procedures with no outputs are wierd - they aren't restartable and don't get optimised away as already having all (0 of) their outputs there. -- From benc at hawaga.org.uk Mon Mar 16 07:50:09 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 16 Mar 2009 12:50:09 +0000 (GMT) Subject: [Swift-devel] Swift 0.9 release for ~2nd April Message-ID: I'd like to put out the Swift 0.9 release on the 2nd of April, with the release candidate being made from SVN on the 23rd of March. Things that have been screwed around with since 0.8 that aren't getting substantial automated testing are coasters, the PBS provider and the log-processing code. So effort to test (or automate tests for) those in the next few weeks would be good. -- From hategan at mcs.anl.gov Mon Mar 16 10:18:54 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 16 Mar 2009 10:18:54 -0500 Subject: [Swift-devel] scalability updates In-Reply-To: References: <1236630300.16421.17.camel@localhost> <1236713041.22637.0.camel@localhost> <1236778692.20470.0.camel@localhost> <1236795503.15465.1.camel@localhost> <1236805031.7892.0.camel@localhost> <1237164082.20046.0.camel@localhost> Message-ID: <1237216734.3397.2.camel@localhost> On Mon, 2009-03-16 at 08:56 +0000, Ben Clifford wrote: > On Sun, 15 Mar 2009, Mihael Hategan wrote: > > > The jobs produce no data. What exactly should be in the restart log > > you'd say? > > oh yes, I forgot that they need an output variable for that. > > Procedures with no outputs are wierd - they aren't restartable and don't > get optimised away as already having all (0 of) their outputs there. Do they even exist? :) > From wilde at mcs.anl.gov Mon Mar 16 13:21:28 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 16 Mar 2009 13:21:28 -0500 Subject: [Swift-devel] Coaster run fails from communicado to ranger Message-ID: <49BE98A8.5060109@mcs.anl.gov> I'm trying a simple test script to ranger, from communicado, using latest svn rev (swift-r2692 cog-r2329). I get the error below (/bin/bash: line 39: eval: --: invalid option) Sites file is: /share/home/00306/tg455797/swiftwork 1 8 00:01:00 TG-CCR080022N 16 -- Output is: Swift svn swift-r2692 cog-r2329 RunID: 20090316-1220-kfipom0f Progress: Progress: Stage in:1 Progress: Submitted:1 Failed to transfer wrapper log from cat-20090316-1220-kfipom0f/info/r on ranger Progress: Failed:1 Execution failed: Exception in cat: Arguments: [data.txt] Host: ranger Directory: cat-20090316-1220-kfipom0f/jobs/r/cat-rsrhc18j stderr.txt: stdout.txt: ---- Caused by: Could not submit job Caused by: Could not start coaster service Caused by: Task ended before registration was received. STDOUT: /bin/bash: line 39: eval: --: invalid option eval: usage: eval [arg ...] STDERR: null Cleaning up... Done From aespinosa at cs.uchicago.edu Mon Mar 16 14:13:04 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 16 Mar 2009 14:13:04 -0500 Subject: [Swift-devel] "any valid host for task" in Swift + deef provider Message-ID: <50b07b4b0903161213r292fb025o1bc52a34608192d@mail.gmail.com> Hi, I'm using swift r2682, cogkit 2326 and provider-deef 2507 RunID: 20090316-1354-ocn573c3 Progress: Execution failed: Could not find any valid host for task "Task(type=UNKNOWN, identity=urn:cog-1237229648327)" with constraints {tr=hostname, filenames=[Ljava.lang.String;@14aa453, trfqn=hostname, filecache=org.griphyn.vdl.karajan.lib.cache.CacheMapAdapter at 16a4aef} Sites.xml: /work/01035/tg802895/swift-runs The run did not initialize the work directory. -Allan From hategan at mcs.anl.gov Mon Mar 16 14:24:35 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 16 Mar 2009 14:24:35 -0500 Subject: [Swift-devel] "any valid host for task" in Swift + deef provider In-Reply-To: <50b07b4b0903161213r292fb025o1bc52a34608192d@mail.gmail.com> References: <50b07b4b0903161213r292fb025o1bc52a34608192d@mail.gmail.com> Message-ID: <1237231475.8617.12.camel@localhost> Can you post your tc.data? On Mon, 2009-03-16 at 14:13 -0500, Allan Espinosa wrote: > Hi, > > I'm using swift r2682, cogkit 2326 and provider-deef 2507 > > RunID: 20090316-1354-ocn573c3 > Progress: > Execution failed: > Could not find any valid host for task "Task(type=UNKNOWN, > identity=urn:cog-1237229648327)" with constraints {tr=hostname, > filenames=[Ljava.lang.String;@14aa453, trfqn=hostname, > filecache=org.griphyn.vdl.karajan.lib.cache.CacheMapAdapter at 16a4aef} > > > Sites.xml: > > > > url="http://129.114.102.179:50001/wsrf/services/GenericPortal/core/WS/GPFactoryService"/> > /work/01035/tg802895/swift-runs > > > > The run did not initialize the work directory. > > -Allan > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Mon Mar 16 14:32:19 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 16 Mar 2009 14:32:19 -0500 Subject: [Swift-devel] Coaster run fails from communicado to ranger In-Reply-To: <49BE98A8.5060109@mcs.anl.gov> References: <49BE98A8.5060109@mcs.anl.gov> Message-ID: <1237231939.10524.1.camel@localhost> On Mon, 2009-03-16 at 13:21 -0500, Michael Wilde wrote: > I'm trying a simple test script to ranger, from communicado, using > latest svn rev (swift-r2692 cog-r2329). > > I get the error below (/bin/bash: line 39: eval: --: invalid option) That's unfortunately because "/bin/bash -l -c 'which wget'" returns: --------------------- Project balances for user tg455678 ---------------------- | Name Avail SUs Expires | Name Avail SUs Expires | | TG-CCR080022N 7008 2009-05-31 | TG-DBS080004N 999508 2009-12-31 | ------------------------ Disk quotas for user tg455678 ---------- -------------- | Disk Usage (GB) Limit %Used File Usage Limit %Used | | /share 3.8 6 64.13 3086 100000 3.09 | ------------------------ ------------------------------------------------------- /usr/bin/wget (with slight variations). I guess another strategy is needed here. > > Sites file is: > > > > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> > > /share/home/00306/tg455797/swiftwork > 1 > 8 > 00:01:00 > TG-CCR080022N > 16 > > > > -- > > Output is: > > Swift svn swift-r2692 cog-r2329 > > RunID: 20090316-1220-kfipom0f > Progress: > Progress: Stage in:1 > Progress: Submitted:1 > Failed to transfer wrapper log from cat-20090316-1220-kfipom0f/info/r on > ranger > Progress: Failed:1 > Execution failed: > Exception in cat: > Arguments: [data.txt] > Host: ranger > Directory: cat-20090316-1220-kfipom0f/jobs/r/cat-rsrhc18j > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Could not submit job > Caused by: > Could not start coaster service > Caused by: > Task ended before registration was received. > STDOUT: /bin/bash: line 39: eval: --: invalid option > eval: usage: eval [arg ...] > > STDERR: null > Cleaning up... > Done > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From aespinosa at cs.uchicago.edu Mon Mar 16 14:44:03 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 16 Mar 2009 14:44:03 -0500 Subject: [Swift-devel] Coaster run fails from communicado to ranger In-Reply-To: <1237231939.10524.1.camel@localhost> References: <49BE98A8.5060109@mcs.anl.gov> <1237231939.10524.1.camel@localhost> Message-ID: <50b07b4b0903161244p311666d2jb9764897ca92b187@mail.gmail.com> What I did before was hardwire wget, md5sum and other binaries needed for coasters because the environment does not like you doing a "bash -l" . You get access to TTY errors. -Allan On Mon, Mar 16, 2009 at 2:32 PM, Mihael Hategan wrote: > On Mon, 2009-03-16 at 13:21 -0500, Michael Wilde wrote: >> I'm trying a simple test script to ranger, from communicado, using >> latest svn rev (swift-r2692 cog-r2329). >> >> I get the error below (/bin/bash: line 39: eval: --: invalid option) > > That's unfortunately because "/bin/bash -l -c 'which wget'" returns: > --------------------- Project balances for user tg455678 > ---------------------- | Name Avail SUs Expires | Name Avail SUs Expires > | > ?| TG-CCR080022N 7008 2009-05-31 | TG-DBS080004N 999508 2009-12-31 | > ------------------------ Disk quotas for user tg455678 ---------- > -------------- | Disk Usage (GB) Limit %Used File Usage Limit %Used | > | /share 3.8 6 64.13 3086 100000 3.09 | ------------------------ > ------------------------------------------------------- /usr/bin/wget > > (with slight variations). > > I guess another strategy is needed here. > >> >> Sites file is: >> >> >> >> ? ?> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> >> ? ? >> ? ?/share/home/00306/tg455797/swiftwork >> ? ?1 >> ? ?8 >> ? ?00:01:00 >> ? ?TG-CCR080022N >> ? ?16 >> >> >> >> -- >> >> Output is: >> >> Swift svn swift-r2692 cog-r2329 >> >> RunID: 20090316-1220-kfipom0f >> Progress: >> Progress: ?Stage in:1 >> Progress: ?Submitted:1 >> Failed to transfer wrapper log from cat-20090316-1220-kfipom0f/info/r on >> ranger >> Progress: ?Failed:1 >> Execution failed: >> ? ? ? ? ?Exception in cat: >> Arguments: [data.txt] >> Host: ranger >> Directory: cat-20090316-1220-kfipom0f/jobs/r/cat-rsrhc18j >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> ? ? ? ? ?Could not submit job >> Caused by: >> ? ? ? ? ?Could not start coaster service >> Caused by: >> ? ? ? ? ?Task ended before registration was received. >> STDOUT: /bin/bash: line 39: eval: --: invalid option >> eval: usage: eval [arg ...] >> >> STDERR: null >> Cleaning up... >> ? Done >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From hategan at mcs.anl.gov Mon Mar 16 15:08:56 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 16 Mar 2009 15:08:56 -0500 Subject: [Swift-devel] Coaster run fails from communicado to ranger In-Reply-To: <50b07b4b0903161244p311666d2jb9764897ca92b187@mail.gmail.com> References: <49BE98A8.5060109@mcs.anl.gov> <1237231939.10524.1.camel@localhost> <50b07b4b0903161244p311666d2jb9764897ca92b187@mail.gmail.com> Message-ID: <1237234136.10524.9.camel@localhost> On Mon, 2009-03-16 at 14:44 -0500, Allan Espinosa wrote: > What I did before was hardwire wget, md5sum and other binaries needed > for coasters because the environment does not like you doing a "bash > -l" . You get access to TTY errors. That's stty. Something that is used to configure the terminal, and doesn't work well in non-terminals. But by itself it is a benign issue. > > -Allan > > On Mon, Mar 16, 2009 at 2:32 PM, Mihael Hategan wrote: > > On Mon, 2009-03-16 at 13:21 -0500, Michael Wilde wrote: > >> I'm trying a simple test script to ranger, from communicado, using > >> latest svn rev (swift-r2692 cog-r2329). > >> > >> I get the error below (/bin/bash: line 39: eval: --: invalid option) > > > > That's unfortunately because "/bin/bash -l -c 'which wget'" returns: > > --------------------- Project balances for user tg455678 > > ---------------------- | Name Avail SUs Expires | Name Avail SUs Expires > > | > > | TG-CCR080022N 7008 2009-05-31 | TG-DBS080004N 999508 2009-12-31 | > > ------------------------ Disk quotas for user tg455678 ---------- > > -------------- | Disk Usage (GB) Limit %Used File Usage Limit %Used | > > | /share 3.8 6 64.13 3086 100000 3.09 | ------------------------ > > ------------------------------------------------------- /usr/bin/wget > > > > (with slight variations). > > > > I guess another strategy is needed here. > > > >> > >> Sites file is: > >> > >> > >> > >> >> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> > >> > >> /share/home/00306/tg455797/swiftwork > >> 1 > >> 8 > >> 00:01:00 > >> TG-CCR080022N > >> 16 > >> > >> > >> > >> -- > >> > >> Output is: > >> > >> Swift svn swift-r2692 cog-r2329 > >> > >> RunID: 20090316-1220-kfipom0f > >> Progress: > >> Progress: Stage in:1 > >> Progress: Submitted:1 > >> Failed to transfer wrapper log from cat-20090316-1220-kfipom0f/info/r on > >> ranger > >> Progress: Failed:1 > >> Execution failed: > >> Exception in cat: > >> Arguments: [data.txt] > >> Host: ranger > >> Directory: cat-20090316-1220-kfipom0f/jobs/r/cat-rsrhc18j > >> stderr.txt: > >> > >> stdout.txt: > >> > >> ---- > >> > >> Caused by: > >> Could not submit job > >> Caused by: > >> Could not start coaster service > >> Caused by: > >> Task ended before registration was received. > >> STDOUT: /bin/bash: line 39: eval: --: invalid option > >> eval: usage: eval [arg ...] > >> > >> STDERR: null > >> Cleaning up... > >> Done > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > From hategan at mcs.anl.gov Mon Mar 16 16:50:11 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 16 Mar 2009 16:50:11 -0500 Subject: [Swift-devel] Coaster run fails from communicado to ranger In-Reply-To: <1237231939.10524.1.camel@localhost> References: <49BE98A8.5060109@mcs.anl.gov> <1237231939.10524.1.camel@localhost> Message-ID: <1237240211.14096.1.camel@localhost> I committed a fix in cog r2330. It uses a temporary file for the output of which from bash -l -c. On Mon, 2009-03-16 at 14:32 -0500, Mihael Hategan wrote: > On Mon, 2009-03-16 at 13:21 -0500, Michael Wilde wrote: > > I'm trying a simple test script to ranger, from communicado, using > > latest svn rev (swift-r2692 cog-r2329). > > > > I get the error below (/bin/bash: line 39: eval: --: invalid option) > > That's unfortunately because "/bin/bash -l -c 'which wget'" returns: > --------------------- Project balances for user tg455678 > ---------------------- | Name Avail SUs Expires | Name Avail SUs Expires > | > | TG-CCR080022N 7008 2009-05-31 | TG-DBS080004N 999508 2009-12-31 | > ------------------------ Disk quotas for user tg455678 ---------- > -------------- | Disk Usage (GB) Limit %Used File Usage Limit %Used | > | /share 3.8 6 64.13 3086 100000 3.09 | ------------------------ > ------------------------------------------------------- /usr/bin/wget > > (with slight variations). > > I guess another strategy is needed here. > > > > > Sites file is: > > > > > > > > > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> > > > > /share/home/00306/tg455797/swiftwork > > 1 > > 8 > > 00:01:00 > > TG-CCR080022N > > 16 > > > > > > > > -- > > > > Output is: > > > > Swift svn swift-r2692 cog-r2329 > > > > RunID: 20090316-1220-kfipom0f > > Progress: > > Progress: Stage in:1 > > Progress: Submitted:1 > > Failed to transfer wrapper log from cat-20090316-1220-kfipom0f/info/r on > > ranger > > Progress: Failed:1 > > Execution failed: > > Exception in cat: > > Arguments: [data.txt] > > Host: ranger > > Directory: cat-20090316-1220-kfipom0f/jobs/r/cat-rsrhc18j > > stderr.txt: > > > > stdout.txt: > > > > ---- > > > > Caused by: > > Could not submit job > > Caused by: > > Could not start coaster service > > Caused by: > > Task ended before registration was received. > > STDOUT: /bin/bash: line 39: eval: --: invalid option > > eval: usage: eval [arg ...] > > > > STDERR: null > > Cleaning up... > > Done > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Mon Mar 16 21:51:01 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 16 Mar 2009 21:51:01 -0500 Subject: [Swift-devel] Coaster run fails from communicado to ranger In-Reply-To: <1237240211.14096.1.camel@localhost> References: <49BE98A8.5060109@mcs.anl.gov> <1237231939.10524.1.camel@localhost> <1237240211.14096.1.camel@localhost> Message-ID: <49BF1015.2010400@mcs.anl.gov> With this rev I was able to submit a simple test job to ranger from a swift script. The job never started, though, so I suspect I got some queuing parameter wrong (or ranger is very congested), and I need to debug further. But it certainly got past the problem that started this thread. Thanks. On 3/16/09 4:50 PM, Mihael Hategan wrote: > I committed a fix in cog r2330. It uses a temporary file for the output > of which from bash -l -c. > > On Mon, 2009-03-16 at 14:32 -0500, Mihael Hategan wrote: >> On Mon, 2009-03-16 at 13:21 -0500, Michael Wilde wrote: >>> I'm trying a simple test script to ranger, from communicado, using >>> latest svn rev (swift-r2692 cog-r2329). >>> >>> I get the error below (/bin/bash: line 39: eval: --: invalid option) >> That's unfortunately because "/bin/bash -l -c 'which wget'" returns: >> --------------------- Project balances for user tg455678 >> ---------------------- | Name Avail SUs Expires | Name Avail SUs Expires >> | >> | TG-CCR080022N 7008 2009-05-31 | TG-DBS080004N 999508 2009-12-31 | >> ------------------------ Disk quotas for user tg455678 ---------- >> -------------- | Disk Usage (GB) Limit %Used File Usage Limit %Used | >> | /share 3.8 6 64.13 3086 100000 3.09 | ------------------------ >> ------------------------------------------------------- /usr/bin/wget >> >> (with slight variations). >> >> I guess another strategy is needed here. >> >>> Sites file is: >>> >>> >>> >>> >> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> >>> >>> /share/home/00306/tg455797/swiftwork >>> 1 >>> 8 >>> 00:01:00 >>> TG-CCR080022N >>> 16 >>> >>> >>> >>> -- >>> >>> Output is: >>> >>> Swift svn swift-r2692 cog-r2329 >>> >>> RunID: 20090316-1220-kfipom0f >>> Progress: >>> Progress: Stage in:1 >>> Progress: Submitted:1 >>> Failed to transfer wrapper log from cat-20090316-1220-kfipom0f/info/r on >>> ranger >>> Progress: Failed:1 >>> Execution failed: >>> Exception in cat: >>> Arguments: [data.txt] >>> Host: ranger >>> Directory: cat-20090316-1220-kfipom0f/jobs/r/cat-rsrhc18j >>> stderr.txt: >>> >>> stdout.txt: >>> >>> ---- >>> >>> Caused by: >>> Could not submit job >>> Caused by: >>> Could not start coaster service >>> Caused by: >>> Task ended before registration was received. >>> STDOUT: /bin/bash: line 39: eval: --: invalid option >>> eval: usage: eval [arg ...] >>> >>> STDERR: null >>> Cleaning up... >>> Done >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Tue Mar 17 05:29:57 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 17 Mar 2009 10:29:57 +0000 (GMT) Subject: [Swift-devel] another attempt at a getting started with provenance section (fwd) Message-ID: I've fiddled a lot with the provenance db code to make installation and configuration easier; and the associated docbook page at http://www.ci.uchicago.edu/~benc/provenance.html to have more of a focus on running your own db (in either sqlite3 for a personal-sized db or in postgres for a larger db). Section 2 of the above web page gives notes on importing your own log files into a database of your choosing, and section 3 gives some notes on query commands that I implemented a while ago and fixed up yesterday. I also added some more commentary in the SQL schema, prov-init.sql, in the provenancedb checkout, to help with creating your own queries. If you're going to play with this, I recommend starting with the sqlite3 mode - that provides a substantially easier to administer low-end database compared to postgres. I think this is basically the form I want the provenance db to look for the next few months. I plan on adding more information (i.e. more tables and more columns) and functionality, but largely in a backwards compatible way. -- From hategan at mcs.anl.gov Tue Mar 17 11:30:10 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 17 Mar 2009 11:30:10 -0500 Subject: [Swift-devel] testing Message-ID: <1237307410.26969.8.camel@localhost> So I think that, at this point, if we're serious about running things production-style on teragrid or osg, an appropriate testing effort is required. In the past our hands were tied due to our inability to fix and deploy gram issues on both those places. With the shift towards coasters and local providers, we have, at least in theory, overcome the issue. However, in order for it to also be in practice, we need to make sure that things actually work, and that can only be done with testing or a magic wand. I don't have the latter, so we'll have to do testing. There are probably a few issues still left to address, one of which is to make sure that coasters are an acceptable way of running things on OSG. I suspect this would require some negotiation with the right people from OSG, and I don't know who those people are. From wilde at mcs.anl.gov Tue Mar 17 12:04:12 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 17 Mar 2009 12:04:12 -0500 Subject: [Swift-devel] testing In-Reply-To: <1237307410.26969.8.camel@localhost> References: <1237307410.26969.8.camel@localhost> Message-ID: <49BFD80C.7080903@mcs.anl.gov> On 3/17/09 11:30 AM, Mihael Hategan wrote: > So I think that, at this point, if we're serious about running things > production-style on teragrid or osg, an appropriate testing effort is > required. I agree. > > In the past our hands were tied due to our inability to fix and deploy > gram issues on both those places. With the shift towards coasters and > local providers, we have, at least in theory, overcome the issue. I think your new Condor provider provides the hopefully final missing piece. > However, in order for it to also be in practice, we need to make sure > that things actually work, and that can only be done with testing or a > magic wand. I don't have the latter, so we'll have to do testing. yes. > There are probably a few issues still left to address, one of which is > to make sure that coasters are an acceptable way of running things on > OSG. I suspect this would require some negotiation with the right people > from OSG, and I don't know who those people are. I do: Its Ruth, and several people in various working groups whose names I can gather and send out. I'll make the initial contacts. What I need are test data from various scale runs that prove Swift is fast, scalable, and "safe" (ie doesnt harm things). This is all coming together well. For example, a user (Glen Hocky) was able to run ZHangiong's "ADEM" installer to push OOPS to 5-8 OSG sites and then run a swift workflow using them. A good test effort would allow us to expand that to a great success story. I'd like to see this effort build on the OSG site list scripts, and equivalent REST-based scripts that are now available at info.teragrid.org. I think you should make this your next focus, Mihael, and just get started; then as you go we can gradually through discussion align this into a testing effort that really opens up OSG and TG to Swift users. That'll be a great step. - Mike _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From zhaozhang at uchicago.edu Tue Mar 17 12:14:29 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 17 Mar 2009 12:14:29 -0500 Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: <1236982413.13026.1.camel@localhost> References: <49BAD926.1030607@uchicago.edu> <1236982413.13026.1.camel@localhost> Message-ID: <49BFDA75.9070803@uchicago.edu> Here comes another question, is there any place that I could set to disable swift's waiting for data feature? Or is there any way for me to cheat swift that the data is already there? thanks. zhao Mihael Hategan wrote: > On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote: > >> Hi, All >> >> I have a question on how swift knows if a task is successful. >> In my case, I am using a status notification instead of a status file. >> >> So my question is is this status notification the only thing swift is >> waiting for, or is swift also waiting for the output data to appear to >> say that a job is successful? >> > > Once the job is done, swift will attempt to stage out all the files that > it expects the job to have produced. > > Should one of those files not be there, there will be failures. > > > > From hategan at mcs.anl.gov Tue Mar 17 12:18:55 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 17 Mar 2009 12:18:55 -0500 Subject: [Swift-devel] testing In-Reply-To: <49BFD80C.7080903@mcs.anl.gov> References: <1237307410.26969.8.camel@localhost> <49BFD80C.7080903@mcs.anl.gov> Message-ID: <1237310335.29738.8.camel@localhost> On Tue, 2009-03-17 at 12:04 -0500, Michael Wilde wrote: > On 3/17/09 11:30 AM, Mihael Hategan wrote: > > So I think that, at this point, if we're serious about running things > > production-style on teragrid or osg, an appropriate testing effort is > > required. > > I agree. > > > > In the past our hands were tied due to our inability to fix and deploy > > gram issues on both those places. With the shift towards coasters and > > local providers, we have, at least in theory, overcome the issue. > > I think your new Condor provider provides the hopefully final missing piece. There are the LSF and SGE providers still to be done ;) > > > However, in order for it to also be in practice, we need to make sure > > that things actually work, and that can only be done with testing or a > > magic wand. I don't have the latter, so we'll have to do testing. > > yes. > > > There are probably a few issues still left to address, one of which is > > to make sure that coasters are an acceptable way of running things on > > OSG. I suspect this would require some negotiation with the right people > > from OSG, and I don't know who those people are. > > I do: Its Ruth, and several people in various working groups whose names > I can gather and send out. I'll make the initial contacts. Ok. > > What I need are test data from various scale runs that prove Swift is > fast, scalable, and "safe" (ie doesnt harm things). This isn't in particular a swift issue, but a coaster issue. There is no proof of safety, and swift being fast, scalable, and safe comes after this testing, not before. But I would like to "negotiate" the ability to: - have one process on the head node, hopefully one that doesn't hog it. - the ability to submit from the head node to the queuing system directly (as if running qsub manually - something that isn't exactly "the way" on OSG). From hategan at mcs.anl.gov Tue Mar 17 12:20:55 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 17 Mar 2009 12:20:55 -0500 Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: <49BFDA75.9070803@uchicago.edu> References: <49BAD926.1030607@uchicago.edu> <1236982413.13026.1.camel@localhost> <49BFDA75.9070803@uchicago.edu> Message-ID: <1237310455.29738.11.camel@localhost> On Tue, 2009-03-17 at 12:14 -0500, Zhao Zhang wrote: > Here comes another question, is there any place that I could set to > disable swift's waiting for data feature? Do you mean disable the stage-outs? > Or is there any way for me to cheat swift that the data is already > there? thanks. > > zhao > > Mihael Hategan wrote: > > On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote: > > > >> Hi, All > >> > >> I have a question on how swift knows if a task is successful. > >> In my case, I am using a status notification instead of a status file. > >> > >> So my question is is this status notification the only thing swift is > >> waiting for, or is swift also waiting for the output data to appear to > >> say that a job is successful? > >> > > > > Once the job is done, swift will attempt to stage out all the files that > > it expects the job to have produced. > > > > Should one of those files not be there, there will be failures. > > > > > > > > From zhaozhang at uchicago.edu Tue Mar 17 12:23:04 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 17 Mar 2009 12:23:04 -0500 Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: <1237310455.29738.11.camel@localhost> References: <49BAD926.1030607@uchicago.edu> <1236982413.13026.1.camel@localhost> <49BFDA75.9070803@uchicago.edu> <1237310455.29738.11.camel@localhost> Message-ID: <49BFDC78.8040506@uchicago.edu> Hi, Mihael yes, can I do that? zhao Mihael Hategan wrote: > On Tue, 2009-03-17 at 12:14 -0500, Zhao Zhang wrote: > >> Here comes another question, is there any place that I could set to >> disable swift's waiting for data feature? >> > > Do you mean disable the stage-outs? > > >> Or is there any way for me to cheat swift that the data is already >> there? thanks. >> >> zhao >> >> Mihael Hategan wrote: >> >>> On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote: >>> >>> >>>> Hi, All >>>> >>>> I have a question on how swift knows if a task is successful. >>>> In my case, I am using a status notification instead of a status file. >>>> >>>> So my question is is this status notification the only thing swift is >>>> waiting for, or is swift also waiting for the output data to appear to >>>> say that a job is successful? >>>> >>>> >>> Once the job is done, swift will attempt to stage out all the files that >>> it expects the job to have produced. >>> >>> Should one of those files not be there, there will be failures. >>> >>> >>> >>> >>> > > > From hategan at mcs.anl.gov Tue Mar 17 12:29:07 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 17 Mar 2009 12:29:07 -0500 Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: <49BFDC78.8040506@uchicago.edu> References: <49BAD926.1030607@uchicago.edu> <1236982413.13026.1.camel@localhost> <49BFDA75.9070803@uchicago.edu> <1237310455.29738.11.camel@localhost> <49BFDC78.8040506@uchicago.edu> Message-ID: <1237310948.30064.2.camel@localhost> On Tue, 2009-03-17 at 12:23 -0500, Zhao Zhang wrote: > Hi, Mihael > > yes, can I do that? You should know this by now: in vdl-int.k, in doStageout, comment out the task:transfer invocation (and dir:make). > > zhao > > Mihael Hategan wrote: > > On Tue, 2009-03-17 at 12:14 -0500, Zhao Zhang wrote: > > > >> Here comes another question, is there any place that I could set to > >> disable swift's waiting for data feature? > >> > > > > Do you mean disable the stage-outs? > > > > > >> Or is there any way for me to cheat swift that the data is already > >> there? thanks. > >> > >> zhao > >> > >> Mihael Hategan wrote: > >> > >>> On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote: > >>> > >>> > >>>> Hi, All > >>>> > >>>> I have a question on how swift knows if a task is successful. > >>>> In my case, I am using a status notification instead of a status file. > >>>> > >>>> So my question is is this status notification the only thing swift is > >>>> waiting for, or is swift also waiting for the output data to appear to > >>>> say that a job is successful? > >>>> > >>>> > >>> Once the job is done, swift will attempt to stage out all the files that > >>> it expects the job to have produced. > >>> > >>> Should one of those files not be there, there will be failures. > >>> > >>> > >>> > >>> > >>> > > > > > > From zhaozhang at uchicago.edu Tue Mar 17 12:31:31 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 17 Mar 2009 12:31:31 -0500 Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: <1237310948.30064.2.camel@localhost> References: <49BAD926.1030607@uchicago.edu> <1236982413.13026.1.camel@localhost> <49BFDA75.9070803@uchicago.edu> <1237310455.29738.11.camel@localhost> <49BFDC78.8040506@uchicago.edu> <1237310948.30064.2.camel@localhost> Message-ID: <49BFDE73.2070600@uchicago.edu> ok, thanks, I will try it out. zhao Mihael Hategan wrote: > On Tue, 2009-03-17 at 12:23 -0500, Zhao Zhang wrote: > >> Hi, Mihael >> >> yes, can I do that? >> > > You should know this by now: > in vdl-int.k, in doStageout, comment out the task:transfer invocation > (and dir:make). > > >> zhao >> >> Mihael Hategan wrote: >> >>> On Tue, 2009-03-17 at 12:14 -0500, Zhao Zhang wrote: >>> >>> >>>> Here comes another question, is there any place that I could set to >>>> disable swift's waiting for data feature? >>>> >>>> >>> Do you mean disable the stage-outs? >>> >>> >>> >>>> Or is there any way for me to cheat swift that the data is already >>>> there? thanks. >>>> >>>> zhao >>>> >>>> Mihael Hategan wrote: >>>> >>>> >>>>> On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote: >>>>> >>>>> >>>>> >>>>>> Hi, All >>>>>> >>>>>> I have a question on how swift knows if a task is successful. >>>>>> In my case, I am using a status notification instead of a status file. >>>>>> >>>>>> So my question is is this status notification the only thing swift is >>>>>> waiting for, or is swift also waiting for the output data to appear to >>>>>> say that a job is successful? >>>>>> >>>>>> >>>>>> >>>>> Once the job is done, swift will attempt to stage out all the files that >>>>> it expects the job to have produced. >>>>> >>>>> Should one of those files not be there, there will be failures. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>> >>> > > > From zhaozhang at uchicago.edu Tue Mar 17 13:36:32 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 17 Mar 2009 13:36:32 -0500 Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: <1237310948.30064.2.camel@localhost> References: <49BAD926.1030607@uchicago.edu> <1236982413.13026.1.camel@localhost> <49BFDA75.9070803@uchicago.edu> <1237310455.29738.11.camel@localhost> <49BFDC78.8040506@uchicago.edu> <1237310948.30064.2.camel@localhost> Message-ID: <49BFEDB0.5070409@uchicago.edu> Hi, Mihael I commented the following lines /*dir:make(ldir) restartOnError(".*", 2 task:transfer(srchost=host, srcfile=bname, srcdir=rdir, destdir=ldir, desthost=dhost, destprovider=provider) )*/ Then I modified wrapper.sh to not to copy output file back, but I still got an error. The log file is at http://www.ci.uchicago.edu/~zzhang/first-20090317-1327-oqgttus8.log Thanks zhao zzhang at login6.surveyor:~/new_dock6> ./run_swift_ssh.sh 1117 64 first.swift waiting for at least 64 nodes to register before submitting workload... waiting to find at least 1 services in file /home/falkon/users/zzhang/1117/config/Client-service-URIs.config... all done, file has found at least 1 services found at least 64 registered, submitting workload... Swift svn swift-r2676 (swift modified locally) cog-r2305 RunID: 20090317-1327-oqgttus8 Progress: Progress: Selecting site:1 Stage in:1 Progress: Submitting:1 Submitted:1 Progress: Submitted:1 Failed but can retry:1 Failed to transfer wrapper log from first-20090317-1327-oqgttus8/info/b/n/bgp000 Progress: Submitted:1 Active:1 Failed to transfer wrapper log from first-20090317-1327-oqgttus8/info/e/n/bgp000 Progress: Submitted:1 Active:1 Failed to transfer wrapper log from first-20090317-1327-oqgttus8/info/g/n/bgp000 Execution failed: Exception in echo: Arguments: [Hello, world!] Host: bgp000 Directory: first-20090317-1327-oqgttus8/jobs/g/n/echo-gnlq238j stderr.txt: stdout.txt: ---- Caused by: Cannot transfer "/tmp/first-20090317-1327-oqgttus8/shared/hello.txt" to "/gpfs/home/zzhang/new_dock6/./hello.txt" Caused by: No such file Mihael Hategan wrote: > On Tue, 2009-03-17 at 12:23 -0500, Zhao Zhang wrote: > >> Hi, Mihael >> >> yes, can I do that? >> > > You should know this by now: > in vdl-int.k, in doStageout, comment out the task:transfer invocation > (and dir:make). > > >> zhao >> >> Mihael Hategan wrote: >> >>> On Tue, 2009-03-17 at 12:14 -0500, Zhao Zhang wrote: >>> >>> >>>> Here comes another question, is there any place that I could set to >>>> disable swift's waiting for data feature? >>>> >>>> >>> Do you mean disable the stage-outs? >>> >>> >>> >>>> Or is there any way for me to cheat swift that the data is already >>>> there? thanks. >>>> >>>> zhao >>>> >>>> Mihael Hategan wrote: >>>> >>>> >>>>> On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote: >>>>> >>>>> >>>>> >>>>>> Hi, All >>>>>> >>>>>> I have a question on how swift knows if a task is successful. >>>>>> In my case, I am using a status notification instead of a status file. >>>>>> >>>>>> So my question is is this status notification the only thing swift is >>>>>> waiting for, or is swift also waiting for the output data to appear to >>>>>> say that a job is successful? >>>>>> >>>>>> >>>>>> >>>>> Once the job is done, swift will attempt to stage out all the files that >>>>> it expects the job to have produced. >>>>> >>>>> Should one of those files not be there, there will be failures. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>> >>> > > > From hategan at mcs.anl.gov Tue Mar 17 13:40:30 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 17 Mar 2009 13:40:30 -0500 Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: <49BFEDB0.5070409@uchicago.edu> References: <49BAD926.1030607@uchicago.edu> <1236982413.13026.1.camel@localhost> <49BFDA75.9070803@uchicago.edu> <1237310455.29738.11.camel@localhost> <49BFDC78.8040506@uchicago.edu> <1237310948.30064.2.camel@localhost> <49BFEDB0.5070409@uchicago.edu> Message-ID: <1237315230.31264.1.camel@localhost> On Tue, 2009-03-17 at 13:36 -0500, Zhao Zhang wrote: > Hi, Mihael > > I commented the following lines > /*dir:make(ldir) > restartOnError(".*", 2 > task:transfer(srchost=host, srcfile=bname, > srcdir=rdir, destdir=ldir, desthost=dhost, destprovider=provider) > )*/ > Did you modify this file in dist/?/libexec? If not, did you re-compile swift after the modification? Put an echo or a log message in place, to see if your change is picked up by swift next time. > Then I modified wrapper.sh to not to copy output file back, but I still > got an error. > The log file is at > http://www.ci.uchicago.edu/~zzhang/first-20090317-1327-oqgttus8.log > Thanks > > zhao > > zzhang at login6.surveyor:~/new_dock6> ./run_swift_ssh.sh 1117 64 first.swift > waiting for at least 64 nodes to register before submitting workload... > waiting to find at least 1 services in file > /home/falkon/users/zzhang/1117/config/Client-service-URIs.config... > all done, file has found at least 1 services > found at least 64 registered, submitting workload... > Swift svn swift-r2676 (swift modified locally) cog-r2305 > > RunID: 20090317-1327-oqgttus8 > Progress: > Progress: Selecting site:1 Stage in:1 > Progress: Submitting:1 Submitted:1 > Progress: Submitted:1 Failed but can retry:1 > Failed to transfer wrapper log from > first-20090317-1327-oqgttus8/info/b/n/bgp000 > Progress: Submitted:1 Active:1 > Failed to transfer wrapper log from > first-20090317-1327-oqgttus8/info/e/n/bgp000 > Progress: Submitted:1 Active:1 > Failed to transfer wrapper log from > first-20090317-1327-oqgttus8/info/g/n/bgp000 > Execution failed: > Exception in echo: > Arguments: [Hello, world!] > Host: bgp000 > Directory: first-20090317-1327-oqgttus8/jobs/g/n/echo-gnlq238j > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Cannot transfer > "/tmp/first-20090317-1327-oqgttus8/shared/hello.txt" to > "/gpfs/home/zzhang/new_dock6/./hello.txt" > Caused by: > No such file > > > Mihael Hategan wrote: > > On Tue, 2009-03-17 at 12:23 -0500, Zhao Zhang wrote: > > > >> Hi, Mihael > >> > >> yes, can I do that? > >> > > > > You should know this by now: > > in vdl-int.k, in doStageout, comment out the task:transfer invocation > > (and dir:make). > > > > > >> zhao > >> > >> Mihael Hategan wrote: > >> > >>> On Tue, 2009-03-17 at 12:14 -0500, Zhao Zhang wrote: > >>> > >>> > >>>> Here comes another question, is there any place that I could set to > >>>> disable swift's waiting for data feature? > >>>> > >>>> > >>> Do you mean disable the stage-outs? > >>> > >>> > >>> > >>>> Or is there any way for me to cheat swift that the data is already > >>>> there? thanks. > >>>> > >>>> zhao > >>>> > >>>> Mihael Hategan wrote: > >>>> > >>>> > >>>>> On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote: > >>>>> > >>>>> > >>>>> > >>>>>> Hi, All > >>>>>> > >>>>>> I have a question on how swift knows if a task is successful. > >>>>>> In my case, I am using a status notification instead of a status file. > >>>>>> > >>>>>> So my question is is this status notification the only thing swift is > >>>>>> waiting for, or is swift also waiting for the output data to appear to > >>>>>> say that a job is successful? > >>>>>> > >>>>>> > >>>>>> > >>>>> Once the job is done, swift will attempt to stage out all the files that > >>>>> it expects the job to have produced. > >>>>> > >>>>> Should one of those files not be there, there will be failures. > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>> > >>> > > > > > > From wilde at mcs.anl.gov Tue Mar 17 16:07:10 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 17 Mar 2009 16:07:10 -0500 Subject: [Swift-devel] null pointer exception from nested loops Message-ID: <49C010FE.4070503@mcs.anl.gov> I just expanded my oops protein folding script to add another level of parameter sweep. This script is getting pretty complex now (at least, for a swift script). I got the following npe on my first two tries. Im going to start debugging, but any clues as to the cause would be helpful. The outer loops are: main() { string protein[] = readData(@arg("plist")); string startTemp[] = ["10","20"]; string tempUpdate[] = ["1","2","3"]; foreach p in protein { foreach st in startTemp { foreach tu in tempUpdate { doRoundSet(p,st,tu); } } } } There are two levels of inner loops further down below doRoundSet(). The script, output, command line args and log are in: http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz I suspect it will take a while to narrow the cause to a simpler test case thats easy tp reproduce without a lot of setup. I'll try on a vanilla swift on local execution; this is on bgp with Falkon. Thanks. -- ... Progress: uninitialized:1 Selecting site:2 SwiftScript trace: T1af7, Round, 0, Sim, 7 SwiftScript trace: T1af7, Round, 0, Sim, 2 SwiftScript trace: T1af7, Round, 0, Sim, 8 SwiftScript trace: T1af7, Round, 0, Sim, 0 SwiftScript trace: T1af7, Round, 0, Sim, 5 SwiftScript trace: T1af7, Round, 0, Sim, 9 SwiftScript trace: T1af7, Round, 0, Sim, 1 SwiftScript trace: T1af7, Round, 0, Sim, 6 SwiftScript trace: T1af7, Round, 0, Sim, 3 SwiftScript trace: T1af7, Round, 0, Sim, 4 Ex098 java.lang.NullPointerException at org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) at org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) at org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) at org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) Execution failed: java.lang.NullPointerException at org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) at org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) at org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) at org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) Ex098 java.lang.NullPointerException at org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) at org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) at org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) at org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) SwiftScript trace: T1af7, Round, 0, Sim, 7 From wilde at mcs.anl.gov Tue Mar 17 16:17:05 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 17 Mar 2009 16:17:05 -0500 Subject: [Swift-devel] null pointer exception from nested loops In-Reply-To: <49C010FE.4070503@mcs.anl.gov> References: <49C010FE.4070503@mcs.anl.gov> Message-ID: <49C01351.1050807@mcs.anl.gov> It seems not related to scale or Falkon. It occurs when running on localhost (but on bgp) and when I cut all the loops down to a single iteration. I'm still debugging. On 3/17/09 4:07 PM, Michael Wilde wrote: > I just expanded my oops protein folding script to add another level of > parameter sweep. This script is getting pretty complex now (at least, > for a swift script). > > I got the following npe on my first two tries. Im going to start > debugging, but any clues as to the cause would be helpful. > > The outer loops are: > > main() > { > string protein[] = readData(@arg("plist")); > string startTemp[] = ["10","20"]; > string tempUpdate[] = ["1","2","3"]; > > foreach p in protein { > foreach st in startTemp { > foreach tu in tempUpdate { > doRoundSet(p,st,tu); > } > } > } > } > > There are two levels of inner loops further down below doRoundSet(). > > The script, output, command line args and log are in: > http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz > > I suspect it will take a while to narrow the cause to a simpler test > case thats easy tp reproduce without a lot of setup. > > I'll try on a vanilla swift on local execution; this is on bgp with Falkon. > > Thanks. > > -- > > ... > Progress: uninitialized:1 Selecting site:2 > SwiftScript trace: T1af7, Round, 0, Sim, 7 > SwiftScript trace: T1af7, Round, 0, Sim, 2 > SwiftScript trace: T1af7, Round, 0, Sim, 8 > SwiftScript trace: T1af7, Round, 0, Sim, 0 > SwiftScript trace: T1af7, Round, 0, Sim, 5 > SwiftScript trace: T1af7, Round, 0, Sim, 9 > SwiftScript trace: T1af7, Round, 0, Sim, 1 > SwiftScript trace: T1af7, Round, 0, Sim, 6 > SwiftScript trace: T1af7, Round, 0, Sim, 3 > SwiftScript trace: T1af7, Round, 0, Sim, 4 > Ex098 > java.lang.NullPointerException > at > org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) > at > org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) > at > org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) > at > org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) > at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) > > at > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) > > at > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) > > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) > > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) > at > org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) > > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > at > org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) > Execution failed: > java.lang.NullPointerException > at > org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) > at > org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) > at > org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) > at > org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) > at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) > > at > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) > > at > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) > > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) > > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) > at > org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) > > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > at > org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) > > Ex098 > java.lang.NullPointerException > at > org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) > at > org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) > at > org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) > at > org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) > at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) > > at > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) > > at > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) > > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) > > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) > > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) > at > org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) > > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > at > org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) > SwiftScript trace: T1af7, Round, 0, Sim, 7 > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Mar 17 16:25:32 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 17 Mar 2009 16:25:32 -0500 Subject: [Swift-devel] null pointer exception from nested loops In-Reply-To: <49C01351.1050807@mcs.anl.gov> References: <49C010FE.4070503@mcs.anl.gov> <49C01351.1050807@mcs.anl.gov> Message-ID: <49C0154C.7000608@mcs.anl.gov> The log contains this just before the NPE, including the suspicious message: WARN FlowNode Ex098: Thats giving me a clue as to the offending statements. --- 2009-03-17 16:20:34,723-0500 INFO AbstractDataNode Adding handle listener "F/org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20090\ 317-1620-e1n1bz3g:720000000071 type SecSeq with no value at dataset=secseq path=[0] (not closed)" to "org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu,\ 2008:swift:dataset:20090317-1620-e1n1bz3g:720000000071 type SecSeq with no value at dataset=secseq path=[0] (not closed)" 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode closed org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\ 3g:720000000092 type string value=s/@DIT@/10/ dataset=unnamed SwiftScript value (closed) 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092 path=$ 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode VALUE dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092 VALUE=s/@DIT@/10/ 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode closed org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\ 3g:720000000093 type string value=s/@TUI@/1/ dataset=unnamed SwiftScript value (closed) 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093 path=$ 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode VALUE dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093 VALUE=s/@TUI@/1/ 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode closed org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\ 3g:720000000094 type string value=params.tloop dataset=unnamed SwiftScript value (closed) 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode ROOTPATH dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094 path=$ 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode VALUE dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094 VALUE=params.tloop 2009-03-17 16:20:34,727-0500 WARN FlowNode Ex098 java.lang.NullPointerException at org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) at org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) On 3/17/09 4:17 PM, Michael Wilde wrote: > It seems not related to scale or Falkon. > > It occurs when running on localhost (but on bgp) and when I cut all the > loops down to a single iteration. > > I'm still debugging. > > On 3/17/09 4:07 PM, Michael Wilde wrote: >> I just expanded my oops protein folding script to add another level of >> parameter sweep. This script is getting pretty complex now (at least, >> for a swift script). >> >> I got the following npe on my first two tries. Im going to start >> debugging, but any clues as to the cause would be helpful. >> >> The outer loops are: >> >> main() >> { >> string protein[] = readData(@arg("plist")); >> string startTemp[] = ["10","20"]; >> string tempUpdate[] = ["1","2","3"]; >> >> foreach p in protein { >> foreach st in startTemp { >> foreach tu in tempUpdate { >> doRoundSet(p,st,tu); >> } >> } >> } >> } >> >> There are two levels of inner loops further down below doRoundSet(). >> >> The script, output, command line args and log are in: >> http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz >> >> I suspect it will take a while to narrow the cause to a simpler test >> case thats easy tp reproduce without a lot of setup. >> >> I'll try on a vanilla swift on local execution; this is on bgp with >> Falkon. >> >> Thanks. >> >> -- >> >> ... >> Progress: uninitialized:1 Selecting site:2 >> SwiftScript trace: T1af7, Round, 0, Sim, 7 >> SwiftScript trace: T1af7, Round, 0, Sim, 2 >> SwiftScript trace: T1af7, Round, 0, Sim, 8 >> SwiftScript trace: T1af7, Round, 0, Sim, 0 >> SwiftScript trace: T1af7, Round, 0, Sim, 5 >> SwiftScript trace: T1af7, Round, 0, Sim, 9 >> SwiftScript trace: T1af7, Round, 0, Sim, 1 >> SwiftScript trace: T1af7, Round, 0, Sim, 6 >> SwiftScript trace: T1af7, Round, 0, Sim, 3 >> SwiftScript trace: T1af7, Round, 0, Sim, 4 >> Ex098 >> java.lang.NullPointerException >> at >> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) >> >> at >> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) >> at >> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) >> at >> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >> >> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >> at >> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >> >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) >> at >> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >> at >> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >> >> at >> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >> >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >> >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) >> >> at >> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) >> at >> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >> >> at >> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >> at >> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >> >> at >> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >> >> Execution failed: >> java.lang.NullPointerException >> at >> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) >> >> at >> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) >> at >> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) >> at >> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >> >> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >> at >> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >> >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) >> at >> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >> at >> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >> >> at >> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >> >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >> >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) >> >> at >> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) >> at >> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >> >> at >> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >> at >> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >> >> at >> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >> >> >> Ex098 >> java.lang.NullPointerException >> at >> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) >> >> at >> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) >> at >> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) >> at >> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >> >> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >> at >> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >> >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) >> at >> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >> at >> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >> >> at >> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >> >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >> >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) >> >> at >> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) >> at >> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >> >> at >> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >> at >> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >> >> at >> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >> >> SwiftScript trace: T1af7, Round, 0, Sim, 7 >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From foster at anl.gov Tue Mar 17 16:26:25 2009 From: foster at anl.gov (Ian Foster) Date: Tue, 17 Mar 2009 16:26:25 -0500 Subject: [Swift-devel] null pointer exception from nested loops In-Reply-To: <49C0154C.7000608@mcs.anl.gov> References: <49C010FE.4070503@mcs.anl.gov> <49C01351.1050807@mcs.anl.gov> <49C0154C.7000608@mcs.anl.gov> Message-ID: <17AE2924-B021-46EC-9649-C80B1EBFFECD@anl.gov> Just curious, is the whole thing working with just Falkon? On Mar 17, 2009, at 4:25 PM, Michael Wilde wrote: > The log contains this just before the NPE, including the suspicious > message: WARN FlowNode Ex098: > > Thats giving me a clue as to the offending statements. > > --- > > 2009-03-17 16:20:34,723-0500 INFO AbstractDataNode Adding handle > listener "F/org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu > ,2008:swift:dataset:20090\ > 317-1620-e1n1bz3g:720000000071 type SecSeq with no value at > dataset=secseq path=[0] (not closed)" to > "org.griphyn.vdl.mapping.DataNode identifier > tag:benc at ci.uchicago.edu,\ > 2008:swift:dataset:20090317-1620-e1n1bz3g:720000000071 type SecSeq > with no value at dataset=secseq path=[0] (not closed)" > 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode closed > org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu > ,2008:swift:dataset:20090317-1620-e1n1bz\ > 3g:720000000092 type string value=s/@DIT@/10/ dataset=unnamed > SwiftScript value (closed) > 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode ROOTPATH > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620- > e1n1bz3g:720000000092 path=$ > 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode VALUE > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620- > e1n1bz3g:720000000092 VALUE=s/@DIT@/10/ > 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode closed > org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu > ,2008:swift:dataset:20090317-1620-e1n1bz\ > 3g:720000000093 type string value=s/@TUI@/1/ dataset=unnamed > SwiftScript value (closed) > 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode ROOTPATH > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620- > e1n1bz3g:720000000093 path=$ > 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode VALUE > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620- > e1n1bz3g:720000000093 VALUE=s/@TUI@/1/ > 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode closed > org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu > ,2008:swift:dataset:20090317-1620-e1n1bz\ > 3g:720000000094 type string value=params.tloop dataset=unnamed > SwiftScript value (closed) > 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode ROOTPATH > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620- > e1n1bz3g:720000000094 path=$ > 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode VALUE > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620- > e1n1bz3g:720000000094 VALUE=params.tloop > 2009-03-17 16:20:34,727-0500 WARN FlowNode Ex098 > java.lang.NullPointerException > at > org > .griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java: > 285) > at > org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) > at > org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) > at > org > .griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java: > 19) > at > org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) > > > > On 3/17/09 4:17 PM, Michael Wilde wrote: >> It seems not related to scale or Falkon. >> It occurs when running on localhost (but on bgp) and when I cut all >> the loops down to a single iteration. >> I'm still debugging. >> On 3/17/09 4:07 PM, Michael Wilde wrote: >>> I just expanded my oops protein folding script to add another >>> level of parameter sweep. This script is getting pretty complex >>> now (at least, for a swift script). >>> >>> I got the following npe on my first two tries. Im going to start >>> debugging, but any clues as to the cause would be helpful. >>> >>> The outer loops are: >>> >>> main() >>> { >>> string protein[] = readData(@arg("plist")); >>> string startTemp[] = ["10","20"]; >>> string tempUpdate[] = ["1","2","3"]; >>> >>> foreach p in protein { >>> foreach st in startTemp { >>> foreach tu in tempUpdate { >>> doRoundSet(p,st,tu); >>> } >>> } >>> } >>> } >>> >>> There are two levels of inner loops further down below doRoundSet(). >>> >>> The script, output, command line args and log are in: >>> http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz >>> >>> I suspect it will take a while to narrow the cause to a simpler >>> test case thats easy tp reproduce without a lot of setup. >>> >>> I'll try on a vanilla swift on local execution; this is on bgp >>> with Falkon. >>> >>> Thanks. >>> >>> -- >>> >>> ... >>> Progress: uninitialized:1 Selecting site:2 >>> SwiftScript trace: T1af7, Round, 0, Sim, 7 >>> SwiftScript trace: T1af7, Round, 0, Sim, 2 >>> SwiftScript trace: T1af7, Round, 0, Sim, 8 >>> SwiftScript trace: T1af7, Round, 0, Sim, 0 >>> SwiftScript trace: T1af7, Round, 0, Sim, 5 >>> SwiftScript trace: T1af7, Round, 0, Sim, 9 >>> SwiftScript trace: T1af7, Round, 0, Sim, 1 >>> SwiftScript trace: T1af7, Round, 0, Sim, 6 >>> SwiftScript trace: T1af7, Round, 0, Sim, 3 >>> SwiftScript trace: T1af7, Round, 0, Sim, 4 >>> Ex098 >>> java.lang.NullPointerException >>> at >>> org >>> .griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java: >>> 285) >>> at >>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java: >>> 201) >>> at >>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java: >>> 182) >>> at >>> org >>> .griphyn >>> .vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >>> at >>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >>> at >>> org >>> .globus >>> .cog >>> .karajan >>> .workflow >>> .nodes >>> .AbstractSequentialWithArguments >>> .childCompleted(AbstractSequentialWithArguments.java:192) >>> at >>> org >>> .globus >>> .cog >>> .karajan >>> .workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java: >>> 332) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java: >>> 125) >>> at >>> org >>> .globus >>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>> at >>> org >>> .globus >>> .cog >>> .karajan >>> .workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >>> at >>> org >>> .globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java: >>> 296) >>> at >>> org >>> .globus >>> .cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java: >>> 58) >>> at >>> org >>> .globus >>> .cog >>> .karajan >>> .workflow >>> .nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >>> at >>> org >>> .globus >>> .cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >>> at >>> org >>> .globus >>> .cog >>> .karajan.workflow.nodes.Sequential.executeChildren(Sequential.java: >>> 27) >>> at >>> org >>> .globus >>> .cog >>> .karajan >>> .workflow >>> .nodes >>> .functions.AbstractFunction.executeChildren(AbstractFunction.java: >>> 40) >>> at >>> org >>> .globus >>> .cog >>> .karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>> at >>> org >>> .globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java: >>> 233) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java: >>> 278) >>> at >>> org >>> .globus >>> .cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java: >>> 329) >>> at >>> org >>> .globus >>> .cog >>> .karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java: >>> 227) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java: >>> 125) >>> at >>> org >>> .globus >>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>> at >>> org >>> .globus >>> .cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >>> Execution failed: >>> java.lang.NullPointerException >>> at >>> org >>> .griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java: >>> 285) >>> at >>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java: >>> 201) >>> at >>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java: >>> 182) >>> at >>> org >>> .griphyn >>> .vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >>> at >>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >>> at >>> org >>> .globus >>> .cog >>> .karajan >>> .workflow >>> .nodes >>> .AbstractSequentialWithArguments >>> .childCompleted(AbstractSequentialWithArguments.java:192) >>> at >>> org >>> .globus >>> .cog >>> .karajan >>> .workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java: >>> 332) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java: >>> 125) >>> at >>> org >>> .globus >>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>> at >>> org >>> .globus >>> .cog >>> .karajan >>> .workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >>> at >>> org >>> .globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java: >>> 296) >>> at >>> org >>> .globus >>> .cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java: >>> 58) >>> at >>> org >>> .globus >>> .cog >>> .karajan >>> .workflow >>> .nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >>> at >>> org >>> .globus >>> .cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >>> at >>> org >>> .globus >>> .cog >>> .karajan.workflow.nodes.Sequential.executeChildren(Sequential.java: >>> 27) >>> at >>> org >>> .globus >>> .cog >>> .karajan >>> .workflow >>> .nodes >>> .functions.AbstractFunction.executeChildren(AbstractFunction.java: >>> 40) >>> at >>> org >>> .globus >>> .cog >>> .karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>> at >>> org >>> .globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java: >>> 233) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java: >>> 278) >>> at >>> org >>> .globus >>> .cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java: >>> 329) >>> at >>> org >>> .globus >>> .cog >>> .karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java: >>> 227) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java: >>> 125) >>> at >>> org >>> .globus >>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>> at >>> org >>> .globus >>> .cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >>> >>> Ex098 >>> java.lang.NullPointerException >>> at >>> org >>> .griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java: >>> 285) >>> at >>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java: >>> 201) >>> at >>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java: >>> 182) >>> at >>> org >>> .griphyn >>> .vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >>> at >>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >>> at >>> org >>> .globus >>> .cog >>> .karajan >>> .workflow >>> .nodes >>> .AbstractSequentialWithArguments >>> .childCompleted(AbstractSequentialWithArguments.java:192) >>> at >>> org >>> .globus >>> .cog >>> .karajan >>> .workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java: >>> 332) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java: >>> 125) >>> at >>> org >>> .globus >>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>> at >>> org >>> .globus >>> .cog >>> .karajan >>> .workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >>> at >>> org >>> .globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java: >>> 296) >>> at >>> org >>> .globus >>> .cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java: >>> 58) >>> at >>> org >>> .globus >>> .cog >>> .karajan >>> .workflow >>> .nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >>> at >>> org >>> .globus >>> .cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >>> at >>> org >>> .globus >>> .cog >>> .karajan.workflow.nodes.Sequential.executeChildren(Sequential.java: >>> 27) >>> at >>> org >>> .globus >>> .cog >>> .karajan >>> .workflow >>> .nodes >>> .functions.AbstractFunction.executeChildren(AbstractFunction.java: >>> 40) >>> at >>> org >>> .globus >>> .cog >>> .karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>> at >>> org >>> .globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java: >>> 233) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java: >>> 278) >>> at >>> org >>> .globus >>> .cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java: >>> 329) >>> at >>> org >>> .globus >>> .cog >>> .karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java: >>> 227) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java: >>> 125) >>> at >>> org >>> .globus >>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>> at >>> org >>> .globus >>> .cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >>> SwiftScript trace: T1af7, Round, 0, Sim, 7 >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Mar 17 16:46:21 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 17 Mar 2009 16:46:21 -0500 Subject: [Swift-devel] null pointer exception from nested loops In-Reply-To: <49C0154C.7000608@mcs.anl.gov> References: <49C010FE.4070503@mcs.anl.gov> <49C01351.1050807@mcs.anl.gov> <49C0154C.7000608@mcs.anl.gov> Message-ID: <49C01A2D.6050408@mcs.anl.gov> I got the error narrowed down to this example: type file; app (file editedParams) setTemps ( file inParams ) { echo @inParams stdout=@editedParams; } file inParams; string config [] = readData( setTemps(inParams ) ); trace(0,config[0]); trace(1,config[1]); trace(2,config[2]); -- params.tloops contains: DEFAULT_INIT_TEMP_=_ at DTI@ TEMP_UPDATE_INTERVAL_=_ at TUI@ KILL_TIME_=_3 MAX_NUMBER_OF_ANNEALING_STEPS_=_0 -- In other code, readData() seemed happy to take a file *or* a filename string as input, but I wonder if it was not as happy as it seemed. I'd been taking advantage of the flexibility with good results (luck?) so far, though. On 3/17/09 4:25 PM, Michael Wilde wrote: > The log contains this just before the NPE, including the suspicious > message: WARN FlowNode Ex098: > > Thats giving me a clue as to the offending statements. > > --- > > 2009-03-17 16:20:34,723-0500 INFO AbstractDataNode Adding handle > listener "F/org.griphyn.vdl.mapping.DataNode identifier > tag:benc at ci.uchicago.edu,2008:swift:dataset:20090\ > 317-1620-e1n1bz3g:720000000071 type SecSeq with no value at > dataset=secseq path=[0] (not closed)" to > "org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu,\ > 2008:swift:dataset:20090317-1620-e1n1bz3g:720000000071 type SecSeq with > no value at dataset=secseq path=[0] (not closed)" > 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode closed > org.griphyn.vdl.mapping.RootDataNode identifier > tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\ > 3g:720000000092 type string value=s/@DIT@/10/ dataset=unnamed > SwiftScript value (closed) > 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode ROOTPATH > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092 > path=$ > 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode VALUE > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092 > VALUE=s/@DIT@/10/ > 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode closed > org.griphyn.vdl.mapping.RootDataNode identifier > tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\ > 3g:720000000093 type string value=s/@TUI@/1/ dataset=unnamed SwiftScript > value (closed) > 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode ROOTPATH > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093 > path=$ > 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode VALUE > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093 > VALUE=s/@TUI@/1/ > 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode closed > org.griphyn.vdl.mapping.RootDataNode identifier > tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\ > 3g:720000000094 type string value=params.tloop dataset=unnamed > SwiftScript value (closed) > 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode ROOTPATH > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094 > path=$ > 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode VALUE > dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094 > VALUE=params.tloop > 2009-03-17 16:20:34,727-0500 WARN FlowNode Ex098 > java.lang.NullPointerException > at > org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) > at > org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) > at > org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) > at > org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) > at > org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) > > > > On 3/17/09 4:17 PM, Michael Wilde wrote: >> It seems not related to scale or Falkon. >> >> It occurs when running on localhost (but on bgp) and when I cut all >> the loops down to a single iteration. >> >> I'm still debugging. >> >> On 3/17/09 4:07 PM, Michael Wilde wrote: >>> I just expanded my oops protein folding script to add another level >>> of parameter sweep. This script is getting pretty complex now (at >>> least, for a swift script). >>> >>> I got the following npe on my first two tries. Im going to start >>> debugging, but any clues as to the cause would be helpful. >>> >>> The outer loops are: >>> >>> main() >>> { >>> string protein[] = readData(@arg("plist")); >>> string startTemp[] = ["10","20"]; >>> string tempUpdate[] = ["1","2","3"]; >>> >>> foreach p in protein { >>> foreach st in startTemp { >>> foreach tu in tempUpdate { >>> doRoundSet(p,st,tu); >>> } >>> } >>> } >>> } >>> >>> There are two levels of inner loops further down below doRoundSet(). >>> >>> The script, output, command line args and log are in: >>> http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz >>> >>> I suspect it will take a while to narrow the cause to a simpler test >>> case thats easy tp reproduce without a lot of setup. >>> >>> I'll try on a vanilla swift on local execution; this is on bgp with >>> Falkon. >>> >>> Thanks. >>> >>> -- >>> >>> ... >>> Progress: uninitialized:1 Selecting site:2 >>> SwiftScript trace: T1af7, Round, 0, Sim, 7 >>> SwiftScript trace: T1af7, Round, 0, Sim, 2 >>> SwiftScript trace: T1af7, Round, 0, Sim, 8 >>> SwiftScript trace: T1af7, Round, 0, Sim, 0 >>> SwiftScript trace: T1af7, Round, 0, Sim, 5 >>> SwiftScript trace: T1af7, Round, 0, Sim, 9 >>> SwiftScript trace: T1af7, Round, 0, Sim, 1 >>> SwiftScript trace: T1af7, Round, 0, Sim, 6 >>> SwiftScript trace: T1af7, Round, 0, Sim, 3 >>> SwiftScript trace: T1af7, Round, 0, Sim, 4 >>> Ex098 >>> java.lang.NullPointerException >>> at >>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) >>> >>> at >>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) >>> at >>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) >>> at >>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >>> >>> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >>> at >>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) >>> at >>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >>> >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>> >>> at >>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >>> >>> Execution failed: >>> java.lang.NullPointerException >>> at >>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) >>> >>> at >>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) >>> at >>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) >>> at >>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >>> >>> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >>> at >>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) >>> at >>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >>> >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>> >>> at >>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >>> >>> >>> Ex098 >>> java.lang.NullPointerException >>> at >>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) >>> >>> at >>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) >>> at >>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) >>> at >>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >>> >>> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >>> at >>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) >>> >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) >>> at >>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >>> >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>> >>> at >>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >>> >>> SwiftScript trace: T1af7, Round, 0, Sim, 7 >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Mar 17 17:02:31 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 17 Mar 2009 17:02:31 -0500 Subject: [Swift-devel] null pointer exception from nested loops In-Reply-To: <49C01A2D.6050408@mcs.anl.gov> References: <49C010FE.4070503@mcs.anl.gov> <49C01351.1050807@mcs.anl.gov> <49C0154C.7000608@mcs.anl.gov> <49C01A2D.6050408@mcs.anl.gov> Message-ID: <49C01DF7.4000204@mcs.anl.gov> OK, I think this is it: The following script *works* with the commented out code and fails with the currently enabled alternative statement: -- type file; app (file editedParams) setTemps ( file inParams ) { sed "-e" "s/@DTI@/123/" @inParams stdout=@editedParams; } file inParams; /* works: file o<"pout">; o = setTemps(inParams); string config [] = readData(o); */ /* Fails: */ string config [] = readData( setTemps(inParams ) ); trace(0,config[0]); trace(1,config[1]); trace(2,config[2]); -- So readData is indeed happy to take a file-value var as an arg but not a file-valued expression (procedure return in this case). On 3/17/09 4:46 PM, Michael Wilde wrote: > I got the error narrowed down to this example: > > type file; > > app (file editedParams) setTemps ( file inParams ) > { > echo @inParams stdout=@editedParams; > } > > file inParams; > > string config [] = readData( setTemps(inParams ) ); > trace(0,config[0]); > trace(1,config[1]); > trace(2,config[2]); > > -- > > params.tloops contains: > > DEFAULT_INIT_TEMP_=_ at DTI@ > TEMP_UPDATE_INTERVAL_=_ at TUI@ > KILL_TIME_=_3 > MAX_NUMBER_OF_ANNEALING_STEPS_=_0 > > -- > > In other code, readData() seemed happy to take a file *or* a filename > string as input, but I wonder if it was not as happy as it seemed. > > I'd been taking advantage of the flexibility with good results (luck?) > so far, though. > > On 3/17/09 4:25 PM, Michael Wilde wrote: >> The log contains this just before the NPE, including the suspicious >> message: WARN FlowNode Ex098: >> >> Thats giving me a clue as to the offending statements. >> >> --- >> >> 2009-03-17 16:20:34,723-0500 INFO AbstractDataNode Adding handle >> listener "F/org.griphyn.vdl.mapping.DataNode identifier >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090\ >> 317-1620-e1n1bz3g:720000000071 type SecSeq with no value at >> dataset=secseq path=[0] (not closed)" to >> "org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu,\ >> 2008:swift:dataset:20090317-1620-e1n1bz3g:720000000071 type SecSeq >> with no value at dataset=secseq path=[0] (not closed)" >> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode closed >> org.griphyn.vdl.mapping.RootDataNode identifier >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\ >> 3g:720000000092 type string value=s/@DIT@/10/ dataset=unnamed >> SwiftScript value (closed) >> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode ROOTPATH >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092 >> path=$ >> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode VALUE >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092 >> VALUE=s/@DIT@/10/ >> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode closed >> org.griphyn.vdl.mapping.RootDataNode identifier >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\ >> 3g:720000000093 type string value=s/@TUI@/1/ dataset=unnamed >> SwiftScript value (closed) >> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode ROOTPATH >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093 >> path=$ >> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode VALUE >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093 >> VALUE=s/@TUI@/1/ >> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode closed >> org.griphyn.vdl.mapping.RootDataNode identifier >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\ >> 3g:720000000094 type string value=params.tloop dataset=unnamed >> SwiftScript value (closed) >> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode ROOTPATH >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094 >> path=$ >> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode VALUE >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094 >> VALUE=params.tloop >> 2009-03-17 16:20:34,727-0500 WARN FlowNode Ex098 >> java.lang.NullPointerException >> at >> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) >> >> at >> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) >> at >> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) >> at >> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >> >> at >> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >> >> >> >> On 3/17/09 4:17 PM, Michael Wilde wrote: >>> It seems not related to scale or Falkon. >>> >>> It occurs when running on localhost (but on bgp) and when I cut all >>> the loops down to a single iteration. >>> >>> I'm still debugging. >>> >>> On 3/17/09 4:07 PM, Michael Wilde wrote: >>>> I just expanded my oops protein folding script to add another level >>>> of parameter sweep. This script is getting pretty complex now (at >>>> least, for a swift script). >>>> >>>> I got the following npe on my first two tries. Im going to start >>>> debugging, but any clues as to the cause would be helpful. >>>> >>>> The outer loops are: >>>> >>>> main() >>>> { >>>> string protein[] = readData(@arg("plist")); >>>> string startTemp[] = ["10","20"]; >>>> string tempUpdate[] = ["1","2","3"]; >>>> >>>> foreach p in protein { >>>> foreach st in startTemp { >>>> foreach tu in tempUpdate { >>>> doRoundSet(p,st,tu); >>>> } >>>> } >>>> } >>>> } >>>> >>>> There are two levels of inner loops further down below doRoundSet(). >>>> >>>> The script, output, command line args and log are in: >>>> http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz >>>> >>>> I suspect it will take a while to narrow the cause to a simpler test >>>> case thats easy tp reproduce without a lot of setup. >>>> >>>> I'll try on a vanilla swift on local execution; this is on bgp with >>>> Falkon. >>>> >>>> Thanks. >>>> >>>> -- >>>> >>>> ... >>>> Progress: uninitialized:1 Selecting site:2 >>>> SwiftScript trace: T1af7, Round, 0, Sim, 7 >>>> SwiftScript trace: T1af7, Round, 0, Sim, 2 >>>> SwiftScript trace: T1af7, Round, 0, Sim, 8 >>>> SwiftScript trace: T1af7, Round, 0, Sim, 0 >>>> SwiftScript trace: T1af7, Round, 0, Sim, 5 >>>> SwiftScript trace: T1af7, Round, 0, Sim, 9 >>>> SwiftScript trace: T1af7, Round, 0, Sim, 1 >>>> SwiftScript trace: T1af7, Round, 0, Sim, 6 >>>> SwiftScript trace: T1af7, Round, 0, Sim, 3 >>>> SwiftScript trace: T1af7, Round, 0, Sim, 4 >>>> Ex098 >>>> java.lang.NullPointerException >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) >>>> >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) >>>> at >>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >>>> >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >>>> at >>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) >>>> at >>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >>>> >>>> Execution failed: >>>> java.lang.NullPointerException >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) >>>> >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) >>>> at >>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >>>> >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >>>> at >>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) >>>> at >>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >>>> >>>> >>>> Ex098 >>>> java.lang.NullPointerException >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) >>>> >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) >>>> at >>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >>>> >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >>>> at >>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) >>>> at >>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >>>> >>>> SwiftScript trace: T1af7, Round, 0, Sim, 7 >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Mar 17 17:05:37 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 17 Mar 2009 17:05:37 -0500 Subject: [Swift-devel] null pointer exception from nested loops In-Reply-To: <17AE2924-B021-46EC-9649-C80B1EBFFECD@anl.gov> References: <49C010FE.4070503@mcs.anl.gov> <49C01351.1050807@mcs.anl.gov> <49C0154C.7000608@mcs.anl.gov> <17AE2924-B021-46EC-9649-C80B1EBFFECD@anl.gov> Message-ID: <49C01EB1.5060801@mcs.anl.gov> Not quite sure what you're asking, Ian. The latest tests have been on BGP w/ Falkon. Earlier tests were on other clusters. The scripts has grown in last week or so, on BGP, and grew some more today to explore some new science code algorithm questions. Its not yet running at full desired scale on the BGP; we are now scaling up carefully so as not to impact other users. This is a test case for the "cio" work as well. - Mike On 3/17/09 4:26 PM, Ian Foster wrote: > Just curious, is the whole thing working with just Falkon? > > > On Mar 17, 2009, at 4:25 PM, Michael Wilde wrote: > >> The log contains this just before the NPE, including the suspicious >> message: WARN FlowNode Ex098: >> >> Thats giving me a clue as to the offending statements. >> >> --- >> >> 2009-03-17 16:20:34,723-0500 INFO AbstractDataNode Adding handle >> listener "F/org.griphyn.vdl.mapping.DataNode identifier >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090\ >> 317-1620-e1n1bz3g:720000000071 type SecSeq with no value at >> dataset=secseq path=[0] (not closed)" to >> "org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu,\ >> 2008:swift:dataset:20090317-1620-e1n1bz3g:720000000071 type SecSeq >> with no value at dataset=secseq path=[0] (not closed)" >> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode closed >> org.griphyn.vdl.mapping.RootDataNode identifier >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\ >> 3g:720000000092 type string value=s/@DIT@/10/ dataset=unnamed >> SwiftScript value (closed) >> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode ROOTPATH >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092 >> path=$ >> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode VALUE >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092 >> VALUE=s/@DIT@/10/ >> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode closed >> org.griphyn.vdl.mapping.RootDataNode identifier >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\ >> 3g:720000000093 type string value=s/@TUI@/1/ dataset=unnamed >> SwiftScript value (closed) >> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode ROOTPATH >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093 >> path=$ >> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode VALUE >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093 >> VALUE=s/@TUI@/1/ >> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode closed >> org.griphyn.vdl.mapping.RootDataNode identifier >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\ >> 3g:720000000094 type string value=params.tloop dataset=unnamed >> SwiftScript value (closed) >> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode ROOTPATH >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094 >> path=$ >> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode VALUE >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094 >> VALUE=params.tloop >> 2009-03-17 16:20:34,727-0500 WARN FlowNode Ex098 >> java.lang.NullPointerException >> at >> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) >> >> at >> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) >> at >> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) >> at >> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >> >> at >> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >> >> >> >> On 3/17/09 4:17 PM, Michael Wilde wrote: >>> It seems not related to scale or Falkon. >>> It occurs when running on localhost (but on bgp) and when I cut all >>> the loops down to a single iteration. >>> I'm still debugging. >>> On 3/17/09 4:07 PM, Michael Wilde wrote: >>>> I just expanded my oops protein folding script to add another level >>>> of parameter sweep. This script is getting pretty complex now (at >>>> least, for a swift script). >>>> >>>> I got the following npe on my first two tries. Im going to start >>>> debugging, but any clues as to the cause would be helpful. >>>> >>>> The outer loops are: >>>> >>>> main() >>>> { >>>> string protein[] = readData(@arg("plist")); >>>> string startTemp[] = ["10","20"]; >>>> string tempUpdate[] = ["1","2","3"]; >>>> >>>> foreach p in protein { >>>> foreach st in startTemp { >>>> foreach tu in tempUpdate { >>>> doRoundSet(p,st,tu); >>>> } >>>> } >>>> } >>>> } >>>> >>>> There are two levels of inner loops further down below doRoundSet(). >>>> >>>> The script, output, command line args and log are in: >>>> http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz >>>> >>>> I suspect it will take a while to narrow the cause to a simpler test >>>> case thats easy tp reproduce without a lot of setup. >>>> >>>> I'll try on a vanilla swift on local execution; this is on bgp with >>>> Falkon. >>>> >>>> Thanks. >>>> >>>> -- >>>> >>>> ... >>>> Progress: uninitialized:1 Selecting site:2 >>>> SwiftScript trace: T1af7, Round, 0, Sim, 7 >>>> SwiftScript trace: T1af7, Round, 0, Sim, 2 >>>> SwiftScript trace: T1af7, Round, 0, Sim, 8 >>>> SwiftScript trace: T1af7, Round, 0, Sim, 0 >>>> SwiftScript trace: T1af7, Round, 0, Sim, 5 >>>> SwiftScript trace: T1af7, Round, 0, Sim, 9 >>>> SwiftScript trace: T1af7, Round, 0, Sim, 1 >>>> SwiftScript trace: T1af7, Round, 0, Sim, 6 >>>> SwiftScript trace: T1af7, Round, 0, Sim, 3 >>>> SwiftScript trace: T1af7, Round, 0, Sim, 4 >>>> Ex098 >>>> java.lang.NullPointerException >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) >>>> >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) >>>> at >>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >>>> >>>> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >>>> at >>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) >>>> at >>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >>>> >>>> Execution failed: >>>> java.lang.NullPointerException >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) >>>> >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) >>>> at >>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >>>> >>>> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >>>> at >>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) >>>> at >>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >>>> >>>> >>>> Ex098 >>>> java.lang.NullPointerException >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) >>>> >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) >>>> at >>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >>>> >>>> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >>>> at >>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) >>>> at >>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>>> at >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>> >>>> at >>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >>>> >>>> SwiftScript trace: T1af7, Round, 0, Sim, 7 >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Tue Mar 17 17:10:02 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 17 Mar 2009 17:10:02 -0500 Subject: [Swift-devel] null pointer exception from nested loops In-Reply-To: <49C01DF7.4000204@mcs.anl.gov> References: <49C010FE.4070503@mcs.anl.gov> <49C01351.1050807@mcs.anl.gov> <49C0154C.7000608@mcs.anl.gov> <49C01A2D.6050408@mcs.anl.gov> <49C01DF7.4000204@mcs.anl.gov> Message-ID: <1237327802.2668.0.camel@localhost> Looks like it might be the nested procedure call interacting badly with readData. On Tue, 2009-03-17 at 17:02 -0500, Michael Wilde wrote: > OK, I think this is it: > > The following script *works* with the commented out code and fails with > the currently enabled alternative statement: > > -- > > type file; > > app (file editedParams) setTemps ( file inParams ) > { > sed "-e" "s/@DTI@/123/" @inParams stdout=@editedParams; > } > > file inParams; > > /* works: > file o<"pout">; > o = setTemps(inParams); > string config [] = readData(o); > */ > > /* Fails: */ > string config [] = readData( setTemps(inParams ) ); > > trace(0,config[0]); > trace(1,config[1]); > trace(2,config[2]); > > -- > > So readData is indeed happy to take a file-value var as an arg but not a > file-valued expression (procedure return in this case). > > > > On 3/17/09 4:46 PM, Michael Wilde wrote: > > I got the error narrowed down to this example: > > > > type file; > > > > app (file editedParams) setTemps ( file inParams ) > > { > > echo @inParams stdout=@editedParams; > > } > > > > file inParams; > > > > string config [] = readData( setTemps(inParams ) ); > > trace(0,config[0]); > > trace(1,config[1]); > > trace(2,config[2]); > > > > -- > > > > params.tloops contains: > > > > DEFAULT_INIT_TEMP_=_ at DTI@ > > TEMP_UPDATE_INTERVAL_=_ at TUI@ > > KILL_TIME_=_3 > > MAX_NUMBER_OF_ANNEALING_STEPS_=_0 > > > > -- > > > > In other code, readData() seemed happy to take a file *or* a filename > > string as input, but I wonder if it was not as happy as it seemed. > > > > I'd been taking advantage of the flexibility with good results (luck?) > > so far, though. > > > > On 3/17/09 4:25 PM, Michael Wilde wrote: > >> The log contains this just before the NPE, including the suspicious > >> message: WARN FlowNode Ex098: > >> > >> Thats giving me a clue as to the offending statements. > >> > >> --- > >> > >> 2009-03-17 16:20:34,723-0500 INFO AbstractDataNode Adding handle > >> listener "F/org.griphyn.vdl.mapping.DataNode identifier > >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090\ > >> 317-1620-e1n1bz3g:720000000071 type SecSeq with no value at > >> dataset=secseq path=[0] (not closed)" to > >> "org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu,\ > >> 2008:swift:dataset:20090317-1620-e1n1bz3g:720000000071 type SecSeq > >> with no value at dataset=secseq path=[0] (not closed)" > >> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode closed > >> org.griphyn.vdl.mapping.RootDataNode identifier > >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\ > >> 3g:720000000092 type string value=s/@DIT@/10/ dataset=unnamed > >> SwiftScript value (closed) > >> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode ROOTPATH > >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092 > >> path=$ > >> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode VALUE > >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092 > >> VALUE=s/@DIT@/10/ > >> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode closed > >> org.griphyn.vdl.mapping.RootDataNode identifier > >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\ > >> 3g:720000000093 type string value=s/@TUI@/1/ dataset=unnamed > >> SwiftScript value (closed) > >> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode ROOTPATH > >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093 > >> path=$ > >> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode VALUE > >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093 > >> VALUE=s/@TUI@/1/ > >> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode closed > >> org.griphyn.vdl.mapping.RootDataNode identifier > >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\ > >> 3g:720000000094 type string value=params.tloop dataset=unnamed > >> SwiftScript value (closed) > >> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode ROOTPATH > >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094 > >> path=$ > >> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode VALUE > >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094 > >> VALUE=params.tloop > >> 2009-03-17 16:20:34,727-0500 WARN FlowNode Ex098 > >> java.lang.NullPointerException > >> at > >> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) > >> > >> at > >> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) > >> at > >> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) > >> at > >> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) > >> > >> at > >> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) > >> > >> > >> > >> On 3/17/09 4:17 PM, Michael Wilde wrote: > >>> It seems not related to scale or Falkon. > >>> > >>> It occurs when running on localhost (but on bgp) and when I cut all > >>> the loops down to a single iteration. > >>> > >>> I'm still debugging. > >>> > >>> On 3/17/09 4:07 PM, Michael Wilde wrote: > >>>> I just expanded my oops protein folding script to add another level > >>>> of parameter sweep. This script is getting pretty complex now (at > >>>> least, for a swift script). > >>>> > >>>> I got the following npe on my first two tries. Im going to start > >>>> debugging, but any clues as to the cause would be helpful. > >>>> > >>>> The outer loops are: > >>>> > >>>> main() > >>>> { > >>>> string protein[] = readData(@arg("plist")); > >>>> string startTemp[] = ["10","20"]; > >>>> string tempUpdate[] = ["1","2","3"]; > >>>> > >>>> foreach p in protein { > >>>> foreach st in startTemp { > >>>> foreach tu in tempUpdate { > >>>> doRoundSet(p,st,tu); > >>>> } > >>>> } > >>>> } > >>>> } > >>>> > >>>> There are two levels of inner loops further down below doRoundSet(). > >>>> > >>>> The script, output, command line args and log are in: > >>>> http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz > >>>> > >>>> I suspect it will take a while to narrow the cause to a simpler test > >>>> case thats easy tp reproduce without a lot of setup. > >>>> > >>>> I'll try on a vanilla swift on local execution; this is on bgp with > >>>> Falkon. > >>>> > >>>> Thanks. > >>>> > >>>> -- > >>>> > >>>> ... > >>>> Progress: uninitialized:1 Selecting site:2 > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 7 > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 2 > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 8 > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 0 > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 5 > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 9 > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 1 > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 6 > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 3 > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 4 > >>>> Ex098 > >>>> java.lang.NullPointerException > >>>> at > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) > >>>> > >>>> at > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) > >>>> at > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) > >>>> at > >>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) > >>>> > >>>> at > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) > >>>> at > >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > >>>> at > >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) > >>>> at > >>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > >>>> at > >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) > >>>> > >>>> Execution failed: > >>>> java.lang.NullPointerException > >>>> at > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) > >>>> > >>>> at > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) > >>>> at > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) > >>>> at > >>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) > >>>> > >>>> at > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) > >>>> at > >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > >>>> at > >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) > >>>> at > >>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > >>>> at > >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) > >>>> > >>>> > >>>> Ex098 > >>>> java.lang.NullPointerException > >>>> at > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) > >>>> > >>>> at > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) > >>>> at > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) > >>>> at > >>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) > >>>> > >>>> at > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) > >>>> at > >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > >>>> at > >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) > >>>> at > >>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > >>>> at > >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > >>>> > >>>> at > >>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) > >>>> > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 7 > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From foster at anl.gov Tue Mar 17 17:22:07 2009 From: foster at anl.gov (Ian Foster) Date: Tue, 17 Mar 2009 17:22:07 -0500 Subject: [Swift-devel] null pointer exception from nested loops In-Reply-To: <49C01EB1.5060801@mcs.anl.gov> References: <49C010FE.4070503@mcs.anl.gov> <49C01351.1050807@mcs.anl.gov> <49C0154C.7000608@mcs.anl.gov> <17AE2924-B021-46EC-9649-C80B1EBFFECD@anl.gov> <49C01EB1.5060801@mcs.anl.gov> Message-ID: I wasn't sure if this ran without Swift--with just Falkon On Mar 17, 2009, at 5:05 PM, Michael Wilde wrote: > Not quite sure what you're asking, Ian. > > The latest tests have been on BGP w/ Falkon. > Earlier tests were on other clusters. > > The scripts has grown in last week or so, on BGP, and grew some more > today to explore some new science code algorithm questions. > > Its not yet running at full desired scale on the BGP; we are now > scaling up carefully so as not to impact other users. > > This is a test case for the "cio" work as well. > > - Mike > > > On 3/17/09 4:26 PM, Ian Foster wrote: >> Just curious, is the whole thing working with just Falkon? >> On Mar 17, 2009, at 4:25 PM, Michael Wilde wrote: >>> The log contains this just before the NPE, including the >>> suspicious message: WARN FlowNode Ex098: >>> >>> Thats giving me a clue as to the offending statements. >>> >>> --- >>> >>> 2009-03-17 16:20:34,723-0500 INFO AbstractDataNode Adding handle >>> listener "F/org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu >>> ,2008:swift:dataset:20090\ >>> 317-1620-e1n1bz3g:720000000071 type SecSeq with no value at >>> dataset=secseq path=[0] (not closed)" to >>> "org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu >>> ,\ >>> 2008:swift:dataset:20090317-1620-e1n1bz3g:720000000071 type SecSeq >>> with no value at dataset=secseq path=[0] (not closed)" >>> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode closed >>> org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu >>> ,2008:swift:dataset:20090317-1620-e1n1bz\ >>> 3g:720000000092 type string value=s/@DIT@/10/ dataset=unnamed >>> SwiftScript value (closed) >>> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode ROOTPATH >>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620- >>> e1n1bz3g:720000000092 path=$ >>> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode VALUE >>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620- >>> e1n1bz3g:720000000092 VALUE=s/@DIT@/10/ >>> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode closed >>> org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu >>> ,2008:swift:dataset:20090317-1620-e1n1bz\ >>> 3g:720000000093 type string value=s/@TUI@/1/ dataset=unnamed >>> SwiftScript value (closed) >>> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode ROOTPATH >>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620- >>> e1n1bz3g:720000000093 path=$ >>> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode VALUE >>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620- >>> e1n1bz3g:720000000093 VALUE=s/@TUI@/1/ >>> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode closed >>> org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu >>> ,2008:swift:dataset:20090317-1620-e1n1bz\ >>> 3g:720000000094 type string value=params.tloop dataset=unnamed >>> SwiftScript value (closed) >>> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode ROOTPATH >>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620- >>> e1n1bz3g:720000000094 path=$ >>> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode VALUE >>> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620- >>> e1n1bz3g:720000000094 VALUE=params.tloop >>> 2009-03-17 16:20:34,727-0500 WARN FlowNode Ex098 >>> java.lang.NullPointerException >>> at >>> org >>> .griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java: >>> 285) >>> at >>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java: >>> 201) >>> at >>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java: >>> 182) >>> at >>> org >>> .griphyn >>> .vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >>> at >>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >>> >>> >>> >>> On 3/17/09 4:17 PM, Michael Wilde wrote: >>>> It seems not related to scale or Falkon. >>>> It occurs when running on localhost (but on bgp) and when I cut >>>> all the loops down to a single iteration. >>>> I'm still debugging. >>>> On 3/17/09 4:07 PM, Michael Wilde wrote: >>>>> I just expanded my oops protein folding script to add another >>>>> level of parameter sweep. This script is getting pretty complex >>>>> now (at least, for a swift script). >>>>> >>>>> I got the following npe on my first two tries. Im going to start >>>>> debugging, but any clues as to the cause would be helpful. >>>>> >>>>> The outer loops are: >>>>> >>>>> main() >>>>> { >>>>> string protein[] = readData(@arg("plist")); >>>>> string startTemp[] = ["10","20"]; >>>>> string tempUpdate[] = ["1","2","3"]; >>>>> >>>>> foreach p in protein { >>>>> foreach st in startTemp { >>>>> foreach tu in tempUpdate { >>>>> doRoundSet(p,st,tu); >>>>> } >>>>> } >>>>> } >>>>> } >>>>> >>>>> There are two levels of inner loops further down below >>>>> doRoundSet(). >>>>> >>>>> The script, output, command line args and log are in: >>>>> http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz >>>>> >>>>> I suspect it will take a while to narrow the cause to a simpler >>>>> test case thats easy tp reproduce without a lot of setup. >>>>> >>>>> I'll try on a vanilla swift on local execution; this is on bgp >>>>> with Falkon. >>>>> >>>>> Thanks. >>>>> >>>>> -- >>>>> >>>>> ... >>>>> Progress: uninitialized:1 Selecting site:2 >>>>> SwiftScript trace: T1af7, Round, 0, Sim, 7 >>>>> SwiftScript trace: T1af7, Round, 0, Sim, 2 >>>>> SwiftScript trace: T1af7, Round, 0, Sim, 8 >>>>> SwiftScript trace: T1af7, Round, 0, Sim, 0 >>>>> SwiftScript trace: T1af7, Round, 0, Sim, 5 >>>>> SwiftScript trace: T1af7, Round, 0, Sim, 9 >>>>> SwiftScript trace: T1af7, Round, 0, Sim, 1 >>>>> SwiftScript trace: T1af7, Round, 0, Sim, 6 >>>>> SwiftScript trace: T1af7, Round, 0, Sim, 3 >>>>> SwiftScript trace: T1af7, Round, 0, Sim, 4 >>>>> Ex098 >>>>> java.lang.NullPointerException >>>>> at >>>>> org >>>>> .griphyn >>>>> .vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) >>>>> at >>>>> org >>>>> .griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java: >>>>> 201) >>>>> at >>>>> org >>>>> .griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java: >>>>> 182) >>>>> at >>>>> org >>>>> .griphyn >>>>> .vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >>>>> at >>>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan >>>>> .workflow >>>>> .nodes >>>>> .AbstractSequentialWithArguments >>>>> .childCompleted(AbstractSequentialWithArguments.java:192) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan >>>>> .workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>>>> at >>>>> org >>>>> .globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java: >>>>> 332) >>>>> at >>>>> org >>>>> .globus.cog.karajan.workflow.events.EventBus.send(EventBus.java: >>>>> 125) >>>>> at >>>>> org >>>>> .globus >>>>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan >>>>> .workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >>>>> at >>>>> org >>>>> .globus >>>>> .cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan >>>>> .workflow >>>>> .nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >>>>> at >>>>> org >>>>> .globus >>>>> .cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java: >>>>> 51) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan >>>>> .workflow.nodes.Sequential.executeChildren(Sequential.java:27) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan >>>>> .workflow >>>>> .nodes >>>>> .functions >>>>> .AbstractFunction.executeChildren(AbstractFunction.java:40) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java: >>>>> 63) >>>>> at >>>>> org >>>>> .globus >>>>> .cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) >>>>> at >>>>> org >>>>> .globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java: >>>>> 278) >>>>> at >>>>> org >>>>> .globus >>>>> .cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java: >>>>> 391) >>>>> at >>>>> org >>>>> .globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java: >>>>> 329) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan >>>>> .workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >>>>> at >>>>> org >>>>> .globus.cog.karajan.workflow.events.EventBus.send(EventBus.java: >>>>> 125) >>>>> at >>>>> org >>>>> .globus >>>>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>>> at >>>>> org >>>>> .globus >>>>> .cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >>>>> Execution failed: >>>>> java.lang.NullPointerException >>>>> at >>>>> org >>>>> .griphyn >>>>> .vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) >>>>> at >>>>> org >>>>> .griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java: >>>>> 201) >>>>> at >>>>> org >>>>> .griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java: >>>>> 182) >>>>> at >>>>> org >>>>> .griphyn >>>>> .vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >>>>> at >>>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan >>>>> .workflow >>>>> .nodes >>>>> .AbstractSequentialWithArguments >>>>> .childCompleted(AbstractSequentialWithArguments.java:192) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan >>>>> .workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>>>> at >>>>> org >>>>> .globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java: >>>>> 332) >>>>> at >>>>> org >>>>> .globus.cog.karajan.workflow.events.EventBus.send(EventBus.java: >>>>> 125) >>>>> at >>>>> org >>>>> .globus >>>>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan >>>>> .workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >>>>> at >>>>> org >>>>> .globus >>>>> .cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan >>>>> .workflow >>>>> .nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >>>>> at >>>>> org >>>>> .globus >>>>> .cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java: >>>>> 51) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan >>>>> .workflow.nodes.Sequential.executeChildren(Sequential.java:27) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan >>>>> .workflow >>>>> .nodes >>>>> .functions >>>>> .AbstractFunction.executeChildren(AbstractFunction.java:40) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java: >>>>> 63) >>>>> at >>>>> org >>>>> .globus >>>>> .cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) >>>>> at >>>>> org >>>>> .globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java: >>>>> 278) >>>>> at >>>>> org >>>>> .globus >>>>> .cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java: >>>>> 391) >>>>> at >>>>> org >>>>> .globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java: >>>>> 329) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan >>>>> .workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >>>>> at >>>>> org >>>>> .globus.cog.karajan.workflow.events.EventBus.send(EventBus.java: >>>>> 125) >>>>> at >>>>> org >>>>> .globus >>>>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>>> at >>>>> org >>>>> .globus >>>>> .cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >>>>> >>>>> Ex098 >>>>> java.lang.NullPointerException >>>>> at >>>>> org >>>>> .griphyn >>>>> .vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) >>>>> at >>>>> org >>>>> .griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java: >>>>> 201) >>>>> at >>>>> org >>>>> .griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java: >>>>> 182) >>>>> at >>>>> org >>>>> .griphyn >>>>> .vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >>>>> at >>>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan >>>>> .workflow >>>>> .nodes >>>>> .AbstractSequentialWithArguments >>>>> .childCompleted(AbstractSequentialWithArguments.java:192) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan >>>>> .workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>>>> at >>>>> org >>>>> .globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java: >>>>> 332) >>>>> at >>>>> org >>>>> .globus.cog.karajan.workflow.events.EventBus.send(EventBus.java: >>>>> 125) >>>>> at >>>>> org >>>>> .globus >>>>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan >>>>> .workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >>>>> at >>>>> org >>>>> .globus >>>>> .cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan >>>>> .workflow >>>>> .nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >>>>> at >>>>> org >>>>> .globus >>>>> .cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java: >>>>> 51) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan >>>>> .workflow.nodes.Sequential.executeChildren(Sequential.java:27) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan >>>>> .workflow >>>>> .nodes >>>>> .functions >>>>> .AbstractFunction.executeChildren(AbstractFunction.java:40) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java: >>>>> 63) >>>>> at >>>>> org >>>>> .globus >>>>> .cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) >>>>> at >>>>> org >>>>> .globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java: >>>>> 278) >>>>> at >>>>> org >>>>> .globus >>>>> .cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java: >>>>> 391) >>>>> at >>>>> org >>>>> .globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java: >>>>> 329) >>>>> at >>>>> org >>>>> .globus >>>>> .cog >>>>> .karajan >>>>> .workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >>>>> at >>>>> org >>>>> .globus.cog.karajan.workflow.events.EventBus.send(EventBus.java: >>>>> 125) >>>>> at >>>>> org >>>>> .globus >>>>> .cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>>> at >>>>> org >>>>> .globus >>>>> .cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >>>>> SwiftScript trace: T1af7, Round, 0, Sim, 7 >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Tue Mar 17 18:03:34 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 17 Mar 2009 18:03:34 -0500 Subject: [Swift-devel] null pointer exception from nested loops In-Reply-To: References: <49C010FE.4070503@mcs.anl.gov> <49C01351.1050807@mcs.anl.gov> <49C0154C.7000608@mcs.anl.gov> <17AE2924-B021-46EC-9649-C80B1EBFFECD@anl.gov> <49C01EB1.5060801@mcs.anl.gov> Message-ID: <49C02C46.2070106@mcs.anl.gov> I see. No, we've been doing all tests through Swift, but testing the app standalone at various points on both local hosts and bgp compute nodes. On 3/17/09 5:22 PM, Ian Foster wrote: > I wasn't sure if this ran without Swift--with just Falkon > > > On Mar 17, 2009, at 5:05 PM, Michael Wilde wrote: > >> Not quite sure what you're asking, Ian. >> >> The latest tests have been on BGP w/ Falkon. >> Earlier tests were on other clusters. >> >> The scripts has grown in last week or so, on BGP, and grew some more >> today to explore some new science code algorithm questions. >> >> Its not yet running at full desired scale on the BGP; we are now >> scaling up carefully so as not to impact other users. >> >> This is a test case for the "cio" work as well. >> >> - Mike >> >> >> On 3/17/09 4:26 PM, Ian Foster wrote: >>> Just curious, is the whole thing working with just Falkon? >>> On Mar 17, 2009, at 4:25 PM, Michael Wilde wrote: >>>> The log contains this just before the NPE, including the suspicious >>>> message: WARN FlowNode Ex098: >>>> >>>> Thats giving me a clue as to the offending statements. >>>> >>>> --- >>>> >>>> 2009-03-17 16:20:34,723-0500 INFO AbstractDataNode Adding handle >>>> listener "F/org.griphyn.vdl.mapping.DataNode identifier >>>> tag:benc at ci.uchicago.edu >>>> ,2008:swift:dataset:20090\ >>>> 317-1620-e1n1bz3g:720000000071 type SecSeq with no value at >>>> dataset=secseq path=[0] (not closed)" to >>>> "org.griphyn.vdl.mapping.DataNode identifier >>>> tag:benc at ci.uchicago.edu ,\ >>>> 2008:swift:dataset:20090317-1620-e1n1bz3g:720000000071 type SecSeq >>>> with no value at dataset=secseq path=[0] (not closed)" >>>> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode closed >>>> org.griphyn.vdl.mapping.RootDataNode identifier >>>> tag:benc at ci.uchicago.edu >>>> ,2008:swift:dataset:20090317-1620-e1n1bz\ >>>> 3g:720000000092 type string value=s/@DIT@/10/ dataset=unnamed >>>> SwiftScript value (closed) >>>> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode ROOTPATH >>>> dataset=tag:benc at ci.uchicago.edu >>>> ,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092 >>>> path=$ >>>> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode VALUE >>>> dataset=tag:benc at ci.uchicago.edu >>>> ,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092 >>>> VALUE=s/@DIT@/10/ >>>> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode closed >>>> org.griphyn.vdl.mapping.RootDataNode identifier >>>> tag:benc at ci.uchicago.edu >>>> ,2008:swift:dataset:20090317-1620-e1n1bz\ >>>> 3g:720000000093 type string value=s/@TUI@/1/ dataset=unnamed >>>> SwiftScript value (closed) >>>> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode ROOTPATH >>>> dataset=tag:benc at ci.uchicago.edu >>>> ,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093 >>>> path=$ >>>> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode VALUE >>>> dataset=tag:benc at ci.uchicago.edu >>>> ,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093 >>>> VALUE=s/@TUI@/1/ >>>> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode closed >>>> org.griphyn.vdl.mapping.RootDataNode identifier >>>> tag:benc at ci.uchicago.edu >>>> ,2008:swift:dataset:20090317-1620-e1n1bz\ >>>> 3g:720000000094 type string value=params.tloop dataset=unnamed >>>> SwiftScript value (closed) >>>> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode ROOTPATH >>>> dataset=tag:benc at ci.uchicago.edu >>>> ,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094 >>>> path=$ >>>> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode VALUE >>>> dataset=tag:benc at ci.uchicago.edu >>>> ,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094 >>>> VALUE=params.tloop >>>> 2009-03-17 16:20:34,727-0500 WARN FlowNode Ex098 >>>> java.lang.NullPointerException >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) >>>> >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) >>>> at >>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >>>> >>>> at >>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >>>> >>>> >>>> >>>> On 3/17/09 4:17 PM, Michael Wilde wrote: >>>>> It seems not related to scale or Falkon. >>>>> It occurs when running on localhost (but on bgp) and when I cut all >>>>> the loops down to a single iteration. >>>>> I'm still debugging. >>>>> On 3/17/09 4:07 PM, Michael Wilde wrote: >>>>>> I just expanded my oops protein folding script to add another >>>>>> level of parameter sweep. This script is getting pretty complex >>>>>> now (at least, for a swift script). >>>>>> >>>>>> I got the following npe on my first two tries. Im going to start >>>>>> debugging, but any clues as to the cause would be helpful. >>>>>> >>>>>> The outer loops are: >>>>>> >>>>>> main() >>>>>> { >>>>>> string protein[] = readData(@arg("plist")); >>>>>> string startTemp[] = ["10","20"]; >>>>>> string tempUpdate[] = ["1","2","3"]; >>>>>> >>>>>> foreach p in protein { >>>>>> foreach st in startTemp { >>>>>> foreach tu in tempUpdate { >>>>>> doRoundSet(p,st,tu); >>>>>> } >>>>>> } >>>>>> } >>>>>> } >>>>>> >>>>>> There are two levels of inner loops further down below doRoundSet(). >>>>>> >>>>>> The script, output, command line args and log are in: >>>>>> http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz >>>>>> >>>>>> I suspect it will take a while to narrow the cause to a simpler >>>>>> test case thats easy tp reproduce without a lot of setup. >>>>>> >>>>>> I'll try on a vanilla swift on local execution; this is on bgp >>>>>> with Falkon. >>>>>> >>>>>> Thanks. >>>>>> >>>>>> -- >>>>>> >>>>>> ... >>>>>> Progress: uninitialized:1 Selecting site:2 >>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 7 >>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 2 >>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 8 >>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 0 >>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 5 >>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 9 >>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 1 >>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 6 >>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 3 >>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 4 >>>>>> Ex098 >>>>>> java.lang.NullPointerException >>>>>> at >>>>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) >>>>>> >>>>>> at >>>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) >>>>>> at >>>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) >>>>>> at >>>>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >>>>>> >>>>>> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) >>>>>> at >>>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>>>>> at >>>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) >>>>>> at >>>>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>>>>> at >>>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >>>>>> >>>>>> Execution failed: >>>>>> java.lang.NullPointerException >>>>>> at >>>>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) >>>>>> >>>>>> at >>>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) >>>>>> at >>>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) >>>>>> at >>>>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >>>>>> >>>>>> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) >>>>>> at >>>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>>>>> at >>>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) >>>>>> at >>>>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>>>>> at >>>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >>>>>> >>>>>> >>>>>> Ex098 >>>>>> java.lang.NullPointerException >>>>>> at >>>>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) >>>>>> >>>>>> at >>>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) >>>>>> at >>>>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) >>>>>> at >>>>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) >>>>>> >>>>>> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) >>>>>> at >>>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>>>>> at >>>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) >>>>>> at >>>>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>>>>> at >>>>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>>>>> >>>>>> at >>>>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >>>>>> >>>>>> SwiftScript trace: T1af7, Round, 0, Sim, 7 >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From skenny at uchicago.edu Tue Mar 17 22:14:40 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Tue, 17 Mar 2009 22:14:40 -0500 (CDT) Subject: [Swift-devel] How does swift know if a task is successful Message-ID: <20090317221440.BUF44237@m4500-02.uchicago.edu> hey zhao, did you get this to work? was thinking i might try it on ranger, but i was wondering if you also then have to hack something else to prevent swift from cleaning up your work directory? that is, i assume you actually DO want the output, you just don't want to have to wait for the stageouts (?) ---- Original message ---- >Date: Tue, 17 Mar 2009 13:40:30 -0500 >From: Mihael Hategan >Subject: Re: [Swift-devel] How does swift know if a task is successful >To: Zhao Zhang >Cc: swift-devel > >On Tue, 2009-03-17 at 13:36 -0500, Zhao Zhang wrote: >> Hi, Mihael >> >> I commented the following lines >> /*dir:make(ldir) >> restartOnError(".*", 2 >> task:transfer(srchost=host, srcfile=bname, >> srcdir=rdir, destdir=ldir, desthost=dhost, destprovider=provider) >> )*/ >> > >Did you modify this file in dist/?/libexec? If not, did you re-compile >swift after the modification? > >Put an echo or a log message in place, to see if your change is picked >up by swift next time. > >> Then I modified wrapper.sh to not to copy output file back, but I still >> got an error. >> The log file is at >> http://www.ci.uchicago.edu/~zzhang/first-20090317-1327-oqgttus8.log >> Thanks >> >> zhao >> >> zzhang at login6.surveyor:~/new_dock6> ./run_swift_ssh.sh 1117 64 first.swift >> waiting for at least 64 nodes to register before submitting workload... >> waiting to find at least 1 services in file >> /home/falkon/users/zzhang/1117/config/Client-service-URIs.config... >> all done, file has found at least 1 services >> found at least 64 registered, submitting workload... >> Swift svn swift-r2676 (swift modified locally) cog-r2305 >> >> RunID: 20090317-1327-oqgttus8 >> Progress: >> Progress: Selecting site:1 Stage in:1 >> Progress: Submitting:1 Submitted:1 >> Progress: Submitted:1 Failed but can retry:1 >> Failed to transfer wrapper log from >> first-20090317-1327-oqgttus8/info/b/n/bgp000 >> Progress: Submitted:1 Active:1 >> Failed to transfer wrapper log from >> first-20090317-1327-oqgttus8/info/e/n/bgp000 >> Progress: Submitted:1 Active:1 >> Failed to transfer wrapper log from >> first-20090317-1327-oqgttus8/info/g/n/bgp000 >> Execution failed: >> Exception in echo: >> Arguments: [Hello, world!] >> Host: bgp000 >> Directory: first-20090317-1327-oqgttus8/jobs/g/n/echo-gnlq238j >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> Cannot transfer >> "/tmp/first-20090317-1327-oqgttus8/shared/hello.txt" to >> "/gpfs/home/zzhang/new_dock6/./hello.txt" >> Caused by: >> No such file >> >> >> Mihael Hategan wrote: >> > On Tue, 2009-03-17 at 12:23 -0500, Zhao Zhang wrote: >> > >> >> Hi, Mihael >> >> >> >> yes, can I do that? >> >> >> > >> > You should know this by now: >> > in vdl-int.k, in doStageout, comment out the task:transfer invocation >> > (and dir:make). >> > >> > >> >> zhao >> >> >> >> Mihael Hategan wrote: >> >> >> >>> On Tue, 2009-03-17 at 12:14 -0500, Zhao Zhang wrote: >> >>> >> >>> >> >>>> Here comes another question, is there any place that I could set to >> >>>> disable swift's waiting for data feature? >> >>>> >> >>>> >> >>> Do you mean disable the stage-outs? >> >>> >> >>> >> >>> >> >>>> Or is there any way for me to cheat swift that the data is already >> >>>> there? thanks. >> >>>> >> >>>> zhao >> >>>> >> >>>> Mihael Hategan wrote: >> >>>> >> >>>> >> >>>>> On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote: >> >>>>> >> >>>>> >> >>>>> >> >>>>>> Hi, All >> >>>>>> >> >>>>>> I have a question on how swift knows if a task is successful. >> >>>>>> In my case, I am using a status notification instead of a status file. >> >>>>>> >> >>>>>> So my question is is this status notification the only thing swift is >> >>>>>> waiting for, or is swift also waiting for the output data to appear to >> >>>>>> say that a job is successful? >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>> Once the job is done, swift will attempt to stage out all the files that >> >>>>> it expects the job to have produced. >> >>>>> >> >>>>> Should one of those files not be there, there will be failures. >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>> >> >>> >> > >> > >> > > >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From zhaozhang at uchicago.edu Tue Mar 17 23:04:59 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 17 Mar 2009 23:04:59 -0500 Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: <20090317221440.BUF44237@m4500-02.uchicago.edu> References: <20090317221440.BUF44237@m4500-02.uchicago.edu> Message-ID: <49C072EB.2060109@uchicago.edu> Hi, Sarah skenny at uchicago.edu wrote: > hey zhao, did you get this to work? Not yet, I am still working on it. > was thinking i might try > it on ranger, but i was wondering if you also then have to > hack something else to prevent swift from cleaning up your > work directory? I think to prevent swift cleaning up work dir is just an option in swift.properties. > that is, i assume you actually DO want the > output, you just don't want to have to wait for the stageouts (?) > exactly, currently, we are building a collective IO system on BGP, so CIO will take care of stage out results. zhao > ---- Original message ---- > >> Date: Tue, 17 Mar 2009 13:40:30 -0500 >> From: Mihael Hategan >> Subject: Re: [Swift-devel] How does swift know if a task is >> > successful > >> To: Zhao Zhang >> Cc: swift-devel >> >> On Tue, 2009-03-17 at 13:36 -0500, Zhao Zhang wrote: >> >>> Hi, Mihael >>> >>> I commented the following lines >>> /*dir:make(ldir) >>> restartOnError(".*", 2 >>> task:transfer(srchost=host, srcfile=bname, >>> srcdir=rdir, destdir=ldir, desthost=dhost, >>> > destprovider=provider) > >>> )*/ >>> >>> >> Did you modify this file in dist/?/libexec? If not, did you >> > re-compile > >> swift after the modification? >> >> Put an echo or a log message in place, to see if your change >> > is picked > >> up by swift next time. >> >> >>> Then I modified wrapper.sh to not to copy output file back, >>> > but I still > >>> got an error. >>> The log file is at >>> >>> > http://www.ci.uchicago.edu/~zzhang/first-20090317-1327-oqgttus8.log > >>> Thanks >>> >>> zhao >>> >>> zzhang at login6.surveyor:~/new_dock6> ./run_swift_ssh.sh 1117 >>> > 64 first.swift > >>> waiting for at least 64 nodes to register before submitting >>> > workload... > >>> waiting to find at least 1 services in file >>> >>> > /home/falkon/users/zzhang/1117/config/Client-service-URIs.config... > >>> all done, file has found at least 1 services >>> found at least 64 registered, submitting workload... >>> Swift svn swift-r2676 (swift modified locally) cog-r2305 >>> >>> RunID: 20090317-1327-oqgttus8 >>> Progress: >>> Progress: Selecting site:1 Stage in:1 >>> Progress: Submitting:1 Submitted:1 >>> Progress: Submitted:1 Failed but can retry:1 >>> Failed to transfer wrapper log from >>> first-20090317-1327-oqgttus8/info/b/n/bgp000 >>> Progress: Submitted:1 Active:1 >>> Failed to transfer wrapper log from >>> first-20090317-1327-oqgttus8/info/e/n/bgp000 >>> Progress: Submitted:1 Active:1 >>> Failed to transfer wrapper log from >>> first-20090317-1327-oqgttus8/info/g/n/bgp000 >>> Execution failed: >>> Exception in echo: >>> Arguments: [Hello, world!] >>> Host: bgp000 >>> Directory: first-20090317-1327-oqgttus8/jobs/g/n/echo-gnlq238j >>> stderr.txt: >>> >>> stdout.txt: >>> >>> ---- >>> >>> Caused by: >>> Cannot transfer >>> "/tmp/first-20090317-1327-oqgttus8/shared/hello.txt" to >>> "/gpfs/home/zzhang/new_dock6/./hello.txt" >>> Caused by: >>> No such file >>> >>> >>> Mihael Hategan wrote: >>> >>>> On Tue, 2009-03-17 at 12:23 -0500, Zhao Zhang wrote: >>>> >>>> >>>>> Hi, Mihael >>>>> >>>>> yes, can I do that? >>>>> >>>>> >>>> You should know this by now: >>>> in vdl-int.k, in doStageout, comment out the >>>> > task:transfer invocation > >>>> (and dir:make). >>>> >>>> >>>> >>>>> zhao >>>>> >>>>> Mihael Hategan wrote: >>>>> >>>>> >>>>>> On Tue, 2009-03-17 at 12:14 -0500, Zhao Zhang wrote: >>>>>> >>>>>> >>>>>> >>>>>>> Here comes another question, is there any place that I >>>>>>> > could set to > >>>>>>> disable swift's waiting for data feature? >>>>>>> >>>>>>> >>>>>>> >>>>>> Do you mean disable the stage-outs? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> Or is there any way for me to cheat swift that the >>>>>>> > data is already > >>>>>>> there? thanks. >>>>>>> >>>>>>> zhao >>>>>>> >>>>>>> Mihael Hategan wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Hi, All >>>>>>>>> >>>>>>>>> I have a question on how swift knows if a task is >>>>>>>>> > successful. > >>>>>>>>> In my case, I am using a status notification instead >>>>>>>>> > of a status file. > >>>>>>>>> So my question is is this status notification the >>>>>>>>> > only thing swift is > >>>>>>>>> waiting for, or is swift also waiting for the output >>>>>>>>> > data to appear to > >>>>>>>>> say that a job is successful? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> Once the job is done, swift will attempt to stage out >>>>>>>> > all the files that > >>>>>>>> it expects the job to have produced. >>>>>>>> >>>>>>>> Should one of those files not be there, there will be >>>>>>>> > failures. > >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>>> >>>> >>>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > From zhaozhang at uchicago.edu Tue Mar 17 23:58:10 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 17 Mar 2009 23:58:10 -0500 Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: <1237315230.31264.1.camel@localhost> References: <49BAD926.1030607@uchicago.edu> <1236982413.13026.1.camel@localhost> <49BFDA75.9070803@uchicago.edu> <1237310455.29738.11.camel@localhost> <49BFDC78.8040506@uchicago.edu> <1237310948.30064.2.camel@localhost> <49BFEDB0.5070409@uchicago.edu> <1237315230.31264.1.camel@localhost> Message-ID: <49C07F62.3000309@uchicago.edu> Hi, Mihael I modified the vdl-int.k in cog/module/swift/libexec, and rebuilt swift, and I used my customized wrapper.sh. I ran the first.swift as a test, the job returned successful, and the output file was still staged out. Any ideas? Thanks. zhao Mihael Hategan wrote: > On Tue, 2009-03-17 at 13:36 -0500, Zhao Zhang wrote: > >> Hi, Mihael >> >> I commented the following lines >> /*dir:make(ldir) >> restartOnError(".*", 2 >> task:transfer(srchost=host, srcfile=bname, >> srcdir=rdir, destdir=ldir, desthost=dhost, destprovider=provider) >> )*/ >> >> > > Did you modify this file in dist/?/libexec? If not, did you re-compile > swift after the modification? > > Put an echo or a log message in place, to see if your change is picked > up by swift next time. > > >> Then I modified wrapper.sh to not to copy output file back, but I still >> got an error. >> The log file is at >> http://www.ci.uchicago.edu/~zzhang/first-20090317-1327-oqgttus8.log >> Thanks >> >> zhao >> >> zzhang at login6.surveyor:~/new_dock6> ./run_swift_ssh.sh 1117 64 first.swift >> waiting for at least 64 nodes to register before submitting workload... >> waiting to find at least 1 services in file >> /home/falkon/users/zzhang/1117/config/Client-service-URIs.config... >> all done, file has found at least 1 services >> found at least 64 registered, submitting workload... >> Swift svn swift-r2676 (swift modified locally) cog-r2305 >> >> RunID: 20090317-1327-oqgttus8 >> Progress: >> Progress: Selecting site:1 Stage in:1 >> Progress: Submitting:1 Submitted:1 >> Progress: Submitted:1 Failed but can retry:1 >> Failed to transfer wrapper log from >> first-20090317-1327-oqgttus8/info/b/n/bgp000 >> Progress: Submitted:1 Active:1 >> Failed to transfer wrapper log from >> first-20090317-1327-oqgttus8/info/e/n/bgp000 >> Progress: Submitted:1 Active:1 >> Failed to transfer wrapper log from >> first-20090317-1327-oqgttus8/info/g/n/bgp000 >> Execution failed: >> Exception in echo: >> Arguments: [Hello, world!] >> Host: bgp000 >> Directory: first-20090317-1327-oqgttus8/jobs/g/n/echo-gnlq238j >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> Cannot transfer >> "/tmp/first-20090317-1327-oqgttus8/shared/hello.txt" to >> "/gpfs/home/zzhang/new_dock6/./hello.txt" >> Caused by: >> No such file >> >> >> Mihael Hategan wrote: >> >>> On Tue, 2009-03-17 at 12:23 -0500, Zhao Zhang wrote: >>> >>> >>>> Hi, Mihael >>>> >>>> yes, can I do that? >>>> >>>> >>> You should know this by now: >>> in vdl-int.k, in doStageout, comment out the task:transfer invocation >>> (and dir:make). >>> >>> >>> >>>> zhao >>>> >>>> Mihael Hategan wrote: >>>> >>>> >>>>> On Tue, 2009-03-17 at 12:14 -0500, Zhao Zhang wrote: >>>>> >>>>> >>>>> >>>>>> Here comes another question, is there any place that I could set to >>>>>> disable swift's waiting for data feature? >>>>>> >>>>>> >>>>>> >>>>> Do you mean disable the stage-outs? >>>>> >>>>> >>>>> >>>>> >>>>>> Or is there any way for me to cheat swift that the data is already >>>>>> there? thanks. >>>>>> >>>>>> zhao >>>>>> >>>>>> Mihael Hategan wrote: >>>>>> >>>>>> >>>>>> >>>>>>> On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Hi, All >>>>>>>> >>>>>>>> I have a question on how swift knows if a task is successful. >>>>>>>> In my case, I am using a status notification instead of a status file. >>>>>>>> >>>>>>>> So my question is is this status notification the only thing swift is >>>>>>>> waiting for, or is swift also waiting for the output data to appear to >>>>>>>> say that a job is successful? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> Once the job is done, swift will attempt to stage out all the files that >>>>>>> it expects the job to have produced. >>>>>>> >>>>>>> Should one of those files not be there, there will be failures. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>>>> >>> >>> > > > From benc at hawaga.org.uk Wed Mar 18 05:29:36 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 18 Mar 2009 10:29:36 +0000 (GMT) Subject: [Swift-devel] testing In-Reply-To: <1237307410.26969.8.camel@localhost> References: <1237307410.26969.8.camel@localhost> Message-ID: On Tue, 17 Mar 2009, Mihael Hategan wrote: > There are probably a few issues still left to address, one of which is > to make sure that coasters are an acceptable way of running things on > OSG. I suspect this would require some negotiation with the right people > from OSG, and I don't know who those people are. Mats can probably make comment on what people in OSG are going to say. My sense is: i) coasters as a general concept is fine - the big VOs do stuff like that. ii) running anything on the head nodes is bad iii) running anything through gram2 is bad - any base job submissions need to be through condor-g using its hybrid gram2+gridmanager system. -- From benc at hawaga.org.uk Wed Mar 18 05:31:02 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 18 Mar 2009 10:31:02 +0000 (GMT) Subject: [Swift-devel] testing In-Reply-To: <1237310335.29738.8.camel@localhost> References: <1237307410.26969.8.camel@localhost> <49BFD80C.7080903@mcs.anl.gov> <1237310335.29738.8.camel@localhost> Message-ID: On Tue, 17 Mar 2009, Mihael Hategan wrote: > this testing, not before. But I would like to "negotiate" the ability > to: > > - have one process on the head node, hopefully one that doesn't hog it. > - the ability to submit from the head node to the queuing system > directly (as if running qsub manually - something that isn't exactly > "the way" on OSG). I think you're likely to get 'no' to both of those. -- From wilde at mcs.anl.gov Wed Mar 18 07:27:55 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 18 Mar 2009 07:27:55 -0500 Subject: [Swift-devel] testing In-Reply-To: References: <1237307410.26969.8.camel@localhost> Message-ID: <49C0E8CB.3080600@mcs.anl.gov> On 3/18/09 5:29 AM, Ben Clifford wrote: > On Tue, 17 Mar 2009, Mihael Hategan wrote: > >> There are probably a few issues still left to address, one of which is >> to make sure that coasters are an acceptable way of running things on >> OSG. I suspect this would require some negotiation with the right people >> from OSG, and I don't know who those people are. > > Mats can probably make comment on what people in OSG are going to say. > > My sense is: > i) coasters as a general concept is fine - the big VOs do stuff like that. > ii) running anything on the head nodes is bad I agree in principle. The immediate issue is whether the load we place on head nodes will be light and not cause trouble, or whether it will be yet another obstacle for us. We need to cope with managed head nodes and their time limiter. I dont know that thats been tested yet. Can coasters architecturally cope with no head node access, if we use a worker node for this function and it connects back to the submitting swift process? On the assumption that outbound connectivity from workers will be more commonly found than inbound? > iii) running anything through gram2 is bad - any base job submissions > need to be through condor-g using its hybrid gram2+gridmanager system. I agree, and was assuming that on OSG we would only use the new Condor provider, and run jobs in this manner. From benc at hawaga.org.uk Wed Mar 18 07:35:35 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 18 Mar 2009 12:35:35 +0000 (GMT) Subject: [Swift-devel] null pointer exception from nested loops In-Reply-To: <1237327802.2668.0.camel@localhost> References: <49C010FE.4070503@mcs.anl.gov> <49C01351.1050807@mcs.anl.gov> <49C0154C.7000608@mcs.anl.gov> <49C01A2D.6050408@mcs.anl.gov> <49C01DF7.4000204@mcs.anl.gov> <1237327802.2668.0.camel@localhost> Message-ID: r2707 fixes this. It was the nested procedure call interacting badly with anything that returns mapped content rather than in-memory values. On Tue, 17 Mar 2009, Mihael Hategan wrote: > Looks like it might be the nested procedure call interacting badly with > readData. > > On Tue, 2009-03-17 at 17:02 -0500, Michael Wilde wrote: > > OK, I think this is it: > > > > The following script *works* with the commented out code and fails with > > the currently enabled alternative statement: > > > > -- > > > > type file; > > > > app (file editedParams) setTemps ( file inParams ) > > { > > sed "-e" "s/@DTI@/123/" @inParams stdout=@editedParams; > > } > > > > file inParams; > > > > /* works: > > file o<"pout">; > > o = setTemps(inParams); > > string config [] = readData(o); > > */ > > > > /* Fails: */ > > string config [] = readData( setTemps(inParams ) ); > > > > trace(0,config[0]); > > trace(1,config[1]); > > trace(2,config[2]); > > > > -- > > > > So readData is indeed happy to take a file-value var as an arg but not a > > file-valued expression (procedure return in this case). > > > > > > > > On 3/17/09 4:46 PM, Michael Wilde wrote: > > > I got the error narrowed down to this example: > > > > > > type file; > > > > > > app (file editedParams) setTemps ( file inParams ) > > > { > > > echo @inParams stdout=@editedParams; > > > } > > > > > > file inParams; > > > > > > string config [] = readData( setTemps(inParams ) ); > > > trace(0,config[0]); > > > trace(1,config[1]); > > > trace(2,config[2]); > > > > > > -- > > > > > > params.tloops contains: > > > > > > DEFAULT_INIT_TEMP_=_ at DTI@ > > > TEMP_UPDATE_INTERVAL_=_ at TUI@ > > > KILL_TIME_=_3 > > > MAX_NUMBER_OF_ANNEALING_STEPS_=_0 > > > > > > -- > > > > > > In other code, readData() seemed happy to take a file *or* a filename > > > string as input, but I wonder if it was not as happy as it seemed. > > > > > > I'd been taking advantage of the flexibility with good results (luck?) > > > so far, though. > > > > > > On 3/17/09 4:25 PM, Michael Wilde wrote: > > >> The log contains this just before the NPE, including the suspicious > > >> message: WARN FlowNode Ex098: > > >> > > >> Thats giving me a clue as to the offending statements. > > >> > > >> --- > > >> > > >> 2009-03-17 16:20:34,723-0500 INFO AbstractDataNode Adding handle > > >> listener "F/org.griphyn.vdl.mapping.DataNode identifier > > >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090\ > > >> 317-1620-e1n1bz3g:720000000071 type SecSeq with no value at > > >> dataset=secseq path=[0] (not closed)" to > > >> "org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uchicago.edu,\ > > >> 2008:swift:dataset:20090317-1620-e1n1bz3g:720000000071 type SecSeq > > >> with no value at dataset=secseq path=[0] (not closed)" > > >> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode closed > > >> org.griphyn.vdl.mapping.RootDataNode identifier > > >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\ > > >> 3g:720000000092 type string value=s/@DIT@/10/ dataset=unnamed > > >> SwiftScript value (closed) > > >> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode ROOTPATH > > >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092 > > >> path=$ > > >> 2009-03-17 16:20:34,724-0500 INFO AbstractDataNode VALUE > > >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000092 > > >> VALUE=s/@DIT@/10/ > > >> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode closed > > >> org.griphyn.vdl.mapping.RootDataNode identifier > > >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\ > > >> 3g:720000000093 type string value=s/@TUI@/1/ dataset=unnamed > > >> SwiftScript value (closed) > > >> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode ROOTPATH > > >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093 > > >> path=$ > > >> 2009-03-17 16:20:34,725-0500 INFO AbstractDataNode VALUE > > >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000093 > > >> VALUE=s/@TUI@/1/ > > >> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode closed > > >> org.griphyn.vdl.mapping.RootDataNode identifier > > >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz\ > > >> 3g:720000000094 type string value=params.tloop dataset=unnamed > > >> SwiftScript value (closed) > > >> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode ROOTPATH > > >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094 > > >> path=$ > > >> 2009-03-17 16:20:34,726-0500 INFO AbstractDataNode VALUE > > >> dataset=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090317-1620-e1n1bz3g:720000000094 > > >> VALUE=params.tloop > > >> 2009-03-17 16:20:34,727-0500 WARN FlowNode Ex098 > > >> java.lang.NullPointerException > > >> at > > >> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) > > >> > > >> at > > >> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) > > >> at > > >> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) > > >> at > > >> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) > > >> > > >> at > > >> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) > > >> > > >> > > >> > > >> On 3/17/09 4:17 PM, Michael Wilde wrote: > > >>> It seems not related to scale or Falkon. > > >>> > > >>> It occurs when running on localhost (but on bgp) and when I cut all > > >>> the loops down to a single iteration. > > >>> > > >>> I'm still debugging. > > >>> > > >>> On 3/17/09 4:07 PM, Michael Wilde wrote: > > >>>> I just expanded my oops protein folding script to add another level > > >>>> of parameter sweep. This script is getting pretty complex now (at > > >>>> least, for a swift script). > > >>>> > > >>>> I got the following npe on my first two tries. Im going to start > > >>>> debugging, but any clues as to the cause would be helpful. > > >>>> > > >>>> The outer loops are: > > >>>> > > >>>> main() > > >>>> { > > >>>> string protein[] = readData(@arg("plist")); > > >>>> string startTemp[] = ["10","20"]; > > >>>> string tempUpdate[] = ["1","2","3"]; > > >>>> > > >>>> foreach p in protein { > > >>>> foreach st in startTemp { > > >>>> foreach tu in tempUpdate { > > >>>> doRoundSet(p,st,tu); > > >>>> } > > >>>> } > > >>>> } > > >>>> } > > >>>> > > >>>> There are two levels of inner loops further down below doRoundSet(). > > >>>> > > >>>> The script, output, command line args and log are in: > > >>>> http://ww.ci.uchicago.edu/~wilde/swift3.tar.gz > > >>>> > > >>>> I suspect it will take a while to narrow the cause to a simpler test > > >>>> case thats easy tp reproduce without a lot of setup. > > >>>> > > >>>> I'll try on a vanilla swift on local execution; this is on bgp with > > >>>> Falkon. > > >>>> > > >>>> Thanks. > > >>>> > > >>>> -- > > >>>> > > >>>> ... > > >>>> Progress: uninitialized:1 Selecting site:2 > > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 7 > > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 2 > > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 8 > > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 0 > > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 5 > > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 9 > > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 1 > > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 6 > > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 3 > > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 4 > > >>>> Ex098 > > >>>> java.lang.NullPointerException > > >>>> at > > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) > > >>>> > > >>>> at > > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) > > >>>> at > > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) > > >>>> at > > >>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) > > >>>> > > >>>> at > > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) > > >>>> at > > >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > > >>>> at > > >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) > > >>>> at > > >>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > > >>>> at > > >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) > > >>>> > > >>>> Execution failed: > > >>>> java.lang.NullPointerException > > >>>> at > > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) > > >>>> > > >>>> at > > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) > > >>>> at > > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) > > >>>> at > > >>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) > > >>>> > > >>>> at > > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) > > >>>> at > > >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > > >>>> at > > >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) > > >>>> at > > >>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > > >>>> at > > >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) > > >>>> > > >>>> > > >>>> Ex098 > > >>>> java.lang.NullPointerException > > >>>> at > > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:285) > > >>>> > > >>>> at > > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:201) > > >>>> at > > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:182) > > >>>> at > > >>>> org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:19) > > >>>> > > >>>> at > > >>>> org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) > > >>>> at > > >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > > >>>> at > > >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) > > >>>> at > > >>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > > >>>> at > > >>>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > >>>> > > >>>> at > > >>>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) > > >>>> > > >>>> SwiftScript trace: T1af7, Round, 0, Sim, 7 > > >>>> _______________________________________________ > > >>>> Swift-devel mailing list > > >>>> Swift-devel at ci.uchicago.edu > > >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > >>> _______________________________________________ > > >>> Swift-devel mailing list > > >>> Swift-devel at ci.uchicago.edu > > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > >> _______________________________________________ > > >> Swift-devel mailing list > > >> Swift-devel at ci.uchicago.edu > > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From benc at hawaga.org.uk Wed Mar 18 07:39:42 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 18 Mar 2009 12:39:42 +0000 (GMT) Subject: [Swift-devel] testing In-Reply-To: References: <1237307410.26969.8.camel@localhost> Message-ID: On Wed, 18 Mar 2009, Ben Clifford wrote: > My sense is: By that I mean, my sense of what OSG people will accept, not my personal opinions. -- From benc at hawaga.org.uk Wed Mar 18 07:51:33 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 18 Mar 2009 12:51:33 +0000 (GMT) Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: <49C072EB.2060109@uchicago.edu> References: <20090317221440.BUF44237@m4500-02.uchicago.edu> <49C072EB.2060109@uchicago.edu> Message-ID: On Tue, 17 Mar 2009, Zhao Zhang wrote: > exactly, currently, we are building a collective IO system on BGP, so CIO will > take > care of stage out results. Is it possible to interface between Swift and your work more cleanly? (maybe, for example, by doing something with the cog file transfer provider API) Hacking essential pieces of the Swift code out feels really unpleasant, and will pretty much definitely break some functionality, which will cause you trouble later on if you try to run arbitrary SwiftScript programs. When we sat down in November/December, it sounded like you wouldn't need to do anything like this to make the CIO stuff work with Swift; so I'd be interested in more explanation/discussion about what the CIO work looks like now. -- From foster at anl.gov Wed Mar 18 08:38:36 2009 From: foster at anl.gov (Ian Foster) Date: Wed, 18 Mar 2009 08:38:36 -0500 Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: References: <20090317221440.BUF44237@m4500-02.uchicago.edu> <49C072EB.2060109@uchicago.edu> Message-ID: I think that at this point we are experimenting to see what can be done. Not to say that we shouldn't do this, but the first focus is on seeing what might work at all. On Mar 18, 2009, at 7:51 AM, Ben Clifford wrote: > > On Tue, 17 Mar 2009, Zhao Zhang wrote: > >> exactly, currently, we are building a collective IO system on BGP, >> so CIO will >> take >> care of stage out results. > > Is it possible to interface between Swift and your work more cleanly? > > (maybe, for example, by doing something with the cog file transfer > provider API) > > Hacking essential pieces of the Swift code out feels really > unpleasant, > and will pretty much definitely break some functionality, which will > cause > you trouble later on if you try to run arbitrary SwiftScript programs. > > When we sat down in November/December, it sounded like you wouldn't > need > to do anything like this to make the CIO stuff work with Swift; so > I'd be > interested in more explanation/discussion about what the CIO work > looks > like now. > > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From zhaozhang at uchicago.edu Wed Mar 18 09:04:13 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 18 Mar 2009 09:04:13 -0500 Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: References: <20090317221440.BUF44237@m4500-02.uchicago.edu> <49C072EB.2060109@uchicago.edu> Message-ID: <49C0FF5D.6000407@uchicago.edu> Hi, Ben Ben Clifford wrote: > On Tue, 17 Mar 2009, Zhao Zhang wrote: > > >> exactly, currently, we are building a collective IO system on BGP, so CIO will >> take >> care of stage out results. >> > > Is it possible to interface between Swift and your work more cleanly? > > (maybe, for example, by doing something with the cog file transfer > provider API) > Yes, make a new provider for swift is another way to do this. > Hacking essential pieces of the Swift code out feels really unpleasant, > and will pretty much definitely break some functionality, which will cause > you trouble later on if you try to run arbitrary SwiftScript programs. > Well, I agree with your point for production use. But things we are doing now is a research for a better architecture of swift on BGP. > When we sat down in November/December, it sounded like you wouldn't need > to do anything like this to make the CIO stuff work with Swift; As I implemented what we discussed last time, new problem came up. Considering a 2-stage computation, the second stage would take the output from the first as an input. With either ssh provider or gridftp provider, this intermediate data has to be copied back to GPFS since the job that consumes this data could be sent to any "site (pset)". To solve this problem, we built a P2P data network on BGP over torus network. So the basic logic for this is that if a wrapper.sh found a piece of intermediate data, it registered this data with (name, rank of the CN) to a Centralized Hash Table(CHT). Next time, when a job needs this data, first it looks this data up in CHT, gets the rank of the remote node, convert the RANK to IP, fetch the data directly. With the above solution, all intermediate data has not to be copied back to GPFS, but swift are waiting for those intermediate data to determine if the jobs of first stage are successful or not. In this case, swift won't send out the jobs of 2nd stage, that's why we need to disable swift's staging out, and let swift determine a job status only by provider status notification. zhao > so I'd be > interested in more explanation/discussion about what the CIO work looks > like now. > > From wilde at mcs.anl.gov Wed Mar 18 09:04:46 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 18 Mar 2009 09:04:46 -0500 Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: References: <20090317221440.BUF44237@m4500-02.uchicago.edu> <49C072EB.2060109@uchicago.edu> Message-ID: <49C0FF7E.6040405@mcs.anl.gov> I'll talk to ZHao and Allan about this; I havent been following this thread because I thought it was a mechanical detail, and in fact thought Swift already did whats needed. But I'll read up and work with Zhao and Allan to see if we can avoid unnecessary changes. Not yet sure, we'll see. On 3/18/09 8:38 AM, Ian Foster wrote: > I think that at this point we are experimenting to see what can be done. > Not to say that we shouldn't do this, but the first focus is on seeing > what might work at all. > > > On Mar 18, 2009, at 7:51 AM, Ben Clifford wrote: > >> >> On Tue, 17 Mar 2009, Zhao Zhang wrote: >> >>> exactly, currently, we are building a collective IO system on BGP, so >>> CIO will >>> take >>> care of stage out results. >> >> Is it possible to interface between Swift and your work more cleanly? >> >> (maybe, for example, by doing something with the cog file transfer >> provider API) >> >> Hacking essential pieces of the Swift code out feels really unpleasant, >> and will pretty much definitely break some functionality, which will >> cause >> you trouble later on if you try to run arbitrary SwiftScript programs. >> >> When we sat down in November/December, it sounded like you wouldn't need >> to do anything like this to make the CIO stuff work with Swift; so I'd be >> interested in more explanation/discussion about what the CIO work looks >> like now. >> >> -- >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Wed Mar 18 09:13:56 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 18 Mar 2009 14:13:56 +0000 (GMT) Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: <49C0FF5D.6000407@uchicago.edu> References: <20090317221440.BUF44237@m4500-02.uchicago.edu> <49C072EB.2060109@uchicago.edu> <49C0FF5D.6000407@uchicago.edu> Message-ID: So if Swift could remove the dependency between staging out and starting subsequent jobs (a subset of what has been talked about before), would you still need to hack out the stageout code? > To solve this problem, we built a P2P data network on BGP over torus > network. So the basic logic for this is that if a wrapper.sh found a > piece of intermediate data, it registered this data with (name, rank of > the CN) to a Centralized Hash Table(CHT). Next time, when a job needs > this data, first it looks this data up in CHT, gets the rank of the > remote node, convert the RANK to IP, fetch the data directly. When we talked in December, I think this bit was done with posix filesystem access. But it sounds like you are doing something different now. I've looked at abstracting that worker<->site shared filesystem code in the past (and have some patches floating round in half-written state) - can you send me your modified wrapper.sh so I can see how you do things? -- From zhaozhang at uchicago.edu Wed Mar 18 09:21:15 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 18 Mar 2009 09:21:15 -0500 Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: References: <20090317221440.BUF44237@m4500-02.uchicago.edu> <49C072EB.2060109@uchicago.edu> <49C0FF5D.6000407@uchicago.edu> Message-ID: <49C1035B.3070606@uchicago.edu> Hi, Ben Ben Clifford wrote: > So if Swift could remove the dependency between staging out and starting > subsequent jobs (a subset of what has been talked about before), would you > still need to hack out the stageout code? > I think swift still needs to hold the 2nd stage computation until the 1st completes. If we simply remove the dependency, swift would send all jobs (both 1st and 2nd) out, right? > >> To solve this problem, we built a P2P data network on BGP over torus >> network. So the basic logic for this is that if a wrapper.sh found a >> piece of intermediate data, it registered this data with (name, rank of >> the CN) to a Centralized Hash Table(CHT). Next time, when a job needs >> this data, first it looks this data up in CHT, gets the rank of the >> remote node, convert the RANK to IP, fetch the data directly. >> > > When we talked in December, I think this bit was done with posix > filesystem access. We missed this point in last talk. > But it sounds like you are doing something different > now. > > I've looked at abstracting that worker<->site shared filesystem code in > the past (and have some patches floating round in half-written state) - > can you send me your modified wrapper.sh so I can see how you do things? > Here it is: http://www.ci.uchicago.edu/~zzhang/wrapper.sh zhao From wilde at mcs.anl.gov Wed Mar 18 09:29:51 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 18 Mar 2009 09:29:51 -0500 Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: References: <20090317221440.BUF44237@m4500-02.uchicago.edu> <49C072EB.2060109@uchicago.edu> <49C0FF5D.6000407@uchicago.edu> Message-ID: <49C1055F.8040508@mcs.anl.gov> I just reviewed the thread and think I understand the issue now. To re-iterate (for my benefit): The status.mode=provider property works but does not do all that Zhao needs here: Swift still insists that all the expected app output files were placed in the workdirectory (and then copies them back to their mapped destination directory). Zhao is experimenting with a "pull model" where data transfer can be done by the compute nodes pulling their input files from where those files were left by the previous job, rather than the swift engine pushing their input data to the shared work directory. So, Ben, I think your solution below *might* work. Zhao, Allan, and I should document the data flow changes that we're testing, to help us all discuss this. On 3/18/09 9:13 AM, Ben Clifford wrote: > So if Swift could remove the dependency between staging out and starting > subsequent jobs (a subset of what has been talked about before), would you > still need to hack out the stageout code? > >> To solve this problem, we built a P2P data network on BGP over torus >> network. So the basic logic for this is that if a wrapper.sh found a >> piece of intermediate data, it registered this data with (name, rank of >> the CN) to a Centralized Hash Table(CHT). Next time, when a job needs >> this data, first it looks this data up in CHT, gets the rank of the >> remote node, convert the RANK to IP, fetch the data directly. > > When we talked in December, I think this bit was done with posix > filesystem access. But it sounds like you are doing something different > now. > > I've looked at abstracting that worker<->site shared filesystem code in > the past (and have some patches floating round in half-written state) - > can you send me your modified wrapper.sh so I can see how you do things? > From benc at hawaga.org.uk Wed Mar 18 09:30:51 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 18 Mar 2009 14:30:51 +0000 (GMT) Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: <49C1035B.3070606@uchicago.edu> References: <20090317221440.BUF44237@m4500-02.uchicago.edu> <49C072EB.2060109@uchicago.edu> <49C0FF5D.6000407@uchicago.edu> <49C1035B.3070606@uchicago.edu> Message-ID: On Wed, 18 Mar 2009, Zhao Zhang wrote: > I think swift still needs to hold the 2nd stage computation until the 1st > completes. If we simply remove > the dependency, swift would send all jobs (both 1st and 2nd) out, right? I don't meant the dependency between Swift jobs. That would still exist. I mean make Swift so that it can start the next job when it has determined that the first job has completed successfully, with stageout happening separately. At the moment, the dependencies are: stagein(job A) < run(job A) < stageout(job A) < stagein(job B) < run(job b) But they could become more like these two chains: stagein(job A) < run(job A) < stagein(job B) < run(job b) run(job A) < stageout(job A) -- From benc at hawaga.org.uk Wed Mar 18 09:37:01 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 18 Mar 2009 14:37:01 +0000 (GMT) Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: <49C0FF5D.6000407@uchicago.edu> References: <20090317221440.BUF44237@m4500-02.uchicago.edu> <49C072EB.2060109@uchicago.edu> <49C0FF5D.6000407@uchicago.edu> Message-ID: On Wed, 18 Mar 2009, Zhao Zhang wrote: > To solve this problem, we built a P2P data network on BGP over torus network. > So the basic logic for this is > that if a wrapper.sh found a piece of intermediate data, it registered this > data with (name, rank of the CN) to a > Centralized Hash Table(CHT). Next time, when a job needs this data, first it > looks this data up in CHT, gets > the rank of the remote node, convert the RANK to IP, fetch the data directly. I don't see any of that in the wrapper.sh that you just sent me. I see input and output files moved around with cp and dd using posix fs access. -- From zhaozhang at uchicago.edu Wed Mar 18 09:59:53 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 18 Mar 2009 09:59:53 -0500 Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: References: <20090317221440.BUF44237@m4500-02.uchicago.edu> <49C072EB.2060109@uchicago.edu> <49C0FF5D.6000407@uchicago.edu> Message-ID: <49C10C69.7070801@uchicago.edu> Oh, you want the wrapper.sh working with P2P data transfer? It is not implemented yet. I am still working on basic APT for this wrapper.sh. zhao Ben Clifford wrote: > On Wed, 18 Mar 2009, Zhao Zhang wrote: > > >> To solve this problem, we built a P2P data network on BGP over torus network. >> So the basic logic for this is >> that if a wrapper.sh found a piece of intermediate data, it registered this >> data with (name, rank of the CN) to a >> Centralized Hash Table(CHT). Next time, when a job needs this data, first it >> looks this data up in CHT, gets >> the rank of the remote node, convert the RANK to IP, fetch the data directly. >> > > I don't see any of that in the wrapper.sh that you just sent me. I see > input and output files moved around with cp and dd using posix fs access. > > From hategan at mcs.anl.gov Wed Mar 18 10:11:11 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 18 Mar 2009 10:11:11 -0500 Subject: [Swift-devel] testing In-Reply-To: <49C0E8CB.3080600@mcs.anl.gov> References: <1237307410.26969.8.camel@localhost> <49C0E8CB.3080600@mcs.anl.gov> Message-ID: <1237389071.5032.1.camel@localhost> On Wed, 2009-03-18 at 07:27 -0500, Michael Wilde wrote: > > iii) running anything through gram2 is bad - any base job submissions > > need to be through condor-g using its hybrid gram2+gridmanager system. > > I agree, and was assuming that on OSG we would only use the new Condor > provider, and run jobs in this manner. There seems to be some confusion here. Ben, the point is to run with one of the scheduler providers, not gram2. Mike, the condor provider is not a condor-through-gram provider. It only submits to the local condor queue. From hategan at mcs.anl.gov Wed Mar 18 10:16:00 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 18 Mar 2009 10:16:00 -0500 Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: References: <20090317221440.BUF44237@m4500-02.uchicago.edu> <49C072EB.2060109@uchicago.edu> <49C0FF5D.6000407@uchicago.edu> Message-ID: <1237389360.5032.4.camel@localhost> On Wed, 2009-03-18 at 14:13 +0000, Ben Clifford wrote: > So if Swift could remove the dependency between staging out and starting > subsequent jobs (a subset of what has been talked about before), would you > still need to hack out the stageout code? Or the yet to be CIO provider could, without doing much, say that the files were staged out. From zhaozhang at uchicago.edu Wed Mar 18 10:16:50 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 18 Mar 2009 10:16:50 -0500 Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: <1237389360.5032.4.camel@localhost> References: <20090317221440.BUF44237@m4500-02.uchicago.edu> <49C072EB.2060109@uchicago.edu> <49C0FF5D.6000407@uchicago.edu> <1237389360.5032.4.camel@localhost> Message-ID: <49C11062.9010304@uchicago.edu> Yes, that is another plan. I think Allan is hacking this. zhao Mihael Hategan wrote: > On Wed, 2009-03-18 at 14:13 +0000, Ben Clifford wrote: > >> So if Swift could remove the dependency between staging out and starting >> subsequent jobs (a subset of what has been talked about before), would you >> still need to hack out the stageout code? >> > > Or the yet to be CIO provider could, without doing much, say that the > files were staged out. > > > > From benc at hawaga.org.uk Wed Mar 18 10:20:07 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 18 Mar 2009 15:20:07 +0000 (GMT) Subject: [Swift-devel] testing In-Reply-To: <1237389071.5032.1.camel@localhost> References: <1237307410.26969.8.camel@localhost> <49C0E8CB.3080600@mcs.anl.gov> <1237389071.5032.1.camel@localhost> Message-ID: On Wed, 18 Mar 2009, Mihael Hategan wrote: > Ben, the point is to run with one of the scheduler providers, not gram2. That is also bad for OSG, as far as I can tell because it screws up accounting which is done in GRAM, not in the LRM. Pretty much what I'm asserting is that both the head job and the worker jobs need to be run through Conodr-G/gridmanager/gram2 (in my understanding of what OSG requires). -- From benc at hawaga.org.uk Wed Mar 18 10:24:42 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 18 Mar 2009 15:24:42 +0000 (GMT) Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: <1237389360.5032.4.camel@localhost> References: <20090317221440.BUF44237@m4500-02.uchicago.edu> <49C072EB.2060109@uchicago.edu> <49C0FF5D.6000407@uchicago.edu> <1237389360.5032.4.camel@localhost> Message-ID: On Wed, 18 Mar 2009, Mihael Hategan wrote: > Or the yet to be CIO provider could, without doing much, say that the > files were staged out. That would still be giving mistruths to Swift about what files had been copied where, and so would break when Swift relied on those mistruths (such as trying to run a job on a different site). What I was talking about is the ever-in-the-future in-swift manager which files are where. -- From hategan at mcs.anl.gov Wed Mar 18 10:40:07 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 18 Mar 2009 10:40:07 -0500 Subject: [Swift-devel] testing In-Reply-To: References: <1237307410.26969.8.camel@localhost> <49C0E8CB.3080600@mcs.anl.gov> <1237389071.5032.1.camel@localhost> Message-ID: <1237390807.5032.19.camel@localhost> On Wed, 2009-03-18 at 15:20 +0000, Ben Clifford wrote: > On Wed, 18 Mar 2009, Mihael Hategan wrote: > > > Ben, the point is to run with one of the scheduler providers, not gram2. > > That is also bad for OSG, as far as I can tell because it screws up > accounting which is done in GRAM, not in the LRM. > > Pretty much what I'm asserting is that both the head job and the worker > jobs need to be run through Conodr-G/gridmanager/gram2 (in my > understanding of what OSG requires). > Ok. Back to the drawing board. From zhaozhang at uchicago.edu Wed Mar 18 11:00:27 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 18 Mar 2009 11:00:27 -0500 Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: References: <20090317221440.BUF44237@m4500-02.uchicago.edu> <49C072EB.2060109@uchicago.edu> <49C0FF5D.6000407@uchicago.edu> <49C1035B.3070606@uchicago.edu> Message-ID: <49C11A9B.8040007@uchicago.edu> Yes, I think that is we need. Thanks. zhao Ben Clifford wrote: > On Wed, 18 Mar 2009, Zhao Zhang wrote: > > >> I think swift still needs to hold the 2nd stage computation until the 1st >> completes. If we simply remove >> the dependency, swift would send all jobs (both 1st and 2nd) out, right? >> > > I don't meant the dependency between Swift jobs. That would still exist. > > I mean make Swift so that it can start the next job when it has determined > that the first job has completed successfully, with stageout happening > separately. > > At the moment, the dependencies are: > > stagein(job A) < run(job A) < stageout(job A) < stagein(job B) < run(job b) > > But they could become more like these two chains: > > stagein(job A) < run(job A) < stagein(job B) < run(job b) > run(job A) < stageout(job A) > > From wilde at mcs.anl.gov Wed Mar 18 12:05:40 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 18 Mar 2009 12:05:40 -0500 Subject: [Swift-devel] testing In-Reply-To: <1237389071.5032.1.camel@localhost> References: <1237307410.26969.8.camel@localhost> <49C0E8CB.3080600@mcs.anl.gov> <1237389071.5032.1.camel@localhost> Message-ID: <49C129E4.9020003@mcs.anl.gov> On 3/18/09 10:11 AM, Mihael Hategan wrote: > On Wed, 2009-03-18 at 07:27 -0500, Michael Wilde wrote: > >>> iii) running anything through gram2 is bad - any base job submissions >>> need to be through condor-g using its hybrid gram2+gridmanager system. >> I agree, and was assuming that on OSG we would only use the new Condor >> provider, and run jobs in this manner. > > There seems to be some confusion here. > > Ben, the point is to run with one of the scheduler providers, not gram2. > > Mike, the condor provider is not a condor-through-gram provider. It only > submits to the local condor queue. I was thinking/hoping that the condor provider would have a setting that submitted swift apps as condor-g jobs to N grid sites, *via* the local condor queue. Isn't that how condor-g works? I send my local condor (via condor_sumit) a .sub file that says e.g.: universe=grid grid_resource=gt2 osg-edu.cs.wisc.edu/jobmanager-condor If I cant do that yet through the condor provider, was it your intent that users eventually be able to do that, or was that not what you were implementing? From hategan at mcs.anl.gov Wed Mar 18 12:11:53 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 18 Mar 2009 12:11:53 -0500 Subject: [Swift-devel] testing In-Reply-To: <49C129E4.9020003@mcs.anl.gov> References: <1237307410.26969.8.camel@localhost> <49C0E8CB.3080600@mcs.anl.gov> <1237389071.5032.1.camel@localhost> <49C129E4.9020003@mcs.anl.gov> Message-ID: <1237396313.9356.1.camel@localhost> On Wed, 2009-03-18 at 12:05 -0500, Michael Wilde wrote: > On 3/18/09 10:11 AM, Mihael Hategan wrote: > > On Wed, 2009-03-18 at 07:27 -0500, Michael Wilde wrote: > > > >>> iii) running anything through gram2 is bad - any base job submissions > >>> need to be through condor-g using its hybrid gram2+gridmanager system. > >> I agree, and was assuming that on OSG we would only use the new Condor > >> provider, and run jobs in this manner. > > > > There seems to be some confusion here. > > > > Ben, the point is to run with one of the scheduler providers, not gram2. > > > > Mike, the condor provider is not a condor-through-gram provider. It only > > submits to the local condor queue. > > I was thinking/hoping that the condor provider would have a setting that > submitted swift apps as condor-g jobs to N grid sites, *via* the local > condor queue. > > Isn't that how condor-g works? I send my local condor (via condor_sumit) > a .sub file that says e.g.: > > universe=grid > grid_resource=gt2 osg-edu.cs.wisc.edu/jobmanager-condor > > If I cant do that yet through the condor provider, was it your intent > that users eventually be able to do that, or was that not what you were > implementing? That was not what I was implementing. What I was aiming for was a local condor provider, similar to the PBS provider, that would address the scalability issues with gram2 for sites using condor as a queuing system. From wilde at mcs.anl.gov Wed Mar 18 12:18:45 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 18 Mar 2009 12:18:45 -0500 Subject: [Swift-devel] testing In-Reply-To: <1237390807.5032.19.camel@localhost> References: <1237307410.26969.8.camel@localhost> <49C0E8CB.3080600@mcs.anl.gov> <1237389071.5032.1.camel@localhost> <1237390807.5032.19.camel@localhost> Message-ID: <49C12CF5.8090708@mcs.anl.gov> Mihael, please create a design note on how coaster bootstrap and communication works, and use that as the basis for getting agreement on the approach and the range of options needed, and for getting input. That design description probably exists in various material you have, and can be brief. On 3/18/09 10:40 AM, Mihael Hategan wrote: > On Wed, 2009-03-18 at 15:20 +0000, Ben Clifford wrote: >> On Wed, 18 Mar 2009, Mihael Hategan wrote: >> >>> Ben, the point is to run with one of the scheduler providers, not gram2. >> That is also bad for OSG, as far as I can tell because it screws up >> accounting which is done in GRAM, not in the LRM. >> >> Pretty much what I'm asserting is that both the head job and the worker >> jobs need to be run through Conodr-G/gridmanager/gram2 (in my >> understanding of what OSG requires). >> > > Ok. Back to the drawing board. > > From skenny at uchicago.edu Wed Mar 18 12:28:15 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Wed, 18 Mar 2009 12:28:15 -0500 (CDT) Subject: [Swift-devel] log-processing tools Message-ID: <20090318122815.BUG11518@m4500-02.uchicago.edu> hi, i was trying to grab the log processing tools, but getting this: login3% svn co https://svn.ci.uchicago.edu/svn/vdl2/log-processing svn: URL 'https://svn.ci.uchicago.edu/svn/vdl2/log-processing' doesn't exist should i look elsewhere? thanks ~skenny From hategan at mcs.anl.gov Wed Mar 18 12:29:47 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 18 Mar 2009 12:29:47 -0500 Subject: [Swift-devel] log-processing tools In-Reply-To: <20090318122815.BUG11518@m4500-02.uchicago.edu> References: <20090318122815.BUG11518@m4500-02.uchicago.edu> Message-ID: <1237397387.9622.0.camel@localhost> On Wed, 2009-03-18 at 12:28 -0500, skenny at uchicago.edu wrote: > hi, i was trying to grab the log processing tools, but getting > this: > > login3% svn co https://svn.ci.uchicago.edu/svn/vdl2/log-processing > svn: URL 'https://svn.ci.uchicago.edu/svn/vdl2/log-processing' > doesn't exist > > should i look elsewhere? Yes, change "vdl2" to "swift" in the above url. > > thanks > ~skenny > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From skenny at uchicago.edu Wed Mar 18 12:35:51 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Wed, 18 Mar 2009 12:35:51 -0500 (CDT) Subject: [Swift-devel] log-processing tools Message-ID: <20090318123551.BUG12524@m4500-02.uchicago.edu> [skenny at login swift]$ svn co https://svn.ci.uchicago.edu/svn/swift/log-processing Authentication realm: SVN Login Password for 'skenny': svn: PROPFIND request failed on '/svn/swift/log-processing' svn: PROPFIND of '/svn/swift/log-processing': 403 Forbidden (https://svn.ci.uchicago.edu) [skenny at login swift]$ do i (or you) need to request access from support? ---- Original message ---- >Date: Wed, 18 Mar 2009 12:29:47 -0500 >From: Mihael Hategan >Subject: Re: [Swift-devel] log-processing tools >To: skenny at uchicago.edu >Cc: Ben Clifford , swift-devel at ci.uchicago.edu > >On Wed, 2009-03-18 at 12:28 -0500, skenny at uchicago.edu wrote: >> hi, i was trying to grab the log processing tools, but getting >> this: >> >> login3% svn co https://svn.ci.uchicago.edu/svn/vdl2/log-processing >> svn: URL 'https://svn.ci.uchicago.edu/svn/vdl2/log-processing' >> doesn't exist >> >> should i look elsewhere? > >Yes, change "vdl2" to "swift" in the above url. > >> >> thanks >> ~skenny >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Wed Mar 18 12:38:17 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 18 Mar 2009 12:38:17 -0500 Subject: [Swift-devel] testing In-Reply-To: <49C12CF5.8090708@mcs.anl.gov> References: <1237307410.26969.8.camel@localhost> <49C0E8CB.3080600@mcs.anl.gov> <1237389071.5032.1.camel@localhost> <1237390807.5032.19.camel@localhost> <49C12CF5.8090708@mcs.anl.gov> Message-ID: <49C13189.4090001@mcs.anl.gov> Also, please put in this description the requirements that we're working under: - node IP connectivity - security - central (head) node load limits - central (head) node job time limits (eg managed headnodes) - accounting info for all resources used (ie generates OSG accounting records) - etc That will help the design converge. On 3/18/09 12:18 PM, Michael Wilde wrote: > Mihael, please create a design note on how coaster bootstrap and > communication works, and use that as the basis for getting agreement on > the approach and the range of options needed, and for getting input. > > That design description probably exists in various material you have, > and can be brief. > > On 3/18/09 10:40 AM, Mihael Hategan wrote: >> On Wed, 2009-03-18 at 15:20 +0000, Ben Clifford wrote: >>> On Wed, 18 Mar 2009, Mihael Hategan wrote: >>> >>>> Ben, the point is to run with one of the scheduler providers, not >>>> gram2. >>> That is also bad for OSG, as far as I can tell because it screws up >>> accounting which is done in GRAM, not in the LRM. >>> >>> Pretty much what I'm asserting is that both the head job and the >>> worker jobs need to be run through Conodr-G/gridmanager/gram2 (in my >>> understanding of what OSG requires). >>> >> >> Ok. Back to the drawing board. >> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed Mar 18 12:51:13 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 18 Mar 2009 12:51:13 -0500 Subject: [Swift-devel] log-processing tools In-Reply-To: <20090318123551.BUG12524@m4500-02.uchicago.edu> References: <20090318123551.BUG12524@m4500-02.uchicago.edu> Message-ID: <49C13491.6060107@mcs.anl.gov> There is no svn/swift as far as I can tell (although I'd be happy to see svn/vld2 renamed to svn/swift) I think log-processing is under trunk as per: http://www.ci.uchicago.edu/trac/swift/changeset/2683 I'm about to look for it. On 3/18/09 12:35 PM, skenny at uchicago.edu wrote: > [skenny at login swift]$ svn co > https://svn.ci.uchicago.edu/svn/swift/log-processing > Authentication realm: SVN Login > Password for 'skenny': > svn: PROPFIND request failed on '/svn/swift/log-processing' > svn: PROPFIND of '/svn/swift/log-processing': 403 Forbidden > (https://svn.ci.uchicago.edu) > [skenny at login swift]$ > > do i (or you) need to request access from support? > > > ---- Original message ---- >> Date: Wed, 18 Mar 2009 12:29:47 -0500 >> From: Mihael Hategan >> Subject: Re: [Swift-devel] log-processing tools >> To: skenny at uchicago.edu >> Cc: Ben Clifford , > swift-devel at ci.uchicago.edu >> On Wed, 2009-03-18 at 12:28 -0500, skenny at uchicago.edu wrote: >>> hi, i was trying to grab the log processing tools, but getting >>> this: >>> >>> login3% svn co > https://svn.ci.uchicago.edu/svn/vdl2/log-processing >>> svn: URL 'https://svn.ci.uchicago.edu/svn/vdl2/log-processing' >>> doesn't exist >>> >>> should i look elsewhere? >> Yes, change "vdl2" to "swift" in the above url. >> >>> thanks >>> ~skenny >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed Mar 18 12:53:03 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 18 Mar 2009 12:53:03 -0500 Subject: [Swift-devel] log-processing tools In-Reply-To: <49C13491.6060107@mcs.anl.gov> References: <20090318123551.BUG12524@m4500-02.uchicago.edu> <49C13491.6060107@mcs.anl.gov> Message-ID: <49C134FF.208@mcs.anl.gov> indeed, its under libexec/log-processing as that change node describes. On 3/18/09 12:51 PM, Michael Wilde wrote: > There is no svn/swift as far as I can tell (although I'd be happy to see > svn/vld2 renamed to svn/swift) > > I think log-processing is under trunk as per: > http://www.ci.uchicago.edu/trac/swift/changeset/2683 > > I'm about to look for it. > > On 3/18/09 12:35 PM, skenny at uchicago.edu wrote: >> [skenny at login swift]$ svn co >> https://svn.ci.uchicago.edu/svn/swift/log-processing >> Authentication realm: SVN Login >> Password for 'skenny': >> svn: PROPFIND request failed on '/svn/swift/log-processing' >> svn: PROPFIND of '/svn/swift/log-processing': 403 Forbidden >> (https://svn.ci.uchicago.edu) >> [skenny at login swift]$ >> >> do i (or you) need to request access from support? >> >> >> ---- Original message ---- >>> Date: Wed, 18 Mar 2009 12:29:47 -0500 >>> From: Mihael Hategan Subject: Re: >>> [Swift-devel] log-processing tools To: skenny at uchicago.edu >>> Cc: Ben Clifford , >> swift-devel at ci.uchicago.edu >>> On Wed, 2009-03-18 at 12:28 -0500, skenny at uchicago.edu wrote: >>>> hi, i was trying to grab the log processing tools, but getting >>>> this: >>>> >>>> login3% svn co >> https://svn.ci.uchicago.edu/svn/vdl2/log-processing >>>> svn: URL 'https://svn.ci.uchicago.edu/svn/vdl2/log-processing' >>>> doesn't exist >>>> >>>> should i look elsewhere? >>> Yes, change "vdl2" to "swift" in the above url. >>> >>>> thanks >>>> ~skenny >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Wed Mar 18 12:54:42 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 18 Mar 2009 12:54:42 -0500 Subject: [Swift-devel] testing In-Reply-To: <49C12CF5.8090708@mcs.anl.gov> References: <1237307410.26969.8.camel@localhost> <49C0E8CB.3080600@mcs.anl.gov> <1237389071.5032.1.camel@localhost> <1237390807.5032.19.camel@localhost> <49C12CF5.8090708@mcs.anl.gov> Message-ID: <1237398882.9622.2.camel@localhost> On Wed, 2009-03-18 at 12:18 -0500, Michael Wilde wrote: > Mihael, please create a design note on how coaster bootstrap and > communication works, and use that as the basis for getting agreement on > the approach and the range of options needed, and for getting input. http://wiki.cogkit.org/wiki/Coasters > > That design description probably exists in various material you have, > and can be brief. > > On 3/18/09 10:40 AM, Mihael Hategan wrote: > > On Wed, 2009-03-18 at 15:20 +0000, Ben Clifford wrote: > >> On Wed, 18 Mar 2009, Mihael Hategan wrote: > >> > >>> Ben, the point is to run with one of the scheduler providers, not gram2. > >> That is also bad for OSG, as far as I can tell because it screws up > >> accounting which is done in GRAM, not in the LRM. > >> > >> Pretty much what I'm asserting is that both the head job and the worker > >> jobs need to be run through Conodr-G/gridmanager/gram2 (in my > >> understanding of what OSG requires). > >> > > > > Ok. Back to the drawing board. > > > > From hategan at mcs.anl.gov Wed Mar 18 12:56:49 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 18 Mar 2009 12:56:49 -0500 Subject: [Swift-devel] log-processing tools In-Reply-To: <49C13491.6060107@mcs.anl.gov> References: <20090318123551.BUG12524@m4500-02.uchicago.edu> <49C13491.6060107@mcs.anl.gov> Message-ID: <1237399009.9622.5.camel@localhost> On Wed, 2009-03-18 at 12:51 -0500, Michael Wilde wrote: > There is no svn/swift as far as I can tell (although I'd be happy to see > svn/vld2 renamed to svn/swift) Oh, sorry. I was confusing the swift directory with the SVN module. > > I think log-processing is under trunk as per: > http://www.ci.uchicago.edu/trac/swift/changeset/2683 > > I'm about to look for it. > > On 3/18/09 12:35 PM, skenny at uchicago.edu wrote: > > [skenny at login swift]$ svn co > > https://svn.ci.uchicago.edu/svn/swift/log-processing > > Authentication realm: SVN Login > > Password for 'skenny': > > svn: PROPFIND request failed on '/svn/swift/log-processing' > > svn: PROPFIND of '/svn/swift/log-processing': 403 Forbidden > > (https://svn.ci.uchicago.edu) > > [skenny at login swift]$ > > > > do i (or you) need to request access from support? > > > > > > ---- Original message ---- > >> Date: Wed, 18 Mar 2009 12:29:47 -0500 > >> From: Mihael Hategan > >> Subject: Re: [Swift-devel] log-processing tools > >> To: skenny at uchicago.edu > >> Cc: Ben Clifford , > > swift-devel at ci.uchicago.edu > >> On Wed, 2009-03-18 at 12:28 -0500, skenny at uchicago.edu wrote: > >>> hi, i was trying to grab the log processing tools, but getting > >>> this: > >>> > >>> login3% svn co > > https://svn.ci.uchicago.edu/svn/vdl2/log-processing > >>> svn: URL 'https://svn.ci.uchicago.edu/svn/vdl2/log-processing' > >>> doesn't exist > >>> > >>> should i look elsewhere? > >> Yes, change "vdl2" to "swift" in the above url. > >> > >>> thanks > >>> ~skenny > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Wed Mar 18 17:14:28 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 18 Mar 2009 17:14:28 -0500 Subject: [Swift-devel] testing In-Reply-To: <1237398882.9622.2.camel@localhost> References: <1237307410.26969.8.camel@localhost> <49C0E8CB.3080600@mcs.anl.gov> <1237389071.5032.1.camel@localhost> <1237390807.5032.19.camel@localhost> <49C12CF5.8090708@mcs.anl.gov> <1237398882.9622.2.camel@localhost> Message-ID: <1237414468.15019.0.camel@localhost> On Wed, 2009-03-18 at 12:54 -0500, Mihael Hategan wrote: > On Wed, 2009-03-18 at 12:18 -0500, Michael Wilde wrote: > > Mihael, please create a design note on how coaster bootstrap and > > communication works, and use that as the basis for getting agreement on > > the approach and the range of options needed, and for getting input. > > http://wiki.cogkit.org/wiki/Coasters And now, with pretty pictures! From aespinosa at cs.uchicago.edu Wed Mar 18 22:19:27 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 18 Mar 2009 22:19:27 -0500 Subject: [Swift-devel] "any valid host for task" in Swift + deef provider In-Reply-To: <1237231475.8617.12.camel@localhost> References: <50b07b4b0903161213r292fb025o1bc52a34608192d@mail.gmail.com> <1237231475.8617.12.camel@localhost> Message-ID: <50b07b4b0903182019x695fb817ua9a05cc45b7fd8c8@mail.gmail.com> Oops i made the mistake of not making the corresponding entries in tc.data for each site description. sorry On Mon, Mar 16, 2009 at 2:24 PM, Mihael Hategan wrote: > Can you post your tc.data? > > On Mon, 2009-03-16 at 14:13 -0500, Allan Espinosa wrote: >> Hi, >> >> I'm using swift r2682, cogkit 2326 and provider-deef 2507 >> >> RunID: 20090316-1354-ocn573c3 >> Progress: >> Execution failed: >> ? ? ? ? Could not find any valid host for task "Task(type=UNKNOWN, >> identity=urn:cog-1237229648327)" with constraints {tr=hostname, >> filenames=[Ljava.lang.String;@14aa453, trfqn=hostname, >> filecache=org.griphyn.vdl.karajan.lib.cache.CacheMapAdapter at 16a4aef} >> >> >> Sites.xml: >> >> >> ? >> ? > url="http://129.114.102.179:50001/wsrf/services/GenericPortal/core/WS/GPFactoryService"/> >> ? /work/01035/tg802895/swift-runs >> >> >> >> The run did not initialize the work directory. >> a> From wilde at mcs.anl.gov Thu Mar 19 07:13:53 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 19 Mar 2009 07:13:53 -0500 Subject: [Swift-devel] Small issues in coasters on local:pbs Message-ID: <49C23701.9060802@mcs.anl.gov> Regarding: [Swift-user] Swift/PBS Scheduler Slow to Report "Finished"?] I'm retesting coasters on local:pbs (on teraport), as I think this may partially alleviate Andrew's problem. A simple foreach() works nice and fast, but I see two things: 1) I first tested without a valid proxy. I forgot that coasters requires a proxy (presumably for its secure channels) even when its not using GRAM to reach its "RRM". The error returned if you dont have a proxy is cryptic and buried in the coaster boostrap log. So 3 things: (a) do a check for proxy early on and print a nice message if theres not a valid proxy; (b) bring the errors from the bootstrap log back to the user (unless thats not possible) in which case point the user to look for that. (c) document that you need a proxy. 2) When the script finishes you get this message on stdout/err which looks like a leftover debugging message: -- Swift svn swift-r2701 cog-r2332 RunID: 20090319-0658-3ejpl9xc Progress: Progress: Submitting:9 Submitted:1 Progress: Submitted:9 Active:1 Progress: Submitted:4 Active:3 Stage out:1 Finished successfully:2 Final status: Finished successfully:10 Cleaning up... Shutting down service at https://128.135.125.117:50002 Got channel MetaChannel: 101224864 -> GSSSChannel-null(1) - Done -- - Mike The errors you get when you dont have a proxy are: tp$ swift hellos.swift -sites.file sites.xml -tc.file tc.data Swift svn swift-r2701 cog-r2332 RunID: 20090319-0655-9ufl1r2g Progress: Progress: Submitting:9 Submitted:1 Failed to transfer wrapper log from hellos-20090319-0655-9ufl1r2g/info/l on teraport Execution failed: Exception in echo: Arguments: [Output of run, 6] Host: teraport Directory: hellos-20090319-0655-9ufl1r2g/jobs/l/echo-lde8x58j stderr.txt: stdout.txt: ---- Caused by: Could not submit job Caused by: Could not start coaster service Caused by: Task ended before registration was received. STDOUT: STDERR: Caused by: Job failed with an exit code of 1 Cleaning up... Done tp$ tp$ cat /home/wilde/coaster-bootstrap-01709350024.log BS: http://tp-login2.ci.uchicago.edu:50001 find wget = /usr/bin/wget -->/usr/bin/wget -c -q http://tp-login2.ci.uchicago.edu:50001/coaster-bootstrap.jar -O /tmp/bootstrap.YJ4129 >>/home/wilde/coaster-bootstrap-01709350024.log 2>&1<-- which: no gmd5sum in (/autonfs/home/wilde/tutorials/osgedu/build/docbook-xsl/tools/bin:/home/wilde/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/apache-ant-1.7.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r1/bin:/soft/globus-4.2.1-r1/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/soft/torque-2.3.6-r1/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/matlab-7.7-r1/bin:/soft/osg-client-1.0.0-r1/lcg/bin:/soft/osg-client-1.0.0-r1/srm-client-lbnl/bin:/soft/osg-client-1.0.0-r1/srm-client-fermi/sbin:/soft/osg-client-1.0.0-r1/srm-client-fermi/bin:/soft/osg-client-1.0.0-r1/curl/bin:/soft/osg-client-1.0.0-r1/wget/bin:/soft/osg-client-1.0.0-r1/cert-scripts/bin:/soft/osg-client-1.0.0-r1/glite/sbin:/soft/osg-client-1.0.0-r1/glite/bin:/soft/osg-client-1.0.0-r1/pyglobus-url-copy/bin:/soft/osg-client-1.0.0-r1/peg asus/bin:/soft/osg-client-1.0.0-r1/ant/bin:/soft/osg-client-1.0.0-r1/gpt/sbin:/soft/osg-client-1.0.0-r1/globus/bin:/soft/osg-client-1.0.0-r1/globus/sbin:/soft/osg-client-1.0.0-r1/jdk1.5/bin:/soft/osg-client-1.0.0-r1/condor/sbin:/soft/osg-client-1.0.0-r1/condor/bin:/soft/osg-client-1.0.0-r1/logrotate/sbin:/software/common/pacman-3.26-r1/bin:/soft/osg-client-1.0.0-r1/vdt/sbin:/soft/osg-client-1.0.0-r1/vdt/bin:/home/grog/bin/linux-rhel4-ia32:/home/grog/bin:/sw/bin:/sbin:/usr/sbin:/home/wilde/swift/tools:/home/wilde/swift/rev/latest/bin:/home/wilde/blast/ncbi/bin:/home/wilde/adem/adem-osg/bin) which: no gmd5sum in (/autonfs/home/wilde/tutorials/osgedu/build/docbook-xsl/tools/bin:/home/wilde/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/apache-ant-1.7.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r1/bin:/soft/globus-4.2.1-r1/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/soft/torque-2.3.6-r1/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/matlab-7.7-r1/bin:/soft/osg-client-1.0.0-r1/lcg/bin:/soft/osg-client-1.0.0-r1/srm-client-lbnl/bin:/soft/osg-client-1.0.0-r1/srm-client-fermi/sbin:/soft/osg-client-1.0.0-r1/srm-client-fermi/bin:/soft/osg-client-1.0.0-r1/curl/bin:/soft/osg-client-1.0.0-r1/wget/bin:/soft/osg-client-1.0.0-r1/cert-scripts/bin:/soft/osg-client-1.0.0-r1/glite/sbin:/soft/osg-client-1.0.0-r1/glite/bin:/soft/osg-client-1.0.0-r1/pyglobus-url-copy/bin:/soft/osg-client-1.0.0-r1/peg asus/bin:/soft/osg-client-1.0.0-r1/ant/bin:/soft/osg-client-1.0.0-r1/gpt/sbin:/soft/osg-client-1.0.0-r1/globus/bin:/soft/osg-client-1.0.0-r1/globus/sbin:/soft/osg-client-1.0.0-r1/jdk1.5/bin:/soft/osg-client-1.0.0-r1/condor/sbin:/soft/osg-client-1.0.0-r1/condor/bin:/soft/osg-client-1.0.0-r1/logrotate/sbin:/software/common/pacman-3.26-r1/bin:/soft/osg-client-1.0.0-r1/vdt/sbin:/soft/osg-client-1.0.0-r1/vdt/bin:/home/grog/bin/linux-rhel4-ia32:/home/grog/bin:/sw/bin:/sbin:/usr/sbin:/home/wilde/swift/tools:/home/wilde/swift/rev/latest/bin:/home/wilde/blast/ncbi/bin:/home/wilde/adem/adem-osg/bin) find gmd5sum = find md5sum = /usr/bin/md5sum Expected checksum: 33170989491a2e007a1c7c68eb907832 Computed checksum: 33170989491a2e007a1c7c68eb907832 find java = /soft/java-1.6.0_11-sun-r1/bin/java JAVA=/soft/java-1.6.0_11-sun-r1/bin/java /soft/java-1.6.0_11-sun-r1/bin/java -Djava=/soft/java-1.6.0_11-sun-r1/bin/java -DGLOBUS_TCP_PORT_RANGE=50000,51000 -DX509_USER_PROXY= -DX509_CERT_DIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA -DGLOBUS_HOSTNAME=none -jar /tmp/bootstrap.YJ4129 http://tp-login2.ci.uchicago.edu:50001 https://128.135.125.117:50000 01709350024 java.lang.RuntimeException: Failed to register service at org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:111) at org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:226) Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://128.135.125.117:50000(1) at org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:104) at org.globus.cog.karajan.workflow.service.channels.GSSChannel.start(GSSChannel.java:63) at org.globus.cog.karajan.workflow.service.ChannelFactory.newChannel(ChannelFactory.java:43) at org.globus.cog.karajan.workflow.service.Client.connect(Client.java:115) at org.globus.cog.karajan.workflow.service.Client.newClient(Client.java:72) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:211) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:230) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:186) at org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:100) ... 1 more Caused by: org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy file (/tmp/x509up_u1031) not found. at org.globus.gsi.GlobusCredential.(GlobusCredential.java:114) at org.globus.gsi.GlobusCredential.reloadDefaultCredential(GlobusCredential.java:590) at org.globus.gsi.GlobusCredential.getDefaultCredential(GlobusCredential.java:575) at org.globus.cog.karajan.workflow.service.GSSService.initializeCredentials(GSSService.java:99) at org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:77) ... 9 more EC: 1 BS: http://tp-login2.ci.uchicago.edu:50001 find wget = /usr/bin/wget -->/usr/bin/wget -c -q http://tp-login2.ci.uchicago.edu:50001/coaster-bootstrap.jar -O /tmp/bootstrap.DS4363 >>/home/wilde/coaster-bootstrap-01709350024.log 2>&1<-- which: no gmd5sum in (/autonfs/home/wilde/tutorials/osgedu/build/docbook-xsl/tools/bin:/home/wilde/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/apache-ant-1.7.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r1/bin:/soft/globus-4.2.1-r1/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/soft/torque-2.3.6-r1/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/matlab-7.7-r1/bin:/soft/osg-client-1.0.0-r1/lcg/bin:/soft/osg-client-1.0.0-r1/srm-client-lbnl/bin:/soft/osg-client-1.0.0-r1/srm-client-fermi/sbin:/soft/osg-client-1.0.0-r1/srm-client-fermi/bin:/soft/osg-client-1.0.0-r1/curl/bin:/soft/osg-client-1.0.0-r1/wget/bin:/soft/osg-client-1.0.0-r1/cert-scripts/bin:/soft/osg-client-1.0.0-r1/glite/sbin:/soft/osg-client-1.0.0-r1/glite/bin:/soft/osg-client-1.0.0-r1/pyglobus-url-copy/bin:/soft/osg-client-1.0.0-r1/peg asus/bin:/soft/osg-client-1.0.0-r1/ant/bin:/soft/osg-client-1.0.0-r1/gpt/sbin:/soft/osg-client-1.0.0-r1/globus/bin:/soft/osg-client-1.0.0-r1/globus/sbin:/soft/osg-client-1.0.0-r1/jdk1.5/bin:/soft/osg-client-1.0.0-r1/condor/sbin:/soft/osg-client-1.0.0-r1/condor/bin:/soft/osg-client-1.0.0-r1/logrotate/sbin:/software/common/pacman-3.26-r1/bin:/soft/osg-client-1.0.0-r1/vdt/sbin:/soft/osg-client-1.0.0-r1/vdt/bin:/home/grog/bin/linux-rhel4-ia32:/home/grog/bin:/sw/bin:/sbin:/usr/sbin:/home/wilde/swift/tools:/home/wilde/swift/rev/latest/bin:/home/wilde/blast/ncbi/bin:/home/wilde/adem/adem-osg/bin) which: no gmd5sum in (/autonfs/home/wilde/tutorials/osgedu/build/docbook-xsl/tools/bin:/home/wilde/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/apache-ant-1.7.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r1/bin:/soft/globus-4.2.1-r1/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/soft/torque-2.3.6-r1/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/matlab-7.7-r1/bin:/soft/osg-client-1.0.0-r1/lcg/bin:/soft/osg-client-1.0.0-r1/srm-client-lbnl/bin:/soft/osg-client-1.0.0-r1/srm-client-fermi/sbin:/soft/osg-client-1.0.0-r1/srm-client-fermi/bin:/soft/osg-client-1.0.0-r1/curl/bin:/soft/osg-client-1.0.0-r1/wget/bin:/soft/osg-client-1.0.0-r1/cert-scripts/bin:/soft/osg-client-1.0.0-r1/glite/sbin:/soft/osg-client-1.0.0-r1/glite/bin:/soft/osg-client-1.0.0-r1/pyglobus-url-copy/bin:/soft/osg-client-1.0.0-r1/peg asus/bin:/soft/osg-client-1.0.0-r1/ant/bin:/soft/osg-client-1.0.0-r1/gpt/sbin:/soft/osg-client-1.0.0-r1/globus/bin:/soft/osg-client-1.0.0-r1/globus/sbin:/soft/osg-client-1.0.0-r1/jdk1.5/bin:/soft/osg-client-1.0.0-r1/condor/sbin:/soft/osg-client-1.0.0-r1/condor/bin:/soft/osg-client-1.0.0-r1/logrotate/sbin:/software/common/pacman-3.26-r1/bin:/soft/osg-client-1.0.0-r1/vdt/sbin:/soft/osg-client-1.0.0-r1/vdt/bin:/home/grog/bin/linux-rhel4-ia32:/home/grog/bin:/sw/bin:/sbin:/usr/sbin:/home/wilde/swift/tools:/home/wilde/swift/rev/latest/bin:/home/wilde/blast/ncbi/bin:/home/wilde/adem/adem-osg/bin) find gmd5sum = find md5sum = /usr/bin/md5sum Expected checksum: 33170989491a2e007a1c7c68eb907832 Computed checksum: 33170989491a2e007a1c7c68eb907832 find java = /soft/java-1.6.0_11-sun-r1/bin/java JAVA=/soft/java-1.6.0_11-sun-r1/bin/java /soft/java-1.6.0_11-sun-r1/bin/java -Djava=/soft/java-1.6.0_11-sun-r1/bin/java -DGLOBUS_TCP_PORT_RANGE=50000,51000 -DX509_USER_PROXY= -DX509_CERT_DIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA -DGLOBUS_HOSTNAME=none -jar /tmp/bootstrap.DS4363 http://tp-login2.ci.uchicago.edu:50001 https://128.135.125.117:50000 01709350024 java.lang.RuntimeException: Failed to register service at org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:111) at org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:226) Caused by: org.globus.cog.karajan.workflow.service.channels.ChannelException: Failed to start channel GSSCChannel-https://128.135.125.117:50000(1) at org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:104) at org.globus.cog.karajan.workflow.service.channels.GSSChannel.start(GSSChannel.java:63) at org.globus.cog.karajan.workflow.service.ChannelFactory.newChannel(ChannelFactory.java:43) at org.globus.cog.karajan.workflow.service.Client.connect(Client.java:115) at org.globus.cog.karajan.workflow.service.Client.newClient(Client.java:72) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:211) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:230) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:186) at org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:100) ... 1 more Caused by: org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy file (/tmp/x509up_u1031) not found. at org.globus.gsi.GlobusCredential.(GlobusCredential.java:114) at org.globus.gsi.GlobusCredential.reloadDefaultCredential(GlobusCredential.java:590) at org.globus.gsi.GlobusCredential.getDefaultCredential(GlobusCredential.java:575) at org.globus.cog.karajan.workflow.service.GSSService.initializeCredentials(GSSService.java:99) at org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:77) ... 9 more EC: 1 tp$ -------- Original Message -------- Subject: [Swift-user] Swift/PBS Scheduler Slow to Report "Finished"? Date: Thu, 19 Mar 2009 02:42:25 -0500 From: Andrew Boyce To: swift-user at ci.uchicago.edu Hello, I am currently running Swift in conjunction with the PBS scheduler. My annoyance at the moment is this: When running any script, even a simple script such as first.swift (which normally finishes almost instantaneously), Swift always takes precisely five minutes to tell me that my job Finished successfully and copy the files back to the appropriate folder. It is always almost exactly five minutes; I've checked many logs - it polls the scheduler for five minutes. When I run a script (like first.swift) without using the PBS scheduler, everything happens as normal; execution and "Finished successfully" are nearly immediate. I think I know what the problem is: even after the scheduler says that the job is 'completed,' (which is generally right away) the scheduler keeps the job up on qstat and such for 5 minutes after (this setting is a PBS server attribute known as 'keep_completed', and I have checked that it is indeed set to 300 seconds; unfortunately I don't have permissions to change it). So when Swift polls the scheduler, the job is still up on qstat, and Swift must think that the task has not yet "Finished successfully." My question is this: Am I indeed right that Swift does not "understand" that when the PBS scheduler says a job is 'completed', the job really has "Finished successfully"? Can this be changed so that Swift does "understand" that a 'completed' job has "Finished successfully"? I have not included any files because I think I have narrowed the problem down to a question that does not require those that I would usually provide, but if I am wrong, then I can provide. Thank you and sorry for the length. Regards, Andrew Boyce _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From wilde at mcs.anl.gov Thu Mar 19 09:22:35 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 19 Mar 2009 09:22:35 -0500 Subject: [Swift-devel] User script gets null pointer exception Message-ID: <49C2552B.20409@mcs.anl.gov> Yue, I should have clarified: this is a Swift bug. Its perhaps caused by some error in your program, but Swift should never get a null pointer exception. Ben, Mihael - The log is in: /home/yuechen/PTMapZhao/PTMap/PTMap-20090310-1746-zszi94b6.log The error seems to be related to some boundary condition on the array "texts" which maps Yue's 375 fasta files, fast01 .. fasta375 This message is in the log: 2009-03-10 17:46:36,188-0500 INFO AbstractDataNode Found data org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20090310-1746-f12dnzdd:720000000382 \ type Inputfile with no value at dataset=parameter (closed).$ 2009-03-10 17:46:36,188-0500 INFO New NEW id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090310-1746-f12dnzdd:720000000382 2009-03-10 17:46:36,190-0500 DEBUG VDL2ExecutionContext Getting value for array texts.$[]/375 which is not permitted. Getting value for array texts.$[]/375 which is not permitted. vdl:getfieldvalue @ PTMap.kml, line: 124 sys:parallelfor @ PTMap.kml, line: 124 sys:sequential @ PTMap.kml, line: 123 doall @ PTMap.kml, line: 170 sys:sequential @ PTMap.kml, line: 169 vdl:mainp @ PTMap.kml, line: 168 mainp @ vdl.k, line: 165 vdl:mains @ PTMap.kml, line: 166 vdl:mains @ PTMap.kml, line: 166 rlog:restartlog @ PTMap.kml, line: 164 kernel:project @ PTMap.kml, line: 2 PTMap-20090310-1746-zszi94b6 Caused by: java.lang.RuntimeException: Getting value for array texts.$[]/375 which is not permitted. at org.griphyn.vdl.karajan.lib.GetFieldValue.function(GetFieldValue.java:53) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) -------- Original Message -------- Subject: Re: [Swift-user] swift execution problem Date: Thu, 19 Mar 2009 08:27:02 -0500 From: Michael Wilde To: Yue, Chen - BMD CC: swift-user at ci.uchicago.edu References: Yue, what version of Swift are you using? Please send the first few lines of your output file, where it says something like: Swift svn swift-r2701 cog-r2332 RunID: 20090319-0820-19zttiq9 (in fact please send the whole output file, stdout/err) Ive tried to run your code in a near-identical test and I cant reproduce the failure. Ive tried with both swift0.8 and the latest svn rev, and both seem to work. Also please can you post the pathname of the directory in which you are testing (I assume you are running this on a CI machine?) so that I can look at your logfile? And make it publicly accessible? Thanks, - Mike On 3/18/09 6:32 PM, Yue, Chen - BMD wrote: > Hi, > > I'm new to Swift programming. I was able to run a swift script before, > but I couldn't run it now. I'm wondering if someone can help me figure > out why. The swift script, sites.xml, tc.data, and all the error > messages are copied in this email. Thank you! > > Regards, > > Chen, Yue > > ********************* > Swift script > ********************* > type Fasta {} > type PTMapOut {} > type Solution {} > type Inputfile {} > app (PTMapOut ofile) PTMap (Solution sfile, Fasta fastafile, Inputfile > input, Inputfile parameter) > { > PTMap @filename(sfile) @filename(fastafile) @filename(input) > @filename(parameter) stdout=@filename(ofile > ); > } > Fasta texts[] ; > > doall(Fasta texts[]) > { > Solution sfile <"BSASolution.mzXML">; > Inputfile input <"inputs.txt">; > Inputfile parameter <"parameters.txt">; > foreach p in texts { > PTMapOut r source=@p , > match="fasta(.*)", > transform="\\1.out " > >; > r = PTMap(sfile, p, input, parameter); > } > } > // Main > doall(texts); > ************** > sites.xml > ************** > > > > /var/tmp > 0 > > ************** > tc.data > ************** > localhost echo /bin/echo INSTALLED > INTEL32::LINUX null > localhost cat /bin/cat INSTALLED > INTEL32::LINUX null > localhost ls /bin/ls INSTALLED > INTEL32::LINUX null > localhost grep /bin/grep INSTALLED > INTEL32::LINUX null > localhost sort /bin/sort INSTALLED > INTEL32::LINUX null > localhost paste /bin/paste INSTALLED > INTEL32::LINUX null > localhost PTMap /home/yuechen/PTMap/PTMap INSTALLED > INTEL32::LINUX null > ************** > Error messages > ************** > [yuechen at communicado PTMap]$ swift PTMap.swift > Execution failed: > java.lang.NullPointerException > at > org.globus.cog.abstraction.impl.common.task.ServiceImpl.toString(ServiceImpl.java:156) > at java.lang.String.valueOf(String.java:2577) > at java.lang.StringBuffer.append(StringBuffer.java:220) > at > org.globus.cog.karajan.workflow.nodes.grid.GridNode.function(GridNode.java:31) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45) > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) > at > org.globus.cog.karajan.workflow.nodes.ExecuteFile.notificationEvent(ExecuteFile.java:163) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > at > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) > at > org.globus.cog.karajan.workflow.nodes.Sequential.childCompleted(Sequential.java:45) > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > at > org.globus.cog.karajan.workflow.nodes.user.UserDefinedElement.childCompleted(UserDefinedElement.java:283) > at > org.globus.cog.karajan.workflow.nodes.user.SequentialImplicitExecutionUDE.childCompleted(SequentialImplicitExecutionUDE.java:85) > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) > at > org.globus.cog.karajan.workflow.nodes.If.childCompleted(If.java:30) > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) > at > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) > at > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:240) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:281) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:393) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) > at > org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > at > org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) > > > > > > This email is intended only for the use of the individual or entity to > which it is addressed and may contain information that is privileged and > confidential. If the reader of this email message is not the intended > recipient, you are hereby notified that any dissemination, distribution, > or copying of this communication is prohibited. If you have received > this email in error, please notify the sender and destroy/delete all > copies of the transmittal. Thank you. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From benc at hawaga.org.uk Thu Mar 19 09:24:08 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 19 Mar 2009 14:24:08 +0000 (GMT) Subject: [Swift-devel] Re: log-processing tools In-Reply-To: <20090318122815.BUG11518@m4500-02.uchicago.edu> References: <20090318122815.BUG11518@m4500-02.uchicago.edu> Message-ID: On Wed, 18 Mar 2009, skenny at uchicago.edu wrote: > hi, i was trying to grab the log processing tools, but getting > this: > > login3% svn co https://svn.ci.uchicago.edu/svn/vdl2/log-processing > svn: URL 'https://svn.ci.uchicago.edu/svn/vdl2/log-processing' > doesn't exist > > should i look elsewhere? Hopefully you figured this out based on what others said in this thread. If not - the log-processing stuff is now part of the main swift build. You should get a swift-plot-log command in the same place as the base swift command (i.e. if you can run swift, you should also be able to run swift-plot-log) -- From yuechen at bsd.uchicago.edu Thu Mar 19 10:05:30 2009 From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD) Date: Thu, 19 Mar 2009 10:05:30 -0500 Subject: [Swift-devel] RE: User script gets null pointer exception References: <49C2552B.20409@mcs.anl.gov> Message-ID: Hi Michael, I used swift version 0.8. I was able to run the same swift script last week, but I don't know why I cannot run it now. Yesterday, I was trying to set up sites.xml and tc.data for NCSA and SDSC clusters, but it didn't run. So I removed those configurations and just used all the original files to see if I can still run swift. That's when I ran into errors. The sites.xml and tc.data were in the /home/yuechen/swift-0.8/etc/. PTMap run by itself seems normal. To test PTMap, I use the following command in PTMap directory at /home/yuechen/PTMap/: $ ./PTMap BSASolution.mzXML fasta172 inputs.txt parameters.txt Thank you very much for help. Regards, Chen, Yue ________________________________ From: Michael Wilde [mailto:wilde at mcs.anl.gov] Sent: Thu 3/19/2009 9:22 AM To: swift-devel Cc: Yue, Chen - BMD; Zhao Zhang Subject: User script gets null pointer exception Yue, I should have clarified: this is a Swift bug. Its perhaps caused by some error in your program, but Swift should never get a null pointer exception. Ben, Mihael - The log is in: /home/yuechen/PTMapZhao/PTMap/PTMap-20090310-1746-zszi94b6.log The error seems to be related to some boundary condition on the array "texts" which maps Yue's 375 fasta files, fast01 .. fasta375 This message is in the log: 2009-03-10 17:46:36,188-0500 INFO AbstractDataNode Found data org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20090310-1746-f12dnzdd:720000000382 \ type Inputfile with no value at dataset=parameter (closed).$ 2009-03-10 17:46:36,188-0500 INFO New NEW id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090310-1746-f12dnzdd:720000000382 2009-03-10 17:46:36,190-0500 DEBUG VDL2ExecutionContext Getting value for array texts.$[]/375 which is not permitted. Getting value for array texts.$[]/375 which is not permitted. vdl:getfieldvalue @ PTMap.kml, line: 124 sys:parallelfor @ PTMap.kml, line: 124 sys:sequential @ PTMap.kml, line: 123 doall @ PTMap.kml, line: 170 sys:sequential @ PTMap.kml, line: 169 vdl:mainp @ PTMap.kml, line: 168 mainp @ vdl.k, line: 165 vdl:mains @ PTMap.kml, line: 166 vdl:mains @ PTMap.kml, line: 166 rlog:restartlog @ PTMap.kml, line: 164 kernel:project @ PTMap.kml, line: 2 PTMap-20090310-1746-zszi94b6 Caused by: java.lang.RuntimeException: Getting value for array texts.$[]/375 which is not permitted. at org.griphyn.vdl.karajan.lib.GetFieldValue.function(GetFieldValue.java:53) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) -------- Original Message -------- Subject: Re: [Swift-user] swift execution problem Date: Thu, 19 Mar 2009 08:27:02 -0500 From: Michael Wilde To: Yue, Chen - BMD CC: swift-user at ci.uchicago.edu References: Yue, what version of Swift are you using? Please send the first few lines of your output file, where it says something like: Swift svn swift-r2701 cog-r2332 RunID: 20090319-0820-19zttiq9 (in fact please send the whole output file, stdout/err) Ive tried to run your code in a near-identical test and I cant reproduce the failure. Ive tried with both swift0.8 and the latest svn rev, and both seem to work. Also please can you post the pathname of the directory in which you are testing (I assume you are running this on a CI machine?) so that I can look at your logfile? And make it publicly accessible? Thanks, - Mike On 3/18/09 6:32 PM, Yue, Chen - BMD wrote: > Hi, > > I'm new to Swift programming. I was able to run a swift script before, > but I couldn't run it now. I'm wondering if someone can help me figure > out why. The swift script, sites.xml, tc.data, and all the error > messages are copied in this email. Thank you! > > Regards, > > Chen, Yue > > ********************* > Swift script > ********************* > type Fasta {} > type PTMapOut {} > type Solution {} > type Inputfile {} > app (PTMapOut ofile) PTMap (Solution sfile, Fasta fastafile, Inputfile > input, Inputfile parameter) > { > PTMap @filename(sfile) @filename(fastafile) @filename(input) > @filename(parameter) stdout=@filename(ofile > ); > } > Fasta texts[] ; > > doall(Fasta texts[]) > { > Solution sfile <"BSASolution.mzXML">; > Inputfile input <"inputs.txt">; > Inputfile parameter <"parameters.txt">; > foreach p in texts { > PTMapOut r source=@p , > match="fasta(.*)", > transform="\\1.out >" > >; > r = PTMap(sfile, p, input, parameter); > } > } > // Main > doall(texts); > ************** > sites.xml > ************** > > > > /var/tmp > 0 > > ************** > tc.data > ************** > localhost echo /bin/echo INSTALLED > INTEL32::LINUX null > localhost cat /bin/cat INSTALLED > INTEL32::LINUX null > localhost ls /bin/ls INSTALLED > INTEL32::LINUX null > localhost grep /bin/grep INSTALLED > INTEL32::LINUX null > localhost sort /bin/sort INSTALLED > INTEL32::LINUX null > localhost paste /bin/paste INSTALLED > INTEL32::LINUX null > localhost PTMap /home/yuechen/PTMap/PTMap INSTALLED > INTEL32::LINUX null > ************** > Error messages > ************** > [yuechen at communicado PTMap]$ swift PTMap.swift > Execution failed: > java.lang.NullPointerException > at > org.globus.cog.abstraction.impl.common.task.ServiceImpl.toString(ServiceImpl.java:156) > at java.lang.String.valueOf(String.java:2577) > at java.lang.StringBuffer.append(StringBuffer.java:220) > at > org.globus.cog.karajan.workflow.nodes.grid.GridNode.function(GridNode.java:31) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45) > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) > at > org.globus.cog.karajan.workflow.nodes.ExecuteFile.notificationEvent(ExecuteFile.java:163) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > at > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) > at > org.globus.cog.karajan.workflow.nodes.Sequential.childCompleted(Sequential.java:45) > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > at > org.globus.cog.karajan.workflow.nodes.user.UserDefinedElement.childCompleted(UserDefinedElement.java:283) > at > org.globus.cog.karajan.workflow.nodes.user.SequentialImplicitExecutionUDE.childCompleted(SequentialImplicitExecutionUDE.java:85) > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) > at > org.globus.cog.karajan.workflow.nodes.If.childCompleted(If.java:30) > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) > at > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) > at > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:240) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:281) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:393) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) > at > org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > at > org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) > > > > > > This email is intended only for the use of the individual or entity to > which it is addressed and may contain information that is privileged and > confidential. If the reader of this email message is not the intended > recipient, you are hereby notified that any dissemination, distribution, > or copying of this communication is prohibited. If you have received > this email in error, please notify the sender and destroy/delete all > copies of the transmittal. Thank you. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-user This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Thu Mar 19 10:12:51 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 19 Mar 2009 15:12:51 +0000 (GMT) Subject: [Swift-devel] User script gets null pointer exception In-Reply-To: <49C2552B.20409@mcs.anl.gov> References: <49C2552B.20409@mcs.anl.gov> Message-ID: Note that this is a different error to the original message which Yue posted (and which I just answered) on swift-user. The log that you show below is 9 days old and is not a log for the error that Yue reported. On Thu, 19 Mar 2009, Michael Wilde wrote: > Yue, I should have clarified: this is a Swift bug. > > Its perhaps caused by some error in your program, but Swift should never get a > null pointer exception. > > Ben, Mihael - The log is in: > > /home/yuechen/PTMapZhao/PTMap/PTMap-20090310-1746-zszi94b6.log > > The error seems to be related to some boundary condition on the array "texts" > which maps Yue's 375 fasta files, fast01 .. fasta375 > > This message is in the log: > > 2009-03-10 17:46:36,188-0500 INFO AbstractDataNode Found data > org.griphyn.vdl.mapping.RootDataNode identifier > tag:benc at ci.uchicago.edu,2008:swift:dataset:20090310-1746-f12dnzdd:720000000382 > \ > type Inputfile with no value at dataset=parameter (closed).$ > 2009-03-10 17:46:36,188-0500 INFO New NEW > id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090310-1746-f12dnzdd:720000000382 > 2009-03-10 17:46:36,190-0500 DEBUG VDL2ExecutionContext Getting value for > array texts.$[]/375 which is not permitted. > Getting value for array texts.$[]/375 which is not permitted. > vdl:getfieldvalue @ PTMap.kml, line: 124 > sys:parallelfor @ PTMap.kml, line: 124 > sys:sequential @ PTMap.kml, line: 123 > doall @ PTMap.kml, line: 170 > sys:sequential @ PTMap.kml, line: 169 > vdl:mainp @ PTMap.kml, line: 168 > mainp @ vdl.k, line: 165 > vdl:mains @ PTMap.kml, line: 166 > vdl:mains @ PTMap.kml, line: 166 > rlog:restartlog @ PTMap.kml, line: 164 > kernel:project @ PTMap.kml, line: 2 > PTMap-20090310-1746-zszi94b6 > Caused by: java.lang.RuntimeException: Getting value for array texts.$[]/375 > which is not permitted. > at > org.griphyn.vdl.karajan.lib.GetFieldValue.function(GetFieldValue.java:53) > at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > > > > -------- Original Message -------- > Subject: Re: [Swift-user] swift execution problem > Date: Thu, 19 Mar 2009 08:27:02 -0500 > From: Michael Wilde > To: Yue, Chen - BMD > CC: swift-user at ci.uchicago.edu > References: > > > Yue, what version of Swift are you using? > > Please send the first few lines of your output file, where it says > something like: > > Swift svn swift-r2701 cog-r2332 > > RunID: 20090319-0820-19zttiq9 > > (in fact please send the whole output file, stdout/err) > > Ive tried to run your code in a near-identical test and I cant reproduce > the failure. Ive tried with both swift0.8 and the latest svn rev, and > both seem to work. > > Also please can you post the pathname of the directory in which you are > testing (I assume you are running this on a CI machine?) so that I can > look at your logfile? And make it publicly accessible? > > Thanks, > > - Mike > > > On 3/18/09 6:32 PM, Yue, Chen - BMD wrote: > > Hi, > > I'm new to Swift programming. I was able to run a swift script before, but > > I couldn't run it now. I'm wondering if someone can help me figure out why. > > The swift script, sites.xml, tc.data, and all the error messages are copied > > in this email. Thank you! > > Regards, > > Chen, Yue > > ********************* > > Swift script > > ********************* > > type Fasta {} > > type PTMapOut {} > > type Solution {} > > type Inputfile {} > > app (PTMapOut ofile) PTMap (Solution sfile, Fasta fastafile, Inputfile > > input, Inputfile parameter) > > { > > PTMap @filename(sfile) @filename(fastafile) @filename(input) > > @filename(parameter) stdout=@filename(ofile > > ); > > } > > Fasta texts[] ; > > > > doall(Fasta texts[]) > > { > > Solution sfile <"BSASolution.mzXML">; > > Inputfile input <"inputs.txt">; > > Inputfile parameter <"parameters.txt">; > > foreach p in texts { > > PTMapOut r > source=@p , > > match="fasta(.*)", > > transform="\\1.out " > > >; > > r = PTMap(sfile, p, input, parameter); > > } > > } > > // Main > > doall(texts); > > ************** > > sites.xml > > ************** > > > > > > > > /var/tmp > > 0 > > > > ************** > > tc.data > > ************** > > localhost echo /bin/echo INSTALLED > > INTEL32::LINUX null > > localhost cat /bin/cat INSTALLED > > INTEL32::LINUX null > > localhost ls /bin/ls INSTALLED > > INTEL32::LINUX null > > localhost grep /bin/grep INSTALLED > > INTEL32::LINUX null > > localhost sort /bin/sort INSTALLED > > INTEL32::LINUX null > > localhost paste /bin/paste INSTALLED > > INTEL32::LINUX null > > localhost PTMap /home/yuechen/PTMap/PTMap INSTALLED > > INTEL32::LINUX null > > ************** > > Error messages > > ************** > > [yuechen at communicado PTMap]$ swift PTMap.swift > > Execution failed: > > java.lang.NullPointerException > > at > > org.globus.cog.abstraction.impl.common.task.ServiceImpl.toString(ServiceImpl.java:156) > > at java.lang.String.valueOf(String.java:2577) > > at java.lang.StringBuffer.append(StringBuffer.java:220) > > at > > org.globus.cog.karajan.workflow.nodes.grid.GridNode.function(GridNode.java:31) > > at > > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45) > > at > > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > > at > > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) > > at > > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > > at > > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) > > at > > org.globus.cog.karajan.workflow.nodes.ExecuteFile.notificationEvent(ExecuteFile.java:163) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) > > at > > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > > at > > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) > > at > > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > at > > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) > > at > > org.globus.cog.karajan.workflow.nodes.Sequential.childCompleted(Sequential.java:45) > > at > > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) > > at > > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > > at > > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) > > at > > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > at > > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > > at > > org.globus.cog.karajan.workflow.nodes.user.UserDefinedElement.childCompleted(UserDefinedElement.java:283) > > at > > org.globus.cog.karajan.workflow.nodes.user.SequentialImplicitExecutionUDE.childCompleted(SequentialImplicitExecutionUDE.java:85) > > at > > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) > > at > > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > > at > > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) > > at > > org.globus.cog.karajan.workflow.nodes.If.childCompleted(If.java:30) > > at > > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) > > at > > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > > at > > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) > > at > > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > at > > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) > > at > > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > > at > > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) > > at > > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > > at > > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) > > at > > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > > at > > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) > > at > > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) > > at > > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) > > at > > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) > > at > > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:240) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:281) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:393) > > at > > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) > > at > > org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) > > at > > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > > at > > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > > at > > org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) > > > > > > This email is intended only for the use of the individual or entity to which > > it is addressed and may contain information that is privileged and > > confidential. If the reader of this email message is not the intended > > recipient, you are hereby notified that any dissemination, distribution, or > > copying of this communication is prohibited. If you have received this email > > in error, please notify the sender and destroy/delete all copies of the > > transmittal. Thank you. > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From bugzilla-daemon at mcs.anl.gov Thu Mar 19 10:18:59 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 19 Mar 2009 10:18:59 -0500 (CDT) Subject: [Swift-devel] [Bug 184] New: missing element in sites.xml causes extremely useless error message Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=184 Summary: missing element in sites.xml causes extremely useless error message Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk CC: swift-devel at ci.uchicago.edu Omitting and using at the top level in sites.xml causes the following error message on the console: $ swift examples/first.swift Ex098 java.lang.NullPointerException at org.globus.cog.abstraction.impl.common.task.ServiceImpl.toString(ServiceImpl.java:156) at java.lang.String.valueOf(String.java:2615) at java.lang.StringBuffer.append(StringBuffer.java:220) at org.globus.cog.karajan.workflow.nodes.grid.GridNode.function(GridNode.java:31) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) [further stack trace omitted] This could be made much more helpful (at least, it should indicate a problem with sites.xml). -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. You are watching the assignee of the bug. You are watching the reporter. From wilde at mcs.anl.gov Thu Mar 19 10:43:55 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 19 Mar 2009 10:43:55 -0500 Subject: [Swift-devel] User script gets null pointer exception In-Reply-To: References: <49C2552B.20409@mcs.anl.gov> Message-ID: <49C2683B.2030609@mcs.anl.gov> Ooops, sorry. Thanks for both explanations :) On 3/19/09 10:12 AM, Ben Clifford wrote: > Note that this is a different error to the original message which > Yue posted (and which I just answered) on swift-user. > > The log that you show below is 9 days old and is not a log for the error > that Yue reported. > > On Thu, 19 Mar 2009, Michael Wilde wrote: > >> Yue, I should have clarified: this is a Swift bug. >> >> Its perhaps caused by some error in your program, but Swift should never get a >> null pointer exception. >> >> Ben, Mihael - The log is in: >> >> /home/yuechen/PTMapZhao/PTMap/PTMap-20090310-1746-zszi94b6.log >> >> The error seems to be related to some boundary condition on the array "texts" >> which maps Yue's 375 fasta files, fast01 .. fasta375 >> >> This message is in the log: >> >> 2009-03-10 17:46:36,188-0500 INFO AbstractDataNode Found data >> org.griphyn.vdl.mapping.RootDataNode identifier >> tag:benc at ci.uchicago.edu,2008:swift:dataset:20090310-1746-f12dnzdd:720000000382 >> \ >> type Inputfile with no value at dataset=parameter (closed).$ >> 2009-03-10 17:46:36,188-0500 INFO New NEW >> id=tag:benc at ci.uchicago.edu,2008:swift:dataset:20090310-1746-f12dnzdd:720000000382 >> 2009-03-10 17:46:36,190-0500 DEBUG VDL2ExecutionContext Getting value for >> array texts.$[]/375 which is not permitted. >> Getting value for array texts.$[]/375 which is not permitted. >> vdl:getfieldvalue @ PTMap.kml, line: 124 >> sys:parallelfor @ PTMap.kml, line: 124 >> sys:sequential @ PTMap.kml, line: 123 >> doall @ PTMap.kml, line: 170 >> sys:sequential @ PTMap.kml, line: 169 >> vdl:mainp @ PTMap.kml, line: 168 >> mainp @ vdl.k, line: 165 >> vdl:mains @ PTMap.kml, line: 166 >> vdl:mains @ PTMap.kml, line: 166 >> rlog:restartlog @ PTMap.kml, line: 164 >> kernel:project @ PTMap.kml, line: 2 >> PTMap-20090310-1746-zszi94b6 >> Caused by: java.lang.RuntimeException: Getting value for array texts.$[]/375 >> which is not permitted. >> at >> org.griphyn.vdl.karajan.lib.GetFieldValue.function(GetFieldValue.java:53) >> at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) >> at >> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) >> at >> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >> >> >> >> -------- Original Message -------- >> Subject: Re: [Swift-user] swift execution problem >> Date: Thu, 19 Mar 2009 08:27:02 -0500 >> From: Michael Wilde >> To: Yue, Chen - BMD >> CC: swift-user at ci.uchicago.edu >> References: >> >> >> Yue, what version of Swift are you using? >> >> Please send the first few lines of your output file, where it says >> something like: >> >> Swift svn swift-r2701 cog-r2332 >> >> RunID: 20090319-0820-19zttiq9 >> >> (in fact please send the whole output file, stdout/err) >> >> Ive tried to run your code in a near-identical test and I cant reproduce >> the failure. Ive tried with both swift0.8 and the latest svn rev, and >> both seem to work. >> >> Also please can you post the pathname of the directory in which you are >> testing (I assume you are running this on a CI machine?) so that I can >> look at your logfile? And make it publicly accessible? >> >> Thanks, >> >> - Mike >> >> >> On 3/18/09 6:32 PM, Yue, Chen - BMD wrote: >>> Hi, >>> I'm new to Swift programming. I was able to run a swift script before, but >>> I couldn't run it now. I'm wondering if someone can help me figure out why. >>> The swift script, sites.xml, tc.data, and all the error messages are copied >>> in this email. Thank you! >>> Regards, >>> Chen, Yue >>> ********************* >>> Swift script >>> ********************* >>> type Fasta {} >>> type PTMapOut {} >>> type Solution {} >>> type Inputfile {} >>> app (PTMapOut ofile) PTMap (Solution sfile, Fasta fastafile, Inputfile >>> input, Inputfile parameter) >>> { >>> PTMap @filename(sfile) @filename(fastafile) @filename(input) >>> @filename(parameter) stdout=@filename(ofile >>> ); >>> } >>> Fasta texts[] ; >>> >>> doall(Fasta texts[]) >>> { >>> Solution sfile <"BSASolution.mzXML">; >>> Inputfile input <"inputs.txt">; >>> Inputfile parameter <"parameters.txt">; >>> foreach p in texts { >>> PTMapOut r >> source=@p , >>> match="fasta(.*)", >>> transform="\\1.out " >>> >; >>> r = PTMap(sfile, p, input, parameter); >>> } >>> } >>> // Main >>> doall(texts); >>> ************** >>> sites.xml >>> ************** >>> >>> >>> >>> /var/tmp >>> 0 >>> >>> ************** >>> tc.data >>> ************** >>> localhost echo /bin/echo INSTALLED >>> INTEL32::LINUX null >>> localhost cat /bin/cat INSTALLED >>> INTEL32::LINUX null >>> localhost ls /bin/ls INSTALLED >>> INTEL32::LINUX null >>> localhost grep /bin/grep INSTALLED >>> INTEL32::LINUX null >>> localhost sort /bin/sort INSTALLED >>> INTEL32::LINUX null >>> localhost paste /bin/paste INSTALLED >>> INTEL32::LINUX null >>> localhost PTMap /home/yuechen/PTMap/PTMap INSTALLED >>> INTEL32::LINUX null >>> ************** >>> Error messages >>> ************** >>> [yuechen at communicado PTMap]$ swift PTMap.swift >>> Execution failed: >>> java.lang.NullPointerException >>> at >>> org.globus.cog.abstraction.impl.common.task.ServiceImpl.toString(ServiceImpl.java:156) >>> at java.lang.String.valueOf(String.java:2577) >>> at java.lang.StringBuffer.append(StringBuffer.java:220) >>> at >>> org.globus.cog.karajan.workflow.nodes.grid.GridNode.function(GridNode.java:31) >>> at >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45) >>> at >>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) >>> at >>> org.globus.cog.karajan.workflow.nodes.ExecuteFile.notificationEvent(ExecuteFile.java:163) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.childCompleted(Sequential.java:45) >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>> at >>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>> at >>> org.globus.cog.karajan.workflow.nodes.user.UserDefinedElement.childCompleted(UserDefinedElement.java:283) >>> at >>> org.globus.cog.karajan.workflow.nodes.user.SequentialImplicitExecutionUDE.childCompleted(SequentialImplicitExecutionUDE.java:85) >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) >>> at >>> org.globus.cog.karajan.workflow.nodes.If.childCompleted(If.java:30) >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>> at >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >>> at >>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:335) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:173) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:299) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>> at >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) >>> at >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:240) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:281) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:393) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) >>> at >>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >>> at >>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >>> >>> >>> This email is intended only for the use of the individual or entity to which >>> it is addressed and may contain information that is privileged and >>> confidential. If the reader of this email message is not the intended >>> recipient, you are hereby notified that any dissemination, distribution, or >>> copying of this communication is prohibited. If you have received this email >>> in error, please notify the sender and destroy/delete all copies of the >>> transmittal. Thank you. >>> >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> From zhaozhang at uchicago.edu Thu Mar 19 12:21:19 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 19 Mar 2009 12:21:19 -0500 Subject: [Swift-devel] How does swift know if a task is successful In-Reply-To: <1237315230.31264.1.camel@localhost> References: <49BAD926.1030607@uchicago.edu> <1236982413.13026.1.camel@localhost> <49BFDA75.9070803@uchicago.edu> <1237310455.29738.11.camel@localhost> <49BFDC78.8040506@uchicago.edu> <1237310948.30064.2.camel@localhost> <49BFEDB0.5070409@uchicago.edu> <1237315230.31264.1.camel@localhost> Message-ID: <49C27F0F.20604@uchicago.edu> Hi, Mihael Things you suggested was working, I misplaced other files in my previous dir. Thank you. best wishes zhangzhao Mihael Hategan wrote: > On Tue, 2009-03-17 at 13:36 -0500, Zhao Zhang wrote: > >> Hi, Mihael >> >> I commented the following lines >> /*dir:make(ldir) >> restartOnError(".*", 2 >> task:transfer(srchost=host, srcfile=bname, >> srcdir=rdir, destdir=ldir, desthost=dhost, destprovider=provider) >> )*/ >> >> > > Did you modify this file in dist/?/libexec? If not, did you re-compile > swift after the modification? > > Put an echo or a log message in place, to see if your change is picked > up by swift next time. > > >> Then I modified wrapper.sh to not to copy output file back, but I still >> got an error. >> The log file is at >> http://www.ci.uchicago.edu/~zzhang/first-20090317-1327-oqgttus8.log >> Thanks >> >> zhao >> >> zzhang at login6.surveyor:~/new_dock6> ./run_swift_ssh.sh 1117 64 first.swift >> waiting for at least 64 nodes to register before submitting workload... >> waiting to find at least 1 services in file >> /home/falkon/users/zzhang/1117/config/Client-service-URIs.config... >> all done, file has found at least 1 services >> found at least 64 registered, submitting workload... >> Swift svn swift-r2676 (swift modified locally) cog-r2305 >> >> RunID: 20090317-1327-oqgttus8 >> Progress: >> Progress: Selecting site:1 Stage in:1 >> Progress: Submitting:1 Submitted:1 >> Progress: Submitted:1 Failed but can retry:1 >> Failed to transfer wrapper log from >> first-20090317-1327-oqgttus8/info/b/n/bgp000 >> Progress: Submitted:1 Active:1 >> Failed to transfer wrapper log from >> first-20090317-1327-oqgttus8/info/e/n/bgp000 >> Progress: Submitted:1 Active:1 >> Failed to transfer wrapper log from >> first-20090317-1327-oqgttus8/info/g/n/bgp000 >> Execution failed: >> Exception in echo: >> Arguments: [Hello, world!] >> Host: bgp000 >> Directory: first-20090317-1327-oqgttus8/jobs/g/n/echo-gnlq238j >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> Cannot transfer >> "/tmp/first-20090317-1327-oqgttus8/shared/hello.txt" to >> "/gpfs/home/zzhang/new_dock6/./hello.txt" >> Caused by: >> No such file >> >> >> Mihael Hategan wrote: >> >>> On Tue, 2009-03-17 at 12:23 -0500, Zhao Zhang wrote: >>> >>> >>>> Hi, Mihael >>>> >>>> yes, can I do that? >>>> >>>> >>> You should know this by now: >>> in vdl-int.k, in doStageout, comment out the task:transfer invocation >>> (and dir:make). >>> >>> >>> >>>> zhao >>>> >>>> Mihael Hategan wrote: >>>> >>>> >>>>> On Tue, 2009-03-17 at 12:14 -0500, Zhao Zhang wrote: >>>>> >>>>> >>>>> >>>>>> Here comes another question, is there any place that I could set to >>>>>> disable swift's waiting for data feature? >>>>>> >>>>>> >>>>>> >>>>> Do you mean disable the stage-outs? >>>>> >>>>> >>>>> >>>>> >>>>>> Or is there any way for me to cheat swift that the data is already >>>>>> there? thanks. >>>>>> >>>>>> zhao >>>>>> >>>>>> Mihael Hategan wrote: >>>>>> >>>>>> >>>>>> >>>>>>> On Fri, 2009-03-13 at 17:07 -0500, Zhao Zhang wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Hi, All >>>>>>>> >>>>>>>> I have a question on how swift knows if a task is successful. >>>>>>>> In my case, I am using a status notification instead of a status file. >>>>>>>> >>>>>>>> So my question is is this status notification the only thing swift is >>>>>>>> waiting for, or is swift also waiting for the output data to appear to >>>>>>>> say that a job is successful? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> Once the job is done, swift will attempt to stage out all the files that >>>>>>> it expects the job to have produced. >>>>>>> >>>>>>> Should one of those files not be there, there will be failures. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>>>> >>> >>> > > > From hategan at mcs.anl.gov Thu Mar 19 16:49:40 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 19 Mar 2009 16:49:40 -0500 Subject: [Swift-devel] Small issues in coasters on local:pbs In-Reply-To: <49C23701.9060802@mcs.anl.gov> References: <49C23701.9060802@mcs.anl.gov> Message-ID: <1237499380.31133.8.camel@localhost> On Thu, 2009-03-19 at 07:13 -0500, Michael Wilde wrote: > Regarding: [Swift-user] Swift/PBS Scheduler Slow to Report "Finished"?] > > I'm retesting coasters on local:pbs (on teraport), as I think this may > partially alleviate Andrew's problem. > > A simple foreach() works nice and fast, but I see two things: > > 1) I first tested without a valid proxy. I forgot that coasters requires > a proxy (presumably for its secure channels) Yes. For its secure channels. > even when its not using > GRAM to reach its "RRM". The error returned if you dont have a proxy is > cryptic and buried in the coaster boostrap log. So 3 things: (a) do a > check for proxy early on and print a nice message if theres not a valid > proxy; (b) bring the errors from the bootstrap log back to the user > (unless thats not possible) in which case point the user to look for > that. I would favor that. Currently swift seems to report too little of the underlying errors, which often contain essential information for solving the problem. But in this particular instance, it's mostly a matter of nicely propagating error messages through a handful of layers. I'll see what I can do, though I suspect that for now the too little vs. too much output conflict will exist. > (c) document that you need a proxy. > > 2) When the script finishes you get this message on stdout/err which > looks like a leftover debugging message: It is. I will remove that. > > -- > Swift svn swift-r2701 cog-r2332 > > RunID: 20090319-0658-3ejpl9xc > Progress: > Progress: Submitting:9 Submitted:1 > Progress: Submitted:9 Active:1 > Progress: Submitted:4 Active:3 Stage out:1 Finished successfully:2 > Final status: Finished successfully:10 > Cleaning up... > Shutting down service at https://128.135.125.117:50002 > Got channel MetaChannel: 101224864 -> GSSSChannel-null(1) > - Done > -- > > - Mike > > The errors you get when you dont have a proxy are: > > tp$ swift hellos.swift -sites.file sites.xml -tc.file tc.data > Swift svn swift-r2701 cog-r2332 > > RunID: 20090319-0655-9ufl1r2g > Progress: > Progress: Submitting:9 Submitted:1 > Failed to transfer wrapper log from hellos-20090319-0655-9ufl1r2g/info/l > on teraport > Execution failed: > Exception in echo: > Arguments: [Output of run, 6] > Host: teraport > Directory: hellos-20090319-0655-9ufl1r2g/jobs/l/echo-lde8x58j > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Could not submit job > Caused by: > Could not start coaster service > Caused by: > Task ended before registration was received. > STDOUT: > STDERR: > > Caused by: > Job failed with an exit code of 1 > Cleaning up... > Done > tp$ > > tp$ cat /home/wilde/coaster-bootstrap-01709350024.log > BS: http://tp-login2.ci.uchicago.edu:50001 > find wget = /usr/bin/wget > -->/usr/bin/wget -c -q > http://tp-login2.ci.uchicago.edu:50001/coaster-bootstrap.jar -O > /tmp/bootstrap.YJ4129 >>/home/wilde/coaster-bootstrap-01709350024.log > 2>&1<-- > which: no gmd5sum in > (/autonfs/home/wilde/tutorials/osgedu/build/docbook-xsl/tools/bin:/home/wilde/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/apache-ant-1.7.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r1/bin:/soft/globus-4.2.1-r1/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/soft/torque-2.3.6-r1/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/matlab-7.7-r1/bin:/soft/osg-client-1.0.0-r1/lcg/bin:/soft/osg-client-1.0.0-r1/srm-client-lbnl/bin:/soft/osg-client-1.0.0-r1/srm-client-fermi/sbin:/soft/osg-client-1.0.0-r1/srm-client-fermi/bin:/soft/osg-client-1.0.0-r1/curl/bin:/soft/osg-client-1.0.0-r1/wget/bin:/soft/osg-client-1.0.0-r1/cert-scripts/bin:/soft/osg-client-1.0.0-r1/glite/sbin:/soft/osg-client-1.0.0-r1/glite/bin:/soft/osg-client-1.0.0-r1/pyglobus-url-copy/bin:/soft/osg-client-1.0.0-r1/p eg > asus/bin:/soft/osg-client-1.0.0-r1/ant/bin:/soft/osg-client-1.0.0-r1/gpt/sbin:/soft/osg-client-1.0.0-r1/globus/bin:/soft/osg-client-1.0.0-r1/globus/sbin:/soft/osg-client-1.0.0-r1/jdk1.5/bin:/soft/osg-client-1.0.0-r1/condor/sbin:/soft/osg-client-1.0.0-r1/condor/bin:/soft/osg-client-1.0.0-r1/logrotate/sbin:/software/common/pacman-3.26-r1/bin:/soft/osg-client-1.0.0-r1/vdt/sbin:/soft/osg-client-1.0.0-r1/vdt/bin:/home/grog/bin/linux-rhel4-ia32:/home/grog/bin:/sw/bin:/sbin:/usr/sbin:/home/wilde/swift/tools:/home/wilde/swift/rev/latest/bin:/home/wilde/blast/ncbi/bin:/home/wilde/adem/adem-osg/bin) > which: no gmd5sum in > (/autonfs/home/wilde/tutorials/osgedu/build/docbook-xsl/tools/bin:/home/wilde/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/apache-ant-1.7.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r1/bin:/soft/globus-4.2.1-r1/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/soft/torque-2.3.6-r1/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/matlab-7.7-r1/bin:/soft/osg-client-1.0.0-r1/lcg/bin:/soft/osg-client-1.0.0-r1/srm-client-lbnl/bin:/soft/osg-client-1.0.0-r1/srm-client-fermi/sbin:/soft/osg-client-1.0.0-r1/srm-client-fermi/bin:/soft/osg-client-1.0.0-r1/curl/bin:/soft/osg-client-1.0.0-r1/wget/bin:/soft/osg-client-1.0.0-r1/cert-scripts/bin:/soft/osg-client-1.0.0-r1/glite/sbin:/soft/osg-client-1.0.0-r1/glite/bin:/soft/osg-client-1.0.0-r1/pyglobus-url-copy/bin:/soft/osg-client-1.0.0-r1/p eg > asus/bin:/soft/osg-client-1.0.0-r1/ant/bin:/soft/osg-client-1.0.0-r1/gpt/sbin:/soft/osg-client-1.0.0-r1/globus/bin:/soft/osg-client-1.0.0-r1/globus/sbin:/soft/osg-client-1.0.0-r1/jdk1.5/bin:/soft/osg-client-1.0.0-r1/condor/sbin:/soft/osg-client-1.0.0-r1/condor/bin:/soft/osg-client-1.0.0-r1/logrotate/sbin:/software/common/pacman-3.26-r1/bin:/soft/osg-client-1.0.0-r1/vdt/sbin:/soft/osg-client-1.0.0-r1/vdt/bin:/home/grog/bin/linux-rhel4-ia32:/home/grog/bin:/sw/bin:/sbin:/usr/sbin:/home/wilde/swift/tools:/home/wilde/swift/rev/latest/bin:/home/wilde/blast/ncbi/bin:/home/wilde/adem/adem-osg/bin) > find gmd5sum = > find md5sum = /usr/bin/md5sum > Expected checksum: 33170989491a2e007a1c7c68eb907832 > Computed checksum: 33170989491a2e007a1c7c68eb907832 > find java = /soft/java-1.6.0_11-sun-r1/bin/java > JAVA=/soft/java-1.6.0_11-sun-r1/bin/java > /soft/java-1.6.0_11-sun-r1/bin/java > -Djava=/soft/java-1.6.0_11-sun-r1/bin/java > -DGLOBUS_TCP_PORT_RANGE=50000,51000 -DX509_USER_PROXY= > -DX509_CERT_DIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA > -DGLOBUS_HOSTNAME=none -jar /tmp/bootstrap.YJ4129 > http://tp-login2.ci.uchicago.edu:50001 https://128.135.125.117:50000 > 01709350024 > java.lang.RuntimeException: Failed to register service > at > org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:111) > at > org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:226) > Caused by: > org.globus.cog.karajan.workflow.service.channels.ChannelException: > Failed to start channel GSSCChannel-https://128.135.125.117:50000(1) > at > org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:104) > at > org.globus.cog.karajan.workflow.service.channels.GSSChannel.start(GSSChannel.java:63) > at > org.globus.cog.karajan.workflow.service.ChannelFactory.newChannel(ChannelFactory.java:43) > at org.globus.cog.karajan.workflow.service.Client.connect(Client.java:115) > at org.globus.cog.karajan.workflow.service.Client.newClient(Client.java:72) > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:211) > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:230) > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:186) > at > org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:100) > ... 1 more > Caused by: org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy > file (/tmp/x509up_u1031) not found. > at org.globus.gsi.GlobusCredential.(GlobusCredential.java:114) > at > org.globus.gsi.GlobusCredential.reloadDefaultCredential(GlobusCredential.java:590) > at > org.globus.gsi.GlobusCredential.getDefaultCredential(GlobusCredential.java:575) > at > org.globus.cog.karajan.workflow.service.GSSService.initializeCredentials(GSSService.java:99) > at > org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:77) > ... 9 more > > EC: 1 > BS: http://tp-login2.ci.uchicago.edu:50001 > find wget = /usr/bin/wget > -->/usr/bin/wget -c -q > http://tp-login2.ci.uchicago.edu:50001/coaster-bootstrap.jar -O > /tmp/bootstrap.DS4363 >>/home/wilde/coaster-bootstrap-01709350024.log > 2>&1<-- > which: no gmd5sum in > (/autonfs/home/wilde/tutorials/osgedu/build/docbook-xsl/tools/bin:/home/wilde/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/apache-ant-1.7.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r1/bin:/soft/globus-4.2.1-r1/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/soft/torque-2.3.6-r1/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/matlab-7.7-r1/bin:/soft/osg-client-1.0.0-r1/lcg/bin:/soft/osg-client-1.0.0-r1/srm-client-lbnl/bin:/soft/osg-client-1.0.0-r1/srm-client-fermi/sbin:/soft/osg-client-1.0.0-r1/srm-client-fermi/bin:/soft/osg-client-1.0.0-r1/curl/bin:/soft/osg-client-1.0.0-r1/wget/bin:/soft/osg-client-1.0.0-r1/cert-scripts/bin:/soft/osg-client-1.0.0-r1/glite/sbin:/soft/osg-client-1.0.0-r1/glite/bin:/soft/osg-client-1.0.0-r1/pyglobus-url-copy/bin:/soft/osg-client-1.0.0-r1/p eg > asus/bin:/soft/osg-client-1.0.0-r1/ant/bin:/soft/osg-client-1.0.0-r1/gpt/sbin:/soft/osg-client-1.0.0-r1/globus/bin:/soft/osg-client-1.0.0-r1/globus/sbin:/soft/osg-client-1.0.0-r1/jdk1.5/bin:/soft/osg-client-1.0.0-r1/condor/sbin:/soft/osg-client-1.0.0-r1/condor/bin:/soft/osg-client-1.0.0-r1/logrotate/sbin:/software/common/pacman-3.26-r1/bin:/soft/osg-client-1.0.0-r1/vdt/sbin:/soft/osg-client-1.0.0-r1/vdt/bin:/home/grog/bin/linux-rhel4-ia32:/home/grog/bin:/sw/bin:/sbin:/usr/sbin:/home/wilde/swift/tools:/home/wilde/swift/rev/latest/bin:/home/wilde/blast/ncbi/bin:/home/wilde/adem/adem-osg/bin) > which: no gmd5sum in > (/autonfs/home/wilde/tutorials/osgedu/build/docbook-xsl/tools/bin:/home/wilde/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/soft/apache-ant-1.7.1-r1/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r1/bin:/soft/globus-4.2.1-r1/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/wilde/bin/linux-rhel5-x86_64:/home/wilde/bin:/soft/torque-2.3.6-r1/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/matlab-7.7-r1/bin:/soft/osg-client-1.0.0-r1/lcg/bin:/soft/osg-client-1.0.0-r1/srm-client-lbnl/bin:/soft/osg-client-1.0.0-r1/srm-client-fermi/sbin:/soft/osg-client-1.0.0-r1/srm-client-fermi/bin:/soft/osg-client-1.0.0-r1/curl/bin:/soft/osg-client-1.0.0-r1/wget/bin:/soft/osg-client-1.0.0-r1/cert-scripts/bin:/soft/osg-client-1.0.0-r1/glite/sbin:/soft/osg-client-1.0.0-r1/glite/bin:/soft/osg-client-1.0.0-r1/pyglobus-url-copy/bin:/soft/osg-client-1.0.0-r1/p eg > asus/bin:/soft/osg-client-1.0.0-r1/ant/bin:/soft/osg-client-1.0.0-r1/gpt/sbin:/soft/osg-client-1.0.0-r1/globus/bin:/soft/osg-client-1.0.0-r1/globus/sbin:/soft/osg-client-1.0.0-r1/jdk1.5/bin:/soft/osg-client-1.0.0-r1/condor/sbin:/soft/osg-client-1.0.0-r1/condor/bin:/soft/osg-client-1.0.0-r1/logrotate/sbin:/software/common/pacman-3.26-r1/bin:/soft/osg-client-1.0.0-r1/vdt/sbin:/soft/osg-client-1.0.0-r1/vdt/bin:/home/grog/bin/linux-rhel4-ia32:/home/grog/bin:/sw/bin:/sbin:/usr/sbin:/home/wilde/swift/tools:/home/wilde/swift/rev/latest/bin:/home/wilde/blast/ncbi/bin:/home/wilde/adem/adem-osg/bin) > find gmd5sum = > find md5sum = /usr/bin/md5sum > Expected checksum: 33170989491a2e007a1c7c68eb907832 > Computed checksum: 33170989491a2e007a1c7c68eb907832 > find java = /soft/java-1.6.0_11-sun-r1/bin/java > JAVA=/soft/java-1.6.0_11-sun-r1/bin/java > /soft/java-1.6.0_11-sun-r1/bin/java > -Djava=/soft/java-1.6.0_11-sun-r1/bin/java > -DGLOBUS_TCP_PORT_RANGE=50000,51000 -DX509_USER_PROXY= > -DX509_CERT_DIR=/soft/osg-client-1.0.0-r1/globus/TRUSTED_CA > -DGLOBUS_HOSTNAME=none -jar /tmp/bootstrap.DS4363 > http://tp-login2.ci.uchicago.edu:50001 https://128.135.125.117:50000 > 01709350024 > java.lang.RuntimeException: Failed to register service > at > org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:111) > at > org.globus.cog.abstraction.coaster.service.CoasterService.main(CoasterService.java:226) > Caused by: > org.globus.cog.karajan.workflow.service.channels.ChannelException: > Failed to start channel GSSCChannel-https://128.135.125.117:50000(1) > at > org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:104) > at > org.globus.cog.karajan.workflow.service.channels.GSSChannel.start(GSSChannel.java:63) > at > org.globus.cog.karajan.workflow.service.ChannelFactory.newChannel(ChannelFactory.java:43) > at org.globus.cog.karajan.workflow.service.Client.connect(Client.java:115) > at org.globus.cog.karajan.workflow.service.Client.newClient(Client.java:72) > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:211) > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:230) > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:186) > at > org.globus.cog.abstraction.coaster.service.CoasterService.start(CoasterService.java:100) > ... 1 more > Caused by: org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy > file (/tmp/x509up_u1031) not found. > at org.globus.gsi.GlobusCredential.(GlobusCredential.java:114) > at > org.globus.gsi.GlobusCredential.reloadDefaultCredential(GlobusCredential.java:590) > at > org.globus.gsi.GlobusCredential.getDefaultCredential(GlobusCredential.java:575) > at > org.globus.cog.karajan.workflow.service.GSSService.initializeCredentials(GSSService.java:99) > at > org.globus.cog.karajan.workflow.service.channels.GSSChannel.reconnect(GSSChannel.java:77) > ... 9 more > > EC: 1 > tp$ > > > -------- Original Message -------- > Subject: [Swift-user] Swift/PBS Scheduler Slow to Report "Finished"? > Date: Thu, 19 Mar 2009 02:42:25 -0500 > From: Andrew Boyce > To: swift-user at ci.uchicago.edu > > Hello, > > I am currently running Swift in conjunction with the PBS scheduler. My > annoyance at the moment is this: > > When running any script, even a simple script such as first.swift > (which normally finishes almost instantaneously), Swift always takes > precisely five minutes to tell me that my job Finished successfully > and copy the files back to the appropriate folder. It is always almost > exactly five minutes; I've checked many logs - it polls the scheduler > for five minutes. When I run a script (like first.swift) without using > the PBS scheduler, everything happens as normal; execution and > "Finished successfully" are nearly immediate. > > I think I know what the problem is: even after the scheduler says that > the job is 'completed,' (which is generally right away) the scheduler > keeps the job up on qstat and such for 5 minutes after (this setting > is a PBS server attribute known as 'keep_completed', and I have > checked that it is indeed set to 300 seconds; unfortunately I don't > have permissions to change it). So when Swift polls the scheduler, the > job is still up on qstat, and Swift must think that the task has not > yet "Finished successfully." > > My question is this: > Am I indeed right that Swift does not "understand" that when the PBS > scheduler says a job is 'completed', the job really has "Finished > successfully"? > Can this be changed so that Swift does "understand" that a 'completed' > job has "Finished successfully"? > > I have not included any files because I think I have narrowed the > problem down to a question that does not require those that I would > usually provide, but if I am wrong, then I can provide. > > Thank you and sorry for the length. > > Regards, > > Andrew Boyce > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Thu Mar 19 18:23:35 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 19 Mar 2009 18:23:35 -0500 Subject: [Swift-devel] testing In-Reply-To: <1237396313.9356.1.camel@localhost> References: <1237307410.26969.8.camel@localhost> <49C0E8CB.3080600@mcs.anl.gov> <1237389071.5032.1.camel@localhost> <49C129E4.9020003@mcs.anl.gov> <1237396313.9356.1.camel@localhost> Message-ID: <49C2D3F7.2050402@mcs.anl.gov> I was writing the following yesterday before you posted the Coaster design notes. Those were very helpful, exactly what I was looking for. Now the changes being discussed can be couched in terms of deltas to that spec. So I'm just going to post thoughts I had below before I loose them, to try to nudge this issue forward. On 3/18/09 12:11 PM, Mihael Hategan wrote: > On Wed, 2009-03-18 at 12:05 -0500, Michael Wilde wrote: >> On 3/18/09 10:11 AM, Mihael Hategan wrote: >>> On Wed, 2009-03-18 at 07:27 -0500, Michael Wilde wrote: >>> >>>>> iii) running anything through gram2 is bad - any base job submissions >>>>> need to be through condor-g using its hybrid gram2+gridmanager system. >>>> I agree, and was assuming that on OSG we would only use the new Condor >>>> provider, and run jobs in this manner. >>> There seems to be some confusion here. >>> >>> Ben, the point is to run with one of the scheduler providers, not gram2. >>> >>> Mike, the condor provider is not a condor-through-gram provider. It only >>> submits to the local condor queue. >> I was thinking/hoping that the condor provider would have a setting that >> submitted swift apps as condor-g jobs to N grid sites, *via* the local >> condor queue. >> >> Isn't that how condor-g works? I send my local condor (via condor_sumit) >> a .sub file that says e.g.: >> >> universe=grid >> grid_resource=gt2 osg-edu.cs.wisc.edu/jobmanager-condor >> >> If I cant do that yet through the condor provider, was it your intent >> that users eventually be able to do that, or was that not what you were >> implementing? > > That was not what I was implementing. OK. But (a) is the Condor-G provider worth implementing and (b) how far is it, in effort, what what you were implementing? > What I was aiming for was a local condor provider, similar to the PBS > provider, that would address the scalability issues with gram2 for sites > using condor as a queuing system. But one uses the pbs provider by running swift directly on a system that has the pbs tools (ie qsub) installed. There are very few systems that users have direct login access to which have a Condor LRM. (The TG Purdue systems being one exception). But, could a swift user use the "local condor provider", the one you are implementing, to have the coaster service launch its workers? Alternatively, if you *were* to implement a Condor-G provider as above, could coaster workers be submitted via that provider via Swift, rather than having them started by the coaster service? And then there is Ben's important point about putting the Coaster service on a cluster worker node rather than the head node - wherever possible. And accepting the limitation that on some systems, if the coaster service cant run on a worker node, it cant run on that site. So sites where the workers can not connect back to the Swift submit host could not run coasters. But the basic limitations are I believe as Ben stated them. We have to, for the time being and probably quite a while, submit both the worker and the service via Condor-G to pre-WS-GRAM with the grid_monitor enabled. From skenny at uchicago.edu Thu Mar 19 21:53:04 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Thu, 19 Mar 2009 21:53:04 -0500 (CDT) Subject: [Swift-devel] estranged on ranger Message-ID: <20090319215304.BUH98753@m4500-02.uchicago.edu> hey there, i'm having some trouble figuring out why my gigantic workflow is failing :) all the details are below...i should mention also that i ran 10k jobs with the same configs and it completed w/o err in about 28min. so i'm trying to run the 65k workflow with the latest build from svn. the workflow completes 244 of the jobs and then begins failing. it never returns an error but seems to hang for quite some time (though all jobs have left the q). from the properties file: lazy.errors=false caching.algorithm=LRU pgraph=false pgraph.graph.options=splines="compound", rankdir="TB" pgraph.node.options=color="seagreen", style="filled" clustering.enabled=false clustering.queue.delay=4 clustering.min.time=60 kickstart.enabled=maybe kickstart.always.transfer=false wrapperlog.always.transfer=false throttle.submit=6 throttle.host.submit=3 throttle.score.job.factor=8 throttle.transfers=16 throttle.file.operations=16 sitedir.keep=true execution.retries=2 replication.enabled=false replication.min.queue.time=60 replication.limit=3 foreach.max.threads=1024 from sites: 1 8 TG-DBS090006 16 /scratch/projects/tg/SIDGrid/sidgrid_out/{username} ran the log plot: http://www.ci.uchicago.edu/~skenny/sem/report-modgenproc-20090319-2002-b0nthqyg/index.html the log itself is here on ranger: /scratch/projects/tg/SIDGrid/swift-logs/skenny/modgenproc-20090319-1513-m5tlihce.log thoughts? ideas of what i might try to tweak? thanks! ~skenny From hategan at mcs.anl.gov Thu Mar 19 21:56:11 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 19 Mar 2009 21:56:11 -0500 Subject: [Swift-devel] estranged on ranger In-Reply-To: <20090319215304.BUH98753@m4500-02.uchicago.edu> References: <20090319215304.BUH98753@m4500-02.uchicago.edu> Message-ID: <1237517771.6388.0.camel@localhost> On Thu, 2009-03-19 at 21:53 -0500, skenny at uchicago.edu wrote: > the log itself is here on ranger: > /scratch/projects/tg/SIDGrid/swift-logs/skenny/modgenproc-20090319-1513-m5tlihce.log login3% less /scratch/projects/tg/SIDGrid/swift-logs/skenny/modgenproc-20090319-1513-m5tlihce.log /scratch/projects/tg/SIDGrid/swift-logs/skenny/modgenproc-20090319-1513-m5tlihce.log: Permission denied From hategan at mcs.anl.gov Thu Mar 19 22:40:12 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 19 Mar 2009 22:40:12 -0500 Subject: [Swift-devel] estranged on ranger In-Reply-To: <20090319215304.BUH98753@m4500-02.uchicago.edu> References: <20090319215304.BUH98753@m4500-02.uchicago.edu> Message-ID: <1237520412.6648.5.camel@localhost> For the record, the issue seems to be some FS problems. This is suggested by the file operation tasks in http://www.ci.uchicago.edu/~skenny/sem/report-modgenproc-20090319-2002-b0nthqyg/karajan.html where it can be seen that before being stopped, swift was running some very slow (166 seconds each?) file tasks. My suspicion is that running swift on ranger's head node is bound to cause problems (due to doubled FS load) and/or suffer from problems caused by the myriad of other people running stuff there. The recommendation would be to run swift from a machine @CI, preferably on a local disk if many files are involved. On Thu, 2009-03-19 at 21:53 -0500, skenny at uchicago.edu wrote: > hey there, i'm having some trouble figuring out why my > gigantic workflow is failing :) all the details are below...i > should mention also that i ran 10k jobs with the same configs > and it completed w/o err in about 28min. > > so i'm trying to run the 65k workflow with the latest > build from svn. the workflow completes 244 of the jobs and > then begins failing. it never returns an error but seems to > hang for quite some time (though all jobs have left the q). > > from the properties file: > > lazy.errors=false > caching.algorithm=LRU > pgraph=false > pgraph.graph.options=splines="compound", rankdir="TB" > pgraph.node.options=color="seagreen", style="filled" > clustering.enabled=false > clustering.queue.delay=4 > clustering.min.time=60 > > kickstart.enabled=maybe > kickstart.always.transfer=false > wrapperlog.always.transfer=false > > throttle.submit=6 > throttle.host.submit=3 > > throttle.score.job.factor=8 > throttle.transfers=16 > > throttle.file.operations=16 > sitedir.keep=true > execution.retries=2 > > replication.enabled=false > replication.min.queue.time=60 > replication.limit=3 > foreach.max.threads=1024 > > from sites: > > > 1 > 8 > key="project">TG-DBS090006 > url="gt2://gatekeeper.ranger.tacc.teragrid.org"/> > 16 > url="gatekeeper.ranger.tacc.teragrid.org" > jobManager="gt2:gt2:SGE"/> > > /scratch/projects/tg/SIDGrid/sidgrid_out/{username} > > > ran the log plot: > > http://www.ci.uchicago.edu/~skenny/sem/report-modgenproc-20090319-2002-b0nthqyg/index.html > > the log itself is here on ranger: > /scratch/projects/tg/SIDGrid/swift-logs/skenny/modgenproc-20090319-1513-m5tlihce.log > > thoughts? ideas of what i might try to tweak? > > thanks! > > ~skenny > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From bugzilla-daemon at mcs.anl.gov Fri Mar 20 15:15:49 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Fri, 20 Mar 2009 15:15:49 -0500 (CDT) Subject: [Swift-devel] [Bug 185] New: expose concurrent mapper's automatic hierarchical directory naming for GPFS-friendliness in other mappers Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=185 Summary: expose concurrent mapper's automatic hierarchical directory naming for GPFS-friendliness in other mappers Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: NEW Severity: enhancement Priority: P2 Component: General AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk CC: swift-devel at ci.uchicago.edu concurrent mapper makes complicated looking directory hierarchies so that large structures mapped with concurrent mapper are GPFS-friendly. An option could be added to other mappers to allow such hierarchies to be appended to the directories used in (for example) simple_mapper, whilst still allowing prefix, suffix specification and the existing filename construction rules based on array and structure membership. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. You are watching the assignee of the bug. You are watching the reporter. From benc at hawaga.org.uk Sun Mar 22 06:28:19 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 22 Mar 2009 11:28:19 +0000 (GMT) Subject: [Swift-devel] lots of variable not found Message-ID: I've seen these sporadically in the past but there seem to be more of them happening in the nmi build and test recently. On http://nmi-s005.cs.wisc.edu:80/nmi/index.php?page=results/runDetails&runid=140131 the 066-many test fails with Progress: Selecting site:1021 Active:1 Stage out:1 Finished successfully:1596 Execution failed: Variable not found: #channel#restartout and again in here: http://nmi-s005.cs.wisc.edu:80/nmi/index.php?page=results/runDetails&rows=20&&runid=140127&offset=20&rows=20 on nmi_cent_4.2 and a Progress: Selecting site:1021 Active:1 Stage out:1 Finished successfully:1867 Execution failed: Variable not found: #threadcount in 066-many on x86_suse_10 in that same URL above. I've also seen: Event was No #caller found on stack for sys:sequential @ 066-many.kml, line: 35 on NMI's x85_slf_3 platform, again in 066-many. These all have the feel of something racy in Karajan, and are hopefully relatively easy to reproduce (i.e. run 066-many tens of times and you should get). On the jdk 1.4 test that ran last night, 3 out of the 13 runs that have completed as of my writing this mail failed. Thats over 20% which is an uncomfortably high failure rate. -- From benc at hawaga.org.uk Sun Mar 22 07:38:13 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 22 Mar 2009 12:38:13 +0000 (GMT) Subject: [Swift-devel] scalability updates In-Reply-To: <1236630300.16421.17.camel@localhost> References: <1236630300.16421.17.camel@localhost> Message-ID: On Mon, 9 Mar 2009, Mihael Hategan wrote: > 2. A lazy range function (the [x:y] operator). The previous one was > silly. Simply writing [0:1000000] would cause swift to run out of memory > because it was trying to create a swift array with 1000000 elements > before running a single iteration on it. This (r2674) breaks some array-related functionality - tests/language-behaviour/1101-array-range added in r2726 works with Swift 0.8 (and also in my head) but does not work with Swift r2522. Backing out r2674 (which I did in r2725) makes this work again (bringing back more correct behaviour but losing the scabaility introduced there) r2674 also makes a exceptions appear in the provenance logging lines. I think both of the above problems are to do with the messy internal numerical datatype handling - in the log lines that are showing exceptions, paths are rendered as [1.0] (or [n.0] in general) rather than [1] or [n]. I guess whoever sorts this out and recommits first gets a cookie. I also started playing with a test to check for exception stack traces in the log files for any of the language-behaviour tests, on the assumption that when running locally there should never be stack traces in the log file; however it discovered some other places unrelated to the above where the tests have been getting exceptions for some time (they're present in 0.8). So I have not committed that until I've looked at those cases and tried to resolve them. -- From benc at hawaga.org.uk Sun Mar 22 08:31:17 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 22 Mar 2009 13:31:17 +0000 (GMT) Subject: [Swift-devel] Re: lots of variable not found In-Reply-To: References: Message-ID: On Sun, 22 Mar 2009, Ben Clifford wrote: > On the jdk 1.4 test that ran last night, 3 out of the 13 runs that have > completed as of my writing this mail failed. Thats over 20% which is an > uncomfortably high failure rate. and for that set of tests, the final outcome is 7 out of 18 (38%) failed one way or another. On this page, the ones that are Failed(-32) failed due to an nmi timeout (which in two cases appears to be swift hanging, giving a long sequence of progress tickers not advancing; in the other two cases may be because I have nmi timeouts set too low - I'll increase them); the ones that are Failed(3) or Failed(4) failed because Swift hit an error during testing. http://nmi-s005.cs.wisc.edu:80/nmi/index.php?page=results%2FrunDetails&runid=140127&rows=200 ick. -- From benc at hawaga.org.uk Sun Mar 22 09:18:59 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 22 Mar 2009 14:18:59 +0000 (GMT) Subject: [Swift-devel] Re: lots of variable not found In-Reply-To: References: Message-ID: I just moved the nmi per-commit build to another platform and on the first build there I saw this: java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextEntry(HashMap.java:841) at java.util.HashMap$KeyIterator.next(HashMap.java:877) at org.globus.cog.karajan.arguments.ArgUtil.initializeChannelBuffers(ArgUtil.java:191) at org.globus.cog.karajan.workflow.nodes.Parallel.initializeChannelBuffers(Parallel.java:57) at org.globus.cog.karajan.workflow.nodes.Parallel.executeChildren(Parallel.java:36) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) at org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) at org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) at org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) running 066-many with swift r2729 and cog r2333. The 066-many test does not seem to be doing very well these days. I'm running that test a few more times to see how reproducible it is. http://nmi-s005.cs.wisc.edu:80/nmi/index.php?page=results/runDetails&runid=140151&opt_project=swift -- From hategan at mcs.anl.gov Sun Mar 22 10:17:07 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 22 Mar 2009 10:17:07 -0500 Subject: [Swift-devel] scalability updates In-Reply-To: References: <1236630300.16421.17.camel@localhost> Message-ID: <1237735027.23923.1.camel@localhost> On Sun, 2009-03-22 at 12:38 +0000, Ben Clifford wrote: > On Mon, 9 Mar 2009, Mihael Hategan wrote: > > > 2. A lazy range function (the [x:y] operator). The previous one was > > silly. Simply writing [0:1000000] would cause swift to run out of memory > > because it was trying to create a swift array with 1000000 elements > > before running a single iteration on it. > > This (r2674) breaks some array-related functionality - > tests/language-behaviour/1101-array-range added in r2726 works with Swift > 0.8 (and also in my head) but does not work with Swift r2522. Backing out > r2674 (which I did in r2725) makes this work again Stop backing things out in trunk if they break some obscure test! We try to fix them first! From bugzilla-daemon at mcs.anl.gov Sun Mar 22 11:34:10 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sun, 22 Mar 2009 11:34:10 -0500 (CDT) Subject: [Swift-devel] [Bug 186] New: File-not-found errors in swift log should be sent to stdout/err Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=186 Summary: File-not-found errors in swift log should be sent to stdout/err Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Keywords: error-handling Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: wilde at mcs.anl.gov In a case when the files that a dataset is mapped to are not found, the details of this error seem to be left buried in the .log file, and what shows up on stdout/err is the much more cryptic error that the app did not produce the an expected output file. Its not yet clear to me if swift even got to the point of executing the app, though. I get this on stdout/err: -- Swift svn swift-r2724 (swift modified locally) cog-r2333 RunID: 20090322-1102-rveraq3f Progress: Failed to transfer wrapper log from t1-20090322-1102-rveraq3f/info/m on localhost Execution failed: java.io.FileNotFoundException: _concurrent/results-c6c862ba-4992-4726-b193-c92753858e0e-7 (No such file or directory) sur$ -- Yet the log is filled with errors, like these, which would have more immediately told me what was wrong: -- sur$ ./checklog Errors found in swift log t1-20090322-1102-rveraq3f.log: ( -rw-rw-r-- 1 wilde users 168177 Mar 22 11:02 t1-20090322-1102-rveraq3f.log ): 2009-03-22 11:02:29,988-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-15-1237737749283) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/03/T1af7.0000.0003.rmsd 2009-03-22 11:02:29,997-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-6-1237737749285) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/00/T1af7.0000.0000.rmsd 2009-03-22 11:02:30,000-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-7-1237737749288) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/01/T1af7.0000.0001.log 2009-03-22 11:02:30,009-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-8-1237737749292) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/01/T1af7.0000.0001.pdt 2009-03-22 11:02:30,012-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-15-1237737749297) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/03/T1af7.0000.0003.rmsd 2009-03-22 11:02:30,015-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-9-1237737749303) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/01/T1af7.0000.0001.rmsd 2009-03-22 11:02:30,018-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-5-1237737749300) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/00/T1af7.0000.0000.pdt 2009-03-22 11:02:30,026-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-1-1237737749305) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/04/T1af7.0000.0004.log 2009-03-22 11:02:30,027-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-4-1237737749307) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/00/T1af7.0000.0000.log 2009-03-22 11:02:30,030-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-6-1237737749309) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/00/T1af7.0000.0000.rmsd 2009-03-22 11:02:30,031-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-12-1237737749290) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/02/T1af7.0000.0002.rmsd 2009-03-22 11:02:30,041-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-14-1237737749315) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/03/T1af7.0000.0003.pdt 2009-03-22 11:02:30,039-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-13-1237737749312) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/03/T1af7.0000.0003.log 2009-03-22 11:02:30,041-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-10-1237737749321) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/02/T1af7.0000.0002.log 2009-03-22 11:02:30,044-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-2-1237737749319) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/04/T1af7.0000.0004.pdt 2009-03-22 11:02:30,047-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-15-1237737749326) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/03/T1af7.0000.0003.rmsd 2009-03-22 11:02:30,049-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-3-1237737749323) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/04/T1af7.0000.0004.rmsd 2009-03-22 11:02:30,053-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-1-1237737749335) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/04/T1af7.0000.0004.log 2009-03-22 11:02:30,053-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-5-1237737749333) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/00/T1af7.0000.0000.pdt 2009-03-22 11:02:30,056-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-6-1237737749337) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/00/T1af7.0000.0000.rmsd 2009-03-22 11:02:30,056-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-12-1237737749341) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/02/T1af7.0000.0002.rmsd 2009-03-22 11:02:30,057-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-4-1237737749339) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/00/T1af7.0000.0000.log 2009-03-22 11:02:30,060-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-8-1237737749350) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/01/T1af7.0000.0001.pdt 2009-03-22 11:02:30,060-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-10-1237737749353) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/02/T1af7.0000.0002.log 2009-03-22 11:02:30,060-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-7-1237737749330) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/01/T1af7.0000.0001.log 2009-03-22 11:02:30,063-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-11-1237737749344) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/02/T1af7.0000.0002.pdt 2009-03-22 11:02:30,064-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-2-1237737749355) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/04/T1af7.0000.0004.pdt 2009-03-22 11:02:30,067-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-3-1237737749357) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/04/T1af7.0000.0004.rmsd 2009-03-22 11:02:30,072-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-5-1237737749370) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/00/T1af7.0000.0000.pdt 2009-03-22 11:02:30,073-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-1-1237737749364) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/04/T1af7.0000.0004.log 2009-03-22 11:02:30,073-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-12-1237737749368) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/02/T1af7.0000.0002.rmsd 2009-03-22 11:02:30,076-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-9-1237737749379) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/01/T1af7.0000.0001.rmsd 2009-03-22 11:02:30,076-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-13-1237737749366) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/03/T1af7.0000.0003.log 2009-03-22 11:02:30,077-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-4-1237737749372) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/00/T1af7.0000.0000.log 2009-03-22 11:02:30,082-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-3-1237737749386) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/04/T1af7.0000.0004.rmsd 2009-03-22 11:02:30,083-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-8-1237737749381) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/01/T1af7.0000.0001.pdt 2009-03-22 11:02:30,084-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-7-1237737749391) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/01/T1af7.0000.0001.log 2009-03-22 11:02:30,083-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-1237737749398) setting status to Failed File not found: /home/wilde/oops/swift/work/t1-20090322-1102-rveraq3f/jobs/m/analyze_round-m9us4b8j/stderr.txt 2009-03-22 11:02:30,085-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-9-1237737749400) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/01/T1af7.0000.0001.rmsd 2009-03-22 11:02:30,086-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-2-1237737749395) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/04/T1af7.0000.0004.pdt 2009-03-22 11:02:30,087-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-11-1237737749402) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/02/T1af7.0000.0002.pdt 2009-03-22 11:02:30,090-0500 DEBUG TaskImpl Task(type=FILE_OPERATION, identity=urn:0-7-0-1-1237737749414) setting status to Failed org.globus.cog.abstraction.impl.file.FileNotFoundException: analyze_round-m9us4b8j-stderr.txt not found. 2009-03-22 11:02:30,090-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-10-1237737749404) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/02/T1af7.0000.0002.log 2009-03-22 11:02:30,091-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-13-1237737749406) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/03/T1af7.0000.0003.log 2009-03-22 11:02:30,093-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-11-1237737749417) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/02/T1af7.0000.0002.pdt 2009-03-22 11:02:30,095-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-14-1237737749384) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/03/T1af7.0000.0003.pdt 2009-03-22 11:02:30,095-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-1237737749421) setting status to Failed File not found: /home/wilde/oops/swift/work/t1-20090322-1102-rveraq3f/jobs/m/analyze_round-m9us4b8j/_concurrent/results-c6c862ba-4992-4726-b193-c92753858e0e-7 2009-03-22 11:02:30,097-0500 DEBUG TaskImpl Task(type=FILE_OPERATION, identity=urn:0-7-0-1-1237737749426) setting status to Failed org.globus.cog.abstraction.impl.file.FileNotFoundException: analyze_round-m9us4b8j-_concurrent/results-c6c862ba-4992-4726-b193-c92753858e0e-7 not found. 2009-03-22 11:02:30,099-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-14-1237737749428) setting status to Failed File not found: /gpfs/home/wilde/oops/swift/./out.n7b/T1af7/0000/00/03/T1af7.0000.0003.pdt 2009-03-22 11:02:30,105-0500 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=urn:0-7-0-1-1237737749435) setting status to Failed File not found: /home/wilde/oops/swift/work/t1-20090322-1102-rveraq3f/info/m/analyze_round-m9us4b8j-info 2009-03-22 11:02:30,107-0500 DEBUG TaskImpl Task(type=FILE_OPERATION, identity=urn:0-7-0-1-1237737749438) setting status to Failed org.globus.cog.abstraction.impl.file.FileNotFoundException: analyze_round-m9us4b8j-info not found. 2009-03-22 11:02:30,108-0500 WARN vdl:transferwrapperlog Failed to transfer wrapper log from t1-20090322-1102-rveraq3f/info/m on localhost 2009-03-22 11:02:30,108-0500 DEBUG vdl:transferwrapperlog Exception for wrapper log failure from t1-20090322-1102-rveraq3f/info/m on localhost: File not found: /home/wilde/oops/swift/work/t1-20090322-1102-rveraq3f/info/m/analyze_round-m9us4b8j-info 2009-03-22 11:02:30,113-0500 INFO vdl:execute END_FAILURE thread=0-7-0 tr=analyze_round 2009-03-22 11:02:30,117-0500 INFO SetFutureFault Failing org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20090322-1102-3yoo96q3:720000000020 type file with no value at dataset=results (not closed) (mapping=false) 2009-03-22 11:02:30,121-0500 INFO SetFutureFault Failing org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20090322-1102-3yoo96q3:720000000018 type SecSeq with no value at dataset=nsecseq (not closed) (mapping=false) 2009-03-22 11:02:30,140-0500 DEBUG Loader Swift finished with errors sur$ -- Possibly relevant, my properties were: lazy.errors=true wrapperlog.always.transfer=true # remove all limits on job submit rates throttle.submit=off throttle.host.submit=off throttle.score.job.factor=off # set data transfer and data management rate limits very high throttle.transfers=1000 throttle.file.operations=1000 # Keep the workflow work directories intact on the execution sites sitedir.keep=true # Dont retry any job failues (while we are debugging. for production =2 is better) execution.retries=0 sur$ -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Sun Mar 22 14:13:02 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sun, 22 Mar 2009 14:13:02 -0500 (CDT) Subject: [Swift-devel] [Bug 187] New: Invalid mappings returned by an ext mapper are silently ignored Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=187 Summary: Invalid mappings returned by an ext mapper are silently ignored Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Keywords: error-handling Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: wilde at mcs.anl.gov When an ext mapper returns erroneous results of the form: sur$ ./OOPSOutPdt.map.sh -d out.n12 -p T1af7 -r 1 -s 1000 -t "" -u "" [0].pdt out.n12/T1af7/0000/00/00/T1af7.0000.0000.pdt [1].pdt out.n12/T1af7/0000/00/01/T1af7.0000.0001.pdt [2].pdt out.n12/T1af7/0000/00/02/T1af7.0000.0002.pdt when it should be returning, eg: [0] out.n12/T1af7/0000/00/00/T1af7.0000.0000.pdt [2] out.n12/T1af7/0000/00/00/T1af7.0000.0000.pdt [2] out.n12/T1af7/0000/00/00/T1af7.0000.0000.pdt then it seems that no error is flagged in either output or the log file, and the script hangs. I would have expected something like "mapping returned by ext mapper OOPSOutPdt.map.sh for variable resultpdt[] is for element not valid for target variable: returned value was: [0].pdt out.n12/T1af7/0000/00/00/T1af7.0000.0000.pdt The log and swift test script are attached. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From wilde at mcs.anl.gov Sun Mar 22 14:15:46 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 22 Mar 2009 14:15:46 -0500 Subject: [Swift-devel] Can't add attachments to swift bugzilla Message-ID: <49C68E62.6050805@mcs.anl.gov> When I try to add an attachment I get: "The bug was created successfully, but attachment creation failed. Please add your attachment by clicking the "Add an Attachment" link below." The link suggested gives the same error. Has anyone reported this to Systems yet? Or know of a workaround? From bugzilla-daemon at mcs.anl.gov Sun Mar 22 14:58:27 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sun, 22 Mar 2009 14:58:27 -0500 (CDT) Subject: [Swift-devel] [Bug 188] New: "Cache already contains..." error should be printed on stdout/err Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=188 Summary: "Cache already contains..." error should be printed on stdout/err Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Keywords: error-handling Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: wilde at mcs.anl.gov The message: 2009-03-22 14:49:24,712-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=runrama-eecudb8j - Application exception: The cache already contains bgp003:oops-20090322-1448-xh2ef4hg/shared/out.n13/T1af7/0000/00/00/T1af7.0000.0000.pdt. indicates that an app is trying to produce a file name that a prior app in the same script already produced. This message could perhaps be worded more clearly, but more importantly should be sent to stdout/err. (I keep saying out/err because I can never recall whic of the two swift errors are sent to...) Also - not sure its related: I also get these errors on stdout/err in these cases. I was not sure if these are some kind of Falkon interaction or these are coming from Swift: Could not find task for jobID urn:0-1-1-15-1-5-1-0-1-1237750894565 Could not find task for jobID urn:0-1-1-15-1-5-5-0-1-1237750894567 These sound more like debug messages that should go to the log. Or do they indicate a related error? -- The stdout/err file contained: Swift svn swift-r2724 (swift modified locally) cog-r2333 RunID: 20090322-1448-xh2ef4hg Progress: SwiftScript trace: T1af7, Round, 0, Sim, 0, StartTemp, , TempUpdate, Progress: Selecting site:1 Stage out:1 Progress: Stage in:1 Finished successfully:1 Waiting for notification for 0 ms Received notification with 2 messages Progress: Submitted:1 Finished successfully:2 Could not find task for jobID urn:0-1-1-15-1-5-1-0-1-1237750894565 Could not find task for jobID urn:0-1-1-15-1-5-5-0-1-1237750894567 Progress: Submitted:1 Finished successfully:2 Waiting for notification for 0 ms Received notification with 1 messages Progress: Active:1 Finished successfully:2 Execution failed: java.io.FileNotFoundException: _concurrent/results-c8f0277b-8412-41c8-916e-57f1510f7863-1-0-15-0-3 (No such file or directory) Command exited with non-zero status 2 -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From benc at hawaga.org.uk Sun Mar 22 16:00:43 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 22 Mar 2009 21:00:43 +0000 (GMT) Subject: [Swift-devel] Can't add attachments to swift bugzilla In-Reply-To: <49C68E62.6050805@mcs.anl.gov> References: <49C68E62.6050805@mcs.anl.gov> Message-ID: On Sun, 22 Mar 2009, Michael Wilde wrote: > When I try to add an attachment I get: > > "The bug was created successfully, but attachment creation failed. Please add > your attachment by clicking the "Add an Attachment" link below." > > The link suggested gives the same error. > > Has anyone reported this to Systems yet? Or know of a workaround? Last I knew (was probably several years ago) the MCS bugzillas wouldn't take attachments at all because they were being used as a cheap way to host other material. The workaround for that was/is to host your attachment somewhere else and paste in a URL. -- From benc at hawaga.org.uk Sun Mar 22 16:01:16 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 22 Mar 2009 21:01:16 +0000 (GMT) Subject: [Swift-devel] scalability updates In-Reply-To: <1237735027.23923.1.camel@localhost> References: <1236630300.16421.17.camel@localhost> <1237735027.23923.1.camel@localhost> Message-ID: On Sun, 22 Mar 2009, Mihael Hategan wrote: > Stop backing things out in trunk if they break some obscure test! We try > to fix them first! Are you using the management-we there? -- From hategan at mcs.anl.gov Sun Mar 22 16:05:27 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 22 Mar 2009 16:05:27 -0500 Subject: [Swift-devel] scalability updates In-Reply-To: References: <1236630300.16421.17.camel@localhost> <1237735027.23923.1.camel@localhost> Message-ID: <1237755927.17314.1.camel@localhost> On Sun, 2009-03-22 at 21:01 +0000, Ben Clifford wrote: > On Sun, 22 Mar 2009, Mihael Hategan wrote: > > > Stop backing things out in trunk if they break some obscure test! We try > > to fix them first! > > Are you using the management-we there? No. I'm using the "try to see what's wrong or let me see what's wrong before reverting" "we". From benc at hawaga.org.uk Sun Mar 22 16:19:05 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 22 Mar 2009 21:19:05 +0000 (GMT) Subject: [Swift-devel] scalability updates In-Reply-To: <1237755927.17314.1.camel@localhost> References: <1236630300.16421.17.camel@localhost> <1237735027.23923.1.camel@localhost> <1237755927.17314.1.camel@localhost> Message-ID: On Sun, 22 Mar 2009, Mihael Hategan wrote: > I'm using the "try to see what's wrong or let me see what's wrong before > reverting" "we". Reverting is no big deal, just like unreverting at fixup time is also no big deal. -- From wilde at mcs.anl.gov Sun Mar 22 16:28:12 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 22 Mar 2009 16:28:12 -0500 Subject: [Swift-devel] problems with external dependencies Message-ID: <49C6AD6C.8020509@mcs.anl.gov> Im trying to adapt the example of external dependencies in the user guide to enable me to wait for a procedure that does lot of parallel work to complete, and then analyze its results. Im trying this approach, instead of ordinary dependencies, because the list of datasets produced in the parallel operations exceeds the command line length limits of wrapper.sh. So Im trying to use externals to indicate that the parallel work has completed, then call a local analysis script with just one item from the parallel outputs, and this script will then locate all the other outputs with similar names by perusing a local directory passed as a string. (theres a few more elegant variants on this, but this simple hack will suffice for the current need). Problem is that in the following test example to see if this will work, the script does not seem to wait for the echo app invocations to complete before running the final trace() function: type file; app (file o) echo (int i) { echo i stdout=@o; } (external o) populateDatabase() { trace("I am populateDatabase"); int j[] = [0:9]; file r[]; foreach i in j { r[i] = echo(i); } } analyseDatabase(external i) { trace("i am analyseDatabase"); } external database; database = populateDatabase(); analyseDatabase(database); --- Instead, I get: Swift svn swift-r2724 (swift modified locally) cog-r2333 RunID: 20090322-1553-ltauz86a Progress: SwiftScript trace: i am analyseDatabase SwiftScript trace: I am populateDatabase Progress: Selecting site:7 Stage in:1 Finished successfully:2 Progress: Selecting site:2 Stage in:1 Active:1 Finished successfully:6 Final status: Finished successfully:10 sur$ --- while I was expecting: SwiftScript trace: i am analyseDatabase SwiftScript trace: I am populateDatabase Progress: Selecting site:7 Stage in:1 Finished successfully:2 Progress: Selecting site:2 Stage in:1 Active:1 Finished successfully:6 Final status: Finished successfully:10 sur$ -- Ive been trying the same with an array of externals, to no avail (yet). Can you offer any guidance on how to do this, or whether its possible with the current implementation of external? From wilde at mcs.anl.gov Sun Mar 22 16:30:14 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 22 Mar 2009 16:30:14 -0500 Subject: [Swift-devel] problems with external dependencies In-Reply-To: <49C6AD6C.8020509@mcs.anl.gov> References: <49C6AD6C.8020509@mcs.anl.gov> Message-ID: <49C6ADE6.4060300@mcs.anl.gov> I meant I was expecting: SwiftScript trace: I am populateDatabase Progress: Selecting site:7 Stage in:1 Finished successfully:2 Progress: Selecting site:2 Stage in:1 Active:1 Finished successfully:6 SwiftScript trace: i am analyseDatabase Final status: Finished successfully:10 sur$ On 3/22/09 4:28 PM, Michael Wilde wrote: > Im trying to adapt the example of external dependencies in the user > guide to enable me to wait for a procedure that does lot of parallel > work to complete, and then analyze its results. > > Im trying this approach, instead of ordinary dependencies, because the > list of datasets produced in the parallel operations exceeds the command > line length limits of wrapper.sh. > > So Im trying to use externals to indicate that the parallel work has > completed, then call a local analysis script with just one item from the > parallel outputs, and this script will then locate all the other outputs > with similar names by perusing a local directory passed as a string. > (theres a few more elegant variants on this, but this simple hack will > suffice for the current need). > > Problem is that in the following test example to see if this will work, > the script does not seem to wait for the echo app invocations to > complete before running the final trace() function: > > type file; > > app (file o) echo (int i) { echo i stdout=@o; } > > (external o) populateDatabase() { > trace("I am populateDatabase"); > int j[] = [0:9]; > file r[]; > foreach i in j { > r[i] = echo(i); > } > } > > analyseDatabase(external i) { > trace("i am analyseDatabase"); > } > > external database; > > database = populateDatabase(); > analyseDatabase(database); > > --- > > Instead, I get: > > Swift svn swift-r2724 (swift modified locally) cog-r2333 > > RunID: 20090322-1553-ltauz86a > Progress: > SwiftScript trace: i am analyseDatabase > SwiftScript trace: I am populateDatabase > Progress: Selecting site:7 Stage in:1 Finished successfully:2 > Progress: Selecting site:2 Stage in:1 Active:1 Finished successfully:6 > Final status: Finished successfully:10 > sur$ > > --- > while I was expecting: > > > SwiftScript trace: i am analyseDatabase > SwiftScript trace: I am populateDatabase > Progress: Selecting site:7 Stage in:1 Finished successfully:2 > Progress: Selecting site:2 Stage in:1 Active:1 Finished successfully:6 > Final status: Finished successfully:10 > sur$ > > > -- > > Ive been trying the same with an array of externals, to no avail (yet). > > Can you offer any guidance on how to do this, or whether its possible > with the current implementation of external? > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Sun Mar 22 16:39:56 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 22 Mar 2009 16:39:56 -0500 Subject: [Swift-devel] scalability updates In-Reply-To: References: <1236630300.16421.17.camel@localhost> <1237735027.23923.1.camel@localhost> <1237755927.17314.1.camel@localhost> Message-ID: <1237757996.17975.1.camel@localhost> On Sun, 2009-03-22 at 21:19 +0000, Ben Clifford wrote: > On Sun, 22 Mar 2009, Mihael Hategan wrote: > > > I'm using the "try to see what's wrong or let me see what's wrong before > > reverting" "we". > > Reverting is no big deal, just like unreverting at fixup time is also no > big deal. > Well, it's additional and pointless effort. From benc at hawaga.org.uk Sun Mar 22 16:47:45 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 22 Mar 2009 21:47:45 +0000 (GMT) Subject: [Swift-devel] problems with external dependencies In-Reply-To: <49C6AD6C.8020509@mcs.anl.gov> References: <49C6AD6C.8020509@mcs.anl.gov> Message-ID: As far as I can tell from a brief poke around, this is what is happening for you: Compound procedures do not themselves wait for their input parameters to all be ready to use. instead, they start trying to run all component pieces. If some data necessary for some component piece is not ready yet, that component piece will wait, so the compound procedure doesn't need to (and indeed shouldn't, because that reduces potential parallelism in some cases) You say this: analyseDatabase(external i) { trace("i am analyseDatabase"); } The trace call does not have any need to wait for i to be ready. So it doesn't wait for i to be ready. If you say this: analyseDatabase(external i) { trace("i am analyseDatabase", i); } then the trace call must wait for i to be ready (and fortuitously in the present implementation doesn't explode even though i cannot be meaningfully traced). With that change, you'll see the behaviour you want. -- From bugzilla-daemon at mcs.anl.gov Sun Mar 22 17:06:44 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sun, 22 Mar 2009 17:06:44 -0500 (CDT) Subject: [Swift-devel] [Bug 188] "Cache already contains..." error should be printed on stdout/err In-Reply-To: References: Message-ID: <20090322220644.621D82CDE9@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=188 --- Comment #1 from Ben Clifford 2009-03-22 17:06:44 --- 'Could not find task for ' messages are likely coming from ./src/org/globus/cog/abstraction/impl/execution/deef/StatusThread.java in provider-deef, not from Swift - that string does not exist in Swift head at the moment, and exists in provider-deef in the above named file. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From foster at anl.gov Sun Mar 22 18:35:03 2009 From: foster at anl.gov (Ian Foster) Date: Sun, 22 Mar 2009 18:35:03 -0500 Subject: [Swift-devel] problems with external dependencies In-Reply-To: References: <49C6AD6C.8020509@mcs.anl.gov> Message-ID: Ben: I can't recall whether we talked about this before, but if we could choose to run in a mode whereby compound procedures wait for all input parameters, that could simplify debugging. But maybe the semantics are now rich enough that this would not necessarily be correct. Ian. On Mar 22, 2009, at 4:47 PM, Ben Clifford wrote: > > As far as I can tell from a brief poke around, this is what is > happening > for you: > > Compound procedures do not themselves wait for their input parameters > to all be ready to use. instead, they start trying to run all > component > pieces. > > If some data necessary for some component piece is not ready yet, that > component piece will wait, so the compound procedure doesn't need to > (and > indeed shouldn't, because that reduces potential parallelism in some > cases) > > You say this: > > analyseDatabase(external i) { > trace("i am analyseDatabase"); > } > > The trace call does not have any need to wait for i to be ready. So it > doesn't wait for i to be ready. > > If you say this: > > analyseDatabase(external i) { > trace("i am analyseDatabase", i); > } > > then the trace call must wait for i to be ready (and fortuitously in > the > present implementation doesn't explode even though i cannot be > meaningfully traced). > > With that change, you'll see the behaviour you want. > > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Sun Mar 22 19:38:31 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 22 Mar 2009 19:38:31 -0500 Subject: [Swift-devel] problems with external dependencies In-Reply-To: References: <49C6AD6C.8020509@mcs.anl.gov> Message-ID: <49C6DA07.7030809@mcs.anl.gov> I got the example from my previous email working using this technique (passing the external var to trace). But a script that simulates more closely what I really need to do is still eluding me. In the real code, I need to wait till a set of nested procedures that involve nested foreach and iterate statements complete. So Im trying create a simple simulation of the needed synchronization with the following script: -- type file; app (file o) echo (int i) { echo i stdout=@o; } (file r[]) generate() { int j[] = [0:10]; foreach i in j { r[i] = echo(i*i); } } (external w) wait(file dir[]) { trace("in wait: dir",dir); } app (file o) ls (string dir, external w) { ls "-l" dir stdout=@o; } file datadir[]; datadir = generate(); external w1 = wait(datadir); trace( "generate done", w1); file out <"ls.out">; out = ls("/home/wilde/oops/swift/datadir/", w1); -- In this script the proc "generate()" simulates the production of the data directory. I want the proc "ls" which simulates the processing of the data directory, to wait until the directory is produced. As the directory has too many files to pass to "ls" as an array, I pass a string with the dir's path to ls, and want external vars to cause it to wait till the directory is complete. But in the case above, returning the dataset (file array) "datadir" from generate() does not wait for the array to be "closed". Nor does passing it to wait(), nor does passing it by name to trace(). The script gives: -- Swift svn swift-r2724 (swift modified locally) cog-r2333 RunID: 20090322-1922-o4ibjxac Progress: SwiftScript trace: in wait: dir, org.griphyn.vdl.karajan.FuturePairIterator at 1e671e67 SwiftScript trace: generate done, null Progress: Selecting site:8 Active:1 Stage out:1 Failed:1 Finished successfully:1 Progress: Selecting site:4 Active:1 Stage out:1 Failed:1 Finished successfully:5 Progress: Active:1 Stage out:1 Failed:1 Finished successfully:9 Final status: Failed:1 Finished successfully:11 The following errors have occurred: 1. Application "ls" failed (Exit code 2) Arguments: "-l, /home/wilde/oops/swift/datadir/" Host: localhost Directory: ex5-20090322-1922-o4ibjxac/jobs/l/ls-ljbsob8j STDERR: /bin/ls: /home/wilde/oops/swift/datadir/: No such file or directory STDOUT: -- It seems there's only 2 kinds of constructs or behaviors that can give me this behavior, neither of which I can find a way to cause: - something that waits for the whole array to get its values - something that waits for an entire array of externals to all be set This note in the users guide suggests a possible way to do what I need: "Statements which deal with the array as a whole will often wait for the array to be closed before executing (thus, a closed array is the equivalent of a non-array type being assigned). However, a foreach statement will apply its body to elements of an array as they become known. It will not wait until the array is closed. What statement can I use to "wait for the array to be closed before executing"? On 3/22/09 4:47 PM, Ben Clifford wrote: > As far as I can tell from a brief poke around, this is what is happening > for you: > > Compound procedures do not themselves wait for their input parameters > to all be ready to use. instead, they start trying to run all component > pieces. > > If some data necessary for some component piece is not ready yet, that > component piece will wait, so the compound procedure doesn't need to (and > indeed shouldn't, because that reduces potential parallelism in some > cases) > > You say this: > > analyseDatabase(external i) { > trace("i am analyseDatabase"); > } > > The trace call does not have any need to wait for i to be ready. So it > doesn't wait for i to be ready. > > If you say this: > > analyseDatabase(external i) { > trace("i am analyseDatabase", i); > } > > then the trace call must wait for i to be ready (and fortuitously in the > present implementation doesn't explode even though i cannot be > meaningfully traced). > > With that change, you'll see the behaviour you want. > From wilde at mcs.anl.gov Sun Mar 22 20:07:26 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 22 Mar 2009 20:07:26 -0500 Subject: [Swift-devel] problems with external dependencies In-Reply-To: <49C6DA07.7030809@mcs.anl.gov> References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> Message-ID: <49C6E0CE.4010101@mcs.anl.gov> This approach looks promising: I generate an external token for each app I call, filling an array of external "wait" vars. Then I use iterate and trace to walk through the array, waiting to make sure that each element has "fired". If I can replace the trace() with something fast and silent, and this scales to a few thousand wait vars, I think it might work. What I have now is this script: type file; app (external w, file o) echo (int i) { echo i stdout=@o; } (external waits[], file r[]) generate() { int j[] = [0:10]; foreach i in j { (waits[i],r[i]) = echo(i*i); } } (external w) wait(external waits[]) { iterate i { trace("in wait: i", i, "wait", waits[i]); } until(i==9); // FIXME } app (file o) ls (string dir, external w) { ls "-l" dir stdout=@o; } file datadir[]; external waits[]; (waits,datadir) = generate(); external wf = wait(waits); trace( "generate and wait done", wf); file out <"ls.out">; out = ls("/home/wilde/oops/swift/datadir/", wf); -- which gives: Swift svn swift-r2724 (swift modified locally) cog-r2333 RunID: 20090322-2005-j3c5zk04 Progress: Progress: Selecting site:8 Stage in:1 Finished successfully:2 SwiftScript trace: in wait: i, 0, wait, null Progress: Selecting site:3 Stage in:1 Active:1 Finished successfully:6 SwiftScript trace: in wait: i, 1, wait, null SwiftScript trace: in wait: i, 2, wait, null SwiftScript trace: in wait: i, 3, wait, null SwiftScript trace: in wait: i, 4, wait, null SwiftScript trace: in wait: i, 5, wait, null SwiftScript trace: in wait: i, 6, wait, null SwiftScript trace: in wait: i, 7, wait, null SwiftScript trace: in wait: i, 8, wait, null SwiftScript trace: in wait: i, 9, wait, null getValue called in an external dataset getValue called in an external dataset SwiftScript trace: generate and wait done, null <<<<<<<< Progress: Stage out:1 Finished successfully:11 Final status: Finished successfully:12 -- and ls succeeds, showing all expected files in ls.out. On 3/22/09 7:38 PM, Michael Wilde wrote: > I got the example from my previous email working using this technique > (passing the external var to trace). But a script that simulates more > closely what I really need to do is still eluding me. > > In the real code, I need to wait till a set of nested procedures that > involve nested foreach and iterate statements complete. So Im trying > create a simple simulation of the needed synchronization with the > following script: > > -- > type file; > > app (file o) echo (int i) { echo i stdout=@o; } > > (file r[]) generate() { > int j[] = [0:10]; > foreach i in j { > r[i] = echo(i*i); > } > } > > (external w) wait(file dir[]) { > trace("in wait: dir",dir); > } > > app (file o) ls (string dir, external w) { ls "-l" dir stdout=@o; } > > file datadir[]; > datadir = generate(); > > external w1 = wait(datadir); > > trace( "generate done", w1); > > file out <"ls.out">; > out = ls("/home/wilde/oops/swift/datadir/", w1); > -- > > In this script the proc "generate()" simulates the production of the > data directory. I want the proc "ls" which simulates the processing of > the data directory, to wait until the directory is produced. As the > directory has too many files to pass to "ls" as an array, I pass a > string with the dir's path to ls, and want external vars to cause it to > wait till the directory is complete. > > But in the case above, returning the dataset (file array) "datadir" from > generate() does not wait for the array to be "closed". Nor does passing > it to wait(), nor does passing it by name to trace(). The script gives: > > -- > Swift svn swift-r2724 (swift modified locally) cog-r2333 > > RunID: 20090322-1922-o4ibjxac > Progress: > > SwiftScript trace: in wait: dir, > org.griphyn.vdl.karajan.FuturePairIterator at 1e671e67 > > SwiftScript trace: generate done, null > > Progress: Selecting site:8 Active:1 Stage out:1 Failed:1 Finished > successfully:1 > Progress: Selecting site:4 Active:1 Stage out:1 Failed:1 Finished > successfully:5 > Progress: Active:1 Stage out:1 Failed:1 Finished successfully:9 > Final status: Failed:1 Finished successfully:11 > > The following errors have occurred: > 1. Application "ls" failed (Exit code 2) > Arguments: "-l, /home/wilde/oops/swift/datadir/" > Host: localhost > Directory: ex5-20090322-1922-o4ibjxac/jobs/l/ls-ljbsob8j > STDERR: /bin/ls: /home/wilde/oops/swift/datadir/: No such file > or directory > STDOUT: > -- > > > It seems there's only 2 kinds of constructs or behaviors that can give > me this behavior, neither of which I can find a way to cause: > - something that waits for the whole array to get its values > - something that waits for an entire array of externals to all be set > > This note in the users guide suggests a possible way to do what I need: > > "Statements which deal with the array as a whole will often wait for the > array to be closed before executing (thus, a closed array is the > equivalent of a non-array type being assigned). However, a foreach > statement will apply its body to elements of an array as they become > known. It will not wait until the array is closed. > > What statement can I use to "wait for the array to be closed before > executing"? > > > On 3/22/09 4:47 PM, Ben Clifford wrote: >> As far as I can tell from a brief poke around, this is what is >> happening for you: >> >> Compound procedures do not themselves wait for their input parameters >> to all be ready to use. instead, they start trying to run all >> component pieces. >> >> If some data necessary for some component piece is not ready yet, that >> component piece will wait, so the compound procedure doesn't need to >> (and indeed shouldn't, because that reduces potential parallelism in >> some cases) >> >> You say this: >> >> analyseDatabase(external i) >> { trace("i am >> analyseDatabase"); } >> The trace call does not have any need to wait for i to be ready. So it >> doesn't wait for i to be ready. >> >> If you say this: >> >> analyseDatabase(external i) { >> trace("i am analyseDatabase", i); >> } >> >> then the trace call must wait for i to be ready (and fortuitously in >> the present implementation doesn't explode even though i cannot be >> meaningfully traced). >> >> With that change, you'll see the behaviour you want. >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Sun Mar 22 20:16:00 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 22 Mar 2009 20:16:00 -0500 Subject: [Swift-devel] problems with external dependencies In-Reply-To: <49C6E0CE.4010101@mcs.anl.gov> References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> <49C6E0CE.4010101@mcs.anl.gov> Message-ID: <49C6E2D0.5040601@mcs.anl.gov> This works, to replace trace(): (external w) wait(external waits[]) { external val[]; iterate i { val[i] = waits[i]; } until(i==9); } Apparently I can assign a external value to an array of them. The assignment presumably waits until the source external value is set. On 3/22/09 8:07 PM, Michael Wilde wrote: > This approach looks promising: I generate an external token for each app > I call, filling an array of external "wait" vars. Then I use iterate > and trace to walk through the array, waiting to make sure that each > element has "fired". > > If I can replace the trace() with something fast and silent, and this > scales to a few thousand wait vars, I think it might work. > > What I have now is this script: > > type file; > > app (external w, file o) echo (int i) { echo i stdout=@o; } > > (external waits[], file r[]) generate() { > int j[] = [0:10]; > foreach i in j { > (waits[i],r[i]) = echo(i*i); > } > } > > (external w) wait(external waits[]) { > iterate i { > trace("in wait: i", i, "wait", waits[i]); > } until(i==9); // FIXME > } > > app (file o) ls (string dir, external w) { ls "-l" dir stdout=@o; } > > file datadir[]; > external waits[]; > > (waits,datadir) = generate(); > > external wf = wait(waits); > > trace( "generate and wait done", wf); > > file out <"ls.out">; > out = ls("/home/wilde/oops/swift/datadir/", wf); > > -- > > which gives: > > Swift svn swift-r2724 (swift modified locally) cog-r2333 > > RunID: 20090322-2005-j3c5zk04 > Progress: > Progress: Selecting site:8 Stage in:1 Finished successfully:2 > SwiftScript trace: in wait: i, 0, wait, null > Progress: Selecting site:3 Stage in:1 Active:1 Finished successfully:6 > SwiftScript trace: in wait: i, 1, wait, null > SwiftScript trace: in wait: i, 2, wait, null > SwiftScript trace: in wait: i, 3, wait, null > SwiftScript trace: in wait: i, 4, wait, null > SwiftScript trace: in wait: i, 5, wait, null > SwiftScript trace: in wait: i, 6, wait, null > SwiftScript trace: in wait: i, 7, wait, null > SwiftScript trace: in wait: i, 8, wait, null > SwiftScript trace: in wait: i, 9, wait, null > getValue called in an external dataset > getValue called in an external dataset > SwiftScript trace: generate and wait done, null <<<<<<<< > Progress: Stage out:1 Finished successfully:11 > Final status: Finished successfully:12 > > -- > and ls succeeds, showing all expected files in ls.out. > > > On 3/22/09 7:38 PM, Michael Wilde wrote: >> I got the example from my previous email working using this technique >> (passing the external var to trace). But a script that simulates more >> closely what I really need to do is still eluding me. >> >> In the real code, I need to wait till a set of nested procedures that >> involve nested foreach and iterate statements complete. So Im trying >> create a simple simulation of the needed synchronization with the >> following script: >> >> -- >> type file; >> >> app (file o) echo (int i) { echo i stdout=@o; } >> >> (file r[]) generate() { >> int j[] = [0:10]; >> foreach i in j { >> r[i] = echo(i*i); >> } >> } >> >> (external w) wait(file dir[]) { >> trace("in wait: dir",dir); >> } >> >> app (file o) ls (string dir, external w) { ls "-l" dir stdout=@o; } >> >> file datadir[]; >> datadir = generate(); >> >> external w1 = wait(datadir); >> >> trace( "generate done", w1); >> >> file out <"ls.out">; >> out = ls("/home/wilde/oops/swift/datadir/", w1); >> -- >> >> In this script the proc "generate()" simulates the production of the >> data directory. I want the proc "ls" which simulates the processing of >> the data directory, to wait until the directory is produced. As the >> directory has too many files to pass to "ls" as an array, I pass a >> string with the dir's path to ls, and want external vars to cause it >> to wait till the directory is complete. >> >> But in the case above, returning the dataset (file array) "datadir" >> from generate() does not wait for the array to be "closed". Nor does >> passing it to wait(), nor does passing it by name to trace(). The >> script gives: >> >> -- >> Swift svn swift-r2724 (swift modified locally) cog-r2333 >> >> RunID: 20090322-1922-o4ibjxac >> Progress: >> >> SwiftScript trace: in wait: dir, >> org.griphyn.vdl.karajan.FuturePairIterator at 1e671e67 >> >> SwiftScript trace: generate done, null >> >> Progress: Selecting site:8 Active:1 Stage out:1 Failed:1 Finished >> successfully:1 >> Progress: Selecting site:4 Active:1 Stage out:1 Failed:1 Finished >> successfully:5 >> Progress: Active:1 Stage out:1 Failed:1 Finished successfully:9 >> Final status: Failed:1 Finished successfully:11 >> >> The following errors have occurred: >> 1. Application "ls" failed (Exit code 2) >> Arguments: "-l, /home/wilde/oops/swift/datadir/" >> Host: localhost >> Directory: ex5-20090322-1922-o4ibjxac/jobs/l/ls-ljbsob8j >> STDERR: /bin/ls: /home/wilde/oops/swift/datadir/: No such file >> or directory >> STDOUT: >> -- >> >> >> It seems there's only 2 kinds of constructs or behaviors that can give >> me this behavior, neither of which I can find a way to cause: >> - something that waits for the whole array to get its values >> - something that waits for an entire array of externals to all be set >> >> This note in the users guide suggests a possible way to do what I need: >> >> "Statements which deal with the array as a whole will often wait for >> the array to be closed before executing (thus, a closed array is the >> equivalent of a non-array type being assigned). However, a foreach >> statement will apply its body to elements of an array as they become >> known. It will not wait until the array is closed. >> >> What statement can I use to "wait for the array to be closed before >> executing"? >> >> >> On 3/22/09 4:47 PM, Ben Clifford wrote: >>> As far as I can tell from a brief poke around, this is what is >>> happening for you: >>> >>> Compound procedures do not themselves wait for their input parameters >>> to all be ready to use. instead, they start trying to run all >>> component pieces. >>> >>> If some data necessary for some component piece is not ready yet, >>> that component piece will wait, so the compound procedure doesn't >>> need to (and indeed shouldn't, because that reduces potential >>> parallelism in some cases) >>> >>> You say this: >>> >>> analyseDatabase(external i) >>> { trace("i am >>> analyseDatabase"); } >>> The trace call does not have any need to wait for i to be ready. So >>> it doesn't wait for i to be ready. >>> >>> If you say this: >>> >>> analyseDatabase(external i) { >>> trace("i am analyseDatabase", i); >>> } >>> >>> then the trace call must wait for i to be ready (and fortuitously in >>> the present implementation doesn't explode even though i cannot be >>> meaningfully traced). >>> >>> With that change, you'll see the behaviour you want. >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Sun Mar 22 22:14:08 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 22 Mar 2009 22:14:08 -0500 Subject: [Swift-devel] problems with external dependencies In-Reply-To: <49C6DA07.7030809@mcs.anl.gov> References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> Message-ID: <1237778048.11219.2.camel@localhost> I'm curious what you are doing with the directory after you wait for it. On a different note, should there be an array size function (e.g. size(array)), it would have to wait for the array to be closed before giving an accurate answer. On Sun, 2009-03-22 at 19:38 -0500, Michael Wilde wrote: > I got the example from my previous email working using this technique > (passing the external var to trace). But a script that simulates more > closely what I really need to do is still eluding me. > > In the real code, I need to wait till a set of nested procedures that > involve nested foreach and iterate statements complete. So Im trying > create a simple simulation of the needed synchronization with the > following script: > > -- > type file; > > app (file o) echo (int i) { echo i stdout=@o; } > > (file r[]) generate() { > int j[] = [0:10]; > foreach i in j { > r[i] = echo(i*i); > } > } > > (external w) wait(file dir[]) { > trace("in wait: dir",dir); > } > > app (file o) ls (string dir, external w) { ls "-l" dir stdout=@o; } > > file datadir[]; > datadir = generate(); > > external w1 = wait(datadir); > > trace( "generate done", w1); > > file out <"ls.out">; > out = ls("/home/wilde/oops/swift/datadir/", w1); > -- > > In this script the proc "generate()" simulates the production of the > data directory. I want the proc "ls" which simulates the processing of > the data directory, to wait until the directory is produced. As the > directory has too many files to pass to "ls" as an array, I pass a > string with the dir's path to ls, and want external vars to cause it to > wait till the directory is complete. > > But in the case above, returning the dataset (file array) "datadir" from > generate() does not wait for the array to be "closed". Nor does passing > it to wait(), nor does passing it by name to trace(). The script gives: > > -- > Swift svn swift-r2724 (swift modified locally) cog-r2333 > > RunID: 20090322-1922-o4ibjxac > Progress: > > SwiftScript trace: in wait: dir, > org.griphyn.vdl.karajan.FuturePairIterator at 1e671e67 > > SwiftScript trace: generate done, null > > Progress: Selecting site:8 Active:1 Stage out:1 Failed:1 Finished > successfully:1 > Progress: Selecting site:4 Active:1 Stage out:1 Failed:1 Finished > successfully:5 > Progress: Active:1 Stage out:1 Failed:1 Finished successfully:9 > Final status: Failed:1 Finished successfully:11 > > The following errors have occurred: > 1. Application "ls" failed (Exit code 2) > Arguments: "-l, /home/wilde/oops/swift/datadir/" > Host: localhost > Directory: ex5-20090322-1922-o4ibjxac/jobs/l/ls-ljbsob8j > STDERR: /bin/ls: /home/wilde/oops/swift/datadir/: No such file > or directory > STDOUT: > -- > > > It seems there's only 2 kinds of constructs or behaviors that can give > me this behavior, neither of which I can find a way to cause: > - something that waits for the whole array to get its values > - something that waits for an entire array of externals to all be set > > This note in the users guide suggests a possible way to do what I need: > > "Statements which deal with the array as a whole will often wait for the > array to be closed before executing (thus, a closed array is the > equivalent of a non-array type being assigned). However, a foreach > statement will apply its body to elements of an array as they become > known. It will not wait until the array is closed. > > What statement can I use to "wait for the array to be closed before > executing"? > > > On 3/22/09 4:47 PM, Ben Clifford wrote: > > As far as I can tell from a brief poke around, this is what is happening > > for you: > > > > Compound procedures do not themselves wait for their input parameters > > to all be ready to use. instead, they start trying to run all component > > pieces. > > > > If some data necessary for some component piece is not ready yet, that > > component piece will wait, so the compound procedure doesn't need to (and > > indeed shouldn't, because that reduces potential parallelism in some > > cases) > > > > You say this: > > > > analyseDatabase(external i) { > > trace("i am analyseDatabase"); > > } > > > > The trace call does not have any need to wait for i to be ready. So it > > doesn't wait for i to be ready. > > > > If you say this: > > > > analyseDatabase(external i) { > > trace("i am analyseDatabase", i); > > } > > > > then the trace call must wait for i to be ready (and fortuitously in the > > present implementation doesn't explode even though i cannot be > > meaningfully traced). > > > > With that change, you'll see the behaviour you want. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Sun Mar 22 22:35:31 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 22 Mar 2009 22:35:31 -0500 Subject: [Swift-devel] problems with external dependencies In-Reply-To: <1237778048.11219.2.camel@localhost> References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> <1237778048.11219.2.camel@localhost> Message-ID: <49C70383.6050802@mcs.anl.gov> On 3/22/09 10:14 PM, Mihael Hategan wrote: > I'm curious what you are doing with the directory after you wait for it. Walking through the files in that dir, checking simulation outputs for convergence, and determining the new protein secondary structure for the next round of simulations if they have not yet converged. > On a different note, should there be an array size function (e.g. > size(array)), it would have to wait for the array to be closed before > giving an accurate answer. Ive run into the need/desire for a len(array) function several times. That might indeed one way to code the wait, albeit a tad idiomatic. A waitForClose() function may also be useful, on which you could base length(). Also note that the limitation that forced me into this approach was the inability to pass a dataset without invoking all the semantics of data transport and the attendant cmd line args required. Some way to do that may also be worth discussing. > On Sun, 2009-03-22 at 19:38 -0500, Michael Wilde wrote: >> I got the example from my previous email working using this technique >> (passing the external var to trace). But a script that simulates more >> closely what I really need to do is still eluding me. >> >> In the real code, I need to wait till a set of nested procedures that >> involve nested foreach and iterate statements complete. So Im trying >> create a simple simulation of the needed synchronization with the >> following script: >> >> -- >> type file; >> >> app (file o) echo (int i) { echo i stdout=@o; } >> >> (file r[]) generate() { >> int j[] = [0:10]; >> foreach i in j { >> r[i] = echo(i*i); >> } >> } >> >> (external w) wait(file dir[]) { >> trace("in wait: dir",dir); >> } >> >> app (file o) ls (string dir, external w) { ls "-l" dir stdout=@o; } >> >> file datadir[]; >> datadir = generate(); >> >> external w1 = wait(datadir); >> >> trace( "generate done", w1); >> >> file out <"ls.out">; >> out = ls("/home/wilde/oops/swift/datadir/", w1); >> -- >> >> In this script the proc "generate()" simulates the production of the >> data directory. I want the proc "ls" which simulates the processing of >> the data directory, to wait until the directory is produced. As the >> directory has too many files to pass to "ls" as an array, I pass a >> string with the dir's path to ls, and want external vars to cause it to >> wait till the directory is complete. >> >> But in the case above, returning the dataset (file array) "datadir" from >> generate() does not wait for the array to be "closed". Nor does passing >> it to wait(), nor does passing it by name to trace(). The script gives: >> >> -- >> Swift svn swift-r2724 (swift modified locally) cog-r2333 >> >> RunID: 20090322-1922-o4ibjxac >> Progress: >> >> SwiftScript trace: in wait: dir, >> org.griphyn.vdl.karajan.FuturePairIterator at 1e671e67 >> >> SwiftScript trace: generate done, null >> >> Progress: Selecting site:8 Active:1 Stage out:1 Failed:1 Finished >> successfully:1 >> Progress: Selecting site:4 Active:1 Stage out:1 Failed:1 Finished >> successfully:5 >> Progress: Active:1 Stage out:1 Failed:1 Finished successfully:9 >> Final status: Failed:1 Finished successfully:11 >> >> The following errors have occurred: >> 1. Application "ls" failed (Exit code 2) >> Arguments: "-l, /home/wilde/oops/swift/datadir/" >> Host: localhost >> Directory: ex5-20090322-1922-o4ibjxac/jobs/l/ls-ljbsob8j >> STDERR: /bin/ls: /home/wilde/oops/swift/datadir/: No such file >> or directory >> STDOUT: >> -- >> >> >> It seems there's only 2 kinds of constructs or behaviors that can give >> me this behavior, neither of which I can find a way to cause: >> - something that waits for the whole array to get its values >> - something that waits for an entire array of externals to all be set >> >> This note in the users guide suggests a possible way to do what I need: >> >> "Statements which deal with the array as a whole will often wait for the >> array to be closed before executing (thus, a closed array is the >> equivalent of a non-array type being assigned). However, a foreach >> statement will apply its body to elements of an array as they become >> known. It will not wait until the array is closed. >> >> What statement can I use to "wait for the array to be closed before >> executing"? >> >> >> On 3/22/09 4:47 PM, Ben Clifford wrote: >>> As far as I can tell from a brief poke around, this is what is happening >>> for you: >>> >>> Compound procedures do not themselves wait for their input parameters >>> to all be ready to use. instead, they start trying to run all component >>> pieces. >>> >>> If some data necessary for some component piece is not ready yet, that >>> component piece will wait, so the compound procedure doesn't need to (and >>> indeed shouldn't, because that reduces potential parallelism in some >>> cases) >>> >>> You say this: >>> >>> analyseDatabase(external i) { >>> trace("i am analyseDatabase"); >>> } >>> >>> The trace call does not have any need to wait for i to be ready. So it >>> doesn't wait for i to be ready. >>> >>> If you say this: >>> >>> analyseDatabase(external i) { >>> trace("i am analyseDatabase", i); >>> } >>> >>> then the trace call must wait for i to be ready (and fortuitously in the >>> present implementation doesn't explode even though i cannot be >>> meaningfully traced). >>> >>> With that change, you'll see the behaviour you want. >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Sun Mar 22 23:21:29 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 22 Mar 2009 23:21:29 -0500 Subject: [Swift-devel] Null pointer exception in scheduling? Message-ID: <49C70E49.4010001@mcs.anl.gov> I just saw this one for what I think is the first time: Progress: uninitialized:1 Finished successfully:2 Progress: Initializing:999 Selecting site:1 Finished successfully:2 Progress: Selecting site:999 Finished successfully:2 Initializing site shared directory:1 Progress: Stage in:999 Submitting:1 Finished successfully:2 Exception caught while processing event java.lang.NullPointerException at org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:609) at org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:421) at org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) at org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:236) at org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:224) at org.globus.cog.abstraction.impl.file.CachingDelegatedFileOperationHandler.setTaskStatus(CachingDelegatedFileOperationHandler.java:68) at org.globus.cog.abstraction.impl.file.CachingDelegatedFileOperationHandler.submit(CachingDelegatedFileOperationHandler.java:42) at org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:28) at org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:86) at edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) at edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) at java.lang.Thread.run(Thread.java:801) Progress: Stage in:867 Submitting:133 Finished successfully:2 Progress: Stage in:256 Submitting:742 Submitted:2 Finished successfully:2 The script ran almost to the end; looks like 1 job is hung in "stage in" and thus the final analysis job didnt run. This was my first run with the new wait logic at a scale of 1000 jobs. It ran ok (once) at 500 jobs, as Ive been scaling up the testing. I'll see if this is reproducible. Not sure if its related to the wait logic or not. The log is at: www.ci.uchicago.edu/~wilde/oops-20090322-2312-ubvg3su6.log From benc at hawaga.org.uk Mon Mar 23 02:13:57 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 23 Mar 2009 07:13:57 +0000 (GMT) Subject: [Swift-devel] problems with external dependencies In-Reply-To: <49C70383.6050802@mcs.anl.gov> References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> <1237778048.11219.2.camel@localhost> <49C70383.6050802@mcs.anl.gov> Message-ID: On Sun, 22 Mar 2009, Michael Wilde wrote: > Also note that the limitation that forced me into this approach was the > inability to pass a dataset without invoking all the semantics of data > transport and the attendant cmd line args required. Some way to do that may > also be worth discussing. That is what the external type is intended to be - you get the dependency processing but you don't get Swift data management. Semantics of external can be changed to help achieve that goal. -- From benc at hawaga.org.uk Mon Mar 23 02:17:17 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 23 Mar 2009 07:17:17 +0000 (GMT) Subject: [Swift-devel] problems with external dependencies In-Reply-To: <49C6DA07.7030809@mcs.anl.gov> References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> Message-ID: On Sun, 22 Mar 2009, Michael Wilde wrote: > SwiftScript trace: in wait: dir, > org.griphyn.vdl.karajan.FuturePairIterator at 1e671e67 That is probably a bug in trace - it is listing the name of an internal future class rather than some SwiftScript description of the array, and apparently not waiting for close. I'll fix that. -- From benc at hawaga.org.uk Mon Mar 23 02:26:54 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 23 Mar 2009 07:26:54 +0000 (GMT) Subject: [Swift-devel] problems with external dependencies In-Reply-To: <49C6DA07.7030809@mcs.anl.gov> References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> Message-ID: On Sun, 22 Mar 2009, Michael Wilde wrote: > What statement can I use to "wait for the array to be closed before > executing"? At the moment, I think the only thing that will wait for an array to be closed is an app procedure taking that array as a parameter. When thats a file array, you will then get the file handling that you don't want. However, its not clear from your example that you need it: type file; app (file o) echo (int i) { echo i stdout=@o; } (external s, file r[]) generate() { int j[] = [0:10]; foreach i in j { r[i] = echo(i*i); } } app (file o) ls (string dir, external w) { ls "-l" dir stdout=@o; } file datadir[]; external sync; (sync,datadir) = generate(); file out <"ls.out">; out = ls("/Users/benc/work/cog/modules/swift/datadir/", sync); -- From benc at hawaga.org.uk Mon Mar 23 03:02:32 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 23 Mar 2009 08:02:32 +0000 (GMT) Subject: [Swift-devel] Swift 0.9 release for ~2nd April In-Reply-To: References: Message-ID: On Mon, 16 Mar 2009, Ben Clifford wrote: > I'd like to put out the Swift 0.9 release on the 2nd of April, with the > release candidate being made from SVN on the 23rd of March. the present trunk seems way too unstable for a release candidate. so not today. -- From wilde at mcs.anl.gov Mon Mar 23 08:30:06 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 23 Mar 2009 08:30:06 -0500 Subject: [Swift-devel] swift compilation error on compound expressions Message-ID: <49C78EDE.7030801@mcs.anl.gov> This script: -- string st = "foo"; string tu = "bar"; if ((st != "") || (tu != "")) { if ((st != "") && (tu != "")) { trace("sweep"); } else { trace("Error"); } } else { trace("simple_loop"); } sur$ gives: -- Could not start execution. Error reading source: : unexpected character in markup > (position: START_TAG seen ...\n\t\t <>... @69:17) : : unexpected character in markup > (position: START_TAG seen ...\n\t\t <>... @69:17) sur$ cat tparse.swift -- tparse.xml is: -- st foo tu bar st tu st tu sweep Error simple_loop -- tparse.kml is: st-f2f01a13-f36f-458b-8cd9-2b8a8ee80c95 tu-91daec5d-486a-421f-8cac-58469d72f0ba st swift#string#17000 tu swift#string#17001 <> st swift#string#17002 <> tu swift#string#17002 <> st swift#string#17002 <> tu swift#string#17002 swift#string#17003 swift#string#17004 swift#string#17005 sur$ From wilde at mcs.anl.gov Mon Mar 23 08:46:20 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 23 Mar 2009 08:46:20 -0500 Subject: [Swift-devel] swift compilation error on compound expressions In-Reply-To: <49C78EDE.7030801@mcs.anl.gov> References: <49C78EDE.7030801@mcs.anl.gov> Message-ID: <49C792AC.7050506@mcs.anl.gov> This occurs on an even simpler example. It doesnt seem to like: cond = (st != ""); This script: string st = "foo"; boolean cond; cond = (st != ""); gives: sur$ swift tp2.swift Could not start execution. Error reading source: : unexpected character in markup > (position: START_TAG seen ...\n\t\t <>... @49:22) : : unexpected character in markup > (position: START_TAG seen ...\n\t\t <>... @49:22) On 3/23/09 8:30 AM, Michael Wilde wrote: > This script: > > -- > > string st = "foo"; > string tu = "bar"; > if ((st != "") || (tu != "")) { > if ((st != "") && (tu != "")) { > trace("sweep"); > } > else { > trace("Error"); > } > } > else { > trace("simple_loop"); > } > sur$ > > > gives: > > -- > > Could not start execution. > Error reading source: : unexpected character in markup > > (position: START_TAG seen ...\n\t\t > <>... @69:17) : : unexpected character in markup > (position: > START_TAG seen ...\n\t\t <>... > @69:17) > sur$ cat tparse.swift > > > -- > > tparse.xml is: > > -- > > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > xmlns:xs="http://www.w3.org/2001/XMLSchema"> > > > st > foo > > > > tu > bar > > > > > st > > > > tu > > > > > > > > st > > > > tu > > > > > > sweep > > > > > Error > > > > > > simple_loop > > > > > -- > > tparse.kml is: > > > > > > > > > > > > > > > > > > > > > > > > > > > > name="prefix">st-f2f01a13-f36f-458b-8cd9-2b8a8ee80c95 > > > > > > > name="prefix">tu-91daec5d-486a-421f-8cac-58469d72f0ba > > > > > > > > > > > > > st > > > swift#string#17000 > > > closeID="88000" /> > > > > > tu > > > swift#string#17001 > > > closeID="88001" /> > > > > > <> > st > swift#string#17002 > > <> > tu > swift#string#17002 > > > > > > > > <> > st > > swift#string#17002 > > <> > tu > > swift#string#17002 > > > > > > > > > swift#string#17003 > > > > > > > > > > > swift#string#17004 > > > > > > > > > > > > > > swift#string#17005 > > > > > > > > > > > > > sur$ > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Mon Mar 23 09:05:06 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 23 Mar 2009 09:05:06 -0500 Subject: [Swift-devel] swift compilation error on compound expressions In-Reply-To: <49C792AC.7050506@mcs.anl.gov> References: <49C78EDE.7030801@mcs.anl.gov> <49C792AC.7050506@mcs.anl.gov> Message-ID: <49C79712.7080505@mcs.anl.gov> Its still simpler: "!=" seems to be broken: int i = 1; if ( i !=2 ) { trace("ne"); } gives: sur$ swift tparse.swift Could not start execution. Error reading source: : unexpected character in markup > (position: START_TAG seen ...<>... @41:36) : : unexpected character in markup > (position: START_TAG seen ...<>... @41:36) -- Is it safe to update my swift svn, or is trunk broken in other ways at the moment? I am at: Swift svn swift-r2724 (swift modified locally) cog-r2333 My local mods are in wrapper and vdl-int to support falkon. On 3/23/09 8:46 AM, Michael Wilde wrote: > This occurs on an even simpler example. It doesnt seem to like: > cond = (st != ""); > > > This script: > > string st = "foo"; > boolean cond; > cond = (st != ""); > > gives: > > sur$ swift tp2.swift > Could not start execution. > Error reading source: : unexpected character in markup > > (position: START_TAG seen ...\n\t\t <>... > @49:22) : : unexpected character in markup > (position: START_TAG seen > ...\n\t\t <>... @49:22) > > > > > On 3/23/09 8:30 AM, Michael Wilde wrote: >> This script: >> >> -- >> >> string st = "foo"; >> string tu = "bar"; >> if ((st != "") || (tu != "")) { >> if ((st != "") && (tu != "")) { >> trace("sweep"); >> } >> else { >> trace("Error"); >> } >> } >> else { >> trace("simple_loop"); >> } >> sur$ >> >> >> gives: >> >> -- >> >> Could not start execution. >> Error reading source: : unexpected character in markup > >> (position: START_TAG seen ...\n\t\t >> <>... @69:17) : : unexpected character in markup > (position: >> START_TAG seen ...\n\t\t <>... >> @69:17) >> sur$ cat tparse.swift >> >> >> -- >> >> tparse.xml is: >> >> -- >> >> > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" >> xmlns:xs="http://www.w3.org/2001/XMLSchema"> >> >> >> st >> foo >> >> >> >> tu >> bar >> >> >> >> >> st >> >> >> >> tu >> >> >> >> >> >> >> >> st >> >> >> >> tu >> >> >> >> >> >> sweep >> >> >> >> >> Error >> >> >> >> >> >> simple_loop >> >> >> >> >> -- >> >> tparse.kml is: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > name="prefix">st-f2f01a13-f36f-458b-8cd9-2b8a8ee80c95 >> >> >> >> >> >> >> > name="prefix">tu-91daec5d-486a-421f-8cac-58469d72f0ba >> >> >> >> >> >> >> >> >> >> >> >> >> st >> >> >> swift#string#17000 >> >> >> > closeID="88000" /> >> >> >> >> >> tu >> >> >> swift#string#17001 >> >> >> > closeID="88001" /> >> >> >> >> >> <> >> st >> swift#string#17002 >> >> <> >> tu >> swift#string#17002 >> >> >> >> >> >> >> >> <> >> st >> >> swift#string#17002 >> >> <> >> tu >> >> swift#string#17002 >> >> >> >> >> >> >> >> >> swift#string#17003 >> >> >> >> >> >> >> >> >> >> >> swift#string#17004 >> >> >> >> >> >> >> >> >> >> >> >> >> >> swift#string#17005 >> >> >> >> >> >> >> >> >> >> >> >> >> sur$ >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Mon Mar 23 09:55:46 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 23 Mar 2009 09:55:46 -0500 Subject: [Swift-devel] problems with external dependencies In-Reply-To: <49C70383.6050802@mcs.anl.gov> References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> <1237778048.11219.2.camel@localhost> <49C70383.6050802@mcs.anl.gov> Message-ID: <1237820146.3410.0.camel@localhost> On Sun, 2009-03-22 at 22:35 -0500, Michael Wilde wrote: > Also note that the limitation that forced me into this approach was the > inability to pass a dataset without invoking all the semantics of data > transport and the attendant cmd line args required. Some way to do that > may also be worth discussing. > I can't quite parse that. Can you be more specific? From wilde at mcs.anl.gov Mon Mar 23 09:56:03 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 23 Mar 2009 09:56:03 -0500 Subject: [Swift-devel] problems with external dependencies In-Reply-To: References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> Message-ID: <49C7A303.5060005@mcs.anl.gov> On 3/23/09 2:26 AM, Ben Clifford wrote: > On Sun, 22 Mar 2009, Michael Wilde wrote: > >> What statement can I use to "wait for the array to be closed before >> executing"? > > At the moment, I think the only thing that will wait for an array to be > closed is an app procedure taking that array as a parameter. Ah - I keep misisng this critical point: that only *app()* procs wait for arrays to be closed. Compound procs dont. > When thats a > file array, you will then get the file handling that you don't want. Right, so of no use. I dont think what you have below works. Perhaps it was fooled by a prior existing datadir? When I change your path back to my local datadir, I get a failure in ls, because it runs before datadir exists. I think the problem is that generate returns as soon as the foreach launches its echo apps, but before they run. I solved this problem in my later examples by putting a sync external on each app() invocation and waitig for them all in an iterate() loop which forces the generate procedure to not return till the iterate is complete (as far as I can tell). This is all subtle, so my assessment above may still be flawed. However, as far as I can tell, the iterate-wait technique seems to work; Ive tested it up to 2000 jobs on the bgp. > However, its not clear from your example that you need it: > > type file; > > app (file o) echo (int i) { echo i stdout=@o; } > > (external s, file r[]) generate() { > int j[] = [0:10]; > foreach i in j { > r[i] = echo(i*i); > } > } > > app (file o) ls (string dir, external w) { ls "-l" dir stdout=@o; } > > file datadir[]; > external sync; > (sync,datadir) = generate(); > > file out <"ls.out">; > out = ls("/Users/benc/work/cog/modules/swift/datadir/", sync); > From wilde at mcs.anl.gov Mon Mar 23 10:01:18 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 23 Mar 2009 10:01:18 -0500 Subject: [Swift-devel] problems with external dependencies In-Reply-To: References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> <1237778048.11219.2.camel@localhost> <49C70383.6050802@mcs.anl.gov> Message-ID: <49C7A43E.5020600@mcs.anl.gov> On 3/23/09 2:13 AM, Ben Clifford wrote: > On Sun, 22 Mar 2009, Michael Wilde wrote: > >> Also note that the limitation that forced me into this approach was the >> inability to pass a dataset without invoking all the semantics of data >> transport and the attendant cmd line args required. Some way to do that may >> also be worth discussing. > > That is what the external type is intended to be - you get the dependency > processing but you don't get Swift data management. But if the user is forced to do this at fine granularity, it adds much effort and code (and head scratching ;) > Semantics of external can be changed to help achieve that goal. I see where you are going; some way to force a proc() to wait for its foreaches (or any async child procs) to complete might make the coding much cleaner/simpler. That might be a handy construct to place in any {} block including the one that defines a proc(){}. Then the only piece missing might be a nice construct to grab a mapped directory name, so the use can pass something like @dirname(). That needs more thought; its a vague notion. From hategan at mcs.anl.gov Mon Mar 23 10:02:10 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 23 Mar 2009 10:02:10 -0500 Subject: [Swift-devel] problems with external dependencies In-Reply-To: <1237820146.3410.0.camel@localhost> References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> <1237778048.11219.2.camel@localhost> <49C70383.6050802@mcs.anl.gov> <1237820146.3410.0.camel@localhost> Message-ID: <1237820530.3410.5.camel@localhost> On Mon, 2009-03-23 at 09:55 -0500, Mihael Hategan wrote: > On Sun, 2009-03-22 at 22:35 -0500, Michael Wilde wrote: > > > Also note that the limitation that forced me into this approach was the > > inability to pass a dataset without invoking all the semantics of data > > transport and the attendant cmd line args required. Some way to do that > > may also be worth discussing. > > > > I can't quite parse that. Can you be more specific? Ok, I get it. Have you tried ls(waits)? From wilde at mcs.anl.gov Mon Mar 23 10:07:59 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 23 Mar 2009 10:07:59 -0500 Subject: [Swift-devel] problems with external dependencies In-Reply-To: <1237820146.3410.0.camel@localhost> References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> <1237778048.11219.2.camel@localhost> <49C70383.6050802@mcs.anl.gov> <1237820146.3410.0.camel@localhost> Message-ID: <49C7A5CF.4010803@mcs.anl.gov> On 3/23/09 9:55 AM, Mihael Hategan wrote: > On Sun, 2009-03-22 at 22:35 -0500, Michael Wilde wrote: > >> Also note that the limitation that forced me into this approach was the >> inability to pass a dataset without invoking all the semantics of data >> transport and the attendant cmd line args required. Some way to do that >> may also be worth discussing. >> > > I can't quite parse that. Can you be more specific? In the oops application, the desired number of simulations to run at a time, per protein, is 2000. After each "round" of 2000 runs, the application needs to analyze the outputs of the 2000, and in some cases decide whether to start another round or not, and if so, with what starting configuration (of the protein molecule being simulated). Passing 2000 outputs from the simuation to the analysis routine causes wrapper.sh to be invoked with values of DIRS and/or INF that exceed the max allowed command line length (which I think is 128KB on the bgp). Its primarily the error causes by this length limit that Im trying to work around here. Since analyze() is run on localhost anyways, and the data is all back there by the time the parallel simulations are done, its also lower overhead to say "just run analyze() on the data in this known location". So its handy, but not essential, to avoid the overhead of linking the dataset members to the localhost workdirectory shared/ subdir. From hategan at mcs.anl.gov Mon Mar 23 11:02:02 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 23 Mar 2009 11:02:02 -0500 Subject: [Swift-devel] scalability updates In-Reply-To: <1237757996.17975.1.camel@localhost> References: <1236630300.16421.17.camel@localhost> <1237735027.23923.1.camel@localhost> <1237755927.17314.1.camel@localhost> <1237757996.17975.1.camel@localhost> Message-ID: <1237824122.5059.2.camel@localhost> On Sun, 2009-03-22 at 16:39 -0500, Mihael Hategan wrote: > On Sun, 2009-03-22 at 21:19 +0000, Ben Clifford wrote: > > On Sun, 22 Mar 2009, Mihael Hategan wrote: > > > > > I'm using the "try to see what's wrong or let me see what's wrong before > > > reverting" "we". > > > > Reverting is no big deal, just like unreverting at fixup time is also no > > big deal. > > > > Well, it's additional and pointless effort. I forgot to mention I fixed this yesterday. When copying arrays, the code that did it would create the indices by doing a String.valueOf(sourceIndex), which in the case of the source index being a Double, would end up being "n.0". I think the array index handling in swift lacks clarity. From wilde at mcs.anl.gov Mon Mar 23 11:05:20 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 23 Mar 2009 11:05:20 -0500 Subject: [Swift-devel] problems with external dependencies In-Reply-To: References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> Message-ID: <49C7B340.1040705@mcs.anl.gov> On 3/23/09 2:26 AM, Ben Clifford wrote: > At the moment, I think the only thing that will wait for an array to be > closed is an app procedure taking that array as a parameter. When thats a > file array, you will then get the file handling that you don't want. Ah, but following up on Mihael's latest point: if on the other hand its an array of externals, *then* a dummy app() like a localhost "sleep 0" might give the desired effect of "wait on all members of a *set* of externals". Then you might be able to say in some code: if(wait(events)) { stuff; } where wait() returns true. Since an app() cant currently return true, it needs to be wrapper in a compound that converts a file from wait() into a true via readData(). Ive used that technique in oops. That could be streamlined. Did we establish that a compound proc() will, or will not, wait for all of its args that are external() to be set before either starting, or exiting? I *think* it was either stated or implied earlier in this thread that it will *not* wait to start, and only wait to exit if the value is needed in an assignment statement or a call to an app, but at this point I need to write all this down, with examples, to remember it. I'm going to retire from this thread until I get a chance to try some of these refinements. Its not urgent at the moment, but interesting. It would be good to capture in one place in the users guide, "All the things a Swift coder needs to know about parallelism". 2.3 and 2.3 try to do that, but there are many subtleties and the treatment needs more details and examples. I'd like to contribute to that but need to find time. From hategan at mcs.anl.gov Mon Mar 23 11:12:36 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 23 Mar 2009 11:12:36 -0500 Subject: [Swift-devel] problems with external dependencies In-Reply-To: <49C7B340.1040705@mcs.anl.gov> References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> <49C7B340.1040705@mcs.anl.gov> Message-ID: <1237824756.5059.8.camel@localhost> On Mon, 2009-03-23 at 11:05 -0500, Michael Wilde wrote: > On 3/23/09 2:26 AM, Ben Clifford wrote: > > > At the moment, I think the only thing that will wait for an array to be > > closed is an app procedure taking that array as a parameter. When thats a > > file array, you will then get the file handling that you don't want. > > Ah, but following up on Mihael's latest point: if on the other hand its > an array of externals, *then* a dummy app() like a localhost "sleep 0" > might give the desired effect of "wait on all members of a *set* of > externals". I'm not sure why you need the extra app. You can pass it directly to that step that computes the next set of parameters. > > Then you might be able to say in some code: > > if(wait(events)) { > stuff; > } > > where wait() returns true. Since an app() cant currently return true, it > needs to be wrapper in a compound that converts a file from wait() into > a true via readData(). Ive used that technique in oops. That could be > streamlined. > > Did we establish that a compound proc() will, or will not, wait for all > of its args that are external() to be set before either starting, or > exiting? There is no difference, from the "who waits for what" perspective between external and non-external data (or at least there shouldn't be). The only thing that external data does is to tell an app not to do any file staging. So no. From benc at hawaga.org.uk Mon Mar 23 12:41:17 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 23 Mar 2009 17:41:17 +0000 (GMT) Subject: [Swift-devel] problems with external dependencies In-Reply-To: <49C7A303.5060005@mcs.anl.gov> References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> <49C7A303.5060005@mcs.anl.gov> Message-ID: On Mon, 23 Mar 2009, Michael Wilde wrote: > I dont think what you have below works. Perhaps it was fooled by a prior > existing datadir? When I change your path back to my local datadir, I get a > failure in ls, because it runs before datadir exists. It works on my laptop repeatably with present head, removing ls.out and datadir/ in between runs. Send me a run log and also a copy of the SwiftScript source code you are running. > I think the problem is that generate returns as soon as the foreach > launches its echo apps, but before they run. It shouldn't. If you have Swift >=r2734, you should see log lines with the token STARTCOMPOUND and ENDCOMPOUND in them giving the timings. -- From benc at hawaga.org.uk Mon Mar 23 12:52:19 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 23 Mar 2009 17:52:19 +0000 (GMT) Subject: [Swift-devel] swift compilation error on compound expressions In-Reply-To: <49C79712.7080505@mcs.anl.gov> References: <49C78EDE.7030801@mcs.anl.gov> <49C792AC.7050506@mcs.anl.gov> <49C79712.7080505@mcs.anl.gov> Message-ID: On Mon, 23 Mar 2009, Michael Wilde wrote: > Its still simpler: "!=" seems to be broken: Indeed it is broken. It looks like its never worked (at least since r462, a couple of years ago). I'll fix and add a test. Stuff like this makes me laugh. -- From benc at hawaga.org.uk Mon Mar 23 13:08:55 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 23 Mar 2009 18:08:55 +0000 (GMT) Subject: [Swift-devel] swift compilation error on compound expressions In-Reply-To: References: <49C78EDE.7030801@mcs.anl.gov> <49C792AC.7050506@mcs.anl.gov> <49C79712.7080505@mcs.anl.gov> Message-ID: On Mon, 23 Mar 2009, Ben Clifford wrote: >> Its still simpler: "!=" seems to be broken: > Indeed it is broken. fixed in r2737 -- From benc at hawaga.org.uk Mon Mar 23 13:27:39 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 23 Mar 2009 18:27:39 +0000 (GMT) Subject: [Swift-devel] problems with external dependencies In-Reply-To: <49C7B340.1040705@mcs.anl.gov> References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> <49C7B340.1040705@mcs.anl.gov> Message-ID: On Mon, 23 Mar 2009, Michael Wilde wrote: > I'm going to retire from this thread until I get a chance to try some of these > refinements. Its not urgent at the moment, but interesting. It would be useful if you would send me the log file that I asked before retiring from this thread - you appear to be experiencing different behaviour from my install and I would like to investigate. -- From benc at hawaga.org.uk Mon Mar 23 14:18:44 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 23 Mar 2009 19:18:44 +0000 (GMT) Subject: [Swift-devel] problems with external dependencies In-Reply-To: <49C7A5CF.4010803@mcs.anl.gov> References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> <1237778048.11219.2.camel@localhost> <49C70383.6050802@mcs.anl.gov> <1237820146.3410.0.camel@localhost> <49C7A5CF.4010803@mcs.anl.gov> Message-ID: On Mon, 23 Mar 2009, Michael Wilde wrote: > Passing 2000 outputs from the simuation to the analysis routine causes > wrapper.sh to be invoked with values of DIRS and/or INF that exceed the max > allowed command line length (which I think is 128KB on the bgp). > > Its primarily the error causes by this length limit that Im trying to work > around here. It might not be too hard to make wrapper.sh take its inputs from a file, which would then scale much better. I've certainly thought about it before. Would your app be able to cope ok if wrapper.sh did not have this problem? (or would it also have the same command-line length problems?) -- From wilde at mcs.anl.gov Mon Mar 23 14:57:56 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 23 Mar 2009 14:57:56 -0500 Subject: [Swift-devel] problems with external dependencies In-Reply-To: References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> <1237778048.11219.2.camel@localhost> <49C70383.6050802@mcs.anl.gov> <1237820146.3410.0.camel@localhost> <49C7A5CF.4010803@mcs.anl.gov> Message-ID: <49C7E9C4.3020908@mcs.anl.gov> On 3/23/09 2:18 PM, Ben Clifford wrote: > On Mon, 23 Mar 2009, Michael Wilde wrote: > >> Passing 2000 outputs from the simuation to the analysis routine causes >> wrapper.sh to be invoked with values of DIRS and/or INF that exceed the max >> allowed command line length (which I think is 128KB on the bgp). >> >> Its primarily the error causes by this length limit that Im trying to work >> around here. > > It might not be too hard to make wrapper.sh take its inputs from a file, > which would then scale much better. I've certainly thought about it > before. Cool. Can you do the best-of-both-worlds, passing short to modest size lists as we do today, and only using files for INF and DIRS when they are very long? Eg, if the first char of dirs is ` as in `listofdirs` then wrapper.sh reads them from file "listofdirs". Otherwise, I would fear that we're stepping backwards on the path of eliminating file operations in the app-calling path. > > Would your app be able to cope ok if wrapper.sh did not have this problem? Yes, I think so, but surprises are possible. > (or would it also have the same command-line length problems?) No, it avoided command-line length limitations from the start by only placing the first file name in the struct-array on the command line, to serve as a pattern for the script, which knows how to find the full set of files, by naming convention. From benc at hawaga.org.uk Mon Mar 23 15:51:03 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 23 Mar 2009 20:51:03 +0000 (GMT) Subject: [Swift-devel] problems with external dependencies In-Reply-To: <49C7E9C4.3020908@mcs.anl.gov> References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> <1237778048.11219.2.camel@localhost> <49C70383.6050802@mcs.anl.gov> <1237820146.3410.0.camel@localhost> <49C7A5CF.4010803@mcs.anl.gov> <49C7E9C4.3020908@mcs.anl.gov> Message-ID: ok I'll have a play with the below. But not until you send me the log file where the code I sent you doesn't work for you but does work for me. On Mon, 23 Mar 2009, Michael Wilde wrote: > > On 3/23/09 2:18 PM, Ben Clifford wrote: > > On Mon, 23 Mar 2009, Michael Wilde wrote: > > > > > Passing 2000 outputs from the simuation to the analysis routine causes > > > wrapper.sh to be invoked with values of DIRS and/or INF that exceed the > > > max > > > allowed command line length (which I think is 128KB on the bgp). > > > > > > Its primarily the error causes by this length limit that Im trying to work > > > around here. > > > > It might not be too hard to make wrapper.sh take its inputs from a file, > > which would then scale much better. I've certainly thought about it before. > > Cool. Can you do the best-of-both-worlds, passing short to modest size lists > as we do today, and only using files for INF and DIRS when they are very long? > > Eg, if the first char of dirs is ` as in `listofdirs` then wrapper.sh reads > them from file "listofdirs". Otherwise, I would fear that we're stepping > backwards on the path of eliminating file operations in the app-calling path. > > > > > Would your app be able to cope ok if wrapper.sh did not have this problem? > > Yes, I think so, but surprises are possible. > > > (or would it also have the same command-line length problems?) > > No, it avoided command-line length limitations from the start by only placing > the first file name in the struct-array on the command line, to serve as a > pattern for the script, which knows how to find the full set of files, by > naming convention. > > > > From bugzilla-daemon at mcs.anl.gov Mon Mar 23 16:42:49 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 23 Mar 2009 16:42:49 -0500 (CDT) Subject: [Swift-devel] [Bug 189] New: If .kml file is empty, swift exits swiftly with cryptic error message Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=189 Summary: If .kml file is empty, swift exits swiftly with cryptic error message Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: wilde at mcs.anl.gov If the .kml file for a previously compiled swift script gets somehow set to length 0 (perhaps from a ^C on a prior run, not sure...) then swift gives the user a very confusing message: com$ swift t.swift Swift svn swift-r2737 cog-r2335 RunID: 20090323-1638-kg6fi441 SwiftScript trace: Hi Progress: Final status: com$ ls -lt total 20 -rw-r--r-- 1 wilde ci-users 1641 Mar 23 16:38 t-20090323-1638-kg6fi441.log -rw-r--r-- 1 wilde ci-users 676 Mar 23 16:38 t.kml -rw-r--r-- 1 wilde ci-users 294 Mar 23 16:38 t.xml -rw-r--r-- 1 wilde ci-users 57 Mar 23 16:38 swift.log -rw-r--r-- 1 wilde ci-users 13 Mar 23 16:38 t.swift com$ >t.kml com$ swift t.swift Could not start execution. null com$ ls -lt *log -rw-r--r-- 1 wilde ci-users 114 Mar 23 16:38 swift.log -rw-r--r-- 1 wilde ci-users 210 Mar 23 16:38 t-20090323-1638-q8vrpyr4.log -rw-r--r-- 1 wilde ci-users 1641 Mar 23 16:38 t-20090323-1638-kg6fi441.log com$ cat *pyr4.log 2009-03-23 16:38:45,645-0500 DEBUG Loader Detailed exception: java.lang.NullPointerException at org.griphyn.vdl.karajan.Loader.compile(Loader.java:224) at org.griphyn.vdl.karajan.Loader.main(Loader.java:126) com$ -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Mar 23 16:47:13 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 23 Mar 2009 16:47:13 -0500 (CDT) Subject: [Swift-devel] [Bug 189] If .kml file is empty, swift exits swiftly with cryptic error message In-Reply-To: References: Message-ID: <20090323214713.2E9932CB9B@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=189 Ben Clifford changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED --- Comment #1 from Ben Clifford 2009-03-23 16:47:13 --- That's been a problem for a while. I thought I'd fixed it with the version comparison that I implemented in r2583, but I guess it doesn't deal with the empty-kml case. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From wilde at mcs.anl.gov Mon Mar 23 16:47:24 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 23 Mar 2009 16:47:24 -0500 Subject: [Swift-devel] [Fwd: [Bug 189] New: If .kml file is empty, swift exits swiftly with cryptic error message] Message-ID: <49C8036C.2050107@mcs.anl.gov> This is the problem, Glen. Somehow oops.kml got zeroed out, its a zero-length file. So rm oops.kml oops.xml and try again. If you do an ls -lt on your swift dir, you can see where things stopped working around 15:41, the time that oops.kml was last written to. I filed a bug report to clean up this error handling. It should say "oh, empty kml file, i gotta recompile". - Mike -------- Original Message -------- Subject: [Bug 189] New: If .kml file is empty, swift exits swiftly with cryptic error message Date: Mon, 23 Mar 2009 16:42:49 -0500 (CDT) From: bugzilla-daemon at mcs.anl.gov To: wilde at mcs.anl.gov https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=189 Summary: If .kml file is empty, swift exits swiftly with cryptic error message Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: wilde at mcs.anl.gov If the .kml file for a previously compiled swift script gets somehow set to length 0 (perhaps from a ^C on a prior run, not sure...) then swift gives the user a very confusing message: com$ swift t.swift Swift svn swift-r2737 cog-r2335 RunID: 20090323-1638-kg6fi441 SwiftScript trace: Hi Progress: Final status: com$ ls -lt total 20 -rw-r--r-- 1 wilde ci-users 1641 Mar 23 16:38 t-20090323-1638-kg6fi441.log -rw-r--r-- 1 wilde ci-users 676 Mar 23 16:38 t.kml -rw-r--r-- 1 wilde ci-users 294 Mar 23 16:38 t.xml -rw-r--r-- 1 wilde ci-users 57 Mar 23 16:38 swift.log -rw-r--r-- 1 wilde ci-users 13 Mar 23 16:38 t.swift com$ >t.kml com$ swift t.swift Could not start execution. null com$ ls -lt *log -rw-r--r-- 1 wilde ci-users 114 Mar 23 16:38 swift.log -rw-r--r-- 1 wilde ci-users 210 Mar 23 16:38 t-20090323-1638-q8vrpyr4.log -rw-r--r-- 1 wilde ci-users 1641 Mar 23 16:38 t-20090323-1638-kg6fi441.log com$ cat *pyr4.log 2009-03-23 16:38:45,645-0500 DEBUG Loader Detailed exception: java.lang.NullPointerException at org.griphyn.vdl.karajan.Loader.compile(Loader.java:224) at org.griphyn.vdl.karajan.Loader.main(Loader.java:126) com$ -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug. From wilde at mcs.anl.gov Mon Mar 23 22:28:22 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 23 Mar 2009 22:28:22 -0500 Subject: [Swift-devel] problems with external dependencies In-Reply-To: References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> <1237778048.11219.2.camel@localhost> <49C70383.6050802@mcs.anl.gov> <1237820146.3410.0.camel@localhost> <49C7A5CF.4010803@mcs.anl.gov> <49C7E9C4.3020908@mcs.anl.gov> Message-ID: <49C85356.8070807@mcs.anl.gov> On 3/23/09 3:51 PM, Ben Clifford wrote: > ok I'll have a play with the below. But not until you send me the log file > where the code I sent you doesn't work for you but does work for me. deal. http://www.ci.uchicag.edu:~wilde/ben-20090323-2222-ubxsbh83.log and below. Did I run the right version of your test? close but not the latest trunk. I will update and see if it makes a diff. I get: sur$ rm ben.xml ben.kml sur$ sur$ cat ben.swift type file; app (file o) echo (int i) { echo i stdout=@o; } (external s, file r[]) generate() { int j[] = [0:10]; foreach i in j { r[i] = echo(i*i); } } app (file o) ls (string dir, external w) { ls "-l" dir stdout=@o; } file datadir[]; external sync; (sync,datadir) = generate(); file out <"ls.out">; out = ls("datadir/", sync); sur$ rm -rf datadir ls.out sur$ ls -l datadir ls.out /bin/ls: datadir: No such file or directory /bin/ls: ls.out: No such file or directory sur$ swift ben.swift Swift svn swift-r2724 (swift modified locally) cog-r2333 RunID: 20090323-2222-ubxsbh83 Progress: Progress: Selecting site:8 Stage in:1 Finished successfully:2 Progress: Selecting site:2 Submitted:1 Finished successfully:8 getValue called in an external dataset Final status: Failed:1 Finished successfully:11 The following errors have occurred: 1. Application "ls" failed (Job failed with an exit code of 2) Arguments: "-l, datadir/" Host: localhost Directory: ben-20090323-2222-ubxsbh83/jobs/s/ls-sheljd8j STDERR: /bin/ls: datadir/: No such file or directory STDOUT: sur$ > On Mon, 23 Mar 2009, Michael Wilde wrote: > >> On 3/23/09 2:18 PM, Ben Clifford wrote: >>> On Mon, 23 Mar 2009, Michael Wilde wrote: >>> >>>> Passing 2000 outputs from the simuation to the analysis routine causes >>>> wrapper.sh to be invoked with values of DIRS and/or INF that exceed the >>>> max >>>> allowed command line length (which I think is 128KB on the bgp). >>>> >>>> Its primarily the error causes by this length limit that Im trying to work >>>> around here. >>> It might not be too hard to make wrapper.sh take its inputs from a file, >>> which would then scale much better. I've certainly thought about it before. >> Cool. Can you do the best-of-both-worlds, passing short to modest size lists >> as we do today, and only using files for INF and DIRS when they are very long? >> >> Eg, if the first char of dirs is ` as in `listofdirs` then wrapper.sh reads >> them from file "listofdirs". Otherwise, I would fear that we're stepping >> backwards on the path of eliminating file operations in the app-calling path. >> >>> Would your app be able to cope ok if wrapper.sh did not have this problem? >> Yes, I think so, but surprises are possible. >> >>> (or would it also have the same command-line length problems?) >> No, it avoided command-line length limitations from the start by only placing >> the first file name in the struct-array on the command line, to serve as a >> pattern for the script, which knows how to find the full set of files, by >> naming convention. >> >> >> >> From benc at hawaga.org.uk Tue Mar 24 04:30:27 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 24 Mar 2009 09:30:27 +0000 (GMT) Subject: [Swift-devel] problems with external dependencies In-Reply-To: <49C85356.8070807@mcs.anl.gov> References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> <1237778048.11219.2.camel@localhost> <49C70383.6050802@mcs.anl.gov> <1237820146.3410.0.camel@localhost> <49C7A5CF.4010803@mcs.anl.gov> <49C7E9C4.3020908@mcs.anl.gov> <49C85356.8070807@mcs.anl.gov> Message-ID: On Mon, 23 Mar 2009, Michael Wilde wrote: > out = ls("datadir/", sync); This ls needs an absolute path - when it runs, it is in its own jobdir which (deliberately) has none of the other files staged in. You need to specify the absolute path to your data directory there. One way to think of it is: rather than calling the directory 'sync', you coul call it 'myDataDir' - that external variable represents the external directory being created and passed around. But, because you have declared that data 'external' that means Swift will never attempt to put you in the right directory for access to that data or prepare any input files for you or anything like that (indeed, your external data might not even be file based - this was originally implemented for someone who wanted to do database access, I think) -- From wilde at mcs.anl.gov Tue Mar 24 08:06:22 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 24 Mar 2009 08:06:22 -0500 Subject: [Swift-devel] problems with external dependencies In-Reply-To: References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> <1237778048.11219.2.camel@localhost> <49C70383.6050802@mcs.anl.gov> <1237820146.3410.0.camel@localhost> <49C7A5CF.4010803@mcs.anl.gov> <49C7E9C4.3020908@mcs.anl.gov> <49C85356.8070807@mcs.anl.gov> Message-ID: <49C8DACE.9010202@mcs.anl.gov> On 3/24/09 4:30 AM, Ben Clifford wrote: > On Mon, 23 Mar 2009, Michael Wilde wrote: > >> out = ls("datadir/", sync); > > This ls needs an absolute path - when it runs, it is in its own jobdir > which (deliberately) has none of the other files staged in. Yes, makes perfect sense. I did exactly that in the production code, but neglected to do it in this test code, and was fooled by the answer. I understand this part completely. I then need to correct this and go back and experiment more with the synchronization semantics, mainly related to array closing and procedure entry/exit semantics. Despite the fact that I can use these pretty effectively, I still dont have a solid, comfortable mental model of how some of thee aspects work. Back to the main point: I need to dissect this example to better understand how it lets me wait for all the apps to finish without putting an external var on each one, and hence avoids the array of externals. This violates my (most recent) understanding of procedure return, in that generate() would return to the calls *before* all the ls() procs called from its foreach loop returns. That seems to be the crux of the matter. I need to go back through this thread, but I though Mihael confirmed that this return would indeed happen. I also need to look at what I did wrong in my prior tests on this, as I thought I tried this approach first. Maybe the full-path mistake misled me from the start. Very subtle stuff. > You need to specify the absolute path to your data directory there. > > One way to think of it is: rather than calling the directory 'sync', you > coul call it 'myDataDir' - that external variable represents the external > directory being created and passed around. But, because you have declared > that data 'external' that means Swift will never attempt to put you in the > right directory for access to that data or prepare any input files for you > or anything like that (indeed, your external data might not even be file > based - this was originally implemented for someone who wanted to do > database access, I think) Yes, that part is clear, and was clear from the start. I just goofed in this test. In the production oops code, I pass in "cwd" as an arg and prepend that to the "outdir" path, which is also an arg. Btw, oops is in svn at: https://svn.ci.uchicago.edu/svn/oops the swift code and supporting scripts (mappers etc) are in there under swift/ From benc at hawaga.org.uk Tue Mar 24 08:15:25 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 24 Mar 2009 13:15:25 +0000 (GMT) Subject: [Swift-devel] problems with external dependencies In-Reply-To: <49C8DACE.9010202@mcs.anl.gov> References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> <1237778048.11219.2.camel@localhost> <49C70383.6050802@mcs.anl.gov> <1237820146.3410.0.camel@localhost> <49C7A5CF.4010803@mcs.anl.gov> <49C7E9C4.3020908@mcs.anl.gov> <49C85356.8070807@mcs.anl.gov> <49C8DACE.9010202@mcs.anl.gov> Message-ID: On Tue, 24 Mar 2009, Michael Wilde wrote: > Back to the main point: I need to dissect this example to better understand > how it lets me wait for all the apps to finish without putting an external var > on each one, and hence avoids the array of externals. This violates my (most > recent) understanding of procedure return, in that generate() would return to > the calls *before* all the ls() procs called from its foreach loop returns. > That seems to be the crux of the matter. no, they return when everything in them has finished executing. Thats how the unpleasant hack of putting foreach loops inside a procedure to get closing to happen at the end of the foreach happens. Its the same here - the external variable is closed when everything inside its returning procedure is finished. -- From zhangzhao0718 at gmail.com Tue Mar 24 08:53:23 2009 From: zhangzhao0718 at gmail.com (Zhao Zhang) Date: Tue, 24 Mar 2009 08:53:23 -0500 Subject: [Swift-devel] could we set the job sequence without a file dependency? Message-ID: <49C8E5D3.5080207@gmail.com> Hi, I have such a scenario: Job B has to wait until Job A is done. But there is no data dependency between A and B. Also, involving data dependency here will cause a heavy work load. So do we have a way, maybe a signal, to inform B that A is completed and B could start? Thanks. best wishes zhangzhao From benc at hawaga.org.uk Tue Mar 24 08:55:20 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 24 Mar 2009 13:55:20 +0000 (GMT) Subject: [Swift-devel] could we set the job sequence without a file dependency? In-Reply-To: <49C8E5D3.5080207@gmail.com> References: <49C8E5D3.5080207@gmail.com> Message-ID: On Tue, 24 Mar 2009, Zhao Zhang wrote: > I have such a scenario: Job B has to wait until Job A is done. But there is no > data dependency between A and B. Also, involving data dependency here will > cause a heavy work load. So do we have a way, maybe a signal, to inform B that > A is completed and B could start? Thanks. What is the dependency between B and A then, if not data? -- From zhaozhang at uchicago.edu Tue Mar 24 08:59:10 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 24 Mar 2009 08:59:10 -0500 Subject: [Swift-devel] could we set the job sequence without a file dependency? Message-ID: <49C8E72E.9010907@uchicago.edu> Hi, I have such a scenario: Job B has to wait until Job A is done. But there is no data dependency between A and B. Also, involving data dependency here will cause a heavy work load. So do we have a way, maybe a signal, to inform B that A is completed and B could start? Thanks. best wishes zhangzhao From zhangzhao0718 at gmail.com Tue Mar 24 09:02:35 2009 From: zhangzhao0718 at gmail.com (Zhao Zhang) Date: Tue, 24 Mar 2009 09:02:35 -0500 Subject: [Swift-devel] could we set the job sequence without a file dependency? In-Reply-To: References: <49C8E5D3.5080207@gmail.com> Message-ID: <49C8E7FB.7090807@gmail.com> Ben Clifford wrote: > On Tue, 24 Mar 2009, Zhao Zhang wrote: > > >> I have such a scenario: Job B has to wait until Job A is done. But there is no >> data dependency between A and B. Also, involving data dependency here will >> cause a heavy work load. So do we have a way, maybe a signal, to inform B that >> A is completed and B could start? Thanks. >> > > What is the dependency between B and A then, if not data? > > Say, Job A is broadcasting common data shared for all jobs. And Job B only needs to know that Job A is done, so he could read the common data. zhao From wilde at mcs.anl.gov Tue Mar 24 09:02:44 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 24 Mar 2009 09:02:44 -0500 Subject: [Swift-devel] problems with external dependencies In-Reply-To: References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> <1237778048.11219.2.camel@localhost> <49C70383.6050802@mcs.anl.gov> <1237820146.3410.0.camel@localhost> <49C7A5CF.4010803@mcs.anl.gov> <49C7E9C4.3020908@mcs.anl.gov> <49C85356.8070807@mcs.anl.gov> <49C8DACE.9010202@mcs.anl.gov> Message-ID: <49C8E804.6020607@mcs.anl.gov> ...a digression for better understanding the language: On 3/24/09 8:15 AM, Ben Clifford wrote: > On Tue, 24 Mar 2009, Michael Wilde wrote: > >> Back to the main point: I need to dissect this example to better understand >> how it lets me wait for all the apps to finish without putting an external var >> on each one, and hence avoids the array of externals. This violates my (most >> recent) understanding of procedure return, in that generate() would return to >> the calls *before* all the ls() procs called from its foreach loop returns. >> That seems to be the crux of the matter. > > no, they return when everything in them has finished executing. That makes more sense - it was my *initial* understanding. Some mis-interpretation of info from a prior email thread led me astray into thinking that proc return was somehow magically more asynchronous. This helps return some comforting simplicity to my mental model. So no matter how many parallel procedure calls a procedure triggers, directly or indirectly, through multiple levels of calls, the procedure does not return until all the calls below it on the stack have returned. is that correct? Is it technically the same for the foreach(), and in fact for every other statement (eg, if() ) - that each statement essentially starts a "block" that may invoke parallel procs, and again, each statement waits for its child statements and procs to complete before it completes? > Thats how the unpleasant hack of putting foreach loops inside a procedure > to get closing to happen at the end of the foreach happens. Its the same > here - the external variable is closed when everything inside its > returning procedure is finished. Is it further true that the term "closing" applies to all vars, of all types, whether atomic or compound? Where "closed" is the same as "set" for atomic vars? And that: - the "setting" of external type vars is tied to the procedure that *returns* them? - the "setting" of other atomic vars happens when they are assigned a value in an assignment statement. - the setting of arrays is based on flow analysis (as per below, from a prior thread) - the setting of structures is ??? (when all fields are set or when that structure goes out of scope? - the setting of file vars means that the file has been created by an app. The var contains only a mapping (which is set before the file is created) and a state, set/unset, somewhat like externals. So externals are almost? exactly like file vars, but have no mapping? From an earlier thread on array closing, by Ben: "...there is static analysis of source code, and when no more assignments are left to make to an array, its regarded as closed. However, in the case of multidimensional arrays, this only happens when the entire top level array has no more assignments at all, not as each subarray happens to become finished." From benc at hawaga.org.uk Tue Mar 24 09:06:42 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 24 Mar 2009 14:06:42 +0000 (GMT) Subject: [Swift-devel] could we set the job sequence without a file dependency? In-Reply-To: <49C8E7FB.7090807@gmail.com> References: <49C8E5D3.5080207@gmail.com> <49C8E7FB.7090807@gmail.com> Message-ID: On Tue, 24 Mar 2009, Zhao Zhang wrote: > Say, Job A is broadcasting common data shared for all jobs. And Job B only > needs to know that Job A is done, so he could read the common data. ok, so there is a data dependency. You can externals (like Mike has been using on the swift-user list) to represent data dependencies like that. app (external commonData) A() { ... } app B(external commonData) { ... } external d = A(); B(d); d represents your shared data set - by declaring it as 'external' you say that Swift should do data dependency handling, but should not attempt to manage the data itself. d is mapped to the data, but in your head, rather than in Swift. -- From benc at hawaga.org.uk Tue Mar 24 09:13:25 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 24 Mar 2009 14:13:25 +0000 (GMT) Subject: [Swift-devel] problems with external dependencies In-Reply-To: <49C8E804.6020607@mcs.anl.gov> References: <49C6AD6C.8020509@mcs.anl.gov> <49C6DA07.7030809@mcs.anl.gov> <1237778048.11219.2.camel@localhost> <49C70383.6050802@mcs.anl.gov> <1237820146.3410.0.camel@localhost> <49C7A5CF.4010803@mcs.anl.gov> <49C7E9C4.3020908@mcs.anl.gov> <49C85356.8070807@mcs.anl.gov> <49C8DACE.9010202@mcs.anl.gov> <49C8E804.6020607@mcs.anl.gov> Message-ID: On Tue, 24 Mar 2009, Michael Wilde wrote: > So no matter how many parallel procedure calls a procedure triggers, directly > or indirectly, through multiple levels of calls, the procedure does not return > until all the calls below it on the stack have returned. is that correct? yes. > Is it technically the same for the foreach(), and in fact for every other > statement (eg, if() ) - that each statement essentially starts a "block" that > may invoke parallel procs, and again, each statement waits for its child > statements and procs to complete before it completes? at an implementation level, yes. though 'completes' doesn't have much effect in almost all circumstances - compound procedures are the only place (I think) where completion of all the statements then causes some other effect (the explicit closing of all returned values). everywhere else, I think you cannot notice that. > Is it further true that the term "closing" applies to all vars, of all types, > whether atomic or compound? yes. > Where "closed" is the same as "set" for atomic > vars? yes > And that: > - the "setting" of external type vars is tied to the procedure that *returns* > them? yes > - the "setting" of other atomic vars happens when they are assigned a value in > an assignment statement. yes > - the setting of arrays is based on flow analysis (as per below, from a prior > thread) arrays get closed in two places: when they are returned from a procedure call, and when static analysis has determined that all expressions that can assign to an array have been evaluated. As you discovered, this second only happens for the root of a multidimensional array, so is not sufficient... > - the setting of structures is ??? (when all fields are set or when that > structure goes out of scope? Arrays need closing to indicate that all of their members are known. A structure has predefined members so closing doesn't come into play there. > - the setting of file vars means that the file has been created by an app. The > var contains only a mapping (which is set before the file is created) and a > state, set/unset, somewhat like externals. So externals are almost? exactly > like file vars, but have no mapping? yes. The best analogy is that they are mapped but in your head rather than in SwiftScript. -- From foster at anl.gov Tue Mar 24 11:59:25 2009 From: foster at anl.gov (Ian Foster) Date: Tue, 24 Mar 2009 11:59:25 -0500 Subject: [Swift-devel] could we set the job sequence without a file dependency? In-Reply-To: References: <49C8E5D3.5080207@gmail.com> <49C8E7FB.7090807@gmail.com> Message-ID: <43FF9D6D-0B79-452B-8D8C-24102E76D9B0@anl.gov> These discussions are very reminiscent of PCN, where we had many of the same issues. A reason then for wanting sequencing was for output. We introduced a sequential operator, which had the advantage that people could say more directly what they meant. On Mar 24, 2009, at 9:06 AM, Ben Clifford wrote: > > On Tue, 24 Mar 2009, Zhao Zhang wrote: > >> Say, Job A is broadcasting common data shared for all jobs. And Job >> B only >> needs to know that Job A is done, so he could read the common data. > > ok, so there is a data dependency. > > You can externals (like Mike has been using on the swift-user list) to > represent data dependencies like that. > > app (external commonData) A() { > ... > } > > app B(external commonData) { > ... > } > > external d = A(); > B(d); > > d represents your shared data set - by declaring it as 'external' > you say > that Swift should do data dependency handling, but should not > attempt to > manage the data itself. > > d is mapped to the data, but in your head, rather than in Swift. > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Tue Mar 24 12:02:25 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 24 Mar 2009 17:02:25 +0000 (GMT) Subject: [Swift-devel] could we set the job sequence without a file dependency? In-Reply-To: <43FF9D6D-0B79-452B-8D8C-24102E76D9B0@anl.gov> References: <49C8E5D3.5080207@gmail.com> <49C8E7FB.7090807@gmail.com> <43FF9D6D-0B79-452B-8D8C-24102E76D9B0@anl.gov> Message-ID: you can express sequencing. that's what extern lets you do. On Tue, 24 Mar 2009, Ian Foster wrote: > These discussions are very reminiscent of PCN, where we had many of the same > issues. > > A reason then for wanting sequencing was for output. > > We introduced a sequential operator, which had the advantage that people could > say more directly what they meant. > > > On Mar 24, 2009, at 9:06 AM, Ben Clifford wrote: > > > > > On Tue, 24 Mar 2009, Zhao Zhang wrote: > > > > > Say, Job A is broadcasting common data shared for all jobs. And Job B only > > > needs to know that Job A is done, so he could read the common data. > > > > ok, so there is a data dependency. > > > > You can externals (like Mike has been using on the swift-user list) to > > represent data dependencies like that. > > > > app (external commonData) A() { > > ... > > } > > > > app B(external commonData) { > > ... > > } > > > > external d = A(); > > B(d); > > > > d represents your shared data set - by declaring it as 'external' you say > > that Swift should do data dependency handling, but should not attempt to > > manage the data itself. > > > > d is mapped to the data, but in your head, rather than in Swift. > > > > -- > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From foster at anl.gov Tue Mar 24 12:08:48 2009 From: foster at anl.gov (Ian Foster) Date: Tue, 24 Mar 2009 12:08:48 -0500 Subject: [Swift-devel] could we set the job sequence without a file dependency? In-Reply-To: References: <49C8E5D3.5080207@gmail.com> <49C8E7FB.7090807@gmail.com> <43FF9D6D-0B79-452B-8D8C-24102E76D9B0@anl.gov> Message-ID: <9E0FDB93-DC52-4FF5-8D7E-1043B0FDB908@anl.gov> Only in a round-about way. Maybe that is ok, it is just less intuitive than say: seq { A(); B(); } (IMHO) Ian. On Mar 24, 2009, at 12:02 PM, Ben Clifford wrote: > > you can express sequencing. that's what extern lets you do. > > On Tue, 24 Mar 2009, Ian Foster wrote: > >> These discussions are very reminiscent of PCN, where we had many of >> the same >> issues. >> >> A reason then for wanting sequencing was for output. >> >> We introduced a sequential operator, which had the advantage that >> people could >> say more directly what they meant. >> >> >> On Mar 24, 2009, at 9:06 AM, Ben Clifford wrote: >> >>> >>> On Tue, 24 Mar 2009, Zhao Zhang wrote: >>> >>>> Say, Job A is broadcasting common data shared for all jobs. And >>>> Job B only >>>> needs to know that Job A is done, so he could read the common data. >>> >>> ok, so there is a data dependency. >>> >>> You can externals (like Mike has been using on the swift-user >>> list) to >>> represent data dependencies like that. >>> >>> app (external commonData) A() { >>> ... >>> } >>> >>> app B(external commonData) { >>> ... >>> } >>> >>> external d = A(); >>> B(d); >>> >>> d represents your shared data set - by declaring it as 'external' >>> you say >>> that Swift should do data dependency handling, but should not >>> attempt to >>> manage the data itself. >>> >>> d is mapped to the data, but in your head, rather than in Swift. >>> >>> -- >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Mar 24 12:11:27 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 24 Mar 2009 12:11:27 -0500 Subject: [Swift-devel] could we set the job sequence without a file dependency? In-Reply-To: <9E0FDB93-DC52-4FF5-8D7E-1043B0FDB908@anl.gov> References: <49C8E5D3.5080207@gmail.com> <49C8E7FB.7090807@gmail.com> <43FF9D6D-0B79-452B-8D8C-24102E76D9B0@anl.gov> <9E0FDB93-DC52-4FF5-8D7E-1043B0FDB908@anl.gov> Message-ID: <1237914687.22653.0.camel@localhost> You seem to want swift to be PCN. On Tue, 2009-03-24 at 12:08 -0500, Ian Foster wrote: > Only in a round-about way. Maybe that is ok, it is just less intuitive > than say: > > > seq { > A(); > B(); > } > > > (IMHO) > > > Ian. > > On Mar 24, 2009, at 12:02 PM, Ben Clifford wrote: > > > > > you can express sequencing. that's what extern lets you do. > > > > On Tue, 24 Mar 2009, Ian Foster wrote: > > > > > These discussions are very reminiscent of PCN, where we had many > > > of the same > > > issues. > > > > > > A reason then for wanting sequencing was for output. > > > > > > We introduced a sequential operator, which had the advantage that > > > people could > > > say more directly what they meant. > > > > > > > > > On Mar 24, 2009, at 9:06 AM, Ben Clifford wrote: > > > > > > > > > > > On Tue, 24 Mar 2009, Zhao Zhang wrote: > > > > > > > > > Say, Job A is broadcasting common data shared for all jobs. > > > > > And Job B only > > > > > needs to know that Job A is done, so he could read the common > > > > > data. > > > > > > > > ok, so there is a data dependency. > > > > > > > > You can externals (like Mike has been using on the swift-user > > > > list) to > > > > represent data dependencies like that. > > > > > > > > app (external commonData) A() { > > > > ... > > > > } > > > > > > > > app B(external commonData) { > > > > ... > > > > } > > > > > > > > external d = A(); > > > > B(d); > > > > > > > > d represents your shared data set - by declaring it as > > > > 'external' you say > > > > that Swift should do data dependency handling, but should not > > > > attempt to > > > > manage the data itself. > > > > > > > > d is mapped to the data, but in your head, rather than in Swift. > > > > > > > > -- > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Tue Mar 24 12:14:41 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 24 Mar 2009 17:14:41 +0000 (GMT) Subject: [Swift-devel] could we set the job sequence without a file dependency? In-Reply-To: <9E0FDB93-DC52-4FF5-8D7E-1043B0FDB908@anl.gov> References: <49C8E5D3.5080207@gmail.com> <49C8E7FB.7090807@gmail.com> <43FF9D6D-0B79-452B-8D8C-24102E76D9B0@anl.gov> <9E0FDB93-DC52-4FF5-8D7E-1043B0FDB908@anl.gov> Message-ID: On Tue, 24 Mar 2009, Ian Foster wrote: > Only in a round-about way. Maybe that is ok, it is just less intuitive > than say: I think if a user has intuitions that suggest they should be specifying things in terms of execution order rather than describing dataflow dependencies, they're probably going to get into other trouble programming in SwiftScript, and those intuitions probably need to change. > seq { > A(); > B(); > } -- From foster at anl.gov Tue Mar 24 12:19:31 2009 From: foster at anl.gov (Ian Foster) Date: Tue, 24 Mar 2009 12:19:31 -0500 Subject: [Swift-devel] could we set the job sequence without a file dependency? In-Reply-To: <1237914687.22653.0.camel@localhost> References: <49C8E5D3.5080207@gmail.com> <49C8E7FB.7090807@gmail.com> <43FF9D6D-0B79-452B-8D8C-24102E76D9B0@anl.gov> <9E0FDB93-DC52-4FF5-8D7E-1043B0FDB908@anl.gov> <1237914687.22653.0.camel@localhost> Message-ID: <338CA536-C657-4B78-839C-F500B2F97276@anl.gov> I wish we could talk about the technical issue at hand, rather than preconceptions of my motives. I raised a technical issue: people sometimes want to express sequencing that relates to external side effects (e.g., output). How do we do that? We can, as Ben suggests, introduce an artificial data dependency. Or, we can provide an operator that expresses the sequencing. Each have pros and cons: - The artificial data dependency is less clear in its semantics than the sequential operator - The artificial data dependency requires that you modify procedures to sequence them - The artificial data dependency is a real pain to write if you have to do it a lot + The artificial data dependency does not introduce a new operator + The artificial data dependency may allow for finer control over what is sequenced I couldn't care less whether Swift looks like PCN. But I do have that experience, and recall the considerable frustration of some users who wanted to ensure that activity X happened before activity Y (whether for side effects, or debugging) and had to go to great lengths to introduce artificial data dependencies to achieve that goal. Ian. On Mar 24, 2009, at 12:11 PM, Mihael Hategan wrote: > You seem to want swift to be PCN. > > On Tue, 2009-03-24 at 12:08 -0500, Ian Foster wrote: >> Only in a round-about way. Maybe that is ok, it is just less >> intuitive >> than say: >> >> >> seq { >> A(); >> B(); >> } >> >> >> (IMHO) >> >> >> Ian. >> >> On Mar 24, 2009, at 12:02 PM, Ben Clifford wrote: >> >>> >>> you can express sequencing. that's what extern lets you do. >>> >>> On Tue, 24 Mar 2009, Ian Foster wrote: >>> >>>> These discussions are very reminiscent of PCN, where we had many >>>> of the same >>>> issues. >>>> >>>> A reason then for wanting sequencing was for output. >>>> >>>> We introduced a sequential operator, which had the advantage that >>>> people could >>>> say more directly what they meant. >>>> >>>> >>>> On Mar 24, 2009, at 9:06 AM, Ben Clifford wrote: >>>> >>>>> >>>>> On Tue, 24 Mar 2009, Zhao Zhang wrote: >>>>> >>>>>> Say, Job A is broadcasting common data shared for all jobs. >>>>>> And Job B only >>>>>> needs to know that Job A is done, so he could read the common >>>>>> data. >>>>> >>>>> ok, so there is a data dependency. >>>>> >>>>> You can externals (like Mike has been using on the swift-user >>>>> list) to >>>>> represent data dependencies like that. >>>>> >>>>> app (external commonData) A() { >>>>> ... >>>>> } >>>>> >>>>> app B(external commonData) { >>>>> ... >>>>> } >>>>> >>>>> external d = A(); >>>>> B(d); >>>>> >>>>> d represents your shared data set - by declaring it as >>>>> 'external' you say >>>>> that Swift should do data dependency handling, but should not >>>>> attempt to >>>>> manage the data itself. >>>>> >>>>> d is mapped to the data, but in your head, rather than in Swift. >>>>> >>>>> -- >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Tue Mar 24 12:23:30 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 24 Mar 2009 12:23:30 -0500 Subject: [Swift-devel] could we set the job sequence without a file dependency? In-Reply-To: <9E0FDB93-DC52-4FF5-8D7E-1043B0FDB908@anl.gov> References: <49C8E5D3.5080207@gmail.com> <49C8E7FB.7090807@gmail.com> <43FF9D6D-0B79-452B-8D8C-24102E76D9B0@anl.gov> <9E0FDB93-DC52-4FF5-8D7E-1043B0FDB908@anl.gov> Message-ID: <49C91712.1060903@mcs.anl.gov> I too think that it would be good to be able to say "sequential" without explicitly wiring in the synchronization primitives. Since we can do what we need in this regard for the moment, this is less urgent but reasonable to discuss to put in the hopper for language evolution dscussion. Towards that end, a few questions: - does Karajan already support the necessary constructs to easily generate a seq{} block or proc? - if not, does it make sense to provide a seq construct that essential compiles into the same as a user would generate doing this manually today with externals? Somewhat of a syntactic sugaring approach? > seq { > A(); > B(); > } "compiles" to: (ext1, r1) = A(v1); (ext2, r2) = B(v2,ext1); I am somewhat delighted with the externals, in the sense that we can express what needs to be done for some real applications. But if the simpler syntax is "low cost" to implement, its worth considering (in my opinon) if it doesnt break things or open a can of messy worms. But its low priority. I dont want to get too deep into this just yet; lets treat it as food for language-evolution discussion. On 3/24/09 12:08 PM, Ian Foster wrote: > Only in a round-about way. Maybe that is ok, it is just less intuitive > than say: > > seq { > A(); > B(); > } > > (IMHO) > > Ian. > > On Mar 24, 2009, at 12:02 PM, Ben Clifford wrote: > >> >> you can express sequencing. that's what extern lets you do. >> >> On Tue, 24 Mar 2009, Ian Foster wrote: >> >>> These discussions are very reminiscent of PCN, where we had many of >>> the same >>> issues. >>> >>> A reason then for wanting sequencing was for output. >>> >>> We introduced a sequential operator, which had the advantage that >>> people could >>> say more directly what they meant. >>> >>> >>> On Mar 24, 2009, at 9:06 AM, Ben Clifford wrote: >>> >>>> >>>> On Tue, 24 Mar 2009, Zhao Zhang wrote: >>>> >>>>> Say, Job A is broadcasting common data shared for all jobs. And Job >>>>> B only >>>>> needs to know that Job A is done, so he could read the common data. >>>> >>>> ok, so there is a data dependency. >>>> >>>> You can externals (like Mike has been using on the swift-user list) to >>>> represent data dependencies like that. >>>> >>>> app (external commonData) A() { >>>> ... >>>> } >>>> >>>> app B(external commonData) { >>>> ... >>>> } >>>> >>>> external d = A(); >>>> B(d); >>>> >>>> d represents your shared data set - by declaring it as 'external' >>>> you say >>>> that Swift should do data dependency handling, but should not attempt to >>>> manage the data itself. >>>> >>>> d is mapped to the data, but in your head, rather than in Swift. >>>> >>>> -- >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From foster at anl.gov Tue Mar 24 12:25:21 2009 From: foster at anl.gov (Ian Foster) Date: Tue, 24 Mar 2009 12:25:21 -0500 Subject: [Swift-devel] could we set the job sequence without a file dependency? In-Reply-To: <49C91712.1060903@mcs.anl.gov> References: <49C8E5D3.5080207@gmail.com> <49C8E7FB.7090807@gmail.com> <43FF9D6D-0B79-452B-8D8C-24102E76D9B0@anl.gov> <9E0FDB93-DC52-4FF5-8D7E-1043B0FDB908@anl.gov> <49C91712.1060903@mcs.anl.gov> Message-ID: <583C827A-79BA-413F-9A7E-5B2581181E02@anl.gov> interestingly, Mike's suggestion of compiling to dummy sequencing variables is how we implemented SEQ in PCN (I believe) On Mar 24, 2009, at 12:23 PM, Michael Wilde wrote: > I too think that it would be good to be able to say "sequential" > without explicitly wiring in the synchronization primitives. Since > we can do what we need in this regard for the moment, this is less > urgent but reasonable to discuss to put in the hopper for language > evolution dscussion. > > Towards that end, a few questions: > > - does Karajan already support the necessary constructs to easily > generate a seq{} block or proc? > > - if not, does it make sense to provide a seq construct that > essential compiles into the same as a user would generate doing this > manually today with externals? Somewhat of a syntactic sugaring > approach? > > > seq { > > A(); > > B(); > > } > > "compiles" to: > > (ext1, r1) = A(v1); > (ext2, r2) = B(v2,ext1); > > I am somewhat delighted with the externals, in the sense that we can > express what needs to be done for some real applications. But if the > simpler syntax is "low cost" to implement, its worth considering (in > my opinon) if it doesnt break things or open a can of messy worms. > But its low priority. > > I dont want to get too deep into this just yet; lets treat it as > food for language-evolution discussion. > > On 3/24/09 12:08 PM, Ian Foster wrote: >> Only in a round-about way. Maybe that is ok, it is just less >> intuitive than say: >> seq { >> A(); >> B(); >> } >> (IMHO) >> Ian. >> On Mar 24, 2009, at 12:02 PM, Ben Clifford wrote: >>> >>> you can express sequencing. that's what extern lets you do. >>> >>> On Tue, 24 Mar 2009, Ian Foster wrote: >>> >>>> These discussions are very reminiscent of PCN, where we had many >>>> of the same >>>> issues. >>>> >>>> A reason then for wanting sequencing was for output. >>>> >>>> We introduced a sequential operator, which had the advantage that >>>> people could >>>> say more directly what they meant. >>>> >>>> >>>> On Mar 24, 2009, at 9:06 AM, Ben Clifford wrote: >>>> >>>>> >>>>> On Tue, 24 Mar 2009, Zhao Zhang wrote: >>>>> >>>>>> Say, Job A is broadcasting common data shared for all jobs. And >>>>>> Job B only >>>>>> needs to know that Job A is done, so he could read the common >>>>>> data. >>>>> >>>>> ok, so there is a data dependency. >>>>> >>>>> You can externals (like Mike has been using on the swift-user >>>>> list) to >>>>> represent data dependencies like that. >>>>> >>>>> app (external commonData) A() { >>>>> ... >>>>> } >>>>> >>>>> app B(external commonData) { >>>>> ... >>>>> } >>>>> >>>>> external d = A(); >>>>> B(d); >>>>> >>>>> d represents your shared data set - by declaring it as >>>>> 'external' you say >>>>> that Swift should do data dependency handling, but should not >>>>> attempt to >>>>> manage the data itself. >>>>> >>>>> d is mapped to the data, but in your head, rather than in Swift. >>>>> >>>>> -- >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >> ------------------------------------------------------------------------ >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Tue Mar 24 12:27:00 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 24 Mar 2009 17:27:00 +0000 (GMT) Subject: [Swift-devel] could we set the job sequence without a file dependency? In-Reply-To: <338CA536-C657-4B78-839C-F500B2F97276@anl.gov> References: <49C8E5D3.5080207@gmail.com> <49C8E7FB.7090807@gmail.com> <43FF9D6D-0B79-452B-8D8C-24102E76D9B0@anl.gov> <9E0FDB93-DC52-4FF5-8D7E-1043B0FDB908@anl.gov> <1237914687.22653.0.camel@localhost> <338CA536-C657-4B78-839C-F500B2F97276@anl.gov> Message-ID: On Tue, 24 Mar 2009, Ian Foster wrote: > I raised a technical issue: people sometimes want to express sequencing that > relates to external side effects (e.g., output). How do we do that? We can, as > Ben suggests, introduce an artificial data dependency. Or, we can provide an > operator that expresses the sequencing. Each have pros and cons: I don't really thing (a lot of the time) that it is an artificial data dependency - in Zhao and Mike's cases both, they have data dependencies. > - The artificial data dependency is less clear in its semantics than the > sequential operator I think they both make one thing happen before the other. I don't think either way is less clear. > - The artificial data dependency requires that you modify procedures to > sequence them > - The artificial data dependency is a real pain to write if you have to do it > a lot disagree > + The artificial data dependency may allow for finer control over what is > sequenced I think they're probably about equivalent. > I couldn't care less whether Swift looks like PCN. But I do have that > experience, and recall the considerable frustration of some users who wanted > to ensure that activity X happened before activity Y (whether for side > effects, or debugging) and had to go to great lengths to introduce artificial > data dependencies to achieve that goal. The sequencing-for-debugging thing is something I think that external dependencies is perhaps wrong for. Not sure if putting an operator into the language is the right way - we've talked about different execution modes too, which might be a better thing to do. -- From benc at hawaga.org.uk Tue Mar 24 12:27:46 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 24 Mar 2009 17:27:46 +0000 (GMT) Subject: [Swift-devel] could we set the job sequence without a file dependency? In-Reply-To: <49C91712.1060903@mcs.anl.gov> References: <49C8E5D3.5080207@gmail.com> <49C8E7FB.7090807@gmail.com> <43FF9D6D-0B79-452B-8D8C-24102E76D9B0@anl.gov> <9E0FDB93-DC52-4FF5-8D7E-1043B0FDB908@anl.gov> <49C91712.1060903@mcs.anl.gov> Message-ID: On Tue, 24 Mar 2009, Michael Wilde wrote: > - if not, does it make sense to provide a seq construct that essential > compiles into the same as a user would generate doing this manually today with > externals? Somewhat of a syntactic sugaring approach? that's monads in haskell pretty much! woo! -- From foster at anl.gov Tue Mar 24 12:38:27 2009 From: foster at anl.gov (Ian Foster) Date: Tue, 24 Mar 2009 12:38:27 -0500 Subject: [Swift-devel] could we set the job sequence without a file dependency? In-Reply-To: References: <49C8E5D3.5080207@gmail.com> <49C8E7FB.7090807@gmail.com> <43FF9D6D-0B79-452B-8D8C-24102E76D9B0@anl.gov> <9E0FDB93-DC52-4FF5-8D7E-1043B0FDB908@anl.gov> <1237914687.22653.0.camel@localhost> <338CA536-C657-4B78-839C-F500B2F97276@anl.gov> Message-ID: <548AF8BE-2DE3-4820-A1A7-F82533AA8AA5@anl.gov> > >> - The artificial data dependency is a real pain to write if you >> have to do it >> a lot > > disagree This was the case in PCN when people were trying to do printf-like output. So you introduced a version of printf that bound a variable when it was done, and then wrote something like: printf("Line 1",X); wait(X) -> printf("Line 2", Y); wait(Y) -> ... which is already a bit messy. But if you had some procedures that you wanted to call, say print-a-bunch() and print-a-bunch-more(), then you had to modify them to produce dummy variables, and things got really messy relative to: seq { printf("Line 1"); print-a-bunch(); printf("Line 2"); ... } BUT -- maybe we don't really do output like this in Swift programs (I don't think it is common, is it)? Enough of my reminiscing ... > The sequencing-for-debugging thing is something I think that external > dependencies is perhaps wrong for. Not sure if putting an operator > into > the language is the right way - we've talked about different > execution modes too, which might be a better thing to do. I agree. From hategan at mcs.anl.gov Tue Mar 24 12:49:37 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 24 Mar 2009 12:49:37 -0500 Subject: [Swift-devel] could we set the job sequence without a file dependency? In-Reply-To: <338CA536-C657-4B78-839C-F500B2F97276@anl.gov> References: <49C8E5D3.5080207@gmail.com> <49C8E7FB.7090807@gmail.com> <43FF9D6D-0B79-452B-8D8C-24102E76D9B0@anl.gov> <9E0FDB93-DC52-4FF5-8D7E-1043B0FDB908@anl.gov> <1237914687.22653.0.camel@localhost> <338CA536-C657-4B78-839C-F500B2F97276@anl.gov> Message-ID: <1237916977.22993.26.camel@localhost> On Tue, 2009-03-24 at 12:19 -0500, Ian Foster wrote: > I wish we could talk about the technical issue at hand, rather than > preconceptions of my motives. Then perhaps it may be useful not to justify a technical solution based on the barely explained needs that arose in different circumstances which are unseparable for us from your preconceptions or motives. In other words, the argument "swift should do it because it was done in PCN" without proper justification is not a very good argument. And if there is proper justification, then the mentioning of PCN becomes unnecessary. > > I raised a technical issue: people sometimes want to express > sequencing that relates to external side effects (e.g., output). How > do we do that? We can, as Ben suggests, introduce an artificial data > dependency. The data dependency is not artificial. It exists, as can be seen from the email exchanges between Ben and Zhao (and Mike). The issue at hand is instructing swift not to manage that data dependency. > Or, we can provide an operator that expresses the > sequencing. Each have pros and cons: > > - The artificial data dependency is less clear in its semantics than > the sequential operator Perhaps so, if you only look at this problem separate from the rest of swift. I believe that if one is to describe a problem in terms of dependencies (something necessary with swift), any kind of explicit sequencing becomes unusual, requiring the user to think what happens behind the scenes. May I also point out that some (relatively) successful languages based on strong theoretical models (such as Haskell) have made the effort to devise ways of dealing with sequencing that are consistent with the basic model of that language, instead of choosing the seemingly easy solution. > - The artificial data dependency requires that you modify procedures > to sequence them No. Like the rest of swift, it requires you to specify what your application has as input and what your application has as output. > - The artificial data dependency is a real pain to write if you have > to do it a lot Anything is a real pain to write if you have to do it a lot. > + The artificial data dependency does not introduce a new operator > + The artificial data dependency may allow for finer control over what > is sequenced > > I couldn't care less whether Swift looks like PCN. But I do have that > experience, and recall the considerable frustration of some users who > wanted to ensure that activity X happened before activity Y (whether > for side effects, or debugging) and had to go to great lengths to > introduce artificial data dependencies to achieve that goal. I can also see considerable frustration of some users who try to interpret swift programs in their heads, and then the subsequent disappearance of the frustration when they realize that writing down a formal description of their problem takes care of both correctness and efficiency. From foster at anl.gov Tue Mar 24 12:54:04 2009 From: foster at anl.gov (Ian Foster) Date: Tue, 24 Mar 2009 12:54:04 -0500 Subject: [Swift-devel] could we set the job sequence without a file dependency? In-Reply-To: <1237916977.22993.26.camel@localhost> References: <49C8E5D3.5080207@gmail.com> <49C8E7FB.7090807@gmail.com> <43FF9D6D-0B79-452B-8D8C-24102E76D9B0@anl.gov> <9E0FDB93-DC52-4FF5-8D7E-1043B0FDB908@anl.gov> <1237914687.22653.0.camel@localhost> <338CA536-C657-4B78-839C-F500B2F97276@anl.gov> <1237916977.22993.26.camel@localhost> Message-ID: That was not what I said. But enough of this. On Mar 24, 2009, at 12:49 PM, Mihael Hategan wrote: > In other words, the argument "swift should do it because it was done > in > PCN" without proper justification is not a very good argument. And if > there is proper justification, then the mentioning of PCN becomes > unnecessary. From hategan at mcs.anl.gov Tue Mar 24 13:05:49 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 24 Mar 2009 13:05:49 -0500 Subject: [Swift-devel] could we set the job sequence without a file dependency? In-Reply-To: References: <49C8E5D3.5080207@gmail.com> <49C8E7FB.7090807@gmail.com> <43FF9D6D-0B79-452B-8D8C-24102E76D9B0@anl.gov> <9E0FDB93-DC52-4FF5-8D7E-1043B0FDB908@anl.gov> <1237914687.22653.0.camel@localhost> <338CA536-C657-4B78-839C-F500B2F97276@anl.gov> <1237916977.22993.26.camel@localhost> Message-ID: <1237917949.23612.9.camel@localhost> On Tue, 2009-03-24 at 12:54 -0500, Ian Foster wrote: > That was not what I said. But enough of this. Right. What you said was ambiguous, but you seem to have implied that seq solved many problems in PCN, out of which you mentioned one that is concrete (output). It would follow that seq would also solve many problems in swift. It would have been more helpful to mention the issue of output by itself, in more detail. But as you say, enough of this :) From foster at anl.gov Tue Mar 24 13:08:55 2009 From: foster at anl.gov (Ian Foster) Date: Tue, 24 Mar 2009 13:08:55 -0500 Subject: [Swift-devel] could we set the job sequence without a file dependency? In-Reply-To: <1237917949.23612.9.camel@localhost> References: <49C8E5D3.5080207@gmail.com> <49C8E7FB.7090807@gmail.com> <43FF9D6D-0B79-452B-8D8C-24102E76D9B0@anl.gov> <9E0FDB93-DC52-4FF5-8D7E-1043B0FDB908@anl.gov> <1237914687.22653.0.camel@localhost> <338CA536-C657-4B78-839C-F500B2F97276@anl.gov> <1237916977.22993.26.camel@localhost> <1237917949.23612.9.camel@localhost> Message-ID: I agree on both fronts! :) On Mar 24, 2009, at 1:05 PM, Mihael Hategan wrote: > On Tue, 2009-03-24 at 12:54 -0500, Ian Foster wrote: >> That was not what I said. But enough of this. > > Right. What you said was ambiguous, but you seem to have implied that > seq solved many problems in PCN, out of which you mentioned one that > is > concrete (output). It would follow that seq would also solve many > problems in swift. > > It would have been more helpful to mention the issue of output by > itself, in more detail. > > But as you say, enough of this :) > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhaozhang at uchicago.edu Tue Mar 24 15:59:15 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 24 Mar 2009 15:59:15 -0500 Subject: [Swift-devel] "getValue called in an external dataset" issue Message-ID: <49C949A3.2040409@uchicago.edu> Hi, I keep getting this message "getValue called in an external dataset" when I run the following swift script. I know it comes from the external data type, could anyone point me the error in the script? The "external bout" is the external data file. Thanks. type Mol {} type Bin {} type Common {} type DOCKOut {} type Molout{} type DockRunSummary {} type file {} (external bout)bcast (Bin binary, Common flex_def, Common flex_tbl, Common grid_bmp, Common grid_in, Common grid_nrg, Common grid_out, Common rec_box, Common receptor, Common sample_grid, Common selected_spheres, Common template_in, Common vdw, Common awkscript) { app { bcast @filename(binary) @filename(flex_def) @filename(flex_tbl) @filename(grid_bmp) @filename(grid_in) @filename(grid_nrg) @filename(grid_out) @filename(rec_box) @filename(receptor) @filename(sample_grid) @filename(selected_spheres) @filename(template_in) @filename(vdw) @filename(awkscript); } } app (DOCKOut ofile) rundock_zhao (Mol molfile, Bin binary, Common flex_def, Common flex_tbl, Common grid_bmp, Common grid_in, Common grid_nrg, Common grid_out, Common rec_box, Common receptor, Common sample_grid, Common selected_spheres, Common template_in, Common vdw, Common awkscript, external bout) { rundock_zhao @filename(molfile) @filename(binary) @filename(flex_def) @filename(flex_tbl) @filename(grid_bmp) @filename(grid_in) @filename(grid_nrg) @filename(grid_out) @filename(rec_box) @filename(receptor) @filename(sample_grid) @filename(selected_spheres) @filename(template_in) @filename(vdw) @filename(awkscript) stdout=@filename(ofile); } app (DockRunSummary finalfile) sumdockresults(DOCKOut r[] ) { summary @filenames(r) stdout=@filename(finalfile); } Mol texts[] ; (DOCKOut result[])doall(Mol texts[]) { Bin binary <"common/dock6.O3.cn">; Common flex_def <"common/flex.defn">; Common flex_tbl <"common/flex_drive.tbl">; Common grid_bmp <"common/grid.bmp">; Common grid_in <"common/grid.in">; Common grid_nrg <"common/grid.nrg">; Common grid_out <"common/grid.out">; Common rec_box <"common/rec_box.pdb">; Common receptor <"common/receptor_charged.mol2">; Common sample_grid <"common/sample_grid.in">; Common selected_spheres <"common/selected_spheres.sph">; Common template_in <"common/template.in">; Common vdw <"common/vdw_AMBER_parm99.defn">; Common awkscript <"common/awkscript">; external bout; bout=bcast(binary, flex_def, flex_tbl, grid_bmp, grid_in, grid_nrg, grid_out, rec_box, receptor, sample_grid, selected_spheres, template_in, vdw, awkscript); foreach p,i in texts { result[i] = rundock_zhao(p, binary, flex_def, flex_tbl, grid_bmp, grid_in, grid_nrg, grid_out, rec_box, receptor, sample_grid, selected_spheres, template_in, vdw, awkscript, bout); } } // Main DockRunSummary summary <"summary.txt">; DOCKOut result[]; result=doall(texts); summary = sumdockresults(result); From benc at hawaga.org.uk Tue Mar 24 16:05:50 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 24 Mar 2009 21:05:50 +0000 (GMT) Subject: [Swift-devel] "getValue called in an external dataset" issue In-Reply-To: <49C949A3.2040409@uchicago.edu> References: <49C949A3.2040409@uchicago.edu> Message-ID: On Tue, 24 Mar 2009, Zhao Zhang wrote: > Hi, I keep getting this message "getValue called in an external dataset" when > I run the following swift script. I know it comes from the external data type, > could anyone point me the error in the script? The "external bout" is the > external data file. Thanks. It is coming from something trying to make that external into a string. I think it probably is not breaking your run, though (?) I've seen it in some circumstances myself but have not investigated more. -- From zhaozhang at uchicago.edu Tue Mar 24 16:16:51 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 24 Mar 2009 16:16:51 -0500 Subject: [Swift-devel] "getValue called in an external dataset" issue In-Reply-To: References: <49C949A3.2040409@uchicago.edu> Message-ID: <49C94DC3.5080604@uchicago.edu> yep, it doesn't harm, just annoying there. I will just leave it as is. zhao Ben Clifford wrote: > On Tue, 24 Mar 2009, Zhao Zhang wrote: > > >> Hi, I keep getting this message "getValue called in an external dataset" when >> I run the following swift script. I know it comes from the external data type, >> could anyone point me the error in the script? The "external bout" is the >> external data file. Thanks. >> > > It is coming from something trying to make that external into a string. > > I think it probably is not breaking your run, though (?) > > I've seen it in some circumstances myself but have not investigated more. > > > From benc at hawaga.org.uk Tue Mar 24 16:37:35 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 24 Mar 2009 21:37:35 +0000 (GMT) Subject: [Swift-devel] could we set the job sequence without a file dependency? In-Reply-To: <548AF8BE-2DE3-4820-A1A7-F82533AA8AA5@anl.gov> References: <49C8E5D3.5080207@gmail.com> <49C8E7FB.7090807@gmail.com> <43FF9D6D-0B79-452B-8D8C-24102E76D9B0@anl.gov> <9E0FDB93-DC52-4FF5-8D7E-1043B0FDB908@anl.gov> <1237914687.22653.0.camel@localhost> <338CA536-C657-4B78-839C-F500B2F97276@anl.gov> <548AF8BE-2DE3-4820-A1A7-F82533AA8AA5@anl.gov> Message-ID: On Tue, 24 Mar 2009, Ian Foster wrote: > This was the case in PCN when people were trying to do printf-like output. So > you introduced a version of printf that bound a variable when it was done, and > then wrote something like: > BUT -- maybe we don't really do output like this in Swift programs (I don't > think it is common, is it)? Its not common to do "meat output" like that, no - all of that comes from procedures which generate files, which tend to not need sequencing of output. Most of what you've talked about seems debugging oriented - none of the debugging I've done with people has really felt like I've wanted that kind of sequence operator. But I think some debugging-oriented something (be it language features, runtime options, something) would be useful - but its not clear to me what. -- From bugzilla-daemon at mcs.anl.gov Tue Mar 24 16:42:42 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 24 Mar 2009 16:42:42 -0500 (CDT) Subject: [Swift-devel] [Bug 190] New: "getValue called in an external dataset" appearing on console Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=190 Summary: "getValue called in an external dataset" appearing on console Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: hategan at mcs.anl.gov ReportedBy: benc at hawaga.org.uk The message "getValue called in an external dataset" is reported on the console when external datasets are used, which is disturbing. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From wilde at mcs.anl.gov Tue Mar 24 22:00:02 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 24 Mar 2009 22:00:02 -0500 Subject: [Swift-devel] printing and library functions In-Reply-To: References: <49C8E5D3.5080207@gmail.com> <49C8E7FB.7090807@gmail.com> <43FF9D6D-0B79-452B-8D8C-24102E76D9B0@anl.gov> <9E0FDB93-DC52-4FF5-8D7E-1043B0FDB908@anl.gov> <1237914687.22653.0.camel@localhost> <338CA536-C657-4B78-839C-F500B2F97276@anl.gov> <548AF8BE-2DE3-4820-A1A7-F82533AA8AA5@anl.gov> Message-ID: <49C99E32.3090405@mcs.anl.gov> [was: Re: [Swift-devel] could we set the job sequence without a file dependency?] [digression on] From what Ive done with oops over the past month I observe a need to print the status of the workflow as it proceeds. I guess thats a kind of "debugging" - watching to make sure a workflow is making progress. I *used* to tail the debug log; then I started tailing that log to watch and count specific status messages. Now Ive taken to entirely watching just stdout, and only looking in the log when things fail and the output file doesnt clearly say why. I feel its a good step forward that I can do this. But users will, I believe, want some way beyond just the periodic status line to show how their script is progressing. I used trace, which is OK. I think this need will grow as I hand oops over the people who will be running it to do science (and by the way, thanks to all of you that is happening as we speak!). This comment, btw, is not on the topic of "seq{}", but rather on the need for output primitives. So, while we're on this topic and digressing: I often missed various string and data handling operations in Swift. In a few points of developing oops, I was delighted to find that I could implement almost any library function I wanted (including string functions like sprintf/fprintf, intToStr(), etc, using a 2-step call: an app to call the external function, and if needed a proc to wrap it and use readData() or readdata2() to extract back the return value(s). I turned out to be pretty reasonable to define a nice sprintf at the moment by passing a varying number of arguments as an array of strings using an array constructor: [ s1, s1, itos(i1), ftos(f1), btos(b1) ] using one layer of primitives to convert all objects to strings, and then passing the string array to sprintf(), and back to trace(). Discovering how to do this eliminated a request I had long been repressing for "easier to implement built-in functions". The fact that readdata accepted a file in addition to a file name was critical. A few things that do come to mind on this, to permit more flexible (or aesthetic) library functions, are: - varying length arg lists (or) - arrays of type Object (ie heterogeneous arrays) - \ escapes in strings (\n \t etc) - sizeof(array) - isSet() etc: See if a var is set or not. cant recall now where I wanted this, need to check my notes. Not very important, I think this was to hack around some other issue, cant recall what. Im sure its a functional and synchronization no-no. One might define the first two in various ways related to each other. So out of all these things, the only one that might be nice to do sooner is the \ escapes in strings, iff thats easy. Just thought I would toss those out in case others have seen similar needs and see easy ways to implement them. Im excited about the prospects of seeing a Swift library grow. I'll try to package up the few functions Ive played with to spark some ideas and library development. We should start batting around ideas for libraries, and for naming, etc, but a simple include mechanism for now will get us far. - Mike (the astute will notice that I have neglected to set [digression off]. That may or may not have been intensional... ;) On 3/24/09 4:37 PM, Ben Clifford wrote: > On Tue, 24 Mar 2009, Ian Foster wrote: > >> This was the case in PCN when people were trying to do printf-like output. So >> you introduced a version of printf that bound a variable when it was done, and >> then wrote something like: > >> BUT -- maybe we don't really do output like this in Swift programs (I don't >> think it is common, is it)? > > Its not common to do "meat output" like that, no - all of that comes from > procedures which generate files, which tend to not need sequencing of > output. > > Most of what you've talked about seems debugging oriented - none of the > debugging I've done with people has really felt like I've wanted that kind > of sequence operator. But I think some debugging-oriented something (be it > language features, runtime options, something) would be useful - but its > not clear to me what. > From benc at hawaga.org.uk Wed Mar 25 07:13:49 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 25 Mar 2009 12:13:49 +0000 (GMT) Subject: [Swift-devel] printing and library functions In-Reply-To: <49C99E32.3090405@mcs.anl.gov> References: <49C8E5D3.5080207@gmail.com> <49C8E7FB.7090807@gmail.com> <43FF9D6D-0B79-452B-8D8C-24102E76D9B0@anl.gov> <9E0FDB93-DC52-4FF5-8D7E-1043B0FDB908@anl.gov> <1237914687.22653.0.camel@localhost> <338CA536-C657-4B78-839C-F500B2F97276@anl.gov> <548AF8BE-2DE3-4820-A1A7-F82533AA8AA5@anl.gov> <49C99E32.3090405@mcs.anl.gov> Message-ID: the need for better progress monitoring of a run doesn't necessarily correspond with a need for more output primitives in the language. there are lots of other possibilities. For example (not that I particularly like it, but its a radical departure from seq/printf so illustrates my point), imagine your source code appearing in a window with every for loop and dataset declaration annoted (in colour or text) by its state/progress. -- From wilde at mcs.anl.gov Wed Mar 25 07:24:09 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 25 Mar 2009 07:24:09 -0500 Subject: [Swift-devel] printing and library functions In-Reply-To: References: <49C8E5D3.5080207@gmail.com> <49C8E7FB.7090807@gmail.com> <43FF9D6D-0B79-452B-8D8C-24102E76D9B0@anl.gov> <9E0FDB93-DC52-4FF5-8D7E-1043B0FDB908@anl.gov> <1237914687.22653.0.camel@localhost> <338CA536-C657-4B78-839C-F500B2F97276@anl.gov> <548AF8BE-2DE3-4820-A1A7-F82533AA8AA5@anl.gov> <49C99E32.3090405@mcs.anl.gov> Message-ID: <49CA2269.4060808@mcs.anl.gov> On 3/25/09 7:13 AM, Ben Clifford wrote: > the need for better progress monitoring of a run doesn't necessarily > correspond with a need for more output primitives in the language. I think at most we need a plain print, like trace but unadorned. And we can certainly live with just trace (as primitives). > there are lots of other possibilities. For example (not that I > particularly like it, but its a radical departure from seq/printf so > illustrates my point), imagine your source code appearing in a window with > every for loop and dataset declaration annoted (in colour or text) by its > state/progress. Yes, that would be very cool. Another simpler step is that progress monitoring shows a top-like running tally by procedure of how many have executed, how many are pending, how many are yet un-entered, what each proc's total and average run time was, etc. We can give a lot more useful information in place simple text. This would eliminate the need for much printf'ing, leaving only application specific status to print (eg: in my convergence loop, printing the current vs target value each time around the iterate). From benc at hawaga.org.uk Wed Mar 25 08:43:07 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 25 Mar 2009 13:43:07 +0000 (GMT) Subject: [Swift-devel] problems with external dependencies In-Reply-To: References: <49C6AD6C.8020509@mcs.anl.gov> Message-ID: (this belongs on swift-devel - i moved it) On Sun, 22 Mar 2009, Ian Foster wrote: > I can't recall whether we talked about this before, but if we could > choose to run in a mode whereby compound procedures wait for all input > parameters, that could simplify debugging. But maybe the semantics are > now rich enough that this would not necessarily be correct. There are a few things in the above paragraph that I think can be considered separately: * should it be possible to make compound procedures wait for all of their inputs to be closed, rather than letting components do this? perhaps. you suggest debugging. I can't think of other use cases. So see other stuff I have written saying that debugging can perhaps be approached in other ways. * does this simplify debugging unsure. I've never come across the need for it when I've been doing things. I'm unsure if others have. It feels to me a bit like "heres something to jiggle dependency handling" but without being clearly more useful. * is an execution mode the way to do this? I think no. It would make the language more symmetrical to allow compound procedures to wait for their inputs (at the expense of parallelism). I think making this a "mode" is the wrong thing to do. It might be better expressed as a modifier on procedure declarations, eg: (making up terms strict and ready): strict (int i)myproc(external ds) { .... } or (int i)myproc(ready external ds) { ... } signifying (respectively) that all inputs must be ready, or that the specified input must be ready. That would not be hard to implement. -- From wilde at mcs.anl.gov Wed Mar 25 13:55:42 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 25 Mar 2009 13:55:42 -0500 Subject: [Swift-devel] Provenance use case Message-ID: <49CA7E2E.5070106@mcs.anl.gov> As the OOPS guys are close to running, they would like and would benefit from provenance recording. Their (and our) requirements are something like below. Im sending this out now because its fresh in my mind after talking to Glen, but probably cant get deep into a disussion of it for the next month or so. I hope its of some use in steering and setting priorities for provenance work. I'd like to do the same for other groups, focusing on CNARI where provenance is a funded, committed deliverable. - Mike Swift project needs: - record all their runs so we can track the (hopeful) growth in their usage of swift - collect all in one place and report runs and usage by system, user, etc. OOPS User needs: - track all their runs so they can readily find all their generated data - know what input parameters were used, both swift level and what config file settings were passed in - exactly what (svn) code rev was run: unlike users of canned apps, these guys run their own code and are constantly changing it - how fast did each run go as a function of code rev, input args, and target system type - record in an annotation database some science characteristics about each run: both on the individual output files and on a set of simulations (called a "rsound"). These attributes are gleaned or computed by examining output, posibly doing some computations on it, and in many cases averaging and finding ranges and other stats from a round. (A round is 100 to 2000 runs of their simulation program. Their simulation has a notion of "goodness" - ie how close to a known protein structure did the simulation get. Thats a key attribute they compute and track.) A note: the svn code rev tracking issue raises interesting needs. Presumably the oops swift script will change little across oops code revs, but you want some kind of tracability from: the app() proc name the tc.data entry name the tc.data entry path what svn rev that path was pointing to or symlinked to at the moment of execution How this is managed will vary from group to group, but we can set and suggest some standards that make tracking practical. In the case of oops, the SVN rev is placed in a REVISION file near the top of the dist tree (and could be placed anywhere where provenance recording might be able to pick it up from). In recent oops testing, we make every src dir on every site an svn checkout, and do svn update to bring the tree up to date. hence the path in tc.data doesnt change as the code evolves. An earlier strategy we tried is that we generated src distros on a central host like communicado, put the svn rev in the distro's tarbal name, and top-level dir, and extracted it to that dir on the each target site, and built the code. Then a symlink was adjusted to point to the "latest" rev, and thus the value of the symlink contained the svn rev. I think some way to track this is a real provenance requirement. Its easy to do in arbitrary ad-hoc ways, but rather harder to implement a single uniform way to grab such information at the time that a binary executable is chosen. Theres also the major issue of info available on the submit host where tc.data resides vs that which can be captured at runtime eg in wrapper.sh. And how to make it low-overhead, etc. From benc at hawaga.org.uk Thu Mar 26 14:17:24 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 26 Mar 2009 19:17:24 +0000 (GMT) Subject: [Swift-devel] r2743 introduces redundant ress-to-site-catalog Message-ID: in r2743 you committed usertools/swift/ress-to-site-catalog which appears to be a slightly divergent version of the existing swift-osg-ress-site-catalog that is in the main Swift bin directory. that seems a bad thing to do. -- From benc at hawaga.org.uk Thu Mar 26 16:05:36 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 26 Mar 2009 21:05:36 +0000 (GMT) Subject: [Swift-devel] Re: Provenance use case In-Reply-To: <49CA7E2E.5070106@mcs.anl.gov> References: <49CA7E2E.5070106@mcs.anl.gov> Message-ID: On Wed, 25 Mar 2009, Michael Wilde wrote: > - record all their runs so we can track the (hopeful) growth in their usage of > swift so that sounds like you want number of jobs and duration of jobs? > - collect all in one place and report runs and usage by system, user, etc. by 'system', you mean site as in what is defined in sites.xml? user is determinable by submit-side unix user? what is etc? > - track all their runs so they can readily find all their generated data elaborate on this - you want to get a list of every data file generated with details of what was run to create it? or different to that? > - exactly what (svn) code rev was run: unlike users of canned apps, these guys > run their own code and are constantly changing it This could be determined for every invocation remotely and collected. That might be expensive. Less reliably it could be measured once per site, but that would not detect someone changing the software in the middle of a run. > - record in an annotation database some science characteristics about each > run: both on the individual output files and on a set of simulations (called a > "rsound"). These attributes are gleaned or computed by examining output, > posibly doing some computations on it, and in many cases averaging and finding > ranges and other stats from a round. > (A round is 100 to 2000 runs of their simulation program. Their simulation has > a notion of "goodness" - ie how close to a known protein structure did the > simulation get. Thats a key attribute they compute and track.) most of that sounds like simple database work that does not need swift to be involved. where it does tie into Swift is getting SQL keys to relate stuff in the provenance database to this annotation data. The following could be useful there: a globally unique URI to identify a particular run (something like the run-id, packaged as a URI) For data, there are two different ways you can label. One is a unique dataset identifier that is different per-run (that is, you run first.swift twice and the output dataset has a different URI in each run, even though it has the same filename, hello.txt, in both runs); and second is the filename, which is easy to look at but doesn't take into account that files are mutable on the filesystem. For storing annotations about data, presumably you would want to use one of those two. Filename you can see easily without interacting with Swift. If you want to use the dataset ID, then proabably you would need some way to give you those dataset IDs (eg feed in a filename and a run ID and get told the dataset ID). > I think some way to track this is a real provenance requirement. yes > Its easy to do in arbitrary ad-hoc ways, but rather harder to implement > a single uniform way to grab such information at the time that a binary > executable is chosen. Theres also the major issue of info available on > the submit host where tc.data resides vs that which can be captured at > runtime eg in wrapper.sh. And how to make it low-overhead, etc. yes -- From benc at hawaga.org.uk Thu Mar 26 16:27:04 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 26 Mar 2009 21:27:04 +0000 (GMT) Subject: [Swift-devel] Re: Provenance use case In-Reply-To: <49CBF13D.9090708@uchicago.edu> References: <49CA7E2E.5070106@mcs.anl.gov> <49CBF13D.9090708@uchicago.edu> Message-ID: On Thu, 26 Mar 2009, Glen Hocky wrote: > Suggestion for this. What about if we wrote a script that echo's any relevant > data (e.g. svn-code-rev) in whatever format you thing best. If this is > installed allong with the rest of the code, we could just stick it into the > tc.data file and write an app wrapper for it. Would that make it easier to > incorporate this information into a provenance collection feature of swift? That is one way it could work, yes. Then Swift might need some way of saying "run this once per site" rather than its present execution model. Another way is to gather the information during each run, by allowing users to specify something like "run this command at the start of each job, and store its text output somewhere". The second way above would be closer to what is actually being run, so would probably be more trustworthy, though more expensive too. -- From hockyg at uchicago.edu Thu Mar 26 16:18:53 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Thu, 26 Mar 2009 16:18:53 -0500 Subject: [Swift-devel] Re: Provenance use case In-Reply-To: References: <49CA7E2E.5070106@mcs.anl.gov> Message-ID: <49CBF13D.9090708@uchicago.edu> > >> > - exactly what (svn) code rev was run: unlike users of canned apps, these guys >> > run their own code and are constantly changing it >> > > This could be determined for every invocation remotely and collected. That > might be expensive. Less reliably it could be measured once per site, but > that would not detect someone changing the software in the middle of a > run. > Suggestion for this. What about if we wrote a script that echo's any relevant data (e.g. svn-code-rev) in whatever format you thing best. If this is installed allong with the rest of the code, we could just stick it into the tc.data file and write an app wrapper for it. Would that make it easier to incorporate this information into a provenance collection feature of swift? Ben Clifford wrote: > On Wed, 25 Mar 2009, Michael Wilde wrote: > > >> - record all their runs so we can track the (hopeful) growth in their usage of >> swift >> > > so that sounds like you want number of jobs and duration of jobs? > > >> - collect all in one place and report runs and usage by system, user, etc. >> > > by 'system', you mean site as in what is defined in sites.xml? > > user is determinable by submit-side unix user? > > what is etc? > > >> - track all their runs so they can readily find all their generated data >> > > elaborate on this - you want to get a list of every data file generated > with details of what was run to create it? or different to that? > > > >> - exactly what (svn) code rev was run: unlike users of canned apps, these guys >> run their own code and are constantly changing it >> > > This could be determined for every invocation remotely and collected. That > might be expensive. Less reliably it could be measured once per site, but > that would not detect someone changing the software in the middle of a > run. > > > >> - record in an annotation database some science characteristics about each >> run: both on the individual output files and on a set of simulations (called a >> "rsound"). These attributes are gleaned or computed by examining output, >> posibly doing some computations on it, and in many cases averaging and finding >> ranges and other stats from a round. >> (A round is 100 to 2000 runs of their simulation program. Their simulation has >> a notion of "goodness" - ie how close to a known protein structure did the >> simulation get. Thats a key attribute they compute and track.) >> > > most of that sounds like simple database work that does not need swift to > be involved. where it does tie into Swift is getting SQL keys to relate > stuff in the provenance database to this annotation data. > > The following could be useful there: a globally unique URI to identify a > particular run (something like the run-id, packaged as a URI) > > For data, there are two different ways you can label. One is a unique > dataset identifier that is different per-run (that is, you run first.swift > twice and the output dataset has a different URI in each run, even though > it has the same filename, hello.txt, in both runs); and second is the > filename, which is easy to look at but doesn't take into account that > files are mutable on the filesystem. > > For storing annotations about data, presumably you would want to use one > of those two. Filename you can see easily without interacting with Swift. > If you want to use the dataset ID, then proabably you would need some way > to give you those dataset IDs (eg feed in a filename and a run ID and get > told the dataset ID). > > >> I think some way to track this is a real provenance requirement. >> > > yes > > >> Its easy to do in arbitrary ad-hoc ways, but rather harder to implement >> a single uniform way to grab such information at the time that a binary >> executable is chosen. Theres also the major issue of info available on >> the submit host where tc.data resides vs that which can be captured at >> runtime eg in wrapper.sh. And how to make it low-overhead, etc. >> > > yes > > From bugzilla-daemon at mcs.anl.gov Fri Mar 27 03:52:16 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Fri, 27 Mar 2009 03:52:16 -0500 (CDT) Subject: [Swift-devel] [Bug 189] If .kml file is empty, swift exits swiftly with cryptic error message In-Reply-To: References: Message-ID: <20090327085216.D42FE2CCDC@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=189 Ben Clifford changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED --- Comment #2 from Ben Clifford 2009-03-27 03:52:16 --- should be fixed in r2746 -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From benc at hawaga.org.uk Fri Mar 27 04:54:46 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 27 Mar 2009 09:54:46 +0000 (GMT) Subject: [Swift-devel] exit codes treated as output data rather than status indicators Message-ID: I played with this patch a few weeks ago and never did anything with it. http://www.ci.uchicago.edu/~benc/tmp/lcc-feature-exit-codes-nonfatal-1.patch It allows status codes for application executables to be staged out as a file, rather than treated as success/failure, as demonstrated in the below fragment. If exitcode= is not specified, then the exitcode is handle as before. Other causes of failure (such as missing files) continue to cause execution errors as before. (messagefile t, messagefile ec) greeting() { app { _false stdout=@filename(t) exitcode=@ec; } } (outfile, ecfile) = greeting(); int ec = readData(ecfile); trace("Exit code",ec); -- From skenny at uchicago.edu Fri Mar 27 10:23:36 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Fri, 27 Mar 2009 10:23:36 -0500 (CDT) Subject: [Swift-devel] swift is rad Message-ID: <20090327102336.BUP29729@m4500-02.uchicago.edu> Progress: Stage out:12 Finished successfully:196596 Final status: Finished successfully:196608 Cleaning up... From hategan at mcs.anl.gov Fri Mar 27 10:34:04 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 27 Mar 2009 10:34:04 -0500 Subject: [Swift-devel] swift is rad In-Reply-To: <20090327102336.BUP29729@m4500-02.uchicago.edu> References: <20090327102336.BUP29729@m4500-02.uchicago.edu> Message-ID: <1238168044.3922.0.camel@localhost> Good news. I assume the fix worked. What was the total time? On Fri, 2009-03-27 at 10:23 -0500, skenny at uchicago.edu wrote: > Progress: Stage out:12 Finished successfully:196596 > Final status: Finished successfully:196608 > Cleaning up... > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri Mar 27 10:38:20 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 27 Mar 2009 10:38:20 -0500 Subject: [Swift-devel] swift is rad In-Reply-To: <20090327102336.BUP29729@m4500-02.uchicago.edu> References: <20090327102336.BUP29729@m4500-02.uchicago.edu> Message-ID: <49CCF2EC.7050100@mcs.anl.gov> Awesome! Very nice work, Sarah and everyone!!! Eager to see the rest of the stats and the plots. Time to update the Web page... - Mike On 3/27/09 10:23 AM, skenny at uchicago.edu wrote: > Progress: Stage out:12 Finished successfully:196596 > Final status: Finished successfully:196608 > Cleaning up... > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Fri Mar 27 15:25:16 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 27 Mar 2009 20:25:16 +0000 (GMT) Subject: [Swift-devel] if you are a wrapper.sh hacker... Message-ID: if you hack at wrapper.sh you should study r2747 to see why your changes no longer seem to do anything. -- From benc at hawaga.org.uk Fri Mar 27 16:16:28 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 27 Mar 2009 21:16:28 +0000 (GMT) Subject: [Swift-devel] Re: Provenance use case In-Reply-To: References: <49CA7E2E.5070106@mcs.anl.gov> <49CBF13D.9090708@uchicago.edu> Message-ID: On Thu, 26 Mar 2009, Ben Clifford wrote: > Another way is to gather the information during each run, by allowing > users to specify something like "run this command at the start of each > job, and store its text output somewhere". Swift r2748 adds a SWIFT_EXTRA_INFO environment setting, that you can use like this: /var/tmp 0 echo monkey foo to get a line like this in your wrapper log: EXTRAINFO=monkey foo See if you can get the version information you want passed back through that mechanism. -- From wilde at mcs.anl.gov Fri Mar 27 16:22:15 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 27 Mar 2009 16:22:15 -0500 Subject: [Swift-devel] if you are a wrapper.sh hacker... In-Reply-To: References: Message-ID: <49CD4387.2090406@mcs.anl.gov> OK, so we should rename the cio-modified version of wrapper.sh to _swiftwrap and pick up the corresponding name change from vdl-int.k. Anything else to worry about? On 3/27/09 3:25 PM, Ben Clifford wrote: > if you hack at wrapper.sh you should study r2747 to see why your changes > no longer seem to do anything. > From benc at hawaga.org.uk Fri Mar 27 16:26:38 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 27 Mar 2009 21:26:38 +0000 (GMT) Subject: [Swift-devel] if you are a wrapper.sh hacker... In-Reply-To: <49CD4387.2090406@mcs.anl.gov> References: <49CD4387.2090406@mcs.anl.gov> Message-ID: On Fri, 27 Mar 2009, Michael Wilde wrote: > OK, so we should rename the cio-modified version of wrapper.sh to _swiftwrap > and pick up the corresponding name change from vdl-int.k. > > Anything else to worry about? Nope, name change is all that matters. -- From wilde at mcs.anl.gov Fri Mar 27 16:28:58 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 27 Mar 2009 16:28:58 -0500 Subject: [Swift-devel] Re: Provenance use case In-Reply-To: References: <49CA7E2E.5070106@mcs.anl.gov> <49CBF13D.9090708@uchicago.edu> Message-ID: <49CD451A.20208@mcs.anl.gov> cool, thanks. so instead of echo monkey foo, this profile entry should go on the tc.data entry, and do something like: getAppVersion.sh appName appPath which can either specify the version info directly in the tc.data entry (for one style of usage) or dynamically hunt it down in the app's directory, symlink, etc. Sounds like a reasonable and flexible experiment to start out with. It begs for a bit of automation in the generation of tc.data, which we can experiment with at the same time. Next we need to work on collection and summarization of the info from wrapper.sh. Are you doing any of that in your current provenance work? On 3/27/09 4:16 PM, Ben Clifford wrote: > On Thu, 26 Mar 2009, Ben Clifford wrote: > >> Another way is to gather the information during each run, by allowing >> users to specify something like "run this command at the start of each >> job, and store its text output somewhere". > > Swift r2748 adds a SWIFT_EXTRA_INFO environment setting, that you can use > like this: > > > > > /var/tmp > 0 > echo monkey foo > > > to get a line like this in your wrapper log: > > EXTRAINFO=monkey foo > > See if you can get the version information you want passed back through > that mechanism. > From benc at hawaga.org.uk Fri Mar 27 16:33:28 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 27 Mar 2009 21:33:28 +0000 (GMT) Subject: [Swift-devel] Re: Provenance use case In-Reply-To: <49CD451A.20208@mcs.anl.gov> References: <49CA7E2E.5070106@mcs.anl.gov> <49CBF13D.9090708@uchicago.edu> <49CD451A.20208@mcs.anl.gov> Message-ID: On Fri, 27 Mar 2009, Michael Wilde wrote: > so instead of echo monkey foo, this profile entry should go on the tc.data > entry, and do something like: > > getAppVersion.sh appName appPath It doesn't have any access to tc.data - it runs remotely. It probably has access the internal variables of the remote wrapper (but those shouldn't be relied on in general) > Next we need to work on collection and summarization of the info from > wrapper.sh. Are you doing any of that in your current provenance work? there has been code in the past to do pull stuff from wrapper logs - not a huge amount at the moment but it should be straightforward to add if there is useful information being collected. -- From foster at anl.gov Sat Mar 28 06:20:31 2009 From: foster at anl.gov (Ian Foster) Date: Sat, 28 Mar 2009 06:20:31 -0500 Subject: [Swift-devel] Provenance use case In-Reply-To: <49CA7E2E.5070106@mcs.anl.gov> References: <49CA7E2E.5070106@mcs.anl.gov> Message-ID: <70C828E5-8C25-4CFF-B16C-EF28D7C78399@anl.gov> Hi, Would it be good to (re)start the practice of storing every run log? Having information on all Swift runs performed will be very useful for when we request additional funding. Ian. From benc at hawaga.org.uk Sat Mar 28 06:32:31 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 28 Mar 2009 11:32:31 +0000 (GMT) Subject: [Swift-devel] Provenance use case In-Reply-To: <70C828E5-8C25-4CFF-B16C-EF28D7C78399@anl.gov> References: <49CA7E2E.5070106@mcs.anl.gov> <70C828E5-8C25-4CFF-B16C-EF28D7C78399@anl.gov> Message-ID: On Sat, 28 Mar 2009, Ian Foster wrote: > Would it be good to (re)start the practice of storing every run log? > Having information on all Swift runs performed will be very useful for > when we request additional funding. I see run logs from recent days in /disks/ci-gpfs/swift/swift-logs so some people are definitely doing this still (and have been for some time). Swift developers can't really force application people to play along though. -- From wilde at mcs.anl.gov Sat Mar 28 10:56:36 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 28 Mar 2009 10:56:36 -0500 Subject: [Swift-devel] Provenance use case In-Reply-To: References: <49CA7E2E.5070106@mcs.anl.gov> <70C828E5-8C25-4CFF-B16C-EF28D7C78399@anl.gov> Message-ID: <49CE48B4.4070209@mcs.anl.gov> Thoughts and comments on this: Ive stressed the need for log collection to every group I work with, long hard and often, but making it happen is slow and makes little sense till the user is really up and running, which takes huge effort and comes first. Ive used on occasion a personal "swiftrun" command that bundled log reporting (and run-directory conventions and cleanup) with swift execution. Ive never had time to polish that for general use, but recently placed these tools under svn so we could start doing soon. Regarding specific users, the main ones we can work with this on at the moment are the OOPS and PtMap groups. These two groups are just now nearing the point where their runs will worth saving. Runs to date have been mostly in the learning category. Putting all logs in a central repo is OK for now. Someday we'll want to have per-collaboration repositories. Ideally we should place the central one under a ~swift dir rather than ~benc - it sounds to users more official and less than a personal project. So bottom line: its happening but takes time. On 3/28/09 6:32 AM, Ben Clifford wrote: > On Sat, 28 Mar 2009, Ian Foster wrote: > >> Would it be good to (re)start the practice of storing every run log? >> Having information on all Swift runs performed will be very useful for >> when we request additional funding. > > I see run logs from recent days in /disks/ci-gpfs/swift/swift-logs so some > people are definitely doing this still (and have been for some time). > > Swift developers can't really force application people to play along > though. > From wilde at mcs.anl.gov Sat Mar 28 10:58:20 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 28 Mar 2009 10:58:20 -0500 Subject: [Swift-devel] Re: Provenance use case In-Reply-To: References: <49CA7E2E.5070106@mcs.anl.gov> <49CBF13D.9090708@uchicago.edu> <49CD451A.20208@mcs.anl.gov> <49CD4D7C.9040109@mcs.anl.gov> Message-ID: <49CE491C.5040303@mcs.anl.gov> That makes sense; I need to start using that feature. On 3/28/09 3:13 AM, Ben Clifford wrote: > I was imagining more that you would have the execution $PATH pointing to > the suite of remote software (using Swift's PATHPREFIX functionality), > rather than specifying the path individually to each one in tc.data; and > then run a version command with (unparameterised) gives the version for > the entire suite. > > > On Fri, 27 Mar 2009, Michael Wilde wrote: > >> >> On 3/27/09 4:33 PM, Ben Clifford wrote: >>> On Fri, 27 Mar 2009, Michael Wilde wrote: >>> >>>> so instead of echo monkey foo, this profile entry should go on the tc.data >>>> entry, and do something like: >>>> >>>> getAppVersion.sh appName appPath >>> It doesn't have any access to tc.data - it runs remotely. >> No, I realized that. I meant, since the string in the key="SWIFT_EXTRA_INFO" >> env profile entry is a command (you showed "echo") and since (I was assuming) >> you can specify a unique string as a profile on each tc.data entry, then one >> could generate tc.data entries that passed enough info to this command that it >> could find the actual app at runtime on the execution host, and echo the >> appropriate text to stdout where it would then go to the wrapper log. >> >> Is this logic feasible? (Will try when I get a chance...) >> >> Seems that it only depends on the ability to specify a unique value for the >> env key EXTRAINFO=$($SWIFT_EXTRA_INFO) on each tc.data entry. >>> It probably has access the internal variables of the remote wrapper (but >>> those shouldn't be relied on in general) >> Right. It would get its info by using its arg to hunt down the command, or >> take its info right from the arg. >> >> eg: >> >> sitex runrama /home/oops/trunk/bin/runrama.bgp.sh INSTALLED INTEL32::LINUX >> \ >> env::SWIFT_EXTRA_INFO="getVersion.sh runrama >> /home/oops/trunk/bin/runrama.bgp.sh" >> >> or: >> >> env::SWIFT_EXTRA_INFO="getVersion.sh runrama >> /home/oops/trunk/bin/runrama.bgp.sh oops-2.4.1" >> >> would echo: >> >> running runrama revision 2474 -or- >> running runrama version 2.4.1 >> >> In the first case it can walk up the path /home/oops/trunk/bin looking for >> application- or collaboration-specific release info (eg /home/oops/RELEASE in >> oops's case has the svn revision) or it can look to see if the path is >> symlinked to a version directory. >> >> I.e, if its not going to poke around in the remote wrapper's environment, it >> needs to know what's being run. >> >> Anyways, this seems cool and I think it will work. >> >>>> Next we need to work on collection and summarization of the info from >>>> wrapper.sh. Are you doing any of that in your current provenance work? >>> there has been code in the past to do pull stuff from wrapper logs - not a >>> huge amount at the moment but it should be straightforward to add if there >>> is useful information being collected. >>> >> From foster at anl.gov Sun Mar 29 04:59:11 2009 From: foster at anl.gov (Ian Foster) Date: Sun, 29 Mar 2009 04:59:11 -0500 Subject: [Swift-devel] =?iso-8859-1?q?Fwd=3A_=5Bgt-chatter=5D_The_Snowtid?= =?iso-8859-1?q?e_Blog_=BB_Blog_Archive_=BB_Why_MIT_now_uses_python_instea?= =?iso-8859-1?q?d_of_scheme_for_its_undergraduate_CS_program?= References: <27E372DB-B3B8-49DA-B247-C7C560E050D6@mcs.anl.gov> Message-ID: <12CC600D-D6E1-457B-90E9-E1BA8A6BB325@anl.gov> an interesting article on the use of Python at MIT Begin forwarded message: > From: Frank Siebenlist > Date: March 28, 2009 9:19:14 PM CDT > To: gt-chatter at googlegroups.com > Cc: Frank Siebenlist > Subject: [gt-chatter] The Snowtide Blog ? Blog Archive ? Why MIT now > uses python instead of scheme for its undergraduate CS program > Reply-To: gt-chatter at googlegroups.com > > > ...mixed emotions about this switch... > > My personal bible has been "Structure and Interpretation of Computer > Programs" (SICP) ever since that book came out many eons ago, and with > that, scheme/lisp has been near and dear... > > However, python isn't too bad either ;-) ...and you can use it for > functional programming. > > http://blog.snowtide.com/2009/03/24/why-mit-now-uses-python-instead-of-scheme-for-its-undergraduate-cs-program > > No word yet about a possible refactoring of SICP onto python... now > that would make the switch almost painless and would raise python even > further up in the stratosphere of coolness... > > -Frank. > > > --- > Frank Siebenlist - franks at mcs.anl.gov > The Globus Alliance | Argonne National Laboratory | University of > Chicago > > > --~--~---------~--~----~------------~-------~--~----~ > You received this message because you are subscribed to the Google > Groups "globus" group. > To post to this group, send email to gt-chatter at googlegroups.com > To unsubscribe from this group, send email to gt-chatter+unsubscribe at googlegroups.com > For more options, visit this group at http://groups.google.com/group/gt-chatter?hl=en > -~----------~----~----~----~------~----~------~--~--- > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Sun Mar 29 15:44:07 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 29 Mar 2009 15:44:07 -0500 Subject: [Swift-devel] console monitor Message-ID: <1238359448.30198.2.camel@localhost> I think there was talk about a monitor for swift runs for a long time. Swift r2769 has a prototype of such a monitor. You can enable it with "-tui" on the command line. I would suggest something like 066-many.swift. From foster at anl.gov Sun Mar 29 21:52:05 2009 From: foster at anl.gov (Ian Foster) Date: Sun, 29 Mar 2009 21:52:05 -0500 Subject: [Swift-devel] =?iso-8859-1?q?Fwd=3A_=5Bgt-chatter=5D_Re=3A_The_S?= =?iso-8859-1?q?nowtide_Blog_=BB_Blog_Archive_=BB_Why_MIT_now_uses_python_?= =?iso-8859-1?q?instead_of_scheme_for_its_undergraduate_CS_program?= References: <939b5a0e0903291917i591ef674m495f02fe327df86b@mail.gmail.com> Message-ID: Begin forwarded message: > From: Wei Tan > Date: March 29, 2009 9:17:57 PM CDT > To: gt-chatter at googlegroups.com > Cc: Frank Siebenlist > Subject: [gt-chatter] Re: The Snowtide Blog ? Blog Archive ? Why MIT > now uses python instead of scheme for its undergraduate CS program > Reply-To: gt-chatter at googlegroups.com > > another article regarding functional programming "welcome to the > functional web". > It argues that functional language will be more popular in Web-scale > distributed > computing paradigm (like cloud?) > > Best regards, > > Wei > > > > On Sat, Mar 28, 2009 at 9:19 PM, Frank Siebenlist > wrote: > > ...mixed emotions about this switch... > > My personal bible has been "Structure and Interpretation of Computer > Programs" (SICP) ever since that book came out many eons ago, and with > that, scheme/lisp has been near and dear... > > However, python isn't too bad either ;-) ...and you can use it for > functional programming. > > http://blog.snowtide.com/2009/03/24/why-mit-now-uses-python-instead-of-scheme-for-its-undergraduate-cs-program > > No word yet about a possible refactoring of SICP onto python... now > that would make the switch almost painless and would raise python even > further up in the stratosphere of coolness... > > -Frank. > > > --- > Frank Siebenlist - franks at mcs.anl.gov > The Globus Alliance | Argonne National Laboratory | University of > Chicago > > > > > > > -- > ------------------------------------------------------------------------- > Ph.D. Wei Tan > the University of Chicago | Argonne National Laboratory > http://www.mcs.anl.gov/~wtan/ > -------------------------------------------------------------------------- > > --~--~---------~--~----~------------~-------~--~----~ > You received this message because you are subscribed to the Google > Groups "globus" group. > To post to this group, send email to gt-chatter at googlegroups.com > To unsubscribe from this group, send email to gt-chatter+unsubscribe at googlegroups.com > For more options, visit this group at http://groups.google.com/group/gt-chatter?hl=en > -~----------~----~----~----~------~----~------~--~--- > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1.pdf Type: application/pdf Size: 124631 bytes Desc: not available URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Sun Mar 29 23:17:10 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 30 Mar 2009 04:17:10 +0000 (GMT) Subject: [Swift-devel] Provenance use case In-Reply-To: <49CE48B4.4070209@mcs.anl.gov> References: <49CA7E2E.5070106@mcs.anl.gov> <70C828E5-8C25-4CFF-B16C-EF28D7C78399@anl.gov> <49CE48B4.4070209@mcs.anl.gov> Message-ID: On Sat, 28 Mar 2009, Michael Wilde wrote: > Putting all logs in a central repo is OK for now. Someday we'll want to have > per-collaboration repositories. Ideally we should place the central one under > a ~swift dir rather than ~benc - it sounds to users more official and less > than a personal project. Its not under ~benc. Its under: ... > > I see run logs from recent days in /disks/ci-gpfs/swift/swift-logs so some -- From benc at hawaga.org.uk Mon Mar 30 00:01:39 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 30 Mar 2009 05:01:39 +0000 (GMT) Subject: [Swift-devel] console monitor In-Reply-To: <1238359448.30198.2.camel@localhost> References: <1238359448.30198.2.camel@localhost> Message-ID: Looks ok from what I've seen so far. At exit time its dumping cruft into my commandline input buffer, though (the string ";25;8") which needs deleting before running the next command. At some point I ctrl-Ced out and ended up with my cursor being invisible. Slightly annoyingly for me, my terminal f-keys don't seem to be bound right to get beyond F5, but F1-F5 are bound ok - I think that looks like its strange stuff in my terminal resulting from an OS X upgrade, though. -- From wilde at mcs.anl.gov Mon Mar 30 00:30:22 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 30 Mar 2009 00:30:22 -0500 Subject: [Swift-devel] Provenance use case In-Reply-To: References: <49CA7E2E.5070106@mcs.anl.gov> <70C828E5-8C25-4CFF-B16C-EF28D7C78399@anl.gov> <49CE48B4.4070209@mcs.anl.gov> Message-ID: <49D058EE.7070901@mcs.anl.gov> ok. perhaps good to move them to something like /home/swift till ci-gpfs is backed up. also /home/swift might be a good place to maintain a public swift release. On 3/29/09 11:17 PM, Ben Clifford wrote: > > On Sat, 28 Mar 2009, Michael Wilde wrote: > >> Putting all logs in a central repo is OK for now. Someday we'll want to have >> per-collaboration repositories. Ideally we should place the central one under >> a ~swift dir rather than ~benc - it sounds to users more official and less >> than a personal project. > > Its not under ~benc. Its under: ... > >>> I see run logs from recent days in /disks/ci-gpfs/swift/swift-logs so some > From benc at hawaga.org.uk Mon Mar 30 00:32:44 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 30 Mar 2009 05:32:44 +0000 (GMT) Subject: [Swift-devel] Provenance use case In-Reply-To: <49D058EE.7070901@mcs.anl.gov> References: <49CA7E2E.5070106@mcs.anl.gov> <70C828E5-8C25-4CFF-B16C-EF28D7C78399@anl.gov> <49CE48B4.4070209@mcs.anl.gov> <49D058EE.7070901@mcs.anl.gov> Message-ID: On Mon, 30 Mar 2009, Michael Wilde wrote: > ok. perhaps good to move them to something like /home/swift till ci-gpfs is > backed up. they're very very large (individual run logs are sometimes a gigabyte). summary information is perhaps backupable, but I doubt the raw logs are worth it. -- From benc at hawaga.org.uk Mon Mar 30 00:36:22 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 30 Mar 2009 05:36:22 +0000 (GMT) Subject: [Swift-devel] Provenance use case In-Reply-To: <49D058EE.7070901@mcs.anl.gov> References: <49CA7E2E.5070106@mcs.anl.gov> <70C828E5-8C25-4CFF-B16C-EF28D7C78399@anl.gov> <49CE48B4.4070209@mcs.anl.gov> <49D058EE.7070901@mcs.anl.gov> Message-ID: On Mon, 30 Mar 2009, Michael Wilde wrote: > also /home/swift might be a good place to maintain a public swift release. At some point, Swift could go in the CI softenv system. Perhaps that should be done with 0.9. However, I think lots of people we work closely with are still using development code from the SVN rather than the main releases, so it may not be worth much. -- From benc at hawaga.org.uk Mon Mar 30 06:58:34 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 30 Mar 2009 11:58:34 +0000 (GMT) Subject: [Swift-devel] Re: Provenance use case In-Reply-To: References: <49CA7E2E.5070106@mcs.anl.gov> <49CBF13D.9090708@uchicago.edu> <49CD451A.20208@mcs.anl.gov> Message-ID: On Fri, 27 Mar 2009, Ben Clifford wrote: > there has been code in the past to do pull stuff from wrapper logs - not a > huge amount at the moment but it should be straightforward to add if there > is useful information being collected. provenancedb as of r2779 has tables in it to associate executions with extrainfo as gathered from the wrapper logs. -- From hategan at mcs.anl.gov Mon Mar 30 10:13:24 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 30 Mar 2009 10:13:24 -0500 Subject: [Swift-devel] console monitor In-Reply-To: References: <1238359448.30198.2.camel@localhost> Message-ID: <1238426004.13535.3.camel@localhost> On Mon, 2009-03-30 at 05:01 +0000, Ben Clifford wrote: > Looks ok from what I've seen so far. > > At exit time its dumping cruft into my commandline input buffer, though > (the string ";25;8") which needs deleting before running the next command. The osx terminal is a bit strange. > > At some point I ctrl-Ced out and ended up with my cursor being invisible. Right. That may happen. I'll see what I can do. But you do have File>Abort (except probably ALT+F will not be sent to the app). > > Slightly annoyingly for me, my terminal f-keys don't seem to be bound > right to get beyond F5, but F1-F5 are bound ok - I think that looks like > its strange stuff in my terminal resulting from an OS X upgrade, though. If you look at the terminal settings, it will tell you what it escape sequence it will send for those keys. Paste them in an email. > From benc at hawaga.org.uk Mon Mar 30 10:18:10 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 30 Mar 2009 15:18:10 +0000 (GMT) Subject: [Swift-devel] console monitor In-Reply-To: <1238426004.13535.3.camel@localhost> References: <1238359448.30198.2.camel@localhost> <1238426004.13535.3.camel@localhost> Message-ID: On Mon, 30 Mar 2009, Mihael Hategan wrote: > If you look at the terminal settings, it will tell you what it escape > sequence it will send for those keys. Paste them in an email. Its more a problem that those keys are remapped in the OS to other stuff and never reach the terminal at all... Maybe one way to deal with terminal variation (of which I'm sure there will be plenty) is to constrain input to obvious ASCII input characters. -- From hategan at mcs.anl.gov Mon Mar 30 10:24:25 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 30 Mar 2009 10:24:25 -0500 Subject: [Swift-devel] console monitor In-Reply-To: References: <1238359448.30198.2.camel@localhost> <1238426004.13535.3.camel@localhost> Message-ID: <1238426665.13535.10.camel@localhost> On Mon, 2009-03-30 at 15:18 +0000, Ben Clifford wrote: > On Mon, 30 Mar 2009, Mihael Hategan wrote: > > > If you look at the terminal settings, it will tell you what it escape > > sequence it will send for those keys. Paste them in an email. > > Its more a problem that those keys are remapped in the OS to other stuff > and never reach the terminal at all... You can change that in the settings (Terminal > Window(s) I think). > > Maybe one way to deal with terminal variation (of which I'm sure there > will be plenty) is to constrain input to obvious ASCII input characters. There is also the choice of emulating those (with, say, ctrl + 0-9). From benc at hawaga.org.uk Mon Mar 30 10:29:50 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 30 Mar 2009 15:29:50 +0000 (GMT) Subject: [Swift-devel] console monitor In-Reply-To: <1238426665.13535.10.camel@localhost> References: <1238359448.30198.2.camel@localhost> <1238426004.13535.3.camel@localhost> <1238426665.13535.10.camel@localhost> Message-ID: On Mon, 30 Mar 2009, Mihael Hategan wrote: > You can change that in the settings (Terminal > Window(s) I think). yes, I know how to reconfigure. However, I then lose the more useful functionality to which I have them bound. -- From hategan at mcs.anl.gov Mon Mar 30 10:41:38 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 30 Mar 2009 10:41:38 -0500 Subject: [Swift-devel] console monitor In-Reply-To: References: <1238359448.30198.2.camel@localhost> <1238426004.13535.3.camel@localhost> <1238426665.13535.10.camel@localhost> Message-ID: <1238427698.14161.0.camel@localhost> On Mon, 2009-03-30 at 15:29 +0000, Ben Clifford wrote: > On Mon, 30 Mar 2009, Mihael Hategan wrote: > > > You can change that in the settings (Terminal > Window(s) I think). > > yes, I know how to reconfigure. However, I then lose the more useful > functionality to which I have them bound. Right. So I'll add ALT+0-9 as emulation for the function keys. From hategan at mcs.anl.gov Mon Mar 30 11:02:23 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 30 Mar 2009 11:02:23 -0500 Subject: [Swift-devel] whitespace Message-ID: <1238428943.14161.21.camel@localhost> Ben, We seem to have different formatting for source code. Let's agree on a single one so that we don't have to tweak each other's whitespace and that our commits are cleaner. From benc at hawaga.org.uk Mon Mar 30 11:06:55 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 30 Mar 2009 16:06:55 +0000 (GMT) Subject: [Swift-devel] whitespace In-Reply-To: <1238428943.14161.21.camel@localhost> References: <1238428943.14161.21.camel@localhost> Message-ID: On Mon, 30 Mar 2009, Mihael Hategan wrote: > We seem to have different formatting for source code. Let's agree on a > single one so that we don't have to tweak each other's whitespace and > that our commits are cleaner. I've tended to use whatever the source file in question has already, so that it remains locally consistent. There is no established convention at all (either in indentation characters or line endings) in the codebase, dating presumably from lack of any consensus when the original prototype was made. In the absence of wholesale reformatting of the codebase, that's the convention I'd prefer. -- From hategan at mcs.anl.gov Mon Mar 30 11:24:13 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 30 Mar 2009 11:24:13 -0500 Subject: [Swift-devel] whitespace In-Reply-To: References: <1238428943.14161.21.camel@localhost> Message-ID: <1238430253.14721.10.camel@localhost> On Mon, 2009-03-30 at 16:06 +0000, Ben Clifford wrote: > On Mon, 30 Mar 2009, Mihael Hategan wrote: > > > We seem to have different formatting for source code. Let's agree on a > > single one so that we don't have to tweak each other's whitespace and > > that our commits are cleaner. > > I've tended to use whatever the source file in question has already, so > that it remains locally consistent. There is no established convention at > all (either in indentation characters or line endings) in the codebase, > dating presumably from lack of any consensus when the original prototype > was made. > > In the absence of wholesale reformatting of the codebase, that's the > convention I'd prefer. > r2774? From benc at hawaga.org.uk Mon Mar 30 11:29:17 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 30 Mar 2009 16:29:17 +0000 (GMT) Subject: [Swift-devel] whitespace In-Reply-To: <1238430253.14721.10.camel@localhost> References: <1238428943.14161.21.camel@localhost> <1238430253.14721.10.camel@localhost> Message-ID: On Mon, 30 Mar 2009, Mihael Hategan wrote: > r2774? RuntimeStats has a majority history of using tab-indentation. r2774 makes it more so. Thats what I mean by: > I've tended to use whatever the source file in question has already -- From hategan at mcs.anl.gov Mon Mar 30 16:48:46 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 30 Mar 2009 16:48:46 -0500 Subject: [Swift-devel] console monitor In-Reply-To: <1238427698.14161.0.camel@localhost> References: <1238359448.30198.2.camel@localhost> <1238426004.13535.3.camel@localhost> <1238426665.13535.10.camel@localhost> <1238427698.14161.0.camel@localhost> Message-ID: <1238449726.30616.1.camel@localhost> On Mon, 2009-03-30 at 10:41 -0500, Mihael Hategan wrote: > On Mon, 2009-03-30 at 15:29 +0000, Ben Clifford wrote: > > On Mon, 30 Mar 2009, Mihael Hategan wrote: > > > > > You can change that in the settings (Terminal > Window(s) I think). > > > > yes, I know how to reconfigure. However, I then lose the more useful > > functionality to which I have them bound. > > Right. So I'll add ALT+0-9 as emulation for the function keys. Turns out my xterm like terminal does not bother to send any sequence for ALT+n. Midnight Commander has ESC+n emulation for function keys, which is what I also put in and will commit as soon as svn comes back to life. From benc at hawaga.org.uk Tue Mar 31 02:22:10 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 31 Mar 2009 07:22:10 +0000 (GMT) Subject: [Swift-devel] console monitor In-Reply-To: <1238449726.30616.1.camel@localhost> References: <1238359448.30198.2.camel@localhost> <1238426004.13535.3.camel@localhost> <1238426665.13535.10.camel@localhost> <1238427698.14161.0.camel@localhost> <1238449726.30616.1.camel@localhost> Message-ID: On Mon, 30 Mar 2009, Mihael Hategan wrote: > Turns out my xterm like terminal does not bother to send any sequence > for ALT+n. Midnight Commander has ESC+n emulation for function keys, > which is what I also put in and will commit as soon as svn comes back to > life. Yep, that works here. Now I can see the magical "Ben's view". -- From wilde at mcs.anl.gov Tue Mar 31 07:23:58 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 31 Mar 2009 07:23:58 -0500 Subject: [Swift-devel] Problem on oops run: Java out of memory In-Reply-To: <49D1B816.1030603@uchicago.edu> References: <49D1B816.1030603@uchicago.edu> Message-ID: <49D20B5E.8070008@mcs.anl.gov> There is a new way to limit the number of jobs that a for-loop will allow to be active at once. Thats in the user guide, and it defaults I think to 1024. I dont know if there is a per-site limit. So it sounds like there may be a swift bug here. You should report this one to swift-devel after rsyncying your log file to the swift log repository as per the user guide. Sounds like nice progress! - Mike On 3/31/09 1:28 AM, Glen Hocky wrote: > Hi Mike, > I was able to successfully modify everything so that temperature sweeps > work correctly on teraport. > the main problem i've run up against is that i would like to test this > out with a lot of jobs, but i got >> Uncaught exception: java.lang.OutOfMemoryError: Java heap space in >> kernel:named @ vdl-int.k >> java.lang.OutOfMemoryError: Java heap space >> at java.util.HashMap.addEntry(HashMap.java:753) >> at java.util.HashMap.put(HashMap.java:385) >> at >> org.globus.cog.karajan.arguments.NamedArgumentsImpl.add(NamedArgumentsImpl.java:83) >> >> at >> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.initializeArgs(AbstractSequentialWithArguments.java:158) >> >> at >> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.pre(AbstractSequentialWithArguments.java:138) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:62) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) >> >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) >> at >> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) >> >> at >> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) >> at >> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) >> >> at >> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) >> > which makes sense since I was submitting thousands of jobs > > i don't see a param to limit the max on a given site to N. do you know > of one? ... From wilde at mcs.anl.gov Tue Mar 31 07:27:06 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 31 Mar 2009 07:27:06 -0500 Subject: [Swift-devel] Re: coasters problem - identified? In-Reply-To: <49D1C033.50806@uchicago.edu> References: <49D1C033.50806@uchicago.edu> Message-ID: <49D20C1A.3080802@mcs.anl.gov> Its possible that with only 5 jobs, and the throttle settings you have (with the sites.xml that you used from Sarah) that the 5 jobs get 5 coasters started before the provide or swift realizes that there are already running coasters there that can be used. I suspect we need to test at a large scale to see if this is really happening or not. I further suspect that in Sarah's tests, the system was indeed using the 16 coasters per node. So its most likely that when all is well, that feature is working. In your earlier tests, when you observed coasters started way more coaster jobs than you had jobs in your workflow, I think the cause there was that the coasters were failing quickly. We saw this once when you had the wrong project number specified, but then I think you saw this again after that was corrected, when a 1-job workflow ran OK (confirming that the project id was correct) but a slightly larger workflow seemed to be spawning quickly-dying coasters. On 3/31/09 2:03 AM, Glen Hocky wrote: > it seems the problem with coasters is that it is not propertly using the > 16 cores on ranger > I'm running a swift script which should (and does) run 5 invocations of > runoops and swift asked for 5x16 cores even with coastersPerNode at 16 From hategan at mcs.anl.gov Tue Mar 31 10:11:39 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 31 Mar 2009 10:11:39 -0500 Subject: [Swift-devel] Problem on oops run: Java out of memory In-Reply-To: <49D20B5E.8070008@mcs.anl.gov> References: <49D1B816.1030603@uchicago.edu> <49D20B5E.8070008@mcs.anl.gov> Message-ID: <1238512299.7923.0.camel@localhost> What's your workflow, and how much memory are you running swift with? On Tue, 2009-03-31 at 07:23 -0500, Michael Wilde wrote: > There is a new way to limit the number of jobs that a for-loop will > allow to be active at once. Thats in the user guide, and it defaults I > think to 1024. > > I dont know if there is a per-site limit. > > So it sounds like there may be a swift bug here. You should report this > one to swift-devel after rsyncying your log file to the swift log > repository as per the user guide. > > Sounds like nice progress! > > - Mike > > > On 3/31/09 1:28 AM, Glen Hocky wrote: > > Hi Mike, > > I was able to successfully modify everything so that temperature sweeps > > work correctly on teraport. > > the main problem i've run up against is that i would like to test this > > out with a lot of jobs, but i got > >> Uncaught exception: java.lang.OutOfMemoryError: Java heap space in > >> kernel:named @ vdl-int.k > >> java.lang.OutOfMemoryError: Java heap space > >> at java.util.HashMap.addEntry(HashMap.java:753) > >> at java.util.HashMap.put(HashMap.java:385) > >> at > >> org.globus.cog.karajan.arguments.NamedArgumentsImpl.add(NamedArgumentsImpl.java:83) > >> > >> at > >> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.initializeArgs(AbstractSequentialWithArguments.java:158) > >> > >> at > >> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.pre(AbstractSequentialWithArguments.java:138) > >> > >> at > >> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:62) > >> > >> at > >> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) > >> at > >> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) > >> at > >> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) > >> > >> at > >> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) > >> at > >> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:227) > >> > >> at > >> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:125) > >> at > >> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:99) > >> > >> at > >> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java:69) > >> > > which makes sense since I was submitting thousands of jobs > > > > i don't see a param to limit the max on a given site to N. do you know > > of one? > ... > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel