From hategan at mcs.anl.gov Mon Nov 2 14:51:04 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 02 Nov 2009 14:51:04 -0600 Subject: [Swift-devel] more swift and bgp plots Message-ID: <1257195064.24600.1.camel@localhost> http://www.mcs.anl.gov/~hategan/report-8k-60s/ http://www.mcs.anl.gov/~hategan/report-8k-120s/ http://www.mcs.anl.gov/~hategan/report-12k-120s/ http://www.mcs.anl.gov/~hategan/report-20k-300s/ From wilde at mcs.anl.gov Tue Nov 3 08:03:46 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 03 Nov 2009 08:03:46 -0600 Subject: [Swift-devel] Problems with Swift trunk on abe Message-ID: <4AF03842.70300@mcs.anl.gov> What seems like the a build of the same Swift & CoG svn revision is working on communicado but giving strange and transient results on Abe. A series of 5 runs of the same code (with .xml and .kml files left in place) give the set of errors below. I will try to use the release build on communicado, to see if this is being caused by some Java compiler difference on Abe. Suggestions welcome. Mihael, I can get you the logs - they are on abe in ~wilde/run.58, along with the source: psim.basicex1.swift - Mike Output with a few progress and blank lines removed: (full output is in: http://www.ci.uchicago.edu/~wilde/abe.swiftproblem.txt ) honest4$ ls Protein.map* Protein.map~* etc input@ logs/ psim.basicex1.swift* sites.xml t.swift* t.swift~* tc tc~ honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift Swift svn swift-r3186 cog-r2572 RunID: 20091103-0750-p7z85z61 Execution failed: Argument size mismatch. Got 5 names and 0 values honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift Swift svn swift-r3186 cog-r2572 RunID: 20091103-0751-sfxthnne Execution failed: Argument size mismatch. Got 5 names and 0 values honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift Swift svn swift-r3186 cog-r2572 RunID: 20091103-0751-oxnhztrc Progress: Execution failed: Argument size mismatch. Got 5 names and 0 values honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift Swift svn swift-r3186 cog-r2572 RunID: 20091103-0751-amtfuvq9 Execution failed: Variable not found: path honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift Swift svn swift-r3186 cog-r2572 RunID: 20091103-0751-b4sr44x2 Execution failed: Ex098 org.globus.cog.karajan.workflow.KarajanRuntimeException: Invalid identifier: [$, org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uch\ icago.edu,2008:swift:dataset:20091103-0751-b65c863d:720000000020 type ProtGeo with no value at dataset=structure path=[1] (not closed)] at org.globus.cog.karajan.util.TypeUtil.toIdentifier(TypeUtil.java:286) ... at java.lang.Thread.run(Thread.java:595) Variable not found: path honest4$ From wilde at mcs.anl.gov Tue Nov 3 08:18:55 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 03 Nov 2009 08:18:55 -0600 Subject: [Swift-devel] Problems with Swift trunk on abe In-Reply-To: <4AF03842.70300@mcs.anl.gov> References: <4AF03842.70300@mcs.anl.gov> Message-ID: <4AF03BCF.3020502@mcs.anl.gov> Using the Swift build (and swift.properties) that works on communicado, I still get the same failures on Abe. I'm going to try a few other systems to see if the problem is widespread or local to Abe. - Mike On 11/3/09 8:03 AM, Michael Wilde wrote: > What seems like the a build of the same Swift & CoG svn revision is > working on communicado but giving strange and transient results on Abe. > > A series of 5 runs of the same code (with .xml and .kml files left in > place) give the set of errors below. > > I will try to use the release build on communicado, to see if this is > being caused by some Java compiler difference on Abe. > > Suggestions welcome. Mihael, I can get you the logs - they are on abe in > ~wilde/run.58, along with the source: psim.basicex1.swift > > - Mike > > Output with a few progress and blank lines removed: > > (full output is in: http://www.ci.uchicago.edu/~wilde/abe.swiftproblem.txt ) > > honest4$ ls > > Protein.map* Protein.map~* etc input@ logs/ psim.basicex1.swift* > sites.xml t.swift* t.swift~* tc tc~ > > honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift > > Swift svn swift-r3186 cog-r2572 > RunID: 20091103-0750-p7z85z61 > Execution failed: > Argument size mismatch. Got 5 names and 0 values > > honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift > > Swift svn swift-r3186 cog-r2572 > RunID: 20091103-0751-sfxthnne > Execution failed: > Argument size mismatch. Got 5 names and 0 values > > honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift > > Swift svn swift-r3186 cog-r2572 > RunID: 20091103-0751-oxnhztrc > Progress: > Execution failed: > Argument size mismatch. Got 5 names and 0 values > > honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift > > Swift svn swift-r3186 cog-r2572 > RunID: 20091103-0751-amtfuvq9 > Execution failed: > Variable not found: path > > honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift > > Swift svn swift-r3186 cog-r2572 > RunID: 20091103-0751-b4sr44x2 > Execution failed: > Ex098 > org.globus.cog.karajan.workflow.KarajanRuntimeException: Invalid > identifier: [$, org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uch\ > icago.edu,2008:swift:dataset:20091103-0751-b65c863d:720000000020 type > ProtGeo with no value at dataset=structure path=[1] (not closed)] > at > org.globus.cog.karajan.util.TypeUtil.toIdentifier(TypeUtil.java:286) > ... > at java.lang.Thread.run(Thread.java:595) > Variable not found: path > honest4$ > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Nov 3 10:19:47 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 03 Nov 2009 10:19:47 -0600 Subject: [Fwd: Re: [Swift-devel] Problems with Swift trunk on abe] Message-ID: <4AF05823.2070709@mcs.anl.gov> Similar problems happen on Ranger. I used an older Swift build initially, and it ran. Then I updated to the latest, and I get the following similar failure: RunID: 20091103-1016-khfn59yf Progress: Ex098 java.lang.NullPointerException at org.globus.cog.karajan.workflow.nodes.functions.Variable.function(Variable.java:38) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) ... edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) at java.lang.Thread.run(Thread.java:619) Execution failed: Argument size mismatch. Got 1 names and 2 values login3$ Im going try the bgp next. If it works there, we can leave this issue for after SC. Else, I'll need to resolve it. - Mike -------- Original Message -------- Subject: Re: [Swift-devel] Problems with Swift trunk on abe Date: Tue, 03 Nov 2009 08:18:55 -0600 From: Michael Wilde To: swift-devel References: <4AF03842.70300 at mcs.anl.gov> Using the Swift build (and swift.properties) that works on communicado, I still get the same failures on Abe. I'm going to try a few other systems to see if the problem is widespread or local to Abe. - Mike On 11/3/09 8:03 AM, Michael Wilde wrote: > What seems like the a build of the same Swift & CoG svn revision is > working on communicado but giving strange and transient results on Abe. > > A series of 5 runs of the same code (with .xml and .kml files left in > place) give the set of errors below. > > I will try to use the release build on communicado, to see if this is > being caused by some Java compiler difference on Abe. > > Suggestions welcome. Mihael, I can get you the logs - they are on abe in > ~wilde/run.58, along with the source: psim.basicex1.swift > > - Mike > > Output with a few progress and blank lines removed: > > (full output is in: http://www.ci.uchicago.edu/~wilde/abe.swiftproblem.txt ) > > honest4$ ls > > Protein.map* Protein.map~* etc input@ logs/ psim.basicex1.swift* > sites.xml t.swift* t.swift~* tc tc~ > > honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift > > Swift svn swift-r3186 cog-r2572 > RunID: 20091103-0750-p7z85z61 > Execution failed: > Argument size mismatch. Got 5 names and 0 values > > honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift > > Swift svn swift-r3186 cog-r2572 > RunID: 20091103-0751-sfxthnne > Execution failed: > Argument size mismatch. Got 5 names and 0 values > > honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift > > Swift svn swift-r3186 cog-r2572 > RunID: 20091103-0751-oxnhztrc > Progress: > Execution failed: > Argument size mismatch. Got 5 names and 0 values > > honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift > > Swift svn swift-r3186 cog-r2572 > RunID: 20091103-0751-amtfuvq9 > Execution failed: > Variable not found: path > > honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift > > Swift svn swift-r3186 cog-r2572 > RunID: 20091103-0751-b4sr44x2 > Execution failed: > Ex098 > org.globus.cog.karajan.workflow.KarajanRuntimeException: Invalid > identifier: [$, org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uch\ > icago.edu,2008:swift:dataset:20091103-0751-b65c863d:720000000020 type > ProtGeo with no value at dataset=structure path=[1] (not closed)] > at > org.globus.cog.karajan.util.TypeUtil.toIdentifier(TypeUtil.java:286) > ... > at java.lang.Thread.run(Thread.java:595) > Variable not found: path > honest4$ > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Nov 3 11:05:23 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 03 Nov 2009 11:05:23 -0600 Subject: [Fwd: Re: [Swift-devel] Problems with Swift trunk on abe] In-Reply-To: <4AF05823.2070709@mcs.anl.gov> References: <4AF05823.2070709@mcs.anl.gov> Message-ID: <1257267923.8801.1.camel@localhost> Thanks. I'm looking into it. On Tue, 2009-11-03 at 10:19 -0600, Michael Wilde wrote: > Similar problems happen on Ranger. I used an older Swift build > initially, and it ran. > > Then I updated to the latest, and I get the following similar failure: > > RunID: 20091103-1016-khfn59yf > Progress: > Ex098 > java.lang.NullPointerException > at > org.globus.cog.karajan.workflow.nodes.functions.Variable.function(Variable.java:38) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45) > at > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) > ... > edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) > at > edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) > at java.lang.Thread.run(Thread.java:619) > Execution failed: > Argument size mismatch. Got 1 names and 2 values > login3$ > > Im going try the bgp next. If it works there, we can leave this issue > for after SC. Else, I'll need to resolve it. > > - Mike > > -------- Original Message -------- > Subject: Re: [Swift-devel] Problems with Swift trunk on abe > Date: Tue, 03 Nov 2009 08:18:55 -0600 > From: Michael Wilde > To: swift-devel > References: <4AF03842.70300 at mcs.anl.gov> > > Using the Swift build (and swift.properties) that works on communicado, > I still get the same failures on Abe. > > I'm going to try a few other systems to see if the problem is widespread > or local to Abe. > > - Mike > > > On 11/3/09 8:03 AM, Michael Wilde wrote: > > What seems like the a build of the same Swift & CoG svn revision is > > working on communicado but giving strange and transient results on Abe. > > > > A series of 5 runs of the same code (with .xml and .kml files left in > > place) give the set of errors below. > > > > I will try to use the release build on communicado, to see if this is > > being caused by some Java compiler difference on Abe. > > > > Suggestions welcome. Mihael, I can get you the logs - they are on abe in > > ~wilde/run.58, along with the source: psim.basicex1.swift > > > > - Mike > > > > Output with a few progress and blank lines removed: > > > > (full output is in: http://www.ci.uchicago.edu/~wilde/abe.swiftproblem.txt ) > > > > honest4$ ls > > > > Protein.map* Protein.map~* etc input@ logs/ psim.basicex1.swift* > > sites.xml t.swift* t.swift~* tc tc~ > > > > honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift > > > > Swift svn swift-r3186 cog-r2572 > > RunID: 20091103-0750-p7z85z61 > > Execution failed: > > Argument size mismatch. Got 5 names and 0 values > > > > honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift > > > > Swift svn swift-r3186 cog-r2572 > > RunID: 20091103-0751-sfxthnne > > Execution failed: > > Argument size mismatch. Got 5 names and 0 values > > > > honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift > > > > Swift svn swift-r3186 cog-r2572 > > RunID: 20091103-0751-oxnhztrc > > Progress: > > Execution failed: > > Argument size mismatch. Got 5 names and 0 values > > > > honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift > > > > Swift svn swift-r3186 cog-r2572 > > RunID: 20091103-0751-amtfuvq9 > > Execution failed: > > Variable not found: path > > > > honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift > > > > Swift svn swift-r3186 cog-r2572 > > RunID: 20091103-0751-b4sr44x2 > > Execution failed: > > Ex098 > > org.globus.cog.karajan.workflow.KarajanRuntimeException: Invalid > > identifier: [$, org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uch\ > > icago.edu,2008:swift:dataset:20091103-0751-b65c863d:720000000020 type > > ProtGeo with no value at dataset=structure path=[1] (not closed)] > > at > > org.globus.cog.karajan.util.TypeUtil.toIdentifier(TypeUtil.java:286) > > ... > > at java.lang.Thread.run(Thread.java:595) > > Variable not found: path > > honest4$ > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Nov 3 11:15:42 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 03 Nov 2009 11:15:42 -0600 Subject: [Fwd: Re: [Swift-devel] Problems with Swift trunk on abe] In-Reply-To: <1257267923.8801.1.camel@localhost> References: <4AF05823.2070709@mcs.anl.gov> <1257267923.8801.1.camel@localhost> Message-ID: <4AF0653E.7010008@mcs.anl.gov> Im running the Swift test battery on Ranger and Abe. Tests on Abe just got this, which is similar to what I was seeing: ... Running test 0024-compound at Tue Nov 3 11:11:33 CST 2009 Swift svn swift-r3186 cog-r2572 RunID: 20091103-1111-as1yb7s3 Progress: Execution failed: First argument must be an identifier or a list of identifiers SWIFT RETURN CODE NON-ZERO - test 0024-compound.swift ... --- Ranger passed all syntax tests but is hanging on 001-echo. Not sure why yet. I'll keep trying to zero in on something small thats reproducble. 0024-compound might be a good starting point. Let me know if you need something specific from me, or have suggestions on what to try, where. Thanks, Mike On 11/3/09 11:05 AM, Mihael Hategan wrote: > Thanks. I'm looking into it. > > On Tue, 2009-11-03 at 10:19 -0600, Michael Wilde wrote: >> Similar problems happen on Ranger. I used an older Swift build >> initially, and it ran. >> >> Then I updated to the latest, and I get the following similar failure: >> >> RunID: 20091103-1016-khfn59yf >> Progress: >> Ex098 >> java.lang.NullPointerException >> at >> org.globus.cog.karajan.workflow.nodes.functions.Variable.function(Variable.java:38) >> at >> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45) >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >> ... >> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) >> at >> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) >> at java.lang.Thread.run(Thread.java:619) >> Execution failed: >> Argument size mismatch. Got 1 names and 2 values >> login3$ >> >> Im going try the bgp next. If it works there, we can leave this issue >> for after SC. Else, I'll need to resolve it. >> >> - Mike >> >> -------- Original Message -------- >> Subject: Re: [Swift-devel] Problems with Swift trunk on abe >> Date: Tue, 03 Nov 2009 08:18:55 -0600 >> From: Michael Wilde >> To: swift-devel >> References: <4AF03842.70300 at mcs.anl.gov> >> >> Using the Swift build (and swift.properties) that works on communicado, >> I still get the same failures on Abe. >> >> I'm going to try a few other systems to see if the problem is widespread >> or local to Abe. >> >> - Mike >> >> >> On 11/3/09 8:03 AM, Michael Wilde wrote: >>> What seems like the a build of the same Swift & CoG svn revision is >>> working on communicado but giving strange and transient results on Abe. >>> >>> A series of 5 runs of the same code (with .xml and .kml files left in >>> place) give the set of errors below. >>> >>> I will try to use the release build on communicado, to see if this is >>> being caused by some Java compiler difference on Abe. >>> >>> Suggestions welcome. Mihael, I can get you the logs - they are on abe in >>> ~wilde/run.58, along with the source: psim.basicex1.swift >>> >>> - Mike >>> >>> Output with a few progress and blank lines removed: >>> >>> (full output is in: http://www.ci.uchicago.edu/~wilde/abe.swiftproblem.txt ) >>> >>> honest4$ ls >>> >>> Protein.map* Protein.map~* etc input@ logs/ psim.basicex1.swift* >>> sites.xml t.swift* t.swift~* tc tc~ >>> >>> honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift >>> >>> Swift svn swift-r3186 cog-r2572 >>> RunID: 20091103-0750-p7z85z61 >>> Execution failed: >>> Argument size mismatch. Got 5 names and 0 values >>> >>> honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift >>> >>> Swift svn swift-r3186 cog-r2572 >>> RunID: 20091103-0751-sfxthnne >>> Execution failed: >>> Argument size mismatch. Got 5 names and 0 values >>> >>> honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift >>> >>> Swift svn swift-r3186 cog-r2572 >>> RunID: 20091103-0751-oxnhztrc >>> Progress: >>> Execution failed: >>> Argument size mismatch. Got 5 names and 0 values >>> >>> honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift >>> >>> Swift svn swift-r3186 cog-r2572 >>> RunID: 20091103-0751-amtfuvq9 >>> Execution failed: >>> Variable not found: path >>> >>> honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift >>> >>> Swift svn swift-r3186 cog-r2572 >>> RunID: 20091103-0751-b4sr44x2 >>> Execution failed: >>> Ex098 >>> org.globus.cog.karajan.workflow.KarajanRuntimeException: Invalid >>> identifier: [$, org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uch\ >>> icago.edu,2008:swift:dataset:20091103-0751-b65c863d:720000000020 type >>> ProtGeo with no value at dataset=structure path=[1] (not closed)] >>> at >>> org.globus.cog.karajan.util.TypeUtil.toIdentifier(TypeUtil.java:286) >>> ... >>> at java.lang.Thread.run(Thread.java:595) >>> Variable not found: path >>> honest4$ >>> >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Tue Nov 3 11:23:54 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 03 Nov 2009 11:23:54 -0600 Subject: [Fwd: Re: [Swift-devel] Problems with Swift trunk on abe] In-Reply-To: <4AF0653E.7010008@mcs.anl.gov> References: <4AF05823.2070709@mcs.anl.gov> <1257267923.8801.1.camel@localhost> <4AF0653E.7010008@mcs.anl.gov> Message-ID: <4AF0672A.1090401@mcs.anl.gov> Indeed, taking 0024-compound, and just copying it to a local test directory, gives very similar *and random, various* errors. See below (abe: ~wilde/swiftbug for logs) Still digging. - Mike honest4$ which swift ~/swift/src/cog/modules/swift/dist/swift-svn/bin/swift honest4$ cat >t.swift type messagefile {} (messagefile t) greeting(string m) { app { echo m stdout=@filename(t); } } (messagefile first, messagefile second) compound() { first = greeting("f"); second = greeting("s"); } messagefile a <"0024-compound.Q.out">; messagefile b <"0024-compound.R.out">; (a,b) = compound(); honest4$ honest4$ pwd /u/ac/wilde/swiftbug honest4$ honest4$ which swift ~/swift/src/cog/modules/swift/dist/swift-svn/bin/swift honest4$ swift t.swift Swift svn swift-r3186 cog-r2572 RunID: 20091103-1120-xnfyl1i5 Progress: Failed to transfer wrapper log from t-20091103-1120-xnfyl1i5/info/k on localhost Progress: Stage in:1 Finished successfully:1 Final status: Finished successfully:2 honest4$ swift t.swift Swift svn swift-r3186 cog-r2572 RunID: 20091103-1121-2pgjnht8 Progress: Execution failed: First argument must be an identifier or a list of identifiers honest4$ swift t.swift Swift svn swift-r3186 cog-r2572 RunID: 20091103-1121-m64zk3p8 Progress: Execution failed: Variable not found: r Failed to transfer wrapper log from t-20091103-1121-m64zk3p8/info/p on localhost honest4$ swift t.swift Swift svn swift-r3186 cog-r2572 RunID: 20091103-1121-effamg5e Progress: Ex098 org.globus.cog.karajan.workflow.KarajanRuntimeException: Invalid identifier: [$, org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20091103-1121-5rjnv1z5:720000000004 type messagefile with no value at dataset=b (not closed)] at org.globus.cog.karajan.util.TypeUtil.toIdentifier(TypeUtil.java:286) at org.globus.cog.karajan.workflow.nodes.ChannelTo.partialArgumentsEvaluated(ChannelTo.java:32) at org.globus.cog.karajan.workflow.nodes.PartialArgumentsContainer.childCompleted(PartialArgumentsContainer.java:81) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:134) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:108) at org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialIterator.iterationCompleted(AbstractSequentialIterator.java:90) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialIterator.nonArgChildCompleted(AbstractSequentialIterator.java:75) at org.globus.cog.karajan.workflow.nodes.PartialArgumentsContainer.childCompleted(PartialArgumentsContainer.java:85) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialIterator.notificationEvent(AbstractSequentialIterator.java:132) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:134) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:108) at org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:33) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:332) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:134) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:108) at org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:176) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:296) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:46) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:27) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.executeChildren(AbstractFunction.java:40) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:233) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:278) at org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:391) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:329) at org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:229) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:134) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java:108) at org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:43) at edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) at edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) at java.lang.Thread.run(Thread.java:595) Execution failed: Invalid identifier: [$, org.griphyn.vdl.mapping.RootDataNode identifier tag:benc at ci.uchicago.edu,2008:swift:dataset:20091103-1121-5rjnv1z5:720000000004 type messagefile with no value at dataset=b (not closed)] honest4$ On 11/3/09 11:15 AM, Michael Wilde wrote: > Im running the Swift test battery on Ranger and Abe. > > Tests on Abe just got this, which is similar to what I was seeing: > > ... > Running test 0024-compound at Tue Nov 3 11:11:33 CST 2009 > Swift svn swift-r3186 cog-r2572 > > RunID: 20091103-1111-as1yb7s3 > Progress: > Execution failed: > First argument must be an identifier or a list of identifiers > SWIFT RETURN CODE NON-ZERO - test 0024-compound.swift > ... > > --- > > Ranger passed all syntax tests but is hanging on 001-echo. Not sure why yet. > > I'll keep trying to zero in on something small thats reproducble. > 0024-compound might be a good starting point. > > Let me know if you need something specific from me, or have suggestions > on what to try, where. > > Thanks, > > Mike > > > > > > On 11/3/09 11:05 AM, Mihael Hategan wrote: >> Thanks. I'm looking into it. >> >> On Tue, 2009-11-03 at 10:19 -0600, Michael Wilde wrote: >>> Similar problems happen on Ranger. I used an older Swift build >>> initially, and it ran. >>> >>> Then I updated to the latest, and I get the following similar failure: >>> >>> RunID: 20091103-1016-khfn59yf >>> Progress: >>> Ex098 >>> java.lang.NullPointerException >>> at >>> org.globus.cog.karajan.workflow.nodes.functions.Variable.function(Variable.java:38) >>> at >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45) >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) >>> ... >>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) >>> at >>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) >>> at java.lang.Thread.run(Thread.java:619) >>> Execution failed: >>> Argument size mismatch. Got 1 names and 2 values >>> login3$ >>> >>> Im going try the bgp next. If it works there, we can leave this issue >>> for after SC. Else, I'll need to resolve it. >>> >>> - Mike >>> >>> -------- Original Message -------- >>> Subject: Re: [Swift-devel] Problems with Swift trunk on abe >>> Date: Tue, 03 Nov 2009 08:18:55 -0600 >>> From: Michael Wilde >>> To: swift-devel >>> References: <4AF03842.70300 at mcs.anl.gov> >>> >>> Using the Swift build (and swift.properties) that works on communicado, >>> I still get the same failures on Abe. >>> >>> I'm going to try a few other systems to see if the problem is widespread >>> or local to Abe. >>> >>> - Mike >>> >>> >>> On 11/3/09 8:03 AM, Michael Wilde wrote: >>>> What seems like the a build of the same Swift & CoG svn revision is >>>> working on communicado but giving strange and transient results on Abe. >>>> >>>> A series of 5 runs of the same code (with .xml and .kml files left in >>>> place) give the set of errors below. >>>> >>>> I will try to use the release build on communicado, to see if this is >>>> being caused by some Java compiler difference on Abe. >>>> >>>> Suggestions welcome. Mihael, I can get you the logs - they are on abe in >>>> ~wilde/run.58, along with the source: psim.basicex1.swift >>>> >>>> - Mike >>>> >>>> Output with a few progress and blank lines removed: >>>> >>>> (full output is in: http://www.ci.uchicago.edu/~wilde/abe.swiftproblem.txt ) >>>> >>>> honest4$ ls >>>> >>>> Protein.map* Protein.map~* etc input@ logs/ psim.basicex1.swift* >>>> sites.xml t.swift* t.swift~* tc tc~ >>>> >>>> honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift >>>> >>>> Swift svn swift-r3186 cog-r2572 >>>> RunID: 20091103-0750-p7z85z61 >>>> Execution failed: >>>> Argument size mismatch. Got 5 names and 0 values >>>> >>>> honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift >>>> >>>> Swift svn swift-r3186 cog-r2572 >>>> RunID: 20091103-0751-sfxthnne >>>> Execution failed: >>>> Argument size mismatch. Got 5 names and 0 values >>>> >>>> honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift >>>> >>>> Swift svn swift-r3186 cog-r2572 >>>> RunID: 20091103-0751-oxnhztrc >>>> Progress: >>>> Execution failed: >>>> Argument size mismatch. Got 5 names and 0 values >>>> >>>> honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift >>>> >>>> Swift svn swift-r3186 cog-r2572 >>>> RunID: 20091103-0751-amtfuvq9 >>>> Execution failed: >>>> Variable not found: path >>>> >>>> honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift >>>> >>>> Swift svn swift-r3186 cog-r2572 >>>> RunID: 20091103-0751-b4sr44x2 >>>> Execution failed: >>>> Ex098 >>>> org.globus.cog.karajan.workflow.KarajanRuntimeException: Invalid >>>> identifier: [$, org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uch\ >>>> icago.edu,2008:swift:dataset:20091103-0751-b65c863d:720000000020 type >>>> ProtGeo with no value at dataset=structure path=[1] (not closed)] >>>> at >>>> org.globus.cog.karajan.util.TypeUtil.toIdentifier(TypeUtil.java:286) >>>> ... >>>> at java.lang.Thread.run(Thread.java:595) >>>> Variable not found: path >>>> honest4$ >>>> >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Nov 3 11:24:43 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 03 Nov 2009 11:24:43 -0600 Subject: [Fwd: Re: [Swift-devel] Problems with Swift trunk on abe] In-Reply-To: <4AF0653E.7010008@mcs.anl.gov> References: <4AF05823.2070709@mcs.anl.gov> <1257267923.8801.1.camel@localhost> <4AF0653E.7010008@mcs.anl.gov> Message-ID: <1257269083.9525.0.camel@localhost> I'll let you know as soon as I have something. In the mean-time, you probably don't need to run more tests. On Tue, 2009-11-03 at 11:15 -0600, Michael Wilde wrote: > Im running the Swift test battery on Ranger and Abe. > > Tests on Abe just got this, which is similar to what I was seeing: > > ... > Running test 0024-compound at Tue Nov 3 11:11:33 CST 2009 > Swift svn swift-r3186 cog-r2572 > > RunID: 20091103-1111-as1yb7s3 > Progress: > Execution failed: > First argument must be an identifier or a list of identifiers > SWIFT RETURN CODE NON-ZERO - test 0024-compound.swift > ... > > --- > > Ranger passed all syntax tests but is hanging on 001-echo. Not sure why yet. > > I'll keep trying to zero in on something small thats reproducble. > 0024-compound might be a good starting point. > > Let me know if you need something specific from me, or have suggestions > on what to try, where. > > Thanks, > > Mike > > > > > > On 11/3/09 11:05 AM, Mihael Hategan wrote: > > Thanks. I'm looking into it. > > > > On Tue, 2009-11-03 at 10:19 -0600, Michael Wilde wrote: > >> Similar problems happen on Ranger. I used an older Swift build > >> initially, and it ran. > >> > >> Then I updated to the latest, and I get the following similar failure: > >> > >> RunID: 20091103-1016-khfn59yf > >> Progress: > >> Ex098 > >> java.lang.NullPointerException > >> at > >> org.globus.cog.karajan.workflow.nodes.functions.Variable.function(Variable.java:38) > >> at > >> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:45) > >> at > >> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:51) > >> ... > >> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) > >> at > >> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) > >> at java.lang.Thread.run(Thread.java:619) > >> Execution failed: > >> Argument size mismatch. Got 1 names and 2 values > >> login3$ > >> > >> Im going try the bgp next. If it works there, we can leave this issue > >> for after SC. Else, I'll need to resolve it. > >> > >> - Mike > >> > >> -------- Original Message -------- > >> Subject: Re: [Swift-devel] Problems with Swift trunk on abe > >> Date: Tue, 03 Nov 2009 08:18:55 -0600 > >> From: Michael Wilde > >> To: swift-devel > >> References: <4AF03842.70300 at mcs.anl.gov> > >> > >> Using the Swift build (and swift.properties) that works on communicado, > >> I still get the same failures on Abe. > >> > >> I'm going to try a few other systems to see if the problem is widespread > >> or local to Abe. > >> > >> - Mike > >> > >> > >> On 11/3/09 8:03 AM, Michael Wilde wrote: > >>> What seems like the a build of the same Swift & CoG svn revision is > >>> working on communicado but giving strange and transient results on Abe. > >>> > >>> A series of 5 runs of the same code (with .xml and .kml files left in > >>> place) give the set of errors below. > >>> > >>> I will try to use the release build on communicado, to see if this is > >>> being caused by some Java compiler difference on Abe. > >>> > >>> Suggestions welcome. Mihael, I can get you the logs - they are on abe in > >>> ~wilde/run.58, along with the source: psim.basicex1.swift > >>> > >>> - Mike > >>> > >>> Output with a few progress and blank lines removed: > >>> > >>> (full output is in: http://www.ci.uchicago.edu/~wilde/abe.swiftproblem.txt ) > >>> > >>> honest4$ ls > >>> > >>> Protein.map* Protein.map~* etc input@ logs/ psim.basicex1.swift* > >>> sites.xml t.swift* t.swift~* tc tc~ > >>> > >>> honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift > >>> > >>> Swift svn swift-r3186 cog-r2572 > >>> RunID: 20091103-0750-p7z85z61 > >>> Execution failed: > >>> Argument size mismatch. Got 5 names and 0 values > >>> > >>> honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift > >>> > >>> Swift svn swift-r3186 cog-r2572 > >>> RunID: 20091103-0751-sfxthnne > >>> Execution failed: > >>> Argument size mismatch. Got 5 names and 0 values > >>> > >>> honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift > >>> > >>> Swift svn swift-r3186 cog-r2572 > >>> RunID: 20091103-0751-oxnhztrc > >>> Progress: > >>> Execution failed: > >>> Argument size mismatch. Got 5 names and 0 values > >>> > >>> honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift > >>> > >>> Swift svn swift-r3186 cog-r2572 > >>> RunID: 20091103-0751-amtfuvq9 > >>> Execution failed: > >>> Variable not found: path > >>> > >>> honest4$ swift -tc.file tc -sites.file sites.xml psim.basicex1.swift > >>> > >>> Swift svn swift-r3186 cog-r2572 > >>> RunID: 20091103-0751-b4sr44x2 > >>> Execution failed: > >>> Ex098 > >>> org.globus.cog.karajan.workflow.KarajanRuntimeException: Invalid > >>> identifier: [$, org.griphyn.vdl.mapping.DataNode identifier tag:benc at ci.uch\ > >>> icago.edu,2008:swift:dataset:20091103-0751-b65c863d:720000000020 type > >>> ProtGeo with no value at dataset=structure path=[1] (not closed)] > >>> at > >>> org.globus.cog.karajan.util.TypeUtil.toIdentifier(TypeUtil.java:286) > >>> ... > >>> at java.lang.Thread.run(Thread.java:595) > >>> Variable not found: path > >>> honest4$ > >>> > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From skenny at uchicago.edu Tue Nov 3 13:40:08 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Tue, 3 Nov 2009 13:40:08 -0600 (CST) Subject: [Swift-devel] more swift and bgp plots In-Reply-To: <1257195064.24600.1.camel@localhost> References: <1257195064.24600.1.camel@localhost> Message-ID: <20091103134008.CEX28656@m4500-02.uchicago.edu> can you post the sites file(s) used for your runs on bgp? thnx ~sk ---- Original message ---- >Date: Mon, 02 Nov 2009 14:51:04 -0600 >From: Mihael Hategan >Subject: [Swift-devel] more swift and bgp plots >To: swift-devel at ci.uchicago.edu > >http://www.mcs.anl.gov/~hategan/report-8k-60s/ >http://www.mcs.anl.gov/~hategan/report-8k-120s/ >http://www.mcs.anl.gov/~hategan/report-12k-120s/ >http://www.mcs.anl.gov/~hategan/report-20k-300s/ > > >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Nov 3 13:51:24 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 03 Nov 2009 13:51:24 -0600 Subject: [Swift-devel] more swift and bgp plots In-Reply-To: <20091103134008.CEX28656@m4500-02.uchicago.edu> References: <1257195064.24600.1.camel@localhost> <20091103134008.CEX28656@m4500-02.uchicago.edu> Message-ID: <1257277884.14655.0.camel@localhost> 10 512 4 512 HTCScienceApps zeptoos 3000 true 100000 /intrepid-fs0/users/hategan/scratch /scratch On Tue, 2009-11-03 at 13:40 -0600, skenny at uchicago.edu wrote: > can you post the sites file(s) used for your runs on bgp? > > thnx > ~sk > > ---- Original message ---- > >Date: Mon, 02 Nov 2009 14:51:04 -0600 > >From: Mihael Hategan > >Subject: [Swift-devel] more swift and bgp plots > >To: swift-devel at ci.uchicago.edu > > > >http://www.mcs.anl.gov/~hategan/report-8k-60s/ > >http://www.mcs.anl.gov/~hategan/report-8k-120s/ > >http://www.mcs.anl.gov/~hategan/report-12k-120s/ > >http://www.mcs.anl.gov/~hategan/report-20k-300s/ > > > > > >_______________________________________________ > >Swift-devel mailing list > >Swift-devel at ci.uchicago.edu > >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed Nov 4 12:19:05 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 04 Nov 2009 12:19:05 -0600 Subject: [Swift-devel] swift cmd args to select execution site(s)? Message-ID: <4AF1C599.2030202@mcs.anl.gov> Mihael, A while back you mentioned as an aside in an email that this capability is now in the trunk. Did I understand you right, and if so can you point me at how to use it? (Ive been hunting and cant find it...) It would have been something Ben did in his last round of improvements. Ben? Thanks, Mike From hategan at mcs.anl.gov Wed Nov 4 12:24:33 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 04 Nov 2009 12:24:33 -0600 Subject: [Swift-devel] Re: swift cmd args to select execution site(s)? In-Reply-To: <4AF1C599.2030202@mcs.anl.gov> References: <4AF1C599.2030202@mcs.anl.gov> Message-ID: <1257359073.4902.1.camel@localhost> On Wed, 2009-11-04 at 12:19 -0600, Michael Wilde wrote: > Mihael, > > A while back you mentioned as an aside in an email that this capability > is now in the trunk. Did I understand you right, and if so can you point > me at how to use it? (Ive been hunting and cant find it...) I have no recollection of such a feature or of having committed code related to such a feature. From hategan at mcs.anl.gov Wed Nov 4 12:36:36 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 04 Nov 2009 12:36:36 -0600 Subject: [Swift-devel] Re: swift cmd args to select execution site(s)? In-Reply-To: <1257359073.4902.1.camel@localhost> References: <4AF1C599.2030202@mcs.anl.gov> <1257359073.4902.1.camel@localhost> Message-ID: <1257359796.5161.0.camel@localhost> On Wed, 2009-11-04 at 12:24 -0600, Mihael Hategan wrote: > On Wed, 2009-11-04 at 12:19 -0600, Michael Wilde wrote: > > Mihael, > > > > A while back you mentioned as an aside in an email that this capability > > is now in the trunk. Did I understand you right, and if so can you point > > me at how to use it? (Ive been hunting and cant find it...) > > I have no recollection of such a feature or of having committed code > related to such a feature. I also looked at the commit logs and the code and didn't find anything that would do that. From benc at hawaga.org.uk Wed Nov 4 19:39:35 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 5 Nov 2009 01:39:35 +0000 (GMT) Subject: [Swift-devel] swift cmd args to select execution site(s)? In-Reply-To: <4AF1C599.2030202@mcs.anl.gov> References: <4AF1C599.2030202@mcs.anl.gov> Message-ID: On Wed, 4 Nov 2009, Michael Wilde wrote: > A while back you mentioned as an aside in an email that this capability is now > in the trunk. Did I understand you right, and if so can you point me at how to > use it? (Ive been hunting and cant find it...) > > It would have been something Ben did in his last round of improvements. > > Ben? I don't remember implementing this. -- From aespinosa at cs.uchicago.edu Thu Nov 5 20:37:39 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 5 Nov 2009 20:37:39 -0600 Subject: [Swift-devel] using cobalt-script mode on the bluegene/p Message-ID: <50b07b4b0911051837q13658947wf1f348de3446f129@mail.gmail.com> it is interesting how I caught stdouts from the echo of my hello_wrapper.sh scripts but not the stdout of the mpi program. It was directed to a file in my ~/.globus directory as specified by the cqsub command in the cobalt provider. This is due to a known cobalt bug (https://trac.mcs.anl.gov/projects/cobalt/ticket/310). But the workflow finished successfully. As of now, either we should explicitly indicate output files in our mpi jobs or manually grab stdout and stderr outputs in the depths of your ~/.globus directory. my sites.xml: 64 HTCScienceApps 20 script prod-devel /intrepid-fs0/users/espinosa/scratch/mpi_runs my hello_wrapper.sh for the app function hello(): #!/bin/bash echo "Cobalt script job" cobalt-mpirun -np 256 -mode vn -cwd `pwd` \ /home/espinosa/experiments/mpitest/hello echo "Finished script" workflow: type file; app (file out) hello() { hello stdout=@filename(out); } file output<"stdout.file">; output = hello(); -- Allan M. Espinosa PhD student, Computer Science University of Chicago From aespinosa at cs.uchicago.edu Mon Nov 9 02:54:56 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 9 Nov 2009 02:54:56 -0600 Subject: [Swift-devel] disable clean-up of ~/.globus/scripts Message-ID: <50b07b4b0911090054v50f5977cgf14f6def5182d787@mail.gmail.com> Hi guys, where in the source tree should i comment out if I want to disable the cleanup of my ~/.globus/scripts in job submissions using the localscheduler provider? I'm trying to debug some cobalt-bluegene stuff. thanks, -Allan -- Allan M. Espinosa PhD student, Computer Science University of Chicago From aespinosa at cs.uchicago.edu Mon Nov 9 03:04:41 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 9 Nov 2009 03:04:41 -0600 Subject: [Swift-devel] Re: disable clean-up of ~/.globus/scripts In-Reply-To: <50b07b4b0911090054v50f5977cgf14f6def5182d787@mail.gmail.com> References: <50b07b4b0911090054v50f5977cgf14f6def5182d787@mail.gmail.com> Message-ID: <50b07b4b0911090104gdf0ccb1wdd5d063f1e919475@mail.gmail.com> nevermind. I think i found it (yay!) :D src/org/globus/cog/abstraction/impl/scheduler/cobalt/CobaltExecutor.java lines 172, and 173 right? 2009/11/9 Allan Espinosa : > Hi guys, > > where in the source tree should i comment out if I want to disable the > cleanup of my ~/.globus/scripts in job submissions using the > localscheduler provider? ?I'm trying to debug some cobalt-bluegene > stuff. > > thanks, > -Allan > > From hategan at mcs.anl.gov Mon Nov 9 11:21:44 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Nov 2009 11:21:44 -0600 Subject: [Swift-devel] Re: disable clean-up of ~/.globus/scripts In-Reply-To: <50b07b4b0911090104gdf0ccb1wdd5d063f1e919475@mail.gmail.com> References: <50b07b4b0911090054v50f5977cgf14f6def5182d787@mail.gmail.com> <50b07b4b0911090104gdf0ccb1wdd5d063f1e919475@mail.gmail.com> Message-ID: <1257787304.24003.1.camel@localhost> On Mon, 2009-11-09 at 03:04 -0600, Allan Espinosa wrote: > nevermind. I think i found it (yay!) :D > > src/org/globus/cog/abstraction/impl/scheduler/cobalt/CobaltExecutor.java > lines 172, and 173 > > right? Right. > > 2009/11/9 Allan Espinosa : > > Hi guys, > > > > where in the source tree should i comment out if I want to disable the > > cleanup of my ~/.globus/scripts in job submissions using the > > localscheduler provider? I'm trying to debug some cobalt-bluegene > > stuff. > > > > thanks, > > -Allan > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From skenny at uchicago.edu Mon Nov 9 12:43:56 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Mon, 9 Nov 2009 12:43:56 -0600 (CST) Subject: [Swift-devel] exception caught while writing to log file Message-ID: <20091109124356.CFF64631@m4500-02.uchicago.edu> org.globus.cog.karajan.workflow.KarajanRuntimeException: Exception caught while writing to log file at org.globus.cog.karajan.workflow.nodes.restartLog.LogVargOperator.update(LogVargOperator.java:40) at org.globus.cog.karajan.workflow.nodes.functions.VariableArgumentsOperator.append(VariableArgumentsOperator.java:38) at org.globus.cog.karajan.workflow.nodes.functions.VariableArgumentsOperator.appendAll(VariableArgumentsOperator.java:44) at org.globus.cog.karajan.workflow.nodes.functions.VariableArgumentsOperator.merge(VariableArgumentsOperator.java:34) has anyone gotten this error before? i'm able to write to files in the directory, so i'm not totally out of space. log file is here on ci: /ci/projects/cnari/logs/skenny/importall-20091109-0133-y7prc38b.log From wilde at mcs.anl.gov Mon Nov 9 13:05:42 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 09 Nov 2009 13:05:42 -0600 Subject: [Swift-devel] exception caught while writing to log file In-Reply-To: <20091109124356.CFF64631@m4500-02.uchicago.edu> References: <20091109124356.CFF64631@m4500-02.uchicago.edu> Message-ID: <4AF86806.7090601@mcs.anl.gov> I've not seen it. The log below is 7.8GB in size, but it looks like its failing on a write to the restart log. How big is your .rlog file? You dont by any chance have a ulimit file size set, do you? - Mike On 11/9/09 12:43 PM, skenny at uchicago.edu wrote: > org.globus.cog.karajan.workflow.KarajanRuntimeException: > Exception caught while writing to log file > at > org.globus.cog.karajan.workflow.nodes.restartLog.LogVargOperator.update(LogVargOperator.java:40) > at > org.globus.cog.karajan.workflow.nodes.functions.VariableArgumentsOperator.append(VariableArgumentsOperator.java:38) > at > org.globus.cog.karajan.workflow.nodes.functions.VariableArgumentsOperator.appendAll(VariableArgumentsOperator.java:44) > at > org.globus.cog.karajan.workflow.nodes.functions.VariableArgumentsOperator.merge(VariableArgumentsOperator.java:34) > > has anyone gotten this error before? i'm able to write to > files in the directory, so i'm not totally out of space. > > log file is here on ci: > > /ci/projects/cnari/logs/skenny/importall-20091109-0133-y7prc38b.log > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Mon Nov 9 13:06:02 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Nov 2009 13:06:02 -0600 Subject: [Swift-devel] exception caught while writing to log file In-Reply-To: <20091109124356.CFF64631@m4500-02.uchicago.edu> References: <20091109124356.CFF64631@m4500-02.uchicago.edu> Message-ID: <1257793562.27365.1.camel@localhost> Yes. Somebody else reported that. Seems to be part of a clan of bugs that I started fixing last week. On Mon, 2009-11-09 at 12:43 -0600, skenny at uchicago.edu wrote: > org.globus.cog.karajan.workflow.KarajanRuntimeException: > Exception caught while writing to log file > at > org.globus.cog.karajan.workflow.nodes.restartLog.LogVargOperator.update(LogVargOperator.java:40) > at > org.globus.cog.karajan.workflow.nodes.functions.VariableArgumentsOperator.append(VariableArgumentsOperator.java:38) > at > org.globus.cog.karajan.workflow.nodes.functions.VariableArgumentsOperator.appendAll(VariableArgumentsOperator.java:44) > at > org.globus.cog.karajan.workflow.nodes.functions.VariableArgumentsOperator.merge(VariableArgumentsOperator.java:34) > > has anyone gotten this error before? i'm able to write to > files in the directory, so i'm not totally out of space. > > log file is here on ci: > > /ci/projects/cnari/logs/skenny/importall-20091109-0133-y7prc38b.log > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Mon Nov 9 13:10:11 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Nov 2009 13:10:11 -0600 Subject: [Swift-devel] exception caught while writing to log file In-Reply-To: <4AF86806.7090601@mcs.anl.gov> References: <20091109124356.CFF64631@m4500-02.uchicago.edu> <4AF86806.7090601@mcs.anl.gov> Message-ID: <1257793811.27365.3.camel@localhost> On Mon, 2009-11-09 at 13:05 -0600, Michael Wilde wrote: > I've not seen it. The log below is 7.8GB in size, but it looks like its > failing on a write to the restart log. Justin reported the same on a small workflow. It seems to be a case of writing to the log after it has been closed due to some improper synchronization issue. From hategan at mcs.anl.gov Mon Nov 9 16:45:49 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 09 Nov 2009 16:45:49 -0600 Subject: [Swift-devel] exception caught while writing to log file In-Reply-To: <1257793811.27365.3.camel@localhost> References: <20091109124356.CFF64631@m4500-02.uchicago.edu> <4AF86806.7090601@mcs.anl.gov> <1257793811.27365.3.camel@localhost> Message-ID: <1257806749.32125.0.camel@localhost> On Mon, 2009-11-09 at 13:10 -0600, Mihael Hategan wrote: > On Mon, 2009-11-09 at 13:05 -0600, Michael Wilde wrote: > > I've not seen it. The log below is 7.8GB in size, but it looks like its > > failing on a write to the restart log. > > Justin reported the same on a small workflow. > > It seems to be a case of writing to the log after it has been closed due > to some improper synchronization issue. Should be fixed in svn. In any event, it isn't generally consequential. It happens as a consequence of the run failing, and does not by itself cause the failure. From wilde at mcs.anl.gov Wed Nov 11 10:38:31 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 11 Nov 2009 10:38:31 -0600 Subject: [Swift-devel] Unexpected messages from coasters on BG/P Message-ID: <4AFAE887.8060703@mcs.anl.gov> Mihael, can you tell me what the messages below mean? - the block ended prematurely message - the long java tracebacks (seems like one per each of 256 jobs? Will send more logs later if needed. Thanks, Mike -- Progress: uninitialized:3 Progress: Stage in:240 Submitting:16 Progress: Stage in:169 Submitting:87 Progress: Stage in:61 Submitting:195 Progress: Submitted:256 Progress: Submitted:256 Progress: Submitted:256 Progress: Submitted:256 Progress: Submitted:256 Progress: Submitted:256 Worker task failed: 1111-271019-000000Block task ended prematurely Progress: Submitted:255 Active:1 Failed to transfer wrapper log from raxmlex1-20091111-1027-26gklim9/info/z on surveyor Progress: Submitted:256 Progress: Submitted:256 Progress: Submitted:256 Progress: Submitted:256 Progress: Submitted:256 Progress: Submitted:255 Active:1 Progress: Active:256 Unknown handler: CANCELJOB. Available handlers: {CHMOD=class org.globus.cog.abstraction.impl.file.coaster.handlers.ChmodHandler, ISDIR=class org.globus.cog.abstraction.impl.file.coaster.handlers.IsDirectoryHandler, LIST=class org.globus.cog.abstraction.impl.file.coaster.handlers.ListHandler, SUBMITJOB=class org.globus.cog.abstraction.coaster.service.SubmitJobHandler, MKDIR=class org.globus.cog.abstraction.impl.file.coaster.handlers.MkdirHandler, PUT=class org.globus.cog.abstraction.impl.file.coaster.handlers.PutFileHandler, DEL=class org.globus.cog.abstraction.impl.file.coaster.handlers.DeleteHandler, CONFIGSERVICE=class org.globus.cog.abstraction.coaster.service.ServiceConfigurationHandler, FILEINFO=class org.globus.cog.abstraction.impl.file.coaster.handlers.FileInfoHandler, SHUTDOWNSERVICE=class org.globus.cog.abstraction.coaster.service.ServiceShutdownHandler, SHUTDOWN=class org.globus.cog.karajan.workflow.service.handlers.ShutdownHandler, EXISTS=class org.globus.cog.abstraction.impl.file.coaster.handlers.ExistsHandler, CHANNELCONFIG=class org.globus.cog.karajan.workflow.service.handlers.ChannelConfigurationHandler, RMDIR=class org.globus.cog.abstraction.impl.file.coaster.handlers.RmdirHandler, VERSION=class org.globus.cog.karajan.workflow.service.handlers.VersionHandler, RENAME=class org.globus.cog.abstraction.impl.file.coaster.handlers.RenameHandler, WORKERSHELLCMD=class org.globus.cog.abstraction.coaster.service.WorkerShellHandler, GET=class org.globus.cog.abstraction.impl.file.coaster.handlers.GetFileHandler} Unknown handler: CANCELJOB. Available handlers: {CHMOD=class org.globus.cog.abstraction.impl.file.coaster.handlers.ChmodHandler, ISDIR=class org.globus.cog.abstraction.impl.file.coaster.handlers.IsDirectoryHandler, LIST=class org.globus.cog.abstraction.impl.file.coaster.handlers.ListHandler, SUBMITJOB=class org.globus.cog.abstraction.coaster.service.SubmitJobHandler, MKDIR=class org.globus.cog.abstraction.impl.file.coaster.handlers.MkdirHandler, PUT=class org.globus.cog.abstraction.impl.file.coaster.handlers.PutFileHandler, DEL=class org.globus.cog.abstraction.impl.file.coaster.handlers.DeleteHandler, CONFIGSERVICE=class org.globus.cog.abstraction. From hategan at mcs.anl.gov Wed Nov 11 12:40:21 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Nov 2009 12:40:21 -0600 Subject: [Swift-devel] Unexpected messages from coasters on BG/P In-Reply-To: <4AFAE887.8060703@mcs.anl.gov> References: <4AFAE887.8060703@mcs.anl.gov> Message-ID: <1257964821.24801.8.camel@localhost> On Wed, 2009-11-11 at 10:38 -0600, Michael Wilde wrote: > Mihael, can you tell me what the messages below mean? > > - the block ended prematurely message That says that the block job completed before being commanded to shut down. It's very likely that workers didn't even get started. It usually indicates a problem with the queue parameters (maybe you forgot kernel=zeptoos), but it's hard to tell without looking at cobalt logs. It is also not a problem that cqsub would complain about, since this only happens when the job is successfully queued. > - the long java tracebacks (seems like one per each of 256 jobs? That tells that the coaster provider doesn't yet implement job canceling. Normally, this doesn't pop up. But if you have replication enabled, and jobs get a chance to get replicated, you will see these when the copies start to run. You should disable replication. It's useless if only running on the BG/P. In fact, the system should disable it automatically for applications that are only present on one site. From wilde at mcs.anl.gov Wed Nov 11 13:06:44 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 11 Nov 2009 13:06:44 -0600 Subject: [Swift-devel] Unexpected messages from coasters on BG/P In-Reply-To: <1257964821.24801.8.camel@localhost> References: <4AFAE887.8060703@mcs.anl.gov> <1257964821.24801.8.camel@localhost> Message-ID: <4AFB0B44.8010204@mcs.anl.gov> On 11/11/09 12:40 PM, Mihael Hategan wrote: > On Wed, 2009-11-11 at 10:38 -0600, Michael Wilde wrote: >> Mihael, can you tell me what the messages below mean? >> >> - the block ended prematurely message > > That says that the block job completed before being commanded to shut > down. It's very likely that workers didn't even get started. It usually > indicates a problem with the queue parameters (maybe you forgot > kernel=zeptoos), but it's hard to tell without looking at cobalt logs. > It is also not a problem that cqsub would complain about, since this > only happens when the job is successfully queued. Here is my sites.xml pool element: 10 64 4 64 HTCScienceApps zeptoos 3000 true 100000 /home/wilde/swiftwork /scratch I copied it from one you posted and have not yet tuned it for a small test. > >> - the long java tracebacks (seems like one per each of 256 jobs? > > That tells that the coaster provider doesn't yet implement job > canceling. Normally, this doesn't pop up. But if you have replication > enabled, and jobs get a chance to get replicated, you will see these > when the copies start to run. I'll do my next runs with replication and retry off and try to shrink and replicate the problem with less noise. Thanks, Mike > > You should disable replication. It's useless if only running on the > BG/P. In fact, the system should disable it automatically for > applications that are only present on one site. > From wilde at mcs.anl.gov Wed Nov 11 13:22:12 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 11 Nov 2009 13:22:12 -0600 Subject: [Swift-devel] suspected error in simple mapper Message-ID: <4AFB0EE4.9060104@mcs.anl.gov> When I use this mapper: Phylip keggfiles[] ; and have files in the dir like this: K00123.phylip I seem to get mapped filenames of K0123.phylip. Its loosing the leading "0". Its as if the mapper is assuming there is a separator character after the "K". Yet if I use a 2-character prefix "K0": Phylip keggfiles[] ; then I get the desired mapping of "K00123.phylip". I will try to create a test case for this later. Anyone seen anything similar? - Mike From hategan at mcs.anl.gov Wed Nov 11 13:26:21 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 11 Nov 2009 13:26:21 -0600 Subject: [Swift-devel] suspected error in simple mapper In-Reply-To: <4AFB0EE4.9060104@mcs.anl.gov> References: <4AFB0EE4.9060104@mcs.anl.gov> Message-ID: <1257967581.26219.1.camel@localhost> On Wed, 2009-11-11 at 13:22 -0600, Michael Wilde wrote: > When I use this mapper: > > Phylip keggfiles[] ; > > and have files in the dir like this: K00123.phylip > > I seem to get mapped filenames of K0123.phylip. Its loosing the leading > "0". Its as if the mapper is assuming there is a separator character > after the "K". By default, indices are 4-digit. You could add padding=5 to the mapper parameters to get what you want. > > Yet if I use a 2-character prefix "K0": > > Phylip keggfiles[] ; > > then I get the desired mapping of "K00123.phylip". > > I will try to create a test case for this later. Anyone seen anything > similar? > > - Mike > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Thu Nov 12 23:53:50 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 12 Nov 2009 23:53:50 -0600 Subject: [Swift-devel] ssh, intrepid, etc. Message-ID: <1258091630.7477.22.camel@localhost> Folk requested the ability to use the BG/Ps remotely. Right now, the only path into intrepid is ssh with a cryptocard. The ssh provider did not support keyboard-interactive authentication, so the authentication stuff had to be cleaned a little. In most cases, you won't see many differences. However, some exist: - if nothing is specified in auth.defaults, the client will prompt for a username and then try all authentication methods supported by the server and the client (pubkey, pwd, kbd-interactive) - if some type of authentication is specified in auth.defaults, the client will only try that method. - it is not necessary to specify all parameters (such as username, key path, etc.). If you don't you will be prompted for them. If you do, the prompt will be pre-populated with the info - the graphical prompts have gone up a notch in usability. I think. Running swift through this pretty much means you have to run with coasters (unless you want to keep typing tokens from the crypto card). Here are some details on how to run this on intrepid: 1. Make sure you set GLOBUS_HOSTNAME to the external IP of your submit machine. 2. Hack around the following sample sites.xml: /home/hategan/work /scratch HTCScienceApps prod-devel zeptoos true 10000 4 8 3000 64 64 172.17.5.144 3. Unfortunately coasters need GSI credentials for security reasons. You need a proxy on the submit side. Since SSH doesn't support GSI delegation, you also need a valid proxy on intrepid. I'm thinking of ways of solving this issue, but until then this is needed. What will happen is that you will see a prompt once for the username and one for the password. You can put the username in auth.defaults and the auth type to "interactive", and then you'll only get one prompt for the password. I have only tried this in an environment where Swing graphical apps can run. The prompts should also work in text-mode, but it needs testing. From wozniak at mcs.anl.gov Fri Nov 13 11:42:36 2009 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Fri, 13 Nov 2009 11:42:36 -0600 (CST) Subject: [Swift-devel] Issues with dependency completion Message-ID: I'm currently getting some new errors in my swift output (for the PTMap workflow). Might be related to what Mike was seeing. I have replication off, zeptoos on. I'm going to keep looking for what the problem might be but I thought I should post that I'm seeing something similar. Progress: Submitted:10 Finished successfully:16 Progress: Submitted:10 Finished successfully:16 ... Progress: Submitted:10 Finished successfully:16 Progress: Submitted:10 Finished successfully:16 Worker task failed: 1113-451015-000000Block task ended prematurely Statement unlikely to be reached at /home/wozniak/.globus/coasters/cscript31418.pl line 592. (Maybe you meant system() when you said exec()?) Statement unlikely to be reached at /home/wozniak/.globus/coasters/cscript31418.pl line 592. (Maybe you meant system() when you said exec()?) ... Statement unlikely to be reached at /home/wozniak/.globus/coasters/cscript31418.pl line 592. (Maybe you meant system() when you said exec()?) ... Statement unlikely to be reached at /home/wozniak/.globus/coasters/cscript31418.pl line 592. (Maybe you meant system() when you said exec()?) Failed to connect: Connection timed out at /home/wozniak/.globus/coasters/cscript31418.pl line 129. Failed to connect: Connection timed out at /home/wozniak/.globus/coasters/cscript31418.pl line 129. Failed to connect: Connection timed out at /home/wozniak/.globus/coasters/cscript31418.pl line 129. Failed to connect: Connection timed out at /home/wozniak/.globus/coasters/cscript31418.pl line 129. -- Justin M Wozniak From wilde at mcs.anl.gov Fri Nov 13 12:31:55 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 13 Nov 2009 12:31:55 -0600 Subject: [Swift-devel] Intrepid status monitor Message-ID: <4AFDA61B.2050603@mcs.anl.gov> Very useful to Swift developers on Intrepid I suspect: http://status.alcf.anl.gov/intrepid/activity From hategan at mcs.anl.gov Fri Nov 13 12:45:09 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 13 Nov 2009 12:45:09 -0600 Subject: [Swift-devel] Issues with dependency completion In-Reply-To: References: Message-ID: <1258137909.12225.1.camel@localhost> The connection time-out would suggest that something isn't right with the network config. Usual suspects are NAT not working or a missing GLOBUS_HOSTNAME. Btw, in order to avoid setting GLOBUS_HOSTNAME every time, you could put it in sites.xml as this: 172.17.5.144 On Fri, 2009-11-13 at 11:42 -0600, Justin M Wozniak wrote: > I'm currently getting some new errors in my swift output (for the PTMap > workflow). Might be related to what Mike was seeing. I have replication > off, zeptoos on. I'm going to keep looking for what the problem might be > but I thought I should post that I'm seeing something similar. > > Progress: Submitted:10 Finished successfully:16 > Progress: Submitted:10 Finished successfully:16 > > ... > > Progress: Submitted:10 Finished successfully:16 > Progress: Submitted:10 Finished successfully:16 > Worker task failed: 1113-451015-000000Block task ended prematurely > > Statement unlikely to be reached at > /home/wozniak/.globus/coasters/cscript31418.pl line 592. > (Maybe you meant system() when you said exec()?) > Statement unlikely to be reached at > /home/wozniak/.globus/coasters/cscript31418.pl line 592. > (Maybe you meant system() when you said exec()?) > > ... > > Statement unlikely to be reached at > /home/wozniak/.globus/coasters/cscript31418.pl line 592. > (Maybe you meant system() when you said exec()?) > > ... > > Statement unlikely to be reached at > /home/wozniak/.globus/coasters/cscript31418.pl line 592. > (Maybe you meant system() when you said exec()?) > Failed to connect: Connection timed out at > /home/wozniak/.globus/coasters/cscript31418.pl line 129. > Failed to connect: Connection timed out at > /home/wozniak/.globus/coasters/cscript31418.pl line 129. > Failed to connect: Connection timed out at > /home/wozniak/.globus/coasters/cscript31418.pl line 129. > Failed to connect: Connection timed out at > /home/wozniak/.globus/coasters/cscript31418.pl line 129. > From wozniak at mcs.anl.gov Fri Nov 13 13:32:44 2009 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Fri, 13 Nov 2009 13:32:44 -0600 (CST) Subject: [Swift-devel] Issues with dependency completion In-Reply-To: <1258137909.12225.1.camel@localhost> References: <1258137909.12225.1.camel@localhost> Message-ID: That was it. oops On Fri, 13 Nov 2009, Mihael Hategan wrote: > The connection time-out would suggest that something isn't right with > the network config. Usual suspects are NAT not working or a missing > GLOBUS_HOSTNAME. > > Btw, in order to avoid setting GLOBUS_HOSTNAME every time, you could put > it in sites.xml as this: > key="internalHostname">172.17.5.144 > > > On Fri, 2009-11-13 at 11:42 -0600, Justin M Wozniak wrote: >> I'm currently getting some new errors in my swift output (for the PTMap >> workflow). Might be related to what Mike was seeing. I have replication >> off, zeptoos on. I'm going to keep looking for what the problem might be >> but I thought I should post that I'm seeing something similar. -- Justin M Wozniak From foster at anl.gov Fri Nov 13 13:59:31 2009 From: foster at anl.gov (Ian Foster) Date: Fri, 13 Nov 2009 13:59:31 -0600 Subject: [Swift-devel] Intrepid status monitor In-Reply-To: <4AFDA61B.2050603@mcs.anl.gov> References: <4AFDA61B.2050603@mcs.anl.gov> Message-ID: lovely On Nov 13, 2009, at 12:31 PM, Michael Wilde wrote: > Very useful to Swift developers on Intrepid I suspect: > > http://status.alcf.anl.gov/intrepid/activity > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From bugzilla-daemon at mcs.anl.gov Sat Nov 14 16:30:11 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sat, 14 Nov 2009 16:30:11 -0600 (CST) Subject: [Swift-devel] [Bug 221] New: -config option doesnt seem to use etc/swift.properties as a base Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=221 Summary: -config option doesnt seem to use etc/swift.properties as a base Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Severity: minor Priority: P3 Component: SwiftScript language AssignedTo: hategan at mcs.anl.gov ReportedBy: wilde at mcs.anl.gov When the -config option on the swift command line is given a file with just a few properties to override, then it sees like etc/swift.properties is not read in first as a base (as the user guides indicates it should be). Its not clear if this is intentional (and a user guide error) or inadvertent. This behavior is seen in this example: com$ swift psim.itfixex1.swift Swift svn swift-r3187 cog-r2579 RunID: 20091114-1623-10i1k2g3 com$ swift -config set.properties psim.itfixex1.swift Execution failed: File not found ${vds.home}/etc/sites.xml com$ cat set.properties execution.retries=2 sitedir.keep=false com$ -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From iraicu at cs.uchicago.edu Mon Nov 16 14:18:31 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 16 Nov 2009 14:18:31 -0600 Subject: [Swift-devel] 100 million cores in 2018 Message-ID: <4B01B397.9030604@cs.uchicago.edu> Hi all, Here is an article I just read on the expectation that supercomputers in 2018 will have 100 million cores. http://www.computerworld.com/s/article/9140928/Supercomputers_with_100_million_cores_coming_by_2018 I suppose this gives us new targets to shoot for when thinking about scalability of our work. Cheers, Ioan -- ================================================================= Ioan Raicu, Ph.D. NSF/CRA Computing Innovation Fellow ================================================================= Center for Ultra-scale Computing and Information Security (CUCIS) Department of Electrical Engineering and Computer Science Northwestern University 2145 Sheridan Rd, Tech M384 Evanston, IL 60208-3118 ================================================================= Cel: 1-847-722-0876 Tel: 1-847-491-8163 Email: iraicu at eecs.northwestern.edu Web: http://www.eecs.northwestern.edu/~iraicu/ https://wiki.cucis.eecs.northwestern.edu/ ================================================================= ================================================================= From wilde at mcs.anl.gov Tue Nov 17 23:08:00 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 17 Nov 2009 21:08:00 -0800 Subject: [Swift-devel] Shutting down BG/P jobs after swift script completes Message-ID: <4B038130.1040007@mcs.anl.gov> Mihael, it seems like jobs linger, and new jobs start, after a swift script completes (on surveyor with coasters) Info below. - Mike I saw this in qstat: sur$ qstat JobID User WallTime Nodes State Location ================================================== 137824 wilde 00:29:00 64 queued None sur$ qstat JobID User WallTime Nodes State Location ============================================================ 137824 wilde 00:29:00 64 running ANL-R00-M0-N12-64 After the script completed, I saw this: sur$ qstat JobID User WallTime Nodes State Location ============================================================= 137824 wilde 00:29:00 64 running ANL-R00-M0-N12-64 137825 wilde 00:10:00 1 starting ANL-R00-M0-N14-64 sur$ sur$ qstat JobID User WallTime Nodes State Location ============================================================ 137824 wilde 00:29:00 64 running ANL-R00-M0-N12-64 137825 wilde 00:10:00 1 running ANL-R00-M0-N14-64 --- for this script activity: sur$ run.itfixex1.sh Running from host with compute-node reachable address of 172.17.3.16 Running in /home/wilde/protests/run.itfix.49 protlib2 home is /home/wilde/protlib2 Swift svn swift-r3190 cog-r2605 RunID: 20091117-2257-3hsazpy8 Progress: Progress: Checking status:1 Progress: Submitting:3 Submitted:1 Finished successfully:1 Progress: Submitted:4 Finished successfully:1 Progress: Submitted:4 Finished successfully:1 Progress: Submitted:4 Finished successfully:1 Progress: Submitted:4 Finished successfully:1 Progress: Submitted:3 Active:1 Finished successfully:1 Progress: Active:4 Finished successfully:1 Progress: Active:4 Finished successfully:1 Progress: Active:3 Checking status:1 Finished successfully:1 Progress: Checking status:1 Finished successfully:5 Progress: Active:4 Finished successfully:6 Progress: Active:3 Checking status:1 Finished successfully:6 Progress: Submitting:1 Finished successfully:10 Progress: Active:1 Finished successfully:10 Progress: Checking status:1 Finished successfully:10 Final status: Finished successfully:11 Cleaning up... Shutting down service at https://172.17.3.16:50002 Got channel MetaChannel: 177867418 -> null + Done sur$ --- With these settings: cat >tc <sites.xml < $rundir 0.01 10000 1 64 4 64 JGI-Pilot zeptoos 1200 true 2.55 100000 $rundir EOF # Put this back in for performance # /scratch # Copy in swift script and mappers cp $p2home/swift/{psim.itfixex1.swift,swift.properties,Protein.map,ItFixProtein.map,ItFixProtSim.map,plist2} . swiftdir=$(dirname $(dirname $(which swift))) cp $swiftdir/etc/swift.properties . cat >>$HOME/.swift/swift.properties < References: <1256576209.1135.46.camel@localhost> <1256590127.9206.6.camel@localhost> Message-ID: <20091118163625.CFT31158@m4500-02.uchicago.edu> >On Mon, 2009-10-26 at 11:56 -0500, Mihael Hategan wrote: >> So here's how one would go with this on intrepid: >> - determine the maximum number of workers (avg-exec-time * 100) >> - set the nodeGranularity to 512 nodes, 4 workers per node. Also set >> maxWorkers to 512 so that only 512 node blocks are requested. For some >> reason 512 node partitions start almost instantly (even if you have 6 of >> them) while 1024 node partitions you have to wait for. >> - set the total number of blocks ("slots" parameter) to >> no-of-workers/2048. >> - set the jobThrottle to 2*no-of-workers/100 >> - make sure you also have foreach,max.threads set to 2*no-of-workers >> (though that depends on the structure of the program). >> - run on login6. There is no point in using the normal login machines >> since they have a limit of 1024 file descriptors per process. >> so, am i correct in understanding that currently swift can only run on login6 when running on intrepid? i ask because i'm currently not able to get on login6, but decided to try a 512-job workflow on login3 and got this: Progress: Submitted:56 Active:456 Server died: Too many open files java.net.SocketException: Too many open files at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:457) at java.net.ServerSocket.implAccept(ServerSocket.java:473) at java.net.ServerSocket.accept(ServerSocket.java:444) at org.globus.net.BaseServer.run(BaseServer.java:226) at java.lang.Thread.run(Thread.java:810) Worker task failed: Progress: Active:511 Failed but can retry:1 Worker task failed: Failed to transfer wrapper log from test-20091118-1622-lx3lzyx5/info/r on localhost Execution failed: Failed to transfer wrapper log from test-20091118-1622-lx3lzyx5/info/t on localhost Exception in RInvoke: Arguments: [scripts/4reg_dummy.R, matrices/net1_gestspeech.cov, 440, 0.5, gestspeech, net1] Host: localhost Directory: test-20091118-1622-lx3lzyx5/jobs/r/RInvoke-rznt0njj stderr.txt: stdout.txt: here's my swift.properties file: skenny at login3.intrepid:~/swift_runs/exhaustive_sem> cat config/swift.properties sites.file=/home/skenny/cnari/config/local_sites.xml tc.file=/home/skenny/cnari/config/tc.data lazy.errors=false caching.algorithm=LRU pgraph=false pgraph.graph.options=splines="compound", rankdir="TB" pgraph.node.options=color="seagreen", style="filled" clustering.enabled=false clustering.queue.delay=4 clustering.min.time=60 kickstart.enabled=maybe kickstart.always.transfer=false wrapperlog.always.transfer=false throttle.submit=3 throttle.host.submit=8 throttle.score.job.factor=64 throttle.transfers=16 throttle.file.operations=16 sitedir.keep=true execution.retries=0 replication.enabled=false replication.min.queue.time=60 replication.limit=3 foreach.max.threads=2048 skenny at login3.intrepid:~/scratch/g/RInvoke-g8ot0njj> cat ~/cnari/config/local_sites.xml 10 512 4 512 HTCScienceApps zeptoos 3000 true 100000 /intrepid-fs0/users/skenny/scratch /home/skenny/scratch i guess i'm wondering if there's a configuration i can use that will allow me to run on other logins besides 6 (?) thnx ~sk From hategan at mcs.anl.gov Wed Nov 18 16:39:22 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 18 Nov 2009 16:39:22 -0600 Subject: [Swift-devel] Swift and BGP In-Reply-To: <20091118163625.CFT31158@m4500-02.uchicago.edu> References: <1256576209.1135.46.camel@localhost> <1256590127.9206.6.camel@localhost> <20091118163625.CFT31158@m4500-02.uchicago.edu> Message-ID: <1258583962.19550.0.camel@localhost> On Wed, 2009-11-18 at 16:36 -0600, skenny at uchicago.edu wrote: > >On Mon, 2009-10-26 at 11:56 -0500, Mihael Hategan wrote: > >> So here's how one would go with this on intrepid: > >> - determine the maximum number of workers (avg-exec-time * 100) > >> - set the nodeGranularity to 512 nodes, 4 workers per node. > Also set > >> maxWorkers to 512 so that only 512 node blocks are > requested. For some > >> reason 512 node partitions start almost instantly (even if > you have 6 of > >> them) while 1024 node partitions you have to wait for. > >> - set the total number of blocks ("slots" parameter) to > >> no-of-workers/2048. > >> - set the jobThrottle to 2*no-of-workers/100 > >> - make sure you also have foreach,max.threads set to > 2*no-of-workers > >> (though that depends on the structure of the program). > >> - run on login6. There is no point in using the normal > login machines > >> since they have a limit of 1024 file descriptors per process. > >> > > so, am i correct in understanding that currently swift can > only run on login6 when running on intrepid? i ask because i'm > currently not able to get on login6, but decided to try a > 512-job workflow on login3 and got this: Right. The error below is precisely the reason why login6 is needed. > > > Progress: Submitted:56 Active:456 > Server died: Too many open files > java.net.SocketException: Too many open files > at From skenny at uchicago.edu Wed Nov 18 18:00:59 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Wed, 18 Nov 2009 18:00:59 -0600 (CST) Subject: [Swift-devel] Swift and BGP In-Reply-To: <1258583962.19550.0.camel@localhost> References: <1256576209.1135.46.camel@localhost> <1256590127.9206.6.camel@localhost> <20091118163625.CFT31158@m4500-02.uchicago.edu> <1258583962.19550.0.camel@localhost> Message-ID: <20091118180059.CFT45039@m4500-02.uchicago.edu> >> >> (though that depends on the structure of the program). >> >> - run on login6. There is no point in using the normal >> login machines >> >> since they have a limit of 1024 file descriptors per process. >> >> >> >> so, am i correct in understanding that currently swift can >> only run on login6 when running on intrepid? i ask because i'm >> currently not able to get on login6, but decided to try a >> 512-job workflow on login3 and got this: > >Right. The error below is precisely the reason why login6 is needed. do you think it's worth writing to support to see if they'd be willing to up the limit on the other login nodes since login6 is down? From hategan at mcs.anl.gov Wed Nov 18 19:52:59 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 18 Nov 2009 19:52:59 -0600 Subject: [Swift-devel] Swift and BGP In-Reply-To: <20091118180059.CFT45039@m4500-02.uchicago.edu> References: <1256576209.1135.46.camel@localhost> <1256590127.9206.6.camel@localhost> <20091118163625.CFT31158@m4500-02.uchicago.edu> <1258583962.19550.0.camel@localhost> <20091118180059.CFT45039@m4500-02.uchicago.edu> Message-ID: <1258595579.25228.0.camel@localhost> On Wed, 2009-11-18 at 18:00 -0600, skenny at uchicago.edu wrote: > >> >> (though that depends on the structure of the program). > >> >> - run on login6. There is no point in using the normal > >> login machines > >> >> since they have a limit of 1024 file descriptors per > process. > >> >> > >> > >> so, am i correct in understanding that currently swift can > >> only run on login6 when running on intrepid? i ask because i'm > >> currently not able to get on login6, but decided to try a > >> 512-job workflow on login3 and got this: > > > >Right. The error below is precisely the reason why login6 is > needed. > > do you think it's worth writing to support to see if they'd be > willing to up the limit on the other login nodes since login6 > is down? > > Might be. But I think it's also worth mentioning that login6 is down. From wilde at mcs.anl.gov Thu Nov 19 08:48:12 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 19 Nov 2009 08:48:12 -0600 Subject: [Swift-devel] Swift and BGP In-Reply-To: <20091118180059.CFT45039@m4500-02.uchicago.edu> References: <1256576209.1135.46.camel@localhost> <1256590127.9206.6.camel@localhost> <20091118163625.CFT31158@m4500-02.uchicago.edu> <1258583962.19550.0.camel@localhost> <20091118180059.CFT45039@m4500-02.uchicago.edu> Message-ID: <4B055AAC.203@mcs.anl.gov> Good idea - I'll discuss it with them. We should also look at running Swift on other machines line communicado and ssh'ing to the BG/P, so that large workflows dont impact the login hosts. Mihael, in that mode, does the fd limit still impact us? - Mike On 11/18/09 6:00 PM, skenny at uchicago.edu wrote: >>>>> (though that depends on the structure of the program). >>>>> - run on login6. There is no point in using the normal >>> login machines >>>>> since they have a limit of 1024 file descriptors per > process. >>> so, am i correct in understanding that currently swift can >>> only run on login6 when running on intrepid? i ask because i'm >>> currently not able to get on login6, but decided to try a >>> 512-job workflow on login3 and got this: >> Right. The error below is precisely the reason why login6 is > needed. > > do you think it's worth writing to support to see if they'd be > willing to up the limit on the other login nodes since login6 > is down? > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From bugzilla-daemon at mcs.anl.gov Thu Nov 19 14:34:59 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 19 Nov 2009 14:34:59 -0600 (CST) Subject: [Swift-devel] [Bug 222] New: file corruption using coasters filesystem provider Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=222 Summary: file corruption using coasters filesystem provider Product: Swift Version: unspecified Platform: PC OS/Version: Windows Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: hategan at mcs.anl.gov ReportedBy: skenny at uchicago.edu when using coasters as the filesystem provider we've had several instances of the input files being corrupted when they reach the remote site. this causes the remote application to fail when trying to read the file. previously we were seeing this with large output files and i believe it was discovered to be a buffering issue (and i think mihael fixed it). so far i haven't seen it again on the output (though the output for recent workflows has been small). it has happened with input files of the following size: 342K 240M 77M -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From skenny at uchicago.edu Thu Nov 19 14:59:46 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Thu, 19 Nov 2009 14:59:46 -0600 (CST) Subject: [Swift-devel] more swift and bgp plots In-Reply-To: <1257277884.14655.0.camel@localhost> References: <1257195064.24600.1.camel@localhost> <20091103134008.CEX28656@m4500-02.uchicago.edu> <1257277884.14655.0.camel@localhost> Message-ID: <20091119145946.CFU86659@m4500-02.uchicago.edu> could you also post your swift.properties file? thnx ~sk ---- Original message ---- >Date: Tue, 03 Nov 2009 13:51:24 -0600 >From: Mihael Hategan >Subject: Re: [Swift-devel] more swift and bgp plots >To: skenny at uchicago.edu >Cc: swift-devel at ci.uchicago.edu > > > > > 10 > 512 > 4 > 512 > HTCScienceApps > zeptoos > 3000 > true > 100000 > > /intrepid-fs0/users/hategan/scratch > /scratch > > > >On Tue, 2009-11-03 at 13:40 -0600, skenny at uchicago.edu wrote: >> can you post the sites file(s) used for your runs on bgp? >> >> thnx >> ~sk >> >> ---- Original message ---- >> >Date: Mon, 02 Nov 2009 14:51:04 -0600 >> >From: Mihael Hategan >> >Subject: [Swift-devel] more swift and bgp plots >> >To: swift-devel at ci.uchicago.edu >> > >> >http://www.mcs.anl.gov/~hategan/report-8k-60s/ >> >http://www.mcs.anl.gov/~hategan/report-8k-120s/ >> >http://www.mcs.anl.gov/~hategan/report-12k-120s/ >> >http://www.mcs.anl.gov/~hategan/report-20k-300s/ >> > >> > >> >_______________________________________________ >> >Swift-devel mailing list >> >Swift-devel at ci.uchicago.edu >> >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From skenny at uchicago.edu Fri Nov 20 13:01:13 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Fri, 20 Nov 2009 13:01:13 -0600 (CST) Subject: [Swift-devel] process hogging memory on ranger login Message-ID: <20091120130113.CFW23623@m4500-02.uchicago.edu> so, using the latest swift (swift-r3116 cog-r2482) i submitted a 1,179,647-job workflow to ranger...got this far: Progress: Submitted:16378 Active:1 Finished successfully:412482 was in that state for a while (~24hrs) showing this in the queue: 1143906 data tg457040 Waiting 2944 01:10:00 Thu Nov 19 07:22:48 which seemed relatively normal (a job requesting that many cores on ranger can sometimes be in the queue for quite some time). however, i then got an email from tacc: I have noticed that you have a java process running on Ranger (login3). The process is consuming quite some resources. Since it is running already for over a day, I assume that this is a run-away process that is left over from something you did earlier. so, i killed the process on the login, but i wasn't sure what to make of it...the job remains in the queue...i wonder if it's ok to try a resume on this workflow or if that might spawn another such process on the login (?) From skenny at uchicago.edu Fri Nov 20 15:10:34 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Fri, 20 Nov 2009 15:10:34 -0600 (CST) Subject: [Swift-devel] process hogging memory on ranger login In-Reply-To: <20091120130113.CFW23623@m4500-02.uchicago.edu> References: <20091120130113.CFW23623@m4500-02.uchicago.edu> Message-ID: <20091120151034.CFW44967@m4500-02.uchicago.edu> >so, using the latest swift (swift-r3116 cog-r2482) i submitted >a 1,179,647-job workflow to ranger sorry, my mistake, i just realized that is NOT the latest cog/swift (despite having been recently compiled)...i'll try a resume with the latest code and see if the problem persists... >Progress: Submitted:16378 Active:1 Finished successfully:412482 > >was in that state for a while (~24hrs) showing this in the queue: > >1143906 data tg457040 Waiting 2944 01:10:00 >Thu Nov 19 07:22:48 > >which seemed relatively normal (a job requesting that many >cores on ranger can sometimes be in the queue for quite some >time). however, i then got an email from tacc: > >I have noticed that you have a java process running on Ranger >(login3). The process is consuming quite some resources. Since >it is running already for over a day, I assume that this is a >run-away process that is left over from something you did >earlier. > >so, i killed the process on the login, but i wasn't sure what >to make of it...the job remains in the queue...i wonder if >it's ok to try a resume on this workflow or if that might >spawn another such process on the login (?) >_______________________________________________ >Swift-devel mailing list >Swift-devel at ci.uchicago.edu >http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Sat Nov 21 13:22:13 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 21 Nov 2009 13:22:13 -0600 Subject: [Swift-devel] process hogging memory on ranger login In-Reply-To: <20091120130113.CFW23623@m4500-02.uchicago.edu> References: <20091120130113.CFW23623@m4500-02.uchicago.edu> Message-ID: <1258831333.6242.6.camel@localhost> On Fri, 2009-11-20 at 13:01 -0600, skenny at uchicago.edu wrote: > so, using the latest swift (swift-r3116 cog-r2482) i submitted > a 1,179,647-job workflow to ranger...got this far: > > Progress: Submitted:16378 Active:1 Finished successfully:412482 > > was in that state for a while (~24hrs) showing this in the queue: > > 1143906 data tg457040 Waiting 2944 01:10:00 > Thu Nov 19 07:22:48 Logs would help. > > which seemed relatively normal (a job requesting that many > cores on ranger can sometimes be in the queue for quite some > time). however, i then got an email from tacc: > > I have noticed that you have a java process running on Ranger > (login3). The process is consuming quite some resources. Please ask them to be more specific. If it was consuming virtual memory, then that's a non-issue. If it was consuming lots of CPU than that's a problem. > Since > it is running already for over a day, I assume that this is a > run-away process that is left over from something you did > earlier. > > so, i killed the process on the login, but i wasn't sure what > to make of it...the job remains in the queue...i wonder if > it's ok to try a resume on this workflow or if that might > spawn another such process on the login (?) It will spawn another such process on the login. That's the coaster service. From skenny at uchicago.edu Mon Nov 23 19:01:30 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Mon, 23 Nov 2009 19:01:30 -0600 (CST) Subject: [Swift-devel] process hogging memory on ranger login Message-ID: <20091123190130.CFZ45811@m4500-02.uchicago.edu> >On Fri, 2009-11-20 at 13:01 -0600, skenny at uchicago.edu wrote: >> so, using the latest swift (swift-r3116 cog-r2482) i submitted >> a 1,179,647-job workflow to ranger...got this far: >> >> Progress: Submitted:16378 Active:1 Finished successfully:412482 >> >> was in that state for a while (~24hrs) showing this in the queue: >> >> 1143906 data tg457040 Waiting 2944 01:10:00 >> Thu Nov 19 07:22:48 > >Logs would help. so, i was trying to re-run this workflow with the latest swift (swift-r3191 cog-r2620) to try and replicate the error. however, a new error has surfaced...the environment, as specified in my tc.data file, is no-longer being set by swift on the remote end. is it possible this is due to recent changes in swift? i am running the same workflow, same tc & sites files with the newer swift and am getting errors (from the app) due to my LD_LIBRARY_PATH not being set. if i switch back to swift-r3116 cog-r2482, the error goes away. From hategan at mcs.anl.gov Mon Nov 23 19:25:55 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 23 Nov 2009 19:25:55 -0600 Subject: [Swift-devel] process hogging memory on ranger login In-Reply-To: <20091123190130.CFZ45811@m4500-02.uchicago.edu> References: <20091123190130.CFZ45811@m4500-02.uchicago.edu> Message-ID: <1259025955.27366.0.camel@localhost> On Mon, 2009-11-23 at 19:01 -0600, skenny at uchicago.edu wrote: > >On Fri, 2009-11-20 at 13:01 -0600, skenny at uchicago.edu wrote: > >> so, using the latest swift (swift-r3116 cog-r2482) i submitted > >> a 1,179,647-job workflow to ranger...got this far: > >> > >> Progress: Submitted:16378 Active:1 Finished > successfully:412482 > >> > >> was in that state for a while (~24hrs) showing this in the > queue: > >> > >> 1143906 data tg457040 Waiting 2944 01:10:00 > >> Thu Nov 19 07:22:48 > > > >Logs would help. > > so, i was trying to re-run this workflow with the latest swift > (swift-r3191 cog-r2620) to try and replicate the error. > however, a new error has surfaced...the environment, as > specified in my tc.data file, is no-longer being set by swift > on the remote end. is it possible this is due to recent > changes in swift? i am running the same workflow, same tc & > sites files with the newer swift and am getting errors (from > the app) due to my LD_LIBRARY_PATH not being set. if i switch > back to swift-r3116 cog-r2482, the error goes away. Yes. That is possible. I will check. Can you post your sites file? From skenny at uchicago.edu Mon Nov 23 19:53:06 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Mon, 23 Nov 2009 19:53:06 -0600 (CST) Subject: [Swift-devel] process hogging memory on ranger login In-Reply-To: <1259025955.27366.0.camel@localhost> References: <20091123190130.CFZ45811@m4500-02.uchicago.edu> <1259025955.27366.0.camel@localhost> Message-ID: <20091123195306.CFZ49743@m4500-02.uchicago.edu> >On Mon, 2009-11-23 at 19:01 -0600, skenny at uchicago.edu wrote: >> >On Fri, 2009-11-20 at 13:01 -0600, skenny at uchicago.edu wrote: >> >> so, using the latest swift (swift-r3116 cog-r2482) i submitted >> >> a 1,179,647-job workflow to ranger...got this far: >> >> >> >> Progress: Submitted:16378 Active:1 Finished >> successfully:412482 >> >> >> >> was in that state for a while (~24hrs) showing this in the >> queue: >> >> >> >> 1143906 data tg457040 Waiting 2944 01:10:00 >> >> Thu Nov 19 07:22:48 >> > >> >Logs would help. >> >> so, i was trying to re-run this workflow with the latest swift >> (swift-r3191 cog-r2620) to try and replicate the error. >> however, a new error has surfaced...the environment, as >> specified in my tc.data file, is no-longer being set by swift >> on the remote end. is it possible this is due to recent >> changes in swift? i am running the same workflow, same tc & >> sites files with the newer swift and am getting errors (from >> the app) due to my LD_LIBRARY_PATH not being set. if i switch >> back to swift-r3116 cog-r2482, the error goes away. > >Yes. That is possible. > >I will check. > >Can you post your sites file? sure, i tried on both ranger and mercury: 1000.0 normal 32 1 16 256 72000 TG-DBS080004N /work/00043/tg457040/sidgrid_out/{username} 1000.0 2 1 2 200 72000 TG-DBS080005N /usr/projects/tg-community/SIDGrid/sidgrid_out/{username} From hategan at mcs.anl.gov Tue Nov 24 16:14:37 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 24 Nov 2009 16:14:37 -0600 Subject: [Swift-devel] process hogging memory on ranger login In-Reply-To: <20091123190130.CFZ45811@m4500-02.uchicago.edu> References: <20091123190130.CFZ45811@m4500-02.uchicago.edu> Message-ID: <1259100877.5951.0.camel@localhost> On Mon, 2009-11-23 at 19:01 -0600, skenny at uchicago.edu wrote: > so, i was trying to re-run this workflow with the latest swift > (swift-r3191 cog-r2620) to try and replicate the error. > however, a new error has surfaced...the environment, as > specified in my tc.data file, is no-longer being set by swift > on the remote end. is it possible this is due to recent > changes in swift? i am running the same workflow, same tc & > sites files with the newer swift and am getting errors (from > the app) due to my LD_LIBRARY_PATH not being set. if i switch > back to swift-r3116 cog-r2482, the error goes away. > > This was a bug introduced earlier during some changes to how the profile stuff was handled. Should be fixed in swift r3192. From skenny at uchicago.edu Wed Nov 25 11:33:08 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Wed, 25 Nov 2009 11:33:08 -0600 (CST) Subject: [Swift-devel] process hogging memory on ranger login In-Reply-To: <1259100877.5951.0.camel@localhost> References: <20091123190130.CFZ45811@m4500-02.uchicago.edu> <1259100877.5951.0.camel@localhost> Message-ID: <20091125113308.CGB97214@m4500-02.uchicago.edu> >> so, i was trying to re-run this workflow with the latest swift >> (swift-r3191 cog-r2620) to try and replicate the error. >> however, a new error has surfaced...the environment, as >> specified in my tc.data file, is no-longer being set by swift >> on the remote end. is it possible this is due to recent >> changes in swift? i am running the same workflow, same tc & >> sites files with the newer swift and am getting errors (from >> the app) due to my LD_LIBRARY_PATH not being set. if i switch >> back to swift-r3116 cog-r2482, the error goes away. >> >> > >This was a bug introduced earlier during some changes to how the profile >stuff was handled. Should be fixed in swift r3192. cool, this seems to be fixed, thanks mihael! now i'm able to replicate the error with the latest swift. so, my (~1 million job) workflow, submitted to ranger hangs in this state: Progress: Submitted:16383 Finished successfully:55681 Progress: Submitted:16383 Finished successfully:55681 on ranger i have nothing in the queue. but i am showing a process still running on login3: 8825 tg457040 28 12 472m 232m 5660 S 15.8 0.7 130:56.41 java i am showing some errors in the stderr.txt of the jobs that were running (they access our database which apparently went down at some point). however, it seems troubling that when the app fails that coaster job is still running on the remote site and the workflow hangs w/o reporting anything... the log is too large to attach, but is here on ci: /ci/projects/cnari/logs/skenny/importDTI-20091124-1655-agj0mze1.log let me know if you need the coaster log as well. ~sk From hategan at mcs.anl.gov Wed Nov 25 11:38:09 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 25 Nov 2009 11:38:09 -0600 Subject: [Swift-devel] process hogging memory on ranger login In-Reply-To: <20091125113308.CGB97214@m4500-02.uchicago.edu> References: <20091123190130.CFZ45811@m4500-02.uchicago.edu> <1259100877.5951.0.camel@localhost> <20091125113308.CGB97214@m4500-02.uchicago.edu> Message-ID: <1259170689.18426.2.camel@localhost> On Wed, 2009-11-25 at 11:33 -0600, skenny at uchicago.edu wrote: > so, my (~1 million job) workflow, submitted to ranger hangs in > this state: > > Progress: Submitted:16383 Finished successfully:55681 > Progress: Submitted:16383 Finished successfully:55681 > > on ranger i have nothing in the queue. but i am showing a > process still running on login3: > > 8825 tg457040 28 12 472m 232m 5660 S 15.8 0.7 130:56.41 java > > i am showing some errors in the stderr.txt of the jobs that > were running (they access our database which apparently went > down at some point). however, it seems troubling that when the > app fails that coaster job is still running on the remote site > and the workflow hangs w/o reporting anything... > > the log is too large to attach, but is here on ci: > > /ci/projects/cnari/logs/skenny/importDTI-20091124-1655-agj0mze1.log > > let me know if you need the coaster log as well. That may occur at times, such as when the service runs out of memory. So yes, I do need the coaster log. Regardless of the exact reason, I think that there needs to be extra logic in there to ensure liveness. In other words a lost state should not be interpreted as "still in last state" but "failure". From skenny at uchicago.edu Wed Nov 25 11:50:25 2009 From: skenny at uchicago.edu (skenny at uchicago.edu) Date: Wed, 25 Nov 2009 11:50:25 -0600 (CST) Subject: [Swift-devel] process hogging memory on ranger login In-Reply-To: <1259170689.18426.2.camel@localhost> References: <20091123190130.CFZ45811@m4500-02.uchicago.edu> <1259100877.5951.0.camel@localhost> <20091125113308.CGB97214@m4500-02.uchicago.edu> <1259170689.18426.2.camel@localhost> Message-ID: <20091125115025.CGB99648@m4500-02.uchicago.edu> >> so, my (~1 million job) workflow, submitted to ranger hangs in >> this state: >> >> Progress: Submitted:16383 Finished successfully:55681 >> Progress: Submitted:16383 Finished successfully:55681 >> >> on ranger i have nothing in the queue. but i am showing a >> process still running on login3: >> >> 8825 tg457040 28 12 472m 232m 5660 S 15.8 0.7 130:56.41 java >> >> i am showing some errors in the stderr.txt of the jobs that >> were running (they access our database which apparently went >> down at some point). however, it seems troubling that when the >> app fails that coaster job is still running on the remote site >> and the workflow hangs w/o reporting anything... >> >> the log is too large to attach, but is here on ci: >> >> /ci/projects/cnari/logs/skenny/importDTI-20091124-1655-agj0mze1.log >> >> let me know if you need the coaster log as well. > >That may occur at times, such as when the service runs out of memory. So >yes, I do need the coaster log. > >Regardless of the exact reason, I think that there needs to be extra >logic in there to ensure liveness. In other words a lost state should >not be interpreted as "still in last state" but "failure". ok, just added the coasters.log to that same dir: /ci/projects/cnari/logs/skenny/coasters.log From hategan at mcs.anl.gov Wed Nov 25 13:24:52 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 25 Nov 2009 13:24:52 -0600 Subject: [Swift-devel] process hogging memory on ranger login In-Reply-To: <20091125115025.CGB99648@m4500-02.uchicago.edu> References: <20091123190130.CFZ45811@m4500-02.uchicago.edu> <1259100877.5951.0.camel@localhost> <20091125113308.CGB97214@m4500-02.uchicago.edu> <1259170689.18426.2.camel@localhost> <20091125115025.CGB99648@m4500-02.uchicago.edu> Message-ID: <1259177092.22405.1.camel@localhost> On Wed, 2009-11-25 at 11:50 -0600, skenny at uchicago.edu wrote: > >> > >> > /ci/projects/cnari/logs/skenny/importDTI-20091124-1655-agj0mze1.log > >> > >> let me know if you need the coaster log as well. > > > >That may occur at times, such as when the service runs out of > memory. So > >yes, I do need the coaster log. > > > >Regardless of the exact reason, I think that there needs to > be extra > >logic in there to ensure liveness. In other words a lost > state should > >not be interpreted as "still in last state" but "failure". > > ok, just added the coasters.log to that same dir: > > /ci/projects/cnari/logs/skenny/coasters.log Looks like the jobs go way over their walltime and that causes badness. I'll investigate. In the mean time, you could try to figure out why the jobs take much longer than expected or adjust the expectation (walltime) accordingly. From hategan at mcs.anl.gov Mon Nov 30 14:55:02 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 30 Nov 2009 14:55:02 -0600 Subject: [Swift-devel] code branch Message-ID: <1259614502.26099.26.camel@localhost> Hello, I branched the cog and swift codes. This was done in order to meet both the needs of users who use Swift on a regular basis as well as our needs to commit "researchy" code that may not be as stable. I added a note on the downloads page (http://www.ci.uchicago.edu/swift/downloads/index.php) which contains information on how to access the stable branch(es). Here's the short version: https://cogkit.svn.sourceforge.net/svnroot/cogkit/branches/4.1.7/src/cog https://svn.ci.uchicago.edu/svn/vdl2/branches/1.0 swift The development code continues to be available at the previous locations in the repositories. Mihael From iraicu at cs.uchicago.edu Mon Nov 30 17:17:43 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 30 Nov 2009 17:17:43 -0600 Subject: [Swift-devel] CFP: IEEE Transactions on Parallel and Distributed Systems, Special Issue on Many-Task Computing on Grids and Supercomputers Message-ID: <4B145297.6050400@cs.uchicago.edu> Call for Papers --------------------------------------------------------------------------------------- IEEE Transactions on Parallel and Distributed Systems Special Issue on Many-Task Computing on Grids and Supercomputers http://dsl.cs.uchicago.edu/TPDS_MTC/ ======================================================================================= The Special Issue on Many-Task Computing (MTC) will provide the scientific community a dedicated forum, within the prestigious IEEE Transactions on Parallel and Distributed Systems Journal, for presenting new research, development, and deployment efforts of loosely coupled large scale applications on large scale clusters, Grids, Supercomputers, and Cloud Computing infrastructure. MTC, the focus of the special issue, encompasses loosely coupled applications, which are generally composed of many tasks (both independent and dependent tasks) to achieve some larger application goal. This special issue will cover challenges that can hamper efficiency and utilization in running applications on large-scale systems, such as local resource manager scalability and granularity, efficient utilization of the raw hardware, parallel file system contention and scalability, data management, I/O management, reliability at scale, and application scalability. We welcome paper submissions on all topics related to MTC on large scale systems. For more information on this special issue, please see http://dsl.cs.uchicago.edu/TPDS_MTC/. Scope --------------------------------------------------------------------------------------- This special issue will focus on the ability to manage and execute large scale applications on today's largest clusters, Grids, and Supercomputers. Clusters with tens of thousands of processor cores are readily available, Grids (i.e. TeraGrid) with a dozen sites and 100K+ processors, and supercomputers with up to 200K processors (i.e. IBM BlueGene/L and BlueGene/P, Cray XT5, Sun Constellation), are all now available to the broader scientific community for open science research. Large clusters and supercomputers have traditionally been high performance computing (HPC) systems, as they are efficient at executing tightly coupled parallel jobs within a particular machine with low-latency interconnects; the applications typically use message passing interface (MPI) to achieve the needed inter-process communication. On the other hand, Grids have been the preferred platform for more loosely coupled applications that tend to be managed and executed through workflow systems, commonly known to fit in the high-throughput computing (HTC) paradigm. Many-task computing (MTC) aims to bridge the gap between two computing paradigms, HTC and HPC. MTC is reminiscent to HTC, but it differs in the emphasis of using many computing resources over short periods of time to accomplish many computational tasks (i.e. including both dependent and independent tasks), where the primary metrics are measured in seconds (e.g. FLOPS, tasks/s, MB/s I/O rates), as opposed to operations (e.g. jobs) per month. MTC denotes high-performance computations comprising multiple distinct activities, coupled via file system operations. Tasks may be small or large, uniprocessor or multiprocessor, compute-intensive or data-intensive. The set of tasks may be static or dynamic, homogeneous or heterogeneous, loosely coupled or tightly coupled. The aggregate number of tasks, quantity of computing, and volumes of data may be extremely large. MTC includes loosely coupled applications that are generally communication-intensive but not naturally expressed using standard message passing interface commonly found in HPC, drawing attention to the many computations that are heterogeneous but not "happily" parallel. There is more to HPC than tightly coupled MPI, and more to HTC than embarrassingly parallel long running jobs. Like HPC applications, and science itself, applications are becoming increasingly complex opening new doors for many opportunities to apply HPC in new ways if we broaden our perspective. Some applications have just so many simple tasks that managing them is hard. Applications that operate on or produce large amounts of data need sophisticated data management in order to scale. There exist applications that involve many tasks, each composed of tightly coupled MPI tasks. Loosely coupled applications often have dependencies among tasks, and typically use files for inter-process communication. Efficient support for these sorts of applications on existing large scale systems will involve substantial technical challenges and will have big impact on science. Today's existing HPC systems are a viable platform to host MTC applications. However, some challenges arise in large scale applications when run on large scale systems, which can hamper the efficiency and utilization of these large scale systems. These challenges vary from local resource manager scalability and granularity, efficient utilization of the raw hardware, parallel file system contention and scalability, data management, I/O management, reliability at scale, application scalability, and understanding the limitations of the HPC systems in order to identify good candidate MTC applications. Furthermore, the MTC paradigm can be naturally applied to the emerging Cloud Computing paradigm due to its loosely coupled nature, which is being adopted by industry as the next wave of technological advancement to reduce operational costs while improving efficiencies in large scale infrastructures. For an interesting discussion in a blog by Ian Foster on the difference between MTC and HTC, please see his blog at http://ianfoster.typepad.com/blog/2008/07/many-tasks-comp.html. The proposed editors also published several papers highly relevant to this special issue. One paper is titled "Toward Loosely Coupled Programming on Petascale Systems", and was published in IEEE/ACM Supercomputing 2008 (SC08) Conference; the second paper is titled "Many-Task Computing for Grids and Supercomputers", which was published in the IEEE Workshop on Many-Task Computing on Grids and Supercomputers 2008 (MTAGS08). To see last year's workshop program agenda, and accepted papers and presentations, please see http://dsl.cs.uchicago.edu/MTAGS08/. To see this year's workshop web site, see http://dsl.cs.uchicago.edu/MTAGS09/. Topics --------------------------------------------------------------------------------------- Topics of interest include, but are not limited to: * Compute Resource Management in large scale clusters, large Grids, Supercomputers, or Cloud Computing infrastructure o Scheduling o Job execution frameworks o Local resource manager extensions o Performance evaluation of resource managers in use on large scale systems o Challenges and opportunities in running many-task workloads on HPC systems o Challenges and opportunities in running many-task workloads on Cloud Computing infrastructure * Data Management in large scale Grid and Supercomputer environments: o Data-Aware Scheduling o Parallel File System performance and scalability in large deployments o Distributed file systems o Data caching frameworks and techniques * Large-Scale Workflow Systems o Workflow system performance and scalability analysis o Scalability of workflow systems o Workflow infrastructure and e-Science middleware o Programming Paradigms and Models * Large-Scale Many-Task Applications o Large-scale many-task applications o Large-scale many-task data-intensive applications o Large-scale high throughput computing (HTC) applications o Quasi-supercomputing applications, deployments, and experiences Paper Submission and Publication --------------------------------------------------------------------------------------- Authors are invited to submit papers with unpublished, original work of not more than 14 pages of double column text using single spaced 9.5 point size on 8.5 x 11 inch pages and 0.5 inch margins (http://www2.computer.org/portal/c/document_library/get_file?uuid=02e1509b-5526-4658-afb2-fe8b35044552&groupId=525767). Papers will be peer-reviewed, and accepted papers will be published in the IEEE digital library. Submitted articles must not have been previously published or currently submitted for journal publication elsewhere. As an author, you are responsible for understanding and adhering to our submission guidelines. You can access them by clicking on the following web link: http://www.computer.org/mc/tpds/author.htm. Please thoroughly read these before submitting your manuscript. Please submit your paper to Manuscript Central at http://cs-ieee.manuscriptcentral.com/. Please feel free to contact the Peer Review Publications Coordinator, Annissia Bryant at tpds at computer.org or the guest editors at foster at anl.gov, iraicu at cs.uchicago.edu, or yozha at microsoft.com if you have any questions. For more information on this special issue, please see http://dsl.cs.uchicago.edu/TPDS_MTC/. Important Dates --------------------------------------------------------------------------------------- * Abstract Due: December 14th, 2009 * Papers Due: December 21st, 2009 * First Round Decisions: February 22nd, 2010 * Major Revisions if needed: April 19th, 2010 * Second Round Decisions: May 24th, 2010 * Minor Revisions if needed: June 7th, 2010 * Final Decision: June 21st, 2010 * Publication Date: November, 2010 Guest Editors and Potential Reviewers --------------------------------------------------------------------------------------- Special Issue Guest Editors * Ian Foster, University of Chicago & Argonne National Laboratory * Ioan Raicu, Northwestern University * Yong Zhao, Microsoft -- ================================================================= Ioan Raicu, Ph.D. NSF/CRA Computing Innovation Fellow ================================================================= Center for Ultra-scale Computing and Information Security (CUCIS) Department of Electrical Engineering and Computer Science Northwestern University 2145 Sheridan Rd, Tech M384 Evanston, IL 60208-3118 ================================================================= Cel: 1-847-722-0876 Tel: 1-847-491-8163 Email: iraicu at eecs.northwestern.edu Web: http://www.eecs.northwestern.edu/~iraicu/ https://wiki.cucis.eecs.northwestern.edu/ ================================================================= ================================================================= -- ================================================================= Ioan Raicu, Ph.D. NSF/CRA Computing Innovation Fellow ================================================================= Center for Ultra-scale Computing and Information Security (CUCIS) Department of Electrical Engineering and Computer Science Northwestern University 2145 Sheridan Rd, Tech M384 Evanston, IL 60208-3118 ================================================================= Cel: 1-847-722-0876 Tel: 1-847-491-8163 Email: iraicu at eecs.northwestern.edu Web: http://www.eecs.northwestern.edu/~iraicu/ https://wiki.cucis.eecs.northwestern.edu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: