From hategan at mcs.anl.gov Sun Jan 2 16:32:41 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 02 Jan 2011 16:32:41 -0600 Subject: [Swift-devel] branching for stabilization of release .95 In-Reply-To: References: <1293646515.31270.0.camel@blabla2.none> Message-ID: <1294007561.2942.0.camel@blabla2.none> Done. On Wed, 2010-12-29 at 10:16 -0800, Sarah Kenny wrote: > ah, ok, no problem. > > On Wed, Dec 29, 2010 at 10:15 AM, Mihael Hategan > wrote: > I have yet to merge the stable branch to trunk. This may > involve some > manual work and it might take a while. > > Mihael > > > On Wed, 2010-12-29 at 10:11 -0800, Sarah Kenny wrote: > > hey all, i was planning to branch the current trunk tomorrow > so it can > > be stabilized for release .95 unless anyone thinks there's a > reason to > > hold off on this (?) > > > > ~sk > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > From wilde at mcs.anl.gov Sun Jan 2 16:46:55 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 2 Jan 2011 16:46:55 -0600 (CST) Subject: [Swift-devel] branching for stabilization of release .95 In-Reply-To: <1294007561.2942.0.camel@blabla2.none> Message-ID: <1336793907.16984.1294008415940.JavaMail.root@zimbra.anl.gov> Happy New Year, All! This sounds great, but wasn't the plan to call the current stable branch 0.91 and the current trunk 0.92? - Mike ----- Original Message ----- > Done. > > On Wed, 2010-12-29 at 10:16 -0800, Sarah Kenny wrote: > > ah, ok, no problem. > > > > On Wed, Dec 29, 2010 at 10:15 AM, Mihael Hategan > > > > wrote: > > I have yet to merge the stable branch to trunk. This may > > involve some > > manual work and it might take a while. > > > > Mihael > > > > > > On Wed, 2010-12-29 at 10:11 -0800, Sarah Kenny wrote: > > > hey all, i was planning to branch the current trunk > > > tomorrow > > so it can > > > be stabilized for release .95 unless anyone thinks there's > > > a > > reason to > > > hold off on this (?) > > > > > > ~sk > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sun Jan 2 18:09:02 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 02 Jan 2011 18:09:02 -0600 Subject: [Swift-devel] branching for stabilization of release .95 In-Reply-To: <1336793907.16984.1294008415940.JavaMail.root@zimbra.anl.gov> References: <1336793907.16984.1294008415940.JavaMail.root@zimbra.anl.gov> Message-ID: <1294013342.6271.1.camel@blabla2.none> On Sun, 2011-01-02 at 16:46 -0600, Michael Wilde wrote: > Happy New Year, All! And to you, too! > > This sounds great, but wasn't the plan to call the current stable branch 0.91 and the current trunk 0.92? Irrespective of that, bug fixes from the branch should be merged to trunk. And better to do so before trunk is branched into another release branch. Mihael From wilde at mcs.anl.gov Sun Jan 2 19:20:02 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 2 Jan 2011 19:20:02 -0600 (CST) Subject: [Swift-devel] branching for stabilization of release .95 In-Reply-To: <1294013342.6271.1.camel@blabla2.none> Message-ID: <1995645965.17104.1294017602363.JavaMail.root@zimbra.anl.gov> Indeed - that was the "great" part :) I was just asking so we that we get the release number right when we create the release branch. - Mike ----- Original Message ----- > On Sun, 2011-01-02 at 16:46 -0600, Michael Wilde wrote: > > Happy New Year, All! > > And to you, too! > > > > This sounds great, but wasn't the plan to call the current stable > > branch 0.91 and the current trunk 0.92? > > Irrespective of that, bug fixes from the branch should be merged to > trunk. And better to do so before trunk is branched into another > release > branch. > > Mihael -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From skenny at uchicago.edu Sun Jan 2 22:23:04 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Sun, 2 Jan 2011 20:23:04 -0800 Subject: [Swift-devel] branching for stabilization of release .95 In-Reply-To: <1995645965.17104.1294017602363.JavaMail.root@zimbra.anl.gov> References: <1294013342.6271.1.camel@blabla2.none> <1995645965.17104.1294017602363.JavaMail.root@zimbra.anl.gov> Message-ID: sorry, forgot where we landed on the naming...as long as it's somewhere btwn .91 and 1.0 we should be all right :) but yeah, i can branch the release as .92. however, i just checked out trunk and am getting some errors compiling: compile: [echo] [provider-coaster]: COMPILE [javac] Compiling 124 source files to /home/skenny/builds/cog/modules/provider-coaster/build [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: illegal start of type [javac] if (shutdown) { [javac] ^ [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: ';' expected [javac] if (shutdown) { [javac] ^ [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: not a statement [javac] if (shutdown) { [javac] ^ [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: ';' expected [javac] if (shutdown) { [javac] ^ [javac] 4 errors p.s. happy new year to you too! On Sun, Jan 2, 2011 at 5:20 PM, Michael Wilde wrote: > Indeed - that was the "great" part :) > I was just asking so we that we get the release number right when we create > the release branch. > > - Mike > > > ----- Original Message ----- > > On Sun, 2011-01-02 at 16:46 -0600, Michael Wilde wrote: > > > Happy New Year, All! > > > > And to you, too! > > > > > > This sounds great, but wasn't the plan to call the current stable > > > branch 0.91 and the current trunk 0.92? > > > > Irrespective of that, bug fixes from the branch should be merged to > > trunk. And better to do so before trunk is branched into another > > release > > branch. > > > > Mihael > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Jan 3 04:35:07 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 03 Jan 2011 04:35:07 -0600 Subject: [Swift-devel] branching for stabilization of release .95 In-Reply-To: References: <1294013342.6271.1.camel@blabla2.none> <1995645965.17104.1294017602363.JavaMail.root@zimbra.anl.gov> Message-ID: <1294050907.9079.0.camel@blabla2.none> I'll fix that tomorrow. Actually that's later today. Mihael On Sun, 2011-01-02 at 20:23 -0800, Sarah Kenny wrote: > sorry, forgot where we landed on the naming...as long as it's > somewhere btwn .91 and 1.0 we should be all right :) but yeah, i can > branch the release as .92. however, i just checked out trunk and am > getting some errors compiling: > > compile: > [echo] [provider-coaster]: COMPILE > [javac] Compiling 124 source files > to /home/skenny/builds/cog/modules/provider-coaster/build > > [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: illegal start of type > [javac] if (shutdown) { > [javac] ^ > > [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: ';' expected > [javac] if (shutdown) { > [javac] ^ > > [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: not a statement > [javac] if (shutdown) { > [javac] ^ > > [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: ';' expected > [javac] if (shutdown) { > [javac] ^ > [javac] 4 errors > > > p.s. happy new year to you too! > > > > On Sun, Jan 2, 2011 at 5:20 PM, Michael Wilde > wrote: > Indeed - that was the "great" part :) > I was just asking so we that we get the release number right > when we create the release branch. > > - Mike > > > > ----- Original Message ----- > > On Sun, 2011-01-02 at 16:46 -0600, Michael Wilde wrote: > > > Happy New Year, All! > > > > And to you, too! > > > > > > This sounds great, but wasn't the plan to call the current > stable > > > branch 0.91 and the current trunk 0.92? > > > > Irrespective of that, bug fixes from the branch should be > merged to > > trunk. And better to do so before trunk is branched into > another > > release > > branch. > > > > Mihael > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > > From wozniak at mcs.anl.gov Mon Jan 3 09:38:06 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 3 Jan 2011 09:38:06 -0600 (Central Standard Time) Subject: [Swift-devel] branching for stabilization of release .95 In-Reply-To: References: <1294013342.6271.1.camel@blabla2.none> <1995645965.17104.1294017602363.JavaMail.root@zimbra.anl.gov> Message-ID: On Sun, 2 Jan 2011, Sarah Kenny wrote: > sorry, forgot where we landed on the naming...as long as it's somewhere btwn > .91 and 1.0 we should be all right :) but yeah, i can branch the release as > .92. however, i just checked out trunk and am getting some errors compiling: The notes from our discussions are at the link below- let's try to keep that up to date on what we're actually doing... http://www.ci.uchicago.edu/wiki/bin/view/SWFT/ReleasePlans Happy New Year! -- Justin M Wozniak From wozniak at mcs.anl.gov Mon Jan 3 09:43:13 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 3 Jan 2011 09:43:13 -0600 (Central Standard Time) Subject: [Swift-devel] [Swift-commit] r3835 - trunk/src/org/griphyn/vdl/karajan/functions (fwd) Message-ID: This looks right for end users, but for development it really helps to get a full stack trace. Is it possible to turn on the full trace with a runtime option? Justin -- Justin M Wozniak ---------- Forwarded message ---------- Date: Fri, 31 Dec 2010 17:39:13 From: noreply at svn.ci.uchicago.edu To: swift-commit at ci.uchicago.edu Subject: [Swift-commit] r3835 - trunk/src/org/griphyn/vdl/karajan/functions Author: skenny Date: 2010-12-31 17:39:12 -0600 (Fri, 31 Dec 2010) New Revision: 3835 Modified: trunk/src/org/griphyn/vdl/karajan/functions/ProcessBulkErrors.java Log: don't show java exceptions, only the swift error Modified: trunk/src/org/griphyn/vdl/karajan/functions/ProcessBulkErrors.java =================================================================== --- trunk/src/org/griphyn/vdl/karajan/functions/ProcessBulkErrors.java 2010-12-30 21:53:08 UTC (rev 3834) +++ trunk/src/org/griphyn/vdl/karajan/functions/ProcessBulkErrors.java 2010-12-31 23:39:12 UTC (rev 3835) @@ -90,31 +90,39 @@ } public static String getMessageChain(Throwable e) { + Throwable orig = e; StringBuffer sb = new StringBuffer(); String prev = null; - boolean first = true; + String lastmsg = null; + boolean first = true; while (e != null) { String msg; if (e instanceof NullPointerException || e instanceof ClassCastException) { CharArrayWriter caw = new CharArrayWriter(); e.printStackTrace(new PrintWriter(caw)); msg = caw.toString(); + } else { msg = e.getMessage(); + if(msg != null){ + lastmsg = msg; + } + } if (msg != null && (prev == null || prev.indexOf(msg) == -1)) { - if (!first) { - sb.append("\nCaused by:\n\t"); - } - else { - first = false; - } - sb.append(msg); - prev = msg; + if (!first) { + sb.append("\nCaused by:\n\t"); + } + else { + first = false; + } + sb.append(msg); + lastmsg = msg; + prev = msg; } - e = e.getCause(); + e = e.getCause(); } - return sb.toString(); + return lastmsg; } } _______________________________________________ Swift-commit mailing list Swift-commit at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-commit From skenny at uchicago.edu Mon Jan 3 10:26:31 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Mon, 3 Jan 2011 08:26:31 -0800 Subject: [Swift-devel] [Swift-commit] r3835 - trunk/src/org/griphyn/vdl/karajan/functions (fwd) In-Reply-To: References: Message-ID: sure, you want to re-open the ticket? On Mon, Jan 3, 2011 at 7:43 AM, Justin M Wozniak wrote: > > This looks right for end users, but for development it really helps to get > a full stack trace. Is it possible to turn on the full trace with a runtime > option? > Justin > > -- > Justin M Wozniak > > ---------- Forwarded message ---------- > Date: Fri, 31 Dec 2010 17:39:13 > From: noreply at svn.ci.uchicago.edu > To: swift-commit at ci.uchicago.edu > Subject: [Swift-commit] r3835 - trunk/src/org/griphyn/vdl/karajan/functions > > Author: skenny > Date: 2010-12-31 17:39:12 -0600 (Fri, 31 Dec 2010) > New Revision: 3835 > > Modified: > trunk/src/org/griphyn/vdl/karajan/functions/ProcessBulkErrors.java > Log: > don't show java exceptions, only the swift error > > Modified: > trunk/src/org/griphyn/vdl/karajan/functions/ProcessBulkErrors.java > =================================================================== > --- trunk/src/org/griphyn/vdl/karajan/functions/ProcessBulkErrors.java > 2010-12-30 21:53:08 UTC (rev 3834) > +++ trunk/src/org/griphyn/vdl/karajan/functions/ProcessBulkErrors.java > 2010-12-31 23:39:12 UTC (rev 3835) > @@ -90,31 +90,39 @@ > } > > public static String getMessageChain(Throwable e) { > + Throwable orig = e; > StringBuffer sb = new StringBuffer(); > String prev = null; > - boolean first = true; > + String lastmsg = null; > + boolean first = true; > while (e != null) { > String msg; > if (e instanceof NullPointerException || e > instanceof ClassCastException) { > CharArrayWriter caw = new CharArrayWriter(); > e.printStackTrace(new PrintWriter(caw)); > msg = caw.toString(); > + > } > else { > msg = e.getMessage(); > + if(msg != null){ > + lastmsg = msg; > + } > + > } > if (msg != null && (prev == null || > prev.indexOf(msg) == -1)) { > - if (!first) { > - sb.append("\nCaused by:\n\t"); > - } > - else { > - first = false; > - } > - sb.append(msg); > - prev = msg; > + if (!first) { > + sb.append("\nCaused by:\n\t"); > + } > + else { > + first = false; > + } > + sb.append(msg); > + lastmsg = msg; > + prev = msg; > } > - e = e.getCause(); > + e = e.getCause(); > } > - return sb.toString(); > + return lastmsg; > } > } > > _______________________________________________ > Swift-commit mailing list > Swift-commit at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-commit > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Jan 3 13:30:31 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 03 Jan 2011 13:30:31 -0600 Subject: [Swift-devel] [Swift-commit] r3835 - trunk/src/org/griphyn/vdl/karajan/functions (fwd) In-Reply-To: References: Message-ID: <1294083031.9533.0.camel@blabla2.none> If the full trace was logged to the... log then it should please both sides. Mihael On Mon, 2011-01-03 at 09:43 -0600, Justin M Wozniak wrote: > This looks right for end users, but for development it really helps to get > a full stack trace. Is it possible to turn on the full trace with a > runtime option? > Justin > From jon.monette at gmail.com Mon Jan 3 13:34:02 2011 From: jon.monette at gmail.com (jon.monette at gmail.com) Date: Mon, 3 Jan 2011 19:34:02 +0000 Subject: [Swift-devel] [Swift-commit] r3835 -trunk/src/org/griphyn/vdl/karajan/functions (fwd) In-Reply-To: <1294083031.9533.0.camel@blabla2.none> References: <1294083031.9533.0.camel@blabla2.none> Message-ID: <340416958-1294083244-cardhu_decombobulator_blackberry.rim.net-1410445782-@bda090.bisx.prod.on.blackberry> I agree with Mihael. The full java stack trace for the error should be in the log and Swift should just report the Swift error during the execution. Sent on the Sprint? Now Network from my BlackBerry? -----Original Message----- From: Mihael Hategan Sender: swift-devel-bounces at ci.uchicago.edu Date: Mon, 03 Jan 2011 13:30:31 To: Justin M Wozniak Cc: Subject: Re: [Swift-devel] [Swift-commit] r3835 - trunk/src/org/griphyn/vdl/karajan/functions (fwd) If the full trace was logged to the... log then it should please both sides. Mihael On Mon, 2011-01-03 at 09:43 -0600, Justin M Wozniak wrote: > This looks right for end users, but for development it really helps to get > a full stack trace. Is it possible to turn on the full trace with a > runtime option? > Justin > _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wozniak at mcs.anl.gov Mon Jan 3 13:43:07 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 3 Jan 2011 13:43:07 -0600 (Central Standard Time) Subject: [Swift-devel] [Swift-commit] r3835 -trunk/src/org/griphyn/vdl/karajan/functions (fwd) In-Reply-To: <340416958-1294083244-cardhu_decombobulator_blackberry.rim.net-1410445782-@bda090.bisx.prod.on.blackberry> References: <1294083031.9533.0.camel@blabla2.none> <340416958-1294083244-cardhu_decombobulator_blackberry.rim.net-1410445782-@bda090.bisx.prod.on.blackberry> Message-ID: I would actually prefer that the default Swift output, even in error cases, not contain Java stack details. On Mon, 3 Jan 2011, jon.monette at gmail.com wrote: > I agree with Mihael. The full java stack trace for the error should be > in the log and Swift should just report the Swift error during the > execution. > > -----Original Message----- > From: Mihael Hategan > Sender: swift-devel-bounces at ci.uchicago.edu > Date: Mon, 03 Jan 2011 13:30:31 > To: Justin M Wozniak > Cc: > Subject: Re: [Swift-devel] [Swift-commit] r3835 - > trunk/src/org/griphyn/vdl/karajan/functions (fwd) > > If the full trace was logged to the... log then it should please both > sides. > > Mihael > > On Mon, 2011-01-03 at 09:43 -0600, Justin M Wozniak wrote: >> This looks right for end users, but for development it really helps to get >> a full stack trace. Is it possible to turn on the full trace with a >> runtime option? >> Justin >> > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Justin M Wozniak From hategan at mcs.anl.gov Mon Jan 3 15:31:46 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 03 Jan 2011 15:31:46 -0600 Subject: [Swift-devel] [Swift-commit] r3835 -trunk/src/org/griphyn/vdl/karajan/functions (fwd) In-Reply-To: References: <1294083031.9533.0.camel@blabla2.none> <340416958-1294083244-cardhu_decombobulator_blackberry.rim.net-1410445782-@bda090.bisx.prod.on.blackberry> Message-ID: <1294090306.10236.0.camel@blabla2.none> Exactly. Those would go to the log. The point is that there is no need for a runtime flag. Nice error messages go to stderr and detailed error messages (with stack traces) go to logs. Mihael On Mon, 2011-01-03 at 13:43 -0600, Justin M Wozniak wrote: > I would actually prefer that the default Swift output, even in error > cases, not contain Java stack details. > > On Mon, 3 Jan 2011, jon.monette at gmail.com wrote: > > > I agree with Mihael. The full java stack trace for the error should be > > in the log and Swift should just report the Swift error during the > > execution. > > > > -----Original Message----- > > From: Mihael Hategan > > Sender: swift-devel-bounces at ci.uchicago.edu > > Date: Mon, 03 Jan 2011 13:30:31 > > To: Justin M Wozniak > > Cc: > > Subject: Re: [Swift-devel] [Swift-commit] r3835 - > > trunk/src/org/griphyn/vdl/karajan/functions (fwd) > > > > If the full trace was logged to the... log then it should please both > > sides. > > > > Mihael > > > > On Mon, 2011-01-03 at 09:43 -0600, Justin M Wozniak wrote: > >> This looks right for end users, but for development it really helps to get > >> a full stack trace. Is it possible to turn on the full trace with a > >> runtime option? > >> Justin > >> > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > From hategan at mcs.anl.gov Mon Jan 3 15:37:49 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 03 Jan 2011 15:37:49 -0600 Subject: [Swift-devel] branching for stabilization of release .95 In-Reply-To: <1294050907.9079.0.camel@blabla2.none> References: <1294013342.6271.1.camel@blabla2.none> <1995645965.17104.1294017602363.JavaMail.root@zimbra.anl.gov> <1294050907.9079.0.camel@blabla2.none> Message-ID: <1294090669.10236.1.camel@blabla2.none> Should be fixed in cog trunk/2990. On Mon, 2011-01-03 at 04:35 -0600, Mihael Hategan wrote: > I'll fix that tomorrow. Actually that's later today. > > Mihael > > On Sun, 2011-01-02 at 20:23 -0800, Sarah Kenny wrote: > > sorry, forgot where we landed on the naming...as long as it's > > somewhere btwn .91 and 1.0 we should be all right :) but yeah, i can > > branch the release as .92. however, i just checked out trunk and am > > getting some errors compiling: > > > > compile: > > [echo] [provider-coaster]: COMPILE > > [javac] Compiling 124 source files > > to /home/skenny/builds/cog/modules/provider-coaster/build > > > > [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: illegal start of type > > [javac] if (shutdown) { > > [javac] ^ > > > > [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: ';' expected > > [javac] if (shutdown) { > > [javac] ^ > > > > [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: not a statement > > [javac] if (shutdown) { > > [javac] ^ > > > > [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: ';' expected > > [javac] if (shutdown) { > > [javac] ^ > > [javac] 4 errors > > > > > > p.s. happy new year to you too! > > > > > > > > On Sun, Jan 2, 2011 at 5:20 PM, Michael Wilde > > wrote: > > Indeed - that was the "great" part :) > > I was just asking so we that we get the release number right > > when we create the release branch. > > > > - Mike > > > > > > > > ----- Original Message ----- > > > On Sun, 2011-01-02 at 16:46 -0600, Michael Wilde wrote: > > > > Happy New Year, All! > > > > > > And to you, too! > > > > > > > > This sounds great, but wasn't the plan to call the current > > stable > > > > branch 0.91 and the current trunk 0.92? > > > > > > Irrespective of that, bug fixes from the branch should be > > merged to > > > trunk. And better to do so before trunk is branched into > > another > > > release > > > branch. > > > > > > Mihael > > > > > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From skenny at uchicago.edu Mon Jan 3 23:09:43 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Mon, 3 Jan 2011 21:09:43 -0800 Subject: [Swift-devel] branching for stabilization of release .95 In-Reply-To: <1294090669.10236.1.camel@blabla2.none> References: <1294013342.6271.1.camel@blabla2.none> <1995645965.17104.1294017602363.JavaMail.root@zimbra.anl.gov> <1294050907.9079.0.camel@blabla2.none> <1294090669.10236.1.camel@blabla2.none> Message-ID: still getting some complaints from the compiler on the merged files: compile: [echo] [swift]: COMPILE [javac] Compiling 374 source files to /home/skenny/builds/cog/modules/swift/build [javac] /home/skenny/builds/cog/modules/swift/src/org/griphyn/vdl/mapping/AbstractDataNode.java:557: closeArraySizes() is already defined in org.griphyn.vdl.mapping.AbstractDataNode [javac] public void closeArraySizes() { [javac] ^ [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 1 error BUILD FAILED On Mon, Jan 3, 2011 at 1:37 PM, Mihael Hategan wrote: > Should be fixed in cog trunk/2990. > > On Mon, 2011-01-03 at 04:35 -0600, Mihael Hategan wrote: > > I'll fix that tomorrow. Actually that's later today. > > > > Mihael > > > > On Sun, 2011-01-02 at 20:23 -0800, Sarah Kenny wrote: > > > sorry, forgot where we landed on the naming...as long as it's > > > somewhere btwn .91 and 1.0 we should be all right :) but yeah, i can > > > branch the release as .92. however, i just checked out trunk and am > > > getting some errors compiling: > > > > > > compile: > > > [echo] [provider-coaster]: COMPILE > > > [javac] Compiling 124 source files > > > to /home/skenny/builds/cog/modules/provider-coaster/build > > > > > > [javac] > /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: > illegal start of type > > > [javac] if (shutdown) { > > > [javac] ^ > > > > > > [javac] > /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: > ';' expected > > > [javac] if (shutdown) { > > > [javac] ^ > > > > > > [javac] > /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: > not a statement > > > [javac] if (shutdown) { > > > [javac] ^ > > > > > > [javac] > /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: > ';' expected > > > [javac] if (shutdown) { > > > [javac] ^ > > > [javac] 4 errors > > > > > > > > > p.s. happy new year to you too! > > > > > > > > > > > > On Sun, Jan 2, 2011 at 5:20 PM, Michael Wilde > > > wrote: > > > Indeed - that was the "great" part :) > > > I was just asking so we that we get the release number right > > > when we create the release branch. > > > > > > - Mike > > > > > > > > > > > > ----- Original Message ----- > > > > On Sun, 2011-01-02 at 16:46 -0600, Michael Wilde wrote: > > > > > Happy New Year, All! > > > > > > > > And to you, too! > > > > > > > > > > This sounds great, but wasn't the plan to call the current > > > stable > > > > > branch 0.91 and the current trunk 0.92? > > > > > > > > Irrespective of that, bug fixes from the branch should be > > > merged to > > > > trunk. And better to do so before trunk is branched into > > > another > > > > release > > > > branch. > > > > > > > > Mihael > > > > > > > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Jan 3 23:18:28 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 03 Jan 2011 21:18:28 -0800 Subject: [Swift-devel] branching for stabilization of release .95 In-Reply-To: References: <1294013342.6271.1.camel@blabla2.none> <1995645965.17104.1294017602363.JavaMail.root@zimbra.anl.gov> <1294050907.9079.0.camel@blabla2.none> <1294090669.10236.1.camel@blabla2.none> Message-ID: <1294118308.19868.0.camel@blabla2.none> Sorry about that. Swift trunk/r3837. On Mon, 2011-01-03 at 21:09 -0800, Sarah Kenny wrote: > still getting some complaints from the compiler on the merged files: > > compile: > [echo] [swift]: COMPILE > [javac] Compiling 374 source files > to /home/skenny/builds/cog/modules/swift/build > > [javac] /home/skenny/builds/cog/modules/swift/src/org/griphyn/vdl/mapping/AbstractDataNode.java:557: closeArraySizes() is already defined in org.griphyn.vdl.mapping.AbstractDataNode > [javac] public void closeArraySizes() { > [javac] ^ > [javac] Note: Some input files use unchecked or unsafe operations. > [javac] Note: Recompile with -Xlint:unchecked for details. > [javac] 1 error > > BUILD FAILED > > > On Mon, Jan 3, 2011 at 1:37 PM, Mihael Hategan > wrote: > Should be fixed in cog trunk/2990. > > > On Mon, 2011-01-03 at 04:35 -0600, Mihael Hategan wrote: > > I'll fix that tomorrow. Actually that's later today. > > > > Mihael > > > > On Sun, 2011-01-02 at 20:23 -0800, Sarah Kenny wrote: > > > sorry, forgot where we landed on the naming...as long as > it's > > > somewhere btwn .91 and 1.0 we should be all right :) but > yeah, i can > > > branch the release as .92. however, i just checked out > trunk and am > > > getting some errors compiling: > > > > > > compile: > > > [echo] [provider-coaster]: COMPILE > > > [javac] Compiling 124 source files > > > to /home/skenny/builds/cog/modules/provider-coaster/build > > > > > > > [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: illegal start of type > > > [javac] if (shutdown) { > > > [javac] ^ > > > > > > > [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: ';' expected > > > [javac] if (shutdown) { > > > [javac] ^ > > > > > > > [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: not a statement > > > [javac] if (shutdown) { > > > [javac] ^ > > > > > > > [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: ';' expected > > > [javac] if (shutdown) { > > > [javac] ^ > > > [javac] 4 errors > > > > > > > > > p.s. happy new year to you too! > > > > > > > > > > > > On Sun, Jan 2, 2011 at 5:20 PM, Michael Wilde > > > > wrote: > > > Indeed - that was the "great" part :) > > > I was just asking so we that we get the release > number right > > > when we create the release branch. > > > > > > - Mike > > > > > > > > > > > > ----- Original Message ----- > > > > On Sun, 2011-01-02 at 16:46 -0600, Michael Wilde > wrote: > > > > > Happy New Year, All! > > > > > > > > And to you, too! > > > > > > > > > > This sounds great, but wasn't the plan to call > the current > > > stable > > > > > branch 0.91 and the current trunk 0.92? > > > > > > > > Irrespective of that, bug fixes from the branch > should be > > > merged to > > > > trunk. And better to do so before trunk is > branched into > > > another > > > > release > > > > branch. > > > > > > > > Mihael > > > > > > > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > From skenny at uchicago.edu Mon Jan 3 23:45:02 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Mon, 3 Jan 2011 21:45:02 -0800 Subject: [Swift-devel] branching for stabilization of release .95 In-Reply-To: <1294118308.19868.0.camel@blabla2.none> References: <1294013342.6271.1.camel@blabla2.none> <1995645965.17104.1294017602363.JavaMail.root@zimbra.anl.gov> <1294050907.9079.0.camel@blabla2.none> <1294090669.10236.1.camel@blabla2.none> <1294118308.19868.0.camel@blabla2.none> Message-ID: alrighty...branching done :) On Mon, Jan 3, 2011 at 9:18 PM, Mihael Hategan wrote: > Sorry about that. Swift trunk/r3837. > > On Mon, 2011-01-03 at 21:09 -0800, Sarah Kenny wrote: > > still getting some complaints from the compiler on the merged files: > > > > compile: > > [echo] [swift]: COMPILE > > [javac] Compiling 374 source files > > to /home/skenny/builds/cog/modules/swift/build > > > > [javac] > /home/skenny/builds/cog/modules/swift/src/org/griphyn/vdl/mapping/AbstractDataNode.java:557: > closeArraySizes() is already defined in > org.griphyn.vdl.mapping.AbstractDataNode > > [javac] public void closeArraySizes() { > > [javac] ^ > > [javac] Note: Some input files use unchecked or unsafe operations. > > [javac] Note: Recompile with -Xlint:unchecked for details. > > [javac] 1 error > > > > BUILD FAILED > > > > > > On Mon, Jan 3, 2011 at 1:37 PM, Mihael Hategan > > wrote: > > Should be fixed in cog trunk/2990. > > > > > > On Mon, 2011-01-03 at 04:35 -0600, Mihael Hategan wrote: > > > I'll fix that tomorrow. Actually that's later today. > > > > > > Mihael > > > > > > On Sun, 2011-01-02 at 20:23 -0800, Sarah Kenny wrote: > > > > sorry, forgot where we landed on the naming...as long as > > it's > > > > somewhere btwn .91 and 1.0 we should be all right :) but > > yeah, i can > > > > branch the release as .92. however, i just checked out > > trunk and am > > > > getting some errors compiling: > > > > > > > > compile: > > > > [echo] [provider-coaster]: COMPILE > > > > [javac] Compiling 124 source files > > > > to /home/skenny/builds/cog/modules/provider-coaster/build > > > > > > > > > > [javac] > /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: > illegal start of type > > > > [javac] if (shutdown) { > > > > [javac] ^ > > > > > > > > > > [javac] > /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: > ';' expected > > > > [javac] if (shutdown) { > > > > [javac] ^ > > > > > > > > > > [javac] > /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: > not a statement > > > > [javac] if (shutdown) { > > > > [javac] ^ > > > > > > > > > > [javac] > /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: > ';' expected > > > > [javac] if (shutdown) { > > > > [javac] ^ > > > > [javac] 4 errors > > > > > > > > > > > > p.s. happy new year to you too! > > > > > > > > > > > > > > > > On Sun, Jan 2, 2011 at 5:20 PM, Michael Wilde > > > > > > wrote: > > > > Indeed - that was the "great" part :) > > > > I was just asking so we that we get the release > > number right > > > > when we create the release branch. > > > > > > > > - Mike > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > On Sun, 2011-01-02 at 16:46 -0600, Michael Wilde > > wrote: > > > > > > Happy New Year, All! > > > > > > > > > > And to you, too! > > > > > > > > > > > > This sounds great, but wasn't the plan to call > > the current > > > > stable > > > > > > branch 0.91 and the current trunk 0.92? > > > > > > > > > > Irrespective of that, bug fixes from the branch > > should be > > > > merged to > > > > > trunk. And better to do so before trunk is > > branched into > > > > another > > > > > release > > > > > branch. > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > -- > > > > Michael Wilde > > > > Computation Institute, University of Chicago > > > > Mathematics and Computer Science Division > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aespinosa at cs.uchicago.edu Tue Jan 4 04:53:18 2011 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Tue, 4 Jan 2011 04:53:18 -0600 Subject: [Swift-devel] Some remote workers + provider staging logs (ReplyTimeouts on large workflows) In-Reply-To: <763241375.13842.1293735103760.JavaMail.root@zimbra.anl.gov> References: <763241375.13842.1293735103760.JavaMail.root@zimbra.anl.gov> Message-ID: <20110104105318.GA16482@morse.cs.uchicago.edu> I was finally able to replicate things in a non OSG setting (well most of it).I ran 100 workers on bridled and it produced the errors I was expecting. I'm binary searching to determine what number it started failing (10 worked). Attached is the trace of 100 workers . client and service logs are also included -Allan 2010/12/30 Michael Wilde : > Hi Allan, > > It would be good to get client, service and worker logs for a reasonably small > failing case - I suspect Mihael could diagnose the problem from that. > > I will try to join you by Skype at 2PM if thats convenient for you and Dan . > > - Mike > > > ----- Original Message ----- >> I redid the OSG run with only 1 worker per coaster service and the same >> workflow finished without problems. I'll investigate if there are problems on >> multiple workers by making a testbed case in PADS as well. >> >> 2010/12/30 Mihael Hategan : >> > On Wed, 2010-12-29 at 15:28 -0600, Allan Espinosa wrote: >> > >> >> Does the timeout occur from the jobs being to long in the coaster >> >> service queue? >> > >> > No. The coaster protocol requires each command sent on a channel to be >> > acknowledged (pretty much like TCP does). Either the worker was very busy >> > (unlikely by design) or it has a fault that disturbed its main event loop >> > or there was an actual networking problem (also unlikely). >> > >> >> >> >> >> >> I did the same workflow on PADS only (site throttle makes it receive only >> >> a maximum of 400 jobs). I got the same errors at some point when my >> >> workers failed at a time less than the timeout period: >> >> >> >> The last line shows the worker.pl message when it exited: >> >> >> >> rmdir >> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations/111/5 >> >> rmdir >> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations/111 >> >> rmdir >> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations >> >> unlink >> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/wrapper.log >> >> unlink >> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/stdout.txt >> >> rmdir >> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k >> >> Failed to process data: at >> >> /home/aespinosa/swift/cogkit/modules/provider-coaster/resources/worker.pl >> >> line 639. >> > >> > I wish perl had a stack trace. Can you enable TRACE on the worker >> > and >> > re-run and send me the log for the failing worker? >> > >> > Mihael >> > >> > >> > >> > >> -- Allan M. Espinosa PhD student, Computer Science University of Chicago -------------- next part -------------- A non-text attachment was scrubbed... Name: timer-bug.tar.bz2 Type: application/octet-stream Size: 6684316 bytes Desc: not available URL: From benc at hawaga.org.uk Tue Jan 4 14:13:43 2011 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 4 Jan 2011 20:13:43 +0000 (GMT) Subject: [Swift-devel] [Haskell] Call for Papers: PAPP 2011 (deadline extended) (fwd) Message-ID: I saw this on haskell list -- http://www.hawaga.org.uk/ben/ ---------- Forwarded message ---------- Date: Tue, 04 Jan 2011 20:13:41 +0100 From: Clemens Grelck To: haskell at haskell.org Subject: [Haskell] Call for Papers: PAPP 2011 (deadline extended) New deadline: January 15, 2011 !! Eighth International Workshop on Practical Aspects of High-Level Parallel Programming (PAPP 2011) part of The International Conference on Computational Science June 1-3, 2011, Tsukuba, Japan http://www.papp-workshop.org AIMS AND SCOPE Computational Science applications are more and more complex to develop and require more and more computing power. Sequential computing cannot go further. Major companies in the computing industry now recognise the urgency of re-orienting an entire industry towards massively parallel computing. Parallel and grid computing are solutions to the increasing need for computing power. The trend is towards the increase of cores in processors, the number of processors and the need for scalable computing everywhere. But parallel and distributed programming is still dominated by low-level techniques such as send/receive message passing. Thus high-level approaches should play a key role in the shift to scalable computing in every computer. Algorithmic skeletons, parallel extensions of functional languages such as Haskell and ML, parallel logic and constraint programming, parallel execution of declarative programs such as SQL queries, genericity and meta-programming in object-oriented languages, etc. have produced methods and tools that improve the price/performance ratio of parallel software, and broaden the range of target applications. Also, high level languages offer a high degree of abstraction which ease the development of complex systems. Moreover, being based on formal semantics, it is possible to certify the correctness of critical parts of the applications. The PAPP workshop focuses on practical aspects of high-level parallel programming: design, implementation and optimisation of high-level programming languages, semantics of parallel languages, formal verification, design or certification of libraries, middle-wares and tools (performance predictors working on high-level parallel/grid source code, visualisations of abstract behaviour, automatic hot-spot detectors, high-level GRID resource managers, compilers, automatic generators, etc.), application of proof assistants to parallel applications, applications in all fields of computational science, benchmarks and experiments. Research on high-level grid programming is particularly relevant as well as domain specific parallel software. The aim of all these languages and tools is to improve and ease the development of applications (safety, expressivity, efficiency, etc.). Thus the PAPP workshop focuses on applications. The PAPP workshop is aimed both at researchers involved in the development of high level approaches for parallel and grid computing and computational science researchers who are potential users of these languages and tools. Topics We welcome submission of original, unpublished papers in English on topics including: * applications in all fields of high-performance computing and visualisation (using high-level tools) * high-level models (CGM, BSP, MPM, LogP, etc.) and tools for parallel and grid computing * high-level parallel language design, implementation and optimisation * practical aspects of computer assisted verification for high-level parallel languages * modular, object-oriented, functional, logic, constraint programming for parallel, distributed and grid computing systems * algorithmic skeletons, patterns and high-level parallel libraries * generative (e.g. template-based) programming with algorithmic skeletons, patterns and high-level parallel libraries * benchmarks and experiments using such languages and tools PAPER SUBMISSION AND PUBLICATION Prospective authors are invited to submit full papers in English presenting original research. Submitted papers must be unpublished and not submitted for publication elsewhere. Papers will go through a rigorous reviewing process. Each paper will be reviewed by at least three referees. The accepted papers will be published in the Procedia Computer Science series, as part of the ICCS proceedings. Submission must be done through the ICCS website. We invite you to submit a full paper of at most 10 pages describing new and original results, no later than January 8, 2011. Submission implies the willingness of at least one of the authors to register and present the paper. Accepted papers should be presented at the workshop. IMPORTANT DATES * January 15, 2011: Full paper due * February 20, 2011: Notification * March 7, 2011: Camera-ready paper due PROGRAMME COMMITTEE * Marco Aldinucci (University of Torino, Italy) * Jost Berthold (University of Copenhagen, Denmark) * Kento Emoto (University of Tokyo, Japan) * Fr?d?ric Gava (University Paris-East, France) * Alexandros Gerbessiotis (NJIT, USA) * Clemens Grelck (University of Amsterdam, Netherlands) * Hideya Iwasaki (The University of Electro-communications, Japan) * Roman Leshchinskiy (Standard Chartered Bank, UK) * Fr?d?ric Loulergue, chair (University of Orl?ans, France) * Bruno Raffin (INRIA, France) * Aamir Shafi (NUST, Pakistan) -- ---------------------------------------------------------------------- Dr Clemens Grelck Science Park 904 Universitair Docent 1098 XH Amsterdam Nederland Universiteit van Amsterdam Instituut voor Informatica T +31 (0) 20 525 8683 F +31 (0) 20 525 7490 Office C3.105 www.science.uva.nl/~grelck ---------------------------------------------------------------------- _______________________________________________ Haskell mailing list Haskell at haskell.org http://www.haskell.org/mailman/listinfo/haskell From hategan at mcs.anl.gov Tue Jan 4 15:44:07 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 04 Jan 2011 13:44:07 -0800 Subject: [Swift-devel] web pages Message-ID: <1294177447.30928.1.camel@blabla2.none> I made the swift pages use relative links so it can be deployed in an arbitrary place for development. However, you still need a properly configured php for your web server. Mihael From wilde at mcs.anl.gov Tue Jan 4 16:07:12 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 4 Jan 2011 16:07:12 -0600 (CST) Subject: [Swift-devel] web pages In-Reply-To: <1294177447.30928.1.camel@blabla2.none> Message-ID: <1059946327.24581.1294178832925.JavaMail.root@zimbra.anl.gov> Great- thanks! That works nice for me from public_html on www.ci I noticed that if you comment out the body text in index.html, you dont see the "redirecting" message, and its a bit more aesthetic. I dont know if you want that message there in case the redirect fails? Otherwise lets remove it. - Mike ----- Original Message ----- > I made the swift pages use relative links so it can be deployed in an > arbitrary place for development. However, you still need a properly > configured php for your web server. > > Mihael > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Tue Jan 4 16:13:41 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 04 Jan 2011 14:13:41 -0800 Subject: [Swift-devel] web pages In-Reply-To: <1059946327.24581.1294178832925.JavaMail.root@zimbra.anl.gov> References: <1059946327.24581.1294178832925.JavaMail.root@zimbra.anl.gov> Message-ID: <1294179221.31480.0.camel@blabla2.none> On Tue, 2011-01-04 at 16:07 -0600, Michael Wilde wrote: > Great- thanks! That works nice for me from public_html on www.ci > > I noticed that if you comment out the body text in index.html, you > dont see the "redirecting" message, and its a bit more aesthetic. I > dont know if you want that message there in case the redirect fails? > Otherwise lets remove it. I want that message in there in case the redirect fails. But that's unlikely. All reasonable browsers implement that. Mihael From dk0966 at cs.ship.edu Tue Jan 4 16:15:26 2011 From: dk0966 at cs.ship.edu (David Kelly) Date: Tue, 4 Jan 2011 17:15:26 -0500 Subject: [Swift-devel] Testing for new releases Message-ID: Hello all, During today's conference call, a matrix of configurations was mentioned which contained a preliminary list of what needed to be supported. Where is this list located? So the overall plan, as I understand it: 1) Finish writing tests for configurations listed in the matrix 2) Verify the tests work with the 0.92 release and report any issues 3) Automate the tests with meta.sh 4) Make these verified configurations available to users via a well documented swift road map In terms of directory structure for site tests.. if I want to create a specific test for PADS, for example, should I create it within tests/providers/local-pbs/PADS/? David From wozniak at mcs.anl.gov Tue Jan 4 16:29:47 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Tue, 4 Jan 2011 16:29:47 -0600 (CST) Subject: [Swift-devel] Testing for new releases In-Reply-To: References: Message-ID: On Tue, 4 Jan 2011, David Kelly wrote: > During today's conference call, a matrix of configurations was > mentioned which contained a preliminary list of what needed to be > supported. Where is this list located? http://www.ci.uchicago.edu/wiki/bin/view/SWFT/ReleasePlans > So the overall plan, as I understand it: > > 1) Finish writing tests for configurations listed in the matrix > 2) Verify the tests work with the 0.92 release and report any issues > 3) Automate the tests with meta.sh Full automation of meta.sh is not required for 0.92. Let's start by using it to create an outline of what it could do via ssh. Once we see what that looks like we can turn it into a real tool. > 4) Make these verified configurations available to users via a well > documented swift road map > > In terms of directory structure for site tests.. if I want to create a > specific test for PADS, for example, should I create it within > tests/providers/local-pbs/PADS/? This looks good. -- Justin M Wozniak From skenny at uchicago.edu Tue Jan 4 20:00:04 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Tue, 4 Jan 2011 18:00:04 -0800 Subject: [Swift-devel] Testing for new releases In-Reply-To: References: Message-ID: trying to make a matrix out of the matrix...first pass feel free to correct :) http://www.ci.uchicago.edu/wiki/bin/view/SWFT/ReleasePlans#site_specific_tests On Tue, Jan 4, 2011 at 2:29 PM, Justin M Wozniak wrote: > On Tue, 4 Jan 2011, David Kelly wrote: > > During today's conference call, a matrix of configurations was >> mentioned which contained a preliminary list of what needed to be >> supported. Where is this list located? >> > > http://www.ci.uchicago.edu/wiki/bin/view/SWFT/ReleasePlans > > > So the overall plan, as I understand it: >> >> 1) Finish writing tests for configurations listed in the matrix >> 2) Verify the tests work with the 0.92 release and report any issues >> 3) Automate the tests with meta.sh >> > > Full automation of meta.sh is not required for 0.92. Let's start by using > it to create an outline of what it could do via ssh. Once we see what that > looks like we can turn it into a real tool. > > > 4) Make these verified configurations available to users via a well >> documented swift road map >> >> In terms of directory structure for site tests.. if I want to create a >> specific test for PADS, for example, should I create it within >> tests/providers/local-pbs/PADS/? >> > > This looks good. > > -- > Justin M Wozniak > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aespinosa at cs.uchicago.edu Wed Jan 5 07:17:34 2011 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 5 Jan 2011 07:17:34 -0600 Subject: [Swift-devel] Some remote workers + provider staging logs (ReplyTimeouts on large workflows) In-Reply-To: <20110104105318.GA16482@morse.cs.uchicago.edu> References: <763241375.13842.1293735103760.JavaMail.root@zimbra.anl.gov> <20110104105318.GA16482@morse.cs.uchicago.edu> Message-ID: ok, somehow the magic number is something between 55-62 workers to cause failures. -Allan 2011/1/4 Allan Espinosa : > I was finally able to replicate things in a non OSG setting (well most of it).I > ran 100 workers on bridled and it produced the errors I was expecting. ?I'm > binary searching to determine what number it started failing (10 worked). > > > Attached is the trace of 100 workers . client and service logs are also included > > -Allan > > 2010/12/30 Michael Wilde : >> Hi Allan, >> >> It would be good to get client, service and worker logs for a reasonably small >> failing case - I suspect Mihael could diagnose the problem from that. >> >> I will try to join you by Skype at 2PM if thats convenient for you and Dan . >> >> - Mike >> >> >> ----- Original Message ----- >>> I redid the OSG run with only 1 worker per coaster service and the same >>> workflow finished without problems. I'll investigate if there are problems on >>> multiple workers by making a testbed case in PADS as well. >>> >>> 2010/12/30 Mihael Hategan : >>> > On Wed, 2010-12-29 at 15:28 -0600, Allan Espinosa wrote: >>> > >>> >> Does the timeout occur from the jobs being to long in the coaster >>> >> service queue? >>> > >>> > No. The coaster protocol requires each command sent on a channel to be >>> > acknowledged (pretty much like TCP does). Either the worker was very busy >>> > (unlikely by design) or it has a fault that disturbed its main event loop >>> > or there was an actual networking problem (also unlikely). >>> > >>> >> >>> >> >>> >> I did the same workflow on PADS only (site throttle makes it receive only >>> >> a maximum of 400 jobs). I got the same errors at some point when my >>> >> workers failed at a time less than the timeout period: >>> >> >>> >> The last line shows the worker.pl message when it exited: >>> >> >>> >> rmdir >>> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations/111/5 >>> >> rmdir >>> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations/111 >>> >> rmdir >>> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/RuptureVariations >>> >> unlink >>> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/wrapper.log >>> >> unlink >>> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k/stdout.txt >>> >> rmdir >>> >> /gpfs/pads/swift/aespinosa/swift-runs/catsall-20101229-1501-x92u64yc-0-cat-0asfsp3k >>> >> Failed to process data: at >>> >> /home/aespinosa/swift/cogkit/modules/provider-coaster/resources/worker.pl >>> >> line 639. >>> > >>> > I wish perl had a stack trace. Can you enable TRACE on the worker >>> > and >>> > re-run and send me the log for the failing worker? >>> > >>> > Mihael From jon.monette at gmail.com Wed Jan 5 13:50:38 2011 From: jon.monette at gmail.com (Jonathan Monette) Date: Wed, 05 Jan 2011 13:50:38 -0600 Subject: [Swift-devel] Swift hang Message-ID: <4D24CB8E.6060304@gmail.com> Hello, I have encountered swift hanging. The deadlock appears to be in the same place every time. This deadlock does seem to be intermittent since smaller work sizes does complete. This job size is with approximately 1200 files. The behavior that the logs show is that the files needed for the job submission are staged in but no jobs are submitted. The Coaster heartbeat that appears in the swift logs shows that the job queue is empty. The logs for the runs are in ~jonmon/Workspace/Swift/Montage/m101_j_6x6/run.000[1,2,3] I will try to recreate the problem using simple cat jobs. -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From aespinosa at cs.uchicago.edu Wed Jan 5 14:42:51 2011 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 5 Jan 2011 14:42:51 -0600 Subject: Fwd: [Swift-devel] Swift hang In-Reply-To: <4D24D515.5060008@gmail.com> References: <4D24CB8E.6060304@gmail.com> <4D24D515.5060008@gmail.com> Message-ID: forgot to include the listhost in the earlier thread. ---------- Forwarded message ---------- From: Jonathan Monette Date: 2011/1/5 Subject: Re: [Swift-devel] Swift hang To: Allan Espinosa Here is the jstack track --(14:29:%)-- jstack -l 10232 ???????????? --(Wed,Jan05)-- 2011-01-05 14:29:28 Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.1-b03 mixed mode): "Attach Listener" daemon prio=10 tid=0x0000000048490800 nid=0x3d25 waiting on condition [0x0000000000000000] ?? java.lang.Thread.State: RUNNABLE ?? Locked ownable synchronizers: ??? - None "Sender" daemon prio=10 tid=0x000000004823f000 nid=0x2a0b in Object.wait() [0x00000000446c5000] ?? java.lang.Thread.State: WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? at java.lang.Object.wait(Object.java:485) ??? at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:241) ??? - locked <0x00002aaab5490a50> (a org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender) ?? Locked ownable synchronizers: ??? - None "PullThread" daemon prio=10 tid=0x0000000048240800 nid=0x2a0a in Object.wait() [0x00000000445c4000] ?? java.lang.Thread.State: TIMED_WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? at org.globus.cog.abstraction.coaster.service.job.manager.PullThread.mwait(PullThread.java:86) ??? at org.globus.cog.abstraction.coaster.service.job.manager.PullThread.run(PullThread.java:57) ??? - locked <0x00002aaab5490d28> (a org.globus.cog.abstraction.coaster.service.job.manager.PullThread) ?? Locked ownable synchronizers: ??? - None "Channel multiplexer 1" daemon prio=10 tid=0x00000000479a2800 nid=0x2a08 sleeping[0x00000000443c2000] ?? java.lang.Thread.State: TIMED_WAITING (sleeping) ??? at java.lang.Thread.sleep(Native Method) ??? at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:418) ?? Locked ownable synchronizers: ??? - None "Channel multiplexer 0" daemon prio=10 tid=0x0000000047a62800 nid=0x2a07 sleeping[0x00000000444c3000] ?? java.lang.Thread.State: TIMED_WAITING (sleeping) ??? at java.lang.Thread.sleep(Native Method) ??? at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:418) ?? Locked ownable synchronizers: ??? - None "Timer-3" daemon prio=10 tid=0x0000000047a62000 nid=0x2a06 in Object.wait() [0x0000000043ebd000] ?? java.lang.Thread.State: TIMED_WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? at java.util.TimerThread.mainLoop(Timer.java:509) ??? - locked <0x00002aaab54afbf0> (a java.util.TaskQueue) ??? at java.util.TimerThread.run(Timer.java:462) ?? Locked ownable synchronizers: ??? - None "PBS provider queue poller" daemon prio=10 tid=0x0000000048405000 nid=0x29bd sleeping[0x0000000043dbc000] ?? java.lang.Thread.State: TIMED_WAITING (sleeping) ??? at java.lang.Thread.sleep(Native Method) ??? at org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:76) ??? at java.lang.Thread.run(Thread.java:662) ?? Locked ownable synchronizers: ??? - None "Block Submitter" daemon prio=10 tid=0x00002aacc4016800 nid=0x2978 in Object.wait() [0x00000000441c0000] ?? java.lang.Thread.State: WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? at java.lang.Object.wait(Object.java:485) ??? at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run(BlockTaskSubmitter.java:54) ??? - locked <0x00002aaab54d4510> (a java.util.LinkedList) ?? Locked ownable synchronizers: ??? - None "Timer-2" daemon prio=10 tid=0x00000000483e7800 nid=0x2952 in Object.wait() [0x0000000043cbb000] ?? java.lang.Thread.State: WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? at java.lang.Object.wait(Object.java:485) ??? at java.util.TimerThread.mainLoop(Timer.java:483) ??? - locked <0x00002aaab54caa88> (a java.util.TaskQueue) ??? at java.util.TimerThread.run(Timer.java:462) ?? Locked ownable synchronizers: ??? - None "Piped Channel Sender" daemon prio=10 tid=0x0000000048403800 nid=0x2951 in Object.wait() [0x0000000043bba000] ?? java.lang.Thread.State: WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? at java.lang.Object.wait(Object.java:485) ??? at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) ??? - locked <0x00002aaab54acd38> (a edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) ??? at org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:113) ?? Locked ownable synchronizers: ??? - None "Piped Channel Sender" daemon prio=10 tid=0x0000000047a04800 nid=0x2950 in Object.wait() [0x0000000043ab9000] ?? java.lang.Thread.State: WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? at java.lang.Object.wait(Object.java:485) ??? at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) ??? - locked <0x00002aaab54ac848> (a edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) ??? at org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:113) ?? Locked ownable synchronizers: ??? - None "Local Queue Processor" daemon prio=10 tid=0x0000000048407000 nid=0x294f in Object.wait() [0x00000000439b8000] ?? java.lang.Thread.State: WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? - waiting on <0x00002aaab548bed8> (a org.globus.cog.karajan.util.Queue) ??? at java.lang.Object.wait(Object.java:485) ??? at org.globus.cog.karajan.util.Queue.take(Queue.java:46) ??? - locked <0x00002aaab548bed8> (a org.globus.cog.karajan.util.Queue) ??? at org.globus.cog.abstraction.coaster.service.job.manager.AbstractQueueProcessor.take(AbstractQueueProcessor.java:51) ??? at org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:37) ?? Locked ownable synchronizers: ??? - None "Server: http://192.5.86.6:46247" daemon prio=10 tid=0x0000000047b66000 nid=0x294e runnable [0x00000000438b7000] ?? java.lang.Thread.State: RUNNABLE ??? at java.net.PlainSocketImpl.socketAccept(Native Method) ??? at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) ??? - locked <0x00002aaab5492b68> (a java.net.SocksSocketImpl) ??? at java.net.ServerSocket.implAccept(ServerSocket.java:453) ??? at java.net.ServerSocket.accept(ServerSocket.java:421) ??? at org.globus.net.BaseServer.run(BaseServer.java:226) ??? at java.lang.Thread.run(Thread.java:662) ?? Locked ownable synchronizers: ??? - None "Timer-1" daemon prio=10 tid=0x00000000487ab000 nid=0x294c in Object.wait() [0x00000000436b5000] ?? java.lang.Thread.State: TIMED_WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? at java.util.TimerThread.mainLoop(Timer.java:509) ??? - locked <0x00002aaab5518710> (a java.util.TaskQueue) ??? at java.util.TimerThread.run(Timer.java:462) ?? Locked ownable synchronizers: ??? - None "Coaster Bootstrap Service Connection Processor" daemon prio=10 tid=0x0000000047c99000 nid=0x294a runnable [0x00000000435b4000] ?? java.lang.Thread.State: RUNNABLE ??? at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) ??? at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210) ??? at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) ??? at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) ??? - locked <0x00002aaab5474d40> (a sun.nio.ch.Util$1) ??? - locked <0x00002aaab5474d28> (a java.util.Collections$UnmodifiableSet) ??? - locked <0x00002aaab5474998> (a sun.nio.ch.EPollSelectorImpl) ??? at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) ??? at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84) ??? at org.globus.cog.abstraction.impl.execution.coaster.BootstrapService$ConnectionProcessor.run(BootstrapService.java:231) ??? at java.lang.Thread.run(Thread.java:662) ?? Locked ownable synchronizers: ??? - None "Coaster Bootstrap Service Thread" daemon prio=10 tid=0x0000000047c30800 nid=0x2949 runnable [0x00000000434b3000] ?? java.lang.Thread.State: RUNNABLE ??? at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) ??? at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145) ??? - locked <0x00002aaab54746f8> (a java.lang.Object) ??? at org.globus.cog.abstraction.impl.execution.coaster.BootstrapService.run(BootstrapService.java:184) ??? at java.lang.Thread.run(Thread.java:662) ?? Locked ownable synchronizers: ??? - None "Local service" daemon prio=10 tid=0x0000000047c49800 nid=0x2948 runnable [0x00000000433b2000] ?? java.lang.Thread.State: RUNNABLE ??? at java.net.PlainSocketImpl.socketAccept(Native Method) ??? at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) ??? - locked <0x00002aaab5489de8> (a java.net.SocksSocketImpl) ??? at java.net.ServerSocket.implAccept(ServerSocket.java:453) ??? at java.net.ServerSocket.accept(ServerSocket.java:421) ??? at org.globus.net.BaseServer.run(BaseServer.java:226) ??? at java.lang.Thread.run(Thread.java:662) ?? Locked ownable synchronizers: ??? - None "Scheduler" prio=10 tid=0x00002aacc01ae800 nid=0x28b5 in Object.wait() [0x0000000042dac000] ?? java.lang.Thread.State: TIMED_WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? at org.globus.cog.karajan.scheduler.LateBindingScheduler.sleep(LateBindingScheduler.java:305) ??? at org.globus.cog.karajan.scheduler.LateBindingScheduler.run(LateBindingScheduler.java:289) ??? - locked <0x00002aaab500e070> (a org.griphyn.vdl.karajan.VDSAdaptiveScheduler) ?? Locked ownable synchronizers: ??? - None "Progress ticker" daemon prio=10 tid=0x000000004821d000 nid=0x281e waiting on condition [0x0000000042cab000] ?? java.lang.Thread.State: TIMED_WAITING (sleeping) ??? at java.lang.Thread.sleep(Native Method) ??? at org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.run(RuntimeStats.java:137) ?? Locked ownable synchronizers: ??? - None "Restart Log Sync" daemon prio=10 tid=0x0000000048219800 nid=0x281d in Object.wait() [0x0000000042baa000] ?? java.lang.Thread.State: WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? at java.lang.Object.wait(Object.java:485) ??? at org.globus.cog.karajan.workflow.nodes.restartLog.SyncThread.run(SyncThread.java:45) ??? - locked <0x00002aaab4b71708> (a org.globus.cog.karajan.workflow.nodes.restartLog.SyncThread) ?? Locked ownable synchronizers: ??? - None "Overloaded Host Monitor" daemon prio=10 tid=0x00000000486b7000 nid=0x2819 waiting on condition [0x0000000042aa9000] ?? java.lang.Thread.State: TIMED_WAITING (sleeping) ??? at java.lang.Thread.sleep(Native Method) ??? at org.globus.cog.karajan.scheduler.OverloadedHostMonitor.run(OverloadedHostMonitor.java:47) ?? Locked ownable synchronizers: ??? - None "Timer-0" daemon prio=10 tid=0x0000000048632000 nid=0x2816 in Object.wait() [0x00000000429a8000] ?? java.lang.Thread.State: TIMED_WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? at java.util.TimerThread.mainLoop(Timer.java:509) ??? - locked <0x00002aaab51e2ea0> (a java.util.TaskQueue) ??? at java.util.TimerThread.run(Timer.java:462) ?? Locked ownable synchronizers: ??? - None "pool-1-thread-8" prio=10 tid=0x000000004849c000 nid=0x2814 in Object.wait() [0x00000000427a6000] ?? java.lang.Thread.State: WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? at java.lang.Object.wait(Object.java:485) ??? at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) ??? - locked <0x00002aaab405e590> (a edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) ??? at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) ??? at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) ??? at java.lang.Thread.run(Thread.java:662) ?? Locked ownable synchronizers: ??? - None "pool-1-thread-7" prio=10 tid=0x000000004807e800 nid=0x2813 in Object.wait() [0x00000000426a5000] ?? java.lang.Thread.State: WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? at java.lang.Object.wait(Object.java:485) ??? at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) ??? - locked <0x00002aaab405e590> (a edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) ??? at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) ??? at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) ??? at java.lang.Thread.run(Thread.java:662) ?? Locked ownable synchronizers: ??? - None "pool-1-thread-6" prio=10 tid=0x000000004855f000 nid=0x2812 in Object.wait() [0x00000000425a4000] ?? java.lang.Thread.State: WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? at java.lang.Object.wait(Object.java:485) ??? at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) ??? - locked <0x00002aaab405e590> (a edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) ??? at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) ??? at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) ??? at java.lang.Thread.run(Thread.java:662) ?? Locked ownable synchronizers: ??? - None "pool-1-thread-5" prio=10 tid=0x00000000486c9800 nid=0x2811 in Object.wait() [0x00000000424a3000] ?? java.lang.Thread.State: WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? at java.lang.Object.wait(Object.java:485) ??? at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) ??? - locked <0x00002aaab405e590> (a edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) ??? at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) ??? at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) ??? at java.lang.Thread.run(Thread.java:662) ?? Locked ownable synchronizers: ??? - None "pool-1-thread-4" prio=10 tid=0x00000000486c8000 nid=0x2810 in Object.wait() [0x00000000423a2000] ?? java.lang.Thread.State: WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? at java.lang.Object.wait(Object.java:485) ??? at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) ??? - locked <0x00002aaab405e590> (a edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) ??? at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) ??? at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) ??? at java.lang.Thread.run(Thread.java:662) ?? Locked ownable synchronizers: ??? - None "pool-1-thread-3" prio=10 tid=0x0000000048491800 nid=0x280f in Object.wait() [0x00000000422a1000] ?? java.lang.Thread.State: WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? at java.lang.Object.wait(Object.java:485) ??? at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) ??? - locked <0x00002aaab405e590> (a edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) ??? at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) ??? at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) ??? at java.lang.Thread.run(Thread.java:662) ?? Locked ownable synchronizers: ??? - None "pool-1-thread-2" prio=10 tid=0x00000000482d8800 nid=0x280e in Object.wait() [0x00000000412dd000] ?? java.lang.Thread.State: WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? at java.lang.Object.wait(Object.java:485) ??? at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) ??? - locked <0x00002aaab405e590> (a edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) ??? at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) ??? at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) ??? at java.lang.Thread.run(Thread.java:662) ?? Locked ownable synchronizers: ??? - None "pool-1-thread-1" prio=10 tid=0x00002aacc0018000 nid=0x280d in Object.wait() [0x000000004104d000] ?? java.lang.Thread.State: WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? at java.lang.Object.wait(Object.java:485) ??? at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) ??? - locked <0x00002aaab405e590> (a edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) ??? at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) ??? at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) ??? at java.lang.Thread.run(Thread.java:662) ?? Locked ownable synchronizers: ??? - None "Low Memory Detector" daemon prio=10 tid=0x00002aacb8026000 nid=0x2808 runnable [0x0000000000000000] ?? java.lang.Thread.State: RUNNABLE ?? Locked ownable synchronizers: ??? - None "CompilerThread1" daemon prio=10 tid=0x00002aacb8023800 nid=0x2807 waiting on condition [0x0000000000000000] ?? java.lang.Thread.State: RUNNABLE ?? Locked ownable synchronizers: ??? - None "CompilerThread0" daemon prio=10 tid=0x00002aacb8020800 nid=0x2806 waiting on condition [0x0000000000000000] ?? java.lang.Thread.State: RUNNABLE ?? Locked ownable synchronizers: ??? - None "Signal Dispatcher" daemon prio=10 tid=0x00002aacb801e000 nid=0x2805 runnable [0x0000000000000000] ?? java.lang.Thread.State: RUNNABLE ?? Locked ownable synchronizers: ??? - None "Finalizer" daemon prio=10 tid=0x000000004796c000 nid=0x2804 in Object.wait() [0x0000000041f9e000] ?? java.lang.Thread.State: WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) ??? - locked <0x00002aaab3e096b8> (a java.lang.ref.ReferenceQueue$Lock) ??? at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) ??? at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) ?? Locked ownable synchronizers: ??? - None "Reference Handler" daemon prio=10 tid=0x0000000047965000 nid=0x2803 in Object.wait() [0x0000000041c29000] ?? java.lang.Thread.State: WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? at java.lang.Object.wait(Object.java:485) ??? at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) ??? - locked <0x00002aaab3e09630> (a java.lang.ref.Reference$Lock) ?? Locked ownable synchronizers: ??? - None "main" prio=10 tid=0x00000000478fb800 nid=0x27f9 in Object.wait() [0x0000000040dd9000] ?? java.lang.Thread.State: WAITING (on object monitor) ??? at java.lang.Object.wait(Native Method) ??? - waiting on <0x00002aaab47b9dc0> (a org.griphyn.vdl.karajan.VDL2ExecutionContext) ??? at java.lang.Object.wait(Object.java:485) ??? at org.globus.cog.karajan.workflow.ExecutionContext.waitFor(ExecutionContext.java:261) ??? - locked <0x00002aaab47b9dc0> (a org.griphyn.vdl.karajan.VDL2ExecutionContext) ??? at org.griphyn.vdl.karajan.Loader.main(Loader.java:197) ?? Locked ownable synchronizers: ??? - None "VM Thread" prio=10 tid=0x0000000047960800 nid=0x2802 runnable "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004790e800 nid=0x27fa runnable "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000047910000 nid=0x27fb runnable "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000047912000 nid=0x27fc runnable "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000047914000 nid=0x27fd runnable "GC task thread#4 (ParallelGC)" prio=10 tid=0x0000000047915800 nid=0x27fe runnable "GC task thread#5 (ParallelGC)" prio=10 tid=0x0000000047917800 nid=0x27ff runnable "GC task thread#6 (ParallelGC)" prio=10 tid=0x0000000047919800 nid=0x2800 runnable "GC task thread#7 (ParallelGC)" prio=10 tid=0x000000004791b000 nid=0x2801 runnable "VM Periodic Task Thread" prio=10 tid=0x00002aacb8038800 nid=0x2809 waiting on condition JNI global references: 1451 On 1/5/11 2:06 PM, Allan Espinosa wrote: Hi jon, Could you post a jstack trace? It should indicate if the code has deadlocks. -Allan (mobile) On Jan 5, 2011 4:50 PM, "Jonathan Monette" wrote: > > Hello, > ? I have encountered swift hanging. ?The deadlock appears to be in the same place every time. ?This deadlock does seem to be intermittent since smaller work sizes does complete. ?This job size is with approximately 1200 files. ?The behavior that the logs show is that the files needed for the job submission are staged in but no jobs are submitted. ?The Coaster heartbeat that appears in the swift logs shows that the job queue is empty. ?The logs for the runs are in ~jonmon/Workspace/Swift/Montage/m101_j_6x6/run.000[1,2,3] ?I will try to recreate the problem using simple cat jobs. > > -- > Jon > > Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. > - Albert Einstein > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein -- Allan M. Espinosa PhD student, Computer Science University of Chicago From jon.monette at gmail.com Wed Jan 5 15:48:50 2011 From: jon.monette at gmail.com (Jonathan Monette) Date: Wed, 05 Jan 2011 15:48:50 -0600 Subject: Fwd: [Swift-devel] Swift hang In-Reply-To: References: <4D24CB8E.6060304@gmail.com> <4D24D515.5060008@gmail.com> Message-ID: <4D24E742.1070009@gmail.com> if you mean site file here it is as well. .05 1 /gpfs/pads/swift/jonmon/Swift/work/localhost 3600 192.5.86.6 1 100 1 1 fast 1 10000 1 /gpfs/pads/swift/jonmon/Swift/work/pads On 1/5/11 2:42 PM, Allan Espinosa wrote: > forgot to include the listhost in the earlier thread. > > > ---------- Forwarded message ---------- > From: Jonathan Monette > Date: 2011/1/5 > Subject: Re: [Swift-devel] Swift hang > To: Allan Espinosa > > > Here is the jstack track > > --(14:29:%)-- jstack -l 10232 > > --(Wed,Jan05)-- > 2011-01-05 14:29:28 > Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.1-b03 mixed mode): > > "Attach Listener" daemon prio=10 tid=0x0000000048490800 nid=0x3d25 > waiting on condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > Locked ownable synchronizers: > - None > > "Sender" daemon prio=10 tid=0x000000004823f000 nid=0x2a0b in > Object.wait() [0x00000000446c5000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:241) > - locked<0x00002aaab5490a50> (a > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender) > > Locked ownable synchronizers: > - None > > "PullThread" daemon prio=10 tid=0x0000000048240800 nid=0x2a0a in > Object.wait() [0x00000000445c4000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at org.globus.cog.abstraction.coaster.service.job.manager.PullThread.mwait(PullThread.java:86) > at org.globus.cog.abstraction.coaster.service.job.manager.PullThread.run(PullThread.java:57) > - locked<0x00002aaab5490d28> (a > org.globus.cog.abstraction.coaster.service.job.manager.PullThread) > > Locked ownable synchronizers: > - None > > "Channel multiplexer 1" daemon prio=10 tid=0x00000000479a2800 > nid=0x2a08 sleeping[0x00000000443c2000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:418) > > Locked ownable synchronizers: > - None > > "Channel multiplexer 0" daemon prio=10 tid=0x0000000047a62800 > nid=0x2a07 sleeping[0x00000000444c3000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:418) > > Locked ownable synchronizers: > - None > > "Timer-3" daemon prio=10 tid=0x0000000047a62000 nid=0x2a06 in > Object.wait() [0x0000000043ebd000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.util.TimerThread.mainLoop(Timer.java:509) > - locked<0x00002aaab54afbf0> (a java.util.TaskQueue) > at java.util.TimerThread.run(Timer.java:462) > > Locked ownable synchronizers: > - None > > "PBS provider queue poller" daemon prio=10 tid=0x0000000048405000 > nid=0x29bd sleeping[0x0000000043dbc000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:76) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "Block Submitter" daemon prio=10 tid=0x00002aacc4016800 nid=0x2978 in > Object.wait() [0x00000000441c0000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run(BlockTaskSubmitter.java:54) > - locked<0x00002aaab54d4510> (a java.util.LinkedList) > > Locked ownable synchronizers: > - None > > "Timer-2" daemon prio=10 tid=0x00000000483e7800 nid=0x2952 in > Object.wait() [0x0000000043cbb000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at java.util.TimerThread.mainLoop(Timer.java:483) > - locked<0x00002aaab54caa88> (a java.util.TaskQueue) > at java.util.TimerThread.run(Timer.java:462) > > Locked ownable synchronizers: > - None > > "Piped Channel Sender" daemon prio=10 tid=0x0000000048403800 > nid=0x2951 in Object.wait() [0x0000000043bba000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > - locked<0x00002aaab54acd38> (a > edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > at org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:113) > > Locked ownable synchronizers: > - None > > "Piped Channel Sender" daemon prio=10 tid=0x0000000047a04800 > nid=0x2950 in Object.wait() [0x0000000043ab9000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > - locked<0x00002aaab54ac848> (a > edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > at org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:113) > > Locked ownable synchronizers: > - None > > "Local Queue Processor" daemon prio=10 tid=0x0000000048407000 > nid=0x294f in Object.wait() [0x00000000439b8000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on<0x00002aaab548bed8> (a org.globus.cog.karajan.util.Queue) > at java.lang.Object.wait(Object.java:485) > at org.globus.cog.karajan.util.Queue.take(Queue.java:46) > - locked<0x00002aaab548bed8> (a org.globus.cog.karajan.util.Queue) > at org.globus.cog.abstraction.coaster.service.job.manager.AbstractQueueProcessor.take(AbstractQueueProcessor.java:51) > at org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:37) > > Locked ownable synchronizers: > - None > > "Server: http://192.5.86.6:46247" daemon prio=10 > tid=0x0000000047b66000 nid=0x294e runnable [0x00000000438b7000] > java.lang.Thread.State: RUNNABLE > at java.net.PlainSocketImpl.socketAccept(Native Method) > at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) > - locked<0x00002aaab5492b68> (a java.net.SocksSocketImpl) > at java.net.ServerSocket.implAccept(ServerSocket.java:453) > at java.net.ServerSocket.accept(ServerSocket.java:421) > at org.globus.net.BaseServer.run(BaseServer.java:226) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "Timer-1" daemon prio=10 tid=0x00000000487ab000 nid=0x294c in > Object.wait() [0x00000000436b5000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.util.TimerThread.mainLoop(Timer.java:509) > - locked<0x00002aaab5518710> (a java.util.TaskQueue) > at java.util.TimerThread.run(Timer.java:462) > > Locked ownable synchronizers: > - None > > "Coaster Bootstrap Service Connection Processor" daemon prio=10 > tid=0x0000000047c99000 nid=0x294a runnable [0x00000000435b4000] > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) > - locked<0x00002aaab5474d40> (a sun.nio.ch.Util$1) > - locked<0x00002aaab5474d28> (a java.util.Collections$UnmodifiableSet) > - locked<0x00002aaab5474998> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84) > at org.globus.cog.abstraction.impl.execution.coaster.BootstrapService$ConnectionProcessor.run(BootstrapService.java:231) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "Coaster Bootstrap Service Thread" daemon prio=10 > tid=0x0000000047c30800 nid=0x2949 runnable [0x00000000434b3000] > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145) > - locked<0x00002aaab54746f8> (a java.lang.Object) > at org.globus.cog.abstraction.impl.execution.coaster.BootstrapService.run(BootstrapService.java:184) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "Local service" daemon prio=10 tid=0x0000000047c49800 nid=0x2948 > runnable [0x00000000433b2000] > java.lang.Thread.State: RUNNABLE > at java.net.PlainSocketImpl.socketAccept(Native Method) > at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) > - locked<0x00002aaab5489de8> (a java.net.SocksSocketImpl) > at java.net.ServerSocket.implAccept(ServerSocket.java:453) > at java.net.ServerSocket.accept(ServerSocket.java:421) > at org.globus.net.BaseServer.run(BaseServer.java:226) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "Scheduler" prio=10 tid=0x00002aacc01ae800 nid=0x28b5 in Object.wait() > [0x0000000042dac000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at org.globus.cog.karajan.scheduler.LateBindingScheduler.sleep(LateBindingScheduler.java:305) > at org.globus.cog.karajan.scheduler.LateBindingScheduler.run(LateBindingScheduler.java:289) > - locked<0x00002aaab500e070> (a > org.griphyn.vdl.karajan.VDSAdaptiveScheduler) > > Locked ownable synchronizers: > - None > > "Progress ticker" daemon prio=10 tid=0x000000004821d000 nid=0x281e > waiting on condition [0x0000000042cab000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.run(RuntimeStats.java:137) > > Locked ownable synchronizers: > - None > > "Restart Log Sync" daemon prio=10 tid=0x0000000048219800 nid=0x281d in > Object.wait() [0x0000000042baa000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at org.globus.cog.karajan.workflow.nodes.restartLog.SyncThread.run(SyncThread.java:45) > - locked<0x00002aaab4b71708> (a > org.globus.cog.karajan.workflow.nodes.restartLog.SyncThread) > > Locked ownable synchronizers: > - None > > "Overloaded Host Monitor" daemon prio=10 tid=0x00000000486b7000 > nid=0x2819 waiting on condition [0x0000000042aa9000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.globus.cog.karajan.scheduler.OverloadedHostMonitor.run(OverloadedHostMonitor.java:47) > > Locked ownable synchronizers: > - None > > "Timer-0" daemon prio=10 tid=0x0000000048632000 nid=0x2816 in > Object.wait() [0x00000000429a8000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.util.TimerThread.mainLoop(Timer.java:509) > - locked<0x00002aaab51e2ea0> (a java.util.TaskQueue) > at java.util.TimerThread.run(Timer.java:462) > > Locked ownable synchronizers: > - None > > "pool-1-thread-8" prio=10 tid=0x000000004849c000 nid=0x2814 in > Object.wait() [0x00000000427a6000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > - locked<0x00002aaab405e590> (a > edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "pool-1-thread-7" prio=10 tid=0x000000004807e800 nid=0x2813 in > Object.wait() [0x00000000426a5000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > - locked<0x00002aaab405e590> (a > edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "pool-1-thread-6" prio=10 tid=0x000000004855f000 nid=0x2812 in > Object.wait() [0x00000000425a4000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > - locked<0x00002aaab405e590> (a > edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "pool-1-thread-5" prio=10 tid=0x00000000486c9800 nid=0x2811 in > Object.wait() [0x00000000424a3000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > - locked<0x00002aaab405e590> (a > edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "pool-1-thread-4" prio=10 tid=0x00000000486c8000 nid=0x2810 in > Object.wait() [0x00000000423a2000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > - locked<0x00002aaab405e590> (a > edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "pool-1-thread-3" prio=10 tid=0x0000000048491800 nid=0x280f in > Object.wait() [0x00000000422a1000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > - locked<0x00002aaab405e590> (a > edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "pool-1-thread-2" prio=10 tid=0x00000000482d8800 nid=0x280e in > Object.wait() [0x00000000412dd000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > - locked<0x00002aaab405e590> (a > edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "pool-1-thread-1" prio=10 tid=0x00002aacc0018000 nid=0x280d in > Object.wait() [0x000000004104d000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > - locked<0x00002aaab405e590> (a > edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "Low Memory Detector" daemon prio=10 tid=0x00002aacb8026000 nid=0x2808 > runnable [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > Locked ownable synchronizers: > - None > > "CompilerThread1" daemon prio=10 tid=0x00002aacb8023800 nid=0x2807 > waiting on condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > Locked ownable synchronizers: > - None > > "CompilerThread0" daemon prio=10 tid=0x00002aacb8020800 nid=0x2806 > waiting on condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > Locked ownable synchronizers: > - None > > "Signal Dispatcher" daemon prio=10 tid=0x00002aacb801e000 nid=0x2805 > runnable [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > Locked ownable synchronizers: > - None > > "Finalizer" daemon prio=10 tid=0x000000004796c000 nid=0x2804 in > Object.wait() [0x0000000041f9e000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) > - locked<0x00002aaab3e096b8> (a java.lang.ref.ReferenceQueue$Lock) > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) > at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) > > Locked ownable synchronizers: > - None > > "Reference Handler" daemon prio=10 tid=0x0000000047965000 nid=0x2803 > in Object.wait() [0x0000000041c29000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) > - locked<0x00002aaab3e09630> (a java.lang.ref.Reference$Lock) > > Locked ownable synchronizers: > - None > > "main" prio=10 tid=0x00000000478fb800 nid=0x27f9 in Object.wait() > [0x0000000040dd9000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on<0x00002aaab47b9dc0> (a > org.griphyn.vdl.karajan.VDL2ExecutionContext) > at java.lang.Object.wait(Object.java:485) > at org.globus.cog.karajan.workflow.ExecutionContext.waitFor(ExecutionContext.java:261) > - locked<0x00002aaab47b9dc0> (a > org.griphyn.vdl.karajan.VDL2ExecutionContext) > at org.griphyn.vdl.karajan.Loader.main(Loader.java:197) > > Locked ownable synchronizers: > - None > > "VM Thread" prio=10 tid=0x0000000047960800 nid=0x2802 runnable > > "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004790e800 > nid=0x27fa runnable > > "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000047910000 > nid=0x27fb runnable > > "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000047912000 > nid=0x27fc runnable > > "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000047914000 > nid=0x27fd runnable > > "GC task thread#4 (ParallelGC)" prio=10 tid=0x0000000047915800 > nid=0x27fe runnable > > "GC task thread#5 (ParallelGC)" prio=10 tid=0x0000000047917800 > nid=0x27ff runnable > > "GC task thread#6 (ParallelGC)" prio=10 tid=0x0000000047919800 > nid=0x2800 runnable > > "GC task thread#7 (ParallelGC)" prio=10 tid=0x000000004791b000 > nid=0x2801 runnable > > "VM Periodic Task Thread" prio=10 tid=0x00002aacb8038800 nid=0x2809 > waiting on condition > > JNI global references: 1451 > > > > On 1/5/11 2:06 PM, Allan Espinosa wrote: > > Hi jon, > > Could you post a jstack trace? It should indicate if the code has deadlocks. > > -Allan (mobile) > > On Jan 5, 2011 4:50 PM, "Jonathan Monette" wrote: >> Hello, >> I have encountered swift hanging. The deadlock appears to be in the same place every time. This deadlock does seem to be intermittent since smaller work sizes does complete. This job size is with approximately 1200 files. The behavior that the logs show is that the files needed for the job submission are staged in but no jobs are submitted. The Coaster heartbeat that appears in the swift logs shows that the job queue is empty. The logs for the runs are in ~jonmon/Workspace/Swift/Montage/m101_j_6x6/run.000[1,2,3] I will try to recreate the problem using simple cat jobs. >> >> -- >> Jon >> >> Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. >> - Albert Einstein >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From hategan at mcs.anl.gov Thu Jan 6 12:50:36 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 06 Jan 2011 10:50:36 -0800 Subject: [Swift-devel] branching for stabilization of release .95 In-Reply-To: References: <1294013342.6271.1.camel@blabla2.none> <1995645965.17104.1294017602363.JavaMail.root@zimbra.anl.gov> <1294050907.9079.0.camel@blabla2.none> <1294090669.10236.1.camel@blabla2.none> <1294118308.19868.0.camel@blabla2.none> Message-ID: <1294339836.2909.33.camel@blabla2.none> I also branched cog to 4.1.8. Skenny, you should probably have write access to the cog svn. On Mon, 2011-01-03 at 21:45 -0800, Sarah Kenny wrote: > alrighty...branching done :) > > On Mon, Jan 3, 2011 at 9:18 PM, Mihael Hategan > wrote: > Sorry about that. Swift trunk/r3837. > > > On Mon, 2011-01-03 at 21:09 -0800, Sarah Kenny wrote: > > still getting some complaints from the compiler on the > merged files: > > > > compile: > > [echo] [swift]: COMPILE > > [javac] Compiling 374 source files > > to /home/skenny/builds/cog/modules/swift/build > > > > > [javac] /home/skenny/builds/cog/modules/swift/src/org/griphyn/vdl/mapping/AbstractDataNode.java:557: closeArraySizes() is already defined in org.griphyn.vdl.mapping.AbstractDataNode > > [javac] public void closeArraySizes() { > > [javac] ^ > > [javac] Note: Some input files use unchecked or unsafe > operations. > > [javac] Note: Recompile with -Xlint:unchecked for > details. > > [javac] 1 error > > > > BUILD FAILED > > > > > > On Mon, Jan 3, 2011 at 1:37 PM, Mihael Hategan > > > wrote: > > Should be fixed in cog trunk/2990. > > > > > > On Mon, 2011-01-03 at 04:35 -0600, Mihael Hategan > wrote: > > > I'll fix that tomorrow. Actually that's later > today. > > > > > > Mihael > > > > > > On Sun, 2011-01-02 at 20:23 -0800, Sarah Kenny > wrote: > > > > sorry, forgot where we landed on the naming...as > long as > > it's > > > > somewhere btwn .91 and 1.0 we should be all > right :) but > > yeah, i can > > > > branch the release as .92. however, i just > checked out > > trunk and am > > > > getting some errors compiling: > > > > > > > > compile: > > > > [echo] [provider-coaster]: COMPILE > > > > [javac] Compiling 124 source files > > > > > to /home/skenny/builds/cog/modules/provider-coaster/build > > > > > > > > > > > [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: illegal start of type > > > > [javac] if (shutdown) { > > > > [javac] ^ > > > > > > > > > > > [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: ';' expected > > > > [javac] if (shutdown) { > > > > [javac] ^ > > > > > > > > > > > [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: not a statement > > > > [javac] if (shutdown) { > > > > [javac] ^ > > > > > > > > > > > [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: ';' expected > > > > [javac] if (shutdown) { > > > > [javac] ^ > > > > [javac] 4 errors > > > > > > > > > > > > p.s. happy new year to you too! > > > > > > > > > > > > > > > > On Sun, Jan 2, 2011 at 5:20 PM, Michael Wilde > > > > > > wrote: > > > > Indeed - that was the "great" part :) > > > > I was just asking so we that we get the > release > > number right > > > > when we create the release branch. > > > > > > > > - Mike > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > On Sun, 2011-01-02 at 16:46 -0600, > Michael Wilde > > wrote: > > > > > > Happy New Year, All! > > > > > > > > > > And to you, too! > > > > > > > > > > > > This sounds great, but wasn't the > plan to call > > the current > > > > stable > > > > > > branch 0.91 and the current trunk > 0.92? > > > > > > > > > > Irrespective of that, bug fixes from > the branch > > should be > > > > merged to > > > > > trunk. And better to do so before > trunk is > > branched into > > > > another > > > > > release > > > > > branch. > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > -- > > > > Michael Wilde > > > > Computation Institute, University of > Chicago > > > > Mathematics and Computer Science > Division > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > From skenny at uchicago.edu Thu Jan 6 13:06:43 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Thu, 6 Jan 2011 11:06:43 -0800 Subject: [Swift-devel] branching for stabilization of release .95 In-Reply-To: <1294339836.2909.33.camel@blabla2.none> References: <1294013342.6271.1.camel@blabla2.none> <1995645965.17104.1294017602363.JavaMail.root@zimbra.anl.gov> <1294050907.9079.0.camel@blabla2.none> <1294090669.10236.1.camel@blabla2.none> <1294118308.19868.0.camel@blabla2.none> <1294339836.2909.33.camel@blabla2.none> Message-ID: ummm....i branched 4.1.8 when i did the swift branch (?) On Thu, Jan 6, 2011 at 10:50 AM, Mihael Hategan wrote: > I also branched cog to 4.1.8. Skenny, you should probably have write > access to the cog svn. > > On Mon, 2011-01-03 at 21:45 -0800, Sarah Kenny wrote: > > alrighty...branching done :) > > > > On Mon, Jan 3, 2011 at 9:18 PM, Mihael Hategan > > wrote: > > Sorry about that. Swift trunk/r3837. > > > > > > On Mon, 2011-01-03 at 21:09 -0800, Sarah Kenny wrote: > > > still getting some complaints from the compiler on the > > merged files: > > > > > > compile: > > > [echo] [swift]: COMPILE > > > [javac] Compiling 374 source files > > > to /home/skenny/builds/cog/modules/swift/build > > > > > > > > [javac] > /home/skenny/builds/cog/modules/swift/src/org/griphyn/vdl/mapping/AbstractDataNode.java:557: > closeArraySizes() is already defined in > org.griphyn.vdl.mapping.AbstractDataNode > > > [javac] public void closeArraySizes() { > > > [javac] ^ > > > [javac] Note: Some input files use unchecked or unsafe > > operations. > > > [javac] Note: Recompile with -Xlint:unchecked for > > details. > > > [javac] 1 error > > > > > > BUILD FAILED > > > > > > > > > On Mon, Jan 3, 2011 at 1:37 PM, Mihael Hategan > > > > > wrote: > > > Should be fixed in cog trunk/2990. > > > > > > > > > On Mon, 2011-01-03 at 04:35 -0600, Mihael Hategan > > wrote: > > > > I'll fix that tomorrow. Actually that's later > > today. > > > > > > > > Mihael > > > > > > > > On Sun, 2011-01-02 at 20:23 -0800, Sarah Kenny > > wrote: > > > > > sorry, forgot where we landed on the naming...as > > long as > > > it's > > > > > somewhere btwn .91 and 1.0 we should be all > > right :) but > > > yeah, i can > > > > > branch the release as .92. however, i just > > checked out > > > trunk and am > > > > > getting some errors compiling: > > > > > > > > > > compile: > > > > > [echo] [provider-coaster]: COMPILE > > > > > [javac] Compiling 124 source files > > > > > > > to /home/skenny/builds/cog/modules/provider-coaster/build > > > > > > > > > > > > > > > [javac] > /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: > illegal start of type > > > > > [javac] if (shutdown) { > > > > > [javac] ^ > > > > > > > > > > > > > > > [javac] > /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: > ';' expected > > > > > [javac] if (shutdown) { > > > > > [javac] ^ > > > > > > > > > > > > > > > [javac] > /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: > not a statement > > > > > [javac] if (shutdown) { > > > > > [javac] ^ > > > > > > > > > > > > > > > [javac] > /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: > ';' expected > > > > > [javac] if (shutdown) { > > > > > [javac] ^ > > > > > [javac] 4 errors > > > > > > > > > > > > > > > p.s. happy new year to you too! > > > > > > > > > > > > > > > > > > > > On Sun, Jan 2, 2011 at 5:20 PM, Michael Wilde > > > > > > > > wrote: > > > > > Indeed - that was the "great" part :) > > > > > I was just asking so we that we get the > > release > > > number right > > > > > when we create the release branch. > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > On Sun, 2011-01-02 at 16:46 -0600, > > Michael Wilde > > > wrote: > > > > > > > Happy New Year, All! > > > > > > > > > > > > And to you, too! > > > > > > > > > > > > > > This sounds great, but wasn't the > > plan to call > > > the current > > > > > stable > > > > > > > branch 0.91 and the current trunk > > 0.92? > > > > > > > > > > > > Irrespective of that, bug fixes from > > the branch > > > should be > > > > > merged to > > > > > > trunk. And better to do so before > > trunk is > > > branched into > > > > > another > > > > > > release > > > > > > branch. > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > -- > > > > > Michael Wilde > > > > > Computation Institute, University of > > Chicago > > > > > Mathematics and Computer Science > > Division > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From skenny at uchicago.edu Thu Jan 6 13:30:39 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Thu, 6 Jan 2011 11:30:39 -0800 Subject: [Swift-devel] branching for stabilization of release .95 In-Reply-To: References: <1294013342.6271.1.camel@blabla2.none> <1995645965.17104.1294017602363.JavaMail.root@zimbra.anl.gov> <1294050907.9079.0.camel@blabla2.none> <1294090669.10236.1.camel@blabla2.none> <1294118308.19868.0.camel@blabla2.none> <1294339836.2909.33.camel@blabla2.none> Message-ID: it was revision 2991...but looks like you and justin added some things before you branched again so i guess no harm. i'm a bit surprised that didn't give you an error from svn though. On Thu, Jan 6, 2011 at 11:06 AM, Sarah Kenny wrote: > ummm....i branched 4.1.8 when i did the swift branch (?) > > > On Thu, Jan 6, 2011 at 10:50 AM, Mihael Hategan wrote: > >> I also branched cog to 4.1.8. Skenny, you should probably have write >> access to the cog svn. >> >> On Mon, 2011-01-03 at 21:45 -0800, Sarah Kenny wrote: >> > alrighty...branching done :) >> > >> > On Mon, Jan 3, 2011 at 9:18 PM, Mihael Hategan >> > wrote: >> > Sorry about that. Swift trunk/r3837. >> > >> > >> > On Mon, 2011-01-03 at 21:09 -0800, Sarah Kenny wrote: >> > > still getting some complaints from the compiler on the >> > merged files: >> > > >> > > compile: >> > > [echo] [swift]: COMPILE >> > > [javac] Compiling 374 source files >> > > to /home/skenny/builds/cog/modules/swift/build >> > > >> > > >> > [javac] >> /home/skenny/builds/cog/modules/swift/src/org/griphyn/vdl/mapping/AbstractDataNode.java:557: >> closeArraySizes() is already defined in >> org.griphyn.vdl.mapping.AbstractDataNode >> > > [javac] public void closeArraySizes() { >> > > [javac] ^ >> > > [javac] Note: Some input files use unchecked or unsafe >> > operations. >> > > [javac] Note: Recompile with -Xlint:unchecked for >> > details. >> > > [javac] 1 error >> > > >> > > BUILD FAILED >> > > >> > > >> > > On Mon, Jan 3, 2011 at 1:37 PM, Mihael Hategan >> > >> > > wrote: >> > > Should be fixed in cog trunk/2990. >> > > >> > > >> > > On Mon, 2011-01-03 at 04:35 -0600, Mihael Hategan >> > wrote: >> > > > I'll fix that tomorrow. Actually that's later >> > today. >> > > > >> > > > Mihael >> > > > >> > > > On Sun, 2011-01-02 at 20:23 -0800, Sarah Kenny >> > wrote: >> > > > > sorry, forgot where we landed on the naming...as >> > long as >> > > it's >> > > > > somewhere btwn .91 and 1.0 we should be all >> > right :) but >> > > yeah, i can >> > > > > branch the release as .92. however, i just >> > checked out >> > > trunk and am >> > > > > getting some errors compiling: >> > > > > >> > > > > compile: >> > > > > [echo] [provider-coaster]: COMPILE >> > > > > [javac] Compiling 124 source files >> > > > > >> > to /home/skenny/builds/cog/modules/provider-coaster/build >> > > > > >> > > > > >> > > >> > [javac] >> /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: >> illegal start of type >> > > > > [javac] if (shutdown) { >> > > > > [javac] ^ >> > > > > >> > > > > >> > > >> > [javac] >> /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: >> ';' expected >> > > > > [javac] if (shutdown) { >> > > > > [javac] ^ >> > > > > >> > > > > >> > > >> > [javac] >> /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: >> not a statement >> > > > > [javac] if (shutdown) { >> > > > > [javac] ^ >> > > > > >> > > > > >> > > >> > [javac] >> /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: >> ';' expected >> > > > > [javac] if (shutdown) { >> > > > > [javac] ^ >> > > > > [javac] 4 errors >> > > > > >> > > > > >> > > > > p.s. happy new year to you too! >> > > > > >> > > > > >> > > > > >> > > > > On Sun, Jan 2, 2011 at 5:20 PM, Michael Wilde >> > > >> > > > > wrote: >> > > > > Indeed - that was the "great" part :) >> > > > > I was just asking so we that we get the >> > release >> > > number right >> > > > > when we create the release branch. >> > > > > >> > > > > - Mike >> > > > > >> > > > > >> > > > > >> > > > > ----- Original Message ----- >> > > > > > On Sun, 2011-01-02 at 16:46 -0600, >> > Michael Wilde >> > > wrote: >> > > > > > > Happy New Year, All! >> > > > > > >> > > > > > And to you, too! >> > > > > > > >> > > > > > > This sounds great, but wasn't the >> > plan to call >> > > the current >> > > > > stable >> > > > > > > branch 0.91 and the current trunk >> > 0.92? >> > > > > > >> > > > > > Irrespective of that, bug fixes from >> > the branch >> > > should be >> > > > > merged to >> > > > > > trunk. And better to do so before >> > trunk is >> > > branched into >> > > > > another >> > > > > > release >> > > > > > branch. >> > > > > > >> > > > > > Mihael >> > > > > >> > > > > >> > > > > >> > > > > -- >> > > > > Michael Wilde >> > > > > Computation Institute, University of >> > Chicago >> > > > > Mathematics and Computer Science >> > Division >> > > > > Argonne National Laboratory >> > > > > >> > > > > >> > > > > >> > > > >> > > > >> > > >> > > >> > > > _______________________________________________ >> > > > Swift-devel mailing list >> > > > Swift-devel at ci.uchicago.edu >> > > > >> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > >> > > >> > > >> > > >> > >> > >> > >> > >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Jan 6 14:44:18 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 06 Jan 2011 12:44:18 -0800 Subject: [Swift-devel] branching for stabilization of release .95 In-Reply-To: References: <1294013342.6271.1.camel@blabla2.none> <1995645965.17104.1294017602363.JavaMail.root@zimbra.anl.gov> <1294050907.9079.0.camel@blabla2.none> <1294090669.10236.1.camel@blabla2.none> <1294118308.19868.0.camel@blabla2.none> <1294339836.2909.33.camel@blabla2.none> Message-ID: <1294346658.6240.4.camel@blabla2.none> On Thu, 2011-01-06 at 11:30 -0800, Sarah Kenny wrote: > it was revision 2991...but looks like you and justin added some things > before you branched again so i guess no harm. i'm a bit surprised that > didn't give you an error from svn though. That explains it! I was wondering why the branch looked like 4.1.8/current/src instead of what I was expecting (4.1.8/src). I thought I mistyped something and I deleted it and rebranched. Sorry. In any event, attached are a couple of scripts I use for this. You may find them useful in the future. Mihael > > On Thu, Jan 6, 2011 at 11:06 AM, Sarah Kenny > wrote: > ummm....i branched 4.1.8 when i did the swift branch (?) > > > > On Thu, Jan 6, 2011 at 10:50 AM, Mihael Hategan > wrote: > I also branched cog to 4.1.8. Skenny, you should > probably have write > access to the cog svn. > > > On Mon, 2011-01-03 at 21:45 -0800, Sarah Kenny wrote: > > alrighty...branching done :) > > > > On Mon, Jan 3, 2011 at 9:18 PM, Mihael Hategan > > > wrote: > > Sorry about that. Swift trunk/r3837. > > > > > > On Mon, 2011-01-03 at 21:09 -0800, Sarah > Kenny wrote: > > > still getting some complaints from the > compiler on the > > merged files: > > > > > > compile: > > > [echo] [swift]: COMPILE > > > [javac] Compiling 374 source files > > > > to /home/skenny/builds/cog/modules/swift/build > > > > > > > > > [javac] /home/skenny/builds/cog/modules/swift/src/org/griphyn/vdl/mapping/AbstractDataNode.java:557: closeArraySizes() is already defined in org.griphyn.vdl.mapping.AbstractDataNode > > > [javac] public void > closeArraySizes() { > > > [javac] ^ > > > [javac] Note: Some input files use > unchecked or unsafe > > operations. > > > [javac] Note: Recompile with > -Xlint:unchecked for > > details. > > > [javac] 1 error > > > > > > BUILD FAILED > > > > > > > > > On Mon, Jan 3, 2011 at 1:37 PM, Mihael > Hategan > > > > > wrote: > > > Should be fixed in cog trunk/2990. > > > > > > > > > On Mon, 2011-01-03 at 04:35 -0600, > Mihael Hategan > > wrote: > > > > I'll fix that tomorrow. Actually > that's later > > today. > > > > > > > > Mihael > > > > > > > > On Sun, 2011-01-02 at 20:23 > -0800, Sarah Kenny > > wrote: > > > > > sorry, forgot where we landed > on the naming...as > > long as > > > it's > > > > > somewhere btwn .91 and 1.0 we > should be all > > right :) but > > > yeah, i can > > > > > branch the release as .92. > however, i just > > checked out > > > trunk and am > > > > > getting some errors compiling: > > > > > > > > > > compile: > > > > > [echo] > [provider-coaster]: COMPILE > > > > > [javac] Compiling 124 > source files > > > > > > > > to /home/skenny/builds/cog/modules/provider-coaster/build > > > > > > > > > > > > > > > > [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: illegal start of type > > > > > [javac] if > (shutdown) { > > > > > [javac] ^ > > > > > > > > > > > > > > > > [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: ';' expected > > > > > [javac] if > (shutdown) { > > > > > [javac] ^ > > > > > > > > > > > > > > > > [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: not a statement > > > > > [javac] if > (shutdown) { > > > > > [javac] ^ > > > > > > > > > > > > > > > > [javac] /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: ';' expected > > > > > [javac] if > (shutdown) { > > > > > [javac] > ^ > > > > > [javac] 4 errors > > > > > > > > > > > > > > > p.s. happy new year to you > too! > > > > > > > > > > > > > > > > > > > > On Sun, Jan 2, 2011 at 5:20 > PM, Michael Wilde > > > > > > > > wrote: > > > > > Indeed - that was the > "great" part :) > > > > > I was just asking so > we that we get the > > release > > > number right > > > > > when we create the > release branch. > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > ----- Original Message > ----- > > > > > > On Sun, 2011-01-02 > at 16:46 -0600, > > Michael Wilde > > > wrote: > > > > > > > Happy New Year, > All! > > > > > > > > > > > > And to you, too! > > > > > > > > > > > > > > This sounds great, > but wasn't the > > plan to call > > > the current > > > > > stable > > > > > > > branch 0.91 and > the current trunk > > 0.92? > > > > > > > > > > > > Irrespective of > that, bug fixes from > > the branch > > > should be > > > > > merged to > > > > > > trunk. And better to > do so before > > trunk is > > > branched into > > > > > another > > > > > > release > > > > > > branch. > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > -- > > > > > Michael Wilde > > > > > Computation Institute, > University of > > Chicago > > > > > Mathematics and > Computer Science > > Division > > > > > Argonne National > Laboratory > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: branch.sh Type: application/x-shellscript Size: 377 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: tag.sh Type: application/x-shellscript Size: 602 bytes Desc: not available URL: From wilde at mcs.anl.gov Thu Jan 6 14:52:02 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 6 Jan 2011 14:52:02 -0600 (CST) Subject: [Swift-devel] branching for stabilization of release .95 In-Reply-To: <1294346658.6240.4.camel@blabla2.none> Message-ID: <144644706.33954.1294347122580.JavaMail.root@zimbra.anl.gov> Sarah, it would be great if in the process of release building you can start a wiki page to capture the tools and techniques. :) Thanks, Mike ----- Original Message ----- > On Thu, 2011-01-06 at 11:30 -0800, Sarah Kenny wrote: > > it was revision 2991...but looks like you and justin added some > > things > > before you branched again so i guess no harm. i'm a bit surprised > > that > > didn't give you an error from svn though. > > That explains it! > > I was wondering why the branch looked like 4.1.8/current/src instead > of > what I was expecting (4.1.8/src). I thought I mistyped something and I > deleted it and rebranched. Sorry. > > In any event, attached are a couple of scripts I use for this. You may > find them useful in the future. > > Mihael > > > > > On Thu, Jan 6, 2011 at 11:06 AM, Sarah Kenny > > wrote: > > ummm....i branched 4.1.8 when i did the swift branch (?) > > > > > > > > On Thu, Jan 6, 2011 at 10:50 AM, Mihael Hategan > > wrote: > > I also branched cog to 4.1.8. Skenny, you should > > probably have write > > access to the cog svn. > > > > > > On Mon, 2011-01-03 at 21:45 -0800, Sarah Kenny > > wrote: > > > alrighty...branching done :) > > > > > > On Mon, Jan 3, 2011 at 9:18 PM, Mihael Hategan > > > > > wrote: > > > Sorry about that. Swift trunk/r3837. > > > > > > > > > On Mon, 2011-01-03 at 21:09 -0800, Sarah > > Kenny wrote: > > > > still getting some complaints from the > > compiler on the > > > merged files: > > > > > > > > compile: > > > > [echo] [swift]: COMPILE > > > > [javac] Compiling 374 source files > > > > > > to /home/skenny/builds/cog/modules/swift/build > > > > > > > > > > > > > [javac] > > /home/skenny/builds/cog/modules/swift/src/org/griphyn/vdl/mapping/AbstractDataNode.java:557: > > closeArraySizes() is already defined in > > org.griphyn.vdl.mapping.AbstractDataNode > > > > [javac] public void > > closeArraySizes() { > > > > [javac] ^ > > > > [javac] Note: Some input files use > > unchecked or unsafe > > > operations. > > > > [javac] Note: Recompile with > > -Xlint:unchecked for > > > details. > > > > [javac] 1 error > > > > > > > > BUILD FAILED > > > > > > > > > > > > On Mon, Jan 3, 2011 at 1:37 PM, Mihael > > Hategan > > > > > > > wrote: > > > > Should be fixed in cog > > > > trunk/2990. > > > > > > > > > > > > On Mon, 2011-01-03 at 04:35 > > > > -0600, > > Mihael Hategan > > > wrote: > > > > > I'll fix that tomorrow. > > > > > Actually > > that's later > > > today. > > > > > > > > > > Mihael > > > > > > > > > > On Sun, 2011-01-02 at 20:23 > > -0800, Sarah Kenny > > > wrote: > > > > > > sorry, forgot where we > > > > > > landed > > on the naming...as > > > long as > > > > it's > > > > > > somewhere btwn .91 and 1.0 > > > > > > we > > should be all > > > right :) but > > > > yeah, i can > > > > > > branch the release as .92. > > however, i just > > > checked out > > > > trunk and am > > > > > > getting some errors > > > > > > compiling: > > > > > > > > > > > > compile: > > > > > > [echo] > > [provider-coaster]: COMPILE > > > > > > [javac] Compiling 124 > > source files > > > > > > > > > > > to > > /home/skenny/builds/cog/modules/provider-coaster/build > > > > > > > > > > > > > > > > > > > > > [javac] > > /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: > > illegal start of type > > > > > > [javac] if > > (shutdown) { > > > > > > [javac] ^ > > > > > > > > > > > > > > > > > > > > > [javac] > > /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: > > ';' expected > > > > > > [javac] if > > (shutdown) { > > > > > > [javac] ^ > > > > > > > > > > > > > > > > > > > > > [javac] > > /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: > > not a statement > > > > > > [javac] if > > (shutdown) { > > > > > > [javac] ^ > > > > > > > > > > > > > > > > > > > > > [javac] > > /home/skenny/builds/cog/modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Cpu.java:242: > > ';' expected > > > > > > [javac] if > > (shutdown) { > > > > > > [javac] > > ^ > > > > > > [javac] 4 errors > > > > > > > > > > > > > > > > > > p.s. happy new year to you > > too! > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Jan 2, 2011 at 5:20 > > PM, Michael Wilde > > > > > > > > > > wrote: > > > > > > Indeed - that was > > > > > > the > > "great" part :) > > > > > > I was just asking so > > we that we get the > > > release > > > > number right > > > > > > when we create the > > release branch. > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > ----- Original > > > > > > Message > > ----- > > > > > > > On Sun, 2011-01-02 > > at 16:46 -0600, > > > Michael Wilde > > > > wrote: > > > > > > > > Happy New Year, > > All! > > > > > > > > > > > > > > And to you, too! > > > > > > > > > > > > > > > > This sounds > > > > > > > > great, > > but wasn't the > > > plan to call > > > > the current > > > > > > stable > > > > > > > > branch 0.91 and > > the current trunk > > > 0.92? > > > > > > > > > > > > > > Irrespective of > > that, bug fixes from > > > the branch > > > > should be > > > > > > merged to > > > > > > > trunk. And better > > > > > > > to > > do so before > > > trunk is > > > > branched into > > > > > > another > > > > > > > release > > > > > > > branch. > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Michael Wilde > > > > > > Computation > > > > > > Institute, > > University of > > > Chicago > > > > > > Mathematics and > > Computer Science > > > Division > > > > > > Argonne National > > Laboratory > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From dk0966 at cs.ship.edu Thu Jan 6 21:39:32 2011 From: dk0966 at cs.ship.edu (David Kelly) Date: Thu, 6 Jan 2011 22:39:32 -0500 Subject: [Swift-devel] Testing for new releases In-Reply-To: References: Message-ID: In the configuration matrix, an entry exists for "CI machines." Are these the thwomp, stomp, crush, sneezy, grumpy, and doc machines @mcs.anl.gov? Thanks, David On Tue, Jan 4, 2011 at 9:00 PM, Sarah Kenny wrote: > trying to make a matrix out of the matrix...first pass feel free to correct > :) > > http://www.ci.uchicago.edu/wiki/bin/view/SWFT/ReleasePlans#site_specific_tests > > On Tue, Jan 4, 2011 at 2:29 PM, Justin M Wozniak > wrote: >> >> On Tue, 4 Jan 2011, David Kelly wrote: >> >>> During today's conference call, a matrix of configurations was >>> mentioned which contained a preliminary list of what needed to be >>> supported. Where is this list located? >> >> http://www.ci.uchicago.edu/wiki/bin/view/SWFT/ReleasePlans >> >>> So the overall plan, as I understand it: >>> >>> 1) Finish writing tests for configurations listed in the matrix >>> 2) Verify the tests work with the 0.92 release and report any issues >>> 3) Automate the tests with meta.sh >> >> Full automation of meta.sh is not required for 0.92. ?Let's start by using >> it to create an outline of what it could do via ssh. ?Once we see what that >> looks like we can turn it into a real tool. >> >>> 4) Make these verified configurations available to users via a well >>> documented swift road map >>> >>> In terms of directory structure for site tests.. if I want to create a >>> specific test for PADS, for example, should I create it within >>> tests/providers/local-pbs/PADS/? >> >> This looks good. >> >> -- >> Justin M Wozniak >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From wilde at mcs.anl.gov Thu Jan 6 22:51:54 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 6 Jan 2011 22:51:54 -0600 (CST) Subject: [Swift-devel] Testing for new releases In-Reply-To: Message-ID: <1421488636.35430.1294375914234.JavaMail.root@zimbra.anl.gov> No, but they could/should be. I think we originally meant to go from say pads login to bridled and communicado. But going from mcs login to crush etc (x10) would be good to, or even better. Its up to you and Sarah; try something easy first, then increase the test intensity and diversity. For the "bag of workstation" tests your scripts might have to deal with issues of getting the ssh environment set up reasonably for automation. Note that you can only reach the 10x MCS servers from login.mcs, not directly from outside. BUt you can get around this either by running the test form bridled to mcs, or by setting up an SSH master channel. - Mike ----- Original Message ----- > In the configuration matrix, an entry exists for "CI machines." Are > these the thwomp, stomp, crush, sneezy, grumpy, and doc machines > @mcs.anl.gov? > > Thanks, > David > > On Tue, Jan 4, 2011 at 9:00 PM, Sarah Kenny > wrote: > > trying to make a matrix out of the matrix...first pass feel free to > > correct > > :) > > > > http://www.ci.uchicago.edu/wiki/bin/view/SWFT/ReleasePlans#site_specific_tests > > > > On Tue, Jan 4, 2011 at 2:29 PM, Justin M Wozniak > > > > wrote: > >> > >> On Tue, 4 Jan 2011, David Kelly wrote: > >> > >>> During today's conference call, a matrix of configurations was > >>> mentioned which contained a preliminary list of what needed to be > >>> supported. Where is this list located? > >> > >> http://www.ci.uchicago.edu/wiki/bin/view/SWFT/ReleasePlans > >> > >>> So the overall plan, as I understand it: > >>> > >>> 1) Finish writing tests for configurations listed in the matrix > >>> 2) Verify the tests work with the 0.92 release and report any > >>> issues > >>> 3) Automate the tests with meta.sh > >> > >> Full automation of meta.sh is not required for 0.92. Let's start by > >> using > >> it to create an outline of what it could do via ssh. Once we see > >> what that > >> looks like we can turn it into a real tool. > >> > >>> 4) Make these verified configurations available to users via a > >>> well > >>> documented swift road map > >>> > >>> In terms of directory structure for site tests.. if I want to > >>> create a > >>> specific test for PADS, for example, should I create it within > >>> tests/providers/local-pbs/PADS/? > >> > >> This looks good. > >> > >> -- > >> Justin M Wozniak > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Fri Jan 7 14:03:30 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 7 Jan 2011 14:03:30 -0600 (CST) Subject: [Swift-devel] Re: Swift on Eureka In-Reply-To: <4D276EBC.6020605@mcs.anl.gov> Message-ID: <1851945446.38303.1294430610736.JavaMail.root@zimbra.anl.gov> Hi Rob and Sheri, I don't know the status of Swift on Eureka, but Im eager to see it running there, so we'll make sure it works. A long while back I tried Swift there, and at the time we had a minor bug in the Cobalt provider. Justin may have fixed that recently on the BG/P's. So Im hoping it either works or has only some readily-fixable issues in the way. We'll try it and get back to you. In the mean time, Sheri, you might want to try a simple hello-world test on Eureka, and see if you can progress to replicating what John Dennis had done so far. Its best to send any errors you get to the swift-user list (which you should join) so that everyone on the Swift team is aware f any issues you encounter and can offer help. You should meet with Justin at Argonne (3rd floor, 240) who can serve as your Swift mentor. Sarah, David - lets add Eureka to the test matrix for release 0.92. Cobalt is very very close to PBS's interface, but there is a separate Swift execution provider that handles the differences. Regards, Mike ----- Original Message ----- > Hi Mike, > > Sheri is going to take over some of the development work John Dennis > was > doing on using swift with the AMWG diag package. > > Our platform is Eureka. Is there a development version of Swift > installed there? > > Rob -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sat Jan 8 13:44:29 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 08 Jan 2011 11:44:29 -0800 Subject: Fwd: [Swift-devel] Swift hang In-Reply-To: References: <4D24CB8E.6060304@gmail.com> <4D24D515.5060008@gmail.com> Message-ID: <1294515869.13304.1.camel@blabla2.none> I don't see a deadlock in the thread dump. Do you? On Wed, 2011-01-05 at 14:42 -0600, Allan Espinosa wrote: > forgot to include the listhost in the earlier thread. > > > ---------- Forwarded message ---------- > From: Jonathan Monette > Date: 2011/1/5 > Subject: Re: [Swift-devel] Swift hang > To: Allan Espinosa > > > Here is the jstack track > > --(14:29:%)-- jstack -l 10232 > > --(Wed,Jan05)-- > 2011-01-05 14:29:28 > Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.1-b03 mixed mode): > > "Attach Listener" daemon prio=10 tid=0x0000000048490800 nid=0x3d25 > waiting on condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > Locked ownable synchronizers: > - None > > "Sender" daemon prio=10 tid=0x000000004823f000 nid=0x2a0b in > Object.wait() [0x00000000446c5000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:241) > - locked <0x00002aaab5490a50> (a > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender) > > Locked ownable synchronizers: > - None > > "PullThread" daemon prio=10 tid=0x0000000048240800 nid=0x2a0a in > Object.wait() [0x00000000445c4000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at org.globus.cog.abstraction.coaster.service.job.manager.PullThread.mwait(PullThread.java:86) > at org.globus.cog.abstraction.coaster.service.job.manager.PullThread.run(PullThread.java:57) > - locked <0x00002aaab5490d28> (a > org.globus.cog.abstraction.coaster.service.job.manager.PullThread) > > Locked ownable synchronizers: > - None > > "Channel multiplexer 1" daemon prio=10 tid=0x00000000479a2800 > nid=0x2a08 sleeping[0x00000000443c2000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:418) > > Locked ownable synchronizers: > - None > > "Channel multiplexer 0" daemon prio=10 tid=0x0000000047a62800 > nid=0x2a07 sleeping[0x00000000444c3000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:418) > > Locked ownable synchronizers: > - None > > "Timer-3" daemon prio=10 tid=0x0000000047a62000 nid=0x2a06 in > Object.wait() [0x0000000043ebd000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.util.TimerThread.mainLoop(Timer.java:509) > - locked <0x00002aaab54afbf0> (a java.util.TaskQueue) > at java.util.TimerThread.run(Timer.java:462) > > Locked ownable synchronizers: > - None > > "PBS provider queue poller" daemon prio=10 tid=0x0000000048405000 > nid=0x29bd sleeping[0x0000000043dbc000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:76) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "Block Submitter" daemon prio=10 tid=0x00002aacc4016800 nid=0x2978 in > Object.wait() [0x00000000441c0000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run(BlockTaskSubmitter.java:54) > - locked <0x00002aaab54d4510> (a java.util.LinkedList) > > Locked ownable synchronizers: > - None > > "Timer-2" daemon prio=10 tid=0x00000000483e7800 nid=0x2952 in > Object.wait() [0x0000000043cbb000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at java.util.TimerThread.mainLoop(Timer.java:483) > - locked <0x00002aaab54caa88> (a java.util.TaskQueue) > at java.util.TimerThread.run(Timer.java:462) > > Locked ownable synchronizers: > - None > > "Piped Channel Sender" daemon prio=10 tid=0x0000000048403800 > nid=0x2951 in Object.wait() [0x0000000043bba000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > - locked <0x00002aaab54acd38> (a > edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > at org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:113) > > Locked ownable synchronizers: > - None > > "Piped Channel Sender" daemon prio=10 tid=0x0000000047a04800 > nid=0x2950 in Object.wait() [0x0000000043ab9000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > - locked <0x00002aaab54ac848> (a > edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > at org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:113) > > Locked ownable synchronizers: > - None > > "Local Queue Processor" daemon prio=10 tid=0x0000000048407000 > nid=0x294f in Object.wait() [0x00000000439b8000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00002aaab548bed8> (a org.globus.cog.karajan.util.Queue) > at java.lang.Object.wait(Object.java:485) > at org.globus.cog.karajan.util.Queue.take(Queue.java:46) > - locked <0x00002aaab548bed8> (a org.globus.cog.karajan.util.Queue) > at org.globus.cog.abstraction.coaster.service.job.manager.AbstractQueueProcessor.take(AbstractQueueProcessor.java:51) > at org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:37) > > Locked ownable synchronizers: > - None > > "Server: http://192.5.86.6:46247" daemon prio=10 > tid=0x0000000047b66000 nid=0x294e runnable [0x00000000438b7000] > java.lang.Thread.State: RUNNABLE > at java.net.PlainSocketImpl.socketAccept(Native Method) > at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) > - locked <0x00002aaab5492b68> (a java.net.SocksSocketImpl) > at java.net.ServerSocket.implAccept(ServerSocket.java:453) > at java.net.ServerSocket.accept(ServerSocket.java:421) > at org.globus.net.BaseServer.run(BaseServer.java:226) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "Timer-1" daemon prio=10 tid=0x00000000487ab000 nid=0x294c in > Object.wait() [0x00000000436b5000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.util.TimerThread.mainLoop(Timer.java:509) > - locked <0x00002aaab5518710> (a java.util.TaskQueue) > at java.util.TimerThread.run(Timer.java:462) > > Locked ownable synchronizers: > - None > > "Coaster Bootstrap Service Connection Processor" daemon prio=10 > tid=0x0000000047c99000 nid=0x294a runnable [0x00000000435b4000] > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) > - locked <0x00002aaab5474d40> (a sun.nio.ch.Util$1) > - locked <0x00002aaab5474d28> (a java.util.Collections$UnmodifiableSet) > - locked <0x00002aaab5474998> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84) > at org.globus.cog.abstraction.impl.execution.coaster.BootstrapService$ConnectionProcessor.run(BootstrapService.java:231) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "Coaster Bootstrap Service Thread" daemon prio=10 > tid=0x0000000047c30800 nid=0x2949 runnable [0x00000000434b3000] > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145) > - locked <0x00002aaab54746f8> (a java.lang.Object) > at org.globus.cog.abstraction.impl.execution.coaster.BootstrapService.run(BootstrapService.java:184) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "Local service" daemon prio=10 tid=0x0000000047c49800 nid=0x2948 > runnable [0x00000000433b2000] > java.lang.Thread.State: RUNNABLE > at java.net.PlainSocketImpl.socketAccept(Native Method) > at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) > - locked <0x00002aaab5489de8> (a java.net.SocksSocketImpl) > at java.net.ServerSocket.implAccept(ServerSocket.java:453) > at java.net.ServerSocket.accept(ServerSocket.java:421) > at org.globus.net.BaseServer.run(BaseServer.java:226) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "Scheduler" prio=10 tid=0x00002aacc01ae800 nid=0x28b5 in Object.wait() > [0x0000000042dac000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at org.globus.cog.karajan.scheduler.LateBindingScheduler.sleep(LateBindingScheduler.java:305) > at org.globus.cog.karajan.scheduler.LateBindingScheduler.run(LateBindingScheduler.java:289) > - locked <0x00002aaab500e070> (a > org.griphyn.vdl.karajan.VDSAdaptiveScheduler) > > Locked ownable synchronizers: > - None > > "Progress ticker" daemon prio=10 tid=0x000000004821d000 nid=0x281e > waiting on condition [0x0000000042cab000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.run(RuntimeStats.java:137) > > Locked ownable synchronizers: > - None > > "Restart Log Sync" daemon prio=10 tid=0x0000000048219800 nid=0x281d in > Object.wait() [0x0000000042baa000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at org.globus.cog.karajan.workflow.nodes.restartLog.SyncThread.run(SyncThread.java:45) > - locked <0x00002aaab4b71708> (a > org.globus.cog.karajan.workflow.nodes.restartLog.SyncThread) > > Locked ownable synchronizers: > - None > > "Overloaded Host Monitor" daemon prio=10 tid=0x00000000486b7000 > nid=0x2819 waiting on condition [0x0000000042aa9000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.globus.cog.karajan.scheduler.OverloadedHostMonitor.run(OverloadedHostMonitor.java:47) > > Locked ownable synchronizers: > - None > > "Timer-0" daemon prio=10 tid=0x0000000048632000 nid=0x2816 in > Object.wait() [0x00000000429a8000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.util.TimerThread.mainLoop(Timer.java:509) > - locked <0x00002aaab51e2ea0> (a java.util.TaskQueue) > at java.util.TimerThread.run(Timer.java:462) > > Locked ownable synchronizers: > - None > > "pool-1-thread-8" prio=10 tid=0x000000004849c000 nid=0x2814 in > Object.wait() [0x00000000427a6000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > - locked <0x00002aaab405e590> (a > edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "pool-1-thread-7" prio=10 tid=0x000000004807e800 nid=0x2813 in > Object.wait() [0x00000000426a5000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > - locked <0x00002aaab405e590> (a > edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "pool-1-thread-6" prio=10 tid=0x000000004855f000 nid=0x2812 in > Object.wait() [0x00000000425a4000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > - locked <0x00002aaab405e590> (a > edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "pool-1-thread-5" prio=10 tid=0x00000000486c9800 nid=0x2811 in > Object.wait() [0x00000000424a3000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > - locked <0x00002aaab405e590> (a > edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "pool-1-thread-4" prio=10 tid=0x00000000486c8000 nid=0x2810 in > Object.wait() [0x00000000423a2000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > - locked <0x00002aaab405e590> (a > edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "pool-1-thread-3" prio=10 tid=0x0000000048491800 nid=0x280f in > Object.wait() [0x00000000422a1000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > - locked <0x00002aaab405e590> (a > edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "pool-1-thread-2" prio=10 tid=0x00000000482d8800 nid=0x280e in > Object.wait() [0x00000000412dd000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > - locked <0x00002aaab405e590> (a > edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "pool-1-thread-1" prio=10 tid=0x00002aacc0018000 nid=0x280d in > Object.wait() [0x000000004104d000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > - locked <0x00002aaab405e590> (a > edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > at java.lang.Thread.run(Thread.java:662) > > Locked ownable synchronizers: > - None > > "Low Memory Detector" daemon prio=10 tid=0x00002aacb8026000 nid=0x2808 > runnable [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > Locked ownable synchronizers: > - None > > "CompilerThread1" daemon prio=10 tid=0x00002aacb8023800 nid=0x2807 > waiting on condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > Locked ownable synchronizers: > - None > > "CompilerThread0" daemon prio=10 tid=0x00002aacb8020800 nid=0x2806 > waiting on condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > Locked ownable synchronizers: > - None > > "Signal Dispatcher" daemon prio=10 tid=0x00002aacb801e000 nid=0x2805 > runnable [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > Locked ownable synchronizers: > - None > > "Finalizer" daemon prio=10 tid=0x000000004796c000 nid=0x2804 in > Object.wait() [0x0000000041f9e000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) > - locked <0x00002aaab3e096b8> (a java.lang.ref.ReferenceQueue$Lock) > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) > at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) > > Locked ownable synchronizers: > - None > > "Reference Handler" daemon prio=10 tid=0x0000000047965000 nid=0x2803 > in Object.wait() [0x0000000041c29000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) > - locked <0x00002aaab3e09630> (a java.lang.ref.Reference$Lock) > > Locked ownable synchronizers: > - None > > "main" prio=10 tid=0x00000000478fb800 nid=0x27f9 in Object.wait() > [0x0000000040dd9000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00002aaab47b9dc0> (a > org.griphyn.vdl.karajan.VDL2ExecutionContext) > at java.lang.Object.wait(Object.java:485) > at org.globus.cog.karajan.workflow.ExecutionContext.waitFor(ExecutionContext.java:261) > - locked <0x00002aaab47b9dc0> (a > org.griphyn.vdl.karajan.VDL2ExecutionContext) > at org.griphyn.vdl.karajan.Loader.main(Loader.java:197) > > Locked ownable synchronizers: > - None > > "VM Thread" prio=10 tid=0x0000000047960800 nid=0x2802 runnable > > "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004790e800 > nid=0x27fa runnable > > "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000047910000 > nid=0x27fb runnable > > "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000047912000 > nid=0x27fc runnable > > "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000047914000 > nid=0x27fd runnable > > "GC task thread#4 (ParallelGC)" prio=10 tid=0x0000000047915800 > nid=0x27fe runnable > > "GC task thread#5 (ParallelGC)" prio=10 tid=0x0000000047917800 > nid=0x27ff runnable > > "GC task thread#6 (ParallelGC)" prio=10 tid=0x0000000047919800 > nid=0x2800 runnable > > "GC task thread#7 (ParallelGC)" prio=10 tid=0x000000004791b000 > nid=0x2801 runnable > > "VM Periodic Task Thread" prio=10 tid=0x00002aacb8038800 nid=0x2809 > waiting on condition > > JNI global references: 1451 > > > > On 1/5/11 2:06 PM, Allan Espinosa wrote: > > Hi jon, > > Could you post a jstack trace? It should indicate if the code has deadlocks. > > -Allan (mobile) > > On Jan 5, 2011 4:50 PM, "Jonathan Monette" wrote: > > > > Hello, > > I have encountered swift hanging. The deadlock appears to be in the same place every time. This deadlock does seem to be intermittent since smaller work sizes does complete. This job size is with approximately 1200 files. The behavior that the logs show is that the files needed for the job submission are staged in but no jobs are submitted. The Coaster heartbeat that appears in the swift logs shows that the job queue is empty. The logs for the runs are in ~jonmon/Workspace/Swift/Montage/m101_j_6x6/run.000[1,2,3] I will try to recreate the problem using simple cat jobs. > > > > -- > > Jon > > > > Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. > > - Albert Einstein > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > -- > Jon > > Computers are incredibly fast, accurate, and stupid. Human beings are > incredibly slow, inaccurate, and brilliant. Together they are powerful > beyond imagination. > - Albert Einstein > > > From jon.monette at gmail.com Sat Jan 8 14:02:19 2011 From: jon.monette at gmail.com (Jonathan Monette) Date: Sat, 08 Jan 2011 14:02:19 -0600 Subject: Fwd: [Swift-devel] Swift hang In-Reply-To: <1294515869.13304.1.camel@blabla2.none> References: <4D24CB8E.6060304@gmail.com> <4D24D515.5060008@gmail.com> <1294515869.13304.1.camel@blabla2.none> Message-ID: <4D28C2CB.40509@gmail.com> I did not in the thread dump. The log showed that the files had been staged in but the coaster queue was empty. I assumed this meant that Swift was hung since coasters had no jobs to run. After seeing the thread dump though i saw this did not seem to be the case. On 1/8/11 1:44 PM, Mihael Hategan wrote: > I don't see a deadlock in the thread dump. Do you? > > On Wed, 2011-01-05 at 14:42 -0600, Allan Espinosa wrote: >> forgot to include the listhost in the earlier thread. >> >> >> ---------- Forwarded message ---------- >> From: Jonathan Monette >> Date: 2011/1/5 >> Subject: Re: [Swift-devel] Swift hang >> To: Allan Espinosa >> >> >> Here is the jstack track >> >> --(14:29:%)-- jstack -l 10232 >> >> --(Wed,Jan05)-- >> 2011-01-05 14:29:28 >> Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.1-b03 mixed mode): >> >> "Attach Listener" daemon prio=10 tid=0x0000000048490800 nid=0x3d25 >> waiting on condition [0x0000000000000000] >> java.lang.Thread.State: RUNNABLE >> >> Locked ownable synchronizers: >> - None >> >> "Sender" daemon prio=10 tid=0x000000004823f000 nid=0x2a0b in >> Object.wait() [0x00000000446c5000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.lang.Object.wait(Object.java:485) >> at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:241) >> - locked<0x00002aaab5490a50> (a >> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender) >> >> Locked ownable synchronizers: >> - None >> >> "PullThread" daemon prio=10 tid=0x0000000048240800 nid=0x2a0a in >> Object.wait() [0x00000000445c4000] >> java.lang.Thread.State: TIMED_WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at org.globus.cog.abstraction.coaster.service.job.manager.PullThread.mwait(PullThread.java:86) >> at org.globus.cog.abstraction.coaster.service.job.manager.PullThread.run(PullThread.java:57) >> - locked<0x00002aaab5490d28> (a >> org.globus.cog.abstraction.coaster.service.job.manager.PullThread) >> >> Locked ownable synchronizers: >> - None >> >> "Channel multiplexer 1" daemon prio=10 tid=0x00000000479a2800 >> nid=0x2a08 sleeping[0x00000000443c2000] >> java.lang.Thread.State: TIMED_WAITING (sleeping) >> at java.lang.Thread.sleep(Native Method) >> at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:418) >> >> Locked ownable synchronizers: >> - None >> >> "Channel multiplexer 0" daemon prio=10 tid=0x0000000047a62800 >> nid=0x2a07 sleeping[0x00000000444c3000] >> java.lang.Thread.State: TIMED_WAITING (sleeping) >> at java.lang.Thread.sleep(Native Method) >> at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:418) >> >> Locked ownable synchronizers: >> - None >> >> "Timer-3" daemon prio=10 tid=0x0000000047a62000 nid=0x2a06 in >> Object.wait() [0x0000000043ebd000] >> java.lang.Thread.State: TIMED_WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.util.TimerThread.mainLoop(Timer.java:509) >> - locked<0x00002aaab54afbf0> (a java.util.TaskQueue) >> at java.util.TimerThread.run(Timer.java:462) >> >> Locked ownable synchronizers: >> - None >> >> "PBS provider queue poller" daemon prio=10 tid=0x0000000048405000 >> nid=0x29bd sleeping[0x0000000043dbc000] >> java.lang.Thread.State: TIMED_WAITING (sleeping) >> at java.lang.Thread.sleep(Native Method) >> at org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:76) >> at java.lang.Thread.run(Thread.java:662) >> >> Locked ownable synchronizers: >> - None >> >> "Block Submitter" daemon prio=10 tid=0x00002aacc4016800 nid=0x2978 in >> Object.wait() [0x00000000441c0000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.lang.Object.wait(Object.java:485) >> at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run(BlockTaskSubmitter.java:54) >> - locked<0x00002aaab54d4510> (a java.util.LinkedList) >> >> Locked ownable synchronizers: >> - None >> >> "Timer-2" daemon prio=10 tid=0x00000000483e7800 nid=0x2952 in >> Object.wait() [0x0000000043cbb000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.lang.Object.wait(Object.java:485) >> at java.util.TimerThread.mainLoop(Timer.java:483) >> - locked<0x00002aaab54caa88> (a java.util.TaskQueue) >> at java.util.TimerThread.run(Timer.java:462) >> >> Locked ownable synchronizers: >> - None >> >> "Piped Channel Sender" daemon prio=10 tid=0x0000000048403800 >> nid=0x2951 in Object.wait() [0x0000000043bba000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.lang.Object.wait(Object.java:485) >> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) >> - locked<0x00002aaab54acd38> (a >> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) >> at org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:113) >> >> Locked ownable synchronizers: >> - None >> >> "Piped Channel Sender" daemon prio=10 tid=0x0000000047a04800 >> nid=0x2950 in Object.wait() [0x0000000043ab9000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.lang.Object.wait(Object.java:485) >> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) >> - locked<0x00002aaab54ac848> (a >> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) >> at org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:113) >> >> Locked ownable synchronizers: >> - None >> >> "Local Queue Processor" daemon prio=10 tid=0x0000000048407000 >> nid=0x294f in Object.wait() [0x00000000439b8000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> - waiting on<0x00002aaab548bed8> (a org.globus.cog.karajan.util.Queue) >> at java.lang.Object.wait(Object.java:485) >> at org.globus.cog.karajan.util.Queue.take(Queue.java:46) >> - locked<0x00002aaab548bed8> (a org.globus.cog.karajan.util.Queue) >> at org.globus.cog.abstraction.coaster.service.job.manager.AbstractQueueProcessor.take(AbstractQueueProcessor.java:51) >> at org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:37) >> >> Locked ownable synchronizers: >> - None >> >> "Server: http://192.5.86.6:46247" daemon prio=10 >> tid=0x0000000047b66000 nid=0x294e runnable [0x00000000438b7000] >> java.lang.Thread.State: RUNNABLE >> at java.net.PlainSocketImpl.socketAccept(Native Method) >> at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) >> - locked<0x00002aaab5492b68> (a java.net.SocksSocketImpl) >> at java.net.ServerSocket.implAccept(ServerSocket.java:453) >> at java.net.ServerSocket.accept(ServerSocket.java:421) >> at org.globus.net.BaseServer.run(BaseServer.java:226) >> at java.lang.Thread.run(Thread.java:662) >> >> Locked ownable synchronizers: >> - None >> >> "Timer-1" daemon prio=10 tid=0x00000000487ab000 nid=0x294c in >> Object.wait() [0x00000000436b5000] >> java.lang.Thread.State: TIMED_WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.util.TimerThread.mainLoop(Timer.java:509) >> - locked<0x00002aaab5518710> (a java.util.TaskQueue) >> at java.util.TimerThread.run(Timer.java:462) >> >> Locked ownable synchronizers: >> - None >> >> "Coaster Bootstrap Service Connection Processor" daemon prio=10 >> tid=0x0000000047c99000 nid=0x294a runnable [0x00000000435b4000] >> java.lang.Thread.State: RUNNABLE >> at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) >> at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210) >> at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) >> at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) >> - locked<0x00002aaab5474d40> (a sun.nio.ch.Util$1) >> - locked<0x00002aaab5474d28> (a java.util.Collections$UnmodifiableSet) >> - locked<0x00002aaab5474998> (a sun.nio.ch.EPollSelectorImpl) >> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) >> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84) >> at org.globus.cog.abstraction.impl.execution.coaster.BootstrapService$ConnectionProcessor.run(BootstrapService.java:231) >> at java.lang.Thread.run(Thread.java:662) >> >> Locked ownable synchronizers: >> - None >> >> "Coaster Bootstrap Service Thread" daemon prio=10 >> tid=0x0000000047c30800 nid=0x2949 runnable [0x00000000434b3000] >> java.lang.Thread.State: RUNNABLE >> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) >> at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145) >> - locked<0x00002aaab54746f8> (a java.lang.Object) >> at org.globus.cog.abstraction.impl.execution.coaster.BootstrapService.run(BootstrapService.java:184) >> at java.lang.Thread.run(Thread.java:662) >> >> Locked ownable synchronizers: >> - None >> >> "Local service" daemon prio=10 tid=0x0000000047c49800 nid=0x2948 >> runnable [0x00000000433b2000] >> java.lang.Thread.State: RUNNABLE >> at java.net.PlainSocketImpl.socketAccept(Native Method) >> at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) >> - locked<0x00002aaab5489de8> (a java.net.SocksSocketImpl) >> at java.net.ServerSocket.implAccept(ServerSocket.java:453) >> at java.net.ServerSocket.accept(ServerSocket.java:421) >> at org.globus.net.BaseServer.run(BaseServer.java:226) >> at java.lang.Thread.run(Thread.java:662) >> >> Locked ownable synchronizers: >> - None >> >> "Scheduler" prio=10 tid=0x00002aacc01ae800 nid=0x28b5 in Object.wait() >> [0x0000000042dac000] >> java.lang.Thread.State: TIMED_WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at org.globus.cog.karajan.scheduler.LateBindingScheduler.sleep(LateBindingScheduler.java:305) >> at org.globus.cog.karajan.scheduler.LateBindingScheduler.run(LateBindingScheduler.java:289) >> - locked<0x00002aaab500e070> (a >> org.griphyn.vdl.karajan.VDSAdaptiveScheduler) >> >> Locked ownable synchronizers: >> - None >> >> "Progress ticker" daemon prio=10 tid=0x000000004821d000 nid=0x281e >> waiting on condition [0x0000000042cab000] >> java.lang.Thread.State: TIMED_WAITING (sleeping) >> at java.lang.Thread.sleep(Native Method) >> at org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.run(RuntimeStats.java:137) >> >> Locked ownable synchronizers: >> - None >> >> "Restart Log Sync" daemon prio=10 tid=0x0000000048219800 nid=0x281d in >> Object.wait() [0x0000000042baa000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.lang.Object.wait(Object.java:485) >> at org.globus.cog.karajan.workflow.nodes.restartLog.SyncThread.run(SyncThread.java:45) >> - locked<0x00002aaab4b71708> (a >> org.globus.cog.karajan.workflow.nodes.restartLog.SyncThread) >> >> Locked ownable synchronizers: >> - None >> >> "Overloaded Host Monitor" daemon prio=10 tid=0x00000000486b7000 >> nid=0x2819 waiting on condition [0x0000000042aa9000] >> java.lang.Thread.State: TIMED_WAITING (sleeping) >> at java.lang.Thread.sleep(Native Method) >> at org.globus.cog.karajan.scheduler.OverloadedHostMonitor.run(OverloadedHostMonitor.java:47) >> >> Locked ownable synchronizers: >> - None >> >> "Timer-0" daemon prio=10 tid=0x0000000048632000 nid=0x2816 in >> Object.wait() [0x00000000429a8000] >> java.lang.Thread.State: TIMED_WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.util.TimerThread.mainLoop(Timer.java:509) >> - locked<0x00002aaab51e2ea0> (a java.util.TaskQueue) >> at java.util.TimerThread.run(Timer.java:462) >> >> Locked ownable synchronizers: >> - None >> >> "pool-1-thread-8" prio=10 tid=0x000000004849c000 nid=0x2814 in >> Object.wait() [0x00000000427a6000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.lang.Object.wait(Object.java:485) >> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) >> - locked<0x00002aaab405e590> (a >> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) >> at java.lang.Thread.run(Thread.java:662) >> >> Locked ownable synchronizers: >> - None >> >> "pool-1-thread-7" prio=10 tid=0x000000004807e800 nid=0x2813 in >> Object.wait() [0x00000000426a5000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.lang.Object.wait(Object.java:485) >> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) >> - locked<0x00002aaab405e590> (a >> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) >> at java.lang.Thread.run(Thread.java:662) >> >> Locked ownable synchronizers: >> - None >> >> "pool-1-thread-6" prio=10 tid=0x000000004855f000 nid=0x2812 in >> Object.wait() [0x00000000425a4000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.lang.Object.wait(Object.java:485) >> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) >> - locked<0x00002aaab405e590> (a >> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) >> at java.lang.Thread.run(Thread.java:662) >> >> Locked ownable synchronizers: >> - None >> >> "pool-1-thread-5" prio=10 tid=0x00000000486c9800 nid=0x2811 in >> Object.wait() [0x00000000424a3000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.lang.Object.wait(Object.java:485) >> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) >> - locked<0x00002aaab405e590> (a >> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) >> at java.lang.Thread.run(Thread.java:662) >> >> Locked ownable synchronizers: >> - None >> >> "pool-1-thread-4" prio=10 tid=0x00000000486c8000 nid=0x2810 in >> Object.wait() [0x00000000423a2000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.lang.Object.wait(Object.java:485) >> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) >> - locked<0x00002aaab405e590> (a >> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) >> at java.lang.Thread.run(Thread.java:662) >> >> Locked ownable synchronizers: >> - None >> >> "pool-1-thread-3" prio=10 tid=0x0000000048491800 nid=0x280f in >> Object.wait() [0x00000000422a1000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.lang.Object.wait(Object.java:485) >> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) >> - locked<0x00002aaab405e590> (a >> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) >> at java.lang.Thread.run(Thread.java:662) >> >> Locked ownable synchronizers: >> - None >> >> "pool-1-thread-2" prio=10 tid=0x00000000482d8800 nid=0x280e in >> Object.wait() [0x00000000412dd000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.lang.Object.wait(Object.java:485) >> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) >> - locked<0x00002aaab405e590> (a >> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) >> at java.lang.Thread.run(Thread.java:662) >> >> Locked ownable synchronizers: >> - None >> >> "pool-1-thread-1" prio=10 tid=0x00002aacc0018000 nid=0x280d in >> Object.wait() [0x000000004104d000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.lang.Object.wait(Object.java:485) >> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) >> - locked<0x00002aaab405e590> (a >> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) >> at java.lang.Thread.run(Thread.java:662) >> >> Locked ownable synchronizers: >> - None >> >> "Low Memory Detector" daemon prio=10 tid=0x00002aacb8026000 nid=0x2808 >> runnable [0x0000000000000000] >> java.lang.Thread.State: RUNNABLE >> >> Locked ownable synchronizers: >> - None >> >> "CompilerThread1" daemon prio=10 tid=0x00002aacb8023800 nid=0x2807 >> waiting on condition [0x0000000000000000] >> java.lang.Thread.State: RUNNABLE >> >> Locked ownable synchronizers: >> - None >> >> "CompilerThread0" daemon prio=10 tid=0x00002aacb8020800 nid=0x2806 >> waiting on condition [0x0000000000000000] >> java.lang.Thread.State: RUNNABLE >> >> Locked ownable synchronizers: >> - None >> >> "Signal Dispatcher" daemon prio=10 tid=0x00002aacb801e000 nid=0x2805 >> runnable [0x0000000000000000] >> java.lang.Thread.State: RUNNABLE >> >> Locked ownable synchronizers: >> - None >> >> "Finalizer" daemon prio=10 tid=0x000000004796c000 nid=0x2804 in >> Object.wait() [0x0000000041f9e000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) >> - locked<0x00002aaab3e096b8> (a java.lang.ref.ReferenceQueue$Lock) >> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) >> at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) >> >> Locked ownable synchronizers: >> - None >> >> "Reference Handler" daemon prio=10 tid=0x0000000047965000 nid=0x2803 >> in Object.wait() [0x0000000041c29000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> at java.lang.Object.wait(Object.java:485) >> at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) >> - locked<0x00002aaab3e09630> (a java.lang.ref.Reference$Lock) >> >> Locked ownable synchronizers: >> - None >> >> "main" prio=10 tid=0x00000000478fb800 nid=0x27f9 in Object.wait() >> [0x0000000040dd9000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(Native Method) >> - waiting on<0x00002aaab47b9dc0> (a >> org.griphyn.vdl.karajan.VDL2ExecutionContext) >> at java.lang.Object.wait(Object.java:485) >> at org.globus.cog.karajan.workflow.ExecutionContext.waitFor(ExecutionContext.java:261) >> - locked<0x00002aaab47b9dc0> (a >> org.griphyn.vdl.karajan.VDL2ExecutionContext) >> at org.griphyn.vdl.karajan.Loader.main(Loader.java:197) >> >> Locked ownable synchronizers: >> - None >> >> "VM Thread" prio=10 tid=0x0000000047960800 nid=0x2802 runnable >> >> "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004790e800 >> nid=0x27fa runnable >> >> "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000047910000 >> nid=0x27fb runnable >> >> "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000047912000 >> nid=0x27fc runnable >> >> "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000047914000 >> nid=0x27fd runnable >> >> "GC task thread#4 (ParallelGC)" prio=10 tid=0x0000000047915800 >> nid=0x27fe runnable >> >> "GC task thread#5 (ParallelGC)" prio=10 tid=0x0000000047917800 >> nid=0x27ff runnable >> >> "GC task thread#6 (ParallelGC)" prio=10 tid=0x0000000047919800 >> nid=0x2800 runnable >> >> "GC task thread#7 (ParallelGC)" prio=10 tid=0x000000004791b000 >> nid=0x2801 runnable >> >> "VM Periodic Task Thread" prio=10 tid=0x00002aacb8038800 nid=0x2809 >> waiting on condition >> >> JNI global references: 1451 >> >> >> >> On 1/5/11 2:06 PM, Allan Espinosa wrote: >> >> Hi jon, >> >> Could you post a jstack trace? It should indicate if the code has deadlocks. >> >> -Allan (mobile) >> >> On Jan 5, 2011 4:50 PM, "Jonathan Monette" wrote: >>> Hello, >>> I have encountered swift hanging. The deadlock appears to be in the same place every time. This deadlock does seem to be intermittent since smaller work sizes does complete. This job size is with approximately 1200 files. The behavior that the logs show is that the files needed for the job submission are staged in but no jobs are submitted. The Coaster heartbeat that appears in the swift logs shows that the job queue is empty. The logs for the runs are in ~jonmon/Workspace/Swift/Montage/m101_j_6x6/run.000[1,2,3] I will try to recreate the problem using simple cat jobs. >>> >>> -- >>> Jon >>> >>> Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. >>> - Albert Einstein >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> -- >> Jon >> >> Computers are incredibly fast, accurate, and stupid. Human beings are >> incredibly slow, inaccurate, and brilliant. Together they are powerful >> beyond imagination. >> - Albert Einstein >> >> >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From hategan at mcs.anl.gov Sat Jan 8 14:09:05 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 08 Jan 2011 12:09:05 -0800 Subject: Fwd: [Swift-devel] Swift hang In-Reply-To: <4D28C2CB.40509@gmail.com> References: <4D24CB8E.6060304@gmail.com> <4D24D515.5060008@gmail.com> <1294515869.13304.1.camel@blabla2.none> <4D28C2CB.40509@gmail.com> Message-ID: <1294517345.15275.0.camel@blabla2.none> I'd check the logs, but: [hategan at login ~]$ cd ~jonmon/Workspace/Swift/Montage/m101_j_6x6/run.0001 [blinking cursor for 30 minutes now] On Sat, 2011-01-08 at 14:02 -0600, Jonathan Monette wrote: > I did not in the thread dump. The log showed that the files had been > staged in but the coaster queue was empty. I assumed this meant that > Swift was hung since coasters had no jobs to run. After seeing the > thread dump though i saw this did not seem to be the case. > > On 1/8/11 1:44 PM, Mihael Hategan wrote: > > I don't see a deadlock in the thread dump. Do you? > > > > On Wed, 2011-01-05 at 14:42 -0600, Allan Espinosa wrote: > >> forgot to include the listhost in the earlier thread. > >> > >> > >> ---------- Forwarded message ---------- > >> From: Jonathan Monette > >> Date: 2011/1/5 > >> Subject: Re: [Swift-devel] Swift hang > >> To: Allan Espinosa > >> > >> > >> Here is the jstack track > >> > >> --(14:29:%)-- jstack -l 10232 > >> > >> --(Wed,Jan05)-- > >> 2011-01-05 14:29:28 > >> Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.1-b03 mixed mode): > >> > >> "Attach Listener" daemon prio=10 tid=0x0000000048490800 nid=0x3d25 > >> waiting on condition [0x0000000000000000] > >> java.lang.Thread.State: RUNNABLE > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Sender" daemon prio=10 tid=0x000000004823f000 nid=0x2a0b in > >> Object.wait() [0x00000000446c5000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> at java.lang.Object.wait(Object.java:485) > >> at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:241) > >> - locked<0x00002aaab5490a50> (a > >> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "PullThread" daemon prio=10 tid=0x0000000048240800 nid=0x2a0a in > >> Object.wait() [0x00000000445c4000] > >> java.lang.Thread.State: TIMED_WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> at org.globus.cog.abstraction.coaster.service.job.manager.PullThread.mwait(PullThread.java:86) > >> at org.globus.cog.abstraction.coaster.service.job.manager.PullThread.run(PullThread.java:57) > >> - locked<0x00002aaab5490d28> (a > >> org.globus.cog.abstraction.coaster.service.job.manager.PullThread) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Channel multiplexer 1" daemon prio=10 tid=0x00000000479a2800 > >> nid=0x2a08 sleeping[0x00000000443c2000] > >> java.lang.Thread.State: TIMED_WAITING (sleeping) > >> at java.lang.Thread.sleep(Native Method) > >> at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:418) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Channel multiplexer 0" daemon prio=10 tid=0x0000000047a62800 > >> nid=0x2a07 sleeping[0x00000000444c3000] > >> java.lang.Thread.State: TIMED_WAITING (sleeping) > >> at java.lang.Thread.sleep(Native Method) > >> at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:418) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Timer-3" daemon prio=10 tid=0x0000000047a62000 nid=0x2a06 in > >> Object.wait() [0x0000000043ebd000] > >> java.lang.Thread.State: TIMED_WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> at java.util.TimerThread.mainLoop(Timer.java:509) > >> - locked<0x00002aaab54afbf0> (a java.util.TaskQueue) > >> at java.util.TimerThread.run(Timer.java:462) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "PBS provider queue poller" daemon prio=10 tid=0x0000000048405000 > >> nid=0x29bd sleeping[0x0000000043dbc000] > >> java.lang.Thread.State: TIMED_WAITING (sleeping) > >> at java.lang.Thread.sleep(Native Method) > >> at org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:76) > >> at java.lang.Thread.run(Thread.java:662) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Block Submitter" daemon prio=10 tid=0x00002aacc4016800 nid=0x2978 in > >> Object.wait() [0x00000000441c0000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> at java.lang.Object.wait(Object.java:485) > >> at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run(BlockTaskSubmitter.java:54) > >> - locked<0x00002aaab54d4510> (a java.util.LinkedList) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Timer-2" daemon prio=10 tid=0x00000000483e7800 nid=0x2952 in > >> Object.wait() [0x0000000043cbb000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> at java.lang.Object.wait(Object.java:485) > >> at java.util.TimerThread.mainLoop(Timer.java:483) > >> - locked<0x00002aaab54caa88> (a java.util.TaskQueue) > >> at java.util.TimerThread.run(Timer.java:462) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Piped Channel Sender" daemon prio=10 tid=0x0000000048403800 > >> nid=0x2951 in Object.wait() [0x0000000043bba000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> at java.lang.Object.wait(Object.java:485) > >> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > >> - locked<0x00002aaab54acd38> (a > >> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > >> at org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:113) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Piped Channel Sender" daemon prio=10 tid=0x0000000047a04800 > >> nid=0x2950 in Object.wait() [0x0000000043ab9000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> at java.lang.Object.wait(Object.java:485) > >> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > >> - locked<0x00002aaab54ac848> (a > >> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > >> at org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:113) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Local Queue Processor" daemon prio=10 tid=0x0000000048407000 > >> nid=0x294f in Object.wait() [0x00000000439b8000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> - waiting on<0x00002aaab548bed8> (a org.globus.cog.karajan.util.Queue) > >> at java.lang.Object.wait(Object.java:485) > >> at org.globus.cog.karajan.util.Queue.take(Queue.java:46) > >> - locked<0x00002aaab548bed8> (a org.globus.cog.karajan.util.Queue) > >> at org.globus.cog.abstraction.coaster.service.job.manager.AbstractQueueProcessor.take(AbstractQueueProcessor.java:51) > >> at org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:37) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Server: http://192.5.86.6:46247" daemon prio=10 > >> tid=0x0000000047b66000 nid=0x294e runnable [0x00000000438b7000] > >> java.lang.Thread.State: RUNNABLE > >> at java.net.PlainSocketImpl.socketAccept(Native Method) > >> at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) > >> - locked<0x00002aaab5492b68> (a java.net.SocksSocketImpl) > >> at java.net.ServerSocket.implAccept(ServerSocket.java:453) > >> at java.net.ServerSocket.accept(ServerSocket.java:421) > >> at org.globus.net.BaseServer.run(BaseServer.java:226) > >> at java.lang.Thread.run(Thread.java:662) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Timer-1" daemon prio=10 tid=0x00000000487ab000 nid=0x294c in > >> Object.wait() [0x00000000436b5000] > >> java.lang.Thread.State: TIMED_WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> at java.util.TimerThread.mainLoop(Timer.java:509) > >> - locked<0x00002aaab5518710> (a java.util.TaskQueue) > >> at java.util.TimerThread.run(Timer.java:462) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Coaster Bootstrap Service Connection Processor" daemon prio=10 > >> tid=0x0000000047c99000 nid=0x294a runnable [0x00000000435b4000] > >> java.lang.Thread.State: RUNNABLE > >> at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > >> at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210) > >> at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) > >> at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) > >> - locked<0x00002aaab5474d40> (a sun.nio.ch.Util$1) > >> - locked<0x00002aaab5474d28> (a java.util.Collections$UnmodifiableSet) > >> - locked<0x00002aaab5474998> (a sun.nio.ch.EPollSelectorImpl) > >> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) > >> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84) > >> at org.globus.cog.abstraction.impl.execution.coaster.BootstrapService$ConnectionProcessor.run(BootstrapService.java:231) > >> at java.lang.Thread.run(Thread.java:662) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Coaster Bootstrap Service Thread" daemon prio=10 > >> tid=0x0000000047c30800 nid=0x2949 runnable [0x00000000434b3000] > >> java.lang.Thread.State: RUNNABLE > >> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > >> at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145) > >> - locked<0x00002aaab54746f8> (a java.lang.Object) > >> at org.globus.cog.abstraction.impl.execution.coaster.BootstrapService.run(BootstrapService.java:184) > >> at java.lang.Thread.run(Thread.java:662) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Local service" daemon prio=10 tid=0x0000000047c49800 nid=0x2948 > >> runnable [0x00000000433b2000] > >> java.lang.Thread.State: RUNNABLE > >> at java.net.PlainSocketImpl.socketAccept(Native Method) > >> at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) > >> - locked<0x00002aaab5489de8> (a java.net.SocksSocketImpl) > >> at java.net.ServerSocket.implAccept(ServerSocket.java:453) > >> at java.net.ServerSocket.accept(ServerSocket.java:421) > >> at org.globus.net.BaseServer.run(BaseServer.java:226) > >> at java.lang.Thread.run(Thread.java:662) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Scheduler" prio=10 tid=0x00002aacc01ae800 nid=0x28b5 in Object.wait() > >> [0x0000000042dac000] > >> java.lang.Thread.State: TIMED_WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> at org.globus.cog.karajan.scheduler.LateBindingScheduler.sleep(LateBindingScheduler.java:305) > >> at org.globus.cog.karajan.scheduler.LateBindingScheduler.run(LateBindingScheduler.java:289) > >> - locked<0x00002aaab500e070> (a > >> org.griphyn.vdl.karajan.VDSAdaptiveScheduler) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Progress ticker" daemon prio=10 tid=0x000000004821d000 nid=0x281e > >> waiting on condition [0x0000000042cab000] > >> java.lang.Thread.State: TIMED_WAITING (sleeping) > >> at java.lang.Thread.sleep(Native Method) > >> at org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.run(RuntimeStats.java:137) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Restart Log Sync" daemon prio=10 tid=0x0000000048219800 nid=0x281d in > >> Object.wait() [0x0000000042baa000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> at java.lang.Object.wait(Object.java:485) > >> at org.globus.cog.karajan.workflow.nodes.restartLog.SyncThread.run(SyncThread.java:45) > >> - locked<0x00002aaab4b71708> (a > >> org.globus.cog.karajan.workflow.nodes.restartLog.SyncThread) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Overloaded Host Monitor" daemon prio=10 tid=0x00000000486b7000 > >> nid=0x2819 waiting on condition [0x0000000042aa9000] > >> java.lang.Thread.State: TIMED_WAITING (sleeping) > >> at java.lang.Thread.sleep(Native Method) > >> at org.globus.cog.karajan.scheduler.OverloadedHostMonitor.run(OverloadedHostMonitor.java:47) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Timer-0" daemon prio=10 tid=0x0000000048632000 nid=0x2816 in > >> Object.wait() [0x00000000429a8000] > >> java.lang.Thread.State: TIMED_WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> at java.util.TimerThread.mainLoop(Timer.java:509) > >> - locked<0x00002aaab51e2ea0> (a java.util.TaskQueue) > >> at java.util.TimerThread.run(Timer.java:462) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "pool-1-thread-8" prio=10 tid=0x000000004849c000 nid=0x2814 in > >> Object.wait() [0x00000000427a6000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> at java.lang.Object.wait(Object.java:485) > >> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > >> - locked<0x00002aaab405e590> (a > >> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > >> at java.lang.Thread.run(Thread.java:662) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "pool-1-thread-7" prio=10 tid=0x000000004807e800 nid=0x2813 in > >> Object.wait() [0x00000000426a5000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> at java.lang.Object.wait(Object.java:485) > >> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > >> - locked<0x00002aaab405e590> (a > >> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > >> at java.lang.Thread.run(Thread.java:662) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "pool-1-thread-6" prio=10 tid=0x000000004855f000 nid=0x2812 in > >> Object.wait() [0x00000000425a4000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> at java.lang.Object.wait(Object.java:485) > >> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > >> - locked<0x00002aaab405e590> (a > >> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > >> at java.lang.Thread.run(Thread.java:662) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "pool-1-thread-5" prio=10 tid=0x00000000486c9800 nid=0x2811 in > >> Object.wait() [0x00000000424a3000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> at java.lang.Object.wait(Object.java:485) > >> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > >> - locked<0x00002aaab405e590> (a > >> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > >> at java.lang.Thread.run(Thread.java:662) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "pool-1-thread-4" prio=10 tid=0x00000000486c8000 nid=0x2810 in > >> Object.wait() [0x00000000423a2000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> at java.lang.Object.wait(Object.java:485) > >> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > >> - locked<0x00002aaab405e590> (a > >> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > >> at java.lang.Thread.run(Thread.java:662) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "pool-1-thread-3" prio=10 tid=0x0000000048491800 nid=0x280f in > >> Object.wait() [0x00000000422a1000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> at java.lang.Object.wait(Object.java:485) > >> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > >> - locked<0x00002aaab405e590> (a > >> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > >> at java.lang.Thread.run(Thread.java:662) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "pool-1-thread-2" prio=10 tid=0x00000000482d8800 nid=0x280e in > >> Object.wait() [0x00000000412dd000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> at java.lang.Object.wait(Object.java:485) > >> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > >> - locked<0x00002aaab405e590> (a > >> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > >> at java.lang.Thread.run(Thread.java:662) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "pool-1-thread-1" prio=10 tid=0x00002aacc0018000 nid=0x280d in > >> Object.wait() [0x000000004104d000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> at java.lang.Object.wait(Object.java:485) > >> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > >> - locked<0x00002aaab405e590> (a > >> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > >> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > >> at java.lang.Thread.run(Thread.java:662) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Low Memory Detector" daemon prio=10 tid=0x00002aacb8026000 nid=0x2808 > >> runnable [0x0000000000000000] > >> java.lang.Thread.State: RUNNABLE > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "CompilerThread1" daemon prio=10 tid=0x00002aacb8023800 nid=0x2807 > >> waiting on condition [0x0000000000000000] > >> java.lang.Thread.State: RUNNABLE > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "CompilerThread0" daemon prio=10 tid=0x00002aacb8020800 nid=0x2806 > >> waiting on condition [0x0000000000000000] > >> java.lang.Thread.State: RUNNABLE > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Signal Dispatcher" daemon prio=10 tid=0x00002aacb801e000 nid=0x2805 > >> runnable [0x0000000000000000] > >> java.lang.Thread.State: RUNNABLE > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Finalizer" daemon prio=10 tid=0x000000004796c000 nid=0x2804 in > >> Object.wait() [0x0000000041f9e000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) > >> - locked<0x00002aaab3e096b8> (a java.lang.ref.ReferenceQueue$Lock) > >> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) > >> at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "Reference Handler" daemon prio=10 tid=0x0000000047965000 nid=0x2803 > >> in Object.wait() [0x0000000041c29000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> at java.lang.Object.wait(Object.java:485) > >> at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) > >> - locked<0x00002aaab3e09630> (a java.lang.ref.Reference$Lock) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "main" prio=10 tid=0x00000000478fb800 nid=0x27f9 in Object.wait() > >> [0x0000000040dd9000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> - waiting on<0x00002aaab47b9dc0> (a > >> org.griphyn.vdl.karajan.VDL2ExecutionContext) > >> at java.lang.Object.wait(Object.java:485) > >> at org.globus.cog.karajan.workflow.ExecutionContext.waitFor(ExecutionContext.java:261) > >> - locked<0x00002aaab47b9dc0> (a > >> org.griphyn.vdl.karajan.VDL2ExecutionContext) > >> at org.griphyn.vdl.karajan.Loader.main(Loader.java:197) > >> > >> Locked ownable synchronizers: > >> - None > >> > >> "VM Thread" prio=10 tid=0x0000000047960800 nid=0x2802 runnable > >> > >> "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004790e800 > >> nid=0x27fa runnable > >> > >> "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000047910000 > >> nid=0x27fb runnable > >> > >> "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000047912000 > >> nid=0x27fc runnable > >> > >> "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000047914000 > >> nid=0x27fd runnable > >> > >> "GC task thread#4 (ParallelGC)" prio=10 tid=0x0000000047915800 > >> nid=0x27fe runnable > >> > >> "GC task thread#5 (ParallelGC)" prio=10 tid=0x0000000047917800 > >> nid=0x27ff runnable > >> > >> "GC task thread#6 (ParallelGC)" prio=10 tid=0x0000000047919800 > >> nid=0x2800 runnable > >> > >> "GC task thread#7 (ParallelGC)" prio=10 tid=0x000000004791b000 > >> nid=0x2801 runnable > >> > >> "VM Periodic Task Thread" prio=10 tid=0x00002aacb8038800 nid=0x2809 > >> waiting on condition > >> > >> JNI global references: 1451 > >> > >> > >> > >> On 1/5/11 2:06 PM, Allan Espinosa wrote: > >> > >> Hi jon, > >> > >> Could you post a jstack trace? It should indicate if the code has deadlocks. > >> > >> -Allan (mobile) > >> > >> On Jan 5, 2011 4:50 PM, "Jonathan Monette" wrote: > >>> Hello, > >>> I have encountered swift hanging. The deadlock appears to be in the same place every time. This deadlock does seem to be intermittent since smaller work sizes does complete. This job size is with approximately 1200 files. The behavior that the logs show is that the files needed for the job submission are staged in but no jobs are submitted. The Coaster heartbeat that appears in the swift logs shows that the job queue is empty. The logs for the runs are in ~jonmon/Workspace/Swift/Montage/m101_j_6x6/run.000[1,2,3] I will try to recreate the problem using simple cat jobs. > >>> > >>> -- > >>> Jon > >>> > >>> Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. > >>> - Albert Einstein > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> > >> -- > >> Jon > >> > >> Computers are incredibly fast, accurate, and stupid. Human beings are > >> incredibly slow, inaccurate, and brilliant. Together they are powerful > >> beyond imagination. > >> - Albert Einstein > >> > >> > >> > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From jon.monette at gmail.com Sat Jan 8 14:11:37 2011 From: jon.monette at gmail.com (Jonathan Monette) Date: Sat, 08 Jan 2011 14:11:37 -0600 Subject: Fwd: [Swift-devel] Swift hang In-Reply-To: <1294517345.15275.0.camel@blabla2.none> References: <4D24CB8E.6060304@gmail.com> <4D24D515.5060008@gmail.com> <1294515869.13304.1.camel@blabla2.none> <4D28C2CB.40509@gmail.com> <1294517345.15275.0.camel@blabla2.none> Message-ID: <4D28C4F9.8050603@gmail.com> Yea. I am not sure what is going on. I don't know what ci machine you are logged into but login.pads has been slow for me. On 1/8/11 2:09 PM, Mihael Hategan wrote: > I'd check the logs, but: > [hategan at login ~]$ cd > ~jonmon/Workspace/Swift/Montage/m101_j_6x6/run.0001 > [blinking cursor for 30 minutes now] > > > On Sat, 2011-01-08 at 14:02 -0600, Jonathan Monette wrote: >> I did not in the thread dump. The log showed that the files had been >> staged in but the coaster queue was empty. I assumed this meant that >> Swift was hung since coasters had no jobs to run. After seeing the >> thread dump though i saw this did not seem to be the case. >> >> On 1/8/11 1:44 PM, Mihael Hategan wrote: >>> I don't see a deadlock in the thread dump. Do you? >>> >>> On Wed, 2011-01-05 at 14:42 -0600, Allan Espinosa wrote: >>>> forgot to include the listhost in the earlier thread. >>>> >>>> >>>> ---------- Forwarded message ---------- >>>> From: Jonathan Monette >>>> Date: 2011/1/5 >>>> Subject: Re: [Swift-devel] Swift hang >>>> To: Allan Espinosa >>>> >>>> >>>> Here is the jstack track >>>> >>>> --(14:29:%)-- jstack -l 10232 >>>> >>>> --(Wed,Jan05)-- >>>> 2011-01-05 14:29:28 >>>> Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.1-b03 mixed mode): >>>> >>>> "Attach Listener" daemon prio=10 tid=0x0000000048490800 nid=0x3d25 >>>> waiting on condition [0x0000000000000000] >>>> java.lang.Thread.State: RUNNABLE >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Sender" daemon prio=10 tid=0x000000004823f000 nid=0x2a0b in >>>> Object.wait() [0x00000000446c5000] >>>> java.lang.Thread.State: WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> at java.lang.Object.wait(Object.java:485) >>>> at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:241) >>>> - locked<0x00002aaab5490a50> (a >>>> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "PullThread" daemon prio=10 tid=0x0000000048240800 nid=0x2a0a in >>>> Object.wait() [0x00000000445c4000] >>>> java.lang.Thread.State: TIMED_WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> at org.globus.cog.abstraction.coaster.service.job.manager.PullThread.mwait(PullThread.java:86) >>>> at org.globus.cog.abstraction.coaster.service.job.manager.PullThread.run(PullThread.java:57) >>>> - locked<0x00002aaab5490d28> (a >>>> org.globus.cog.abstraction.coaster.service.job.manager.PullThread) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Channel multiplexer 1" daemon prio=10 tid=0x00000000479a2800 >>>> nid=0x2a08 sleeping[0x00000000443c2000] >>>> java.lang.Thread.State: TIMED_WAITING (sleeping) >>>> at java.lang.Thread.sleep(Native Method) >>>> at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:418) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Channel multiplexer 0" daemon prio=10 tid=0x0000000047a62800 >>>> nid=0x2a07 sleeping[0x00000000444c3000] >>>> java.lang.Thread.State: TIMED_WAITING (sleeping) >>>> at java.lang.Thread.sleep(Native Method) >>>> at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:418) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Timer-3" daemon prio=10 tid=0x0000000047a62000 nid=0x2a06 in >>>> Object.wait() [0x0000000043ebd000] >>>> java.lang.Thread.State: TIMED_WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> at java.util.TimerThread.mainLoop(Timer.java:509) >>>> - locked<0x00002aaab54afbf0> (a java.util.TaskQueue) >>>> at java.util.TimerThread.run(Timer.java:462) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "PBS provider queue poller" daemon prio=10 tid=0x0000000048405000 >>>> nid=0x29bd sleeping[0x0000000043dbc000] >>>> java.lang.Thread.State: TIMED_WAITING (sleeping) >>>> at java.lang.Thread.sleep(Native Method) >>>> at org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:76) >>>> at java.lang.Thread.run(Thread.java:662) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Block Submitter" daemon prio=10 tid=0x00002aacc4016800 nid=0x2978 in >>>> Object.wait() [0x00000000441c0000] >>>> java.lang.Thread.State: WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> at java.lang.Object.wait(Object.java:485) >>>> at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run(BlockTaskSubmitter.java:54) >>>> - locked<0x00002aaab54d4510> (a java.util.LinkedList) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Timer-2" daemon prio=10 tid=0x00000000483e7800 nid=0x2952 in >>>> Object.wait() [0x0000000043cbb000] >>>> java.lang.Thread.State: WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> at java.lang.Object.wait(Object.java:485) >>>> at java.util.TimerThread.mainLoop(Timer.java:483) >>>> - locked<0x00002aaab54caa88> (a java.util.TaskQueue) >>>> at java.util.TimerThread.run(Timer.java:462) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Piped Channel Sender" daemon prio=10 tid=0x0000000048403800 >>>> nid=0x2951 in Object.wait() [0x0000000043bba000] >>>> java.lang.Thread.State: WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> at java.lang.Object.wait(Object.java:485) >>>> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) >>>> - locked<0x00002aaab54acd38> (a >>>> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) >>>> at org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:113) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Piped Channel Sender" daemon prio=10 tid=0x0000000047a04800 >>>> nid=0x2950 in Object.wait() [0x0000000043ab9000] >>>> java.lang.Thread.State: WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> at java.lang.Object.wait(Object.java:485) >>>> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) >>>> - locked<0x00002aaab54ac848> (a >>>> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) >>>> at org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:113) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Local Queue Processor" daemon prio=10 tid=0x0000000048407000 >>>> nid=0x294f in Object.wait() [0x00000000439b8000] >>>> java.lang.Thread.State: WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> - waiting on<0x00002aaab548bed8> (a org.globus.cog.karajan.util.Queue) >>>> at java.lang.Object.wait(Object.java:485) >>>> at org.globus.cog.karajan.util.Queue.take(Queue.java:46) >>>> - locked<0x00002aaab548bed8> (a org.globus.cog.karajan.util.Queue) >>>> at org.globus.cog.abstraction.coaster.service.job.manager.AbstractQueueProcessor.take(AbstractQueueProcessor.java:51) >>>> at org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:37) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Server: http://192.5.86.6:46247" daemon prio=10 >>>> tid=0x0000000047b66000 nid=0x294e runnable [0x00000000438b7000] >>>> java.lang.Thread.State: RUNNABLE >>>> at java.net.PlainSocketImpl.socketAccept(Native Method) >>>> at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) >>>> - locked<0x00002aaab5492b68> (a java.net.SocksSocketImpl) >>>> at java.net.ServerSocket.implAccept(ServerSocket.java:453) >>>> at java.net.ServerSocket.accept(ServerSocket.java:421) >>>> at org.globus.net.BaseServer.run(BaseServer.java:226) >>>> at java.lang.Thread.run(Thread.java:662) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Timer-1" daemon prio=10 tid=0x00000000487ab000 nid=0x294c in >>>> Object.wait() [0x00000000436b5000] >>>> java.lang.Thread.State: TIMED_WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> at java.util.TimerThread.mainLoop(Timer.java:509) >>>> - locked<0x00002aaab5518710> (a java.util.TaskQueue) >>>> at java.util.TimerThread.run(Timer.java:462) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Coaster Bootstrap Service Connection Processor" daemon prio=10 >>>> tid=0x0000000047c99000 nid=0x294a runnable [0x00000000435b4000] >>>> java.lang.Thread.State: RUNNABLE >>>> at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) >>>> at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210) >>>> at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) >>>> at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) >>>> - locked<0x00002aaab5474d40> (a sun.nio.ch.Util$1) >>>> - locked<0x00002aaab5474d28> (a java.util.Collections$UnmodifiableSet) >>>> - locked<0x00002aaab5474998> (a sun.nio.ch.EPollSelectorImpl) >>>> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) >>>> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84) >>>> at org.globus.cog.abstraction.impl.execution.coaster.BootstrapService$ConnectionProcessor.run(BootstrapService.java:231) >>>> at java.lang.Thread.run(Thread.java:662) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Coaster Bootstrap Service Thread" daemon prio=10 >>>> tid=0x0000000047c30800 nid=0x2949 runnable [0x00000000434b3000] >>>> java.lang.Thread.State: RUNNABLE >>>> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) >>>> at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145) >>>> - locked<0x00002aaab54746f8> (a java.lang.Object) >>>> at org.globus.cog.abstraction.impl.execution.coaster.BootstrapService.run(BootstrapService.java:184) >>>> at java.lang.Thread.run(Thread.java:662) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Local service" daemon prio=10 tid=0x0000000047c49800 nid=0x2948 >>>> runnable [0x00000000433b2000] >>>> java.lang.Thread.State: RUNNABLE >>>> at java.net.PlainSocketImpl.socketAccept(Native Method) >>>> at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) >>>> - locked<0x00002aaab5489de8> (a java.net.SocksSocketImpl) >>>> at java.net.ServerSocket.implAccept(ServerSocket.java:453) >>>> at java.net.ServerSocket.accept(ServerSocket.java:421) >>>> at org.globus.net.BaseServer.run(BaseServer.java:226) >>>> at java.lang.Thread.run(Thread.java:662) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Scheduler" prio=10 tid=0x00002aacc01ae800 nid=0x28b5 in Object.wait() >>>> [0x0000000042dac000] >>>> java.lang.Thread.State: TIMED_WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> at org.globus.cog.karajan.scheduler.LateBindingScheduler.sleep(LateBindingScheduler.java:305) >>>> at org.globus.cog.karajan.scheduler.LateBindingScheduler.run(LateBindingScheduler.java:289) >>>> - locked<0x00002aaab500e070> (a >>>> org.griphyn.vdl.karajan.VDSAdaptiveScheduler) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Progress ticker" daemon prio=10 tid=0x000000004821d000 nid=0x281e >>>> waiting on condition [0x0000000042cab000] >>>> java.lang.Thread.State: TIMED_WAITING (sleeping) >>>> at java.lang.Thread.sleep(Native Method) >>>> at org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.run(RuntimeStats.java:137) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Restart Log Sync" daemon prio=10 tid=0x0000000048219800 nid=0x281d in >>>> Object.wait() [0x0000000042baa000] >>>> java.lang.Thread.State: WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> at java.lang.Object.wait(Object.java:485) >>>> at org.globus.cog.karajan.workflow.nodes.restartLog.SyncThread.run(SyncThread.java:45) >>>> - locked<0x00002aaab4b71708> (a >>>> org.globus.cog.karajan.workflow.nodes.restartLog.SyncThread) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Overloaded Host Monitor" daemon prio=10 tid=0x00000000486b7000 >>>> nid=0x2819 waiting on condition [0x0000000042aa9000] >>>> java.lang.Thread.State: TIMED_WAITING (sleeping) >>>> at java.lang.Thread.sleep(Native Method) >>>> at org.globus.cog.karajan.scheduler.OverloadedHostMonitor.run(OverloadedHostMonitor.java:47) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Timer-0" daemon prio=10 tid=0x0000000048632000 nid=0x2816 in >>>> Object.wait() [0x00000000429a8000] >>>> java.lang.Thread.State: TIMED_WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> at java.util.TimerThread.mainLoop(Timer.java:509) >>>> - locked<0x00002aaab51e2ea0> (a java.util.TaskQueue) >>>> at java.util.TimerThread.run(Timer.java:462) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "pool-1-thread-8" prio=10 tid=0x000000004849c000 nid=0x2814 in >>>> Object.wait() [0x00000000427a6000] >>>> java.lang.Thread.State: WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> at java.lang.Object.wait(Object.java:485) >>>> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) >>>> - locked<0x00002aaab405e590> (a >>>> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) >>>> at java.lang.Thread.run(Thread.java:662) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "pool-1-thread-7" prio=10 tid=0x000000004807e800 nid=0x2813 in >>>> Object.wait() [0x00000000426a5000] >>>> java.lang.Thread.State: WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> at java.lang.Object.wait(Object.java:485) >>>> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) >>>> - locked<0x00002aaab405e590> (a >>>> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) >>>> at java.lang.Thread.run(Thread.java:662) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "pool-1-thread-6" prio=10 tid=0x000000004855f000 nid=0x2812 in >>>> Object.wait() [0x00000000425a4000] >>>> java.lang.Thread.State: WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> at java.lang.Object.wait(Object.java:485) >>>> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) >>>> - locked<0x00002aaab405e590> (a >>>> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) >>>> at java.lang.Thread.run(Thread.java:662) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "pool-1-thread-5" prio=10 tid=0x00000000486c9800 nid=0x2811 in >>>> Object.wait() [0x00000000424a3000] >>>> java.lang.Thread.State: WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> at java.lang.Object.wait(Object.java:485) >>>> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) >>>> - locked<0x00002aaab405e590> (a >>>> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) >>>> at java.lang.Thread.run(Thread.java:662) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "pool-1-thread-4" prio=10 tid=0x00000000486c8000 nid=0x2810 in >>>> Object.wait() [0x00000000423a2000] >>>> java.lang.Thread.State: WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> at java.lang.Object.wait(Object.java:485) >>>> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) >>>> - locked<0x00002aaab405e590> (a >>>> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) >>>> at java.lang.Thread.run(Thread.java:662) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "pool-1-thread-3" prio=10 tid=0x0000000048491800 nid=0x280f in >>>> Object.wait() [0x00000000422a1000] >>>> java.lang.Thread.State: WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> at java.lang.Object.wait(Object.java:485) >>>> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) >>>> - locked<0x00002aaab405e590> (a >>>> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) >>>> at java.lang.Thread.run(Thread.java:662) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "pool-1-thread-2" prio=10 tid=0x00000000482d8800 nid=0x280e in >>>> Object.wait() [0x00000000412dd000] >>>> java.lang.Thread.State: WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> at java.lang.Object.wait(Object.java:485) >>>> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) >>>> - locked<0x00002aaab405e590> (a >>>> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) >>>> at java.lang.Thread.run(Thread.java:662) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "pool-1-thread-1" prio=10 tid=0x00002aacc0018000 nid=0x280d in >>>> Object.wait() [0x000000004104d000] >>>> java.lang.Thread.State: WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> at java.lang.Object.wait(Object.java:485) >>>> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) >>>> - locked<0x00002aaab405e590> (a >>>> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) >>>> at java.lang.Thread.run(Thread.java:662) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Low Memory Detector" daemon prio=10 tid=0x00002aacb8026000 nid=0x2808 >>>> runnable [0x0000000000000000] >>>> java.lang.Thread.State: RUNNABLE >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "CompilerThread1" daemon prio=10 tid=0x00002aacb8023800 nid=0x2807 >>>> waiting on condition [0x0000000000000000] >>>> java.lang.Thread.State: RUNNABLE >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "CompilerThread0" daemon prio=10 tid=0x00002aacb8020800 nid=0x2806 >>>> waiting on condition [0x0000000000000000] >>>> java.lang.Thread.State: RUNNABLE >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Signal Dispatcher" daemon prio=10 tid=0x00002aacb801e000 nid=0x2805 >>>> runnable [0x0000000000000000] >>>> java.lang.Thread.State: RUNNABLE >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Finalizer" daemon prio=10 tid=0x000000004796c000 nid=0x2804 in >>>> Object.wait() [0x0000000041f9e000] >>>> java.lang.Thread.State: WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) >>>> - locked<0x00002aaab3e096b8> (a java.lang.ref.ReferenceQueue$Lock) >>>> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) >>>> at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "Reference Handler" daemon prio=10 tid=0x0000000047965000 nid=0x2803 >>>> in Object.wait() [0x0000000041c29000] >>>> java.lang.Thread.State: WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> at java.lang.Object.wait(Object.java:485) >>>> at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) >>>> - locked<0x00002aaab3e09630> (a java.lang.ref.Reference$Lock) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "main" prio=10 tid=0x00000000478fb800 nid=0x27f9 in Object.wait() >>>> [0x0000000040dd9000] >>>> java.lang.Thread.State: WAITING (on object monitor) >>>> at java.lang.Object.wait(Native Method) >>>> - waiting on<0x00002aaab47b9dc0> (a >>>> org.griphyn.vdl.karajan.VDL2ExecutionContext) >>>> at java.lang.Object.wait(Object.java:485) >>>> at org.globus.cog.karajan.workflow.ExecutionContext.waitFor(ExecutionContext.java:261) >>>> - locked<0x00002aaab47b9dc0> (a >>>> org.griphyn.vdl.karajan.VDL2ExecutionContext) >>>> at org.griphyn.vdl.karajan.Loader.main(Loader.java:197) >>>> >>>> Locked ownable synchronizers: >>>> - None >>>> >>>> "VM Thread" prio=10 tid=0x0000000047960800 nid=0x2802 runnable >>>> >>>> "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004790e800 >>>> nid=0x27fa runnable >>>> >>>> "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000047910000 >>>> nid=0x27fb runnable >>>> >>>> "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000047912000 >>>> nid=0x27fc runnable >>>> >>>> "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000047914000 >>>> nid=0x27fd runnable >>>> >>>> "GC task thread#4 (ParallelGC)" prio=10 tid=0x0000000047915800 >>>> nid=0x27fe runnable >>>> >>>> "GC task thread#5 (ParallelGC)" prio=10 tid=0x0000000047917800 >>>> nid=0x27ff runnable >>>> >>>> "GC task thread#6 (ParallelGC)" prio=10 tid=0x0000000047919800 >>>> nid=0x2800 runnable >>>> >>>> "GC task thread#7 (ParallelGC)" prio=10 tid=0x000000004791b000 >>>> nid=0x2801 runnable >>>> >>>> "VM Periodic Task Thread" prio=10 tid=0x00002aacb8038800 nid=0x2809 >>>> waiting on condition >>>> >>>> JNI global references: 1451 >>>> >>>> >>>> >>>> On 1/5/11 2:06 PM, Allan Espinosa wrote: >>>> >>>> Hi jon, >>>> >>>> Could you post a jstack trace? It should indicate if the code has deadlocks. >>>> >>>> -Allan (mobile) >>>> >>>> On Jan 5, 2011 4:50 PM, "Jonathan Monette" wrote: >>>>> Hello, >>>>> I have encountered swift hanging. The deadlock appears to be in the same place every time. This deadlock does seem to be intermittent since smaller work sizes does complete. This job size is with approximately 1200 files. The behavior that the logs show is that the files needed for the job submission are staged in but no jobs are submitted. The Coaster heartbeat that appears in the swift logs shows that the job queue is empty. The logs for the runs are in ~jonmon/Workspace/Swift/Montage/m101_j_6x6/run.000[1,2,3] I will try to recreate the problem using simple cat jobs. >>>>> >>>>> -- >>>>> Jon >>>>> >>>>> Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. >>>>> - Albert Einstein >>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>> >>>> -- >>>> Jon >>>> >>>> Computers are incredibly fast, accurate, and stupid. Human beings are >>>> incredibly slow, inaccurate, and brilliant. Together they are powerful >>>> beyond imagination. >>>> - Albert Einstein >>>> >>>> >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From hategan at mcs.anl.gov Sat Jan 8 14:13:55 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 08 Jan 2011 12:13:55 -0800 Subject: Fwd: [Swift-devel] Swift hang In-Reply-To: <4D28C4F9.8050603@gmail.com> References: <4D24CB8E.6060304@gmail.com> <4D24D515.5060008@gmail.com> <1294515869.13304.1.camel@blabla2.none> <4D28C2CB.40509@gmail.com> <1294517345.15275.0.camel@blabla2.none> <4D28C4F9.8050603@gmail.com> Message-ID: <1294517635.15532.0.camel@blabla2.none> login.ci... I think there have been some problems with the CI-wide GPFS. On Sat, 2011-01-08 at 14:11 -0600, Jonathan Monette wrote: > Yea. I am not sure what is going on. I don't know what ci machine you > are logged into but login.pads has been slow for me. > > On 1/8/11 2:09 PM, Mihael Hategan wrote: > > I'd check the logs, but: > > [hategan at login ~]$ cd > > ~jonmon/Workspace/Swift/Montage/m101_j_6x6/run.0001 > > [blinking cursor for 30 minutes now] > > > > > > On Sat, 2011-01-08 at 14:02 -0600, Jonathan Monette wrote: > >> I did not in the thread dump. The log showed that the files had been > >> staged in but the coaster queue was empty. I assumed this meant that > >> Swift was hung since coasters had no jobs to run. After seeing the > >> thread dump though i saw this did not seem to be the case. > >> > >> On 1/8/11 1:44 PM, Mihael Hategan wrote: > >>> I don't see a deadlock in the thread dump. Do you? > >>> > >>> On Wed, 2011-01-05 at 14:42 -0600, Allan Espinosa wrote: > >>>> forgot to include the listhost in the earlier thread. > >>>> > >>>> > >>>> ---------- Forwarded message ---------- > >>>> From: Jonathan Monette > >>>> Date: 2011/1/5 > >>>> Subject: Re: [Swift-devel] Swift hang > >>>> To: Allan Espinosa > >>>> > >>>> > >>>> Here is the jstack track > >>>> > >>>> --(14:29:%)-- jstack -l 10232 > >>>> > >>>> --(Wed,Jan05)-- > >>>> 2011-01-05 14:29:28 > >>>> Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.1-b03 mixed mode): > >>>> > >>>> "Attach Listener" daemon prio=10 tid=0x0000000048490800 nid=0x3d25 > >>>> waiting on condition [0x0000000000000000] > >>>> java.lang.Thread.State: RUNNABLE > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Sender" daemon prio=10 tid=0x000000004823f000 nid=0x2a0b in > >>>> Object.wait() [0x00000000446c5000] > >>>> java.lang.Thread.State: WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> at java.lang.Object.wait(Object.java:485) > >>>> at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:241) > >>>> - locked<0x00002aaab5490a50> (a > >>>> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "PullThread" daemon prio=10 tid=0x0000000048240800 nid=0x2a0a in > >>>> Object.wait() [0x00000000445c4000] > >>>> java.lang.Thread.State: TIMED_WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> at org.globus.cog.abstraction.coaster.service.job.manager.PullThread.mwait(PullThread.java:86) > >>>> at org.globus.cog.abstraction.coaster.service.job.manager.PullThread.run(PullThread.java:57) > >>>> - locked<0x00002aaab5490d28> (a > >>>> org.globus.cog.abstraction.coaster.service.job.manager.PullThread) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Channel multiplexer 1" daemon prio=10 tid=0x00000000479a2800 > >>>> nid=0x2a08 sleeping[0x00000000443c2000] > >>>> java.lang.Thread.State: TIMED_WAITING (sleeping) > >>>> at java.lang.Thread.sleep(Native Method) > >>>> at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:418) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Channel multiplexer 0" daemon prio=10 tid=0x0000000047a62800 > >>>> nid=0x2a07 sleeping[0x00000000444c3000] > >>>> java.lang.Thread.State: TIMED_WAITING (sleeping) > >>>> at java.lang.Thread.sleep(Native Method) > >>>> at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:418) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Timer-3" daemon prio=10 tid=0x0000000047a62000 nid=0x2a06 in > >>>> Object.wait() [0x0000000043ebd000] > >>>> java.lang.Thread.State: TIMED_WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> at java.util.TimerThread.mainLoop(Timer.java:509) > >>>> - locked<0x00002aaab54afbf0> (a java.util.TaskQueue) > >>>> at java.util.TimerThread.run(Timer.java:462) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "PBS provider queue poller" daemon prio=10 tid=0x0000000048405000 > >>>> nid=0x29bd sleeping[0x0000000043dbc000] > >>>> java.lang.Thread.State: TIMED_WAITING (sleeping) > >>>> at java.lang.Thread.sleep(Native Method) > >>>> at org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:76) > >>>> at java.lang.Thread.run(Thread.java:662) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Block Submitter" daemon prio=10 tid=0x00002aacc4016800 nid=0x2978 in > >>>> Object.wait() [0x00000000441c0000] > >>>> java.lang.Thread.State: WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> at java.lang.Object.wait(Object.java:485) > >>>> at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run(BlockTaskSubmitter.java:54) > >>>> - locked<0x00002aaab54d4510> (a java.util.LinkedList) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Timer-2" daemon prio=10 tid=0x00000000483e7800 nid=0x2952 in > >>>> Object.wait() [0x0000000043cbb000] > >>>> java.lang.Thread.State: WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> at java.lang.Object.wait(Object.java:485) > >>>> at java.util.TimerThread.mainLoop(Timer.java:483) > >>>> - locked<0x00002aaab54caa88> (a java.util.TaskQueue) > >>>> at java.util.TimerThread.run(Timer.java:462) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Piped Channel Sender" daemon prio=10 tid=0x0000000048403800 > >>>> nid=0x2951 in Object.wait() [0x0000000043bba000] > >>>> java.lang.Thread.State: WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> at java.lang.Object.wait(Object.java:485) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > >>>> - locked<0x00002aaab54acd38> (a > >>>> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > >>>> at org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:113) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Piped Channel Sender" daemon prio=10 tid=0x0000000047a04800 > >>>> nid=0x2950 in Object.wait() [0x0000000043ab9000] > >>>> java.lang.Thread.State: WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> at java.lang.Object.wait(Object.java:485) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > >>>> - locked<0x00002aaab54ac848> (a > >>>> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > >>>> at org.globus.cog.karajan.workflow.service.channels.AbstractPipedChannel$Sender.run(AbstractPipedChannel.java:113) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Local Queue Processor" daemon prio=10 tid=0x0000000048407000 > >>>> nid=0x294f in Object.wait() [0x00000000439b8000] > >>>> java.lang.Thread.State: WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> - waiting on<0x00002aaab548bed8> (a org.globus.cog.karajan.util.Queue) > >>>> at java.lang.Object.wait(Object.java:485) > >>>> at org.globus.cog.karajan.util.Queue.take(Queue.java:46) > >>>> - locked<0x00002aaab548bed8> (a org.globus.cog.karajan.util.Queue) > >>>> at org.globus.cog.abstraction.coaster.service.job.manager.AbstractQueueProcessor.take(AbstractQueueProcessor.java:51) > >>>> at org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:37) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Server: http://192.5.86.6:46247" daemon prio=10 > >>>> tid=0x0000000047b66000 nid=0x294e runnable [0x00000000438b7000] > >>>> java.lang.Thread.State: RUNNABLE > >>>> at java.net.PlainSocketImpl.socketAccept(Native Method) > >>>> at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) > >>>> - locked<0x00002aaab5492b68> (a java.net.SocksSocketImpl) > >>>> at java.net.ServerSocket.implAccept(ServerSocket.java:453) > >>>> at java.net.ServerSocket.accept(ServerSocket.java:421) > >>>> at org.globus.net.BaseServer.run(BaseServer.java:226) > >>>> at java.lang.Thread.run(Thread.java:662) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Timer-1" daemon prio=10 tid=0x00000000487ab000 nid=0x294c in > >>>> Object.wait() [0x00000000436b5000] > >>>> java.lang.Thread.State: TIMED_WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> at java.util.TimerThread.mainLoop(Timer.java:509) > >>>> - locked<0x00002aaab5518710> (a java.util.TaskQueue) > >>>> at java.util.TimerThread.run(Timer.java:462) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Coaster Bootstrap Service Connection Processor" daemon prio=10 > >>>> tid=0x0000000047c99000 nid=0x294a runnable [0x00000000435b4000] > >>>> java.lang.Thread.State: RUNNABLE > >>>> at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > >>>> at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210) > >>>> at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) > >>>> at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) > >>>> - locked<0x00002aaab5474d40> (a sun.nio.ch.Util$1) > >>>> - locked<0x00002aaab5474d28> (a java.util.Collections$UnmodifiableSet) > >>>> - locked<0x00002aaab5474998> (a sun.nio.ch.EPollSelectorImpl) > >>>> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) > >>>> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84) > >>>> at org.globus.cog.abstraction.impl.execution.coaster.BootstrapService$ConnectionProcessor.run(BootstrapService.java:231) > >>>> at java.lang.Thread.run(Thread.java:662) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Coaster Bootstrap Service Thread" daemon prio=10 > >>>> tid=0x0000000047c30800 nid=0x2949 runnable [0x00000000434b3000] > >>>> java.lang.Thread.State: RUNNABLE > >>>> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > >>>> at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145) > >>>> - locked<0x00002aaab54746f8> (a java.lang.Object) > >>>> at org.globus.cog.abstraction.impl.execution.coaster.BootstrapService.run(BootstrapService.java:184) > >>>> at java.lang.Thread.run(Thread.java:662) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Local service" daemon prio=10 tid=0x0000000047c49800 nid=0x2948 > >>>> runnable [0x00000000433b2000] > >>>> java.lang.Thread.State: RUNNABLE > >>>> at java.net.PlainSocketImpl.socketAccept(Native Method) > >>>> at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) > >>>> - locked<0x00002aaab5489de8> (a java.net.SocksSocketImpl) > >>>> at java.net.ServerSocket.implAccept(ServerSocket.java:453) > >>>> at java.net.ServerSocket.accept(ServerSocket.java:421) > >>>> at org.globus.net.BaseServer.run(BaseServer.java:226) > >>>> at java.lang.Thread.run(Thread.java:662) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Scheduler" prio=10 tid=0x00002aacc01ae800 nid=0x28b5 in Object.wait() > >>>> [0x0000000042dac000] > >>>> java.lang.Thread.State: TIMED_WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> at org.globus.cog.karajan.scheduler.LateBindingScheduler.sleep(LateBindingScheduler.java:305) > >>>> at org.globus.cog.karajan.scheduler.LateBindingScheduler.run(LateBindingScheduler.java:289) > >>>> - locked<0x00002aaab500e070> (a > >>>> org.griphyn.vdl.karajan.VDSAdaptiveScheduler) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Progress ticker" daemon prio=10 tid=0x000000004821d000 nid=0x281e > >>>> waiting on condition [0x0000000042cab000] > >>>> java.lang.Thread.State: TIMED_WAITING (sleeping) > >>>> at java.lang.Thread.sleep(Native Method) > >>>> at org.griphyn.vdl.karajan.lib.RuntimeStats$ProgressTicker.run(RuntimeStats.java:137) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Restart Log Sync" daemon prio=10 tid=0x0000000048219800 nid=0x281d in > >>>> Object.wait() [0x0000000042baa000] > >>>> java.lang.Thread.State: WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> at java.lang.Object.wait(Object.java:485) > >>>> at org.globus.cog.karajan.workflow.nodes.restartLog.SyncThread.run(SyncThread.java:45) > >>>> - locked<0x00002aaab4b71708> (a > >>>> org.globus.cog.karajan.workflow.nodes.restartLog.SyncThread) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Overloaded Host Monitor" daemon prio=10 tid=0x00000000486b7000 > >>>> nid=0x2819 waiting on condition [0x0000000042aa9000] > >>>> java.lang.Thread.State: TIMED_WAITING (sleeping) > >>>> at java.lang.Thread.sleep(Native Method) > >>>> at org.globus.cog.karajan.scheduler.OverloadedHostMonitor.run(OverloadedHostMonitor.java:47) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Timer-0" daemon prio=10 tid=0x0000000048632000 nid=0x2816 in > >>>> Object.wait() [0x00000000429a8000] > >>>> java.lang.Thread.State: TIMED_WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> at java.util.TimerThread.mainLoop(Timer.java:509) > >>>> - locked<0x00002aaab51e2ea0> (a java.util.TaskQueue) > >>>> at java.util.TimerThread.run(Timer.java:462) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "pool-1-thread-8" prio=10 tid=0x000000004849c000 nid=0x2814 in > >>>> Object.wait() [0x00000000427a6000] > >>>> java.lang.Thread.State: WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> at java.lang.Object.wait(Object.java:485) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > >>>> - locked<0x00002aaab405e590> (a > >>>> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > >>>> at java.lang.Thread.run(Thread.java:662) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "pool-1-thread-7" prio=10 tid=0x000000004807e800 nid=0x2813 in > >>>> Object.wait() [0x00000000426a5000] > >>>> java.lang.Thread.State: WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> at java.lang.Object.wait(Object.java:485) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > >>>> - locked<0x00002aaab405e590> (a > >>>> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > >>>> at java.lang.Thread.run(Thread.java:662) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "pool-1-thread-6" prio=10 tid=0x000000004855f000 nid=0x2812 in > >>>> Object.wait() [0x00000000425a4000] > >>>> java.lang.Thread.State: WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> at java.lang.Object.wait(Object.java:485) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > >>>> - locked<0x00002aaab405e590> (a > >>>> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > >>>> at java.lang.Thread.run(Thread.java:662) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "pool-1-thread-5" prio=10 tid=0x00000000486c9800 nid=0x2811 in > >>>> Object.wait() [0x00000000424a3000] > >>>> java.lang.Thread.State: WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> at java.lang.Object.wait(Object.java:485) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > >>>> - locked<0x00002aaab405e590> (a > >>>> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > >>>> at java.lang.Thread.run(Thread.java:662) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "pool-1-thread-4" prio=10 tid=0x00000000486c8000 nid=0x2810 in > >>>> Object.wait() [0x00000000423a2000] > >>>> java.lang.Thread.State: WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> at java.lang.Object.wait(Object.java:485) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > >>>> - locked<0x00002aaab405e590> (a > >>>> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > >>>> at java.lang.Thread.run(Thread.java:662) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "pool-1-thread-3" prio=10 tid=0x0000000048491800 nid=0x280f in > >>>> Object.wait() [0x00000000422a1000] > >>>> java.lang.Thread.State: WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> at java.lang.Object.wait(Object.java:485) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > >>>> - locked<0x00002aaab405e590> (a > >>>> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > >>>> at java.lang.Thread.run(Thread.java:662) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "pool-1-thread-2" prio=10 tid=0x00000000482d8800 nid=0x280e in > >>>> Object.wait() [0x00000000412dd000] > >>>> java.lang.Thread.State: WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> at java.lang.Object.wait(Object.java:485) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > >>>> - locked<0x00002aaab405e590> (a > >>>> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > >>>> at java.lang.Thread.run(Thread.java:662) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "pool-1-thread-1" prio=10 tid=0x00002aacc0018000 nid=0x280d in > >>>> Object.wait() [0x000000004104d000] > >>>> java.lang.Thread.State: WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> at java.lang.Object.wait(Object.java:485) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:315) > >>>> - locked<0x00002aaab405e590> (a > >>>> edu.emory.mathcs.backport.java.util.concurrent.LinkedBlockingQueue$SerializableLock) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:470) > >>>> at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:667) > >>>> at java.lang.Thread.run(Thread.java:662) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Low Memory Detector" daemon prio=10 tid=0x00002aacb8026000 nid=0x2808 > >>>> runnable [0x0000000000000000] > >>>> java.lang.Thread.State: RUNNABLE > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "CompilerThread1" daemon prio=10 tid=0x00002aacb8023800 nid=0x2807 > >>>> waiting on condition [0x0000000000000000] > >>>> java.lang.Thread.State: RUNNABLE > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "CompilerThread0" daemon prio=10 tid=0x00002aacb8020800 nid=0x2806 > >>>> waiting on condition [0x0000000000000000] > >>>> java.lang.Thread.State: RUNNABLE > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Signal Dispatcher" daemon prio=10 tid=0x00002aacb801e000 nid=0x2805 > >>>> runnable [0x0000000000000000] > >>>> java.lang.Thread.State: RUNNABLE > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Finalizer" daemon prio=10 tid=0x000000004796c000 nid=0x2804 in > >>>> Object.wait() [0x0000000041f9e000] > >>>> java.lang.Thread.State: WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) > >>>> - locked<0x00002aaab3e096b8> (a java.lang.ref.ReferenceQueue$Lock) > >>>> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) > >>>> at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "Reference Handler" daemon prio=10 tid=0x0000000047965000 nid=0x2803 > >>>> in Object.wait() [0x0000000041c29000] > >>>> java.lang.Thread.State: WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> at java.lang.Object.wait(Object.java:485) > >>>> at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) > >>>> - locked<0x00002aaab3e09630> (a java.lang.ref.Reference$Lock) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "main" prio=10 tid=0x00000000478fb800 nid=0x27f9 in Object.wait() > >>>> [0x0000000040dd9000] > >>>> java.lang.Thread.State: WAITING (on object monitor) > >>>> at java.lang.Object.wait(Native Method) > >>>> - waiting on<0x00002aaab47b9dc0> (a > >>>> org.griphyn.vdl.karajan.VDL2ExecutionContext) > >>>> at java.lang.Object.wait(Object.java:485) > >>>> at org.globus.cog.karajan.workflow.ExecutionContext.waitFor(ExecutionContext.java:261) > >>>> - locked<0x00002aaab47b9dc0> (a > >>>> org.griphyn.vdl.karajan.VDL2ExecutionContext) > >>>> at org.griphyn.vdl.karajan.Loader.main(Loader.java:197) > >>>> > >>>> Locked ownable synchronizers: > >>>> - None > >>>> > >>>> "VM Thread" prio=10 tid=0x0000000047960800 nid=0x2802 runnable > >>>> > >>>> "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004790e800 > >>>> nid=0x27fa runnable > >>>> > >>>> "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000047910000 > >>>> nid=0x27fb runnable > >>>> > >>>> "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000047912000 > >>>> nid=0x27fc runnable > >>>> > >>>> "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000047914000 > >>>> nid=0x27fd runnable > >>>> > >>>> "GC task thread#4 (ParallelGC)" prio=10 tid=0x0000000047915800 > >>>> nid=0x27fe runnable > >>>> > >>>> "GC task thread#5 (ParallelGC)" prio=10 tid=0x0000000047917800 > >>>> nid=0x27ff runnable > >>>> > >>>> "GC task thread#6 (ParallelGC)" prio=10 tid=0x0000000047919800 > >>>> nid=0x2800 runnable > >>>> > >>>> "GC task thread#7 (ParallelGC)" prio=10 tid=0x000000004791b000 > >>>> nid=0x2801 runnable > >>>> > >>>> "VM Periodic Task Thread" prio=10 tid=0x00002aacb8038800 nid=0x2809 > >>>> waiting on condition > >>>> > >>>> JNI global references: 1451 > >>>> > >>>> > >>>> > >>>> On 1/5/11 2:06 PM, Allan Espinosa wrote: > >>>> > >>>> Hi jon, > >>>> > >>>> Could you post a jstack trace? It should indicate if the code has deadlocks. > >>>> > >>>> -Allan (mobile) > >>>> > >>>> On Jan 5, 2011 4:50 PM, "Jonathan Monette" wrote: > >>>>> Hello, > >>>>> I have encountered swift hanging. The deadlock appears to be in the same place every time. This deadlock does seem to be intermittent since smaller work sizes does complete. This job size is with approximately 1200 files. The behavior that the logs show is that the files needed for the job submission are staged in but no jobs are submitted. The Coaster heartbeat that appears in the swift logs shows that the job queue is empty. The logs for the runs are in ~jonmon/Workspace/Swift/Montage/m101_j_6x6/run.000[1,2,3] I will try to recreate the problem using simple cat jobs. > >>>>> > >>>>> -- > >>>>> Jon > >>>>> > >>>>> Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. > >>>>> - Albert Einstein > >>>>> > >>>>> _______________________________________________ > >>>>> Swift-devel mailing list > >>>>> Swift-devel at ci.uchicago.edu > >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>>> > >>>> -- > >>>> Jon > >>>> > >>>> Computers are incredibly fast, accurate, and stupid. Human beings are > >>>> incredibly slow, inaccurate, and brilliant. Together they are powerful > >>>> beyond imagination. > >>>> - Albert Einstein > >>>> > >>>> > >>>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > From wilde at mcs.anl.gov Sat Jan 8 16:40:33 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 8 Jan 2011 16:40:33 -0600 (CST) Subject: [Swift-devel] Re: Swift on Eureka In-Reply-To: Message-ID: <1019521736.40464.1294526433159.JavaMail.root@zimbra.anl.gov> Thanks, Justin. cc'ing back to the list, Rob, and Sheri. Sheri, maybe you can run on PADS or Fusion till this is fixed? - Mike ----- Original Message ----- > Hello > Right, Swift does not currently run on Eureka due to the following > bug in Cobalt: > > http://trac.mcs.anl.gov/projects/cobalt/ticket/462 > > I got about half of a work-around for this done... > > Justin > > On Fri, 7 Jan 2011, Michael Wilde wrote: > > > Hi Rob and Sheri, > > > > I don't know the status of Swift on Eureka, but Im eager to see it > > running there, so we'll make sure it works. > > > > A long while back I tried Swift there, and at the time we had a > > minor > > bug in the Cobalt provider. Justin may have fixed that recently on > > the > > BG/P's. So Im hoping it either works or has only some > > readily-fixable > > issues in the way. > > > > We'll try it and get back to you. > > > > In the mean time, Sheri, you might want to try a simple hello-world > > test > > on Eureka, and see if you can progress to replicating what John > > Dennis > > had done so far. > > > > Its best to send any errors you get to the swift-user list (which > > you > > should join) so that everyone on the Swift team is aware f any > > issues > > you encounter and can offer help. > > > > You should meet with Justin at Argonne (3rd floor, 240) who can > > serve as > > your Swift mentor. > > > > Sarah, David - lets add Eureka to the test matrix for release 0.92. > > Cobalt is very very close to PBS's interface, but there is a > > separate > > Swift execution provider that handles the differences. > > > > Regards, > > > > Mike > > > > > > ----- Original Message ----- > >> Hi Mike, > >> > >> Sheri is going to take over some of the development work John > >> Dennis > >> was > >> doing on using swift with the AMWG diag package. > >> > >> Our platform is Eureka. Is there a development version of Swift > >> installed there? > >> > >> Rob > > > > > > -- > Justin M Wozniak -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From dk0966 at cs.ship.edu Sat Jan 8 22:46:57 2011 From: dk0966 at cs.ship.edu (David Kelly) Date: Sat, 8 Jan 2011 23:46:57 -0500 Subject: [Swift-devel] Compilation error in 0.92 Message-ID: Hello, I am getting this compilation error in the latest version of 0.92: compile: [echo] [swift]: COMPILE [mkdir] Created dir: /home/david/cog/modules/swift/build [javac] Compiling 374 source files to /home/david/cog/modules/swift/build [javac] /home/david/cog/modules/swift/src/org/griphyn/vdl/karajan/lib/Execute.java:52: cannot find symbol [javac] symbol : method setStack(org.globus.cog.abstraction.interfaces.Task,org.globus.cog.karajan.stack.VariableStack) [javac] location: class org.griphyn.vdl.karajan.lib.Execute [javac] setStack(task, stack); [javac] ^ [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 1 error From hategan at mcs.anl.gov Sat Jan 8 22:56:28 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 08 Jan 2011 20:56:28 -0800 Subject: [Swift-devel] Compilation error in 0.92 In-Reply-To: References: Message-ID: <1294548988.30003.0.camel@blabla2.none> That was fixed in r3881. On Sat, 2011-01-08 at 23:46 -0500, David Kelly wrote: > Hello, > > I am getting this compilation error in the latest version of 0.92: > > compile: > [echo] [swift]: COMPILE > [mkdir] Created dir: /home/david/cog/modules/swift/build > [javac] Compiling 374 source files to /home/david/cog/modules/swift/build > [javac] /home/david/cog/modules/swift/src/org/griphyn/vdl/karajan/lib/Execute.java:52: > cannot find symbol > [javac] symbol : method > setStack(org.globus.cog.abstraction.interfaces.Task,org.globus.cog.karajan.stack.VariableStack) > [javac] location: class org.griphyn.vdl.karajan.lib.Execute > [javac] setStack(task, stack); > [javac] ^ > [javac] Note: Some input files use unchecked or unsafe operations. > [javac] Note: Recompile with -Xlint:unchecked for details. > [javac] 1 error > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From dk0966 at cs.ship.edu Sat Jan 8 23:54:58 2011 From: dk0966 at cs.ship.edu (David Kelly) Date: Sun, 9 Jan 2011 00:54:58 -0500 Subject: [Swift-devel] Compilation error in 0.92 In-Reply-To: <1294548988.30003.0.camel@blabla2.none> References: <1294548988.30003.0.camel@blabla2.none> Message-ID: Hmm.. I am still getting the error with 3883 and 3881. I am downloading cog from https://cogkit.svn.sourceforge.net/svnroot/cogkit/trunk/current/src/cog and swift from https://svn.ci.uchicago.edu/svn/vdl2/branches/release-0.92. I tried it on my laptop first, and then on login.mcs. On Sat, Jan 8, 2011 at 11:56 PM, Mihael Hategan wrote: > That was fixed in r3881. > > On Sat, 2011-01-08 at 23:46 -0500, David Kelly wrote: >> Hello, >> >> I am getting this compilation error in the latest version of 0.92: >> >> compile: >> ? ? ?[echo] [swift]: COMPILE >> ? ? [mkdir] Created dir: /home/david/cog/modules/swift/build >> ? ? [javac] Compiling 374 source files to /home/david/cog/modules/swift/build >> ? ? [javac] /home/david/cog/modules/swift/src/org/griphyn/vdl/karajan/lib/Execute.java:52: >> cannot find symbol >> ? ? [javac] symbol ?: method >> setStack(org.globus.cog.abstraction.interfaces.Task,org.globus.cog.karajan.stack.VariableStack) >> ? ? [javac] location: class org.griphyn.vdl.karajan.lib.Execute >> ? ? [javac] ? ? ? ? ? ? ? ? ? setStack(task, stack); >> ? ? [javac] ? ? ? ? ? ? ? ? ? ^ >> ? ? [javac] Note: Some input files use unchecked or unsafe operations. >> ? ? [javac] Note: Recompile with -Xlint:unchecked for details. >> ? ? [javac] 1 error >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > From hategan at mcs.anl.gov Sun Jan 9 00:58:22 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 08 Jan 2011 22:58:22 -0800 Subject: [Swift-devel] Compilation error in 0.92 In-Reply-To: References: <1294548988.30003.0.camel@blabla2.none> Message-ID: <1294556302.770.0.camel@blabla2.none> https://cogkit.svn.sourceforge.net/svnroot/cogkit/branches/4.1.8/ On Sun, 2011-01-09 at 00:54 -0500, David Kelly wrote: > Hmm.. I am still getting the error with 3883 and 3881. > > I am downloading cog from > https://cogkit.svn.sourceforge.net/svnroot/cogkit/trunk/current/src/cog > and swift from https://svn.ci.uchicago.edu/svn/vdl2/branches/release-0.92. > I tried it on my laptop first, and then on login.mcs. > > On Sat, Jan 8, 2011 at 11:56 PM, Mihael Hategan wrote: > > That was fixed in r3881. > > > > On Sat, 2011-01-08 at 23:46 -0500, David Kelly wrote: > >> Hello, > >> > >> I am getting this compilation error in the latest version of 0.92: > >> > >> compile: > >> [echo] [swift]: COMPILE > >> [mkdir] Created dir: /home/david/cog/modules/swift/build > >> [javac] Compiling 374 source files to /home/david/cog/modules/swift/build > >> [javac] /home/david/cog/modules/swift/src/org/griphyn/vdl/karajan/lib/Execute.java:52: > >> cannot find symbol > >> [javac] symbol : method > >> setStack(org.globus.cog.abstraction.interfaces.Task,org.globus.cog.karajan.stack.VariableStack) > >> [javac] location: class org.griphyn.vdl.karajan.lib.Execute > >> [javac] setStack(task, stack); > >> [javac] ^ > >> [javac] Note: Some input files use unchecked or unsafe operations. > >> [javac] Note: Recompile with -Xlint:unchecked for details. > >> [javac] 1 error > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > From dk0966 at cs.ship.edu Sun Jan 9 01:16:40 2011 From: dk0966 at cs.ship.edu (David Kelly) Date: Sun, 9 Jan 2011 02:16:40 -0500 Subject: [Swift-devel] Compilation error in 0.92 In-Reply-To: <1294556302.770.0.camel@blabla2.none> References: <1294548988.30003.0.camel@blabla2.none> <1294556302.770.0.camel@blabla2.none> Message-ID: Aha, got it working now. Thanks On Sun, Jan 9, 2011 at 1:58 AM, Mihael Hategan wrote: > https://cogkit.svn.sourceforge.net/svnroot/cogkit/branches/4.1.8/ > > On Sun, 2011-01-09 at 00:54 -0500, David Kelly wrote: >> Hmm.. I am still getting the error with 3883 and 3881. >> >> I am downloading cog from >> https://cogkit.svn.sourceforge.net/svnroot/cogkit/trunk/current/src/cog >> and swift from https://svn.ci.uchicago.edu/svn/vdl2/branches/release-0.92. >> I tried it on my laptop first, and then on login.mcs. >> >> On Sat, Jan 8, 2011 at 11:56 PM, Mihael Hategan wrote: >> > That was fixed in r3881. >> > >> > On Sat, 2011-01-08 at 23:46 -0500, David Kelly wrote: >> >> Hello, >> >> >> >> I am getting this compilation error in the latest version of 0.92: >> >> >> >> compile: >> >> ? ? ?[echo] [swift]: COMPILE >> >> ? ? [mkdir] Created dir: /home/david/cog/modules/swift/build >> >> ? ? [javac] Compiling 374 source files to /home/david/cog/modules/swift/build >> >> ? ? [javac] /home/david/cog/modules/swift/src/org/griphyn/vdl/karajan/lib/Execute.java:52: >> >> cannot find symbol >> >> ? ? [javac] symbol ?: method >> >> setStack(org.globus.cog.abstraction.interfaces.Task,org.globus.cog.karajan.stack.VariableStack) >> >> ? ? [javac] location: class org.griphyn.vdl.karajan.lib.Execute >> >> ? ? [javac] ? ? ? ? ? ? ? ? ? setStack(task, stack); >> >> ? ? [javac] ? ? ? ? ? ? ? ? ? ^ >> >> ? ? [javac] Note: Some input files use unchecked or unsafe operations. >> >> ? ? [javac] Note: Recompile with -Xlint:unchecked for details. >> >> ? ? [javac] 1 error >> >> _______________________________________________ >> >> Swift-devel mailing list >> >> Swift-devel at ci.uchicago.edu >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > >> > >> > > > > From wilde at mcs.anl.gov Mon Jan 10 10:43:49 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 10 Jan 2011 10:43:49 -0600 (CST) Subject: [Swift-devel] Fwd: poor performance In-Reply-To: <8FBE44AC-0E1E-4864-A581-B68ABC9ECA35@uchicago.edu> Message-ID: <513365597.44978.1294677829480.JavaMail.root@zimbra.anl.gov> Mihael, can you take a look at Marc's logs? Do you have access to engage-submit? We should now help Marc move his work to bridled and communicado where we now have COndor-G installed and can more readily assist him in debugging. Marc, I'll try to give this more attention this week, but have to work on a deadline for today, first. - Mike ----- Forwarded Message ----- From: "Marc Parisien" To: "Michael Wilde" Sent: Monday, January 10, 2011 10:25:30 AM Subject: poor performance Hi Mike, I have a campaign running on renci for almost 2 days now; only 529 on 3000 jobs are done. IBI can weed 3000 within 2 days on 128 processors. What is the problem? Why don't I have a decent performance on making use of 10 super-computers? I have the most simplest swift script ever (aside from the "Hello World" one). the (1 Gb) log file is here: /home/parisien/Database/MCSG/ftdock-20110108-1301-xx8evgh7.log could it be because of a bad "site" that throws off all of Swift's scheduling? Or a "badly" set parameter?? Very Best, Marc. -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Mon Jan 10 12:05:20 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 10 Jan 2011 12:05:20 -0600 (CST) Subject: [Swift-devel] Re: Swift on Eureka In-Reply-To: <4D2B4495.6030805@mcs.anl.gov> Message-ID: <2049325529.45559.1294682720072.JavaMail.root@zimbra.anl.gov> No, we'll either need to build it for you or you can try to build it yourself. Sarah or David, can you do a build and sanity test of Swift on Fusion today? (If not, I will do this later today...) We should get this installed as a softenv package on Fusion, PADS, and MCS machines. Thanks, Mike ----- Original Message ----- > Is swift already installed on fussion? > > -Sheri > > Robert Jacob wrote: > > > > Lets use Fusion. > > > > Rob > > > > > > On 1/8/11 4:40 PM, Michael Wilde wrote: > >> Thanks, Justin. cc'ing back to the list, Rob, and Sheri. > >> > >> Sheri, maybe you can run on PADS or Fusion till this is fixed? > >> > >> - Mike > >> > >> ----- Original Message ----- > >>> Hello > >>> Right, Swift does not currently run on Eureka due to the following > >>> bug in Cobalt: > >>> > >>> http://trac.mcs.anl.gov/projects/cobalt/ticket/462 > >>> > >>> I got about half of a work-around for this done... > >>> > >>> Justin > >>> > >>> On Fri, 7 Jan 2011, Michael Wilde wrote: > >>> > >>>> Hi Rob and Sheri, > >>>> > >>>> I don't know the status of Swift on Eureka, but Im eager to see > >>>> it > >>>> running there, so we'll make sure it works. > >>>> > >>>> A long while back I tried Swift there, and at the time we had a > >>>> minor > >>>> bug in the Cobalt provider. Justin may have fixed that recently > >>>> on > >>>> the > >>>> BG/P's. So Im hoping it either works or has only some > >>>> readily-fixable > >>>> issues in the way. > >>>> > >>>> We'll try it and get back to you. > >>>> > >>>> In the mean time, Sheri, you might want to try a simple > >>>> hello-world > >>>> test > >>>> on Eureka, and see if you can progress to replicating what John > >>>> Dennis > >>>> had done so far. > >>>> > >>>> Its best to send any errors you get to the swift-user list (which > >>>> you > >>>> should join) so that everyone on the Swift team is aware f any > >>>> issues > >>>> you encounter and can offer help. > >>>> > >>>> You should meet with Justin at Argonne (3rd floor, 240) who can > >>>> serve as > >>>> your Swift mentor. > >>>> > >>>> Sarah, David - lets add Eureka to the test matrix for release > >>>> 0.92. > >>>> Cobalt is very very close to PBS's interface, but there is a > >>>> separate > >>>> Swift execution provider that handles the differences. > >>>> > >>>> Regards, > >>>> > >>>> Mike > >>>> > >>>> > >>>> ----- Original Message ----- > >>>>> Hi Mike, > >>>>> > >>>>> Sheri is going to take over some of the development work John > >>>>> Dennis > >>>>> was > >>>>> doing on using swift with the AMWG diag package. > >>>>> > >>>>> Our platform is Eureka. Is there a development version of Swift > >>>>> installed there? > >>>>> > >>>>> Rob > >>>> > >>>> > >>> > >>> -- > >>> Justin M Wozniak > >> -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From skenny at uchicago.edu Mon Jan 10 13:44:03 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Mon, 10 Jan 2011 11:44:03 -0800 Subject: [Swift-devel] Re: Swift on Eureka In-Reply-To: <2049325529.45559.1294682720072.JavaMail.root@zimbra.anl.gov> References: <4D2B4495.6030805@mcs.anl.gov> <2049325529.45559.1294682720072.JavaMail.root@zimbra.anl.gov> Message-ID: what's the full hostname for fusion? i can see if my new account is active there. On Mon, Jan 10, 2011 at 10:05 AM, Michael Wilde wrote: > No, we'll either need to build it for you or you can try to build it > yourself. > > Sarah or David, can you do a build and sanity test of Swift on Fusion > today? > (If not, I will do this later today...) > > We should get this installed as a softenv package on Fusion, PADS, and MCS > machines. > > Thanks, > > Mike > > > ----- Original Message ----- > > Is swift already installed on fussion? > > > > -Sheri > > > > Robert Jacob wrote: > > > > > > Lets use Fusion. > > > > > > Rob > > > > > > > > > On 1/8/11 4:40 PM, Michael Wilde wrote: > > >> Thanks, Justin. cc'ing back to the list, Rob, and Sheri. > > >> > > >> Sheri, maybe you can run on PADS or Fusion till this is fixed? > > >> > > >> - Mike > > >> > > >> ----- Original Message ----- > > >>> Hello > > >>> Right, Swift does not currently run on Eureka due to the following > > >>> bug in Cobalt: > > >>> > > >>> http://trac.mcs.anl.gov/projects/cobalt/ticket/462 > > >>> > > >>> I got about half of a work-around for this done... > > >>> > > >>> Justin > > >>> > > >>> On Fri, 7 Jan 2011, Michael Wilde wrote: > > >>> > > >>>> Hi Rob and Sheri, > > >>>> > > >>>> I don't know the status of Swift on Eureka, but Im eager to see > > >>>> it > > >>>> running there, so we'll make sure it works. > > >>>> > > >>>> A long while back I tried Swift there, and at the time we had a > > >>>> minor > > >>>> bug in the Cobalt provider. Justin may have fixed that recently > > >>>> on > > >>>> the > > >>>> BG/P's. So Im hoping it either works or has only some > > >>>> readily-fixable > > >>>> issues in the way. > > >>>> > > >>>> We'll try it and get back to you. > > >>>> > > >>>> In the mean time, Sheri, you might want to try a simple > > >>>> hello-world > > >>>> test > > >>>> on Eureka, and see if you can progress to replicating what John > > >>>> Dennis > > >>>> had done so far. > > >>>> > > >>>> Its best to send any errors you get to the swift-user list (which > > >>>> you > > >>>> should join) so that everyone on the Swift team is aware f any > > >>>> issues > > >>>> you encounter and can offer help. > > >>>> > > >>>> You should meet with Justin at Argonne (3rd floor, 240) who can > > >>>> serve as > > >>>> your Swift mentor. > > >>>> > > >>>> Sarah, David - lets add Eureka to the test matrix for release > > >>>> 0.92. > > >>>> Cobalt is very very close to PBS's interface, but there is a > > >>>> separate > > >>>> Swift execution provider that handles the differences. > > >>>> > > >>>> Regards, > > >>>> > > >>>> Mike > > >>>> > > >>>> > > >>>> ----- Original Message ----- > > >>>>> Hi Mike, > > >>>>> > > >>>>> Sheri is going to take over some of the development work John > > >>>>> Dennis > > >>>>> was > > >>>>> doing on using swift with the AMWG diag package. > > >>>>> > > >>>>> Our platform is Eureka. Is there a development version of Swift > > >>>>> installed there? > > >>>>> > > >>>>> Rob > > >>>> > > >>>> > > >>> > > >>> -- > > >>> Justin M Wozniak > > >> > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From skenny at uchicago.edu Mon Jan 10 14:51:43 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Mon, 10 Jan 2011 12:51:43 -0800 Subject: [Swift-devel] Re: Swift on Eureka In-Reply-To: <4D2B629F.2060903@mcs.anl.gov> References: <4D2B4495.6030805@mcs.anl.gov> <2049325529.45559.1294682720072.JavaMail.root@zimbra.anl.gov> <4D2B629F.2060903@mcs.anl.gov> Message-ID: sorry, i can't get on...i've emailed support. ~sk On Mon, Jan 10, 2011 at 11:48 AM, Sheri Mickelson wrote: > fusion.lcrc.anl.gov > > -Sheri > > Sarah Kenny wrote: > >> what's the full hostname for fusion? i can see if my new account is active >> there. >> >> On Mon, Jan 10, 2011 at 10:05 AM, Michael Wilde > wilde at mcs.anl.gov>> wrote: >> >> No, we'll either need to build it for you or you can try to build it >> yourself. >> >> Sarah or David, can you do a build and sanity test of Swift on >> Fusion today? >> (If not, I will do this later today...) >> >> We should get this installed as a softenv package on Fusion, PADS, >> and MCS machines. >> >> Thanks, >> >> Mike >> >> >> ----- Original Message ----- >> > Is swift already installed on fussion? >> > >> > -Sheri >> > >> > Robert Jacob wrote: >> > > >> > > Lets use Fusion. >> > > >> > > Rob >> > > >> > > >> > > On 1/8/11 4:40 PM, Michael Wilde wrote: >> > >> Thanks, Justin. cc'ing back to the list, Rob, and Sheri. >> > >> >> > >> Sheri, maybe you can run on PADS or Fusion till this is fixed? >> > >> >> > >> - Mike >> > >> >> > >> ----- Original Message ----- >> > >>> Hello >> > >>> Right, Swift does not currently run on Eureka due to the >> following >> > >>> bug in Cobalt: >> > >>> >> > >>> http://trac.mcs.anl.gov/projects/cobalt/ticket/462 >> > >>> >> > >>> I got about half of a work-around for this done... >> > >>> >> > >>> Justin >> > >>> >> > >>> On Fri, 7 Jan 2011, Michael Wilde wrote: >> > >>> >> > >>>> Hi Rob and Sheri, >> > >>>> >> > >>>> I don't know the status of Swift on Eureka, but Im eager to see >> > >>>> it >> > >>>> running there, so we'll make sure it works. >> > >>>> >> > >>>> A long while back I tried Swift there, and at the time we had a >> > >>>> minor >> > >>>> bug in the Cobalt provider. Justin may have fixed that recently >> > >>>> on >> > >>>> the >> > >>>> BG/P's. So Im hoping it either works or has only some >> > >>>> readily-fixable >> > >>>> issues in the way. >> > >>>> >> > >>>> We'll try it and get back to you. >> > >>>> >> > >>>> In the mean time, Sheri, you might want to try a simple >> > >>>> hello-world >> > >>>> test >> > >>>> on Eureka, and see if you can progress to replicating what John >> > >>>> Dennis >> > >>>> had done so far. >> > >>>> >> > >>>> Its best to send any errors you get to the swift-user list >> (which >> > >>>> you >> > >>>> should join) so that everyone on the Swift team is aware f any >> > >>>> issues >> > >>>> you encounter and can offer help. >> > >>>> >> > >>>> You should meet with Justin at Argonne (3rd floor, 240) who can >> > >>>> serve as >> > >>>> your Swift mentor. >> > >>>> >> > >>>> Sarah, David - lets add Eureka to the test matrix for release >> > >>>> 0.92. >> > >>>> Cobalt is very very close to PBS's interface, but there is a >> > >>>> separate >> > >>>> Swift execution provider that handles the differences. >> > >>>> >> > >>>> Regards, >> > >>>> >> > >>>> Mike >> > >>>> >> > >>>> >> > >>>> ----- Original Message ----- >> > >>>>> Hi Mike, >> > >>>>> >> > >>>>> Sheri is going to take over some of the development work John >> > >>>>> Dennis >> > >>>>> was >> > >>>>> doing on using swift with the AMWG diag package. >> > >>>>> >> > >>>>> Our platform is Eureka. Is there a development version of >> Swift >> > >>>>> installed there? >> > >>>>> >> > >>>>> Rob >> > >>>> >> > >>>> >> > >>> >> > >>> -- >> > >>> Justin M Wozniak >> > >> >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Mon Jan 10 14:59:58 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 10 Jan 2011 14:59:58 -0600 (CST) Subject: [Swift-devel] Re: Swift on Eureka In-Reply-To: Message-ID: <1683915090.46717.1294693198908.JavaMail.root@zimbra.anl.gov> I'll build it. What release (s) should we put out there? 0.91 and 0.92 candidate? Justin, if Sheri wants to replicate what John Dennis did, does she need 0.92 and CDM? - Mike ----- Original Message ----- sorry, i can't get on...i've emailed support. ~sk On Mon, Jan 10, 2011 at 11:48 AM, Sheri Mickelson < mickelso at mcs.anl.gov > wrote: fusion.lcrc.anl.gov -Sheri Sarah Kenny wrote: what's the full hostname for fusion? i can see if my new account is active there. On Mon, Jan 10, 2011 at 10:05 AM, Michael Wilde < wilde at mcs.anl.gov > wrote: No, we'll either need to build it for you or you can try to build it yourself. Sarah or David, can you do a build and sanity test of Swift on Fusion today? (If not, I will do this later today...) We should get this installed as a softenv package on Fusion, PADS, and MCS machines. Thanks, Mike ----- Original Message ----- > Is swift already installed on fussion? > > -Sheri > > Robert Jacob wrote: > > > > Lets use Fusion. > > > > Rob > > > > > > On 1/8/11 4:40 PM, Michael Wilde wrote: > >> Thanks, Justin. cc'ing back to the list, Rob, and Sheri. > >> > >> Sheri, maybe you can run on PADS or Fusion till this is fixed? > >> > >> - Mike > >> > >> ----- Original Message ----- > >>> Hello > >>> Right, Swift does not currently run on Eureka due to the following > >>> bug in Cobalt: > >>> > >>> http://trac.mcs.anl.gov/projects/cobalt/ticket/462 > >>> > >>> I got about half of a work-around for this done... > >>> > >>> Justin > >>> > >>> On Fri, 7 Jan 2011, Michael Wilde wrote: > >>> > >>>> Hi Rob and Sheri, > >>>> > >>>> I don't know the status of Swift on Eureka, but Im eager to see > >>>> it > >>>> running there, so we'll make sure it works. > >>>> > >>>> A long while back I tried Swift there, and at the time we had a > >>>> minor > >>>> bug in the Cobalt provider. Justin may have fixed that recently > >>>> on > >>>> the > >>>> BG/P's. So Im hoping it either works or has only some > >>>> readily-fixable > >>>> issues in the way. > >>>> > >>>> We'll try it and get back to you. > >>>> > >>>> In the mean time, Sheri, you might want to try a simple > >>>> hello-world > >>>> test > >>>> on Eureka, and see if you can progress to replicating what John > >>>> Dennis > >>>> had done so far. > >>>> > >>>> Its best to send any errors you get to the swift-user list (which > >>>> you > >>>> should join) so that everyone on the Swift team is aware f any > >>>> issues > >>>> you encounter and can offer help. > >>>> > >>>> You should meet with Justin at Argonne (3rd floor, 240) who can > >>>> serve as > >>>> your Swift mentor. > >>>> > >>>> Sarah, David - lets add Eureka to the test matrix for release > >>>> 0.92. > >>>> Cobalt is very very close to PBS's interface, but there is a > >>>> separate > >>>> Swift execution provider that handles the differences. > >>>> > >>>> Regards, > >>>> > >>>> Mike > >>>> > >>>> > >>>> ----- Original Message ----- > >>>>> Hi Mike, > >>>>> > >>>>> Sheri is going to take over some of the development work John > >>>>> Dennis > >>>>> was > >>>>> doing on using swift with the AMWG diag package. > >>>>> > >>>>> Our platform is Eureka. Is there a development version of Swift > >>>>> installed there? > >>>>> > >>>>> Rob > >>>> > >>>> > >>> > >>> -- > >>> Justin M Wozniak > >> -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: From skenny at uchicago.edu Mon Jan 10 17:17:36 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Mon, 10 Jan 2011 15:17:36 -0800 Subject: [Swift-devel] running nightly.sh on pads Message-ID: so, i'm trying to get nightly.sh to run on pads with coasters and i'm not quite sure where this is falling apart. so far the only thing i've edited is providers/ssh-pbs-coasters/sites.template.xml (allowing it to take the PROJECT and QUEUE variables). from what i can tell the sites.xml file does get generated correctly but then according to the test output it times out during submission: [skenny at login1 tests]$ ./nightly.sh -c -g -s groups/group-ssh.sh RUNNING_IN: /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/run-2011-01-10 HTML_OUTPUT: /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/run-2011-01-10/tests-2011-01-10.html which: no ifconfig in (/ci/projects/cnari/apps/freesurfer64/bin:/ci/projects/cnari/apps/freesurfer64/fsfast/bin:/ci/projects/cnari/apps/freesurfer64/mni/bin:/ci/projects/cnari/usr/bin:/ci/projects/cnari/apps/afni:/ci/projects/cnari/apps/swift/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/apache-ant-1.7.1-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r2/bin:/soft/globus-4.2.1-r2/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/skenny/bin/linux-rhel5-x86_64:/home/skenny/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/openmpi-1.4.2-gcc4.1-r1/bin) GROUPLISTFILE: groups/group-ssh.sh Prolog: Build Executing (part 1) /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests Executing (part 2) /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift 14815 pts/27 00:00:00 nightly.sh monitor(1): killing test process... touch: cannot touch `killed_test': Stale NFS file handle monitor(1): killed process_exec (TERM) process_exec_trap() killing all swifts... ++ echo 13685 13685 ++ ps -f UID PID PPID C STIME TTY TIME CMD skenny 14815 1 0 15:49 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g -s groups/group-ssh.sh skenny 14816 1 0 15:49 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g -s groups/group-ssh.sh skenny 14879 1 0 15:49 pts/27 00:00:04 java -Xmx2048M -Djava.endorsed.dirs=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/ skenny 15473 23767 0 15:55 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g -s groups/group-ssh.sh skenny 15503 15473 7 15:55 pts/27 00:00:08 /soft/java-1.6.0_11-sun-r1/jre/bin/java -classpath /soft/apache-ant-1.7.1-r1/lib/ant-launcher.jar -Dant.home=/soft/apache-ant-1.7.1-r1 -Dant. skenny 15890 14815 0 15:57 pts/27 00:00:00 ps -f skenny 23767 23760 0 13:53 pts/27 00:00:00 -bash ./nightly.sh: line 588: 14819 Killed "$@" > $OUTPUT 2>&1 +++ ps -f +++ grep '.*java' +++ grep -v grep ++ kill_this skenny 14879 1 0 15:49 pts/27 00:00:04 java -Xmx2048M -Djava.endorsed.dirs=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/endorsed -DUID=1195 -DGLOBUS_TCP_PORT_RANGE=50000,51000 -DGLOBUS_HOSTNAME= login1.pads.ci.uchicago.edu-DCOG_INSTALL_PATH=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/.. -Dswift.home=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/.. -Djava.security.egd=file:///dev/urandom -classpath /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../etc:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../libexec:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/addressing-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/antlr-2.7.5.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/axis.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/axis-url.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/backport-util-concurrent.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/coaster-bootstrap.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-abstraction-common-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-axis.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-grapheditor-0.47.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-jglobus-1.7.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-karajan-0.36-dev.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-clref-gt4_0_0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-coaster-0.3.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-dcache-0.1.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-gt2-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-gt4_0_0-2.5.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-local-2.2.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-localscheduler-0.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-ssh-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-webdav-2.1.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-resources-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-swift-svn.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-trap-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-url.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-util-0.92.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/commonj.jar:/ci/projects/cnari/soft/swift_latest/cog/mo skenny 15503 15473 7 15:55 pts/27 00:00:08 /soft/java-1.6.0_11-sun-r1/jre/bin/java -classpath /soft/apache-ant-1.7.1-r1/lib/ant-launcher.jar -Dant.home=/soft/apache-ant-1.7.1-r1 -Dant.library.dir=/soft/apache-ant-1.7.1-r1/lib org.apache.tools.ant.launch.Launcher -cp :./ -quiet dist ++ '[' -n 14879 ']' ++ /bin/kill -KILL 14879 ++ set +x Executing Package (part 3) /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift Executing Package (part 4) /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/lib Executing Package (part 5) /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/lib Executing Package (part 6) /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/swift Part 1: SSH with PBS and Coasters Configuration Test Using: /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/tests/providers/ssh-pbs-coasters/sites.template.xml Using: /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/tests/providers/ssh-pbs-coasters/tc.template.data `/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/etc/swift.properties' -> `./swift.properties' `/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/tests/providers/ssh-pbs-coasters/001-catsn-ssh-pbs-coasters.swift' -> `./001-catsn-ssh-pbs-coasters.swift' Executing /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/tests/providers/ssh-pbs-coasters/001-catsn-ssh-pbs-coasters.swift (part 1) 16623 pts/27 00:00:00 nightly.sh monitor(1): killing test process... monitor(1): killed process_exec (TERM) process_exec_trap() killing all swifts... ++ echo 15473 15473 ++ ps -f UID PID PPID C STIME TTY TIME CMD skenny 15473 23767 0 15:55 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g -s groups/group-ssh.sh skenny 16623 15473 0 15:58 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g -s groups/group-ssh.sh skenny 16624 15473 0 15:58 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g -s groups/group-ssh.sh skenny 16687 1 0 15:58 pts/27 00:00:04 java -Xmx2048M -Djava.endorsed.dirs=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/ skenny 17414 16623 0 16:06 pts/27 00:00:00 ps -f skenny 23767 23760 0 13:53 pts/27 00:00:00 -bash ./nightly.sh: line 588: 16627 Killed "$@" > $OUTPUT 2>&1 +++ ps -f +++ grep '.*java' +++ grep -v grep ++ kill_this skenny 16687 1 0 15:58 pts/27 00:00:04 java -Xmx2048M -Djava.endorsed.dirs=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/endorsed -DUID=1195 -DGLOBUS_TCP_PORT_RANGE=50000,51000 -DGLOBUS_HOSTNAME= login1.pads.ci.uchicago.edu-DCOG_INSTALL_PATH=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/.. -Dswift.home=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/.. -Djava.security.egd=file:///dev/urandom -classpath /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../etc:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../libexec:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/addressing-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/antlr-2.7.5.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/axis.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/axis-url.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/backport-util-concurrent.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/coaster-bootstrap.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-abstraction-common-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-axis.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-grapheditor-0.47.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-jglobus-1.7.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-karajan-0.36-dev.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-clref-gt4_0_0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-coaster-0.3.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-dcache-0.1.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-gt2-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-gt4_0_0-2.5.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-local-2.2.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-localscheduler-0.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-ssh-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-webdav-2.1.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-resources-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-swift-svn.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-trap-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-url.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-util-0.92.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/commonj.jar:/ci/projects/cnari/soft/swift_latest/cog/mo ++ '[' -n 16687 ']' ++ /bin/kill -KILL 16687 ++ set +x kill 16624: No such process TOOK: 500 FAILED Swift svn swift-r3921 (swift modified locally) cog-r3013 RunID: 20110110-1558-ojtlnxfb Progress: Progress: Selecting site:9 Initializing site shared directory:1 Progress: Selecting site:9 Initializing site shared directory:1 Progress: Selecting site:9 Initializing site shared directory:1 Progress: Selecting site:9 Initializing site shared directory:1 Progress: Selecting site:9 Initializing site shared directory:1 Progress: Selecting site:9 Initializing site shared directory:1 Progress: Selecting site:9 Initializing site shared directory:1 Progress: Selecting site:9 Initializing site shared directory:1 Progress: Selecting site:9 Initializing site shared directory:1 Progress: Selecting site:9 Initializing site shared directory:1 Progress: Selecting site:9 Initializing site shared directory:1 Progress: Selecting site:9 Initializing site shared directory:1 Progress: Selecting site:9 Initializing site shared directory:1 Progress: Selecting site:9 Initializing site shared directory:1 Progress: Selecting site:9 Initializing site shared directory:1 Progress: Selecting site:9 Initializing site shared directory:1 nightly.sh: monitor(1): killed: exceeded 500 seconds FAILED ++ seq --format %04.f 1 1 10 + for count in '`seq --format "%04.f" 1 1 10`' + '[' -f catsn.0001.out ']' + exit 1 ---------------------------------------------------------------- i'm running this directly on the pads login and seeing this in the swift log: 2011-01-10 16:33:18,539-0600 INFO TransportProtocolCommon The Transport Protocol thread failed java.io.IOException: The socket is EOF at com.sshtools.j2ssh.transport.TransportProtocolInputStream.readBufferedData(TransportProtocolInputStream.java:183) at com.sshtools.j2ssh.transport.TransportProtocolInputStream.readMessage(TransportProtocolInputStream.java:226) at com.sshtools.j2ssh.transport.TransportProtocolCommon.processMessages(TransportProtocolCommon.java:1440) at com.sshtools.j2ssh.transport.TransportProtocolCommon.startBinaryPacketProtocol(TransportProtocolCommon.java:1034) at com.sshtools.j2ssh.transport.TransportProtocolCommon.run(TransportProtocolCommon.java:393) at java.lang.Thread.run(Thread.java:619) you can view the test output here: http://www.ci.uchicago.edu/~skenny/swift_tests/run-2011-01-10/tests-2011-01-10.html anyway, thought i'd post this in case there's something that might jump out at any of you that i can tweak... ~sk -------------- next part -------------- An HTML attachment was scrubbed... URL: From dk0966 at cs.ship.edu Mon Jan 10 18:27:18 2011 From: dk0966 at cs.ship.edu (David Kelly) Date: Mon, 10 Jan 2011 19:27:18 -0500 Subject: [Swift-devel] running nightly.sh on pads In-Reply-To: References: Message-ID: Maybe try increasing the time in the .timeout file? I usually see something similar when the job exceeds the timeout value On Jan 10, 2011 6:17 PM, "Sarah Kenny" wrote: > so, i'm trying to get nightly.sh to run on pads with coasters and i'm not > quite sure where this is falling apart. so far the only thing i've edited is > providers/ssh-pbs-coasters/sites.template.xml (allowing it to take the > PROJECT and QUEUE variables). from what i can tell the sites.xml file does > get generated correctly but then according to the test output it times out > during submission: > > [skenny at login1 tests]$ ./nightly.sh -c -g -s groups/group-ssh.sh > RUNNING_IN: > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/run-2011-01-10 > HTML_OUTPUT: > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/run-2011-01-10/tests-2011-01-10.html > which: no ifconfig in > (/ci/projects/cnari/apps/freesurfer64/bin:/ci/projects/cnari/apps/freesurfer64/fsfast/bin:/ci/projects/cnari/apps/freesurfer64/mni/bin:/ci/projects/cnari/usr/bin:/ci/projects/cnari/apps/afni:/ci/projects/cnari/apps/swift/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/apache-ant-1.7.1-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r2/bin:/soft/globus-4.2.1-r2/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/skenny/bin/linux-rhel5-x86_64:/home/skenny/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/openmpi-1.4.2-gcc4.1-r1/bin) > GROUPLISTFILE: groups/group-ssh.sh > > Prolog: Build > > Executing (part 1) > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests > Executing (part 2) > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift > 14815 pts/27 00:00:00 nightly.sh > monitor(1): killing test process... > touch: cannot touch `killed_test': Stale NFS file handle > monitor(1): killed process_exec (TERM) > process_exec_trap() > killing all swifts... > ++ echo 13685 > 13685 > ++ ps -f > UID PID PPID C STIME TTY TIME CMD > skenny 14815 1 0 15:49 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g > -s groups/group-ssh.sh > skenny 14816 1 0 15:49 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g > -s groups/group-ssh.sh > skenny 14879 1 0 15:49 pts/27 00:00:04 java -Xmx2048M > -Djava.endorsed.dirs=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/ > skenny 15473 23767 0 15:55 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g > -s groups/group-ssh.sh > skenny 15503 15473 7 15:55 pts/27 00:00:08 > /soft/java-1.6.0_11-sun-r1/jre/bin/java -classpath > /soft/apache-ant-1.7.1-r1/lib/ant-launcher.jar > -Dant.home=/soft/apache-ant-1.7.1-r1 -Dant. > skenny 15890 14815 0 15:57 pts/27 00:00:00 ps -f > skenny 23767 23760 0 13:53 pts/27 00:00:00 -bash > ./nightly.sh: line 588: 14819 Killed "$@" > $OUTPUT 2>&1 > +++ ps -f > +++ grep '.*java' > +++ grep -v grep > ++ kill_this skenny 14879 1 0 15:49 pts/27 00:00:04 java -Xmx2048M > -Djava.endorsed.dirs=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/endorsed > -DUID=1195 -DGLOBUS_TCP_PORT_RANGE=50000,51000 -DGLOBUS_HOSTNAME= > login1.pads.ci.uchicago.edu-DCOG_INSTALL_PATH=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/.. > -Dswift.home=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/.. > -Djava.security.egd=file:///dev/urandom -classpath > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../etc:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../libexec:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/addressing-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/antlr-2.7.5.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/axis.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/axis-url.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/backport-util-concurrent.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/coaster-bootstrap.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-abstraction-common-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-axis.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-grapheditor-0.47.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-jglobus-1.7.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-karajan-0.36-dev.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-clref-gt4_0_0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-coaster-0.3.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-dcache-0.1.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-gt2-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-gt4_0_0-2.5.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-local-2.2.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-localscheduler-0.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-ssh-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-webdav-2.1.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-resources-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-swift-svn.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-trap-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-url.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-util-0.92.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/commonj.jar:/ci/projects/cnari/soft/swift_latest/cog/mo > skenny 15503 15473 7 15:55 pts/27 00:00:08 > /soft/java-1.6.0_11-sun-r1/jre/bin/java -classpath > /soft/apache-ant-1.7.1-r1/lib/ant-launcher.jar > -Dant.home=/soft/apache-ant-1.7.1-r1 > -Dant.library.dir=/soft/apache-ant-1.7.1-r1/lib > org.apache.tools.ant.launch.Launcher -cp :./ -quiet dist > ++ '[' -n 14879 ']' > ++ /bin/kill -KILL 14879 > ++ set +x > Executing Package (part 3) > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift > Executing Package (part 4) > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/lib > Executing Package (part 5) > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/lib > Executing Package (part 6) > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/swift > > Part 1: SSH with PBS and Coasters Configuration Test > > Using: > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/tests/providers/ssh-pbs-coasters/sites.template.xml > Using: > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/tests/providers/ssh-pbs-coasters/tc.template.data > `/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/etc/swift.properties' > -> `./swift.properties' > `/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/tests/providers/ssh-pbs-coasters/001-catsn-ssh-pbs-coasters.swift' > -> `./001-catsn-ssh-pbs-coasters.swift' > Executing > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/tests/providers/ssh-pbs-coasters/001-catsn-ssh-pbs-coasters.swift > (part 1) > 16623 pts/27 00:00:00 nightly.sh > monitor(1): killing test process... > monitor(1): killed process_exec (TERM) > process_exec_trap() > killing all swifts... > ++ echo 15473 > 15473 > ++ ps -f > UID PID PPID C STIME TTY TIME CMD > skenny 15473 23767 0 15:55 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g > -s groups/group-ssh.sh > skenny 16623 15473 0 15:58 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g > -s groups/group-ssh.sh > skenny 16624 15473 0 15:58 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g > -s groups/group-ssh.sh > skenny 16687 1 0 15:58 pts/27 00:00:04 java -Xmx2048M > -Djava.endorsed.dirs=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/ > skenny 17414 16623 0 16:06 pts/27 00:00:00 ps -f > skenny 23767 23760 0 13:53 pts/27 00:00:00 -bash > ./nightly.sh: line 588: 16627 Killed "$@" > $OUTPUT 2>&1 > +++ ps -f > +++ grep '.*java' > +++ grep -v grep > ++ kill_this skenny 16687 1 0 15:58 pts/27 00:00:04 java -Xmx2048M > -Djava.endorsed.dirs=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/endorsed > -DUID=1195 -DGLOBUS_TCP_PORT_RANGE=50000,51000 -DGLOBUS_HOSTNAME= > login1.pads.ci.uchicago.edu-DCOG_INSTALL_PATH=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/.. > -Dswift.home=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/.. > -Djava.security.egd=file:///dev/urandom -classpath > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../etc:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../libexec:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/addressing-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/antlr-2.7.5.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/axis.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/axis-url.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/backport-util-concurrent.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/coaster-bootstrap.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-abstraction-common-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-axis.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-grapheditor-0.47.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-jglobus-1.7.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-karajan-0.36-dev.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-clref-gt4_0_0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-coaster-0.3.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-dcache-0.1.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-gt2-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-gt4_0_0-2.5.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-local-2.2.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-localscheduler-0.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-ssh-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-webdav-2.1.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-resources-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-swift-svn.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-trap-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-url.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-util-0.92.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/commonj.jar:/ci/projects/cnari/soft/swift_latest/cog/mo > ++ '[' -n 16687 ']' > ++ /bin/kill -KILL 16687 > ++ set +x > kill 16624: No such process > TOOK: 500 > FAILED > Swift svn swift-r3921 (swift modified locally) cog-r3013 > > RunID: 20110110-1558-ojtlnxfb > Progress: > Progress: Selecting site:9 Initializing site shared directory:1 > Progress: Selecting site:9 Initializing site shared directory:1 > Progress: Selecting site:9 Initializing site shared directory:1 > Progress: Selecting site:9 Initializing site shared directory:1 > Progress: Selecting site:9 Initializing site shared directory:1 > Progress: Selecting site:9 Initializing site shared directory:1 > Progress: Selecting site:9 Initializing site shared directory:1 > Progress: Selecting site:9 Initializing site shared directory:1 > Progress: Selecting site:9 Initializing site shared directory:1 > Progress: Selecting site:9 Initializing site shared directory:1 > Progress: Selecting site:9 Initializing site shared directory:1 > Progress: Selecting site:9 Initializing site shared directory:1 > Progress: Selecting site:9 Initializing site shared directory:1 > Progress: Selecting site:9 Initializing site shared directory:1 > Progress: Selecting site:9 Initializing site shared directory:1 > Progress: Selecting site:9 Initializing site shared directory:1 > nightly.sh: monitor(1): killed: exceeded 500 seconds > FAILED > ++ seq --format %04.f 1 1 10 > + for count in '`seq --format "%04.f" 1 1 10`' > + '[' -f catsn.0001.out ']' > + exit 1 > > ---------------------------------------------------------------- > > i'm running this directly on the pads login and seeing this in the swift > log: > > 2011-01-10 16:33:18,539-0600 INFO TransportProtocolCommon The Transport > Protocol thread > failed > > java.io.IOException: The socket is > EOF > > at > com.sshtools.j2ssh.transport.TransportProtocolInputStream.readBufferedData(TransportProtocolInputStream.java:183) > > at > com.sshtools.j2ssh.transport.TransportProtocolInputStream.readMessage(TransportProtocolInputStream.java:226) > > at > com.sshtools.j2ssh.transport.TransportProtocolCommon.processMessages(TransportProtocolCommon.java:1440) > > at > com.sshtools.j2ssh.transport.TransportProtocolCommon.startBinaryPacketProtocol(TransportProtocolCommon.java:1034) > > at > com.sshtools.j2ssh.transport.TransportProtocolCommon.run(TransportProtocolCommon.java:393) > > at > java.lang.Thread.run(Thread.java:619) > > > > you can view the test output here: > > http://www.ci.uchicago.edu/~skenny/swift_tests/run-2011-01-10/tests-2011-01-10.html > > anyway, thought i'd post this in case there's something that might jump out > at any of you that i can tweak... > > ~sk -------------- next part -------------- An HTML attachment was scrubbed... URL: From wozniak at mcs.anl.gov Mon Jan 10 18:59:18 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 10 Jan 2011 18:59:18 -0600 (CST) Subject: [Swift-devel] running nightly.sh on pads In-Reply-To: References: Message-ID: Hopefully that will do it- the default timeout is only 30 seconds. I recommend running with -a to skip the ant build, and -p to skip something else that is in there. I will also take a look at why you might be getting the error messages you are and try to clean some of that up. Justin On Mon, 10 Jan 2011, David Kelly wrote: > Maybe try increasing the time in the .timeout file? I usually see something > similar when the job exceeds the timeout value > On Jan 10, 2011 6:17 PM, "Sarah Kenny" wrote: >> so, i'm trying to get nightly.sh to run on pads with coasters and i'm not >> quite sure where this is falling apart. so far the only thing i've edited > is >> providers/ssh-pbs-coasters/sites.template.xml (allowing it to take the >> PROJECT and QUEUE variables). from what i can tell the sites.xml file does >> get generated correctly but then according to the test output it times out >> during submission: >> >> [skenny at login1 tests]$ ./nightly.sh -c -g -s groups/group-ssh.sh >> RUNNING_IN: >> > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/run-2011-01-10 >> HTML_OUTPUT: >> > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/run-2011-01-10/tests-2011-01-10.html >> which: no ifconfig in >> > (/ci/projects/cnari/apps/freesurfer64/bin:/ci/projects/cnari/apps/freesurfer64/fsfast/bin:/ci/projects/cnari/apps/freesurfer64/mni/bin:/ci/projects/cnari/usr/bin:/ci/projects/cnari/apps/afni:/ci/projects/cnari/apps/swift/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/apache-ant-1.7.1-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r2/bin:/soft/globus-4.2.1-r2/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/skenny/bin/linux-rhel5-x86_64:/home/skenny/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/openmpi-1.4.2-gcc4.1-r1/bin) >> GROUPLISTFILE: groups/group-ssh.sh >> >> Prolog: Build >> >> Executing (part 1) >> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests >> Executing (part 2) >> > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift >> 14815 pts/27 00:00:00 nightly.sh >> monitor(1): killing test process... >> touch: cannot touch `killed_test': Stale NFS file handle >> monitor(1): killed process_exec (TERM) >> process_exec_trap() >> killing all swifts... >> ++ echo 13685 >> 13685 >> ++ ps -f >> UID PID PPID C STIME TTY TIME CMD >> skenny 14815 1 0 15:49 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g >> -s groups/group-ssh.sh >> skenny 14816 1 0 15:49 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g >> -s groups/group-ssh.sh >> skenny 14879 1 0 15:49 pts/27 00:00:04 java -Xmx2048M >> > -Djava.endorsed.dirs=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/ >> skenny 15473 23767 0 15:55 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g >> -s groups/group-ssh.sh >> skenny 15503 15473 7 15:55 pts/27 00:00:08 >> /soft/java-1.6.0_11-sun-r1/jre/bin/java -classpath >> /soft/apache-ant-1.7.1-r1/lib/ant-launcher.jar >> -Dant.home=/soft/apache-ant-1.7.1-r1 -Dant. >> skenny 15890 14815 0 15:57 pts/27 00:00:00 ps -f >> skenny 23767 23760 0 13:53 pts/27 00:00:00 -bash >> ./nightly.sh: line 588: 14819 Killed "$@" > $OUTPUT 2>&1 >> +++ ps -f >> +++ grep '.*java' >> +++ grep -v grep >> ++ kill_this skenny 14879 1 0 15:49 pts/27 00:00:04 java -Xmx2048M >> > -Djava.endorsed.dirs=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/endorsed >> -DUID=1195 -DGLOBUS_TCP_PORT_RANGE=50000,51000 -DGLOBUS_HOSTNAME= >> > login1.pads.ci.uchicago.edu-DCOG_INSTALL_PATH=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/.. >> > -Dswift.home=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/.. >> -Djava.security.egd=file:///dev/urandom -classpath >> > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../etc:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../libexec:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/addressing-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/antlr-2.7.5.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/axis.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/axis-url.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/backport-util-concurrent.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/coaster-bootstrap.jar:/ci/projects/cnari/soft/swift_latest /cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-abstraction-common-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-axis.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-grapheditor-0.47.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-jglobus-1.7.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-karajan-0.36-dev.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-clref-gt4_0_0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-coaster-0.3.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provi der-dcache-0.1.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-gt2-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-gt4_0_0-2.5.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-local-2.2.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-localscheduler-0.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-ssh-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-webdav-2.1.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-resources-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/mo dules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-swift-svn.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-trap-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-url.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-util-0.92.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/commonj.jar:/ci/projects/cnari/soft/swift_latest/cog/mo >> skenny 15503 15473 7 15:55 pts/27 00:00:08 >> /soft/java-1.6.0_11-sun-r1/jre/bin/java -classpath >> /soft/apache-ant-1.7.1-r1/lib/ant-launcher.jar >> -Dant.home=/soft/apache-ant-1.7.1-r1 >> -Dant.library.dir=/soft/apache-ant-1.7.1-r1/lib >> org.apache.tools.ant.launch.Launcher -cp :./ -quiet dist >> ++ '[' -n 14879 ']' >> ++ /bin/kill -KILL 14879 >> ++ set +x >> Executing Package (part 3) >> > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift >> Executing Package (part 4) >> > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/lib >> Executing Package (part 5) >> > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/lib >> Executing Package (part 6) >> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests >> > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/swift >> >> Part 1: SSH with PBS and Coasters Configuration Test >> >> Using: >> > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/tests/providers/ssh-pbs-coasters/sites.template.xml >> Using: >> > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/tests/providers/ssh-pbs-coasters/tc.template.data >> > `/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/etc/swift.properties' >> -> `./swift.properties' >> > `/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/tests/providers/ssh-pbs-coasters/001-catsn-ssh-pbs-coasters.swift' >> -> `./001-catsn-ssh-pbs-coasters.swift' >> Executing >> > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/tests/providers/ssh-pbs-coasters/001-catsn-ssh-pbs-coasters.swift >> (part 1) >> 16623 pts/27 00:00:00 nightly.sh >> monitor(1): killing test process... >> monitor(1): killed process_exec (TERM) >> process_exec_trap() >> killing all swifts... >> ++ echo 15473 >> 15473 >> ++ ps -f >> UID PID PPID C STIME TTY TIME CMD >> skenny 15473 23767 0 15:55 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g >> -s groups/group-ssh.sh >> skenny 16623 15473 0 15:58 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g >> -s groups/group-ssh.sh >> skenny 16624 15473 0 15:58 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g >> -s groups/group-ssh.sh >> skenny 16687 1 0 15:58 pts/27 00:00:04 java -Xmx2048M >> > -Djava.endorsed.dirs=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/ >> skenny 17414 16623 0 16:06 pts/27 00:00:00 ps -f >> skenny 23767 23760 0 13:53 pts/27 00:00:00 -bash >> ./nightly.sh: line 588: 16627 Killed "$@" > $OUTPUT 2>&1 >> +++ ps -f >> +++ grep '.*java' >> +++ grep -v grep >> ++ kill_this skenny 16687 1 0 15:58 pts/27 00:00:04 java -Xmx2048M >> > -Djava.endorsed.dirs=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/endorsed >> -DUID=1195 -DGLOBUS_TCP_PORT_RANGE=50000,51000 -DGLOBUS_HOSTNAME= >> > login1.pads.ci.uchicago.edu-DCOG_INSTALL_PATH=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/.. >> > -Dswift.home=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/.. >> -Djava.security.egd=file:///dev/urandom -classpath >> > /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../etc:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../libexec:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/addressing-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/antlr-2.7.5.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/axis.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/axis-url.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/backport-util-concurrent.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/coaster-bootstrap.jar:/ci/projects/cnari/soft/swift_latest /cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-abstraction-common-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-axis.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-grapheditor-0.47.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-jglobus-1.7.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-karajan-0.36-dev.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-clref-gt4_0_0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-coaster-0.3.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provi der-dcache-0.1.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-gt2-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-gt4_0_0-2.5.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-local-2.2.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-localscheduler-0.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-ssh-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-webdav-2.1.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-resources-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/mo dules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-swift-svn.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-trap-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-url.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-util-0.92.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/commonj.jar:/ci/projects/cnari/soft/swift_latest/cog/mo >> ++ '[' -n 16687 ']' >> ++ /bin/kill -KILL 16687 >> ++ set +x >> kill 16624: No such process >> TOOK: 500 >> FAILED >> Swift svn swift-r3921 (swift modified locally) cog-r3013 >> >> RunID: 20110110-1558-ojtlnxfb >> Progress: >> Progress: Selecting site:9 Initializing site shared directory:1 >> Progress: Selecting site:9 Initializing site shared directory:1 >> Progress: Selecting site:9 Initializing site shared directory:1 >> Progress: Selecting site:9 Initializing site shared directory:1 >> Progress: Selecting site:9 Initializing site shared directory:1 >> Progress: Selecting site:9 Initializing site shared directory:1 >> Progress: Selecting site:9 Initializing site shared directory:1 >> Progress: Selecting site:9 Initializing site shared directory:1 >> Progress: Selecting site:9 Initializing site shared directory:1 >> Progress: Selecting site:9 Initializing site shared directory:1 >> Progress: Selecting site:9 Initializing site shared directory:1 >> Progress: Selecting site:9 Initializing site shared directory:1 >> Progress: Selecting site:9 Initializing site shared directory:1 >> Progress: Selecting site:9 Initializing site shared directory:1 >> Progress: Selecting site:9 Initializing site shared directory:1 >> Progress: Selecting site:9 Initializing site shared directory:1 >> nightly.sh: monitor(1): killed: exceeded 500 seconds >> FAILED >> ++ seq --format %04.f 1 1 10 >> + for count in '`seq --format "%04.f" 1 1 10`' >> + '[' -f catsn.0001.out ']' >> + exit 1 >> >> ---------------------------------------------------------------- >> >> i'm running this directly on the pads login and seeing this in the swift >> log: >> >> 2011-01-10 16:33:18,539-0600 INFO TransportProtocolCommon The Transport >> Protocol thread >> failed >> >> java.io.IOException: The socket is >> EOF >> >> at >> > com.sshtools.j2ssh.transport.TransportProtocolInputStream.readBufferedData(TransportProtocolInputStream.java:183) >> >> at >> > com.sshtools.j2ssh.transport.TransportProtocolInputStream.readMessage(TransportProtocolInputStream.java:226) >> >> at >> > com.sshtools.j2ssh.transport.TransportProtocolCommon.processMessages(TransportProtocolCommon.java:1440) >> >> at >> > com.sshtools.j2ssh.transport.TransportProtocolCommon.startBinaryPacketProtocol(TransportProtocolCommon.java:1034) >> >> at >> > com.sshtools.j2ssh.transport.TransportProtocolCommon.run(TransportProtocolCommon.java:393) >> >> at >> java.lang.Thread.run(Thread.java:619) >> >> >> >> you can view the test output here: >> >> > http://www.ci.uchicago.edu/~skenny/swift_tests/run-2011-01-10/tests-2011-01-10.html >> >> anyway, thought i'd post this in case there's something that might jump > out >> at any of you that i can tweak... >> >> ~sk > -- Justin M Wozniak From bugzilla-daemon at mcs.anl.gov Mon Jan 10 21:37:11 2011 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 10 Jan 2011 21:37:11 -0600 (CST) Subject: [Swift-devel] [Bug 243] New: Block Submitter error when using SGE and coasters Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=243 Summary: Block Submitter error when using SGE and coasters Product: Swift Version: unspecified Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: hategan at mcs.anl.gov ReportedBy: dk0966 at cs.ship.edu When using a configuration with SGE and coasters, an exception is thrown related to Block Submitter. RunID: 20110110-2105-nuakb162 Progress: Exception in thread "Block Submitter" java.lang.NullPointerException at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.taskFailed(Cpu.java:302) at org.globus.cog.abstraction.coaster.service.job.manager.Block.taskFailed(Block.java:330) at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run(BlockTaskSubmitter.java:76) Failed to shut down block: Block 0110-050925-000000 (4x3600.000s) org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Can only cancel an active task at org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:179) at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91) at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:45) at org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:302) at org.globus.cog.abstraction.coaster.service.job.manager.Block.shutdown(Block.java:282) at org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.cleanDoneBlocks(BlockQueueProcessor.java:177) at org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:496) at org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:100) Previous versions of swift also sometimes threw exceptions with this configuration. They were usually related to changes in the formatting of qstat or the "pe" settings not being interpreted correctly. Two patches exist which should fix these - my patch for reading qstat information as xml, and Mike's patch for the pe settings. This particular error seems unrelated - showing up before and after the other patches are applied. Tested with swift 0.92 on ibicluster using the following configuration files sites.xml: threaded 4 128 1 1 5.11 10000 /cchome/dkelly/swiftwork tc.data: sge-coasters echo /bin/echo INSTALLED INTEL32::LINUX sge-coasters cat /bin/cat INSTALLED INTEL32::LINUX sge-coasters ls /bin/ls INSTALLED INTEL32::LINUX sge-coasters grep /bin/grep INSTALLED INTEL32::LINUX sge-coasters sort /bin/sort INSTALLED INTEL32::LINUX sge-coasters paste /bin/paste INSTALLED INTEL32::LINUX sge-coasters wc /usr/bin/wc INSTALLED INTEL32::LINUX catsn.swift: type file; app (file o) cat (file i) { cat @i stdout=@o; } string t = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"; string char[] = @strsplit(t, ""); file out[]; foreach j in [1:@toint(@arg("n","10"))] { file data<"data.txt">; out[j] = cat(data); } -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From bugzilla-daemon at mcs.anl.gov Tue Jan 11 00:33:48 2011 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 11 Jan 2011 00:33:48 -0600 (CST) Subject: [Swift-devel] [Bug 243] Block Submitter error when using SGE and coasters In-Reply-To: References: Message-ID: <20110111063413.E4CB7563FE@wind-2.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=243 Mihael Hategan changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED --- Comment #1 from Mihael Hategan 2011-01-11 00:33:48 --- Fixed in cog /branches/4.1.8/r3014 -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From wilde at mcs.anl.gov Tue Jan 11 07:48:10 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 11 Jan 2011 07:48:10 -0600 (CST) Subject: [Swift-devel] devel prios for Mihael In-Reply-To: <48877587.48519.1294753047442.JavaMail.root@zimbra.anl.gov> Message-ID: <2131184763.48562.1294753690160.JavaMail.root@zimbra.anl.gov> Mihael, all, Below are the main prios I know of (for Mihael) to work on. We'll need to refine this to agree on the prios within this list. These have shifted since we met late in Dec but I think they are now: - fix Cobalt provider for general Swift use on Eureka (Justin: please file info on this as a bugzilla) - fix general Swift job scheduling to OSG -- fix swift-plot-log to give useful reports on runs -- target is to support runs for Marc, Glen, and Aashish - assist in test and fix of SGE and PBS issues -- issues to be gathered and filed - support Allan to achieve ExTENCI data transfer goals - assess coaster time gap issue with Justin; fix if needed - assess coaster polling issue with Justin; fix if needed - work on runs and plots for Coaster paper, maybe for Swift paper - add option to set coaster-service to passive mode - anything else needed for 0.92? Please pipe in with whats missing. Thanks, Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From bugzilla-daemon at mcs.anl.gov Tue Jan 11 10:11:58 2011 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 11 Jan 2011 10:11:58 -0600 (CST) Subject: [Swift-devel] [Bug 245] New: Coasters/Eureka does not work Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=245 Summary: Coasters/Eureka does not work Product: Swift Version: unspecified Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Specific site issues AssignedTo: benc at hawaga.org.uk ReportedBy: wozniak at mcs.anl.gov Coasters does not work on Eureka due to the following Cobalt issue: http://trac.mcs.anl.gov/projects/cobalt/ticket/462 Possible work-around: substitute arguments in worker.pl: my $URISTR=$ARGV[0]; my $BLOCKID=$ARGV[1]; my $LOGDIR=$ARGV[2]; #WORKER_CODE_SUBST defined $URISTR || die "Not given: URI\n"; defined $BLOCKID || die "Not given: BLOCKID\n"; defined $LOGDIR || die "Not given: LOGDIR\n"; via ScriptManager.writeScript() and provide this to users via site profile. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From bugzilla-daemon at mcs.anl.gov Tue Jan 11 10:21:53 2011 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 11 Jan 2011 10:21:53 -0600 (CST) Subject: [Swift-devel] [Bug 245] Coasters/Eureka does not work In-Reply-To: References: Message-ID: <20110111162153.2DF90563FE@wind-2.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=245 Justin Wozniak changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|benc at hawaga.org.uk |nobody at mcs.anl.gov -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From bugzilla-daemon at mcs.anl.gov Tue Jan 11 10:20:30 2011 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 11 Jan 2011 10:20:30 -0600 (CST) Subject: [Swift-devel] [Bug 31] error message should not refer to java exception classes In-Reply-To: References: Message-ID: <20110111162147.26D47563F9@wind-2.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=31 Justin Wozniak changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED CC| |wozniak at mcs.anl.gov Resolution|FIXED | --- Comment #4 from Justin Wozniak 2011-01-11 10:20:30 --- Should provide developer setting (via log4j?) to enable full Java stack output. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. You are watching the reporter. From wozniak at mcs.anl.gov Tue Jan 11 10:27:14 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Tue, 11 Jan 2011 10:27:14 -0600 (CST) Subject: [Swift-devel] devel prios for Mihael In-Reply-To: <2131184763.48562.1294753690160.JavaMail.root@zimbra.anl.gov> References: <2131184763.48562.1294753690160.JavaMail.root@zimbra.anl.gov> Message-ID: On Tue, 11 Jan 2011, Michael Wilde wrote: > - fix Cobalt provider for general Swift use on Eureka > (Justin: please file info on this as a bugzilla) Filed. > - assess coaster time gap issue with Justin; fix if needed I am going to run a couple cases and post the observed gaps from worker.pl PROFILE_EVENTS. > - assess coaster polling issue with Justin; fix if needed I'm going to run a quick test for this as well. -- Justin M Wozniak From wozniak at mcs.anl.gov Tue Jan 11 10:29:24 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Tue, 11 Jan 2011 10:29:24 -0600 (CST) Subject: [Swift-devel] Bugzilla admin request Message-ID: Can someone add me as a Bugzilla admin? Thanks -- Justin M Wozniak From wilde at mcs.anl.gov Tue Jan 11 10:49:12 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 11 Jan 2011 10:49:12 -0600 (CST) Subject: [Swift-devel] Re: Swift on Eureka In-Reply-To: <2049325529.45559.1294682720072.JavaMail.root@zimbra.anl.gov> Message-ID: <1357754317.49737.1294764552708.JavaMail.root@zimbra.anl.gov> Sheri, All, I have built Swift on Fusion (using the 0.92 branches of swift and cog) To use: PATH=/home/wilde/swift/rev/0.92/bin:$PATH swift -etc The source was built from /home/wilde/swift/src/0.92 on fusion I have yet to test, but Sheri this should help you get started. Please bear with us if the going starts out rough here. We will also work on making Swift run on Eureka. Sample (but untested) config files to use on PBS here are at: /home/wilde/swift/lab/{pbs.xml,tc,cf} Please join and send all problems to swift-user: http://www.ci.uchicago.edu/swift/support/index.php Im hoping that Sarah and David will soon have logins on Fusion and can help certify this release on that system. Regards, Mike ----- Original Message ----- > No, we'll either need to build it for you or you can try to build it > yourself. > > Sarah or David, can you do a build and sanity test of Swift on Fusion > today? > (If not, I will do this later today...) > > We should get this installed as a softenv package on Fusion, PADS, and > MCS machines. > > Thanks, > > Mike > > > ----- Original Message ----- > > Is swift already installed on fussion? > > > > -Sheri > > > > Robert Jacob wrote: > > > > > > Lets use Fusion. > > > > > > Rob > > > > > > > > > On 1/8/11 4:40 PM, Michael Wilde wrote: > > >> Thanks, Justin. cc'ing back to the list, Rob, and Sheri. > > >> > > >> Sheri, maybe you can run on PADS or Fusion till this is fixed? > > >> > > >> - Mike > > >> > > >> ----- Original Message ----- > > >>> Hello > > >>> Right, Swift does not currently run on Eureka due to the > > >>> following > > >>> bug in Cobalt: > > >>> > > >>> http://trac.mcs.anl.gov/projects/cobalt/ticket/462 > > >>> > > >>> I got about half of a work-around for this done... > > >>> > > >>> Justin > > >>> > > >>> On Fri, 7 Jan 2011, Michael Wilde wrote: > > >>> > > >>>> Hi Rob and Sheri, > > >>>> > > >>>> I don't know the status of Swift on Eureka, but Im eager to see > > >>>> it > > >>>> running there, so we'll make sure it works. > > >>>> > > >>>> A long while back I tried Swift there, and at the time we had a > > >>>> minor > > >>>> bug in the Cobalt provider. Justin may have fixed that recently > > >>>> on > > >>>> the > > >>>> BG/P's. So Im hoping it either works or has only some > > >>>> readily-fixable > > >>>> issues in the way. > > >>>> > > >>>> We'll try it and get back to you. > > >>>> > > >>>> In the mean time, Sheri, you might want to try a simple > > >>>> hello-world > > >>>> test > > >>>> on Eureka, and see if you can progress to replicating what John > > >>>> Dennis > > >>>> had done so far. > > >>>> > > >>>> Its best to send any errors you get to the swift-user list > > >>>> (which > > >>>> you > > >>>> should join) so that everyone on the Swift team is aware f any > > >>>> issues > > >>>> you encounter and can offer help. > > >>>> > > >>>> You should meet with Justin at Argonne (3rd floor, 240) who can > > >>>> serve as > > >>>> your Swift mentor. > > >>>> > > >>>> Sarah, David - lets add Eureka to the test matrix for release > > >>>> 0.92. > > >>>> Cobalt is very very close to PBS's interface, but there is a > > >>>> separate > > >>>> Swift execution provider that handles the differences. > > >>>> > > >>>> Regards, > > >>>> > > >>>> Mike > > >>>> > > >>>> > > >>>> ----- Original Message ----- > > >>>>> Hi Mike, > > >>>>> > > >>>>> Sheri is going to take over some of the development work John > > >>>>> Dennis > > >>>>> was > > >>>>> doing on using swift with the AMWG diag package. > > >>>>> > > >>>>> Our platform is Eureka. Is there a development version of > > >>>>> Swift > > >>>>> installed there? > > >>>>> > > >>>>> Rob > > >>>> > > >>>> > > >>> > > >>> -- > > >>> Justin M Wozniak > > >> > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Jan 11 12:07:57 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 11 Jan 2011 12:07:57 -0600 (CST) Subject: [Swift-devel] Re: poor performance In-Reply-To: <513365597.44978.1294677829480.JavaMail.root@zimbra.anl.gov> Message-ID: <831327385.50297.1294769277319.JavaMail.root@zimbra.anl.gov> Mihael, I moved the log below to ~wilde/ftdock on the CI net. I tried to run swift-plot-log on it, and that failed ( i got nothing). I'll file the latter issue in bugzilla. - Mike ----- Original Message ----- > Mihael, can you take a look at Marc's logs? Do you have access to > engage-submit? > > We should now help Marc move his work to bridled and communicado where > we now have COndor-G installed and can more readily assist him in > debugging. > > Marc, I'll try to give this more attention this week, but have to work > on a deadline for today, first. > > - Mike > > > ----- Forwarded Message ----- > From: "Marc Parisien" > To: "Michael Wilde" > Sent: Monday, January 10, 2011 10:25:30 AM > Subject: poor performance > > Hi Mike, > > I have a campaign running on renci for almost 2 days now; only 529 on > 3000 jobs are done. IBI can weed 3000 within 2 days on 128 processors. > What is the problem? Why don't I have a decent performance on making > use of 10 super-computers? I have the most simplest swift script ever > (aside from the "Hello World" one). > > the (1 Gb) log file is here: > /home/parisien/Database/MCSG/ftdock-20110108-1301-xx8evgh7.log > > could it be because of a bad "site" that throws off all of Swift's > scheduling? Or a "badly" set parameter?? > > > Very Best, > Marc. > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From skenny at uchicago.edu Tue Jan 11 12:40:36 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Tue, 11 Jan 2011 10:40:36 -0800 Subject: [Swift-devel] running nightly.sh on pads In-Reply-To: References: Message-ID: so i tried again changing the timeout from 500 to 900 and got a slightly different error in the log. all the files are here: http://www.ci.uchicago.edu/~skenny/swift_tests/run-2011-01-11/tests-2011-01-11.html thought i would go ahead and post to the list in case we wanted to look at it during the meeting. ~sk On Mon, Jan 10, 2011 at 4:59 PM, Justin M Wozniak wrote: > > Hopefully that will do it- the default timeout is only 30 seconds. > > I recommend running with -a to skip the ant build, and -p to skip something > else that is in there. > > I will also take a look at why you might be getting the error messages you > are and try to clean some of that up. > > Justin > > > On Mon, 10 Jan 2011, David Kelly wrote: > > Maybe try increasing the time in the .timeout file? I usually see >> something >> similar when the job exceeds the timeout value >> On Jan 10, 2011 6:17 PM, "Sarah Kenny" wrote: >> >>> so, i'm trying to get nightly.sh to run on pads with coasters and i'm not >>> quite sure where this is falling apart. so far the only thing i've edited >>> >> is >> >>> providers/ssh-pbs-coasters/sites.template.xml (allowing it to take the >>> PROJECT and QUEUE variables). from what i can tell the sites.xml file >>> does >>> get generated correctly but then according to the test output it times >>> out >>> during submission: >>> >>> [skenny at login1 tests]$ ./nightly.sh -c -g -s groups/group-ssh.sh >>> RUNNING_IN: >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/run-2011-01-10 >> >>> HTML_OUTPUT: >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/run-2011-01-10/tests-2011-01-10.html >> >>> which: no ifconfig in >>> >>> (/ci/projects/cnari/apps/freesurfer64/bin:/ci/projects/cnari/apps/freesurfer64/fsfast/bin:/ci/projects/cnari/apps/freesurfer64/mni/bin:/ci/projects/cnari/usr/bin:/ci/projects/cnari/apps/afni:/ci/projects/cnari/apps/swift/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/apache-ant-1.7.1-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r2/bin:/soft/globus-4.2.1-r2/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/skenny/bin/linux-rhel5-x86_64:/home/skenny/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/openmpi-1.4.2-gcc4.1-r1/bin) >> >>> GROUPLISTFILE: groups/group-ssh.sh >>> >>> Prolog: Build >>> >>> Executing (part 1) >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests >>> Executing (part 2) >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift >> >>> 14815 pts/27 00:00:00 nightly.sh >>> monitor(1): killing test process... >>> touch: cannot touch `killed_test': Stale NFS file handle >>> monitor(1): killed process_exec (TERM) >>> process_exec_trap() >>> killing all swifts... >>> ++ echo 13685 >>> 13685 >>> ++ ps -f >>> UID PID PPID C STIME TTY TIME CMD >>> skenny 14815 1 0 15:49 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g >>> -s groups/group-ssh.sh >>> skenny 14816 1 0 15:49 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g >>> -s groups/group-ssh.sh >>> skenny 14879 1 0 15:49 pts/27 00:00:04 java -Xmx2048M >>> >>> -Djava.endorsed.dirs=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/ >> >>> skenny 15473 23767 0 15:55 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g >>> -s groups/group-ssh.sh >>> skenny 15503 15473 7 15:55 pts/27 00:00:08 >>> /soft/java-1.6.0_11-sun-r1/jre/bin/java -classpath >>> /soft/apache-ant-1.7.1-r1/lib/ant-launcher.jar >>> -Dant.home=/soft/apache-ant-1.7.1-r1 -Dant. >>> skenny 15890 14815 0 15:57 pts/27 00:00:00 ps -f >>> skenny 23767 23760 0 13:53 pts/27 00:00:00 -bash >>> ./nightly.sh: line 588: 14819 Killed "$@" > $OUTPUT 2>&1 >>> +++ ps -f >>> +++ grep '.*java' >>> +++ grep -v grep >>> ++ kill_this skenny 14879 1 0 15:49 pts/27 00:00:04 java -Xmx2048M >>> >>> -Djava.endorsed.dirs=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/endorsed >> >>> -DUID=1195 -DGLOBUS_TCP_PORT_RANGE=50000,51000 -DGLOBUS_HOSTNAME= >>> >>> login1.pads.ci.uchicago.edu-DCOG_INSTALL_PATH=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/.. >> >>> >>> -Dswift.home=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/.. >> >>> -Djava.security.egd=file:///dev/urandom -classpath >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../etc:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../libexec:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/addressing-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/antlr-2.7.5.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/axis.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/axis-url.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/backport-util-concurrent.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/coaster-bootstrap.jar:/ci/projects/cnari/soft/swift_latest >> > > /cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-abstraction-common-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-axis.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-grapheditor-0.47.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-jglobus-1.7.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-karajan-0.36-dev.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-clref-gt4_0_0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-coaster-0.3.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provi > > der-dcache-0.1.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-gt2-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-gt4_0_0-2.5.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-local-2.2.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-localscheduler-0.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-ssh-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-webdav-2.1.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-resources-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/mo > > dules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-swift-svn.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-trap-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-url.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-util-0.92.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/commonj.jar:/ci/projects/cnari/soft/swift_latest/cog/mo > >> skenny 15503 15473 7 15:55 pts/27 00:00:08 >>> /soft/java-1.6.0_11-sun-r1/jre/bin/java -classpath >>> /soft/apache-ant-1.7.1-r1/lib/ant-launcher.jar >>> -Dant.home=/soft/apache-ant-1.7.1-r1 >>> -Dant.library.dir=/soft/apache-ant-1.7.1-r1/lib >>> org.apache.tools.ant.launch.Launcher -cp :./ -quiet dist >>> ++ '[' -n 14879 ']' >>> ++ /bin/kill -KILL 14879 >>> ++ set +x >>> Executing Package (part 3) >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift >> >>> Executing Package (part 4) >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/lib >> >>> Executing Package (part 5) >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/lib >> >>> Executing Package (part 6) >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/swift >> >>> >>> Part 1: SSH with PBS and Coasters Configuration Test >>> >>> Using: >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/tests/providers/ssh-pbs-coasters/sites.template.xml >> >>> Using: >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/tests/providers/ssh-pbs-coasters/tc.template.data >> >>> >>> `/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/etc/swift.properties' >> >>> -> `./swift.properties' >>> >>> `/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/tests/providers/ssh-pbs-coasters/001-catsn-ssh-pbs-coasters.swift' >> >>> -> `./001-catsn-ssh-pbs-coasters.swift' >>> Executing >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/tests/providers/ssh-pbs-coasters/001-catsn-ssh-pbs-coasters.swift >> >>> (part 1) >>> 16623 pts/27 00:00:00 nightly.sh >>> monitor(1): killing test process... >>> monitor(1): killed process_exec (TERM) >>> process_exec_trap() >>> killing all swifts... >>> ++ echo 15473 >>> 15473 >>> ++ ps -f >>> UID PID PPID C STIME TTY TIME CMD >>> skenny 15473 23767 0 15:55 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g >>> -s groups/group-ssh.sh >>> skenny 16623 15473 0 15:58 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g >>> -s groups/group-ssh.sh >>> skenny 16624 15473 0 15:58 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g >>> -s groups/group-ssh.sh >>> skenny 16687 1 0 15:58 pts/27 00:00:04 java -Xmx2048M >>> >>> -Djava.endorsed.dirs=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/ >> >>> skenny 17414 16623 0 16:06 pts/27 00:00:00 ps -f >>> skenny 23767 23760 0 13:53 pts/27 00:00:00 -bash >>> ./nightly.sh: line 588: 16627 Killed "$@" > $OUTPUT 2>&1 >>> +++ ps -f >>> +++ grep '.*java' >>> +++ grep -v grep >>> ++ kill_this skenny 16687 1 0 15:58 pts/27 00:00:04 java -Xmx2048M >>> >>> -Djava.endorsed.dirs=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/endorsed >> >>> -DUID=1195 -DGLOBUS_TCP_PORT_RANGE=50000,51000 -DGLOBUS_HOSTNAME= >>> >>> login1.pads.ci.uchicago.edu-DCOG_INSTALL_PATH=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/.. >> >>> >>> -Dswift.home=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/.. >> >>> -Djava.security.egd=file:///dev/urandom -classpath >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../etc:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../libexec:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/addressing-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/antlr-2.7.5.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/axis.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/axis-url.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/backport-util-concurrent.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/coaster-bootstrap.jar:/ci/projects/cnari/soft/swift_latest >> > > /cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-abstraction-common-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-axis.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-grapheditor-0.47.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-jglobus-1.7.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-karajan-0.36-dev.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-clref-gt4_0_0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-coaster-0.3.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provi > > der-dcache-0.1.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-gt2-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-gt4_0_0-2.5.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-local-2.2.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-localscheduler-0.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-ssh-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-webdav-2.1.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-resources-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/mo > > dules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-swift-svn.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-trap-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-url.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-util-0.92.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/commonj.jar:/ci/projects/cnari/soft/swift_latest/cog/mo > >> ++ '[' -n 16687 ']' >>> ++ /bin/kill -KILL 16687 >>> ++ set +x >>> kill 16624: No such process >>> TOOK: 500 >>> FAILED >>> Swift svn swift-r3921 (swift modified locally) cog-r3013 >>> >>> RunID: 20110110-1558-ojtlnxfb >>> Progress: >>> Progress: Selecting site:9 Initializing site shared directory:1 >>> Progress: Selecting site:9 Initializing site shared directory:1 >>> Progress: Selecting site:9 Initializing site shared directory:1 >>> Progress: Selecting site:9 Initializing site shared directory:1 >>> Progress: Selecting site:9 Initializing site shared directory:1 >>> Progress: Selecting site:9 Initializing site shared directory:1 >>> Progress: Selecting site:9 Initializing site shared directory:1 >>> Progress: Selecting site:9 Initializing site shared directory:1 >>> Progress: Selecting site:9 Initializing site shared directory:1 >>> Progress: Selecting site:9 Initializing site shared directory:1 >>> Progress: Selecting site:9 Initializing site shared directory:1 >>> Progress: Selecting site:9 Initializing site shared directory:1 >>> Progress: Selecting site:9 Initializing site shared directory:1 >>> Progress: Selecting site:9 Initializing site shared directory:1 >>> Progress: Selecting site:9 Initializing site shared directory:1 >>> Progress: Selecting site:9 Initializing site shared directory:1 >>> nightly.sh: monitor(1): killed: exceeded 500 seconds >>> FAILED >>> ++ seq --format %04.f 1 1 10 >>> + for count in '`seq --format "%04.f" 1 1 10`' >>> + '[' -f catsn.0001.out ']' >>> + exit 1 >>> >>> ---------------------------------------------------------------- >>> >>> i'm running this directly on the pads login and seeing this in the swift >>> log: >>> >>> 2011-01-10 16:33:18,539-0600 INFO TransportProtocolCommon The Transport >>> Protocol thread >>> failed >>> >>> java.io.IOException: The socket is >>> EOF >>> >>> at >>> >>> com.sshtools.j2ssh.transport.TransportProtocolInputStream.readBufferedData(TransportProtocolInputStream.java:183) >> >>> >>> at >>> >>> com.sshtools.j2ssh.transport.TransportProtocolInputStream.readMessage(TransportProtocolInputStream.java:226) >> >>> >>> at >>> >>> com.sshtools.j2ssh.transport.TransportProtocolCommon.processMessages(TransportProtocolCommon.java:1440) >> >>> >>> at >>> >>> com.sshtools.j2ssh.transport.TransportProtocolCommon.startBinaryPacketProtocol(TransportProtocolCommon.java:1034) >> >>> >>> at >>> >>> com.sshtools.j2ssh.transport.TransportProtocolCommon.run(TransportProtocolCommon.java:393) >> >>> >>> at >>> java.lang.Thread.run(Thread.java:619) >>> >>> >>> >>> you can view the test output here: >>> >>> >>> >> http://www.ci.uchicago.edu/~skenny/swift_tests/run-2011-01-10/tests-2011-01-10.html >> >>> >>> anyway, thought i'd post this in case there's something that might jump >>> >> out >> >>> at any of you that i can tweak... >>> >>> ~sk >>> >> >> > -- > Justin M Wozniak -------------- next part -------------- An HTML attachment was scrubbed... URL: From dk0966 at cs.ship.edu Tue Jan 11 12:57:54 2011 From: dk0966 at cs.ship.edu (David Kelly) Date: Tue, 11 Jan 2011 13:57:54 -0500 Subject: [Swift-devel] running nightly.sh on pads In-Reply-To: References: Message-ID: When testing with ssh on the mcs machines, I noticed that if the URL used in sites.xml doesn't match the machine name exactly in ~/.ssh/auth.defaults, a window would pop up via X windows and ask me for credentials. I'm guessing this might be the "CredentialsDialog" mentioned in the log? Here is an example of how I have mine set up on mcs: .ssh/auth.defaults: vanquish.mcs.anl.gov.type=key vanquish.mcs.anl.gov.username=davidk vanquish.mcs.anl.gov.key=/home/davidk/.ssh/id_rsa vanquish.mcs.anl.gov.passphrase=my secret phrase is here Then the url field in sites.xml is also "vanquish.mcs.anl.gov" (if I just use "vanquish" here, even though it's the same machine it doesn't work and I get a popup) On Tue, Jan 11, 2011 at 1:40 PM, Sarah Kenny wrote: > so i tried again changing the timeout from 500 to 900 and got a slightly > different error in the log. > > all the files are here: > > http://www.ci.uchicago.edu/~skenny/swift_tests/run-2011-01-11/tests-2011-01-11.html > > thought i would go ahead and post to the list in case we wanted to look at > it during the meeting. > > ~sk > > On Mon, Jan 10, 2011 at 4:59 PM, Justin M Wozniak > wrote: >> >> Hopefully that will do it- the default timeout is only 30 seconds. >> >> I recommend running with -a to skip the ant build, and -p to skip >> something else that is in there. >> >> I will also take a look at why you might be getting the error messages you >> are and try to clean some of that up. >> >> ? ? ? ?Justin >> >> On Mon, 10 Jan 2011, David Kelly wrote: >> >>> Maybe try increasing the time in the .timeout file? I usually see >>> something >>> similar when the job exceeds the timeout value >>> On Jan 10, 2011 6:17 PM, "Sarah Kenny" wrote: >>>> >>>> so, i'm trying to get nightly.sh to run on pads with coasters and i'm >>>> not >>>> quite sure where this is falling apart. so far the only thing i've >>>> edited >>> >>> is >>>> >>>> providers/ssh-pbs-coasters/sites.template.xml (allowing it to take the >>>> PROJECT and QUEUE variables). from what i can tell the sites.xml file >>>> does >>>> get generated correctly but then according to the test output it times >>>> out >>>> during submission: >>>> >>>> [skenny at login1 tests]$ ./nightly.sh -c -g -s groups/group-ssh.sh >>>> RUNNING_IN: >>>> >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/run-2011-01-10 >>>> >>>> HTML_OUTPUT: >>>> >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/run-2011-01-10/tests-2011-01-10.html >>>> >>>> which: no ifconfig in >>>> >>> >>> (/ci/projects/cnari/apps/freesurfer64/bin:/ci/projects/cnari/apps/freesurfer64/fsfast/bin:/ci/projects/cnari/apps/freesurfer64/mni/bin:/ci/projects/cnari/usr/bin:/ci/projects/cnari/apps/afni:/ci/projects/cnari/apps/swift/bin:/soft/java-1.6.0_11-sun-r1/bin:/soft/java-1.6.0_11-sun-r1/jre/bin:/software/common/gx-map-0.5.3.3-r1/bin:/soft/apache-ant-1.7.1-r1/bin:/soft/condor-7.0.5-r1/bin:/soft/globus-4.2.1-r2/bin:/soft/globus-4.2.1-r2/sbin:/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/skenny/bin/linux-rhel5-x86_64:/home/skenny/bin:/soft/maui-3.2.6p21-r1/bin:/soft/maui-3.2.6p21-r1/sbin:/soft/openmpi-1.4.2-gcc4.1-r1/bin) >>>> >>>> GROUPLISTFILE: groups/group-ssh.sh >>>> >>>> Prolog: Build >>>> >>>> Executing (part 1) >>>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests >>>> Executing (part 2) >>>> >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift >>>> >>>> 14815 pts/27 00:00:00 nightly.sh >>>> monitor(1): killing test process... >>>> touch: cannot touch `killed_test': Stale NFS file handle >>>> monitor(1): killed process_exec (TERM) >>>> process_exec_trap() >>>> killing all swifts... >>>> ++ echo 13685 >>>> 13685 >>>> ++ ps -f >>>> UID PID PPID C STIME TTY TIME CMD >>>> skenny 14815 1 0 15:49 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g >>>> -s groups/group-ssh.sh >>>> skenny 14816 1 0 15:49 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g >>>> -s groups/group-ssh.sh >>>> skenny 14879 1 0 15:49 pts/27 00:00:04 java -Xmx2048M >>>> >>> >>> -Djava.endorsed.dirs=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/ >>>> >>>> skenny 15473 23767 0 15:55 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g >>>> -s groups/group-ssh.sh >>>> skenny 15503 15473 7 15:55 pts/27 00:00:08 >>>> /soft/java-1.6.0_11-sun-r1/jre/bin/java -classpath >>>> /soft/apache-ant-1.7.1-r1/lib/ant-launcher.jar >>>> -Dant.home=/soft/apache-ant-1.7.1-r1 -Dant. >>>> skenny 15890 14815 0 15:57 pts/27 00:00:00 ps -f >>>> skenny 23767 23760 0 13:53 pts/27 00:00:00 -bash >>>> ./nightly.sh: line 588: 14819 Killed "$@" > $OUTPUT 2>&1 >>>> +++ ps -f >>>> +++ grep '.*java' >>>> +++ grep -v grep >>>> ++ kill_this skenny 14879 1 0 15:49 pts/27 00:00:04 java -Xmx2048M >>>> >>> >>> -Djava.endorsed.dirs=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/endorsed >>>> >>>> -DUID=1195 -DGLOBUS_TCP_PORT_RANGE=50000,51000 -DGLOBUS_HOSTNAME= >>>> >>> >>> login1.pads.ci.uchicago.edu-DCOG_INSTALL_PATH=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/.. >>>> >>> >>> -Dswift.home=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/.. >>>> >>>> -Djava.security.egd=file:///dev/urandom -classpath >>>> >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../etc:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../libexec:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/addressing-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/antlr-2.7.5.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/axis.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/axis-url.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/backport-util-concurrent.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/coaster-bootstrap.jar:/ci/projects/cnari/soft/swift_latest >> >> >> /cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-abstraction-common-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-axis.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-grapheditor-0.47.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-jglobus-1.7.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-karajan-0.36-dev.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-clref-gt4_0_0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-coaster-0.3.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provi >> >> der-dcache-0.1.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-gt2-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-gt4_0_0-2.5.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-local-2.2.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-localscheduler-0.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-ssh-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-webdav-2.1.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-resources-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/mo >> >> dules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-swift-svn.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-trap-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-url.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-util-0.92.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/commonj.jar:/ci/projects/cnari/soft/swift_latest/cog/mo >>>> >>>> skenny 15503 15473 7 15:55 pts/27 00:00:08 >>>> /soft/java-1.6.0_11-sun-r1/jre/bin/java -classpath >>>> /soft/apache-ant-1.7.1-r1/lib/ant-launcher.jar >>>> -Dant.home=/soft/apache-ant-1.7.1-r1 >>>> -Dant.library.dir=/soft/apache-ant-1.7.1-r1/lib >>>> org.apache.tools.ant.launch.Launcher -cp :./ -quiet dist >>>> ++ '[' -n 14879 ']' >>>> ++ /bin/kill -KILL 14879 >>>> ++ set +x >>>> Executing Package (part 3) >>>> >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift >>>> >>>> Executing Package (part 4) >>>> >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/lib >>>> >>>> Executing Package (part 5) >>>> >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/lib >>>> >>>> Executing Package (part 6) >>>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests >>>> >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/swift >>>> >>>> Part 1: SSH with PBS and Coasters Configuration Test >>>> >>>> Using: >>>> >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/tests/providers/ssh-pbs-coasters/sites.template.xml >>>> >>>> Using: >>>> >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/tests/providers/ssh-pbs-coasters/tc.template.data >>>> >>> >>> `/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/etc/swift.properties' >>>> >>>> -> `./swift.properties' >>>> >>> >>> `/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/tests/providers/ssh-pbs-coasters/001-catsn-ssh-pbs-coasters.swift' >>>> >>>> -> `./001-catsn-ssh-pbs-coasters.swift' >>>> Executing >>>> >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/tests/providers/ssh-pbs-coasters/001-catsn-ssh-pbs-coasters.swift >>>> >>>> (part 1) >>>> 16623 pts/27 00:00:00 nightly.sh >>>> monitor(1): killing test process... >>>> monitor(1): killed process_exec (TERM) >>>> process_exec_trap() >>>> killing all swifts... >>>> ++ echo 15473 >>>> 15473 >>>> ++ ps -f >>>> UID PID PPID C STIME TTY TIME CMD >>>> skenny 15473 23767 0 15:55 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g >>>> -s groups/group-ssh.sh >>>> skenny 16623 15473 0 15:58 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g >>>> -s groups/group-ssh.sh >>>> skenny 16624 15473 0 15:58 pts/27 00:00:00 /bin/bash ./nightly.sh -c -g >>>> -s groups/group-ssh.sh >>>> skenny 16687 1 0 15:58 pts/27 00:00:04 java -Xmx2048M >>>> >>> >>> -Djava.endorsed.dirs=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/ >>>> >>>> skenny 17414 16623 0 16:06 pts/27 00:00:00 ps -f >>>> skenny 23767 23760 0 13:53 pts/27 00:00:00 -bash >>>> ./nightly.sh: line 588: 16627 Killed "$@" > $OUTPUT 2>&1 >>>> +++ ps -f >>>> +++ grep '.*java' >>>> +++ grep -v grep >>>> ++ kill_this skenny 16687 1 0 15:58 pts/27 00:00:04 java -Xmx2048M >>>> >>> >>> -Djava.endorsed.dirs=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/endorsed >>>> >>>> -DUID=1195 -DGLOBUS_TCP_PORT_RANGE=50000,51000 -DGLOBUS_HOSTNAME= >>>> >>> >>> login1.pads.ci.uchicago.edu-DCOG_INSTALL_PATH=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/.. >>>> >>> >>> -Dswift.home=/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/.. >>>> >>>> -Djava.security.egd=file:///dev/urandom -classpath >>>> >>> >>> /ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../etc:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../libexec:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/addressing-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/antlr-2.7.5.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/axis.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/axis-url.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/backport-util-concurrent.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/coaster-bootstrap.jar:/ci/projects/cnari/soft/swift_latest >> >> >> /cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-abstraction-common-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-axis.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-grapheditor-0.47.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-jglobus-1.7.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-karajan-0.36-dev.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-clref-gt4_0_0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-coaster-0.3.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provi >> >> der-dcache-0.1.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-gt2-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-gt4_0_0-2.5.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-local-2.2.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-localscheduler-0.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-ssh-2.4.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-provider-webdav-2.1.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-resources-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/mo >> >> dules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-swift-svn.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-trap-1.0.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-url.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/cog-util-0.92.jar:/ci/projects/cnari/soft/swift_latest/cog/modules/swift/tests/cog/modules/swift/dist/swift-svn/bin/../lib/commonj.jar:/ci/projects/cnari/soft/swift_latest/cog/mo >>>> >>>> ++ '[' -n 16687 ']' >>>> ++ /bin/kill -KILL 16687 >>>> ++ set +x >>>> kill 16624: No such process >>>> TOOK: 500 >>>> FAILED >>>> Swift svn swift-r3921 (swift modified locally) cog-r3013 >>>> >>>> RunID: 20110110-1558-ojtlnxfb >>>> Progress: >>>> Progress: Selecting site:9 Initializing site shared directory:1 >>>> Progress: Selecting site:9 Initializing site shared directory:1 >>>> Progress: Selecting site:9 Initializing site shared directory:1 >>>> Progress: Selecting site:9 Initializing site shared directory:1 >>>> Progress: Selecting site:9 Initializing site shared directory:1 >>>> Progress: Selecting site:9 Initializing site shared directory:1 >>>> Progress: Selecting site:9 Initializing site shared directory:1 >>>> Progress: Selecting site:9 Initializing site shared directory:1 >>>> Progress: Selecting site:9 Initializing site shared directory:1 >>>> Progress: Selecting site:9 Initializing site shared directory:1 >>>> Progress: Selecting site:9 Initializing site shared directory:1 >>>> Progress: Selecting site:9 Initializing site shared directory:1 >>>> Progress: Selecting site:9 Initializing site shared directory:1 >>>> Progress: Selecting site:9 Initializing site shared directory:1 >>>> Progress: Selecting site:9 Initializing site shared directory:1 >>>> Progress: Selecting site:9 Initializing site shared directory:1 >>>> nightly.sh: monitor(1): killed: exceeded 500 seconds >>>> FAILED >>>> ++ seq --format %04.f 1 1 10 >>>> + for count in '`seq --format "%04.f" 1 1 10`' >>>> + '[' -f catsn.0001.out ']' >>>> + exit 1 >>>> >>>> ---------------------------------------------------------------- >>>> >>>> i'm running this directly on the pads login and seeing this in the swift >>>> log: >>>> >>>> 2011-01-10 16:33:18,539-0600 INFO TransportProtocolCommon The Transport >>>> Protocol thread >>>> failed >>>> >>>> java.io.IOException: The socket is >>>> EOF >>>> >>>> at >>>> >>> >>> com.sshtools.j2ssh.transport.TransportProtocolInputStream.readBufferedData(TransportProtocolInputStream.java:183) >>>> >>>> at >>>> >>> >>> com.sshtools.j2ssh.transport.TransportProtocolInputStream.readMessage(TransportProtocolInputStream.java:226) >>>> >>>> at >>>> >>> >>> com.sshtools.j2ssh.transport.TransportProtocolCommon.processMessages(TransportProtocolCommon.java:1440) >>>> >>>> at >>>> >>> >>> com.sshtools.j2ssh.transport.TransportProtocolCommon.startBinaryPacketProtocol(TransportProtocolCommon.java:1034) >>>> >>>> at >>>> >>> >>> com.sshtools.j2ssh.transport.TransportProtocolCommon.run(TransportProtocolCommon.java:393) >>>> >>>> at >>>> java.lang.Thread.run(Thread.java:619) >>>> >>>> >>>> >>>> you can view the test output here: >>>> >>>> >>> >>> http://www.ci.uchicago.edu/~skenny/swift_tests/run-2011-01-10/tests-2011-01-10.html >>>> >>>> anyway, thought i'd post this in case there's something that might jump >>> >>> out >>>> >>>> at any of you that i can tweak... >>>> >>>> ~sk >>> >> >> -- >> Justin M Wozniak > From hategan at mcs.anl.gov Tue Jan 11 15:35:36 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 11 Jan 2011 13:35:36 -0800 Subject: [Swift-devel] Re: devel prios for Mihael In-Reply-To: <2131184763.48562.1294753690160.JavaMail.root@zimbra.anl.gov> References: <2131184763.48562.1294753690160.JavaMail.root@zimbra.anl.gov> Message-ID: <1294781736.20443.0.camel@blabla2.none> Just wanted to comment that at this point 0.92 will not get any new features. Mihael On Tue, 2011-01-11 at 07:48 -0600, Michael Wilde wrote: > Mihael, all, > > Below are the main prios I know of (for Mihael) to work on. > We'll need to refine this to agree on the prios within this list. > > These have shifted since we met late in Dec but I think they are now: > > - fix Cobalt provider for general Swift use on Eureka > (Justin: please file info on this as a bugzilla) > > - fix general Swift job scheduling to OSG > > -- fix swift-plot-log to give useful reports on runs > -- target is to support runs for Marc, Glen, and Aashish > > - assist in test and fix of SGE and PBS issues > > -- issues to be gathered and filed > > - support Allan to achieve ExTENCI data transfer goals > > - assess coaster time gap issue with Justin; fix if needed > > - assess coaster polling issue with Justin; fix if needed > > - work on runs and plots for Coaster paper, maybe for Swift paper > > - add option to set coaster-service to passive mode > > - anything else needed for 0.92? > > Please pipe in with whats missing. > > Thanks, > > Mike > From hategan at mcs.anl.gov Tue Jan 11 15:37:15 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 11 Jan 2011 13:37:15 -0800 Subject: [Swift-devel] devel prios for Mihael In-Reply-To: References: <2131184763.48562.1294753690160.JavaMail.root@zimbra.anl.gov> Message-ID: <1294781835.20443.1.camel@blabla2.none> On Tue, 2011-01-11 at 10:27 -0600, Justin M Wozniak wrote: > On Tue, 11 Jan 2011, Michael Wilde wrote: > > > - fix Cobalt provider for general Swift use on Eureka > > (Justin: please file info on this as a bugzilla) > > Filed. So is cobalt on Eureka ignoring command line arguments or what is the actual problem? > > > - assess coaster time gap issue with Justin; fix if needed > > I am going to run a couple cases and post the observed gaps from worker.pl > PROFILE_EVENTS. > > > - assess coaster polling issue with Justin; fix if needed > > I'm going to run a quick test for this as well. > From wozniak at mcs.anl.gov Tue Jan 11 15:39:42 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Tue, 11 Jan 2011 15:39:42 -0600 (CST) Subject: [Swift-devel] devel prios for Mihael In-Reply-To: <1294781835.20443.1.camel@blabla2.none> References: <2131184763.48562.1294753690160.JavaMail.root@zimbra.anl.gov> <1294781835.20443.1.camel@blabla2.none> Message-ID: On Tue, 11 Jan 2011, Mihael Hategan wrote: > On Tue, 2011-01-11 at 10:27 -0600, Justin M Wozniak wrote: >> On Tue, 11 Jan 2011, Michael Wilde wrote: >> >>> - fix Cobalt provider for general Swift use on Eureka >>> (Justin: please file info on this as a bugzilla) >> >> Filed. > > So is cobalt on Eureka ignoring command line arguments or what is the > actual problem? Right. >>> - assess coaster time gap issue with Justin; fix if needed >> >> I am going to run a couple cases and post the observed gaps from worker.pl >> PROFILE_EVENTS. >> >>> - assess coaster polling issue with Justin; fix if needed >> >> I'm going to run a quick test for this as well. Ok- I just posted my waitpid results and am now moving to get some worker.pl profiles over the next half hour. -- Justin M Wozniak From hategan at mcs.anl.gov Tue Jan 11 15:47:45 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 11 Jan 2011 13:47:45 -0800 Subject: [Swift-devel] devel prios for Mihael In-Reply-To: References: <2131184763.48562.1294753690160.JavaMail.root@zimbra.anl.gov> <1294781835.20443.1.camel@blabla2.none> Message-ID: <1294782465.20443.10.camel@blabla2.none> On Tue, 2011-01-11 at 15:39 -0600, Justin M Wozniak wrote: > On Tue, 11 Jan 2011, Mihael Hategan wrote: > > > On Tue, 2011-01-11 at 10:27 -0600, Justin M Wozniak wrote: > >> On Tue, 11 Jan 2011, Michael Wilde wrote: > >> > >>> - fix Cobalt provider for general Swift use on Eureka > >>> (Justin: please file info on this as a bugzilla) > >> > >> Filed. > > > > So is cobalt on Eureka ignoring command line arguments or what is the > > actual problem? > > Right. Shouldn't we, in the long run, address this by asking for Cobalt to be fixed? From wozniak at mcs.anl.gov Tue Jan 11 15:56:22 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Tue, 11 Jan 2011 15:56:22 -0600 (CST) Subject: [Swift-devel] devel prios for Mihael In-Reply-To: <1294782465.20443.10.camel@blabla2.none> References: <2131184763.48562.1294753690160.JavaMail.root@zimbra.anl.gov> <1294781835.20443.1.camel@blabla2.none> <1294782465.20443.10.camel@blabla2.none> Message-ID: On Tue, 11 Jan 2011, Mihael Hategan wrote: >>> So is cobalt on Eureka ignoring command line arguments or what is the >>> actual problem? >> >> Right. > > Shouldn't we, in the long run, address this by asking for Cobalt to be > fixed? Yes. I even spoke to the developer in person several months ago. Now we have some potential users on that system so it's more of a priority. -- Justin M Wozniak From wilde at mcs.anl.gov Tue Jan 11 15:59:28 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 11 Jan 2011 15:59:28 -0600 (CST) Subject: [Swift-devel] Re: devel prios for Mihael In-Reply-To: <1478810696.51886.1294783149285.JavaMail.root@zimbra.anl.gov> Message-ID: <911560247.51888.1294783168605.JavaMail.root@zimbra.anl.gov> ----- Original Message ----- > Just wanted to comment that at this point 0.92 will not get any new > features. Yup. I agree, thats the rules. We can decide if "fixing the cobalt provider to work on cobalt" is new feature or not. Perhaps depends on the size of the fix. - Mike > Mihael > > On Tue, 2011-01-11 at 07:48 -0600, Michael Wilde wrote: > > Mihael, all, > > > > Below are the main prios I know of (for Mihael) to work on. > > We'll need to refine this to agree on the prios within this list. > > > > These have shifted since we met late in Dec but I think they are > > now: > > > > - fix Cobalt provider for general Swift use on Eureka > > (Justin: please file info on this as a bugzilla) > > > > - fix general Swift job scheduling to OSG > > > > -- fix swift-plot-log to give useful reports on runs > > -- target is to support runs for Marc, Glen, and Aashish > > > > - assist in test and fix of SGE and PBS issues > > > > -- issues to be gathered and filed > > > > - support Allan to achieve ExTENCI data transfer goals > > > > - assess coaster time gap issue with Justin; fix if needed > > > > - assess coaster polling issue with Justin; fix if needed > > > > - work on runs and plots for Coaster paper, maybe for Swift paper > > > > - add option to set coaster-service to passive mode > > > > - anything else needed for 0.92? > > > > Please pipe in with whats missing. > > > > Thanks, > > > > Mike > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Jan 11 16:02:41 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 11 Jan 2011 16:02:41 -0600 (CST) Subject: [Swift-devel] devel prios for Mihael In-Reply-To: <1294782465.20443.10.camel@blabla2.none> Message-ID: <1198962053.51911.1294783361869.JavaMail.root@zimbra.anl.gov> ----- Original Message ----- > Shouldn't we, in the long run, address this by asking for Cobalt to be > fixed? We should ask right now. I think Justin has already done so, but no action was taken. Seems like a Swift fix would enable Swift users to run on Eureka, which is desirable. - Mike From hategan at mcs.anl.gov Tue Jan 11 16:15:49 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 11 Jan 2011 14:15:49 -0800 Subject: [Swift-devel] Re: devel prios for Mihael In-Reply-To: <2131184763.48562.1294753690160.JavaMail.root@zimbra.anl.gov> References: <2131184763.48562.1294753690160.JavaMail.root@zimbra.anl.gov> Message-ID: <1294784149.21301.0.camel@blabla2.none> On Tue, 2011-01-11 at 07:48 -0600, Michael Wilde wrote: > -- fix swift-plot-log to give useful reports on runs Are you sure you want me to do that? I'll just rewrite the whole thing... From hategan at mcs.anl.gov Tue Jan 11 16:17:36 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 11 Jan 2011 14:17:36 -0800 Subject: [Swift-devel] Re: devel prios for Mihael In-Reply-To: <911560247.51888.1294783168605.JavaMail.root@zimbra.anl.gov> References: <911560247.51888.1294783168605.JavaMail.root@zimbra.anl.gov> Message-ID: <1294784256.21301.1.camel@blabla2.none> On Tue, 2011-01-11 at 15:59 -0600, Michael Wilde wrote: > > ----- Original Message ----- > > Just wanted to comment that at this point 0.92 will not get any new > > features. > > Yup. I agree, thats the rules. > > We can decide if "fixing the cobalt provider to work on cobalt" is new feature or not. Perhaps depends on the size of the fix. The cobalt provider works just fine on cobalt. It doesn't work on a broken cobalt though. From wozniak at mcs.anl.gov Tue Jan 11 16:23:43 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Tue, 11 Jan 2011 16:23:43 -0600 (CST) Subject: [Swift-devel] Re: devel prios for Mihael In-Reply-To: <1294784256.21301.1.camel@blabla2.none> References: <911560247.51888.1294783168605.JavaMail.root@zimbra.anl.gov> <1294784256.21301.1.camel@blabla2.none> Message-ID: On Tue, 11 Jan 2011, Mihael Hategan wrote: > On Tue, 2011-01-11 at 15:59 -0600, Michael Wilde wrote: >> >> ----- Original Message ----- >>> Just wanted to comment that at this point 0.92 will not get any new >>> features. >> >> Yup. I agree, thats the rules. >> >> We can decide if "fixing the cobalt provider to work on cobalt" is new feature or not. Perhaps depends on the size of the fix. > > The cobalt provider works just fine on cobalt. It doesn't work on a > broken cobalt though. Right- this apparently has to do with Cobalt on a cluster. -- Justin M Wozniak From hategan at mcs.anl.gov Tue Jan 11 16:25:15 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 11 Jan 2011 14:25:15 -0800 Subject: [Swift-devel] Re: poor performance In-Reply-To: <831327385.50297.1294769277319.JavaMail.root@zimbra.anl.gov> References: <831327385.50297.1294769277319.JavaMail.root@zimbra.anl.gov> Message-ID: <1294784715.21653.0.camel@blabla2.none> I thought I mentioned what the problem was there and that it should be fixed in cog 4.1.8/swift 0.92. Mihael On Tue, 2011-01-11 at 12:07 -0600, Michael Wilde wrote: > Mihael, I moved the log below to ~wilde/ftdock on the CI net. > > I tried to run swift-plot-log on it, and that failed ( i got nothing). > I'll file the latter issue in bugzilla. > > - Mike > > ----- Original Message ----- > > Mihael, can you take a look at Marc's logs? Do you have access to > > engage-submit? > > > > We should now help Marc move his work to bridled and communicado where > > we now have COndor-G installed and can more readily assist him in > > debugging. > > > > Marc, I'll try to give this more attention this week, but have to work > > on a deadline for today, first. > > > > - Mike > > > > > > ----- Forwarded Message ----- > > From: "Marc Parisien" > > To: "Michael Wilde" > > Sent: Monday, January 10, 2011 10:25:30 AM > > Subject: poor performance > > > > Hi Mike, > > > > I have a campaign running on renci for almost 2 days now; only 529 on > > 3000 jobs are done. IBI can weed 3000 within 2 days on 128 processors. > > What is the problem? Why don't I have a decent performance on making > > use of 10 super-computers? I have the most simplest swift script ever > > (aside from the "Hello World" one). > > > > the (1 Gb) log file is here: > > /home/parisien/Database/MCSG/ftdock-20110108-1301-xx8evgh7.log > > > > could it be because of a bad "site" that throws off all of Swift's > > scheduling? Or a "badly" set parameter?? > > > > > > Very Best, > > Marc. > > > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > From hategan at mcs.anl.gov Tue Jan 11 16:26:04 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 11 Jan 2011 14:26:04 -0800 Subject: [Swift-devel] Re: devel prios for Mihael In-Reply-To: References: <911560247.51888.1294783168605.JavaMail.root@zimbra.anl.gov> <1294784256.21301.1.camel@blabla2.none> Message-ID: <1294784764.21653.1.camel@blabla2.none> On Tue, 2011-01-11 at 16:23 -0600, Justin M Wozniak wrote: > On Tue, 11 Jan 2011, Mihael Hategan wrote: > > > On Tue, 2011-01-11 at 15:59 -0600, Michael Wilde wrote: > >> > >> ----- Original Message ----- > >>> Just wanted to comment that at this point 0.92 will not get any new > >>> features. > >> > >> Yup. I agree, thats the rules. > >> > >> We can decide if "fixing the cobalt provider to work on cobalt" is new feature or not. Perhaps depends on the size of the fix. > > > > The cobalt provider works just fine on cobalt. It doesn't work on a > > broken cobalt though. > > Right- this apparently has to do with Cobalt on a cluster. Apart from Intrepid you must mean. Or am I missing something? From wozniak at mcs.anl.gov Tue Jan 11 16:29:33 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Tue, 11 Jan 2011 16:29:33 -0600 (CST) Subject: [Swift-devel] Re: devel prios for Mihael In-Reply-To: <1294784764.21653.1.camel@blabla2.none> References: <911560247.51888.1294783168605.JavaMail.root@zimbra.anl.gov> <1294784256.21301.1.camel@blabla2.none> <1294784764.21653.1.camel@blabla2.none> Message-ID: On Tue, 11 Jan 2011, Mihael Hategan wrote: > On Tue, 2011-01-11 at 16:23 -0600, Justin M Wozniak wrote: >> On Tue, 11 Jan 2011, Mihael Hategan wrote: >> >>> On Tue, 2011-01-11 at 15:59 -0600, Michael Wilde wrote: >>>> >>> >>> The cobalt provider works just fine on cobalt. It doesn't work on a >>> broken cobalt though. >> >> Right- this apparently has to do with Cobalt on a cluster. > > Apart from Intrepid you must mean. Or am I missing something? I can only speculate that the system API that they have to use as a back-end is sufficiently different to manifest this issue. -- Justin M Wozniak From wilde at mcs.anl.gov Tue Jan 11 16:32:16 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 11 Jan 2011 16:32:16 -0600 (CST) Subject: [Swift-devel] Re: devel prios for Mihael In-Reply-To: <1294784149.21301.0.camel@blabla2.none> Message-ID: <1846301574.52136.1294785136141.JavaMail.root@zimbra.anl.gov> No, not *sure*. A point for discussion. ----- Original Message ----- > On Tue, 2011-01-11 at 07:48 -0600, Michael Wilde wrote: > > -- fix swift-plot-log to give useful reports on runs > > Are you sure you want me to do that? I'll just rewrite the whole > thing... -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Jan 11 16:48:25 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 11 Jan 2011 16:48:25 -0600 (CST) Subject: [Swift-devel] Can Cobalt command-line bug on Eureka be fixed? Message-ID: <1484991023.52267.1294786105485.JavaMail.root@zimbra.anl.gov> Hi ALCF Team, The following known issue in Cobalt is currently preventing us from running Swift on Eureka: http://trac.mcs.anl.gov/projects/cobalt/ticket/462 With some additional development effort we can work around this, but it would be much cleaner and better if this were fixed in Cobalt, instead, as suggested in ticket 462 above. Is there any chance that can be done in the next few days? If not, please let me know, and we will implement the work-around instead. This is holding up work on the DOE ParVis project (Rob Jacob, PI) and we've had to move some work we want to run on Eureka to other platforms in the meantime. Thanks very much, Mike 462 is: Ticket #462 (new defect) Opened 7 months ago Cobalt on clusters ignores job script arguments Reported by: acherry Priority: major Component: clients Description It appears that cobalt-launcher.py does not support running a job script or executable with command arguments, even though qsub will accept the arguments, and the man page and help for qsub indicates that arguments are accepted. I'm filing this as a bug rather than a feature request, since the behavior isn't consistent with the documentation. But I'd rather the fix for this to be adding support for args, rather than changing the docs to say they aren't accepted. :-) -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Jan 11 16:50:51 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 11 Jan 2011 16:50:51 -0600 (CST) Subject: [Swift-devel] Re: devel prios for Mihael In-Reply-To: Message-ID: <2014962521.52271.1294786251514.JavaMail.root@zimbra.anl.gov> I will try to contact ALCF and see if they can fix it. Also note that in today's Release 0.92 telecon we set David to work on this. Lets see what ALCF says, and decide then, David, if you should continue. This is filed as Swift bug 245: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=245 - Mike ----- Original Message ----- > On Tue, 11 Jan 2011, Mihael Hategan wrote: > > > On Tue, 2011-01-11 at 16:23 -0600, Justin M Wozniak wrote: > >> On Tue, 11 Jan 2011, Mihael Hategan wrote: > >> > >>> On Tue, 2011-01-11 at 15:59 -0600, Michael Wilde wrote: > >>>> > >>> > >>> The cobalt provider works just fine on cobalt. It doesn't work on > >>> a > >>> broken cobalt though. > >> > >> Right- this apparently has to do with Cobalt on a cluster. > > > > Apart from Intrepid you must mean. Or am I missing something? > > I can only speculate that the system API that they have to use as a > back-end is sufficiently different to manifest this issue. > > -- > Justin M Wozniak -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Jan 11 16:54:57 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 11 Jan 2011 16:54:57 -0600 (CST) Subject: [Swift-devel] Re: poor performance In-Reply-To: <1294784715.21653.0.camel@blabla2.none> Message-ID: <806088913.52286.1294786497331.JavaMail.root@zimbra.anl.gov> Its likely that I lost track of this. I *thought* that when Marc reported the bug, that he had already applied the fix, but now that you remind me of the thread, its very likely that he did not. I will do this with him (move him to 0.92 RC), change the throttle values, and re-test the workflow. - Mike ----- Original Message ----- > I thought I mentioned what the problem was there and that it should be > fixed in cog 4.1.8/swift 0.92. > > Mihael > > On Tue, 2011-01-11 at 12:07 -0600, Michael Wilde wrote: > > Mihael, I moved the log below to ~wilde/ftdock on the CI net. > > > > I tried to run swift-plot-log on it, and that failed ( i got > > nothing). > > I'll file the latter issue in bugzilla. > > > > - Mike > > > > ----- Original Message ----- > > > Mihael, can you take a look at Marc's logs? Do you have access to > > > engage-submit? > > > > > > We should now help Marc move his work to bridled and communicado > > > where > > > we now have COndor-G installed and can more readily assist him in > > > debugging. > > > > > > Marc, I'll try to give this more attention this week, but have to > > > work > > > on a deadline for today, first. > > > > > > - Mike > > > > > > > > > ----- Forwarded Message ----- > > > From: "Marc Parisien" > > > To: "Michael Wilde" > > > Sent: Monday, January 10, 2011 10:25:30 AM > > > Subject: poor performance > > > > > > Hi Mike, > > > > > > I have a campaign running on renci for almost 2 days now; only 529 > > > on > > > 3000 jobs are done. IBI can weed 3000 within 2 days on 128 > > > processors. > > > What is the problem? Why don't I have a decent performance on > > > making > > > use of 10 super-computers? I have the most simplest swift script > > > ever > > > (aside from the "Hello World" one). > > > > > > the (1 Gb) log file is here: > > > /home/parisien/Database/MCSG/ftdock-20110108-1301-xx8evgh7.log > > > > > > could it be because of a bad "site" that throws off all of Swift's > > > scheduling? Or a "badly" set parameter?? > > > > > > > > > Very Best, > > > Marc. > > > > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Jan 11 17:01:08 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 11 Jan 2011 17:01:08 -0600 (CST) Subject: [Swift-devel] Re: waitpid mini-benchmark In-Reply-To: <1294782406.20443.9.camel@blabla2.none> Message-ID: <1222744546.52320.1294786868334.JavaMail.root@zimbra.anl.gov> Agreed. That was the purpose of measuring it. Probably should make some easy way to pass command line options to the worker. I think we felt all along that the current logic was OK for workersPerNode <16 or so. The concern was only related to running workersPerNode much higher, like O(100s), which would only be done for experiments (what we've been calling "over-clocking the nodes" with sleep jobs). Ie, to use one rack of the BG/P to simulate the client load of 40 racks. So I agree - this task is off the list. - Mike ----- Original Message ----- > That's 0.3ms for 100 jobs? If yes, I don't think we should worry about > it beyond making sure that we don't poll very often (e.g. more than 10 > times/s). > > On Tue, 2011-01-11 at 15:34 -0600, Justin M Wozniak wrote: > > Just posted the results of a Perl waitpid() mini-benchmark: > > > > http://www.ci.uchicago.edu/wiki/bin/view/SWFT/MiniBenchmarks > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Jan 11 17:15:10 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 11 Jan 2011 17:15:10 -0600 (CST) Subject: [Swift-devel] Re: devel prios for Mihael In-Reply-To: <2131184763.48562.1294753690160.JavaMail.root@zimbra.anl.gov> Message-ID: <725770737.52387.1294787710456.JavaMail.root@zimbra.anl.gov> Consolidating what we discussed on the list into an update on these issues: > - fix Cobalt provider for general Swift use on Eureka > (Justin: please file info on this as a bugzilla) pending word from ALCF. > - fix general Swift job scheduling to OSG > > -- fix swift-plot-log to give useful reports on runs > -- target is to support runs for Marc, Glen, and Aashish pending re-tests by these users on 0.92 w/ changed sites params > - assist in test and fix of SGE and PBS issues > > -- issues to be gathered and filed - pending documentation of the issues > - support Allan to achieve ExTENCI data transfer goals This is two issues: HIGH: 1. coasters fails at ~ >60 workers per service ^^^ bubbles up to higher prio HIGH: 2. throughput w/ provider staging is about 2jobs/sec/service with 3MB data input file per job. Allan needs at least 10 jobs/sec/service for ExTENCI runs. Its possible that issue #2 is the same as the next issue below, "coaster time gap issue". ^^^ Justin and Allan are both working to isolate this at the moment. > - assess coaster time gap issue with Justin; fix if needed - pending measurement and reproduction of the issue > - assess coaster polling issue with Justin; fix if needed - dropped > - work on runs and plots for Coaster paper, maybe for Swift paper - on hold pending discussion > - add option to set coaster-service to passive mode - low prio. Is its trivial to add? > - anything else needed for 0.92? Other support/debug work: MED Fix problems blocking Jon from reliable large runs. > > Please pipe in with whats missing. > > Thanks, > > Mike > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Jan 11 17:24:40 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 11 Jan 2011 17:24:40 -0600 (CST) Subject: [Swift-devel] Re: poor performance In-Reply-To: <8FBE44AC-0E1E-4864-A581-B68ABC9ECA35@uchicago.edu> Message-ID: <997022408.52423.1294788280104.JavaMail.root@zimbra.anl.gov> Hi Marc, Mihael's reply to this was to remind me that he applied a Swift fix to try to address this. That was done while I was focused on writing, and I forgot to follow up with you. What we needed to do based on Mihael's last analysis of your logs was: - rerun using the new Swift release candidate "0.92" which I need to build for you. Can you confirm that you did *not* build this yourself, and that the latest run is still on the old, un-fixed Swift release? - change your throttle and score values to something lower than 1,10000 (which we will need some advice from Mihael on). I suggested we try the new code with the throttles removed. So while I wait to hear from you I will build Swift 0.92 on engage. And also start setting up for running on OSG form the CI network instead of engage-submit. - Mike ----- Original Message ----- > Hi Mike, > > I have a campaign running on renci for almost 2 days now; only 529 on > 3000 jobs are done. IBI can weed 3000 within 2 days on 128 processors. > What is the problem? Why don't I have a decent performance on making > use of 10 super-computers? I have the most simplest swift script ever > (aside from the "Hello World" one). > > the (1 Gb) log file is here: > /home/parisien/Database/MCSG/ftdock-20110108-1301-xx8evgh7.log > > could it be because of a bad "site" that throws off all of Swift's > scheduling? Or a "badly" set parameter?? > > > Very Best, > Marc. -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Jan 11 17:33:57 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 11 Jan 2011 17:33:57 -0600 (CST) Subject: [Swift-devel] Fwd: poor performance In-Reply-To: <797BC4D8-BE0E-46D9-9BCD-75466F5EB8E0@uchicago.edu> Message-ID: <1180898859.52477.1294788837654.JavaMail.root@zimbra.anl.gov> Marc confirms that he was using an un-fixed release. Im building an 0.92 on engage submit now. If you have suggestions on score/throttle values to accelerate the scheduler's default ramp up, would be good to try (in addition to a test of the defaults). - Mike ----- Forwarded Message ----- From: "Marc Parisien" To: "Michael Wilde" Sent: Tuesday, January 11, 2011 5:28:37 PM Subject: Re: poor performance Hi Mike, > Can you confirm that you did *not* build this yourself, and that the latest run is still on the old, un-fixed Swift release? I ain't building that :-D So I must be using the one in your renci account: /home/wilde/swift/src/trunk/cog/modules/swift/dist/swift-svn/bin/ > - change your throttle and score values to something lower than 1,10000 (which we will need some advice from Mihael on). I suggested we try the new code with the throttles removed. You already suggest to remove them... > So while I wait to hear from you I will build Swift 0.92 on engage. And also start setting up for running on OSG form the CI network instead of engage-submit. alright, Thanks, A+ -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Jan 11 17:35:48 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 11 Jan 2011 17:35:48 -0600 (CST) Subject: [Swift-devel] Re: [alcf-support #60887] Can Cobalt command-line bug on Eureka be fixed? In-Reply-To: <4D2CE575.2090506@alcf.anl.gov> Message-ID: <2039357798.52486.1294788948030.JavaMail.root@zimbra.anl.gov> Thanks, Rich and Andrew, for the very fast responses. We'll try the work-around, then. Regards, - Mike ----- Original Message ----- > Michael, > > Unfortunately a fix for this will, at this point in time, take a > minimum > of four weeks to deploy to a production resource like Eureka, due to > our > testing, upgrade and maintenance procedures. > > As a workaround for this on Eureka, since every job effectively runs > in > script mode, you should be able to set environment variables within > the > script that you submit to Cobalt. > > We apologize for the inconvenience. Let us know if you have any other > questions. > > -- > Paul Rich > ALCF Operations -- AIG > richp at alcf.anl.gov > > > On 1/11/11 4:48 PM, Michael Wilde wrote: > > User info for wilde at mcs.anl.gov > > ================================= > > Username: wilde > > Full Name: Michael Wilde > > Projects: > > HTCScienceApps,JGI-Pilot,MTCScienceApps,OOPS,PTMAP,pilot-wilde > > ('*' denotes INCITE projects) > > ================================= > > > > > > Hi ALCF Team, > > > > The following known issue in Cobalt is currently preventing us from > > running Swift on Eureka: > > > > http://trac.mcs.anl.gov/projects/cobalt/ticket/462 > > > > With some additional development effort we can work around this, but > > it would be much cleaner and better if this were fixed in Cobalt, > > instead, as suggested in ticket 462 above. > > > > Is there any chance that can be done in the next few days? > > If not, please let me know, and we will implement the work-around > > instead. > > > > This is holding up work on the DOE ParVis project (Rob Jacob, PI) > > and we've had to move some work we want to run on Eureka to other > > platforms in the meantime. > > > > Thanks very much, > > > > Mike > > > > 462 is: > > > > Ticket #462 (new defect) > > Opened 7 months ago > > Cobalt on clusters ignores job script arguments > > > > Reported by: acherry > > Priority: major > > Component: clients > > > > Description > > > > It appears that cobalt-launcher.py does not support running a job > > script or executable with command arguments, even though qsub will > > accept the arguments, and the man page and help for qsub indicates > > that arguments are accepted. > > > > I'm filing this as a bug rather than a feature request, since the > > behavior isn't consistent with the documentation. But I'd rather the > > fix for this to be adding support for args, rather than changing the > > docs to say they aren't accepted. :-) > > > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Jan 11 17:40:31 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 11 Jan 2011 17:40:31 -0600 (CST) Subject: [Swift-devel] Re: [alcf-support #60887] Can Cobalt command-line bug on Eureka be fixed? In-Reply-To: <2039357798.52486.1294788948030.JavaMail.root@zimbra.anl.gov> Message-ID: <413995418.52513.1294789231634.JavaMail.root@zimbra.anl.gov> One workaround we can try here, which may be more valuable than a temp fix, would be to make a more user-ready script to launch manual coasters (persistent/passive) on any cluster. We have several such scripts floating around; probably Sheri could use one if it were only slightly polished. That would be a good project for you, David. Such a script would be useful on any cluster, and would need only slight flexibility to specify the batch jobs for various PBS, SGE, Cobalt, and Slurm systems. It has all the drawbacks of manual coasters (which some folks like) and is a usage mode we want to support. Justin, you noted yesterday that its hard to make such a script general. Maybe if we split the script into 2 variants (one for clusters, and one for sets of workstations) that would ake the resultant scripts more maintainable and testable? - Mike ----- Original Message ----- > Thanks, Rich and Andrew, for the very fast responses. > > We'll try the work-around, then. > > Regards, > > - Mike > > > ----- Original Message ----- > > Michael, > > > > Unfortunately a fix for this will, at this point in time, take a > > minimum > > of four weeks to deploy to a production resource like Eureka, due to > > our > > testing, upgrade and maintenance procedures. > > > > As a workaround for this on Eureka, since every job effectively runs > > in > > script mode, you should be able to set environment variables within > > the > > script that you submit to Cobalt. > > > > We apologize for the inconvenience. Let us know if you have any > > other > > questions. > > > > -- > > Paul Rich > > ALCF Operations -- AIG > > richp at alcf.anl.gov > > > > > > On 1/11/11 4:48 PM, Michael Wilde wrote: > > > User info for wilde at mcs.anl.gov > > > ================================= > > > Username: wilde > > > Full Name: Michael Wilde > > > Projects: > > > HTCScienceApps,JGI-Pilot,MTCScienceApps,OOPS,PTMAP,pilot-wilde > > > ('*' denotes INCITE projects) > > > ================================= > > > > > > > > > Hi ALCF Team, > > > > > > The following known issue in Cobalt is currently preventing us > > > from > > > running Swift on Eureka: > > > > > > http://trac.mcs.anl.gov/projects/cobalt/ticket/462 > > > > > > With some additional development effort we can work around this, > > > but > > > it would be much cleaner and better if this were fixed in Cobalt, > > > instead, as suggested in ticket 462 above. > > > > > > Is there any chance that can be done in the next few days? > > > If not, please let me know, and we will implement the work-around > > > instead. > > > > > > This is holding up work on the DOE ParVis project (Rob Jacob, PI) > > > and we've had to move some work we want to run on Eureka to other > > > platforms in the meantime. > > > > > > Thanks very much, > > > > > > Mike > > > > > > 462 is: > > > > > > Ticket #462 (new defect) > > > Opened 7 months ago > > > Cobalt on clusters ignores job script arguments > > > > > > Reported by: acherry > > > Priority: major > > > Component: clients > > > > > > Description > > > > > > It appears that cobalt-launcher.py does not support running a job > > > script or executable with command arguments, even though qsub will > > > accept the arguments, and the man page and help for qsub indicates > > > that arguments are accepted. > > > > > > I'm filing this as a bug rather than a feature request, since the > > > behavior isn't consistent with the documentation. But I'd rather > > > the > > > fix for this to be adding support for args, rather than changing > > > the > > > docs to say they aren't accepted. :-) > > > > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Jan 11 18:08:32 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 11 Jan 2011 18:08:32 -0600 (CST) Subject: [Swift-devel] Manual start script for persistent coasters on Cobalt and other schedulers In-Reply-To: <413995418.52513.1294789231634.JavaMail.root@zimbra.anl.gov> Message-ID: <758452773.52563.1294790912749.JavaMail.root@zimbra.anl.gov> was: Re: [Swift-devel] Re: [alcf-support #60887] Can Cobalt command-line bug on Eureka be fixed? David, the evolving Swift R package has a start-swift command in this directory: https://svn.ci.uchicago.edu/svn/vdl2/SwiftApps/SwiftR/Swift/exec which has the logic needed to start a manual persistent passive coaster pool on both clusters and workstations. You'll need to pick up the files that start-swift sources from that same directory, and remove the final stage of the script where it actually launches Swift (that part is just for the Swift R service). You'll want to keep the part where it launches the Swift script "passivate.swift" to force the persistent service into passive mode. I think that with some cleanup and much testing, this script could be adapted to launch all means of manual coaster configurations. Justin has expressed the view that perhaps this whole process can not be scripted cleanly, and that we instead should provide tools for the user to do this manually. I would like to try, though, to see if this script can be made clean and reliable, and then we could place it in Swift and factor it out of SwiftR. I'm willing to help you get this set up and tested. - Mike ----- Original Message ----- > One workaround we can try here, which may be more valuable than a temp > fix, would be to make a more user-ready script to launch manual > coasters (persistent/passive) on any cluster. > > We have several such scripts floating around; probably Sheri could use > one if it were only slightly polished. > > That would be a good project for you, David. > > Such a script would be useful on any cluster, and would need only > slight flexibility to specify the batch jobs for various PBS, SGE, > Cobalt, and Slurm systems. > > It has all the drawbacks of manual coasters (which some folks like) > and is a usage mode we want to support. > > Justin, you noted yesterday that its hard to make such a script > general. Maybe if we split the script into 2 variants (one for > clusters, and one for sets of workstations) that would ake the > resultant scripts more maintainable and testable? > > - Mike > > > ----- Original Message ----- > > Thanks, Rich and Andrew, for the very fast responses. > > > > We'll try the work-around, then. > > > > Regards, > > > > - Mike > > > > > > ----- Original Message ----- > > > Michael, > > > > > > Unfortunately a fix for this will, at this point in time, take a > > > minimum > > > of four weeks to deploy to a production resource like Eureka, due > > > to > > > our > > > testing, upgrade and maintenance procedures. > > > > > > As a workaround for this on Eureka, since every job effectively > > > runs > > > in > > > script mode, you should be able to set environment variables > > > within > > > the > > > script that you submit to Cobalt. > > > > > > We apologize for the inconvenience. Let us know if you have any > > > other > > > questions. > > > > > > -- > > > Paul Rich > > > ALCF Operations -- AIG > > > richp at alcf.anl.gov > > > > > > > > > On 1/11/11 4:48 PM, Michael Wilde wrote: > > > > User info for wilde at mcs.anl.gov > > > > ================================= > > > > Username: wilde > > > > Full Name: Michael Wilde > > > > Projects: > > > > HTCScienceApps,JGI-Pilot,MTCScienceApps,OOPS,PTMAP,pilot-wilde > > > > ('*' denotes INCITE projects) > > > > ================================= > > > > > > > > > > > > Hi ALCF Team, > > > > > > > > The following known issue in Cobalt is currently preventing us > > > > from > > > > running Swift on Eureka: > > > > > > > > http://trac.mcs.anl.gov/projects/cobalt/ticket/462 > > > > > > > > With some additional development effort we can work around this, > > > > but > > > > it would be much cleaner and better if this were fixed in > > > > Cobalt, > > > > instead, as suggested in ticket 462 above. > > > > > > > > Is there any chance that can be done in the next few days? > > > > If not, please let me know, and we will implement the > > > > work-around > > > > instead. > > > > > > > > This is holding up work on the DOE ParVis project (Rob Jacob, > > > > PI) > > > > and we've had to move some work we want to run on Eureka to > > > > other > > > > platforms in the meantime. > > > > > > > > Thanks very much, > > > > > > > > Mike > > > > > > > > 462 is: > > > > > > > > Ticket #462 (new defect) > > > > Opened 7 months ago > > > > Cobalt on clusters ignores job script arguments > > > > > > > > Reported by: acherry > > > > Priority: major > > > > Component: clients > > > > > > > > Description > > > > > > > > It appears that cobalt-launcher.py does not support running a > > > > job > > > > script or executable with command arguments, even though qsub > > > > will > > > > accept the arguments, and the man page and help for qsub > > > > indicates > > > > that arguments are accepted. > > > > > > > > I'm filing this as a bug rather than a feature request, since > > > > the > > > > behavior isn't consistent with the documentation. But I'd rather > > > > the > > > > fix for this to be adding support for args, rather than changing > > > > the > > > > docs to say they aren't accepted. :-) > > > > > > > > > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Tue Jan 11 19:08:21 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 11 Jan 2011 17:08:21 -0800 Subject: [Swift-devel] Re: waitpid mini-benchmark In-Reply-To: <1222744546.52320.1294786868334.JavaMail.root@zimbra.anl.gov> References: <1222744546.52320.1294786868334.JavaMail.root@zimbra.anl.gov> Message-ID: <1294794501.21653.2.camel@blabla2.none> On Tue, 2011-01-11 at 17:01 -0600, Michael Wilde wrote: > Agreed. That was the purpose of measuring it. And it's how things are properly done. I'm not complaining. > Probably should make some easy way to pass command line options to the worker. > > I think we felt all along that the current logic was OK for workersPerNode <16 or so. > > The concern was only related to running workersPerNode much higher, like O(100s), which would only be done for experiments (what we've been calling "over-clocking the nodes" with sleep jobs). Ie, to use one rack of the BG/P to simulate the client load of 40 racks. > > So I agree - this task is off the list. > > - Mike > > > ----- Original Message ----- > > That's 0.3ms for 100 jobs? If yes, I don't think we should worry about > > it beyond making sure that we don't poll very often (e.g. more than 10 > > times/s). > > > > On Tue, 2011-01-11 at 15:34 -0600, Justin M Wozniak wrote: > > > Just posted the results of a Perl waitpid() mini-benchmark: > > > > > > http://www.ci.uchicago.edu/wiki/bin/view/SWFT/MiniBenchmarks > > > > From hategan at mcs.anl.gov Tue Jan 11 19:10:57 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 11 Jan 2011 17:10:57 -0800 Subject: [Swift-devel] Re: devel prios for Mihael In-Reply-To: <725770737.52387.1294787710456.JavaMail.root@zimbra.anl.gov> References: <725770737.52387.1294787710456.JavaMail.root@zimbra.anl.gov> Message-ID: <1294794657.21653.4.camel@blabla2.none> On Tue, 2011-01-11 at 17:15 -0600, Michael Wilde wrote: > Consolidating what we discussed on the list into an update on these issues: [...] > > - support Allan to achieve ExTENCI data transfer goals > > This is two issues: > > HIGH: 1. coasters fails at ~ >60 workers per service > > ^^^ bubbles up to higher prio > > HIGH: 2. throughput w/ provider staging is about 2jobs/sec/service with 3MB data input file per job. Allan needs at least 10 jobs/sec/service for ExTENCI runs. Right. I need to look at these. > > Its possible that issue #2 is the same as the next issue below, "coaster time gap issue". > > ^^^ Justin and Allan are both working to isolate this at the moment. > > > - assess coaster time gap issue with Justin; fix if needed I am not familiar with this. [...] > > - add option to set coaster-service to passive mode > > - low prio. Is its trivial to add? > What is this? From hategan at mcs.anl.gov Tue Jan 11 19:12:43 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 11 Jan 2011 17:12:43 -0800 Subject: [Swift-devel] Fwd: poor performance In-Reply-To: <1180898859.52477.1294788837654.JavaMail.root@zimbra.anl.gov> References: <1180898859.52477.1294788837654.JavaMail.root@zimbra.anl.gov> Message-ID: <1294794763.21653.6.camel@blabla2.none> On Tue, 2011-01-11 at 17:33 -0600, Michael Wilde wrote: > Marc confirms that he was using an un-fixed release. Im building an 0.92 on engage submit now. > > If you have suggestions on score/throttle values to accelerate the > scheduler's default ramp up, would be good to try (in addition to a > test of the defaults). I'd say start with, say, 20 jobs on each site by changing the initial score (don't know what the value would be though). From hategan at mcs.anl.gov Tue Jan 11 19:14:16 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 11 Jan 2011 17:14:16 -0800 Subject: [Swift-devel] Re: [alcf-support #60887] Can Cobalt command-line bug on Eureka be fixed? In-Reply-To: <2039357798.52486.1294788948030.JavaMail.root@zimbra.anl.gov> References: <2039357798.52486.1294788948030.JavaMail.root@zimbra.anl.gov> Message-ID: <1294794856.21653.7.camel@blabla2.none> We should insist that the issue be fixed. The fact that it can't be done quickly doesn't mean that it shouldn't be done. But they do have their priorities. On Tue, 2011-01-11 at 17:35 -0600, Michael Wilde wrote: > Thanks, Rich and Andrew, for the very fast responses. > > We'll try the work-around, then. > > Regards, > > - Mike > > > ----- Original Message ----- > > Michael, > > > > Unfortunately a fix for this will, at this point in time, take a > > minimum > > of four weeks to deploy to a production resource like Eureka, due to > > our > > testing, upgrade and maintenance procedures. > > > > As a workaround for this on Eureka, since every job effectively runs > > in > > script mode, you should be able to set environment variables within > > the > > script that you submit to Cobalt. > > > > We apologize for the inconvenience. Let us know if you have any other > > questions. > > > > -- > > Paul Rich > > ALCF Operations -- AIG > > richp at alcf.anl.gov > > > > > > On 1/11/11 4:48 PM, Michael Wilde wrote: > > > User info for wilde at mcs.anl.gov > > > ================================= > > > Username: wilde > > > Full Name: Michael Wilde > > > Projects: > > > HTCScienceApps,JGI-Pilot,MTCScienceApps,OOPS,PTMAP,pilot-wilde > > > ('*' denotes INCITE projects) > > > ================================= > > > > > > > > > Hi ALCF Team, > > > > > > The following known issue in Cobalt is currently preventing us from > > > running Swift on Eureka: > > > > > > http://trac.mcs.anl.gov/projects/cobalt/ticket/462 > > > > > > With some additional development effort we can work around this, but > > > it would be much cleaner and better if this were fixed in Cobalt, > > > instead, as suggested in ticket 462 above. > > > > > > Is there any chance that can be done in the next few days? > > > If not, please let me know, and we will implement the work-around > > > instead. > > > > > > This is holding up work on the DOE ParVis project (Rob Jacob, PI) > > > and we've had to move some work we want to run on Eureka to other > > > platforms in the meantime. > > > > > > Thanks very much, > > > > > > Mike > > > > > > 462 is: > > > > > > Ticket #462 (new defect) > > > Opened 7 months ago > > > Cobalt on clusters ignores job script arguments > > > > > > Reported by: acherry > > > Priority: major > > > Component: clients > > > > > > Description > > > > > > It appears that cobalt-launcher.py does not support running a job > > > script or executable with command arguments, even though qsub will > > > accept the arguments, and the man page and help for qsub indicates > > > that arguments are accepted. > > > > > > I'm filing this as a bug rather than a feature request, since the > > > behavior isn't consistent with the documentation. But I'd rather the > > > fix for this to be adding support for args, rather than changing the > > > docs to say they aren't accepted. :-) > > > > > > > From wilde at mcs.anl.gov Tue Jan 11 19:30:30 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 11 Jan 2011 19:30:30 -0600 (CST) Subject: [Swift-devel] Re: [alcf-support #60887] Can Cobalt command-line bug on Eureka be fixed? In-Reply-To: <2039357798.52486.1294788948030.JavaMail.root@zimbra.anl.gov> Message-ID: <1715669240.52741.1294795830606.JavaMail.root@zimbra.anl.gov> Paul, Andrew, What I think we're going to do on this from the Swift side is temporarily try to use Eureka in a mode where we manually start Swift workers on the cluster using a batch job. We'll wait on testing the Swift Cobolt interface (which is different than the above) until we hear from you that the bug is fixed and ready for testing. So even though it may be many weeks or more away, we'd like to put in our vote for fixing this issue (realizing that you have many other priorities :) Thanks, MIke ----- Original Message ----- > Thanks, Rich and Andrew, for the very fast responses. > > We'll try the work-around, then. > > Regards, > > - Mike > > > ----- Original Message ----- > > Michael, > > > > Unfortunately a fix for this will, at this point in time, take a > > minimum > > of four weeks to deploy to a production resource like Eureka, due to > > our > > testing, upgrade and maintenance procedures. > > > > As a workaround for this on Eureka, since every job effectively runs > > in > > script mode, you should be able to set environment variables within > > the > > script that you submit to Cobalt. > > > > We apologize for the inconvenience. Let us know if you have any > > other > > questions. > > > > -- > > Paul Rich > > ALCF Operations -- AIG > > richp at alcf.anl.gov > > > > > > On 1/11/11 4:48 PM, Michael Wilde wrote: > > > User info for wilde at mcs.anl.gov > > > ================================= > > > Username: wilde > > > Full Name: Michael Wilde > > > Projects: > > > HTCScienceApps,JGI-Pilot,MTCScienceApps,OOPS,PTMAP,pilot-wilde > > > ('*' denotes INCITE projects) > > > ================================= > > > > > > > > > Hi ALCF Team, > > > > > > The following known issue in Cobalt is currently preventing us > > > from > > > running Swift on Eureka: > > > > > > http://trac.mcs.anl.gov/projects/cobalt/ticket/462 > > > > > > With some additional development effort we can work around this, > > > but > > > it would be much cleaner and better if this were fixed in Cobalt, > > > instead, as suggested in ticket 462 above. > > > > > > Is there any chance that can be done in the next few days? > > > If not, please let me know, and we will implement the work-around > > > instead. > > > > > > This is holding up work on the DOE ParVis project (Rob Jacob, PI) > > > and we've had to move some work we want to run on Eureka to other > > > platforms in the meantime. > > > > > > Thanks very much, > > > > > > Mike > > > > > > 462 is: > > > > > > Ticket #462 (new defect) > > > Opened 7 months ago > > > Cobalt on clusters ignores job script arguments > > > > > > Reported by: acherry > > > Priority: major > > > Component: clients > > > > > > Description > > > > > > It appears that cobalt-launcher.py does not support running a job > > > script or executable with command arguments, even though qsub will > > > accept the arguments, and the man page and help for qsub indicates > > > that arguments are accepted. > > > > > > I'm filing this as a bug rather than a feature request, since the > > > behavior isn't consistent with the documentation. But I'd rather > > > the > > > fix for this to be adding support for args, rather than changing > > > the > > > docs to say they aren't accepted. :-) > > > > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wozniak at mcs.anl.gov Tue Jan 11 19:48:03 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Tue, 11 Jan 2011 19:48:03 -0600 (Central Standard Time) Subject: [Swift-devel] Re: [alcf-support #60887] Can Cobalt command-line bug on Eureka be fixed? In-Reply-To: <413995418.52513.1294789231634.JavaMail.root@zimbra.anl.gov> References: <413995418.52513.1294789231634.JavaMail.root@zimbra.anl.gov> Message-ID: On Tue, 11 Jan 2011, Michael Wilde wrote: > Justin, you noted yesterday that its hard to make such a script general. > Maybe if we split the script into 2 variants (one for clusters, and one > for sets of workstations) that would ake the resultant scripts more > maintainable and testable? Yeah, let's get it running there and see where it goes. I'll move my script into the Swift repo and put a test for it in nightly.sh . -- Justin M Wozniak From dk0966 at cs.ship.edu Tue Jan 11 20:11:34 2011 From: dk0966 at cs.ship.edu (David Kelly) Date: Tue, 11 Jan 2011 21:11:34 -0500 Subject: [Swift-devel] Re: Manual start script for persistent coasters on Cobalt and other schedulers In-Reply-To: <758452773.52563.1294790912749.JavaMail.root@zimbra.anl.gov> References: <413995418.52513.1294789231634.JavaMail.root@zimbra.anl.gov> <758452773.52563.1294790912749.JavaMail.root@zimbra.anl.gov> Message-ID: Mike, I will give it a try. Would the configuration for this be similar to the persistent passive coaster configuration used on the MCS machines? For example: passive With each of the 4 worker nodes having it's own entry? Do you happen to know the names of the workers for Gadzooks? Thanks, David On Tue, Jan 11, 2011 at 7:08 PM, Michael Wilde wrote: > was: Re: [Swift-devel] Re: > ?[alcf-support #60887] Can Cobalt command-line bug on Eureka be fixed? > > David, the evolving Swift R package has a start-swift command in this directory: > > ?https://svn.ci.uchicago.edu/svn/vdl2/SwiftApps/SwiftR/Swift/exec > > which has the logic needed to start a manual persistent passive coaster pool on both clusters and workstations. > > You'll need to pick up the files that start-swift sources from that same directory, and remove the final stage of the script where it actually launches Swift (that part is just for the Swift R service). > > You'll want to keep the part where it launches the Swift script "passivate.swift" to force the persistent service into passive mode. > > I think that with some cleanup and much testing, this script could be adapted to launch all means of manual coaster configurations. > > Justin has expressed the view that perhaps this whole process can not be scripted cleanly, and that we instead should provide tools for the user to do this manually. > > I would like to try, though, to see if this script can be made clean and reliable, and then we could place it in Swift and factor it out of SwiftR. > > I'm willing to help you get this set up and tested. > > - Mike > > ----- Original Message ----- >> One workaround we can try here, which may be more valuable than a temp >> fix, would be to make a more user-ready script to launch manual >> coasters (persistent/passive) on any cluster. >> >> We have several such scripts floating around; probably Sheri could use >> one if it were only slightly polished. >> >> That would be a good project for you, David. >> >> Such a script would be useful on any cluster, and would need only >> slight flexibility to specify the batch jobs for various PBS, SGE, >> Cobalt, and Slurm systems. >> >> It has all the drawbacks of manual coasters (which some folks like) >> and is a usage mode we want to support. >> >> Justin, you noted yesterday that its hard to make such a script >> general. Maybe if we split the script into 2 variants (one for >> clusters, and one for sets of workstations) that would ake the >> resultant scripts more maintainable and testable? >> >> - Mike >> >> >> ----- Original Message ----- >> > Thanks, Rich and Andrew, for the very fast responses. >> > >> > We'll try the work-around, then. >> > >> > Regards, >> > >> > - Mike >> > >> > >> > ----- Original Message ----- >> > > Michael, >> > > >> > > Unfortunately a fix for this will, at this point in time, take a >> > > minimum >> > > of four weeks to deploy to a production resource like Eureka, due >> > > to >> > > our >> > > testing, upgrade and maintenance procedures. >> > > >> > > As a workaround for this on Eureka, since every job effectively >> > > runs >> > > in >> > > script mode, you should be able to set environment variables >> > > within >> > > the >> > > script that you submit to Cobalt. >> > > >> > > We apologize for the inconvenience. Let us know if you have any >> > > other >> > > questions. >> > > >> > > -- >> > > Paul Rich >> > > ALCF Operations -- AIG >> > > richp at alcf.anl.gov >> > > >> > > >> > > On 1/11/11 4:48 PM, Michael Wilde wrote: >> > > > User info for wilde at mcs.anl.gov >> > > > ================================= >> > > > Username: wilde >> > > > Full Name: Michael Wilde >> > > > Projects: >> > > > HTCScienceApps,JGI-Pilot,MTCScienceApps,OOPS,PTMAP,pilot-wilde >> > > > ? ? ? ? ? ? ?('*' denotes INCITE projects) >> > > > ================================= >> > > > >> > > > >> > > > Hi ALCF Team, >> > > > >> > > > The following known issue in Cobalt is currently preventing us >> > > > from >> > > > running Swift on Eureka: >> > > > >> > > > ? ?http://trac.mcs.anl.gov/projects/cobalt/ticket/462 >> > > > >> > > > With some additional development effort we can work around this, >> > > > but >> > > > it would be much cleaner and better if this were fixed in >> > > > Cobalt, >> > > > instead, as suggested in ticket 462 above. >> > > > >> > > > Is there any chance that can be done in the next few days? >> > > > If not, please let me know, and we will implement the >> > > > work-around >> > > > instead. >> > > > >> > > > This is holding up work on the DOE ParVis project (Rob Jacob, >> > > > PI) >> > > > and we've had to move some work we want to run on Eureka to >> > > > other >> > > > platforms in the meantime. >> > > > >> > > > Thanks very much, >> > > > >> > > > Mike >> > > > >> > > > 462 is: >> > > > >> > > > Ticket #462 (new defect) >> > > > Opened 7 months ago >> > > > Cobalt on clusters ignores job script arguments >> > > > >> > > > Reported by: acherry >> > > > Priority: major >> > > > Component: clients >> > > > >> > > > Description >> > > > >> > > > It appears that cobalt-launcher.py does not support running a >> > > > job >> > > > script or executable with command arguments, even though qsub >> > > > will >> > > > accept the arguments, and the man page and help for qsub >> > > > indicates >> > > > that arguments are accepted. >> > > > >> > > > I'm filing this as a bug rather than a feature request, since >> > > > the >> > > > behavior isn't consistent with the documentation. But I'd rather >> > > > the >> > > > fix for this to be adding support for args, rather than changing >> > > > the >> > > > docs to say they aren't accepted. :-) >> > > > >> > > > >> > >> > -- >> > Michael Wilde >> > Computation Institute, University of Chicago >> > Mathematics and Computer Science Division >> > Argonne National Laboratory >> > >> > _______________________________________________ >> > Swift-devel mailing list >> > Swift-devel at ci.uchicago.edu >> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > From benc at hawaga.org.uk Wed Jan 12 07:47:04 2011 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 12 Jan 2011 13:47:04 +0000 (GMT) Subject: [Swift-devel] Re: devel prios for Mihael In-Reply-To: <1294784149.21301.0.camel@blabla2.none> References: <2131184763.48562.1294753690160.JavaMail.root@zimbra.anl.gov> <1294784149.21301.0.camel@blabla2.none> Message-ID: > > -- fix swift-plot-log to give useful reports on runs > > Are you sure you want me to do that? I'll just rewrite the whole > thing... It could probably do with a good rewrite - it was a prototype-gone-wild when I last touched it, when I wasn't sure what it should do or how useful it would be. -- From wilde at mcs.anl.gov Wed Jan 12 08:54:48 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 12 Jan 2011 08:54:48 -0600 (CST) Subject: [Swift-devel] Re: Manual start script for persistent coasters on Cobalt and other schedulers In-Reply-To: Message-ID: <1182240605.53668.1294844088426.JavaMail.root@zimbra.anl.gov> David, lets do a skype call in a few hours to discuss. I *think* this command should "just work" to a large extent if you make sure that the helper script is accessible and the "R"-specific stuff is commented out. I last tested it on SGE but it has worked on PADS/PBS. - Mike ----- Original Message ----- > Mike, > > I will give it a try. Would the configuration for this be similar to > the persistent passive coaster configuration used on the MCS machines? > > For example: > jobmanager="local:local"/> > passive > > With each of the 4 worker nodes having it's own entry? Do you happen > to know the names of the workers for Gadzooks? > > Thanks, > David > > On Tue, Jan 11, 2011 at 7:08 PM, Michael Wilde > wrote: > > was: Re: [Swift-devel] Re: > > ?[alcf-support #60887] Can Cobalt command-line bug on Eureka be > > ?fixed? > > > > David, the evolving Swift R package has a start-swift command in > > this directory: > > > > ?https://svn.ci.uchicago.edu/svn/vdl2/SwiftApps/SwiftR/Swift/exec > > > > which has the logic needed to start a manual persistent passive > > coaster pool on both clusters and workstations. > > > > You'll need to pick up the files that start-swift sources from that > > same directory, and remove the final stage of the script where it > > actually launches Swift (that part is just for the Swift R service). > > > > You'll want to keep the part where it launches the Swift script > > "passivate.swift" to force the persistent service into passive mode. > > > > I think that with some cleanup and much testing, this script could > > be adapted to launch all means of manual coaster configurations. > > > > Justin has expressed the view that perhaps this whole process can > > not be scripted cleanly, and that we instead should provide tools > > for the user to do this manually. > > > > I would like to try, though, to see if this script can be made clean > > and reliable, and then we could place it in Swift and factor it out > > of SwiftR. > > > > I'm willing to help you get this set up and tested. > > > > - Mike > > > > ----- Original Message ----- > >> One workaround we can try here, which may be more valuable than a > >> temp > >> fix, would be to make a more user-ready script to launch manual > >> coasters (persistent/passive) on any cluster. > >> > >> We have several such scripts floating around; probably Sheri could > >> use > >> one if it were only slightly polished. > >> > >> That would be a good project for you, David. > >> > >> Such a script would be useful on any cluster, and would need only > >> slight flexibility to specify the batch jobs for various PBS, SGE, > >> Cobalt, and Slurm systems. > >> > >> It has all the drawbacks of manual coasters (which some folks like) > >> and is a usage mode we want to support. > >> > >> Justin, you noted yesterday that its hard to make such a script > >> general. Maybe if we split the script into 2 variants (one for > >> clusters, and one for sets of workstations) that would ake the > >> resultant scripts more maintainable and testable? > >> > >> - Mike > >> > >> > >> ----- Original Message ----- > >> > Thanks, Rich and Andrew, for the very fast responses. > >> > > >> > We'll try the work-around, then. > >> > > >> > Regards, > >> > > >> > - Mike > >> > > >> > > >> > ----- Original Message ----- > >> > > Michael, > >> > > > >> > > Unfortunately a fix for this will, at this point in time, take > >> > > a > >> > > minimum > >> > > of four weeks to deploy to a production resource like Eureka, > >> > > due > >> > > to > >> > > our > >> > > testing, upgrade and maintenance procedures. > >> > > > >> > > As a workaround for this on Eureka, since every job effectively > >> > > runs > >> > > in > >> > > script mode, you should be able to set environment variables > >> > > within > >> > > the > >> > > script that you submit to Cobalt. > >> > > > >> > > We apologize for the inconvenience. Let us know if you have any > >> > > other > >> > > questions. > >> > > > >> > > -- > >> > > Paul Rich > >> > > ALCF Operations -- AIG > >> > > richp at alcf.anl.gov > >> > > > >> > > > >> > > On 1/11/11 4:48 PM, Michael Wilde wrote: > >> > > > User info for wilde at mcs.anl.gov > >> > > > ================================= > >> > > > Username: wilde > >> > > > Full Name: Michael Wilde > >> > > > Projects: > >> > > > HTCScienceApps,JGI-Pilot,MTCScienceApps,OOPS,PTMAP,pilot-wilde > >> > > > ? ? ? ? ? ? ?('*' denotes INCITE projects) > >> > > > ================================= > >> > > > > >> > > > > >> > > > Hi ALCF Team, > >> > > > > >> > > > The following known issue in Cobalt is currently preventing > >> > > > us > >> > > > from > >> > > > running Swift on Eureka: > >> > > > > >> > > > ? ?http://trac.mcs.anl.gov/projects/cobalt/ticket/462 > >> > > > > >> > > > With some additional development effort we can work around > >> > > > this, > >> > > > but > >> > > > it would be much cleaner and better if this were fixed in > >> > > > Cobalt, > >> > > > instead, as suggested in ticket 462 above. > >> > > > > >> > > > Is there any chance that can be done in the next few days? > >> > > > If not, please let me know, and we will implement the > >> > > > work-around > >> > > > instead. > >> > > > > >> > > > This is holding up work on the DOE ParVis project (Rob Jacob, > >> > > > PI) > >> > > > and we've had to move some work we want to run on Eureka to > >> > > > other > >> > > > platforms in the meantime. > >> > > > > >> > > > Thanks very much, > >> > > > > >> > > > Mike > >> > > > > >> > > > 462 is: > >> > > > > >> > > > Ticket #462 (new defect) > >> > > > Opened 7 months ago > >> > > > Cobalt on clusters ignores job script arguments > >> > > > > >> > > > Reported by: acherry > >> > > > Priority: major > >> > > > Component: clients > >> > > > > >> > > > Description > >> > > > > >> > > > It appears that cobalt-launcher.py does not support running a > >> > > > job > >> > > > script or executable with command arguments, even though qsub > >> > > > will > >> > > > accept the arguments, and the man page and help for qsub > >> > > > indicates > >> > > > that arguments are accepted. > >> > > > > >> > > > I'm filing this as a bug rather than a feature request, since > >> > > > the > >> > > > behavior isn't consistent with the documentation. But I'd > >> > > > rather > >> > > > the > >> > > > fix for this to be adding support for args, rather than > >> > > > changing > >> > > > the > >> > > > docs to say they aren't accepted. :-) > >> > > > > >> > > > > >> > > >> > -- > >> > Michael Wilde > >> > Computation Institute, University of Chicago > >> > Mathematics and Computer Science Division > >> > Argonne National Laboratory > >> > > >> > _______________________________________________ > >> > Swift-devel mailing list > >> > Swift-devel at ci.uchicago.edu > >> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > >> -- > >> Michael Wilde > >> Computation Institute, University of Chicago > >> Mathematics and Computer Science Division > >> Argonne National Laboratory > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Wed Jan 12 09:01:29 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 12 Jan 2011 09:01:29 -0600 (CST) Subject: [Swift-devel] coaster-service -passive option In-Reply-To: <1294794657.21653.4.camel@blabla2.none> Message-ID: <105200131.53689.1294844489717.JavaMail.root@zimbra.anl.gov> was: Re: devel prios for Mihael > > > - add option to set coaster-service to passive mode > > > > - low prio. Is its trivial to add? > > > > What is this? The coaster-service command needs an option to set it into passive mode. A primary use case for this command is to stand up a persistent service with persistent workers which are started by external means (i.e. outside of coaster-service). To do that, the service needs to be in passive mode. We have worked around that to date by running a dummy Swift script that runs a single dummy app() call, using a sites entry with a coaster execution provider with passive mode set. Thats the only way I know of to force the service to passive mode. The command line option suggested above to coaster-service would eliminate the need for this hack. - Mike From wilde at mcs.anl.gov Wed Jan 12 09:07:03 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 12 Jan 2011 09:07:03 -0600 (CST) Subject: [Swift-devel] Re: Manual start script for persistent coasters on Cobalt and other schedulers In-Reply-To: <1182240605.53668.1294844088426.JavaMail.root@zimbra.anl.gov> Message-ID: <1886729092.53698.1294844823654.JavaMail.root@zimbra.anl.gov> David, my apologies, I posted the wrong script. The start-swift command no longer starts a coaster service, because it uses a persistent swift command that reads requests from R and runs them. So the coaster service is embedded in the persistent swift. Lets look at the one Justin posted. I suspect you can merge the logic in the SwiftR start-swift command that starts the workers with Justin's logic that start the service. - Mike ----- Original Message ----- > David, lets do a skype call in a few hours to discuss. > > I *think* this command should "just work" to a large extent if you > make sure that the helper script is accessible and the "R"-specific > stuff is commented out. > > I last tested it on SGE but it has worked on PADS/PBS. > > - Mike > > > ----- Original Message ----- > > Mike, > > > > I will give it a try. Would the configuration for this be similar to > > the persistent passive coaster configuration used on the MCS > > machines? > > > > For example: > > > jobmanager="local:local"/> > > passive > > > > With each of the 4 worker nodes having it's own entry? Do you happen > > to know the names of the workers for Gadzooks? > > > > Thanks, > > David > > > > On Tue, Jan 11, 2011 at 7:08 PM, Michael Wilde > > wrote: > > > was: Re: [Swift-devel] Re: > > > ?[alcf-support #60887] Can Cobalt command-line bug on Eureka be > > > ?fixed? > > > > > > David, the evolving Swift R package has a start-swift command in > > > this directory: > > > > > > ?https://svn.ci.uchicago.edu/svn/vdl2/SwiftApps/SwiftR/Swift/exec > > > > > > which has the logic needed to start a manual persistent passive > > > coaster pool on both clusters and workstations. > > > > > > You'll need to pick up the files that start-swift sources from > > > that > > > same directory, and remove the final stage of the script where it > > > actually launches Swift (that part is just for the Swift R > > > service). > > > > > > You'll want to keep the part where it launches the Swift script > > > "passivate.swift" to force the persistent service into passive > > > mode. > > > > > > I think that with some cleanup and much testing, this script could > > > be adapted to launch all means of manual coaster configurations. > > > > > > Justin has expressed the view that perhaps this whole process can > > > not be scripted cleanly, and that we instead should provide tools > > > for the user to do this manually. > > > > > > I would like to try, though, to see if this script can be made > > > clean > > > and reliable, and then we could place it in Swift and factor it > > > out > > > of SwiftR. > > > > > > I'm willing to help you get this set up and tested. > > > > > > - Mike > > > > > > ----- Original Message ----- > > >> One workaround we can try here, which may be more valuable than a > > >> temp > > >> fix, would be to make a more user-ready script to launch manual > > >> coasters (persistent/passive) on any cluster. > > >> > > >> We have several such scripts floating around; probably Sheri > > >> could > > >> use > > >> one if it were only slightly polished. > > >> > > >> That would be a good project for you, David. > > >> > > >> Such a script would be useful on any cluster, and would need only > > >> slight flexibility to specify the batch jobs for various PBS, > > >> SGE, > > >> Cobalt, and Slurm systems. > > >> > > >> It has all the drawbacks of manual coasters (which some folks > > >> like) > > >> and is a usage mode we want to support. > > >> > > >> Justin, you noted yesterday that its hard to make such a script > > >> general. Maybe if we split the script into 2 variants (one for > > >> clusters, and one for sets of workstations) that would ake the > > >> resultant scripts more maintainable and testable? > > >> > > >> - Mike > > >> > > >> > > >> ----- Original Message ----- > > >> > Thanks, Rich and Andrew, for the very fast responses. > > >> > > > >> > We'll try the work-around, then. > > >> > > > >> > Regards, > > >> > > > >> > - Mike > > >> > > > >> > > > >> > ----- Original Message ----- > > >> > > Michael, > > >> > > > > >> > > Unfortunately a fix for this will, at this point in time, > > >> > > take > > >> > > a > > >> > > minimum > > >> > > of four weeks to deploy to a production resource like Eureka, > > >> > > due > > >> > > to > > >> > > our > > >> > > testing, upgrade and maintenance procedures. > > >> > > > > >> > > As a workaround for this on Eureka, since every job > > >> > > effectively > > >> > > runs > > >> > > in > > >> > > script mode, you should be able to set environment variables > > >> > > within > > >> > > the > > >> > > script that you submit to Cobalt. > > >> > > > > >> > > We apologize for the inconvenience. Let us know if you have > > >> > > any > > >> > > other > > >> > > questions. > > >> > > > > >> > > -- > > >> > > Paul Rich > > >> > > ALCF Operations -- AIG > > >> > > richp at alcf.anl.gov > > >> > > > > >> > > > > >> > > On 1/11/11 4:48 PM, Michael Wilde wrote: > > >> > > > User info for wilde at mcs.anl.gov > > >> > > > ================================= > > >> > > > Username: wilde > > >> > > > Full Name: Michael Wilde > > >> > > > Projects: > > >> > > > HTCScienceApps,JGI-Pilot,MTCScienceApps,OOPS,PTMAP,pilot-wilde > > >> > > > ? ? ? ? ? ? ?('*' denotes INCITE projects) > > >> > > > ================================= > > >> > > > > > >> > > > > > >> > > > Hi ALCF Team, > > >> > > > > > >> > > > The following known issue in Cobalt is currently preventing > > >> > > > us > > >> > > > from > > >> > > > running Swift on Eureka: > > >> > > > > > >> > > > ? ?http://trac.mcs.anl.gov/projects/cobalt/ticket/462 > > >> > > > > > >> > > > With some additional development effort we can work around > > >> > > > this, > > >> > > > but > > >> > > > it would be much cleaner and better if this were fixed in > > >> > > > Cobalt, > > >> > > > instead, as suggested in ticket 462 above. > > >> > > > > > >> > > > Is there any chance that can be done in the next few days? > > >> > > > If not, please let me know, and we will implement the > > >> > > > work-around > > >> > > > instead. > > >> > > > > > >> > > > This is holding up work on the DOE ParVis project (Rob > > >> > > > Jacob, > > >> > > > PI) > > >> > > > and we've had to move some work we want to run on Eureka to > > >> > > > other > > >> > > > platforms in the meantime. > > >> > > > > > >> > > > Thanks very much, > > >> > > > > > >> > > > Mike > > >> > > > > > >> > > > 462 is: > > >> > > > > > >> > > > Ticket #462 (new defect) > > >> > > > Opened 7 months ago > > >> > > > Cobalt on clusters ignores job script arguments > > >> > > > > > >> > > > Reported by: acherry > > >> > > > Priority: major > > >> > > > Component: clients > > >> > > > > > >> > > > Description > > >> > > > > > >> > > > It appears that cobalt-launcher.py does not support running > > >> > > > a > > >> > > > job > > >> > > > script or executable with command arguments, even though > > >> > > > qsub > > >> > > > will > > >> > > > accept the arguments, and the man page and help for qsub > > >> > > > indicates > > >> > > > that arguments are accepted. > > >> > > > > > >> > > > I'm filing this as a bug rather than a feature request, > > >> > > > since > > >> > > > the > > >> > > > behavior isn't consistent with the documentation. But I'd > > >> > > > rather > > >> > > > the > > >> > > > fix for this to be adding support for args, rather than > > >> > > > changing > > >> > > > the > > >> > > > docs to say they aren't accepted. :-) > > >> > > > > > >> > > > > > >> > > > >> > -- > > >> > Michael Wilde > > >> > Computation Institute, University of Chicago > > >> > Mathematics and Computer Science Division > > >> > Argonne National Laboratory > > >> > > > >> > _______________________________________________ > > >> > Swift-devel mailing list > > >> > Swift-devel at ci.uchicago.edu > > >> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > >> > > >> -- > > >> Michael Wilde > > >> Computation Institute, University of Chicago > > >> Mathematics and Computer Science Division > > >> Argonne National Laboratory > > >> > > >> _______________________________________________ > > >> Swift-devel mailing list > > >> Swift-devel at ci.uchicago.edu > > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Wed Jan 12 09:19:29 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 12 Jan 2011 09:19:29 -0600 (CST) Subject: [Swift-devel] Re: Manual start script for persistent coasters on Cobalt and other schedulers In-Reply-To: Message-ID: <2079880929.53784.1294845569109.JavaMail.root@zimbra.anl.gov> ----- Original Message ----- > Mike, > > I will give it a try. Would the configuration for this be similar to > the persistent passive coaster configuration used on the MCS machines? Similar code to start the service and put it in passive mode. Different code to start the workers - that you can take from the Swift R package start-swift script. > For example: > jobmanager="local:local"/> > passive > > With each of the 4 worker nodes having it's own entry? Do you happen > to know the names of the workers for Gadzooks? Gadzooks and its big sister Eureka are clusters, driven by the PBS-like "Cobalt" scheduler. The sites entry, however, will be exactly what you have above. For passive, I *think*, the second "local" in "local:local" is ignored. And for coaster-persistent, the first "local" is ignored. I *think* Cobalt is more like PBS than SGE is. So whats needed is to take the PBS branch from start-swift and adjust the command line options and scheduler directives to work for Cobalt. We can discuss the details when we talk. - Mike > > Thanks, > David > > On Tue, Jan 11, 2011 at 7:08 PM, Michael Wilde > wrote: > > was: Re: [Swift-devel] Re: > > ?[alcf-support #60887] Can Cobalt command-line bug on Eureka be > > ?fixed? > > > > David, the evolving Swift R package has a start-swift command in > > this directory: > > > > ?https://svn.ci.uchicago.edu/svn/vdl2/SwiftApps/SwiftR/Swift/exec > > > > which has the logic needed to start a manual persistent passive > > coaster pool on both clusters and workstations. > > > > You'll need to pick up the files that start-swift sources from that > > same directory, and remove the final stage of the script where it > > actually launches Swift (that part is just for the Swift R service). > > > > You'll want to keep the part where it launches the Swift script > > "passivate.swift" to force the persistent service into passive mode. > > > > I think that with some cleanup and much testing, this script could > > be adapted to launch all means of manual coaster configurations. > > > > Justin has expressed the view that perhaps this whole process can > > not be scripted cleanly, and that we instead should provide tools > > for the user to do this manually. > > > > I would like to try, though, to see if this script can be made clean > > and reliable, and then we could place it in Swift and factor it out > > of SwiftR. > > > > I'm willing to help you get this set up and tested. > > > > - Mike > > > > ----- Original Message ----- > >> One workaround we can try here, which may be more valuable than a > >> temp > >> fix, would be to make a more user-ready script to launch manual > >> coasters (persistent/passive) on any cluster. > >> > >> We have several such scripts floating around; probably Sheri could > >> use > >> one if it were only slightly polished. > >> > >> That would be a good project for you, David. > >> > >> Such a script would be useful on any cluster, and would need only > >> slight flexibility to specify the batch jobs for various PBS, SGE, > >> Cobalt, and Slurm systems. > >> > >> It has all the drawbacks of manual coasters (which some folks like) > >> and is a usage mode we want to support. > >> > >> Justin, you noted yesterday that its hard to make such a script > >> general. Maybe if we split the script into 2 variants (one for > >> clusters, and one for sets of workstations) that would ake the > >> resultant scripts more maintainable and testable? > >> > >> - Mike > >> > >> > >> ----- Original Message ----- > >> > Thanks, Rich and Andrew, for the very fast responses. > >> > > >> > We'll try the work-around, then. > >> > > >> > Regards, > >> > > >> > - Mike > >> > > >> > > >> > ----- Original Message ----- > >> > > Michael, > >> > > > >> > > Unfortunately a fix for this will, at this point in time, take > >> > > a > >> > > minimum > >> > > of four weeks to deploy to a production resource like Eureka, > >> > > due > >> > > to > >> > > our > >> > > testing, upgrade and maintenance procedures. > >> > > > >> > > As a workaround for this on Eureka, since every job effectively > >> > > runs > >> > > in > >> > > script mode, you should be able to set environment variables > >> > > within > >> > > the > >> > > script that you submit to Cobalt. > >> > > > >> > > We apologize for the inconvenience. Let us know if you have any > >> > > other > >> > > questions. > >> > > > >> > > -- > >> > > Paul Rich > >> > > ALCF Operations -- AIG > >> > > richp at alcf.anl.gov > >> > > > >> > > > >> > > On 1/11/11 4:48 PM, Michael Wilde wrote: > >> > > > User info for wilde at mcs.anl.gov > >> > > > ================================= > >> > > > Username: wilde > >> > > > Full Name: Michael Wilde > >> > > > Projects: > >> > > > HTCScienceApps,JGI-Pilot,MTCScienceApps,OOPS,PTMAP,pilot-wilde > >> > > > ? ? ? ? ? ? ?('*' denotes INCITE projects) > >> > > > ================================= > >> > > > > >> > > > > >> > > > Hi ALCF Team, > >> > > > > >> > > > The following known issue in Cobalt is currently preventing > >> > > > us > >> > > > from > >> > > > running Swift on Eureka: > >> > > > > >> > > > ? ?http://trac.mcs.anl.gov/projects/cobalt/ticket/462 > >> > > > > >> > > > With some additional development effort we can work around > >> > > > this, > >> > > > but > >> > > > it would be much cleaner and better if this were fixed in > >> > > > Cobalt, > >> > > > instead, as suggested in ticket 462 above. > >> > > > > >> > > > Is there any chance that can be done in the next few days? > >> > > > If not, please let me know, and we will implement the > >> > > > work-around > >> > > > instead. > >> > > > > >> > > > This is holding up work on the DOE ParVis project (Rob Jacob, > >> > > > PI) > >> > > > and we've had to move some work we want to run on Eureka to > >> > > > other > >> > > > platforms in the meantime. > >> > > > > >> > > > Thanks very much, > >> > > > > >> > > > Mike > >> > > > > >> > > > 462 is: > >> > > > > >> > > > Ticket #462 (new defect) > >> > > > Opened 7 months ago > >> > > > Cobalt on clusters ignores job script arguments > >> > > > > >> > > > Reported by: acherry > >> > > > Priority: major > >> > > > Component: clients > >> > > > > >> > > > Description > >> > > > > >> > > > It appears that cobalt-launcher.py does not support running a > >> > > > job > >> > > > script or executable with command arguments, even though qsub > >> > > > will > >> > > > accept the arguments, and the man page and help for qsub > >> > > > indicates > >> > > > that arguments are accepted. > >> > > > > >> > > > I'm filing this as a bug rather than a feature request, since > >> > > > the > >> > > > behavior isn't consistent with the documentation. But I'd > >> > > > rather > >> > > > the > >> > > > fix for this to be adding support for args, rather than > >> > > > changing > >> > > > the > >> > > > docs to say they aren't accepted. :-) > >> > > > > >> > > > > >> > > >> > -- > >> > Michael Wilde > >> > Computation Institute, University of Chicago > >> > Mathematics and Computer Science Division > >> > Argonne National Laboratory > >> > > >> > _______________________________________________ > >> > Swift-devel mailing list > >> > Swift-devel at ci.uchicago.edu > >> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > >> -- > >> Michael Wilde > >> Computation Institute, University of Chicago > >> Mathematics and Computer Science Division > >> Argonne National Laboratory > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Wed Jan 12 10:10:13 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 12 Jan 2011 10:10:13 -0600 (CST) Subject: [Swift-devel] bugzilla admin privileges In-Reply-To: Message-ID: <1623196513.54117.1294848613476.JavaMail.root@zimbra.anl.gov> Justin, Sarah, Mihael - you now all have bugzilla admin privileges. Justin, I assume you wanted this to tune up the defaults to better fit the current system, release numbers, process, etc. That sounds good. - Mike ----- Forwarded Message ----- From: "Ken Raffenetti" To: wilde at mcs.anl.gov Sent: Wednesday, January 12, 2011 9:54:37 AM Subject: [mcs-systems #60883] Request for bugzilla admin privileges This is all set. Ken On Tue Jan 11 16:39:04 2011, wilde wrote: > Hi Systems, > > Please give the following people admin access to Swift bugzilla: > - skenny, wozniak, hategan, wilde > > Thanks, > > Mike > > > ----- Forwarded Message ----- > From: "Justin M Wozniak" > To: swift-devel at ci.uchicago.edu > Sent: Tuesday, January 11, 2011 10:29:24 AM > Subject: [Swift-devel] Bugzilla admin request > > > Can someone add me as a Bugzilla admin? > Thanks > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wozniak at mcs.anl.gov Wed Jan 12 10:15:22 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Wed, 12 Jan 2011 10:15:22 -0600 (CST) Subject: [Swift-devel] bugzilla admin privileges In-Reply-To: <1623196513.54117.1294848613476.JavaMail.root@zimbra.anl.gov> References: <1623196513.54117.1294848613476.JavaMail.root@zimbra.anl.gov> Message-ID: On Wed, 12 Jan 2011, Michael Wilde wrote: > Justin, Sarah, Mihael - you now all have bugzilla admin privileges. > > Justin, I assume you wanted this to tune up the defaults to better fit > the current system, release numbers, process, etc. That sounds good. Right. -- Justin M Wozniak From bugzilla-daemon at mcs.anl.gov Wed Jan 12 12:57:07 2011 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 12 Jan 2011 12:57:07 -0600 (CST) Subject: [Swift-devel] [Bug 247] New: Add passive mode to persistent coaster service Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=247 Summary: Add passive mode to persistent coaster service Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: hategan at mcs.anl.gov ReportedBy: hategan at mcs.anl.gov When passive worker management is used, the coaster service needs to initialize a local service for the workers to connect to. When the coaster service is automatically started, this happens automatically. However, when the stand-alone service is used, the user would have to wait until the fist job is submitted, which is undesirable. This should be fixed by adding a "-passive" flag to the coaster service which would initialize the passive worker manager when the service is started. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From hategan at mcs.anl.gov Wed Jan 12 12:57:08 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 12 Jan 2011 10:57:08 -0800 Subject: [Swift-devel] Re: coaster-service -passive option In-Reply-To: <105200131.53689.1294844489717.JavaMail.root@zimbra.anl.gov> References: <105200131.53689.1294844489717.JavaMail.root@zimbra.anl.gov> Message-ID: <1294858628.3835.4.camel@blabla2.none> On Wed, 2011-01-12 at 09:01 -0600, Michael Wilde wrote: > was: Re: devel prios for Mihael > > > > > - add option to set coaster-service to passive mode > > > > > > - low prio. Is its trivial to add? > > > > > > > What is this? > > The coaster-service command needs an option to set it into passive mode. I see the problem. Bug filed. From bugzilla-daemon at mcs.anl.gov Wed Jan 12 13:39:47 2011 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 12 Jan 2011 13:39:47 -0600 (CST) Subject: [Swift-devel] [Bug 247] Add passive mode to persistent coaster service In-Reply-To: References: Message-ID: <20110112193947.B0A7A1BD89@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=247 Michael Wilde changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P2 |P4 Status|NEW |ASSIGNED CC| |wilde at mcs.anl.gov Severity|normal |minor --- Comment #1 from Michael Wilde 2011-01-12 13:39:47 --- Justin has observed a Java exception in the service when workers try to connect before the "dummy" job has initialized the coaster provider and put the service into passive mode. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching someone on the CC list of the bug. You are watching the reporter. From aespinosa at cs.uchicago.edu Wed Jan 12 15:37:12 2011 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 12 Jan 2011 15:37:12 -0600 Subject: [Swift-devel] provider staging stage-in rate on localhost and PADS Message-ID: setup1: N workers on localhost (10, 20, 40) coaster service on localhost swift client on localhost 10,000 2.3MB files in Localdisk (/var/tmp/RuptureVariations in communicado) jobs are simple 'cat inputfile' jobs. no data is explicitly staged out in the app() invocation Result: max transfer rate is 7MB/s with 10 workers resulting to a 2.5 jobs staged per second rate. I attached the plot of bandwidth (red = 10 workers, green = 20 workers, blue = 40 workers). We maybe CPU bound at this point. i'll try testing on communicado->bridled next. setup2: (communicado->PADS) N workers on PADS workers (10, 40) coaster service on communicadp swift client on communicado pads.png shows the plot of bandwidth (max is around 5.5MB/s). legend: black = 10 workers, red = 40 workers this translates to a jobrate of 2 jobs/ second -Allan -- Allan M. Espinosa PhD student, Computer Science University of Chicago -------------- next part -------------- A non-text attachment was scrubbed... Name: local.png Type: image/png Size: 35236 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pads.png Type: image/png Size: 28595 bytes Desc: not available URL: From dsk at ci.uchicago.edu Wed Jan 12 15:57:26 2011 From: dsk at ci.uchicago.edu (Daniel S. Katz) Date: Wed, 12 Jan 2011 15:57:26 -0600 Subject: [Swift-devel] Re: provider staging stage-in rate on localhost and PADS In-Reply-To: References: Message-ID: <39FC459D-BD6C-47AC-AE66-9BA8BC4C7D98@ci.uchicago.edu> As I read this, it appears that there is some limit in Swift, rather than in the hardware, that is causing these numbers to be very low. Mihael, do you agree? Can you help us figure out what's going on? Dan On Jan 12, 2011, at 3:37 PM, Allan Espinosa wrote: > setup1: > > N workers on localhost (10, 20, 40) > coaster service on localhost > swift client on localhost > > 10,000 2.3MB files in Localdisk (/var/tmp/RuptureVariations in communicado) > jobs are simple 'cat inputfile' jobs. no data is explicitly staged > out in the app() invocation > > Result: > max transfer rate is 7MB/s with 10 workers resulting to a 2.5 jobs > staged per second rate. > > I attached the plot of bandwidth (red = 10 workers, green = 20 > workers, blue = 40 workers). > > We maybe CPU bound at this point. i'll try testing on > communicado->bridled next. > > setup2: (communicado->PADS) > > N workers on PADS workers (10, 40) > coaster service on communicadp > swift client on communicado > > pads.png shows the plot of bandwidth (max is around 5.5MB/s). > legend: black = 10 workers, red = 40 workers > > this translates to a jobrate of 2 jobs/ second > > -Allan > > -- > Allan M. Espinosa > PhD student, Computer Science > University of Chicago > -- Daniel S. Katz University of Chicago (773) 834-7186 (voice) (773) 834-3700 (fax) d.katz at ieee.org or dsk at ci.uchicago.edu http://www.ci.uchicago.edu/~dsk/ From wozniak at mcs.anl.gov Wed Jan 12 17:06:16 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Wed, 12 Jan 2011 17:06:16 -0600 (CST) Subject: [Swift-devel] provider staging stage-in rate on localhost and PADS In-Reply-To: References: Message-ID: I'm actually trying to isolate this as well- can you try running this with very small files? On Wed, 12 Jan 2011, Allan Espinosa wrote: > setup1: > > N workers on localhost (10, 20, 40) > coaster service on localhost > swift client on localhost > > 10,000 2.3MB files in Localdisk (/var/tmp/RuptureVariations in communicado) > jobs are simple 'cat inputfile' jobs. no data is explicitly staged > out in the app() invocation > > Result: > max transfer rate is 7MB/s with 10 workers resulting to a 2.5 jobs > staged per second rate. > > I attached the plot of bandwidth (red = 10 workers, green = 20 > workers, blue = 40 workers). > > We maybe CPU bound at this point. i'll try testing on > communicado->bridled next. > > setup2: (communicado->PADS) > > N workers on PADS workers (10, 40) > coaster service on communicadp > swift client on communicado > > pads.png shows the plot of bandwidth (max is around 5.5MB/s). > legend: black = 10 workers, red = 40 workers > > this translates to a jobrate of 2 jobs/ second > > -Allan > > -- Justin M Wozniak From hategan at mcs.anl.gov Wed Jan 12 19:01:15 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 12 Jan 2011 17:01:15 -0800 Subject: [Swift-devel] provider staging stage-in rate on localhost and PADS In-Reply-To: References: Message-ID: <1294880475.3835.5.camel@blabla2.none> And then try to run this with very large files. Essentially that will tell whether the limit is in the coaster i/o bandwidth or the job throughput. On Wed, 2011-01-12 at 17:06 -0600, Justin M Wozniak wrote: > I'm actually trying to isolate this as well- can you try running this with > very small files? > > On Wed, 12 Jan 2011, Allan Espinosa wrote: > > > setup1: > > > > N workers on localhost (10, 20, 40) > > coaster service on localhost > > swift client on localhost > > > > 10,000 2.3MB files in Localdisk (/var/tmp/RuptureVariations in communicado) > > jobs are simple 'cat inputfile' jobs. no data is explicitly staged > > out in the app() invocation > > > > Result: > > max transfer rate is 7MB/s with 10 workers resulting to a 2.5 jobs > > staged per second rate. > > > > I attached the plot of bandwidth (red = 10 workers, green = 20 > > workers, blue = 40 workers). > > > > We maybe CPU bound at this point. i'll try testing on > > communicado->bridled next. > > > > setup2: (communicado->PADS) > > > > N workers on PADS workers (10, 40) > > coaster service on communicadp > > swift client on communicado > > > > pads.png shows the plot of bandwidth (max is around 5.5MB/s). > > legend: black = 10 workers, red = 40 workers > > > > this translates to a jobrate of 2 jobs/ second > > > > -Allan > > > > > From hategan at mcs.anl.gov Wed Jan 12 19:05:32 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 12 Jan 2011 17:05:32 -0800 Subject: [Swift-devel] Re: provider staging stage-in rate on localhost and PADS In-Reply-To: <39FC459D-BD6C-47AC-AE66-9BA8BC4C7D98@ci.uchicago.edu> References: <39FC459D-BD6C-47AC-AE66-9BA8BC4C7D98@ci.uchicago.edu> Message-ID: <1294880732.3835.7.camel@blabla2.none> On Wed, 2011-01-12 at 15:57 -0600, Daniel S. Katz wrote: > As I read this, it appears that there is some limit in Swift, rather than in the hardware, that is causing these numbers to be very low. > > Mihael, do you agree? Yes, but that does not exclude a limit in hardware in other configurations. > Can you help us figure out what's going on? Of course. From wilde at mcs.anl.gov Wed Jan 12 19:53:54 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 12 Jan 2011 19:53:54 -0600 (CST) Subject: [Swift-devel] Errors compiling 0.92 with Java 1.5 Message-ID: <1846653395.57370.1294883634798.JavaMail.root@zimbra.anl.gov> I get the errors below when I build 0.92 with this Java and ant on RENCI engage-submit: -- e$ which java /opt/osg/1.0.0/jdk1.5/bin/java e$ java -version java version "1.5.0_14" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03) Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) e$ which javac /opt/osg/1.0.0/jdk1.5/bin/javac e$ ant -version Apache Ant version 1.6.5 compiled on June 2 2005 e$ -- The same svn rev seems to uild clean with 1.6 on PADS. I dont think this necessarily needs to be fixed; but at least documented if we no longer work with 1.5. (I recall we said we wont support 1.4 -- or was that 1.5?) - Mike Errors: compile: [echo] [util]: COMPILE [mkdir] Created dir: /home/wilde/swift/src/0.92/cog/modules/util/build [javac] Compiling 53 source files to /home/wilde/swift/src/0.92/cog/modules/util/build [javac] /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:71: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:105: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:144: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:227: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:232: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:237: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:242: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:247: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:252: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:257: method does not override a method from its superclass [javac] @Override [javac] ^ [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 10 errors BUILD FAILED /home/wilde/swift/src/0.92/cog/modules/swift/build.xml:73: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:444: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:79: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:52: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/modules/swift/dependencies.xml:4: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:163: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:168: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/modules/karajan/build.xml:59: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:444: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:79: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:52: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/modules/karajan/dependencies.xml:4: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:163: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:168: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/modules/abstraction/build.xml:58: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:444: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:79: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:52: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/modules/abstraction/dependencies.xml:4: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:163: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:168: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/modules/abstraction-common/build.xml:63: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:444: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:79: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:52: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/modules/abstraction-common/dependencies.xml:7: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:163: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:168: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/modules/util/build.xml:59: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:465: The following error occurred while executing this line: /home/wilde/swift/src/0.92/cog/mbuild.xml:228: Compile failed; see the compiler error output for details. Total time: 20 seconds e$ which java /opt/osg/1.0.0/jdk1.5/bin/java e$ java -version java version "1.5.0_14" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03) Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) e$ cd e$ ls bin ftdock glenslogs.tar.gz osgcat.sh osgfact out.ps out.qstat swift t1.stderr t1.sub t2.sub buildnrun.sh glasslogs.tgz mygroup osgcat.sh~ out out.ps2 pfgroup t1.log t1.stdout t1.sub~ e$ which javac /opt/osg/1.0.0/jdk1.5/bin/javac e$ ant -version Apache Ant version 1.6.5 compiled on June 2 2005 e$ -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From dk0966 at cs.ship.edu Wed Jan 12 19:58:49 2011 From: dk0966 at cs.ship.edu (David Kelly) Date: Wed, 12 Jan 2011 20:58:49 -0500 Subject: [Swift-devel] Errors compiling 0.92 with Java 1.5 In-Reply-To: <1846653395.57370.1294883634798.JavaMail.root@zimbra.anl.gov> References: <1846653395.57370.1294883634798.JavaMail.root@zimbra.anl.gov> Message-ID: I ran into a similar issue. Here is the patch I wrote for it (there should also be a bug report) On Wed, Jan 12, 2011 at 8:53 PM, Michael Wilde wrote: > I get the errors below when I build 0.92 with this Java and ant on RENCI engage-submit: > > -- > e$ which java > /opt/osg/1.0.0/jdk1.5/bin/java > e$ java -version > java version "1.5.0_14" > Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03) > Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) > e$ which javac > /opt/osg/1.0.0/jdk1.5/bin/javac > e$ ant -version > Apache Ant version 1.6.5 compiled on June 2 2005 > e$ > -- > > The same svn rev seems to uild clean with 1.6 on PADS. > > I dont think this necessarily needs to be fixed; but at least documented if we no longer work with 1.5. (I recall we said we wont support 1.4 -- or was that 1.5?) > > - Mike > > Errors: > > compile: > ? ? [echo] [util]: COMPILE > ? ?[mkdir] Created dir: /home/wilde/swift/src/0.92/cog/modules/util/build > ? ?[javac] Compiling 53 source files to /home/wilde/swift/src/0.92/cog/modules/util/build > ? ?[javac] /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:71: method does not override a method from its superclass > ? ?[javac] ? ? @Override > ? ?[javac] ? ? ?^ > ? ?[javac] /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:105: method does not override a method from its superclass > ? ?[javac] ? ? @Override > ? ?[javac] ? ? ? ? ?^ > ? ?[javac] /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:144: method does not override a method from its superclass > ? ?[javac] ? ? @Override > ? ?[javac] ? ? ? ? ?^ > ? ?[javac] /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:227: method does not override a method from its superclass > ? ?[javac] ? ? @Override > ? ?[javac] ? ? ?^ > ? ?[javac] /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:232: method does not override a method from its superclass > ? ?[javac] ? ? @Override > ? ?[javac] ? ? ?^ > ? ?[javac] /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:237: method does not override a method from its superclass > ? ?[javac] ? ? @Override > ? ?[javac] ? ? ?^ > ? ?[javac] /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:242: method does not override a method from its superclass > ? ?[javac] ? ? @Override > ? ?[javac] ? ? ?^ > ? ?[javac] /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:247: method does not override a method from its superclass > ? ?[javac] ? ? @Override > ? ?[javac] ? ? ?^ > ? ?[javac] /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:252: method does not override a method from its superclass > ? ?[javac] ? ? @Override > ? ?[javac] ? ? ?^ > ? ?[javac] /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:257: method does not override a method from its superclass > ? ?[javac] ? ? @Override > ? ?[javac] ? ? ?^ > ? ?[javac] Note: Some input files use or override a deprecated API. > ? ?[javac] Note: Recompile with -Xlint:deprecation for details. > ? ?[javac] Note: Some input files use unchecked or unsafe operations. > ? ?[javac] Note: Recompile with -Xlint:unchecked for details. > ? ?[javac] 10 errors > > BUILD FAILED > /home/wilde/swift/src/0.92/cog/modules/swift/build.xml:73: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:444: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:79: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:52: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/modules/swift/dependencies.xml:4: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:163: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:168: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/modules/karajan/build.xml:59: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:444: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:79: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:52: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/modules/karajan/dependencies.xml:4: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:163: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:168: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/modules/abstraction/build.xml:58: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:444: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:79: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:52: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/modules/abstraction/dependencies.xml:4: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:163: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:168: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/modules/abstraction-common/build.xml:63: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:444: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:79: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:52: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/modules/abstraction-common/dependencies.xml:7: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:163: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:168: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/modules/util/build.xml:59: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:465: The following error occurred while executing this line: > /home/wilde/swift/src/0.92/cog/mbuild.xml:228: Compile failed; see the compiler error output for details. > > Total time: 20 seconds > e$ which java > /opt/osg/1.0.0/jdk1.5/bin/java > e$ java -version > java version "1.5.0_14" > Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03) > Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) > e$ cd > e$ ls > bin ? ? ? ? ? ftdock ? ? ? ? glenslogs.tar.gz ?osgcat.sh ? osgfact ?out.ps ? out.qstat ?swift ? t1.stderr ?t1.sub ? t2.sub > buildnrun.sh ?glasslogs.tgz ?mygroup ? ? ? ? ? osgcat.sh~ ?out ? ? ?out.ps2 ?pfgroup ? ?t1.log ?t1.stdout ?t1.sub~ > e$ which javac > /opt/osg/1.0.0/jdk1.5/bin/javac > e$ ant -version > Apache Ant version 1.6.5 compiled on June 2 2005 > e$ > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -------------- next part -------------- A non-text attachment was scrubbed... Name: overridepatch.diff Type: text/x-patch Size: 2244 bytes Desc: not available URL: From wilde at mcs.anl.gov Wed Jan 12 20:24:30 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 12 Jan 2011 20:24:30 -0600 (CST) Subject: [Swift-devel] Errors compiling 0.92 with Java 1.5 In-Reply-To: Message-ID: <806374967.57404.1294885470290.JavaMail.root@zimbra.anl.gov> Thanks, David. Heres a process question for the group: - should we fix this in 0.92? - by saying 1.6 or above is required (with a poiter to the patch for 1.5 users?) - by applying the patch to 0.92? - if so, how should we mark 0.92 showstoppers in bugzilla? - lets agree on a test-fix-test-release approach for sealing and releasing 0.92 Ie, identify the showstoppers, fix 'em, test, add showstoppers only as necessary, and repeat till no more showstoppers. - Mike ----- Original Message ----- > Yep, it's #239. I saw it first on sisboombah. > https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=239 > > On Wed, Jan 12, 2011 at 8:58 PM, David Kelly > wrote: > > I ran into a similar issue. Here is the patch I wrote for it (there > > should also be a bug report) > > > > On Wed, Jan 12, 2011 at 8:53 PM, Michael Wilde > > wrote: > >> I get the errors below when I build 0.92 with this Java and ant on > >> RENCI engage-submit: > >> > >> -- > >> e$ which java > >> /opt/osg/1.0.0/jdk1.5/bin/java > >> e$ java -version > >> java version "1.5.0_14" > >> Java(TM) 2 Runtime Environment, Standard Edition (build > >> 1.5.0_14-b03) > >> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) > >> e$ which javac > >> /opt/osg/1.0.0/jdk1.5/bin/javac > >> e$ ant -version > >> Apache Ant version 1.6.5 compiled on June 2 2005 > >> e$ > >> -- > >> > >> The same svn rev seems to uild clean with 1.6 on PADS. > >> > >> I dont think this necessarily needs to be fixed; but at least > >> documented if we no longer work with 1.5. (I recall we said we wont > >> support 1.4 -- or was that 1.5?) > >> > >> - Mike > >> > >> Errors: > >> > >> compile: > >> ? ? [echo] [util]: COMPILE > >> ? ?[mkdir] Created dir: > >> ? ?/home/wilde/swift/src/0.92/cog/modules/util/build > >> ? ?[javac] Compiling 53 source files to > >> ? ?/home/wilde/swift/src/0.92/cog/modules/util/build > >> ? ?[javac] > >> ? ?/home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:71: > >> ? ?method does not override a method from its superclass > >> ? ?[javac] @Override > >> ? ?[javac] ^ > >> ? ?[javac] > >> ? ?/home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:105: > >> ? ?method does not override a method from its superclass > >> ? ?[javac] @Override > >> ? ?[javac] ^ > >> ? ?[javac] > >> ? ?/home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:144: > >> ? ?method does not override a method from its superclass > >> ? ?[javac] @Override > >> ? ?[javac] ^ > >> ? ?[javac] > >> ? ?/home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:227: > >> ? ?method does not override a method from its superclass > >> ? ?[javac] @Override > >> ? ?[javac] ^ > >> ? ?[javac] > >> ? ?/home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:232: > >> ? ?method does not override a method from its superclass > >> ? ?[javac] @Override > >> ? ?[javac] ^ > >> ? ?[javac] > >> ? ?/home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:237: > >> ? ?method does not override a method from its superclass > >> ? ?[javac] @Override > >> ? ?[javac] ^ > >> ? ?[javac] > >> ? ?/home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:242: > >> ? ?method does not override a method from its superclass > >> ? ?[javac] @Override > >> ? ?[javac] ^ > >> ? ?[javac] > >> ? ?/home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:247: > >> ? ?method does not override a method from its superclass > >> ? ?[javac] @Override > >> ? ?[javac] ^ > >> ? ?[javac] > >> ? ?/home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:252: > >> ? ?method does not override a method from its superclass > >> ? ?[javac] @Override > >> ? ?[javac] ^ > >> ? ?[javac] > >> ? ?/home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:257: > >> ? ?method does not override a method from its superclass > >> ? ?[javac] @Override > >> ? ?[javac] ^ > >> ? ?[javac] Note: Some input files use or override a deprecated API. > >> ? ?[javac] Note: Recompile with -Xlint:deprecation for details. > >> ? ?[javac] Note: Some input files use unchecked or unsafe > >> ? ?operations. > >> ? ?[javac] Note: Recompile with -Xlint:unchecked for details. > >> ? ?[javac] 10 errors > >> > >> BUILD FAILED > >> /home/wilde/swift/src/0.92/cog/modules/swift/build.xml:73: The > >> following error occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:444: The following error > >> occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:79: The following error > >> occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:52: The following error > >> occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/modules/swift/dependencies.xml:4: > >> The following error occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:163: The following error > >> occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:168: The following error > >> occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/modules/karajan/build.xml:59: The > >> following error occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:444: The following error > >> occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:79: The following error > >> occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:52: The following error > >> occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/modules/karajan/dependencies.xml:4: > >> The following error occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:163: The following error > >> occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:168: The following error > >> occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/modules/abstraction/build.xml:58: > >> The following error occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:444: The following error > >> occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:79: The following error > >> occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:52: The following error > >> occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/modules/abstraction/dependencies.xml:4: > >> The following error occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:163: The following error > >> occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:168: The following error > >> occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/modules/abstraction-common/build.xml:63: > >> The following error occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:444: The following error > >> occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:79: The following error > >> occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:52: The following error > >> occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/modules/abstraction-common/dependencies.xml:7: > >> The following error occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:163: The following error > >> occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:168: The following error > >> occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/modules/util/build.xml:59: The > >> following error occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:465: The following error > >> occurred while executing this line: > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:228: Compile failed; see > >> the compiler error output for details. > >> > >> Total time: 20 seconds > >> e$ which java > >> /opt/osg/1.0.0/jdk1.5/bin/java > >> e$ java -version > >> java version "1.5.0_14" > >> Java(TM) 2 Runtime Environment, Standard Edition (build > >> 1.5.0_14-b03) > >> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) > >> e$ cd > >> e$ ls > >> bin ftdock glenslogs.tar.gz osgcat.sh osgfact out.ps out.qstat > >> swift t1.stderr t1.sub t2.sub > >> buildnrun.sh glasslogs.tgz mygroup osgcat.sh~ out out.ps2 pfgroup > >> t1.log t1.stdout t1.sub~ > >> e$ which javac > >> /opt/osg/1.0.0/jdk1.5/bin/javac > >> e$ ant -version > >> Apache Ant version 1.6.5 compiled on June 2 2005 > >> e$ > >> > >> > >> -- > >> Michael Wilde > >> Computation Institute, University of Chicago > >> Mathematics and Computer Science Division > >> Argonne National Laboratory > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Wed Jan 12 20:33:09 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 12 Jan 2011 18:33:09 -0800 Subject: [Swift-devel] Errors compiling 0.92 with Java 1.5 In-Reply-To: <806374967.57404.1294885470290.JavaMail.root@zimbra.anl.gov> References: <806374967.57404.1294885470290.JavaMail.root@zimbra.anl.gov> Message-ID: <1294885989.9297.0.camel@blabla2.none> On Wed, 2011-01-12 at 20:24 -0600, Michael Wilde wrote: > Thanks, David. > > Heres a process question for the group: > > - should we fix this in 0.92? Yes. > - by saying 1.6 or above is required (with a poiter to the patch for 1.5 users?) > - by applying the patch to 0.92? I vote for the latter. > > - if so, how should we mark 0.92 showstoppers in bugzilla? Yes. > > - lets agree on a test-fix-test-release approach for sealing and releasing 0.92 > > Ie, identify the showstoppers, fix 'em, test, add showstoppers only as necessary, and repeat till no more showstoppers. > > - Mike > > > ----- Original Message ----- > > Yep, it's #239. I saw it first on sisboombah. > > https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=239 > > > > On Wed, Jan 12, 2011 at 8:58 PM, David Kelly > > wrote: > > > I ran into a similar issue. Here is the patch I wrote for it (there > > > should also be a bug report) > > > > > > On Wed, Jan 12, 2011 at 8:53 PM, Michael Wilde > > > wrote: > > >> I get the errors below when I build 0.92 with this Java and ant on > > >> RENCI engage-submit: > > >> > > >> -- > > >> e$ which java > > >> /opt/osg/1.0.0/jdk1.5/bin/java > > >> e$ java -version > > >> java version "1.5.0_14" > > >> Java(TM) 2 Runtime Environment, Standard Edition (build > > >> 1.5.0_14-b03) > > >> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) > > >> e$ which javac > > >> /opt/osg/1.0.0/jdk1.5/bin/javac > > >> e$ ant -version > > >> Apache Ant version 1.6.5 compiled on June 2 2005 > > >> e$ > > >> -- > > >> > > >> The same svn rev seems to uild clean with 1.6 on PADS. > > >> > > >> I dont think this necessarily needs to be fixed; but at least > > >> documented if we no longer work with 1.5. (I recall we said we wont > > >> support 1.4 -- or was that 1.5?) > > >> > > >> - Mike > > >> > > >> Errors: > > >> > > >> compile: > > >> [echo] [util]: COMPILE > > >> [mkdir] Created dir: > > >> /home/wilde/swift/src/0.92/cog/modules/util/build > > >> [javac] Compiling 53 source files to > > >> /home/wilde/swift/src/0.92/cog/modules/util/build > > >> [javac] > > >> /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:71: > > >> method does not override a method from its superclass > > >> [javac] @Override > > >> [javac] ^ > > >> [javac] > > >> /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:105: > > >> method does not override a method from its superclass > > >> [javac] @Override > > >> [javac] ^ > > >> [javac] > > >> /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:144: > > >> method does not override a method from its superclass > > >> [javac] @Override > > >> [javac] ^ > > >> [javac] > > >> /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:227: > > >> method does not override a method from its superclass > > >> [javac] @Override > > >> [javac] ^ > > >> [javac] > > >> /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:232: > > >> method does not override a method from its superclass > > >> [javac] @Override > > >> [javac] ^ > > >> [javac] > > >> /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:237: > > >> method does not override a method from its superclass > > >> [javac] @Override > > >> [javac] ^ > > >> [javac] > > >> /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:242: > > >> method does not override a method from its superclass > > >> [javac] @Override > > >> [javac] ^ > > >> [javac] > > >> /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:247: > > >> method does not override a method from its superclass > > >> [javac] @Override > > >> [javac] ^ > > >> [javac] > > >> /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:252: > > >> method does not override a method from its superclass > > >> [javac] @Override > > >> [javac] ^ > > >> [javac] > > >> /home/wilde/swift/src/0.92/cog/modules/util/src/org/globus/cog/util/CopyOnWriteArrayList.java:257: > > >> method does not override a method from its superclass > > >> [javac] @Override > > >> [javac] ^ > > >> [javac] Note: Some input files use or override a deprecated API. > > >> [javac] Note: Recompile with -Xlint:deprecation for details. > > >> [javac] Note: Some input files use unchecked or unsafe > > >> operations. > > >> [javac] Note: Recompile with -Xlint:unchecked for details. > > >> [javac] 10 errors > > >> > > >> BUILD FAILED > > >> /home/wilde/swift/src/0.92/cog/modules/swift/build.xml:73: The > > >> following error occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:444: The following error > > >> occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:79: The following error > > >> occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:52: The following error > > >> occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/modules/swift/dependencies.xml:4: > > >> The following error occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:163: The following error > > >> occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:168: The following error > > >> occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/modules/karajan/build.xml:59: The > > >> following error occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:444: The following error > > >> occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:79: The following error > > >> occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:52: The following error > > >> occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/modules/karajan/dependencies.xml:4: > > >> The following error occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:163: The following error > > >> occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:168: The following error > > >> occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/modules/abstraction/build.xml:58: > > >> The following error occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:444: The following error > > >> occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:79: The following error > > >> occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:52: The following error > > >> occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/modules/abstraction/dependencies.xml:4: > > >> The following error occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:163: The following error > > >> occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:168: The following error > > >> occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/modules/abstraction-common/build.xml:63: > > >> The following error occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:444: The following error > > >> occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:79: The following error > > >> occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:52: The following error > > >> occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/modules/abstraction-common/dependencies.xml:7: > > >> The following error occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:163: The following error > > >> occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:168: The following error > > >> occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/modules/util/build.xml:59: The > > >> following error occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:465: The following error > > >> occurred while executing this line: > > >> /home/wilde/swift/src/0.92/cog/mbuild.xml:228: Compile failed; see > > >> the compiler error output for details. > > >> > > >> Total time: 20 seconds > > >> e$ which java > > >> /opt/osg/1.0.0/jdk1.5/bin/java > > >> e$ java -version > > >> java version "1.5.0_14" > > >> Java(TM) 2 Runtime Environment, Standard Edition (build > > >> 1.5.0_14-b03) > > >> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode) > > >> e$ cd > > >> e$ ls > > >> bin ftdock glenslogs.tar.gz osgcat.sh osgfact out.ps out.qstat > > >> swift t1.stderr t1.sub t2.sub > > >> buildnrun.sh glasslogs.tgz mygroup osgcat.sh~ out out.ps2 pfgroup > > >> t1.log t1.stdout t1.sub~ > > >> e$ which javac > > >> /opt/osg/1.0.0/jdk1.5/bin/javac > > >> e$ ant -version > > >> Apache Ant version 1.6.5 compiled on June 2 2005 > > >> e$ > > >> > > >> > > >> -- > > >> Michael Wilde > > >> Computation Institute, University of Chicago > > >> Mathematics and Computer Science Division > > >> Argonne National Laboratory > > >> > > >> _______________________________________________ > > >> Swift-devel mailing list > > >> Swift-devel at ci.uchicago.edu > > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > >> > > > > From wilde at mcs.anl.gov Wed Jan 12 23:35:32 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 12 Jan 2011 23:35:32 -0600 (CST) Subject: [Swift-devel] Problem with 0.92 sending jobs to OSG via Condor-G In-Reply-To: <211116763.57671.1294896702027.JavaMail.root@zimbra.anl.gov> Message-ID: <1819085888.57673.1294896932483.JavaMail.root@zimbra.anl.gov> Im trying my first tests of 0.92 on engage-submit, sending 100 trivial cat jobs to 10 OSG sites. My jobs seem to be all dying with the error "Found illegal unescaped double-quote" (see below). Has anyone successfully run a Condor-G job on OSG with 0.92? I'll dig deeper and try the same test with the older version of trunk that Marc has been using here with better success. Will also try a single job run and capture a simpler log and the condor-g submit file. Allan, have you tried 0.92 against COndor-G? If not, could you? Sarah, we should add some Condor-G-to-GT2 testing to 0.92 validation I think. - Mike -- Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot submit job: Could not submit job (condor_submit reported an exit code of 1). Submitting job(s) Found illegal unescaped double-quote: "" -e /bin/cat -out outdir/f.0065.out -err stderr.txt -i -d outdir -if data.txt -of outdir/f.0065.out -k "" -cdmfile "" -status file -a data.txtThe full arguments you specified were: /osg/data/engage/tmp/osg.hpc.ufl.edu/catsn-20110113-0025-vv4p4up3/shared/_swiftwrap cat-ajxnee4k -jobdir a -scratch "" -e /bin/cat -out outdir/f.0065.out -err stderr.txt -i -d outdir -if data.txt -of outdir/f.0065.out -k "" -cdmfile "" -status file -a data.txt Script is: e$ cat catsn.swift type file; app (file o) cat (file i) { cat @i stdout=@o; } file out[]; foreach j in [1:@toint(@arg("n","1"))] { file data<"data.txt">; out[j] = cat(data); } -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Wed Jan 12 23:43:14 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 12 Jan 2011 23:43:14 -0600 (CST) Subject: [Swift-devel] Problem with 0.92 sending jobs to OSG via Condor-G In-Reply-To: <1819085888.57673.1294896932483.JavaMail.root@zimbra.anl.gov> Message-ID: <775856762.57677.1294897394831.JavaMail.root@zimbra.anl.gov> An initial test with an older trunk (~mid-december, swift-r3703 cog-r2925 cog modified locally) seems to work fine with the same tc, sites, and properties file. I need to check what local mods I had applied, but I think its more likely that some Condor submit file quoting fix fell off in 0.92 integration. So Marc, sorry - this release is not usable for you yet. - Mike ----- Original Message ----- > Im trying my first tests of 0.92 on engage-submit, sending 100 trivial > cat jobs to 10 OSG sites. > > My jobs seem to be all dying with the error "Found illegal unescaped > double-quote" (see below). > > Has anyone successfully run a Condor-G job on OSG with 0.92? > > I'll dig deeper and try the same test with the older version of trunk > that Marc has been using here with better success. Will also try a > single job run and capture a simpler log and the condor-g submit file. > > Allan, have you tried 0.92 against COndor-G? If not, could you? > > Sarah, we should add some Condor-G-to-GT2 testing to 0.92 validation I > think. > > - Mike > > -- > > Caused by: > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Cannot submit job: Could not submit job (condor_submit reported an > exit code of 1). Submitting job(s) > Found illegal unescaped double-quote: "" -e /bin/cat -out > outdir/f.0065.out -err stderr.txt -i -d outdir -if data.txt -of > outdir/f.0065.out -k "" -cdmfile "" -status file -a data.txtThe full > arguments you specified were: > /osg/data/engage/tmp/osg.hpc.ufl.edu/catsn-20110113-0025-vv4p4up3/shared/_swiftwrap > cat-ajxnee4k -jobdir a -scratch "" -e /bin/cat -out outdir/f.0065.out > -err stderr.txt -i -d outdir -if data.txt -of outdir/f.0065.out -k "" > -cdmfile "" -status file -a data.txt > > > Script is: > > e$ cat catsn.swift > type file; > > app (file o) cat (file i) > { > cat @i stdout=@o; > } > > file out[] prefix="f.",suffix=".out">; > foreach j in [1:@toint(@arg("n","1"))] { > file data<"data.txt">; > out[j] = cat(data); > } > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From bugzilla-daemon at mcs.anl.gov Wed Jan 12 23:49:46 2011 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 12 Jan 2011 23:49:46 -0600 (CST) Subject: [Swift-devel] [Bug 249] New: Condor-G provider gives quoting error Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=249 Summary: Condor-G provider gives quoting error Product: Swift Version: unspecified Platform: All OS/Version: Linux Status: NEW Severity: major Priority: P1 Component: SwiftScript language AssignedTo: hategan at mcs.anl.gov ReportedBy: wilde at mcs.anl.gov My jobs seem to be all dying with the error "Found illegal unescaped double-quote" (see below). Has anyone successfully run a Condor-G job on OSG with 0.92? I'll dig deeper and try the same test with the older version of trunk that Marc has been using here with better success. Will also try a single job run and capture a simpler log and the condor-g submit file. Allan, have you tried 0.92 against COndor-G? If not, could you? Sarah, we should add some Condor-G-to-GT2 testing to 0.92 validation I think. - Mike -- Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot submit job: Could not submit job (condor_submit reported an exit code of 1). Submitting job(s) Found illegal unescaped double-quote: "" -e /bin/cat -out outdir/f.0065.out -err stderr.txt -i -d outdir -if data.txt -of outdir/f.0065.out -k "" -cdmfile "" -status file -a data.txtThe full arguments you specified were: /osg/data/engage/tmp/osg.hpc.ufl.edu/catsn-20110113-0025-vv4p4up3/shared/_swiftwrap cat-ajxnee4k -jobdir a -scratch "" -e /bin/cat -out outdir/f.0065.out -err stderr.txt -i -d outdir -if data.txt -of outdir/f.0065.out -k "" -cdmfile "" -status file -a data.txt Script is: e$ cat catsn.swift type file; app (file o) cat (file i) { cat @i stdout=@o; } file out[]; foreach j in [1:@toint(@arg("n","1"))] { file data<"data.txt">; out[j] = cat(data); } --- An initial test with an older trunk (~mid-december, swift-r3703 cog-r2925 cog modified locally) seems to work fine with the same tc, sites, and properties file. I need to check what local mods I had applied, but I think its more likely that some Condor submit file quoting fix fell off in 0.92 integration. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Wed Jan 12 23:51:49 2011 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Wed, 12 Jan 2011 23:51:49 -0600 (CST) Subject: [Swift-devel] [Bug 249] Condor-G provider gives quoting error In-Reply-To: References: Message-ID: <20110113055149.C5D9F1BD89@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=249 --- Comment #1 from Michael Wilde 2011-01-12 23:51:49 --- Initial description was missing this first line: on engage-submit, sending 100 trivial cat jobs to 10 OSG sites. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From wilde at mcs.anl.gov Wed Jan 12 23:55:50 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 12 Jan 2011 23:55:50 -0600 (CST) Subject: [Swift-devel] Did usage tracking fall off of 0.92? Message-ID: <1886902128.57685.1294898150654.JavaMail.root@zimbra.anl.gov> David, I suspect that 0.92 may have lost your usage tracking mods to the swift command. Can you check? I say this because I noticed that on engage-submit, the older ~Dec-15 trunk I was using gives the (expected) complaint that /dev/udp is not found, while 0.92 does not complain. (Or has that message just been more cleanly suppressed, or a work-around added for the missing /dev/udp?) -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Thu Jan 13 00:32:53 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 12 Jan 2011 22:32:53 -0800 Subject: [Swift-devel] Problem with 0.92 sending jobs to OSG via Condor-G In-Reply-To: <775856762.57677.1294897394831.JavaMail.root@zimbra.anl.gov> References: <775856762.57677.1294897394831.JavaMail.root@zimbra.anl.gov> Message-ID: <1294900373.11403.0.camel@blabla2.none> On Wed, 2011-01-12 at 23:43 -0600, Michael Wilde wrote: > An initial test with an older trunk (~mid-december, swift-r3703 cog-r2925 cog modified locally) seems to work fine with the same tc, sites, and properties file. > > I need to check what local mods I had applied, but I think its more likely that some Condor submit file quoting fix fell off in 0.92 integration. Yeah. A svn diff > somefile would help. > > So Marc, sorry - this release is not usable for you yet. > > - Mike > > > ----- Original Message ----- > > Im trying my first tests of 0.92 on engage-submit, sending 100 trivial > > cat jobs to 10 OSG sites. > > > > My jobs seem to be all dying with the error "Found illegal unescaped > > double-quote" (see below). > > > > Has anyone successfully run a Condor-G job on OSG with 0.92? > > > > I'll dig deeper and try the same test with the older version of trunk > > that Marc has been using here with better success. Will also try a > > single job run and capture a simpler log and the condor-g submit file. > > > > Allan, have you tried 0.92 against COndor-G? If not, could you? > > > > Sarah, we should add some Condor-G-to-GT2 testing to 0.92 validation I > > think. > > > > - Mike > > > > -- > > > > Caused by: > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Cannot submit job: Could not submit job (condor_submit reported an > > exit code of 1). Submitting job(s) > > Found illegal unescaped double-quote: "" -e /bin/cat -out > > outdir/f.0065.out -err stderr.txt -i -d outdir -if data.txt -of > > outdir/f.0065.out -k "" -cdmfile "" -status file -a data.txtThe full > > arguments you specified were: > > /osg/data/engage/tmp/osg.hpc.ufl.edu/catsn-20110113-0025-vv4p4up3/shared/_swiftwrap > > cat-ajxnee4k -jobdir a -scratch "" -e /bin/cat -out outdir/f.0065.out > > -err stderr.txt -i -d outdir -if data.txt -of outdir/f.0065.out -k "" > > -cdmfile "" -status file -a data.txt > > > > > > Script is: > > > > e$ cat catsn.swift > > type file; > > > > app (file o) cat (file i) > > { > > cat @i stdout=@o; > > } > > > > file out[] > prefix="f.",suffix=".out">; > > foreach j in [1:@toint(@arg("n","1"))] { > > file data<"data.txt">; > > out[j] = cat(data); > > } > > > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Thu Jan 13 08:17:26 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 13 Jan 2011 08:17:26 -0600 (CST) Subject: [Swift-devel] Problem with 0.92 sending jobs to OSG via Condor-G In-Reply-To: <1294900373.11403.0.camel@blabla2.none> Message-ID: <1791121314.57953.1294928246779.JavaMail.root@zimbra.anl.gov> ----- Original Message ----- > > I need to check what local mods I had applied, but I think its more > > likely that some Condor submit file quoting fix fell off in 0.92 > > integration. > > Yeah. A svn diff > somefile would help. Hmmm. So far svn diffs show no changes within provider-condor, neither between trunk and 0.92 branch nor within my working copies of those two on engage-submit, which seem to behave differently regarding Condor quoting. Could the change(s) that were made a long time ago to fix Condor quoting be in a different module than provider-condor? If so, whats a likely place to look? I'll check vdl-int.k next. - Mike > > > > So Marc, sorry - this release is not usable for you yet. > > > > - Mike > > > > > > ----- Original Message ----- > > > Im trying my first tests of 0.92 on engage-submit, sending 100 > > > trivial > > > cat jobs to 10 OSG sites. > > > > > > My jobs seem to be all dying with the error "Found illegal > > > unescaped > > > double-quote" (see below). > > > > > > Has anyone successfully run a Condor-G job on OSG with 0.92? > > > > > > I'll dig deeper and try the same test with the older version of > > > trunk > > > that Marc has been using here with better success. Will also try a > > > single job run and capture a simpler log and the condor-g submit > > > file. > > > > > > Allan, have you tried 0.92 against COndor-G? If not, could you? > > > > > > Sarah, we should add some Condor-G-to-GT2 testing to 0.92 > > > validation I > > > think. > > > > > > - Mike > > > > > > -- > > > > > > Caused by: > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > Cannot submit job: Could not submit job (condor_submit reported an > > > exit code of 1). Submitting job(s) > > > Found illegal unescaped double-quote: "" -e /bin/cat -out > > > outdir/f.0065.out -err stderr.txt -i -d outdir -if data.txt -of > > > outdir/f.0065.out -k "" -cdmfile "" -status file -a data.txtThe > > > full > > > arguments you specified were: > > > /osg/data/engage/tmp/osg.hpc.ufl.edu/catsn-20110113-0025-vv4p4up3/shared/_swiftwrap > > > cat-ajxnee4k -jobdir a -scratch "" -e /bin/cat -out > > > outdir/f.0065.out > > > -err stderr.txt -i -d outdir -if data.txt -of outdir/f.0065.out -k > > > "" > > > -cdmfile "" -status file -a data.txt > > > > > > > > > Script is: > > > > > > e$ cat catsn.swift > > > type file; > > > > > > app (file o) cat (file i) > > > { > > > cat @i stdout=@o; > > > } > > > > > > file out[] > > prefix="f.",suffix=".out">; > > > foreach j in [1:@toint(@arg("n","1"))] { > > > file data<"data.txt">; > > > out[j] = cat(data); > > > } > > > > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Thu Jan 13 08:24:03 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 13 Jan 2011 08:24:03 -0600 (CST) Subject: [Swift-devel] Problem with 0.92 sending jobs to OSG via Condor-G In-Reply-To: <1791121314.57953.1294928246779.JavaMail.root@zimbra.anl.gov> Message-ID: <842221165.57992.1294928643781.JavaMail.root@zimbra.anl.gov> So far I see no diffs between 0.92 swift/libexec and the trunk working copy I was using on engage-submit. Where else should I look? (Its very possible Im missing something; but I tried creating an artificial change in at least one case, a few dirs down, and svn diff picks it up) - Mike ----- Original Message ----- > ----- Original Message ----- > > > I need to check what local mods I had applied, but I think its > > > more > > > likely that some Condor submit file quoting fix fell off in 0.92 > > > integration. > > > > Yeah. A svn diff > somefile would help. > > Hmmm. So far svn diffs show no changes within provider-condor, neither > between trunk and 0.92 branch nor within my working copies of those > two on engage-submit, which seem to behave differently regarding > Condor quoting. > > Could the change(s) that were made a long time ago to fix Condor > quoting be in a different module than provider-condor? If so, whats a > likely place to look? > > I'll check vdl-int.k next. > > - Mike > > > > > > > So Marc, sorry - this release is not usable for you yet. > > > > > > - Mike > > > > > > > > > ----- Original Message ----- > > > > Im trying my first tests of 0.92 on engage-submit, sending 100 > > > > trivial > > > > cat jobs to 10 OSG sites. > > > > > > > > My jobs seem to be all dying with the error "Found illegal > > > > unescaped > > > > double-quote" (see below). > > > > > > > > Has anyone successfully run a Condor-G job on OSG with 0.92? > > > > > > > > I'll dig deeper and try the same test with the older version of > > > > trunk > > > > that Marc has been using here with better success. Will also try > > > > a > > > > single job run and capture a simpler log and the condor-g submit > > > > file. > > > > > > > > Allan, have you tried 0.92 against COndor-G? If not, could you? > > > > > > > > Sarah, we should add some Condor-G-to-GT2 testing to 0.92 > > > > validation I > > > > think. > > > > > > > > - Mike > > > > > > > > -- > > > > > > > > Caused by: > > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > > Cannot submit job: Could not submit job (condor_submit reported > > > > an > > > > exit code of 1). Submitting job(s) > > > > Found illegal unescaped double-quote: "" -e /bin/cat -out > > > > outdir/f.0065.out -err stderr.txt -i -d outdir -if data.txt -of > > > > outdir/f.0065.out -k "" -cdmfile "" -status file -a data.txtThe > > > > full > > > > arguments you specified were: > > > > /osg/data/engage/tmp/osg.hpc.ufl.edu/catsn-20110113-0025-vv4p4up3/shared/_swiftwrap > > > > cat-ajxnee4k -jobdir a -scratch "" -e /bin/cat -out > > > > outdir/f.0065.out > > > > -err stderr.txt -i -d outdir -if data.txt -of outdir/f.0065.out > > > > -k > > > > "" > > > > -cdmfile "" -status file -a data.txt > > > > > > > > > > > > Script is: > > > > > > > > e$ cat catsn.swift > > > > type file; > > > > > > > > app (file o) cat (file i) > > > > { > > > > cat @i stdout=@o; > > > > } > > > > > > > > file out[] > > > prefix="f.",suffix=".out">; > > > > foreach j in [1:@toint(@arg("n","1"))] { > > > > file data<"data.txt">; > > > > out[j] = cat(data); > > > > } > > > > > > > > > > > > -- > > > > Michael Wilde > > > > Computation Institute, University of Chicago > > > > Mathematics and Computer Science Division > > > > Argonne National Laboratory > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Thu Jan 13 08:45:48 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 13 Jan 2011 08:45:48 -0600 (CST) Subject: [Swift-devel] Problem with 0.92 sending jobs to OSG via Condor-G In-Reply-To: <842221165.57992.1294928643781.JavaMail.root@zimbra.anl.gov> Message-ID: <221819054.58120.1294929948306.JavaMail.root@zimbra.anl.gov> This is very strange. I diffed the entire cog and modules/swift dir trees between my 0.92 and trunk working copies. The only changes I see are: - my (one) local mod to remove Time:HiRes for Ranger in worker.pl - David's patch to override (which *does* touch JobSpecification, suspiciously! The latter I will investigate, but Im very surprised to see so few differences between 0.92 and a trunk working copy. Mihael, I *thought* you integrated changes from stable-branch ( ~ 0.91) into the 0.92 branch. Or did you integrate those into trunk, and perhaps I took a later copy of trunk? If you did indeed integrate the stable changes into trunk, *and* its possible that what I test on was not re-built after I did an svn update some time in the past few weeks, *then* perhaps the Condor bug crept in between recent changes to trunk. Looking at svn log on my trunk copy, it seems almost as if one of these two adjacent revisions are somehow reverted or suddenly not working: ------------------------------------------------------------------------ r2021 | b_z_c | 2008-05-16 10:01:46 -0400 (Fri, 16 May 2008) | 1 line JDK1.4.2 compatible string mangling ------------------------------------------------------------------------ r2020 | b_z_c | 2008-05-16 08:04:01 -0400 (Fri, 16 May 2008) | 1 line double-quote symbols in arguments are now escaped ------------------------------------------------------------------------ I will also try to back off the Override patch and compile with 1.6. I cant understand how my trunk got compiled, previously, without the override patch. So I'll try to check my update vs build dates, how I did my older (working) build, Let me know if you have other ideas of how to diagnose this. - Mike ----- Original Message ----- > So far I see no diffs between 0.92 swift/libexec and the trunk working > copy I was using on engage-submit. > > Where else should I look? > > (Its very possible Im missing something; but I tried creating an > artificial change in at least one case, a few dirs down, and svn diff > picks it up) > > - Mike > > ----- Original Message ----- > > ----- Original Message ----- > > > > I need to check what local mods I had applied, but I think its > > > > more > > > > likely that some Condor submit file quoting fix fell off in 0.92 > > > > integration. > > > > > > Yeah. A svn diff > somefile would help. > > > > Hmmm. So far svn diffs show no changes within provider-condor, > > neither > > between trunk and 0.92 branch nor within my working copies of those > > two on engage-submit, which seem to behave differently regarding > > Condor quoting. > > > > Could the change(s) that were made a long time ago to fix Condor > > quoting be in a different module than provider-condor? If so, whats > > a > > likely place to look? > > > > I'll check vdl-int.k next. > > > > - Mike > > > > > > > > > > So Marc, sorry - this release is not usable for you yet. > > > > > > > > - Mike > > > > > > > > > > > > ----- Original Message ----- > > > > > Im trying my first tests of 0.92 on engage-submit, sending 100 > > > > > trivial > > > > > cat jobs to 10 OSG sites. > > > > > > > > > > My jobs seem to be all dying with the error "Found illegal > > > > > unescaped > > > > > double-quote" (see below). > > > > > > > > > > Has anyone successfully run a Condor-G job on OSG with 0.92? > > > > > > > > > > I'll dig deeper and try the same test with the older version > > > > > of > > > > > trunk > > > > > that Marc has been using here with better success. Will also > > > > > try > > > > > a > > > > > single job run and capture a simpler log and the condor-g > > > > > submit > > > > > file. > > > > > > > > > > Allan, have you tried 0.92 against COndor-G? If not, could > > > > > you? > > > > > > > > > > Sarah, we should add some Condor-G-to-GT2 testing to 0.92 > > > > > validation I > > > > > think. > > > > > > > > > > - Mike > > > > > > > > > > -- > > > > > > > > > > Caused by: > > > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > > > Cannot submit job: Could not submit job (condor_submit > > > > > reported > > > > > an > > > > > exit code of 1). Submitting job(s) > > > > > Found illegal unescaped double-quote: "" -e /bin/cat -out > > > > > outdir/f.0065.out -err stderr.txt -i -d outdir -if data.txt > > > > > -of > > > > > outdir/f.0065.out -k "" -cdmfile "" -status file -a > > > > > data.txtThe > > > > > full > > > > > arguments you specified were: > > > > > /osg/data/engage/tmp/osg.hpc.ufl.edu/catsn-20110113-0025-vv4p4up3/shared/_swiftwrap > > > > > cat-ajxnee4k -jobdir a -scratch "" -e /bin/cat -out > > > > > outdir/f.0065.out > > > > > -err stderr.txt -i -d outdir -if data.txt -of > > > > > outdir/f.0065.out > > > > > -k > > > > > "" > > > > > -cdmfile "" -status file -a data.txt > > > > > > > > > > > > > > > Script is: > > > > > > > > > > e$ cat catsn.swift > > > > > type file; > > > > > > > > > > app (file o) cat (file i) > > > > > { > > > > > cat @i stdout=@o; > > > > > } > > > > > > > > > > file out[] > > > > prefix="f.",suffix=".out">; > > > > > foreach j in [1:@toint(@arg("n","1"))] { > > > > > file data<"data.txt">; > > > > > out[j] = cat(data); > > > > > } > > > > > > > > > > > > > > > -- > > > > > Michael Wilde > > > > > Computation Institute, University of Chicago > > > > > Mathematics and Computer Science Division > > > > > Argonne National Laboratory > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From aespinosa at cs.uchicago.edu Thu Jan 13 09:23:28 2011 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 13 Jan 2011 09:23:28 -0600 Subject: [Swift-devel] Problem with 0.92 sending jobs to OSG via Condor-G In-Reply-To: <1791121314.57953.1294928246779.JavaMail.root@zimbra.anl.gov> References: <1294900373.11403.0.camel@blabla2.none> <1791121314.57953.1294928246779.JavaMail.root@zimbra.anl.gov> Message-ID: Shouldn't we be looking at the diffs in provider-localscheduler? -Allan (mobile) On Jan 13, 2011 11:17 AM, "Michael Wilde" wrote: > > > > ----- Original Message ----- > > > I need to check what local mods I had applied, but I think its more > > > likely that some Condor submit file quoting fix fell off in 0.92 > > > integration. > > > > Yeah. A svn diff > somefile would help. > > Hmmm. So far svn diffs show no changes within provider-condor, neither between trunk and 0.92 branch nor within my working copies of those two on engage-submit, which seem to behave differently regarding Condor quoting. > > Could the change(s) that were made a long time ago to fix Condor quoting be in a different module than provider-condor? If so, whats a likely place to look? > > I'll check vdl-int.k next. > > - Mike > > > > > > > So Marc, sorry - this release is not usable for you yet. > > > > > > - Mike > > > > > > > > > ----- Original Message ----- > > > > Im trying my first tests of 0.92 on engage-submit, sending 100 > > > > trivial > > > > cat jobs to 10 OSG sites. > > > > > > > > My jobs seem to be all dying with the error "Found illegal > > > > unescaped > > > > double-quote" (see below). > > > > > > > > Has anyone successfully run a Condor-G job on OSG with 0.92? > > > > > > > > I'll dig deeper and try the same test with the older version of > > > > trunk > > > > that Marc has been using here with better success. Will also try a > > > > single job run and capture a simpler log and the condor-g submit > > > > file. > > > > > > > > Allan, have you tried 0.92 against COndor-G? If not, could you? > > > > > > > > Sarah, we should add some Condor-G-to-GT2 testing to 0.92 > > > > validation I > > > > think. > > > > > > > > - Mike > > > > > > > > -- > > > > > > > > Caused by: > > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > > Cannot submit job: Could not submit job (condor_submit reported an > > > > exit code of 1). Submitting job(s) > > > > Found illegal unescaped double-quote: "" -e /bin/cat -out > > > > outdir/f.0065.out -err stderr.txt -i -d outdir -if data.txt -of > > > > outdir/f.0065.out -k "" -cdmfile "" -status file -a data.txtThe > > > > full > > > > arguments you specified were: > > > > /osg/data/engage/tmp/ osg.hpc.ufl.edu/catsn-20110113-0025-vv4p4up3/shared/_swiftwrap > > > > cat-ajxnee4k -jobdir a -scratch "" -e /bin/cat -out > > > > outdir/f.0065.out > > > > -err stderr.txt -i -d outdir -if data.txt -of outdir/f.0065.out -k > > > > "" > > > > -cdmfile "" -status file -a data.txt > > > > > > > > > > > > Script is: > > > > > > > > e$ cat catsn.swift > > > > type file; > > > > > > > > app (file o) cat (file i) > > > > { > > > > cat @i stdout=@o; > > > > } > > > > > > > > file out[] > > > prefix="f.",suffix=".out">; > > > > foreach j in [1:@toint(@arg("n","1"))] { > > > > file data<"data.txt">; > > > > out[j] = cat(data); > > > > } > > > > > > > > > > > > -- > > > > Michael Wilde > > > > Computation Institute, University of Chicago > > > > Mathematics and Computer Science Division > > > > Argonne National Laboratory > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Jan 13 09:31:58 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 13 Jan 2011 09:31:58 -0600 (CST) Subject: [Swift-devel] Problem with 0.92 sending jobs to OSG via Condor-G In-Reply-To: Message-ID: <624341900.58517.1294932718459.JavaMail.root@zimbra.anl.gov> ----- Original Message ----- > Shouldn't we be looking at the diffs in provider-localscheduler? I don't *think* so - my tests were using COndor-G directly: grid gt2 ff-grid3.unl.edu/jobmanager-pbs But in any case, I diff'ed the entire cog and swift trees, and saw almost *no* diffs (see later msg). The only one I am suspicious of at the moment is the @Override patch. I need to find when that change was made and whether I somehow compiled *with* the Overrides in place in the older working copy. - Mike > > -Allan (mobile) > > On Jan 13, 2011 11:17 AM, "Michael Wilde" < wilde at mcs.anl.gov > wrote: > > > > > > > > ----- Original Message ----- > > > > I need to check what local mods I had applied, but I think its > > > > more > > > > likely that some Condor submit file quoting fix fell off in 0.92 > > > > integration. > > > > > > Yeah. A svn diff > somefile would help. > > > > Hmmm. So far svn diffs show no changes within provider-condor, > > neither between trunk and 0.92 branch nor within my working copies > > of those two on engage-submit, which seem to behave differently > > regarding Condor quoting. > > > > Could the change(s) that were made a long time ago to fix Condor > > quoting be in a different module than provider-condor? If so, whats > > a likely place to look? > > > > I'll check vdl-int.k next. > > > > - Mike > > > > > > > > > > So Marc, sorry - this release is not usable for you yet. > > > > > > > > - Mike > > > > > > > > > > > > ----- Original Message ----- > > > > > Im trying my first tests of 0.92 on engage-submit, sending 100 > > > > > trivial > > > > > cat jobs to 10 OSG sites. > > > > > > > > > > My jobs seem to be all dying with the error "Found illegal > > > > > unescaped > > > > > double-quote" (see below). > > > > > > > > > > Has anyone successfully run a Condor-G job on OSG with 0.92? > > > > > > > > > > I'll dig deeper and try the same test with the older version > > > > > of > > > > > trunk > > > > > that Marc has been using here with better success. Will also > > > > > try a > > > > > single job run and capture a simpler log and the condor-g > > > > > submit > > > > > file. > > > > > > > > > > Allan, have you tried 0.92 against COndor-G? If not, could > > > > > you? > > > > > > > > > > Sarah, we should add some Condor-G-to-GT2 testing to 0.92 > > > > > validation I > > > > > think. > > > > > > > > > > - Mike > > > > > > > > > > -- > > > > > > > > > > Caused by: > > > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > > > Cannot submit job: Could not submit job (condor_submit > > > > > reported an > > > > > exit code of 1). Submitting job(s) > > > > > Found illegal unescaped double-quote: "" -e /bin/cat -out > > > > > outdir/f.0065.out -err stderr.txt -i -d outdir -if data.txt > > > > > -of > > > > > outdir/f.0065.out -k "" -cdmfile "" -status file -a > > > > > data.txtThe > > > > > full > > > > > arguments you specified were: > > > > > /osg/data/engage/tmp/ > > > > > osg.hpc.ufl.edu/catsn-20110113-0025-vv4p4up3/shared/_swiftwrap > > > > > cat-ajxnee4k -jobdir a -scratch "" -e /bin/cat -out > > > > > outdir/f.0065.out > > > > > -err stderr.txt -i -d outdir -if data.txt -of > > > > > outdir/f.0065.out -k > > > > > "" > > > > > -cdmfile "" -status file -a data.txt > > > > > > > > > > > > > > > Script is: > > > > > > > > > > e$ cat catsn.swift > > > > > type file; > > > > > > > > > > app (file o) cat (file i) > > > > > { > > > > > cat @i stdout=@o; > > > > > } > > > > > > > > > > file out[] > > > > prefix="f.",suffix=".out">; > > > > > foreach j in [1:@toint(@arg("n","1"))] { > > > > > file data<"data.txt">; > > > > > out[j] = cat(data); > > > > > } > > > > > > > > > > > > > > > -- > > > > > Michael Wilde > > > > > Computation Institute, University of Chicago > > > > > Mathematics and Computer Science Division > > > > > Argonne National Laboratory > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Thu Jan 13 09:37:02 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 13 Jan 2011 09:37:02 -0600 (CST) Subject: [Swift-devel] Problem with 0.92 sending jobs to OSG via Condor-G In-Reply-To: <624341900.58517.1294932718459.JavaMail.root@zimbra.anl.gov> Message-ID: <219989897.58570.1294933022424.JavaMail.root@zimbra.anl.gov> I think my diffs were wrong. Please ignore this thread till I re-do them. - Mike ----- Original Message ----- > ----- Original Message ----- > > Shouldn't we be looking at the diffs in provider-localscheduler? > > I don't *think* so - my tests were using COndor-G directly: > > grid > gt2 > ff-grid3.unl.edu/jobmanager-pbs > > But in any case, I diff'ed the entire cog and swift trees, and saw > almost *no* diffs (see later msg). The only one I am suspicious of at > the moment is the @Override patch. > > I need to find when that change was made and whether I somehow > compiled *with* the Overrides in place in the older working copy. > > - Mike > > > > > -Allan (mobile) > > > > On Jan 13, 2011 11:17 AM, "Michael Wilde" < wilde at mcs.anl.gov > > > wrote: > > > > > > > > > > > > ----- Original Message ----- > > > > > I need to check what local mods I had applied, but I think its > > > > > more > > > > > likely that some Condor submit file quoting fix fell off in > > > > > 0.92 > > > > > integration. > > > > > > > > Yeah. A svn diff > somefile would help. > > > > > > Hmmm. So far svn diffs show no changes within provider-condor, > > > neither between trunk and 0.92 branch nor within my working copies > > > of those two on engage-submit, which seem to behave differently > > > regarding Condor quoting. > > > > > > Could the change(s) that were made a long time ago to fix Condor > > > quoting be in a different module than provider-condor? If so, > > > whats > > > a likely place to look? > > > > > > I'll check vdl-int.k next. > > > > > > - Mike > > > > > > > > > > > > > So Marc, sorry - this release is not usable for you yet. > > > > > > > > > > - Mike > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > Im trying my first tests of 0.92 on engage-submit, sending > > > > > > 100 > > > > > > trivial > > > > > > cat jobs to 10 OSG sites. > > > > > > > > > > > > My jobs seem to be all dying with the error "Found illegal > > > > > > unescaped > > > > > > double-quote" (see below). > > > > > > > > > > > > Has anyone successfully run a Condor-G job on OSG with 0.92? > > > > > > > > > > > > I'll dig deeper and try the same test with the older version > > > > > > of > > > > > > trunk > > > > > > that Marc has been using here with better success. Will also > > > > > > try a > > > > > > single job run and capture a simpler log and the condor-g > > > > > > submit > > > > > > file. > > > > > > > > > > > > Allan, have you tried 0.92 against COndor-G? If not, could > > > > > > you? > > > > > > > > > > > > Sarah, we should add some Condor-G-to-GT2 testing to 0.92 > > > > > > validation I > > > > > > think. > > > > > > > > > > > > - Mike > > > > > > > > > > > > -- > > > > > > > > > > > > Caused by: > > > > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > > > > Cannot submit job: Could not submit job (condor_submit > > > > > > reported an > > > > > > exit code of 1). Submitting job(s) > > > > > > Found illegal unescaped double-quote: "" -e /bin/cat -out > > > > > > outdir/f.0065.out -err stderr.txt -i -d outdir -if data.txt > > > > > > -of > > > > > > outdir/f.0065.out -k "" -cdmfile "" -status file -a > > > > > > data.txtThe > > > > > > full > > > > > > arguments you specified were: > > > > > > /osg/data/engage/tmp/ > > > > > > osg.hpc.ufl.edu/catsn-20110113-0025-vv4p4up3/shared/_swiftwrap > > > > > > cat-ajxnee4k -jobdir a -scratch "" -e /bin/cat -out > > > > > > outdir/f.0065.out > > > > > > -err stderr.txt -i -d outdir -if data.txt -of > > > > > > outdir/f.0065.out -k > > > > > > "" > > > > > > -cdmfile "" -status file -a data.txt > > > > > > > > > > > > > > > > > > Script is: > > > > > > > > > > > > e$ cat catsn.swift > > > > > > type file; > > > > > > > > > > > > app (file o) cat (file i) > > > > > > { > > > > > > cat @i stdout=@o; > > > > > > } > > > > > > > > > > > > file out[] > > > > > prefix="f.",suffix=".out">; > > > > > > foreach j in [1:@toint(@arg("n","1"))] { > > > > > > file data<"data.txt">; > > > > > > out[j] = cat(data); > > > > > > } > > > > > > > > > > > > > > > > > > -- > > > > > > Michael Wilde > > > > > > Computation Institute, University of Chicago > > > > > > Mathematics and Computer Science Division > > > > > > Argonne National Laboratory > > > > > > > > > > > > _______________________________________________ > > > > > > Swift-devel mailing list > > > > > > Swift-devel at ci.uchicago.edu > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Thu Jan 13 10:20:34 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 13 Jan 2011 10:20:34 -0600 (CST) Subject: [Swift-devel] Problem with 0.92 sending jobs to OSG via Condor-G In-Reply-To: <219989897.58570.1294933022424.JavaMail.root@zimbra.anl.gov> Message-ID: <2127159625.58925.1294935634836.JavaMail.root@zimbra.anl.gov> Allan, you are right! So the code in provide-condor is an obsolete fossil? My earlier diffs were wrong because I diffed trunk against 0.92, but the problem occurred in the merge of stable *to* trunk (obviously now ;) The error I think is in rev 2989: --- modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/condor/CondorExecutor.java (revision 2988) +++ modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/condor/CondorExecutor.java (working copy) The working trunk version generates this Condor submit file: universe = grid grid_resource = gt2 ff-grid3.unl.edu/jobmanager-pbs stream_output = False stream_error = False Transfer_Executable = false output = /home/wilde/.globus/scripts/Condor50896.submit.stdout error = /home/wilde/.globus/scripts/Condor50896.submit.stderr remote_initialdir = /panfs/panasas/CMS/data/engage/tmp/ff-grid3.unl.edu/catsn-20110113-1059-4xb6b31h executable = /bin/bash arguments = /panfs/panasas/CMS/data/engage/tmp/ff-grid3.unl.edu/catsn-20110113-1059-4xb6b31h/shared/_swiftwrap cat-fmk15f4k -jobdir f -scratch -e /bin/cat -out outdir/f.0001.out -err stderr.txt -i -d outdir -if data.txt -of outdir/f.0001.out -k -cdmfile -status file -a data.txt notification = Never leave_in_queue = TRUE queue while the failing 0.92 version generates this: universe = grid grid_resource = gt2 belhaven-1.renci.org/jobmanager-condor stream_output = False stream_error = False Transfer_Executable = false output = /home/wilde/.globus/scripts/Condor43688.submit.stdout error = /home/wilde/.globus/scripts/Condor43688.submit.stderr remote_initialdir = /nfs/osg-data/engage/tmp/belhaven-1.renci.org/catsn-20110113-1050-eskyjcb5 executable = /bin/bash arguments = /nfs/osg-data/engage/tmp/belhaven-1.renci.org/catsn-20110113-1050-eskyjcb5/shared/_swiftwrap cat-kbmn4f4k -jobdir k -scrat ch "" -e /bin/cat -out outdir/f.0001.out -err stderr.txt -i -d outdir -if data.txt -of outdir/f.0001.out -k "" -cdmfile "" -status fil e -a data.txt notification = Never leave_in_queue = TRUE queue It is not yet clear to me if the older code is working because it *failed* to escape the quotes on the arguments line with \", or because it *omitted* the "". I need to look more closely to see if Im being fooled by the .submit file text I pasted above (ie if the \" is really there, or if "" is missing entirely). At any rate - Mihael, can you sync up with me on this (ie whichever of us get to it first should fix). Or Sarah, David, Justin, or Allan? Mihael, I think your top prio should be the coaster staging timing issue that Allan and Justin are both encountering (we think). We need to add a test for how this works and verify that its creating a valid submit file. Thanks, - Mike The diffs are below: --- modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/condor/CondorExecutor.java (revision 2988) +++ modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/condor/CondorExecutor.java (working copy) @@ -116,97 +116,6 @@ wr.close(); } - private static final boolean[] TRIGGERS; - - static { - TRIGGERS = new boolean[128]; - TRIGGERS[' '] = true; - TRIGGERS['\n'] = true; - TRIGGERS['\t'] = true; - TRIGGERS['\\'] = true; - TRIGGERS['>'] = true; - TRIGGERS['<'] = true; - TRIGGERS['"'] = true; - } - - protected String quote(String s) { - if ("".equals(s)) { - return ""; - } - boolean quotes = false; - for (int i = 0; i < s.length(); i++) { - char c = s.charAt(i); - if (c < 128 && TRIGGERS[c]) { - quotes = true; - break; - } - } - if (!quotes) { - return s; - } - StringBuffer sb = new StringBuffer(); - if (quotes) { - sb.append('\\'); - sb.append('"'); - } - for (int i = 0; i < s.length(); i++) { - char c = s.charAt(i); - if (c == '"' || c == '\\') { - sb.append('\\'); - } - sb.append(c); - } - if (quotes) { - sb.append('\\'); - sb.append('"'); - } - return sb.toString(); - } - - protected String replaceVars(String str) { - StringBuffer sb = new StringBuffer(); - boolean escaped = false; - for (int i = 0; i < str.length(); i++) { - char c = str.charAt(i); - if (c == '\\') { - if (escaped) { - sb.append('\\'); - } - else { - escaped = true; - } - } - else { - if (c == '$' && !escaped) { - if (i == str.length() - 1) { - sb.append('$'); - } - else { - int e = str.indexOf(' ', i); - if (e == -1) { - e = str.length(); - } - String name = str.substring(i + 1, e); - Object attr = getSpec().getAttribute(name); - if (attr != null) { - sb.append(attr.toString()); - } - else { - sb.append('$'); - sb.append(name); - } - i = e; - } - } - else { - sb.append(c); - } - escaped = false; - } - } - return sb.toString(); - } - protected String getName() { return "Condor"; } login1$ ----- Original Message ----- > I think my diffs were wrong. Please ignore this thread till I re-do > them. > > - Mike > > ----- Original Message ----- > > ----- Original Message ----- > > > Shouldn't we be looking at the diffs in provider-localscheduler? > > > > I don't *think* so - my tests were using COndor-G directly: > > > > grid > > gt2 > > ff-grid3.unl.edu/jobmanager-pbs > > > > But in any case, I diff'ed the entire cog and swift trees, and saw > > almost *no* diffs (see later msg). The only one I am suspicious of > > at > > the moment is the @Override patch. > > > > I need to find when that change was made and whether I somehow > > compiled *with* the Overrides in place in the older working copy. > > > > - Mike > > > > > > > > -Allan (mobile) > > > > > > On Jan 13, 2011 11:17 AM, "Michael Wilde" < wilde at mcs.anl.gov > > > > wrote: > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > I need to check what local mods I had applied, but I think > > > > > > its > > > > > > more > > > > > > likely that some Condor submit file quoting fix fell off in > > > > > > 0.92 > > > > > > integration. > > > > > > > > > > Yeah. A svn diff > somefile would help. > > > > > > > > Hmmm. So far svn diffs show no changes within provider-condor, > > > > neither between trunk and 0.92 branch nor within my working > > > > copies > > > > of those two on engage-submit, which seem to behave differently > > > > regarding Condor quoting. > > > > > > > > Could the change(s) that were made a long time ago to fix Condor > > > > quoting be in a different module than provider-condor? If so, > > > > whats > > > > a likely place to look? > > > > > > > > I'll check vdl-int.k next. > > > > > > > > - Mike > > > > > > > > > > > > > > > > So Marc, sorry - this release is not usable for you yet. > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > Im trying my first tests of 0.92 on engage-submit, sending > > > > > > > 100 > > > > > > > trivial > > > > > > > cat jobs to 10 OSG sites. > > > > > > > > > > > > > > My jobs seem to be all dying with the error "Found illegal > > > > > > > unescaped > > > > > > > double-quote" (see below). > > > > > > > > > > > > > > Has anyone successfully run a Condor-G job on OSG with > > > > > > > 0.92? > > > > > > > > > > > > > > I'll dig deeper and try the same test with the older > > > > > > > version > > > > > > > of > > > > > > > trunk > > > > > > > that Marc has been using here with better success. Will > > > > > > > also > > > > > > > try a > > > > > > > single job run and capture a simpler log and the condor-g > > > > > > > submit > > > > > > > file. > > > > > > > > > > > > > > Allan, have you tried 0.92 against COndor-G? If not, could > > > > > > > you? > > > > > > > > > > > > > > Sarah, we should add some Condor-G-to-GT2 testing to 0.92 > > > > > > > validation I > > > > > > > think. > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Caused by: > > > > > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > > > > > Cannot submit job: Could not submit job (condor_submit > > > > > > > reported an > > > > > > > exit code of 1). Submitting job(s) > > > > > > > Found illegal unescaped double-quote: "" -e /bin/cat -out > > > > > > > outdir/f.0065.out -err stderr.txt -i -d outdir -if > > > > > > > data.txt > > > > > > > -of > > > > > > > outdir/f.0065.out -k "" -cdmfile "" -status file -a > > > > > > > data.txtThe > > > > > > > full > > > > > > > arguments you specified were: > > > > > > > /osg/data/engage/tmp/ > > > > > > > osg.hpc.ufl.edu/catsn-20110113-0025-vv4p4up3/shared/_swiftwrap > > > > > > > cat-ajxnee4k -jobdir a -scratch "" -e /bin/cat -out > > > > > > > outdir/f.0065.out > > > > > > > -err stderr.txt -i -d outdir -if data.txt -of > > > > > > > outdir/f.0065.out -k > > > > > > > "" > > > > > > > -cdmfile "" -status file -a data.txt > > > > > > > > > > > > > > > > > > > > > Script is: > > > > > > > > > > > > > > e$ cat catsn.swift > > > > > > > type file; > > > > > > > > > > > > > > app (file o) cat (file i) > > > > > > > { > > > > > > > cat @i stdout=@o; > > > > > > > } > > > > > > > > > > > > > > file out[] > > > > > > prefix="f.",suffix=".out">; > > > > > > > foreach j in [1:@toint(@arg("n","1"))] { > > > > > > > file data<"data.txt">; > > > > > > > out[j] = cat(data); > > > > > > > } > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Michael Wilde > > > > > > > Computation Institute, University of Chicago > > > > > > > Mathematics and Computer Science Division > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Swift-devel mailing list > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > -- > > > > Michael Wilde > > > > Computation Institute, University of Chicago > > > > Mathematics and Computer Science Division > > > > Argonne National Laboratory > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Thu Jan 13 11:16:53 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 13 Jan 2011 09:16:53 -0800 Subject: [Swift-devel] Problem with 0.92 sending jobs to OSG via Condor-G In-Reply-To: <221819054.58120.1294929948306.JavaMail.root@zimbra.anl.gov> References: <221819054.58120.1294929948306.JavaMail.root@zimbra.anl.gov> Message-ID: <1294939013.19109.1.camel@blabla2.none> On Thu, 2011-01-13 at 08:45 -0600, Michael Wilde wrote: > This is very strange. I diffed the entire cog and modules/swift dir trees between my 0.92 and trunk working copies. The only changes I see are: > - my (one) local mod to remove Time:HiRes for Ranger in worker.pl > - David's patch to override (which *does* touch JobSpecification, suspiciously! > > The latter I will investigate, but Im very surprised to see so few differences between 0.92 and a trunk working copy. > > Mihael, I *thought* you integrated changes from stable-branch ( ~ > 0.91) into the 0.92 branch. Or did you integrate those into trunk, and > perhaps I took a later copy of trunk? If you did indeed integrate the > stable changes into trunk, *and* its possible that what I test on was > not re-built after I did an svn update some time in the past few > weeks, *then* perhaps the Condor bug crept in between recent changes > to trunk. I merged the 0.91 branch back to trunk. That might have gone wrong. That's why it's relevant to do a diff between the version that works and 0.92. [...] From hategan at mcs.anl.gov Thu Jan 13 11:18:10 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 13 Jan 2011 09:18:10 -0800 Subject: [Swift-devel] Problem with 0.92 sending jobs to OSG via Condor-G In-Reply-To: <624341900.58517.1294932718459.JavaMail.root@zimbra.anl.gov> References: <624341900.58517.1294932718459.JavaMail.root@zimbra.anl.gov> Message-ID: <1294939091.19109.2.camel@blabla2.none> On Thu, 2011-01-13 at 09:31 -0600, Michael Wilde wrote: > > ----- Original Message ----- > > Shouldn't we be looking at the diffs in provider-localscheduler? > > I don't *think* so - my tests were using COndor-G directly: > > grid > gt2 ff-grid3.unl.edu/jobmanager-pbs > > But in any case, I diff'ed the entire cog and swift trees, and saw > almost *no* diffs (see later msg). The only one I am suspicious of at > the moment is the @Override patch. I highly doubt that the problem is with the @Override(s). > > I need to find when that change was made and whether I somehow compiled *with* the Overrides in place in the older working copy. > > - Mike > > > > > -Allan (mobile) > > > > On Jan 13, 2011 11:17 AM, "Michael Wilde" < wilde at mcs.anl.gov > wrote: > > > > > > > > > > > > ----- Original Message ----- > > > > > I need to check what local mods I had applied, but I think its > > > > > more > > > > > likely that some Condor submit file quoting fix fell off in 0.92 > > > > > integration. > > > > > > > > Yeah. A svn diff > somefile would help. > > > > > > Hmmm. So far svn diffs show no changes within provider-condor, > > > neither between trunk and 0.92 branch nor within my working copies > > > of those two on engage-submit, which seem to behave differently > > > regarding Condor quoting. > > > > > > Could the change(s) that were made a long time ago to fix Condor > > > quoting be in a different module than provider-condor? If so, whats > > > a likely place to look? > > > > > > I'll check vdl-int.k next. > > > > > > - Mike > > > > > > > > > > > > > So Marc, sorry - this release is not usable for you yet. > > > > > > > > > > - Mike > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > Im trying my first tests of 0.92 on engage-submit, sending 100 > > > > > > trivial > > > > > > cat jobs to 10 OSG sites. > > > > > > > > > > > > My jobs seem to be all dying with the error "Found illegal > > > > > > unescaped > > > > > > double-quote" (see below). > > > > > > > > > > > > Has anyone successfully run a Condor-G job on OSG with 0.92? > > > > > > > > > > > > I'll dig deeper and try the same test with the older version > > > > > > of > > > > > > trunk > > > > > > that Marc has been using here with better success. Will also > > > > > > try a > > > > > > single job run and capture a simpler log and the condor-g > > > > > > submit > > > > > > file. > > > > > > > > > > > > Allan, have you tried 0.92 against COndor-G? If not, could > > > > > > you? > > > > > > > > > > > > Sarah, we should add some Condor-G-to-GT2 testing to 0.92 > > > > > > validation I > > > > > > think. > > > > > > > > > > > > - Mike > > > > > > > > > > > > -- > > > > > > > > > > > > Caused by: > > > > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > > > > Cannot submit job: Could not submit job (condor_submit > > > > > > reported an > > > > > > exit code of 1). Submitting job(s) > > > > > > Found illegal unescaped double-quote: "" -e /bin/cat -out > > > > > > outdir/f.0065.out -err stderr.txt -i -d outdir -if data.txt > > > > > > -of > > > > > > outdir/f.0065.out -k "" -cdmfile "" -status file -a > > > > > > data.txtThe > > > > > > full > > > > > > arguments you specified were: > > > > > > /osg/data/engage/tmp/ > > > > > > osg.hpc.ufl.edu/catsn-20110113-0025-vv4p4up3/shared/_swiftwrap > > > > > > cat-ajxnee4k -jobdir a -scratch "" -e /bin/cat -out > > > > > > outdir/f.0065.out > > > > > > -err stderr.txt -i -d outdir -if data.txt -of > > > > > > outdir/f.0065.out -k > > > > > > "" > > > > > > -cdmfile "" -status file -a data.txt > > > > > > > > > > > > > > > > > > Script is: > > > > > > > > > > > > e$ cat catsn.swift > > > > > > type file; > > > > > > > > > > > > app (file o) cat (file i) > > > > > > { > > > > > > cat @i stdout=@o; > > > > > > } > > > > > > > > > > > > file out[] > > > > > prefix="f.",suffix=".out">; > > > > > > foreach j in [1:@toint(@arg("n","1"))] { > > > > > > file data<"data.txt">; > > > > > > out[j] = cat(data); > > > > > > } > > > > > > > > > > > > > > > > > > -- > > > > > > Michael Wilde > > > > > > Computation Institute, University of Chicago > > > > > > Mathematics and Computer Science Division > > > > > > Argonne National Laboratory > > > > > > > > > > > > _______________________________________________ > > > > > > Swift-devel mailing list > > > > > > Swift-devel at ci.uchicago.edu > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > From dk0966 at cs.ship.edu Thu Jan 13 11:23:47 2011 From: dk0966 at cs.ship.edu (David Kelly) Date: Thu, 13 Jan 2011 12:23:47 -0500 Subject: [Swift-devel] Re: Did usage tracking fall off of 0.92? In-Reply-To: <1886902128.57685.1294898150654.JavaMail.root@zimbra.anl.gov> References: <1886902128.57685.1294898150654.JavaMail.root@zimbra.anl.gov> Message-ID: Mike, I checked the swift shell script for 0.92 and everything looks good. I'm not sure what changed in terms of the udp message, but I ran 0.92 from my machine and verified that it was recorded in the database. David On Thu, Jan 13, 2011 at 12:55 AM, Michael Wilde wrote: > David, > > I suspect that 0.92 may have lost your usage tracking mods to the swift command. Can you check? > > I say this because I noticed that on engage-submit, the older ~Dec-15 trunk I was using gives the (expected) complaint that /dev/udp is not found, while 0.92 does not complain. ?(Or has that message just been more cleanly suppressed, or a work-around added for the missing /dev/udp?) > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > From wilde at mcs.anl.gov Thu Jan 13 11:33:57 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 13 Jan 2011 11:33:57 -0600 (CST) Subject: [Swift-devel] Re: Did usage tracking fall off of 0.92? In-Reply-To: Message-ID: <1470481739.59495.1294940037341.JavaMail.root@zimbra.anl.gov> OK, I'll need to double check my 0.92 build on engage-submit. This is the second case of unexpected results, so maybe I goofed in my checkouts. - Mike ----- Original Message ----- > Mike, > > I checked the swift shell script for 0.92 and everything looks good. > I'm not sure what changed in terms of the udp message, but I ran 0.92 > from my machine and verified that it was recorded in the database. > > David > > On Thu, Jan 13, 2011 at 12:55 AM, Michael Wilde > wrote: > > David, > > > > I suspect that 0.92 may have lost your usage tracking mods to the > > swift command. Can you check? > > > > I say this because I noticed that on engage-submit, the older > > ~Dec-15 trunk I was using gives the (expected) complaint that > > /dev/udp is not found, while 0.92 does not complain. (Or has that > > message just been more cleanly suppressed, or a work-around added > > for the missing /dev/udp?) > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Thu Jan 13 11:37:45 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 13 Jan 2011 11:37:45 -0600 (CST) Subject: [Swift-devel] Please verify 0.92 checkout & build instructions Message-ID: <190529563.59534.1294940265863.JavaMail.root@zimbra.anl.gov> I updated the ReleasePlans page with the checkout procedure that I used. http://www.ci.uchicago.edu/wiki/bin/view/SWFT/ReleasePlans Could everyone verify that this is correct: svn co https://cogkit.svn.sourceforge.net/svnroot/cogkit/branches/4.1.8/src/cog cd cog/modules svn co https://svn.ci.uchicago.edu/svn/vdl2/branches/release-0.92 swift cd swift ant redist Thanks, Mike From hategan at mcs.anl.gov Thu Jan 13 11:41:28 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 13 Jan 2011 09:41:28 -0800 Subject: [Swift-devel] Please verify 0.92 checkout & build instructions In-Reply-To: <190529563.59534.1294940265863.JavaMail.root@zimbra.anl.gov> References: <190529563.59534.1294940265863.JavaMail.root@zimbra.anl.gov> Message-ID: <1294940488.19109.5.camel@blabla2.none> On Thu, 2011-01-13 at 11:37 -0600, Michael Wilde wrote: > I updated the ReleasePlans page with the checkout procedure that I used. > http://www.ci.uchicago.edu/wiki/bin/view/SWFT/ReleasePlans > > Could everyone verify that this is correct: > > svn co https://cogkit.svn.sourceforge.net/svnroot/cogkit/branches/4.1.8/src/cog > cd cog/modules > svn co https://svn.ci.uchicago.edu/svn/vdl2/branches/release-0.92 swift > cd swift > ant redist This is correct. From wilde at mcs.anl.gov Thu Jan 13 12:24:13 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 13 Jan 2011 12:24:13 -0600 (CST) Subject: [Swift-devel] Please verify 0.92 checkout & build instructions In-Reply-To: <1294940488.19109.5.camel@blabla2.none> Message-ID: <1070793442.59932.1294943053889.JavaMail.root@zimbra.anl.gov> OK. I think the behavior Im seeing is due to 2 errors: - the new quote logic (from Ben) was (inadvertently, I think) dropped out of trunk at rev 2989 - the new quote logic itself has an error, in that instead of quoting a zero-length argument as \"\" it instead inserts nothing into the arguments= line for this case. For the examples Ive seen, this *seems* to cause no harm, and was enabling jobs to run. I think its incorrect, though. I will put this back in trunk, test on Condor-G, and report back. - Mike ----- Original Message ----- > On Thu, 2011-01-13 at 11:37 -0600, Michael Wilde wrote: > > I updated the ReleasePlans page with the checkout procedure that I > > used. > > http://www.ci.uchicago.edu/wiki/bin/view/SWFT/ReleasePlans > > > > Could everyone verify that this is correct: > > > > svn co > > https://cogkit.svn.sourceforge.net/svnroot/cogkit/branches/4.1.8/src/cog > > cd cog/modules > > svn co https://svn.ci.uchicago.edu/svn/vdl2/branches/release-0.92 > > swift > > cd swift > > ant redist > > This is correct. -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Thu Jan 13 12:43:44 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 13 Jan 2011 12:43:44 -0600 (CST) Subject: [Swift-devel] Please verify 0.92 checkout & build instructions In-Reply-To: <1070793442.59932.1294943053889.JavaMail.root@zimbra.anl.gov> Message-ID: <2057375353.60063.1294944224582.JavaMail.root@zimbra.anl.gov> OK, that *seems* to work. The argument string is now: arguments = /osg/data/engage/tmp/ce02.cmsaf.mit.edu/catsn-20110113-1240-vn86uhoc/shared/_swiftwrap cat-jaxfbf4k -jobdir j -scratch \"\" -e /bin/cat -out outdir/f.0001.out -err stderr.txt -i -d outdir -if data.txt -of outdir/f.0001.out -k \"\" -cdmfile \"\" -status file -a data.txt with \"\" for empty args. Im going to commit this and test, then get a build of trunk to Marc. Sarah, how/when do you want this and other fixes committed to the 0.92 branch? Do you want to do these mods? - Mike ----- Original Message ----- > OK. I think the behavior Im seeing is due to 2 errors: > > - the new quote logic (from Ben) was (inadvertently, I think) dropped > out of trunk at rev 2989 > > - the new quote logic itself has an error, in that instead of quoting > a zero-length argument as \"\" it instead inserts nothing into the > arguments= line for this case. For the examples Ive seen, this *seems* > to cause no harm, and was enabling jobs to run. I think its incorrect, > though. > > I will put this back in trunk, test on Condor-G, and report back. > > - Mike > > > ----- Original Message ----- > > On Thu, 2011-01-13 at 11:37 -0600, Michael Wilde wrote: > > > I updated the ReleasePlans page with the checkout procedure that I > > > used. > > > http://www.ci.uchicago.edu/wiki/bin/view/SWFT/ReleasePlans > > > > > > Could everyone verify that this is correct: > > > > > > svn co > > > https://cogkit.svn.sourceforge.net/svnroot/cogkit/branches/4.1.8/src/cog > > > cd cog/modules > > > svn co https://svn.ci.uchicago.edu/svn/vdl2/branches/release-0.92 > > > swift > > > cd swift > > > ant redist > > > > This is correct. > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Thu Jan 13 17:02:21 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 13 Jan 2011 15:02:21 -0800 Subject: [Swift-devel] Please verify 0.92 checkout & build instructions In-Reply-To: <2057375353.60063.1294944224582.JavaMail.root@zimbra.anl.gov> References: <2057375353.60063.1294944224582.JavaMail.root@zimbra.anl.gov> Message-ID: <1294959741.24746.3.camel@blabla2.none> So now for the mystery of why this patch didn't make it to trunk in the merge: the quoting methods were extracted in an abstract class. They were overridden in the condor executor, but I thought that was just a consequence of the refactoring, so I removed that code. The truth was somewhere in the middle: the abstract replaceVars is ok, but not the abstract quote() which needs to be made more specific for condor. Mihael On Thu, 2011-01-13 at 12:43 -0600, Michael Wilde wrote: > OK, that *seems* to work. The argument string is now: > > arguments = /osg/data/engage/tmp/ce02.cmsaf.mit.edu/catsn-20110113-1240-vn86uhoc/shared/_swiftwrap cat-jaxfbf4k -jobdir j -scratch \"\" -e /bin/cat -out outdir/f.0001.out -err stderr.txt -i -d outdir -if data.txt -of outdir/f.0001.out -k \"\" -cdmfile \"\" -status file -a data.txt > > with \"\" for empty args. > > Im going to commit this and test, then get a build of trunk to Marc. > > Sarah, how/when do you want this and other fixes committed to the 0.92 branch? > Do you want to do these mods? > > - Mike > > > ----- Original Message ----- > > OK. I think the behavior Im seeing is due to 2 errors: > > > > - the new quote logic (from Ben) was (inadvertently, I think) dropped > > out of trunk at rev 2989 > > > > - the new quote logic itself has an error, in that instead of quoting > > a zero-length argument as \"\" it instead inserts nothing into the > > arguments= line for this case. For the examples Ive seen, this *seems* > > to cause no harm, and was enabling jobs to run. I think its incorrect, > > though. > > > > I will put this back in trunk, test on Condor-G, and report back. > > > > - Mike > > > > > > ----- Original Message ----- > > > On Thu, 2011-01-13 at 11:37 -0600, Michael Wilde wrote: > > > > I updated the ReleasePlans page with the checkout procedure that I > > > > used. > > > > http://www.ci.uchicago.edu/wiki/bin/view/SWFT/ReleasePlans > > > > > > > > Could everyone verify that this is correct: > > > > > > > > svn co > > > > https://cogkit.svn.sourceforge.net/svnroot/cogkit/branches/4.1.8/src/cog > > > > cd cog/modules > > > > svn co https://svn.ci.uchicago.edu/svn/vdl2/branches/release-0.92 > > > > swift > > > > cd swift > > > > ant redist > > > > > > This is correct. > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Thu Jan 13 17:43:27 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 13 Jan 2011 17:43:27 -0600 (CST) Subject: [Swift-devel] Re: Persistent coasters In-Reply-To: Message-ID: <1020593283.62333.1294962207063.JavaMail.root@zimbra.anl.gov> You should configure the tools for this mode of operation on PADS (and any PBS system): - run the commands on a login node (but should work on any PADS node that you are ssh'ed into) - use qsub to obtain nodes -- mode 1: 1 N-node M-core job -- mode 2: N 1-core jobs Do mode 1 first: Job script (the script you use as an arg to qsub) should use a foreach loop to start one worker.pl on each node of the job. You can adapt the code below from Swift R start-swift: make-pbs-submit-file() { if [ $queue != default ]; then queueDirective="#PBS -q $queue" else queueDirective="" fi cat >pbs.sub <$pbsjobidfile Mike ----- Original Message ----- > How should I proceed in testing the persistent coasters scripts on > PADS? Should I use workers-ssh from the login node to pads? Should I > copy the format of workers-cobalt and modify it to use qsub parameters > that work with pbs? > > David -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From aespinosa at cs.uchicago.edu Thu Jan 13 23:01:31 2011 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 13 Jan 2011 23:01:31 -0600 Subject: [Swift-devel] Re: Persistent coasters In-Reply-To: <1020593283.62333.1294962207063.JavaMail.root@zimbra.anl.gov> References: <1020593283.62333.1294962207063.JavaMail.root@zimbra.anl.gov> Message-ID: Since PBS on PADS doesn't care if we requested for multiple nodes or multiple jobs, i just send multiple jobs of worker.pl ... ... 2011/1/13 Michael Wilde : > You should configure the tools for this mode of operation on PADS (and any PBS system): > > - run the commands on a login node (but should work on any PADS node that you are ssh'ed into) > > - use qsub to obtain nodes > ?-- mode 1: 1 N-node M-core job > ?-- mode 2: N 1-core jobs > > Do mode 1 first: > > Job script (the script you use as an arg to qsub) should use a foreach loop to start one worker.pl on each node of the job. You can adapt the code below from Swift R start-swift: > > make-pbs-submit-file() > { > ?if [ $queue != default ]; then > ? ?queueDirective="#PBS -q $queue" > ?else > ? ?queueDirective="" > ?fi > cat >pbs.sub < #PBS -S /bin/sh > #PBS -N SwiftR-workers > #PBS -m n > #PBS -l nodes=$nodes > #PBS -l walltime=$time > #PBS -o $HOME > #PBS -e $HOME > $queueDirective > WORKER_LOGGING_ENABLED=true # FIXME: parameterize; fix w PBS -v > #cd / && /usr/bin/perl $SWIFTBIN/worker.pl $CONTACT SwiftR-workers $HOME/.globus/coasters $IDLETIMEOUT > HOST=\$(echo $CONTACT | sed -e 's,^http://,,' -e 's/:.*//') > PORT=\$(echo $CONTACT | sed -e 's,^.*:,,') > echo '***' PBS_NODEFILE file: \$PBS_NODEFILE CONTACT:$CONTACT > cat \$PBS_NODEFILE > echo '***' unique nodes are: > sort < \$PBS_NODEFILE|uniq > for h in \$(sort < \$PBS_NODEFILE|uniq); do > ?ssh \$h "echo Swift R startup running on host; hostname; cd /; /usr/bin/perl $SWIFTBIN/worker.pl $CONTACT SwiftR-\$h $HOME/.globus/\ > coasters $IDLETIMEOUT" & > done > wait > END > } > > then: > > ?make-${server}-submit-file > ?qsub pbs.sub >$pbsjobidfile > > > Mike > > ----- Original Message ----- >> How should I proceed in testing the persistent coasters scripts on >> PADS? Should I use workers-ssh from the login node to pads? Should I >> copy the format of workers-cobalt and modify it to use qsub parameters >> that work with pbs? >> >> David -- Allan M. Espinosa PhD student, Computer Science University of Chicago From wilde at mcs.anl.gov Fri Jan 14 08:25:26 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 14 Jan 2011 08:25:26 -0600 (CST) Subject: [Swift-devel] Re: Persistent coasters In-Reply-To: Message-ID: <3998804.63169.1295015126800.JavaMail.root@zimbra.anl.gov> Both methods have their pros and cons - mainly based on what size jobs the scheduler will favor under a given load. There's possibly an advantage at the moment for having multiple workers on a node instead of one, in light of the problem of gaps in worker dispatch of user apps. But I expect that will be resolved. There are also schedulers, in particular on TeraGrid, that will favor mode 1 below, of requesting a set of nodes in a single scheduler job (or a small number of them limited by policy). - Mike ----- Original Message ----- > Since PBS on PADS doesn't care if we requested for multiple nodes or > multiple jobs, i just send multiple jobs of > > worker.pl ... ... > > 2011/1/13 Michael Wilde : > > You should configure the tools for this mode of operation on PADS > > (and any PBS system): > > > > - run the commands on a login node (but should work on any PADS node > > that you are ssh'ed into) > > > > - use qsub to obtain nodes > > ?-- mode 1: 1 N-node M-core job > > ?-- mode 2: N 1-core jobs > > > > Do mode 1 first: > > > > Job script (the script you use as an arg to qsub) should use a > > foreach loop to start one worker.pl on each node of the job. You can > > adapt the code below from Swift R start-swift: > > > > make-pbs-submit-file() > > { > > ?if [ $queue != default ]; then > > ? ?queueDirective="#PBS -q $queue" > > ?else > > ? ?queueDirective="" > > ?fi > > cat >pbs.sub < > #PBS -S /bin/sh > > #PBS -N SwiftR-workers > > #PBS -m n > > #PBS -l nodes=$nodes > > #PBS -l walltime=$time > > #PBS -o $HOME > > #PBS -e $HOME > > $queueDirective > > WORKER_LOGGING_ENABLED=true # FIXME: parameterize; fix w PBS -v > > #cd / && /usr/bin/perl $SWIFTBIN/worker.pl $CONTACT SwiftR-workers > > $HOME/.globus/coasters $IDLETIMEOUT > > HOST=\$(echo $CONTACT | sed -e 's,^http://,,' -e 's/:.*//') > > PORT=\$(echo $CONTACT | sed -e 's,^.*:,,') > > echo '***' PBS_NODEFILE file: \$PBS_NODEFILE CONTACT:$CONTACT > > cat \$PBS_NODEFILE > > echo '***' unique nodes are: > > sort < \$PBS_NODEFILE|uniq > > for h in \$(sort < \$PBS_NODEFILE|uniq); do > > ?ssh \$h "echo Swift R startup running on host; hostname; cd /; > > ?/usr/bin/perl $SWIFTBIN/worker.pl $CONTACT SwiftR-\$h > > ?$HOME/.globus/\ > > coasters $IDLETIMEOUT" & > > done > > wait > > END > > } > > > > then: > > > > ?make-${server}-submit-file > > ?qsub pbs.sub >$pbsjobidfile > > > > > > Mike > > > > ----- Original Message ----- > >> How should I proceed in testing the persistent coasters scripts on > >> PADS? Should I use workers-ssh from the login node to pads? Should > >> I > >> copy the format of workers-cobalt and modify it to use qsub > >> parameters > >> that work with pbs? > >> > >> David > > -- > Allan M. Espinosa > PhD student, Computer Science > University of Chicago -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sun Jan 16 17:21:10 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 16 Jan 2011 15:21:10 -0800 Subject: [Swift-devel] Re: provider staging stage-in rate on localhost and PADS In-Reply-To: <1294880732.3835.7.camel@blabla2.none> References: <39FC459D-BD6C-47AC-AE66-9BA8BC4C7D98@ci.uchicago.edu> <1294880732.3835.7.camel@blabla2.none> Message-ID: <1295220070.5713.19.camel@blabla2.none> So I'm running some tests. So far here's how the stage-ins look like: -site: local (my laptop) -job input size: 32MB -256 jobs -job output size 0B -measurement is made at the interface between service and worker (the TCP connection). It is aggregated for all workers. -2 workers, 4 jobs per worker -this is the fast branch, but the job throughput is pretty irrelevant here since this is I/O bound. What I get is this: file: [IN]: Total transferred: 641.93 KB, current rate: 0 B/s, average rate: 4.62 KB/s [OUT] Total transferred: 8 GB, current rate: 40.2 MB/s, average rate: 58.92 MB/s Final status: time:140732 Finished successfully:256 Time: 142.13, rate: 1 j/s proxy: Final status: time:113915 Finished successfully:256 Time: 115.393, rate: 2 j/s [IN]: Total transferred: 705.62 KB, current rate: 6.44 KB/s, average rate: 6.24 KB/s [OUT] Total transferred: 8 GB, current rate: 36.08 MB/s, average rate: 72.54 MB/s (It is funny that proxy is faster than "file", but for now I'll ignore that). For comparison: mike at blabla2 coasters$ time dd if=/dev/zero of=~/tmp/8g bs=32KB count=262144 262144+0 records in 262144+0 records out 8388608000 bytes (8.4 GB) copied, 49.9836 s, 168 MB/s Things could be improved, but staging in does not seem to be the bottleneck. Next, stage-outs... Mihael On Wed, 2011-01-12 at 17:05 -0800, Mihael Hategan wrote: > On Wed, 2011-01-12 at 15:57 -0600, Daniel S. Katz wrote: > > As I read this, it appears that there is some limit in Swift, rather than in the hardware, that is causing these numbers to be very low. > > > > Mihael, do you agree? > > Yes, but that does not exclude a limit in hardware in other > configurations. > > > Can you help us figure out what's going on? > > Of course. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Sun Jan 16 18:03:55 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 16 Jan 2011 16:03:55 -0800 Subject: [Swift-devel] Re: provider staging stage-in rate on localhost and PADS In-Reply-To: <1295220070.5713.19.camel@blabla2.none> References: <39FC459D-BD6C-47AC-AE66-9BA8BC4C7D98@ci.uchicago.edu> <1294880732.3835.7.camel@blabla2.none> <1295220070.5713.19.camel@blabla2.none> Message-ID: <1295222635.5713.20.camel@blabla2.none> Right. So stageouts: Progress: time:86039 Selecting site:231 Submitted:1 Active:8 Finished successfully:16 [IN]: Total transferred: 597.1 MB, current rate: 8.99 MB/s, average rate: 7.02 MB/s [OUT] Total transferred: 161.33 KB, current rate: 0 B/s, average rate: 1.9 KB/s That's probably because, as opposed to when the Java side reads files, there is no read-ahead done by the worker. I'll see if I can add that. On Sun, 2011-01-16 at 15:21 -0800, Mihael Hategan wrote: > So I'm running some tests. > > So far here's how the stage-ins look like: > -site: local (my laptop) > -job input size: 32MB > -256 jobs > -job output size 0B > -measurement is made at the interface between service and worker (the > TCP connection). It is aggregated for all workers. > -2 workers, 4 jobs per worker > -this is the fast branch, but the job throughput is pretty irrelevant > here since this is I/O bound. > > What I get is this: > file: > [IN]: Total transferred: 641.93 KB, current rate: 0 B/s, average rate: > 4.62 KB/s > [OUT] Total transferred: 8 GB, current rate: 40.2 MB/s, average rate: > 58.92 MB/s > Final status: time:140732 Finished successfully:256 > Time: 142.13, rate: 1 j/s > > proxy: > Final status: time:113915 Finished successfully:256 > Time: 115.393, rate: 2 j/s > [IN]: Total transferred: 705.62 KB, current rate: 6.44 KB/s, average > rate: 6.24 KB/s > [OUT] Total transferred: 8 GB, current rate: 36.08 MB/s, average rate: > 72.54 MB/s > > (It is funny that proxy is faster than "file", but for now I'll ignore > that). > > For comparison: > mike at blabla2 coasters$ time dd if=/dev/zero of=~/tmp/8g bs=32KB > count=262144 > 262144+0 records in > 262144+0 records out > 8388608000 bytes (8.4 GB) copied, 49.9836 s, 168 MB/s > > Things could be improved, but staging in does not seem to be the > bottleneck. > Next, stage-outs... > > Mihael > > > > > On Wed, 2011-01-12 at 17:05 -0800, Mihael Hategan wrote: > > On Wed, 2011-01-12 at 15:57 -0600, Daniel S. Katz wrote: > > > As I read this, it appears that there is some limit in Swift, rather than in the hardware, that is causing these numbers to be very low. > > > > > > Mihael, do you agree? > > > > Yes, but that does not exclude a limit in hardware in other > > configurations. > > > > > Can you help us figure out what's going on? > > > > Of course. > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Sun Jan 16 18:56:04 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 16 Jan 2011 18:56:04 -0600 (CST) Subject: [Swift-devel] Re: provider staging stage-in rate on localhost and PADS In-Reply-To: <1295222635.5713.20.camel@blabla2.none> Message-ID: <1518689411.68929.1295225764952.JavaMail.root@zimbra.anl.gov> Mihael, If I follow this right, your stageout rate (7MB/sec) is about what Allan saw for stage ins (~ 6MB/sec as I recall). But your stagein rate is 59MB/sec and 73MB/sec depending on the mode (file vs proxy). So that does not yet replicate what he's seeing (on the WAN). Allan, I cant recall if you sent numbers for local-host tests? - Mike ----- Original Message ----- > Right. So stageouts: > Progress: time:86039 Selecting site:231 Submitted:1 Active:8 > Finished successfully:16 > [IN]: Total transferred: 597.1 MB, current rate: 8.99 MB/s, average > rate: 7.02 MB/s > [OUT] Total transferred: 161.33 KB, current rate: 0 B/s, average rate: > 1.9 KB/s > > That's probably because, as opposed to when the Java side reads files, > there is no read-ahead done by the worker. I'll see if I can add that. > > On Sun, 2011-01-16 at 15:21 -0800, Mihael Hategan wrote: > > So I'm running some tests. > > > > So far here's how the stage-ins look like: > > -site: local (my laptop) > > -job input size: 32MB > > -256 jobs > > -job output size 0B > > -measurement is made at the interface between service and worker > > (the > > TCP connection). It is aggregated for all workers. > > -2 workers, 4 jobs per worker > > -this is the fast branch, but the job throughput is pretty > > irrelevant > > here since this is I/O bound. > > > > What I get is this: > > file: > > [IN]: Total transferred: 641.93 KB, current rate: 0 B/s, average > > rate: > > 4.62 KB/s > > [OUT] Total transferred: 8 GB, current rate: 40.2 MB/s, average > > rate: > > 58.92 MB/s > > Final status: time:140732 Finished successfully:256 > > Time: 142.13, rate: 1 j/s > > > > proxy: > > Final status: time:113915 Finished successfully:256 > > Time: 115.393, rate: 2 j/s > > [IN]: Total transferred: 705.62 KB, current rate: 6.44 KB/s, average > > rate: 6.24 KB/s > > [OUT] Total transferred: 8 GB, current rate: 36.08 MB/s, average > > rate: > > 72.54 MB/s > > > > (It is funny that proxy is faster than "file", but for now I'll > > ignore > > that). > > > > For comparison: > > mike at blabla2 coasters$ time dd if=/dev/zero of=~/tmp/8g bs=32KB > > count=262144 > > 262144+0 records in > > 262144+0 records out > > 8388608000 bytes (8.4 GB) copied, 49.9836 s, 168 MB/s > > > > Things could be improved, but staging in does not seem to be the > > bottleneck. > > Next, stage-outs... > > > > Mihael > > > > > > > > > > On Wed, 2011-01-12 at 17:05 -0800, Mihael Hategan wrote: > > > On Wed, 2011-01-12 at 15:57 -0600, Daniel S. Katz wrote: > > > > As I read this, it appears that there is some limit in Swift, > > > > rather than in the hardware, that is causing these numbers to be > > > > very low. > > > > > > > > Mihael, do you agree? > > > > > > Yes, but that does not exclude a limit in hardware in other > > > configurations. > > > > > > > Can you help us figure out what's going on? > > > > > > Of course. > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sun Jan 16 19:28:22 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 16 Jan 2011 17:28:22 -0800 Subject: [Swift-devel] Re: provider staging stage-in rate on localhost and PADS In-Reply-To: <1518689411.68929.1295225764952.JavaMail.root@zimbra.anl.gov> References: <1518689411.68929.1295225764952.JavaMail.root@zimbra.anl.gov> Message-ID: <1295227702.23556.7.camel@blabla2.none> On Sun, 2011-01-16 at 18:56 -0600, Michael Wilde wrote: > Mihael, > > If I follow this right, your stageout rate (7MB/sec) is about what Allan saw for stage ins (~ 6MB/sec as I recall). That is also close to what I get if I disable read-ahead for stage-ins. > > But your stagein rate is 59MB/sec and 73MB/sec depending on the mode (file vs proxy). That was experiment error. They both average around 60MB/s after repeated runs. Though in the local/local case "file" has considerably lower memory consumption that "proxy". > > So that does not yet replicate what he's seeing (on the WAN). No. Just local. I wanted to see if the code was introducing a significant bottleneck, but it does not seem so. Though keep in mind that this was on an SSD. An HDD may not get the same performance. I gave the raw dd numbers for comparison. In the local/local case, one should probably divide that by two since with swift you are both reading and writing the data at the same time, while the dd test was from /dev/zero. > > Allan, I cant recall if you sent numbers for local-host tests? > > - Mike > > > > > ----- Original Message ----- > > Right. So stageouts: > > Progress: time:86039 Selecting site:231 Submitted:1 Active:8 > > Finished successfully:16 > > [IN]: Total transferred: 597.1 MB, current rate: 8.99 MB/s, average > > rate: 7.02 MB/s > > [OUT] Total transferred: 161.33 KB, current rate: 0 B/s, average rate: > > 1.9 KB/s > > > > That's probably because, as opposed to when the Java side reads files, > > there is no read-ahead done by the worker. I'll see if I can add that. > > > > On Sun, 2011-01-16 at 15:21 -0800, Mihael Hategan wrote: > > > So I'm running some tests. > > > > > > So far here's how the stage-ins look like: > > > -site: local (my laptop) > > > -job input size: 32MB > > > -256 jobs > > > -job output size 0B > > > -measurement is made at the interface between service and worker > > > (the > > > TCP connection). It is aggregated for all workers. > > > -2 workers, 4 jobs per worker > > > -this is the fast branch, but the job throughput is pretty > > > irrelevant > > > here since this is I/O bound. > > > > > > What I get is this: > > > file: > > > [IN]: Total transferred: 641.93 KB, current rate: 0 B/s, average > > > rate: > > > 4.62 KB/s > > > [OUT] Total transferred: 8 GB, current rate: 40.2 MB/s, average > > > rate: > > > 58.92 MB/s > > > Final status: time:140732 Finished successfully:256 > > > Time: 142.13, rate: 1 j/s > > > > > > proxy: > > > Final status: time:113915 Finished successfully:256 > > > Time: 115.393, rate: 2 j/s > > > [IN]: Total transferred: 705.62 KB, current rate: 6.44 KB/s, average > > > rate: 6.24 KB/s > > > [OUT] Total transferred: 8 GB, current rate: 36.08 MB/s, average > > > rate: > > > 72.54 MB/s > > > > > > (It is funny that proxy is faster than "file", but for now I'll > > > ignore > > > that). > > > > > > For comparison: > > > mike at blabla2 coasters$ time dd if=/dev/zero of=~/tmp/8g bs=32KB > > > count=262144 > > > 262144+0 records in > > > 262144+0 records out > > > 8388608000 bytes (8.4 GB) copied, 49.9836 s, 168 MB/s > > > > > > Things could be improved, but staging in does not seem to be the > > > bottleneck. > > > Next, stage-outs... > > > > > > Mihael > > > > > > > > > > > > > > > On Wed, 2011-01-12 at 17:05 -0800, Mihael Hategan wrote: > > > > On Wed, 2011-01-12 at 15:57 -0600, Daniel S. Katz wrote: > > > > > As I read this, it appears that there is some limit in Swift, > > > > > rather than in the hardware, that is causing these numbers to be > > > > > very low. > > > > > > > > > > Mihael, do you agree? > > > > > > > > Yes, but that does not exclude a limit in hardware in other > > > > configurations. > > > > > > > > > Can you help us figure out what's going on? > > > > > > > > Of course. > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From aespinosa at cs.uchicago.edu Sun Jan 16 19:38:31 2011 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Sun, 16 Jan 2011 19:38:31 -0600 Subject: [Swift-devel] Re: provider staging stage-in rate on localhost and PADS In-Reply-To: <1518689411.68929.1295225764952.JavaMail.root@zimbra.anl.gov> References: <1295222635.5713.20.camel@blabla2.none> <1518689411.68929.1295225764952.JavaMail.root@zimbra.anl.gov> Message-ID: So for the measurement interface, are you measuring the total data received as the data arrives or when the received file is completely written to the job directory. I was measuring from the logs from JOB_START to JOB_END. I assumed the actualy job execution to be 0. The 7MB/s probably corresponds to Mihael's stage out results. the cat jobs dump to stdout (redirected to a file in the swift wrapper) probably shows the same behavior as the stageout. -Allan 2011/1/16 Michael Wilde : > Mihael, > > If I follow this right, your stageout rate (7MB/sec) is about what Allan saw > for stage ins (~ 6MB/sec as I recall). > > But your stagein rate is 59MB/sec and 73MB/sec depending on the mode (file vs > proxy). > > So that does not yet replicate what he's seeing (on the WAN). > > Allan, I cant recall if you sent numbers for local-host tests? yes. The rates are still at 5-7 MB/s > > - Mike > > > > > ----- Original Message ----- >> Right. So stageouts: >> Progress: time:86039 Selecting site:231 Submitted:1 Active:8 >> Finished successfully:16 >> [IN]: Total transferred: 597.1 MB, current rate: 8.99 MB/s, average >> rate: 7.02 MB/s >> [OUT] Total transferred: 161.33 KB, current rate: 0 B/s, average rate: >> 1.9 KB/s >> >> That's probably because, as opposed to when the Java side reads files, >> there is no read-ahead done by the worker. I'll see if I can add that. >> >> On Sun, 2011-01-16 at 15:21 -0800, Mihael Hategan wrote: >> > So I'm running some tests. >> > >> > So far here's how the stage-ins look like: >> > -site: local (my laptop) >> > -job input size: 32MB >> > -256 jobs >> > -job output size 0B >> > -measurement is made at the interface between service and worker >> > (the TCP connection). It is aggregated for all workers. >> > -2 workers, 4 jobs per worker >> > -this is the fast branch, but the job throughput is pretty >> > irrelevant here since this is I/O bound. >> > >> > What I get is this: >> > file: >> > [IN]: Total transferred: 641.93 KB, current rate: 0 B/s, average rate: >> > 4.62 KB/s >> > [OUT] Total transferred: 8 GB, current rate: 40.2 MB/s, average rate: >> > 58.92 MB/s >> > Final status: time:140732 Finished successfully:256 >> > Time: 142.13, rate: 1 j/s >> > >> > proxy: >> > Final status: time:113915 Finished successfully:256 >> > Time: 115.393, rate: 2 j/s >> > [IN]: Total transferred: 705.62 KB, current rate: 6.44 KB/s, average rate: >> > 6.24 KB/s >> > [OUT] Total transferred: 8 GB, current rate: 36.08 MB/s, average >> > rate: >> > 72.54 MB/s >> > >> > (It is funny that proxy is faster than "file", but for now I'll >> > ignore >> > that). >> > >> > For comparison: >> > mike at blabla2 coasters$ time dd if=/dev/zero of=~/tmp/8g bs=32KB >> > count=262144 >> > 262144+0 records in >> > 262144+0 records out >> > 8388608000 bytes (8.4 GB) copied, 49.9836 s, 168 MB/s >> > >> > Things could be improved, but staging in does not seem to be the >> > bottleneck. >> > Next, stage-outs... >> > >> > Mihael >> > >> > >> > >> > >> > On Wed, 2011-01-12 at 17:05 -0800, Mihael Hategan wrote: >> > > On Wed, 2011-01-12 at 15:57 -0600, Daniel S. Katz wrote: >> > > > As I read this, it appears that there is some limit in Swift, >> > > > rather than in the hardware, that is causing these numbers to be >> > > > very low. >> > > > >> > > > Mihael, do you agree? >> > > >> > > Yes, but that does not exclude a limit in hardware in other >> > > configurations. >> > > >> > > > ? Can you help us figure out what's going on? >> > > >> > > Of course. From hategan at mcs.anl.gov Sun Jan 16 20:02:26 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 16 Jan 2011 18:02:26 -0800 Subject: [Swift-devel] Re: provider staging stage-in rate on localhost and PADS In-Reply-To: References: <1295222635.5713.20.camel@blabla2.none> <1518689411.68929.1295225764952.JavaMail.root@zimbra.anl.gov> Message-ID: <1295229746.25413.12.camel@blabla2.none> On Sun, 2011-01-16 at 19:38 -0600, Allan Espinosa wrote: > So for the measurement interface, are you measuring the total data received as > the data arrives or when the received file is completely written to the job > directory. The average is all the bytes that go from client to all the workers over the entire time spent to run the jobs. > > I was measuring from the logs from JOB_START to JOB_END. I assumed the actualy > job execution to be 0. The 7MB/s probably corresponds to Mihael's stage out > results. the cat jobs dump to stdout (redirected to a file in the swift > wrapper) probably shows the same behavior as the stageout. I'm becoming less surprised about 7MB/s in the local case. You have to multiply that by 6 to get the real disk I/O bandwidth: 1. client reads from disk 2. worker writes to disk 3. cat reads from disk 4. cat writes to disk 5. worker reads from disk 6. client writes to disk If it all happens on a single disk, then it adds up to about 42 MB/s, which is a reasonable fraction of what a normal disk can do. It would be useful to do a dd from /dev/zero to see what the actual disk performance is. From hategan at mcs.anl.gov Sun Jan 16 20:58:16 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 16 Jan 2011 18:58:16 -0800 Subject: [Swift-devel] Re: provider staging stage-in rate on localhost and PADS In-Reply-To: <1295229746.25413.12.camel@blabla2.none> References: <1295222635.5713.20.camel@blabla2.none> <1518689411.68929.1295225764952.JavaMail.root@zimbra.anl.gov> <1295229746.25413.12.camel@blabla2.none> Message-ID: <1295233096.28926.21.camel@blabla2.none> Ok, so I committed a fix to make the worker send files a bit faster and adjusted the buffer sizes a bit. There is a trade-off between per worker performance and number of workers, so this should probably be a setting of some sort (since when there are many workers, the client bandwidth becomes the bottleneck). With a plain cat, 4 workers, 1 job/w, and 32M files I get this: [IN]: Total transferred: 7.99 GB, current rate: 23.6 MB/s, average rate: 16.47 MB/s [MEM] Heap total: 155.31 MMB, Heap used: 104.2 MMB [OUT] Total transferred: 8 GB, current rate: 0 B/s, average rate: 16.49 MB/s Final status: time:498988 Finished successfully:256 Time: 500.653, rate: 0 j/s So the system probably sees 96 MB/s combined reads and writes. I'd be curious how this looks without caching, but during the run the computer became laggy, so it's saturating something in the OS and/or hardware. I'll test on a cluster next. On Sun, 2011-01-16 at 18:02 -0800, Mihael Hategan wrote: > On Sun, 2011-01-16 at 19:38 -0600, Allan Espinosa wrote: > > So for the measurement interface, are you measuring the total data received as > > the data arrives or when the received file is completely written to the job > > directory. > > The average is all the bytes that go from client to all the workers over > the entire time spent to run the jobs. > > > > > I was measuring from the logs from JOB_START to JOB_END. I assumed the actualy > > job execution to be 0. The 7MB/s probably corresponds to Mihael's stage out > > results. the cat jobs dump to stdout (redirected to a file in the swift > > wrapper) probably shows the same behavior as the stageout. > > I'm becoming less surprised about 7MB/s in the local case. You have to > multiply that by 6 to get the real disk I/O bandwidth: > 1. client reads from disk > 2. worker writes to disk > 3. cat reads from disk > 4. cat writes to disk > 5. worker reads from disk > 6. client writes to disk > > If it all happens on a single disk, then it adds up to about 42 MB/s, > which is a reasonable fraction of what a normal disk can do. It would be > useful to do a dd from /dev/zero to see what the actual disk performance > is. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From skenny at uchicago.edu Sun Jan 16 22:15:44 2011 From: skenny at uchicago.edu (Sarah Kenny) Date: Sun, 16 Jan 2011 20:15:44 -0800 Subject: [Swift-devel] Please verify 0.92 checkout & build instructions In-Reply-To: <2057375353.60063.1294944224582.JavaMail.root@zimbra.anl.gov> References: <1070793442.59932.1294943053889.JavaMail.root@zimbra.anl.gov> <2057375353.60063.1294944224582.JavaMail.root@zimbra.anl.gov> Message-ID: On Thu, Jan 13, 2011 at 10:43 AM, Michael Wilde wrote: > OK, that *seems* to work. The argument string is now: > > arguments = /osg/data/engage/tmp/ > ce02.cmsaf.mit.edu/catsn-20110113-1240-vn86uhoc/shared/_swiftwrapcat-jaxfbf4k -jobdir j -scratch \"\" -e /bin/cat -out outdir/f.0001.out -err > stderr.txt -i -d outdir -if data.txt -of outdir/f.0001.out -k \"\" -cdmfile > \"\" -status file -a data.txt > > with \"\" for empty args. > > Im going to commit this and test, then get a build of trunk to Marc. > > Sarah, how/when do you want this and other fixes committed to the 0.92 > branch? > Do you want to do these mods? > i thought the plan was that anything fixes for .92 go into .92, and then we merge back with trunk after the release (?) however, i'm not quite sure what's broken here (or rather how you came across it)...is there a bug filed for this? > > - Mike > > > ----- Original Message ----- > > OK. I think the behavior Im seeing is due to 2 errors: > > > > - the new quote logic (from Ben) was (inadvertently, I think) dropped > > out of trunk at rev 2989 > > > > - the new quote logic itself has an error, in that instead of quoting > > a zero-length argument as \"\" it instead inserts nothing into the > > arguments= line for this case. For the examples Ive seen, this *seems* > > to cause no harm, and was enabling jobs to run. I think its incorrect, > > though. > > > > I will put this back in trunk, test on Condor-G, and report back. > > > > - Mike > > > > > > ----- Original Message ----- > > > On Thu, 2011-01-13 at 11:37 -0600, Michael Wilde wrote: > > > > I updated the ReleasePlans page with the checkout procedure that I > > > > used. > > > > http://www.ci.uchicago.edu/wiki/bin/view/SWFT/ReleasePlans > > > > > > > > Could everyone verify that this is correct: > > > > > > > > svn co > > > > > https://cogkit.svn.sourceforge.net/svnroot/cogkit/branches/4.1.8/src/cog > > > > cd cog/modules > > > > svn co https://svn.ci.uchicago.edu/svn/vdl2/branches/release-0.92 > > > > swift > > > > cd swift > > > > ant redist > > > > > > This is correct. > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Sun Jan 16 22:28:53 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 16 Jan 2011 22:28:53 -0600 (CST) Subject: [Swift-devel] Please verify 0.92 checkout & build instructions In-Reply-To: Message-ID: <996487700.69010.1295238533878.JavaMail.root@zimbra.anl.gov> Sarah, this was files as bug 249: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=249 Mihael applied a fix for it a few days ago to trunk and 0.92. I came across it trying to run a workload for a UChicago user. - Mike ----- Original Message ----- On Thu, Jan 13, 2011 at 10:43 AM, Michael Wilde < wilde at mcs.anl.gov > wrote: OK, that *seems* to work. The argument string is now: arguments = /osg/data/engage/tmp/ ce02.cmsaf.mit.edu/catsn-20110113-1240-vn86uhoc/shared/_swiftwrap cat-jaxfbf4k -jobdir j -scratch \"\" -e /bin/cat -out outdir/f.0001.out -err stderr.txt -i -d outdir -if data.txt -of outdir/f.0001.out -k \"\" -cdmfile \"\" -status file -a data.txt with \"\" for empty args. Im going to commit this and test, then get a build of trunk to Marc. Sarah, how/when do you want this and other fixes committed to the 0.92 branch? Do you want to do these mods? i thought the plan was that anything fixes for .92 go into .92, and then we merge back with trunk after the release (?) however, i'm not quite sure what's broken here (or rather how you came across it)...is there a bug filed for this? - Mike ----- Original Message ----- > OK. I think the behavior Im seeing is due to 2 errors: > > - the new quote logic (from Ben) was (inadvertently, I think) dropped > out of trunk at rev 2989 > > - the new quote logic itself has an error, in that instead of quoting > a zero-length argument as \"\" it instead inserts nothing into the > arguments= line for this case. For the examples Ive seen, this *seems* > to cause no harm, and was enabling jobs to run. I think its incorrect, > though. > > I will put this back in trunk, test on Condor-G, and report back. > > - Mike > > > ----- Original Message ----- > > On Thu, 2011-01-13 at 11:37 -0600, Michael Wilde wrote: > > > I updated the ReleasePlans page with the checkout procedure that I > > > used. > > > http://www.ci.uchicago.edu/wiki/bin/view/SWFT/ReleasePlans > > > > > > Could everyone verify that this is correct: > > > > > > svn co > > > https://cogkit.svn.sourceforge.net/svnroot/cogkit/branches/4.1.8/src/cog > > > cd cog/modules > > > svn co https://svn.ci.uchicago.edu/svn/vdl2/branches/release-0.92 > > > swift > > > cd swift > > > ant redist > > > > This is correct. > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: From aespinosa at cs.uchicago.edu Mon Jan 17 14:35:05 2011 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 17 Jan 2011 14:35:05 -0600 Subject: [Swift-devel] provider staging vdl-int.staging.k doesn't honor wrapperlog.transfer.always Message-ID: vdl-int.staging.k line 239 There's no check for the swift config property. hence i'm alwaysgetting the wrapper logs in provider staging. -Allanb -- Allan M. Espinosa PhD student, Computer Science University of Chicago From hategan at mcs.anl.gov Mon Jan 17 14:41:46 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 17 Jan 2011 12:41:46 -0800 Subject: [Swift-devel] provider staging vdl-int.staging.k doesn't honor wrapperlog.transfer.always In-Reply-To: References: Message-ID: <1295296906.16323.1.camel@blabla2.none> There is no way to conditionally stage something out with provider staging. I.e., you can't say "transfer only if error occurs". But I suppose the flag could be used to disable wrapper log transfer entirely. On Mon, 2011-01-17 at 14:35 -0600, Allan Espinosa wrote: > vdl-int.staging.k line 239 > > There's no check for the swift config property. hence i'm > alwaysgetting the wrapper logs in provider staging. > > -Allanb > > From wozniak at mcs.anl.gov Mon Jan 17 14:49:30 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 17 Jan 2011 14:49:30 -0600 (Central Standard Time) Subject: [Swift-devel] provider staging vdl-int.staging.k doesn't honor wrapperlog.transfer.always In-Reply-To: References: Message-ID: Try the attached patch- I will test it a bit more before committing. Thanks for the report. Justin On Mon, 17 Jan 2011, Allan Espinosa wrote: > vdl-int.staging.k line 239 > > There's no check for the swift config property. hence i'm > alwaysgetting the wrapper logs in provider staging. > > -Allanb > > > -- Justin M Wozniak -------------- next part -------------- A non-text attachment was scrubbed... Name: wrapperlog-fix.diff Type: application/octet-stream Size: 755 bytes Desc: URL: From iraicu at cs.iit.edu Mon Jan 17 16:02:49 2011 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Mon, 17 Jan 2011 16:02:49 -0600 Subject: [Swift-devel] Final CFP: ACM HPDC 2011, deadline January 24th, 2011 Message-ID: <4D34BC89.5070400@cs.iit.edu> Call For Papers The 20th International ACM Symposium on High-Performance Parallel and Distributed Computing http://www.hpdc.org/2011/ San Jose, California, June 8-11, 2011 The ACM International Symposium on High-Performance Parallel and Distributed Computing is the premier conference for presenting the latest research on the design, implementation, evaluation, and use of parallel and distributed systems for high end computing. The 20th installment of HPDC will take place in San Jose, California, in the heart of Silicon Valley. This year, HPDC is affiliated with the ACM Federated Computing Research Conference, consisting of fifteen leading ACM conferences all in one week. HPDC will be held on June 9-11 (Thursday through Saturday) with affiliated workshops taking place on June 8th (Wednesday). Submissions are welcomed on all forms of high performance parallel and distributed computing, including but not limited to clusters, clouds, grids, utility computing, data-intensive computing, multicore and parallel computing. All papers will be reviewed by a distinguished program committee, with a strong preference for rigorous results obtained in operational parallel and distributed systems. All papers will be evaluated for correctness, originality, potential impact, quality of presentation, and interest and relevance to the conference. In addition to traditional technical papers, we also invite experience papers. Such papers should present operational details of a production high end system or application, and draw out conclusions gained from operating the system or application. The evaluation of experience papers will place a greater weight on the real-world impact of the system and the value of conclusions to future system designs. Topics of interest include, but are not limited to: ------------------------------------------------------------------------------- # Applications of parallel and distributed computing. # Systems, networks, and architectures for high end computing. # Parallel and multicore issues and opportunities. # Virtualization of machines, networks, and storage. # Programming languages and environments. # I/O, file systems, and data management. # Data intensive computing. # Resource management, scheduling, and load-balancing. # Performance modeling, simulation, and prediction. # Fault tolerance, reliability and availability. # Security, configuration, policy, and management issues. # Models and use cases for utility, grid, and cloud computing. Authors are invited to submit technical papers of at most 12 pages in PDF format, including all figures and references. Papers should be formatted in the ACM Proceedings Style and submitted via the conference web site. Accepted papers will appear in the conference proceedings, and will be incorporated into the ACM Digital Library. Papers must be self-contained and provide the technical substance required for the program committee to evaluate the paper's contribution. Papers should thoughtfully address all related work, particularly work presented at previous HPDC events. Submitted papers must be original work that has not appeared in and is not under consideration for another conference or a journal. See the ACM Prior Publication Policy for more details. Workshops ------------------------------------------------------------------------------- Seven workshops affiliated with HPDC will be held on Wednesday, June 8th. For more information, see the Workshops page at http://www.hpdc.org/2011/workshops.php. # ScienceCloud: 2nd Workshop on Scientific Cloud Computing # MapReduce: The Second International Workshop on MapReduce and its Applications # VTDC: Virtual Technologies in Distributed Computing # ECMLS: The Second International Emerging Computational Methods for the Life Sciences Workshop # LSAP: Workshop on Large-Scale System and Application Performance # DIDC: The Fourth International Workshop on Data-Intensive Distributed Computing # 3DAPAS: Workshop on Dynamic Distributed Data-Intensive Applications, Programming Abstractions, and Systems Important Dates ------------------------------------------------------------------------------- Technical Papers Due: 17 January 2011 PAPER DEADLINE EXTENDED: 24 January 2011 at 12:01 PM (NOON) Eastern Time Author Notifications: 28 February 2011 Final Papers Due: 24 March 2011 Conference Dates: 8-11 June 2011 Organization ------------------------------------------------------------------------------- General Chair Barney Maccabe, Oak Ridge National Laboratory Program Chair Douglas Thain, University of Notre Dame Workshops Chair Mike Lewis, Binghamton University Local Arrangements Chair Nick Wright, Lawrence Berkeley National Laboratory Student Activities Chairs Huaiming Song, Illinois Institute of Technology Hui Jin, Illinois Institute of Technology Publicity Chairs Alexandru Iosup, Delft University John Lange, University of Pittsburgh Ioan Raicu, Illinois Institute of Technology Yong Zhao, Microsoft Program Committee Kento Aida, National Institute of Informatics Henri Bal, Vrije Universiteit Roger Barga, Microsoft Jim Basney, NCSA John Bent, Los Alamos National Laboratory Ron Brightwell, Sandia National Laboratories Shawn Brown, Pittsburgh Supercomputer Center Claris Castillo, IBM Andrew A. Chien, UC San Diego and SDSC Ewa Deelman, USC Information Sciences Institute Peter Dinda, Northwestern University Scott Emrich, University of Notre Dame Dick Epema, Delft University of Technology Gilles Fedak, INRIA Renato Figuierdo, University of Florida Ian Foster, University of Chicago and Argonne National Laboratory Gabriele Garzoglio, Fermi National Accelerator Laboratory Rong Ge, Marquette University Sebastien Goasguen, Clemson University Kartik Gopalan, Binghamton University Dean Hildebrand, IBM Almaden Adriana Iamnitchi, University of South Florida Alexandru Iosup, Delft University of Technology Keith Jackson, Lawrence Berkeley Shantenu Jha, Louisiana State University Daniel S. Katz, University of Chicago and Argonne National Laboratory Thilo Kielmann, Vrije Universiteit Charles Killian, Purdue University Tevfik Kosar, Louisiana State University John Lange, University of Pittsburgh Mike Lewis, Binghamton University Barney Maccabe, Oak Ridge National Laboratory Grzegorz Malewicz, Google Satoshi Matsuoka, Tokyo Institute of Technology Jarek Nabrzyski, University of Notre Dame Manish Parashar, Rutgers University Beth Plale, Indiana University Ioan Raicu, Illinois Institute of Technology Philip Rhodes, University of Mississippi Matei Ripeanu, University of British Columbia Philip Roth, Oak Ridge National Laboratory Karsten Schwan, Georgia Tech Martin Swany, University of Delaware Jon Weissman, University of Minnesota Dongyan Xu, Purdue University Ken Yocum, UC San Diego Yong Zhao, Microsoft Steering Committee Henri Bal, Vrije Universiteit Andrew A. Chien, UC San Diego and SDSC Peter Dinda, Northwestern University Ian Foster, Argonne National Laboratory and University of Chicago Dennis Gannon, Microsoft Salim Hariri, University of Arizona Dieter Kranzlmueller, Ludwig-Maximilians-Univ. Muenchen Satoshi Matsuoka, Tokyo Institute of Technology Manish Parashar, Rutgers University Karsten Schwan, Georgia Tech Jon Weissman, University of Minnesota (Chair) -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email:iraicu at cs.iit.edu Web:http://www.cs.iit.edu/~iraicu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.iit.edu Mon Jan 17 16:22:14 2011 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Mon, 17 Jan 2011 16:22:14 -0600 Subject: [Swift-devel] CFP: ACM ScienceCloud 2011, co-located with HPDC, deadline January 25th (abstract) and February 1st (paper) Message-ID: <4D34C116.2080606@cs.iit.edu> --------------------------------------------------------------------------------- * ** Call for Papers *** 2nd Workshop on Scientific Cloud Computing (ScienceCloud) 2011 In conjunction with ACM HPDC 2011, June 8th, 2011, San Jose, California http://www.cs.iit.edu/~iraicu/ScienceCloud2011/ --------------------------------------------------------------------------------- The advent of computation can be compared, in terms of the breadth and depth of its impact on research and scholarship, to the invention of writing and the development of modern mathematics. Scientific Computing has already begun to change how science is done, enabling scientific breakthroughs through new kinds of experiments that would have been impossible only a decade ago. Today's science is generating datasets that are increasing exponentially in both complexity and volume, making their analysis, archival, and sharing one of the grand challenges of the 21st century. The support for data intensive computing is critical to advancing modern science as storage systems have experienced an increasing gap between their capacity and bandwidth by more than 10-fold over the last decade. There is an emerging need for advanced techniques to manipulate, visualize and interpret large datasets. Scientific computing involves a broad range of technologies, from high-performance computing (HPC) which is heavily focused on compute-intensive applications, high-throughput computing (HTC) which focuses on using many computing resources over long periods of time to accomplish its computational tasks, many-task computing (MTC) which aims to bridge the gap between HPC and HTC by focusing on using many resources over short periods of time, to data-intensive computing which is heavily focused on data distribution and harnessing data locality by scheduling of computations close to the data. The 2nd workshop on Scientific Cloud Computing (ScienceCloud) will provide the scientific community a dedicated forum for discussing new research, development, and deployment efforts in running these kinds of scientific computing workloads on Cloud Computing infrastructures. The ScienceCloud workshop will focus on the use of cloud-based technologies to meet new compute intensive and data intensive scientific challenges that are not well served by the current supercomputers, grids or commercial clouds. What architectural changes to the current cloud frameworks (hardware, operating systems, networking and/or programming models) are needed to support science? Dynamic information derived from remote instruments and coupled simulation and sensor ensembles are both important new science pathways and tremendous challenges for current HPC/HTC/MTC technologies. How can cloud technologies enable these new scientific approaches? How are scientists using clouds? Are there scientific HPC/HTC/MTC workloads that are suitable candidates to take advantage of emerging cloud computing resources with high efficiency? What benefits exist by adopting the cloud model, over clusters, grids, or supercomputers? What factors are limiting clouds use or would make them more usable/efficient? This workshop encourages interaction and cross-pollination between those developing applications, algorithms, software, hardware and networking, emphasizing scientific computing for such cloud platforms. We believe the workshop will be an excellent place to help the community define the current state, determine future goals, and define architectures and services for future science clouds. For more information about the workshop, please see http://www.cs.iit.edu/~iraicu/ScienceCloud2011/. To see last year's workshop program agenda, and accepted papers and presentations, please see http://dsl.cs.uchicago.edu/ScienceCloud2010/. TOPICS --------------------------------------------------------------------------------- # scientific computing applications * case studies on public, private and open source cloud computing * case studies comparing between cloud computing and cluster, grids, and/or supercomputers * performance evaluation # performance evaluation * real systems * cloud computing benchmarks * reliability of large systems # programming models and tools * map-reduce and its generalizations * many-task computing middleware and applications * integrating parallel programming frameworks with storage clouds * message passing interface (MPI) * service-oriented science applications # storage cloud architectures and implementations * distributed file systems * content distribution systems for large data * data caching frameworks and techniques * data management within and across data centers * data streaming applications * data-aware scheduling * data-intensive computing applications * eventual-consistency storage usage and management # compute resource management * dynamic resource provisioning * scheduling * techniques to manage many-core resources and/or GPUs # high-performance computing * high-performance I/O systems * interconnect and network interface architectures for HPC * multi-gigabit wide-area networking * scientific computing tradeoffs between clusters/grids/supercomputers and clouds * parallel file systems in dynamic environments # models, frameworks and systems for cloud security * implementation of access control and scalable isolation IMPORTANT DATES --------------------------------------------------------------------------------- Abstract submission: January 25th, 2011 Paper submission: February 1st, 2011 Acceptance notification: February 28th, 2011 Final papers due: March 24th, 2011 Workshop date: June 8th, 2011 PAPER SUBMISSION --------------------------------------------------------------------------------- Authors are invited to submit papers with unpublished, original work of not more than 10 pages of double column text using single spaced 10 point size on 8.5 x 11 inch pages (including all text, figures, and references), as per ACM 8.5 x 11 manuscript guidelines (http://www.acm.org/publications/instructions_for_proceedings_volumes); document templates can be found at http://www.acm.org/sigs/publications/proceedings-templates. A 250 word abstract (PDF format) must be submitted online at https://cmt.research.microsoft.com/ScienceCloud2011/ before the deadline of January 25th, 2011 at 11:59PM PST; the final 5/10 page papers in PDF format will be due on February 1st, 2011 at 11:59PM PST. Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the ACM digital library. Notifications of the paper decisions will be sent out by February 28th, 2011. Selected excellent work will be invited to submit extended versions of the workshop paper to a special issue journal. Submission implies the willingness of at least one of the authors to register and present the paper. For more information, please visit http://www.cs.iit.edu/~iraicu/ScienceCloud2011/. WORKSHOP GENERAL CHAIRS --------------------------------------------------------------------------------- * Ioan Raicu, Illinois Institute of Technology * Pete Beckman, University of Chicago& Argonne National Laboratory * Ian Foster, University of Chicago& Argonne National Laboratory PROGRAM CHAIR --------------------------------------------------------------------------------- Yogesh Simmhan, University of Southern California STEERING COMMITTEE --------------------------------------------------------------------------------- * Dennis Gannon, Microsoft Research, USA * Robert Grossman, University of Chicago, USA * Kate Keahey, Nimbus, University of Chicago, Argonne National Laboratory, USA * Ed Lazowska, University of Washington& Computing Community Consortium, USA * Ignacio Llorente, Open Nebula, Universidad Complutense de Madrid, Spain * David O'Hallaron, Carnegie Mellon University& Intel Labs, USA * Jack Dongarra, University of Tennessee, USA * Geoffrey Fox, Indiana University, USA PROGRAM COMMITTEE --------------------------------------------------------------------------------- * David Abramson, Monash University, Australia * Remzi Arpaci-Dusseau, University of Wisconsin, Madison * Roger Barga, Microsoft Research * Jeff Broughton, Lawrence Berkeley National Lab. * Rajkumar Buyya, University of Melbourne, Australia * Roy Campbell, Univ. of Illinois at Urbana Champaign * Henri Casanova, University of Hawaii at Manoa * Jeff Chase, Duke University * Alok Choudhary, Northwestern University * Peter Dinda, Northwestern University * Bill Howe, University of Washington * Alexandru Iosup, Delft University of Technology, Netherlands * Shantenu Jha, Louisiana State University * Tevfik Kosar, Louisiana State University * Shiyong Lu, Wayne State University * Joe Mambretti, Northwestern University * David Martin, Argonne National Laboratory * Gabriel Mateescu, University of Illinois at Urbana Champaign * Paolo Missier, University of Manchester, UK * Ruben Montero, Univ. Complutense de Madrid, Spain * Reagan Moore, Univ. of North Carolina, Chappel Hill * Jose Moreira, IBM Research * Jim Myers, NCSA * Viktor Prasanna, University of Southern California * Lavanya Ramakrishnan, Lawrence Berkeley Nat. Lab. * Matei Ripeanu, University of British Columbia, Canada * Josh Simons, VMWare * Marc Snir, University of Illinois at Urbana Champaign * Ion Stoica, University of California Berkeley * Yong Zhao, University of Electronic and Science Technology of China * Daniel Zinn, University of California at Davis -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email:iraicu at cs.iit.edu Web:http://www.cs.iit.edu/~iraicu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.iit.edu Mon Jan 17 16:50:51 2011 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Mon, 17 Jan 2011 16:50:51 -0600 Subject: [Swift-devel] CFP: Workshop on Large-scale System and Application Performance (LSAP) 2011, co-located with ACM HPDC 2011 Message-ID: <4D34C7CB.1040605@cs.iit.edu> Call for Papers --------------- Workshop on Large-scale System and Application Performance (LSAP2011) in conjunction with the 20-th International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC-20) San Jose, USA, June 8, 2011http://www.lsap2011.org MISSION Over the last decade, computer systems and applications in everyday use have grown to unprecedented scales. Large clusters serving millions of search requests per day, grids executing large workflows and parameter sweeps consisting of thousands of jobs, and supercomputers running complex e-science applications, have now hundreds of thousands of processing cores, and clouds are quickly emerging as a large-scale computing infrastructure. In addition, peer-to-peer systems and centralized video distribution systems that dominate the internet, online social networks, and complicated internet applications such as massive multiplayer online games are used by millions of people every day. In view of this tremendous growth, understanding the performance of large-scale computer systems and applications has become vital to institutional, commercial, and private interests. This workshop solicits original papers on performance evaluation methods, tools, and case studies *explicitly focusing on the challenges of large scale*, such as decentralization, predictable performance, reliability, and scalability. It aims to bring together system designers and researchers involved with the modeling and performance evaluation of large-scale systems and applications. Topics of interest include but are not limited to: - Performance aspects of large-scale systems - Performance aspects of large-scale applications - Performance-oriented properties such as availability, reliability, and scalability - Workload characterization and modeling - Mathematical modeling and analysis methods - Simulation methods and tools - Measurement methods and tools - Performance case studies - Exascale and beyond SUBMISSION GUIDELINES Submitted papers should be limited to 8 pages (including tables, images, and references) and should be formatted according to the ACM SIG Style. Please use the Linklings submission site to submit your paper (see link below); only pdf format is accepted. All papers will receive at least three reviews. Submission implies the willingness of at least one of the authors to register or the workshop and present the paper. The authors of the best paper in the workshop will receive a best-paper award. SUBMISSION SITE http://www.easychair.org/conferences/?conf=lsap2011 PROCEEDINGS Accepted workshop papers will appear in the HPDC conference proceedings and will be incorporated in the ACM Digital Library. IMPORTANT DATES Submission deadline: January 31, 2011 (11:59 PM EST) Author notification: February 28, 2011 Final papers due: March 24, 2011 Workshop: June 8, 2011 WORKSHOP WEBSITE www.lsap2011.org PROGRAM CO-CHAIRS Martin Arlitt, HP Labs, USA, and University of Calgary, CA Dick Epema, Delft University of Technology, NL, d.h.j.epema at tudelft.nl Jose Moreira, IBM T.J. Watson Research Lab, USA, jmoreira at us.ibm.com PROGRAM COMMITTEE Peter Buchholz, University of Dortmund, Germany Franck Cappello, INRIA, France/University of Illinois at Urbana-Champaign, USA Niklas Carlsson, University of Calgary, CA Pawel Garbacki, Google, USA Alexandru Iosup, Delft University of Technology, NL Evgenia Smirni, College of William and Mary, USA Swami Sivasubramanian, Amazon, USA Allen Snavely, University of California, San Diego, USA Denis Trystram, Laboratoire d'Informatique de Grenoble, FR CONTACT For further information please contact Dick Epema atd.h.j.epema at tudelft.nl. -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email:iraicu at cs.iit.edu Web:http://www.cs.iit.edu/~iraicu/ ================================================================= ================================================================= -------------- next part -------------- A non-text attachment was scrubbed... Name: LSAP2011-CFP.pdf Type: application/pdf Size: 64672 bytes Desc: not available URL: From dsk at ci.uchicago.edu Tue Jan 18 10:41:13 2011 From: dsk at ci.uchicago.edu (Daniel S. Katz) Date: Tue, 18 Jan 2011 10:41:13 -0600 Subject: [Swift-devel] Re: Swift hang In-Reply-To: <4D24CB8E.6060304@gmail.com> References: <4D24CB8E.6060304@gmail.com> Message-ID: <1768E971-6A49-4CFB-8427-1FF6BFC5245C@ci.uchicago.edu> Hi Jon, How is this going? Dan On Jan 5, 2011, at 1:50 PM, Jonathan Monette wrote: > Hello, > I have encountered swift hanging. The deadlock appears to be in the same place every time. This deadlock does seem to be intermittent since smaller work sizes does complete. This job size is with approximately 1200 files. The behavior that the logs show is that the files needed for the job submission are staged in but no jobs are submitted. The Coaster heartbeat that appears in the swift logs shows that the job queue is empty. The logs for the runs are in ~jonmon/Workspace/Swift/Montage/m101_j_6x6/run.000[1,2,3] I will try to recreate the problem using simple cat jobs. > > -- > Jon > > Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. > - Albert Einstein > -- Daniel S. Katz University of Chicago (773) 834-7186 (voice) (773) 834-3700 (fax) d.katz at ieee.org or dsk at ci.uchicago.edu http://www.ci.uchicago.edu/~dsk/ From jon.monette at gmail.com Tue Jan 18 17:17:04 2011 From: jon.monette at gmail.com (Jonathan Monette) Date: Tue, 18 Jan 2011 17:17:04 -0600 Subject: [Swift-devel] Re: Swift hang In-Reply-To: <1768E971-6A49-4CFB-8427-1FF6BFC5245C@ci.uchicago.edu> References: <4D24CB8E.6060304@gmail.com> <1768E971-6A49-4CFB-8427-1FF6BFC5245C@ci.uchicago.edu> Message-ID: <4D361F70.2050401@gmail.com> I have tried several configurations using coasters(using different number of blocks, workers per node, etc.) and tried following coasters code in cog to see if this is where Swift hung. Still haven't found out if coasters is the culprit for the hang. This has proved to be hard since the work flows that hang regularly are very large and take awhile to get to the point where they hang. Can anyone help with send me a sample configuration of what using straight pbs would look like? Something that doesn't use coasters just uses pbs. I believe this will help me narrow down if coasters is the culprit or is the problem really in Swift. Mihael, were you ever able to take a look at the log files for my runs to see if you saw anything? I know cleaning up for the next release has been more of the priority just wondering if you ever got a chance. On the plus side, I have also been developing a script that runs Swift that allows me to choose certain configurations so I don't have to be constantly changing files. This I believe will be useful in the overall final product(what ever that may be). On 1/18/11 10:41 AM, Daniel S. Katz wrote: > Hi Jon, > > How is this going? > > Dan > > > On Jan 5, 2011, at 1:50 PM, Jonathan Monette wrote: > >> Hello, >> I have encountered swift hanging. The deadlock appears to be in the same place every time. This deadlock does seem to be intermittent since smaller work sizes does complete. This job size is with approximately 1200 files. The behavior that the logs show is that the files needed for the job submission are staged in but no jobs are submitted. The Coaster heartbeat that appears in the swift logs shows that the job queue is empty. The logs for the runs are in ~jonmon/Workspace/Swift/Montage/m101_j_6x6/run.000[1,2,3] I will try to recreate the problem using simple cat jobs. >> >> -- >> Jon >> >> Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. >> - Albert Einstein >> From hategan at mcs.anl.gov Tue Jan 18 17:27:34 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 18 Jan 2011 15:27:34 -0800 Subject: [Swift-devel] Re: Swift hang In-Reply-To: <4D361F70.2050401@gmail.com> References: <4D24CB8E.6060304@gmail.com> <1768E971-6A49-4CFB-8427-1FF6BFC5245C@ci.uchicago.edu> <4D361F70.2050401@gmail.com> Message-ID: <1295393254.27847.0.camel@blabla2.none> On Tue, 2011-01-18 at 17:17 -0600, Jonathan Monette wrote: > Mihael, were you ever able to take a look at the log files for my runs > to see if you saw anything? I know cleaning up for the next release has > been more of the priority just wondering if you ever got a chance. Not yet. Can you please remind me where they were? Mihael From jon.monette at gmail.com Tue Jan 18 17:29:53 2011 From: jon.monette at gmail.com (Jonathan Monette) Date: Tue, 18 Jan 2011 17:29:53 -0600 Subject: [Swift-devel] Re: Swift hang In-Reply-To: <1295393254.27847.0.camel@blabla2.none> References: <4D24CB8E.6060304@gmail.com> <1768E971-6A49-4CFB-8427-1FF6BFC5245C@ci.uchicago.edu> <4D361F70.2050401@gmail.com> <1295393254.27847.0.camel@blabla2.none> Message-ID: <4D362271.8090504@gmail.com> ~jonmon/Workspace/Swift/Montage/m101_j_6x6/run.000[1,2,3,4] On 1/18/11 5:27 PM, Mihael Hategan wrote: > On Tue, 2011-01-18 at 17:17 -0600, Jonathan Monette wrote: >> Mihael, were you ever able to take a look at the log files for my runs >> to see if you saw anything? I know cleaning up for the next release has >> been more of the priority just wondering if you ever got a chance. > Not yet. Can you please remind me where they were? > > Mihael > > From wilde at mcs.anl.gov Tue Jan 18 17:54:52 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 18 Jan 2011 17:54:52 -0600 (CST) Subject: [Swift-devel] Re: Swift hang In-Reply-To: <4D361F70.2050401@gmail.com> Message-ID: <241258457.78042.1295394892851.JavaMail.root@zimbra.anl.gov> PBS: 02:00:00 2.55 10000 short /home/wilde/swiftwork l adjust maxWallTime, throttle, and queue Maybe try running with just localhost, and jobThrottle=0.07 or 0.08 on a pads compute node, with as much of your data as possible, and your workdirectory, on a local disk (/scratch/local). Could also try that on Ranger. For PADS use qsub -I -l walltime=01:00:00 to get a local node to login to to get a quiet node all to yourself. - Mike ----- Original Message ----- > I have tried several configurations using coasters(using different > number of blocks, workers per node, etc.) and tried following coasters > code in cog to see if this is where Swift hung. Still haven't found > out > if coasters is the culprit for the hang. This has proved to be hard > since the work flows that hang regularly are very large and take > awhile > to get to the point where they hang. > > Can anyone help with send me a sample configuration of what using > straight pbs would look like? Something that doesn't use coasters just > uses pbs. I believe this will help me narrow down if coasters is the > culprit or is the problem really in Swift. > > Mihael, were you ever able to take a look at the log files for my runs > to see if you saw anything? I know cleaning up for the next release > has > been more of the priority just wondering if you ever got a chance. > > On the plus side, I have also been developing a script that runs Swift > that allows me to choose certain configurations so I don't have to be > constantly changing files. This I believe will be useful in the > overall > final product(what ever that may be). > > On 1/18/11 10:41 AM, Daniel S. Katz wrote: > > Hi Jon, > > > > How is this going? > > > > Dan > > > > > > On Jan 5, 2011, at 1:50 PM, Jonathan Monette wrote: > > > >> Hello, > >> I have encountered swift hanging. The deadlock appears to be in > >> the same place every time. This deadlock does seem to be > >> intermittent since smaller work sizes does complete. This job > >> size is with approximately 1200 files. The behavior that the > >> logs show is that the files needed for the job submission are > >> staged in but no jobs are submitted. The Coaster heartbeat that > >> appears in the swift logs shows that the job queue is empty. The > >> logs for the runs are in > >> ~jonmon/Workspace/Swift/Montage/m101_j_6x6/run.000[1,2,3] I will > >> try to recreate the problem using simple cat jobs. > >> > >> -- > >> Jon > >> > >> Computers are incredibly fast, accurate, and stupid. Human beings > >> are incredibly slow, inaccurate, and brilliant. Together they are > >> powerful beyond imagination. > >> - Albert Einstein > >> -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Tue Jan 18 17:57:06 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 18 Jan 2011 15:57:06 -0800 Subject: [Swift-devel] Re: Swift hang In-Reply-To: <4D362271.8090504@gmail.com> References: <4D24CB8E.6060304@gmail.com> <1768E971-6A49-4CFB-8427-1FF6BFC5245C@ci.uchicago.edu> <4D361F70.2050401@gmail.com> <1295393254.27847.0.camel@blabla2.none> <4D362271.8090504@gmail.com> Message-ID: <1295395026.27847.19.camel@blabla2.none> Ok. Well, this may not be related, but it looks funny: 2011-01-05 12:32:40,108-0600 INFO Block 0105-311257-000027 stdout: ---------------------------------------- Begin PBS Prologue Wed Jan 5 12:32:10 CST 2011 Job ID: 788859.svc.pads.ci.uchicago.edu Username: jonmon Group: ci-users Nodes: c17.pads.ci.uchicago.edu End PBS Prologue Wed Jan 5 12:32:10 CST 2011 ---------------------------------------- 2011-01-05 12:32:40,109-0600 INFO Block 0105-311257-000027 stderr: syntax error at /var/spool/torque/mom_priv/epilogue line 18, near "my " Global symbol "$setupnode" requires explicit package name at /var/spool/torque/mom_priv/epilogue line 18. Global symbol "$setupnode" requires explicit package name at /var/spool/torque/mom_priv/epilogue line 41. Execution of /var/spool/torque/mom_priv/epilogue aborted due to compilation errors. ======================= And here is the reason for the hang: 2011-01-05 12:48:37,734-0600 INFO CoasterService Idle time: 125670 2011-01-05 12:48:37,735-0600 WARN CoasterService Idle time exceeded. Shutting down service. I thought I fixed that earlier, but I may be imagining things. Anyway, fix in cog/trunk/r3030. Mihael On Tue, 2011-01-18 at 17:29 -0600, Jonathan Monette wrote: > ~jonmon/Workspace/Swift/Montage/m101_j_6x6/run.000[1,2,3,4] > > On 1/18/11 5:27 PM, Mihael Hategan wrote: > > On Tue, 2011-01-18 at 17:17 -0600, Jonathan Monette wrote: > >> Mihael, were you ever able to take a look at the log files for my runs > >> to see if you saw anything? I know cleaning up for the next release has > >> been more of the priority just wondering if you ever got a chance. > > Not yet. Can you please remind me where they were? > > > > Mihael > > > > From jon.monette at gmail.com Tue Jan 18 17:59:07 2011 From: jon.monette at gmail.com (Jonathan Monette) Date: Tue, 18 Jan 2011 17:59:07 -0600 Subject: [Swift-devel] Re: Swift hang In-Reply-To: <241258457.78042.1295394892851.JavaMail.root@zimbra.anl.gov> References: <241258457.78042.1295394892851.JavaMail.root@zimbra.anl.gov> Message-ID: <4D36294B.2010602@gmail.com> Thanks for the PBS entry. Will try that right now. I was going to try just localhost last since my work flows already take quite a bit of time(at least 30 mins to get to a hang using 40 workers). Adjusting the parameters for optimal use probably would bring that time down but the hang has been a priority than finding the optimal numbers. On 1/18/11 5:54 PM, Michael Wilde wrote: > PBS: > > > > > > 02:00:00 > 2.55 > 10000 > short > > /home/wilde/swiftwork > > l > > adjust maxWallTime, throttle, and queue > > Maybe try running with just localhost, and jobThrottle=0.07 or 0.08 on a pads compute node, with as much of your data as possible, and your workdirectory, on a local disk (/scratch/local). Could also try that on Ranger. For PADS use qsub -I -l walltime=01:00:00 to get a local node to login to to get a quiet node all to yourself. > > - Mike > > > > ----- Original Message ----- >> I have tried several configurations using coasters(using different >> number of blocks, workers per node, etc.) and tried following coasters >> code in cog to see if this is where Swift hung. Still haven't found >> out >> if coasters is the culprit for the hang. This has proved to be hard >> since the work flows that hang regularly are very large and take >> awhile >> to get to the point where they hang. >> >> Can anyone help with send me a sample configuration of what using >> straight pbs would look like? Something that doesn't use coasters just >> uses pbs. I believe this will help me narrow down if coasters is the >> culprit or is the problem really in Swift. >> >> Mihael, were you ever able to take a look at the log files for my runs >> to see if you saw anything? I know cleaning up for the next release >> has >> been more of the priority just wondering if you ever got a chance. >> >> On the plus side, I have also been developing a script that runs Swift >> that allows me to choose certain configurations so I don't have to be >> constantly changing files. This I believe will be useful in the >> overall >> final product(what ever that may be). >> >> On 1/18/11 10:41 AM, Daniel S. Katz wrote: >>> Hi Jon, >>> >>> How is this going? >>> >>> Dan >>> >>> >>> On Jan 5, 2011, at 1:50 PM, Jonathan Monette wrote: >>> >>>> Hello, >>>> I have encountered swift hanging. The deadlock appears to be in >>>> the same place every time. This deadlock does seem to be >>>> intermittent since smaller work sizes does complete. This job >>>> size is with approximately 1200 files. The behavior that the >>>> logs show is that the files needed for the job submission are >>>> staged in but no jobs are submitted. The Coaster heartbeat that >>>> appears in the swift logs shows that the job queue is empty. The >>>> logs for the runs are in >>>> ~jonmon/Workspace/Swift/Montage/m101_j_6x6/run.000[1,2,3] I will >>>> try to recreate the problem using simple cat jobs. >>>> >>>> -- >>>> Jon >>>> >>>> Computers are incredibly fast, accurate, and stupid. Human beings >>>> are incredibly slow, inaccurate, and brilliant. Together they are >>>> powerful beyond imagination. >>>> - Albert Einstein >>>> From jon.monette at gmail.com Tue Jan 18 18:03:13 2011 From: jon.monette at gmail.com (Jonathan Monette) Date: Tue, 18 Jan 2011 18:03:13 -0600 Subject: [Swift-devel] Re: Swift hang In-Reply-To: <1295395026.27847.19.camel@blabla2.none> References: <4D24CB8E.6060304@gmail.com> <1768E971-6A49-4CFB-8427-1FF6BFC5245C@ci.uchicago.edu> <4D361F70.2050401@gmail.com> <1295393254.27847.0.camel@blabla2.none> <4D362271.8090504@gmail.com> <1295395026.27847.19.camel@blabla2.none> Message-ID: <4D362A41.8030106@gmail.com> On 1/18/11 5:57 PM, Mihael Hategan wrote: > Ok. Well, this may not be related, but it looks funny: > > 2011-01-05 12:32:40,108-0600 INFO Block 0105-311257-000027 stdout: > ---------------------------------------- > Begin PBS Prologue Wed Jan 5 12:32:10 CST 2011 > Job ID: 788859.svc.pads.ci.uchicago.edu > Username: jonmon > Group: ci-users > Nodes: c17.pads.ci.uchicago.edu > End PBS Prologue Wed Jan 5 12:32:10 CST 2011 > ---------------------------------------- > > > 2011-01-05 12:32:40,109-0600 INFO Block 0105-311257-000027 stderr: > syntax error at /var/spool/torque/mom_priv/epilogue line 18, near "my " > Global symbol "$setupnode" requires explicit package name > at /var/spool/torque/mom_priv/epilogue line 18. > Global symbol "$setupnode" requires explicit package name > at /var/spool/torque/mom_priv/epilogue line 41. > Execution of /var/spool/torque/mom_priv/epilogue aborted due to > compilation errors. > > ======================= > And here is the reason for the hang: > > 2011-01-05 12:48:37,734-0600 INFO CoasterService Idle time: 125670 > 2011-01-05 12:48:37,735-0600 WARN CoasterService Idle time exceeded. > Shutting down service. I think Swift reported this that the service was shutdown due to idle time exceeded but I assumed that the service restarted when needed. At least that is the assumption that I have been going off of. > I thought I fixed that earlier, but I may be imagining things. Anyway, > fix in cog/trunk/r3030. > > Mihael > > On Tue, 2011-01-18 at 17:29 -0600, Jonathan Monette wrote: >> ~jonmon/Workspace/Swift/Montage/m101_j_6x6/run.000[1,2,3,4] >> >> On 1/18/11 5:27 PM, Mihael Hategan wrote: >>> On Tue, 2011-01-18 at 17:17 -0600, Jonathan Monette wrote: >>>> Mihael, were you ever able to take a look at the log files for my runs >>>> to see if you saw anything? I know cleaning up for the next release has >>>> been more of the priority just wondering if you ever got a chance. >>> Not yet. Can you please remind me where they were? >>> >>> Mihael >>> >>> > From hategan at mcs.anl.gov Tue Jan 18 18:14:04 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 18 Jan 2011 16:14:04 -0800 Subject: [Swift-devel] Re: Swift hang In-Reply-To: <4D362A41.8030106@gmail.com> References: <4D24CB8E.6060304@gmail.com> <1768E971-6A49-4CFB-8427-1FF6BFC5245C@ci.uchicago.edu> <4D361F70.2050401@gmail.com> <1295393254.27847.0.camel@blabla2.none> <4D362271.8090504@gmail.com> <1295395026.27847.19.camel@blabla2.none> <4D362A41.8030106@gmail.com> Message-ID: <1295396044.29548.2.camel@blabla2.none> On Tue, 2011-01-18 at 18:03 -0600, Jonathan Monette wrote: > > 2011-01-05 12:48:37,734-0600 INFO CoasterService Idle time: 125670 > > 2011-01-05 12:48:37,735-0600 WARN CoasterService Idle time exceeded. > > Shutting down service. > I think Swift reported this that the service was shutdown due to idle > time exceeded but I assumed that the service restarted when needed. At > least that is the assumption that I have been going off of. The reason for having the idle timeout is to prevent services from taking up resources when there is no work being done and to also prevent services from staying up when the client has died. In the local case, since they go down with the JVM and they do not fundamentally eat more resources (the JVM stays up anyway), this mechanism has no good reason to be. Whether it's restarted or not, maybe. But simpler code I trust more. Also, I'll commit the two new fixes to the stable branch. Mihael From jon.monette at gmail.com Tue Jan 18 22:30:19 2011 From: jon.monette at gmail.com (Jonathan Monette) Date: Tue, 18 Jan 2011 22:30:19 -0600 Subject: [Swift-devel] Re: Swift hang In-Reply-To: <1295396044.29548.2.camel@blabla2.none> References: <4D24CB8E.6060304@gmail.com> <1768E971-6A49-4CFB-8427-1FF6BFC5245C@ci.uchicago.edu> <4D361F70.2050401@gmail.com> <1295393254.27847.0.camel@blabla2.none> <4D362271.8090504@gmail.com> <1295395026.27847.19.camel@blabla2.none> <4D362A41.8030106@gmail.com> <1295396044.29548.2.camel@blabla2.none> Message-ID: <4D3668DB.4000303@gmail.com> Ok. That work flow just finished. PADS was backed and it took awhile for my jobs to start. I am starting my largest work flow right now. If all completes nicely I can start tuning the parameters in the sites.xml and tc.data files to speed it up. Thanks Mihael. On 1/18/11 6:14 PM, Mihael Hategan wrote: > On Tue, 2011-01-18 at 18:03 -0600, Jonathan Monette wrote: >>> 2011-01-05 12:48:37,734-0600 INFO CoasterService Idle time: 125670 >>> 2011-01-05 12:48:37,735-0600 WARN CoasterService Idle time exceeded. >>> Shutting down service. >> I think Swift reported this that the service was shutdown due to idle >> time exceeded but I assumed that the service restarted when needed. At >> least that is the assumption that I have been going off of. > The reason for having the idle timeout is to prevent services from > taking up resources when there is no work being done and to also prevent > services from staying up when the client has died. > > In the local case, since they go down with the JVM and they do not > fundamentally eat more resources (the JVM stays up anyway), this > mechanism has no good reason to be. Whether it's restarted or not, > maybe. But simpler code I trust more. > > Also, I'll commit the two new fixes to the stable branch. > > Mihael > From wilde at mcs.anl.gov Wed Jan 19 10:12:28 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 19 Jan 2011 10:12:28 -0600 (CST) Subject: [Swift-devel] Re: provider staging stage-in rate on localhost and PADS In-Reply-To: <1295233096.28926.21.camel@blabla2.none> Message-ID: <817314590.80100.1295453548072.JavaMail.root@zimbra.anl.gov> Continuing to work on resolving this problem. I think the next step is to methodically test provider staginug moving from the single-node test to multi-node local (pads) and then to multi-node wan tests. Now that the native coaster job rate to a single one-core worker is better understood (and seems to be 4-5 jobs per second) we can now devise tests with a better understanding of more factors involved. I tried a local test on pads login(at a fairly quiet time, unloaded) as follows: - local coasters service (in Swift jvm) - app is "mv" (to avoid extra data movement) - same input data file is used (so its likely in kernel block cache) - unique output file is used - swift and cwd is on /scratch local disk - file is 3MB (to be closer to Allan's 2.3 MB) - mv app stages file to worker and back (no app reads or writes) - workers per node = 8 (on an 8 core host) - throttle of 200 jobs (2.0) - 100 jobs per swift script invocation I get just over 5 apps/sec or 30MB/sec with this setup. Allan, I'd like to suggest you take it from here, but lets talk as soon as possible this morning to make a plan. One approach that may be fruitful is to re-design a remote test that is closer to what a real scec workload would be (basically your prior tests with some adjustment to the concurrency: more workers per site, and more overall files going in parallel. Then, every time we have a new insight or code change, re-test the larger-scale WAN test in parallel with continuing down the micro-test methods. That way, as soon as we hit a breakthrough that reaches your requires WAN data transfer rate, you can restart the full scec workflow, while we continue to analyze swift behavior issues with the simpler micro benchmarks. Regards, Mike ----- Original Message ----- > Ok, so I committed a fix to make the worker send files a bit faster > and > adjusted the buffer sizes a bit. There is a trade-off between per > worker > performance and number of workers, so this should probably be a > setting > of some sort (since when there are many workers, the client bandwidth > becomes the bottleneck). > > With a plain cat, 4 workers, 1 job/w, and 32M files I get this: > [IN]: Total transferred: 7.99 GB, current rate: 23.6 MB/s, average > rate: > 16.47 MB/s > [MEM] Heap total: 155.31 MMB, Heap used: 104.2 MMB > [OUT] Total transferred: 8 GB, current rate: 0 B/s, average rate: > 16.49 > MB/s > Final status: time:498988 Finished successfully:256 > Time: 500.653, rate: 0 j/s > > So the system probably sees 96 MB/s combined reads and writes. I'd be > curious how this looks without caching, but during the run the > computer > became laggy, so it's saturating something in the OS and/or hardware. > > I'll test on a cluster next. > > On Sun, 2011-01-16 at 18:02 -0800, Mihael Hategan wrote: > > On Sun, 2011-01-16 at 19:38 -0600, Allan Espinosa wrote: > > > So for the measurement interface, are you measuring the total data > > > received as > > > the data arrives or when the received file is completely written > > > to the job > > > directory. > > > > The average is all the bytes that go from client to all the workers > > over > > the entire time spent to run the jobs. > > > > > > > > I was measuring from the logs from JOB_START to JOB_END. I assumed > > > the actualy > > > job execution to be 0. The 7MB/s probably corresponds to Mihael's > > > stage out > > > results. the cat jobs dump to stdout (redirected to a file in the > > > swift > > > wrapper) probably shows the same behavior as the stageout. > > > > I'm becoming less surprised about 7MB/s in the local case. You have > > to > > multiply that by 6 to get the real disk I/O bandwidth: > > 1. client reads from disk > > 2. worker writes to disk > > 3. cat reads from disk > > 4. cat writes to disk > > 5. worker reads from disk > > 6. client writes to disk > > > > If it all happens on a single disk, then it adds up to about 42 > > MB/s, > > which is a reasonable fraction of what a normal disk can do. It > > would be > > useful to do a dd from /dev/zero to see what the actual disk > > performance > > is. > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Wed Jan 19 10:29:37 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 19 Jan 2011 10:29:37 -0600 (CST) Subject: [Swift-devel] Re: provider staging stage-in rate on localhost and PADS In-Reply-To: <817314590.80100.1295453548072.JavaMail.root@zimbra.anl.gov> Message-ID: <699977417.80246.1295454577878.JavaMail.root@zimbra.anl.gov> I forgot to also state regarding the test below: - tried both stagingMethod proxy and file - no significant perf diff - tried workersPerNode > 8 (10 seemed to cause slight degradation, so I went back to 8; this test may have just been noise) - I used this recent swift and cog: Swift svn swift-r3997 cog-r3029 (cog modified locally) I think the following tests would be good to do in the micro-series: - re-do the above on a quiet dedicated PADS cluster node (qsub -I) - try the test with just input and just output - try the test with a 1-byte file (to see what the protocol issues are) - try the test with a 30MB file (to try to replicate Mihaels results) - try testing from one pads node client to say 3-4 other pads nodes (again, with either qsub -I or a swift run with auto coasters and the maxNode and nodeGranularity set to say 4 & 4 (or 5 & 5, etc) This last test will probe the ability of swift to move more tasks/sec when there are more concurrent app-job endpoints (ie when swift is driving more cores). We *think* swift trunk should be able to drive >100 tasks/sec - maybe even 200/sec - when the configuration is optimized: all local disk use; log settings tuned, perhaps; throttles set right; ...) Then also try swift fast branch, but Mihael needs to post (or you need to check in svn) whether all the latest provider staging improvements have been, or could be, applied to fast branch. Lastly, for the wide area test: 10 OSG sites try to keep say 2 < N < 10 workers active per site (using queue-N COndor script) with most sites having large numbers of workers. That should more closely mimic the load your will need to drive for the actual application. workersPerNode=1 The wan test will likely require more thought. - Mike ----- Original Message ----- > Continuing to work on resolving this problem. > > I think the next step is to methodically test provider staginug moving > from the single-node test to multi-node local (pads) and then to > multi-node wan tests. > > Now that the native coaster job rate to a single one-core worker is > better understood (and seems to be 4-5 jobs per second) we can now > devise tests with a better understanding of more factors involved. > > I tried a local test on pads login(at a fairly quiet time, unloaded) > as follows: > - local coasters service (in Swift jvm) > - app is "mv" (to avoid extra data movement) > - same input data file is used (so its likely in kernel block cache) > - unique output file is used > - swift and cwd is on /scratch local disk > - file is 3MB (to be closer to Allan's 2.3 MB) > - mv app stages file to worker and back (no app reads or writes) > - workers per node = 8 (on an 8 core host) > - throttle of 200 jobs (2.0) > - 100 jobs per swift script invocation > > I get just over 5 apps/sec or 30MB/sec with this setup. > > Allan, I'd like to suggest you take it from here, but lets talk as > soon as possible this morning to make a plan. > > One approach that may be fruitful is to re-design a remote test that > is closer to what a real scec workload would be (basically your prior > tests with some adjustment to the concurrency: more workers per site, > and more overall files going in parallel. > > Then, every time we have a new insight or code change, re-test the > larger-scale WAN test in parallel with continuing down the micro-test > methods. That way, as soon as we hit a breakthrough that reaches your > requires WAN data transfer rate, you can restart the full scec > workflow, while we continue to analyze swift behavior issues with the > simpler micro benchmarks. > > Regards, > > Mike > > > ----- Original Message ----- > > Ok, so I committed a fix to make the worker send files a bit faster > > and > > adjusted the buffer sizes a bit. There is a trade-off between per > > worker > > performance and number of workers, so this should probably be a > > setting > > of some sort (since when there are many workers, the client > > bandwidth > > becomes the bottleneck). > > > > With a plain cat, 4 workers, 1 job/w, and 32M files I get this: > > [IN]: Total transferred: 7.99 GB, current rate: 23.6 MB/s, average > > rate: > > 16.47 MB/s > > [MEM] Heap total: 155.31 MMB, Heap used: 104.2 MMB > > [OUT] Total transferred: 8 GB, current rate: 0 B/s, average rate: > > 16.49 > > MB/s > > Final status: time:498988 Finished successfully:256 > > Time: 500.653, rate: 0 j/s > > > > So the system probably sees 96 MB/s combined reads and writes. I'd > > be > > curious how this looks without caching, but during the run the > > computer > > became laggy, so it's saturating something in the OS and/or > > hardware. > > > > I'll test on a cluster next. > > > > On Sun, 2011-01-16 at 18:02 -0800, Mihael Hategan wrote: > > > On Sun, 2011-01-16 at 19:38 -0600, Allan Espinosa wrote: > > > > So for the measurement interface, are you measuring the total > > > > data > > > > received as > > > > the data arrives or when the received file is completely written > > > > to the job > > > > directory. > > > > > > The average is all the bytes that go from client to all the > > > workers > > > over > > > the entire time spent to run the jobs. > > > > > > > > > > > I was measuring from the logs from JOB_START to JOB_END. I > > > > assumed > > > > the actualy > > > > job execution to be 0. The 7MB/s probably corresponds to > > > > Mihael's > > > > stage out > > > > results. the cat jobs dump to stdout (redirected to a file in > > > > the > > > > swift > > > > wrapper) probably shows the same behavior as the stageout. > > > > > > I'm becoming less surprised about 7MB/s in the local case. You > > > have > > > to > > > multiply that by 6 to get the real disk I/O bandwidth: > > > 1. client reads from disk > > > 2. worker writes to disk > > > 3. cat reads from disk > > > 4. cat writes to disk > > > 5. worker reads from disk > > > 6. client writes to disk > > > > > > If it all happens on a single disk, then it adds up to about 42 > > > MB/s, > > > which is a reasonable fraction of what a normal disk can do. It > > > would be > > > useful to do a dd from /dev/zero to see what the actual disk > > > performance > > > is. > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Wed Jan 19 10:45:03 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 19 Jan 2011 10:45:03 -0600 (CST) Subject: [Swift-devel] Re: provider staging stage-in rate on localhost and PADS In-Reply-To: <699977417.80246.1295454577878.JavaMail.root@zimbra.anl.gov> Message-ID: <1517889091.80408.1295455503257.JavaMail.root@zimbra.anl.gov> A few more test results: moving 3 byte files: this runs at about 20 jobs/sec in the single-node 8-core test. moving 30MB files: runs 100 jobs in 143 secs = about 40 MB/sec total in/out Both tests are using a single input file going to all jobs and N unique output files coming back. So the latter job I think is about the same ballpark as Mihael's latest results? And the former job confirms that provider staging does not seem to slow down the job rate unacceptably. - Mike ----- Original Message ----- > I forgot to also state regarding the test below: > - tried both stagingMethod proxy and file - no significant perf diff > - tried workersPerNode > 8 (10 seemed to cause slight degradation, so > I went back to 8; this test may have just been noise) > - I used this recent swift and cog: Swift svn swift-r3997 cog-r3029 > (cog modified locally) > > I think the following tests would be good to do in the micro-series: > > - re-do the above on a quiet dedicated PADS cluster node (qsub -I) > > - try the test with just input and just output > - try the test with a 1-byte file (to see what the protocol issues > are) > - try the test with a 30MB file (to try to replicate Mihaels results) > > - try testing from one pads node client to say 3-4 other pads nodes > (again, with either qsub -I or a swift run with auto coasters and the > maxNode and nodeGranularity set to say 4 & 4 (or 5 & 5, etc) > > This last test will probe the ability of swift to move more tasks/sec > when there are more concurrent app-job endpoints (ie when swift is > driving more cores). We *think* swift trunk should be able to drive > >100 tasks/sec - maybe even 200/sec - when the configuration is > optimized: all local disk use; log settings tuned, perhaps; throttles > set right; ...) > > Then also try swift fast branch, but Mihael needs to post (or you need > to check in svn) whether all the latest provider staging improvements > have been, or could be, applied to fast branch. > > Lastly, for the wide area test: > > 10 OSG sites > try to keep say 2 < N < 10 workers active per site (using queue-N > COndor script) with most sites having large numbers of workers. That > should more closely mimic the load your will need to drive for the > actual application. > workersPerNode=1 > > The wan test will likely require more thought. > > - Mike > > ----- Original Message ----- > > Continuing to work on resolving this problem. > > > > I think the next step is to methodically test provider staginug > > moving > > from the single-node test to multi-node local (pads) and then to > > multi-node wan tests. > > > > Now that the native coaster job rate to a single one-core worker is > > better understood (and seems to be 4-5 jobs per second) we can now > > devise tests with a better understanding of more factors involved. > > > > I tried a local test on pads login(at a fairly quiet time, unloaded) > > as follows: > > - local coasters service (in Swift jvm) > > - app is "mv" (to avoid extra data movement) > > - same input data file is used (so its likely in kernel block cache) > > - unique output file is used > > - swift and cwd is on /scratch local disk > > - file is 3MB (to be closer to Allan's 2.3 MB) > > - mv app stages file to worker and back (no app reads or writes) > > - workers per node = 8 (on an 8 core host) > > - throttle of 200 jobs (2.0) > > - 100 jobs per swift script invocation > > > > I get just over 5 apps/sec or 30MB/sec with this setup. > > > > Allan, I'd like to suggest you take it from here, but lets talk as > > soon as possible this morning to make a plan. > > > > One approach that may be fruitful is to re-design a remote test that > > is closer to what a real scec workload would be (basically your > > prior > > tests with some adjustment to the concurrency: more workers per > > site, > > and more overall files going in parallel. > > > > Then, every time we have a new insight or code change, re-test the > > larger-scale WAN test in parallel with continuing down the > > micro-test > > methods. That way, as soon as we hit a breakthrough that reaches > > your > > requires WAN data transfer rate, you can restart the full scec > > workflow, while we continue to analyze swift behavior issues with > > the > > simpler micro benchmarks. > > > > Regards, > > > > Mike > > > > > > ----- Original Message ----- > > > Ok, so I committed a fix to make the worker send files a bit > > > faster > > > and > > > adjusted the buffer sizes a bit. There is a trade-off between per > > > worker > > > performance and number of workers, so this should probably be a > > > setting > > > of some sort (since when there are many workers, the client > > > bandwidth > > > becomes the bottleneck). > > > > > > With a plain cat, 4 workers, 1 job/w, and 32M files I get this: > > > [IN]: Total transferred: 7.99 GB, current rate: 23.6 MB/s, average > > > rate: > > > 16.47 MB/s > > > [MEM] Heap total: 155.31 MMB, Heap used: 104.2 MMB > > > [OUT] Total transferred: 8 GB, current rate: 0 B/s, average rate: > > > 16.49 > > > MB/s > > > Final status: time:498988 Finished successfully:256 > > > Time: 500.653, rate: 0 j/s > > > > > > So the system probably sees 96 MB/s combined reads and writes. I'd > > > be > > > curious how this looks without caching, but during the run the > > > computer > > > became laggy, so it's saturating something in the OS and/or > > > hardware. > > > > > > I'll test on a cluster next. > > > > > > On Sun, 2011-01-16 at 18:02 -0800, Mihael Hategan wrote: > > > > On Sun, 2011-01-16 at 19:38 -0600, Allan Espinosa wrote: > > > > > So for the measurement interface, are you measuring the total > > > > > data > > > > > received as > > > > > the data arrives or when the received file is completely > > > > > written > > > > > to the job > > > > > directory. > > > > > > > > The average is all the bytes that go from client to all the > > > > workers > > > > over > > > > the entire time spent to run the jobs. > > > > > > > > > > > > > > I was measuring from the logs from JOB_START to JOB_END. I > > > > > assumed > > > > > the actualy > > > > > job execution to be 0. The 7MB/s probably corresponds to > > > > > Mihael's > > > > > stage out > > > > > results. the cat jobs dump to stdout (redirected to a file in > > > > > the > > > > > swift > > > > > wrapper) probably shows the same behavior as the stageout. > > > > > > > > I'm becoming less surprised about 7MB/s in the local case. You > > > > have > > > > to > > > > multiply that by 6 to get the real disk I/O bandwidth: > > > > 1. client reads from disk > > > > 2. worker writes to disk > > > > 3. cat reads from disk > > > > 4. cat writes to disk > > > > 5. worker reads from disk > > > > 6. client writes to disk > > > > > > > > If it all happens on a single disk, then it adds up to about 42 > > > > MB/s, > > > > which is a reasonable fraction of what a normal disk can do. It > > > > would be > > > > useful to do a dd from /dev/zero to see what the actual disk > > > > performance > > > > is. > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From aespinosa at cs.uchicago.edu Wed Jan 19 11:48:23 2011 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 19 Jan 2011 11:48:23 -0600 Subject: [Swift-devel] Re: provider staging stage-in rate on localhost and PADS In-Reply-To: <1517889091.80408.1295455503257.JavaMail.root@zimbra.anl.gov> References: <699977417.80246.1295454577878.JavaMail.root@zimbra.anl.gov> <1517889091.80408.1295455503257.JavaMail.root@zimbra.anl.gov> Message-ID: Hi Mike, Here's the setup I tested: 18 services on communicado. 1 worker.pl w/ 40 workersPerNode connecting to each. Each worker is in a CS condor pool compute host. using provider staging. no data. just a bunch of strings passed on the app(() function. I got 58k jobs in 1 hr. so the rate is 16.3 jobs/sec. well within your 20 jobs/sec i guess. I haven't experimented with stuff that includes data though. -Allan 2011/1/19 Michael Wilde : > A few more test results: > > moving 3 byte files: this runs at about 20 jobs/sec in the single-node 8-core test. > > moving 30MB files: runs 100 jobs in 143 secs = about 40 MB/sec total in/out > > Both tests are using a single input file going to all jobs and N unique output files coming back. > > So the latter job I think is about the same ballpark as Mihael's latest results? ?And the former job confirms that provider staging does not seem to slow down the job rate unacceptably. > > - Mike > > > ----- Original Message ----- >> I forgot to also state regarding the test below: >> - tried both stagingMethod proxy and file - no significant perf diff >> - tried workersPerNode > 8 (10 seemed to cause slight degradation, so >> I went back to 8; this test may have just been noise) >> - I used this recent swift and cog: Swift svn swift-r3997 cog-r3029 >> (cog modified locally) >> >> I think the following tests would be good to do in the micro-series: >> >> - re-do the above on a quiet dedicated PADS cluster node (qsub -I) >> >> - try the test with just input and just output >> - try the test with a 1-byte file (to see what the protocol issues >> are) >> - try the test with a 30MB file (to try to replicate Mihaels results) >> >> - try testing from one pads node client to say 3-4 other pads nodes >> (again, with either qsub -I or a swift run with auto coasters and the >> maxNode and nodeGranularity set to say 4 & 4 (or 5 & 5, etc) >> >> This last test will probe the ability of swift to move more tasks/sec >> when there are more concurrent app-job endpoints (ie when swift is >> driving more cores). We *think* swift trunk should be able to drive >> >100 tasks/sec - maybe even 200/sec - when the configuration is >> optimized: all local disk use; log settings tuned, perhaps; throttles >> set right; ...) >> >> Then also try swift fast branch, but Mihael needs to post (or you need >> to check in svn) whether all the latest provider staging improvements >> have been, or could be, applied to fast branch. >> >> Lastly, for the wide area test: >> >> 10 OSG sites >> try to keep say 2 < N < 10 workers active per site (using queue-N >> COndor script) with most sites having large numbers of workers. That >> should more closely mimic the load your will need to drive for the >> actual application. >> workersPerNode=1 >> >> The wan test will likely require more thought. >> >> - Mike >> >> ----- Original Message ----- >> > Continuing to work on resolving this problem. >> > >> > I think the next step is to methodically test provider staginug >> > moving >> > from the single-node test to multi-node local (pads) and then to >> > multi-node wan tests. >> > >> > Now that the native coaster job rate to a single one-core worker is >> > better understood (and seems to be 4-5 jobs per second) we can now >> > devise tests with a better understanding of more factors involved. >> > >> > I tried a local test on pads login(at a fairly quiet time, unloaded) >> > as follows: >> > - local coasters service (in Swift jvm) >> > - app is "mv" (to avoid extra data movement) >> > - same input data file is used (so its likely in kernel block cache) >> > - unique output file is used >> > - swift and cwd is on /scratch local disk >> > - file is 3MB (to be closer to Allan's 2.3 MB) >> > - mv app stages file to worker and back (no app reads or writes) >> > - workers per node = 8 (on an 8 core host) >> > - throttle of 200 jobs (2.0) >> > - 100 jobs per swift script invocation >> > >> > I get just over 5 apps/sec or 30MB/sec with this setup. >> > >> > Allan, I'd like to suggest you take it from here, but lets talk as >> > soon as possible this morning to make a plan. >> > >> > One approach that may be fruitful is to re-design a remote test that >> > is closer to what a real scec workload would be (basically your >> > prior >> > tests with some adjustment to the concurrency: more workers per >> > site, >> > and more overall files going in parallel. >> > >> > Then, every time we have a new insight or code change, re-test the >> > larger-scale WAN test in parallel with continuing down the >> > micro-test >> > methods. That way, as soon as we hit a breakthrough that reaches >> > your >> > requires WAN data transfer rate, you can restart the full scec >> > workflow, while we continue to analyze swift behavior issues with >> > the >> > simpler micro benchmarks. >> > >> > Regards, >> > >> > Mike >> > >> > >> > ----- Original Message ----- >> > > Ok, so I committed a fix to make the worker send files a bit >> > > faster >> > > and >> > > adjusted the buffer sizes a bit. There is a trade-off between per >> > > worker >> > > performance and number of workers, so this should probably be a >> > > setting >> > > of some sort (since when there are many workers, the client >> > > bandwidth >> > > becomes the bottleneck). >> > > >> > > With a plain cat, 4 workers, 1 job/w, and 32M files I get this: >> > > [IN]: Total transferred: 7.99 GB, current rate: 23.6 MB/s, average >> > > rate: >> > > 16.47 MB/s >> > > [MEM] Heap total: 155.31 MMB, Heap used: 104.2 MMB >> > > [OUT] Total transferred: 8 GB, current rate: 0 B/s, average rate: >> > > 16.49 >> > > MB/s >> > > Final status: time:498988 Finished successfully:256 >> > > Time: 500.653, rate: 0 j/s >> > > >> > > So the system probably sees 96 MB/s combined reads and writes. I'd >> > > be >> > > curious how this looks without caching, but during the run the >> > > computer >> > > became laggy, so it's saturating something in the OS and/or >> > > hardware. >> > > >> > > I'll test on a cluster next. >> > > >> > > On Sun, 2011-01-16 at 18:02 -0800, Mihael Hategan wrote: >> > > > On Sun, 2011-01-16 at 19:38 -0600, Allan Espinosa wrote: >> > > > > So for the measurement interface, are you measuring the total >> > > > > data >> > > > > received as >> > > > > the data arrives or when the received file is completely >> > > > > written >> > > > > to the job >> > > > > directory. >> > > > >> > > > The average is all the bytes that go from client to all the >> > > > workers >> > > > over >> > > > the entire time spent to run the jobs. >> > > > >> > > > > >> > > > > I was measuring from the logs from JOB_START to JOB_END. I >> > > > > assumed >> > > > > the actualy >> > > > > job execution to be 0. The 7MB/s probably corresponds to >> > > > > Mihael's >> > > > > stage out >> > > > > results. the cat jobs dump to stdout (redirected to a file in >> > > > > the >> > > > > swift >> > > > > wrapper) probably shows the same behavior as the stageout. >> > > > >> > > > I'm becoming less surprised about 7MB/s in the local case. You >> > > > have >> > > > to >> > > > multiply that by 6 to get the real disk I/O bandwidth: >> > > > 1. client reads from disk >> > > > 2. worker writes to disk >> > > > 3. cat reads from disk >> > > > 4. cat writes to disk >> > > > 5. worker reads from disk >> > > > 6. client writes to disk >> > > > >> > > > If it all happens on a single disk, then it adds up to about 42 >> > > > MB/s, >> > > > which is a reasonable fraction of what a normal disk can do. It >> > > > would be >> > > > useful to do a dd from /dev/zero to see what the actual disk >> > > > performance >> > > > is. >> > > > >> > > > From hategan at mcs.anl.gov Wed Jan 19 12:10:11 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 19 Jan 2011 10:10:11 -0800 Subject: [Swift-devel] Re: provider staging stage-in rate on localhost and PADS In-Reply-To: <699977417.80246.1295454577878.JavaMail.root@zimbra.anl.gov> References: <699977417.80246.1295454577878.JavaMail.root@zimbra.anl.gov> Message-ID: <1295460611.4263.2.camel@blabla2.none> On Wed, 2011-01-19 at 10:29 -0600, Michael Wilde wrote: > I forgot to also state regarding the test below: > - tried both stagingMethod proxy and file - no significant perf diff Yes, but the memory overhead is considerably larger for proxy. So don't use it if you can use file. > - tried workersPerNode > 8 (10 seemed to cause slight degradation, so I went back to 8; this test may have just been noise) Probably not noise. If you have more (actual) workers (instead of wpn) the combined tcp buffer sizes are larger and there is somewhat more parallelization. > - I used this recent swift and cog: Swift svn swift-r3997 cog-r3029 (cog modified locally) > > I think the following tests would be good to do in the micro-series: > > - re-do the above on a quiet dedicated PADS cluster node (qsub -I) > > - try the test with just input and just output > - try the test with a 1-byte file (to see what the protocol issues are) > - try the test with a 30MB file (to try to replicate Mihaels results) > > - try testing from one pads node client to say 3-4 other pads nodes (again, with either qsub -I or a swift run with auto coasters and the maxNode and nodeGranularity set to say 4 & 4 (or 5 & 5, etc) > > This last test will probe the ability of swift to move more tasks/sec when there are more concurrent app-job endpoints (ie when swift is driving more cores). We *think* swift trunk should be able to drive >100 tasks/sec - maybe even 200/sec - when the configuration is optimized: all local disk use; log settings tuned, perhaps; throttles set right; ...) > > Then also try swift fast branch, but Mihael needs to post (or you need to check in svn) whether all the latest provider staging improvements have been, or could be, applied to fast branch. > > Lastly, for the wide area test: > > 10 OSG sites > try to keep say 2 < N < 10 workers active per site (using queue-N COndor script) with most sites having large numbers of workers. That should more closely mimic the load your will need to drive for the actual application. > workersPerNode=1 > > The wan test will likely require more thought. > > - Mike > > ----- Original Message ----- > > Continuing to work on resolving this problem. > > > > I think the next step is to methodically test provider staginug moving > > from the single-node test to multi-node local (pads) and then to > > multi-node wan tests. > > > > Now that the native coaster job rate to a single one-core worker is > > better understood (and seems to be 4-5 jobs per second) we can now > > devise tests with a better understanding of more factors involved. > > > > I tried a local test on pads login(at a fairly quiet time, unloaded) > > as follows: > > - local coasters service (in Swift jvm) > > - app is "mv" (to avoid extra data movement) > > - same input data file is used (so its likely in kernel block cache) > > - unique output file is used > > - swift and cwd is on /scratch local disk > > - file is 3MB (to be closer to Allan's 2.3 MB) > > - mv app stages file to worker and back (no app reads or writes) > > - workers per node = 8 (on an 8 core host) > > - throttle of 200 jobs (2.0) > > - 100 jobs per swift script invocation > > > > I get just over 5 apps/sec or 30MB/sec with this setup. > > > > Allan, I'd like to suggest you take it from here, but lets talk as > > soon as possible this morning to make a plan. > > > > One approach that may be fruitful is to re-design a remote test that > > is closer to what a real scec workload would be (basically your prior > > tests with some adjustment to the concurrency: more workers per site, > > and more overall files going in parallel. > > > > Then, every time we have a new insight or code change, re-test the > > larger-scale WAN test in parallel with continuing down the micro-test > > methods. That way, as soon as we hit a breakthrough that reaches your > > requires WAN data transfer rate, you can restart the full scec > > workflow, while we continue to analyze swift behavior issues with the > > simpler micro benchmarks. > > > > Regards, > > > > Mike > > > > > > ----- Original Message ----- > > > Ok, so I committed a fix to make the worker send files a bit faster > > > and > > > adjusted the buffer sizes a bit. There is a trade-off between per > > > worker > > > performance and number of workers, so this should probably be a > > > setting > > > of some sort (since when there are many workers, the client > > > bandwidth > > > becomes the bottleneck). > > > > > > With a plain cat, 4 workers, 1 job/w, and 32M files I get this: > > > [IN]: Total transferred: 7.99 GB, current rate: 23.6 MB/s, average > > > rate: > > > 16.47 MB/s > > > [MEM] Heap total: 155.31 MMB, Heap used: 104.2 MMB > > > [OUT] Total transferred: 8 GB, current rate: 0 B/s, average rate: > > > 16.49 > > > MB/s > > > Final status: time:498988 Finished successfully:256 > > > Time: 500.653, rate: 0 j/s > > > > > > So the system probably sees 96 MB/s combined reads and writes. I'd > > > be > > > curious how this looks without caching, but during the run the > > > computer > > > became laggy, so it's saturating something in the OS and/or > > > hardware. > > > > > > I'll test on a cluster next. > > > > > > On Sun, 2011-01-16 at 18:02 -0800, Mihael Hategan wrote: > > > > On Sun, 2011-01-16 at 19:38 -0600, Allan Espinosa wrote: > > > > > So for the measurement interface, are you measuring the total > > > > > data > > > > > received as > > > > > the data arrives or when the received file is completely written > > > > > to the job > > > > > directory. > > > > > > > > The average is all the bytes that go from client to all the > > > > workers > > > > over > > > > the entire time spent to run the jobs. > > > > > > > > > > > > > > I was measuring from the logs from JOB_START to JOB_END. I > > > > > assumed > > > > > the actualy > > > > > job execution to be 0. The 7MB/s probably corresponds to > > > > > Mihael's > > > > > stage out > > > > > results. the cat jobs dump to stdout (redirected to a file in > > > > > the > > > > > swift > > > > > wrapper) probably shows the same behavior as the stageout. > > > > > > > > I'm becoming less surprised about 7MB/s in the local case. You > > > > have > > > > to > > > > multiply that by 6 to get the real disk I/O bandwidth: > > > > 1. client reads from disk > > > > 2. worker writes to disk > > > > 3. cat reads from disk > > > > 4. cat writes to disk > > > > 5. worker reads from disk > > > > 6. client writes to disk > > > > > > > > If it all happens on a single disk, then it adds up to about 42 > > > > MB/s, > > > > which is a reasonable fraction of what a normal disk can do. It > > > > would be > > > > useful to do a dd from /dev/zero to see what the actual disk > > > > performance > > > > is. > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Wed Jan 19 12:13:11 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 19 Jan 2011 10:13:11 -0800 Subject: [Swift-devel] Re: provider staging stage-in rate on localhost and PADS In-Reply-To: <1517889091.80408.1295455503257.JavaMail.root@zimbra.anl.gov> References: <1517889091.80408.1295455503257.JavaMail.root@zimbra.anl.gov> Message-ID: <1295460791.4263.3.camel@blabla2.none> On Wed, 2011-01-19 at 10:45 -0600, Michael Wilde wrote: > A few more test results: > > moving 3 byte files: this runs at about 20 jobs/sec in the single-node 8-core test. > > moving 30MB files: runs 100 jobs in 143 secs = about 40 MB/sec total in/out > > Both tests are using a single input file going to all jobs and N unique output files coming back. > > So the latter job I think is about the same ballpark as Mihael's > latest results? And the former job confirms that provider staging > does not seem to slow down the job rate unacceptably. It would be interesting to compare this with the local file provider (i.e. NFS). From wilde at mcs.anl.gov Wed Jan 19 13:12:14 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 19 Jan 2011 13:12:14 -0600 (CST) Subject: [Swift-devel] Error 521 provider-staging files to PADS nodes In-Reply-To: <825902623.81482.1295463019140.JavaMail.root@zimbra.anl.gov> Message-ID: <559440241.81613.1295464334933.JavaMail.root@zimbra.anl.gov> Mihael, The following test on pads failed/hung with an error 521 from worker.pl: --- sub getFileCBDataInIndirect { ... elsif ($timeout) { queueCmd((nullCB(), "JOBSTATUS", $jobid, FAILED, "521", "Timeout staging in file")); delete($JOBDATA{$jobid}); --- single foreach loop, doing 1,000 "mv" commands throttle was 200 jobs to this coaster pool (1 4-node 32-core PBS job): 8 3500 1 4 4 short 2.0 10000 /scratch/local/wilde/test/swiftwork file /scratch/local/wilde/swiftscratch Ran 33 jobs - 1 job over 1 "wave" of 32 and then one or more workers timed out. Note that the hang may have happened earlier, as no new jobs were starting as the jobs in the first wave were finishing. time swift -tc.file tc -sites.file pbscoasters.xml -config cf.ps mvn.swift -n=1000 >& out & The log is in ~wilde/mvn-20110119-0956-s3s8h9h2.log on CI net. Swift stdout showed the following after waiting a while for a 4-node PADS coaster allocation to start: Progress: Selecting site:799 Submitted:201 Progress: Selecting site:799 Submitted:201 Progress: Selecting site:799 Submitted:200 Active:1 Progress: Selecting site:798 Submitted:177 Active:24 Finished successfully:1 Progress: Selecting site:796 Submitted:172 Active:28 Finished successfully:4 Progress: Selecting site:792 Submitted:176 Active:24 Finished successfully:8 Progress: Selecting site:788 Submitted:180 Active:20 Finished successfully:12 Progress: Selecting site:784 Submitted:184 Active:16 Finished successfully:16 Progress: Selecting site:780 Submitted:188 Active:12 Finished successfully:20 Progress: Selecting site:777 Submitted:191 Active:9 Finished successfully:23 Progress: Selecting site:773 Submitted:195 Active:5 Finished successfully:27 Progress: Selecting site:770 Submitted:197 Active:3 Finished successfully:30 Progress: Selecting site:767 Submitted:200 Finished successfully:33 Progress: Selecting site:766 Submitted:201 Finished successfully:33 Progress: Selecting site:766 Submitted:201 Finished successfully:33 Progress: Selecting site:766 Submitted:201 Finished successfully:33 Progress: Selecting site:766 Submitted:201 Finished successfully:33 Progress: Selecting site:766 Submitted:201 Finished successfully:33 Progress: Selecting site:766 Submitted:201 Finished successfully:33 Progress: Selecting site:766 Submitted:200 Active:1 Finished successfully:33 Execution failed: Job failed with an exit code of 521 login1$ login1$ login1$ pwd /scratch/local/wilde/lab login1$ ls -lt | head total 51408 -rw-r--r-- 1 wilde ci-users 5043350 Jan 19 10:51 mvn-20110119-0956-s3s8h9h2.log (copied to ~wilde) script was: login1$ cat mvn.swift type file; app (file o) mv (file i) { mv @i @o; } file out[]; foreach j in [1:@toint(@arg("n","1"))] { file data<"data.txt">; out[j] = mv(data); } data.txt was 3MB A look at the outdir gives a clue to where things hung: The files of <= ~3MB from time 10:48 are from this job. Files from 10:39 and earlier are from other manual runs executed on login1, Note that 3 of the 3MB output files have length 0 or <3MB, and were likely in transit back from the worker: -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out login1$ pwd /scratch/local/wilde/lab login1$ cd outdir login1$ ls -lt | head -40 total 2772188 -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0023.out -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0336.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0380.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0015.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0204.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0379.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0066.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0221.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0281.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0403.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0142.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0187.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0067.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0081.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0134.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0136.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0146.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0254.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0362.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0312.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0370.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0389.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0027.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0094.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0183.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0363.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0016.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0025.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0429.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0239.out -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0024.out -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0037.out -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0001.out -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0042.out -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0033.out -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0051.out l - Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Wed Jan 19 13:37:15 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 19 Jan 2011 13:37:15 -0600 (CST) Subject: [Swift-devel] Error 521 provider-staging files to PADS nodes In-Reply-To: <559440241.81613.1295464334933.JavaMail.root@zimbra.anl.gov> Message-ID: <1112858620.81735.1295465835538.JavaMail.root@zimbra.anl.gov> An interesting observation on the returned output files: there are exactly 33 files in the output dir from this run: the same as the number of jobs Swift reports as Finished successfully. But of those 33, the last 4 are only of partial length, and one of the 4 is length zero (see below). Its surprising and perhaps a bug that the jobs are reported finished before the output file is fully written??? Also this 3-partial plus 1-zero file looks to me like one worker staging op hung (the oldest of the 4 incomplete output files) and then perhaps 3 were cut short when the coaster service data protocol froze? - Mike login1$ pwd /scratch/local/wilde/lab login1$ cd outdir login1$ ls -lt | grep 10:48 -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0023.out -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0336.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0380.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0015.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0204.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0379.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0066.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0221.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0281.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0403.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0142.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0187.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0067.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0081.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0134.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0136.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0146.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0254.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0362.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0312.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0370.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0389.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0027.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0094.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0183.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0363.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0016.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0025.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0429.out -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0239.out login1$ ls -lt | grep 10:48 | wc -l 33 login1$ ----- Original Message ----- > Mihael, > > The following test on pads failed/hung with an error 521 from > worker.pl: > > --- > sub getFileCBDataInIndirect { > ... > elsif ($timeout) { > queueCmd((nullCB(), "JOBSTATUS", $jobid, FAILED, "521", "Timeout > staging in file")); > delete($JOBDATA{$jobid}); > --- > > single foreach loop, doing 1,000 "mv" commands > > throttle was 200 jobs to this coaster pool (1 4-node 32-core PBS job): > > > > 8 > 3500 > 1 > 4 > 4 > short > 2.0 > 10000 > > /scratch/local/wilde/test/swiftwork > file > /scratch/local/wilde/swiftscratch > > > Ran 33 jobs - 1 job over 1 "wave" of 32 and then one or more workers > timed out. Note that the hang may have happened earlier, as no new > jobs were starting as the jobs in the first wave were finishing. > > time swift -tc.file tc -sites.file pbscoasters.xml -config cf.ps > mvn.swift -n=1000 >& out & > > > The log is in ~wilde/mvn-20110119-0956-s3s8h9h2.log on CI net. > > Swift stdout showed the following after waiting a while for a 4-node > PADS coaster allocation to start: > > Progress: Selecting site:799 Submitted:201 > Progress: Selecting site:799 Submitted:201 > Progress: Selecting site:799 Submitted:200 Active:1 > Progress: Selecting site:798 Submitted:177 Active:24 Finished > successfully:1 > Progress: Selecting site:796 Submitted:172 Active:28 Finished > successfully:4 > Progress: Selecting site:792 Submitted:176 Active:24 Finished > successfully:8 > Progress: Selecting site:788 Submitted:180 Active:20 Finished > successfully:12 > Progress: Selecting site:784 Submitted:184 Active:16 Finished > successfully:16 > Progress: Selecting site:780 Submitted:188 Active:12 Finished > successfully:20 > Progress: Selecting site:777 Submitted:191 Active:9 Finished > successfully:23 > Progress: Selecting site:773 Submitted:195 Active:5 Finished > successfully:27 > Progress: Selecting site:770 Submitted:197 Active:3 Finished > successfully:30 > Progress: Selecting site:767 Submitted:200 Finished successfully:33 > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > Progress: Selecting site:766 Submitted:200 Active:1 Finished > successfully:33 > Execution failed: > Job failed with an exit code of 521 > login1$ > login1$ > login1$ pwd > /scratch/local/wilde/lab > login1$ ls -lt | head > total 51408 > -rw-r--r-- 1 wilde ci-users 5043350 Jan 19 10:51 > mvn-20110119-0956-s3s8h9h2.log > > (copied to ~wilde) > > script was: > > login1$ cat mvn.swift > type file; > > app (file o) mv (file i) > { > mv @i @o; > } > > file out[] prefix="f.",suffix=".out">; > foreach j in [1:@toint(@arg("n","1"))] { > file data<"data.txt">; > out[j] = mv(data); > } > > > data.txt was 3MB > > A look at the outdir gives a clue to where things hung: The files of > <= ~3MB from time 10:48 are from this job. Files from 10:39 and > earlier are from other manual runs executed on login1, Note that 3 of > the 3MB output files have length 0 or <3MB, and were likely in transit > back from the worker: > > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out > > > login1$ pwd > /scratch/local/wilde/lab > login1$ cd outdir > login1$ ls -lt | head -40 > total 2772188 > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0023.out > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0336.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0380.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0015.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0204.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0379.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0066.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0221.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0281.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0403.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0142.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0187.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0067.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0081.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0134.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0136.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0146.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0254.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0362.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0312.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0370.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0389.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0027.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0094.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0183.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0363.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0016.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0025.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0429.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0239.out > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0024.out > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0037.out > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0001.out > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0042.out > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0033.out > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0051.out > l > > - Mike > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Wed Jan 19 15:12:04 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 19 Jan 2011 13:12:04 -0800 Subject: [Swift-devel] Error 521 provider-staging files to PADS nodes In-Reply-To: <1112858620.81735.1295465835538.JavaMail.root@zimbra.anl.gov> References: <1112858620.81735.1295465835538.JavaMail.root@zimbra.anl.gov> Message-ID: <1295471524.6134.0.camel@blabla2.none> might be due to one of the recent patches. you could try to set IOBLOCKSZ to 1 in worker.pl and rerun. On Wed, 2011-01-19 at 13:37 -0600, Michael Wilde wrote: > An interesting observation on the returned output files: there are exactly 33 files in the output dir from this run: the same as the number of jobs Swift reports as Finished successfully. But of those 33, the last 4 are only of partial length, and one of the 4 is length zero (see below). > > Its surprising and perhaps a bug that the jobs are reported finished before the output file is fully written??? > > Also this 3-partial plus 1-zero file looks to me like one worker staging op hung (the oldest of the 4 incomplete output files) and then perhaps 3 were cut short when the coaster service data protocol froze? > > - Mike > > login1$ pwd > /scratch/local/wilde/lab > login1$ cd outdir > login1$ ls -lt | grep 10:48 > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0023.out > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0336.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0380.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0015.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0204.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0379.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0066.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0221.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0281.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0403.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0142.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0187.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0067.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0081.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0134.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0136.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0146.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0254.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0362.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0312.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0370.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0389.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0027.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0094.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0183.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0363.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0016.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0025.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0429.out > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0239.out > login1$ ls -lt | grep 10:48 | wc -l > 33 > login1$ > > > > > ----- Original Message ----- > > Mihael, > > > > The following test on pads failed/hung with an error 521 from > > worker.pl: > > > > --- > > sub getFileCBDataInIndirect { > > ... > > elsif ($timeout) { > > queueCmd((nullCB(), "JOBSTATUS", $jobid, FAILED, "521", "Timeout > > staging in file")); > > delete($JOBDATA{$jobid}); > > --- > > > > single foreach loop, doing 1,000 "mv" commands > > > > throttle was 200 jobs to this coaster pool (1 4-node 32-core PBS job): > > > > > > > > 8 > > 3500 > > 1 > > 4 > > 4 > > short > > 2.0 > > 10000 > > > > /scratch/local/wilde/test/swiftwork > > file > > /scratch/local/wilde/swiftscratch > > > > > > Ran 33 jobs - 1 job over 1 "wave" of 32 and then one or more workers > > timed out. Note that the hang may have happened earlier, as no new > > jobs were starting as the jobs in the first wave were finishing. > > > > time swift -tc.file tc -sites.file pbscoasters.xml -config cf.ps > > mvn.swift -n=1000 >& out & > > > > > > The log is in ~wilde/mvn-20110119-0956-s3s8h9h2.log on CI net. > > > > Swift stdout showed the following after waiting a while for a 4-node > > PADS coaster allocation to start: > > > > Progress: Selecting site:799 Submitted:201 > > Progress: Selecting site:799 Submitted:201 > > Progress: Selecting site:799 Submitted:200 Active:1 > > Progress: Selecting site:798 Submitted:177 Active:24 Finished > > successfully:1 > > Progress: Selecting site:796 Submitted:172 Active:28 Finished > > successfully:4 > > Progress: Selecting site:792 Submitted:176 Active:24 Finished > > successfully:8 > > Progress: Selecting site:788 Submitted:180 Active:20 Finished > > successfully:12 > > Progress: Selecting site:784 Submitted:184 Active:16 Finished > > successfully:16 > > Progress: Selecting site:780 Submitted:188 Active:12 Finished > > successfully:20 > > Progress: Selecting site:777 Submitted:191 Active:9 Finished > > successfully:23 > > Progress: Selecting site:773 Submitted:195 Active:5 Finished > > successfully:27 > > Progress: Selecting site:770 Submitted:197 Active:3 Finished > > successfully:30 > > Progress: Selecting site:767 Submitted:200 Finished successfully:33 > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > Progress: Selecting site:766 Submitted:200 Active:1 Finished > > successfully:33 > > Execution failed: > > Job failed with an exit code of 521 > > login1$ > > login1$ > > login1$ pwd > > /scratch/local/wilde/lab > > login1$ ls -lt | head > > total 51408 > > -rw-r--r-- 1 wilde ci-users 5043350 Jan 19 10:51 > > mvn-20110119-0956-s3s8h9h2.log > > > > (copied to ~wilde) > > > > script was: > > > > login1$ cat mvn.swift > > type file; > > > > app (file o) mv (file i) > > { > > mv @i @o; > > } > > > > file out[] > prefix="f.",suffix=".out">; > > foreach j in [1:@toint(@arg("n","1"))] { > > file data<"data.txt">; > > out[j] = mv(data); > > } > > > > > > data.txt was 3MB > > > > A look at the outdir gives a clue to where things hung: The files of > > <= ~3MB from time 10:48 are from this job. Files from 10:39 and > > earlier are from other manual runs executed on login1, Note that 3 of > > the 3MB output files have length 0 or <3MB, and were likely in transit > > back from the worker: > > > > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out > > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out > > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out > > > > > > login1$ pwd > > /scratch/local/wilde/lab > > login1$ cd outdir > > login1$ ls -lt | head -40 > > total 2772188 > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0023.out > > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out > > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out > > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0336.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0380.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0015.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0204.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0379.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0066.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0221.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0281.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0403.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0142.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0187.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0067.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0081.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0134.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0136.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0146.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0254.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0362.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0312.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0370.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0389.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0027.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0094.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0183.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0363.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0016.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0025.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0429.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0239.out > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0024.out > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0037.out > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0001.out > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0042.out > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0033.out > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0051.out > > l > > > > - Mike > > > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From jon.monette at gmail.com Wed Jan 19 15:44:45 2011 From: jon.monette at gmail.com (Jonathan Monette) Date: Wed, 19 Jan 2011 15:44:45 -0600 Subject: [Swift-devel] log4j.properties Message-ID: <4D375B4D.70803@gmail.com> Hello, Is there any log4j.properties that need to be set to start getting timing data or numbers out of the log file? -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From wozniak at mcs.anl.gov Wed Jan 19 16:08:32 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Wed, 19 Jan 2011 16:08:32 -0600 (CST) Subject: [Swift-devel] Re: log4j.properties In-Reply-To: <4D375B4D.70803@gmail.com> References: <4D375B4D.70803@gmail.com> Message-ID: I have recently been profiling and plotting the worker log. Some examples for a single task type are linked below. If you'd like to extend these scripts, I can check them in. Some easy extensions could be to differentiate multiple task types or multiple logs. http://www.ci.uchicago.edu/wiki/bin/view/SWFT/WorkerProfile On Wed, 19 Jan 2011, Jonathan Monette wrote: > Hello, > Is there any log4j.properties that need to be set to start getting timing > data or numbers out of the log file? -- Justin M Wozniak From jon.monette at gmail.com Wed Jan 19 19:01:48 2011 From: jon.monette at gmail.com (Jonathan Monette) Date: Wed, 19 Jan 2011 19:01:48 -0600 Subject: [Swift-devel] Re: log4j.properties In-Reply-To: References: <4D375B4D.70803@gmail.com> Message-ID: <4D37897C.2040103@gmail.com> Yea. If you could. I want to take a look at those see how easy it will be for me to extend them and use them for my stuff. Thanks. On 1/19/11 4:08 PM, Justin M Wozniak wrote: > > I have recently been profiling and plotting the worker log. Some > examples for a single task type are linked below. If you'd like to > extend these scripts, I can check them in. > > Some easy extensions could be to differentiate multiple task types or > multiple logs. > > http://www.ci.uchicago.edu/wiki/bin/view/SWFT/WorkerProfile > > On Wed, 19 Jan 2011, Jonathan Monette wrote: > >> Hello, >> Is there any log4j.properties that need to be set to start getting >> timing data or numbers out of the log file? > From wozniak at mcs.anl.gov Thu Jan 20 10:40:23 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Thu, 20 Jan 2011 10:40:23 -0600 (CST) Subject: [Swift-devel] Re: log4j.properties In-Reply-To: <4D37897C.2040103@gmail.com> References: <4D375B4D.70803@gmail.com> <4D37897C.2040103@gmail.com> Message-ID: Ok, the tools are in vdl2/usertools/plotter and vdl2/usertools/worker-profile . On Wed, 19 Jan 2011, Jonathan Monette wrote: > Yea. If you could. I want to take a look at those see how easy it will be > for me to extend them and use them for my stuff. Thanks. > > On 1/19/11 4:08 PM, Justin M Wozniak wrote: >> >> I have recently been profiling and plotting the worker log. Some examples >> for a single task type are linked below. If you'd like to extend these >> scripts, I can check them in. >> >> Some easy extensions could be to differentiate multiple task types or >> multiple logs. >> >> http://www.ci.uchicago.edu/wiki/bin/view/SWFT/WorkerProfile >> >> On Wed, 19 Jan 2011, Jonathan Monette wrote: >> >>> Hello, >>> Is there any log4j.properties that need to be set to start getting >>> timing data or numbers out of the log file? -- Justin M Wozniak From jon.monette at gmail.com Thu Jan 20 11:02:32 2011 From: jon.monette at gmail.com (=?utf-8?B?am9uLm1vbmV0dGVAZ21haWwuY29t?=) Date: Thu, 20 Jan 2011 11:02:32 -0600 Subject: [Swift-devel] Re: log4j.properties Message-ID: <4d386a9d.17a6650a.3840.065b@mx.google.com> Ok. Cool. Thanks. I'll take a look at them and see how they work. Sent from my HTC on the Now Network from Sprint! ----- Reply message ----- From: "Justin M Wozniak" Date: Thu, Jan 20, 2011 10:40 am Subject: log4j.properties To: "Jonathan Monette" Cc: "Swift Devel" Ok, the tools are in vdl2/usertools/plotter and vdl2/usertools/worker-profile . On Wed, 19 Jan 2011, Jonathan Monette wrote: > Yea. If you could. I want to take a look at those see how easy it will be > for me to extend them and use them for my stuff. Thanks. > > On 1/19/11 4:08 PM, Justin M Wozniak wrote: >> >> I have recently been profiling and plotting the worker log. Some examples >> for a single task type are linked below. If you'd like to extend these >> scripts, I can check them in. >> >> Some easy extensions could be to differentiate multiple task types or >> multiple logs. >> >> http://www.ci.uchicago.edu/wiki/bin/view/SWFT/WorkerProfile >> >> On Wed, 19 Jan 2011, Jonathan Monette wrote: >> >>> Hello, >>> Is there any log4j.properties that need to be set to start getting >>> timing data or numbers out of the log file? -- Justin M Wozniak -------------- next part -------------- An HTML attachment was scrubbed... URL: From tim.g.armstrong at gmail.com Fri Jan 21 16:55:01 2011 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Fri, 21 Jan 2011 16:55:01 -0600 Subject: [Swift-devel] Swift compile issues on IBI cluster Message-ID: Hi, I'm trying to get a Swift script running on the IBI cluster but have been running into some issues. I'm using a recent development version 3029. Running the script fails at the compile stage. Strangely, this script compiles without issue using exactly the same swift on other machines. Even more strangely, it looks like the swift -> intermediate xml step completes without any exceptions being thrown, but when I look at the xml file it contains precisely zero bytes. It appears that this is what actually causes the karajan parser to crash. What would cause the first compile step to complete without causing an exception but not produce any output? Do we maybe have a buffering issue? Any suggestions for how I might solve this? The log contents are: 2011-01-21 16:43:13,448-0600 DEBUG Loader Max heap: 238616576 2011-01-21 16:43:13,449-0600 INFO Loader rserver.swift: source file is new. Recompiling. 2011-01-21 16:43:13,919-0600 DEBUG Loader Detailed exception: org.griphyn.vdl.karajan.CompilationException: Unable to parse intermediate XML at org.griphyn.vdl.engine.Karajan.compile(Karajan.java:98) at org.griphyn.vdl.karajan.Loader.compile(Loader.java:300) at org.griphyn.vdl.karajan.Loader.main(Loader.java:145) Caused by: org.apache.xmlbeans.XmlException: /tmp/tga/SwiftR/swift.local.2306/rserver.xml:2:1: error: Unexpected end of file after null at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3467) at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270) at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257) at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:252) at org.globus.swift.language.ProgramDocument$Factory.parse(ProgramDocument.java:499) at org.griphyn.vdl.engine.Karajan.parseProgramXML(Karajan.java:122) at org.griphyn.vdl.engine.Karajan.compile(Karajan.java:96) ... 2 more Caused by: org.xml.sax.SAXParseException: Unexpected end of file after null at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportFatalError(Piccolo.java:1038) at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:723) at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3435) ... 9 more -------------- next part -------------- An HTML attachment was scrubbed... URL: From aespinosa at cs.uchicago.edu Fri Jan 21 21:07:46 2011 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Fri, 21 Jan 2011 21:07:46 -0600 Subject: [Swift-devel] Swift compile issues on IBI cluster In-Reply-To: References: Message-ID: Hi Tim, What's the version of java on IBI that the Swift runtime uses? -Allan 2011/1/21 Tim Armstrong : > Hi, > ? I'm trying to get a Swift script running on the IBI cluster but have been > running into some issues.? I'm using a recent development version 3029. > Running the script fails at the compile stage. > > Strangely, this script compiles without issue using exactly the same swift > on other machines. > Even more strangely, it looks like the swift ->? intermediate xml step > completes without any exceptions being thrown, but when I look at the xml > file it contains precisely zero bytes.? It appears that this is what > actually causes the karajan parser to crash. > What would cause the first compile step to complete without causing an > exception but not produce any output?? Do we maybe have a buffering issue? > Any suggestions for how I might solve this? > > The log contents are: > > 2011-01-21 16:43:13,448-0600 DEBUG Loader Max heap: 238616576 > 2011-01-21 16:43:13,449-0600 INFO? Loader rserver.swift: source file is new. > Recompiling. > 2011-01-21 16:43:13,919-0600 DEBUG Loader Detailed exception: > org.griphyn.vdl.karajan.CompilationException: Unable to parse intermediate > XML > ??????? at org.griphyn.vdl.engine.Karajan.compile(Karajan.java:98) > ??????? at org.griphyn.vdl.karajan.Loader.compile(Loader.java:300) > ??????? at org.griphyn.vdl.karajan.Loader.main(Loader.java:145) > Caused by: org.apache.xmlbeans.XmlException: > /tmp/tga/SwiftR/swift.local.2306/rserver.xml:2:1: error: Unexpected end of > file after null > ??????? at > org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3467) > ??????? at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270) > ??????? at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257) > ??????? at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) > ??????? at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:252) > ??????? at > org.globus.swift.language.ProgramDocument$Factory.parse(ProgramDocument.java:499) > ??????? at org.griphyn.vdl.engine.Karajan.parseProgramXML(Karajan.java:122) > ??????? at org.griphyn.vdl.engine.Karajan.compile(Karajan.java:96) > ??????? ... 2 more > Caused by: org.xml.sax.SAXParseException: Unexpected end of file after null > ??????? at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportFatalError(Piccolo.java:1038) > ??????? at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:723) > ??????? at > org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3435) > ??????? ... 9 more > > > > From tim.g.armstrong at gmail.com Fri Jan 21 22:20:40 2011 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Fri, 21 Jan 2011 22:20:40 -0600 Subject: [Swift-devel] Swift compile issues on IBI cluster In-Reply-To: References: Message-ID: Its Sun Java 6 so I assumed that would be fine. The exact version is 1.6.0_07. - Tim On Fri, Jan 21, 2011 at 9:07 PM, Allan Espinosa wrote: > Hi Tim, > > What's the version of java on IBI that the Swift runtime uses? > > -Allan > > 2011/1/21 Tim Armstrong : > > Hi, > > I'm trying to get a Swift script running on the IBI cluster but have > been > > running into some issues. I'm using a recent development version 3029. > > Running the script fails at the compile stage. > > > > Strangely, this script compiles without issue using exactly the same > swift > > on other machines. > > Even more strangely, it looks like the swift -> intermediate xml step > > completes without any exceptions being thrown, but when I look at the xml > > file it contains precisely zero bytes. It appears that this is what > > actually causes the karajan parser to crash. > > What would cause the first compile step to complete without causing an > > exception but not produce any output? Do we maybe have a buffering > issue? > > Any suggestions for how I might solve this? > > > > The log contents are: > > > > 2011-01-21 16:43:13,448-0600 DEBUG Loader Max heap: 238616576 > > 2011-01-21 16:43:13,449-0600 INFO Loader rserver.swift: source file is > new. > > Recompiling. > > 2011-01-21 16:43:13,919-0600 DEBUG Loader Detailed exception: > > org.griphyn.vdl.karajan.CompilationException: Unable to parse > intermediate > > XML > > at org.griphyn.vdl.engine.Karajan.compile(Karajan.java:98) > > at org.griphyn.vdl.karajan.Loader.compile(Loader.java:300) > > at org.griphyn.vdl.karajan.Loader.main(Loader.java:145) > > Caused by: org.apache.xmlbeans.XmlException: > > /tmp/tga/SwiftR/swift.local.2306/rserver.xml:2:1: error: Unexpected end > of > > file after null > > at > > org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3467) > > at > > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270) > > at > > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257) > > at > > > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) > > at > > > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:252) > > at > > > org.globus.swift.language.ProgramDocument$Factory.parse(ProgramDocument.java:499) > > at > org.griphyn.vdl.engine.Karajan.parseProgramXML(Karajan.java:122) > > at org.griphyn.vdl.engine.Karajan.compile(Karajan.java:96) > > ... 2 more > > Caused by: org.xml.sax.SAXParseException: Unexpected end of file after > null > > at > > > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportFatalError(Piccolo.java:1038) > > at > > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:723) > > at > > org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3435) > > ... 9 more > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Sat Jan 22 06:53:07 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 22 Jan 2011 06:53:07 -0600 (CST) Subject: [Swift-devel] Swift compile issues on IBI cluster In-Reply-To: Message-ID: <549693844.91902.1295700787731.JavaMail.root@zimbra.anl.gov> Tim, I forgot to mention: on IBI you need to use the Java in my home dir (or any recent Java). The particular SUn java thats the default there has some weird issue in xml parsing that causes this failure. - Mike ----- Original Message ----- Hi, I'm trying to get a Swift script running on the IBI cluster but have been running into some issues. I'm using a recent development version 3029. Running the script fails at the compile stage. Strangely, this script compiles without issue using exactly the same swift on other machines. Even more strangely, it looks like the swift -> intermediate xml step completes without any exceptions being thrown, but when I look at the xml file it contains precisely zero bytes. It appears that this is what actually causes the karajan parser to crash. What would cause the first compile step to complete without causing an exception but not produce any output? Do we maybe have a buffering issue? Any suggestions for how I might solve this? The log contents are: 2011-01-21 16:43:13,448-0600 DEBUG Loader Max heap: 238616576 2011-01-21 16:43:13,449-0600 INFO Loader rserver.swift: source file is new. Recompiling. 2011-01-21 16:43:13,919-0600 DEBUG Loader Detailed exception: org.griphyn.vdl.karajan.CompilationException: Unable to parse intermediate XML at org.griphyn.vdl.engine.Karajan.compile(Karajan.java:98) at org.griphyn.vdl.karajan.Loader.compile(Loader.java:300) at org.griphyn.vdl.karajan.Loader.main(Loader.java:145) Caused by: org.apache.xmlbeans.XmlException: /tmp/tga/SwiftR/swift.local.2306/rserver.xml:2:1: error: Unexpected end of file after null at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3467) at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270) at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257) at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:252) at org.globus.swift.language.ProgramDocument$Factory.parse(ProgramDocument.java:499) at org.griphyn.vdl.engine.Karajan.parseProgramXML(Karajan.java:122) at org.griphyn.vdl.engine.Karajan.compile(Karajan.java:96) ... 2 more Caused by: org.xml.sax.SAXParseException: Unexpected end of file after null at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportFatalError(Piccolo.java:1038) at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:723) at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3435) ... 9 more _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Sat Jan 22 07:09:33 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 22 Jan 2011 07:09:33 -0600 (CST) Subject: [Swift-devel] Swift compile issues on IBI cluster In-Reply-To: <549693844.91902.1295700787731.JavaMail.root@zimbra.anl.gov> Message-ID: <1291420067.91915.1295701773409.JavaMail.root@zimbra.anl.gov> Mihael commented on this problem in this post Nov 3: http://mail.ci.uchicago.edu/mailman/private/swift-devel/2010-November/006990.html (likely SAX incompatibility problem with that particular Java version) ----- Original Message ----- Tim, I forgot to mention: on IBI you need to use the Java in my home dir (or any recent Java). The particular SUn java thats the default there has some weird issue in xml parsing that causes this failure. - Mike ----- Original Message ----- Hi, I'm trying to get a Swift script running on the IBI cluster but have been running into some issues. I'm using a recent development version 3029. Running the script fails at the compile stage. Strangely, this script compiles without issue using exactly the same swift on other machines. Even more strangely, it looks like the swift -> intermediate xml step completes without any exceptions being thrown, but when I look at the xml file it contains precisely zero bytes. It appears that this is what actually causes the karajan parser to crash. What would cause the first compile step to complete without causing an exception but not produce any output? Do we maybe have a buffering issue? Any suggestions for how I might solve this? The log contents are: 2011-01-21 16:43:13,448-0600 DEBUG Loader Max heap: 238616576 2011-01-21 16:43:13,449-0600 INFO Loader rserver.swift: source file is new. Recompiling. 2011-01-21 16:43:13,919-0600 DEBUG Loader Detailed exception: org.griphyn.vdl.karajan.CompilationException: Unable to parse intermediate XML at org.griphyn.vdl.engine.Karajan.compile(Karajan.java:98) at org.griphyn.vdl.karajan.Loader.compile(Loader.java:300) at org.griphyn.vdl.karajan.Loader.main(Loader.java:145) Caused by: org.apache.xmlbeans.XmlException: /tmp/tga/SwiftR/swift.local.2306/rserver.xml:2:1: error: Unexpected end of file after null at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3467) at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270) at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257) at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:252) at org.globus.swift.language.ProgramDocument$Factory.parse(ProgramDocument.java:499) at org.griphyn.vdl.engine.Karajan.parseProgramXML(Karajan.java:122) at org.griphyn.vdl.engine.Karajan.compile(Karajan.java:96) ... 2 more Caused by: org.xml.sax.SAXParseException: Unexpected end of file after null at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportFatalError(Piccolo.java:1038) at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:723) at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3435) ... 9 more _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: From tim.g.armstrong at gmail.com Sat Jan 22 12:08:03 2011 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Sat, 22 Jan 2011 12:08:03 -0600 Subject: [Swift-devel] Swift compile issues on IBI cluster In-Reply-To: <1291420067.91915.1295701773409.JavaMail.root@zimbra.anl.gov> References: <549693844.91902.1295700787731.JavaMail.root@zimbra.anl.gov> <1291420067.91915.1295701773409.JavaMail.root@zimbra.anl.gov> Message-ID: Thanks, that appears to fix it. It is a very strange issue. It almost looks like the output from the .swift -> .xml stage isn't written to the .xml file before the next stage tries to read. Either that or there is a silent failure somewhere in that stage On Sat, Jan 22, 2011 at 7:09 AM, Michael Wilde wrote: > Mihael commented on this problem in this post Nov 3: > > http://mail.ci.uchicago.edu/mailman/private/swift-devel/2010-November/006990.html > (likely SAX incompatibility problem with that particular Java version) > > ------------------------------ > > Tim, I forgot to mention: on IBI you need to use the Java in my home dir > (or any recent Java). The particular SUn java thats the default there has > some weird issue in xml parsing that causes this failure. > > - Mike > > ------------------------------ > > Hi, > I'm trying to get a Swift script running on the IBI cluster but have been > running into some issues. I'm using a recent development version 3029. > Running the script fails at the compile stage. > > Strangely, this script compiles without issue using exactly the same swift > on other machines. > Even more strangely, it looks like the swift -> intermediate xml step > completes without any exceptions being thrown, but when I look at the xml > file it contains precisely zero bytes. It appears that this is what > actually causes the karajan parser to crash. > What would cause the first compile step to complete without causing an > exception but not produce any output? Do we maybe have a buffering issue? > Any suggestions for how I might solve this? > > The log contents are: > > 2011-01-21 16:43:13,448-0600 DEBUG Loader Max heap: 238616576 > 2011-01-21 16:43:13,449-0600 INFO Loader rserver.swift: source file is > new. Recompiling. > 2011-01-21 16:43:13,919-0600 DEBUG Loader Detailed exception: > org.griphyn.vdl.karajan.CompilationException: Unable to parse intermediate > XML > at org.griphyn.vdl.engine.Karajan.compile(Karajan.java:98) > at org.griphyn.vdl.karajan.Loader.compile(Loader.java:300) > at org.griphyn.vdl.karajan.Loader.main(Loader.java:145) > Caused by: org.apache.xmlbeans.XmlException: > /tmp/tga/SwiftR/swift.local.2306/rserver.xml:2:1: error: Unexpected end of > file after null > at > org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3467) > at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270) > at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257) > at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) > at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:252) > at > org.globus.swift.language.ProgramDocument$Factory.parse(ProgramDocument.java:499) > at org.griphyn.vdl.engine.Karajan.parseProgramXML(Karajan.java:122) > at org.griphyn.vdl.engine.Karajan.compile(Karajan.java:96) > ... 2 more > Caused by: org.xml.sax.SAXParseException: Unexpected end of file after null > at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportFatalError(Piccolo.java:1038) > at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:723) > at > org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3435) > ... 9 more > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Sat Jan 22 14:29:13 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 22 Jan 2011 12:29:13 -0800 Subject: [Swift-devel] Swift compile issues on IBI cluster In-Reply-To: References: <549693844.91902.1295700787731.JavaMail.root@zimbra.anl.gov> <1291420067.91915.1295701773409.JavaMail.root@zimbra.anl.gov> Message-ID: <1295728153.5253.5.camel@blabla2.none> I don't think I have access to said cluster, but it may be worth getting some recent version of xerces, sticking it into some directory, specifying java.endorsed.dirs and seeing if that fixes the problem with 1.6.0_07-b06. Mihael On Sat, 2011-01-22 at 12:08 -0600, Tim Armstrong wrote: > Thanks, that appears to fix it. It is a very strange issue. It > almost looks like the output from the .swift -> .xml stage isn't > written to the .xml file before the next stage tries to read. Either > that or there is a silent failure somewhere in that stage > > On Sat, Jan 22, 2011 at 7:09 AM, Michael Wilde > wrote: > Mihael commented on this problem in this post Nov 3: > http://mail.ci.uchicago.edu/mailman/private/swift-devel/2010-November/006990.html > (likely SAX incompatibility problem with that particular Java > version) > > > ______________________________________________________________ > > Tim, I forgot to mention: on IBI you need to use the > Java in my home dir (or any recent Java). The > particular SUn java thats the default there has some > weird issue in xml parsing that causes this failure. > > > - Mike > > > ______________________________________________________ > Hi, > I'm trying to get a Swift script running on > the IBI cluster but have been running into > some issues. I'm using a recent development > version 3029. Running the script fails at the > compile stage. > > Strangely, this script compiles without issue > using exactly the same swift on other > machines. > Even more strangely, it looks like the swift > -> intermediate xml step completes without > any exceptions being thrown, but when I look > at the xml file it contains precisely zero > bytes. It appears that this is what actually > causes the karajan parser to crash. > What would cause the first compile step to > complete without causing an exception but not > produce any output? Do we maybe have a > buffering issue? Any suggestions for how I > might solve this? > > The log contents are: > > 2011-01-21 16:43:13,448-0600 DEBUG Loader Max > heap: 238616576 > 2011-01-21 16:43:13,449-0600 INFO Loader > rserver.swift: source file is new. > Recompiling. > 2011-01-21 16:43:13,919-0600 DEBUG Loader > Detailed exception: > org.griphyn.vdl.karajan.CompilationException: > Unable to parse intermediate XML > at > org.griphyn.vdl.engine.Karajan.compile(Karajan.java:98) > at > org.griphyn.vdl.karajan.Loader.compile(Loader.java:300) > at > org.griphyn.vdl.karajan.Loader.main(Loader.java:145) > Caused by: > org.apache.xmlbeans.XmlException: /tmp/tga/SwiftR/swift.local.2306/rserver.xml:2:1: error: Unexpected end of file after null > at > org.apache.xmlbeans.impl.store.Locale > $SaxLoader.load(Locale.java:3467) > at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270) > at > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257) > at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) > at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:252) > at > org.globus.swift.language.ProgramDocument > $Factory.parse(ProgramDocument.java:499) > at > org.griphyn.vdl.engine.Karajan.parseProgramXML(Karajan.java:122) > at > org.griphyn.vdl.engine.Karajan.compile(Karajan.java:96) > ... 2 more > Caused by: org.xml.sax.SAXParseException: > Unexpected end of file after null > at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportFatalError(Piccolo.java:1038) > at > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:723) > at > org.apache.xmlbeans.impl.store.Locale > $SaxLoader.load(Locale.java:3435) > ... 9 more > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Sun Jan 23 15:32:14 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 23 Jan 2011 13:32:14 -0800 Subject: [Swift-devel] Error 521 provider-staging files to PADS nodes In-Reply-To: <1295471524.6134.0.camel@blabla2.none> References: <1112858620.81735.1295465835538.JavaMail.root@zimbra.anl.gov> <1295471524.6134.0.camel@blabla2.none> Message-ID: <1295818334.4211.1.camel@blabla2.none> I'm trying to run tests on pads. The queues aren't quite empty. In the mean time, I committed a bit of a patch to trunk to measure aggregate traffic on TCP channels (those are only used by the workers). You can enable it by setting the "tcp.channel.log.io.performance" system property to "true". Mihael On Wed, 2011-01-19 at 13:12 -0800, Mihael Hategan wrote: > might be due to one of the recent patches. > > you could try to set IOBLOCKSZ to 1 in worker.pl and rerun. > > On Wed, 2011-01-19 at 13:37 -0600, Michael Wilde wrote: > > An interesting observation on the returned output files: there are exactly 33 files in the output dir from this run: the same as the number of jobs Swift reports as Finished successfully. But of those 33, the last 4 are only of partial length, and one of the 4 is length zero (see below). > > > > Its surprising and perhaps a bug that the jobs are reported finished before the output file is fully written??? > > > > Also this 3-partial plus 1-zero file looks to me like one worker staging op hung (the oldest of the 4 incomplete output files) and then perhaps 3 were cut short when the coaster service data protocol froze? > > > > - Mike > > > > login1$ pwd > > /scratch/local/wilde/lab > > login1$ cd outdir > > login1$ ls -lt | grep 10:48 > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0023.out > > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out > > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out > > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0336.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0380.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0015.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0204.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0379.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0066.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0221.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0281.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0403.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0142.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0187.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0067.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0081.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0134.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0136.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0146.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0254.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0362.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0312.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0370.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0389.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0027.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0094.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0183.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0363.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0016.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0025.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0429.out > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0239.out > > login1$ ls -lt | grep 10:48 | wc -l > > 33 > > login1$ > > > > > > > > > > ----- Original Message ----- > > > Mihael, > > > > > > The following test on pads failed/hung with an error 521 from > > > worker.pl: > > > > > > --- > > > sub getFileCBDataInIndirect { > > > ... > > > elsif ($timeout) { > > > queueCmd((nullCB(), "JOBSTATUS", $jobid, FAILED, "521", "Timeout > > > staging in file")); > > > delete($JOBDATA{$jobid}); > > > --- > > > > > > single foreach loop, doing 1,000 "mv" commands > > > > > > throttle was 200 jobs to this coaster pool (1 4-node 32-core PBS job): > > > > > > > > > > > > 8 > > > 3500 > > > 1 > > > 4 > > > 4 > > > short > > > 2.0 > > > 10000 > > > > > > /scratch/local/wilde/test/swiftwork > > > file > > > /scratch/local/wilde/swiftscratch > > > > > > > > > Ran 33 jobs - 1 job over 1 "wave" of 32 and then one or more workers > > > timed out. Note that the hang may have happened earlier, as no new > > > jobs were starting as the jobs in the first wave were finishing. > > > > > > time swift -tc.file tc -sites.file pbscoasters.xml -config cf.ps > > > mvn.swift -n=1000 >& out & > > > > > > > > > The log is in ~wilde/mvn-20110119-0956-s3s8h9h2.log on CI net. > > > > > > Swift stdout showed the following after waiting a while for a 4-node > > > PADS coaster allocation to start: > > > > > > Progress: Selecting site:799 Submitted:201 > > > Progress: Selecting site:799 Submitted:201 > > > Progress: Selecting site:799 Submitted:200 Active:1 > > > Progress: Selecting site:798 Submitted:177 Active:24 Finished > > > successfully:1 > > > Progress: Selecting site:796 Submitted:172 Active:28 Finished > > > successfully:4 > > > Progress: Selecting site:792 Submitted:176 Active:24 Finished > > > successfully:8 > > > Progress: Selecting site:788 Submitted:180 Active:20 Finished > > > successfully:12 > > > Progress: Selecting site:784 Submitted:184 Active:16 Finished > > > successfully:16 > > > Progress: Selecting site:780 Submitted:188 Active:12 Finished > > > successfully:20 > > > Progress: Selecting site:777 Submitted:191 Active:9 Finished > > > successfully:23 > > > Progress: Selecting site:773 Submitted:195 Active:5 Finished > > > successfully:27 > > > Progress: Selecting site:770 Submitted:197 Active:3 Finished > > > successfully:30 > > > Progress: Selecting site:767 Submitted:200 Finished successfully:33 > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > > Progress: Selecting site:766 Submitted:200 Active:1 Finished > > > successfully:33 > > > Execution failed: > > > Job failed with an exit code of 521 > > > login1$ > > > login1$ > > > login1$ pwd > > > /scratch/local/wilde/lab > > > login1$ ls -lt | head > > > total 51408 > > > -rw-r--r-- 1 wilde ci-users 5043350 Jan 19 10:51 > > > mvn-20110119-0956-s3s8h9h2.log > > > > > > (copied to ~wilde) > > > > > > script was: > > > > > > login1$ cat mvn.swift > > > type file; > > > > > > app (file o) mv (file i) > > > { > > > mv @i @o; > > > } > > > > > > file out[] > > prefix="f.",suffix=".out">; > > > foreach j in [1:@toint(@arg("n","1"))] { > > > file data<"data.txt">; > > > out[j] = mv(data); > > > } > > > > > > > > > data.txt was 3MB > > > > > > A look at the outdir gives a clue to where things hung: The files of > > > <= ~3MB from time 10:48 are from this job. Files from 10:39 and > > > earlier are from other manual runs executed on login1, Note that 3 of > > > the 3MB output files have length 0 or <3MB, and were likely in transit > > > back from the worker: > > > > > > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out > > > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out > > > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out > > > > > > > > > login1$ pwd > > > /scratch/local/wilde/lab > > > login1$ cd outdir > > > login1$ ls -lt | head -40 > > > total 2772188 > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0023.out > > > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out > > > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out > > > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0336.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0380.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0015.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0204.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0379.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0066.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0221.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0281.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0403.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0142.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0187.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0067.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0081.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0134.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0136.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0146.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0254.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0362.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0312.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0370.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0389.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0027.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0094.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0183.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0363.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0016.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0025.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0429.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0239.out > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0024.out > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0037.out > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0001.out > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0042.out > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0033.out > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0051.out > > > l > > > > > > - Mike > > > > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From jon.monette at gmail.com Sun Jan 23 22:18:23 2011 From: jon.monette at gmail.com (Jonathan Monette) Date: Sun, 23 Jan 2011 22:18:23 -0600 Subject: [Swift-devel] No provider specified error Message-ID: <4D3CFD8F.1070700@gmail.com> Mihael, I just updated to cog-r3043 and swift-r4031. I now get this swift error: Execution failed: No 'proxy' provider or alias found. Available providers: [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; Upon inspecting the log file I got found this Java error: 2011-01-23 18:31:27,221-0600 INFO AbstractionProperties No properties for provider proxy. Using empty properties 2011-01-23 18:31:27,227-0600 INFO vdl:execute END_FAILURE thread=0-3 tr=mImgtbl 2011-01-23 18:31:27,234-0600 DEBUG VDL2ExecutionContext sys:throw @ execute-default.k, line: 45: Exception in mImgtbl: Arguments: [proj_dir, images.tbl] Host: localhost Directory: rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs ---- Exception in mImgtbl: Arguments: [proj_dir, images.tbl] Host: localhost Directory: rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs ---- Caused by: No 'proxy' provider or alias found. Available providers: [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; Caused by: org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No 'proxy' provider or alias found. Available providers: [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; at org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:32) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:340) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) at org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:181) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:309) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:50) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:26) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:238) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:289) at org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:402) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:343) at org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:230) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) at org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:44) at edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) at edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) at java.lang.Thread.run(Thread.java:662) Caused by: No 'proxy' provider or alias found. Available providers: [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; Caused by: org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No 'proxy' provider or alias found. Available providers: [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; at org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) at org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) at org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:153) at org.globus.cog.karajan.workflow.nodes.grid.GridExec.taskFailed(GridExec.java:373) at org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) at org.griphyn.vdl.karajan.lib.Execute.statusChanged(Execute.java:127) at org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) at org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:675) at org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:422) at org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) at org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:240) at org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:228) at org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) at org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:220) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) ... 1 more Caused by: org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No 'proxy' provider or alias found. Available providers: [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; at org.globus.cog.abstraction.impl.common.AbstractionProperties.getProperties(AbstractionProperties.java:138) at org.globus.cog.abstraction.impl.common.AbstractionFactory.newObject(AbstractionFactory.java:136) at org.globus.cog.abstraction.impl.common.AbstractionFactory.newFileResource(AbstractionFactory.java:112) at org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.copy(JobSubmissionTaskHandler.java:293) at org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stage(JobSubmissionTaskHandler.java:272) at org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stageIn(JobSubmissionTaskHandler.java:283) at org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:159) ... 6 more 2011-01-23 18:31:27,274-0600 INFO ExecutionContext Detailed exception: Exception in mImgtbl: Arguments: [proj_dir, images.tbl] Host: localhost Directory: rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs ---- Caused by: No 'proxy' provider or alias found. Available providers: [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; Caused by: org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No 'proxy' provider or alias found. Available providers: [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; at org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) at org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:32) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:340) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) at org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:181) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:309) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:50) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:26) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:238) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:289) at org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:402) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:343) at org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:230) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) at org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:44) at edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) at edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) at java.lang.Thread.run(Thread.java:662) Caused by: No 'proxy' provider or alias found. Available providers: [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; Caused by: org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No 'proxy' provider or alias found. Available providers: [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; at org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) at org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) at org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:153) at org.globus.cog.karajan.workflow.nodes.grid.GridExec.taskFailed(GridExec.java:373) at org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) at org.griphyn.vdl.karajan.lib.Execute.statusChanged(Execute.java:127) at org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) at org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:675) at org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:422) at org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) at org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:240) at org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:228) at org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) at org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:220) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) ... 1 more Caused by: org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No 'proxy' provider or alias found. Available providers: [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; at org.globus.cog.abstraction.impl.common.AbstractionProperties.getProperties(AbstractionProperties.java:138) at org.globus.cog.abstraction.impl.common.AbstractionFactory.newObject(AbstractionFactory.java:136) at org.globus.cog.abstraction.impl.common.AbstractionFactory.newFileResource(AbstractionFactory.java:112) at org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.copy(JobSubmissionTaskHandler.java:293) at org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stage(JobSubmissionTaskHandler.java:272) at org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stageIn(JobSubmissionTaskHandler.java:283) at org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:159) ... 6 more I am running on PADS and here is my sites.xml file: .05 1 /gpfs/pads/swift/jonmon/Swift/work/localhost 3600 192.5.86.6 1 40 1 1 fast 1 10000 1 /gpfs/pads/swift/jonmon/Swift/work/pads Has anyone else seen this error or Mihael have you witnessed this? From hategan at mcs.anl.gov Mon Jan 24 03:41:28 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 24 Jan 2011 01:41:28 -0800 Subject: [Swift-devel] Error 521 provider-staging files to PADS nodes In-Reply-To: <1295818334.4211.1.camel@blabla2.none> References: <1112858620.81735.1295465835538.JavaMail.root@zimbra.anl.gov> <1295471524.6134.0.camel@blabla2.none> <1295818334.4211.1.camel@blabla2.none> Message-ID: <1295862088.29849.6.camel@blabla2.none> Play with buffer sizes and ye shall be rewarded. Turns out that setting TCP buffer sizes to obscene numbers, like 2M, gives you quite a bit: 70MB/s in + 70MB/s out on average. Those pads nodes must have some fast disks (though maybe it's just the cache). This is with 1 worker and 4wpn. I'm assuming that with many workers, the fact that each worker connection has its separate buffer will essentially achieve a similar effect. But then there should be an option for setting the buffer size. The numbers are attached. This all goes from head node local disk to worker node local disk directly, so there is no nfs. I'd be curious to know how that compares, but I am done for the day. Mihael On Sun, 2011-01-23 at 13:32 -0800, Mihael Hategan wrote: > I'm trying to run tests on pads. The queues aren't quite empty. In the > mean time, I committed a bit of a patch to trunk to measure aggregate > traffic on TCP channels (those are only used by the workers). You can > enable it by setting the "tcp.channel.log.io.performance" system > property to "true". > > Mihael > > On Wed, 2011-01-19 at 13:12 -0800, Mihael Hategan wrote: > > might be due to one of the recent patches. > > > > you could try to set IOBLOCKSZ to 1 in worker.pl and rerun. > > > > On Wed, 2011-01-19 at 13:37 -0600, Michael Wilde wrote: > > > An interesting observation on the returned output files: there are exactly 33 files in the output dir from this run: the same as the number of jobs Swift reports as Finished successfully. But of those 33, the last 4 are only of partial length, and one of the 4 is length zero (see below). > > > > > > Its surprising and perhaps a bug that the jobs are reported finished before the output file is fully written??? > > > > > > Also this 3-partial plus 1-zero file looks to me like one worker staging op hung (the oldest of the 4 incomplete output files) and then perhaps 3 were cut short when the coaster service data protocol froze? > > > > > > - Mike > > > > > > login1$ pwd > > > /scratch/local/wilde/lab > > > login1$ cd outdir > > > login1$ ls -lt | grep 10:48 > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0023.out > > > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out > > > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out > > > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0336.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0380.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0015.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0204.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0379.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0066.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0221.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0281.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0403.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0142.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0187.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0067.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0081.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0134.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0136.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0146.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0254.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0362.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0312.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0370.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0389.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0027.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0094.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0183.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0363.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0016.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0025.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0429.out > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0239.out > > > login1$ ls -lt | grep 10:48 | wc -l > > > 33 > > > login1$ > > > > > > > > > > > > > > > ----- Original Message ----- > > > > Mihael, > > > > > > > > The following test on pads failed/hung with an error 521 from > > > > worker.pl: > > > > > > > > --- > > > > sub getFileCBDataInIndirect { > > > > ... > > > > elsif ($timeout) { > > > > queueCmd((nullCB(), "JOBSTATUS", $jobid, FAILED, "521", "Timeout > > > > staging in file")); > > > > delete($JOBDATA{$jobid}); > > > > --- > > > > > > > > single foreach loop, doing 1,000 "mv" commands > > > > > > > > throttle was 200 jobs to this coaster pool (1 4-node 32-core PBS job): > > > > > > > > > > > > > > > > 8 > > > > 3500 > > > > 1 > > > > 4 > > > > 4 > > > > short > > > > 2.0 > > > > 10000 > > > > > > > > /scratch/local/wilde/test/swiftwork > > > > file > > > > /scratch/local/wilde/swiftscratch > > > > > > > > > > > > Ran 33 jobs - 1 job over 1 "wave" of 32 and then one or more workers > > > > timed out. Note that the hang may have happened earlier, as no new > > > > jobs were starting as the jobs in the first wave were finishing. > > > > > > > > time swift -tc.file tc -sites.file pbscoasters.xml -config cf.ps > > > > mvn.swift -n=1000 >& out & > > > > > > > > > > > > The log is in ~wilde/mvn-20110119-0956-s3s8h9h2.log on CI net. > > > > > > > > Swift stdout showed the following after waiting a while for a 4-node > > > > PADS coaster allocation to start: > > > > > > > > Progress: Selecting site:799 Submitted:201 > > > > Progress: Selecting site:799 Submitted:201 > > > > Progress: Selecting site:799 Submitted:200 Active:1 > > > > Progress: Selecting site:798 Submitted:177 Active:24 Finished > > > > successfully:1 > > > > Progress: Selecting site:796 Submitted:172 Active:28 Finished > > > > successfully:4 > > > > Progress: Selecting site:792 Submitted:176 Active:24 Finished > > > > successfully:8 > > > > Progress: Selecting site:788 Submitted:180 Active:20 Finished > > > > successfully:12 > > > > Progress: Selecting site:784 Submitted:184 Active:16 Finished > > > > successfully:16 > > > > Progress: Selecting site:780 Submitted:188 Active:12 Finished > > > > successfully:20 > > > > Progress: Selecting site:777 Submitted:191 Active:9 Finished > > > > successfully:23 > > > > Progress: Selecting site:773 Submitted:195 Active:5 Finished > > > > successfully:27 > > > > Progress: Selecting site:770 Submitted:197 Active:3 Finished > > > > successfully:30 > > > > Progress: Selecting site:767 Submitted:200 Finished successfully:33 > > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > > > Progress: Selecting site:766 Submitted:200 Active:1 Finished > > > > successfully:33 > > > > Execution failed: > > > > Job failed with an exit code of 521 > > > > login1$ > > > > login1$ > > > > login1$ pwd > > > > /scratch/local/wilde/lab > > > > login1$ ls -lt | head > > > > total 51408 > > > > -rw-r--r-- 1 wilde ci-users 5043350 Jan 19 10:51 > > > > mvn-20110119-0956-s3s8h9h2.log > > > > > > > > (copied to ~wilde) > > > > > > > > script was: > > > > > > > > login1$ cat mvn.swift > > > > type file; > > > > > > > > app (file o) mv (file i) > > > > { > > > > mv @i @o; > > > > } > > > > > > > > file out[] > > > prefix="f.",suffix=".out">; > > > > foreach j in [1:@toint(@arg("n","1"))] { > > > > file data<"data.txt">; > > > > out[j] = mv(data); > > > > } > > > > > > > > > > > > data.txt was 3MB > > > > > > > > A look at the outdir gives a clue to where things hung: The files of > > > > <= ~3MB from time 10:48 are from this job. Files from 10:39 and > > > > earlier are from other manual runs executed on login1, Note that 3 of > > > > the 3MB output files have length 0 or <3MB, and were likely in transit > > > > back from the worker: > > > > > > > > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out > > > > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out > > > > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out > > > > > > > > > > > > login1$ pwd > > > > /scratch/local/wilde/lab > > > > login1$ cd outdir > > > > login1$ ls -lt | head -40 > > > > total 2772188 > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0023.out > > > > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out > > > > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out > > > > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0336.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0380.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0015.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0204.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0379.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0066.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0221.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0281.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0403.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0142.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0187.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0067.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0081.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0134.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0136.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0146.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0254.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0362.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0312.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0370.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0389.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0027.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0094.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0183.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0363.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0016.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0025.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0429.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0239.out > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0024.out > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0037.out > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0001.out > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0042.out > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0033.out > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0051.out > > > > l > > > > > > > > - Mike > > > > > > > > > > > > -- > > > > Michael Wilde > > > > Computation Institute, University of Chicago > > > > Mathematics and Computer Science Division > > > > Argonne National Laboratory > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -------------- next part -------------- PADS: 1m, cat, file, 3-4: Progress: Submitted:2 Active:2 Finished successfully:252 [IN]: Total transferred: 255.43 MB, current rate: 5.07 MB/s, average rate: 6.39 MB/s [MEM] Heap total: 39.62 MMB, Heap used: 33.19 MMB [OUT] Total transferred: 257.87 MB, current rate: 1.42 MB/s, average rate: 6.45 MB/s 32m, cat, file, 1-4: Progress: Selecting site:209 Stage in:1 Submitted:6 Active:2 Finished successfully:38 [IN]: Total transferred: 1.19 GB, current rate: 2.01 MB/s, average rate: 2.07 MB/s [MEM] Heap total: 27.81 MMB, Heap used: 23.84 MMB [OUT] Total transferred: 1.26 GB, current rate: 1 MB/s, average rate: 2.19 MB/s 32m, cat, file, 1-12: Progress: Selecting site:205 Stage in:1 Submitted:5 Active:3 Finished successfully:42 [IN]: Total transferred: 1.32 GB, current rate: 2.02 MB/s, average rate: 2.01 MB/s [MEM] Heap total: 45 MMB, Heap used: 28.49 MMB [OUT] Total transferred: 1.43 GB, current rate: 672.25 KB/s, average rate: 2.18 MB/s 32m, cat, file, 1-4, ppn=2: Progress: Selecting site:222 Submitted:3 Active:6 Finished successfully:25 [IN]: Total transferred: 860.34 MB, current rate: 4.3 MB/s, average rate: 4.14 MB/s [MEM] Heap total: 87.25 MMB, Heap used: 66.65 MMB [OUT] Total transferred: 1018.62 MB, current rate: 1.07 MB/s, average rate: 4.9 MB/s 32m, cat, file, 1-4, ppn=1, bufsz=2*32768: Progress: Selecting site:218 Stage in:1 Submitted:8 Active:1 Finished successfully:28 [IN]: Total transferred: 928.4 MB, current rate: 3.63 MB/s, average rate: 4.3 MB/s [MEM] Heap total: 49.38 MMB, Heap used: 21.04 MMB [OUT] Total transferred: 1006.27 MB, current rate: 29.05 MB/s, average rate: 4.66 MB/s 32m, cat, file, 1-4, ppn=1, bufsz=4*32768: Progress: Selecting site:212 Submitted:8 Finished successfully:36 [IN]: Total transferred: 1.13 GB, current rate: 6.88 MB/s, average rate: 7.53 MB/s [MEM] Heap total: 33.75 MMB, Heap used: 27.72 MMB [OUT] Total transferred: 1.2 GB, current rate: 32.51 MB/s, average rate: 8.01 MB/s 32m, cat, file, 1-4, ppn=1, bufsz=8*32768: Progress: Selecting site:131 Submitted:5 Active:2 Finished successfully:118 [IN]: Total transferred: 3.75 GB, current rate: 17.63 MB/s, average rate: 17.62 MB/s [MEM] Heap total: 49.06 MMB, Heap used: 31.45 MMB [OUT] Total transferred: 3.78 GB, current rate: 25.04 MB/s, average rate: 17.77 MB/s 32m, cat, file, 1-4, ppn=1, bufsz=16*32768: Progress: Selecting site:125 Submitted:7 Active:1 Finished successfully:123 [IN]: Total transferred: 3.88 GB, current rate: 31.54 MB/s, average rate: 32.81 MB/s [MEM] Heap total: 66.06 MMB, Heap used: 52.44 MMB [OUT] Total transferred: 3.94 GB, current rate: 41.39 MB/s, average rate: 33.34 MB/s 32m, cat, file, 1-4, ppn=1, bufsz=32*32768: Progress: Selecting site:55 Submitted:7 Active:1 Finished successfully:193 [IN]: Total transferred: 6.05 GB, current rate: 61.8 MB/s, average rate: 58.49 MB/s [MEM] Heap total: 88.5 MMB, Heap used: 31.79 MMB [OUT] Total transferred: 6.12 GB, current rate: 64.16 MB/s, average rate: 59.08 MB/s 32m, cat, file, 1-4, ppn=1, bufsz=64*32768: Progress: Submitted:3 Active:2 Finished successfully:251 [IN]: Total transferred: 7.93 GB, current rate: 86.94 MB/s, average rate: 70.58 MB/s [MEM] Heap total: 122.75 MMB, Heap used: 66.48 MMB [OUT] Total transferred: 7.99 GB, current rate: 66.47 MB/s, average rate: 71.11 MB/s From tim.g.armstrong at gmail.com Mon Jan 24 08:37:20 2011 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Mon, 24 Jan 2011 08:37:20 -0600 Subject: [Swift-devel] Swift compile issues on IBI cluster In-Reply-To: <1295728153.5253.5.camel@blabla2.none> References: <549693844.91902.1295700787731.JavaMail.root@zimbra.anl.gov> <1291420067.91915.1295701773409.JavaMail.root@zimbra.anl.gov> <1295728153.5253.5.camel@blabla2.none> Message-ID: It doesn't appear that changing the version of xerces helps. I'm not that familiar with the java.endorsed.dirs mechanism, but I believe I'm using it correctly: adding the xerces jars to the class path and then adding the containing directory as an endorsed dir. I still get exactly the same failure. eval java -Xmx256M -Djavax.xml.parsers.SAXParserFactory=org.apache.xerces.jaxp.SAXParserFactoryImpl -Djava.endorsed.dirs=/cchome/tga/xerces-2_11_0 -DUID=667327 -DGLOBUS_HOSTNAME=ibicluster.uchicago.cc-DCOG_INSTALL_PATH=/userhom3/1/tga/R/library/Swift/exec/../swift/bin/.. -Dswift.home=/userhom3/1/tga/R/library/Swift/exec/../swift/bin/.. -Djava.security.egd=file:///dev/urandom -classpath /cchome/tga/xerces-2_11_0/xercesImpl.jar:/cchome/tga/xerces-2_11_0/xercesSamples.jar:/cchome/tga/xerces-2_11_0/xml-apis.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../etc:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../libexec:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/addressing-1.0.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/ant.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/antlr-2.7.5.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/axis.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/axis-url.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/backport-util-concurrent.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/castor-0.9.6.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/coaster-bootstrap.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cog-abstraction-common-2.4.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cog-axis.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cog-grapheditor-0.47.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cog-jglobus-1.7.0.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cog-karajan-0.36-dev.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cog-provider-clref-gt4_0_0.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cog-provider-coaster-0.3.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cog-provider-dcache-0.1.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cog-provider-gt2-2.4.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cog-provider-gt4_0_0-2.5.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cog-provider-local-2.2.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cog-provider-localscheduler-0.4.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cog-provider-ssh-2.4.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cog-provider-webdav-2.1.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cog-resources-1.0.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cog-swift-svn.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cog-trap-1.0.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cog-url.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cog-util-0.92.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/commonj.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/commons-beanutils.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/commons-collections-3.0.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/commons-digester.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/commons-discovery.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/commons-httpclient.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/commons-logging-1.1.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/concurrent.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cryptix32.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cryptix-asn1.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/cryptix.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/globus_delegation_service.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/globus_delegation_stubs.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/globus_wsrf_mds_aggregator_stubs.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/globus_wsrf_rendezvous_service.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/globus_wsrf_rendezvous_stubs.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/globus_wsrf_rft_stubs.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/gram-client.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/gram-stubs.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/gram-utils.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/j2ssh-common-0.2.2.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/j2ssh-core-0.2.2-patch-b.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/jakarta-regexp-1.2.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/jakarta-slide-webdavlib-2.0.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/jaxrpc.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/jce-jdk13-131.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/jgss.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/jline-0.9.94.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/jsr173_1.0_api.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/jug-lgpl-2.0.0.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/junit.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/log4j-1.2.8.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/naming-common.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/naming-factory.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/naming-java.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/naming-resources.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/opensaml.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/puretls.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/resolver.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/saaj.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/stringtemplate.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/vdldefinitions.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/wsdl4j.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/wsrf_core.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/wsrf_core_stubs.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/wsrf_mds_index_stubs.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/wsrf_mds_usefulrp_schema_stubs.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/wsrf_provider_jce.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/wsrf_tools.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/wss4j.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/xalan.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/xbean.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/xbean_xpath.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/xercesImpl.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/xml-apis.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/xmlsec.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/xpp3-1.1.3.4d_b4_min.jar:/userhom3/1/tga/R/library/Swift/exec/../swift/bin/../lib/xstream-1.1.1-patched.jar: org.griphyn.vdl.karajan.Loader '-config' 'cf' '-tc.file' 'tc' '-sites.file' 'sites.xml' 'rserver.swift' '-pipedir=/tmp/tga/SwiftR/swift.local' 2011-01-24 08:19:37,519-0600 DEBUG Loader Max heap: 238616576 2011-01-24 08:19:37,540-0600 INFO Loader rserver.swift: source file is new. Recompiling. 2011-01-24 08:19:39,454-0600 DEBUG Loader Detailed exception: org.griphyn.vdl.karajan.CompilationException: Unable to parse intermediate XML at org.griphyn.vdl.engine.Karajan.compile(Karajan.java:98) at org.griphyn.vdl.karajan.Loader.compile(Loader.java:300) at org.griphyn.vdl.karajan.Loader.main(Loader.java:145) Caused by: org.apache.xmlbeans.XmlException: /tmp/tga/SwiftR/swift.local.4111/rserver.xml:2:1: error: Unexpected end of file after null at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3467) at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270) at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257) at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:252) at org.globus.swift.language.ProgramDocument$Factory.parse(ProgramDocument.java:499) at org.griphyn.vdl.engine.Karajan.parseProgramXML(Karajan.java:122) at org.griphyn.vdl.engine.Karajan.compile(Karajan.java:96) ... 2 more Caused by: org.xml.sax.SAXParseException: Unexpected end of file after null at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportFatalError(Piccolo.java:1038) at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:723) at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3435) ... 9 more On Sat, Jan 22, 2011 at 2:29 PM, Mihael Hategan wrote: > I don't think I have access to said cluster, but it may be worth getting > some recent version of xerces, sticking it into some directory, > specifying java.endorsed.dirs and seeing if that fixes the problem with > 1.6.0_07-b06. > > Mihael > > On Sat, 2011-01-22 at 12:08 -0600, Tim Armstrong wrote: > > Thanks, that appears to fix it. It is a very strange issue. It > > almost looks like the output from the .swift -> .xml stage isn't > > written to the .xml file before the next stage tries to read. Either > > that or there is a silent failure somewhere in that stage > > > > On Sat, Jan 22, 2011 at 7:09 AM, Michael Wilde > > wrote: > > Mihael commented on this problem in this post Nov 3: > > > http://mail.ci.uchicago.edu/mailman/private/swift-devel/2010-November/006990.html > > (likely SAX incompatibility problem with that particular Java > > version) > > > > > > ______________________________________________________________ > > > > Tim, I forgot to mention: on IBI you need to use the > > Java in my home dir (or any recent Java). The > > particular SUn java thats the default there has some > > weird issue in xml parsing that causes this failure. > > > > > > - Mike > > > > > > ______________________________________________________ > > Hi, > > I'm trying to get a Swift script running on > > the IBI cluster but have been running into > > some issues. I'm using a recent development > > version 3029. Running the script fails at the > > compile stage. > > > > Strangely, this script compiles without issue > > using exactly the same swift on other > > machines. > > Even more strangely, it looks like the swift > > -> intermediate xml step completes without > > any exceptions being thrown, but when I look > > at the xml file it contains precisely zero > > bytes. It appears that this is what actually > > causes the karajan parser to crash. > > What would cause the first compile step to > > complete without causing an exception but not > > produce any output? Do we maybe have a > > buffering issue? Any suggestions for how I > > might solve this? > > > > The log contents are: > > > > 2011-01-21 16:43:13,448-0600 DEBUG Loader Max > > heap: 238616576 > > 2011-01-21 16:43:13,449-0600 INFO Loader > > rserver.swift: source file is new. > > Recompiling. > > 2011-01-21 16:43:13,919-0600 DEBUG Loader > > Detailed exception: > > org.griphyn.vdl.karajan.CompilationException: > > Unable to parse intermediate XML > > at > > > org.griphyn.vdl.engine.Karajan.compile(Karajan.java:98) > > at > > > org.griphyn.vdl.karajan.Loader.compile(Loader.java:300) > > at > > > org.griphyn.vdl.karajan.Loader.main(Loader.java:145) > > Caused by: > > org.apache.xmlbeans.XmlException: > /tmp/tga/SwiftR/swift.local.2306/rserver.xml:2:1: error: Unexpected end of > file after null > > at > > org.apache.xmlbeans.impl.store.Locale > > $SaxLoader.load(Locale.java:3467) > > at > > > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270) > > at > > > org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257) > > at > > > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) > > at > > > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:252) > > at > > org.globus.swift.language.ProgramDocument > > $Factory.parse(ProgramDocument.java:499) > > at > > > org.griphyn.vdl.engine.Karajan.parseProgramXML(Karajan.java:122) > > at > > > org.griphyn.vdl.engine.Karajan.compile(Karajan.java:96) > > ... 2 more > > Caused by: org.xml.sax.SAXParseException: > > Unexpected end of file after null > > at > > > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportFatalError(Piccolo.java:1038) > > at > > > org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:723) > > at > > org.apache.xmlbeans.impl.store.Locale > > $SaxLoader.load(Locale.java:3435) > > ... 9 more > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > > > > > > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Jan 24 12:07:45 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 24 Jan 2011 10:07:45 -0800 Subject: [Swift-devel] Swift compile issues on IBI cluster In-Reply-To: References: <549693844.91902.1295700787731.JavaMail.root@zimbra.anl.gov> <1291420067.91915.1295701773409.JavaMail.root@zimbra.anl.gov> <1295728153.5253.5.camel@blabla2.none> Message-ID: <1295892465.30587.1.camel@blabla2.none> Then my theory is wrong. I can't do much without being able to reproduce this. So there are two choices: get access to the IBI cluster or let it go. Mihael On Mon, 2011-01-24 at 08:37 -0600, Tim Armstrong wrote: > It doesn't appear that changing the version of xerces helps. I'm not > that familiar with the java.endorsed.dirs mechanism, but I believe I'm > using it correctly: adding the xerces jars to the class path and then > adding the containing directory as an endorsed dir. > > I still get exactly the same failure. From hategan at mcs.anl.gov Mon Jan 24 12:51:35 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 24 Jan 2011 10:51:35 -0800 Subject: [Swift-devel] Re: [Swift-user] pbs ppn count and stuff In-Reply-To: <1295894773.31774.5.camel@blabla2.none> References: <1295894773.31774.5.camel@blabla2.none> Message-ID: <1295895095.31774.6.camel@blabla2.none> Sorry. This was meant for swift-devel. Mihael On Mon, 2011-01-24 at 10:46 -0800, Mihael Hategan wrote: > So I think some of the problems with ppn are as follows: > 1. count in cog means number of processes. count in PBS means number of > nodes. > 2. when the number of nodes requested was 1 but ppn > 1, the multiple > job scheme was not enabled so, despite having multiple lines in > PBS_NODEFILE, only one process would get started. If count was > 1 then > PBS would understand that count*ppn lines should be in PBS_NODEFILE, > which would result in that number of processes be started. In other > words there was no way to tell PBS to start 4 jobs on only one node. > > So: > > - I changed this to be consistent with 1. Count means number of > processes to be started. This imposes the restriction that count % ppn = > 0. If not, the pbs provider will throw an exception. > - I also added mppnppn if USE_MPPWIDTH is enabled. > > This is in trunk. > > Mihael > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From hategan at mcs.anl.gov Mon Jan 24 18:54:11 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 24 Jan 2011 16:54:11 -0800 Subject: [Swift-devel] Error 521 provider-staging files to PADS nodes In-Reply-To: <1295862088.29849.6.camel@blabla2.none> References: <1112858620.81735.1295465835538.JavaMail.root@zimbra.anl.gov> <1295471524.6134.0.camel@blabla2.none> <1295818334.4211.1.camel@blabla2.none> <1295862088.29849.6.camel@blabla2.none> Message-ID: <1295916851.9232.2.camel@blabla2.none> And then here's the funny thing: 2 workers, 4 wpn. When running with ppn=2 (so both on the same node): [IN]: Total transferred: 7.99 GB, current rate: 13.07 MB/s, average rate: 85.23 MB/s [OUT] Total transferred: 8 GB, current rate: 42 B/s, average rate: 85.38 MB/s Same situation, but with ppn=1 (so the two are on different nodes): [IN]: Total transferred: 5.83 GB, current rate: 20.79 MB/s, average rate: 20.31 MB/s [OUT] Total transferred: 5.97 GB, current rate: 32.01 MB/s, average rate: 20.8 MB/s This, to me, looks fine because it's the opposite of what I'm expecting. The service itself should see no difference between the two, and I suspect it doesn't. But something else is going on. Any ideas? Mihael On Mon, 2011-01-24 at 01:41 -0800, Mihael Hategan wrote: > Play with buffer sizes and ye shall be rewarded. > > Turns out that setting TCP buffer sizes to obscene numbers, like 2M, > gives you quite a bit: 70MB/s in + 70MB/s out on average. Those pads > nodes must have some fast disks (though maybe it's just the cache). > > This is with 1 worker and 4wpn. I'm assuming that with many workers, the > fact that each worker connection has its separate buffer will > essentially achieve a similar effect. But then there should be an option > for setting the buffer size. > > The numbers are attached. This all goes from head node local disk to > worker node local disk directly, so there is no nfs. I'd be curious to > know how that compares, but I am done for the day. > > Mihael > > On Sun, 2011-01-23 at 13:32 -0800, Mihael Hategan wrote: > > I'm trying to run tests on pads. The queues aren't quite empty. In the > > mean time, I committed a bit of a patch to trunk to measure aggregate > > traffic on TCP channels (those are only used by the workers). You can > > enable it by setting the "tcp.channel.log.io.performance" system > > property to "true". > > > > Mihael > > > > On Wed, 2011-01-19 at 13:12 -0800, Mihael Hategan wrote: > > > might be due to one of the recent patches. > > > > > > you could try to set IOBLOCKSZ to 1 in worker.pl and rerun. > > > > > > On Wed, 2011-01-19 at 13:37 -0600, Michael Wilde wrote: > > > > An interesting observation on the returned output files: there are exactly 33 files in the output dir from this run: the same as the number of jobs Swift reports as Finished successfully. But of those 33, the last 4 are only of partial length, and one of the 4 is length zero (see below). > > > > > > > > Its surprising and perhaps a bug that the jobs are reported finished before the output file is fully written??? > > > > > > > > Also this 3-partial plus 1-zero file looks to me like one worker staging op hung (the oldest of the 4 incomplete output files) and then perhaps 3 were cut short when the coaster service data protocol froze? > > > > > > > > - Mike > > > > > > > > login1$ pwd > > > > /scratch/local/wilde/lab > > > > login1$ cd outdir > > > > login1$ ls -lt | grep 10:48 > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0023.out > > > > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out > > > > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out > > > > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0336.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0380.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0015.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0204.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0379.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0066.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0221.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0281.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0403.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0142.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0187.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0067.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0081.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0134.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0136.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0146.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0254.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0362.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0312.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0370.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0389.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0027.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0094.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0183.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0363.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0016.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0025.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0429.out > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0239.out > > > > login1$ ls -lt | grep 10:48 | wc -l > > > > 33 > > > > login1$ > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > Mihael, > > > > > > > > > > The following test on pads failed/hung with an error 521 from > > > > > worker.pl: > > > > > > > > > > --- > > > > > sub getFileCBDataInIndirect { > > > > > ... > > > > > elsif ($timeout) { > > > > > queueCmd((nullCB(), "JOBSTATUS", $jobid, FAILED, "521", "Timeout > > > > > staging in file")); > > > > > delete($JOBDATA{$jobid}); > > > > > --- > > > > > > > > > > single foreach loop, doing 1,000 "mv" commands > > > > > > > > > > throttle was 200 jobs to this coaster pool (1 4-node 32-core PBS job): > > > > > > > > > > > > > > > > > > > > 8 > > > > > 3500 > > > > > 1 > > > > > 4 > > > > > 4 > > > > > short > > > > > 2.0 > > > > > 10000 > > > > > > > > > > /scratch/local/wilde/test/swiftwork > > > > > file > > > > > /scratch/local/wilde/swiftscratch > > > > > > > > > > > > > > > Ran 33 jobs - 1 job over 1 "wave" of 32 and then one or more workers > > > > > timed out. Note that the hang may have happened earlier, as no new > > > > > jobs were starting as the jobs in the first wave were finishing. > > > > > > > > > > time swift -tc.file tc -sites.file pbscoasters.xml -config cf.ps > > > > > mvn.swift -n=1000 >& out & > > > > > > > > > > > > > > > The log is in ~wilde/mvn-20110119-0956-s3s8h9h2.log on CI net. > > > > > > > > > > Swift stdout showed the following after waiting a while for a 4-node > > > > > PADS coaster allocation to start: > > > > > > > > > > Progress: Selecting site:799 Submitted:201 > > > > > Progress: Selecting site:799 Submitted:201 > > > > > Progress: Selecting site:799 Submitted:200 Active:1 > > > > > Progress: Selecting site:798 Submitted:177 Active:24 Finished > > > > > successfully:1 > > > > > Progress: Selecting site:796 Submitted:172 Active:28 Finished > > > > > successfully:4 > > > > > Progress: Selecting site:792 Submitted:176 Active:24 Finished > > > > > successfully:8 > > > > > Progress: Selecting site:788 Submitted:180 Active:20 Finished > > > > > successfully:12 > > > > > Progress: Selecting site:784 Submitted:184 Active:16 Finished > > > > > successfully:16 > > > > > Progress: Selecting site:780 Submitted:188 Active:12 Finished > > > > > successfully:20 > > > > > Progress: Selecting site:777 Submitted:191 Active:9 Finished > > > > > successfully:23 > > > > > Progress: Selecting site:773 Submitted:195 Active:5 Finished > > > > > successfully:27 > > > > > Progress: Selecting site:770 Submitted:197 Active:3 Finished > > > > > successfully:30 > > > > > Progress: Selecting site:767 Submitted:200 Finished successfully:33 > > > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33 > > > > > Progress: Selecting site:766 Submitted:200 Active:1 Finished > > > > > successfully:33 > > > > > Execution failed: > > > > > Job failed with an exit code of 521 > > > > > login1$ > > > > > login1$ > > > > > login1$ pwd > > > > > /scratch/local/wilde/lab > > > > > login1$ ls -lt | head > > > > > total 51408 > > > > > -rw-r--r-- 1 wilde ci-users 5043350 Jan 19 10:51 > > > > > mvn-20110119-0956-s3s8h9h2.log > > > > > > > > > > (copied to ~wilde) > > > > > > > > > > script was: > > > > > > > > > > login1$ cat mvn.swift > > > > > type file; > > > > > > > > > > app (file o) mv (file i) > > > > > { > > > > > mv @i @o; > > > > > } > > > > > > > > > > file out[] > > > > prefix="f.",suffix=".out">; > > > > > foreach j in [1:@toint(@arg("n","1"))] { > > > > > file data<"data.txt">; > > > > > out[j] = mv(data); > > > > > } > > > > > > > > > > > > > > > data.txt was 3MB > > > > > > > > > > A look at the outdir gives a clue to where things hung: The files of > > > > > <= ~3MB from time 10:48 are from this job. Files from 10:39 and > > > > > earlier are from other manual runs executed on login1, Note that 3 of > > > > > the 3MB output files have length 0 or <3MB, and were likely in transit > > > > > back from the worker: > > > > > > > > > > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out > > > > > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out > > > > > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out > > > > > > > > > > > > > > > login1$ pwd > > > > > /scratch/local/wilde/lab > > > > > login1$ cd outdir > > > > > login1$ ls -lt | head -40 > > > > > total 2772188 > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0023.out > > > > > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out > > > > > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out > > > > > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0336.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0380.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0015.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0204.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0379.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0066.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0221.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0281.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0403.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0142.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0187.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0067.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0081.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0134.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0136.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0146.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0254.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0362.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0312.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0370.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0389.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0027.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0094.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0183.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0363.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0016.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0025.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0429.out > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0239.out > > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0024.out > > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0037.out > > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0001.out > > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0042.out > > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0033.out > > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0051.out > > > > > l > > > > > > > > > > - Mike > > > > > > > > > > > > > > > -- > > > > > Michael Wilde > > > > > Computation Institute, University of Chicago > > > > > Mathematics and Computer Science Division > > > > > Argonne National Laboratory > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From jon.monette at gmail.com Mon Jan 24 19:38:59 2011 From: jon.monette at gmail.com (Jonathan Monette) Date: Mon, 24 Jan 2011 19:38:59 -0600 Subject: [Swift-devel] Re: No provider specified error In-Reply-To: <4D3CFD8F.1070700@gmail.com> References: <4D3CFD8F.1070700@gmail.com> Message-ID: <4D3E29B3.8010406@gmail.com> Ok. I went searching through some of the cog code for this error when I took a close look at this line from the log file: Caused by: org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No 'proxy' provider or alias found. Available providers: [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge] I am assuming that all these providers listed are the acceptable entries in the sites.xml for the execution provider entry. I notice that localhost isn't an acceptable execution provider. Was localhost switched to local(which is an acceptable value) in cog-r3043? On 1/23/11 10:18 PM, Jonathan Monette wrote: > Mihael, > I just updated to cog-r3043 and swift-r4031. I now get this swift > error: > > Execution failed: > No 'proxy' provider or alias found. Available providers: [cobalt, > gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt > <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> > gridftp-old; local <-> file; > > Upon inspecting the log file I got found this Java error: > > 2011-01-23 18:31:27,221-0600 INFO AbstractionProperties No properties > for provider proxy. Using empty properties > 2011-01-23 18:31:27,227-0600 INFO vdl:execute END_FAILURE thread=0-3 > tr=mImgtbl > 2011-01-23 18:31:27,234-0600 DEBUG VDL2ExecutionContext sys:throw @ > execute-default.k, line: 45: Exception in mImgtbl: > Arguments: [proj_dir, images.tbl] > Host: localhost > Directory: > rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs > ---- > > Exception in mImgtbl: > Arguments: [proj_dir, images.tbl] > Host: localhost > Directory: > rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs > ---- > > Caused by: No 'proxy' provider or alias found. Available providers: > [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt > <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> > gridftp-old; local <-> file; > Caused by: > org.globus.cog.abstraction.impl.common.task.InvalidProviderException: > No 'proxy' provider or alias found. Available providers: [cobalt, > gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt > <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> > gridftp-old; local <-> file; > at > org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:32) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:340) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:181) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:309) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > at > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:50) > at > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:26) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:238) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:289) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:402) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:343) > at > org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:230) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) > at > org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:44) > at > edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) > at > edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) > at > edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) > at java.lang.Thread.run(Thread.java:662) > Caused by: No 'proxy' provider or alias found. Available providers: > [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt > <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> > gridftp-old; local <-> file; > Caused by: > org.globus.cog.abstraction.impl.common.task.InvalidProviderException: > No 'proxy' provider or alias found. Available providers: [cobalt, > gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt > <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> > gridftp-old; local <-> file; > at > org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) > at > org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:153) > at > org.globus.cog.karajan.workflow.nodes.grid.GridExec.taskFailed(GridExec.java:373) > at > org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) > at > org.griphyn.vdl.karajan.lib.Execute.statusChanged(Execute.java:127) > at > org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) > at > org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:675) > at > org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:422) > at > org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) > at > org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:240) > at > org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:228) > at > org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) > at > org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:220) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > ... 1 more > Caused by: > org.globus.cog.abstraction.impl.common.task.InvalidProviderException: > No 'proxy' provider or alias found. Available providers: [cobalt, > gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt > <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> > gridftp-old; local <-> file; > at > org.globus.cog.abstraction.impl.common.AbstractionProperties.getProperties(AbstractionProperties.java:138) > at > org.globus.cog.abstraction.impl.common.AbstractionFactory.newObject(AbstractionFactory.java:136) > at > org.globus.cog.abstraction.impl.common.AbstractionFactory.newFileResource(AbstractionFactory.java:112) > at > org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.copy(JobSubmissionTaskHandler.java:293) > at > org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stage(JobSubmissionTaskHandler.java:272) > at > org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stageIn(JobSubmissionTaskHandler.java:283) > at > org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:159) > ... 6 more > 2011-01-23 18:31:27,274-0600 INFO ExecutionContext Detailed exception: > Exception in mImgtbl: > Arguments: [proj_dir, images.tbl] > Host: localhost > Directory: > rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs > ---- > > Caused by: No 'proxy' provider or alias found. Available providers: > [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt > <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> > gridftp-old; local <-> file; > Caused by: > org.globus.cog.abstraction.impl.common.task.InvalidProviderException: > No 'proxy' provider or alias found. Available providers: [cobalt, > gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt > <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> > gridftp-old; local <-> file; > at > org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > at > org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:32) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:340) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:181) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:309) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > at > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > at > org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:50) > at > org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:26) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:238) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:289) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:402) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:343) > at > org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:230) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) > at > org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:44) > at > edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) > at > edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) > at > edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) > at java.lang.Thread.run(Thread.java:662) > Caused by: No 'proxy' provider or alias found. Available providers: > [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt > <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> > gridftp-old; local <-> file; > Caused by: > org.globus.cog.abstraction.impl.common.task.InvalidProviderException: > No 'proxy' provider or alias found. Available providers: [cobalt, > gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt > <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> > gridftp-old; local <-> file; > at > org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) > at > org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:153) > at > org.globus.cog.karajan.workflow.nodes.grid.GridExec.taskFailed(GridExec.java:373) > at > org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) > at > org.griphyn.vdl.karajan.lib.Execute.statusChanged(Execute.java:127) > at > org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) > at > org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:675) > at > org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:422) > at > org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) > at > org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:240) > at > org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:228) > at > org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) > at > org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:220) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > ... 1 more > Caused by: > org.globus.cog.abstraction.impl.common.task.InvalidProviderException: > No 'proxy' provider or alias found. Available providers: [cobalt, > gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; cobalt > <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; gsiftp-old <-> > gridftp-old; local <-> file; > at > org.globus.cog.abstraction.impl.common.AbstractionProperties.getProperties(AbstractionProperties.java:138) > at > org.globus.cog.abstraction.impl.common.AbstractionFactory.newObject(AbstractionFactory.java:136) > at > org.globus.cog.abstraction.impl.common.AbstractionFactory.newFileResource(AbstractionFactory.java:112) > at > org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.copy(JobSubmissionTaskHandler.java:293) > at > org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stage(JobSubmissionTaskHandler.java:272) > at > org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stageIn(JobSubmissionTaskHandler.java:283) > at > org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:159) > ... 6 more > > I am running on PADS and here is my sites.xml file: > > > > > > .05 > 1 > /gpfs/pads/swift/jonmon/Swift/work/localhost > > > > > > 3600 > 192.5.86.6 > 1 > 40 > 1 > 1 > fast > 1 > 10000 > 1 > /gpfs/pads/swift/jonmon/Swift/work/pads > > > > Has anyone else seen this error or Mihael have you witnessed this? From jon.monette at gmail.com Mon Jan 24 19:42:16 2011 From: jon.monette at gmail.com (Jonathan Monette) Date: Mon, 24 Jan 2011 19:42:16 -0600 Subject: [Swift-devel] Re: No provider specified error In-Reply-To: <4D3E29B3.8010406@gmail.com> References: <4D3CFD8F.1070700@gmail.com> <4D3E29B3.8010406@gmail.com> Message-ID: <4D3E2A78.3070001@gmail.com> Scratch that, I do have execution provider="local" set in my sites.xml. I am not sure where the problem is. Still looking. On 1/24/11 7:38 PM, Jonathan Monette wrote: > Ok. I went searching through some of the cog code for this error when > I took a close look at this line from the log file: > > Caused by: > org.globus.cog.abstraction.impl.common.task.InvalidProviderException: > No 'proxy' provider or alias found. Available providers: [cobalt, > gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > coaster-persistent, pbs, ftp, gsiftp-old, local, sge] > > I am assuming that all these providers listed are the acceptable > entries in the sites.xml for the execution provider entry. I notice > that localhost isn't an acceptable execution provider. Was localhost > switched to local(which is an acceptable value) in cog-r3043? > > On 1/23/11 10:18 PM, Jonathan Monette wrote: >> Mihael, >> I just updated to cog-r3043 and swift-r4031. I now get this swift >> error: >> >> Execution failed: >> No 'proxy' provider or alias found. Available providers: [cobalt, >> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; >> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; >> gsiftp-old <-> gridftp-old; local <-> file; >> >> Upon inspecting the log file I got found this Java error: >> >> 2011-01-23 18:31:27,221-0600 INFO AbstractionProperties No >> properties for provider proxy. Using empty properties >> 2011-01-23 18:31:27,227-0600 INFO vdl:execute END_FAILURE thread=0-3 >> tr=mImgtbl >> 2011-01-23 18:31:27,234-0600 DEBUG VDL2ExecutionContext sys:throw @ >> execute-default.k, line: 45: Exception in mImgtbl: >> Arguments: [proj_dir, images.tbl] >> Host: localhost >> Directory: >> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs >> ---- >> >> Exception in mImgtbl: >> Arguments: [proj_dir, images.tbl] >> Host: localhost >> Directory: >> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs >> ---- >> >> Caused by: No 'proxy' provider or alias found. Available providers: >> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: >> gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; >> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; >> gsiftp-old <-> gridftp-old; local <-> file; >> Caused by: >> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: >> No 'proxy' provider or alias found. Available providers: [cobalt, >> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; >> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; >> gsiftp-old <-> gridftp-old; local <-> file; >> at >> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) >> at >> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) >> at >> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:32) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:340) >> at >> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:181) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:309) >> at >> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >> at >> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:50) >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:26) >> at >> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:238) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:289) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:402) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:343) >> at >> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:230) >> at >> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) >> at >> org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:44) >> at >> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) >> at >> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) >> at >> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) >> at >> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) >> at java.lang.Thread.run(Thread.java:662) >> Caused by: No 'proxy' provider or alias found. Available providers: >> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: >> gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; >> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; >> gsiftp-old <-> gridftp-old; local <-> file; >> Caused by: >> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: >> No 'proxy' provider or alias found. Available providers: [cobalt, >> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; >> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; >> gsiftp-old <-> gridftp-old; local <-> file; >> at >> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) >> at >> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:153) >> at >> org.globus.cog.karajan.workflow.nodes.grid.GridExec.taskFailed(GridExec.java:373) >> at >> org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) >> at >> org.griphyn.vdl.karajan.lib.Execute.statusChanged(Execute.java:127) >> at >> org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) >> at >> org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:675) >> at >> org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:422) >> at >> org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) >> at >> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:240) >> at >> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:228) >> at >> org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) >> at >> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:220) >> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) >> at >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >> ... 1 more >> Caused by: >> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: >> No 'proxy' provider or alias found. Available providers: [cobalt, >> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; >> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; >> gsiftp-old <-> gridftp-old; local <-> file; >> at >> org.globus.cog.abstraction.impl.common.AbstractionProperties.getProperties(AbstractionProperties.java:138) >> at >> org.globus.cog.abstraction.impl.common.AbstractionFactory.newObject(AbstractionFactory.java:136) >> at >> org.globus.cog.abstraction.impl.common.AbstractionFactory.newFileResource(AbstractionFactory.java:112) >> at >> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.copy(JobSubmissionTaskHandler.java:293) >> at >> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stage(JobSubmissionTaskHandler.java:272) >> at >> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stageIn(JobSubmissionTaskHandler.java:283) >> at >> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:159) >> ... 6 more >> 2011-01-23 18:31:27,274-0600 INFO ExecutionContext Detailed exception: >> Exception in mImgtbl: >> Arguments: [proj_dir, images.tbl] >> Host: localhost >> Directory: >> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs >> ---- >> >> Caused by: No 'proxy' provider or alias found. Available providers: >> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: >> gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; >> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; >> gsiftp-old <-> gridftp-old; local <-> file; >> Caused by: >> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: >> No 'proxy' provider or alias found. Available providers: [cobalt, >> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; >> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; >> gsiftp-old <-> gridftp-old; local <-> file; >> at >> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) >> at >> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) >> at >> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:32) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:340) >> at >> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:181) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:309) >> at >> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >> at >> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:50) >> at >> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:26) >> at >> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:238) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:289) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:402) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:343) >> at >> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:230) >> at >> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) >> at >> org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:44) >> at >> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) >> at >> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) >> at >> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) >> at >> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) >> at java.lang.Thread.run(Thread.java:662) >> Caused by: No 'proxy' provider or alias found. Available providers: >> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: >> gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; >> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; >> gsiftp-old <-> gridftp-old; local <-> file; >> Caused by: >> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: >> No 'proxy' provider or alias found. Available providers: [cobalt, >> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; >> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; >> gsiftp-old <-> gridftp-old; local <-> file; >> at >> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) >> at >> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:153) >> at >> org.globus.cog.karajan.workflow.nodes.grid.GridExec.taskFailed(GridExec.java:373) >> at >> org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) >> at >> org.griphyn.vdl.karajan.lib.Execute.statusChanged(Execute.java:127) >> at >> org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) >> at >> org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:675) >> at >> org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:422) >> at >> org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) >> at >> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:240) >> at >> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:228) >> at >> org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) >> at >> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:220) >> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) >> at >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >> ... 1 more >> Caused by: >> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: >> No 'proxy' provider or alias found. Available providers: [cobalt, >> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; >> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; >> gsiftp-old <-> gridftp-old; local <-> file; >> at >> org.globus.cog.abstraction.impl.common.AbstractionProperties.getProperties(AbstractionProperties.java:138) >> at >> org.globus.cog.abstraction.impl.common.AbstractionFactory.newObject(AbstractionFactory.java:136) >> at >> org.globus.cog.abstraction.impl.common.AbstractionFactory.newFileResource(AbstractionFactory.java:112) >> at >> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.copy(JobSubmissionTaskHandler.java:293) >> at >> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stage(JobSubmissionTaskHandler.java:272) >> at >> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stageIn(JobSubmissionTaskHandler.java:283) >> at >> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:159) >> ... 6 more >> >> I am running on PADS and here is my sites.xml file: >> >> >> >> >> >> .05 >> 1 >> /gpfs/pads/swift/jonmon/Swift/work/localhost >> >> >> >> >> >> 3600 >> 192.5.86.6 >> 1 >> 40 >> 1 >> 1 >> fast >> 1 >> 10000 >> 1 >> /gpfs/pads/swift/jonmon/Swift/work/pads >> >> >> >> Has anyone else seen this error or Mihael have you witnessed this? From jon.monette at gmail.com Mon Jan 24 21:28:08 2011 From: jon.monette at gmail.com (Jonathan Monette) Date: Mon, 24 Jan 2011 21:28:08 -0600 Subject: [Swift-devel] Re: No provider specified error In-Reply-To: <4D3E2A78.3070001@gmail.com> References: <4D3CFD8F.1070700@gmail.com> <4D3E29B3.8010406@gmail.com> <4D3E2A78.3070001@gmail.com> Message-ID: <4D3E4348.8080003@gmail.com> Figured out what was wrong. I was using some properties in my sites.xml for provider staging per Mike's suggestion. By commenting those lines out my workflow began working again. I am assuming for provider staging to work a 'proxy' must be set up and I do not believe I have set one up. However I am unfamiliar with provider staging, this was a suggestion by Mike to try next in a list of things for me to do. On 1/24/11 7:42 PM, Jonathan Monette wrote: > Scratch that, I do have execution provider="local" set in my > sites.xml. I am not sure where the problem is. Still looking. > > On 1/24/11 7:38 PM, Jonathan Monette wrote: >> Ok. I went searching through some of the cog code for this error >> when I took a close look at this line from the log file: >> >> Caused by: >> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: >> No 'proxy' provider or alias found. Available providers: [cobalt, >> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >> coaster-persistent, pbs, ftp, gsiftp-old, local, sge] >> >> I am assuming that all these providers listed are the acceptable >> entries in the sites.xml for the execution provider entry. I notice >> that localhost isn't an acceptable execution provider. Was localhost >> switched to local(which is an acceptable value) in cog-r3043? >> >> On 1/23/11 10:18 PM, Jonathan Monette wrote: >>> Mihael, >>> I just updated to cog-r3043 and swift-r4031. I now get this >>> swift error: >>> >>> Execution failed: >>> No 'proxy' provider or alias found. Available providers: >>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. >>> Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> >>> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> >>> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; >>> >>> Upon inspecting the log file I got found this Java error: >>> >>> 2011-01-23 18:31:27,221-0600 INFO AbstractionProperties No >>> properties for provider proxy. Using empty properties >>> 2011-01-23 18:31:27,227-0600 INFO vdl:execute END_FAILURE >>> thread=0-3 tr=mImgtbl >>> 2011-01-23 18:31:27,234-0600 DEBUG VDL2ExecutionContext sys:throw @ >>> execute-default.k, line: 45: Exception in mImgtbl: >>> Arguments: [proj_dir, images.tbl] >>> Host: localhost >>> Directory: >>> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs >>> ---- >>> >>> Exception in mImgtbl: >>> Arguments: [proj_dir, images.tbl] >>> Host: localhost >>> Directory: >>> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs >>> ---- >>> >>> Caused by: No 'proxy' provider or alias found. Available providers: >>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. >>> Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> >>> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> >>> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; >>> Caused by: >>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No >>> 'proxy' provider or alias found. Available providers: [cobalt, >>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; >>> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; >>> gsiftp-old <-> gridftp-old; local <-> file; >>> at >>> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) >>> at >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) >>> at >>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:32) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:340) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:181) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:309) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>> at >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:50) >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:26) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:238) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:289) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:402) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:343) >>> at >>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:230) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) >>> at >>> org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:44) >>> at >>> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) >>> at >>> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) >>> at >>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) >>> at >>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) >>> at java.lang.Thread.run(Thread.java:662) >>> Caused by: No 'proxy' provider or alias found. Available providers: >>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. >>> Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> >>> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> >>> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; >>> Caused by: >>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No >>> 'proxy' provider or alias found. Available providers: [cobalt, >>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; >>> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; >>> gsiftp-old <-> gridftp-old; local <-> file; >>> at >>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) >>> at >>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:153) >>> at >>> org.globus.cog.karajan.workflow.nodes.grid.GridExec.taskFailed(GridExec.java:373) >>> at >>> org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) >>> at >>> org.griphyn.vdl.karajan.lib.Execute.statusChanged(Execute.java:127) >>> at >>> org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) >>> at >>> org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:675) >>> at >>> org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:422) >>> at >>> org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) >>> at >>> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:240) >>> at >>> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:228) >>> at >>> org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) >>> at >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:220) >>> at >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) >>> at >>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>> ... 1 more >>> Caused by: >>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No >>> 'proxy' provider or alias found. Available providers: [cobalt, >>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; >>> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; >>> gsiftp-old <-> gridftp-old; local <-> file; >>> at >>> org.globus.cog.abstraction.impl.common.AbstractionProperties.getProperties(AbstractionProperties.java:138) >>> at >>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newObject(AbstractionFactory.java:136) >>> at >>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newFileResource(AbstractionFactory.java:112) >>> at >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.copy(JobSubmissionTaskHandler.java:293) >>> at >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stage(JobSubmissionTaskHandler.java:272) >>> at >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stageIn(JobSubmissionTaskHandler.java:283) >>> at >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:159) >>> ... 6 more >>> 2011-01-23 18:31:27,274-0600 INFO ExecutionContext Detailed exception: >>> Exception in mImgtbl: >>> Arguments: [proj_dir, images.tbl] >>> Host: localhost >>> Directory: >>> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs >>> ---- >>> >>> Caused by: No 'proxy' provider or alias found. Available providers: >>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. >>> Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> >>> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> >>> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; >>> Caused by: >>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No >>> 'proxy' provider or alias found. Available providers: [cobalt, >>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; >>> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; >>> gsiftp-old <-> gridftp-old; local <-> file; >>> at >>> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) >>> at >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) >>> at >>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:32) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:340) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:181) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:309) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>> at >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:50) >>> at >>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:26) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:238) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:289) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:402) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:343) >>> at >>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:230) >>> at >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) >>> at >>> org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:44) >>> at >>> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) >>> at >>> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) >>> at >>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) >>> at >>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) >>> at java.lang.Thread.run(Thread.java:662) >>> Caused by: No 'proxy' provider or alias found. Available providers: >>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. >>> Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> >>> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> >>> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; >>> Caused by: >>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No >>> 'proxy' provider or alias found. Available providers: [cobalt, >>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; >>> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; >>> gsiftp-old <-> gridftp-old; local <-> file; >>> at >>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) >>> at >>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) >>> at >>> org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:153) >>> at >>> org.globus.cog.karajan.workflow.nodes.grid.GridExec.taskFailed(GridExec.java:373) >>> at >>> org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) >>> at >>> org.griphyn.vdl.karajan.lib.Execute.statusChanged(Execute.java:127) >>> at >>> org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) >>> at >>> org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:675) >>> at >>> org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:422) >>> at >>> org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) >>> at >>> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:240) >>> at >>> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:228) >>> at >>> org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) >>> at >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:220) >>> at >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) >>> at >>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>> ... 1 more >>> Caused by: >>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No >>> 'proxy' provider or alias found. Available providers: [cobalt, >>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; >>> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; >>> gsiftp-old <-> gridftp-old; local <-> file; >>> at >>> org.globus.cog.abstraction.impl.common.AbstractionProperties.getProperties(AbstractionProperties.java:138) >>> at >>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newObject(AbstractionFactory.java:136) >>> at >>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newFileResource(AbstractionFactory.java:112) >>> at >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.copy(JobSubmissionTaskHandler.java:293) >>> at >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stage(JobSubmissionTaskHandler.java:272) >>> at >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stageIn(JobSubmissionTaskHandler.java:283) >>> at >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:159) >>> ... 6 more >>> >>> I am running on PADS and here is my sites.xml file: >>> >>> >>> >>> >>> >>> .05 >>> 1 >>> /gpfs/pads/swift/jonmon/Swift/work/localhost >>> >>> >>> >>> >>> >>> 3600 >>> 192.5.86.6 >>> 1 >>> 40 >>> 1 >>> 1 >>> fast >>> 1 >>> 10000 >>> 1 >>> /gpfs/pads/swift/jonmon/Swift/work/pads >>> >>> >>> >>> Has anyone else seen this error or Mihael have you witnessed this? From hategan at mcs.anl.gov Mon Jan 24 21:57:30 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 24 Jan 2011 19:57:30 -0800 Subject: [Swift-devel] Re: No provider specified error In-Reply-To: <4D3E4348.8080003@gmail.com> References: <4D3CFD8F.1070700@gmail.com> <4D3E29B3.8010406@gmail.com> <4D3E2A78.3070001@gmail.com> <4D3E4348.8080003@gmail.com> Message-ID: <1295927850.12825.1.camel@blabla2.none> Not quite. Proxy is just a coaster staging method in which the coaster service acts as a network proxy between the client and the worker node(s). I do not know what causes your problem. It looks odd. Can you post your broken configuration (swift.properties and sites file)? Mihael On Mon, 2011-01-24 at 21:28 -0600, Jonathan Monette wrote: > Figured out what was wrong. I was using some properties in my sites.xml > for provider staging per Mike's suggestion. By commenting those lines > out my workflow began working again. I am assuming for provider staging > to work a 'proxy' must be set up and I do not believe I have set one > up. However I am unfamiliar with provider staging, this was a > suggestion by Mike to try next in a list of things for me to do. > > On 1/24/11 7:42 PM, Jonathan Monette wrote: > > Scratch that, I do have execution provider="local" set in my > > sites.xml. I am not sure where the problem is. Still looking. > > > > On 1/24/11 7:38 PM, Jonathan Monette wrote: > >> Ok. I went searching through some of the cog code for this error > >> when I took a close look at this line from the log file: > >> > >> Caused by: > >> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: > >> No 'proxy' provider or alias found. Available providers: [cobalt, > >> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > >> coaster-persistent, pbs, ftp, gsiftp-old, local, sge] > >> > >> I am assuming that all these providers listed are the acceptable > >> entries in the sites.xml for the execution provider entry. I notice > >> that localhost isn't an acceptable execution provider. Was localhost > >> switched to local(which is an acceptable value) in cog-r3043? > >> > >> On 1/23/11 10:18 PM, Jonathan Monette wrote: > >>> Mihael, > >>> I just updated to cog-r3043 and swift-r4031. I now get this > >>> swift error: > >>> > >>> Execution failed: > >>> No 'proxy' provider or alias found. Available providers: > >>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, > >>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. > >>> Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> > >>> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> > >>> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; > >>> > >>> Upon inspecting the log file I got found this Java error: > >>> > >>> 2011-01-23 18:31:27,221-0600 INFO AbstractionProperties No > >>> properties for provider proxy. Using empty properties > >>> 2011-01-23 18:31:27,227-0600 INFO vdl:execute END_FAILURE > >>> thread=0-3 tr=mImgtbl > >>> 2011-01-23 18:31:27,234-0600 DEBUG VDL2ExecutionContext sys:throw @ > >>> execute-default.k, line: 45: Exception in mImgtbl: > >>> Arguments: [proj_dir, images.tbl] > >>> Host: localhost > >>> Directory: > >>> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs > >>> ---- > >>> > >>> Exception in mImgtbl: > >>> Arguments: [proj_dir, images.tbl] > >>> Host: localhost > >>> Directory: > >>> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs > >>> ---- > >>> > >>> Caused by: No 'proxy' provider or alias found. Available providers: > >>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, > >>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. > >>> Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> > >>> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> > >>> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; > >>> Caused by: > >>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No > >>> 'proxy' provider or alias found. Available providers: [cobalt, > >>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > >>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > >>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; > >>> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; > >>> gsiftp-old <-> gridftp-old; local <-> file; > >>> at > >>> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:32) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:340) > >>> at > >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:181) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:309) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:50) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:26) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:238) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:289) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:402) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:343) > >>> at > >>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:230) > >>> at > >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) > >>> at > >>> org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:44) > >>> at > >>> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) > >>> at > >>> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) > >>> at > >>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) > >>> at > >>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) > >>> at java.lang.Thread.run(Thread.java:662) > >>> Caused by: No 'proxy' provider or alias found. Available providers: > >>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, > >>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. > >>> Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> > >>> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> > >>> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; > >>> Caused by: > >>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No > >>> 'proxy' provider or alias found. Available providers: [cobalt, > >>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > >>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > >>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; > >>> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; > >>> gsiftp-old <-> gridftp-old; local <-> file; > >>> at > >>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) > >>> at > >>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:153) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.grid.GridExec.taskFailed(GridExec.java:373) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) > >>> at > >>> org.griphyn.vdl.karajan.lib.Execute.statusChanged(Execute.java:127) > >>> at > >>> org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) > >>> at > >>> org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:675) > >>> at > >>> org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:422) > >>> at > >>> org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) > >>> at > >>> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:240) > >>> at > >>> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:228) > >>> at > >>> org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) > >>> at > >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:220) > >>> at > >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > >>> at > >>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > >>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) > >>> at > >>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > >>> at > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > >>> ... 1 more > >>> Caused by: > >>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No > >>> 'proxy' provider or alias found. Available providers: [cobalt, > >>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > >>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > >>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; > >>> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; > >>> gsiftp-old <-> gridftp-old; local <-> file; > >>> at > >>> org.globus.cog.abstraction.impl.common.AbstractionProperties.getProperties(AbstractionProperties.java:138) > >>> at > >>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newObject(AbstractionFactory.java:136) > >>> at > >>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newFileResource(AbstractionFactory.java:112) > >>> at > >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.copy(JobSubmissionTaskHandler.java:293) > >>> at > >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stage(JobSubmissionTaskHandler.java:272) > >>> at > >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stageIn(JobSubmissionTaskHandler.java:283) > >>> at > >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:159) > >>> ... 6 more > >>> 2011-01-23 18:31:27,274-0600 INFO ExecutionContext Detailed exception: > >>> Exception in mImgtbl: > >>> Arguments: [proj_dir, images.tbl] > >>> Host: localhost > >>> Directory: > >>> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs > >>> ---- > >>> > >>> Caused by: No 'proxy' provider or alias found. Available providers: > >>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, > >>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. > >>> Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> > >>> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> > >>> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; > >>> Caused by: > >>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No > >>> 'proxy' provider or alias found. Available providers: [cobalt, > >>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > >>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > >>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; > >>> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; > >>> gsiftp-old <-> gridftp-old; local <-> file; > >>> at > >>> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:32) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:340) > >>> at > >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:181) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:309) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:50) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:26) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:238) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:289) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:402) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:343) > >>> at > >>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:230) > >>> at > >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) > >>> at > >>> org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:44) > >>> at > >>> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) > >>> at > >>> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) > >>> at > >>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) > >>> at > >>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) > >>> at java.lang.Thread.run(Thread.java:662) > >>> Caused by: No 'proxy' provider or alias found. Available providers: > >>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, > >>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. > >>> Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> > >>> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> > >>> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; > >>> Caused by: > >>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No > >>> 'proxy' provider or alias found. Available providers: [cobalt, > >>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > >>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > >>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; > >>> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; > >>> gsiftp-old <-> gridftp-old; local <-> file; > >>> at > >>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) > >>> at > >>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:153) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.grid.GridExec.taskFailed(GridExec.java:373) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) > >>> at > >>> org.griphyn.vdl.karajan.lib.Execute.statusChanged(Execute.java:127) > >>> at > >>> org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) > >>> at > >>> org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:675) > >>> at > >>> org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:422) > >>> at > >>> org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) > >>> at > >>> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:240) > >>> at > >>> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:228) > >>> at > >>> org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) > >>> at > >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:220) > >>> at > >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > >>> at > >>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > >>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) > >>> at > >>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > >>> at > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > >>> ... 1 more > >>> Caused by: > >>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No > >>> 'proxy' provider or alias found. Available providers: [cobalt, > >>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > >>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > >>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; > >>> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; > >>> gsiftp-old <-> gridftp-old; local <-> file; > >>> at > >>> org.globus.cog.abstraction.impl.common.AbstractionProperties.getProperties(AbstractionProperties.java:138) > >>> at > >>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newObject(AbstractionFactory.java:136) > >>> at > >>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newFileResource(AbstractionFactory.java:112) > >>> at > >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.copy(JobSubmissionTaskHandler.java:293) > >>> at > >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stage(JobSubmissionTaskHandler.java:272) > >>> at > >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stageIn(JobSubmissionTaskHandler.java:283) > >>> at > >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:159) > >>> ... 6 more > >>> > >>> I am running on PADS and here is my sites.xml file: > >>> > >>> > >>> > >>> > >>> > >>> .05 > >>> 1 > >>> /gpfs/pads/swift/jonmon/Swift/work/localhost > >>> > >>> > >>> > >>> > >>> > >>> 3600 > >>> 192.5.86.6 > >>> 1 > >>> 40 > >>> 1 > >>> 1 > >>> fast > >>> 1 > >>> 10000 > >>> 1 > >>> /gpfs/pads/swift/jonmon/Swift/work/pads > >>> > >>> > >>> > >>> Has anyone else seen this error or Mihael have you witnessed this? From wilde at mcs.anl.gov Mon Jan 24 22:00:42 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 24 Jan 2011 22:00:42 -0600 (CST) Subject: [Swift-devel] Re: No provider specified error In-Reply-To: <4D3E4348.8080003@gmail.com> Message-ID: <994571985.97713.1295928042480.JavaMail.root@zimbra.anl.gov> Perhaps what was missing was this tag in the sites.xml pool entry (which should have specified execution provider=coaster: proxy where the method can be file, proxy, or sfs (shared file system). You dont need to set up a proxy: coasters does this for you. This is explained (only, at the moment) at the end of etc/swift.properties: # Controls whether file staging is done by swift or by the execution # provider. If set to false, the standard swift staging mechanism is # used. If set to true, swift does not stage files. Instead, the # execution provider is instructed to stage files in and out. # # Provider staging is experimental. # # When enabled, and when coasters are used as an execution provider, # a staging mechanism can be selected for each site # using the swift:stagingMethod site profile in sites.xml. The # following is a list of accepted mechanisms: # # * file: Staging is done from a filesystem accessible to the # coaster service (typically running on the head node) # * proxy: Staging is done from a filesystem accessible to the # client machine that swift is running on, and is proxied # through the coaster service # * sfs: (short for "shared filesystem") Staging is done by # copying files to and from a filesystem accessible # by the compute node (such as an NFS or GPFS mount). use.provider.staging=false provider.staging.pin.swiftfiles=false --- - Mike ----- Original Message ----- > Figured out what was wrong. I was using some properties in my > sites.xml > for provider staging per Mike's suggestion. By commenting those lines > out my workflow began working again. I am assuming for provider > staging > to work a 'proxy' must be set up and I do not believe I have set one > up. However I am unfamiliar with provider staging, this was a > suggestion by Mike to try next in a list of things for me to do. > > On 1/24/11 7:42 PM, Jonathan Monette wrote: > > Scratch that, I do have execution provider="local" set in my > > sites.xml. I am not sure where the problem is. Still looking. > > > > On 1/24/11 7:38 PM, Jonathan Monette wrote: > >> Ok. I went searching through some of the cog code for this error > >> when I took a close look at this line from the log file: > >> > >> Caused by: > >> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: > >> No 'proxy' provider or alias found. Available providers: [cobalt, > >> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > >> coaster-persistent, pbs, ftp, gsiftp-old, local, sge] > >> > >> I am assuming that all these providers listed are the acceptable > >> entries in the sites.xml for the execution provider entry. I notice > >> that localhost isn't an acceptable execution provider. Was > >> localhost > >> switched to local(which is an acceptable value) in cog-r3043? > >> > >> On 1/23/11 10:18 PM, Jonathan Monette wrote: > >>> Mihael, > >>> I just updated to cog-r3043 and swift-r4031. I now get this > >>> swift error: > >>> > >>> Execution failed: > >>> No 'proxy' provider or alias found. Available providers: > >>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, > >>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. > >>> Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> > >>> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> > >>> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; > >>> > >>> Upon inspecting the log file I got found this Java error: > >>> > >>> 2011-01-23 18:31:27,221-0600 INFO AbstractionProperties No > >>> properties for provider proxy. Using empty properties > >>> 2011-01-23 18:31:27,227-0600 INFO vdl:execute END_FAILURE > >>> thread=0-3 tr=mImgtbl > >>> 2011-01-23 18:31:27,234-0600 DEBUG VDL2ExecutionContext sys:throw > >>> @ > >>> execute-default.k, line: 45: Exception in mImgtbl: > >>> Arguments: [proj_dir, images.tbl] > >>> Host: localhost > >>> Directory: > >>> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs > >>> ---- > >>> > >>> Exception in mImgtbl: > >>> Arguments: [proj_dir, images.tbl] > >>> Host: localhost > >>> Directory: > >>> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs > >>> ---- > >>> > >>> Caused by: No 'proxy' provider or alias found. Available > >>> providers: > >>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, > >>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. > >>> Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> > >>> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> > >>> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; > >>> Caused by: > >>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: > >>> No > >>> 'proxy' provider or alias found. Available providers: [cobalt, > >>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > >>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: > >>> gt4 > >>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; > >>> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; > >>> gsiftp-old <-> gridftp-old; local <-> file; > >>> at > >>> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:32) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:340) > >>> at > >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:181) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:309) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:50) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:26) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:238) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:289) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:402) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:343) > >>> at > >>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:230) > >>> at > >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) > >>> at > >>> org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:44) > >>> at > >>> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) > >>> at > >>> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) > >>> at > >>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) > >>> at > >>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) > >>> at java.lang.Thread.run(Thread.java:662) > >>> Caused by: No 'proxy' provider or alias found. Available > >>> providers: > >>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, > >>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. > >>> Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> > >>> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> > >>> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; > >>> Caused by: > >>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: > >>> No > >>> 'proxy' provider or alias found. Available providers: [cobalt, > >>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > >>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: > >>> gt4 > >>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; > >>> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; > >>> gsiftp-old <-> gridftp-old; local <-> file; > >>> at > >>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) > >>> at > >>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:153) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.grid.GridExec.taskFailed(GridExec.java:373) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) > >>> at > >>> org.griphyn.vdl.karajan.lib.Execute.statusChanged(Execute.java:127) > >>> at > >>> org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) > >>> at > >>> org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:675) > >>> at > >>> org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:422) > >>> at > >>> org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) > >>> at > >>> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:240) > >>> at > >>> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:228) > >>> at > >>> org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) > >>> at > >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:220) > >>> at > >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > >>> at > >>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > >>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) > >>> at > >>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > >>> at > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > >>> ... 1 more > >>> Caused by: > >>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: > >>> No > >>> 'proxy' provider or alias found. Available providers: [cobalt, > >>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > >>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: > >>> gt4 > >>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; > >>> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; > >>> gsiftp-old <-> gridftp-old; local <-> file; > >>> at > >>> org.globus.cog.abstraction.impl.common.AbstractionProperties.getProperties(AbstractionProperties.java:138) > >>> at > >>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newObject(AbstractionFactory.java:136) > >>> at > >>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newFileResource(AbstractionFactory.java:112) > >>> at > >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.copy(JobSubmissionTaskHandler.java:293) > >>> at > >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stage(JobSubmissionTaskHandler.java:272) > >>> at > >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stageIn(JobSubmissionTaskHandler.java:283) > >>> at > >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:159) > >>> ... 6 more > >>> 2011-01-23 18:31:27,274-0600 INFO ExecutionContext Detailed > >>> exception: > >>> Exception in mImgtbl: > >>> Arguments: [proj_dir, images.tbl] > >>> Host: localhost > >>> Directory: > >>> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs > >>> ---- > >>> > >>> Caused by: No 'proxy' provider or alias found. Available > >>> providers: > >>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, > >>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. > >>> Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> > >>> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> > >>> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; > >>> Caused by: > >>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: > >>> No > >>> 'proxy' provider or alias found. Available providers: [cobalt, > >>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > >>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: > >>> gt4 > >>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; > >>> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; > >>> gsiftp-old <-> gridftp-old; local <-> file; > >>> at > >>> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:32) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:340) > >>> at > >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:181) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:309) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:50) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:26) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:238) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:289) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:402) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:343) > >>> at > >>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:230) > >>> at > >>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) > >>> at > >>> org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:44) > >>> at > >>> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) > >>> at > >>> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) > >>> at > >>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) > >>> at > >>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) > >>> at java.lang.Thread.run(Thread.java:662) > >>> Caused by: No 'proxy' provider or alias found. Available > >>> providers: > >>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, > >>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. > >>> Aliases: gt4 <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> > >>> condorlocal; cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> > >>> pbslocal; gsiftp-old <-> gridftp-old; local <-> file; > >>> Caused by: > >>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: > >>> No > >>> 'proxy' provider or alias found. Available providers: [cobalt, > >>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > >>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: > >>> gt4 > >>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; > >>> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; > >>> gsiftp-old <-> gridftp-old; local <-> file; > >>> at > >>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) > >>> at > >>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:153) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.grid.GridExec.taskFailed(GridExec.java:373) > >>> at > >>> org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) > >>> at > >>> org.griphyn.vdl.karajan.lib.Execute.statusChanged(Execute.java:127) > >>> at > >>> org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) > >>> at > >>> org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:675) > >>> at > >>> org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:422) > >>> at > >>> org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) > >>> at > >>> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:240) > >>> at > >>> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:228) > >>> at > >>> org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) > >>> at > >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:220) > >>> at > >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > >>> at > >>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > >>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) > >>> at > >>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > >>> at > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > >>> ... 1 more > >>> Caused by: > >>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: > >>> No > >>> 'proxy' provider or alias found. Available providers: [cobalt, > >>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > >>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: > >>> gt4 > >>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor <-> condorlocal; > >>> cobalt <-> cobaltlocal; gsiftp <-> gridftp; pbs <-> pbslocal; > >>> gsiftp-old <-> gridftp-old; local <-> file; > >>> at > >>> org.globus.cog.abstraction.impl.common.AbstractionProperties.getProperties(AbstractionProperties.java:138) > >>> at > >>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newObject(AbstractionFactory.java:136) > >>> at > >>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newFileResource(AbstractionFactory.java:112) > >>> at > >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.copy(JobSubmissionTaskHandler.java:293) > >>> at > >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stage(JobSubmissionTaskHandler.java:272) > >>> at > >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stageIn(JobSubmissionTaskHandler.java:283) > >>> at > >>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:159) > >>> ... 6 more > >>> > >>> I am running on PADS and here is my sites.xml file: > >>> > >>> > >>> > >>> > >>> > >>> .05 > >>> 1 > >>> /gpfs/pads/swift/jonmon/Swift/work/localhost > >>> > >>> > >>> > >>> > >>> > >>> 3600 > >>> >>> namespace="globus">192.5.86.6 > >>> 1 > >>> 40 > >>> 1 > >>> 1 > >>> fast > >>> 1 > >>> 10000 > >>> 1 > >>> /gpfs/pads/swift/jonmon/Swift/work/pads > >>> > >>> > >>> > >>> Has anyone else seen this error or Mihael have you witnessed this? > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jon.monette at gmail.com Mon Jan 24 22:03:23 2011 From: jon.monette at gmail.com (Jonathan Monette) Date: Mon, 24 Jan 2011 22:03:23 -0600 Subject: [Swift-devel] Re: No provider specified error In-Reply-To: <1295927850.12825.1.camel@blabla2.none> References: <4D3CFD8F.1070700@gmail.com> <4D3E29B3.8010406@gmail.com> <4D3E2A78.3070001@gmail.com> <4D3E4348.8080003@gmail.com> <1295927850.12825.1.camel@blabla2.none> Message-ID: <4D3E4B8B.80500@gmail.com> execution.retries=0 sitedir.keep=true status.mode=provider wrapper.log.always.transfer=true foreach.maxthreads=1024 wrapper.parameter.mode=files use.provider.staging=true provider.staging.pin.swiftfiles=false If the last two lines are commented out my work flow runs. Here is my sites file .05 1 /gpfs/pads/swift/jonmon/Swift/work/localhost 3600 192.5.86.6 1 40 1 1 fast 1 10000 1 /gpfs/pads/swift/jonmon/Swift/work/pads On 1/24/11 9:57 PM, Mihael Hategan wrote: > Not quite. Proxy is just a coaster staging method in which the coaster > service acts as a network proxy between the client and the worker > node(s). > > I do not know what causes your problem. It looks odd. Can you post your > broken configuration (swift.properties and sites file)? > > Mihael > > On Mon, 2011-01-24 at 21:28 -0600, Jonathan Monette wrote: >> Figured out what was wrong. I was using some properties in my sites.xml >> for provider staging per Mike's suggestion. By commenting those lines >> out my workflow began working again. I am assuming for provider staging >> to work a 'proxy' must be set up and I do not believe I have set one >> up. However I am unfamiliar with provider staging, this was a >> suggestion by Mike to try next in a list of things for me to do. >> >> On 1/24/11 7:42 PM, Jonathan Monette wrote: >>> Scratch that, I do have execution provider="local" set in my >>> sites.xml. I am not sure where the problem is. Still looking. >>> >>> On 1/24/11 7:38 PM, Jonathan Monette wrote: >>>> Ok. I went searching through some of the cog code for this error >>>> when I took a close look at this line from the log file: >>>> >>>> Caused by: >>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: >>>> No 'proxy' provider or alias found. Available providers: [cobalt, >>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge] >>>> >>>> I am assuming that all these providers listed are the acceptable >>>> entries in the sites.xml for the execution provider entry. I notice >>>> that localhost isn't an acceptable execution provider. Was localhost >>>> switched to local(which is an acceptable value) in cog-r3043? >>>> >>>> On 1/23/11 10:18 PM, Jonathan Monette wrote: >>>>> Mihael, >>>>> I just updated to cog-r3043 and swift-r4031. I now get this >>>>> swift error: >>>>> >>>>> Execution failed: >>>>> No 'proxy' provider or alias found. Available providers: >>>>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >>>>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. >>>>> Aliases: gt4<-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> >>>>> condorlocal; cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> >>>>> pbslocal; gsiftp-old<-> gridftp-old; local<-> file; >>>>> >>>>> Upon inspecting the log file I got found this Java error: >>>>> >>>>> 2011-01-23 18:31:27,221-0600 INFO AbstractionProperties No >>>>> properties for provider proxy. Using empty properties >>>>> 2011-01-23 18:31:27,227-0600 INFO vdl:execute END_FAILURE >>>>> thread=0-3 tr=mImgtbl >>>>> 2011-01-23 18:31:27,234-0600 DEBUG VDL2ExecutionContext sys:throw @ >>>>> execute-default.k, line: 45: Exception in mImgtbl: >>>>> Arguments: [proj_dir, images.tbl] >>>>> Host: localhost >>>>> Directory: >>>>> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs >>>>> ---- >>>>> >>>>> Exception in mImgtbl: >>>>> Arguments: [proj_dir, images.tbl] >>>>> Host: localhost >>>>> Directory: >>>>> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs >>>>> ---- >>>>> >>>>> Caused by: No 'proxy' provider or alias found. Available providers: >>>>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >>>>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. >>>>> Aliases: gt4<-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> >>>>> condorlocal; cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> >>>>> pbslocal; gsiftp-old<-> gridftp-old; local<-> file; >>>>> Caused by: >>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No >>>>> 'proxy' provider or alias found. Available providers: [cobalt, >>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; >>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; >>>>> gsiftp-old<-> gridftp-old; local<-> file; >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:32) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:340) >>>>> at >>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:181) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:309) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:50) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:26) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:238) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:289) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:402) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:343) >>>>> at >>>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:230) >>>>> at >>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) >>>>> at >>>>> org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:44) >>>>> at >>>>> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) >>>>> at >>>>> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) >>>>> at >>>>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) >>>>> at >>>>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) >>>>> at java.lang.Thread.run(Thread.java:662) >>>>> Caused by: No 'proxy' provider or alias found. Available providers: >>>>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >>>>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. >>>>> Aliases: gt4<-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> >>>>> condorlocal; cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> >>>>> pbslocal; gsiftp-old<-> gridftp-old; local<-> file; >>>>> Caused by: >>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No >>>>> 'proxy' provider or alias found. Available providers: [cobalt, >>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; >>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; >>>>> gsiftp-old<-> gridftp-old; local<-> file; >>>>> at >>>>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) >>>>> at >>>>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:153) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.grid.GridExec.taskFailed(GridExec.java:373) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) >>>>> at >>>>> org.griphyn.vdl.karajan.lib.Execute.statusChanged(Execute.java:127) >>>>> at >>>>> org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) >>>>> at >>>>> org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:675) >>>>> at >>>>> org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:422) >>>>> at >>>>> org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) >>>>> at >>>>> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:240) >>>>> at >>>>> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:228) >>>>> at >>>>> org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) >>>>> at >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:220) >>>>> at >>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) >>>>> at >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>>>> ... 1 more >>>>> Caused by: >>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No >>>>> 'proxy' provider or alias found. Available providers: [cobalt, >>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; >>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; >>>>> gsiftp-old<-> gridftp-old; local<-> file; >>>>> at >>>>> org.globus.cog.abstraction.impl.common.AbstractionProperties.getProperties(AbstractionProperties.java:138) >>>>> at >>>>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newObject(AbstractionFactory.java:136) >>>>> at >>>>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newFileResource(AbstractionFactory.java:112) >>>>> at >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.copy(JobSubmissionTaskHandler.java:293) >>>>> at >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stage(JobSubmissionTaskHandler.java:272) >>>>> at >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stageIn(JobSubmissionTaskHandler.java:283) >>>>> at >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:159) >>>>> ... 6 more >>>>> 2011-01-23 18:31:27,274-0600 INFO ExecutionContext Detailed exception: >>>>> Exception in mImgtbl: >>>>> Arguments: [proj_dir, images.tbl] >>>>> Host: localhost >>>>> Directory: >>>>> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs >>>>> ---- >>>>> >>>>> Caused by: No 'proxy' provider or alias found. Available providers: >>>>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >>>>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. >>>>> Aliases: gt4<-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> >>>>> condorlocal; cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> >>>>> pbslocal; gsiftp-old<-> gridftp-old; local<-> file; >>>>> Caused by: >>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No >>>>> 'proxy' provider or alias found. Available providers: [cobalt, >>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; >>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; >>>>> gsiftp-old<-> gridftp-old; local<-> file; >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:32) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:340) >>>>> at >>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:181) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:309) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:50) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:26) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:238) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:289) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:402) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:343) >>>>> at >>>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:230) >>>>> at >>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) >>>>> at >>>>> org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:44) >>>>> at >>>>> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) >>>>> at >>>>> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) >>>>> at >>>>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) >>>>> at >>>>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) >>>>> at java.lang.Thread.run(Thread.java:662) >>>>> Caused by: No 'proxy' provider or alias found. Available providers: >>>>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >>>>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. >>>>> Aliases: gt4<-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> >>>>> condorlocal; cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> >>>>> pbslocal; gsiftp-old<-> gridftp-old; local<-> file; >>>>> Caused by: >>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No >>>>> 'proxy' provider or alias found. Available providers: [cobalt, >>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; >>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; >>>>> gsiftp-old<-> gridftp-old; local<-> file; >>>>> at >>>>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) >>>>> at >>>>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:153) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.grid.GridExec.taskFailed(GridExec.java:373) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) >>>>> at >>>>> org.griphyn.vdl.karajan.lib.Execute.statusChanged(Execute.java:127) >>>>> at >>>>> org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) >>>>> at >>>>> org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:675) >>>>> at >>>>> org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:422) >>>>> at >>>>> org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) >>>>> at >>>>> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:240) >>>>> at >>>>> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:228) >>>>> at >>>>> org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) >>>>> at >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:220) >>>>> at >>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) >>>>> at >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>>>> ... 1 more >>>>> Caused by: >>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No >>>>> 'proxy' provider or alias found. Available providers: [cobalt, >>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; >>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; >>>>> gsiftp-old<-> gridftp-old; local<-> file; >>>>> at >>>>> org.globus.cog.abstraction.impl.common.AbstractionProperties.getProperties(AbstractionProperties.java:138) >>>>> at >>>>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newObject(AbstractionFactory.java:136) >>>>> at >>>>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newFileResource(AbstractionFactory.java:112) >>>>> at >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.copy(JobSubmissionTaskHandler.java:293) >>>>> at >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stage(JobSubmissionTaskHandler.java:272) >>>>> at >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stageIn(JobSubmissionTaskHandler.java:283) >>>>> at >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:159) >>>>> ... 6 more >>>>> >>>>> I am running on PADS and here is my sites.xml file: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> .05 >>>>> 1 >>>>> /gpfs/pads/swift/jonmon/Swift/work/localhost >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> 3600 >>>>> 192.5.86.6 >>>>> 1 >>>>> 40 >>>>> 1 >>>>> 1 >>>>> fast >>>>> 1 >>>>> 10000 >>>>> 1 >>>>> /gpfs/pads/swift/jonmon/Swift/work/pads >>>>> >>>>> >>>>> >>>>> Has anyone else seen this error or Mihael have you witnessed this? > -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From jon.monette at gmail.com Mon Jan 24 22:22:30 2011 From: jon.monette at gmail.com (Jonathan Monette) Date: Mon, 24 Jan 2011 22:22:30 -0600 Subject: [Swift-devel] Re: No provider specified error In-Reply-To: <994571985.97713.1295928042480.JavaMail.root@zimbra.anl.gov> References: <994571985.97713.1295928042480.JavaMail.root@zimbra.anl.gov> Message-ID: <4D3E5006.7060800@gmail.com> That may have been the problem. I have some jobs running so when they finish I will add that line to see if that fixes it. Because I am running on PADS with all my files stored on gpfs I should use the sfs option correct? On 1/24/11 10:00 PM, Michael Wilde wrote: > Perhaps what was missing was this tag in the sites.xml pool entry (which should have specified execution provider=coaster: > > proxy > > where the method can be file, proxy, or sfs (shared file system). > > You dont need to set up a proxy: coasters does this for you. This is explained (only, at the moment) at the end of etc/swift.properties: > > # Controls whether file staging is done by swift or by the execution > # provider. If set to false, the standard swift staging mechanism is > # used. If set to true, swift does not stage files. Instead, the > # execution provider is instructed to stage files in and out. > # > # Provider staging is experimental. > # > # When enabled, and when coasters are used as an execution provider, > # a staging mechanism can be selected for each site > # using the swift:stagingMethod site profile in sites.xml. The > # following is a list of accepted mechanisms: > # > # * file: Staging is done from a filesystem accessible to the > # coaster service (typically running on the head node) > # * proxy: Staging is done from a filesystem accessible to the > # client machine that swift is running on, and is proxied > # through the coaster service > # * sfs: (short for "shared filesystem") Staging is done by > # copying files to and from a filesystem accessible > # by the compute node (such as an NFS or GPFS mount). > > > use.provider.staging=false > provider.staging.pin.swiftfiles=false > > --- > > > - Mike > > > > ----- Original Message ----- >> Figured out what was wrong. I was using some properties in my >> sites.xml >> for provider staging per Mike's suggestion. By commenting those lines >> out my workflow began working again. I am assuming for provider >> staging >> to work a 'proxy' must be set up and I do not believe I have set one >> up. However I am unfamiliar with provider staging, this was a >> suggestion by Mike to try next in a list of things for me to do. >> >> On 1/24/11 7:42 PM, Jonathan Monette wrote: >>> Scratch that, I do have execution provider="local" set in my >>> sites.xml. I am not sure where the problem is. Still looking. >>> >>> On 1/24/11 7:38 PM, Jonathan Monette wrote: >>>> Ok. I went searching through some of the cog code for this error >>>> when I took a close look at this line from the log file: >>>> >>>> Caused by: >>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: >>>> No 'proxy' provider or alias found. Available providers: [cobalt, >>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge] >>>> >>>> I am assuming that all these providers listed are the acceptable >>>> entries in the sites.xml for the execution provider entry. I notice >>>> that localhost isn't an acceptable execution provider. Was >>>> localhost >>>> switched to local(which is an acceptable value) in cog-r3043? >>>> >>>> On 1/23/11 10:18 PM, Jonathan Monette wrote: >>>>> Mihael, >>>>> I just updated to cog-r3043 and swift-r4031. I now get this >>>>> swift error: >>>>> >>>>> Execution failed: >>>>> No 'proxy' provider or alias found. Available providers: >>>>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >>>>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. >>>>> Aliases: gt4<-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> >>>>> condorlocal; cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> >>>>> pbslocal; gsiftp-old<-> gridftp-old; local<-> file; >>>>> >>>>> Upon inspecting the log file I got found this Java error: >>>>> >>>>> 2011-01-23 18:31:27,221-0600 INFO AbstractionProperties No >>>>> properties for provider proxy. Using empty properties >>>>> 2011-01-23 18:31:27,227-0600 INFO vdl:execute END_FAILURE >>>>> thread=0-3 tr=mImgtbl >>>>> 2011-01-23 18:31:27,234-0600 DEBUG VDL2ExecutionContext sys:throw >>>>> @ >>>>> execute-default.k, line: 45: Exception in mImgtbl: >>>>> Arguments: [proj_dir, images.tbl] >>>>> Host: localhost >>>>> Directory: >>>>> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs >>>>> ---- >>>>> >>>>> Exception in mImgtbl: >>>>> Arguments: [proj_dir, images.tbl] >>>>> Host: localhost >>>>> Directory: >>>>> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs >>>>> ---- >>>>> >>>>> Caused by: No 'proxy' provider or alias found. Available >>>>> providers: >>>>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >>>>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. >>>>> Aliases: gt4<-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> >>>>> condorlocal; cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> >>>>> pbslocal; gsiftp-old<-> gridftp-old; local<-> file; >>>>> Caused by: >>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: >>>>> No >>>>> 'proxy' provider or alias found. Available providers: [cobalt, >>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: >>>>> gt4 >>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; >>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; >>>>> gsiftp-old<-> gridftp-old; local<-> file; >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:32) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:340) >>>>> at >>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:181) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:309) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:50) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:26) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:238) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:289) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:402) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:343) >>>>> at >>>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:230) >>>>> at >>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) >>>>> at >>>>> org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:44) >>>>> at >>>>> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) >>>>> at >>>>> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) >>>>> at >>>>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) >>>>> at >>>>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) >>>>> at java.lang.Thread.run(Thread.java:662) >>>>> Caused by: No 'proxy' provider or alias found. Available >>>>> providers: >>>>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >>>>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. >>>>> Aliases: gt4<-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> >>>>> condorlocal; cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> >>>>> pbslocal; gsiftp-old<-> gridftp-old; local<-> file; >>>>> Caused by: >>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: >>>>> No >>>>> 'proxy' provider or alias found. Available providers: [cobalt, >>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: >>>>> gt4 >>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; >>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; >>>>> gsiftp-old<-> gridftp-old; local<-> file; >>>>> at >>>>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) >>>>> at >>>>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:153) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.grid.GridExec.taskFailed(GridExec.java:373) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) >>>>> at >>>>> org.griphyn.vdl.karajan.lib.Execute.statusChanged(Execute.java:127) >>>>> at >>>>> org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) >>>>> at >>>>> org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:675) >>>>> at >>>>> org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:422) >>>>> at >>>>> org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) >>>>> at >>>>> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:240) >>>>> at >>>>> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:228) >>>>> at >>>>> org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) >>>>> at >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:220) >>>>> at >>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) >>>>> at >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>>>> ... 1 more >>>>> Caused by: >>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: >>>>> No >>>>> 'proxy' provider or alias found. Available providers: [cobalt, >>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: >>>>> gt4 >>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; >>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; >>>>> gsiftp-old<-> gridftp-old; local<-> file; >>>>> at >>>>> org.globus.cog.abstraction.impl.common.AbstractionProperties.getProperties(AbstractionProperties.java:138) >>>>> at >>>>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newObject(AbstractionFactory.java:136) >>>>> at >>>>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newFileResource(AbstractionFactory.java:112) >>>>> at >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.copy(JobSubmissionTaskHandler.java:293) >>>>> at >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stage(JobSubmissionTaskHandler.java:272) >>>>> at >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stageIn(JobSubmissionTaskHandler.java:283) >>>>> at >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:159) >>>>> ... 6 more >>>>> 2011-01-23 18:31:27,274-0600 INFO ExecutionContext Detailed >>>>> exception: >>>>> Exception in mImgtbl: >>>>> Arguments: [proj_dir, images.tbl] >>>>> Host: localhost >>>>> Directory: >>>>> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs >>>>> ---- >>>>> >>>>> Caused by: No 'proxy' provider or alias found. Available >>>>> providers: >>>>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >>>>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. >>>>> Aliases: gt4<-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> >>>>> condorlocal; cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> >>>>> pbslocal; gsiftp-old<-> gridftp-old; local<-> file; >>>>> Caused by: >>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: >>>>> No >>>>> 'proxy' provider or alias found. Available providers: [cobalt, >>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: >>>>> gt4 >>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; >>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; >>>>> gsiftp-old<-> gridftp-old; local<-> file; >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:32) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:340) >>>>> at >>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:181) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:309) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:50) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:26) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:238) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:289) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:402) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:343) >>>>> at >>>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:230) >>>>> at >>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) >>>>> at >>>>> org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:44) >>>>> at >>>>> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) >>>>> at >>>>> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) >>>>> at >>>>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) >>>>> at >>>>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) >>>>> at java.lang.Thread.run(Thread.java:662) >>>>> Caused by: No 'proxy' provider or alias found. Available >>>>> providers: >>>>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >>>>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. >>>>> Aliases: gt4<-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> >>>>> condorlocal; cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> >>>>> pbslocal; gsiftp-old<-> gridftp-old; local<-> file; >>>>> Caused by: >>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: >>>>> No >>>>> 'proxy' provider or alias found. Available providers: [cobalt, >>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: >>>>> gt4 >>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; >>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; >>>>> gsiftp-old<-> gridftp-old; local<-> file; >>>>> at >>>>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) >>>>> at >>>>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:153) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.grid.GridExec.taskFailed(GridExec.java:373) >>>>> at >>>>> org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) >>>>> at >>>>> org.griphyn.vdl.karajan.lib.Execute.statusChanged(Execute.java:127) >>>>> at >>>>> org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) >>>>> at >>>>> org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:675) >>>>> at >>>>> org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:422) >>>>> at >>>>> org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) >>>>> at >>>>> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:240) >>>>> at >>>>> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:228) >>>>> at >>>>> org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) >>>>> at >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:220) >>>>> at >>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) >>>>> at >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>>>> ... 1 more >>>>> Caused by: >>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: >>>>> No >>>>> 'proxy' provider or alias found. Available providers: [cobalt, >>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: >>>>> gt4 >>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; >>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; >>>>> gsiftp-old<-> gridftp-old; local<-> file; >>>>> at >>>>> org.globus.cog.abstraction.impl.common.AbstractionProperties.getProperties(AbstractionProperties.java:138) >>>>> at >>>>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newObject(AbstractionFactory.java:136) >>>>> at >>>>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newFileResource(AbstractionFactory.java:112) >>>>> at >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.copy(JobSubmissionTaskHandler.java:293) >>>>> at >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stage(JobSubmissionTaskHandler.java:272) >>>>> at >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stageIn(JobSubmissionTaskHandler.java:283) >>>>> at >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:159) >>>>> ... 6 more >>>>> >>>>> I am running on PADS and here is my sites.xml file: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> .05 >>>>> 1 >>>>> /gpfs/pads/swift/jonmon/Swift/work/localhost >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> 3600 >>>>> >>>> namespace="globus">192.5.86.6 >>>>> 1 >>>>> 40 >>>>> 1 >>>>> 1 >>>>> fast >>>>> 1 >>>>> 10000 >>>>> 1 >>>>> /gpfs/pads/swift/jonmon/Swift/work/pads >>>>> >>>>> >>>>> >>>>> Has anyone else seen this error or Mihael have you witnessed this? >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From hategan at mcs.anl.gov Mon Jan 24 22:27:27 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 24 Jan 2011 20:27:27 -0800 Subject: [Swift-devel] Re: No provider specified error In-Reply-To: <4D3E4B8B.80500@gmail.com> References: <4D3CFD8F.1070700@gmail.com> <4D3E29B3.8010406@gmail.com> <4D3E2A78.3070001@gmail.com> <4D3E4348.8080003@gmail.com> <1295927850.12825.1.camel@blabla2.none> <4D3E4B8B.80500@gmail.com> Message-ID: <1295929647.13858.1.camel@blabla2.none> I'd suggest removing the entry in sites.xml. And also using "file" as staging method, since the client and coaster service are on the same machine. On Mon, 2011-01-24 at 22:03 -0600, Jonathan Monette wrote: > execution.retries=0 > sitedir.keep=true > status.mode=provider > wrapper.log.always.transfer=true > foreach.maxthreads=1024 > wrapper.parameter.mode=files > use.provider.staging=true > provider.staging.pin.swiftfiles=false > > If the last two lines are commented out my work flow runs. Here is my > sites file > > > > > > .05 > 1 > /gpfs/pads/swift/jonmon/Swift/work/localhost > > > > > 3600 > 192.5.86.6 > 1 > 40 > 1 > 1 > fast > 1 > 10000 > 1 > /gpfs/pads/swift/jonmon/Swift/work/pads > > > > > On 1/24/11 9:57 PM, Mihael Hategan wrote: > > Not quite. Proxy is just a coaster staging method in which the coaster > > service acts as a network proxy between the client and the worker > > node(s). > > > > I do not know what causes your problem. It looks odd. Can you post your > > broken configuration (swift.properties and sites file)? > > > > Mihael > > > > On Mon, 2011-01-24 at 21:28 -0600, Jonathan Monette wrote: > >> Figured out what was wrong. I was using some properties in my sites.xml > >> for provider staging per Mike's suggestion. By commenting those lines > >> out my workflow began working again. I am assuming for provider staging > >> to work a 'proxy' must be set up and I do not believe I have set one > >> up. However I am unfamiliar with provider staging, this was a > >> suggestion by Mike to try next in a list of things for me to do. > >> > >> On 1/24/11 7:42 PM, Jonathan Monette wrote: > >>> Scratch that, I do have execution provider="local" set in my > >>> sites.xml. I am not sure where the problem is. Still looking. > >>> > >>> On 1/24/11 7:38 PM, Jonathan Monette wrote: > >>>> Ok. I went searching through some of the cog code for this error > >>>> when I took a close look at this line from the log file: > >>>> > >>>> Caused by: > >>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: > >>>> No 'proxy' provider or alias found. Available providers: [cobalt, > >>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > >>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge] > >>>> > >>>> I am assuming that all these providers listed are the acceptable > >>>> entries in the sites.xml for the execution provider entry. I notice > >>>> that localhost isn't an acceptable execution provider. Was localhost > >>>> switched to local(which is an acceptable value) in cog-r3043? > >>>> > >>>> On 1/23/11 10:18 PM, Jonathan Monette wrote: > >>>>> Mihael, > >>>>> I just updated to cog-r3043 and swift-r4031. I now get this > >>>>> swift error: > >>>>> > >>>>> Execution failed: > >>>>> No 'proxy' provider or alias found. Available providers: > >>>>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, > >>>>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. > >>>>> Aliases: gt4<-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> > >>>>> condorlocal; cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> > >>>>> pbslocal; gsiftp-old<-> gridftp-old; local<-> file; > >>>>> > >>>>> Upon inspecting the log file I got found this Java error: > >>>>> > >>>>> 2011-01-23 18:31:27,221-0600 INFO AbstractionProperties No > >>>>> properties for provider proxy. Using empty properties > >>>>> 2011-01-23 18:31:27,227-0600 INFO vdl:execute END_FAILURE > >>>>> thread=0-3 tr=mImgtbl > >>>>> 2011-01-23 18:31:27,234-0600 DEBUG VDL2ExecutionContext sys:throw @ > >>>>> execute-default.k, line: 45: Exception in mImgtbl: > >>>>> Arguments: [proj_dir, images.tbl] > >>>>> Host: localhost > >>>>> Directory: > >>>>> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs > >>>>> ---- > >>>>> > >>>>> Exception in mImgtbl: > >>>>> Arguments: [proj_dir, images.tbl] > >>>>> Host: localhost > >>>>> Directory: > >>>>> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs > >>>>> ---- > >>>>> > >>>>> Caused by: No 'proxy' provider or alias found. Available providers: > >>>>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, > >>>>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. > >>>>> Aliases: gt4<-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> > >>>>> condorlocal; cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> > >>>>> pbslocal; gsiftp-old<-> gridftp-old; local<-> file; > >>>>> Caused by: > >>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No > >>>>> 'proxy' provider or alias found. Available providers: [cobalt, > >>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > >>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > >>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; > >>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; > >>>>> gsiftp-old<-> gridftp-old; local<-> file; > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:32) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:340) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:181) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:309) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:50) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:26) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:238) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:289) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:402) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:343) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:230) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:44) > >>>>> at > >>>>> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) > >>>>> at > >>>>> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) > >>>>> at > >>>>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) > >>>>> at > >>>>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) > >>>>> at java.lang.Thread.run(Thread.java:662) > >>>>> Caused by: No 'proxy' provider or alias found. Available providers: > >>>>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, > >>>>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. > >>>>> Aliases: gt4<-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> > >>>>> condorlocal; cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> > >>>>> pbslocal; gsiftp-old<-> gridftp-old; local<-> file; > >>>>> Caused by: > >>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No > >>>>> 'proxy' provider or alias found. Available providers: [cobalt, > >>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > >>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > >>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; > >>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; > >>>>> gsiftp-old<-> gridftp-old; local<-> file; > >>>>> at > >>>>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:153) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.grid.GridExec.taskFailed(GridExec.java:373) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) > >>>>> at > >>>>> org.griphyn.vdl.karajan.lib.Execute.statusChanged(Execute.java:127) > >>>>> at > >>>>> org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) > >>>>> at > >>>>> org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:675) > >>>>> at > >>>>> org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:422) > >>>>> at > >>>>> org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) > >>>>> at > >>>>> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:240) > >>>>> at > >>>>> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:228) > >>>>> at > >>>>> org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) > >>>>> at > >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:220) > >>>>> at > >>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > >>>>> at > >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > >>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) > >>>>> at > >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > >>>>> at > >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > >>>>> ... 1 more > >>>>> Caused by: > >>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No > >>>>> 'proxy' provider or alias found. Available providers: [cobalt, > >>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > >>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > >>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; > >>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; > >>>>> gsiftp-old<-> gridftp-old; local<-> file; > >>>>> at > >>>>> org.globus.cog.abstraction.impl.common.AbstractionProperties.getProperties(AbstractionProperties.java:138) > >>>>> at > >>>>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newObject(AbstractionFactory.java:136) > >>>>> at > >>>>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newFileResource(AbstractionFactory.java:112) > >>>>> at > >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.copy(JobSubmissionTaskHandler.java:293) > >>>>> at > >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stage(JobSubmissionTaskHandler.java:272) > >>>>> at > >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stageIn(JobSubmissionTaskHandler.java:283) > >>>>> at > >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:159) > >>>>> ... 6 more > >>>>> 2011-01-23 18:31:27,274-0600 INFO ExecutionContext Detailed exception: > >>>>> Exception in mImgtbl: > >>>>> Arguments: [proj_dir, images.tbl] > >>>>> Host: localhost > >>>>> Directory: > >>>>> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs > >>>>> ---- > >>>>> > >>>>> Caused by: No 'proxy' provider or alias found. Available providers: > >>>>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, > >>>>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. > >>>>> Aliases: gt4<-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> > >>>>> condorlocal; cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> > >>>>> pbslocal; gsiftp-old<-> gridftp-old; local<-> file; > >>>>> Caused by: > >>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No > >>>>> 'proxy' provider or alias found. Available providers: [cobalt, > >>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > >>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > >>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; > >>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; > >>>>> gsiftp-old<-> gridftp-old; local<-> file; > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:32) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:340) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:181) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:309) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:50) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:26) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:238) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:289) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:402) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:343) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:230) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:44) > >>>>> at > >>>>> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) > >>>>> at > >>>>> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) > >>>>> at > >>>>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) > >>>>> at > >>>>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) > >>>>> at java.lang.Thread.run(Thread.java:662) > >>>>> Caused by: No 'proxy' provider or alias found. Available providers: > >>>>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, > >>>>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. > >>>>> Aliases: gt4<-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> > >>>>> condorlocal; cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> > >>>>> pbslocal; gsiftp-old<-> gridftp-old; local<-> file; > >>>>> Caused by: > >>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No > >>>>> 'proxy' provider or alias found. Available providers: [cobalt, > >>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > >>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > >>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; > >>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; > >>>>> gsiftp-old<-> gridftp-old; local<-> file; > >>>>> at > >>>>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:153) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.grid.GridExec.taskFailed(GridExec.java:373) > >>>>> at > >>>>> org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) > >>>>> at > >>>>> org.griphyn.vdl.karajan.lib.Execute.statusChanged(Execute.java:127) > >>>>> at > >>>>> org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) > >>>>> at > >>>>> org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:675) > >>>>> at > >>>>> org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:422) > >>>>> at > >>>>> org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) > >>>>> at > >>>>> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:240) > >>>>> at > >>>>> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:228) > >>>>> at > >>>>> org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) > >>>>> at > >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:220) > >>>>> at > >>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > >>>>> at > >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > >>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) > >>>>> at > >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > >>>>> at > >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > >>>>> ... 1 more > >>>>> Caused by: > >>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No > >>>>> 'proxy' provider or alias found. Available providers: [cobalt, > >>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, > >>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 > >>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; > >>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; > >>>>> gsiftp-old<-> gridftp-old; local<-> file; > >>>>> at > >>>>> org.globus.cog.abstraction.impl.common.AbstractionProperties.getProperties(AbstractionProperties.java:138) > >>>>> at > >>>>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newObject(AbstractionFactory.java:136) > >>>>> at > >>>>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newFileResource(AbstractionFactory.java:112) > >>>>> at > >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.copy(JobSubmissionTaskHandler.java:293) > >>>>> at > >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stage(JobSubmissionTaskHandler.java:272) > >>>>> at > >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stageIn(JobSubmissionTaskHandler.java:283) > >>>>> at > >>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:159) > >>>>> ... 6 more > >>>>> > >>>>> I am running on PADS and here is my sites.xml file: > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> .05 > >>>>> 1 > >>>>> /gpfs/pads/swift/jonmon/Swift/work/localhost > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> 3600 > >>>>> 192.5.86.6 > >>>>> 1 > >>>>> 40 > >>>>> 1 > >>>>> 1 > >>>>> fast > >>>>> 1 > >>>>> 10000 > >>>>> 1 > >>>>> /gpfs/pads/swift/jonmon/Swift/work/pads > >>>>> > >>>>> > >>>>> > >>>>> Has anyone else seen this error or Mihael have you witnessed this? > > > From jon.monette at gmail.com Mon Jan 24 22:29:02 2011 From: jon.monette at gmail.com (Jonathan Monette) Date: Mon, 24 Jan 2011 22:29:02 -0600 Subject: [Swift-devel] Re: No provider specified error In-Reply-To: <1295929647.13858.1.camel@blabla2.none> References: <4D3CFD8F.1070700@gmail.com> <4D3E29B3.8010406@gmail.com> <4D3E2A78.3070001@gmail.com> <4D3E4348.8080003@gmail.com> <1295927850.12825.1.camel@blabla2.none> <4D3E4B8B.80500@gmail.com> <1295929647.13858.1.camel@blabla2.none> Message-ID: <4D3E518E.3030207@gmail.com> Ok. Will give that a try next. Thanks. On 1/24/11 10:27 PM, Mihael Hategan wrote: > I'd suggest removing the entry in sites.xml. And also using > "file" as staging method, since the client and coaster service are on > the same machine. > > On Mon, 2011-01-24 at 22:03 -0600, Jonathan Monette wrote: >> execution.retries=0 >> sitedir.keep=true >> status.mode=provider >> wrapper.log.always.transfer=true >> foreach.maxthreads=1024 >> wrapper.parameter.mode=files >> use.provider.staging=true >> provider.staging.pin.swiftfiles=false >> >> If the last two lines are commented out my work flow runs. Here is my >> sites file >> >> >> >> >> >> .05 >> 1 >> /gpfs/pads/swift/jonmon/Swift/work/localhost >> >> >> >> >> 3600 >> 192.5.86.6 >> 1 >> 40 >> 1 >> 1 >> fast >> 1 >> 10000 >> 1 >> /gpfs/pads/swift/jonmon/Swift/work/pads >> >> >> >> >> On 1/24/11 9:57 PM, Mihael Hategan wrote: >>> Not quite. Proxy is just a coaster staging method in which the coaster >>> service acts as a network proxy between the client and the worker >>> node(s). >>> >>> I do not know what causes your problem. It looks odd. Can you post your >>> broken configuration (swift.properties and sites file)? >>> >>> Mihael >>> >>> On Mon, 2011-01-24 at 21:28 -0600, Jonathan Monette wrote: >>>> Figured out what was wrong. I was using some properties in my sites.xml >>>> for provider staging per Mike's suggestion. By commenting those lines >>>> out my workflow began working again. I am assuming for provider staging >>>> to work a 'proxy' must be set up and I do not believe I have set one >>>> up. However I am unfamiliar with provider staging, this was a >>>> suggestion by Mike to try next in a list of things for me to do. >>>> >>>> On 1/24/11 7:42 PM, Jonathan Monette wrote: >>>>> Scratch that, I do have execution provider="local" set in my >>>>> sites.xml. I am not sure where the problem is. Still looking. >>>>> >>>>> On 1/24/11 7:38 PM, Jonathan Monette wrote: >>>>>> Ok. I went searching through some of the cog code for this error >>>>>> when I took a close look at this line from the log file: >>>>>> >>>>>> Caused by: >>>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: >>>>>> No 'proxy' provider or alias found. Available providers: [cobalt, >>>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge] >>>>>> >>>>>> I am assuming that all these providers listed are the acceptable >>>>>> entries in the sites.xml for the execution provider entry. I notice >>>>>> that localhost isn't an acceptable execution provider. Was localhost >>>>>> switched to local(which is an acceptable value) in cog-r3043? >>>>>> >>>>>> On 1/23/11 10:18 PM, Jonathan Monette wrote: >>>>>>> Mihael, >>>>>>> I just updated to cog-r3043 and swift-r4031. I now get this >>>>>>> swift error: >>>>>>> >>>>>>> Execution failed: >>>>>>> No 'proxy' provider or alias found. Available providers: >>>>>>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >>>>>>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. >>>>>>> Aliases: gt4<-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> >>>>>>> condorlocal; cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> >>>>>>> pbslocal; gsiftp-old<-> gridftp-old; local<-> file; >>>>>>> >>>>>>> Upon inspecting the log file I got found this Java error: >>>>>>> >>>>>>> 2011-01-23 18:31:27,221-0600 INFO AbstractionProperties No >>>>>>> properties for provider proxy. Using empty properties >>>>>>> 2011-01-23 18:31:27,227-0600 INFO vdl:execute END_FAILURE >>>>>>> thread=0-3 tr=mImgtbl >>>>>>> 2011-01-23 18:31:27,234-0600 DEBUG VDL2ExecutionContext sys:throw @ >>>>>>> execute-default.k, line: 45: Exception in mImgtbl: >>>>>>> Arguments: [proj_dir, images.tbl] >>>>>>> Host: localhost >>>>>>> Directory: >>>>>>> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs >>>>>>> ---- >>>>>>> >>>>>>> Exception in mImgtbl: >>>>>>> Arguments: [proj_dir, images.tbl] >>>>>>> Host: localhost >>>>>>> Directory: >>>>>>> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs >>>>>>> ---- >>>>>>> >>>>>>> Caused by: No 'proxy' provider or alias found. Available providers: >>>>>>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >>>>>>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. >>>>>>> Aliases: gt4<-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> >>>>>>> condorlocal; cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> >>>>>>> pbslocal; gsiftp-old<-> gridftp-old; local<-> file; >>>>>>> Caused by: >>>>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No >>>>>>> 'proxy' provider or alias found. Available providers: [cobalt, >>>>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>>>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >>>>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; >>>>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; >>>>>>> gsiftp-old<-> gridftp-old; local<-> file; >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:32) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:340) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:181) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:309) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:50) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:26) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:238) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:289) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:402) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:343) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:230) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:44) >>>>>>> at >>>>>>> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) >>>>>>> at >>>>>>> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) >>>>>>> at >>>>>>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) >>>>>>> at >>>>>>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) >>>>>>> at java.lang.Thread.run(Thread.java:662) >>>>>>> Caused by: No 'proxy' provider or alias found. Available providers: >>>>>>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >>>>>>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. >>>>>>> Aliases: gt4<-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> >>>>>>> condorlocal; cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> >>>>>>> pbslocal; gsiftp-old<-> gridftp-old; local<-> file; >>>>>>> Caused by: >>>>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No >>>>>>> 'proxy' provider or alias found. Available providers: [cobalt, >>>>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>>>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >>>>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; >>>>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; >>>>>>> gsiftp-old<-> gridftp-old; local<-> file; >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:153) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.grid.GridExec.taskFailed(GridExec.java:373) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) >>>>>>> at >>>>>>> org.griphyn.vdl.karajan.lib.Execute.statusChanged(Execute.java:127) >>>>>>> at >>>>>>> org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) >>>>>>> at >>>>>>> org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:675) >>>>>>> at >>>>>>> org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:422) >>>>>>> at >>>>>>> org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:240) >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:228) >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:220) >>>>>>> at >>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) >>>>>>> at >>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>>>>>> at >>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>>>>>> at >>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>>>>>> ... 1 more >>>>>>> Caused by: >>>>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No >>>>>>> 'proxy' provider or alias found. Available providers: [cobalt, >>>>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>>>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >>>>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; >>>>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; >>>>>>> gsiftp-old<-> gridftp-old; local<-> file; >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.common.AbstractionProperties.getProperties(AbstractionProperties.java:138) >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newObject(AbstractionFactory.java:136) >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newFileResource(AbstractionFactory.java:112) >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.copy(JobSubmissionTaskHandler.java:293) >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stage(JobSubmissionTaskHandler.java:272) >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stageIn(JobSubmissionTaskHandler.java:283) >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:159) >>>>>>> ... 6 more >>>>>>> 2011-01-23 18:31:27,274-0600 INFO ExecutionContext Detailed exception: >>>>>>> Exception in mImgtbl: >>>>>>> Arguments: [proj_dir, images.tbl] >>>>>>> Host: localhost >>>>>>> Directory: >>>>>>> rectified-20110123-1707-ptqrq9j5/jobs/3/mImgtbl-3krv2x4kTODO: outs >>>>>>> ---- >>>>>>> >>>>>>> Caused by: No 'proxy' provider or alias found. Available providers: >>>>>>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >>>>>>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. >>>>>>> Aliases: gt4<-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> >>>>>>> condorlocal; cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> >>>>>>> pbslocal; gsiftp-old<-> gridftp-old; local<-> file; >>>>>>> Caused by: >>>>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No >>>>>>> 'proxy' provider or alias found. Available providers: [cobalt, >>>>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>>>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >>>>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; >>>>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; >>>>>>> gsiftp-old<-> gridftp-old; local<-> file; >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.childCompleted(AbstractSequentialWithArguments.java:192) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.notificationEvent(Sequential.java:32) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:340) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificationEvent(FlowNode.java:181) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:309) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:50) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:26) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:238) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:289) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java:402) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java:343) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java:230) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java:173) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:44) >>>>>>> at >>>>>>> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) >>>>>>> at >>>>>>> edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) >>>>>>> at >>>>>>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) >>>>>>> at >>>>>>> edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) >>>>>>> at java.lang.Thread.run(Thread.java:662) >>>>>>> Caused by: No 'proxy' provider or alias found. Available providers: >>>>>>> [cobalt, gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, >>>>>>> http, coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. >>>>>>> Aliases: gt4<-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> >>>>>>> condorlocal; cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> >>>>>>> pbslocal; gsiftp-old<-> gridftp-old; local<-> file; >>>>>>> Caused by: >>>>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No >>>>>>> 'proxy' provider or alias found. Available providers: [cobalt, >>>>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>>>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >>>>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; >>>>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; >>>>>>> gsiftp-old<-> gridftp-old; local<-> file; >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:36) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.events.FailureNotificationEvent.(FailureNotificationEvent.java:42) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.FlowNode.failImmediately(FlowNode.java:153) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.grid.GridExec.taskFailed(GridExec.java:373) >>>>>>> at >>>>>>> org.globus.cog.karajan.workflow.nodes.grid.AbstractGridNode.statusChanged(AbstractGridNode.java:276) >>>>>>> at >>>>>>> org.griphyn.vdl.karajan.lib.Execute.statusChanged(Execute.java:127) >>>>>>> at >>>>>>> org.globus.cog.karajan.scheduler.AbstractScheduler.fireJobStatusChangeEvent(AbstractScheduler.java:168) >>>>>>> at >>>>>>> org.globus.cog.karajan.scheduler.LateBindingScheduler.statusChanged(LateBindingScheduler.java:675) >>>>>>> at >>>>>>> org.globus.cog.karajan.scheduler.WeightedHostScoreScheduler.statusChanged(WeightedHostScoreScheduler.java:422) >>>>>>> at >>>>>>> org.griphyn.vdl.karajan.VDSAdaptiveScheduler.statusChanged(VDSAdaptiveScheduler.java:410) >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:240) >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:228) >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.common.AbstractDelegatedTaskHandler.failTask(AbstractDelegatedTaskHandler.java:54) >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:220) >>>>>>> at >>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) >>>>>>> at >>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>>>>>> at >>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>>>>>> at >>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>>>>>> ... 1 more >>>>>>> Caused by: >>>>>>> org.globus.cog.abstraction.impl.common.task.InvalidProviderException: No >>>>>>> 'proxy' provider or alias found. Available providers: [cobalt, >>>>>>> gsiftp, coaster, webdav, dcache, ssh, gt4, gt2, condor, http, >>>>>>> coaster-persistent, pbs, ftp, gsiftp-old, local, sge]. Aliases: gt4 >>>>>>> <-> gt4.0.2, gt4.0.1, gt3.9.5, gt4.0.0; condor<-> condorlocal; >>>>>>> cobalt<-> cobaltlocal; gsiftp<-> gridftp; pbs<-> pbslocal; >>>>>>> gsiftp-old<-> gridftp-old; local<-> file; >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.common.AbstractionProperties.getProperties(AbstractionProperties.java:138) >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newObject(AbstractionFactory.java:136) >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.common.AbstractionFactory.newFileResource(AbstractionFactory.java:112) >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.copy(JobSubmissionTaskHandler.java:293) >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stage(JobSubmissionTaskHandler.java:272) >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.stageIn(JobSubmissionTaskHandler.java:283) >>>>>>> at >>>>>>> org.globus.cog.abstraction.impl.execution.local.JobSubmissionTaskHandler.run(JobSubmissionTaskHandler.java:159) >>>>>>> ... 6 more >>>>>>> >>>>>>> I am running on PADS and here is my sites.xml file: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> .05 >>>>>>> 1 >>>>>>> /gpfs/pads/swift/jonmon/Swift/work/localhost >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> 3600 >>>>>>> 192.5.86.6 >>>>>>> 1 >>>>>>> 40 >>>>>>> 1 >>>>>>> 1 >>>>>>> fast >>>>>>> 1 >>>>>>> 10000 >>>>>>> 1 >>>>>>> /gpfs/pads/swift/jonmon/Swift/work/pads >>>>>>> >>>>>>> >>>>>>> >>>>>>> Has anyone else seen this error or Mihael have you witnessed this? > -- Jon Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination. - Albert Einstein From benc at hawaga.org.uk Tue Jan 25 09:05:23 2011 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 25 Jan 2011 15:05:23 +0000 (GMT) Subject: [Swift-devel] [provenance-challenge] A formal account of the open provenance model (fwd) Message-ID: Work on OPM still proceeds, it seems... -- http://www.hawaga.org.uk/ben/ ---------- Forwarded message ---------- Date: Tue, 25 Jan 2011 14:28:22 +0000 From: Luc Moreau Reply-To: provenance-challenge at ipaw.info To: provenance-challenge at ipaw.info, "public-xg-prov at w3.org" Cc: Jan Van den Bussche Subject: [provenance-challenge] A formal account of the open provenance model To the provenance community, Natalia, Jan and myself are pleased to announce the availability of the following paper, which can be downloaded from http://eprints.ecs.soton.ac.uk/21819/ A formal account of the open provenance model. Natalia Kwasnikowska, Luc Moreau, and Jan Van den Bussche. The Open Provenance Model (OPM) is a community data model for provenance that is designed to facilitate the meaningful interchange of provenance information between systems. Underpinning OPM, is a notion of directed graph, used to represent data products and processes in- volved in past computations, and dependencies between them; it is complemented by inference rules allowing new dependencies to be derived. The Open Provenance Model was designed from requirements captured in two `Provenance Challenges', and tested during the third: these challenges were international, multi-disciplinary activities aiming to exchange provenance information between multiple systems and query it. The design of OPM was mostly driven by practical and pragmatic considerations. The purpose of this paper is to formalize the theory underpinning this data model. Specifically, this paper proposes a temporal semantics for OPM graphs, defined in terms of a set of ordering constraints between time-points associated with OPM constructs. OPM inferences are characterized with respect to this temporal semantics, and a novel set of patterns is introduced to establish soundness and completeness properties. Building on this novel foundation, the paper proposes new definitions for graph algebraic operations, graph refinement and the notion of account, by which multiple descriptions of a same execution are allowed to co-exist in a same graph. Overall, this paper provides a strong theoretical underpinning to a data model being adopted by a community of users that help its disambiguation and promote inter-operability. Best regards, Natalia, Jan and Luc -- Professor Luc Moreau Electronics and Computer Science tel: +44 23 8059 4487 University of Southampton fax: +44 23 8059 2865 Southampton SO17 1BJ email: l.moreau at ecs.soton.ac.uk United Kingdom http://www.ecs.soton.ac.uk/~lavm From tim.g.armstrong at gmail.com Tue Jan 25 16:58:22 2011 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Tue, 25 Jan 2011 16:58:22 -0600 Subject: [Swift-devel] Running a command on all coasters Message-ID: Hi All, Is there any mechanism in Swift where I could get swift to execute a command on all of the worker machines that were currently attached to the coasters instance? I basically want to stage some data out to all of the workers and run some setup scripts in preparation for running a whole bunch of tasks. I'm not sure I actually want to do this, just assessing the feasibility of the option. - Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: From tim.g.armstrong at gmail.com Tue Jan 25 16:58:45 2011 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Tue, 25 Jan 2011 16:58:45 -0600 Subject: [Swift-devel] Swift compile issues on IBI cluster In-Reply-To: <1295892465.30587.1.camel@blabla2.none> References: <549693844.91902.1295700787731.JavaMail.root@zimbra.anl.gov> <1291420067.91915.1295701773409.JavaMail.root@zimbra.anl.gov> <1295728153.5253.5.camel@blabla2.none> <1295892465.30587.1.camel@blabla2.none> Message-ID: I don't know if it pays to worry about it then - Tim On Mon, Jan 24, 2011 at 12:07 PM, Mihael Hategan wrote: > Then my theory is wrong. > > I can't do much without being able to reproduce this. So there are two > choices: get access to the IBI cluster or let it go. > > Mihael > > On Mon, 2011-01-24 at 08:37 -0600, Tim Armstrong wrote: > > It doesn't appear that changing the version of xerces helps. I'm not > > that familiar with the java.endorsed.dirs mechanism, but I believe I'm > > using it correctly: adding the xerces jars to the class path and then > > adding the containing directory as an endorsed dir. > > > > I still get exactly the same failure. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aespinosa at cs.uchicago.edu Tue Jan 25 17:00:47 2011 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Tue, 25 Jan 2011 17:00:47 -0600 Subject: [Swift-devel] Running a command on all coasters In-Reply-To: References: Message-ID: Hi Tim, If you send lots of jobs, it should run on all the workers. -Allan 2011/1/25 Tim Armstrong : > Hi All, > ? Is there any mechanism in Swift where I could get swift to execute a > command? on all of the worker machines that were currently attached to the > coasters instance?? I basically want to stage some data out to all of the > workers and run some setup scripts in preparation for running a whole bunch > of tasks. > > I'm not sure I actually want to do this, just assessing the feasibility of > the option. > > ?- Tim > From wilde at mcs.anl.gov Tue Jan 25 17:06:10 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 25 Jan 2011 17:06:10 -0600 (CST) Subject: [Swift-devel] Running a command on all coasters In-Reply-To: Message-ID: <326924550.101610.1295996770131.JavaMail.root@zimbra.anl.gov> Other than what Allan says below, there is no mechanism to do this, although such a function has been discussed in the past and there is interest in adding such a feature. It would likely apply to coaster workers only, though, and thus is semantically awkward at the language level. Note that SwiftR uses a simple test to see if the initializer expression needs to be executed by the worker based on whether it has changed since the last R function eval. - Mike ----- Original Message ----- > Hi Tim, > > If you send lots of jobs, it should run on all the workers. > > -Allan > > 2011/1/25 Tim Armstrong : > > Hi All, > > ? Is there any mechanism in Swift where I could get swift to execute > > ? a > > command on all of the worker machines that were currently attached > > to the > > coasters instance? I basically want to stage some data out to all of > > the > > workers and run some setup scripts in preparation for running a > > whole bunch > > of tasks. > > > > I'm not sure I actually want to do this, just assessing the > > feasibility of > > the option. > > > > ?- Tim > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wozniak at mcs.anl.gov Tue Jan 25 17:06:51 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Tue, 25 Jan 2011 17:06:51 -0600 (CST) Subject: [Swift-devel] Running a command on all coasters In-Reply-To: References: Message-ID: Hello Tim I've looked into this a bit. First, there is the WORKERSHELLCMD RPC that is already in there. You could call that from, say, an AllocationHook. Second, there is the WORKER_COPIES functionality (cf. Settings.java and worker.pl). This lets you perform arbitrary file copies at worker start time. I was thinking the easiest thing might be to add a WORKER_SETUP variable that would be similar to WORKER_COPIES and would execute right after the WORKER_COPIES are done. So, say, you could upload a script and a tarball and perform a setup operation. We could also have a WORKER_SHUTDOWN variable. Justin On Tue, 25 Jan 2011, Tim Armstrong wrote: > Hi All, > Is there any mechanism in Swift where I could get swift to execute a > command on all of the worker machines that were currently attached to the > coasters instance? I basically want to stage some data out to all of the > workers and run some setup scripts in preparation for running a whole bunch > of tasks. > > I'm not sure I actually want to do this, just assessing the feasibility of > the option. > > - Tim > -- Justin M Wozniak From hategan at mcs.anl.gov Tue Jan 25 18:13:09 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 25 Jan 2011 16:13:09 -0800 Subject: [Swift-devel] Running a command on all coasters In-Reply-To: References: Message-ID: <1296000789.590.1.camel@blabla2.none> Right. I don't think it would be very difficult to have a broadcast of some command to all the workers. It would go along the lines of what Justin is saying. Mihael On Tue, 2011-01-25 at 17:06 -0600, Justin M Wozniak wrote: > Hello Tim > I've looked into this a bit. > First, there is the WORKERSHELLCMD RPC that is already in there. > You could call that from, say, an AllocationHook. > Second, there is the WORKER_COPIES functionality (cf. > Settings.java and worker.pl). This lets you perform arbitrary file copies > at worker start time. > I was thinking the easiest thing might be to add a WORKER_SETUP > variable that would be similar to WORKER_COPIES and would execute right > after the WORKER_COPIES are done. So, say, you could upload a script and > a tarball and perform a setup operation. > We could also have a WORKER_SHUTDOWN variable. > Justin > > On Tue, 25 Jan 2011, Tim Armstrong wrote: > > > Hi All, > > Is there any mechanism in Swift where I could get swift to execute a > > command on all of the worker machines that were currently attached to the > > coasters instance? I basically want to stage some data out to all of the > > workers and run some setup scripts in preparation for running a whole bunch > > of tasks. > > > > I'm not sure I actually want to do this, just assessing the feasibility of > > the option. > > > > - Tim > > > From tim.g.armstrong at gmail.com Tue Jan 25 20:47:53 2011 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Tue, 25 Jan 2011 20:47:53 -0600 Subject: [Swift-devel] Running a command on all coasters In-Reply-To: <1296000789.590.1.camel@blabla2.none> References: <1296000789.590.1.camel@blabla2.none> Message-ID: Cool, thanks everyone - that was helpful - Tim On Tue, Jan 25, 2011 at 6:13 PM, Mihael Hategan wrote: > Right. I don't think it would be very difficult to have a broadcast of > some command to all the workers. It would go along the lines of what > Justin is saying. > > Mihael > > On Tue, 2011-01-25 at 17:06 -0600, Justin M Wozniak wrote: > > Hello Tim > > I've looked into this a bit. > > First, there is the WORKERSHELLCMD RPC that is already in there. > > You could call that from, say, an AllocationHook. > > Second, there is the WORKER_COPIES functionality (cf. > > Settings.java and worker.pl). This lets you perform arbitrary file > copies > > at worker start time. > > I was thinking the easiest thing might be to add a WORKER_SETUP > > variable that would be similar to WORKER_COPIES and would execute right > > after the WORKER_COPIES are done. So, say, you could upload a script and > > a tarball and perform a setup operation. > > We could also have a WORKER_SHUTDOWN variable. > > Justin > > > > On Tue, 25 Jan 2011, Tim Armstrong wrote: > > > > > Hi All, > > > Is there any mechanism in Swift where I could get swift to execute a > > > command on all of the worker machines that were currently attached to > the > > > coasters instance? I basically want to stage some data out to all of > the > > > workers and run some setup scripts in preparation for running a whole > bunch > > > of tasks. > > > > > > I'm not sure I actually want to do this, just assessing the feasibility > of > > > the option. > > > > > > - Tim > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Jan 26 13:06:43 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 26 Jan 2011 11:06:43 -0800 Subject: [Swift-devel] Error 521 provider-staging files to PADS nodes In-Reply-To: References: <1112858620.81735.1295465835538.JavaMail.root@zimbra.anl.gov> <1295471524.6134.0.camel@blabla2.none> <1295818334.4211.1.camel@blabla2.none> <1295862088.29849.6.camel@blabla2.none> <1295916851.9232.2.camel@blabla2.none> Message-ID: <1296068803.17931.14.camel@blabla2.none> First we ignore wpn, or rather said we calculate the total worker throughput (for all 4 concurrent jobs per worker). In any event that stays a constant, so when I say 1 worker I mean 1 worker with 4 concurrent jobs. I'm doing that to remove the job life-cycle latencies from the picture, and keep I/O at maximum. That said, here's the summary: a. 1 worker (clearly on one node): 80 MB/s in/80 out aggregate b. 2 workers on the same node: 80 MB/s in/80 out aggregate c. 2 workers on different nodes: 20 MB/s in/20 out aggregate I ran these a sufficiently large number of times to not believe that the difference can be attributed to statistical variation. If what you say were true (job scheduled along other jobs on the same node), then I believe that (a) would also have 20 MB/s. Mihael On Wed, 2011-01-26 at 11:02 -0600, Allan Espinosa wrote: > Shouldn't we use ppn=4 to guarantee different nodes? > > It might be the case that the 3 other cores got assigned to other jobs > by PBS. > > -Allan (mobile) > On Jan 24, 2011 6:55 PM, "Mihael Hategan" wrote: > > > > And then here's the funny thing: > > 2 workers, 4 wpn. > > When running with ppn=2 (so both on the same node): > > [IN]: Total transferred: 7.99 GB, current rate: 13.07 MB/s, average > > rate: 85.23 MB/s > > [OUT] Total transferred: 8 GB, current rate: 42 B/s, average rate: > 85.38 > > MB/s > > > > Same situation, but with ppn=1 (so the two are on different nodes): > > [IN]: Total transferred: 5.83 GB, current rate: 20.79 MB/s, average > > rate: 20.31 MB/s > > [OUT] Total transferred: 5.97 GB, current rate: 32.01 MB/s, average > > rate: 20.8 MB/s > > > > This, to me, looks fine because it's the opposite of what I'm > expecting. > > The service itself should see no difference between the two, and I > > suspect it doesn't. But something else is going on. Any ideas? > > > > Mihael > > > > > > On Mon, 2011-01-24 at 01:41 -0800, Mihael Hategan wrote: > > > Play with buffer sizes and ye shall be rewarded. > > > > > > Turns out that setting TCP buffer sizes to obscene numbers, like > 2M, > > > gives you quite a bit: 70MB/s in + 70MB/s out on average. Those > pads > > > nodes must have some fast disks (though maybe it's just the > cache). > > > > > > This is with 1 worker and 4wpn. I'm assuming that with many > workers, the > > > fact that each worker connection has its separate buffer will > > > essentially achieve a similar effect. But then there should be an > option > > > for setting the buffer size. > > > > > > The numbers are attached. This all goes from head node local disk > to > > > worker node local disk directly, so there is no nfs. I'd be > curious to > > > know how that compares, but I am done for the day. > > > > > > Mihael > > > > > > On Sun, 2011-01-23 at 13:32 -0800, Mihael Hategan wrote: > > > > I'm trying to run tests on pads. The queues aren't quite empty. > In the > > > > mean time, I committed a bit of a patch to trunk to measure > aggregate > > > > traffic on TCP channels (those are only used by the workers). > You can > > > > enable it by setting the "tcp.channel.log.io.performance" system > > > > property to "true". > > > > > > > > Mihael > > > > > > > > On Wed, 2011-01-19 at 13:12 -0800, Mihael Hategan wrote: > > > > > might be due to one of the recent patches. > > > > > > > > > > you could try to set IOBLOCKSZ to 1 in worker.pl and rerun. > > > > > > > > > > On Wed, 2011-01-19 at 13:37 -0600, Michael Wilde wrote: > > > > > > An interesting observation on the returned output files: > there are exactly 33 files in the output dir from this run: the same > as the number of jobs Swift reports as Finished successfully. But of > those 33, the last 4 are only of partial length, and one of the 4 is > length zero (see below). > > > > > > > > > > > > Its surprising and perhaps a bug that the jobs are reported > finished before the output file is fully written??? > > > > > > > > > > > > Also this 3-partial plus 1-zero file looks to me like one > worker staging op hung (the oldest of the 4 incomplete output files) > and then perhaps 3 were cut short when the coaster service data > protocol froze? > > > > > > > > > > > > - Mike > > > > > > > > > > > > login1$ pwd > > > > > > /scratch/local/wilde/lab > > > > > > login1$ cd outdir > > > > > > login1$ ls -lt | grep 10:48 > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0023.out > > > > > > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out > > > > > > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out > > > > > > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0336.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0380.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0015.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0204.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0379.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0066.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0221.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0281.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0403.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0142.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0187.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0067.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0081.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0134.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0136.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0146.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0254.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0362.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0312.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0370.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0389.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0027.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0094.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0183.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0363.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0016.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0025.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0429.out > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0239.out > > > > > > login1$ ls -lt | grep 10:48 | wc -l > > > > > > 33 > > > > > > login1$ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > Mihael, > > > > > > > > > > > > > > The following test on pads failed/hung with an error 521 > from > > > > > > > worker.pl: > > > > > > > > > > > > > > --- > > > > > > > sub getFileCBDataInIndirect { > > > > > > > ... > > > > > > > elsif ($timeout) { > > > > > > > queueCmd((nullCB(), "JOBSTATUS", $jobid, FAILED, "521", > "Timeout > > > > > > > staging in file")); > > > > > > > delete($JOBDATA{$jobid}); > > > > > > > --- > > > > > > > > > > > > > > single foreach loop, doing 1,000 "mv" commands > > > > > > > > > > > > > > throttle was 200 jobs to this coaster pool (1 4-node > 32-core PBS job): > > > > > > > > > > > > > > > > > > > > > jobmanager="local:pbs"/> > > > > > > > key="workersPerNode">8 > > > > > > > 3500 > > > > > > > 1 > > > > > > > key="nodeGranularity">4 > > > > > > > 4 > > > > > > > short > > > > > > > key="jobThrottle">2.0 > > > > > > > key="initialScore">10000 > > > > > > > > > > > > > > > /scratch/local/wilde/test/swiftwork > > > > > > > key="stagingMethod">file > > > > > > > /scratch/local/wilde/swiftscratch > > > > > > > > > > > > > > > > > > > > > Ran 33 jobs - 1 job over 1 "wave" of 32 and then one or > more workers > > > > > > > timed out. Note that the hang may have happened earlier, > as no new > > > > > > > jobs were starting as the jobs in the first wave were > finishing. > > > > > > > > > > > > > > time swift -tc.file tc -sites.file pbscoasters.xml -config > cf.ps > > > > > > > mvn.swift -n=1000 >& out & > > > > > > > > > > > > > > > > > > > > > The log is in ~wilde/mvn-20110119-0956-s3s8h9h2.log on CI > net. > > > > > > > > > > > > > > Swift stdout showed the following after waiting a while > for a 4-node > > > > > > > PADS coaster allocation to start: > > > > > > > > > > > > > > Progress: Selecting site:799 Submitted:201 > > > > > > > Progress: Selecting site:799 Submitted:201 > > > > > > > Progress: Selecting site:799 Submitted:200 Active:1 > > > > > > > Progress: Selecting site:798 Submitted:177 Active:24 > Finished > > > > > > > successfully:1 > > > > > > > Progress: Selecting site:796 Submitted:172 Active:28 > Finished > > > > > > > successfully:4 > > > > > > > Progress: Selecting site:792 Submitted:176 Active:24 > Finished > > > > > > > successfully:8 > > > > > > > Progress: Selecting site:788 Submitted:180 Active:20 > Finished > > > > > > > successfully:12 > > > > > > > Progress: Selecting site:784 Submitted:184 Active:16 > Finished > > > > > > > successfully:16 > > > > > > > Progress: Selecting site:780 Submitted:188 Active:12 > Finished > > > > > > > successfully:20 > > > > > > > Progress: Selecting site:777 Submitted:191 Active:9 > Finished > > > > > > > successfully:23 > > > > > > > Progress: Selecting site:773 Submitted:195 Active:5 > Finished > > > > > > > successfully:27 > > > > > > > Progress: Selecting site:770 Submitted:197 Active:3 > Finished > > > > > > > successfully:30 > > > > > > > Progress: Selecting site:767 Submitted:200 Finished > successfully:33 > > > > > > > Progress: Selecting site:766 Submitted:201 Finished > successfully:33 > > > > > > > Progress: Selecting site:766 Submitted:201 Finished > successfully:33 > > > > > > > Progress: Selecting site:766 Submitted:201 Finished > successfully:33 > > > > > > > Progress: Selecting site:766 Submitted:201 Finished > successfully:33 > > > > > > > Progress: Selecting site:766 Submitted:201 Finished > successfully:33 > > > > > > > Progress: Selecting site:766 Submitted:201 Finished > successfully:33 > > > > > > > Progress: Selecting site:766 Submitted:200 Active:1 > Finished > > > > > > > successfully:33 > > > > > > > Execution failed: > > > > > > > Job failed with an exit code of 521 > > > > > > > login1$ > > > > > > > login1$ > > > > > > > login1$ pwd > > > > > > > /scratch/local/wilde/lab > > > > > > > login1$ ls -lt | head > > > > > > > total 51408 > > > > > > > -rw-r--r-- 1 wilde ci-users 5043350 Jan 19 10:51 > > > > > > > mvn-20110119-0956-s3s8h9h2.log > > > > > > > > > > > > > > (copied to ~wilde) > > > > > > > > > > > > > > script was: > > > > > > > > > > > > > > login1$ cat mvn.swift > > > > > > > type file; > > > > > > > > > > > > > > app (file o) mv (file i) > > > > > > > { > > > > > > > mv @i @o; > > > > > > > } > > > > > > > > > > > > > > file out[] > > > > > > prefix="f.",suffix=".out">; > > > > > > > foreach j in [1:@toint(@arg("n","1"))] { > > > > > > > file data<"data.txt">; > > > > > > > out[j] = mv(data); > > > > > > > } > > > > > > > > > > > > > > > > > > > > > data.txt was 3MB > > > > > > > > > > > > > > A look at the outdir gives a clue to where things hung: > The files of > > > > > > > <= ~3MB from time 10:48 are from this job. Files from > 10:39 and > > > > > > > earlier are from other manual runs executed on login1, > Note that 3 of > > > > > > > the 3MB output files have length 0 or <3MB, and were > likely in transit > > > > > > > back from the worker: > > > > > > > > > > > > > > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 > f.0125.out > > > > > > > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 > f.0167.out > > > > > > > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out > > > > > > > > > > > > > > > > > > > > > login1$ pwd > > > > > > > /scratch/local/wilde/lab > > > > > > > login1$ cd outdir > > > > > > > login1$ ls -lt | head -40 > > > > > > > total 2772188 > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0023.out > > > > > > > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 > f.0125.out > > > > > > > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 > f.0167.out > > > > > > > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0336.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0380.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0015.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0204.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0379.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0066.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0221.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0281.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0403.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0142.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0187.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0067.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0081.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0134.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0136.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0146.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0254.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0362.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0312.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0370.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0389.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0027.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0094.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0183.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0363.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0016.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0025.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0429.out > > > > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 > f.0239.out > > > > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 > f.0024.out > > > > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 > f.0037.out > > > > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 > f.0001.out > > > > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 > f.0042.out > > > > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 > f.0033.out > > > > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 > f.0051.out > > > > > > > l > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Michael Wilde > > > > > > > Computation Institute, University of Chicago > > > > > > > Mathematics and Computer Science Division > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Swift-devel mailing list > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > From iraicu at cs.iit.edu Fri Jan 28 13:42:28 2011 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Fri, 28 Jan 2011 13:42:28 -0600 Subject: [Swift-devel] CFP: IEEE 2011 Fifth International Workshop on Scientific Workflows (SWF 2011) Message-ID: <4D431C24.10408@cs.iit.edu> CALL FOR PAPERS IEEE 2011 Fifth International Workshop on Scientific Workflows (SWF 2011) WashingtonDC, USA, one day between July 5-10, 2011, inconjunction with ICWS 2011 , SCC 2011 , SERVICES , and CLOUD Description Scientific workflows have become an increasingly popular paradigm for scientists to formalize and structure complex scientific processes to enable and accelerate many significant scientific discoveries. A scientific workflow is a formal specification of a scientific process, which represents, streamlines, and automates the analytical and computational steps that a scientist needs to go through from dataset selection and integration, computation and analysis, to final data product presentation and visualization. A scientific workflow management system (SWFMS) is a system that supports the specification, modification, execution, failure handling, and monitoring of a scientific workflow using the workflow logic to control the order of executing workflow tasks. The importance of scientific workflows has been recognized by NSF since 2006 and was reemphasized recently in an science article titled "Beyond the Data Deluge" (Science, Vol. 323. no. 5919, pp. 1297 -- 1298, 2009), which concluded, "In the future, the rapidity with which any given discipline advances is likely to depend on how well the community acquires the necessary expertise in database, workflow management, visualization, and cloud computing technologies." An emerging trend in scientific workflow management research and systems is the convergence of concepts, techniques, and tools from both scientific workflow and enterprise workflow areas. Although scientific workflow systems and enterprise workflow areas have evolved in parallel, each has adopted and incorporated the best practices and ideas from the other area. One of the main areas of interest is this emerging convergence. A concrete example is the leverage of enterprise workflow tools and systems in solving scientific/engineering workflow problems, particularly in data centers and cloud computing environments. In response to this trend, this year, we like to expand the scope of SWF to include topics for enterprise workflows as well to foster the interaction between these two areas. The First IEEE International Workshop on Scientific Workflows (SWF 2007) was launched at Salt Lake city, Utah, as part of the First IEEE World Congress on Services (SERVICES 2007), in conjunction with IEEE SCC/ICWS 2007, attracting around 20 attendants including 5 presenters and a dozen of submissions. SWF 2008 was held in Honolulu, Hawaii, in conjunction with IEEE SCC, with around 25 attendants including 9 presenters (3 of them were invited speakers) and a dozen of submissions. SWF 2009 was held in Los Angeles, CA, in conjunction with IEEE ICWS, with around 30 attendants including 20 presenters (10 for regular papers, and 10 for short papers). SWF 2009 also enjoyed the event of the launch of the first IEEE International Conference on Cloud Computing (CLOUD 2009). SWF 2010 was held in Miami, Florida, with around 25 attendants (9 papers), in conjunction with IEEE CLOUD/ICWS/SCC. Authors are invited to submit regular papers (8 pages) and short papers (4 pages) that show original unpublished research results in all areas of scientific workflows and enterprise workflows. Topics of interest are listed below; however, submissions on all aspects of scientific workflows and enterprise workflows are welcome. Accepted papers will be included in the proceedings of IEEE SERVICES 2011, which will be published by IEEE Computer Society Press. List of topics ?Scientific workflow provenance management and analytics ?Scientific workflow data, metadata, service, and task management ?Scientific workflow architectures, models, languages, systems, and algorithms ?Scientific workflow monitoring, debugging, exception handling, and fault tolerance ?Streaming data processing in scientific workflows ?Pipelined, data, workflow, and task parallelism in scientific workflows ?Cloud, Service, Grid, or hybrid scientific workflows ?Data, metadata, compute, user-interaction, or visualization-intensive scientific workflows ?Semantic techniques for scientific workflows ?Scientific workflow composition ?Security issues in scientific workflows ?Data integration and service integration in scientific workflows ?Scientific workflow mapping, optimization, and scheduling ?Scientific workflow modeling, simulation, analysis, and verification ?Scalability, reliability, extensibility, agility, and interoperability ?Scientific workflow applications and case studies ?Enterprise service workflow management and enterprise services computing ?Enterprise workflow cooperation and collaboration Important dates ?Paper SubmissionFebruary 21, 2011 ?Decision Notification (Electronic)March 21, 2011 ?Camera-Ready Submission & Pre-registrationApril 8, 2011 Paper submission Authors are invited to submit full papers (about 8 pages) or short papers (about 4 pages) as per IEEE 8.5 x 11 manuscript guidelines (http://www.computer.org/cspress/instruct.htm). All papers should be in PDF and submitted via the SWF Submission/Review system . First time users need to register with the system first. All the accepted papers by the workshops will be included in the Proceedings of the Seventh IEEE 2011 World Congress on Services (SERVICES 2011), which will be published by IEEE Computer Society. Workshop chairs ?Shiyong Lu , Wayne State University, USA, Email: shiyong at wayne.edu ?Calton Pu , Georgia Tech, USA, Email: calton.pu at cc.gatech.edu Publicity chairs ?Ilkay Altintas, San Diego Supercomputer Center, USA ?Yong Zhao, University of Electronic Science and Technology of China, PR China ?Paolo Missier, University of Manchester, UK Publication chair ?Xubo Fei,Wayne State University, USA, Email: xubo at wayne.edu Program committee ?Jamal Alhiyafi,University of Dammam, Saudi Arabia ?Ilkay Altintas,San Diego Supercomputer Center, U.S.A. ?Roger Barga,Microsoft Research, U.S.A. ?Adam Barker, University of St Andrews, U.K. ?Adam Belloum, University of Amsterdam, the Netherlands ?Shawn Bowers, Gonzaga University, U.S.A. ?Bin Cao, Teradata Corporation, U.S.A. ?Artem Chebotko, University of Texas at Pan American, U.S.A. ?Jinjun Chen, Swinburne University of Technology, Australia ?Susan Davidson, University of Pennsylvania, U.S.A. ?Thomas Fahringer, University of Innsbruck, Austria ?Hasan Jamil, Wayne State University, U.S.A. ?Carole Goble, University of Manchester, U.K. ?Ian Gorton, Pacific Northwest National Laboratory, U.S.A. ?Paul Groth, University of Amsterdam, the Netherlands ?Zo? Lacroix, Arizona State University, U.S.A. ?Cui Lin, Valdosta State University, U.S.A. ?Marta Mattoso, Federal University of Rio de Janeiro, Brazil ?Paolo Missier, University of Manchester, U.K. ?Ioan Raicu, Illinois Institute of Technology, U.S.A. ?Yogesh Simmhan, University of Southern California, U.S.A. ?Wei Tan, IBM T. J. Watson Research Center, U.S.A. ?Ian Taylor, Cardiff University, U.K. ?Liqiang Wang,University of Wyoming, U.S.A. ?Jianwu Wang, San Diego Super Computer Center, U.S.A. ?Ping Yang,Binghamton University, U.S.A. ?Ustun Yildiz, UC Davis, U.S.A. ?Jia Zhang, Northern Illinois University, U.S.A. ?Yong Zhao, University of Electronic Science and Technology of China, P.R. China ?Zhiming Zhao, University of Amsterdam, the Netherlands ** -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Sat Jan 29 13:35:04 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 29 Jan 2011 13:35:04 -0600 (CST) Subject: [Swift-devel] Slow job processing by SGE execution provider? Message-ID: <1098751715.503.1296329704930.JavaMail.root@zimbra.anl.gov> Mihael, Im running some simple Swift tests on the "siraf" SGE cluster in UC Radiology. The swift script runs 10 cat jobs with one tiny input file (and in this case, no output files, even though an output dataset was mapped). I see the following unexpected behavior: The siraf cluster schedules jobs very quickly. The 10 jobs are launched and finished (as seen by a "watch qstat" command) in the first few seconds of the script's execution. But then, swift slowly logs the completions on stdout at a rate of less than one per second, so the overall workflow takes almost 40 seconds when it should finish in say 5-10 seconds. The same behavior occurs when I send in 50 or 100 jobs - the jobs get through SGE *very* quickly, and then swift sluggishly recognizes their completion. Its almost like SGE qstat is polled less than once per second, and only one completed job is recognized per poll. Also note that the progress log on stdout makes it look like Swift thinks only one job is active at a time, when in fact all 10 of the jobs have long finished as seen by an external qstat. The stdout and a sample log is below. You can see from the stdout that the completions were recognized piecemeal rather than in large batches at once. The same workflow finishes very rapidly both on siraf running on the local exec provider and on PADS under PBS. --- stdout and log: sir$ time swift -tc.file tc -sites.file sge.xml -config cf catsin.swift -n=10 Swift svn swift-r4044 cog-r3044 (cog modified locally) RunID: 20110129-1337-00pltm8f Progress: Progress: Submitting:8 Submitted:2 Progress: Submitted:10 Progress: Submitted:9 Active:1 Progress: Submitted:8 Active:1 Finished successfully:1 Progress: Submitted:7 Active:1 Finished successfully:2 Progress: Submitted:6 Active:1 Finished successfully:3 Progress: Submitted:5 Active:1 Finished successfully:4 Progress: Submitted:4 Active:1 Finished successfully:5 Progress: Submitted:3 Active:1 Finished successfully:6 Progress: Submitted:2 Active:1 Finished successfully:7 Progress: Submitted:1 Active:1 Finished successfully:8 Progress: Active:1 Finished successfully:9 Final status: Finished successfully:10 real 0m38.315s user 0m4.890s sys 0m0.583s sir$ ls -lt *.log | head -3 -rw-r--r-- 1 mwilde ml-giger 22865 Jan 29 13:38 catsin-20110129-1337-00pltm8f.log -rw-r--r-- 1 mwilde ml-giger 1653 Jan 29 13:37 swift.log -rw-r--r-- 1 mwilde ml-giger 30749 Jan 29 10:28 catsin-20110129-1027-019p3p32.log sir$ cat *8f.log 2011-01-29 13:37:35,128-0600 DEBUG Loader Max heap: 238616576 2011-01-29 13:37:35,128-0600 DEBUG Loader kmlversion is >a13cb327-aa83-47c5-9340-53e068ae5e33< 2011-01-29 13:37:35,128-0600 DEBUG Loader build version is >a13cb327-aa83-47c5-9340-53e068ae5e33< 2011-01-29 13:37:35,129-0600 DEBUG Loader Recompilation suppressed. 2011-01-29 13:37:35,340-0600 INFO VDL2ExecutionContext Stack dump: Level 1 [iA = 0, iB = 0, bA = false, bB = false] vdl:instanceconfigfile = cf vdl:instanceconfig = Swift configuration [cf] vdl:operation = run PATH_SEPARATOR = / swift.home = /home/mwilde/swift/rev/trunk/bin/.. 2011-01-29 13:37:36,047-0600 INFO unknown Using sites file: sge.xml 2011-01-29 13:37:36,094-0600 INFO unknown Using tc.data: tc 2011-01-29 13:37:36,235-0600 INFO AbstractScheduler Setting resources to: {sge=sge} 2011-01-29 13:37:36,879-0600 INFO unknown Swift svn swift-r4044 cog-r3044 (cog modified locally) 2011-01-29 13:37:36,879-0600 INFO unknown RUNID id=run:20110129-1337-00pltm8f 2011-01-29 13:37:36,967-0600 INFO SetFieldValue Set: swift#mapper#17000=outdir 2011-01-29 13:37:36,967-0600 INFO SetFieldValue Set: swift#mapper#17004=.out 2011-01-29 13:37:36,967-0600 INFO SetFieldValue Set: swift#mapper#17002=f. 2011-01-29 13:37:36,967-0600 INFO VDLFunction FUNCTION: arg() 2011-01-29 13:37:36,968-0600 INFO VDLFunction FUNCTION: toint() 2011-01-29 13:37:36,979-0600 INFO AbstractDataNode Found data org.griphyn.vdl.mapping.RootDataNode identifier dataset:20110129-1337-qcgf98ff:720000000028 type file with no value at dataset=data (closed).$ 2011-01-29 13:37:36,979-0600 INFO AbstractDataNode Found data org.griphyn.vdl.mapping.RootDataNode identifier dataset:20110129-1337-qcgf98ff:720000000032 type file with no value at dataset=data (closed).$ 2011-01-29 13:37:36,979-0600 INFO AbstractDataNode Found data org.griphyn.vdl.mapping.RootDataNode identifier dataset:20110129-1337-qcgf98ff:720000000031 type file with no value at dataset=data (closed).$ 2011-01-29 13:37:36,979-0600 INFO AbstractDataNode Found data org.griphyn.vdl.mapping.RootDataNode identifier dataset:20110129-1337-qcgf98ff:720000000029 type file with no value at dataset=data (closed).$ 2011-01-29 13:37:36,979-0600 INFO AbstractDataNode Found data org.griphyn.vdl.mapping.RootDataNode identifier dataset:20110129-1337-qcgf98ff:720000000030 type file with no value at dataset=data (closed).$ 2011-01-29 13:37:36,980-0600 INFO AbstractDataNode Found data org.griphyn.vdl.mapping.RootDataNode identifier dataset:20110129-1337-qcgf98ff:720000000035 type file with no value at dataset=data (closed).$ 2011-01-29 13:37:36,980-0600 INFO AbstractDataNode Found data org.griphyn.vdl.mapping.RootDataNode identifier dataset:20110129-1337-qcgf98ff:720000000033 type file with no value at dataset=data (closed).$ 2011-01-29 13:37:36,980-0600 INFO AbstractDataNode Found data org.griphyn.vdl.mapping.RootDataNode identifier dataset:20110129-1337-qcgf98ff:720000000034 type file with no value at dataset=data (closed).$ 2011-01-29 13:37:36,980-0600 INFO AbstractDataNode Found data org.griphyn.vdl.mapping.RootDataNode identifier dataset:20110129-1337-qcgf98ff:720000000036 type file with no value at dataset=data (closed).$ 2011-01-29 13:37:36,980-0600 INFO AbstractDataNode Found data org.griphyn.vdl.mapping.RootDataNode identifier dataset:20110129-1337-qcgf98ff:720000000028 type file with no value at dataset=data (closed).$ 2011-01-29 13:37:36,989-0600 INFO VDLFunction FUNCTION: filename() 2011-01-29 13:37:36,990-0600 INFO VDLFunction FUNCTION: filename() 2011-01-29 13:37:36,991-0600 INFO VDLFunction FUNCTION: filename() 2011-01-29 13:37:36,991-0600 INFO VDLFunction FUNCTION: filename() 2011-01-29 13:37:36,992-0600 INFO VDLFunction FUNCTION: filename() 2011-01-29 13:37:36,993-0600 INFO VDLFunction FUNCTION: filename() 2011-01-29 13:37:36,989-0600 INFO VDLFunction FUNCTION: filename() 2011-01-29 13:37:36,990-0600 INFO VDLFunction FUNCTION: filename() 2011-01-29 13:37:36,990-0600 INFO VDLFunction FUNCTION: filename() 2011-01-29 13:37:36,990-0600 INFO VDLFunction FUNCTION: filename() 2011-01-29 13:37:36,996-0600 INFO vdl:execute START thread=0-3-5-1 tr=cat 2011-01-29 13:37:36,996-0600 INFO vdl:execute START thread=0-3-8-1 tr=cat 2011-01-29 13:37:36,996-0600 INFO vdl:execute START thread=0-3-10-1 tr=cat 2011-01-29 13:37:36,996-0600 INFO vdl:execute START thread=0-3-6-1 tr=cat 2011-01-29 13:37:36,997-0600 INFO vdl:execute START thread=0-3-4-1 tr=cat 2011-01-29 13:37:36,997-0600 INFO vdl:execute START thread=0-3-3-1 tr=cat 2011-01-29 13:37:36,997-0600 INFO vdl:execute START thread=0-3-7-1 tr=cat 2011-01-29 13:37:36,997-0600 INFO vdl:execute START thread=0-3-9-1 tr=cat 2011-01-29 13:37:36,998-0600 INFO vdl:execute START thread=0-3-1-1 tr=cat 2011-01-29 13:37:37,000-0600 INFO vdl:execute START thread=0-3-2-1 tr=cat 2011-01-29 13:37:37,015-0600 INFO WeightedHostScoreScheduler CONTACT_SELECTED host=sge, score=99.854 2011-01-29 13:37:37,015-0600 INFO WeightedHostScoreScheduler CONTACT_SELECTED host=sge, score=99.854 2011-01-29 13:37:37,015-0600 INFO WeightedHostScoreScheduler CONTACT_SELECTED host=sge, score=99.854 2011-01-29 13:37:37,016-0600 INFO WeightedHostScoreScheduler CONTACT_SELECTED host=sge, score=99.854 2011-01-29 13:37:37,016-0600 INFO WeightedHostScoreScheduler CONTACT_SELECTED host=sge, score=99.854 2011-01-29 13:37:37,016-0600 INFO WeightedHostScoreScheduler CONTACT_SELECTED host=sge, score=99.854 2011-01-29 13:37:37,016-0600 INFO WeightedHostScoreScheduler CONTACT_SELECTED host=sge, score=99.854 2011-01-29 13:37:37,016-0600 INFO WeightedHostScoreScheduler CONTACT_SELECTED host=sge, score=99.854 2011-01-29 13:37:37,017-0600 INFO WeightedHostScoreScheduler CONTACT_SELECTED host=sge, score=99.854 2011-01-29 13:37:37,017-0600 INFO WeightedHostScoreScheduler CONTACT_SELECTED host=sge, score=99.854 2011-01-29 13:37:37,019-0600 INFO GlobalSubmitQueue No global submit throttle set. Using default (1024) 2011-01-29 13:37:37,020-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:37,024-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:37,027-0600 INFO vdl:initshareddir START host=sge - Initializing shared directory 2011-01-29 13:37:37,135-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:37,154-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:37,158-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:37,161-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:37,166-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:37,167-0600 INFO vdl:initshareddir END host=sge - Done initializing shared directory 2011-01-29 13:37:37,180-0600 INFO vdl:createdirset START jobid=cat-8gnyo65k host=sge - Initializing directory structure 2011-01-29 13:37:37,180-0600 INFO vdl:createdirset START jobid=cat-4gnyo65k host=sge - Initializing directory structure 2011-01-29 13:37:37,181-0600 INFO vdl:createdirset START jobid=cat-agnyo65k host=sge - Initializing directory structure 2011-01-29 13:37:37,180-0600 INFO vdl:createdirset START jobid=cat-7gnyo65k host=sge - Initializing directory structure 2011-01-29 13:37:37,180-0600 INFO vdl:createdirset START jobid=cat-5gnyo65k host=sge - Initializing directory structure 2011-01-29 13:37:37,181-0600 INFO vdl:createdirset START jobid=cat-6gnyo65k host=sge - Initializing directory structure 2011-01-29 13:37:37,182-0600 INFO vdl:createdirset END jobid=cat-8gnyo65k - Done initializing directory structure 2011-01-29 13:37:37,181-0600 INFO vdl:createdirset START jobid=cat-9gnyo65k host=sge - Initializing directory structure 2011-01-29 13:37:37,181-0600 INFO vdl:createdirset START jobid=cat-2gnyo65k host=sge - Initializing directory structure 2011-01-29 13:37:37,181-0600 INFO vdl:createdirset START jobid=cat-3gnyo65k host=sge - Initializing directory structure 2011-01-29 13:37:37,182-0600 INFO vdl:createdirset END jobid=cat-6gnyo65k - Done initializing directory structure 2011-01-29 13:37:37,182-0600 INFO vdl:createdirset END jobid=cat-9gnyo65k - Done initializing directory structure 2011-01-29 13:37:37,182-0600 INFO vdl:createdirset END jobid=cat-2gnyo65k - Done initializing directory structure 2011-01-29 13:37:37,183-0600 INFO vdl:dostagein START jobid=cat-8gnyo65k - Staging in files 2011-01-29 13:37:37,183-0600 INFO vdl:dostagein START jobid=cat-9gnyo65k - Staging in files 2011-01-29 13:37:37,183-0600 INFO vdl:dostagein START jobid=cat-2gnyo65k - Staging in files 2011-01-29 13:37:37,182-0600 INFO vdl:createdirset END jobid=cat-5gnyo65k - Done initializing directory structure 2011-01-29 13:37:37,182-0600 INFO vdl:createdirset START jobid=cat-1gnyo65k host=sge - Initializing directory structure 2011-01-29 13:37:37,182-0600 INFO vdl:createdirset END jobid=cat-4gnyo65k - Done initializing directory structure 2011-01-29 13:37:37,182-0600 INFO vdl:createdirset END jobid=cat-7gnyo65k - Done initializing directory structure 2011-01-29 13:37:37,182-0600 INFO vdl:createdirset END jobid=cat-agnyo65k - Done initializing directory structure 2011-01-29 13:37:37,183-0600 INFO vdl:dostagein START jobid=cat-6gnyo65k - Staging in files 2011-01-29 13:37:37,185-0600 INFO vdl:dostagein START jobid=cat-5gnyo65k - Staging in files 2011-01-29 13:37:37,185-0600 INFO vdl:dostagein START jobid=cat-4gnyo65k - Staging in files 2011-01-29 13:37:37,182-0600 INFO vdl:createdirset END jobid=cat-3gnyo65k - Done initializing directory structure 2011-01-29 13:37:37,185-0600 INFO vdl:dostagein START jobid=cat-7gnyo65k - Staging in files 2011-01-29 13:37:37,185-0600 INFO vdl:createdirset END jobid=cat-1gnyo65k - Done initializing directory structure 2011-01-29 13:37:37,185-0600 INFO vdl:dostagein START jobid=cat-agnyo65k - Staging in files 2011-01-29 13:37:37,186-0600 INFO vdl:dostagein START jobid=cat-3gnyo65k - Staging in files 2011-01-29 13:37:37,187-0600 INFO vdl:dostagein START jobid=cat-1gnyo65k - Staging in files 2011-01-29 13:37:37,194-0600 INFO LateBindingScheduler JobQueue: 2 2011-01-29 13:37:37,194-0600 INFO LateBindingScheduler JobQueue: 2 2011-01-29 13:37:37,195-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:37,196-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:37,196-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:37,197-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:37,197-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:37,198-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:37,198-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:37,198-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:37,212-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:37,214-0600 INFO vdl:dostagein END jobid=cat-agnyo65k - Staging in finished 2011-01-29 13:37:37,214-0600 INFO vdl:dostagein END jobid=cat-2gnyo65k - Staging in finished 2011-01-29 13:37:37,214-0600 INFO vdl:dostagein END jobid=cat-4gnyo65k - Staging in finished 2011-01-29 13:37:37,214-0600 INFO vdl:dostagein END jobid=cat-7gnyo65k - Staging in finished 2011-01-29 13:37:37,214-0600 INFO vdl:dostagein END jobid=cat-3gnyo65k - Staging in finished 2011-01-29 13:37:37,214-0600 INFO vdl:dostagein END jobid=cat-8gnyo65k - Staging in finished 2011-01-29 13:37:37,214-0600 INFO vdl:dostagein END jobid=cat-6gnyo65k - Staging in finished 2011-01-29 13:37:37,214-0600 INFO vdl:dostagein END jobid=cat-5gnyo65k - Staging in finished 2011-01-29 13:37:37,214-0600 INFO vdl:dostagein END jobid=cat-9gnyo65k - Staging in finished 2011-01-29 13:37:37,214-0600 INFO vdl:dostagein END jobid=cat-1gnyo65k - Staging in finished 2011-01-29 13:37:37,239-0600 INFO Execute Submit: in: catsin-20110129-1337-00pltm8f command: /bin/bash shared/_swiftwrap cat-agnyo65k -jobdir a -scratch -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if data.txt -of -k -cdmfile -status provider -a data.txt 2011-01-29 13:37:37,240-0600 INFO Execute Submit: in: catsin-20110129-1337-00pltm8f command: /bin/bash shared/_swiftwrap cat-7gnyo65k -jobdir 7 -scratch -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if data.txt -of -k -cdmfile -status provider -a data.txt 2011-01-29 13:37:37,240-0600 INFO Execute Submit: in: catsin-20110129-1337-00pltm8f command: /bin/bash shared/_swiftwrap cat-3gnyo65k -jobdir 3 -scratch -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if data.txt -of -k -cdmfile -status provider -a data.txt 2011-01-29 13:37:37,239-0600 INFO Execute Submit: in: catsin-20110129-1337-00pltm8f command: /bin/bash shared/_swiftwrap cat-2gnyo65k -jobdir 2 -scratch -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if data.txt -of -k -cdmfile -status provider -a data.txt 2011-01-29 13:37:37,239-0600 INFO Execute Submit: in: catsin-20110129-1337-00pltm8f command: /bin/bash shared/_swiftwrap cat-1gnyo65k -jobdir 1 -scratch -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if data.txt -of -k -cdmfile -status provider -a data.txt 2011-01-29 13:37:37,240-0600 INFO GridExec TASK_DEFINITION: Task(type=JOB_SUBMISSION, identity=urn:0-3-6-1-1-1296329856121) is /bin/bash shared/_swiftwrap cat-1gnyo65k -jobdir 1 -scratch -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if data.txt -of -k -cdmfile -status provider -a data.txt 2011-01-29 13:37:37,241-0600 INFO Execute Submit: in: catsin-20110129-1337-00pltm8f command: /bin/bash shared/_swiftwrap cat-6gnyo65k -jobdir 6 -scratch -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if data.txt -of -k -cdmfile -status provider -a data.txt 2011-01-29 13:37:37,241-0600 INFO GridExec TASK_DEFINITION: Task(type=JOB_SUBMISSION, identity=urn:0-3-3-1-1-1296329856126) is /bin/bash shared/_swiftwrap cat-6gnyo65k -jobdir 6 -scratch -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if data.txt -of -k -cdmfile -status provider -a data.txt 2011-01-29 13:37:37,239-0600 INFO Execute Submit: in: catsin-20110129-1337-00pltm8f command: /bin/bash shared/_swiftwrap cat-8gnyo65k -jobdir 8 -scratch -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if data.txt -of -k -cdmfile -status provider -a data.txt 2011-01-29 13:37:37,242-0600 INFO GridExec TASK_DEFINITION: Task(type=JOB_SUBMISSION, identity=urn:0-3-5-1-1-1296329856124) is /bin/bash shared/_swiftwrap cat-8gnyo65k -jobdir 8 -scratch -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if data.txt -of -k -cdmfile -status provider -a data.txt 2011-01-29 13:37:37,242-0600 INFO Execute Submit: in: catsin-20110129-1337-00pltm8f command: /bin/bash shared/_swiftwrap cat-4gnyo65k -jobdir 4 -scratch -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if data.txt -of -k -cdmfile -status provider -a data.txt 2011-01-29 13:37:37,242-0600 INFO GridExec TASK_DEFINITION: Task(type=JOB_SUBMISSION, identity=urn:0-3-9-1-1-1296329856128) is /bin/bash shared/_swiftwrap cat-4gnyo65k -jobdir 4 -scratch -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if data.txt -of -k -cdmfile -status provider -a data.txt 2011-01-29 13:37:37,239-0600 INFO Execute Submit: in: catsin-20110129-1337-00pltm8f command: /bin/bash shared/_swiftwrap cat-5gnyo65k -jobdir 5 -scratch -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if data.txt -of -k -cdmfile -status provider -a data.txt 2011-01-29 13:37:37,239-0600 INFO Execute Submit: in: catsin-20110129-1337-00pltm8f command: /bin/bash shared/_swiftwrap cat-9gnyo65k -jobdir 9 -scratch -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if data.txt -of -k -cdmfile -status provider -a data.txt 2011-01-29 13:37:37,243-0600 INFO GridExec TASK_DEFINITION: Task(type=JOB_SUBMISSION, identity=urn:0-3-10-1-1-1296329856122) is /bin/bash shared/_swiftwrap cat-9gnyo65k -jobdir 9 -scratch -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if data.txt -of -k -cdmfile -status provider -a data.txt 2011-01-29 13:37:37,243-0600 INFO GridExec TASK_DEFINITION: Task(type=JOB_SUBMISSION, identity=urn:0-3-7-1-1-1296329856123) is /bin/bash shared/_swiftwrap cat-5gnyo65k -jobdir 5 -scratch -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if data.txt -of -k -cdmfile -status provider -a data.txt 2011-01-29 13:37:37,240-0600 INFO GridExec TASK_DEFINITION: Task(type=JOB_SUBMISSION, identity=urn:0-3-2-1-1-1296329856118) is /bin/bash shared/_swiftwrap cat-2gnyo65k -jobdir 2 -scratch -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if data.txt -of -k -cdmfile -status provider -a data.txt 2011-01-29 13:37:37,240-0600 INFO GridExec TASK_DEFINITION: Task(type=JOB_SUBMISSION, identity=urn:0-3-1-1-1-1296329856117) is /bin/bash shared/_swiftwrap cat-3gnyo65k -jobdir 3 -scratch -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if data.txt -of -k -cdmfile -status provider -a data.txt 2011-01-29 13:37:37,240-0600 INFO GridExec TASK_DEFINITION: Task(type=JOB_SUBMISSION, identity=urn:0-3-4-1-1-1296329856120) is /bin/bash shared/_swiftwrap cat-agnyo65k -jobdir a -scratch -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if data.txt -of -k -cdmfile -status provider -a data.txt 2011-01-29 13:37:37,240-0600 INFO GridExec TASK_DEFINITION: Task(type=JOB_SUBMISSION, identity=urn:0-3-8-1-1-1296329856119) is /bin/bash shared/_swiftwrap cat-7gnyo65k -jobdir 7 -scratch -e /bin/cat -out stdout.txt -err stderr.txt -i -d -if data.txt -of -k -cdmfile -status provider -a data.txt 2011-01-29 13:37:37,941-0600 INFO SGEExecutor Job id from qsub: 712380 2011-01-29 13:37:38,069-0600 INFO SGEExecutor Job id from qsub: 712381 2011-01-29 13:37:38,188-0600 INFO SGEExecutor Job id from qsub: 712382 2011-01-29 13:37:38,195-0600 INFO AbstractQueuePoller Actively monitored: 0, New: 3, Done: 0 2011-01-29 13:37:38,223-0600 INFO SGEExecutor Job id from qsub: 712383 2011-01-29 13:37:38,388-0600 INFO SGEExecutor Job id from qsub: 712384 2011-01-29 13:37:38,397-0600 INFO SGEExecutor Job id from qsub: 712385 2011-01-29 13:37:38,574-0600 INFO SGEExecutor Job id from qsub: 712386 2011-01-29 13:37:38,911-0600 INFO SGEExecutor Job id from qsub: 712387 2011-01-29 13:37:38,917-0600 INFO SGEExecutor Job id from qsub: 712388 2011-01-29 13:37:39,220-0600 INFO SGEExecutor Job id from qsub: 712389 2011-01-29 13:37:50,037-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:50,048-0600 INFO vdl:dostageout START jobid=cat-agnyo65k - Staging out files 2011-01-29 13:37:50,053-0600 INFO vdl:dostageout END jobid=cat-agnyo65k - Staging out finished 2011-01-29 13:37:50,250-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:50,254-0600 INFO vdl:execute END_SUCCESS thread=0-3-4-1 tr=cat 2011-01-29 13:37:51,272-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:51,276-0600 INFO vdl:dostageout START jobid=cat-3gnyo65k - Staging out files 2011-01-29 13:37:51,277-0600 INFO vdl:dostageout END jobid=cat-3gnyo65k - Staging out finished 2011-01-29 13:37:51,396-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:51,401-0600 INFO vdl:execute END_SUCCESS thread=0-3-1-1 tr=cat 2011-01-29 13:37:52,634-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:52,634-0600 INFO AbstractQueuePoller Actively monitored: 3, New: 7, Done: 3 2011-01-29 13:37:52,636-0600 INFO vdl:dostageout START jobid=cat-7gnyo65k - Staging out files 2011-01-29 13:37:52,637-0600 INFO vdl:dostageout END jobid=cat-7gnyo65k - Staging out finished 2011-01-29 13:37:52,788-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:37:52,790-0600 INFO vdl:execute END_SUCCESS thread=0-3-8-1 tr=cat 2011-01-29 13:38:03,788-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:38:03,791-0600 INFO vdl:dostageout START jobid=cat-2gnyo65k - Staging out files 2011-01-29 13:38:03,792-0600 INFO vdl:dostageout END jobid=cat-2gnyo65k - Staging out finished 2011-01-29 13:38:03,931-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:38:03,934-0600 INFO vdl:execute END_SUCCESS thread=0-3-2-1 tr=cat 2011-01-29 13:38:05,274-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:38:05,275-0600 INFO vdl:dostageout START jobid=cat-1gnyo65k - Staging out files 2011-01-29 13:38:05,276-0600 INFO vdl:dostageout END jobid=cat-1gnyo65k - Staging out finished 2011-01-29 13:38:05,375-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:38:05,377-0600 INFO vdl:execute END_SUCCESS thread=0-3-6-1 tr=cat 2011-01-29 13:38:06,790-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:38:06,792-0600 INFO vdl:dostageout START jobid=cat-9gnyo65k - Staging out files 2011-01-29 13:38:06,792-0600 INFO vdl:dostageout END jobid=cat-9gnyo65k - Staging out finished 2011-01-29 13:38:06,955-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:38:06,957-0600 INFO vdl:execute END_SUCCESS thread=0-3-10-1 tr=cat 2011-01-29 13:38:08,248-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:38:08,249-0600 INFO vdl:dostageout START jobid=cat-4gnyo65k - Staging out files 2011-01-29 13:38:08,250-0600 INFO vdl:dostageout END jobid=cat-4gnyo65k - Staging out finished 2011-01-29 13:38:08,324-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:38:08,326-0600 INFO vdl:execute END_SUCCESS thread=0-3-9-1 tr=cat 2011-01-29 13:38:09,482-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:38:09,483-0600 INFO vdl:dostageout START jobid=cat-5gnyo65k - Staging out files 2011-01-29 13:38:09,484-0600 INFO vdl:dostageout END jobid=cat-5gnyo65k - Staging out finished 2011-01-29 13:38:10,380-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:38:10,381-0600 INFO vdl:execute END_SUCCESS thread=0-3-7-1 tr=cat 2011-01-29 13:38:10,906-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:38:10,907-0600 INFO vdl:dostageout START jobid=cat-6gnyo65k - Staging out files 2011-01-29 13:38:10,907-0600 INFO vdl:dostageout END jobid=cat-6gnyo65k - Staging out finished 2011-01-29 13:38:11,362-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:38:11,363-0600 INFO vdl:execute END_SUCCESS thread=0-3-3-1 tr=cat 2011-01-29 13:38:12,732-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:38:12,732-0600 INFO AbstractQueuePoller Actively monitored: 7, New: 0, Done: 7 2011-01-29 13:38:12,733-0600 INFO vdl:dostageout START jobid=cat-8gnyo65k - Staging out files 2011-01-29 13:38:12,734-0600 INFO vdl:dostageout END jobid=cat-8gnyo65k - Staging out finished 2011-01-29 13:38:12,982-0600 INFO LateBindingScheduler JobQueue: 0 2011-01-29 13:38:12,984-0600 INFO vdl:execute END_SUCCESS thread=0-3-5-1 tr=cat 2011-01-29 13:38:12,996-0600 INFO vdl:cleanups START cleanups=[[catsin-20110129-1337-00pltm8f, sge]] 2011-01-29 13:38:12,997-0600 INFO vdl:cleanup START dir=catsin-20110129-1337-00pltm8f host=sge 2011-01-29 13:38:12,999-0600 INFO vdl:cleanup END dir=catsin-20110129-1337-00pltm8f host=sge 2011-01-29 13:38:12,999-0600 INFO vdl:cleanups END cleanups=[[catsin-20110129-1337-00pltm8f, sge]] 2011-01-29 13:38:13,002-0600 INFO Loader Swift finished with no errors sir$ --- script, sites and properties files: sir$ more catsin.swift sge.xml cf :::::::::::::: catsin.swift :::::::::::::: type file; app cat (file i) { cat @i ; } file out[]; foreach j in [1:@toint(@arg("n","1"))] { file data<"data.txt">; cat(data); } :::::::::::::: sge.xml :::::::::::::: 00:01:00 shm 10000 .10 /home/mwilde/swiftwork :::::::::::::: cf :::::::::::::: wrapperlog.always.transfer=true sitedir.keep=true execution.retries=0 lazy.errors=false status.mode=provider use.provider.staging=false provider.staging.pin.swiftfiles=false sir$ -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sun Jan 30 03:05:56 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 30 Jan 2011 01:05:56 -0800 Subject: [Swift-devel] fast branch merged back to trunk Message-ID: <1296378356.27649.2.camel@blabla2.none> I haven't tested the results of the merge much. It compiles though. In terms of the code itself, I have been running it on my laptop for a few months now and I haven't seen any obvious problems. Mihael From hategan at mcs.anl.gov Mon Jan 31 18:27:42 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 31 Jan 2011 16:27:42 -0800 Subject: [Swift-devel] Re: Slow job processing by SGE execution provider? In-Reply-To: <1098751715.503.1296329704930.JavaMail.root@zimbra.anl.gov> References: <1098751715.503.1296329704930.JavaMail.root@zimbra.anl.gov> Message-ID: <1296520062.28637.12.camel@blabla2.none> On Sat, 2011-01-29 at 13:35 -0600, Michael Wilde wrote: > Mihael, > > Im running some simple Swift tests on the "siraf" SGE cluster in UC Radiology. > > The swift script runs 10 cat jobs with one tiny input file (and in > this case, no output files, even though an output dataset was mapped). > > I see the following unexpected behavior: > > The siraf cluster schedules jobs very quickly. The 10 jobs are > launched and finished (as seen by a "watch qstat" command) in the > first few seconds of the script's execution. > > But then, swift slowly logs the completions on stdout at a rate of > less than one per second, so the overall workflow takes almost 40 > seconds when it should finish in say 5-10 seconds. The same behavior > occurs when I send in 50 or 100 jobs - the jobs get through SGE *very* > quickly, and then swift sluggishly recognizes their completion. > > Its almost like SGE qstat is polled less than once per second, and > only one completed job is recognized per poll. It is polled less than once per second. There is one poll every 10 seconds. This doesn't explain the delay between the stageouts. I suspect it might be some side effect of the "fast" code, which clearly isn't very fast in this case. > > Also note that the progress log on stdout makes it look like Swift > thinks only one job is active at a time, when in fact all 10 of the > jobs have long finished as seen by an external qstat. Right. It looks like karajan code is executed in the task notification thread which serializes those notifications. There should be a simple fix for this. Mihael From hategan at mcs.anl.gov Mon Jan 31 22:52:20 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 31 Jan 2011 20:52:20 -0800 Subject: [Swift-devel] Re: Slow job processing by SGE execution provider? In-Reply-To: <1296520062.28637.12.camel@blabla2.none> References: <1098751715.503.1296329704930.JavaMail.root@zimbra.anl.gov> <1296520062.28637.12.camel@blabla2.none> Message-ID: <1296535940.31037.40.camel@blabla2.none> On Mon, 2011-01-31 at 16:27 -0800, Mihael Hategan wrote: > On Sat, 2011-01-29 at 13:35 -0600, Michael Wilde wrote: > > Mihael, > > > > Im running some simple Swift tests on the "siraf" SGE cluster in UC Radiology. > > > > The swift script runs 10 cat jobs with one tiny input file (and in > > this case, no output files, even though an output dataset was mapped). > > > > I see the following unexpected behavior: > > > > The siraf cluster schedules jobs very quickly. The 10 jobs are > > launched and finished (as seen by a "watch qstat" command) in the > > first few seconds of the script's execution. > > > > But then, swift slowly logs the completions on stdout at a rate of > > less than one per second, so the overall workflow takes almost 40 > > seconds when it should finish in say 5-10 seconds. The same behavior > > occurs when I send in 50 or 100 jobs - the jobs get through SGE *very* > > quickly, and then swift sluggishly recognizes their completion. > > > > Its almost like SGE qstat is polled less than once per second, and > > only one completed job is recognized per poll. > > It is polled less than once per second. There is one poll every 10 > seconds. > > This doesn't explain the delay between the stageouts. I suspect it might > be some side effect of the "fast" code, which clearly isn't very fast in > this case. > > > > Also note that the progress log on stdout makes it look like Swift > > thinks only one job is active at a time, when in fact all 10 of the > > jobs have long finished as seen by an external qstat. > > Right. It looks like karajan code is executed in the task notification > thread which serializes those notifications. That doesn't turn out to be the issue. So I'll need to dig more. From jacob at mcs.anl.gov Mon Jan 10 11:34:51 2011 From: jacob at mcs.anl.gov (Robert Jacob) Date: Mon, 10 Jan 2011 17:34:51 -0000 Subject: [Swift-devel] Re: Swift on Eureka In-Reply-To: <1019521736.40464.1294526433159.JavaMail.root@zimbra.anl.gov> References: <1019521736.40464.1294526433159.JavaMail.root@zimbra.anl.gov> Message-ID: <4D2B433C.60404@mcs.anl.gov> Lets use Fusion. Rob On 1/8/11 4:40 PM, Michael Wilde wrote: > Thanks, Justin. cc'ing back to the list, Rob, and Sheri. > > Sheri, maybe you can run on PADS or Fusion till this is fixed? > > - Mike > > ----- Original Message ----- >> Hello >> Right, Swift does not currently run on Eureka due to the following >> bug in Cobalt: >> >> http://trac.mcs.anl.gov/projects/cobalt/ticket/462 >> >> I got about half of a work-around for this done... >> >> Justin >> >> On Fri, 7 Jan 2011, Michael Wilde wrote: >> >>> Hi Rob and Sheri, >>> >>> I don't know the status of Swift on Eureka, but Im eager to see it >>> running there, so we'll make sure it works. >>> >>> A long while back I tried Swift there, and at the time we had a >>> minor >>> bug in the Cobalt provider. Justin may have fixed that recently on >>> the >>> BG/P's. So Im hoping it either works or has only some >>> readily-fixable >>> issues in the way. >>> >>> We'll try it and get back to you. >>> >>> In the mean time, Sheri, you might want to try a simple hello-world >>> test >>> on Eureka, and see if you can progress to replicating what John >>> Dennis >>> had done so far. >>> >>> Its best to send any errors you get to the swift-user list (which >>> you >>> should join) so that everyone on the Swift team is aware f any >>> issues >>> you encounter and can offer help. >>> >>> You should meet with Justin at Argonne (3rd floor, 240) who can >>> serve as >>> your Swift mentor. >>> >>> Sarah, David - lets add Eureka to the test matrix for release 0.92. >>> Cobalt is very very close to PBS's interface, but there is a >>> separate >>> Swift execution provider that handles the differences. >>> >>> Regards, >>> >>> Mike >>> >>> >>> ----- Original Message ----- >>>> Hi Mike, >>>> >>>> Sheri is going to take over some of the development work John >>>> Dennis >>>> was >>>> doing on using swift with the AMWG diag package. >>>> >>>> Our platform is Eureka. Is there a development version of Swift >>>> installed there? >>>> >>>> Rob >>> >>> >> >> -- >> Justin M Wozniak > From mickelso at mcs.anl.gov Mon Jan 10 11:40:40 2011 From: mickelso at mcs.anl.gov (Sheri Mickelson) Date: Mon, 10 Jan 2011 17:40:40 -0000 Subject: [Swift-devel] Re: Swift on Eureka In-Reply-To: <4D2B433C.60404@mcs.anl.gov> References: <1019521736.40464.1294526433159.JavaMail.root@zimbra.anl.gov> <4D2B433C.60404@mcs.anl.gov> Message-ID: <4D2B4495.6030805@mcs.anl.gov> Is swift already installed on fussion? -Sheri Robert Jacob wrote: > > Lets use Fusion. > > Rob > > > On 1/8/11 4:40 PM, Michael Wilde wrote: >> Thanks, Justin. cc'ing back to the list, Rob, and Sheri. >> >> Sheri, maybe you can run on PADS or Fusion till this is fixed? >> >> - Mike >> >> ----- Original Message ----- >>> Hello >>> Right, Swift does not currently run on Eureka due to the following >>> bug in Cobalt: >>> >>> http://trac.mcs.anl.gov/projects/cobalt/ticket/462 >>> >>> I got about half of a work-around for this done... >>> >>> Justin >>> >>> On Fri, 7 Jan 2011, Michael Wilde wrote: >>> >>>> Hi Rob and Sheri, >>>> >>>> I don't know the status of Swift on Eureka, but Im eager to see it >>>> running there, so we'll make sure it works. >>>> >>>> A long while back I tried Swift there, and at the time we had a >>>> minor >>>> bug in the Cobalt provider. Justin may have fixed that recently on >>>> the >>>> BG/P's. So Im hoping it either works or has only some >>>> readily-fixable >>>> issues in the way. >>>> >>>> We'll try it and get back to you. >>>> >>>> In the mean time, Sheri, you might want to try a simple hello-world >>>> test >>>> on Eureka, and see if you can progress to replicating what John >>>> Dennis >>>> had done so far. >>>> >>>> Its best to send any errors you get to the swift-user list (which >>>> you >>>> should join) so that everyone on the Swift team is aware f any >>>> issues >>>> you encounter and can offer help. >>>> >>>> You should meet with Justin at Argonne (3rd floor, 240) who can >>>> serve as >>>> your Swift mentor. >>>> >>>> Sarah, David - lets add Eureka to the test matrix for release 0.92. >>>> Cobalt is very very close to PBS's interface, but there is a >>>> separate >>>> Swift execution provider that handles the differences. >>>> >>>> Regards, >>>> >>>> Mike >>>> >>>> >>>> ----- Original Message ----- >>>>> Hi Mike, >>>>> >>>>> Sheri is going to take over some of the development work John >>>>> Dennis >>>>> was >>>>> doing on using swift with the AMWG diag package. >>>>> >>>>> Our platform is Eureka. Is there a development version of Swift >>>>> installed there? >>>>> >>>>> Rob >>>> >>>> >>> >>> -- >>> Justin M Wozniak >> From mickelso at mcs.anl.gov Mon Jan 10 13:48:50 2011 From: mickelso at mcs.anl.gov (Sheri Mickelson) Date: Mon, 10 Jan 2011 19:48:50 -0000 Subject: [Swift-devel] Re: Swift on Eureka In-Reply-To: References: <4D2B4495.6030805@mcs.anl.gov> <2049325529.45559.1294682720072.JavaMail.root@zimbra.anl.gov> Message-ID: <4D2B629F.2060903@mcs.anl.gov> fusion.lcrc.anl.gov -Sheri Sarah Kenny wrote: > what's the full hostname for fusion? i can see if my new account is > active there. > > On Mon, Jan 10, 2011 at 10:05 AM, Michael Wilde > wrote: > > No, we'll either need to build it for you or you can try to build it > yourself. > > Sarah or David, can you do a build and sanity test of Swift on > Fusion today? > (If not, I will do this later today...) > > We should get this installed as a softenv package on Fusion, PADS, > and MCS machines. > > Thanks, > > Mike > > > ----- Original Message ----- > > Is swift already installed on fussion? > > > > -Sheri > > > > Robert Jacob wrote: > > > > > > Lets use Fusion. > > > > > > Rob > > > > > > > > > On 1/8/11 4:40 PM, Michael Wilde wrote: > > >> Thanks, Justin. cc'ing back to the list, Rob, and Sheri. > > >> > > >> Sheri, maybe you can run on PADS or Fusion till this is fixed? > > >> > > >> - Mike > > >> > > >> ----- Original Message ----- > > >>> Hello > > >>> Right, Swift does not currently run on Eureka due to the > following > > >>> bug in Cobalt: > > >>> > > >>> http://trac.mcs.anl.gov/projects/cobalt/ticket/462 > > >>> > > >>> I got about half of a work-around for this done... > > >>> > > >>> Justin > > >>> > > >>> On Fri, 7 Jan 2011, Michael Wilde wrote: > > >>> > > >>>> Hi Rob and Sheri, > > >>>> > > >>>> I don't know the status of Swift on Eureka, but Im eager to see > > >>>> it > > >>>> running there, so we'll make sure it works. > > >>>> > > >>>> A long while back I tried Swift there, and at the time we had a > > >>>> minor > > >>>> bug in the Cobalt provider. Justin may have fixed that recently > > >>>> on > > >>>> the > > >>>> BG/P's. So Im hoping it either works or has only some > > >>>> readily-fixable > > >>>> issues in the way. > > >>>> > > >>>> We'll try it and get back to you. > > >>>> > > >>>> In the mean time, Sheri, you might want to try a simple > > >>>> hello-world > > >>>> test > > >>>> on Eureka, and see if you can progress to replicating what John > > >>>> Dennis > > >>>> had done so far. > > >>>> > > >>>> Its best to send any errors you get to the swift-user list > (which > > >>>> you > > >>>> should join) so that everyone on the Swift team is aware f any > > >>>> issues > > >>>> you encounter and can offer help. > > >>>> > > >>>> You should meet with Justin at Argonne (3rd floor, 240) who can > > >>>> serve as > > >>>> your Swift mentor. > > >>>> > > >>>> Sarah, David - lets add Eureka to the test matrix for release > > >>>> 0.92. > > >>>> Cobalt is very very close to PBS's interface, but there is a > > >>>> separate > > >>>> Swift execution provider that handles the differences. > > >>>> > > >>>> Regards, > > >>>> > > >>>> Mike > > >>>> > > >>>> > > >>>> ----- Original Message ----- > > >>>>> Hi Mike, > > >>>>> > > >>>>> Sheri is going to take over some of the development work John > > >>>>> Dennis > > >>>>> was > > >>>>> doing on using swift with the AMWG diag package. > > >>>>> > > >>>>> Our platform is Eureka. Is there a development version of Swift > > >>>>> installed there? > > >>>>> > > >>>>> Rob > > >>>> > > >>>> > > >>> > > >>> -- > > >>> Justin M Wozniak > > >> > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From mickelso at mcs.anl.gov Tue Jan 11 10:50:25 2011 From: mickelso at mcs.anl.gov (Sheri Mickelson) Date: Tue, 11 Jan 2011 16:50:25 -0000 Subject: [Swift-devel] Re: Swift on Eureka In-Reply-To: <1357754317.49737.1294764552708.JavaMail.root@zimbra.anl.gov> References: <1357754317.49737.1294764552708.JavaMail.root@zimbra.anl.gov> Message-ID: <4D2C8A4E.2040707@mcs.anl.gov> Thanks Michael Wilde wrote: > Sheri, All, > > I have built Swift on Fusion (using the 0.92 branches of swift and cog) > > To use: > > PATH=/home/wilde/swift/rev/0.92/bin:$PATH > swift -etc > > The source was built from /home/wilde/swift/src/0.92 on fusion > > I have yet to test, but Sheri this should help you get started. > > Please bear with us if the going starts out rough here. > > We will also work on making Swift run on Eureka. > > Sample (but untested) config files to use on PBS here are at: > /home/wilde/swift/lab/{pbs.xml,tc,cf} > > Please join and send all problems to swift-user: > http://www.ci.uchicago.edu/swift/support/index.php > > Im hoping that Sarah and David will soon have logins on Fusion and can help certify this release on that system. > > Regards, > > Mike > > > > > > > > ----- Original Message ----- >> No, we'll either need to build it for you or you can try to build it >> yourself. >> >> Sarah or David, can you do a build and sanity test of Swift on Fusion >> today? >> (If not, I will do this later today...) >> >> We should get this installed as a softenv package on Fusion, PADS, and >> MCS machines. >> >> Thanks, >> >> Mike >> >> >> ----- Original Message ----- >>> Is swift already installed on fussion? >>> >>> -Sheri >>> >>> Robert Jacob wrote: >>>> Lets use Fusion. >>>> >>>> Rob >>>> >>>> >>>> On 1/8/11 4:40 PM, Michael Wilde wrote: >>>>> Thanks, Justin. cc'ing back to the list, Rob, and Sheri. >>>>> >>>>> Sheri, maybe you can run on PADS or Fusion till this is fixed? >>>>> >>>>> - Mike >>>>> >>>>> ----- Original Message ----- >>>>>> Hello >>>>>> Right, Swift does not currently run on Eureka due to the >>>>>> following >>>>>> bug in Cobalt: >>>>>> >>>>>> http://trac.mcs.anl.gov/projects/cobalt/ticket/462 >>>>>> >>>>>> I got about half of a work-around for this done... >>>>>> >>>>>> Justin >>>>>> >>>>>> On Fri, 7 Jan 2011, Michael Wilde wrote: >>>>>> >>>>>>> Hi Rob and Sheri, >>>>>>> >>>>>>> I don't know the status of Swift on Eureka, but Im eager to see >>>>>>> it >>>>>>> running there, so we'll make sure it works. >>>>>>> >>>>>>> A long while back I tried Swift there, and at the time we had a >>>>>>> minor >>>>>>> bug in the Cobalt provider. Justin may have fixed that recently >>>>>>> on >>>>>>> the >>>>>>> BG/P's. So Im hoping it either works or has only some >>>>>>> readily-fixable >>>>>>> issues in the way. >>>>>>> >>>>>>> We'll try it and get back to you. >>>>>>> >>>>>>> In the mean time, Sheri, you might want to try a simple >>>>>>> hello-world >>>>>>> test >>>>>>> on Eureka, and see if you can progress to replicating what John >>>>>>> Dennis >>>>>>> had done so far. >>>>>>> >>>>>>> Its best to send any errors you get to the swift-user list >>>>>>> (which >>>>>>> you >>>>>>> should join) so that everyone on the Swift team is aware f any >>>>>>> issues >>>>>>> you encounter and can offer help. >>>>>>> >>>>>>> You should meet with Justin at Argonne (3rd floor, 240) who can >>>>>>> serve as >>>>>>> your Swift mentor. >>>>>>> >>>>>>> Sarah, David - lets add Eureka to the test matrix for release >>>>>>> 0.92. >>>>>>> Cobalt is very very close to PBS's interface, but there is a >>>>>>> separate >>>>>>> Swift execution provider that handles the differences. >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Mike >>>>>>> >>>>>>> >>>>>>> ----- Original Message ----- >>>>>>>> Hi Mike, >>>>>>>> >>>>>>>> Sheri is going to take over some of the development work John >>>>>>>> Dennis >>>>>>>> was >>>>>>>> doing on using swift with the AMWG diag package. >>>>>>>> >>>>>>>> Our platform is Eureka. Is there a development version of >>>>>>>> Swift >>>>>>>> installed there? >>>>>>>> >>>>>>>> Rob >>>>>>> >>>>>> -- >>>>>> Justin M Wozniak >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory > From richp at alcf.anl.gov Tue Jan 11 17:19:19 2011 From: richp at alcf.anl.gov (Paul M. Rich) Date: Tue, 11 Jan 2011 23:19:19 -0000 Subject: [Swift-devel] Re: [alcf-support #60887] Can Cobalt command-line bug on Eureka be fixed? In-Reply-To: References: <1484991023.52267.1294786105485.JavaMail.root@zimbra.anl.gov> Message-ID: <4D2CE575.2090506@alcf.anl.gov> Michael, Unfortunately a fix for this will, at this point in time, take a minimum of four weeks to deploy to a production resource like Eureka, due to our testing, upgrade and maintenance procedures. As a workaround for this on Eureka, since every job effectively runs in script mode, you should be able to set environment variables within the script that you submit to Cobalt. We apologize for the inconvenience. Let us know if you have any other questions. -- Paul Rich ALCF Operations -- AIG richp at alcf.anl.gov On 1/11/11 4:48 PM, Michael Wilde wrote: > User info for wilde at mcs.anl.gov > ================================= > Username: wilde > Full Name: Michael Wilde > Projects: HTCScienceApps,JGI-Pilot,MTCScienceApps,OOPS,PTMAP,pilot-wilde > ('*' denotes INCITE projects) > ================================= > > > Hi ALCF Team, > > The following known issue in Cobalt is currently preventing us from running Swift on Eureka: > > http://trac.mcs.anl.gov/projects/cobalt/ticket/462 > > With some additional development effort we can work around this, but it would be much cleaner and better if this were fixed in Cobalt, instead, as suggested in ticket 462 above. > > Is there any chance that can be done in the next few days? > If not, please let me know, and we will implement the work-around instead. > > This is holding up work on the DOE ParVis project (Rob Jacob, PI) and we've had to move some work we want to run on Eureka to other platforms in the meantime. > > Thanks very much, > > Mike > > 462 is: > > Ticket #462 (new defect) > Opened 7 months ago > Cobalt on clusters ignores job script arguments > > Reported by: acherry > Priority: major > Component: clients > > Description > > It appears that cobalt-launcher.py does not support running a job script or executable with command arguments, even though qsub will accept the arguments, and the man page and help for qsub indicates that arguments are accepted. > > I'm filing this as a bug rather than a feature request, since the behavior isn't consistent with the documentation. But I'd rather the fix for this to be adding support for args, rather than changing the docs to say they aren't accepted. :-) > > From acherry at alcf.anl.gov Tue Jan 11 17:19:25 2011 From: acherry at alcf.anl.gov (Andrew Cherry) Date: Tue, 11 Jan 2011 23:19:25 -0000 Subject: [Swift-devel] Re: [alcf-support #60887] Can Cobalt command-line bug on Eureka be fixed? In-Reply-To: References: <1484991023.52267.1294786105485.JavaMail.root@zimbra.anl.gov> Message-ID: Mike- My initial reaction is that a fix would probably not be doable in the next few days, since it would almost certainly require scheduling downtime to bring Cobalt down, apply the fix, test, and restart. But I'll ping the Cobalt folks to find out how feasible this would be. My recollections from my previous investigation is that it would require changes to the cluster_system component as well as the launcher, so a shutdown wouldn't be avoidable. -Andrew On Jan 11, 2011, at 4:48 PM, Michael Wilde wrote: > Hi ALCF Team, > > The following known issue in Cobalt is currently preventing us from > running Swift on Eureka: > > http://trac.mcs.anl.gov/projects/cobalt/ticket/462 > > With some additional development effort we can work around this, but > it would be much cleaner and better if this were fixed in Cobalt, > instead, as suggested in ticket 462 above. > > Is there any chance that can be done in the next few days? > If not, please let me know, and we will implement the work-around > instead. > > This is holding up work on the DOE ParVis project (Rob Jacob, PI) > and we've had to move some work we want to run on Eureka to other > platforms in the meantime. > > Thanks very much, > > Mike > > 462 is: > > Ticket #462 (new defect) > Opened 7 months ago > Cobalt on clusters ignores job script arguments > > Reported by: acherry > Priority: major > Component: clients > > Description > > It appears that cobalt-launcher.py does not support running a job > script or executable with command arguments, even though qsub will > accept the arguments, and the man page and help for qsub indicates > that arguments are accepted. > > I'm filing this as a bug rather than a feature request, since the > behavior isn't consistent with the documentation. But I'd rather the > fix for this to be adding support for args, rather than changing the > docs to say they aren't accepted. :-) -------------- next part -------------- An HTML attachment was scrubbed... URL: