From wilde at mcs.anl.gov Thu Sep 2 08:55:52 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 2 Sep 2010 07:55:52 -0600 (GMT-06:00) Subject: [Swift-devel] Items needed for Grid/OSG Condor-G support In-Reply-To: Message-ID: <1018902194.181691283435752089.JavaMail.root@zimbra.anl.gov> Glen, following up on these: ----- "Glen Hocky" wrote: > oh but two things for the devels that we discussed before > 1) if you could get someone to make restarting slightly easier (i.e. > you don't have to specify all options to restart, see earlier email to > list host) I can do that in a swiftrun wrapper, right? > 2) tagging the jobs submitted or at least making sure they get pulled > out when a job fails or is canceled with the condor provider I was going to put a "run #" tag on all condor-g jobs from a run, and create a script that removes all the jobs from a run (swiftrm) using the right condor_rm patter matching spell. Sound OK? - Mike > > On Sun, Aug 29, 2010 at 9:08 PM, Glen Hocky < hockyg at gmail.com > > wrote: > > > well 2 sites that would be productive, *vcell* and *mit* (forget exact > names) both have jobs failing with "failed to transfer wrapper log" > errors but since it works on so many other sites, i think that must be > a problem on those sites...if we could work around or get that fixed > that would add a lot of machines. otherwise i'm just gonna try to get > some productive runs done (almost done one) so we can say that we used > OSG productively.... > > > > > > On Sun, Aug 29, 2010 at 8:40 PM, Michael Wilde < wilde at mcs.anl.gov > > wrote: > > > Very good, thanks Glen. > > What's the next prio on this workflow? Still some sites that are not > building or running correctly? > > > > > - Mike > > ----- "Glen Hocky" < hockyg at gmail.com > wrote: > > > it works now. thanks a lot > > > > > > On Fri, Aug 27, 2010 at 2:52 PM, Glen Hocky < hockyg at gmail.com > > > wrote: > > > > > > ok i'll try again > > > > > > > > > > > > On Fri, Aug 27, 2010 at 2:49 PM, Michael Wilde < wilde at mcs.anl.gov > > > wrote: > > > > > > Updated; ~wilde/swift/rev/trunk is now at: swift-r3571 cog-r2868 > > > > > > > > > > - Mike > > > > ----- "Glen Hocky" < hockyg at gmail.com > wrote: > > > > > Let me know when you update... > > > > > > > > > Begin forwarded message: > > > > > > > > > > > > > > > From: Mihael Hategan < hategan at mcs.anl.gov > > > > Date: August 27, 2010 2:01:56 PM EDT > > > To: Glen Hocky < hockyg at gmail.com > > > > Cc: Mike Wilde < wilde at mcs.anl.gov > > > > Subject: Re: [Swift-user] Re: Errors in 13-site OSG run: lazy > error > > > question > > > > > > > > > > > > > > > > > > swift trunk r3568 > > > From hategan at mcs.anl.gov Thu Sep 2 12:20:04 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 02 Sep 2010 12:20:04 -0500 Subject: [Swift-devel] Items needed for Grid/OSG Condor-G support In-Reply-To: <1018902194.181691283435752089.JavaMail.root@zimbra.anl.gov> References: <1018902194.181691283435752089.JavaMail.root@zimbra.anl.gov> Message-ID: <1283448005.3416.4.camel@blabla2.none> On Thu, 2010-09-02 at 07:55 -0600, Michael Wilde wrote: > Glen, following up on these: > > ----- "Glen Hocky" wrote: > > 2) tagging the jobs submitted or at least making sure they get pulled > > out when a job fails or is canceled with the condor provider > > I was going to put a "run #" tag on all condor-g jobs from a run, and > create a script that removes all the jobs from a run (swiftrm) using > the right condor_rm patter matching spell. Condor has an automatic removal mode. However, it is explicitly disabled because of the way the current condor provider figures out status: it needs to see the job in a done or failed state. After that happens, the job is marked as auto-remove. However if swift is killed in the middle of the process, that doesn't happen. Now, if the condor provider was to use log files for tracking status instead, this would not be a problem. So I think that's the right way to go. Otherwise we could add some JVM cleanup tasks to mark the jobs as removable when the JVM is shut down. Mihael From wilde at mcs.anl.gov Fri Sep 3 11:45:38 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Fri, 3 Sep 2010 10:45:38 -0600 (GMT-06:00) Subject: [Swift-devel] Delay when coasters are shutting down In-Reply-To: <1260100689.193451283444925306.JavaMail.root@zimbra.anl.gov> Message-ID: <835707635.346781283532338582.JavaMail.root@zimbra.anl.gov> Mihael, There seems to be a brief (~ 2-5 second) delay at the end of a run when swift seems to be shutting down the coaster service. (See example below) This delay degrades performance for the Swift R interface, where an R script will be running many separate Swift scripts, one for each call to parallel apply(). Can this delay be readily eliminated? If its not present for persistent coasters, then that's not a problem. I'll look into it when I get a chance, but a pointer to where in the code the delay occurs would be helpful. Eventually we'll have an R script make multiple calls to the same Swift JVM for successive apply() calls. That in fact would be a better place to focus than eliminating this delay. In the R interface, we use a pair of FIFOs to communicate pass calls from a Swift app() script to persistent R server processes, one per Swift worker slot. We could use the same approach (and/or sockets) to provide a client interface for running multiple Swift calls from the same JVM. Thanks, - Mike swift output is in: swift.stdouterr.0745, pids in swift.workerpids.0745 Swift svn swift-r3591 cog-r2868 (cog modified locally) RunID: 20100902-1058-k58r43yg Progress: Passive queue processor initialized. Callback URI is http://128.135.125.18:50003 /home/wilde/SwiftR/passive-coaster-swift: Coaster contact: http://128.135.125.18:50003 /home/wilde/SwiftR/passive-coaster-swift: started workers from these ssh processes: 30864 logilename: /home/wilde/swiftworkerlogs/worker-42.log Progress: Active:1 Final status: Finished successfully:1 Cleaning up... Shutting down service at https://128.135.125.18:50002 Got channel MetaChannel: 1855107489[1623980477: {}] -> null[1623980477: {}] + Done <<< Delay occurs here: swift command doesn't exit for a few seconds after the "Cleaning up..." message >>> Exitting, terminating worker processes 30864 and starter 30747: bri$ -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Fri Sep 3 12:15:55 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 03 Sep 2010 10:15:55 -0700 Subject: [Swift-devel] Re: Delay when coasters are shutting down In-Reply-To: <835707635.346781283532338582.JavaMail.root@zimbra.anl.gov> References: <835707635.346781283532338582.JavaMail.root@zimbra.anl.gov> Message-ID: <1283534155.8685.1.camel@blabla2.none> On Fri, 2010-09-03 at 10:45 -0600, wilde at mcs.anl.gov wrote: > Mihael, > > There seems to be a brief (~ 2-5 second) delay at the end of a run > when swift seems to be shutting down the coaster service. (See example > below) > > This delay degrades performance for the Swift R interface, where an R > script will be running many separate Swift scripts, one for each call > to parallel apply(). > > Can this delay be readily eliminated? If its not present for > persistent coasters, then that's not a problem. I'll look into it when > I get a chance, but a pointer to where in the code the delay occurs > would be helpful. The delay is not there with persistent coasters. It is also not there if, by some unknown mechanism, the sub-swift JVM is reused. > > Eventually we'll have an R script make multiple calls to the same > Swift JVM for successive apply() calls. That in fact would be a better > place to focus than eliminating this delay. > > In the R interface, we use a pair of FIFOs to communicate pass calls > from a Swift app() script to persistent R server processes, one per > Swift worker slot. We could use the same approach (and/or sockets) to > provide a client interface for running multiple Swift calls from the > same JVM. Can't the sub-swift be a swift function? Mihael From hategan at mcs.anl.gov Fri Sep 3 21:34:33 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 03 Sep 2010 19:34:33 -0700 Subject: [Swift-devel] Provider staging is failing In-Reply-To: <969490906.75291283230364810.JavaMail.root@zimbra.anl.gov> References: <969490906.75291283230364810.JavaMail.root@zimbra.anl.gov> Message-ID: <1283567673.13115.2.camel@blabla2.none> This should be fixed in trunk (swift r3600). Two issues were addressed: 1. don't bother with creating input file dirs in the wrapper since the stage-in process, which happens before the wrapper gets invoked, should take care of that. If it does not, then the model is broken to begin with. 2. don't add empty directories to the list Please test. Mihael On Mon, 2010-08-30 at 22:52 -0600, wilde at mcs.anl.gov wrote: > Nope - I was wrong again. The "-d |outdir" form has been generated all along. The problem was that this causes a mkdir -p in _swiftwrap.staging to be invoked with a null value. This was obscured in _swiftwrap, which had a jobdir in front of the null input dir, and was thus silently ignored by mkdir -p. > > I committed a fix (skip mkdir if dir is null), but please keep an eye on _swiftwrap.staging in case it causes other issues. > > There was also a typo in a var $STDER -> STDERR. > > - Mike > > > > ----- wilde at mcs.anl.gov wrote: > > > ----- "Justin M Wozniak" wrote: > > > > > I think that's ok. > > > > Right: Mihael pointed out to me in IM that the exec'ed program is > > /bin/bash with _swiftwrap.staging as an arg. > > > > Digging deeper it looks like _swiftwrap.staging is getting run with > > this command line: > > > > /bin/bash _swiftwrap.staging -e /bin/cat -out outdir/f.0001.out -err > > stderr.txt -i -d '|outdir' -if data.txt -of outdir/f.0001.out -k > > -cdmfile -status provider -a data.txt > > > > and the extra "|" separator in the -d 'outdir' arg (quotes mine) is > > causing a spurious mkdir to get invoked for what would have been the > > "in dirs" argument.That in turn is causing the ret code 254. > > > > I think that extra | separator is not supposed to be there when there > > are no input directories (as in this case). vdl-int.staging has: > > "-d", flatten(each(fileDirs)), > > and I now suspect a null value for the dirs of stagein is not being > > handled right, somewhere around: > > fileDirs := fileDirs(stagein, stageout) > > > > - Mike > > > > > > > > > > > Do you have the wrapper.log/info files? > > > > > > On Mon, 30 Aug 2010, Michael Wilde wrote: > > > > > > > _swiftwrap.staging didnt sem to get marked executable: > > > > > > > > > > ----- "Michael Wilde" wrote: > > > > > > > >> WIth proxy the stageins seem to complete. Then a get a 254 when > > it > > > >> tries to run; Im looking at that now: > > > >> > > > >> 1283218480.397 DEBUG 000000 CWD: / > > > >> 1283218480.397 DEBUG 000000 Running /bin/bash > > > >> 1283218480.397 DEBUG 000000 Directory: > > > >> /home/wilde/swiftwork/catsn-20100830-2034-hotqv61h-o-cat-oy\ > > > >> un22yj > > > >> 1283218480.397 DEBUG 000000 Command: _swiftwrap.staging -e > > > /bin/cat > > > >> -out outdir/f.0001.out -err st\ > > > >> derr.txt -i -d |outdir -if data.txt -of outdir/f.0001.out -k > > > -cdmfile > > > >> -status provider -a data.tx\ > > > >> t > > > >> 1283218480.397 DEBUG 000000 Command: /bin/bash > > _swiftwrap.staging > > > -e > > > >> /bin/cat -out outdir/f.0001.o\ > > > >> ut -err stderr.txt -i -d |outdir -if data.txt -of > > outdir/f.0001.out > > > -k > > > >> -cdmfile -status provider \ > > > >> -a data.txt > > > >> 1283218480.397 DEBUG 000000 1283218479990 Forked process 17949. > > > >> Waiting for its completion > > > >> 1283218480.408 DEBUG 000000 Checking jobs status (1 active) > > > >> 1283218480.408 DEBUG 000000 1283218479990 Checking pid 17949 > > > >> 1283218480.408 DEBUG 000000 1283218479990 Job 17949 still > > running > > > >> 1283218480.408 TRACE 000000 IN: len=2, actuallen=2, tag=4, > > > flags=3, > > > >> OK > > > >> 1283218480.408 DEBUG 000000 Fin flag set > > > >> 1283218480.508 DEBUG 000000 Checking jobs status (1 active) > > > >> 1283218480.508 DEBUG 000000 1283218479990 Checking pid 17949 > > > >> 1283218480.508 DEBUG 000000 1283218479990 Child process 17949 > > > >> terminated. Status is 254. > > > >> > > > >> > > > >> - Mike > > > >> > > > >> ----- "Mihael Hategan" wrote: > > > >> > > > >>> On Mon, 2010-08-30 at 19:26 -0600, wilde at mcs.anl.gov wrote: > > > >>>> I turned on the TRACE output level in worker.pl. I need to dig > > > >>> deeper but it looks to me that the pathnames its trying to > > fetch > > > >> are > > > >>> getting mangled/confused with the file:// portion of the URI: > > > >>>> > > > >>>> org.globus.cog.karajan.workflow.service.ProtocolException: > > > >>> java.io.FileNotFoundException: > > > >>> > > > >> > > > > > /autonfs/home/wilde/./file:/localhost/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging > > > >>> (No such file or directory) > > > >>>> > > > >>>> The file > > > >>> "/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging" > > > >> does > > > >>> exist on the client side. > > > >>> > > > >>> Seems to. I gather "file" is broken. > > > >>> > > > >>> Can you try "proxy", and see if it fails? If not, I'll know a > > bit > > > >>> better > > > >>> where to look. > > > >>> > > > >>> Mihael > > > >> > > > >> -- > > > >> Michael Wilde > > > >> Computation Institute, University of Chicago > > > >> Mathematics and Computer Science Division > > > >> Argonne National Laboratory > > > > > > > > > > > > > > -- > > > Justin M Wozniak > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Sat Sep 4 06:51:48 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Sat, 4 Sep 2010 05:51:48 -0600 (GMT-06:00) Subject: [Swift-devel] Provider staging is failing In-Reply-To: <1874569087.371561283600831113.JavaMail.root@zimbra.anl.gov> Message-ID: <1246035099.371581283601108918.JavaMail.root@zimbra.anl.gov> Thanks, Mihael - I will test. Can you clarify how to use provider staging? My understanding is: - set use.provider.staging=true in swift.properties (this turns it on for all sites as far as I know; thats not desirable moving forward, but fine for now for testing. Perhaps it should be controlled solely by the stagingMethod tag in sites.xml?) - set stagingMethod in sites.xml. At the moment, the file method is broken and proxy method works. - set workdirectory in sites.xml to a *node local* directory - element in sites.xml is ignored (as far as I could tell from tests), so there is no way to specify that the jobdirectory should be local. Instead, one does this by setting the workdirectory to a node-local dir. - the element in sites.xml is ignored Is that correct, and is anything else needed to make provider staging work correctly? - Mike ----- "Mihael Hategan" wrote: > This should be fixed in trunk (swift r3600). > > Two issues were addressed: > 1. don't bother with creating input file dirs in the wrapper since > the > stage-in process, which happens before the wrapper gets invoked, > should > take care of that. If it does not, then the model is broken to begin > with. > 2. don't add empty directories to the list > > Please test. > > Mihael > > On Mon, 2010-08-30 at 22:52 -0600, wilde at mcs.anl.gov wrote: > > Nope - I was wrong again. The "-d |outdir" form has been generated > all along. The problem was that this causes a mkdir -p in > _swiftwrap.staging to be invoked with a null value. This was obscured > in _swiftwrap, which had a jobdir in front of the null input dir, and > was thus silently ignored by mkdir -p. > > > > I committed a fix (skip mkdir if dir is null), but please keep an > eye on _swiftwrap.staging in case it causes other issues. > > > > There was also a typo in a var $STDER -> STDERR. > > > > - Mike > > > > > > > > ----- wilde at mcs.anl.gov wrote: > > > > > ----- "Justin M Wozniak" wrote: > > > > > > > I think that's ok. > > > > > > Right: Mihael pointed out to me in IM that the exec'ed program is > > > /bin/bash with _swiftwrap.staging as an arg. > > > > > > Digging deeper it looks like _swiftwrap.staging is getting run > with > > > this command line: > > > > > > /bin/bash _swiftwrap.staging -e /bin/cat -out outdir/f.0001.out > -err > > > stderr.txt -i -d '|outdir' -if data.txt -of outdir/f.0001.out -k > > > -cdmfile -status provider -a data.txt > > > > > > and the extra "|" separator in the -d 'outdir' arg (quotes mine) > is > > > causing a spurious mkdir to get invoked for what would have been > the > > > "in dirs" argument.That in turn is causing the ret code 254. > > > > > > I think that extra | separator is not supposed to be there when > there > > > are no input directories (as in this case). vdl-int.staging has: > > > "-d", flatten(each(fileDirs)), > > > and I now suspect a null value for the dirs of stagein is not > being > > > handled right, somewhere around: > > > fileDirs := fileDirs(stagein, stageout) > > > > > > - Mike > > > > > > > > > > > > > > > > Do you have the wrapper.log/info files? > > > > > > > > On Mon, 30 Aug 2010, Michael Wilde wrote: > > > > > > > > > _swiftwrap.staging didnt sem to get marked executable: > > > > > > > > > > > > > ----- "Michael Wilde" wrote: > > > > > > > > > >> WIth proxy the stageins seem to complete. Then a get a 254 > when > > > it > > > > >> tries to run; Im looking at that now: > > > > >> > > > > >> 1283218480.397 DEBUG 000000 CWD: / > > > > >> 1283218480.397 DEBUG 000000 Running /bin/bash > > > > >> 1283218480.397 DEBUG 000000 Directory: > > > > >> /home/wilde/swiftwork/catsn-20100830-2034-hotqv61h-o-cat-oy\ > > > > >> un22yj > > > > >> 1283218480.397 DEBUG 000000 Command: _swiftwrap.staging -e > > > > /bin/cat > > > > >> -out outdir/f.0001.out -err st\ > > > > >> derr.txt -i -d |outdir -if data.txt -of outdir/f.0001.out -k > > > > -cdmfile > > > > >> -status provider -a data.tx\ > > > > >> t > > > > >> 1283218480.397 DEBUG 000000 Command: /bin/bash > > > _swiftwrap.staging > > > > -e > > > > >> /bin/cat -out outdir/f.0001.o\ > > > > >> ut -err stderr.txt -i -d |outdir -if data.txt -of > > > outdir/f.0001.out > > > > -k > > > > >> -cdmfile -status provider \ > > > > >> -a data.txt > > > > >> 1283218480.397 DEBUG 000000 1283218479990 Forked process > 17949. > > > > >> Waiting for its completion > > > > >> 1283218480.408 DEBUG 000000 Checking jobs status (1 active) > > > > >> 1283218480.408 DEBUG 000000 1283218479990 Checking pid 17949 > > > > >> 1283218480.408 DEBUG 000000 1283218479990 Job 17949 still > > > running > > > > >> 1283218480.408 TRACE 000000 IN: len=2, actuallen=2, tag=4, > > > > flags=3, > > > > >> OK > > > > >> 1283218480.408 DEBUG 000000 Fin flag set > > > > >> 1283218480.508 DEBUG 000000 Checking jobs status (1 active) > > > > >> 1283218480.508 DEBUG 000000 1283218479990 Checking pid 17949 > > > > >> 1283218480.508 DEBUG 000000 1283218479990 Child process > 17949 > > > > >> terminated. Status is 254. > > > > >> > > > > >> > > > > >> - Mike > > > > >> > > > > >> ----- "Mihael Hategan" wrote: > > > > >> > > > > >>> On Mon, 2010-08-30 at 19:26 -0600, wilde at mcs.anl.gov wrote: > > > > >>>> I turned on the TRACE output level in worker.pl. I need to > dig > > > > >>> deeper but it looks to me that the pathnames its trying to > > > fetch > > > > >> are > > > > >>> getting mangled/confused with the file:// portion of the > URI: > > > > >>>> > > > > >>>> org.globus.cog.karajan.workflow.service.ProtocolException: > > > > >>> java.io.FileNotFoundException: > > > > >>> > > > > >> > > > > > > > > /autonfs/home/wilde/./file:/localhost/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging > > > > >>> (No such file or directory) > > > > >>>> > > > > >>>> The file > > > > >>> > "/home/wilde/swift/rev/trunk/bin/../libexec/_swiftwrap.staging" > > > > >> does > > > > >>> exist on the client side. > > > > >>> > > > > >>> Seems to. I gather "file" is broken. > > > > >>> > > > > >>> Can you try "proxy", and see if it fails? If not, I'll know > a > > > bit > > > > >>> better > > > > >>> where to look. > > > > >>> > > > > >>> Mihael > > > > >> > > > > >> -- > > > > >> Michael Wilde > > > > >> Computation Institute, University of Chicago > > > > >> Mathematics and Computer Science Division > > > > >> Argonne National Laboratory > > > > > > > > > > > > > > > > > > -- > > > > Justin M Wozniak > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sat Sep 4 11:30:14 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 04 Sep 2010 09:30:14 -0700 Subject: [Swift-devel] Provider staging is failing In-Reply-To: <1246035099.371581283601108918.JavaMail.root@zimbra.anl.gov> References: <1246035099.371581283601108918.JavaMail.root@zimbra.anl.gov> Message-ID: <1283617814.16820.5.camel@blabla2.none> On Sat, 2010-09-04 at 05:51 -0600, wilde at mcs.anl.gov wrote: > Thanks, Mihael - I will test. > > Can you clarify how to use provider staging? My understanding is: > > - set use.provider.staging=true in swift.properties Yes. > > (this turns it on for all sites as far as I know; thats not desirable > moving forward, but fine for now for testing. Perhaps it should be > controlled solely by the stagingMethod tag in sites.xml?) The problem was that vdl-int and _swiftwrap were getting too messy. > > - set stagingMethod in sites.xml. At the moment, the file method is > broken and proxy method works. Not for me. We need to double-check and see what the problem is there. > > - set workdirectory in sites.xml to a *node local* directory Yes. > > - element in sites.xml is ignored (as far as I could tell > from tests), so there is no way to specify that the jobdirectory > should be local. Instead, one does this by setting the workdirectory > to a node-local dir. Right. Traditionally was used for the shared directory on a site. Since there is no shared directory with provider staging, workdirectory is meant to indicate where app data should be when the app is running. > > - the element in sites.xml is ignored Yes. > > Is that correct, and is anything else needed to make provider staging > work correctly? That's it as far as I can tell. Mihael From wilde at mcs.anl.gov Sun Sep 5 10:44:11 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Sun, 5 Sep 2010 09:44:11 -0600 (GMT-06:00) Subject: [Swift-devel] A new use case for the persistent swift server In-Reply-To: <908372634.381591283701281790.JavaMail.root@zimbra.anl.gov> Message-ID: <994083981.381641283701451222.JavaMail.root@zimbra.anl.gov> Mihael, as the Swift R interface progresses I realize we have another use case to consider for the persistent Wwift service: - run the same script repeatedly, from different directories (perhaps with different arguments) - since the script doesn't change, we can run a pre-compiled version: no need to regenerate the myscript.xml and myscript.kml files. Can this be readily handled by the code you have underway to do multiple script executions from a single Swift invocation? For the R code, the script (for now) is always the same simple foreach loop that executes an R list apply() function in parallel: swiftapply.swift: type file; type RFile; app (RFile result, file stout, file sterr) RunR (RFile rcall) { RunR @rcall @result stdout=@stout stderr=@sterr; } RFile rcalls[] ; RFile results[] ; file stout[] ; file sterr[] ; foreach c, i in rcalls { (results[i],stout[i], sterr[i]) = RunR(c); } - Mike From hategan at mcs.anl.gov Sun Sep 5 10:56:24 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 05 Sep 2010 08:56:24 -0700 Subject: [Swift-devel] Re: A new use case for the persistent swift server In-Reply-To: <994083981.381641283701451222.JavaMail.root@zimbra.anl.gov> References: <994083981.381641283701451222.JavaMail.root@zimbra.anl.gov> Message-ID: <1283702184.21550.7.camel@blabla2.none> On Sun, 2010-09-05 at 09:44 -0600, wilde at mcs.anl.gov wrote: > Mihael, as the Swift R interface progresses I realize we have another use case to consider for the persistent Wwift service: > > - run the same script repeatedly, from different directories (perhaps with different arguments) > > - since the script doesn't change, we can run a pre-compiled version: no need to regenerate the myscript.xml and myscript.kml files. They are not typically re-generated unless either the .swift file changes or swift is re-compiled. > > Can this be readily handled by the code you have underway to do multiple script executions from a single Swift invocation? I didn't have code underway. I just said it was not difficult to write. So, bounty: - prize is 2 shiny virtual cookies (they are very good). - write a simple java service (plenty of examples on the tubes) that can read a command line from a client, launch Loader.main() with the respective arguments, and send a message back when Loader.main() completes or fails. - write a simple client (either perl or using netcat) that can connect to said service, pass its ARGV to it and wait for the completion message then exit. Mihael From hategan at mcs.anl.gov Sun Sep 5 11:03:09 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 05 Sep 2010 09:03:09 -0700 Subject: [Swift-devel] Re: A new use case for the persistent swift server In-Reply-To: <1283702184.21550.7.camel@blabla2.none> References: <994083981.381641283701451222.JavaMail.root@zimbra.anl.gov> <1283702184.21550.7.camel@blabla2.none> Message-ID: <1283702589.21550.10.camel@blabla2.none> On Sun, 2010-09-05 at 08:56 -0700, Mihael Hategan wrote: > > > > Can this be readily handled by the code you have underway to do multiple script executions from a single Swift invocation? > > I didn't have code underway. I just said it was not difficult to write. Sorry there. I now remember I mentioned such code to you, but I didn't make the connection. What I had was along the lines of a swift shell. Not exactly the same in that there is no network involved. I think it would be better and simpler to have the service. Mihael From hategan at mcs.anl.gov Tue Sep 7 00:31:35 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Sep 2010 22:31:35 -0700 Subject: [Swift-devel] Coaster persistent service issues - logs In-Reply-To: <543522540.97321283274861849.JavaMail.root@zimbra.anl.gov> References: <543522540.97321283274861849.JavaMail.root@zimbra.anl.gov> Message-ID: <1283837495.8961.3.camel@blabla2.none> -nosec should be fixed in cog r2879. That means that the service won't complain about missing credentials, and swift should be able to submit to it as long as you say http:// or tcp:// in the url. I removed some small piece of code whose exact purpose was to provide some default that was interfering with some of the other logic. Long story short, please also test the normal coasters to ensure that there are no unintended consequences there. Mihael On Tue, 2010-08-31 at 11:14 -0600, Michael Wilde wrote: > ----- Forwarded Message ----- > From: "David Kelly" > To: "Michael Wilde" > Cc: "Jonathan Monette" , "Justin Wozniak" , "Mihael Hategan" > Sent: Sunday, August 29, 2010 10:32:07 AM GMT -06:00 US/Canada Central > Subject: Re: change skype call time today - and some to-do notes > > Hello all, > > A few things I've noticed while trying out various coaster configurations this weekend: > > Had similar problems with the -nosec option. Here is the output I got: > > davidk at churn:~/cog/modules/swift/dist/swift-svn/bin$ coaster-service -nosec > Error loading credential: [JGLOBUS-10] Expired credentials (DC=org,DC=doegrids,OU=People,CN=David Kelly 16830,CN=753950975). > Error loading credential > org.globus.gsi.GlobusCredentialException: [JGLOBUS-10] Expired credentials (DC=org,DC=doegrids,OU=People,CN=David Kelly 16830,CN=753950975). > at org.globus.gsi.GlobusCredential.verify(GlobusCredential.java:321) > at org.globus.gsi.GlobusCredential.reloadDefaultCredential(GlobusCredential.java:593) > at org.globus.gsi.GlobusCredential.getDefaultCredential(GlobusCredential.java:575) > at org.globus.cog.abstraction.coaster.service.CoasterPersistentService.main(CoasterPersistentService.java:73) > > I tested multiple connections when using a coasters-persistent+active mode. That seemed to have worked fine, with each new swift connection waiting for the previous to finish. I noticed there were some java exceptions in the log files: > > org.globus.cog.karajan.workflow.service.ReplyTimeoutException > at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280) > at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285) > at java.util.TimerThread.mainLoop(Timer.java:512) > at java.util.TimerThread.run(Timer.java:462) > Channel IOException > java.net.SocketException: Broken pipe > at java.net.SocketOutputStream.socketWrite0(Native Method) > at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) > at java.net.SocketOutputStream.write(SocketOutputStream.java:124) > at org.globus.gsi.gssapi.net.impl.GSIGssOutputStream.writeToken(GSIGssOutputStream.java:61) > at org.globus.gsi.gssapi.net.impl.GSIGssOutputStream.flush(GSIGssOutputStream.java:45) > at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.send(AbstractStreamKarajanChannel.java:298) > at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Sender.run(AbstractStreamKarajanChannel.java:247) > > Not sure if this important or not, but I will include the logs. > > I can't quite get coasters-persistent working in passive mode. I am not sure if this if a configuration issue, a swift issue, or operator error. Here is what I am trying to do: > > sites.xml: > > > > passive > 1 > 3500 > 1 > 1 > 1 > > .31 > 10000 > /home/davidk/swiftwork/churn > > > > I run grid-proxy init on the submit host (login*. mcs.anl.gov ) and on the remote host ( churn.mcs.anl.gov ). From churn I run coaster-service. When I run the catsn.swift script on login, I notice these kind of errors in the coaster-service output: > > GSSSChannel-null(1)[7990655: {}] REQ: Handler(HEARTBEAT) > GSSSChannel-null(1)[7990655: {}] REQ: Handler(HEARTBEAT) > GSSSChannel-null(1)[7990655: {}] REQ: Handler(HEARTBEAT) > GSSSChannel-null(1)[7990655: {}] REQ: Handler(HEARTBEAT) > GSSSChannel-null(1)[7990655: {}] REQ: Handler(HEARTBEAT) > GSSSChannel-null(1)[7990655: {}] REQ: Handler(HEARTBEAT) > GSSSChannel-null(1)[7990655: {}] REQ: Handler(HEARTBEAT) > GSSSChannel-null(1)[7990655: {}] REQ: Handler(HEARTBEAT) > GSSSChannel-null(1)[7990655: {}] REQ: Handler(HEARTBEAT) > GSSSChannel-null(1)[7990655: {}] REQ: Handler(HEARTBEAT) > > These messages seem to repeat several times per second and never stop. The script never finishes. The configurations and log files attached. > > Once I can get this configuration working manually, I will start working on a script to automate this process for multiple hosts to make things a little easier. > > David > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Thu Sep 16 13:14:11 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Thu, 16 Sep 2010 12:14:11 -0600 (GMT-06:00) Subject: [Swift-devel] Swift hanging in complex iterate script In-Reply-To: <883939304.24281284660644754.JavaMail.root@zimbra.anl.gov> Message-ID: <224671941.24551284660851139.JavaMail.root@zimbra.anl.gov> Mihael, I've developed a Swift script that loops using iterate, reading requests to process an R function from a named pipe (fifo), calling R, and replying "done" on a response fifo. This has been working very well, but I just hit a case where the script hangs. I exercise it using a small battery of R tests; I was manually restarting the test battery (which does hundreds of R calls in 30 seconds or so, when it hung in the middle of the test suite. As far as I can tell it hung after receiving a work request, mapping the files for the work request, but never called the app() function that invokes R. The log is in ~wilde/rserver-20100916-1159-y94hftt0.log (I will try to post the script and all related files, but looks like bridled may have just gone down for patches) look for these trace lines in the log, which are issued at the start of every R request: line 37342: 2010-09-16 12:03:24,212-0500 INFO vdl:execute END_SUCCESS thread=0-1-86-4 tr=bash 2010-09-16 12:03:24,213-0500 INFO apply STARTCOMPOUND thread=0-1-87-2 name=apply 2010-09-16 12:03:24,215-0500 WARN trace SwiftScript trace: rserver: got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.233 2010-09-16 12:03:24,215-0500 INFO SetFieldValue Set: done=false The END_SUCCESS is the completion of the last app() in the prior iterate pass, which signals the response ("done") fifo using a shell script. The trace says its starting to process the next R request, #233 (randomly assigned) after mapping 20 files (for 5 R datasets containing 2 R evaluation requests each) it just hangs, and all I see in the log after that point is coaster heartbeats. The last request prior to this hanging request is in the log at line 37137: 2010-09-16 12:03:24,060-0500 INFO vdl:execute END_SUCCESS thread=0-1-85-4 tr=bash 2010-09-16 12:03:24,062-0500 INFO apply STARTCOMPOUND thread=0-1-86-2 name=apply 2010-09-16 12:03:24,062-0500 WARN trace SwiftScript trace: rserver: got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.174 2010-09-16 12:03:24,062-0500 INFO SetFieldValue Set: done=false R request #174 (and all prior ones) completed fine, and should illustrate the normal processing sequence. Any ideas on what to look for regarding the cause of the hang? I will try to reproduce it and try to get a karajan status trace using swift stdin. - Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Thu Sep 16 13:50:00 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 16 Sep 2010 11:50:00 -0700 Subject: [Swift-devel] Re: Swift hanging in complex iterate script In-Reply-To: <224671941.24551284660851139.JavaMail.root@zimbra.anl.gov> References: <224671941.24551284660851139.JavaMail.root@zimbra.anl.gov> Message-ID: <1284663000.18308.0.camel@blabla2.none> I can't tell what's causing the problem, but it may generally be a good idea to do a jstack -l when you get a hang. Mihael On Thu, 2010-09-16 at 12:14 -0600, wilde at mcs.anl.gov wrote: > Mihael, > > I've developed a Swift script that loops using iterate, reading requests to process an R function from a named pipe (fifo), calling R, and replying "done" on a response fifo. > > This has been working very well, but I just hit a case where the script hangs. > > I exercise it using a small battery of R tests; I was manually restarting the test battery (which does hundreds of R calls in 30 seconds or so, when it hung in the middle of the test suite. > > As far as I can tell it hung after receiving a work request, mapping the files for the work request, but never called the app() function that invokes R. > > The log is in ~wilde/rserver-20100916-1159-y94hftt0.log > (I will try to post the script and all related files, but looks like bridled may have just gone down for patches) > > look for these trace lines in the log, which are issued at the start of every R request: > > line 37342: > > 2010-09-16 12:03:24,212-0500 INFO vdl:execute END_SUCCESS thread=0-1-86-4 tr=bash > 2010-09-16 12:03:24,213-0500 INFO apply STARTCOMPOUND thread=0-1-87-2 name=apply > 2010-09-16 12:03:24,215-0500 WARN trace SwiftScript trace: rserver: got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.233 > 2010-09-16 12:03:24,215-0500 INFO SetFieldValue Set: done=false > > The END_SUCCESS is the completion of the last app() in the prior iterate pass, which signals the response ("done") fifo using a shell script. > > The trace says its starting to process the next R request, #233 (randomly assigned) > > after mapping 20 files (for 5 R datasets containing 2 R evaluation requests each) > > it just hangs, and all I see in the log after that point is coaster heartbeats. > > The last request prior to this hanging request is in the log at line 37137: > > 2010-09-16 12:03:24,060-0500 INFO vdl:execute END_SUCCESS thread=0-1-85-4 tr=bash > 2010-09-16 12:03:24,062-0500 INFO apply STARTCOMPOUND thread=0-1-86-2 name=apply > 2010-09-16 12:03:24,062-0500 WARN trace SwiftScript trace: rserver: got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.174 > 2010-09-16 12:03:24,062-0500 INFO SetFieldValue Set: done=false > > R request #174 (and all prior ones) completed fine, and should illustrate the normal processing sequence. > > Any ideas on what to look for regarding the cause of the hang? > > I will try to reproduce it and try to get a karajan status trace using swift stdin. > > - Mike > From wilde at mcs.anl.gov Thu Sep 16 14:07:49 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 16 Sep 2010 13:07:49 -0600 (GMT-06:00) Subject: [Swift-devel] Re: Swift hanging in complex iterate script In-Reply-To: <1284663000.18308.0.camel@blabla2.none> Message-ID: <831386786.28901284664069772.JavaMail.root@zimbra.anl.gov> OK, thats in ~wilde/swiftrhang/jstack.out The jvm is still running: pid 22435 on bridled. Let me know if you need any other traces from it. It hung around 12:05 - Mike bri$ mp UID PID PPID PGID SID C STIME TTY TIME CMD wilde 19025 19023 19023 19023 0 11:49 ? 00:00:00 sshd: wilde at pts/29 wilde 19026 19025 19026 19026 0 11:49 pts/29 00:00:00 -bash wilde 19203 19026 19203 19026 0 11:49 pts/29 00:00:00 /usr/bin/screen -x wilde 15373 15371 15371 15371 0 06:58 ? 00:00:00 sshd: wilde at pts/14 wilde 15374 15373 15374 15374 0 06:58 pts/14 00:00:00 -bash wilde 18291 15374 18291 15374 0 11:47 pts/14 00:00:00 /home/wilde/R/R-2.11.0/bin/exec/R wilde 11315 15374 11315 15374 0 13:59 pts/14 00:00:00 /usr/bin/screen -x wilde 15183 15181 15181 15181 0 06:58 ? 00:00:00 sshd: wilde at pts/11 wilde 15184 15183 15184 15184 0 06:58 pts/11 00:00:00 -bash wilde 15835 15184 15835 15184 0 06:59 pts/11 00:00:00 /usr/bin/screen wilde 15836 15835 15836 15836 0 06:59 ? 00:00:00 /usr/bin/SCREEN wilde 15837 15836 15837 15837 0 06:59 pts/15 00:00:00 bash wilde 15839 15836 15839 15839 0 06:59 pts/16 00:00:00 bash wilde 15841 15836 15841 15841 0 06:59 pts/19 00:00:00 bash wilde 15843 15836 15843 15843 0 06:59 pts/20 00:00:00 bash wilde 15845 15836 15845 15845 0 06:59 pts/21 00:00:00 bash wilde 15847 15836 15847 15847 0 06:59 pts/22 00:00:00 bash wilde 14845 15847 14845 15847 0 10:05 pts/22 00:00:00 emacs Swift/exec/start-swift-workers wilde 22364 15847 22364 15847 0 11:59 pts/22 00:00:00 /bin/bash Swift/exec/RunRServer.sh wilde 22378 22364 22364 15847 0 11:59 pts/22 00:00:00 /bin/sh /home/wilde/swift/rev/trunk/bin/swift -config swift.pr wilde 22435 22378 22364 15847 0 11:59 pts/22 00:00:29 java -Xmx256M -Djava.endorsed.dirs=/home/wilde/swift/rev/tru wilde 12506 15847 12506 15847 0 14:03 pts/22 00:00:00 ps -fjH -u wilde wilde 22930 1 22364 15847 0 11:59 pts/22 00:00:05 /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore --file=./SwiftRServ wilde 22926 1 22364 15847 0 11:59 pts/22 00:00:04 /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore --file=./SwiftRServ wilde 22920 1 22364 15847 0 11:59 pts/22 00:00:04 /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore --file=./SwiftRServ wilde 22760 1 22364 15847 0 11:59 pts/22 00:00:04 /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore --file=./SwiftRServ bri$ ----- "Mihael Hategan" wrote: > I can't tell what's causing the problem, but it may generally be a > good > idea to do a jstack -l when you get a hang. > > Mihael > > On Thu, 2010-09-16 at 12:14 -0600, wilde at mcs.anl.gov wrote: > > Mihael, > > > > I've developed a Swift script that loops using iterate, reading > requests to process an R function from a named pipe (fifo), calling R, > and replying "done" on a response fifo. > > > > This has been working very well, but I just hit a case where the > script hangs. > > > > I exercise it using a small battery of R tests; I was manually > restarting the test battery (which does hundreds of R calls in 30 > seconds or so, when it hung in the middle of the test suite. > > > > As far as I can tell it hung after receiving a work request, mapping > the files for the work request, but never called the app() function > that invokes R. > > > > The log is in ~wilde/rserver-20100916-1159-y94hftt0.log > > (I will try to post the script and all related files, but looks like > bridled may have just gone down for patches) > > > > look for these trace lines in the log, which are issued at the start > of every R request: > > > > line 37342: > > > > 2010-09-16 12:03:24,212-0500 INFO vdl:execute END_SUCCESS > thread=0-1-86-4 tr=bash > > 2010-09-16 12:03:24,213-0500 INFO apply STARTCOMPOUND > thread=0-1-87-2 name=apply > > 2010-09-16 12:03:24,215-0500 WARN trace SwiftScript trace: rserver: > got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.233 > > 2010-09-16 12:03:24,215-0500 INFO SetFieldValue Set: done=false > > > > The END_SUCCESS is the completion of the last app() in the prior > iterate pass, which signals the response ("done") fifo using a shell > script. > > > > The trace says its starting to process the next R request, #233 > (randomly assigned) > > > > after mapping 20 files (for 5 R datasets containing 2 R evaluation > requests each) > > > > it just hangs, and all I see in the log after that point is coaster > heartbeats. > > > > The last request prior to this hanging request is in the log at line > 37137: > > > > 2010-09-16 12:03:24,060-0500 INFO vdl:execute END_SUCCESS > thread=0-1-85-4 tr=bash > > 2010-09-16 12:03:24,062-0500 INFO apply STARTCOMPOUND > thread=0-1-86-2 name=apply > > 2010-09-16 12:03:24,062-0500 WARN trace SwiftScript trace: rserver: > got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.174 > > 2010-09-16 12:03:24,062-0500 INFO SetFieldValue Set: done=false > > > > R request #174 (and all prior ones) completed fine, and should > illustrate the normal processing sequence. > > > > Any ideas on what to look for regarding the cause of the hang? > > > > I will try to reproduce it and try to get a karajan status trace > using swift stdin. > > > > - Mike > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Thu Sep 16 14:11:58 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 16 Sep 2010 12:11:58 -0700 Subject: [Swift-devel] Re: Swift hanging in complex iterate script In-Reply-To: <831386786.28901284664069772.JavaMail.root@zimbra.anl.gov> References: <831386786.28901284664069772.JavaMail.root@zimbra.anl.gov> Message-ID: <1284664318.18500.0.camel@blabla2.none> Beautiful. Deadlock right there. I'll see what I can do. On Thu, 2010-09-16 at 13:07 -0600, Michael Wilde wrote: > OK, thats in ~wilde/swiftrhang/jstack.out > > The jvm is still running: pid 22435 on bridled. Let me know if you need any other traces from it. > > It hung around 12:05 > > - Mike > > > bri$ mp > UID PID PPID PGID SID C STIME TTY TIME CMD > wilde 19025 19023 19023 19023 0 11:49 ? 00:00:00 sshd: wilde at pts/29 > wilde 19026 19025 19026 19026 0 11:49 pts/29 00:00:00 -bash > wilde 19203 19026 19203 19026 0 11:49 pts/29 00:00:00 /usr/bin/screen -x > wilde 15373 15371 15371 15371 0 06:58 ? 00:00:00 sshd: wilde at pts/14 > wilde 15374 15373 15374 15374 0 06:58 pts/14 00:00:00 -bash > wilde 18291 15374 18291 15374 0 11:47 pts/14 00:00:00 /home/wilde/R/R-2.11.0/bin/exec/R > wilde 11315 15374 11315 15374 0 13:59 pts/14 00:00:00 /usr/bin/screen -x > wilde 15183 15181 15181 15181 0 06:58 ? 00:00:00 sshd: wilde at pts/11 > wilde 15184 15183 15184 15184 0 06:58 pts/11 00:00:00 -bash > wilde 15835 15184 15835 15184 0 06:59 pts/11 00:00:00 /usr/bin/screen > wilde 15836 15835 15836 15836 0 06:59 ? 00:00:00 /usr/bin/SCREEN > wilde 15837 15836 15837 15837 0 06:59 pts/15 00:00:00 bash > wilde 15839 15836 15839 15839 0 06:59 pts/16 00:00:00 bash > wilde 15841 15836 15841 15841 0 06:59 pts/19 00:00:00 bash > wilde 15843 15836 15843 15843 0 06:59 pts/20 00:00:00 bash > wilde 15845 15836 15845 15845 0 06:59 pts/21 00:00:00 bash > wilde 15847 15836 15847 15847 0 06:59 pts/22 00:00:00 bash > wilde 14845 15847 14845 15847 0 10:05 pts/22 00:00:00 emacs Swift/exec/start-swift-workers > wilde 22364 15847 22364 15847 0 11:59 pts/22 00:00:00 /bin/bash Swift/exec/RunRServer.sh > wilde 22378 22364 22364 15847 0 11:59 pts/22 00:00:00 /bin/sh /home/wilde/swift/rev/trunk/bin/swift -config swift.pr > wilde 22435 22378 22364 15847 0 11:59 pts/22 00:00:29 java -Xmx256M -Djava.endorsed.dirs=/home/wilde/swift/rev/tru > wilde 12506 15847 12506 15847 0 14:03 pts/22 00:00:00 ps -fjH -u wilde > wilde 22930 1 22364 15847 0 11:59 pts/22 00:00:05 /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore --file=./SwiftRServ > wilde 22926 1 22364 15847 0 11:59 pts/22 00:00:04 /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore --file=./SwiftRServ > wilde 22920 1 22364 15847 0 11:59 pts/22 00:00:04 /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore --file=./SwiftRServ > wilde 22760 1 22364 15847 0 11:59 pts/22 00:00:04 /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore --file=./SwiftRServ > bri$ > > > ----- "Mihael Hategan" wrote: > > > I can't tell what's causing the problem, but it may generally be a > > good > > idea to do a jstack -l when you get a hang. > > > > Mihael > > > > On Thu, 2010-09-16 at 12:14 -0600, wilde at mcs.anl.gov wrote: > > > Mihael, > > > > > > I've developed a Swift script that loops using iterate, reading > > requests to process an R function from a named pipe (fifo), calling R, > > and replying "done" on a response fifo. > > > > > > This has been working very well, but I just hit a case where the > > script hangs. > > > > > > I exercise it using a small battery of R tests; I was manually > > restarting the test battery (which does hundreds of R calls in 30 > > seconds or so, when it hung in the middle of the test suite. > > > > > > As far as I can tell it hung after receiving a work request, mapping > > the files for the work request, but never called the app() function > > that invokes R. > > > > > > The log is in ~wilde/rserver-20100916-1159-y94hftt0.log > > > (I will try to post the script and all related files, but looks like > > bridled may have just gone down for patches) > > > > > > look for these trace lines in the log, which are issued at the start > > of every R request: > > > > > > line 37342: > > > > > > 2010-09-16 12:03:24,212-0500 INFO vdl:execute END_SUCCESS > > thread=0-1-86-4 tr=bash > > > 2010-09-16 12:03:24,213-0500 INFO apply STARTCOMPOUND > > thread=0-1-87-2 name=apply > > > 2010-09-16 12:03:24,215-0500 WARN trace SwiftScript trace: rserver: > > got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.233 > > > 2010-09-16 12:03:24,215-0500 INFO SetFieldValue Set: done=false > > > > > > The END_SUCCESS is the completion of the last app() in the prior > > iterate pass, which signals the response ("done") fifo using a shell > > script. > > > > > > The trace says its starting to process the next R request, #233 > > (randomly assigned) > > > > > > after mapping 20 files (for 5 R datasets containing 2 R evaluation > > requests each) > > > > > > it just hangs, and all I see in the log after that point is coaster > > heartbeats. > > > > > > The last request prior to this hanging request is in the log at line > > 37137: > > > > > > 2010-09-16 12:03:24,060-0500 INFO vdl:execute END_SUCCESS > > thread=0-1-85-4 tr=bash > > > 2010-09-16 12:03:24,062-0500 INFO apply STARTCOMPOUND > > thread=0-1-86-2 name=apply > > > 2010-09-16 12:03:24,062-0500 WARN trace SwiftScript trace: rserver: > > got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.174 > > > 2010-09-16 12:03:24,062-0500 INFO SetFieldValue Set: done=false > > > > > > R request #174 (and all prior ones) completed fine, and should > > illustrate the normal processing sequence. > > > > > > Any ideas on what to look for regarding the cause of the hang? > > > > > > I will try to reproduce it and try to get a karajan status trace > > using swift stdin. > > > > > > - Mike > > > > From wilde at mcs.anl.gov Thu Sep 16 14:13:37 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 16 Sep 2010 13:13:37 -0600 (GMT-06:00) Subject: [Swift-devel] Re: Swift hanging in complex iterate script In-Reply-To: <831386786.28901284664069772.JavaMail.root@zimbra.anl.gov> Message-ID: <539130626.29311284664417734.JavaMail.root@zimbra.anl.gov> I see this suspicious deadlock in that jstack output: Found one Java-level deadlock: ============================= "pool-1-thread-4": waiting to lock monitor 0x0000000052b88650 (object 0x00002aaab5542348, a org.griphyn.vdl.mapping.RootDataNode), which is held by "pool-1-thread-2" "pool-1-thread-2": waiting to lock monitor 0x0000000052b879d8 (object 0x00002aaab5542540, a org.griphyn.vdl.mapping.RootArrayDataNode), which is held by "pool-1-thread-4" Java stack information for the threads listed above: =================================================== ... - Mike ----- "Michael Wilde" wrote: > OK, thats in ~wilde/swiftrhang/jstack.out > > The jvm is still running: pid 22435 on bridled. Let me know if you > need any other traces from it. > > It hung around 12:05 > > - Mike > > > bri$ mp > UID PID PPID PGID SID C STIME TTY TIME CMD > wilde 19025 19023 19023 19023 0 11:49 ? 00:00:00 sshd: > wilde at pts/29 > wilde 19026 19025 19026 19026 0 11:49 pts/29 00:00:00 -bash > wilde 19203 19026 19203 19026 0 11:49 pts/29 00:00:00 > /usr/bin/screen -x > wilde 15373 15371 15371 15371 0 06:58 ? 00:00:00 sshd: > wilde at pts/14 > wilde 15374 15373 15374 15374 0 06:58 pts/14 00:00:00 -bash > wilde 18291 15374 18291 15374 0 11:47 pts/14 00:00:00 > /home/wilde/R/R-2.11.0/bin/exec/R > wilde 11315 15374 11315 15374 0 13:59 pts/14 00:00:00 > /usr/bin/screen -x > wilde 15183 15181 15181 15181 0 06:58 ? 00:00:00 sshd: > wilde at pts/11 > wilde 15184 15183 15184 15184 0 06:58 pts/11 00:00:00 -bash > wilde 15835 15184 15835 15184 0 06:59 pts/11 00:00:00 > /usr/bin/screen > wilde 15836 15835 15836 15836 0 06:59 ? 00:00:00 > /usr/bin/SCREEN > wilde 15837 15836 15837 15837 0 06:59 pts/15 00:00:00 > bash > wilde 15839 15836 15839 15839 0 06:59 pts/16 00:00:00 > bash > wilde 15841 15836 15841 15841 0 06:59 pts/19 00:00:00 > bash > wilde 15843 15836 15843 15843 0 06:59 pts/20 00:00:00 > bash > wilde 15845 15836 15845 15845 0 06:59 pts/21 00:00:00 > bash > wilde 15847 15836 15847 15847 0 06:59 pts/22 00:00:00 > bash > wilde 14845 15847 14845 15847 0 10:05 pts/22 00:00:00 > emacs Swift/exec/start-swift-workers > wilde 22364 15847 22364 15847 0 11:59 pts/22 00:00:00 > /bin/bash Swift/exec/RunRServer.sh > wilde 22378 22364 22364 15847 0 11:59 pts/22 00:00:00 > /bin/sh /home/wilde/swift/rev/trunk/bin/swift -config swift.pr > wilde 22435 22378 22364 15847 0 11:59 pts/22 00:00:29 > java -Xmx256M -Djava.endorsed.dirs=/home/wilde/swift/rev/tru > wilde 12506 15847 12506 15847 0 14:03 pts/22 00:00:00 > ps -fjH -u wilde > wilde 22930 1 22364 15847 0 11:59 pts/22 00:00:05 > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > --file=./SwiftRServ > wilde 22926 1 22364 15847 0 11:59 pts/22 00:00:04 > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > --file=./SwiftRServ > wilde 22920 1 22364 15847 0 11:59 pts/22 00:00:04 > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > --file=./SwiftRServ > wilde 22760 1 22364 15847 0 11:59 pts/22 00:00:04 > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > --file=./SwiftRServ > bri$ > > > ----- "Mihael Hategan" wrote: > > > I can't tell what's causing the problem, but it may generally be a > > good > > idea to do a jstack -l when you get a hang. > > > > Mihael > > > > On Thu, 2010-09-16 at 12:14 -0600, wilde at mcs.anl.gov wrote: > > > Mihael, > > > > > > I've developed a Swift script that loops using iterate, reading > > requests to process an R function from a named pipe (fifo), calling > R, > > and replying "done" on a response fifo. > > > > > > This has been working very well, but I just hit a case where the > > script hangs. > > > > > > I exercise it using a small battery of R tests; I was manually > > restarting the test battery (which does hundreds of R calls in 30 > > seconds or so, when it hung in the middle of the test suite. > > > > > > As far as I can tell it hung after receiving a work request, > mapping > > the files for the work request, but never called the app() function > > that invokes R. > > > > > > The log is in ~wilde/rserver-20100916-1159-y94hftt0.log > > > (I will try to post the script and all related files, but looks > like > > bridled may have just gone down for patches) > > > > > > look for these trace lines in the log, which are issued at the > start > > of every R request: > > > > > > line 37342: > > > > > > 2010-09-16 12:03:24,212-0500 INFO vdl:execute END_SUCCESS > > thread=0-1-86-4 tr=bash > > > 2010-09-16 12:03:24,213-0500 INFO apply STARTCOMPOUND > > thread=0-1-87-2 name=apply > > > 2010-09-16 12:03:24,215-0500 WARN trace SwiftScript trace: > rserver: > > got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.233 > > > 2010-09-16 12:03:24,215-0500 INFO SetFieldValue Set: done=false > > > > > > The END_SUCCESS is the completion of the last app() in the prior > > iterate pass, which signals the response ("done") fifo using a shell > > script. > > > > > > The trace says its starting to process the next R request, #233 > > (randomly assigned) > > > > > > after mapping 20 files (for 5 R datasets containing 2 R evaluation > > requests each) > > > > > > it just hangs, and all I see in the log after that point is > coaster > > heartbeats. > > > > > > The last request prior to this hanging request is in the log at > line > > 37137: > > > > > > 2010-09-16 12:03:24,060-0500 INFO vdl:execute END_SUCCESS > > thread=0-1-85-4 tr=bash > > > 2010-09-16 12:03:24,062-0500 INFO apply STARTCOMPOUND > > thread=0-1-86-2 name=apply > > > 2010-09-16 12:03:24,062-0500 WARN trace SwiftScript trace: > rserver: > > got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.174 > > > 2010-09-16 12:03:24,062-0500 INFO SetFieldValue Set: done=false > > > > > > R request #174 (and all prior ones) completed fine, and should > > illustrate the normal processing sequence. > > > > > > Any ideas on what to look for regarding the cause of the hang? > > > > > > I will try to reproduce it and try to get a karajan status trace > > using swift stdin. > > > > > > - Mike > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Thu Sep 16 14:17:13 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 16 Sep 2010 12:17:13 -0700 Subject: [Swift-devel] Re: Swift hanging in complex iterate script In-Reply-To: <539130626.29311284664417734.JavaMail.root@zimbra.anl.gov> References: <539130626.29311284664417734.JavaMail.root@zimbra.anl.gov> Message-ID: <1284664633.18674.0.camel@blabla2.none> That's the one. On Thu, 2010-09-16 at 13:13 -0600, Michael Wilde wrote: > I see this suspicious deadlock in that jstack output: > > Found one Java-level deadlock: > ============================= > "pool-1-thread-4": > waiting to lock monitor 0x0000000052b88650 (object 0x00002aaab5542348, a org.griphyn.vdl.mapping.RootDataNode), > which is held by "pool-1-thread-2" > "pool-1-thread-2": > waiting to lock monitor 0x0000000052b879d8 (object 0x00002aaab5542540, a org.griphyn.vdl.mapping.RootArrayDataNode), > which is held by "pool-1-thread-4" > > Java stack information for the threads listed above: > =================================================== > ... > > - Mike > > ----- "Michael Wilde" wrote: > > > OK, thats in ~wilde/swiftrhang/jstack.out > > > > The jvm is still running: pid 22435 on bridled. Let me know if you > > need any other traces from it. > > > > It hung around 12:05 > > > > - Mike > > > > > > bri$ mp > > UID PID PPID PGID SID C STIME TTY TIME CMD > > wilde 19025 19023 19023 19023 0 11:49 ? 00:00:00 sshd: > > wilde at pts/29 > > wilde 19026 19025 19026 19026 0 11:49 pts/29 00:00:00 -bash > > wilde 19203 19026 19203 19026 0 11:49 pts/29 00:00:00 > > /usr/bin/screen -x > > wilde 15373 15371 15371 15371 0 06:58 ? 00:00:00 sshd: > > wilde at pts/14 > > wilde 15374 15373 15374 15374 0 06:58 pts/14 00:00:00 -bash > > wilde 18291 15374 18291 15374 0 11:47 pts/14 00:00:00 > > /home/wilde/R/R-2.11.0/bin/exec/R > > wilde 11315 15374 11315 15374 0 13:59 pts/14 00:00:00 > > /usr/bin/screen -x > > wilde 15183 15181 15181 15181 0 06:58 ? 00:00:00 sshd: > > wilde at pts/11 > > wilde 15184 15183 15184 15184 0 06:58 pts/11 00:00:00 -bash > > wilde 15835 15184 15835 15184 0 06:59 pts/11 00:00:00 > > /usr/bin/screen > > wilde 15836 15835 15836 15836 0 06:59 ? 00:00:00 > > /usr/bin/SCREEN > > wilde 15837 15836 15837 15837 0 06:59 pts/15 00:00:00 > > bash > > wilde 15839 15836 15839 15839 0 06:59 pts/16 00:00:00 > > bash > > wilde 15841 15836 15841 15841 0 06:59 pts/19 00:00:00 > > bash > > wilde 15843 15836 15843 15843 0 06:59 pts/20 00:00:00 > > bash > > wilde 15845 15836 15845 15845 0 06:59 pts/21 00:00:00 > > bash > > wilde 15847 15836 15847 15847 0 06:59 pts/22 00:00:00 > > bash > > wilde 14845 15847 14845 15847 0 10:05 pts/22 00:00:00 > > emacs Swift/exec/start-swift-workers > > wilde 22364 15847 22364 15847 0 11:59 pts/22 00:00:00 > > /bin/bash Swift/exec/RunRServer.sh > > wilde 22378 22364 22364 15847 0 11:59 pts/22 00:00:00 > > /bin/sh /home/wilde/swift/rev/trunk/bin/swift -config swift.pr > > wilde 22435 22378 22364 15847 0 11:59 pts/22 00:00:29 > > java -Xmx256M -Djava.endorsed.dirs=/home/wilde/swift/rev/tru > > wilde 12506 15847 12506 15847 0 14:03 pts/22 00:00:00 > > ps -fjH -u wilde > > wilde 22930 1 22364 15847 0 11:59 pts/22 00:00:05 > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > > --file=./SwiftRServ > > wilde 22926 1 22364 15847 0 11:59 pts/22 00:00:04 > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > > --file=./SwiftRServ > > wilde 22920 1 22364 15847 0 11:59 pts/22 00:00:04 > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > > --file=./SwiftRServ > > wilde 22760 1 22364 15847 0 11:59 pts/22 00:00:04 > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > > --file=./SwiftRServ > > bri$ > > > > > > ----- "Mihael Hategan" wrote: > > > > > I can't tell what's causing the problem, but it may generally be a > > > good > > > idea to do a jstack -l when you get a hang. > > > > > > Mihael > > > > > > On Thu, 2010-09-16 at 12:14 -0600, wilde at mcs.anl.gov wrote: > > > > Mihael, > > > > > > > > I've developed a Swift script that loops using iterate, reading > > > requests to process an R function from a named pipe (fifo), calling > > R, > > > and replying "done" on a response fifo. > > > > > > > > This has been working very well, but I just hit a case where the > > > script hangs. > > > > > > > > I exercise it using a small battery of R tests; I was manually > > > restarting the test battery (which does hundreds of R calls in 30 > > > seconds or so, when it hung in the middle of the test suite. > > > > > > > > As far as I can tell it hung after receiving a work request, > > mapping > > > the files for the work request, but never called the app() function > > > that invokes R. > > > > > > > > The log is in ~wilde/rserver-20100916-1159-y94hftt0.log > > > > (I will try to post the script and all related files, but looks > > like > > > bridled may have just gone down for patches) > > > > > > > > look for these trace lines in the log, which are issued at the > > start > > > of every R request: > > > > > > > > line 37342: > > > > > > > > 2010-09-16 12:03:24,212-0500 INFO vdl:execute END_SUCCESS > > > thread=0-1-86-4 tr=bash > > > > 2010-09-16 12:03:24,213-0500 INFO apply STARTCOMPOUND > > > thread=0-1-87-2 name=apply > > > > 2010-09-16 12:03:24,215-0500 WARN trace SwiftScript trace: > > rserver: > > > got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.233 > > > > 2010-09-16 12:03:24,215-0500 INFO SetFieldValue Set: done=false > > > > > > > > The END_SUCCESS is the completion of the last app() in the prior > > > iterate pass, which signals the response ("done") fifo using a shell > > > script. > > > > > > > > The trace says its starting to process the next R request, #233 > > > (randomly assigned) > > > > > > > > after mapping 20 files (for 5 R datasets containing 2 R evaluation > > > requests each) > > > > > > > > it just hangs, and all I see in the log after that point is > > coaster > > > heartbeats. > > > > > > > > The last request prior to this hanging request is in the log at > > line > > > 37137: > > > > > > > > 2010-09-16 12:03:24,060-0500 INFO vdl:execute END_SUCCESS > > > thread=0-1-85-4 tr=bash > > > > 2010-09-16 12:03:24,062-0500 INFO apply STARTCOMPOUND > > > thread=0-1-86-2 name=apply > > > > 2010-09-16 12:03:24,062-0500 WARN trace SwiftScript trace: > > rserver: > > > got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.174 > > > > 2010-09-16 12:03:24,062-0500 INFO SetFieldValue Set: done=false > > > > > > > > R request #174 (and all prior ones) completed fine, and should > > > illustrate the normal processing sequence. > > > > > > > > Any ideas on what to look for regarding the cause of the hang? > > > > > > > > I will try to reproduce it and try to get a karajan status trace > > > using swift stdin. > > > > > > > > - Mike > > > > > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > From hategan at mcs.anl.gov Thu Sep 16 14:22:38 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 16 Sep 2010 12:22:38 -0700 Subject: [Swift-devel] Re: Swift hanging in complex iterate script In-Reply-To: <539130626.29311284664417734.JavaMail.root@zimbra.anl.gov> References: <539130626.29311284664417734.JavaMail.root@zimbra.anl.gov> Message-ID: <1284664958.18780.0.camel@blabla2.none> Can you point me to the swift script? Mihael On Thu, 2010-09-16 at 13:13 -0600, Michael Wilde wrote: > I see this suspicious deadlock in that jstack output: > > Found one Java-level deadlock: > ============================= > "pool-1-thread-4": > waiting to lock monitor 0x0000000052b88650 (object 0x00002aaab5542348, a org.griphyn.vdl.mapping.RootDataNode), > which is held by "pool-1-thread-2" > "pool-1-thread-2": > waiting to lock monitor 0x0000000052b879d8 (object 0x00002aaab5542540, a org.griphyn.vdl.mapping.RootArrayDataNode), > which is held by "pool-1-thread-4" > > Java stack information for the threads listed above: > =================================================== > ... > > - Mike > > ----- "Michael Wilde" wrote: > > > OK, thats in ~wilde/swiftrhang/jstack.out > > > > The jvm is still running: pid 22435 on bridled. Let me know if you > > need any other traces from it. > > > > It hung around 12:05 > > > > - Mike > > > > > > bri$ mp > > UID PID PPID PGID SID C STIME TTY TIME CMD > > wilde 19025 19023 19023 19023 0 11:49 ? 00:00:00 sshd: > > wilde at pts/29 > > wilde 19026 19025 19026 19026 0 11:49 pts/29 00:00:00 -bash > > wilde 19203 19026 19203 19026 0 11:49 pts/29 00:00:00 > > /usr/bin/screen -x > > wilde 15373 15371 15371 15371 0 06:58 ? 00:00:00 sshd: > > wilde at pts/14 > > wilde 15374 15373 15374 15374 0 06:58 pts/14 00:00:00 -bash > > wilde 18291 15374 18291 15374 0 11:47 pts/14 00:00:00 > > /home/wilde/R/R-2.11.0/bin/exec/R > > wilde 11315 15374 11315 15374 0 13:59 pts/14 00:00:00 > > /usr/bin/screen -x > > wilde 15183 15181 15181 15181 0 06:58 ? 00:00:00 sshd: > > wilde at pts/11 > > wilde 15184 15183 15184 15184 0 06:58 pts/11 00:00:00 -bash > > wilde 15835 15184 15835 15184 0 06:59 pts/11 00:00:00 > > /usr/bin/screen > > wilde 15836 15835 15836 15836 0 06:59 ? 00:00:00 > > /usr/bin/SCREEN > > wilde 15837 15836 15837 15837 0 06:59 pts/15 00:00:00 > > bash > > wilde 15839 15836 15839 15839 0 06:59 pts/16 00:00:00 > > bash > > wilde 15841 15836 15841 15841 0 06:59 pts/19 00:00:00 > > bash > > wilde 15843 15836 15843 15843 0 06:59 pts/20 00:00:00 > > bash > > wilde 15845 15836 15845 15845 0 06:59 pts/21 00:00:00 > > bash > > wilde 15847 15836 15847 15847 0 06:59 pts/22 00:00:00 > > bash > > wilde 14845 15847 14845 15847 0 10:05 pts/22 00:00:00 > > emacs Swift/exec/start-swift-workers > > wilde 22364 15847 22364 15847 0 11:59 pts/22 00:00:00 > > /bin/bash Swift/exec/RunRServer.sh > > wilde 22378 22364 22364 15847 0 11:59 pts/22 00:00:00 > > /bin/sh /home/wilde/swift/rev/trunk/bin/swift -config swift.pr > > wilde 22435 22378 22364 15847 0 11:59 pts/22 00:00:29 > > java -Xmx256M -Djava.endorsed.dirs=/home/wilde/swift/rev/tru > > wilde 12506 15847 12506 15847 0 14:03 pts/22 00:00:00 > > ps -fjH -u wilde > > wilde 22930 1 22364 15847 0 11:59 pts/22 00:00:05 > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > > --file=./SwiftRServ > > wilde 22926 1 22364 15847 0 11:59 pts/22 00:00:04 > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > > --file=./SwiftRServ > > wilde 22920 1 22364 15847 0 11:59 pts/22 00:00:04 > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > > --file=./SwiftRServ > > wilde 22760 1 22364 15847 0 11:59 pts/22 00:00:04 > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > > --file=./SwiftRServ > > bri$ > > > > > > ----- "Mihael Hategan" wrote: > > > > > I can't tell what's causing the problem, but it may generally be a > > > good > > > idea to do a jstack -l when you get a hang. > > > > > > Mihael > > > > > > On Thu, 2010-09-16 at 12:14 -0600, wilde at mcs.anl.gov wrote: > > > > Mihael, > > > > > > > > I've developed a Swift script that loops using iterate, reading > > > requests to process an R function from a named pipe (fifo), calling > > R, > > > and replying "done" on a response fifo. > > > > > > > > This has been working very well, but I just hit a case where the > > > script hangs. > > > > > > > > I exercise it using a small battery of R tests; I was manually > > > restarting the test battery (which does hundreds of R calls in 30 > > > seconds or so, when it hung in the middle of the test suite. > > > > > > > > As far as I can tell it hung after receiving a work request, > > mapping > > > the files for the work request, but never called the app() function > > > that invokes R. > > > > > > > > The log is in ~wilde/rserver-20100916-1159-y94hftt0.log > > > > (I will try to post the script and all related files, but looks > > like > > > bridled may have just gone down for patches) > > > > > > > > look for these trace lines in the log, which are issued at the > > start > > > of every R request: > > > > > > > > line 37342: > > > > > > > > 2010-09-16 12:03:24,212-0500 INFO vdl:execute END_SUCCESS > > > thread=0-1-86-4 tr=bash > > > > 2010-09-16 12:03:24,213-0500 INFO apply STARTCOMPOUND > > > thread=0-1-87-2 name=apply > > > > 2010-09-16 12:03:24,215-0500 WARN trace SwiftScript trace: > > rserver: > > > got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.233 > > > > 2010-09-16 12:03:24,215-0500 INFO SetFieldValue Set: done=false > > > > > > > > The END_SUCCESS is the completion of the last app() in the prior > > > iterate pass, which signals the response ("done") fifo using a shell > > > script. > > > > > > > > The trace says its starting to process the next R request, #233 > > > (randomly assigned) > > > > > > > > after mapping 20 files (for 5 R datasets containing 2 R evaluation > > > requests each) > > > > > > > > it just hangs, and all I see in the log after that point is > > coaster > > > heartbeats. > > > > > > > > The last request prior to this hanging request is in the log at > > line > > > 37137: > > > > > > > > 2010-09-16 12:03:24,060-0500 INFO vdl:execute END_SUCCESS > > > thread=0-1-85-4 tr=bash > > > > 2010-09-16 12:03:24,062-0500 INFO apply STARTCOMPOUND > > > thread=0-1-86-2 name=apply > > > > 2010-09-16 12:03:24,062-0500 WARN trace SwiftScript trace: > > rserver: > > > got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.174 > > > > 2010-09-16 12:03:24,062-0500 INFO SetFieldValue Set: done=false > > > > > > > > R request #174 (and all prior ones) completed fine, and should > > > illustrate the normal processing sequence. > > > > > > > > Any ideas on what to look for regarding the cause of the hang? > > > > > > > > I will try to reproduce it and try to get a karajan status trace > > > using swift stdin. > > > > > > > > - Mike > > > > > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > From wilde at mcs.anl.gov Thu Sep 16 14:26:57 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Thu, 16 Sep 2010 13:26:57 -0600 (GMT-06:00) Subject: [Swift-devel] Re: Swift hanging in complex iterate script In-Reply-To: <403635151.30261284665153784.JavaMail.root@zimbra.anl.gov> Message-ID: <2021927160.30321284665217575.JavaMail.root@zimbra.anl.gov> ~wilde/swiftrhang/rserver.swift sites.xml, tc, and properties are in the same dir. Launched swift from ~wilde/SwiftR/Swift/exec/RunRServer.sh: swift -config swift.properties -tc.file tc -sites.file sites.xml $script \ >& swift.stdouterr - Mike ----- "Mihael Hategan" wrote: > Can you point me to the swift script? > > Mihael > > On Thu, 2010-09-16 at 13:13 -0600, Michael Wilde wrote: > > I see this suspicious deadlock in that jstack output: > > > > Found one Java-level deadlock: > > ============================= > > "pool-1-thread-4": > > waiting to lock monitor 0x0000000052b88650 (object > 0x00002aaab5542348, a org.griphyn.vdl.mapping.RootDataNode), > > which is held by "pool-1-thread-2" > > "pool-1-thread-2": > > waiting to lock monitor 0x0000000052b879d8 (object > 0x00002aaab5542540, a org.griphyn.vdl.mapping.RootArrayDataNode), > > which is held by "pool-1-thread-4" > > > > Java stack information for the threads listed above: > > =================================================== > > ... > > > > - Mike > > > > ----- "Michael Wilde" wrote: > > > > > OK, thats in ~wilde/swiftrhang/jstack.out > > > > > > The jvm is still running: pid 22435 on bridled. Let me know if > you > > > need any other traces from it. > > > > > > It hung around 12:05 > > > > > > - Mike > > > > > > > > > bri$ mp > > > UID PID PPID PGID SID C STIME TTY TIME CMD > > > wilde 19025 19023 19023 19023 0 11:49 ? 00:00:00 sshd: > > > wilde at pts/29 > > > wilde 19026 19025 19026 19026 0 11:49 pts/29 00:00:00 > -bash > > > wilde 19203 19026 19203 19026 0 11:49 pts/29 00:00:00 > > > /usr/bin/screen -x > > > wilde 15373 15371 15371 15371 0 06:58 ? 00:00:00 sshd: > > > wilde at pts/14 > > > wilde 15374 15373 15374 15374 0 06:58 pts/14 00:00:00 > -bash > > > wilde 18291 15374 18291 15374 0 11:47 pts/14 00:00:00 > > > /home/wilde/R/R-2.11.0/bin/exec/R > > > wilde 11315 15374 11315 15374 0 13:59 pts/14 00:00:00 > > > /usr/bin/screen -x > > > wilde 15183 15181 15181 15181 0 06:58 ? 00:00:00 sshd: > > > wilde at pts/11 > > > wilde 15184 15183 15184 15184 0 06:58 pts/11 00:00:00 > -bash > > > wilde 15835 15184 15835 15184 0 06:59 pts/11 00:00:00 > > > /usr/bin/screen > > > wilde 15836 15835 15836 15836 0 06:59 ? 00:00:00 > > > /usr/bin/SCREEN > > > wilde 15837 15836 15837 15837 0 06:59 pts/15 00:00:00 > > > > bash > > > wilde 15839 15836 15839 15839 0 06:59 pts/16 00:00:00 > > > > bash > > > wilde 15841 15836 15841 15841 0 06:59 pts/19 00:00:00 > > > > bash > > > wilde 15843 15836 15843 15843 0 06:59 pts/20 00:00:00 > > > > bash > > > wilde 15845 15836 15845 15845 0 06:59 pts/21 00:00:00 > > > > bash > > > wilde 15847 15836 15847 15847 0 06:59 pts/22 00:00:00 > > > > bash > > > wilde 14845 15847 14845 15847 0 10:05 pts/22 00:00:00 > > > > emacs Swift/exec/start-swift-workers > > > wilde 22364 15847 22364 15847 0 11:59 pts/22 00:00:00 > > > > /bin/bash Swift/exec/RunRServer.sh > > > wilde 22378 22364 22364 15847 0 11:59 pts/22 00:00:00 > > > > /bin/sh /home/wilde/swift/rev/trunk/bin/swift -config swift.pr > > > wilde 22435 22378 22364 15847 0 11:59 pts/22 00:00:29 > > > > java -Xmx256M -Djava.endorsed.dirs=/home/wilde/swift/rev/tru > > > wilde 12506 15847 12506 15847 0 14:03 pts/22 00:00:00 > > > > ps -fjH -u wilde > > > wilde 22930 1 22364 15847 0 11:59 pts/22 00:00:05 > > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > > > --file=./SwiftRServ > > > wilde 22926 1 22364 15847 0 11:59 pts/22 00:00:04 > > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > > > --file=./SwiftRServ > > > wilde 22920 1 22364 15847 0 11:59 pts/22 00:00:04 > > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > > > --file=./SwiftRServ > > > wilde 22760 1 22364 15847 0 11:59 pts/22 00:00:04 > > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > > > --file=./SwiftRServ > > > bri$ > > > > > > > > > ----- "Mihael Hategan" wrote: > > > > > > > I can't tell what's causing the problem, but it may generally be > a > > > > good > > > > idea to do a jstack -l when you get a hang. > > > > > > > > Mihael > > > > > > > > On Thu, 2010-09-16 at 12:14 -0600, wilde at mcs.anl.gov wrote: > > > > > Mihael, > > > > > > > > > > I've developed a Swift script that loops using iterate, > reading > > > > requests to process an R function from a named pipe (fifo), > calling > > > R, > > > > and replying "done" on a response fifo. > > > > > > > > > > This has been working very well, but I just hit a case where > the > > > > script hangs. > > > > > > > > > > I exercise it using a small battery of R tests; I was > manually > > > > restarting the test battery (which does hundreds of R calls in > 30 > > > > seconds or so, when it hung in the middle of the test suite. > > > > > > > > > > As far as I can tell it hung after receiving a work request, > > > mapping > > > > the files for the work request, but never called the app() > function > > > > that invokes R. > > > > > > > > > > The log is in ~wilde/rserver-20100916-1159-y94hftt0.log > > > > > (I will try to post the script and all related files, but > looks > > > like > > > > bridled may have just gone down for patches) > > > > > > > > > > look for these trace lines in the log, which are issued at > the > > > start > > > > of every R request: > > > > > > > > > > line 37342: > > > > > > > > > > 2010-09-16 12:03:24,212-0500 INFO vdl:execute END_SUCCESS > > > > thread=0-1-86-4 tr=bash > > > > > 2010-09-16 12:03:24,213-0500 INFO apply STARTCOMPOUND > > > > thread=0-1-87-2 name=apply > > > > > 2010-09-16 12:03:24,215-0500 WARN trace SwiftScript trace: > > > rserver: > > > > got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.233 > > > > > 2010-09-16 12:03:24,215-0500 INFO SetFieldValue Set: > done=false > > > > > > > > > > The END_SUCCESS is the completion of the last app() in the > prior > > > > iterate pass, which signals the response ("done") fifo using a > shell > > > > script. > > > > > > > > > > The trace says its starting to process the next R request, > #233 > > > > (randomly assigned) > > > > > > > > > > after mapping 20 files (for 5 R datasets containing 2 R > evaluation > > > > requests each) > > > > > > > > > > it just hangs, and all I see in the log after that point is > > > coaster > > > > heartbeats. > > > > > > > > > > The last request prior to this hanging request is in the log > at > > > line > > > > 37137: > > > > > > > > > > 2010-09-16 12:03:24,060-0500 INFO vdl:execute END_SUCCESS > > > > thread=0-1-85-4 tr=bash > > > > > 2010-09-16 12:03:24,062-0500 INFO apply STARTCOMPOUND > > > > thread=0-1-86-2 name=apply > > > > > 2010-09-16 12:03:24,062-0500 WARN trace SwiftScript trace: > > > rserver: > > > > got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.174 > > > > > 2010-09-16 12:03:24,062-0500 INFO SetFieldValue Set: > done=false > > > > > > > > > > R request #174 (and all prior ones) completed fine, and > should > > > > illustrate the normal processing sequence. > > > > > > > > > > Any ideas on what to look for regarding the cause of the > hang? > > > > > > > > > > I will try to reproduce it and try to get a karajan status > trace > > > > using swift stdin. > > > > > > > > > > - Mike > > > > > > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Thu Sep 16 15:37:09 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 16 Sep 2010 13:37:09 -0700 Subject: [Swift-devel] Re: Swift hanging in complex iterate script In-Reply-To: <2021927160.30321284665217575.JavaMail.root@zimbra.anl.gov> References: <2021927160.30321284665217575.JavaMail.root@zimbra.anl.gov> Message-ID: <1284669429.19374.11.camel@blabla2.none> Well, so concurrency and all, it's a hairy issue. We do seem to synchronize on stuff liberally. I tried to remove what I thought were some unnecessary synchronizations (and I actually had these removed in my local copy a while ago - and I also think they are in the fast branch). These are committed in swift r3628. But reduced synchronizations may lead to other bad things, and while I tried to avoid that, it is concurrency we're talking about. Mihael On Thu, 2010-09-16 at 13:26 -0600, wilde at mcs.anl.gov wrote: > ~wilde/swiftrhang/rserver.swift > > sites.xml, tc, and properties are in the same dir. > > Launched swift from ~wilde/SwiftR/Swift/exec/RunRServer.sh: > > swift -config swift.properties -tc.file tc -sites.file sites.xml $script \ > >& swift.stdouterr > > > - Mike > > ----- "Mihael Hategan" wrote: > > > Can you point me to the swift script? > > > > Mihael > > > > On Thu, 2010-09-16 at 13:13 -0600, Michael Wilde wrote: > > > I see this suspicious deadlock in that jstack output: > > > > > > Found one Java-level deadlock: > > > ============================= > > > "pool-1-thread-4": > > > waiting to lock monitor 0x0000000052b88650 (object > > 0x00002aaab5542348, a org.griphyn.vdl.mapping.RootDataNode), > > > which is held by "pool-1-thread-2" > > > "pool-1-thread-2": > > > waiting to lock monitor 0x0000000052b879d8 (object > > 0x00002aaab5542540, a org.griphyn.vdl.mapping.RootArrayDataNode), > > > which is held by "pool-1-thread-4" > > > > > > Java stack information for the threads listed above: > > > =================================================== > > > ... > > > > > > - Mike > > > > > > ----- "Michael Wilde" wrote: > > > > > > > OK, thats in ~wilde/swiftrhang/jstack.out > > > > > > > > The jvm is still running: pid 22435 on bridled. Let me know if > > you > > > > need any other traces from it. > > > > > > > > It hung around 12:05 > > > > > > > > - Mike > > > > > > > > > > > > bri$ mp > > > > UID PID PPID PGID SID C STIME TTY TIME CMD > > > > wilde 19025 19023 19023 19023 0 11:49 ? 00:00:00 sshd: > > > > wilde at pts/29 > > > > wilde 19026 19025 19026 19026 0 11:49 pts/29 00:00:00 > > -bash > > > > wilde 19203 19026 19203 19026 0 11:49 pts/29 00:00:00 > > > > /usr/bin/screen -x > > > > wilde 15373 15371 15371 15371 0 06:58 ? 00:00:00 sshd: > > > > wilde at pts/14 > > > > wilde 15374 15373 15374 15374 0 06:58 pts/14 00:00:00 > > -bash > > > > wilde 18291 15374 18291 15374 0 11:47 pts/14 00:00:00 > > > > /home/wilde/R/R-2.11.0/bin/exec/R > > > > wilde 11315 15374 11315 15374 0 13:59 pts/14 00:00:00 > > > > /usr/bin/screen -x > > > > wilde 15183 15181 15181 15181 0 06:58 ? 00:00:00 sshd: > > > > wilde at pts/11 > > > > wilde 15184 15183 15184 15184 0 06:58 pts/11 00:00:00 > > -bash > > > > wilde 15835 15184 15835 15184 0 06:59 pts/11 00:00:00 > > > > /usr/bin/screen > > > > wilde 15836 15835 15836 15836 0 06:59 ? 00:00:00 > > > > /usr/bin/SCREEN > > > > wilde 15837 15836 15837 15837 0 06:59 pts/15 00:00:00 > > > > > > bash > > > > wilde 15839 15836 15839 15839 0 06:59 pts/16 00:00:00 > > > > > > bash > > > > wilde 15841 15836 15841 15841 0 06:59 pts/19 00:00:00 > > > > > > bash > > > > wilde 15843 15836 15843 15843 0 06:59 pts/20 00:00:00 > > > > > > bash > > > > wilde 15845 15836 15845 15845 0 06:59 pts/21 00:00:00 > > > > > > bash > > > > wilde 15847 15836 15847 15847 0 06:59 pts/22 00:00:00 > > > > > > bash > > > > wilde 14845 15847 14845 15847 0 10:05 pts/22 00:00:00 > > > > > > emacs Swift/exec/start-swift-workers > > > > wilde 22364 15847 22364 15847 0 11:59 pts/22 00:00:00 > > > > > > /bin/bash Swift/exec/RunRServer.sh > > > > wilde 22378 22364 22364 15847 0 11:59 pts/22 00:00:00 > > > > > > /bin/sh /home/wilde/swift/rev/trunk/bin/swift -config swift.pr > > > > wilde 22435 22378 22364 15847 0 11:59 pts/22 00:00:29 > > > > > > java -Xmx256M -Djava.endorsed.dirs=/home/wilde/swift/rev/tru > > > > wilde 12506 15847 12506 15847 0 14:03 pts/22 00:00:00 > > > > > > ps -fjH -u wilde > > > > wilde 22930 1 22364 15847 0 11:59 pts/22 00:00:05 > > > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > > > > --file=./SwiftRServ > > > > wilde 22926 1 22364 15847 0 11:59 pts/22 00:00:04 > > > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > > > > --file=./SwiftRServ > > > > wilde 22920 1 22364 15847 0 11:59 pts/22 00:00:04 > > > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > > > > --file=./SwiftRServ > > > > wilde 22760 1 22364 15847 0 11:59 pts/22 00:00:04 > > > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > > > > --file=./SwiftRServ > > > > bri$ > > > > > > > > > > > > ----- "Mihael Hategan" wrote: > > > > > > > > > I can't tell what's causing the problem, but it may generally be > > a > > > > > good > > > > > idea to do a jstack -l when you get a hang. > > > > > > > > > > Mihael > > > > > > > > > > On Thu, 2010-09-16 at 12:14 -0600, wilde at mcs.anl.gov wrote: > > > > > > Mihael, > > > > > > > > > > > > I've developed a Swift script that loops using iterate, > > reading > > > > > requests to process an R function from a named pipe (fifo), > > calling > > > > R, > > > > > and replying "done" on a response fifo. > > > > > > > > > > > > This has been working very well, but I just hit a case where > > the > > > > > script hangs. > > > > > > > > > > > > I exercise it using a small battery of R tests; I was > > manually > > > > > restarting the test battery (which does hundreds of R calls in > > 30 > > > > > seconds or so, when it hung in the middle of the test suite. > > > > > > > > > > > > As far as I can tell it hung after receiving a work request, > > > > mapping > > > > > the files for the work request, but never called the app() > > function > > > > > that invokes R. > > > > > > > > > > > > The log is in ~wilde/rserver-20100916-1159-y94hftt0.log > > > > > > (I will try to post the script and all related files, but > > looks > > > > like > > > > > bridled may have just gone down for patches) > > > > > > > > > > > > look for these trace lines in the log, which are issued at > > the > > > > start > > > > > of every R request: > > > > > > > > > > > > line 37342: > > > > > > > > > > > > 2010-09-16 12:03:24,212-0500 INFO vdl:execute END_SUCCESS > > > > > thread=0-1-86-4 tr=bash > > > > > > 2010-09-16 12:03:24,213-0500 INFO apply STARTCOMPOUND > > > > > thread=0-1-87-2 name=apply > > > > > > 2010-09-16 12:03:24,215-0500 WARN trace SwiftScript trace: > > > > rserver: > > > > > got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.233 > > > > > > 2010-09-16 12:03:24,215-0500 INFO SetFieldValue Set: > > done=false > > > > > > > > > > > > The END_SUCCESS is the completion of the last app() in the > > prior > > > > > iterate pass, which signals the response ("done") fifo using a > > shell > > > > > script. > > > > > > > > > > > > The trace says its starting to process the next R request, > > #233 > > > > > (randomly assigned) > > > > > > > > > > > > after mapping 20 files (for 5 R datasets containing 2 R > > evaluation > > > > > requests each) > > > > > > > > > > > > it just hangs, and all I see in the log after that point is > > > > coaster > > > > > heartbeats. > > > > > > > > > > > > The last request prior to this hanging request is in the log > > at > > > > line > > > > > 37137: > > > > > > > > > > > > 2010-09-16 12:03:24,060-0500 INFO vdl:execute END_SUCCESS > > > > > thread=0-1-85-4 tr=bash > > > > > > 2010-09-16 12:03:24,062-0500 INFO apply STARTCOMPOUND > > > > > thread=0-1-86-2 name=apply > > > > > > 2010-09-16 12:03:24,062-0500 WARN trace SwiftScript trace: > > > > rserver: > > > > > got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.174 > > > > > > 2010-09-16 12:03:24,062-0500 INFO SetFieldValue Set: > > done=false > > > > > > > > > > > > R request #174 (and all prior ones) completed fine, and > > should > > > > > illustrate the normal processing sequence. > > > > > > > > > > > > Any ideas on what to look for regarding the cause of the > > hang? > > > > > > > > > > > > I will try to reproduce it and try to get a karajan status > > trace > > > > > using swift stdin. > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > -- > > > > Michael Wilde > > > > Computation Institute, University of Chicago > > > > Mathematics and Computer Science Division > > > > Argonne National Laboratory > > > > From wilde at mcs.anl.gov Thu Sep 16 20:09:48 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 16 Sep 2010 19:09:48 -0600 (GMT-06:00) Subject: [Swift-devel] Re: Swift hanging in complex iterate script In-Reply-To: <1284669429.19374.11.camel@blabla2.none> Message-ID: <1925807685.43161284685788637.JavaMail.root@zimbra.anl.gov> With r3628, Ive run 100 passes of the R tests successfully. Before, it was hanging are less than 10 passes. - Mike ----- "Mihael Hategan" wrote: > Well, so concurrency and all, it's a hairy issue. > > We do seem to synchronize on stuff liberally. I tried to remove what > I > thought were some unnecessary synchronizations (and I actually had > these > removed in my local copy a while ago - and I also think they are in > the > fast branch). > > These are committed in swift r3628. > > But reduced synchronizations may lead to other bad things, and while > I > tried to avoid that, it is concurrency we're talking about. > > Mihael > > On Thu, 2010-09-16 at 13:26 -0600, wilde at mcs.anl.gov wrote: > > ~wilde/swiftrhang/rserver.swift > > > > sites.xml, tc, and properties are in the same dir. > > > > Launched swift from ~wilde/SwiftR/Swift/exec/RunRServer.sh: > > > > swift -config swift.properties -tc.file tc -sites.file sites.xml > $script \ > > >& swift.stdouterr > > > > > > - Mike > > > > ----- "Mihael Hategan" wrote: > > > > > Can you point me to the swift script? > > > > > > Mihael > > > > > > On Thu, 2010-09-16 at 13:13 -0600, Michael Wilde wrote: > > > > I see this suspicious deadlock in that jstack output: > > > > > > > > Found one Java-level deadlock: > > > > ============================= > > > > "pool-1-thread-4": > > > > waiting to lock monitor 0x0000000052b88650 (object > > > 0x00002aaab5542348, a org.griphyn.vdl.mapping.RootDataNode), > > > > which is held by "pool-1-thread-2" > > > > "pool-1-thread-2": > > > > waiting to lock monitor 0x0000000052b879d8 (object > > > 0x00002aaab5542540, a org.griphyn.vdl.mapping.RootArrayDataNode), > > > > which is held by "pool-1-thread-4" > > > > > > > > Java stack information for the threads listed above: > > > > =================================================== > > > > ... > > > > > > > > - Mike > > > > > > > > ----- "Michael Wilde" wrote: > > > > > > > > > OK, thats in ~wilde/swiftrhang/jstack.out > > > > > > > > > > The jvm is still running: pid 22435 on bridled. Let me know > if > > > you > > > > > need any other traces from it. > > > > > > > > > > It hung around 12:05 > > > > > > > > > > - Mike > > > > > > > > > > > > > > > bri$ mp > > > > > UID PID PPID PGID SID C STIME TTY TIME > CMD > > > > > wilde 19025 19023 19023 19023 0 11:49 ? 00:00:00 > sshd: > > > > > wilde at pts/29 > > > > > wilde 19026 19025 19026 19026 0 11:49 pts/29 00:00:00 > > > -bash > > > > > wilde 19203 19026 19203 19026 0 11:49 pts/29 00:00:00 > > > > > > /usr/bin/screen -x > > > > > wilde 15373 15371 15371 15371 0 06:58 ? 00:00:00 > sshd: > > > > > wilde at pts/14 > > > > > wilde 15374 15373 15374 15374 0 06:58 pts/14 00:00:00 > > > -bash > > > > > wilde 18291 15374 18291 15374 0 11:47 pts/14 00:00:00 > > > > > > /home/wilde/R/R-2.11.0/bin/exec/R > > > > > wilde 11315 15374 11315 15374 0 13:59 pts/14 00:00:00 > > > > > > /usr/bin/screen -x > > > > > wilde 15183 15181 15181 15181 0 06:58 ? 00:00:00 > sshd: > > > > > wilde at pts/11 > > > > > wilde 15184 15183 15184 15184 0 06:58 pts/11 00:00:00 > > > -bash > > > > > wilde 15835 15184 15835 15184 0 06:59 pts/11 00:00:00 > > > > > > /usr/bin/screen > > > > > wilde 15836 15835 15836 15836 0 06:59 ? 00:00:00 > > > > > > /usr/bin/SCREEN > > > > > wilde 15837 15836 15837 15837 0 06:59 pts/15 00:00:00 > > > > > > > > > bash > > > > > wilde 15839 15836 15839 15839 0 06:59 pts/16 00:00:00 > > > > > > > > > bash > > > > > wilde 15841 15836 15841 15841 0 06:59 pts/19 00:00:00 > > > > > > > > > bash > > > > > wilde 15843 15836 15843 15843 0 06:59 pts/20 00:00:00 > > > > > > > > > bash > > > > > wilde 15845 15836 15845 15845 0 06:59 pts/21 00:00:00 > > > > > > > > > bash > > > > > wilde 15847 15836 15847 15847 0 06:59 pts/22 00:00:00 > > > > > > > > > bash > > > > > wilde 14845 15847 14845 15847 0 10:05 pts/22 00:00:00 > > > > > > > > > emacs Swift/exec/start-swift-workers > > > > > wilde 22364 15847 22364 15847 0 11:59 pts/22 00:00:00 > > > > > > > > > /bin/bash Swift/exec/RunRServer.sh > > > > > wilde 22378 22364 22364 15847 0 11:59 pts/22 00:00:00 > > > > > > > > > /bin/sh /home/wilde/swift/rev/trunk/bin/swift -config > swift.pr > > > > > wilde 22435 22378 22364 15847 0 11:59 pts/22 00:00:29 > > > > > > > > > java -Xmx256M > -Djava.endorsed.dirs=/home/wilde/swift/rev/tru > > > > > wilde 12506 15847 12506 15847 0 14:03 pts/22 00:00:00 > > > > > > > > > ps -fjH -u wilde > > > > > wilde 22930 1 22364 15847 0 11:59 pts/22 00:00:05 > > > > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > > > > > --file=./SwiftRServ > > > > > wilde 22926 1 22364 15847 0 11:59 pts/22 00:00:04 > > > > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > > > > > --file=./SwiftRServ > > > > > wilde 22920 1 22364 15847 0 11:59 pts/22 00:00:04 > > > > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > > > > > --file=./SwiftRServ > > > > > wilde 22760 1 22364 15847 0 11:59 pts/22 00:00:04 > > > > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore > > > > > --file=./SwiftRServ > > > > > bri$ > > > > > > > > > > > > > > > ----- "Mihael Hategan" wrote: > > > > > > > > > > > I can't tell what's causing the problem, but it may > generally be > > > a > > > > > > good > > > > > > idea to do a jstack -l when you get a hang. > > > > > > > > > > > > Mihael > > > > > > > > > > > > On Thu, 2010-09-16 at 12:14 -0600, wilde at mcs.anl.gov wrote: > > > > > > > Mihael, > > > > > > > > > > > > > > I've developed a Swift script that loops using iterate, > > > reading > > > > > > requests to process an R function from a named pipe (fifo), > > > calling > > > > > R, > > > > > > and replying "done" on a response fifo. > > > > > > > > > > > > > > This has been working very well, but I just hit a case > where > > > the > > > > > > script hangs. > > > > > > > > > > > > > > I exercise it using a small battery of R tests; I was > > > manually > > > > > > restarting the test battery (which does hundreds of R calls > in > > > 30 > > > > > > seconds or so, when it hung in the middle of the test > suite. > > > > > > > > > > > > > > As far as I can tell it hung after receiving a work > request, > > > > > mapping > > > > > > the files for the work request, but never called the app() > > > function > > > > > > that invokes R. > > > > > > > > > > > > > > The log is in ~wilde/rserver-20100916-1159-y94hftt0.log > > > > > > > (I will try to post the script and all related files, but > > > looks > > > > > like > > > > > > bridled may have just gone down for patches) > > > > > > > > > > > > > > look for these trace lines in the log, which are issued > at > > > the > > > > > start > > > > > > of every R request: > > > > > > > > > > > > > > line 37342: > > > > > > > > > > > > > > 2010-09-16 12:03:24,212-0500 INFO vdl:execute > END_SUCCESS > > > > > > thread=0-1-86-4 tr=bash > > > > > > > 2010-09-16 12:03:24,213-0500 INFO apply STARTCOMPOUND > > > > > > thread=0-1-87-2 name=apply > > > > > > > 2010-09-16 12:03:24,215-0500 WARN trace SwiftScript > trace: > > > > > rserver: > > > > > > got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.233 > > > > > > > 2010-09-16 12:03:24,215-0500 INFO SetFieldValue Set: > > > done=false > > > > > > > > > > > > > > The END_SUCCESS is the completion of the last app() in > the > > > prior > > > > > > iterate pass, which signals the response ("done") fifo using > a > > > shell > > > > > > script. > > > > > > > > > > > > > > The trace says its starting to process the next R > request, > > > #233 > > > > > > (randomly assigned) > > > > > > > > > > > > > > after mapping 20 files (for 5 R datasets containing 2 R > > > evaluation > > > > > > requests each) > > > > > > > > > > > > > > it just hangs, and all I see in the log after that point > is > > > > > coaster > > > > > > heartbeats. > > > > > > > > > > > > > > The last request prior to this hanging request is in the > log > > > at > > > > > line > > > > > > 37137: > > > > > > > > > > > > > > 2010-09-16 12:03:24,060-0500 INFO vdl:execute > END_SUCCESS > > > > > > thread=0-1-85-4 tr=bash > > > > > > > 2010-09-16 12:03:24,062-0500 INFO apply STARTCOMPOUND > > > > > > thread=0-1-86-2 name=apply > > > > > > > 2010-09-16 12:03:24,062-0500 WARN trace SwiftScript > trace: > > > > > rserver: > > > > > > got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.174 > > > > > > > 2010-09-16 12:03:24,062-0500 INFO SetFieldValue Set: > > > done=false > > > > > > > > > > > > > > R request #174 (and all prior ones) completed fine, and > > > should > > > > > > illustrate the normal processing sequence. > > > > > > > > > > > > > > Any ideas on what to look for regarding the cause of the > > > hang? > > > > > > > > > > > > > > I will try to reproduce it and try to get a karajan > status > > > trace > > > > > > using swift stdin. > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > -- > > > > > Michael Wilde > > > > > Computation Institute, University of Chicago > > > > > Mathematics and Computer Science Division > > > > > Argonne National Laboratory > > > > > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From aespinosa at cs.uchicago.edu Mon Sep 20 15:27:36 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 20 Sep 2010 15:27:36 -0500 Subject: [Swift-devel] persistent coaster service In-Reply-To: <1282860325.16378.1.camel@blabla2.none> References: <1082193619.1353181282856421112.JavaMail.root@zimbra.anl.gov> <1282860325.16378.1.camel@blabla2.none> Message-ID: If the workers have a head node different from the coaster service, then we only have option (b) right? -Allan 2010/8/26 Mihael Hategan : >> 2) Following up on Allan's last question, can you clarify: >> >> When you start a persistent coaster service do you have the option of either: >> >> (a) the Swift client starts workers per the sites.xml profile settings or >> >> (b) the user starts the workers manually, connecting to the persisten server, when you specify workerManager passive in the Globus profile tag: >> >> ? key="workerManager">passive -- Allan M. Espinosa PhD student, Computer Science University of Chicago From hategan at mcs.anl.gov Mon Sep 20 18:23:51 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 20 Sep 2010 16:23:51 -0700 Subject: [Swift-devel] persistent coaster service In-Reply-To: References: <1082193619.1353181282856421112.JavaMail.root@zimbra.anl.gov> <1282860325.16378.1.camel@blabla2.none> Message-ID: <1285025031.9528.4.camel@blabla2.none> On Mon, 2010-09-20 at 15:27 -0500, Allan Espinosa wrote: > If the workers have a head node different from the coaster service, > then we only have option (b) right? Yes and no. The coaster service cannot right now start workers on a different machine, but I don't see why that couldn't be possible in theory (and therefore in some future version of the code). As usual, virtual cookies for the actual code. Mihael > > -Allan > > 2010/8/26 Mihael Hategan : > > >> 2) Following up on Allan's last question, can you clarify: > >> > >> When you start a persistent coaster service do you have the option of either: > >> > >> (a) the Swift client starts workers per the sites.xml profile settings or > >> > >> (b) the user starts the workers manually, connecting to the persisten server, when you specify workerManager passive in the Globus profile tag: > >> > >> key="workerManager">passive > > > From wilde at mcs.anl.gov Tue Sep 21 07:18:46 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 21 Sep 2010 06:18:46 -0600 (GMT-06:00) Subject: [Swift-devel] Issues with provider staging Message-ID: <1189789838.1161285071526433.JavaMail.root@zimbra.anl.gov> In testing yesterday with provider staging and coasters, I noticed two anomalies: - my scripts would pause for a while - about 50 seconds, and then run, then hang - I suspect a problem in transferring back zero-length result files from the remote site Has anyone seen similar issues? I will try to reproduce the problems and get logs, and sort through the logs I already have. - Mike From iraicu at cs.uchicago.edu Tue Sep 21 11:49:57 2010 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 21 Sep 2010 11:49:57 -0500 Subject: [Swift-devel] CFP: Workshop on Data Intensive Computing in the Clouds (DataCloud) 2011, co-located with IEEE IPDPS 2011 Message-ID: <4C98E235.70907@cs.uchicago.edu> --------------------------------------------------------------------------------- *** Call for Papers *** WORKSHOP ON DATA INTENSIVE COMPUTING IN THE CLOUDS (DATACLOUD 2011) In conjunction with IPDPS 2011, May 16, Anchorage, Alaska http://www.cct.lsu.edu/~kosar/DataCloud2011 --------------------------------------------------------------------------------- The First International Workshop on Data Intensive Computing in the Clouds (DataCloud2011) will be held in conjunction with the 25th IEEE International Parallel and Distributed Computing Symposium (IPDPS 2011), in Anchorage, Alaska. Applications and experiments in all areas of science are becoming increasingly complex and more demanding in terms of their computational and data requirements. Some applications generate data volumes reaching hundreds of terabytes and even petabytes. As scientific applications become more data intensive, the management of data resources and dataflow between the storage and compute resources is becoming the main bottleneck. Analyzing, visualizing, and disseminating these large data sets has become a major challenge and data intensive computing is now considered as the ?fourth paradigm? in scientific discovery after theoretical, experimental, and computational science. DataCloud2011 will provide the scientific community a dedicated forum for discussing new research, development, and deployment efforts in running data-intensive computing workloads on Cloud Computing infrastructures. The DataCloud2011 workshop will focus on the use of cloud-based technologies to meet the new data intensive scientific challenges that are not well served by the current supercomputers, grids or compute-intensive clouds. We believe the workshop will be an excellent place to help the community define the current state, determine future goals, and present architectures and services for future clouds supporting data intensive computing. TOPICS --------------------------------------------------------------------------------- - Data-intensive cloud computing applications, characteristics, challenges - Case studies of data intensive computing in the clouds - Performance evaluation of data clouds, data grids, and data centers - Energy-efficient data cloud design and management - Data placement, scheduling, and interoperability in the clouds - Accountability, QoS, and SLAs - Data privacy and protection in a public cloud environment - Distributed file systems for clouds - Data streaming and parallelization - New programming models for data-intensive cloud computing - Scalability issues in clouds - Social computing and massively social gaming - 3D Internet and implications - Future research challenges in data-intensive cloud computing IMPORTANT DATES --------------------------------------------------------------------------------- Abstract submission: December 1, 2010 Paper submission: December 8, 2010 Acceptance notification: January 7, 2011 Final papers due: February 1, 2011 PAPER SUBMISSION --------------------------------------------------------------------------------- DataCloud2011 invites authors to submit original and unpublished technical papers. All submissions will be peer-reviewed and judged on correctness, originality, technical strength, significance, quality of presentation, and relevance to the workshop topics of interest. Submitted papers may not have appeared in or be under consideration for another workshop, conference or a journal, nor may they be under review or submitted to another forum during the DataCloud2011 review process. Submitted papers may not exceed 10 single-spaced double-column pages using 10-point size font on 8.5x11 inch pages (IEEE conference style, document templates can be found at ftp://pubftp.computer.org/Press/Outgoing/proceedings/instruct8.5x11.pdf and ftp://pubftp.computer.org/Press/Outgoing/proceedings/instruct8.5x11.doc), including figures, tables, and references. A 250 word abstract (PDF format) must be submitted online at https://cmt.research.microsoft.com/DataCloud2011/ before the deadline of December 1st, 2010 at 11:59PM PST; the final 10 page papers in PDF format will be due on December 8th, 2010 at 11:59PM PST. WORKSHOP and PROGRAM CHAIRS --------------------------------------------------------------------------------- Tevfik Kosar, Louisiana State University Ioan Raicu, Illinois Institute of Technology STEERING COMMITTEE --------------------------------------------------------------------------------- Ian Foster, Univ of Chicago & Argonne National Lab Geoffrey Fox, Indiana University James Hamilton, Amazon Web Services Manish Parashar, Rutgers University & NSF Dan Reed, Microsoft Research Rich Wolski, University of California, Santa Barbara Liang-Jie Zhang, IBM Research PROGRAM COMMITTEE --------------------------------------------------------------------------------- David Abramson, Monash University, Australia Roger Barga, Microsoft Research John Bent, Los Alamos National Laboratory Umit Catalyurek, Ohio State University Abhishek Chandra, University of Minnesota Rong N. Chang, IBM Research Alok Choudhary, Northwestern University Brian Cooper, Google Ewa Deelman, University of Southern California Murat Demirbas, University at Buffalo Adriana Iamnitchi, University of South Florida Maria Indrawan, Monash University, Australia Alexandru Iosup, Delft University of Technology, Netherlands Peter Kacsuk, Hungarian Academy of Sciences, Hungary Dan Katz, University of Chicago Steven Ko, University at Buffalo Gregor von Laszewski, Rochester Institute of Technology Erwin Laure, CERN, Switzerland Ignacio Llorente, Universidad Complutense de Madrid, Spain Reagan Moore, University of North Carolina Lavanya Ramakrishnan, Lawrence Berkeley National Laboratory Ian Taylor, Cardiff University, UK Douglas Thain, University of Notre Dame Bernard Traversat, Oracle Yong Zhao, Univ of Electronic Science & Tech of China -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor ================================================================= Computer Science Department Illinois Institute of Technology 10 W. 31st Street Stuart Building, Room 237D Chicago, IL 60616 ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email:iraicu at cs.iit.edu Web:http://www.cs.iit.edu/~iraicu/ ================================================================= ================================================================= From wilde at mcs.anl.gov Thu Sep 23 18:54:09 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Thu, 23 Sep 2010 17:54:09 -0600 (GMT-06:00) Subject: [Swift-devel] Problems with file and sfs stagingMethods In-Reply-To: <1459040511.172421285285498481.JavaMail.root@zimbra.anl.gov> Message-ID: <1264402105.172511285286049599.JavaMail.root@zimbra.anl.gov> Hi Mihael, I've reproduced these problems I mentioned a week ago with provider staging. At the moment, only "proxy" mode works for me. On the CI net in /home/wilde/swift/lab/: All tests were done using run.sh which sets all the config files. Each test was a single cat of data.txt to outdir/f.0001.out Each run dir contains a coasters/ dir with the coaster worker log. run01: this run worked, with stagingMethod proxy run02: stagingMethod file, files. The remote side is opening a bad file name (same I I showed you a week ago) with the file:// left in the *middle* of the path name. What gets returned as data to the worker is the Java exception and traceback text. run03: stagingMethod sfs. What happens here is (a) the sfs:// gets stripped off, but localhost// remains at the front of the source file name. When I strip that off, the next problem is that the destination directory has not been created. When I create that, the last problem is that the worker is referencing the relative input pathname "data.txt" from a different directory and without the necessary directory prefix. I explored this in dirs run04-run07 with mods to worker.pl that you can see in the coaster logs, which have been placed in each run dir below coasters/ using a symlink created in run.sh Can you fix these two stagingMethods? Or, if your tests are working, please point me to your config files, script, and swift revision. My swift is in: ~/swift/rev/trunk/bin/swift which is a symlink to the dist/ in my ~/swift/src/trunk Thanks, Mike From hategan at mcs.anl.gov Fri Sep 24 00:37:12 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 23 Sep 2010 22:37:12 -0700 Subject: [Swift-devel] Re: Problems with file and sfs stagingMethods In-Reply-To: <1264402105.172511285286049599.JavaMail.root@zimbra.anl.gov> References: <1264402105.172511285286049599.JavaMail.root@zimbra.anl.gov> Message-ID: <1285306632.23372.8.camel@blabla2.none> On Thu, 2010-09-23 at 17:54 -0600, wilde at mcs.anl.gov wrote: > [...] > Or, if your tests are working, please point me to your config files, script, and swift revision. My tests do work. But regardless, the fact that some otherwise normal configuration, can cause problems like that needs fixin'. Mihael From wilde at mcs.anl.gov Fri Sep 24 09:18:14 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Fri, 24 Sep 2010 08:18:14 -0600 (GMT-06:00) Subject: [Swift-devel] Re: Problems with file and sfs stagingMethods In-Reply-To: <1937982587.185751285337414091.JavaMail.root@zimbra.anl.gov> Message-ID: <109633049.186311285337894785.JavaMail.root@zimbra.anl.gov> I'd be surprised if the sfs method works as-is, as the code in worker.pl seemed insufficient to me, but its certainly possible Ive misunderstood how its supposed to work. Its possible that in both cases, the formats of the file names that are passing through the code differ in your test case from my test case, possibly due to our sites.xml settings Im happy to help debug with you interactively if you have trouble reproducing the problem. I'll later try to reproduce the problem with a clean unmodified build. Thanks, Mike ----- "Mihael Hategan" wrote: > On Thu, 2010-09-23 at 17:54 -0600, wilde at mcs.anl.gov wrote: > > [...] > > Or, if your tests are working, please point me to your config files, > script, and swift revision. > > My tests do work. But regardless, the fact that some otherwise normal > configuration, can cause problems like that needs fixin'. > > Mihael -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From iraicu at cs.uchicago.edu Fri Sep 24 12:45:16 2010 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 24 Sep 2010 12:45:16 -0500 Subject: [Swift-devel] CFP: Special Issue on Science-driven Cloud Computing, in the Scientific Programming Journal Message-ID: <4C9CE3AC.5000605@cs.uchicago.edu> Call for Papers --------------------------------------------------------------------------------- Scientific Programming Journal Special Issue on Science-driven Cloud Computing http://www.cs.iit.edu/~iraicu/SPJ_ScienceCloud_2011/ Overview --------------------------------------------------------------------------------- Cloud computing first established in the business computing domain is now a topic of research in computer science and an interesting execution platform for science applications. Today there are a number of commercial and science cloud deployments, including those provided by Amazon, Google, IBM, Microsoft, and others. Campus and national labs are also deploying their own cloud solutions. The ability to control the resources and the pay-as-you go usage model enables new approaches to application development and resource provisioning. Science applications are looking towards the cloud to provide a stable and customizable execution environment. This special issue of the Scientific Programming Journal is dedicated to the computational challenges and opportunities of cloud computing. Topics --------------------------------------------------------------------------------- We invite the submission of original work that is related to the topics below. Topics of interest include (in the context of Cloud Computing): * Scientific cloud applications * Novel programming models * High-performance computing * Many-task computing * Resource scheduling * Compute resource management * Resource provisioning and configuration (compute, data, and network) * Adaptive computing and resource usage * Power-aware use of clouds computing * Storage cloud architectures and implementations * Cloud scalability and elasticity * Performance Evaluations and Benchmarks * Quality of service and SLA management * Cloud heterogeneity * Charging models * Models, frameworks and systems for cloud security and privacy * Monitoring Paper Submission --------------------------------------------------------------------------------- Authors are encouraged to submit high quality, original work that has neither appeared in, nor is under consideration by other journals. The manuscript must follow the formatting instructions found at the Scientific Programming site at http://www.iospress.nl/html/10589244_ita.html. Papers should be not more than 25 pages of single column text using double spaced 10 point size on 8.5 x 11 inch pages and 1" margins (including all text, figures, and references). A 250 word abstract (PDF format) must be submitted online at https://cmt.research.microsoft.com/SPJ_ScienceCloud_2011/ before the deadline of October 22nd, 2010 at 11:59PM PST; the final 25 page papers in PDF format will be due on October 29th, 2010 at 11:59PM PST. Papers will be peer-reviewed, and accepted papers will be published in the IOS Press. Notifications of the paper decisions will be sent out by December 1st, 2010. Accepted papers will be published by IOS Press without any fees to the authors. Important dates --------------------------------------------------------------------------------- * Abstract Due: October 22nd, 2010 * Papers Due: October 29th, 2010 * Reviews Completed: December 1st, 2010 * Publication Date: Early 2011 Guest Editors: --------------------------------------------------------------------------------- Ivona Brandic, Vienna University of Technology,ivona at infosys.tuwien.ac.at Ewa Deelman, University of Southern California,deelman at isi.edu Ioan Raicu, Illinois Institute of Technology,iraicu at cs.iit.edu For more information on this special issue in Scientific Programming Journal, please visithttp://www.cs.iit.edu/~iraicu/SPJ_ScienceCloud_2011/. -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor ================================================================= Computer Science Department Illinois Institute of Technology 10 W. 31st Street Stuart Building, Room 237D Chicago, IL 60616 ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email:iraicu at cs.iit.edu Web:http://www.cs.iit.edu/~iraicu/ ================================================================= ================================================================= From dk0966 at cs.ship.edu Fri Sep 24 14:01:47 2010 From: dk0966 at cs.ship.edu (dk0966 at cs.ship.edu) Date: Fri, 24 Sep 2010 14:01:47 -0500 Subject: [Swift-devel] Swift usage summary Message-ID: <20100924190147.2B8008D00071@bridled.ci.uchicago.edu> An embedded and charset-unspecified text was scrubbed... Name: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: usage.png Type: image/png Size: 10577 bytes Desc: not available URL: From iraicu at cs.uchicago.edu Fri Sep 24 14:07:47 2010 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Fri, 24 Sep 2010 14:07:47 -0500 Subject: [Swift-devel] Swift usage summary In-Reply-To: <20100924190147.2B8008D00071@bridled.ci.uchicago.edu> References: <20100924190147.2B8008D00071@bridled.ci.uchicago.edu> Message-ID: <4C9CF703.6010704@cs.uchicago.edu> What are the numbers? Is it the times Swift was invoked? Or the number of jobs Swift generated? Can you also plot the number of CPU hours used? Ioan -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor ================================================================= Computer Science Department Illinois Institute of Technology 10 W. 31st Street Stuart Building, Room 237D Chicago, IL 60616 ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ ================================================================= ================================================================= On 9/24/2010 2:01 PM, dk0966 at cs.ship.edu wrote: > Swift usage summary attached. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From dk0966 at cs.ship.edu Fri Sep 24 14:49:21 2010 From: dk0966 at cs.ship.edu (David Kelly) Date: Fri, 24 Sep 2010 15:49:21 -0400 Subject: [Swift-devel] Swift usage summary In-Reply-To: <4C9CF703.6010704@cs.uchicago.edu> References: <20100924190147.2B8008D00071@bridled.ci.uchicago.edu> <4C9CF703.6010704@cs.uchicago.edu> Message-ID: It is the total number of times the Swift shell script was invoked by those who have a recent version of Swift from trunk and have not disabled usage tracking. Right now it does not track CPU hours. The closest thing would be the real start and stop time for each invocation. If you find that useful, I could add a graph which shows the total amount of real time it uses per day. David On Fri, Sep 24, 2010 at 3:07 PM, Ioan Raicu wrote: > What are the numbers? Is it the times Swift was invoked? Or the number of > jobs Swift generated? Can you also plot the number of CPU hours used? > > Ioan > > -- > ================================================================= > Ioan Raicu, Ph.D. > Assistant Professor > ================================================================= > Computer Science Department > Illinois Institute of Technology > 10 W. 31st Street > Stuart Building, Room 237D > Chicago, IL 60616 > ================================================================= > Cel: 1-847-722-0876 > Office: 1-312-567-5704 > Email: iraicu at cs.iit.edu > Web: http://www.cs.iit.edu/~iraicu/ > ================================================================= > ================================================================= > > > > On 9/24/2010 2:01 PM, dk0966 at cs.ship.edu wrote: > > Swift usage summary attached. > > > _______________________________________________ > Swift-devel mailing listSwift-devel at ci.uchicago.eduhttp://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From foster at anl.gov Fri Sep 24 17:56:03 2010 From: foster at anl.gov (Ian Foster) Date: Fri, 24 Sep 2010 17:56:03 -0500 Subject: [Swift-devel] Swift usage summary In-Reply-To: References: <20100924190147.2B8008D00071@bridled.ci.uchicago.edu> <4C9CF703.6010704@cs.uchicago.edu> Message-ID: David: This looks great. It would be good to have some information about the location where the script was invoked from (e.g., IP address) so we can analyze number of unique users. Information about number of nodes, tasks, data transfers, etc., would be neat too. Failure information can be interesting too. Ian. On Sep 24, 2010, at 2:49 PM, David Kelly wrote: > > It is the total number of times the Swift shell script was invoked by those who have a recent version of Swift from trunk and have not disabled usage tracking. > > Right now it does not track CPU hours. The closest thing would be the real start and stop time for each invocation. If you find that useful, I could add a graph which shows the total amount of real time it uses per day. > > David > > On Fri, Sep 24, 2010 at 3:07 PM, Ioan Raicu wrote: > What are the numbers? Is it the times Swift was invoked? Or the number of jobs Swift generated? Can you also plot the number of CPU hours used? > > Ioan > -- > ================================================================= > Ioan Raicu, Ph.D. > Assistant Professor > ================================================================= > Computer Science Department > Illinois Institute of Technology > 10 W. 31st Street > Stuart Building, Room 237D > Chicago, IL 60616 > ================================================================= > Cel: 1-847-722-0876 > Office: 1-312-567-5704 > Email: iraicu at cs.iit.edu > Web: http://www.cs.iit.edu/~iraicu/ > ================================================================= > ================================================================= > > > On 9/24/2010 2:01 PM, dk0966 at cs.ship.edu wrote: >> >> Swift usage summary attached. >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From dk0966 at cs.ship.edu Fri Sep 24 18:44:38 2010 From: dk0966 at cs.ship.edu (David Kelly) Date: Fri, 24 Sep 2010 19:44:38 -0400 Subject: [Swift-devel] Swift usage summary In-Reply-To: References: <20100924190147.2B8008D00071@bridled.ci.uchicago.edu> <4C9CF703.6010704@cs.uchicago.edu> Message-ID: IP addresses, return codes and anonymized user id's are among the data currently being collected. So far there have been 51 unique users from 35 different IP addresses. The only way to get this data right now is by manually connecting to the database. I will work on adding more of this information to the weekly email script. Right now all of the usage statistics are gathered outside of swift in a shell script, so it is somewhat limited. I am working on a way to get this integrated directly into swift, which should allow for more detail. David On Fri, Sep 24, 2010 at 6:56 PM, Ian Foster wrote: > David: > > This looks great. > > It would be good to have some information about the location where the > script was invoked from (e.g., IP address) so we can analyze number of > unique users. > > Information about number of nodes, tasks, data transfers, etc., would be > neat too. Failure information can be interesting too. > > Ian. > > > On Sep 24, 2010, at 2:49 PM, David Kelly wrote: > > > It is the total number of times the Swift shell script was invoked by those > who have a recent version of Swift from trunk and have not disabled usage > tracking. > > Right now it does not track CPU hours. The closest thing would be the real > start and stop time for each invocation. If you find that useful, I could > add a graph which shows the total amount of real time it uses per day. > > David > > On Fri, Sep 24, 2010 at 3:07 PM, Ioan Raicu wrote: > >> What are the numbers? Is it the times Swift was invoked? Or the number of >> jobs Swift generated? Can you also plot the number of CPU hours used? >> >> Ioan >> >> -- >> ================================================================= >> Ioan Raicu, Ph.D. >> Assistant Professor >> ================================================================= >> Computer Science Department >> Illinois Institute of Technology >> 10 W. 31st Street >> Stuart Building, Room 237D >> Chicago, IL 60616 >> ================================================================= >> Cel: 1-847-722-0876 >> Office: 1-312-567-5704 >> Email: iraicu at cs.iit.edu >> Web: http://www.cs.iit.edu/~iraicu/ >> ================================================================= >> ================================================================= >> >> >> >> On 9/24/2010 2:01 PM, dk0966 at cs.ship.edu wrote: >> >> Swift usage summary attached. >> >> >> _______________________________________________ >> Swift-devel mailing listSwift-devel at ci.uchicago.eduhttp://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From foster at anl.gov Fri Sep 24 18:50:31 2010 From: foster at anl.gov (Ian Foster) Date: Fri, 24 Sep 2010 18:50:31 -0500 Subject: [Swift-devel] Swift usage summary In-Reply-To: References: <20100924190147.2B8008D00071@bridled.ci.uchicago.edu> <4C9CF703.6010704@cs.uchicago.edu> Message-ID: <7225A43C-B374-437A-942F-8DD8BBB8271D@anl.gov> You're using Globus usage reporting, I assume? There is lots of code people have written to generate various reports On Sep 24, 2010, at 6:44 PM, David Kelly wrote: > IP addresses, return codes and anonymized user id's are among the data currently being collected. So far there have been 51 unique users from 35 different IP addresses. The only way to get this data right now is by manually connecting to the database. I will work on adding more of this information to the weekly email script. > > Right now all of the usage statistics are gathered outside of swift in a shell script, so it is somewhat limited. I am working on a way to get this integrated directly into swift, which should allow for more detail. > > David > > On Fri, Sep 24, 2010 at 6:56 PM, Ian Foster wrote: > David: > > This looks great. > > It would be good to have some information about the location where the script was invoked from (e.g., IP address) so we can analyze number of unique users. > > Information about number of nodes, tasks, data transfers, etc., would be neat too. Failure information can be interesting too. > > Ian. > > > On Sep 24, 2010, at 2:49 PM, David Kelly wrote: > >> >> It is the total number of times the Swift shell script was invoked by those who have a recent version of Swift from trunk and have not disabled usage tracking. >> >> Right now it does not track CPU hours. The closest thing would be the real start and stop time for each invocation. If you find that useful, I could add a graph which shows the total amount of real time it uses per day. >> >> David >> >> On Fri, Sep 24, 2010 at 3:07 PM, Ioan Raicu wrote: >> What are the numbers? Is it the times Swift was invoked? Or the number of jobs Swift generated? Can you also plot the number of CPU hours used? >> >> Ioan >> -- >> ================================================================= >> Ioan Raicu, Ph.D. >> Assistant Professor >> ================================================================= >> Computer Science Department >> Illinois Institute of Technology >> 10 W. 31st Street >> Stuart Building, Room 237D >> Chicago, IL 60616 >> ================================================================= >> Cel: 1-847-722-0876 >> Office: 1-312-567-5704 >> Email: iraicu at cs.iit.edu >> Web: http://www.cs.iit.edu/~iraicu/ >> ================================================================= >> ================================================================= >> >> >> On 9/24/2010 2:01 PM, dk0966 at cs.ship.edu wrote: >>> >>> Swift usage summary attached. >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dk0966 at cs.ship.edu Fri Sep 24 19:43:55 2010 From: dk0966 at cs.ship.edu (David Kelly) Date: Fri, 24 Sep 2010 20:43:55 -0400 Subject: [Swift-devel] Swift usage summary In-Reply-To: <7225A43C-B374-437A-942F-8DD8BBB8271D@anl.gov> References: <20100924190147.2B8008D00071@bridled.ci.uchicago.edu> <4C9CF703.6010704@cs.uchicago.edu> <7225A43C-B374-437A-942F-8DD8BBB8271D@anl.gov> Message-ID: That's a good question. I was thinking of starting with Java replacements for the current implementation, but perhaps integrating Globus usage reporting should be the goal? Would it be overkill for the type of data that we need? David On Fri, Sep 24, 2010 at 7:50 PM, Ian Foster wrote: > You're using Globus usage reporting, I assume? There is lots of code people > have written to generate various reports > > > On Sep 24, 2010, at 6:44 PM, David Kelly wrote: > > IP addresses, return codes and anonymized user id's are among the data > currently being collected. So far there have been 51 unique users from 35 > different IP addresses. The only way to get this data right now is by > manually connecting to the database. I will work on adding more of this > information to the weekly email script. > > Right now all of the usage statistics are gathered outside of swift in a > shell script, so it is somewhat limited. I am working on a way to get this > integrated directly into swift, which should allow for more detail. > > David > > On Fri, Sep 24, 2010 at 6:56 PM, Ian Foster < > foster at anl.gov> wrote: > >> David: >> >> This looks great. >> >> It would be good to have some information about the location where the >> script was invoked from (e.g., IP address) so we can analyze number of >> unique users. >> >> Information about number of nodes, tasks, data transfers, etc., would be >> neat too. Failure information can be interesting too. >> >> Ian. >> >> >> On Sep 24, 2010, at 2:49 PM, David Kelly wrote: >> >> >> It is the total number of times the Swift shell script was invoked by >> those who have a recent version of Swift from trunk and have not disabled >> usage tracking. >> >> Right now it does not track CPU hours. The closest thing would be the real >> start and stop time for each invocation. If you find that useful, I could >> add a graph which shows the total amount of real time it uses per day. >> >> David >> >> On Fri, Sep 24, 2010 at 3:07 PM, Ioan Raicu < >> iraicu at cs.uchicago.edu> wrote: >> >>> What are the numbers? Is it the times Swift was invoked? Or the number >>> of jobs Swift generated? Can you also plot the number of CPU hours used? >>> >>> Ioan >>> >>> -- >>> ================================================================= >>> Ioan Raicu, Ph.D. >>> Assistant Professor >>> ================================================================= >>> Computer Science Department >>> Illinois Institute of Technology >>> 10 W. 31st Street >>> Stuart Building, Room 237D >>> Chicago, IL 60616 >>> ================================================================= >>> Cel: 1-847-722-0876 >>> Office: 1-312-567-5704 >>> Email: iraicu at cs.iit.edu >>> Web: http://www.cs.iit.edu/~iraicu/ >>> ================================================================= >>> ================================================================= >>> >>> >>> >>> On 9/24/2010 2:01 PM, dk0966 at cs.ship.edu wrote: >>> >>> Swift usage summary attached. >>> >>> >>> _______________________________________________ >>> Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Fri Sep 24 20:51:05 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 24 Sep 2010 19:51:05 -0600 (GMT-06:00) Subject: [Swift-devel] Globus contact for Swift usage logging support? In-Reply-To: Message-ID: <1960697637.221081285379465658.JavaMail.root@zimbra.anl.gov> was: Re: [Swift-devel] Swift usage summary Right David - once you move the logging hooks from the front end swift shell script to the Java code, you should use the Globus logging API. That will involve using an existing data collector/database server (preferably) or standing up a new one (which would be good to avoid). Stu, who is the best contact on the Globus team for David to work with on this (Swift usage logging) - Mike ----- "David Kelly" wrote: > That's a good question. I was thinking of starting with Java > replacements for the current implementation, but perhaps integrating > Globus usage reporting should be the goal? Would it be overkill for > the type of data that we need? > > > > > David > > > On Fri, Sep 24, 2010 at 7:50 PM, Ian Foster < foster at anl.gov > wrote: > > > > > You're using Globus usage reporting, I assume? There is lots of code > people have written to generate various reports > > > > > On Sep 24, 2010, at 6:44 PM, David Kelly < dk0966 at cs.ship.edu > wrote: > > > > > > > > > > IP addresses, return codes and anonymized user id's are among the data > currently being collected. So far there have been 51 unique users from > 35 different IP addresses. The only way to get this data right now is > by manually connecting to the database. I will work on adding more of > this information to the weekly email script. > > > Right now all of the usage statistics are gathered outside of swift in > a shell script, so it is somewhat limited. I am working on a way to > get this integrated directly into swift, which should allow for more > detail. > > David > > > On Fri, Sep 24, 2010 at 6:56 PM, Ian Foster < foster at anl.gov > wrote: > > > > David: > > > This looks great. > > > It would be good to have some information about the location where the > script was invoked from (e.g., IP address) so we can analyze number of > unique users. > > > Information about number of nodes, tasks, data transfers, etc., would > be neat too. Failure information can be interesting too. > > > Ian. > > > > > > > > On Sep 24, 2010, at 2:49 PM, David Kelly wrote: > > > > It is the total number of times the Swift shell script was invoked by > those who have a recent version of Swift from trunk and have not > disabled usage tracking. > > > Right now it does not track CPU hours. The closest thing would be the > real start and stop time for each invocation. If you find that useful, > I could add a graph which shows the total amount of real time it uses > per day. > > > David > > > On Fri, Sep 24, 2010 at 3:07 PM, Ioan Raicu < iraicu at cs.uchicago.edu > > wrote: > > > > What are the numbers? Is it the times Swift was invoked? Or the number > of jobs Swift generated? Can you also plot the number of CPU hours > used? > > Ioan > -- > ================================================================= > Ioan Raicu, Ph.D. > Assistant Professor > ================================================================= > Computer Science Department > Illinois Institute of Technology > 10 W. 31st Street > Stuart Building, Room 237D > Chicago, IL 60616 > ================================================================= > Cel: 1-847-722-0876 > Office: 1-312-567-5704 > Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ > ================================================================= > ================================================================= > On 9/24/2010 2:01 PM, dk0966 at cs.ship.edu wrote: > > Swift usage summary attached. > _______________________________________________ > Swift-devel mailing list Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sat Sep 25 14:03:39 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 25 Sep 2010 12:03:39 -0700 Subject: [Swift-devel] Re: Problems with file and sfs stagingMethods In-Reply-To: <1264402105.172511285286049599.JavaMail.root@zimbra.anl.gov> References: <1264402105.172511285286049599.JavaMail.root@zimbra.anl.gov> Message-ID: <1285441419.31704.0.camel@blabla2.none> I can reproduce the problem with my existing configuration. I'm trying to figure out how that differs from what I saw before. Mihael On Thu, 2010-09-23 at 17:54 -0600, wilde at mcs.anl.gov wrote: > Hi Mihael, > > I've reproduced these problems I mentioned a week ago with provider staging. At the moment, only "proxy" mode works for me. > > On the CI net in /home/wilde/swift/lab/: > > All tests were done using run.sh which sets all the config files. Each test was a single cat of data.txt to outdir/f.0001.out > > Each run dir contains a coasters/ dir with the coaster worker log. > > run01: this run worked, with stagingMethod proxy > > run02: stagingMethod file, files. The remote side is opening a bad file name (same I I showed you a week ago) with the file:// left in the *middle* of the path name. What gets returned as data to the worker is the Java exception and traceback text. > > run03: stagingMethod sfs. What happens here is (a) the sfs:// gets stripped off, but localhost// remains at the front of the source file name. When I strip that off, the next problem is that the destination directory has not been created. When I create that, the last problem is that the worker is referencing the relative input pathname "data.txt" from a different directory and without the necessary directory prefix. I explored this in dirs run04-run07 with mods to worker.pl that you can see in the coaster logs, which have been placed in each run dir below coasters/ using a symlink created in run.sh > > Can you fix these two stagingMethods? Or, if your tests are working, please point me to your config files, script, and swift revision. > > My swift is in: ~/swift/rev/trunk/bin/swift > which is a symlink to the dist/ in my ~/swift/src/trunk > > Thanks, > > Mike > > From hategan at mcs.anl.gov Sat Sep 25 17:10:55 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 25 Sep 2010 15:10:55 -0700 Subject: [Swift-devel] Re: Problems with file and sfs stagingMethods In-Reply-To: <1285441419.31704.0.camel@blabla2.none> References: <1264402105.172511285286049599.JavaMail.root@zimbra.anl.gov> <1285441419.31704.0.camel@blabla2.none> Message-ID: <1285452655.5636.3.camel@blabla2.none> I committed a few changes to deal with the problem. Cog r2890 contains them. Justin, it's possible that I may have introduced some incompatibilities with pinned: Essentially all staging paths need to now be of one of the following forms: 1. :/// 2. : 3. Also, direct providers (i.e. "sfs") need to have absolute paths for client files because the CWD of the client is unknown to the worker. So please test and tell me if it works. Mihael On Sat, 2010-09-25 at 12:03 -0700, Mihael Hategan wrote: > I can reproduce the problem with my existing configuration. I'm trying > to figure out how that differs from what I saw before. > > Mihael > > On Thu, 2010-09-23 at 17:54 -0600, wilde at mcs.anl.gov wrote: > > Hi Mihael, > > > > I've reproduced these problems I mentioned a week ago with provider staging. At the moment, only "proxy" mode works for me. > > > > On the CI net in /home/wilde/swift/lab/: > > > > All tests were done using run.sh which sets all the config files. Each test was a single cat of data.txt to outdir/f.0001.out > > > > Each run dir contains a coasters/ dir with the coaster worker log. > > > > run01: this run worked, with stagingMethod proxy > > > > run02: stagingMethod file, files. The remote side is opening a bad file name (same I I showed you a week ago) with the file:// left in the *middle* of the path name. What gets returned as data to the worker is the Java exception and traceback text. > > > > run03: stagingMethod sfs. What happens here is (a) the sfs:// gets stripped off, but localhost// remains at the front of the source file name. When I strip that off, the next problem is that the destination directory has not been created. When I create that, the last problem is that the worker is referencing the relative input pathname "data.txt" from a different directory and without the necessary directory prefix. I explored this in dirs run04-run07 with mods to worker.pl that you can see in the coaster logs, which have been placed in each run dir below coasters/ using a symlink created in run.sh > > > > Can you fix these two stagingMethods? Or, if your tests are working, please point me to your config files, script, and swift revision. > > > > My swift is in: ~/swift/rev/trunk/bin/swift > > which is a symlink to the dist/ in my ~/swift/src/trunk > > > > Thanks, > > > > Mike > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From pr.jayathilake at gmail.com Mon Sep 27 06:25:58 2010 From: pr.jayathilake at gmail.com (Sukhitha Jayathilake) Date: Mon, 27 Sep 2010 16:55:58 +0530 Subject: [Swift-devel] new to swift Message-ID: I am a undergrad student of the University of Moratuwa, Sri lanka and I'd like to contribute to the development of the Swift project. I have browsed through the Google Summer of Codes 2010 ideas and and like to implement the idea listed as "Making the swift scripting system easy to install, evaluate and learn on readily available computer sources" http://dev.globus.org/wiki/Google_Summer_of_Code_2010_Ideas#Making_the_Swift_parallel_scripting_system_easy_to_install.2C_evaluate_and_learn_on_readily_available_computing_resources I'd like to know whether this idea has been implemented yet, and if not How can I start to contribute for it? I've already installed Swift and followed the tutorials. What else should I do? You must know that I am fairly new to open source development and very eager contribute. Any help is welcome. Many thanx in advance. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Mon Sep 27 08:02:08 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 27 Sep 2010 07:02:08 -0600 (GMT-06:00) Subject: [Swift-devel] new to swift In-Reply-To: Message-ID: <1479688862.246491285592527859.JavaMail.root@zimbra.anl.gov> Hi Sukhitha, Yes, the project you mention below was done as a GSoC project by David Kelly. You can find some of the discussion and announcements from this work in the swift-devel email list archives, and perhaps get from there some ideas for follow-on projects in the area of easy-of-configuration and learning. Regards, Mike ----- "Sukhitha Jayathilake" wrote: > I am a undergrad student of the University of Moratuwa, Sri lanka and > I'd like to contribute to the development of the Swift project. I have > browsed through the Google Summer of Codes 2010 ideas and and like to > implement the idea listed as "Making the swift scripting system easy > to install, evaluate and learn on readily available computer sources" > http://dev.globus.org/wiki/Google_Summer_of_Code_2010_Ideas#Making_the_Swift_parallel_scripting_system_easy_to_install.2C_evaluate_and_learn_on_readily_available_computing_resources > I'd like to know whether this idea has been implemented yet, and if > not How can I start to contribute for it? I've already installed Swift > and followed the tutorials. What else should I do? You must know that > I am fairly new to open source development and very eager contribute. > Any help is welcome. Many thanx in advance. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wozniak at mcs.anl.gov Mon Sep 27 15:00:38 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 27 Sep 2010 15:00:38 -0500 (CDT) Subject: [Swift-devel] Re: Problems with file and sfs stagingMethods In-Reply-To: <1285452655.5636.3.camel@blabla2.none> References: <1264402105.172511285286049599.JavaMail.root@zimbra.anl.gov> <1285441419.31704.0.camel@blabla2.none> <1285452655.5636.3.camel@blabla2.none> Message-ID: Yeah, that's incompatible. Do you think it would help if I switched the URL prefix from "pinned:" to "pinned-" ? On Sat, 25 Sep 2010, Mihael Hategan wrote: > I committed a few changes to deal with the problem. Cog r2890 contains > them. > > Justin, it's possible that I may have introduced some incompatibilities > with pinned: > > Essentially all staging paths need to now be of one of the following > forms: > 1. :/// > 2. : > 3. > > Also, direct providers (i.e. "sfs") need to have absolute paths for > client files because the CWD of the client is unknown to the worker. > > So please test and tell me if it works. > > Mihael > > On Sat, 2010-09-25 at 12:03 -0700, Mihael Hategan wrote: >> I can reproduce the problem with my existing configuration. I'm trying >> to figure out how that differs from what I saw before. >> >> Mihael >> >> On Thu, 2010-09-23 at 17:54 -0600, wilde at mcs.anl.gov wrote: >>> Hi Mihael, >>> >>> I've reproduced these problems I mentioned a week ago with provider staging. At the moment, only "proxy" mode works for me. >>> >>> On the CI net in /home/wilde/swift/lab/: >>> >>> All tests were done using run.sh which sets all the config files. Each test was a single cat of data.txt to outdir/f.0001.out >>> >>> Each run dir contains a coasters/ dir with the coaster worker log. >>> >>> run01: this run worked, with stagingMethod proxy >>> >>> run02: stagingMethod file, files. The remote side is opening a bad file name (same I I showed you a week ago) with the file:// left in the *middle* of the path name. What gets returned as data to the worker is the Java exception and traceback text. >>> >>> run03: stagingMethod sfs. What happens here is (a) the sfs:// gets stripped off, but localhost// remains at the front of the source file name. When I strip that off, the next problem is that the destination directory has not been created. When I create that, the last problem is that the worker is referencing the relative input pathname "data.txt" from a different directory and without the necessary directory prefix. I explored this in dirs run04-run07 with mods to worker.pl that you can see in the coaster logs, which have been placed in each run dir below coasters/ using a symlink created in run.sh >>> >>> Can you fix these two stagingMethods? Or, if your tests are working, please point me to your config files, script, and swift revision. >>> >>> My swift is in: ~/swift/rev/trunk/bin/swift >>> which is a symlink to the dist/ in my ~/swift/src/trunk >>> >>> Thanks, >>> >>> Mike >>> >>> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -- Justin M Wozniak From hategan at mcs.anl.gov Mon Sep 27 15:35:59 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 27 Sep 2010 13:35:59 -0700 Subject: [Swift-devel] Re: Problems with file and sfs stagingMethods In-Reply-To: References: <1264402105.172511285286049599.JavaMail.root@zimbra.anl.gov> <1285441419.31704.0.camel@blabla2.none> <1285452655.5636.3.camel@blabla2.none> Message-ID: <1285619759.13432.3.camel@blabla2.none> On Mon, 2010-09-27 at 15:00 -0500, Justin M Wozniak wrote: > Yeah, that's incompatible. Do you think it would help if I switched the > URL prefix from "pinned:" to "pinned-" ? No. I think that we should fix whatever codepath is involved. Can you let me know how to reproduce the problem? Mihael > > On Sat, 25 Sep 2010, Mihael Hategan wrote: > > > I committed a few changes to deal with the problem. Cog r2890 contains > > them. > > > > Justin, it's possible that I may have introduced some incompatibilities > > with pinned: > > > > Essentially all staging paths need to now be of one of the following > > forms: > > 1. :/// > > 2. : > > 3. > > > > Also, direct providers (i.e. "sfs") need to have absolute paths for > > client files because the CWD of the client is unknown to the worker. > > > > So please test and tell me if it works. > > > > Mihael > > > > On Sat, 2010-09-25 at 12:03 -0700, Mihael Hategan wrote: > >> I can reproduce the problem with my existing configuration. I'm trying > >> to figure out how that differs from what I saw before. > >> > >> Mihael > >> > >> On Thu, 2010-09-23 at 17:54 -0600, wilde at mcs.anl.gov wrote: > >>> Hi Mihael, > >>> > >>> I've reproduced these problems I mentioned a week ago with provider staging. At the moment, only "proxy" mode works for me. > >>> > >>> On the CI net in /home/wilde/swift/lab/: > >>> > >>> All tests were done using run.sh which sets all the config files. Each test was a single cat of data.txt to outdir/f.0001.out > >>> > >>> Each run dir contains a coasters/ dir with the coaster worker log. > >>> > >>> run01: this run worked, with stagingMethod proxy > >>> > >>> run02: stagingMethod file, files. The remote side is opening a bad file name (same I I showed you a week ago) with the file:// left in the *middle* of the path name. What gets returned as data to the worker is the Java exception and traceback text. > >>> > >>> run03: stagingMethod sfs. What happens here is (a) the sfs:// gets stripped off, but localhost// remains at the front of the source file name. When I strip that off, the next problem is that the destination directory has not been created. When I create that, the last problem is that the worker is referencing the relative input pathname "data.txt" from a different directory and without the necessary directory prefix. I explored this in dirs run04-run07 with mods to worker.pl that you can see in the coaster logs, which have been placed in each run dir below coasters/ using a symlink created in run.sh > >>> > >>> Can you fix these two stagingMethods? Or, if your tests are working, please point me to your config files, script, and swift revision. > >>> > >>> My swift is in: ~/swift/rev/trunk/bin/swift > >>> which is a symlink to the dist/ in my ~/swift/src/trunk > >>> > >>> Thanks, > >>> > >>> Mike > >>> > >>> > >> > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > From wilde at mcs.anl.gov Mon Sep 27 15:45:34 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Mon, 27 Sep 2010 14:45:34 -0600 (GMT-06:00) Subject: [Swift-devel] Re: Problems with file and sfs stagingMethods In-Reply-To: <568953877.294081285620226059.JavaMail.root@zimbra.anl.gov> Message-ID: <509993886.295111285620334141.JavaMail.root@zimbra.anl.gov> ----- "Mihael Hategan" wrote: ... > > > Essentially all staging paths need to now be of one of the > following > > > forms: > > > 1. :/// > > > 2. : > > > 3. > > > > > > Also, direct providers (i.e. "sfs") need to have absolute paths > for > > > client files because the CWD of the client is unknown to the > worker. By this sfs restriction do you mean that the Swift script needs to map files to absolute pathnames? Does wrapper.invocation.mode=absolute also need to be set in swift.properties? - Mike From wozniak at mcs.anl.gov Mon Sep 27 16:20:53 2010 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 27 Sep 2010 16:20:53 -0500 (CDT) Subject: [Swift-devel] Re: Problems with file and sfs stagingMethods In-Reply-To: <1285619759.13432.3.camel@blabla2.none> References: <1264402105.172511285286049599.JavaMail.root@zimbra.anl.gov> <1285441419.31704.0.camel@blabla2.none> <1285452655.5636.3.camel@blabla2.none> <1285619759.13432.3.camel@blabla2.none> Message-ID: On Mon, 27 Sep 2010, Mihael Hategan wrote: > On Mon, 2010-09-27 at 15:00 -0500, Justin M Wozniak wrote: >> Yeah, that's incompatible. Do you think it would help if I switched the >> URL prefix from "pinned:" to "pinned-" ? > > No. I think that we should fix whatever codepath is involved. Can you > let me know how to reproduce the problem? A convenient way to do that is to go into the directory that has cog/ and run: cog/modules/swift/tests/nightly.sh -a -c -g -p -s Give that a try, if it's not helpful I can bundle something up. -- Justin M Wozniak From hategan at mcs.anl.gov Mon Sep 27 16:29:30 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 27 Sep 2010 14:29:30 -0700 Subject: [Swift-devel] Re: Problems with file and sfs stagingMethods In-Reply-To: <509993886.295111285620334141.JavaMail.root@zimbra.anl.gov> References: <509993886.295111285620334141.JavaMail.root@zimbra.anl.gov> Message-ID: <1285622970.14827.1.camel@blabla2.none> On Mon, 2010-09-27 at 14:45 -0600, wilde at mcs.anl.gov wrote: > ----- "Mihael Hategan" wrote: > > ... > > > > Essentially all staging paths need to now be of one of the > > following > > > > forms: > > > > 1. :/// > > > > 2. : > > > > 3. > > > > > > > > Also, direct providers (i.e. "sfs") need to have absolute paths > > for > > > > client files because the CWD of the client is unknown to the > > worker. > > By this sfs restriction do you mean that the Swift script needs to map files to absolute pathnames? The worker needs an absolute location. But that was what the change was about: transform a relative location received from swift/client into an absolute location. So no. You can now say file f <"relativepath">; in swift with sfs. > > Does wrapper.invocation.mode=absolute also need to be set in swift.properties? No. From wilde at mcs.anl.gov Tue Sep 28 05:46:07 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 28 Sep 2010 04:46:07 -0600 (GMT-06:00) Subject: [Swift-devel] Need to integrate SGE provider into trunk Message-ID: <789287483.336151285670767227.JavaMail.root@zimbra.anl.gov> Mihael, I just noticed that the SGE provider is in the stable branch but not in trunk. Can you move it there? Is this a trivial copy or are there divergent changes in provider-localscheduler/ that need to be integrated? Im trying to make some BLAST scripts available to the CI-IBI team, and to test them for some Swift examples that were requested for an external book. Thanks, Mike From chaturanganamal at gmail.com Tue Sep 28 06:21:42 2010 From: chaturanganamal at gmail.com (Chaturanga Wimalarathne) Date: Tue, 28 Sep 2010 16:51:42 +0530 Subject: [Swift-devel] Concerning the project "Enhancing the Swift parallel scripting Library" Message-ID: I am an Undergraduate student of the Department of Computer Science and Engineering at University of Moratuwa, Sri Lanka. And I'd like to contribute to enhance the swift parallel scripting library, as a Programming Project.I have browsed through the Project ideas Listed on Globus Swift, and I like to implement the idea listed as "Enhancing the Swift parallel scripting Library" http://dev.globus.org/wiki/Google_Summer_of_Code_2010_Ideas#Enhancing_the_Swift_parallel_scripting_LibraryI'd like to know if this project is available and if it is, what are the prerequisites for this project.I must inform you that I am fairly new to open source development and also very eager contribute. Any help is welcome. Many thanx in advance. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Tue Sep 28 09:52:03 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 28 Sep 2010 08:52:03 -0600 (GMT-06:00) Subject: [Swift-devel] Concerning the project "Enhancing the Swift parallel scripting Library" In-Reply-To: Message-ID: <1815525148.343741285685523428.JavaMail.root@zimbra.anl.gov> Chaturanga, is this for a class project? If so, can you tell me who the professor is, and send me a pointer to any web pages for the class, especially something describing the project requirements? Thanks, Mike ----- "Chaturanga Wimalarathne" wrote: > I am an Undergraduate student of the Department of Computer Science > and Engineering at University of Moratuwa, Sri Lanka. And I'd like to > contribute to enhance the swift parallel scripting library, as a > Programming Project.I have browsed through the Project ideas Listed on > Globus Swift, and I like to implement the idea listed as "Enhancing > the Swift parallel scripting Library" > http://dev.globus.org/wiki/Google_Summer_of_Code_2010_Ideas#Enhancing_the_Swift_parallel_scripting_Library > I'd like to know if this project is available and if it is, what are > the prerequisites for this project.I must inform you that I am fairly > new to open source development and also very eager contribute. Any > help is welcome. Many thanx in advance. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Sep 28 10:22:06 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Tue, 28 Sep 2010 09:22:06 -0600 (GMT-06:00) Subject: [Swift-devel] Pgraph misses dependency on mapper input file In-Reply-To: <1165894844.346391285687269337.JavaMail.root@zimbra.anl.gov> Message-ID: <688719493.346571285687326780.JavaMail.root@zimbra.anl.gov> In the attached script, the dependency of the blastall stage on the file list produced by the initial split stage is missing from the plotted pgraph. (as seen by the un-connected "split" job at the left of the png file). - Mike -------------- next part -------------- A non-text attachment was scrubbed... Name: blastx.swift Type: application/octet-stream Size: 1513 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: tiny.png Type: image/png Size: 18790 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: tiny.dot Type: text/vnd.graphviz Size: 3220 bytes Desc: not available URL: From hategan at mcs.anl.gov Tue Sep 28 12:13:47 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 28 Sep 2010 10:13:47 -0700 Subject: [Swift-devel] Re: Need to integrate SGE provider into trunk In-Reply-To: <789287483.336151285670767227.JavaMail.root@zimbra.anl.gov> References: <789287483.336151285670767227.JavaMail.root@zimbra.anl.gov> Message-ID: <1285694027.18611.1.camel@blabla2.none> On Tue, 2010-09-28 at 04:46 -0600, Michael Wilde wrote: > Mihael, I just noticed that the SGE provider is in the stable branch but not in trunk. > > Can you move it there? Is this a trivial copy or are there divergent changes in provider-localscheduler/ that need to be integrated? I will move it there. Btw, my schedule is shaping up. Here's the outline: M, W - no chance T, F - some time in the morning R, S, S - available Mihael From wilde at mcs.anl.gov Tue Sep 28 19:29:27 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Tue, 28 Sep 2010 18:29:27 -0600 (GMT-06:00) Subject: [Swift-devel] Need precise throttle on local provider In-Reply-To: <2116249194.421731285719686930.JavaMail.root@zimbra.anl.gov> Message-ID: <179574332.422131285720167850.JavaMail.root@zimbra.anl.gov> Mihael, I have the need (for the Swift R interface) to either throttle the local execution provider to run *exactly* one job at a time, or to enhance the provider to set a SWIFT_JOB_SLOT env var to a value that signifies a virtual "slot number" for N concurrent jobs being run by the provider. I use this env var to associate Swift jobs with persistent R evaluation servers that need to run serially: they can handle only one job at a time. Ive modified the coaster worker.pl script to do this and it works very well. I'm now trying to get the same behavior from the local execution provider, and rather than tackle inserting this into the Java provider code, I tried the shortcut of configuring a small set of local provider pool entries, each with the throttle set to what I *thought* would guarantee me no more than one job at at time running on each "pool": -0.001 10000 I thought that the correct value for jobThrottle would be 0.0 to ensure 1 job, but from experimentation I found that I needed to set it to a slightly negative value, as above (-0.001). But it seems like even this is not sufficient: under heavy load, Im seeing a second job start on the same pool before the prior job has completed (I use "mkdir" as a pseudo-mutex, and Im running on a local filesystem under /tmp). So, my first question is: Is there some set of throttling or other sites.xml entries that will ensure <= 1 job per local provider pool? Second question: If you can point me to the right place, Justin or I could do this the "right" way by modifying the local execution provider set set "SLOT" numbers. I initially thought the current hack would be easier, and it seemed to work under standalone testing, but seems to be failing now in the live setting. Thanks, Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Sep 28 20:04:26 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 28 Sep 2010 19:04:26 -0600 (GMT-06:00) Subject: [Swift-devel] Need precise throttle on local provider In-Reply-To: <179574332.422131285720167850.JavaMail.root@zimbra.anl.gov> Message-ID: <977798375.423071285722266569.JavaMail.root@zimbra.anl.gov> I read a bit more on this in the User Guide property section. I tried throttle.host.submit=1 (along with the jobThrottle element in sites.xml) but this did not solve the problem. The documentation on the throttle confused me, as Ive always thought that the throttle behavior was to run (n * 100)+1 job, but the doc says +2. Regardless, I implemented a workaround with some simple sh mutex, so this is low prio for now, and I'll probably try the provider enhancement route rather than counting on throttle behavior for concurrency control. - Mike ----- wilde at mcs.anl.gov wrote: > Mihael, > > I have the need (for the Swift R interface) to either throttle the > local execution provider to run *exactly* one job at a time, or to > enhance the provider to set a SWIFT_JOB_SLOT env var to a value that > signifies a virtual "slot number" for N concurrent jobs being run by > the provider. > > I use this env var to associate Swift jobs with persistent R > evaluation servers that need to run serially: they can handle only one > job at a time. > > Ive modified the coaster worker.pl script to do this and it works very > well. > > I'm now trying to get the same behavior from the local execution > provider, and rather than tackle inserting this into the Java provider > code, I tried the shortcut of configuring a small set of local > provider pool entries, each with the throttle set to what I *thought* > would guarantee me no more than one job at at time running on each > "pool": > > -0.001 > 10000 > > I thought that the correct value for jobThrottle would be 0.0 to > ensure 1 job, but from experimentation I found that I needed to set it > to a slightly negative value, as above (-0.001). > > But it seems like even this is not sufficient: under heavy load, Im > seeing a second job start on the same pool before the prior job has > completed (I use "mkdir" as a pseudo-mutex, and Im running on a local > filesystem under /tmp). > > So, my first question is: Is there some set of throttling or other > sites.xml entries that will ensure <= 1 job per local provider pool? > > Second question: If you can point me to the right place, Justin or I > could do this the "right" way by modifying the local execution > provider set set "SLOT" numbers. I initially thought the current hack > would be easier, and it seemed to work under standalone testing, but > seems to be failing now in the live setting. > > Thanks, > > Mike > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Tue Sep 28 22:16:26 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 28 Sep 2010 20:16:26 -0700 Subject: [Swift-devel] Re: Need precise throttle on local provider In-Reply-To: <179574332.422131285720167850.JavaMail.root@zimbra.anl.gov> References: <179574332.422131285720167850.JavaMail.root@zimbra.anl.gov> Message-ID: <1285730186.21714.4.camel@blabla2.none> On Tue, 2010-09-28 at 18:29 -0600, wilde at mcs.anl.gov wrote: > Mihael, > > I have the need (for the Swift R interface) to either throttle the > local execution provider to run *exactly* one job at a time, or to > enhance the provider to set a SWIFT_JOB_SLOT env var to a value that > signifies a virtual "slot number" for N concurrent jobs being run by > the provider. > > I use this env var to associate Swift jobs with persistent R > evaluation servers that need to run serially: they can handle only one > job at a time. > > Ive modified the coaster worker.pl script to do this and it works very well. > > I'm now trying to get the same behavior from the local execution > provider, and rather than tackle inserting this into the Java provider > code, I tried the shortcut of configuring a small set of local > provider pool entries, each with the throttle set to what I *thought* > would guarantee me no more than one job at at time running on each > "pool": > Try 1 > -0.001 > 10000 > > I thought that the correct value for jobThrottle would be 0.0 to > ensure 1 job, but from experimentation I found that I needed to set it > to a slightly negative value, as above (-0.001). Right. There's a +2 there somewhere in the formula. > > But it seems like even this is not sufficient: under heavy load, Im > seeing a second job start on the same pool before the prior job has > completed (I use "mkdir" as a pseudo-mutex, and Im running on a local > filesystem under /tmp). Explain "seeing" in the above sentence. But before that, try jobsPerCpu. > > So, my first question is: Is there some set of throttling or other > sites.xml entries that will ensure <= 1 job per local provider pool? > > Second question: If you can point me to the right place, Justin or I > could do this the "right" way by modifying the local execution > provider set set "SLOT" numbers. I initially thought the current hack > would be easier, and it seemed to work under standalone testing, but > seems to be failing now in the live setting. The right way, I would think, is to modify the relevant throttling parameters for the scheduler for that site. That is, the local provider should not have anything to do with this. Luckily there already is a parameter to limit the number of concurrent jobs (and I mentioned it before). Mihael From hategan at mcs.anl.gov Tue Sep 28 22:24:11 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 28 Sep 2010 20:24:11 -0700 Subject: [Swift-devel] Need precise throttle on local provider In-Reply-To: <977798375.423071285722266569.JavaMail.root@zimbra.anl.gov> References: <977798375.423071285722266569.JavaMail.root@zimbra.anl.gov> Message-ID: <1285730651.21714.12.camel@blabla2.none> On Tue, 2010-09-28 at 19:04 -0600, Michael Wilde wrote: > I read a bit more on this in the User Guide property section. I tried > throttle.host.submit=1 (along with the jobThrottle element in > sites.xml) but this did not solve the problem. That is not meant to solve that problem. It limits the number of concurrent calls to jobTaskHandler.submit() for a site, but not the number of (fictive) concurrent job.run() calls. > > The documentation on the throttle confused me, as Ive always thought > that the throttle behavior was to run (n * 100)+1 job, but the doc > says +2. +1 seems to be correct. The doc needs updating. public double maxLoad() { return jobThrottle * tscore + 1; } > > Regardless, I implemented a workaround with some simple sh mutex, so > this is low prio for now, and I'll probably try the provider > enhancement route rather than counting on throttle behavior for > concurrency control. If it wasn't the case that there is (in my belief) a solution to this, I would ask: You are writing a throttle in one way or another. Why not have it in the "right" place? In other words, whether the code is a shell wrapper or a re-usable for any configuration/provider/site piece of code in the scheduler is a question of engineering, and I believe that the latter is a better option (due to the broader scope). Mihael